arXiv:2505.17050v1 Announce Type: cross
Abstract: Project-Based Learning (PBL) involves a variety of highly correlated multimodal data, making it a vital educational approach within STEM disciplines. With the rapid development of multimodal large language models (MLLMs), researchers have begun exploring their potential to enhance tasks such as information retrieval, knowledge comprehension, and data generation in educational settings. However, existing benchmarks fall short in providing both a free-form output structure and a rigorous human expert validation process, limiting their effectiveness in evaluating real-world educational tasks. Additionally, few methods have developed automated pipelines to assist with the complex responsibilities of teachers leveraging MLLMs, largely due to model hallucination and instability, which lead to unreliable implementation. To address this gap, we introduce PBLBench, a novel benchmark designed to evaluate complex reasoning grounded in domain-specific knowledge and long-context understanding, thereby challenging models with tasks that closely resemble those handled by human experts. To establish reliable ground truth, we adopt the Analytic Hierarchy Process (AHP), utilizing expert-driven pairwise comparisons to derive structured and weighted evaluation criteria. We assess the performance of 15 leading MLLMs/LLMs using PBLBench and demonstrate that even the most advanced models achieve only 59% rank accuracy, underscoring the significant challenges presented by this benchmark. We believe PBLBench will serve as a catalyst for the development of more capable AI agents, ultimately aiming to alleviate teacher workload and enhance educational productivity.

Expert Commentary: Utilizing Multimodal Large Language Models in Project-Based Learning

Project-Based Learning (PBL) is a pedagogical approach that integrates various modes of learning, making it a valuable method within STEM disciplines. With the emergence of multimodal large language models (MLLMs), such as GPT-3, researchers are now exploring how these advanced AI models can enhance educational tasks related to information retrieval, knowledge comprehension, and data generation in PBL settings.

This study highlights the challenges faced by current benchmarks in evaluating the performance of MLLMs in educational contexts. The lack of free-form output structure and rigorous human expert validation processes in existing benchmarks limit their effectiveness in assessing real-world educational tasks. Additionally, the issue of model hallucination and instability poses obstacles to the development of automated pipelines to support teachers in utilizing MLLMs effectively.

Multi-disciplinary Nature

The concepts discussed in this article touch upon a variety of disciplines, including computer science, education, artificial intelligence, and cognitive science. The integration of MLLMs in PBL requires a multi-disciplinary approach to address the complex challenges involved in leveraging advanced AI technology in educational settings.

Relation to Multimedia Information Systems

The utilization of MLLMs in PBL aligns with the broader field of multimedia information systems, where the integration of various modes of data (text, images, videos) is crucial for enhancing information retrieval and knowledge dissemination. The incorporation of MLLMs in PBL emphasizes the importance of considering multimodal data in educational contexts for more effective learning outcomes.

Future Implications

The introduction of PBLBench as a novel benchmark for evaluating MLLMs in complex reasoning tasks signifies a step forward in addressing the limitations of current evaluation methods. By incorporating the Analytic Hierarchy Process (AHP) for structured evaluation criteria, this benchmark aims to challenge AI models with tasks that require domain-specific knowledge and long-context understanding, mirroring the tasks handled by human experts.

Overall, the findings of this study underscore the challenges and opportunities presented by integrating MLLMs in PBL. As AI technology continues to advance, the development of more capable AI agents through benchmarks like PBLBench has the potential to alleviate teacher workload and enhance educational productivity in the future.

Read the original article