arXiv:2504.06637v1 Announce Type: new
Abstract: Large Language Models (LLMs) and Large Multimodal Models (LMMs) demonstrate impressive problem-solving skills in many tasks and domains. However, their ability to reason with complex images in academic domains has not been systematically investigated. To bridge this gap, we present SCI-Reason, a dataset for complex multimodel reasoning in academic areas. SCI-Reason aims to test and improve the reasoning ability of large multimodal models using real complex images in academic domains. The dataset contains 12,066 images and 12,626 question-answer pairs extracted from PubMed, divided into training, validation and test splits. Each question-answer pair also contains an accurate and efficient inference chain as a guide to improving the inference properties of the dataset. With SCI-Reason, we performed a comprehensive evaluation of 8 well-known models. The best performing model, Claude-3.7-Sonnet, only achieved an accuracy of 55.19%. Error analysis shows that more than half of the model failures are due to breakdowns in multi-step inference chains rather than errors in primary visual feature extraction. This finding underscores the inherent limitations in reasoning capabilities exhibited by current multimodal models when processing complex image analysis tasks within authentic academic contexts. Experiments on open-source models show that SCI-Reason not only enhances reasoning ability but also demonstrates cross-domain generalization in VQA tasks. We also explore future applications of model inference capabilities in this domain, highlighting its potential for future research.
SCI-Reason: Enhancing Multimodal Reasoning in Academic Domains
Large Language Models (LLMs) and Large Multimodal Models (LMMs) have showcased their remarkable problem-solving abilities across various tasks and domains. However, their effectiveness in reasoning with complex images in academic domains has yet to be thoroughly examined. To bridge this gap, SCI-Reason introduces a dataset designed to evaluate and enhance the reasoning capabilities of large multimodal models using real complex images in academic contexts.
The SCI-Reason dataset consists of 12,066 images and 12,626 question-answer pairs extracted from PubMed, a widely-used repository of scholarly articles. The dataset is divided into training, validation, and test splits, providing a comprehensive set of data for model evaluation. Notably, each question-answer pair in the dataset is accompanied by a well-defined and efficient inference chain, which serves as a valuable guide for improving the inference properties of the dataset.
Understanding the limitations of existing multimodal models, SCI-Reason conducts a comprehensive evaluation of eight well-known models. Surprisingly, even the best-performing model, Claude-3.7-Sonnet, achieves an accuracy of only 55.19%. This suggests that there are inherent limitations in the reasoning capabilities of current multimodal models when faced with complex image analysis tasks within academic domains.
An enlightening aspect of the error analysis conducted on the models is the identification of the primary source of model failures. Over half of the model failures can be attributed to breakdowns in multi-step inference chains rather than errors in primary visual feature extraction. This finding highlights the pressing need to improve the reasoning capabilities of multimodal models in order to tackle complex academic reasoning tasks effectively.
While the focus of SCI-Reason is primarily on advancing multimodal reasoning within academic domains, the experiments also shed light on the potential cross-domain generalization capabilities of the open-source models. The results demonstrate that SCI-Reason not only enhances reasoning abilities within academic contexts but can also perform well in Visual Question Answering (VQA) tasks across domains.
The implications of these findings go beyond the realm of academic research. As multimedia information systems continue to evolve, incorporating animations, artificial reality, augmented reality, and virtual realities, the ability to reason with complex images becomes increasingly crucial. SCI-Reason serves as a stepping stone towards unlocking the full potential of large multimodal models in these advanced multimedia systems.
Looking towards the future, the dataset opens up exciting possibilities for further research. In the domain of AI-assisted academic work, the inference capabilities of multimodal models could be leveraged to enhance knowledge synthesis, literature review processes, and even automate aspects of academic research. Additionally, as multimodal models advance, they may find applications in diverse fields such as medical diagnostics, image recognition, and content generation.
SCI-Reason represents a significant contribution to the field of multimodal reasoning. By highlighting the limitations, exploring cross-domain generalization, and envisioning the potential applications, this dataset encourages researchers to tackle the challenges of complex image analysis within academic domains and beyond.