Multi-modal large language models(MLLMs) have achieved remarkable progress
and demonstrated powerful knowledge comprehension and reasoning abilities.
However, the mastery of domain-specific knowledge, which is essential for
evaluating the intelligence of MLLMs, continues to be a challenge. Current
multi-modal benchmarks for domain-specific knowledge concentrate on
multiple-choice questions and are predominantly available in English, which
imposes limitations on the comprehensiveness of the evaluation. To this end, we
introduce CMMU, a novel benchmark for multi-modal and multi-type question
understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7
subjects, covering knowledge from primary to high school. The questions can be
categorized into 3 types: multiple-choice, multiple-response, and
fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we
propose a rigorous evaluation strategy called ShiftCheck for assessing
multiple-choice questions. The strategy aims to reduce position bias, minimize
the influence of randomness on correctness, and perform a quantitative analysis
of position bias. We evaluate seven open-source MLLMs along with GPT4-V,
Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a
significant challenge to the recent MLLMs.

Multi-modal large language models (MLLMs) have been making significant advancements in natural language processing, showing impressive comprehension and reasoning abilities. However, evaluating their intelligence and domain-specific knowledge has remained a challenge. In order to address this, the authors of this article have introduced a new benchmark called CMMU, which focuses on multi-modal and multi-type question understanding and reasoning in Chinese.

CMMU consists of 3,603 questions across 7 subjects, covering knowledge from primary to high school levels. The questions are of three types: multiple-choice, multiple-response, and fill-in-the-blank, which present greater challenges for MLLMs to handle. This benchmark serves as a platform for evaluating the performance of MLLMs in Chinese, enabling a more comprehensive assessment of their domain-specific knowledge.

In addition to introducing CMMU benchmark, the article also proposes a rigorous evaluation strategy called ShiftCheck for assessing multiple-choice questions. This strategy aims to minimize position bias, reduce the impact of randomness on correctness, and provide a quantitative analysis of position bias. By implementing ShiftCheck, the authors aim to further enhance the evaluation process and ensure fair assessment of MLLMs’ performance.

The results of the evaluation conducted on seven open-source MLLMs, along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus, indicate that CMMU indeed presents a significant challenge to these state-of-the-art models. The findings highlight the need for further improvements in MLLMs’ domain-specific knowledge and their ability to handle multi-modal and multi-type questions.

This article has important implications for the wider field of multimedia information systems and related technologies such as animations, artificial reality, augmented reality, and virtual realities. As MLLMs continue to advance and demonstrate powerful knowledge comprehension and reasoning abilities, they are expected to play a crucial role in various multimedia applications. The ability of these models to understand and reason with different types of information, including visual and textual data, is particularly relevant in the context of multimedia systems.

Moreover, the introduction of CMMU as a benchmark for Chinese language evaluations expands the scope of assessment beyond English, which has predominantly been the focus in existing benchmarks. This highlights the importance of considering different languages and cultures when evaluating the performance of MLLMs. It also underscores the multi-disciplinary nature of MLLMs, as they need to incorporate various linguistic and cultural aspects to achieve proficient understanding and reasoning.

By addressing the limitations in evaluating MLLMs’ domain-specific knowledge and expanding the evaluation to other languages, the article contributes to advancing the field of natural language processing and its intersection with multimedia information systems. It encourages researchers and practitioners to strive for more comprehensive evaluations and overcome the challenges posed by multi-modal and multi-type questions in different languages, thereby advancing the overall capabilities of MLLMs in understanding and reasoning across diverse domains.
Read the original article