The rapid evolution of Multi-modality Large Language Models (MLLMs) has
catalyzed a shift in computer vision from specialized models to general-purpose
foundation models. Nevertheless, there is still an inadequacy in assessing the
abilities of MLLMs on low-level visual perception and understanding. To address
this gap, we present Q-Bench, a holistic benchmark crafted to systematically
evaluate potential abilities of MLLMs on three realms: low-level visual
perception, low-level visual description, and overall visual quality
assessment. a) To evaluate the low-level perception ability, we construct the
LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped
with a human-asked question focusing on its low-level attributes. We then
measure the correctness of MLLMs on answering these questions. b) To examine
the description ability of MLLMs on low-level information, we propose the
LLDescribe dataset consisting of long expert-labelled golden low-level text
descriptions on 499 images, and a GPT-involved comparison pipeline between
outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we
further measure their visual quality assessment ability to align with human
opinion scores. Specifically, we design a softmax-based strategy that enables
MLLMs to predict quantifiable quality scores, and evaluate them on various
existing image quality assessment (IQA) datasets. Our evaluation across the
three abilities confirms that MLLMs possess preliminary low-level visual
skills. However, these skills are still unstable and relatively imprecise,
indicating the need for specific enhancements on MLLMs towards these abilities.
We hope that our benchmark can encourage the research community to delve deeper
to discover and enhance these untapped potentials of MLLMs. Project Page:
https://q-future.github.io/Q-Bench.

Multimodal Large Language Models: Assessing Low-Level Visual Skills

The field of computer vision has experienced a shift from specialized models to more general-purpose foundation models, thanks to the rapid evolution of Multi-modality Large Language Models (MLLMs). These models have shown great potential in various tasks but are still lacking in their ability to perceive and understand low-level visual information. To address this gap, a team of researchers presents Q-Bench, a benchmark designed to systematically evaluate the potential abilities of MLLMs.

Assessing Low-Level Visual Perception

To evaluate the low-level perception ability of MLLMs, the researchers have constructed the LLVisionQA dataset. This dataset consists of 2,990 images from diverse sources, each accompanied by a human-asked question focusing on its low-level attributes. The MLLMs are then evaluated based on their correctness in answering these questions. This task provides insights into how well MLLMs understand and perceive low-level visual characteristics.

Evaluating Low-Level Visual Description

In addition to perception, the researchers also assess the description ability of MLLMs on low-level information. The LLDescribe dataset is introduced, which contains expert-labelled golden low-level text descriptions for 499 images. A GPT-involved comparison pipeline is employed to compare the outputs of MLLMs with these expert descriptions. This evaluation allows for an examination of how effectively MLLMs generate accurate descriptions based on low-level visual information.

Measuring Visual Quality Assessment

Besides perception and description, the Q-Bench benchmark includes measuring the visual quality assessment ability of MLLMs. A softmax-based strategy is designed to enable MLLMs to predict quantifiable quality scores. The researchers evaluate the MLLMs on various existing image quality assessment (IQA) datasets, aligning their predictions with human opinion scores. This assessment provides insights into how well MLLMs can judge and assess the visual quality of images.

The evaluation across these three abilities confirms that MLLMs possess preliminary low-level visual skills. However, it also reveals that these skills are still relatively unstable and imprecise, indicating the need for specific enhancements. The multi-disciplinary nature of the benchmark highlights the intersection of computer vision, natural language processing, and artificial intelligence.

The findings and insights gained from Q-Bench open up avenues for future research and enhancements in MLLMs. The benchmark serves as a call to action for the research community to delve deeper into uncovering and improving the untapped potential of MLLMs in perceiving, describing, and assessing low-level visual information. By focusing on these important aspects, we can push the boundaries of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, leading to more advanced and effective applications in various domains.

More information about the Q-Bench benchmark can be found on the project page: https://q-future.github.io/Q-Bench.

Read the original article