arXiv:2403.20194v1 Announce Type: new
Abstract: This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts a three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on a distinct capability, mirroring the cognitive progression from basic perception to logical reasoning and ultimately to advanced creativity. ConvBench comprises 577 meticulously curated multi-turn conversations encompassing 215 tasks reflective of real-world demands. Automatic evaluations quantify response performance at each turn and overall conversation level. Leveraging the capability hierarchy, ConvBench enables precise attribution of conversation mistakes to specific levels. Experimental results reveal a performance gap between multi-modal models, including GPT4-V, and human performance in multi-turn conversations. Additionally, weak fine-grained perception in multi-modal models contributes to reasoning and creation failures. ConvBench serves as a catalyst for further research aimed at enhancing visual dialogues.

ConvBench: A Multi-Turn Conversation Evaluation Benchmark for Large Vision-Language Models

In the field of multimedia information systems, the development of Large Vision-Language Models (LVLMs) has gained significant attention. These models are designed to understand and generate text while also incorporating visual information. ConvBench, a novel benchmark presented in this paper, focuses on evaluating the performance of LVLMs in multi-turn conversations.

Unlike existing benchmarks that assess the capabilities of models in single-turn dialogues, ConvBench takes a multi-level approach. It mimics the cognitive processes of humans by dividing the evaluation into three levels: perception, reasoning, and creativity. This multi-modal capability hierarchy allows for a more comprehensive assessment of LVLM performance.

ConvBench comprises 577 carefully curated multi-turn conversations, covering 215 real-world tasks. Each conversation is automatically evaluated at every turn, as well as at the overall conversation level. This precise evaluation enables researchers to attribute mistakes to specific levels, facilitating a deeper understanding of model performance.

The results of experiments conducted using ConvBench highlight a performance gap between multi-modal models, including GPT4-V, and human performance in multi-turn conversations. This suggests that there is still room for improvement in LVLMs, particularly in the area of weak fine-grained perception, which contributes to failures in reasoning and creativity.

The concepts presented in ConvBench have far-reaching implications in the wider field of multimedia information systems. By incorporating both visual and textual information, LVLMs have the potential to revolutionize various applications such as animations, artificial reality, augmented reality, and virtual reality. These technologies heavily rely on the seamless integration of visuals and language, and ConvBench provides a benchmark for evaluating and improving the performance of LVLMs in these domains.

Furthermore, the multi-disciplinary nature of ConvBench, with its combination of perception, reasoning, and creativity, highlights the complex cognitive processes involved in human conversation. By studying and enhancing these capabilities in LVLMs, researchers can advance the field of artificial intelligence and develop models that come closer to human-level performance in engaging and meaningful conversations.


ConvBench is a pioneering multi-turn conversation evaluation benchmark that provides deep insights into the performance of Large Vision-Language Models. With its multi-modal capability hierarchy and carefully curated conversations, ConvBench enables precise evaluation and attribution of errors. The results of ConvBench experiments reveal the existing performance gap and the need for improvement in multi-modal models. The concepts presented in ConvBench have significant implications for multimedia information systems, animations, artificial reality, augmented reality, and virtual reality. By advancing LVLMs, researchers can pave the way for more engaging and meaningful interactions between humans and machines.

Read the original article