arXiv:2504.16405v1 Announce Type: new
Abstract: The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities.
Among these, understanding image-evoked emotions aims to enhance MLLMs’ empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking.
To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories.
Our core contributions include:
1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated.
2) We design four tasks to evaluate MLLMs’ ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model’s proficiency in performing joint and comparative analysis.
In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs.
The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal.
Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding.
Enhancing Multi-Modal Large Language Models (MLLMs) with Image-Evoked Emotions
This article introduces the concept of image-evoked emotions and its relevance in enhancing the empathy of multi-modal large language models (MLLMs). MLLMs have gained significant attention in various domains, including human-machine interaction and advertising recommendations. However, the evaluation of MLLMs’ understanding of image-evoked emotions is currently limited and lacks a systematic and comprehensive assessment.
The Importance of Emotion in MLLMs
Emotion plays a crucial role in human communication and understanding, and the ability to perceive and understand emotions is highly desirable in MLLMs. By incorporating image-evoked emotions into MLLMs, these models can better empathize with users and provide more tailored responses and recommendations.
The EEmo-Bench Benchmark
To address the limitations in evaluating MLLMs’ understanding of image-evoked emotions, the authors introduce EEmo-Bench, a novel benchmark specifically designed for this purpose. EEmo-Bench focuses on the analysis of the evoked emotions in images across diverse content categories.
The benchmark includes the following core contributions:
- Diversity of evoked emotions: To assess emotional attributes, the authors adopt an emotion ranking strategy and utilize the Valence-Arousal-Dominance (VAD) model. A dataset of 1,960 images is collected and manually annotated for emotional assessment.
- Four evaluation tasks: Four tasks are designed to evaluate MLLMs’ ability to capture evoked emotions and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced for joint and comparative analysis.
- Thorough assessment of MLLMs: A comprehensive evaluation is conducted on 19 commonly-used MLLMs, with a collection of 6,773 question-answer pairs. The results highlight the performance of different models in various evaluation dimensions.
Insights and Future Directions
The results of the EEmo-Bench benchmark reveal that while some proprietary and large-scale open-source MLLMs show promising overall performance, there are still areas in which these models’ analytical capabilities can be improved. This highlights the need for further research and innovation to enhance MLLMs’ comprehension and perception of image-evoked emotions.
The concepts discussed in this article are highly relevant to the wider field of multimedia information systems, as they bridge the gap between textual data and visual content analysis. Incorporating image-evoked emotions into MLLMs opens up new avenues for research in areas such as virtual reality, augmented reality, and artificial reality.
The multi-disciplinary nature of the concepts presented here underscores the importance of collaboration between researchers from fields such as computer vision, natural language processing, and psychology. By combining expertise from these diverse domains, we can develop more sophisticated MLLMs that truly understand and respond to the emotions evoked by visual stimuli.
In conclusion, the EEmo-Bench benchmark serves as a stepping stone for future research in enhancing the comprehension and perception capabilities of MLLMs in the context of image-evoked emotions. This research has significant implications for machine-centric emotion perception and understanding, with applications ranging from personalized user experiences to improved advertising recommendations.