arXiv:2602.05496v1 Announce Type: new
Abstract: Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.
Expert Commentary: Exploring Explainable Multimodal Emotion Recognition
Explainable Multimodal Emotion Recognition (EMER) is a field that holds immense potential in enhancing various applications such as human-computer interaction and social media analytics. In this recent research study, the authors highlight the challenges faced by current approaches in EMER, particularly in cue-level perception and reasoning.
One of the key issues addressed in the study is the limitations of general-purpose modality encoders in capturing fine-grained emotional cues. These encoders, typically pretrained to capture global structures and general semantics, may not be sensitive enough to detect subtle emotional signals. This highlights the importance of developing specialized modules like the Video Emotional Cue Bridge (VECB) and Audio Emotional Cue Bridge (AECB) to enhance the perception of emotional cues in multimedia content.
Furthermore, the creation of the EmoCue dataset is a significant contribution to the field, as it aims to provide better supervision for emotional cues and improve cue-level reasoning. The dataset, along with the EmoCue-360 automated metric and EmoCue-Eval benchmark, enables researchers to evaluate the performance of EMER models more effectively.
This research not only advances the field of EMER but also aligns with the broader domain of multimedia information systems. By integrating concepts of video and audio processing, artificial reality, augmented reality, and virtual reality, the proposed XEmoGPT framework sets a strong foundation for future developments in multimodal emotion recognition.
Overall, this study showcases the multi-disciplinary nature of EMER and emphasizes the importance of developing explainable models that can accurately perceive and reason over emotional cues in multimedia content.