arXiv:2505.06685v1 Announce Type: new
Abstract: Emotion understanding in videos aims to accurately recognize and interpret individuals’ emotional states by integrating contextual, visual, textual, and auditory cues. While Large Multimodal Models (LMMs) have demonstrated significant progress in general vision-language (VL) tasks, their performance in emotion-specific scenarios remains limited. Moreover, fine-tuning LMMs on emotion-related tasks often leads to catastrophic forgetting, hindering their ability to generalize across diverse tasks. To address these challenges, we present Emotion-Qwen, a tailored multimodal framework designed to enhance both emotion understanding and general VL reasoning. Emotion-Qwen incorporates a sophisticated Hybrid Compressor based on the Mixture of Experts (MoE) paradigm, which dynamically routes inputs to balance emotion-specific and general-purpose processing. The model is pre-trained in a three-stage pipeline on large-scale general and emotional image datasets to support robust multimodal representations. Furthermore, we construct the Video Emotion Reasoning (VER) dataset, comprising more than 40K bilingual video clips with fine-grained descriptive annotations, to further enrich Emotion-Qwen’s emotional reasoning capability. Experimental results demonstrate that Emotion-Qwen achieves state-of-the-art performance on multiple emotion recognition benchmarks, while maintaining competitive results on general VL tasks. Code and models are available at https://anonymous.4open.science/r/Emotion-Qwen-Anonymous.
Expert Commentary:
Emotion understanding in videos is a complex task that requires the integration of various cues such as visual, textual, and auditory information. The development of Large Multimodal Models (LMMs) has shown promise in general vision-language tasks, but their performance in emotion-specific scenarios has been limited. The Emotion-Qwen framework presented in this article aims to address these challenges by incorporating a Hybrid Compressor based on the Mixture of Experts (MoE) paradigm.
The use of a MoE paradigm allows Emotion-Qwen to dynamically route inputs, balancing emotion-specific processing with general-purpose reasoning. This approach helps prevent catastrophic forgetting when fine-tuning LMMs on emotion-related tasks, enabling the model to generalize across diverse tasks. Additionally, the pre-training of Emotion-Qwen on large-scale general and emotional image datasets helps improve its multimodal representations.
One notable contribution of this work is the construction of the Video Emotion Reasoning (VER) dataset, which contains a large number of bilingual video clips with fine-grained descriptive annotations. This dataset enriches Emotion-Qwen’s emotional reasoning capabilities and enables the model to achieve state-of-the-art performance on multiple emotion recognition benchmarks.
From a multidisciplinary perspective, the Emotion-Qwen framework integrates concepts from computer vision, natural language processing, and machine learning to enable robust emotion understanding in videos. The model’s success in both emotion-specific and general VL tasks showcases the potential of multimodal approaches in the field of multimedia information systems.
Overall, the Emotion-Qwen framework represents a significant advancement in the field of emotion understanding in videos and demonstrates the importance of multi-disciplinary approaches in developing sophisticated AI models for complex tasks.
For more information and access to the code and models for Emotion-Qwen, visit the project’s page at https://anonymous.4open.science/r/Emotion-Qwen-Anonymous.