Multi-modal multi-label emotion recognition (MMER) aims to identify relevant
emotions from multiple modalities. The challenge of MMER is how to effectively
capture discriminative features for multiple labels from heterogeneous data.
Recent studies are mainly devoted to exploring various fusion strategies to
integrate multi-modal information into a unified representation for all labels.
However, such a learning scheme not only overlooks the specificity of each
modality but also fails to capture individual discriminative features for
different labels. Moreover, dependencies of labels and modalities cannot be
effectively modeled. To address these issues, this paper presents ContrAstive
feature Reconstruction and AggregaTion (CARAT) for the MMER task. Specifically,
we devise a reconstruction-based fusion mechanism to better model fine-grained
modality-to-label dependencies by contrastively learning modal-separated and
label-specific features. To further exploit the modality complementarity, we
introduce a shuffle-based aggregation strategy to enrich co-occurrence
collaboration among labels. Experiments on two benchmark datasets CMU-MOSEI and
M3ED demonstrate the effectiveness of CARAT over state-of-the-art methods. Code
is available at https://github.com/chengzju/CARAT.
Multi-modal multi-label emotion recognition (MMER) is a challenging task that aims to identify relevant emotions from multiple modalities. This means that instead of relying on a single modality, such as text or audio, MMER incorporates multiple modalities, such as text, audio, and video, to capture a more comprehensive understanding of emotions.
The challenge in MMER lies in effectively capturing discriminative features for multiple labels from heterogeneous data. Heterogeneous data refers to different types of data, such as text, audio, and video, which all have their own unique characteristics. The goal is to find a way to combine these modalities in a way that effectively represents the emotions for each label.
Recent studies have focused on exploring fusion strategies to integrate multi-modal information into a unified representation for all labels. However, this approach overlooks the specificity of each modality and fails to capture individual discriminative features for different labels. It also doesn’t effectively model the dependencies between labels and modalities.
To address these issues, the authors of this paper propose ContrAstive feature Reconstruction and AggregaTion (CARAT), a new approach for MMER. CARAT uses a reconstruction-based fusion mechanism to better model fine-grained modality-to-label dependencies. By contrastively learning modal-separated and label-specific features, CARAT can capture the unique characteristics of each modality and label.
In addition, CARAT introduces a shuffle-based aggregation strategy to enrich co-occurrence collaboration among labels. This means that CARAT considers the relationships and interactions between different labels, allowing for a more comprehensive understanding of emotions.
The effectiveness of CARAT is demonstrated through experiments on two benchmark datasets, CMU-MOSEI and M3ED. The results show that CARAT outperforms state-of-the-art methods in multi-modal multi-label emotion recognition.
In the wider field of multimedia information systems, CARAT contributes to the study of how to effectively integrate multi-modal information for emotion recognition. By considering the specificities of each modality and capturing individual discriminative features, CARAT provides a more nuanced understanding of emotions.
Furthermore, CARAT is related to the fields of animations, artificial reality, augmented reality, and virtual realities. These fields often involve multi-modal data, as animations and virtual realities typically include visual and audio components. CARAT’s approach of combining different modalities can be applied to enhance the emotional realism and immersion in these multimedia experiences.