Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-ware intelligent
machines. Previous efforts in this area are dominated by the supervised
learning paradigm. Despite significant progress, supervised learning is meeting
its bottleneck due to the longstanding data scarcity issue in AVER. Motivated
by recent advances in self-supervised learning, we propose Hierarchical
Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that
leverages large-scale self-supervised pre-training on vast unlabeled
audio-visual data to promote the advancement of AVER. Following prior arts in
self-supervised audio-visual representation learning, HiCMAE adopts two primary
forms of self-supervision for pre-training, namely masked data modeling and
contrastive learning. Unlike them which focus exclusively on top-layer
representations while neglecting explicit guidance of intermediate layers,
HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual
feature learning and improve the overall quality of learned representations. To
verify the effectiveness of HiCMAE, we conduct extensive experiments on 9
datasets covering both categorical and dimensional AVER tasks. Experimental
results show that our method significantly outperforms state-of-the-art
supervised and self-supervised audio-visual methods, which indicates that
HiCMAE is a powerful audio-visual emotion representation learner. Codes and
models will be publicly available at https://github.com/sunlicai/HiCMAE.

Audio-Visual Emotion Recognition (AVER) is an important area in the field of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. It involves the development of intelligent machines that can recognize and understand human emotions based on audio-visual data. The article highlights the limitations of the traditional supervised learning paradigm and proposes a novel self-supervised framework called Hierarchical Contrastive Masked Autoencoder (HiCMAE) to address the data scarcity issue in AVER.

The HiCMAE framework leverages large-scale self-supervised pre-training on unlabeled audio-visual data to enhance the performance of AVER systems. It adopts two primary forms of self-supervision: masked data modeling and contrastive learning. These techniques help in learning high-quality representations of audio-visual features.

What sets HiCMAE apart from previous approaches is its emphasis on hierarchical audio-visual feature learning. While previous methods focus only on top-layer representations, HiCMAE incorporates explicit guidance for intermediate layers. This three-pronged strategy enhances the overall quality of learned representations.

To validate the effectiveness of HiCMAE, extensive experiments are conducted on 9 datasets covering both categorical and dimensional AVER tasks. The experimental results demonstrate that HiCMAE outperforms state-of-the-art supervised and self-supervised audio-visual methods. This indicates that HiCMAE is a powerful audio-visual emotion representation learner, capable of improving the performance of AVER systems.

The multi-disciplinary nature of this content is evident in its connections to various fields. In multimedia information systems, HiCMAE contributes to the development of intelligent machines that can process and interpret audio-visual data in relation to human emotions. In animations, artificial reality, augmented reality, and virtual realities, HiCMAE can enable more realistic and immersive experiences by incorporating emotion recognition capabilities into virtual environments.

Overall, this article introduces a promising framework, HiCMAE, for enhancing Audio-Visual Emotion Recognition. Its self-supervised learning approach and hierarchical feature learning strategy address the limitations of data scarcity. The experimental results indicate its superiority over existing methods and highlight its potential for applications in multimedia information systems, animations, artificial reality, augmented reality, and virtual realities.

Read the original article