Video-based facial affect analysis has recently attracted increasing
attention owing to its critical role in human-computer interaction. Previous
studies mainly focus on developing various deep learning architectures and
training them in a fully supervised manner. Although significant progress has
been achieved by these supervised methods, the longstanding lack of large-scale
high-quality labeled data severely hinders their further improvements.
Motivated by the recent success of self-supervised learning in computer vision,
this paper introduces a self-supervised approach, termed Self-supervised Video
Facial Affect Perceiver (SVFAP), to address the dilemma faced by supervised
methods. Specifically, SVFAP leverages masked facial video autoencoding to
perform self-supervised pre-training on massive unlabeled facial videos.
Considering that large spatiotemporal redundancy exists in facial videos, we
propose a novel temporal pyramid and spatial bottleneck Transformer as the
encoder of SVFAP, which not only enjoys low computational cost but also
achieves excellent performance. To verify the effectiveness of our method, we
conduct experiments on nine datasets spanning three downstream tasks, including
dynamic facial expression recognition, dimensional emotion recognition, and
personality recognition. Comprehensive results demonstrate that SVFAP can learn
powerful affect-related representations via large-scale self-supervised
pre-training and it significantly outperforms previous state-of-the-art methods
on all datasets. Codes will be available at https://github.com/sunlicai/SVFAP.

Video-based facial affect analysis has become an increasingly important area of study in the field of human-computer interaction. The ability to accurately analyze facial expressions and emotions plays a critical role in developing effective and intuitive human-computer interfaces. Previous research in this area has focused on developing deep learning architectures and training them using supervised learning methods. While these approaches have led to significant advancements, they are reliant on large-scale labeled datasets, which are often unavailable or difficult to obtain.

In this paper, the authors propose a novel approach called the Self-supervised Video Facial Affect Perceiver (SVFAP), inspired by the success of self-supervised learning in computer vision. By leveraging self-supervised pre-training on massive amounts of unlabeled facial videos, SVFAP aims to overcome the limitations of supervised methods and improve the performance of video-based facial affect analysis.

To achieve this, SVFAP uses masked facial video autoencoding as a self-supervised learning task. By reconstructing facial videos from incomplete or masked inputs, the model learns to capture the spatiotemporal redundancy present in these videos. The authors introduce a temporal pyramid and spatial bottleneck Transformer as the encoder for SVFAP, which not only provides excellent performance but also has low computational cost.

To evaluate the effectiveness of SVFAP, the authors conducted experiments on nine datasets covering three different downstream tasks: dynamic facial expression recognition, dimensional emotion recognition, and personality recognition. The results consistently demonstrated that SVFAP outperformed previous state-of-the-art methods on all datasets. This highlights the power of large-scale self-supervised pre-training in learning affect-related representations from unlabeled data.

This research has significant implications in various fields related to multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. In multimedia systems, accurate and real-time facial affect analysis can enhance user experience and enable more natural interactions with digital content. Animations can benefit from improved facial expression recognition, leading to more realistic and expressive characters. In artificial reality, augmented reality, and virtual reality applications, the ability to analyze facial affect can enable more immersive and emotionally responsive experiences.

Overall, this paper presents an innovative approach to video-based facial affect analysis, combining self-supervised learning with deep learning techniques. The results demonstrate the effectiveness of this approach in learning affect-related representations without the need for large labeled datasets. This research opens up new avenues for improving human-computer interaction and has wide-ranging implications in various domains related to multimedia information systems and virtual reality technologies.
Read the original article