In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to
unsupervised audio-visual speech representation learning. The latent space is
structured to dissociate the latent dynamical factors that are shared between
the modalities from those that are specific to each modality. A static latent
variable is also introduced to encode the information that is constant over
time within an audiovisual speech sequence. The model is trained in an
unsupervised manner on an audiovisual emotional speech dataset, in two stages.
In the first stage, a vector quantized VAE (VQ-VAE) is learned independently
for each modality, without temporal modeling. The second stage consists in
learning the MDVAE model on the intermediate representation of the VQ-VAEs
before quantization. The disentanglement between static versus dynamical and
modality-specific versus modality-common information occurs during this second
training stage. Extensive experiments are conducted to investigate how
audiovisual speech latent factors are encoded in the latent space of MDVAE.
These experiments include manipulating audiovisual speech, audiovisual facial
image denoising, and audiovisual speech emotion recognition. The results show
that MDVAE effectively combines the audio and visual information in its latent
space. They also show that the learned static representation of audiovisual
speech can be used for emotion recognition with few labeled data, and with
better accuracy compared with unimodal baselines and a state-of-the-art
supervised model based on an audiovisual transformer architecture.

In this article, we are introduced to the concept of multimodal and dynamical Variational Autoencoder (MDVAE) applied to unsupervised audio-visual speech representation learning. The key idea behind MDVAE is to structure the latent space in a way that it captures the shared and specific information between the audio and visual modalities, as well as the temporal dynamics within an audiovisual speech sequence.

This research is a testament to the multi-disciplinary nature of multimedia information systems, as it combines techniques from computer vision, machine learning, and speech processing. By integrating both audio and visual modalities, this approach paves the way for more immersive and realistic multimedia experiences.

Animations, artificial reality, augmented reality, and virtual realities are all fields that greatly benefit from advancements in audio-visual processing. By effectively combining audio and visual information in the latent space, MDVAE opens up possibilities for creating more realistic and interactive virtual environments. Imagine a virtual reality game where characters not only look real but also sound realistic when they speak. This level of fidelity can greatly enhance the user’s immersion and overall experience.

Furthermore, this research addresses the challenge of disentangling static versus dynamical and modality-specific versus modality-common information. This is crucial for tasks such as audiovisual facial image denoising and emotion recognition. By learning a static representation of audiovisual speech, the model can effectively filter out noise and extract meaningful features that contribute to emotion recognition. The results demonstrate that MDVAE outperforms unimodal baselines and even a state-of-the-art supervised model based on an audiovisual transformer architecture.

Overall, this research showcases the potential of incorporating multimodal and dynamical approaches in the field of multimedia information systems. By harnessing the power of both audio and visual modalities, we can create more immersive experiences and improve tasks such as animation, artificial reality, augmented reality, and virtual realities. The MDVAE model’s ability to disentangle different factors opens up possibilities for various applications, including emotion recognition and facial image denoising.

Read the original article