Generating vivid and diverse 3D co-speech gestures is crucial for various
applications in animating virtual avatars. While most existing methods can
generate gestures from audio directly, they usually overlook that emotion is
one of the key factors of authentic co-speech gesture generation. In this work,
we propose EmotionGesture, a novel framework for synthesizing vivid and diverse
emotional co-speech 3D gestures from audio. Considering emotion is often
entangled with the rhythmic beat in speech audio, we first develop an
Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features
as well as model their correlation via a transcript-based visual-rhythm
alignment. Then, we propose an initial pose based Spatial-Temporal Prompter
(STP) to generate future gestures from the given initial poses. STP effectively
models the spatial-temporal correlations between the initial poses and the
future gestures, thus producing the spatial-temporal coherent pose prompt. Once
we obtain pose prompts, emotion, and audio beat features, we will generate 3D
co-speech gestures through a transformer architecture. However, considering the
poses of existing datasets often contain jittering effects, this would lead to
generating unstable gestures. To address this issue, we propose an effective
objective function, dubbed Motion-Smooth Loss. Specifically, we model motion
offset to compensate for jittering ground-truth by forcing gestures to be
smooth. Last, we present an emotion-conditioned VAE to sample emotion features,
enabling us to generate diverse emotional results. Extensive experiments
demonstrate that our framework outperforms the state-of-the-art, achieving
vivid and diverse emotional co-speech 3D gestures. Our code and dataset will be
released at the project page:
https://xingqunqi-lab.github.io/Emotion-Gesture-Web/

EmotionGesture: Synthesizing Vivid and Diverse Emotional Co-Speech 3D Gestures

In the field of multimedia information systems, the generation of realistic and expressive virtual avatars has become a crucial research area. One important aspect of animating virtual avatars is the generation of co-speech gestures that are synchronized with speech. The ability to generate vivid and diverse 3D co-speech gestures is essential for applications such as virtual reality, augmented reality, and artificial reality.

The article introduces EmotionGesture, a novel framework for synthesizing emotional co-speech 3D gestures from audio. Unlike existing methods, EmotionGesture takes into account the emotion in speech audio, which is often overlooked but plays a significant role in generating authentic gestures. The framework consists of several modules that work together to produce coherent and expressive gestures.

Emotion-Beat Mining Module (EBM)

The Emotion-Beat Mining module is responsible for extracting emotion and audio beat features from the speech audio. It also models the correlation between these features through a transcript-based visual-rhythm alignment. This module is crucial for capturing the emotional content of the speech and its rhythmic characteristics, which are important cues for gesture generation.

Spatial-Temporal Prompter (STP)

The Spatial-Temporal Prompter module generates future gestures based on the given initial poses. This module effectively models the spatial-temporal correlations between the initial poses and the future gestures, producing a spatial-temporal coherent pose prompt. By considering the relationships between poses over time, the STP ensures that the generated gestures are natural and coherent.

Transformer Architecture

The framework uses a transformer architecture to generate 3D co-speech gestures based on the pose prompts, emotion, and audio beat features. The transformer architecture is a powerful deep learning model that can capture complex relationships between different input modalities. In this case, it allows the framework to generate gestures that are synchronized with the speech and reflect the emotional content.

Motion-Smooth Loss

To address the issue of jittering effects in existing datasets, the framework introduces an objective function called Motion-Smooth Loss. This loss function models motion offset to compensate for jittering ground-truth data, ensuring that the generated gestures are stable and smooth. By enforcing smoothness in the gestures, the framework improves the overall quality and coherence of the animations.

Emotion-Conditioned VAE

The framework incorporates an emotion-conditioned Variational Autoencoder (VAE) to sample emotion features. This allows for the generation of diverse emotional results, as the VAE can learn and sample from a distribution of emotion features. By conditioning the generation process on emotion, the framework can produce gestures that express different emotions, adding further richness and variability to the animations.

In summary, EmotionGesture presents a comprehensive framework for synthesizing vivid and diverse emotional co-speech 3D gestures. By considering emotion, spatial-temporal correlations, and smoothness, the framework produces high-quality animations that are closely synchronized with speech. The multi-disciplinary nature of this work lies in its integration of audio analysis, computer vision, natural language processing, and deep learning techniques. This research contributes to the wider field of multimedia information systems, including applications in virtual reality, augmented reality, and artificial reality.

Read the original article