We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D
Expression and Gesture generation with arbitrary length. While previous works
focused on co-speech gesture or expression generation individually, the joint
generation of synchronized expressions and gestures remains barely explored. To
address this, our diffusion-based co-speech motion generation transformer
enables uni-directional information flow from expression to gesture,
facilitating improved matching of joint expression-gesture distributions.
Furthermore, we introduce an outpainting-based sampling strategy for arbitrary
long sequence generation in diffusion models, offering flexibility and
computational efficiency. Our method provides a practical solution that
produces high-quality synchronized expression and gesture generation driven by
speech. Evaluated on two public datasets, our approach achieves
state-of-the-art performance both quantitatively and qualitatively.
Additionally, a user study confirms the superiority of DiffSHEG over prior
approaches. By enabling the real-time generation of expressive and synchronized
motions, DiffSHEG showcases its potential for various applications in the
development of digital humans and embodied agents.

DiffSHEG (Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation) is a groundbreaking method that combines speech, expression, and gesture generation. While previous studies have focused on generating co-speech gestures or expressions separately, the joint generation of synchronized expressions and gestures has been underexplored. DiffSHEG addresses this gap by introducing a diffusion-based co-speech motion generation transformer, which allows for improved matching of joint expression-gesture distributions.

What sets DiffSHEG apart is its uni-directional information flow from expression to gesture. By enabling expression to influence gesture generation, DiffSHEG ensures a more coherent and synchronized output. This multi-disciplinary approach draws insights from speech processing, computer vision, and motion generation. By integrating these disciplines, DiffSHEG creates a holistic system that generates expressive and synchronized motions in real-time.

One of the key contributions of DiffSHEG is the outpainting-based sampling strategy for generating arbitrary long sequences in diffusion models. This strategy offers flexibility and computational efficiency, allowing for the practical application of DiffSHEG in various scenarios. Furthermore, the method has been evaluated on two public datasets, demonstrating state-of-the-art performance both quantitatively and qualitatively.

To validate the effectiveness of DiffSHEG, a user study was conducted, confirming its superiority over prior approaches. This user study further reaffirms the potential of DiffSHEG for applications in the development of digital humans and embodied agents.

The significance of DiffSHEG lies in its ability to generate high-quality synchronized expression and gesture generation driven by speech. This technology opens doors for applications in human-computer interaction, virtual reality, animation, and robotics. By seamlessly integrating speech, expression, and gesture generation, DiffSHEG paves the way for more natural and immersive interactions between humans and machines.

Read the original article