arXiv:2508.12020v1 Announce Type: new
Abstract: The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fr’echet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.
Expert Commentary: Exploring Human Preference and Quality Assessment for AI-generated 3D Human Gestures
The Audio-to-3D-Gesture (A2G) task holds significant potential for various applications in virtual reality, computer graphics, and beyond. However, current evaluation metrics like Fre’chet Gesture Distance and Beat Constancy may not accurately capture human preference for generated 3D gestures. This gap highlights the need to delve deeper into understanding human perception and developing objective quality assessment metrics for AI-generated 3D human gestures.
Multi-disciplinary in nature, this research bridges the fields of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. By introducing the Ges-QA dataset, which comprises 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency, the authors have laid a solid foundation for further exploration in this domain.
The inclusion of binary classification labels to determine emotional matching between generated gestures and audio adds another layer of complexity to the task. This dataset enables the development of a multi-modal transformer-based neural network with separate branches for video, audio, and 3D skeleton modalities. This approach allows for scoring A2G contents across multiple dimensions, providing a more comprehensive assessment of gesture quality.
The comparative experimental results and ablation studies presented in the paper showcase the effectiveness of the proposed Ges-QAer model, demonstrating state-of-the-art performance on the dataset. This research not only contributes to advancing the field of AI-generated 3D human gestures but also underscores the importance of incorporating human preference into evaluation metrics for such tasks.