arXiv:2505.18614v1 Announce Type: cross
Abstract: Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
Expert Commentary
Lyrics translation in animated musicals presents a unique set of challenges that require a multi-disciplinary approach to address. The Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL) introduces a groundbreaking benchmark that integrates text, audio, and video to enable more expressive translations than traditional text-only methods. This approach acknowledges the importance of not only accurate semantic transfer but also the preservation of musical rhythm, syllabic structure, and poetic style, aligning with visual and auditory cues in animated musicals.
Furthermore, the proposed Syllable-Constrained Audio-Video LLM with Chain-of-Thought (SylAVL-CoT) takes this multimodal approach a step further by leveraging audio-video cues and enforcing syllabic constraints to produce natural-sounding lyrics. This innovative model demonstrates significant improvement in singability and contextual accuracy compared to text-based models, highlighting the value of multimodal, multilingual approaches for lyrics translation in the realm of animated musicals.
These advancements in the field of lyrics translation not only contribute to the broader field of multimedia information systems but also have implications for disciplines such as Animations, Artificial Reality, Augmented Reality, and Virtual Realities. By incorporating text, audio, and video in the translation process, researchers are pushing the boundaries of what is possible in terms of conveying meaning, emotion, and cultural nuances in a variety of visual and auditory formats.