arXiv:2409.15545v1 Announce Type: cross
Abstract: The subjective nature of music emotion introduces inherent bias in both recognition and generation, especially when relying on a single audio encoder, emotion classifier, or evaluation metric. In this work, we conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG), employing diverse audio encoders alongside the Frechet Audio Distance (FAD), a reference-free evaluation metric. Our study begins with a benchmark evaluation of MER, highlighting the limitations associated with using a single audio encoder and the disparities observed across different measurements. We then propose assessing MER performance using FAD from multiple encoders to provide a more objective measure of music emotion. Furthermore, we introduce an enhanced EMG approach designed to improve both the variation and prominence of generated music emotion, thus enhancing realism. Additionally, we investigate the realism disparities between the emotions conveyed in real and synthetic music, comparing our EMG model against two baseline models. Experimental results underscore the emotion bias problem in both MER and EMG and demonstrate the potential of using FAD and diverse audio encoders to evaluate music emotion objectively.
The Subjective Nature of Music Emotion and Its Impact on Recognition and Generation
Music has long been recognized as a powerful medium for evoking emotions in listeners. However, the subjective nature of music emotion makes it challenging to objectively measure and evaluate these emotions. This inherent bias affects both Music Emotion Recognition (MER) and Emotional Music Generation (EMG), two important areas in multimedia information systems.
In the field of MER, researchers have traditionally relied on a single audio encoder to extract features from music and classify the emotions conveyed. This approach, while convenient, fails to consider the diverse ways in which different encoders perceive and represent music. As a result, the performance of MER systems can vary widely, depending on the choice of encoder.
To address this limitation, the authors of the article propose using the Frechet Audio Distance (FAD), a reference-free evaluation metric, alongside multiple audio encoders. By considering the output of multiple encoders, it becomes possible to obtain a more objective measure of music emotion. This multi-disciplinary approach, combining insights from audio signal processing, machine learning, and psychology, has the potential to significantly improve the performance and reliability of MER systems.
Moreover, the article also explores the field of EMG, which focuses on generating music that conveys specific emotions. While previous EMG models have achieved some success, they often struggle to produce music that is both varied and emotionally evocative. To overcome this limitation, the authors propose an enhanced EMG approach that aims to improve both the variation and prominence of generated music emotion. This is achieved by incorporating insights from music theory, computational creativity, and human-computer interaction.
In addition to evaluating the performance of their EMG model, the authors also investigate the realism disparities between emotions conveyed in real and synthetic music. This comparison highlights the challenges faced by EMG models in capturing the nuances and complexities of human emotional expression. By addressing these challenges, the field of EMG can contribute to the development of more realistic and emotionally engaging multimedia experiences.
Relevance to Multimedia Information Systems and Virtual Realities
The concepts discussed in the article are highly relevant to the wider field of multimedia information systems. Multimedia information systems deal with the storage, retrieval, and analysis of multimedia data, including audio, images, and videos. Emotion recognition and generation play a crucial role in enhancing the user experience and personalization of such systems.
Animations, artificial reality, augmented reality, and virtual realities are all domains that can benefit from advancements in music emotion recognition and generation. For example, in virtual reality applications, the incorporation of emotionally engaging music can significantly enhance the sense of immersion and presence. Similarly, in animations and augmented reality experiences, the ability to generate music that effectively conveys specific emotions can enhance the storytelling and overall impact of the content.
By addressing the inherent biases and limitations of current approaches, the research presented in this article contributes to the development of more accurate, reliable, and emotionally engaging multimedia information systems. The multi-disciplinary nature of the concepts discussed, spanning fields such as audio signal processing, machine learning, psychology, music theory, computational creativity, and human-computer interaction, highlights the complexity and interplay of different disciplines in the pursuit of advancing multimedia technologies.