Generating music with emotion is an important task in automatic music
generation, in which emotion is evoked through a variety of musical elements
(such as pitch and duration) that change over time and collaborate with each
other. However, prior research on deep learning-based emotional music
generation has rarely explored the contribution of different musical elements
to emotions, let alone the deliberate manipulation of these elements to alter
the emotion of music, which is not conducive to fine-grained element-level
control over emotions. To address this gap, we present a novel approach
employing musical element-based regularization in the latent space to
disentangle distinct elements, investigate their roles in distinguishing
emotions, and further manipulate elements to alter musical emotions.
Specifically, we propose a novel VQ-VAE-based model named MusER. MusER
incorporates a regularization loss to enforce the correspondence between the
musical element sequences and the specific dimensions of latent variable
sequences, providing a new solution for disentangling discrete sequences.
Taking advantage of the disentangled latent vectors, a two-level decoding
strategy that includes multiple decoders attending to latent vectors with
different semantics is devised to better predict the elements. By visualizing
latent space, we conclude that MusER yields a disentangled and interpretable
latent space and gain insights into the contribution of distinct elements to
the emotional dimensions (i.e., arousal and valence). Experimental results
demonstrate that MusER outperforms the state-of-the-art models for generating
emotional music in both objective and subjective evaluation. Besides, we
rearrange music through element transfer and attempt to alter the emotion of
music by transferring emotion-distinguishable elements.
In this article, the authors discuss the importance of generating music with emotion and highlight a gap in prior research when it comes to deep learning-based emotional music generation. They propose a novel approach called MusER, which employs musical element-based regularization in the latent space, allowing for fine-grained control over emotions.
MusER is a VQ-VAE-based model that incorporates a regularization loss to ensure that the musical element sequences correspond to specific dimensions of latent variable sequences. This approach allows for the disentanglement of distinct elements and enables researchers to investigate their roles in distinguishing emotions. By manipulating these elements, MusER can alter the emotional quality of the generated music.
The authors also introduce a two-level decoding strategy that includes multiple decoders, each attending to latent vectors with different semantics. This strategy improves the prediction of musical elements. Through visualizing the latent space, the authors demonstrate that MusER yields a disentangled and interpretable latent space, providing insights into the contribution of different elements to emotional dimensions such as arousal and valence.
The experimental results show that MusER outperforms state-of-the-art models in both objective and subjective evaluations when it comes to generating emotional music. Additionally, the authors explore the possibility of rearranging music through element transfer, allowing for the alteration of music’s emotional qualities by transferring emotion-distinguishable elements.
From a multidisciplinary perspective, this research integrates concepts from deep learning, music theory, and human emotion. It explores the relationship between musical elements and emotions, shedding light on how specific variations in pitch, duration, and other elements can evoke different emotional responses in listeners. The disentanglement and manipulation of these elements highlight the potential for more precise control over the emotional quality of music.
In the context of multimedia information systems and animations, this research contributes to the development of intelligent music generation algorithms. By understanding the connection between music, emotions, and different musical elements, systems can generate customized music for various contexts, such as video games, films, and virtual reality experiences. This approach enhances user engagement and immersion by creating music that aligns with the desired emotional atmosphere.
Furthermore, MusER’s insights into disentanglement and interpretability of the latent space can potentially be applied to other domains beyond music generation. Similar techniques could be utilized in the development of augmented reality or virtual reality systems to create immersive and emotionally evocative experiences. The ability to manipulate specific elements and dimensions of the virtual environment can greatly enhance the user’s sense of presence and emotional connection.