Despite recent progress in text-to-audio (TTA) generation, we show that the
state-of-the-art models, such as AudioLDM, trained on datasets with an
imbalanced class distribution, such as AudioCaps, are biased in their
generation performance. Specifically, they excel in generating common audio
classes while underperforming in the rare ones, thus degrading the overall
generation performance. We refer to this problem as long-tailed text-to-audio
generation. To address this issue, we propose a simple retrieval-augmented
approach for TTA models. Specifically, given an input text prompt, we first
leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve
relevant text-audio pairs. The features of the retrieved audio-text data are
then used as additional conditions to guide the learning of TTA models. We
enhance AudioLDM with our proposed approach and denote the resulting augmented
system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a
state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the
existing approaches by a large margin. Furthermore, we show that Re-AudioLDM
can generate realistic audio for complex scenes, rare audio classes, and even
unseen audio types, indicating its potential in TTA tasks.
Addressing Bias in Text-to-Audio Generation: A Multi-Disciplinary Approach
As technology continues to advance, text-to-audio (TTA) generation has seen significant progress. However, it is crucial to acknowledge the biases that can emerge when state-of-the-art models, like AudioLDM trained on imbalanced class distribution datasets such as AudioCaps, are used. This article introduces the concept of long-tailed text-to-audio generation, where models excel in generating common audio classes but struggle with rare ones, impacting the overall performance.
To combat this issue, the authors propose a retrieval-augmented approach for TTA models. The process involves leveraging a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs based on an input text prompt. The features of the retrieved audio-text data then guide the learning of TTA models. By enhancing AudioLDM with this approach, the researchers introduce Re-AudioLDM, which achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37 on the AudioCaps dataset.
This work stands at the intersection of multiple disciplines, showcasing its multi-disciplinary nature. Firstly, it draws upon natural language processing techniques to retrieve relevant text-audio pairs using the CLAP model. Secondly, it leverages machine learning methodologies to enhance TTA models with the retrieved audio-text data. Finally, it applies evaluation metrics from the field of multimedia information systems, specifically Frechet Audio Distance, to assess the performance of Re-AudioLDM.
The relevance of this research to multimedia information systems lies in its aim to improve the generation performance of TTA models. Generating realistic audio for complex scenes, rare audio classes, and even unseen audio types holds great potential for various multimedia applications. For instance, in animations, artificial reality, augmented reality, and virtual realities, the ability to generate high-quality and diverse audio content is crucial for creating immersive experiences. By addressing bias in TTA generation, Re-AudioLDM opens up new possibilities for enhancing multimedia systems across these domains.
In conclusion, the proposed retrieval-augmented approach presented in this article showcases the potential to address bias in text-to-audio generation. Despite the challenges posed by imbalanced class distribution datasets, Re-AudioLDM demonstrates state-of-the-art performance and the ability to generate realistic audio across different scenarios. Moving forward, further research in this area could explore the application of similar approaches to other text-to-multimedia tasks, paving the way for more inclusive and accurate multimedia content creation.