Large audio-video language models can generate descriptions for both video
and audio. However, they sometimes ignore audio content, producing audio
descriptions solely reliant on visual information. This paper refers to this as
audio hallucinations and analyzes them in large audio-video language models. We
gather 1,000 sentences by inquiring about audio information and annotate them
whether they contain hallucinations. If a sentence is hallucinated, we also
categorize the type of hallucination. The results reveal that 332 sentences are
hallucinated with distinct trends observed in nouns and verbs for each
hallucination type. Based on this, we tackle a task of audio hallucination
classification using pre-trained audio-text models in the zero-shot and
fine-tuning settings. Our experimental results reveal that the zero-shot models
achieve higher performance (52.2% in F1) than the random (40.3%) and the
fine-tuning models achieve 87.9%, outperforming the zero-shot models.

Analysis of Audio Hallucinations in Large Audio-Video Language Models

In this paper, the authors address the issue of audio hallucinations in large audio-video language models. These models have the capability to generate descriptions for both video and audio content, but often ignore the audio aspect and rely solely on visual information, resulting in inaccurate audio descriptions. This phenomenon is referred to as audio hallucination.

To investigate this problem, the authors collected 1,000 sentences by specifically asking for audio information and then annotated them to identify whether they contained hallucinations. The analysis revealed that 332 sentences showed signs of audio hallucination. Additionally, the authors categorized the type of hallucinations observed in nouns and verbs.

This research highlights the multi-disciplinary nature of the concepts discussed. It combines elements from multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. By studying the limitations and inaccuracies in audio description generation, it contributes to the advancement of technologies that strive to create more immersive and realistic multimedia experiences.

The next step taken by the authors is to tackle the task of audio hallucination classification using pre-trained audio-text models in both zero-shot and fine-tuning settings. Zero-shot models achieve a F1 score of 52.2%, outperforming random classification (40.3%). However, the fine-tuning models achieve even higher performance with an impressive 87.9% accuracy.

This research has significant implications in various domains. In multimedia information systems, it can lead to the development of improved algorithms for generating accurate and comprehensive audio descriptions in video content. For animations and virtual realities, it can enhance the realism and immersion by incorporating more accurate audio representations. Furthermore, in augmented reality applications, where real-world objects are augmented with virtual elements, accurate audio descriptions can provide users with a more interactive and engaging experience.

The findings and methodologies presented in this paper contribute to the broader field of multimedia information systems, as well as related areas such as animations, artificial reality, augmented reality, and virtual realities. This research highlights the importance of considering all sensory modalities when generating multimedia content and emphasizes the need for continued advancements in audio processing and synthesis technologies.

Read the original article