arXiv:2407.21721v1 Announce Type: new
Abstract: Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.
Expert Commentary: Open-Vocabulary Audio-Visual Semantic Segmentation
In the field of multimedia information systems, audio-visual semantic segmentation (AVSS) plays a significant role in understanding and processing audio and visual content in videos. Traditionally, AVSS approaches have focused on identifying and classifying pre-defined categories based on training data. However, in practical applications, it is essential to have the ability to detect and recognize novel categories that may not be present in the training data. This is where the concept of open-vocabulary AVSS comes into play.
Open-Vocabulary AVSS: A Challenging Task
Open-vocabulary audio-visual semantic segmentation extends the capabilities of AVSS to handle open-world scenarios beyond the annotated label space. It involves recognizing and segmenting all categories, including those that have never been seen or heard during training. This task is highly challenging as it requires a model to generalize and adapt to new categories without any prior knowledge.
The OV-AVSS Framework
The authors of this paper propose the first open-vocabulary AVSS framework called OV-AVSS. This framework consists of two main components:
- Universal sound source localization module: This module performs audio-visual fusion and locates all potential sounding objects in the video. It combines information from both auditory and visual cues to improve localization accuracy.
- Open-vocabulary classification module: This module predicts categories using prior knowledge from large-scale pre-trained vision-language models. It leverages the power of pre-trained models to generalize and recognize novel categories in an open-vocabulary setting.
Evaluation and Results
To evaluate the performance of the proposed open-vocabulary AVSS framework, the authors introduce the AVSBench-OV dataset. This dataset includes split zero-shot training and testing subsets and serves as a benchmark for open-vocabulary AVSS. The experiments conducted on this dataset demonstrate the strong segmentation and zero-shot generalization ability of the OV-AVSS model.
On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU (mean intersection over union) on base categories and 29.14% mIoU on novel categories. These results surpass the state-of-the-art zero-shot method by 41.88% (base categories) and 20.61% (novel categories), as well as the open-vocabulary method by 10.2% (base categories) and 11.6% (novel categories).
Implications and Future Directions
The concept of open-vocabulary audio-visual semantic segmentation has implications for a wide range of multimedia information systems. As the field progresses, the ability to recognize and segment novel categories without prior training data will become increasingly valuable in practical applications. Additionally, the integration of audio and visual cues, as demonstrated in the OV-AVSS framework, highlights the multidisciplinary nature of the concepts within AVSS and its related fields such as animations, artificial reality, augmented reality, and virtual realities.
In the future, further research can explore the development of more advanced open-vocabulary AVSS models and datasets to push the boundaries of zero-shot generalization and enable practical applications in real-world scenarios. The availability of the code for the OV-AVSS framework on GitHub provides a valuable resource for researchers and practitioners interested in advancing the field.