arXiv:2403.19002v1 Announce Type: new
Abstract: This paper addresses the issue of active speaker detection (ASD) in noisy environments and formulates a robust active speaker detection (rASD) problem. Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance. To overcome this, we propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features. These features are then utilized in an ASD model, and both tasks are jointly optimized in an end-to-end framework. Our proposed framework mitigates residual noise and audio quality reduction issues that can occur in a naive cascaded two-stage framework that directly uses separated speech for ASD, and enables the two tasks to be optimized simultaneously. To further enhance the robustness of the audio features and handle inherent speech noises, we propose a dynamic weighted loss approach to train the speech separator. We also collected a real-world noise audio dataset to facilitate investigations. Experiments demonstrate that non-speech audio noises significantly impact ASD models, and our proposed approach improves ASD performance in noisy environments. The framework is general and can be applied to different ASD approaches to improve their robustness. Our code, models, and data will be released.

Active Speaker Detection in Noisy Environments: A Robust Approach

Active speaker detection (ASD) is an essential task in multimedia information systems where the goal is to identify and track the speaker in a given audio or audio-visual stream. However, in real-world scenarios, the presence of ambient noise can significantly degrade the performance of ASD models. This paper introduces a robust approach, called robust active speaker detection (rASD), which addresses the challenge of detecting the active speaker accurately in noisy environments.

Existing ASD approaches leverage both audio and visual modalities to improve accuracy. However, non-speech sounds in the surrounding environment can interfere with the speaker’s voice, leading to performance degradation. To overcome this, the proposed rASD framework introduces a novel strategy that utilizes audio-visual speech separation as guidance to learn noise-free audio features. These features are then fed into an ASD model in an end-to-end framework, where both the speech separation and ASD tasks are jointly optimized.

This multi-disciplinary approach combines concepts from multimedia information systems, animations, artificial reality, augmented reality, and virtual realities. The integration of audio and visual modalities aligns with the principles of multimedia information systems, where the goal is to process and analyze different forms of media simultaneously. Additionally, the use of audio-visual speech separation techniques relates to the field of animations, as it involves separating speech from non-speech sounds, similar to isolating dialogues from background noises in animated films.

The proposed rASD framework also emphasizes the importance of addressing the audio quality reduction issues that can occur in a naive cascaded two-stage framework. By jointly optimizing the speech separation and ASD tasks, the framework mitigates residual noise and improves the overall audio quality. The dynamics weighted loss approach introduced to train the speech separator further enhances the robustness of the audio features, making the framework more resilient to inherent speech noises.

To validate the effectiveness of the rASD framework, the authors conducted experiments using a real-world noise audio dataset they collected. The experiments demonstrate that non-speech audio noises have a significant impact on ASD models, confirming the need for robust approaches. The proposed rASD framework outperforms existing methods in noisy environments, offering improved accuracy and robustness.

In conclusion, this paper presents a robust approach, the rASD framework, for active speaker detection in noisy environments. The integration of audio-visual speech separation and the joint optimization of both tasks contribute to its effectiveness. The paper’s contribution extends to the wider field of multimedia information systems, animations, artificial reality, augmented reality, and virtual realities by addressing the challenges posed by ambient noise in active speaker detection.

Read the original article