arXiv:2408.06468v1 Announce Type: cross
Abstract: This paper presents a novel multi-channel speech enhancement approach, FoVNet, that enables highly efficient speech enhancement within a configurable field of view (FoV) of a smart-glasses user without needing specific target-talker(s) directions. It advances over prior works by enhancing all speakers within any given FoV, with a hybrid signal processing and deep learning approach designed with high computational efficiency. The neural network component is designed with ultra-low computation (about 50 MMACS). A multi-channel Wiener filter and a post-processing module are further used to improve perceptual quality. We evaluate our algorithm with a microphone array on smart glasses, providing a configurable, efficient solution for augmented hearing on energy-constrained devices. FoVNet excels in both computational efficiency and speech quality across multiple scenarios, making it a promising solution for smart glasses applications.
Enhancing Speech in Smart Glasses with FoVNet
In the field of multimedia information systems, one of the key challenges is to enhance the user’s experience in various types of multimedia content. Speech enhancement plays a crucial role in improving audio quality and user satisfaction. In this regard, the recent development of smart glasses has opened up new opportunities for audio augmentation and enabled the creation of virtual and augmented reality experiences. However, enhancing speech in smart glasses presents unique challenges due to the computational limitations of these energy-constrained devices.
The FoVNet (Field of View Network) approach presented in this paper addresses these challenges by offering a novel multi-channel speech enhancement technique specifically designed for smart glasses. Unlike previous approaches that focused on enhancing specific target-talker(s) directions, FoVNet enhances speech for all speakers within a configurable field of view. This multi-disciplinary approach combines signal processing techniques with deep learning algorithms to achieve high computational efficiency.
The deep learning component of FoVNet is particularly noteworthy as it features an ultra-low computation requirement of only about 50 MMACS (Million Multiply-Accumulates per Second). This low computational demand makes it an ideal solution for smart glasses and other energy-constrained devices. By using a multi-channel Wiener filter and a post-processing module, FoVNet further enhances the perceptual quality of the speech signal.
The evaluation of FoVNet using a microphone array on smart glasses demonstrates its effectiveness in providing a configurable and efficient solution for augmented hearing. The algorithm excels in both computational efficiency and speech quality across multiple scenarios, making it a promising solution for smart glasses applications.
From a broader perspective, FoVNet’s contribution to the field of multimedia information systems is significant. It showcases the important role of signal processing and deep learning in enhancing speech and audio in multimedia applications. By addressing the specific challenges of smart glasses and energy-constrained devices, FoVNet expands the possibilities for creating immersive and realistic multimedia experiences in augmented and virtual realities.
Overall, the FoVNet approach presented in this paper represents a valuable addition to the field of speech enhancement in multimedia information systems. Its multi-disciplinary nature, combining signal processing and deep learning techniques, offers a comprehensive solution for enhancing speech in smart glasses. As smart glasses continue to evolve and become more prevalent in various industries, the advancements in speech enhancement algorithms like FoVNet will play a crucial role in creating immersive and engaging multimedia experiences.