arXiv:2407.11492v1 Announce Type: cross
Abstract: Stuttering is a common speech impediment that is caused by irregular disruptions in speech production, affecting over 70 million people across the world. Standard automatic speech processing tools do not take speech ailments into account and are thereby not able to generate meaningful results when presented with stuttered speech as input. The automatic detection of stuttering is an integral step towards building efficient, context-aware speech processing systems. While previous approaches explore both statistical and neural approaches for stuttering detection, all of these methods are uni-modal in nature. This paper presents MMSD-Net, the first multi-modal neural framework for stuttering detection. Experiments and results demonstrate that incorporating the visual signal significantly aids stuttering detection, and our model yields an improvement of 2-17% in the F1-score over existing state-of-the-art uni-modal approaches.

The Multi-Disciplinary Nature of Stuttering Detection in Multimedia Information Systems

Stuttering, a common speech impediment affecting millions of people worldwide, presents a unique challenge for automatic speech processing systems. The irregular disruptions in speech production characteristic of this condition make it difficult for standard tools to generate meaningful results when presented with stuttered speech as input. Therefore, the development of efficient, context-aware speech processing systems requires a reliable method for automatic detection of stuttering.

In recent years, researchers have explored various statistical and neural approaches for stuttering detection. However, all of these methods have been uni-modal, meaning they rely solely on audio signals. This is where the MMSD-Net, the multi-modal neural framework presented in this paper, breaks new ground.

By incorporating visual signals in addition to audio signals, MMSD-Net introduces a multi-modal approach to stuttering detection. This innovation opens up new possibilities and brings together expertise from various disciplines, including speech processing, computer vision, and multimedia information systems.

Multimedia information systems, such as animations, artificial reality, augmented reality, and virtual realities, rely on combining different modalities to enhance user experiences and improve the overall effectiveness of the systems. The integration of visual signals in stuttering detection aligns with the principles of multi-modal systems by leveraging the complementary nature of audio and visual information.

MMSD-Net’s ability to achieve a 2-17% improvement in the F1-score over existing state-of-the-art uni-modal approaches demonstrates the potential of multi-modal neural frameworks in stuttering detection. This finding not only contributes to the field of speech processing but also highlights the importance of incorporating multi-disciplinary approaches in related domains.

In conclusion, the development of MMSD-Net marks a significant advancement in the field of stuttering detection and showcases the benefits of a multi-modal approach. By integrating visual signals and leveraging expertise from diverse fields, such as multimedia information systems, animations, artificial reality, augmented reality, and virtual realities, researchers have paved the way for more context-aware and effective speech processing systems for individuals with speech impediments.

Read the original article