Reducing Complexity and Enhancing Robustness in Speech Emotion Recognition
Representations derived from models like BERT and HuBERT have revolutionized speech emotion recognition, achieving remarkable performance. However, these representations come with a high memory and computational cost, as they were not specifically designed for emotion recognition tasks. In this article, we uncover lower-dimensional subspaces within these pre-trained representations that can significantly reduce model complexity without compromising emotion estimation accuracy. Furthermore, we introduce a novel approach to incorporate label uncertainty, in the form of grader opinion variance, into the models, resulting in improved generalization capacity and robustness. Additionally, we conduct experiments to evaluate the robustness of these emotion models against acoustic degradations and find that the reduced-dimensional representations maintain similar performance to their full-dimensional counterparts, making them highly promising for real-world applications.
Abstract:Representations derived from models such as BERT (Bidirectional Encoder Representations from Transformers) and HuBERT (Hidden units BERT), have helped to achieve state-of-the-art performance in dimensional speech emotion recognition. Despite their large dimensionality, and even though these representations are not tailored for emotion recognition tasks, they are frequently used to train large speech emotion models with high memory and computational costs. In this work, we show that there exist lower-dimensional subspaces within the these pre-trained representational spaces that offer a reduction in downstream model complexity without sacrificing performance on emotion estimation. In addition, we model label uncertainty in the form of grader opinion variance, and demonstrate that such information can improve the models generalization capacity and robustness. Finally, we compare the robustness of the emotion models against acoustic degradations and observed that the reduced dimensional representations were able to retain the performance similar to the full-dimensional representations without significant regression in dimensional emotion performance.