Speech emotion recognition (SER) systems aim to recognize human emotional
state during human-computer interaction. Most existing SER systems are trained
based on utterance-level labels. However, not all frames in an audio have
affective states consistent with utterance-level label, which makes it
difficult for the model to distinguish the true emotion of the audio and
perform poorly. To address this problem, we propose a frame-level emotional
state alignment method for SER. First, we fine-tune HuBERT model to obtain a
SER system with task-adaptive pretraining (TAPT) method, and extract embeddings
from its transformer layers to form frame-level pseudo-emotion labels with
clustering. Then, the pseudo labels are used to pretrain HuBERT. Hence, the
each frame output of HuBERT has corresponding emotional information. Finally,
we fine-tune the above pretrained HuBERT for SER by adding an attention layer
on the top of it, which can focus only on those frames that are emotionally
more consistent with utterance-level label. The experimental results performed
on IEMOCAP indicate that our proposed method performs better than
state-of-the-art (SOTA) methods.

SER systems and the importance of frame-level emotional state alignment

Speech emotion recognition (SER) systems play a crucial role in human-computer interaction, as they aim to identify and understand human emotional states. However, most existing SER systems are trained based on utterance-level labels, which poses a challenge when it comes to accurately capturing emotions within individual frames of audio.

This lack of alignment between frame-level emotional states and utterance-level labels can lead to poor performance of SER models, as they struggle to differentiate the true emotion portrayed in the audio. To overcome this limitation, a frame-level emotional state alignment method has been proposed.

The role of task-adaptive pretraining (TAPT)

The proposed method utilizes a task-adaptive pretraining (TAPT) approach to fine-tune the HuBERT model. HuBERT is a transformer-based model that can extract embeddings from its transformer layers. By fine-tuning HuBERT with the TAPT method, frame-level pseudo-emotion labels are generated through clustering techniques.

Clustering allows for the grouping of frames with similar emotional characteristics, providing an approximation of emotional labels at the frame level. The extracted embeddings and pseudo labels are then used for pretraining HuBERT, aligning each frame output of the model with corresponding emotional information.

The importance of attention and fine-tuning

To further enhance the SER system, an attention layer is added on top of the pretrained HuBERT model. This attention layer enables the model to focus its attention on frames that are more emotionally consistent with the utterance-level label. This step helps in improving the overall performance and accuracy of the system.

Finally, the pretrained HuBERT model, along with the attention layer, undergoes fine-tuning specifically for SER. This fine-tuning process ensures that the model is optimized to recognize and classify emotions accurately, leveraging the information captured at both the frame and utterance levels.

The potential applications and multi-disciplinary nature of the proposed method

The proposed frame-level emotional state alignment method has shown promising results when evaluated on the IEMOCAP dataset. By outperforming state-of-the-art methods in SER, this approach opens up new possibilities for real-world applications.

Moreover, the methodology presented in this article highlights the multi-disciplinary nature of SER. It combines techniques from natural language processing (NLP), audio signal processing, and machine learning to achieve accurate emotion recognition. This cross-disciplinary collaboration is essential for developing robust SER systems that can better understand and respond to human emotions, thereby enhancing human-computer interaction experiences across various domains.

Read the original article