now publishers - Multi-attribute Learning for Multi-level Emotion Recognition from Speech

APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Multi-attribute Learning for Multi-level Emotion Recognition from Speech

Yuan Gao, Kyoto University, Japan, gao.yuan.75x@st.kyoto-u.ac.jp , Hao Shi, Kyoto University, Japan, Chenhui Chu, Kyoto University, Japan, Tatsuya Kawahara, Kyoto University, Japan

Suggested Citation

Yuan Gao, Hao Shi, Chenhui Chu and Tatsuya Kawahara (2025), "Multi-attribute Learning for Multi-level Emotion Recognition from Speech", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e20. http://dx.doi.org/10.1561/116.20250025

Publication Date: 23 Jul 2025

Subjects

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 1085 times

In this article:

Abstract

In this paper, we propose a multi-level speech emotion recognition (SER) system that captures both acoustic and linguistic emotional features using only speech input. In particular, our approach utilizes an acoustic feature extractor (HuBERT) to process the input waveform, capturing acoustic emotional features while simultaneously performing automatic speech recognition (ASR) to implicitly learn linguistic information. The ASR-decoded transcriptions are then fed into a linguistic feature extractor (BERT) to explicitly encode linguistic emotional features. To combine these features, we introduce a temporal gated fusion method that dynamically modulates the contribution of each modality, addressing modality incongruity issues. Furthermore, integrating multi-attribute learning for emotion-related attributes such as gender and speaking style, further enhances SER performance. To address gradient conflicts inherent in multi-attribute learning, we propose a two-stage fine-tuning framework employing adapters. Additionally, to mitigate the negative impact of ASR errors, we introduce an error correction module and a contrastive learning method to align representations learned from ground-truth text and the decoded transcriptions. Comprehensive experimental results on the IEMOCAP and MELD datasets validate that our method enhances SER performance without requiring textual input. Compared to the acoustic model baseline, our approach achieves a 10.85% improvement in unweighted accuracy on the IEMOCAP dataset.

DOI:10.1561/116.20250025

Introduction
Related Work
Proposed Method
Experimental Setup
Results and Analysis
Conclusion
References

Multi-attribute Learning for Multi-level Emotion Recognition from Speech

Share

Journal details

Abstract