APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 2

Multi-attribute Learning for Multi-level Emotion Recognition from Speech

Yuan Gao, Kyoto University, Japan, gao.yuan.75x@st.kyoto-u.ac.jp , Hao Shi, Kyoto University, Japan, Chenhui Chu, Kyoto University, Japan, Tatsuya Kawahara, Kyoto University, Japan
 
Suggested Citation
Yuan Gao, Hao Shi, Chenhui Chu and Tatsuya Kawahara (2025), "Multi-attribute Learning for Multi-level Emotion Recognition from Speech", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 2, e20. http://dx.doi.org/10.1561/116.20250025

Publication Date: 23 Jul 2025
© 2025 Y. Gao, H. Shi, C. Chu and T. Kawahara
 
Subjects
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 43 times

In this article:
Introduction 
Related Work 
Proposed Method 
Experimental Setup 
Results and Analysis 
Conclusion 
References 

Abstract

In this paper, we propose a multi-level speech emotion recognition (SER) system that captures both acoustic and linguistic emotional features using only speech input. In particular, our approach utilizes an acoustic feature extractor (HuBERT) to process the input waveform, capturing acoustic emotional features while simultaneously performing automatic speech recognition (ASR) to implicitly learn linguistic information. The ASR-decoded transcriptions are then fed into a linguistic feature extractor (BERT) to explicitly encode linguistic emotional features. To combine these features, we introduce a temporal gated fusion method that dynamically modulates the contribution of each modality, addressing modality incongruity issues. Furthermore, integrating multi-attribute learning for emotion-related attributes such as gender and speaking style, further enhances SER performance. To address gradient conflicts inherent in multi-attribute learning, we propose a two-stage fine-tuning framework employing adapters. Additionally, to mitigate the negative impact of ASR errors, we introduce an error correction module and a contrastive learning method to align representations learned from ground-truth text and the decoded transcriptions. Comprehensive experimental results on the IEMOCAP and MELD datasets validate that our method enhances SER performance without requiring textual input. Compared to the acoustic model baseline, our approach achieves a 10.85% improvement in unweighted accuracy on the IEMOCAP dataset.

DOI:10.1561/116.20250025