now publishers - Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition

APSIPA Transactions on Signal and Information Processing > Vol 12 > Issue 2

Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition

Qian-Bei Hong, Graduate Program of Multimedia Systems and Intelligent Computing, National Cheng Kung University and Academia Sinica, Taiwan, Chung-Hsien Wu, Graduate Program of Multimedia Systems and Intelligent Computing, National Cheng Kung University and Academia Sinica, and Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, chunghsienwu@gmail.com , Hsin-Min Wang, Graduate Program of Multimedia Systems and Intelligent Computing, National Cheng Kung University and Academia Sinica, Taiwan

Suggested Citation

Qian-Bei Hong, Chung-Hsien Wu and Hsin-Min Wang (2023), "Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition", APSIPA Transactions on Signal and Information Processing: Vol. 12: No. 2, e10. http://dx.doi.org/10.1561/116.00000150

Publication Date: 03 Apr 2023

Subjects

Keywords

Speaker recognition, articulatory feature, knowledge distillation

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 1157 times

In this article:

Abstract

This paper proposes a novel speaker-specific articulatory feature (AF) extraction model based on knowledge distillation (KD) for speaker recognition. First, an AF extractor is trained as a teacher model for extracting the AF profiles of the input speaker dataset. Next, a KD-based speaker embedding extraction method is proposed to distill the speaker-specific information from the AF profiles in the teacher model to a student model based on multi-task learning, in which the lower layers not only capture the speaker characteristics from acoustic features, but also learn the speaker-specific features from the AF profiles for robust speaker representation. Finally, speaker embeddings are extracted from the high-level layer, and the obtained speaker embeddings are further used to train a probabilistic linear discriminant analysis (PLDA) model for speaker recognition. In the experiments, speaker embedding models were trained using the VoxCeleb2 dataset and the AF extractor was trained based on the LibriSpeech dataset, and the performance was evaluated using the VoxCeleb1 dataset. The experiments showed that the proposed KD-based models outperformed the baseline models without KD. Furthermore, feature concatenation of multimodal results can further improve the performance.

DOI:10.1561/116.00000150

Related publications

Companion

APSIPA Transactions on Signal and Information Processing Special Issue - Learning, Security, AIoT for Emerging Communication/Networking Systems
See the other articles that are part of this special issue.

Introduction
Speaker Embedding Extraction
Articulatory Feature Extraction
Knowledge Distillation for Speaker Recognition
Experimental Results
Conclusion
References

Speaker-Specific Articulatory Feature Extraction Based on Knowledge Distillation for Speaker Recognition

Share

Journal details

Abstract

Related publications