APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Target Speaker Extractor Training with Diverse Speaker Conditions and Synthetic Data

Yun Liu, National Institute of Informatics, Japan AND Sokendai, Japan, yunliu@nii.ac.jp , Xuechen Liu, National Institute of Informatics, Japan, Xiaoxiao Miao, Duke KunShan University, Japan, Junichi Yamagishi, National Institute of Informatics, Japan AND Sokendai, Japan
 
Suggested Citation
Yun Liu, Xuechen Liu, Xiaoxiao Miao and Junichi Yamagishi (2025), "Target Speaker Extractor Training with Diverse Speaker Conditions and Synthetic Data", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e30. http://dx.doi.org/10.1561/116.20250054

Publication Date: 20 Oct 2025
© 2025 Y. Liu, X. Liu, X. Miao and J. Yamagishi
 
Subjects
Speech and spoken language processing,  Audio signal processing
 
Keywords
Target speaker extractioncurriculum learningsynthetic dataspeech dataset
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 11 times

In this article:
Introduction 
Related Work 
Libri2Vox Dataset 
Synthetic Libri2Vox Dataset 
Constructing TSE Models on Libri2vox 
Experiments 
Main Results 
Evaluation on Real-world Recordings 
Ablation Study Regarding Synthetic Data 
Ablation Study Regarding Datasets 
Conclusion 
Appendix 
References 

Abstract

Target speaker extraction (TSE) is essential for various speech processing applications, particularly with complex acoustic environments. However, current TSE systems lack robustness under real world conditions due to limited training data diversity and unrealistic noise. To address these challenges, we first constructed Libri2Vox, a new dataset combining clean target speech from LibriTTS with interference speech from VoxCeleb2 that contains real acoustic variations, channel effects, and ambient conditions. To increase speaker variability, we augmented Libri2Vox with synthetic speakers generated by developing two speech anonymization methods: SynVox2 and SALT (speaker anonymization through latent space transformation). Further, we propose a three-stage curriculum learning approach that progressively introduces synthetic speakers after training a seed TSE model on real data with varying speaker similarity levels. Experiments with four different neural TSE models show that Libri2Vox’s rich acoustic variations and synthetic speaker integration through curriculum learning consistently improve performance across common evaluation metrics. We also confirmed that the proper ratio of synthetic speakers to real speakers is important for improving the performance.

DOI:10.1561/116.20250054