now publishers - Target Speaker Extractor Training with Diverse Speaker Conditions and Synthetic Data

APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Target Speaker Extractor Training with Diverse Speaker Conditions and Synthetic Data

Yun Liu, National Institute of Informatics, Japan AND Sokendai, Japan, yunliu@nii.ac.jp , Xuechen Liu, National Institute of Informatics, Japan, Xiaoxiao Miao, Duke KunShan University, Japan, Junichi Yamagishi, National Institute of Informatics, Japan AND Sokendai, Japan

Suggested Citation

Yun Liu, Xuechen Liu, Xiaoxiao Miao and Junichi Yamagishi (2025), "Target Speaker Extractor Training with Diverse Speaker Conditions and Synthetic Data", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e30. http://dx.doi.org/10.1561/116.20250054

Publication Date: 20 Oct 2025

Subjects

Speech and spoken language processing, Audio signal processing

Keywords

Target speaker extraction, curriculum learning, synthetic data, speech dataset

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 1230 times

In this article:

Abstract

Target speaker extraction (TSE) is essential for various speech processing applications, particularly with complex acoustic environments. However, current TSE systems lack robustness under real world conditions due to limited training data diversity and unrealistic noise. To address these challenges, we first constructed Libri2Vox, a new dataset combining clean target speech from LibriTTS with interference speech from VoxCeleb2 that contains real acoustic variations, channel effects, and ambient conditions. To increase speaker variability, we augmented Libri2Vox with synthetic speakers generated by developing two speech anonymization methods: SynVox2 and SALT (speaker anonymization through latent space transformation). Further, we propose a three-stage curriculum learning approach that progressively introduces synthetic speakers after training a seed TSE model on real data with varying speaker similarity levels. Experiments with four different neural TSE models show that Libri2Vox’s rich acoustic variations and synthetic speaker integration through curriculum learning consistently improve performance across common evaluation metrics. We also confirmed that the proper ratio of synthetic speakers to real speakers is important for improving the performance.

DOI:10.1561/116.20250054

Introduction
Related Work
Libri2Vox Dataset
Synthetic Libri2Vox Dataset
Constructing TSE Models on Libri2vox
Experiments
Main Results
Evaluation on Real-world Recordings
Ablation Study Regarding Synthetic Data
Ablation Study Regarding Datasets
Conclusion
Appendix
References

Target Speaker Extractor Training with Diverse Speaker Conditions and Synthetic Data

Share

Journal details

Abstract