APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Target Speaker Extraction under Noisy Underdetermined Conditions Using Conditional Variational Autoencoder, Global Style Token, and Neural Postfilter

Rui Wang, Nagoya University, Japan, rui.wang@g.sp.m.is.nagoya-u.ac.jp , Takuya Fujimura, Nagoya University, Japan, Tomoki Toda, Nagoya University, Japan
 
Suggested Citation
Rui Wang, Takuya Fujimura and Tomoki Toda (2025), "Target Speaker Extraction under Noisy Underdetermined Conditions Using Conditional Variational Autoencoder, Global Style Token, and Neural Postfilter", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e2. http://dx.doi.org/10.1561/116.20240067

Publication Date: 27 Jan 2025
© 2025 R. Wang, T. Fujimura and T. Toda
 
Subjects
Speech and spoken language processing
 
Keywords
Target speaker extractionmultichannel source separationconditional variational autoencoder (CVAE)speech enhancement
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 67 times

In this article:
Introduction 
Directional TSE Based on Dual-Channel System 
Related Work on TSE under Noisy Underdetermined Conditions 
Proposed Method for Enhancing the Extracted Target 
Experimental Evaluation 
Conclusion 
References 

Abstract

Target speaker extraction (TSE) acts as a front-end processing technology for various speech applications, such as automatic speech recognition. However, TSE has long faced challenges from underdetermined environments and in the presence of noise. In this paper, we propose a dual-channel system for directional TSE under noisy underdetermined conditions. In our approach, we utilize two source models that integrate conditional variational autoencoders (CVAEs) with global style tokens (GSTs) to learn representations of the noisy single speech and the noisy mixed speech within a geometric source separation framework, where GSTs generate conditional variables for CVAEs. To address residual noise in the extracted target signal under various noisy conditions, we introduce a conditional neural postfilter with a GST to estimate a complex Time-Frequency (T-F) mask for denoising. Additionally, we propose a joint network, where a conditional neural postfilter is jointly trained with a CVAE and a shared GST module. The experimental results demonstrate that our proposed dual-channel TSE method achieves better performance under noisy underdetermined conditions.

DOI:10.1561/116.20240067