APSIPA Transactions on Signal and Information Processing > Vol 8 > Issue 1

Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

Zhaojie Luo, Kobe University, Japan, luozhaojie@me.cs.scitec.kobe-u.ac.jp , Jinhui Chen, Kobe University, Japan, Tetsuya Takiguchi, Kobe University, Japan, Yasuo Ariki, Kobe University, Japan
 
Suggested Citation
Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi and Yasuo Ariki (2019), "Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks", APSIPA Transactions on Signal and Information Processing: Vol. 8: No. 1, e10. http://dx.doi.org/10.1017/ATSIP.2019.3

Publication Date: 04 Mar 2019
© 2019 Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi and Yasuo Ariki
 
Subjects
 
Keywords
Continuous wavelet transformEmotional voice conversionGenerative adversarial networksVariational autoencoderF0 features
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 1856 times

In this article:
I. INTRODUCTION 
II. FEATURE EXTRACTION AND PROCESSING 
III. TRAINING MODEL: VA-GAN 
IV. EXPERIMENTS 
V. CONCLUSIONS 

Abstract

In this paper, we propose a novel neutral-to-emotional voice conversion (VC) model that can effectively learn a mapping from neutral to emotional speech with limited emotional voice data. Although conventional VC techniques have achieved tremendous success in spectral conversion, the lack of representations in fundamental frequency (F0), which explicitly represents prosody information, is still a major limiting factor for emotional VC. To overcome this limitation, in our proposed model, we outline the practical elements of the cross-wavelet transform (XWT) method, highlighting how such a method is applied in synthesizing diverse representations of F0 features in emotional VC. The idea is (1) to decompose F0 into different temporal level representations using continuous wavelet transform (CWT); (2) to use XWT to combine different CWT-F0 features to synthesize interaction XWT-F0 features; (3) and then use both the CWT-F0 and corresponding XWT-F0 features to train the emotional VC model. Moreover, to better measure similarities between the converted and real F0 features, we applied a VA-GAN training model, which combines a variational autoencoder (VAE) with a generative adversarial network (GAN). In the VA-GAN model, VAE learns the latent representations of high-dimensional features (CWT-F0, XWT-F0), while the discriminator of the GAN can use the learned feature representations as a basis for a VAE reconstruction objective.

DOI:10.1017/ATSIP.2019.3