now publishers - Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

APSIPA Transactions on Signal and Information Processing > Vol 8 > Issue 1

Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

Zhaojie Luo, Kobe University, Japan, luozhaojie@me.cs.scitec.kobe-u.ac.jp , Jinhui Chen, Kobe University, Japan, Tetsuya Takiguchi, Kobe University, Japan, Yasuo Ariki, Kobe University, Japan

Suggested Citation

Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi and Yasuo Ariki (2019), "Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks", APSIPA Transactions on Signal and Information Processing: Vol. 8: No. 1, e10. http://dx.doi.org/10.1017/ATSIP.2019.3

Publication Date: 04 Mar 2019

Subjects

Keywords

Continuous wavelet transform, Emotional voice conversion, Generative adversarial networks, Variational autoencoder, F0 features

Journal details

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 2139 times

In this article:

Abstract

In this paper, we propose a novel neutral-to-emotional voice conversion (VC) model that can effectively learn a mapping from neutral to emotional speech with limited emotional voice data. Although conventional VC techniques have achieved tremendous success in spectral conversion, the lack of representations in fundamental frequency (F0), which explicitly represents prosody information, is still a major limiting factor for emotional VC. To overcome this limitation, in our proposed model, we outline the practical elements of the cross-wavelet transform (XWT) method, highlighting how such a method is applied in synthesizing diverse representations of F0 features in emotional VC. The idea is (1) to decompose F0 into different temporal level representations using continuous wavelet transform (CWT); (2) to use XWT to combine different CWT-F0 features to synthesize interaction XWT-F0 features; (3) and then use both the CWT-F0 and corresponding XWT-F0 features to train the emotional VC model. Moreover, to better measure similarities between the converted and real F0 features, we applied a VA-GAN training model, which combines a variational autoencoder (VAE) with a generative adversarial network (GAN). In the VA-GAN model, VAE learns the latent representations of high-dimensional features (CWT-F0, XWT-F0), while the discriminator of the GAN can use the learned feature representations as a basis for a VAE reconstruction objective.

DOI:10.1017/ATSIP.2019.3

I. INTRODUCTION
II. FEATURE EXTRACTION AND PROCESSING
III. TRAINING MODEL: VA-GAN
IV. EXPERIMENTS
V. CONCLUSIONS

Neutral-to-emotional voice conversion with cross-wavelet transform F0 using generative adversarial networks

Share

Journal details

Abstract