APSIPA Transactions on Signal and Information Processing > Vol 9 > Issue 1

An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder

Patrick Lumban Tobing, Graduate School of Information Science, Nagoya University, Japan, patrick.lumbantobing@g.sp.m.is.nagoya-u.ac.jp , Yi-Chiao Wu, Graduate School of Information Science, Nagoya University, Japan, Tomoki Hayashi, Graduate School of Information Science, Nagoya University, Japan, Kazuhiro Kobayashi, Information Technology Center, Nagoya University, Japan, Tomoki Toda, Information Technology Center, Nagoya University, Japan
 
Suggested Citation
Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi and Tomoki Toda (2020), "An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder", APSIPA Transactions on Signal and Information Processing: Vol. 9: No. 1, e26. http://dx.doi.org/10.1017/ATSIP.2020.24

Publication Date: 25 Nov 2020
© 2020 Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi and Tomoki Toda
 
Subjects
 
Keywords
Voice conversionNeural networkSpectral mappingWaveNet vocoderOversmoothed parameters
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 1558 times

In this article:
I. INTRODUCTION 
II. COMPARISON TO PREVIOUS WORK 
III. SPECTRAL CONVERSION MODELS WITH NN-BASED ARCHITECTURES 
IV. WAVEFORM GENERATION MODELS WITH WN VOCODER 
V. EXPERIMENTAL EVALUATION 
VI. CONCLUSION 

Abstract

This paper presents an evaluation of parallel voice conversion (VC) with neural network (NN)-based statistical models for spectral mapping and waveform generation. The NN-based architectures for spectral mapping include deep NN (DNN), deep mixture density network (DMDN), and recurrent NN (RNN) models. WaveNet (WN) vocoder is employed as a high-quality NN-based waveform generation. In VC, though, owing to the oversmoothed characteristics of estimated speech parameters, quality degradation still occurs. To address this problem, we utilize post-conversion for the converted features based on direct waveform modifferential and global variance postfilter. To preserve the consistency with the post-conversion, we further propose a spectrum differential loss for the spectral modeling. The experimental results demonstrate that: (1) the RNN-based spectral modeling achieves higher accuracy with a faster convergence rate and better generalization compared to the DNN-/DMDN-based models; (2) the RNN-based spectral modeling is also capable of producing less oversmoothed spectral trajectory; (3) the use of proposed spectrum differential loss improves the performance in the same-gender conversions; and (4) the proposed post-conversion on converted features for the WN vocoder in VC yields the best performance in both naturalness and speaker similarity compared to the conventional use of WN vocoder.

DOI:10.1017/ATSIP.2020.24