now publishers - Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence

APSIPA Transactions on Signal and Information Processing > Vol 13 > Issue 1

Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence

Xuan Luo, The University of Tokyo, Japan, Shinnosuke Takamichi, The University of Tokyo, Japan, shinnosuke_takamichi@ipc.i.u-tokyo , Yuki Saito, The University of Tokyo, Japan, Tomoki Koriyama, CyberAgent, Japan, Hiroshi Saruwatari, The University of Tokyo, Japan

Suggested Citation

Xuan Luo, Shinnosuke Takamichi, Yuki Saito, Tomoki Koriyama and Hiroshi Saruwatari (2024), "Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence", APSIPA Transactions on Signal and Information Processing: Vol. 13: No. 1, e2. http://dx.doi.org/10.1561/116.00000242

Publication Date: 13 Feb 2024

Subjects

Keywords

Emotion-controllable speech synthesis, Expressive speech synthesis, Controllable speech synthesis, Text to speech, Speech emotion recognition

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 2740 times

In this article:

Abstract

We propose a two-stage emotion-controllable text-to-speech (TTS) model that can increase the diversity of intra-emotion variation and also preserve inter-emotion controllability in synthesized speech. Conventional emotion-controllable TTS models increase the diversity of intra-emotion variation by controlling fine-grained emotion strengths; however, such models cannot control various prosodic factors (e.g., pitch). While other methods directly condition TTS models on intuitive prosodic factors, they cannot control emotions. Our proposed two-stage emotion-controllable TTS model extends the Tacotron2 model with a speech emotion recognizer (SER) and a prosodic factor generator (PFG) to solve this problem. In the first stage, we condition our model on emotion soft labels predicted by the SER model to enable inter-emotion controllability. In the second stage, we fine-condition our model on utterance-level prosodic factors and word-level prominence generated by the PFG model from emotion soft labels, which provides intra-emotion diversity. Due to this two-stage control design, we can increase intra-emotion diversity at both the utterance and word levels, and also preserve inter-emotion controllability. The experiments achieved 1) 51% emotion-distinguishable accuracy on average when conditioning on soft labels of three emotions, 2) average linear controllability scores of 0.95 when fine-conditioning on prosodic factors and prominence, respectively, and 3) comparable audio quality to conventional models.

DOI:10.1561/116.00000242

Introduction
Related Work
Proposed Work
Experimental Setup
Evaluation
Conclusion and Discussion
Appendix
References

Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence

Share

Journal details

Abstract