Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence

Xuan Luo, Shinnosuke Takamichi, Yuki Saito, Tomoki Koriyama, Hiroshi Saruwatari

Research output: Contribution to journalArticlepeer-review

Abstract

We propose a two-stage emotion-controllable text-to-speech (TTS) model that can increase the diversity of intra-emotion variation and also preserve inter-emotion controllability in synthesized speech. Conventional emotion-controllable TTS models increase the diversity of intra-emotion variation by controlling fine-grained emotion strengths; however, such models cannot control various prosodic factors (e.g., pitch). While other methods directly condition TTS models on intuitive prosodic factors, they cannot control emotions. Our proposed two-stage emotion-controllable TTS model extends the Tacotron2 model with a speech emotion recognizer (SER) and a prosodic factor generator (PFG) to solve this problem. In the first stage, we condition our model on emotion soft labels predicted by the SER model to enable inter-emotion controllability. In the second stage, we fine-condition our model on utterance-level prosodic factors and word-level prominence generated by the PFG model from emotion soft labels, which provides intra-emotion diversity. Due to this two-stage control design, we can increase intra-emotion diversity at both the utterance and word levels, and also preserve inter-emotion controllability. The experiments achieved 1) 51% emotion-distinguishable accuracy on average when conditioning on soft labels of three emotions, 2) average linear controllability scores of 0.95 when fine-conditioning on prosodic factors and prominence, respectively, and 3) comparable audio quality to conventional models.

Original languageEnglish
Article numbere2
JournalAPSIPA Transactions on Signal and Information Processing
Volume13
Issue number1
DOIs
Publication statusPublished - 2024 Feb 13
Externally publishedYes

Keywords

  • controllable speech synthesis
  • Emotion-controllable speech synthesis
  • expressive speech synthesis
  • speech emotion recognition
  • text to speech

ASJC Scopus subject areas

  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence'. Together they form a unique fingerprint.

Cite this