TY - JOUR
T1 - Emotion-controllable Speech Synthesis Using Emotion Soft Label, Utterance-level Prosodic Factors, and Word-level Prominence
AU - Luo, Xuan
AU - Takamichi, Shinnosuke
AU - Saito, Yuki
AU - Koriyama, Tomoki
AU - Saruwatari, Hiroshi
N1 - Publisher Copyright:
© 2024 X. Luo, S. Takamichi, Y. Saito, T. Koriyama and H. Saruwatari.
PY - 2024/2/13
Y1 - 2024/2/13
N2 - We propose a two-stage emotion-controllable text-to-speech (TTS) model that can increase the diversity of intra-emotion variation and also preserve inter-emotion controllability in synthesized speech. Conventional emotion-controllable TTS models increase the diversity of intra-emotion variation by controlling fine-grained emotion strengths; however, such models cannot control various prosodic factors (e.g., pitch). While other methods directly condition TTS models on intuitive prosodic factors, they cannot control emotions. Our proposed two-stage emotion-controllable TTS model extends the Tacotron2 model with a speech emotion recognizer (SER) and a prosodic factor generator (PFG) to solve this problem. In the first stage, we condition our model on emotion soft labels predicted by the SER model to enable inter-emotion controllability. In the second stage, we fine-condition our model on utterance-level prosodic factors and word-level prominence generated by the PFG model from emotion soft labels, which provides intra-emotion diversity. Due to this two-stage control design, we can increase intra-emotion diversity at both the utterance and word levels, and also preserve inter-emotion controllability. The experiments achieved 1) 51% emotion-distinguishable accuracy on average when conditioning on soft labels of three emotions, 2) average linear controllability scores of 0.95 when fine-conditioning on prosodic factors and prominence, respectively, and 3) comparable audio quality to conventional models.
AB - We propose a two-stage emotion-controllable text-to-speech (TTS) model that can increase the diversity of intra-emotion variation and also preserve inter-emotion controllability in synthesized speech. Conventional emotion-controllable TTS models increase the diversity of intra-emotion variation by controlling fine-grained emotion strengths; however, such models cannot control various prosodic factors (e.g., pitch). While other methods directly condition TTS models on intuitive prosodic factors, they cannot control emotions. Our proposed two-stage emotion-controllable TTS model extends the Tacotron2 model with a speech emotion recognizer (SER) and a prosodic factor generator (PFG) to solve this problem. In the first stage, we condition our model on emotion soft labels predicted by the SER model to enable inter-emotion controllability. In the second stage, we fine-condition our model on utterance-level prosodic factors and word-level prominence generated by the PFG model from emotion soft labels, which provides intra-emotion diversity. Due to this two-stage control design, we can increase intra-emotion diversity at both the utterance and word levels, and also preserve inter-emotion controllability. The experiments achieved 1) 51% emotion-distinguishable accuracy on average when conditioning on soft labels of three emotions, 2) average linear controllability scores of 0.95 when fine-conditioning on prosodic factors and prominence, respectively, and 3) comparable audio quality to conventional models.
KW - controllable speech synthesis
KW - Emotion-controllable speech synthesis
KW - expressive speech synthesis
KW - speech emotion recognition
KW - text to speech
UR - http://www.scopus.com/inward/record.url?scp=85189864509&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85189864509&partnerID=8YFLogxK
U2 - 10.1561/116.00000242
DO - 10.1561/116.00000242
M3 - Article
AN - SCOPUS:85189864509
SN - 2048-7703
VL - 13
JO - APSIPA Transactions on Signal and Information Processing
JF - APSIPA Transactions on Signal and Information Processing
IS - 1
M1 - e2
ER -