TY - JOUR
T1 - Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model
AU - Saeki, Takaaki
AU - Takamichi, Shinnosuke
AU - Saruwatari, Hiroshi
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2021
Y1 - 2021
N2 - This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, 'lookahead'). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.
AB - This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, 'lookahead'). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.
KW - contextual embedding
KW - end-to-end text-to-speech synthesis
KW - Incremental text-to-speech synthesis
KW - language model
UR - http://www.scopus.com/inward/record.url?scp=85104652677&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85104652677&partnerID=8YFLogxK
U2 - 10.1109/LSP.2021.3073869
DO - 10.1109/LSP.2021.3073869
M3 - Article
AN - SCOPUS:85104652677
SN - 1070-9908
VL - 28
SP - 857
EP - 861
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
M1 - 9406329
ER -