TY - JOUR
T1 - Simultaneous Execution of Dereverberation, Denoising, and Speaker Separation Using a Neural Beamformer for Adapting Robots to Real Environments
AU - Nagano, Daichi
AU - Nakazawa, Kazuo
N1 - Publisher Copyright:
© Fuji Technology Press Ltd.
PY - 2022/12
Y1 - 2022/12
N2 - It remains challenging for robots to accurately per-form sound source localization and speech recognition in a real environment with reverberation, noise, and the voices of multiple speakers. Accordingly, we propose “U-TasNet-Beam,” a speech extraction method for extracting only the target speaker’s voice from all ambient sounds in a real environment. U-TasNet-Beam is a neural beamformer comprising three ele-ments: a neural network for removing reverberation and noise, a second neural network for separating the voices of multiple speakers, and a minimum variance distortionless response (MVDR) beamformer. Experiments with simulated data and recorded data show that the proposed U-TasNet-Beam can improve the accuracy of sound source localization and speech recognition in robots compared to the conventional methods in a noisy, reverberant, and multi-speaker environ-ment. In addition, we propose the spatial correlation matrix loss (SCM loss) as a loss function for the neural network learning the spatial information of the sound. By using the SCM loss, we can improve the speech extraction performance of the neural beamformer.
AB - It remains challenging for robots to accurately per-form sound source localization and speech recognition in a real environment with reverberation, noise, and the voices of multiple speakers. Accordingly, we propose “U-TasNet-Beam,” a speech extraction method for extracting only the target speaker’s voice from all ambient sounds in a real environment. U-TasNet-Beam is a neural beamformer comprising three ele-ments: a neural network for removing reverberation and noise, a second neural network for separating the voices of multiple speakers, and a minimum variance distortionless response (MVDR) beamformer. Experiments with simulated data and recorded data show that the proposed U-TasNet-Beam can improve the accuracy of sound source localization and speech recognition in robots compared to the conventional methods in a noisy, reverberant, and multi-speaker environ-ment. In addition, we propose the spatial correlation matrix loss (SCM loss) as a loss function for the neural network learning the spatial information of the sound. By using the SCM loss, we can improve the speech extraction performance of the neural beamformer.
KW - communication robot
KW - denoising
KW - dereverberation
KW - neural beamformer
KW - speech extraction
UR - http://www.scopus.com/inward/record.url?scp=85144285978&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85144285978&partnerID=8YFLogxK
U2 - 10.20965/jrm.2022.p1399
DO - 10.20965/jrm.2022.p1399
M3 - Article
AN - SCOPUS:85144285978
SN - 0915-3942
VL - 34
SP - 1399
EP - 1410
JO - Journal of Robotics and Mechatronics
JF - Journal of Robotics and Mechatronics
IS - 6
ER -