Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

Author：Cassia Valentini-Botinhao, Junichi Yamagishi

#音声処理
#音声強調

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Text-to-speech voices created from noisy and reverberant recordings are of lower quality. A simple way to improve this is to increase the quality of the recordings prior to text-to-speech training with speech enhancement methods such as noise suppression and dereverberation. In this paper, we opted for this approach and to perform the enhancement, we used a recurrent neural network. The network is trained with parallel data of clean and lower quality recordings of speech. The lower quality data was artificially created by adding recordings of environmental noise to studio-quality recordings of speech and by convolving room impulse responses with these clean recordings. We trained separate networks with noise-only, reverberation-only, and both reverberation and additive noise data. The quality of voices trained with lower quality data that has been enhanced using these networks was significantly higher in all cases. For the noise-only case, the enhanced synthetic voice ranked as high as the voice trained with clean data. For the most realistic and challenging scenario, when both noise and reverberation were present, the improvements were more modest, but still significant.

一覧へ戻る