Using text and acoustic features in predicting glottal excitation waveforms for parametric speech synthesis with recurrent neural networks

Author：Lauri Juvela, Xin Wang, Shinji Takaki, Manu Airaksinen, Junichi Yamagishi, Paavo Alku

#音声処理
#音声合成

Interspeech 2016

This work studies the use of deep learning methods to directly model glottal excitation waveforms from context dependent text features in a text-to-speech synthesis system. Glottal vocoding is integrated into a deep neural network-based text-to-speech framework where text and acoustic features can be flexibly used as both network inputs or outputs. Long short-term memory recurrent neural networks are utilised in two stages: first, in mapping text features to acoustic features and second, in predicting glottal waveforms from the text and/or acoustic features. Results show that using the text features directly yields similar quality to the prediction of the excitation from acoustic features, both outperforming a baseline system based on using a fixed glottal pulse for excitation generation.

一覧へ戻る