Interviews & Lectures

取材・講演

Share

  • 講演実績

Rocling 2022

[Keynote Talk] Speech Synthesis Research 2.0

  • #生成モデル
  • #ディープフェイク検知
  • #音声処理
  • #画像処理

講演者:Junichi Yamagishi
会議名:THE 34TH CONFERENCE ON COMPUTATIONAL LINGUISTICS AND SPEECH PROCESSING (Rocling 2022)
主催者:Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
開催地:台北, 台湾
開催日:2022年11月22日
URL:https://rocling2022.github.io
動画:https://youtu.be/m9tClt2Iaf8?feature=shared

The Yamagishi Laboratory at the National Institute of Informatics researches text-to-speech (TTS) and voice conversion (VC) technologies. Having achieved TTS and VC methods that reproduce human-level naturalness and speaker similarity, we introduce three challenging projects we are currently working on as the next phase of our research.

1) Rakugo speech synthesis [1]
As an example of a challenging application of speech synthesis technology, especially an example of an entertainment application, we have concentrated on rakugo, a traditional Japanese performing art. We have been working on learning and reproducing the skills of a professional comic storyteller using speech synthesis. This project aims to achieve an “AI storyteller” that entertains listeners, entirely different from the conventional speech synthesis task, whose primary purpose is to convey information or answer questions. The main story of rakugo comprises conversations between characters, and various characters appear in the story. These characters are performed by a single rakugo storyteller, who changes their voice appropriately so the listeners can understand and entertain them. To reproduce such characteristics of rakugo voice by machine learning, performance data of rakugo and advanced modeling techniques are required. Therefore, we constructed a corpus of rakugo speech without any noise or audience sounds with the cooperation of an Edo-style rakugo performer and modeled this data using deep learning. In addition, we benchmarked our system by comparing the generated Rakugo speech with performances by Rakugo storytellers of different ranks (“Zenza/前座,” “Futatsume/二つ目,” and “Shinuchi/真打”) through subjective evaluation.

(2) Speech intelligibility enhancement [2]
In remote communication, such as online conferencing, there are environmental background noises on both the speaker and listener sides. Speech intelligibility enhancement is a technique to manipulate speech signals so as not to be masked by the noise on the listener’s side while maintaining the volume. This is not a simple conversion task since “correct teacher data” does not exist. For this reason, deep learning has not been used in the past, and there has been no significant technological progress. However, various possible practical applications exist, such as intelligibility enhancement of station announcements. Therefore, we proposed a network structure called “iMetricGAN” and its learning method, in which complex and non-differentiable speech intelligibility and quality indexes are treated as output values of a discriminator in an adversarial generative network, the discriminator approximates the indexes and based on the approximated indexes, a generator is used to transform an input speech signal into an enhanced, easy-to-hear speech signal automatically. Subject experiments confirmed that this transformation significantly improves keyword recognition in noisy environments.

(3) Speaker Anonymization [3, 4]
Now that it is becoming easier to build speech synthesis systems that digitally clone someone’s voice using ‘found’ data on social media, there is a need to mask the speaker information in speech and other sensitive attributes that are appropriate to be protected. This is a new research topic; it has not yet been clearly defined how speaker anonymization can be achieved. We proposed a speaker anonymization method that combines speech synthesis and speaker recognition technologies. Our approach decomposes speech into three pieces of information: prosody, phoneme information, and a speaker embedding vector called X-vector, which is standardly used in speaker recognition and anonymizes the individuality of a speaker by averaging only the X-vector with K speakers. A neural vocoder is used to re-synthesize a high-quality speech waveform. We also introduce a speech database and evaluation metrics to compare speaker anonymization methods.

Reference
[1] Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Shinji Takaki, Junichi Yamagishi “Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences,” IEEE Access, vol.8, pp.138149-138161, July 2020
[2] Haoyu Li, Junichi Yamagishi, “Multi-Metric Optimization Using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.29, pp.3000-3011, Sept 2021
[3] Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, Jean-Francois Bonastre, “Speaker Anonymization Using X-vector and Neural Waveform Models,” 10th ISCA Speech Synthesis Workshop (SSW10), Sept 2019 
[4] Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi, Natalia Tomashenko, “Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models,” Odyssey 2022: The Speaker and Language Recognition Workshop, June 2022