News
お知らせ
- お知らせ
博士論文公聴会
博士論文の公聴会を開催いたします。どなたでもご参加いただけますが、人数把握および入館方法の事前連絡のため、ご参加予定の方はお問い合わせページからご連絡をお願い致します。
日時: 2026年1月28日(水)14:30-15:30
場所: 〒101-8430 東京都千代田区一ツ橋2-1-2 国立情報学研究所 18階 1810会議室
発表者 Yun Liu
論文タイトル Training and Data Strategies for Target Speaker Extraction
要旨 Target speaker extraction (TSE) aims to isolate the voice from a specific speaker from complex acoustic mixtures containing multiple speakers and background noise. Despite recent advances in deep neural network approaches, current TSE systems face significant challenges under real-world environments, primarily due to limited speaker diversity in training data and the gap between artificially constructed training conditions and authentic acoustic scenarios. This thesis addresses these fundamental limitations through a comprehensive framework combining curriculum learning strategies with synthetic speaker generation.
The thesis makes four key contributions. First, we identify and evaluate multiple difficulty measures for TSE training data, demonstrating that speaker similarity based on cosine distance between embeddings provides the most reliable indicator of extraction difficulty. Second, we develop a multi-stage curriculum learning framework that progressively trains models from easier to more challenging examples, achieving up to 0.97 dB improvement in signal-to-distortion ratio compared to conventional random sampling. Third, we leverage speech generative models to create valuable synthetic speakers through voice conversion techniques as additional training data, and the optimal performance was achieved using a 50%-50% balance between real and synthetic speakers. Last but not least, we introduce Libri2Vox, a novel dataset that bridges the gap between controlled experimental conditions and real-world complexity by combining clean speech from LibriTTS with recordings from VoxCeleb2, which contains diverse acoustic conditions and encompasses totally over 7,000 speakers.
Experimental evaluations across multiple modern and state-of-the-art TSE architectures (Conformer, BLSTM, SpeakerBeam, and VoiceFilter) demonstrate consistent improvements ranging from 0.5 to 2.2 dB in extraction performance. Notably, models trained with our approach show substantial improvements on real-world recordings, achieving positive quality gains where conventional methods yield degraded performance. Throughout this thesis, the integration of curriculum learning, synthetic speaker augmentation, and realistic acoustic conditions significantly enhances model robustness and generalization capability.
This thesis establishes that strategic training data utilization through curriculum design and synthetic augmentation can substantially improve TSE systems. The proposed methods are architecture-agnostic and can be readily integrated into existing TSE training pipelines. These findings have broader implications for speech processing tasks facing similar challenges of limited training data and domain mismatch, contributing to the development of more robust speech technologies for real-world deployment.