Research

研究プロジェクト・論文・書籍等

Share

  • 論文

Learning Relationship Between Speaker Embeddings and Descriptions of Speaker Traits

Author:X. Liu, J. Yamagishi, X. Wang and E. Cooper

  • #音声処理
  • #品質評価
  • #話者認識

IEEE Transactions on Audio, Speech and Language Processing

Speech perception research reveals important connections between audio signals and perceptual speaker characteristics. Addressing this intersection, this study explores the relationship between textual descriptions of perceivable speaker characteristics and speech representations by establishing a joint learning space. We thus construct a dataset for this purpose created through extensive crowd-sourced listening tests based on VoxCeleb, where participants provided detailed evaluations of diverse speaker attributes. These evaluations are transformed into structured textual descriptions, creating paired data that captures nuanced speaker characteristics. Using such data, we extract speaker and text embeddings via pre-trained corresponding encoders. Additionally, Our specialized linking networks use contrastive learning and generative transformations to align these embeddings in a unified space. We apply them for cross-modal speaker retrieval in both English and Japanese, and extend to a multilingual scenario. Experimental results highlight the value of our curated dataset of listener-perceived speaker traits.