Member
メンバー

王 鑫
Xin Wang
国立情報学研究所 コンテンツ科学研究系
プロジェクト特任准教授 (2023-現在)
科学技術振興機構さきがけ研究員 (2023-現在)
研究分野
音声情報処理
音声合成
生体認証
略歴
Xin Wang is a Project Associate Professor at the National Institute of Informatics and a PRESTO researcher at the Japan Science and Technology Agency. He was an organizing team member of the past ASVspoof Challenges since 2019 and the VoicePrivacy initiatives in 2020 and 2022. His research focuses on speech synthesis, anti-spoofing, and other speech security- and private-related tasks. He is a Member of IEEE, ISCA, and ASJ. He was a guest editor of the Computer Speech & Language Special Issue on Advances in Automatic Speaker Verification Anti-spoofing. He was on the appointed team of the ISCA Special Interest Group on Security and Privacy in Speech Communication from 2022 to 2024.
経歴
- 2015年〜2018年
- 総合研究大学院大学 博士課程
- 2018年〜2019年
- 国立情報学研究所 特任研究員
- 2019年〜2023年
- 国立情報学研究所 プロジェクト特任助教
- 2023年〜現在
- 国立情報学研究所 プロジェクト特任准教授
- 2023年〜現在
- 科学技術振興機構さきがけ研究員
受賞歴
- 2017年 IEEE Signal Processing Society Japan, 11th IEEE Signal Processing Society Japan Student Best Paper Award
- 2018年 SOKENDAI, SOKENDAI Award
- 2019年 SSW Best Paper Award (SSW9 2016, test-of-time award)
学会活動
- 2022 – 2024 ISCA Special Interest Group: Security and Privacy in Speech Communication, appointed team member
- その他、IEEE, ISCA, 日本音響学会各会員
最近の研究
- ASVspoof 5: Design, collection and validation of resources for spoofing, deepfake, and adversarial attack detection using crowdsourced speech
- Proactive Detection of Speaker Identity Manipulation with Neural Watermarking
- A Benchmark for Multi-speaker Anonymization
- Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis
- Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores
- SpoofCeleb: Speech Deepfake Detection and SASV in the Wild
- Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches
- A Preliminary Study on Long-Form In-the-Wild Audio Spoofing Detection
- Exploring Active Data Selection Strategies for Continuous Training in Deepfake Detection
- ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
- DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input
- Speaker Detection by the Individual Listener and the Crowd: Parametric Models Applicable to Bonafide and Deepfake Speech
- Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
- Spoof Diarization: “What Spoofed When” in Partially Spoofed Audio
- To what extent can ASV systems naturally defend against spoofing attacks?
- Malacopula: Adversarial Automatic Speaker Verification Attacks using a Neural-based Generalised Hammerstein Model
- ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale
- The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation
- Spoofing Attack Augmentation: Can Differently-Trained Attack Models Improve Generalisation?
- SynVox2: Towards a privacy-friendly VoxCeleb2 dataset
- Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?
- Collaborative Watermarking for Adversarial Speech Synthesis
- Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances
- [Arxiv] Speaker-Text Retrieval via Contrastive Learning
- Speaker Anonymization using Orthogonal Householder Neural Network
- Towards single integrated spoofing-aware speaker verification embeddings
- Range-Based Equal Error Rate for Spoof Localization
- Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms
- ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild
- Hiding speaker’s sex in speech using zero-evidence speaker representation in an analysis/synthesis pipeline
- Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?
- Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders
- Investigating Active-learning-based Training Data Selection for Speech Spoofing Countermeasure
- The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance
- Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions
- Privacy and utility of x-vector based speaker anonymization
- Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models
- Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation
- Investigating self-supervised front ends for speech spoofing countermeasures
- Estimating the Confidence of Speech Spoofing Countermeasure
- Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances
- The VoicePrivacy 2020 Challenge: Results and findings
- Benchmarking and challenges in security and privacy for voice biometrics
- ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection
- Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection
- An Initial Investigation for Detecting Partially Spoofed Audio
- A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection
- Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
- A Multi-Level Attention Model for Evidence-Based Fact Checking
- How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers?
- End-to-End Text-to-Speech using Latent Duration based on VQ-VAE
- ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech
- Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation
- Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
- [Arxiv] Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis
- Design Choices for X-vector Based Speaker Anonymization
- Reverberation Modeling for Source-Filter-based Neural Vocoder
- Introducing the VoicePrivacy Initiative
- Using Cyclic Noise as the Source Signal for Neural Source-Filter-based Speech Waveform Model
- [国内学会] 落語音声合成は人間の落語家にどれだけ迫れるのか?
- Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences
- Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals
- ASVspoof 2019: a large-scale public database of synthesized, converted and replayed speech
- Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment
- Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation
- Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings
- Neural source-filter waveform models for statistical parametric speech synthesis
- [Arxiv] Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model
- A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis
- Rakugo speech synthesis using segment-to-segment neural transduction and style tokens — toward speech synthesis for entertaining audiences
- Initial investigation of encoder-decoder end-to-end TTS using marginalization of monotonic hard alignments
- Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis
- Speaker Anonymization Using X-vector and Neural Waveform Models
- ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection
- Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
- Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet
- Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics
- Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language
- Neural source-filter-based waveform model for statistical parametric speech synthesis
- STFT spectral loss for training a neural speech waveform model
- [国内学会] 落語音声合成におけるTacotronおよびコンテキスト特徴量の使用とその評価
- [国内学会] Investigating neural source-filter waveform model for statistical parametric speech synthesis
- Fundamental Frequency Modeling for Neural-Network-Based Statistical Parametric Speech Synthesis
- Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects
- [Arxiv] Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
- Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data
- Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis
- A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis
- Speech Waveform Synthesis from MFCC Sequences with Generative Adversarial Networks
- Cyborg Speech: Deep Multilingual Speech Synthesis for Generating Segmental Foreign Accent with Natural Prosody
- [国内学会] Do prosodic manual annotations matter for Japanese speech synthesis systems with WaveNet vocoder?
- [国内学会] Investigation of WaveNet for Text-to-Speech Synthesis
- [国内学会] Stealing your vocal identity from the internet: cloning Obama’s voice from found data using GAN and Wavenet
- [国内学会] Generating Segment-Level Foreign-Accented Synthetic Speech with Natural Speech Prosody
- Investigating very deep highway networks for parametric speech synthesis
- [国内学会] Autoregressive quantized F0 modeling using a recurrent neural network with feedback links
- An RNN-based quantized f0 model with multi-tier feedback links for text-to-speech synthesis
- [NII Technical Report] A simple RNN-plus-highway network for statistical parametric speech synthesis
- An autoregressive recurrent mixture density network for parametric speech synthesis
- [国内学会] Investigation on an autoregressive recurrent mixture density network for parametric speech synthesis
- [国内学会] Investigating the impact of a neural network’s depth on spectral and F0 modelling for parametric speech synthesis
- Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis
- The NII speech synthesis entry for Blizzard Challenge 2016
- Investigating Very Deep Highway Networks for Parametric Speech Synthesis
- A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora
- Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
- Using text and acoustic features in predicting glottal excitation waveforms for parametric speech synthesis with recurrent neural networks
- Enhance the word vector with prosodic information for the recurrent neural network based TTS system
- Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks
- [国内学会] Using Continuous Representation of Various Linguistic Units for Recurrent Neural Network based TTS Synthesis