Member
メンバー

研究分野
音声情報処理
音声合成
話者照合
生体認証
メディアフォレンジクス
略歴
Junichi Yamagishi received a Ph.D. from the Tokyo Institute of Technology in 2006 for a thesis that pioneered speaker-adaptive speech synthesis. He is a Professor at the National Institute of Informatics, Tokyo, Japan. Previously, he held an EPSRC Career Acceleration Fellowship in the Centre for Speech Technology Research (CSTR) at the University of Edinburgh, U.K., from 2011 to 2016. He has authored and co-authored more than 400 refereed papers in international journals and conferences.
He served as an elected member of the IEEE Speech and Language Technical Committee from 2013 to 2019, as an Associate Editor for the IEEE/ACM Transactions on Audio, Speech, and Language Processing from 2014 to 2017, as a chairperson of ISCA SynSIG from 2017 to 2021, as a Senior Associate Editor for the IEEE/ACM Transaction on Audio, Speech and Language Processing from 2019 to 2023, and as a technical committee of the Asia-Pacific Signal and Information Processing Association (APSIPA) Multimedia Security and Forensics from 2018 to 2023. He has been a member of the IEEE Signal Processing Society Education Board since 2021.
He received the Itakura Prize from the Acoustic Society of Japan, the Kiyasu Special Industrial Achievement Award from the Information Processing Society of Japan, the Young Scientists’ Prize from the Minister of Education, Science and Technology, the JSPS prize from the Japan Society for the Promotion of Science, the best paper award at the IEEE International Workshop on Information Forensics and Security (WIFS), the Docmo mobile science award, IEEE Biometrics Council’s BTAS/IJCB 5-Year Highest Impact Award, and the Award for Science and Technology (Research Category) of the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology in 2010, 2013, 2014, 2016, 2017, 2018, 2023, and 2025, respectively.
経歴
- 2006年
- 東京工業大学大学院博士課程修了(工学博士)
- 2006年〜2007年
- 英国エジンバラ大学客員研究員.(日本学術振興会特別研究員-PD)
- 2007年〜2011年
- 同大学Research Fellow
- 2011年〜2013年
- 同大学EPSRC Career Acceleration Fellow
- 2013年〜2019年
- 国立情報学研究所(NII)コンテンツ科学研究系准教授
- 2013年〜2020年
- 英国エジンバラ大学Senior Research Fellow兼任
- 2019年〜現在
- 国立情報学研究所(NII)コンテンツ科学研究系教授
- 2020年
- 英国エジンバラ大学よりHonorary Professorの称号付与
- 2021年〜現在
- 国立情報学研究所(NII)シンセティックメディア国際研究センター 副センター長
受賞歴
- 2007年 手島博士論文賞
- 2010年 日本音響学会板倉記念研究賞
- 2014年 文部科学省大臣表彰 若手科学者賞
- 2016年 日本学術振興会賞
- 2018年 ドコモ・モバイル・サイエンス賞先端技術部門優秀賞
- 2023年 電気通信普及財団賞 テレコム学際研究賞 特例表彰
- 2023年 電子情報通信学会ISS論文賞および電子情報通信学会論文賞
- 2023年および2024年 IEEE Biometrics Council Society BTAS/IJCB 5-Year Highest Impact Award
- 2025年 文部科学省大臣表彰 科学技術賞(研究部門)
学会活動
- 2013年〜2019年 IEEE Signal Processing Society (SPS), Speech & Language Technical Committee
- 2013年〜現在 IEEE Senior member
- 2014年〜2017年 IEEE/ACM Transaction on Audio, Speech, and Language Processing, Associate Editor
- 2017年〜2021年 ISCA Special Interest Group: Speech Synthesis (SynSig) Chairperson
- 2018年〜2023年 Asia-Pacific Signal and Information Processing Association (APSIPA), Multimedia Security and Forensics Technical Committee
- 2019年〜現在 情報処理学会シニア会員
- 2019年〜2023年 IEEE/ACM Transaction on Audio, Speech, and Language Processing, Senior Area Editor
- 2021年〜現在 IEEE Signal Processing Society, Education Board Member
- その他、ISCA, 電子情報通信学会, 日本音響学会各会員
最近の研究
- Proactive Detection of Speaker Identity Manipulation with Neural Watermarking
- Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs
- The First VoicePrivacy Attacker Challenge
- Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores
- Explaining Speaker and Spoof Embeddings via Probing
- [Arxiv] Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data
- [Arxiv] It Takes Two: Real-time Co-Speech Two-person’s Interaction Generation via Reactive Auto-regressive Diffusion Model
- The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction
- Improving curriculum learning for target speaker extraction with synthetic speakers
- Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion
- Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches
- [Arxiv] AfriHuBERT: A self-supervised speech representation model for African languages
- Speech Generation for Indigenous Language Education
- A Preliminary Study on Long-Form In-the-Wild Audio Spoofing Detection
- Exploring Active Data Selection Strategies for Continuous Training in Deepfake Detection
- Quantifying Source Speaker Leakage in One-to-One Voice Conversion
- Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis
- ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
- DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input
- Experimental evaluation of MOS, AB and BWS listening test designs
- Speaker Detection by the Individual Listener and the Crowd: Parametric Models Applicable to Bonafide and Deepfake Speech
- Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
- Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems
- Target Speaker Extraction with Curriculum Learning
- Spoof Diarization: “What Spoofed When” in Partially Spoofed Audio
- To what extent can ASV systems naturally defend against spoofing attacks?
- ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale
- [和訳] 合成音声の客観評価とVoiceMOSチャレンジ
- The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation
- Bridging Textual and Tabular Worlds for Fact Verification: A Lightweight, Attention-Based Model
- Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction
- A review on subjective and objective evaluation of synthetic speech
- Spoofing Attack Augmentation: Can Differently-Trained Attack Models Improve Generalisation?
- SynVox2: Towards a privacy-friendly VoxCeleb2 dataset
- Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?
- eKYC-DF: A Large-Scale Deepfake Dataset for Developing and Evaluating eKYC Systems
- Analysis of Fine-grained Counting Methods for Masked Face Counting: A Comparative Study
- Joint Speaker Encoder and Neural Back-end Model for Fully End-to-End Automatic Speaker Verification with Multiple Enrollment Utterances
- The VoiceMOS Challenge 2023:Zero-shot Subjective Speech Quality Prediction for Multiple Domains
- Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-Shot and Semi-supervised setting
- [Arxiv] Speaker-Text Retrieval via Contrastive Learning
- Exploring Isolated Musical Notes as Pre-training Data for Predominant Instrument Recognition in Polyphonic Music
- XFEVER: Exploring Fact Verification across Languages
- How Close are Other Computer Vision Tasks to Deepfake Detection?
- Speaker Anonymization using Orthogonal Householder Neural Network
- Cyber Vaccine for Deepfake Immunity
- Towards single integrated spoofing-aware speaker verification embeddings
- Range-Based Equal Error Rate for Spoof Localization
- Controlling Multi-Class Human Vocalization Generation via a Simple Scheme of Segment-based Labeling
- Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms
- Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech
- BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer
- Revisiting Pathologies of Neural Models under Input Reduction
- ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild
- Hiding speaker’s sex in speech using zero-evidence speaker representation in an analysis/synthesis pipeline
- Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?
- Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders
- Joint Noise Reduction and Listening Enhancement for Full-End Speech Enhancement
- Investigating Active-learning-based Training Data Selection for Speech Spoofing Countermeasure
- Analysis of Master Vein Attacks on Finger Vein Recognition Systems
- The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance
- Outlier-Aware Training for Improving Group Accuracy Disparities
- Mitigating the Diminishing Effect of Elastic Weight Consolidation
- Spoofing-Aware Attention based ASV Back-end with Multiple Enrollment Utterances and a Sampling Strategy for the SASV Challenge 2022
- Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions
- The VoiceMOS Challenge 2022
- DDS: A new device-degraded speech dataset for speech enhancement
- Privacy and utility of x-vector based speaker anonymization
- [学会誌] 自動事実検証: FEVER Challenge参加から得た教訓
- [学会誌] Deepfake生成と検出の現状
- Language-Independent Speaker Anonymization Approach using Self-Supervised Pre-Trained Models
- Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation
- Investigating self-supervised front ends for speech spoofing countermeasures
- LDNET: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech
- Estimating the Confidence of Speech Spoofing Countermeasure
- Generalization Ability of MOS Prediction Networks
- On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis
- Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances
- Master Face Attacks on Face Recognition Systems
- SVSNet: An End-to-end Speaker Voice Similarity Assessment Model
- The VoicePrivacy 2020 Challenge: Results and findings
- Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds
- Effects of Image Processing Operations on Adversarial Noise and Their Use in Detecting and Correcting Adversarial Images
- Optimizing Tandem Speaker Verification and Anti-Spoofing Systems
- Effectiveness of Detection-based and Regression-based Approaches for Estimating Mask-Wearing Ratio
- Benchmarking and challenges in security and privacy for voice biometrics
- [Arxiv] LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example
- OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild
- ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection
- Multi-Task Learning in Utterance-Level and Segmental-Level Spoof Detection
- Multi-Metric Optimization Using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement
- An Initial Investigation for Detecting Partially Spoofed Audio
- A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection
- Preliminary study on using vector quantization latent spaces for consistent performance TTS/VC systems
- How do Voices from Past Speech Synthesis Challenges Compare Today?
- Exploring Disentanglement with Multilingual and Monolingual VQ-VAE
- Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
- A Multi-Level Attention Model for Evidence-Based Fact Checking
- Fashion-Guided Adversarial Attack on Person Segmentation
- How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers?
- Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm
- End-to-End Text-to-Speech using Latent Duration based on VQ-VAE
- ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech
- Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation
- Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model
- Generation and Detection of Media Clones
- Preventing Fake Information Generation Against Media Clone Attacks
- Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
- Color Transfer to Anonymized Gait Images While Maintaining Anonymization
- A Method for Identifying Origin of Digital Images Using a Convolution Neural Network
- An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning
- Viable Threat on News Reading: Generating Biased News Using Natural Language Models
- [Arxiv] Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis
- An initial investigation on optimizing tandem speaker verification and countermeasure systems using reinforcement learning
- Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion
- Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions
- Latent linguistic embedding for cross-lingual text-to-speech and voice conversion
- NAUTILUS: a Versatile Voice Cloning System
- The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment
- Design Choices for X-vector Based Speaker Anonymization
- Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction
- Reverberation Modeling for Source-Filter-based Neural Vocoder
- Introducing the VoicePrivacy Initiative
- Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
- Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement
- Using Cyclic Noise as the Source Signal for Neural Source-Filter-based Speech Waveform Model
- iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning
- [国内学会] 落語音声合成は人間の落語家にどれだけ迫れるのか?
- [Arxiv] An Investigation of the Relation Between Grapheme Embeddings and Pronunciation for Tacotron-based Systems
- Security of Facial Forensics Models Against Adversarial Attacks
- Generating Master Faces for Use in Performing Wolf Attacks on Face Recognition Systems
- Modeling of Rakugo Speech and Its Limitations: Toward Speech Synthesis That Entertains Audiences
- Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals
- ASVspoof 2019: a large-scale public database of synthesized, converted and replayed speech
- Effect of choice of probability distribution, randomness, and search methods for alignment modeling in sequence-to-sequence text-to-speech synthesis using hard alignment
- Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation
- Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings
- Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human- and Machine-based Detection
- Bootstrapping Non-Parallel Voice Conversion from Speaker-Adaptive Text-to-Speech
- Neural source-filter waveform models for statistical parametric speech synthesis
- An RGB Gait Anonymization Model for Low-Quality Silhouettes
- [Arxiv] Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model
- A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis
- Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos
- Rakugo speech synthesis using segment-to-segment neural transduction and style tokens — toward speech synthesis for entertaining audiences
- Initial investigation of encoder-decoder end-to-end TTS using marginalization of monotonic hard alignments
- Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis
- Speaker Anonymization Using X-vector and Neural Waveform Models
- MOSNet: Deep Learning based Objective Assessment for Voice Conversion
- GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-Spectrogram
- ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection
- Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
- Does the Lombard Effect Improve Emotional Communication in Noise? — Analysis of Emotional Speech Acted in Noise
- Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet
- [Arxiv] A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation
- Cycle-consistent adversarial networks for non-parallel vocal effort based speaking style conversion
- Attentive Filtering Networks for Audio Replay Attack Detection
- Audiovisual speaker conversion: jointly and simultaneously transforming facial expression and acoustic characteristics
- Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks
- Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language
- Neural source-filter-based waveform model for statistical parametric speech synthesis
- STFT spectral loss for training a neural speech waveform model
- Capsule-Forensics: Using Capsule Networks to Detect Forged Images and Videos
- Spatio-Temporal Generative Adversarial Network for Gait Anonymization
- [Arxiv] Training a Neural Speech Waveform Model using Spectral Losses of Short-Time Fourier Transform and Continuous Wavelet Transform
- [国内学会] 落語音声合成におけるTacotronおよびコンテキスト特徴量の使用とその評価
- [国内学会] Investigating neural source-filter waveform model for statistical parametric speech synthesis
- [Arxiv] Non linear time compression of clear and normal speech at high rates
- Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems
- Transforming acoustic characteristics to deceive playback spoofing countermeasures of speaker verification systems
- MesoNet: a Compact Facial Video Forgery Detection Network
- Identifying Computer-Translated Paragraphs using Coherence Features
- Unsupervised speaker adaptation for DNN-based speech synthesis using input codes
- Complex-Valued Restricted Boltzmann Machine for Speaker-Dependent Speech Parameterization From Complex Spectra
- Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
- Multimodal speech synthesis architecture for unsupervised speaker adaptation
- Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects
- Integrated Presentation Attack Detection and Automatic Speaker Verification: Common Features and Gaussian Back-end Fusion
- Speaker-independent raw waveform model for glottal excitation
- Expressive Speech Synthesis Using Sentiment Embeddings
- Modular Convolutional Neural Network for Discriminating between Computer-Generated Images and Photographic Images
- [Arxiv] Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis
- Transformation on Computer-Generated Facial Image to Avoid Detection by Spoofing Detector
- t-DCF: a Detection Cost Function for the Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification
- A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment
- ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements
- Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data
- The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods
- A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis
- Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech
- Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis
- A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis
- Speech Waveform Synthesis from MFCC Sequences with Generative Adversarial Networks
- Cyborg Speech: Deep Multilingual Speech Synthesis for Generating Segmental Foreign Accent with Natural Prosody
- High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network
- [国内学会] Do prosodic manual annotations matter for Japanese speech synthesis systems with WaveNet vocoder?
- Investigating different representations for modeling and controlling multiple emotions in DNN-based speech synthesis
- [国内学会] Investigation of WaveNet for Text-to-Speech Synthesis
- [国内学会] Stealing your vocal identity from the internet: cloning Obama’s voice from found data using GAN and Wavenet
- [国内学会] Generating Segment-Level Foreign-Accented Synthetic Speech with Natural Speech Prosody
- [国内学会] 劣化音声を用いたDNN音声合成のための話者類似度に基づく教師なし話者適応
- [国内学会] リカレント構造を持つ複素制限ボルツマンマシンによる複素スペクトル系列モデリング
- [国内学会] CycleGANを用いたクロスリンガル声質変換
- [国内学会]Analyzing the impact of including listener perception annotations in RNN-based emotional speech synthesis
- [国内学会]CycleGANを用いた高品質なノンパラレル声質変換
- Identifying Computer-Generated Text Using Statistical Analysis
- Distinguishing Computer Graphics from Natural Images Using Convolution Neural Networks
- An Approach for Gait Anonymization Using Deep Learning
- Investigating very deep highway networks for parametric speech synthesis
- [国内学会] DNN音声合成のための話者類似度に基づく教師なし話者適応
- [国内学会] Autoregressive quantized F0 modeling using a recurrent neural network with feedback links
- Investigating different representations for modeling multiple emotions in DNN-based speech synthesis
- Learning word vector representations based on acoustic counts
- The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection
- Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system
- Complex-valued restricted Boltzmann machine for direct learning of frequency spectra
- Misperceptions of the emotional content of natural and vocoded speech in a car
- Direct modeling of frequency spectra and waveform generation based on phase recovery for DNN-based speech synthesis
- An RNN-based quantized f0 model with multi-tier feedback links for text-to-speech synthesis
- Principles for learning controllable TTS from annotated and latent variation
- Speech intelligibility in cars: The effect of speaking style, noise and listener age
- [国内学会] 複素RBM:制限ボルツマンマシンの複素数拡張と音声信号への応用と評価
- [国内学会] 敵対的学習に基づくSTFTスペクトログラムのポストフィルタリング
- Influence of speaker familiarity on blind and visually impaired children’s and young adults’ perception of synthetic voices
- Introduction to the Issue on Spoofing and Countermeasures for Automatic Speaker Verification
- [NII Technical Report] A simple RNN-plus-highway network for statistical parametric speech synthesis
- An autoregressive recurrent mixture density network for parametric speech synthesis
- Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation
- Adapting and controlling DNN-based speech synthesis using input codes
- [国内学会] Investigation on an autoregressive recurrent mixture density network for parametric speech synthesis
- ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge
- [国内学会] A comparative study on modeling and controlling emotional acoustic parameters in neural networks based Japanese and Spanish speech synthesis
- [国内学会] DNNに基づくテキスト音声合成のためのFFTスペクトルを用いた位相復元に基づく音声波形生成
- Continuous Expressive Speaking Styles Synthesis based on CVSM and MR-HMM
- [国内学会] Investigating the impact of a neural network’s depth on spectral and F0 modelling for parametric speech synthesis
- [国内学会] DNNに基づくテキスト音声合成における話者・ジェンダー・年齢コード利用の検討
- Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis
- The NII speech synthesis entry for Blizzard Challenge 2016
- Multidimensional scaling of systems in the Voice Conversion Challenge 2016
- Investigating Very Deep Highway Networks for Parametric Speech Synthesis
- A Comparative Study of the Performance of HMM, DNN, and RNN based Speech Synthesis Systems Trained on Very Large Speaker-Dependent Corpora
- Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis
- Parallel and cascaded deep neural networks for text-to-speech synthesis
- Development and evaluation of a statistical parametric synthesis system for operatic singing in German
- Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
- Analysis of the Voice Conversion Challenge 2016 Evaluation Results
- The Voice Conversion Challenge 2016
- Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis
- The SIWIS database: a multilingual speech database with acted emphasis
- A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks
- Majorisation-minimisation based optimisation of the composite autoregressive system with application to glottal inverse filtering
- Using text and acoustic features in predicting glottal excitation waveforms for parametric speech synthesis with recurrent neural networks
- Enhance the word vector with prosodic information for the recurrent neural network based TTS system
- Applying Spectral Normalisation and Efficient Envelope Estimation and Statistical Transformation for the Voice Conversion Challenge 2016
- Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks
- [国内学会] 巨大特定話者データを用いたHMM・DNN・RNNに基づく音声合成システムの性能評価
- Voice Liveness Detection for Speaker Verification based on a Tandem Single/Double-channel Pop Noise Detector
- [国内学会] 音素情報を考慮した話者照合のための声の生体検知の検討
- Privacy-preserving sound to degrade automatic speaker verification performance
- A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis
- Initial investigation of speech synthesis based on complex-valued neural networks
- Testing the consistency assumption: Pronunciation variant forced alignment in read and spontaneous speech synthesis
- Wavelet-based decomposition of F0 as a secondary task for DNN-based speech synthesis with multi-task learning
- Deep neural network-guided unit selection synthesis
- Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database, Comparison of Countermeasures, and Human Performance
- [国内学会] Using Continuous Representation of Various Linguistic Units for Recurrent Neural Network based TTS Synthesis
- [国内学会] 統計的パラメトリック音声合成のための FFT スペクトルからの Deep Auto-encoder に基づく低次元音響特徴量抽出
- [国内学会] 統計的パラメトリック音声合成のためのFFTスペクトルからのDeep Auto-encoderに基づく低次元音響特徴量抽出
- The use of articulatory movement data in speech synthesis applications: an overview –Application of articulatory movements using machine learning algorithms–
- [和訳] 調音運動の機械学習に基づく応用
- Intelligibility of time-compressed synthetic speech: Compression method and speaking style
- A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis
- Reconstructing Voices within the Multiple-Average-Voice-Model Framework
- Influence of speaker familiarity on blind and visually impaired children’s perception of synthetic voices in audio games
- Deep neural network context embeddings for model selection in rich-context HMM synthesis
- Multiple Feed-forward Deep Neural Networks for Statistical Parametric Speech Synthesis
- Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning
- Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification
- ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge
- Human vs Machine Spoofing Detection on Wideband and Narrowband Data
- A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis
- [国内学会] 自動話者照合システムの性能を低下させるプライバシープリザービングサウンドの検討
- [国内学会] 話者照合のためのポップノイズの周波数特性を考慮した声の生体検知
- [国内学会] 複数のFeed-Forward Deep Neural Networkに基づく統計的パラメトリック音声合成
- [国内学会] Efficient Pitch Estimation on Natural Opera-Singing by a Spectral Correlation based Strategy
- ALISA: An automatic lightly supervised speech segmentation and alignment tool
- [Arxiv] Deep Denoising Auto-encoder for Statistical Speech Synthesis
- [学会誌] 音声言語資源の現状
- Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis
- SAS: A speaker verification spoofing database containing diverse attacks
- Methods for applying dynamic sinusoidal models to statistical parametric speech synthesis
- Emotion transplantation through adaptation in HMM-based speech synthesis
- [学会誌] 音声の障がい者のための最先端音声合成技術
- [国内学会] Deep Neural Networkに基づく音響特徴抽出・音響モデルを用いた統計的音声合成システムの構築