SlideShare a Scribd company logo
1 of 33
Download to read offline
1
ESPnet-TTS: Unified, Reproducible,
and Integratable Open Source
End-to-End Text-to-Speech Toolkit
Tomoki Hayashi (@kan-bayashi)1,2,
Ryuichi Yamamoto3, Katsuki Inoue4,
Takenori Yoshimura1,2, Shinji Watanabe5,
Tomoki Toda1, Kazuya Takeda1, Yu Zhang6, Xu Tan7
1Nagoya University, 2Human Dataware lab. Co., Ltd.,
3LINE Corp., 4Okayama University, 5Johns Hopkins University,
6Google AI, 7Microsoft Research
Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 2
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We definitely need to accelerate the research
and prepare the comparable baseline!
Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 3
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We introduce ESPnet-TTS,
the new open-source toolkit of E2E-TTS
What is ESPnet-TTS?
p Open-source E2E-TTS toolkit
n Apache 2.0 LICENSE / Pytorch as main network engine
p Developed for the researcher community
n Easy to reproduce the-state-of-art model
n Can be used as a baseline to check the performance
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 4
1. Support of various Text2Mel models
n Include autoregressive (AR), non-AR, and multi-spk models
2. Support of various Mel2Wav models
n Include both AR and the latest non-AR models
3. Unified and reproducible kaldi-style recipes
n Support 10+ recipes including En, Jp, Zn, and more
n Provide pretrained models of all recipes
n Integratable with ASR functions
(Extension of )
ESPnet-TTS
functions
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 5
Supported Text2Mel models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 6
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Supported Text2Mel models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 7
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
Encoding
Tacotron 2
[Shen+, 2018]
Transformer-TTS
[Li+, 2018]
FastSpeech
[Ren+, 2019]
: Autoregressive
: Non-autoregressive
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration
p Extension with pretrained speaker embedding
n Use X-Vector [Snyder+ 2018] trained by VoxCeleb corpus
Multi-speaker extension (1)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 8
Multi-speaker Tacotron 2
[Jia+, 2018]
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Tacotron 2
[Shen+, 2018]
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
p Extension with pretrained speaker embedding
n Apply the same idea to the other models
Multi-speaker extension (2)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 9
Multi-speaker Transformer-TTS Multi-speaker FastSpeech
※ EXPERIMENTAL
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
EncodingReference
audio
Add / Concat
Pretrained
X-vector
Extractor
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 10
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 11
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
: Non-autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 12
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
: Non-autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Parallel WaveGAN
[Yamamoto+, 2020]
Please check
Ryuichi‘s
presentation on
this ICASSP.
Other remarkable functions
p Dynamic batch-size to maximize GPU utilization
n Change batch-size dynamically according to the length
p Gradient accumulation
n Pseudo-increase the batch-size even with a single GPU
p Guided attention loss [Tachibana+, 2017]
n Constrain the attention weight to be diagonal
p Attention constraint decoding [Ping+, 2017]
n Stably decode with a long input sentence
p Forward attention [Zhang+, 2018]
n Attention mechanism with causal regularization
p CBHG [Wang+, 2017]
n Upsample the frequency resolution
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 13
ESPnet-TTS
recipes
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 14
Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 15
The same data format for ASR and TTS recipes
Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 16
ASR and TTS recipes can be converted to each other
Supported recipes
p Support 10+ recipes including 10 langs.
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 17
Corpus name Lang Recipe type
Arctic En Adaptation
Blizzard 2017 En Single
CSMSC Zn Single
JNAS Jp Multi
JVS Jp Adaptation
JUST Jp Single
LibriTTS En Multi
LJSpeech En Single
M-AILABS En, De, Fr, Es, Pl, Uk, Ru Single
TWEB En Single
VAIS1000 Vi Single
We provide pretrained models of all recipes
Integration with ASR
p ASR-based evaluation for TTS
n Automatically check the deletion or repetition of words
p Advanced recipes combining TTS with ASR
n ASR-TTS cycle-consistency training [Karthick+, 2019]
n Semi-supervised ASR-TTS training [Karita+, 2019]
n Non-parallel voice conversion
l Cascade ASR + TTS system
l VCC2020 baseline system (http://www.vc-challenge.org/)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 18
We can combine TTS with ASR
for the development and new research ideas
※Not merged yet
ESPnet-TTS
performance
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 19
Experimental condition
p Evaluation with the LJSpeech dataset
n #Training 12,600 / #validation 250 / #evaluation 250
p Comparison methods (Input type, [attention type])
n Tacotron 2 (Char, Forward)
n Transformer (Char)
n FastSpeech (Char)※1
p Comparison other toolkits
n CSTR/Merlin: Conventional TTS + WORLD [Morise+, 2016]
n NVIDIA/tacotron2: Pretrained※2 Tacotron 2 + WaveGlow
n Mozilla/TTS: Pretrained※2 Tacotron 2 + WaveRNN
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 20
※2 Data split is different. The evaluation samples might be included in training data.
n Tacotron 2 (Char, Location)
n Transformer (Phoneme)
n FastSpeech (Phoneme)※1
※1 We did not use knowledge distillation
The same MoL-WaveNet trained w/ natural feats is used
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 21
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 22
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation
Tacotron 2 is more robust than Transformer-TTS
Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 23
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
FastSpeech is the most robust
thanks to non-AR architecture
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 24
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 25
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Tacotron 2 is faster than Transformer-TTS
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 26
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
FastSpeech is much faster than real-time
thanks to non-AR architecture
Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
p (For reference) RTF of Non-AR Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 27
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2(Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Method RTF on CPU RTF on GPU
Parallel WaveGAN 0.734 0.016
MelGAN 0.137 0.002
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 28
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Please check the samples
from QR-code!
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 29
Please check the samples
from QR-code!
Tacotron 2 and Transformer-TTS have
almost the same performance
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 30
Our best model can achieve the performance
comparable to state-of-the-art
※ The evaluation samples might be included in training data.
Please check the samples
from QR-code!
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 31
Please check the samples
from QR-code!
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Demonstration
p Demo notebooks with Google Colab.
1. E2E-TTS real-time demonstration
https://bit.ly/2Vex0Iw
2. E2E-TTS recipe Tutorial
https://bit.ly/3bhv0ow
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 32
You can generate your favorite
sentence in En, Jp, Zn!
You can learn the TTS recipe
flow online!
Closing
p Conclusion
n Introduced open-source toolkit ESPnet-TTS
l Developed for the research community
l Make E2E-TTS more user-friendly
l Accelerate the research in this field
n Provide various Text2Mel and Mel2Wav models
n Provide reproducible recipes including various langs
n Achieved the performance comparable to SoTA
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 33
We are always welcome
your feature requests and pull requests!

More Related Content

What's hot

Pythonによる非同期プログラミング入門
Pythonによる非同期プログラミング入門Pythonによる非同期プログラミング入門
Pythonによる非同期プログラミング入門Hironori Sekine
 
3GPP 5G SA Detailed explanation 5(5G SA Handover Call Flow include 5GC)
3GPP 5G SA Detailed explanation 5(5G SA Handover Call Flow include 5GC)3GPP 5G SA Detailed explanation 5(5G SA Handover Call Flow include 5GC)
3GPP 5G SA Detailed explanation 5(5G SA Handover Call Flow include 5GC)Ryuichi Yasunaga
 
Linux の hugepage の開発動向
Linux の hugepage の開発動向Linux の hugepage の開発動向
Linux の hugepage の開発動向Naoya Horiguchi
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model reviewSeoung-Ho Choi
 
[Cloud OnAir] Google Compute Engine に Deep Dive! 基本から運用時のベストプラクティスまで 2018年7月1...
[Cloud OnAir] Google Compute Engine に Deep Dive! 基本から運用時のベストプラクティスまで 2018年7月1...[Cloud OnAir] Google Compute Engine に Deep Dive! 基本から運用時のベストプラクティスまで 2018年7月1...
[Cloud OnAir] Google Compute Engine に Deep Dive! 基本から運用時のベストプラクティスまで 2018年7月1...Google Cloud Platform - Japan
 
OSSプロジェクトへのコントリビューション はじめの一歩を踏み出そう!(Open Source Conference 2022 Online/Spring...
OSSプロジェクトへのコントリビューション はじめの一歩を踏み出そう!(Open Source Conference 2022 Online/Spring...OSSプロジェクトへのコントリビューション はじめの一歩を踏み出そう!(Open Source Conference 2022 Online/Spring...
OSSプロジェクトへのコントリビューション はじめの一歩を踏み出そう!(Open Source Conference 2022 Online/Spring...NTT DATA Technology & Innovation
 
大規模DCのネットワークデザイン
大規模DCのネットワークデザイン大規模DCのネットワークデザイン
大規模DCのネットワークデザインMasayuki Kobayashi
 
Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERTATPowr
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMakerSuman Debnath
 
Worldwide Scalable and Resilient Messaging Services by CQRS and Event Sourcin...
Worldwide Scalable and Resilient Messaging Services by CQRS and Event Sourcin...Worldwide Scalable and Resilient Messaging Services by CQRS and Event Sourcin...
Worldwide Scalable and Resilient Messaging Services by CQRS and Event Sourcin...DataWorks Summit
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)NTT DATA OSS Professional Services
 
3GPP 5G SA Detailed explanation 1(Relationship between 5G Identifier and Virt...
3GPP 5G SA Detailed explanation 1(Relationship between 5G Identifier and Virt...3GPP 5G SA Detailed explanation 1(Relationship between 5G Identifier and Virt...
3GPP 5G SA Detailed explanation 1(Relationship between 5G Identifier and Virt...Ryuichi Yasunaga
 
ネットワーク自動化ツール紹介(Ansible・NAPALM編)
ネットワーク自動化ツール紹介(Ansible・NAPALM編)ネットワーク自動化ツール紹介(Ansible・NAPALM編)
ネットワーク自動化ツール紹介(Ansible・NAPALM編)akira6592
 
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張Preferred Networks
 

What's hot (20)

Pythonによる非同期プログラミング入門
Pythonによる非同期プログラミング入門Pythonによる非同期プログラミング入門
Pythonによる非同期プログラミング入門
 
3GPP 5G SA Detailed explanation 5(5G SA Handover Call Flow include 5GC)
3GPP 5G SA Detailed explanation 5(5G SA Handover Call Flow include 5GC)3GPP 5G SA Detailed explanation 5(5G SA Handover Call Flow include 5GC)
3GPP 5G SA Detailed explanation 5(5G SA Handover Call Flow include 5GC)
 
PyTorch under the hood
PyTorch under the hoodPyTorch under the hood
PyTorch under the hood
 
Linux の hugepage の開発動向
Linux の hugepage の開発動向Linux の hugepage の開発動向
Linux の hugepage の開発動向
 
IIJmio meeting 31 音声通信の世界
IIJmio meeting 31 音声通信の世界IIJmio meeting 31 音声通信の世界
IIJmio meeting 31 音声通信の世界
 
IIJmio meeting 16 スマートフォンがつながる仕組み
IIJmio meeting 16 スマートフォンがつながる仕組みIIJmio meeting 16 スマートフォンがつながる仕組み
IIJmio meeting 16 スマートフォンがつながる仕組み
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
IIJmio高速モバイル/Dについて
IIJmio高速モバイル/DについてIIJmio高速モバイル/Dについて
IIJmio高速モバイル/Dについて
 
[Cloud OnAir] Google Compute Engine に Deep Dive! 基本から運用時のベストプラクティスまで 2018年7月1...
[Cloud OnAir] Google Compute Engine に Deep Dive! 基本から運用時のベストプラクティスまで 2018年7月1...[Cloud OnAir] Google Compute Engine に Deep Dive! 基本から運用時のベストプラクティスまで 2018年7月1...
[Cloud OnAir] Google Compute Engine に Deep Dive! 基本から運用時のベストプラクティスまで 2018年7月1...
 
OSSプロジェクトへのコントリビューション はじめの一歩を踏み出そう!(Open Source Conference 2022 Online/Spring...
OSSプロジェクトへのコントリビューション はじめの一歩を踏み出そう!(Open Source Conference 2022 Online/Spring...OSSプロジェクトへのコントリビューション はじめの一歩を踏み出そう!(Open Source Conference 2022 Online/Spring...
OSSプロジェクトへのコントリビューション はじめの一歩を踏み出そう!(Open Source Conference 2022 Online/Spring...
 
大規模DCのネットワークデザイン
大規模DCのネットワークデザイン大規模DCのネットワークデザイン
大規模DCのネットワークデザイン
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERT
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
Worldwide Scalable and Resilient Messaging Services by CQRS and Event Sourcin...
Worldwide Scalable and Resilient Messaging Services by CQRS and Event Sourcin...Worldwide Scalable and Resilient Messaging Services by CQRS and Event Sourcin...
Worldwide Scalable and Resilient Messaging Services by CQRS and Event Sourcin...
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
 
3GPP 5G SA Detailed explanation 1(Relationship between 5G Identifier and Virt...
3GPP 5G SA Detailed explanation 1(Relationship between 5G Identifier and Virt...3GPP 5G SA Detailed explanation 1(Relationship between 5G Identifier and Virt...
3GPP 5G SA Detailed explanation 1(Relationship between 5G Identifier and Virt...
 
ネットワーク自動化ツール紹介(Ansible・NAPALM編)
ネットワーク自動化ツール紹介(Ansible・NAPALM編)ネットワーク自動化ツール紹介(Ansible・NAPALM編)
ネットワーク自動化ツール紹介(Ansible・NAPALM編)
 
Bert
BertBert
Bert
 
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
Matlantis™のニューラルネットワークポテンシャルPFPの適用範囲拡張
 

Similar to ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Yusuke Oda
 
SP Study1018 Paper Reading
SP Study1018 Paper ReadingSP Study1018 Paper Reading
SP Study1018 Paper ReadingMori Takuma
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET Journal
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Codemotion
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...IRJET Journal
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewJune-Woo Kim
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...IRJET Journal
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...AI Frontiers
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisProvectus
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationChen Xu
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET Journal
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesIRJET Journal
 

Similar to ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit (20)

Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
 
SP Study1018 Paper Reading
SP Study1018 Paper ReadingSP Study1018 Paper Reading
SP Study1018 Paper Reading
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
 
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
Rendering Of Voice By Using Convolutional Neural Network And With The Help Of...
 
Non autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech reviewNon autoregressive neural text-to-speech review
Non autoregressive neural text-to-speech review
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
 
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
Training at AI Frontiers 2018 - Lukasz Kaiser: Sequence to Sequence Learning ...
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
Lenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesisLenar Gabdrakhmanov (Provectus): Speech synthesis
Lenar Gabdrakhmanov (Provectus): Speech synthesis
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text Detection
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP Approaches
 

More from Tomoki Hayashi

複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査Tomoki Hayashi
 
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出Tomoki Hayashi
 
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出Tomoki Hayashi
 
形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識Tomoki Hayashi
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247Tomoki Hayashi
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNETomoki Hayashi
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network Tomoki Hayashi
 

More from Tomoki Hayashi (7)

複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査複数話者WaveNetボコーダに関する調査
複数話者WaveNetボコーダに関する調査
 
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
イベント継続長を明示的に制御したBLSTM-HSMMハイブリッドモデルによる多重音響イベント検出
 
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
イベント区間検出統合型 BLSTM-HMMハイブリッドモデルによる 多重音響イベント検出
 
形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識形態素解析も辞書も言語モデルもいらないend-to-end音声認識
形態素解析も辞書も言語モデルもいらないend-to-end音声認識
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247
 
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network
 

Recently uploaded

Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 

Recently uploaded (20)

Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

  • 1. 1 ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit Tomoki Hayashi (@kan-bayashi)1,2, Ryuichi Yamamoto3, Katsuki Inoue4, Takenori Yoshimura1,2, Shinji Watanabe5, Tomoki Toda1, Kazuya Takeda1, Yu Zhang6, Xu Tan7 1Nagoya University, 2Human Dataware lab. Co., Ltd., 3LINE Corp., 4Okayama University, 5Johns Hopkins University, 6Google AI, 7Microsoft Research
  • 2. Background p The era of End-to-End Text-to-Speech (E2E-TTS) p Various advantages of E2E-TTS n Require no language-dependent expert knowledge n Require no alignment between text and speech p More and more new research ideas n Style control / Multi-speaker / Multi-lingual / etc... ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 2 Hello, world! Speech Text2Mel Mel2Wav Neural Network We definitely need to accelerate the research and prepare the comparable baseline!
  • 3. Background p The era of End-to-End Text-to-Speech (E2E-TTS) p Various advantages of E2E-TTS n Require no language-dependent expert knowledge n Require no alignment between text and speech p More and more new research ideas n Style control / Multi-speaker / Multi-lingual / etc... ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 3 Hello, world! Speech Text2Mel Mel2Wav Neural Network We introduce ESPnet-TTS, the new open-source toolkit of E2E-TTS
  • 4. What is ESPnet-TTS? p Open-source E2E-TTS toolkit n Apache 2.0 LICENSE / Pytorch as main network engine p Developed for the researcher community n Easy to reproduce the-state-of-art model n Can be used as a baseline to check the performance ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 4 1. Support of various Text2Mel models n Include autoregressive (AR), non-AR, and multi-spk models 2. Support of various Mel2Wav models n Include both AR and the latest non-AR models 3. Unified and reproducible kaldi-style recipes n Support 10+ recipes including En, Jp, Zn, and more n Provide pretrained models of all recipes n Integratable with ASR functions (Extension of )
  • 5. ESPnet-TTS functions ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 5
  • 6. Supported Text2Mel models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 6 Hello, world! Speech Text2Mel Mel2Wav Neural Network
  • 7. Supported Text2Mel models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 7 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output Input sequence Transformer Encoder Transformer Decoder Postnet Decoder Prenet Next output Encoder Prenet Positional Encoding Positional Encoding Tacotron 2 [Shen+, 2018] Transformer-TTS [Li+, 2018] FastSpeech [Ren+, 2019] : Autoregressive : Non-autoregressive Input sequence Transformer Encoder Transformer Decoder Duration Predictor Embedding Positional Encoding Length Regulator Output sequence Duration
  • 8. p Extension with pretrained speaker embedding n Use X-Vector [Snyder+ 2018] trained by VoxCeleb corpus Multi-speaker extension (1) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 8 Multi-speaker Tacotron 2 [Jia+, 2018] Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output Tacotron 2 [Shen+, 2018] Reference audio Add / Concat Pretrained X-vector Extractor Input sequence CNN+BLSTM Encoder Attention LSTM Decoder Postnet Prenet Next output
  • 9. p Extension with pretrained speaker embedding n Apply the same idea to the other models Multi-speaker extension (2) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 9 Multi-speaker Transformer-TTS Multi-speaker FastSpeech ※ EXPERIMENTAL Input sequence Transformer Encoder Transformer Decoder Postnet Decoder Prenet Next output Encoder Prenet Positional Encoding Positional EncodingReference audio Add / Concat Pretrained X-vector Extractor Reference audio Add / Concat Pretrained X-vector Extractor Input sequence Transformer Encoder Transformer Decoder Duration Predictor Embedding Positional Encoding Length Regulator Output sequence Duration
  • 10. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 10 Hello, world! Speech Text2Mel Mel2Wav Neural Network
  • 11. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 11 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Mel spectrogram Deep causal dilated CNN Previous waveform Posterior Upsample network Sampling Next waveform Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Mel spectrogram Upsample deep CNN Waveform sequence : Autoregressive : Non-autoregressive WaveNet [Oord+, 2016] Parallel WaveGAN [Yamamoto+, 2020] MelGAN [Kumar+, 2019] Mixture of Logistics (MoL) and Softmax support Support the combination of these GAN-based models
  • 12. Support Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 12 Hello, world! Speech Text2Mel Mel2Wav Neural Network Hello, world! Speech Text2Mel Mel2Wav Neural Network This part! Mel spectrogram Deep causal dilated CNN Previous waveform Posterior Upsample network Sampling Next waveform Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Mel spectrogram Upsample deep CNN Waveform sequence : Autoregressive : Non-autoregressive WaveNet [Oord+, 2016] Parallel WaveGAN [Yamamoto+, 2020] MelGAN [Kumar+, 2019] Mixture of Logistics (MoL) and Softmax support Support the combination of these GAN-based models Mel spectrogram Deep dilated CNN Noise sequence Waveform sequence Upsample network Parallel WaveGAN [Yamamoto+, 2020] Please check Ryuichi‘s presentation on this ICASSP.
  • 13. Other remarkable functions p Dynamic batch-size to maximize GPU utilization n Change batch-size dynamically according to the length p Gradient accumulation n Pseudo-increase the batch-size even with a single GPU p Guided attention loss [Tachibana+, 2017] n Constrain the attention weight to be diagonal p Attention constraint decoding [Ping+, 2017] n Stably decode with a long input sentence p Forward attention [Zhang+, 2018] n Attention mechanism with causal regularization p CBHG [Wang+, 2017] n Upsample the frequency resolution ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 13
  • 14. ESPnet-TTS recipes ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 14
  • 15. Unified, reproducible recipe p All-in-one Kaldi-style recipe n Include all procedures needed to reproduce the results n Have an unified design for both ASR and TTS recipe ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 15 The same data format for ASR and TTS recipes
  • 16. Unified, reproducible recipe p All-in-one Kaldi-style recipe n Include all procedures needed to reproduce the results n Have an unified design for both ASR and TTS recipe ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 16 ASR and TTS recipes can be converted to each other
  • 17. Supported recipes p Support 10+ recipes including 10 langs. ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 17 Corpus name Lang Recipe type Arctic En Adaptation Blizzard 2017 En Single CSMSC Zn Single JNAS Jp Multi JVS Jp Adaptation JUST Jp Single LibriTTS En Multi LJSpeech En Single M-AILABS En, De, Fr, Es, Pl, Uk, Ru Single TWEB En Single VAIS1000 Vi Single We provide pretrained models of all recipes
  • 18. Integration with ASR p ASR-based evaluation for TTS n Automatically check the deletion or repetition of words p Advanced recipes combining TTS with ASR n ASR-TTS cycle-consistency training [Karthick+, 2019] n Semi-supervised ASR-TTS training [Karita+, 2019] n Non-parallel voice conversion l Cascade ASR + TTS system l VCC2020 baseline system (http://www.vc-challenge.org/) ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 18 We can combine TTS with ASR for the development and new research ideas ※Not merged yet
  • 19. ESPnet-TTS performance ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 19
  • 20. Experimental condition p Evaluation with the LJSpeech dataset n #Training 12,600 / #validation 250 / #evaluation 250 p Comparison methods (Input type, [attention type]) n Tacotron 2 (Char, Forward) n Transformer (Char) n FastSpeech (Char)※1 p Comparison other toolkits n CSTR/Merlin: Conventional TTS + WORLD [Morise+, 2016] n NVIDIA/tacotron2: Pretrained※2 Tacotron 2 + WaveGlow n Mozilla/TTS: Pretrained※2 Tacotron 2 + WaveRNN ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 20 ※2 Data split is different. The evaluation samples might be included in training data. n Tacotron 2 (Char, Location) n Transformer (Phoneme) n FastSpeech (Phoneme)※1 ※1 We did not use knowledge distillation The same MoL-WaveNet trained w/ natural feats is used
  • 21. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 21 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 ※Only one sample failed to stop the generation
  • 22. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 22 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 ※Only one sample failed to stop the generation Tacotron 2 is more robust than Transformer-TTS
  • 23. Objective evaluation (CER) p Character error rate (CER) n ASR model: Transformer trained on Librispeech ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 23 Method Sub [%] Del [%] Ins [%] CER [%] Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0 Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1 Transformer (Char) 0.6 1.7 0.5 2.8 Transformer (Phoneme) 0.5 1.8 0.5 2.8 FastSpeech (Char) 0.3 0.9 0.3 1.6 FastSpeech (Phoneme) 0.4 1.3 0.4 2.1 Groundtruth (Raw) 0.3 0.7 0.3 1.3 FastSpeech is the most robust thanks to non-AR architecture
  • 24. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 24 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004
  • 25. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 25 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 Tacotron 2 is faster than Transformer-TTS
  • 26. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 26 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 FastSpeech is much faster than real-time thanks to non-AR architecture
  • 27. Objective evaluation (RTF) p Real-time factor (RTF) of Char-based models n Calculate the speed for only Text2Mel part n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads p (For reference) RTF of Non-AR Mel2Wav models ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 27 Method RTF on CPU RTF on GPU Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006 Tacotron 2(Location) 0.225 ± 0.016 0.094 ± 0.009 Transformer 0.851 ± 0.076 0.634 ± 0.025 FastSpeech 0.015 ± 0.005 0.003 ± 0.004 Method RTF on CPU RTF on GPU Parallel WaveGAN 0.734 0.016 MelGAN 0.137 0.002
  • 28. Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 28 Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Please check the samples from QR-code!
  • 29. Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 29 Please check the samples from QR-code! Tacotron 2 and Transformer-TTS have almost the same performance
  • 30. Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05 Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 30 Our best model can achieve the performance comparable to state-of-the-art ※ The evaluation samples might be included in training data. Please check the samples from QR-code!
  • 31. Subjective evaluation (MOS) p Mean opinion score (MOS) on naturalness n #subjects = 101 @ Amazon Mechanical Turk ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 31 Please check the samples from QR-code! Method MOS (± 95% CI) Tacotron 2 (Char, Forward) 4.14 ± 0.06 Tacotron 2 (Char, Location) 4.20 ± 0.06 Transformer (Char) 4.17 ± 0.06 Transformer (Phoneme) 4.25 ± 0.06 CSTR/Merlin 2.69 ± 0.09 NVIDIA/tacotron2※ 4.21 ± 0.06 Mozilla/TTS※ 3.91 ± 0.07 Groundtruth (Raw) 4.46 ± 0.05
  • 32. Demonstration p Demo notebooks with Google Colab. 1. E2E-TTS real-time demonstration https://bit.ly/2Vex0Iw 2. E2E-TTS recipe Tutorial https://bit.ly/3bhv0ow ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 32 You can generate your favorite sentence in En, Jp, Zn! You can learn the TTS recipe flow online!
  • 33. Closing p Conclusion n Introduced open-source toolkit ESPnet-TTS l Developed for the research community l Make E2E-TTS more user-friendly l Accelerate the research in this field n Provide various Text2Mel and Mel2Wav models n Provide reproducible recipes including various langs n Achieved the performance comparable to SoTA ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 33 We are always welcome your feature requests and pull requests!