2. Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 2
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We definitely need to accelerate the research
and prepare the comparable baseline!
3. Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 3
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We introduce ESPnet-TTS,
the new open-source toolkit of E2E-TTS
4. What is ESPnet-TTS?
p Open-source E2E-TTS toolkit
n Apache 2.0 LICENSE / Pytorch as main network engine
p Developed for the researcher community
n Easy to reproduce the-state-of-art model
n Can be used as a baseline to check the performance
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 4
1. Support of various Text2Mel models
n Include autoregressive (AR), non-AR, and multi-spk models
2. Support of various Mel2Wav models
n Include both AR and the latest non-AR models
3. Unified and reproducible kaldi-style recipes
n Support 10+ recipes including En, Jp, Zn, and more
n Provide pretrained models of all recipes
n Integratable with ASR functions
(Extension of )
8. p Extension with pretrained speaker embedding
n Use X-Vector [Snyder+ 2018] trained by VoxCeleb corpus
Multi-speaker extension (1)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 8
Multi-speaker Tacotron 2
[Jia+, 2018]
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Tacotron 2
[Shen+, 2018]
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
9. p Extension with pretrained speaker embedding
n Apply the same idea to the other models
Multi-speaker extension (2)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 9
Multi-speaker Transformer-TTS Multi-speaker FastSpeech
※ EXPERIMENTAL
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
EncodingReference
audio
Add / Concat
Pretrained
X-vector
Extractor
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration
10. Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 10
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
11. Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 11
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
: Non-autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
12. Support Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 12
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
: Non-autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Parallel WaveGAN
[Yamamoto+, 2020]
Please check
Ryuichi‘s
presentation on
this ICASSP.
13. Other remarkable functions
p Dynamic batch-size to maximize GPU utilization
n Change batch-size dynamically according to the length
p Gradient accumulation
n Pseudo-increase the batch-size even with a single GPU
p Guided attention loss [Tachibana+, 2017]
n Constrain the attention weight to be diagonal
p Attention constraint decoding [Ping+, 2017]
n Stably decode with a long input sentence
p Forward attention [Zhang+, 2018]
n Attention mechanism with causal regularization
p CBHG [Wang+, 2017]
n Upsample the frequency resolution
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 13
15. Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 15
The same data format for ASR and TTS recipes
16. Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 16
ASR and TTS recipes can be converted to each other
17. Supported recipes
p Support 10+ recipes including 10 langs.
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 17
Corpus name Lang Recipe type
Arctic En Adaptation
Blizzard 2017 En Single
CSMSC Zn Single
JNAS Jp Multi
JVS Jp Adaptation
JUST Jp Single
LibriTTS En Multi
LJSpeech En Single
M-AILABS En, De, Fr, Es, Pl, Uk, Ru Single
TWEB En Single
VAIS1000 Vi Single
We provide pretrained models of all recipes
18. Integration with ASR
p ASR-based evaluation for TTS
n Automatically check the deletion or repetition of words
p Advanced recipes combining TTS with ASR
n ASR-TTS cycle-consistency training [Karthick+, 2019]
n Semi-supervised ASR-TTS training [Karita+, 2019]
n Non-parallel voice conversion
l Cascade ASR + TTS system
l VCC2020 baseline system (http://www.vc-challenge.org/)
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 18
We can combine TTS with ASR
for the development and new research ideas
※Not merged yet
20. Experimental condition
p Evaluation with the LJSpeech dataset
n #Training 12,600 / #validation 250 / #evaluation 250
p Comparison methods (Input type, [attention type])
n Tacotron 2 (Char, Forward)
n Transformer (Char)
n FastSpeech (Char)※1
p Comparison other toolkits
n CSTR/Merlin: Conventional TTS + WORLD [Morise+, 2016]
n NVIDIA/tacotron2: Pretrained※2 Tacotron 2 + WaveGlow
n Mozilla/TTS: Pretrained※2 Tacotron 2 + WaveRNN
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 20
※2 Data split is different. The evaluation samples might be included in training data.
n Tacotron 2 (Char, Location)
n Transformer (Phoneme)
n FastSpeech (Phoneme)※1
※1 We did not use knowledge distillation
The same MoL-WaveNet trained w/ natural feats is used
21. Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 21
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation
22. Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 22
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation
Tacotron 2 is more robust than Transformer-TTS
23. Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 23
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
FastSpeech is the most robust
thanks to non-AR architecture
24. Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 24
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
25. Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 25
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Tacotron 2 is faster than Transformer-TTS
26. Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 26
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
FastSpeech is much faster than real-time
thanks to non-AR architecture
27. Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
p (For reference) RTF of Non-AR Mel2Wav models
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 27
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2(Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Method RTF on CPU RTF on GPU
Parallel WaveGAN 0.734 0.016
MelGAN 0.137 0.002
28. Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 28
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Please check the samples
from QR-code!
29. Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 29
Please check the samples
from QR-code!
Tacotron 2 and Transformer-TTS have
almost the same performance
30. Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 30
Our best model can achieve the performance
comparable to state-of-the-art
※ The evaluation samples might be included in training data.
Please check the samples
from QR-code!
31. Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 31
Please check the samples
from QR-code!
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
32. Demonstration
p Demo notebooks with Google Colab.
1. E2E-TTS real-time demonstration
https://bit.ly/2Vex0Iw
2. E2E-TTS recipe Tutorial
https://bit.ly/3bhv0ow
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 32
You can generate your favorite
sentence in En, Jp, Zn!
You can learn the TTS recipe
flow online!
33. Closing
p Conclusion
n Introduced open-source toolkit ESPnet-TTS
l Developed for the research community
l Make E2E-TTS more user-friendly
l Accelerate the research in this field
n Provide various Text2Mel and Mel2Wav models
n Provide reproducible recipes including various langs
n Achieved the performance comparable to SoTA
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 33
We are always welcome
your feature requests and pull requests!