ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

1
ESPnet-TTS: Unified, Reproducible,
and Integratable Open Source
End-to-End Text-to-Speech Toolkit
Tomoki Hayashi (@kan-bayashi)1,2,
Ryuichi Yamamoto3, Katsuki Inoue4,
Takenori Yoshimura1,2, Shinji Watanabe5,
Tomoki Toda1, Kazuya Takeda1, Yu Zhang6, Xu Tan7
1Nagoya University, 2Human Dataware lab. Co., Ltd.,
3LINE Corp., 4Okayama University, 5Johns Hopkins University,
6Google AI, 7Microsoft Research

Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
ICASSP2020 ESPNET-TTS: UNIFIED, REPRODUCIBLE, AND INTEGRATABLE OPEN SOURCE END-TO-END TEXT-TO-SPEECH TOOLKIT 2
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We definitely need to accelerate the research
and prepare the comparable baseline!

Background
p The era of End-to-End Text-to-Speech (E2E-TTS)
p Various advantages of E2E-TTS
n Require no language-dependent expert knowledge
n Require no alignment between text and speech
p More and more new research ideas
n Style control / Multi-speaker / Multi-lingual / etc...
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
We introduce ESPnet-TTS,
the new open-source toolkit of E2E-TTS

What is ESPnet-TTS?
p Open-source E2E-TTS toolkit
n Apache 2.0 LICENSE / Pytorch as main network engine
p Developed for the researcher community
n Easy to reproduce the-state-of-art model
n Can be used as a baseline to check the performance
1. Support of various Text2Mel models
n Include autoregressive (AR), non-AR, and multi-spk models
2. Support of various Mel2Wav models
n Include both AR and the latest non-AR models
3. Unified and reproducible kaldi-style recipes
n Support 10+ recipes including En, Jp, Zn, and more
n Provide pretrained models of all recipes
n Integratable with ASR functions
(Extension of )

ESPnet-TTS
functions

Supported Text2Mel models
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network

Supported Text2Mel models
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
Encoding
Tacotron 2
[Shen+, 2018]
Transformer-TTS
[Li+, 2018]
FastSpeech
[Ren+, 2019]
: Autoregressive
: Non-autoregressive
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration

p Extension with pretrained speaker embedding
n Use X-Vector [Snyder+ 2018] trained by VoxCeleb corpus
Multi-speaker extension (1)
Multi-speaker Tacotron 2
[Jia+, 2018]
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output
Tacotron 2
[Shen+, 2018]
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
CNN+BLSTM
Encoder
Attention LSTM
Decoder
Postnet
Prenet
Next output

p Extension with pretrained speaker embedding
n Apply the same idea to the other models
Multi-speaker extension (2)
Multi-speaker Transformer-TTS Multi-speaker FastSpeech
※ EXPERIMENTAL
Input sequence
Transformer
Encoder
Transformer
Decoder
Postnet
Decoder
Prenet
Next output
Encoder
Prenet
Positional
Encoding
Positional
EncodingReference
audio
Add / Concat
Pretrained
X-vector
Extractor
Reference
audio
Add / Concat
Pretrained
X-vector
Extractor
Input sequence
Transformer
Encoder
Transformer
Decoder
Duration
Predictor
Embedding
Positional
Encoding
Length
Regulator
Output sequence
Duration

Support Mel2Wav models
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network

Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models

Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
Hello,
world! Speech
Text2Mel Mel2Wav
Neural Network
This part!
Mel
spectrogram
Deep causal
dilated CNN
Previous
waveform
Posterior
Upsample
network
Sampling
Next
waveform
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Mel
spectrogram
Upsample
deep CNN
Waveform
sequence
: Autoregressive
WaveNet
[Oord+, 2016]
Parallel WaveGAN
[Yamamoto+, 2020]
MelGAN
[Kumar+, 2019]
Mixture of Logistics (MoL)
and Softmax support
Support the combination of these
GAN-based models
Mel
spectrogram
Deep
dilated CNN
Noise
sequence
Waveform
sequence
Upsample
network
Parallel WaveGAN
[Yamamoto+, 2020]
Please check
Ryuichi‘s
presentation on
this ICASSP.

Other remarkable functions
p Dynamic batch-size to maximize GPU utilization
n Change batch-size dynamically according to the length
p Gradient accumulation
n Pseudo-increase the batch-size even with a single GPU
p Guided attention loss [Tachibana+, 2017]
n Constrain the attention weight to be diagonal
p Attention constraint decoding [Ping+, 2017]
n Stably decode with a long input sentence
p Forward attention [Zhang+, 2018]
n Attention mechanism with causal regularization
p CBHG [Wang+, 2017]
n Upsample the frequency resolution

ESPnet-TTS
recipes

Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
The same data format for ASR and TTS recipes

Unified, reproducible recipe
p All-in-one Kaldi-style recipe
n Include all procedures needed to reproduce the results
n Have an unified design for both ASR and TTS recipe
ASR and TTS recipes can be converted to each other

Supported recipes
p Support 10+ recipes including 10 langs.
Corpus name Lang Recipe type
Arctic En Adaptation
Blizzard 2017 En Single
CSMSC Zn Single
JNAS Jp Multi
JVS Jp Adaptation
JUST Jp Single
LibriTTS En Multi
LJSpeech En Single
M-AILABS En, De, Fr, Es, Pl, Uk, Ru Single
TWEB En Single
VAIS1000 Vi Single
We provide pretrained models of all recipes

Integration with ASR
p ASR-based evaluation for TTS
n Automatically check the deletion or repetition of words
p Advanced recipes combining TTS with ASR
n ASR-TTS cycle-consistency training [Karthick+, 2019]
n Semi-supervised ASR-TTS training [Karita+, 2019]
n Non-parallel voice conversion
l Cascade ASR + TTS system
l VCC2020 baseline system (http://www.vc-challenge.org/)
We can combine TTS with ASR
for the development and new research ideas
※Not merged yet

ESPnet-TTS
performance

Experimental condition
p Evaluation with the LJSpeech dataset
n #Training 12,600 / #validation 250 / #evaluation 250
p Comparison methods (Input type, [attention type])
n Tacotron 2 (Char, Forward)
n Transformer (Char)
n FastSpeech (Char)※1
p Comparison other toolkits
n CSTR/Merlin: Conventional TTS + WORLD [Morise+, 2016]
n NVIDIA/tacotron2: Pretrained※2 Tacotron 2 + WaveGlow
n Mozilla/TTS: Pretrained※2 Tacotron 2 + WaveRNN
※2 Data split is different. The evaluation samples might be included in training data.
n Tacotron 2 (Char, Location)
n Transformer (Phoneme)
n FastSpeech (Phoneme)※1
※1 We did not use knowledge distillation
The same MoL-WaveNet trained w/ natural feats is used

Objective evaluation (CER)
p Character error rate (CER)
n ASR model: Transformer trained on Librispeech
Method Sub [%] Del [%] Ins [%] CER [%]
Tacotron 2 (Char, Forward) 0.4 1.0 3.6※ 5.0
Tacotron 2 (Char, Location) 0.5 1.2 0.3 2.1
Transformer (Char) 0.6 1.7 0.5 2.8
Transformer (Phoneme) 0.5 1.8 0.5 2.8
FastSpeech (Char) 0.3 0.9 0.3 1.6
FastSpeech (Phoneme) 0.4 1.3 0.4 2.1
Groundtruth (Raw) 0.3 0.7 0.3 1.3
※Only one sample failed to stop the generation

※Only one sample failed to stop the generation
Tacotron 2 is more robust than Transformer-TTS

FastSpeech is the most robust
thanks to non-AR architecture

Objective evaluation (RTF)
p Real-time factor (RTF) of Char-based models
n Calculate the speed for only Text2Mel part
n GPU: Titan V / CPU: Xeon Gold 6154 3 GHz x 16 threads
Method RTF on CPU RTF on GPU
Tacotron 2 (Forward) 0.216 ± 0.016 0.104 ± 0.006
Tacotron 2 (Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004

Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Tacotron 2 is faster than Transformer-TTS

Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
FastSpeech is much faster than real-time
thanks to non-AR architecture

p (For reference) RTF of Non-AR Mel2Wav models
Tacotron 2(Location) 0.225 ± 0.016 0.094 ± 0.009
Transformer 0.851 ± 0.076 0.634 ± 0.025
FastSpeech 0.015 ± 0.005 0.003 ± 0.004
Parallel WaveGAN 0.734 0.016
MelGAN 0.137 0.002

Subjective evaluation (MOS)
p Mean opinion score (MOS) on naturalness
n #subjects = 101 @ Amazon Mechanical Turk
Method MOS (± 95% CI)
Tacotron 2 (Char, Forward) 4.14 ± 0.06
Tacotron 2 (Char, Location) 4.20 ± 0.06
Transformer (Char) 4.17 ± 0.06
Transformer (Phoneme) 4.25 ± 0.06
CSTR/Merlin 2.69 ± 0.09
NVIDIA/tacotron2※ 4.21 ± 0.06
Mozilla/TTS※ 3.91 ± 0.07
Groundtruth (Raw) 4.46 ± 0.05
Please check the samples
from QR-code!

from QR-code!
Tacotron 2 and Transformer-TTS have
almost the same performance

Our best model can achieve the performance
comparable to state-of-the-art
※ The evaluation samples might be included in training data.
from QR-code!

from QR-code!

Demonstration
p Demo notebooks with Google Colab.
1. E2E-TTS real-time demonstration
https://bit.ly/2Vex0Iw
2. E2E-TTS recipe Tutorial
https://bit.ly/3bhv0ow
You can generate your favorite
sentence in En, Jp, Zn!
You can learn the TTS recipe
flow online!

Closing
p Conclusion
n Introduced open-source toolkit ESPnet-TTS
l Developed for the research community
l Make E2E-TTS more user-friendly
l Accelerate the research in this field
n Provide various Text2Mel and Mel2Wav models
n Provide reproducible recipes including various langs
n Achieved the performance comparable to SoTA
We are always welcome
your feature requests and pull requests!

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit

Similar to ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit (20)

More from Tomoki Hayashi

More from Tomoki Hayashi (7)

Recently uploaded

Recently uploaded (20)

ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit