SlideShare a Scribd company logo
1 of 12
Download to read offline
Real-time neural text-to-speech
with sequence-to-sequence acoustic model
and WaveGlow or single Gaussian WaveRNN vocoders
Takuma Okamoto1, Tomoki Toda2,1, Yoshinori Shiga1 and Hisashi Kawai1
1National Institute of Information and Communications Technology (NICT), Japan
2Nagoya University, Japan
1
Introduction!
Problems and purpose!
Sequence-to-sequence acoustic model with full-context label input!
Real-time neural vocoders!
WaveGlow vocoder
Proposed single Gaussian WaveRNN vocoder
Experiments!
Alternative sequence-to-sequence acoustic model (NOT included in proceeding)!
Conclusions
Outline
2
High-fidelity text-to-speech (TTS) systems!
WaveNet outperformed conventional TTS systems in 2016 -> End-to-end neural TTS
Tacotron 2 (+ WaveNet vocoder) J. Shen et al., ICASSP 2018
Text (English) -> [Tacotron 2] -> mel-spectrogram -> [WaveNet vocoder] -> speech waveform
Jointly optimizing text analysis, duration and acoustic models with a single neural network
No text analysis, no phoneme alignment, and no fundamental frequency analysis
Problem
NOT directly applied to pitch accent languages
Tacotron for pitch accent language (Japanese) Y. Yasuda et al., ICASSP 2019
Phoneme and accentual type sequence input (instead of character sequence)
Conventional pipeline model with full-context label input > sequence-to-sequence acoustic model
Introduction
Realizing high-fidelity synthesis comparable to human speech!!
3
Problems in real-time neural TTS systems!
Results of sequence-to-sequence acoustic model for pitch accent language
Full-context label input > phoneme and accentual type sequence
Many investigations for end-to-end TTS
Introducing Autoregressive (AR) WaveNet vocoder -> CANNOT realize real-time synthesis
Parallel WaveNet with linguistic feature input
High-quality real-time TTS but complicated teacher-student training with additional loss functions required
Purpose: Developing real-time neural TTS for pitch accent languages !
Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure
Jointly optimizing phoneme duration and acoustic models
Real-time neural vocoders without complicated teether-student training
WaveGlow vocoder
Proposed single Gaussian WaveRNN vocoder
Problems and purpose
4
Sequence-to-sequence acoustic model with full-context label input based on Tacotron
structure!
Input: full-context label vector (phoneme level sequence)
Reducing past and future 2 contexts based on bidirectional LSTM structure (478 dims -> 130 dims)
1 x 1 convolution layer instead of embedding layer
Sequence-to-sequence acoustic model
layer layers
Bidirectional
LSTM
layers
2 LSTM
Full-context label
vector
Linear
projection
Linear
projection
Stop token
3 conv
2 layer
pre-net
5 conv layer
post-net
Location
sensitive
attention
1 × 1 conv
Mel-spectrogram
+
Neural
vocoder
Speech
waveform
Input text
Text analyzer
Replaced components 5
Generative flow-based model!
Image generative model: Glow + raw audio generative model: WaveNet
Training stage: speech waveform + acoustic feature -> white noise
Synthesis stage: white noise + acoustic feature -> speech waveform
Investigated WaveGlow vocoder!
Acoustic feature: mel-spectrogram (80 dims)
Training time
About 1 month using 4 GPUs (NVIDIA V100)
Inference time as real time factor (RTF)
0.1: using a GPU (NVIDIA V100)
4.0: using CPUs (Intel Xeon Gold 6148)
WaveGlow
R. Prenger et al., ICASSP 2019
Directly training real-time parallel generative model
without teacher-student training
Acoustic feature hGround-truth waveform x
WaveNet
Upsampling layer
z
xa xb
Affine
Coupling layer
xa x′
b
Invertible 1×1
convolution
Squeeze to
vectors
× 12
W k
log sj, tj
fi
Affine
transform
6
WaveRNN!
Sparse WaveRNN
Real-time inference with a mobile CPU
Dual-softmax
16 bit linear PCM is split into coarse and fine 8 bits
-> two samplings are required to synthesize one audio sample
Proposed single Gaussian WaveRNN!
Predicting mean and standard deviation of next sample
Continuous values can be predicted
Initially proposed in ClariNet (W. Ping et al., ICLR 2019)
Applied to FFTNet (T. Okamoto et al., ICASSP 2019)
Only one sampling is sufficient to synthesize one audio sample
WaveRNN vocoders for CPU inference
Acoustic feature h
Upsampling layer
Masked GRU
Acoustic feature h
Upsampling layer
Ground-truth waveform xt−1
+
O1 O2GRU µt, log σt
Oh Ox
37 or 80
37 or 80
1024 1024
1024
1024 256 2
1
Concat
Split
O1 O2 Softmax for ct
Ground-truth waveform
1. Past coarse 8-bit: ct−1
2. Past fine 8-bit: ft−1
3. Current coarse 8-bit: ct
37 or 80
37 or 80
40 or 83
1024
512 256 256
Softmax for ftO3 O4
512 256 256
3
(a) WaveRNN with dual-softmax
(b) Proposed SG-WaveRNN
Early investigation for real-time synthesis using a CPU
N. Kalchbrenner et al., ICML 2018
7
Noise shaping method considering auditory perception!
Improving synthesis quality by reducing spectral distortion due to prediction error
Implemented by MLSA filter with averaged mel-cepstra
Effiective for categorical and single Gaussian WaveNet and FFTNet vocoders
T. Okamoto et al., SLT 2018, ICASSP 2019
Noise shaping for neural vocoders
K. Tachibana et al., ICASSP 2018
(a) Training stage
Speech signal
Acoustic features
Residual signal
(b) Synthesis stage
Acoustic features
WaveNet / FFTNet
Reconstructed speech signal
Speech
corpus
Source signal
f [Hz]
AmplitudeAmplitude
f [Hz]
Residual signal
Amplitude
f [Hz]
Amplitude
f [Hz]
f [Hz]
AmplitudeAmplitude
f [Hz]
Amplitude
f [Hz]
Reconstructed
Amplitude
f [Hz]
Filtering Quantization
Training of
WaveNet / FFTNet
WaveNet / FFTNet
Time-invariant
noise weighting filter
Calculation of time-invariant
noise shaping fileter
Generation of
residual signal
Dequantization Inverse filtering
Extraction of
acoustic features
Investigating impact for WaveGlow and WaveRNN vocoders 8
Speech corpus!
Japanese female corpus: about 22 h (test set: 20 utterances)
Sampling frequency: 24 kHz
Sequence-to-sequence acoustic model (introducing Tacotron 2’s setting)!
Input: full-context label vector (130 dim)
Neural vocoders (w/wo noise shaping)!
Single Gaussian AR WaveNet
Vanilla WaveRNN with dual softmax
Proposed single Gaussian WaveRNN
WaveGlow
Acoustic features!
Simple acoustic features (SAF): fundamental frequency + mel-cepstra (37 dims)
Mel-spectrograms (MELSPC): 80 dims
Experimental conditions
9
Subjective evaluation!
Listening subjects: 15 Japanese native speakers
18 conditions x 20 utterances = 360 sentences / a subject
Results!
Vanilla and single Gaussian WaveRNNs require noise shaping
Noise shaping is NOT effective for WaveGlow
Neural TTS systems with sequence-to-sequence acoustic model and neural vocoders can realize higher quality
synthesis than STRAIGHT vocoder with analysis-synthesis condition
MOS results and demo
SG-WaveRNNWaveRNN WaveGlow
: MELSPC : MELSPC (NS) : TTS : TTS (NS)
STRAIGHT
Original
: SAF : SAF (NS)
AR SG-WaveNet
12 3 4 5
10
Evaluation condition!
Using a GPU (NVIDIA V100)
Simple PyTorch implementation
Results!
Sequence-to-sequence acoustic model + WaveGlow can realize real-time neural TTS with an RTF of 0.16
Single Gaussian WaveRNN can synthesize about twice as fast as vanilla WaveRNN
Results of real-time factor (RTF)
Real-time high fidelity neural TTS for Japanese can be realized 11
Real-time neural TTS with sequence-to-sequence acoustic model and WaveGlow or
single Gaussian WaveRNN vocoders!
Sequence-to-sequence acoustic model with full-context label input
WaveGlow and proposed single Gaussian WaveRNN vocoders
Realizing real-time high-fidelity neural TTS using sequence-to-sequence acoustic model and WaveGlow vocoder with a
real time factor of 0.16
Future work!
Implementing real-time inference with a CPU (such as sparse WaveRNN and LPCNet)
Comparing sequence-to-sequence acoustic model with conventional pipeline TTS models
T. Okamoto, T. Toda, Y. Shiga and H. Kawai, “Tacotron-based acoustic model using phoneme alignment for practical
neural text-to-speech systems,” IEEE ASRU 2019@Singapore, Dec. 2019 (to appear)
Conclusions
12

More Related Content

What's hot

Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionNAVER Engineering
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsTakuya Yoshioka
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNNJunho Cho
 
Voice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency FilteringVoice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency FilteringTejus Adiga M
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonMel Chua
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep LearningSungjoon Choi
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...a3labdsp
 
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...Alexander Decker
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...Tomoki Hayashi
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC techniqueChun Hao Wang
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderAkira Tamamori
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrellzukun
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNsAuro Tripathy
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]Dongmin Choi
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...Edge AI and Vision Alliance
 

What's hot (20)

Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domains
 
150807 Fast R-CNN
150807 Fast R-CNN150807 Fast R-CNN
150807 Fast R-CNN
 
Voice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency FilteringVoice Activity Detection using Single Frequency Filtering
Voice Activity Detection using Single Frequency Filtering
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and Python
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
Object Detection Methods using Deep Learning
Object Detection Methods using Deep LearningObject Detection Methods using Deep Learning
Object Detection Methods using Deep Learning
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
 
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
Signal to-noise-ratio of signal acquisition in global navigation satellite sy...
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrell
 
Auro tripathy - Localizing with CNNs
Auro tripathy -  Localizing with CNNsAuro tripathy -  Localizing with CNNs
Auro tripathy - Localizing with CNNs
 
Pycon apac 2014
Pycon apac 2014Pycon apac 2014
Pycon apac 2014
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation..."Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
"Semantic Segmentation for Scene Understanding: Algorithms and Implementation...
 

Similar to Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...NUGU developers
 
Taras Sereda "Waveglow. Generative modeling for audio synthesis"
Taras Sereda "Waveglow. Generative modeling for audio synthesis"Taras Sereda "Waveglow. Generative modeling for audio synthesis"
Taras Sereda "Waveglow. Generative modeling for audio synthesis"Fwdays
 
129966863283913778[1]
129966863283913778[1]129966863283913778[1]
129966863283913778[1]威華 王
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition finalArchit Vora
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐlykhnh386525
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio filesMinh Anh Nguyen
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Compression presentation 415 (1)
Compression presentation 415 (1)Compression presentation 415 (1)
Compression presentation 415 (1)Godo Dodo
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniquesidescitation
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...IRJET Journal
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio filesMinh Anh Nguyen
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxHamzaJaved306957
 

Similar to Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders (20)

Final presentation
Final presentationFinal presentation
Final presentation
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
Taras Sereda "Waveglow. Generative modeling for audio synthesis"
Taras Sereda "Waveglow. Generative modeling for audio synthesis"Taras Sereda "Waveglow. Generative modeling for audio synthesis"
Taras Sereda "Waveglow. Generative modeling for audio synthesis"
 
129966863283913778[1]
129966863283913778[1]129966863283913778[1]
129966863283913778[1]
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
A1mpeg12 2004
A1mpeg12 2004A1mpeg12 2004
A1mpeg12 2004
 
add9.5.ppt
add9.5.pptadd9.5.ppt
add9.5.ppt
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
 
H0814247
H0814247H0814247
H0814247
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio files
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Compression presentation 415 (1)
Compression presentation 415 (1)Compression presentation 415 (1)
Compression presentation 415 (1)
 
Analysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition TechniquesAnalysis of PEAQ Model using Wavelet Decomposition Techniques
Analysis of PEAQ Model using Wavelet Decomposition Techniques
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
The method of comparing two audio files
The method of comparing two audio filesThe method of comparing two audio files
The method of comparing two audio files
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
 

Recently uploaded

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 

Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders

  • 1. Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders Takuma Okamoto1, Tomoki Toda2,1, Yoshinori Shiga1 and Hisashi Kawai1 1National Institute of Information and Communications Technology (NICT), Japan 2Nagoya University, Japan 1
  • 2. Introduction! Problems and purpose! Sequence-to-sequence acoustic model with full-context label input! Real-time neural vocoders! WaveGlow vocoder Proposed single Gaussian WaveRNN vocoder Experiments! Alternative sequence-to-sequence acoustic model (NOT included in proceeding)! Conclusions Outline 2
  • 3. High-fidelity text-to-speech (TTS) systems! WaveNet outperformed conventional TTS systems in 2016 -> End-to-end neural TTS Tacotron 2 (+ WaveNet vocoder) J. Shen et al., ICASSP 2018 Text (English) -> [Tacotron 2] -> mel-spectrogram -> [WaveNet vocoder] -> speech waveform Jointly optimizing text analysis, duration and acoustic models with a single neural network No text analysis, no phoneme alignment, and no fundamental frequency analysis Problem NOT directly applied to pitch accent languages Tacotron for pitch accent language (Japanese) Y. Yasuda et al., ICASSP 2019 Phoneme and accentual type sequence input (instead of character sequence) Conventional pipeline model with full-context label input > sequence-to-sequence acoustic model Introduction Realizing high-fidelity synthesis comparable to human speech!! 3
  • 4. Problems in real-time neural TTS systems! Results of sequence-to-sequence acoustic model for pitch accent language Full-context label input > phoneme and accentual type sequence Many investigations for end-to-end TTS Introducing Autoregressive (AR) WaveNet vocoder -> CANNOT realize real-time synthesis Parallel WaveNet with linguistic feature input High-quality real-time TTS but complicated teacher-student training with additional loss functions required Purpose: Developing real-time neural TTS for pitch accent languages ! Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure Jointly optimizing phoneme duration and acoustic models Real-time neural vocoders without complicated teether-student training WaveGlow vocoder Proposed single Gaussian WaveRNN vocoder Problems and purpose 4
  • 5. Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure! Input: full-context label vector (phoneme level sequence) Reducing past and future 2 contexts based on bidirectional LSTM structure (478 dims -> 130 dims) 1 x 1 convolution layer instead of embedding layer Sequence-to-sequence acoustic model layer layers Bidirectional LSTM layers 2 LSTM Full-context label vector Linear projection Linear projection Stop token 3 conv 2 layer pre-net 5 conv layer post-net Location sensitive attention 1 × 1 conv Mel-spectrogram + Neural vocoder Speech waveform Input text Text analyzer Replaced components 5
  • 6. Generative flow-based model! Image generative model: Glow + raw audio generative model: WaveNet Training stage: speech waveform + acoustic feature -> white noise Synthesis stage: white noise + acoustic feature -> speech waveform Investigated WaveGlow vocoder! Acoustic feature: mel-spectrogram (80 dims) Training time About 1 month using 4 GPUs (NVIDIA V100) Inference time as real time factor (RTF) 0.1: using a GPU (NVIDIA V100) 4.0: using CPUs (Intel Xeon Gold 6148) WaveGlow R. Prenger et al., ICASSP 2019 Directly training real-time parallel generative model without teacher-student training Acoustic feature hGround-truth waveform x WaveNet Upsampling layer z xa xb Affine Coupling layer xa x′ b Invertible 1×1 convolution Squeeze to vectors × 12 W k log sj, tj fi Affine transform 6
  • 7. WaveRNN! Sparse WaveRNN Real-time inference with a mobile CPU Dual-softmax 16 bit linear PCM is split into coarse and fine 8 bits -> two samplings are required to synthesize one audio sample Proposed single Gaussian WaveRNN! Predicting mean and standard deviation of next sample Continuous values can be predicted Initially proposed in ClariNet (W. Ping et al., ICLR 2019) Applied to FFTNet (T. Okamoto et al., ICASSP 2019) Only one sampling is sufficient to synthesize one audio sample WaveRNN vocoders for CPU inference Acoustic feature h Upsampling layer Masked GRU Acoustic feature h Upsampling layer Ground-truth waveform xt−1 + O1 O2GRU µt, log σt Oh Ox 37 or 80 37 or 80 1024 1024 1024 1024 256 2 1 Concat Split O1 O2 Softmax for ct Ground-truth waveform 1. Past coarse 8-bit: ct−1 2. Past fine 8-bit: ft−1 3. Current coarse 8-bit: ct 37 or 80 37 or 80 40 or 83 1024 512 256 256 Softmax for ftO3 O4 512 256 256 3 (a) WaveRNN with dual-softmax (b) Proposed SG-WaveRNN Early investigation for real-time synthesis using a CPU N. Kalchbrenner et al., ICML 2018 7
  • 8. Noise shaping method considering auditory perception! Improving synthesis quality by reducing spectral distortion due to prediction error Implemented by MLSA filter with averaged mel-cepstra Effiective for categorical and single Gaussian WaveNet and FFTNet vocoders T. Okamoto et al., SLT 2018, ICASSP 2019 Noise shaping for neural vocoders K. Tachibana et al., ICASSP 2018 (a) Training stage Speech signal Acoustic features Residual signal (b) Synthesis stage Acoustic features WaveNet / FFTNet Reconstructed speech signal Speech corpus Source signal f [Hz] AmplitudeAmplitude f [Hz] Residual signal Amplitude f [Hz] Amplitude f [Hz] f [Hz] AmplitudeAmplitude f [Hz] Amplitude f [Hz] Reconstructed Amplitude f [Hz] Filtering Quantization Training of WaveNet / FFTNet WaveNet / FFTNet Time-invariant noise weighting filter Calculation of time-invariant noise shaping fileter Generation of residual signal Dequantization Inverse filtering Extraction of acoustic features Investigating impact for WaveGlow and WaveRNN vocoders 8
  • 9. Speech corpus! Japanese female corpus: about 22 h (test set: 20 utterances) Sampling frequency: 24 kHz Sequence-to-sequence acoustic model (introducing Tacotron 2’s setting)! Input: full-context label vector (130 dim) Neural vocoders (w/wo noise shaping)! Single Gaussian AR WaveNet Vanilla WaveRNN with dual softmax Proposed single Gaussian WaveRNN WaveGlow Acoustic features! Simple acoustic features (SAF): fundamental frequency + mel-cepstra (37 dims) Mel-spectrograms (MELSPC): 80 dims Experimental conditions 9
  • 10. Subjective evaluation! Listening subjects: 15 Japanese native speakers 18 conditions x 20 utterances = 360 sentences / a subject Results! Vanilla and single Gaussian WaveRNNs require noise shaping Noise shaping is NOT effective for WaveGlow Neural TTS systems with sequence-to-sequence acoustic model and neural vocoders can realize higher quality synthesis than STRAIGHT vocoder with analysis-synthesis condition MOS results and demo SG-WaveRNNWaveRNN WaveGlow : MELSPC : MELSPC (NS) : TTS : TTS (NS) STRAIGHT Original : SAF : SAF (NS) AR SG-WaveNet 12 3 4 5 10
  • 11. Evaluation condition! Using a GPU (NVIDIA V100) Simple PyTorch implementation Results! Sequence-to-sequence acoustic model + WaveGlow can realize real-time neural TTS with an RTF of 0.16 Single Gaussian WaveRNN can synthesize about twice as fast as vanilla WaveRNN Results of real-time factor (RTF) Real-time high fidelity neural TTS for Japanese can be realized 11
  • 12. Real-time neural TTS with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders! Sequence-to-sequence acoustic model with full-context label input WaveGlow and proposed single Gaussian WaveRNN vocoders Realizing real-time high-fidelity neural TTS using sequence-to-sequence acoustic model and WaveGlow vocoder with a real time factor of 0.16 Future work! Implementing real-time inference with a CPU (such as sparse WaveRNN and LPCNet) Comparing sequence-to-sequence acoustic model with conventional pipeline TTS models T. Okamoto, T. Toda, Y. Shiga and H. Kawai, “Tacotron-based acoustic model using phoneme alignment for practical neural text-to-speech systems,” IEEE ASRU 2019@Singapore, Dec. 2019 (to appear) Conclusions 12