SlideShare a Scribd company logo
1 of 70
Download to read offline
NII
Tacotron WaveNet
1contact: wangxin@nii.ac.jp
we welcome critical comments, suggestions, and discussion
Xin WANG, Yusuke YASUDA
National Institute of Informatics, Japan
2019-01-27
NII
Part 1: WaveNet ssddddd
2contact: wangxin@nii.ac.jp
we welcome cri4cal comments, sugges4ons, and discussion
Xin WANG
National Institute of Informatics, Japan
2019-01-27
Neural waveform models
l
l
l
l PhD (2015-2018)
l M.A (2012-2015)
l h0p://tonywangx.github.io
SELF-INTRODUCTION
3
l Introduction: text-to-speech synthesis
l Neural waveform models
l Summary & software
CO N TEN TS
4
Text-to-speech synthesis (TTS) [1,2]
NII’s TTS systems
5
INTRODUCTION
[1] P. Taylor. Text-to-Speech Synthesis. Cambridge University Press, 2009.
[2] T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, 1997.
Text
Text-to-speech
(TTS)
Speech
waveform
Text Waveform Text Waveform
NII
TTS system
2013 2016 2018 Year
6
INTRODUCTION
Text-to-speech synthesis (TTS)
q Architectures
text text-to-speech speechText
Text
analyzer
Acous6c
model(s)
Linguistic
features
Waveform
Waveform
model
Acoustic
features
…
*
| *
||
Linguistic features Acoustic features
Spectral features,F0,
Band-aperiodicity, etc.
7
INTRODUCTION
Text-to-speech synthesis (TTS)
q Architectures
TTS backend model
text text-to-speech speechText
Text
analyzer
Acoustic
model(s)
Linguistic
features
Waveform
Waveform
model
Acoustic
features
Text
Text
analyzer
Linguistic
features
Waveform
End-to-end TTS modelText Waveform
8
INTRODUCTION
Text-to-speech synthesis (TTS)
q Architectures
TTS backend model sss
text text-to-speech speechText
Text
analyzer
Acoustic
model(s)
Linguistic
features
Waveform
Waveform
model
Text
Text
analyzer
Linguistic
features
Waveform
End-to-end TTS modelText Waveform
Waveform
module
Waveform
module
Tutorial part 1: neural waveform modeling
Tutorial part 2: end-to-end TTS
Acoustic
features
l Introduction: text-to-speech synthesis
l Neural waveform modeling
• Overview
• Autoregressive models
• Normalizing flow
• STFT-based training criterion
l Summary
CONTENTS
9
10
OVERVIEW
Task definition
Spectrogram
11
OVERVIEW
Task definition
…
c1 c2 cNc3
Waveform
values
Spectrogram
1 2 3 4 T…
o1 o2 o3 o4 oT
T waveform sampling points
N frames
Neural waveform models
12
OVERVIEW
Task definition
1 2 3 4 T…
o1 o2 o3 o4 oT
c1
…
c2 cNc3
p(o1:T |c1:N ; ⇥)
Waveform
values
13
OVERVIEW
1 2 3 4 T
…
c1
…
c2 cNc3
Naïve model
q Simple network + maximum-likelihood
• Feedforward network
Hidden
layers
Distribution
p(o4|c1:N ; ⇥)
…Up-sampling
Waveform
values
14
OVERVIEW
1 2 3 4 T…
c1
…
c2 cNc3
Naïve model
q Simple network + maximum-likelihood
• RNN
…Up-sampling
Hidden
layers
Distribution
Waveform
values
15
OVERVIEW
1 2 3 4 T…
c1
…
c2 cNc3
Naïve model
q Simple network + maximum-likelihood
• Dilated CNN [1,2]
…
Hidden
layers
Distribution
Waveform
values
Up-sampling
[1] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[2] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions
on Acoustics, Speech, and Signal Processing, 37(3):328–339, 1989.
16
OVERVIEW
1 2 3 4 T…
…
Naïve model
…
c1 c2 cNc3
o1 o2 o3 o4 oT
p(o1:T |c1:N ; ⇥) = p(o1|c1:N ; ⇥)· · ·p(oT |c1:N ; ⇥) =
TY
t=1
p(ot|c1:N ; ⇥)
o1:TWeak temporal model <---------------> strongly correlated data
mismatch
Hidden
layers
Distribution
Waveform
values
Up-sampling
17
OVERVIEW
Towards better models
Naïve
model
o1:T
c1:N
Criterion
Better
model
o1:T
c1:N
Criterion
Better model Easier target
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Better criterion
Weak temporal model <---------------> strongly correlated data
mismatch
Combination
Better
Model
o1:T
c1:N
Criterion
Knowledge-
based
Transform
18
OVERVIEW
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Autoregressive models
Normlizing flow / feature transformation Training criterion in frequency domain
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Speech/signal processing
☛ Please check reference in the appendix
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
l Introduction: text-to-speech synthesis
l Neural waveform models
• Overview
• Autoregressive models
• Normalizing flow
• STFT-based training criterion
l Summary & software
CONTENTS
19
20
PART I: AUTOREGRESSIVE MODELS
Core idea
q Naïve model
…
…
c1 c2 cNc3
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|c1:N ; ⇥)
1 2 3 4 T…
o1 o2 o3 o4 oT
21
PART I: AUTOREGRESSIVE MODELS
Core idea
q Autoregressive (AR) model
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|o1:t 1, c1:N ; ⇥)
22
PART I: AUTOREGRESSIVE MODELS
Core idea
q Autoregressive (AR) model
• Training: use natural waveform for feedback (teacher forcing [1])
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
[1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
23
PART I: AUTOREGRESSIVE MODELS
Core idea
q Autoregressive (AR) model
• Training: use natural waveform for feedback (teacher forcing [1])
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
[1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
Natural
waveform
24
PART I: AUTOREGRESSIVE MODELS
Core idea
q Autoregressive (AR) model
• Training: use natural waveform for feedback (teacher forcing [1])
• Generation:
…
…
c1 c2 cNc3
1 2
1
bo1
2
bo2
3
bo3
4
3
bo4
T…
boT
p(bo1:T |c1:N ; ⇥) =
QT
t=1 p(bot|bot P :t 1, c1:N ; ⇥)
[1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
Generated
waveform
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
SampleRNN
25
WaveRNN FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Normlizing flow / feature transformation Training criterion in frequency domain
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Speech/signal processing
PART I: AUTOREGRESSIVE MODELS
WaveNet Autoregressive models
WaveNet
q Network structure
• Details: http://id.nii.ac.jp/1001/00185683/
http://tonywangx.github.io/pdfs/wavenet.pdf
26
PART I: AUTOREGRESSIVE MODELS
1 2 3 4 T
1 2 3
c1 c2 cNc3
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN
27
WaveRNN FFTNet
LPCNet
LP-WaveNet
ExcitNet
GlotNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Normlizing flow / feature transformation Training criterion in frequency domain
Subband WaveNet Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART I: AUTOREGRESSIVE MODELS
WaveNet Autoregressive models
☛ Please check SampleRNN in the appendix
q Issues with WaveNet & Sample RNN
Generation is very very … very slow
q Acceleration?
• Engineering-based: parallelize/simplify computation
• Knowledge-based: speech / signal processing theory
Speech/signal processing
28
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q Linear time AR dependency in WaveNet
q Subscale AR dependency + batch sampling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Waveform
values
16 steps to generate
29
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q Linear time AR dependency in WaveNet
q Subscale AR dependency + batch sampling
• Generate multiple samples at the same time
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
16 steps to generate
10 steps to generate
step1
step2
step3
…
30
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q Linear time AR dependency in WaveNet
q Subscale AR dependency + batch sampling
• Generate multiple samples at the same time
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
16 steps to generate
10 steps to generate
step1
step2
step3
…
SampleRNN
31
WaveRNN FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Normlizing flow / feature transformation Training criterion in frequency domain
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART I: AUTOREGRESSIVE MODELS
WaveNet Autoregressive models
q AR models
Generation is still slow
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|o1:t 1, c1:N ; ⇥)
Speech/signal processing
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
SampleRNN
32
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow / feature transformation
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART I: AUTOREGRESSIVE MODELS
WaveNet Autoregressive models
Non-autoregressive model?
Speech/signal processing
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
l Introduction: text-to-speech synthesis
l Neural waveform models
• Overview
• Autoregressive models
• Normalizing flow
• STFT-based training criterion
l Summary & software
CONTENTS
33
34
PART II: NORMALIZING FLOW-BASED MODELS
General idea
• o with strong temporal correla3on ---> z with weak temporal correla3on
• Principle of changing random variable
Model s
Criterion
⇥
o
f 1
(o)
c
Model s⇥
c
bo
po(o|c; ⇥, ) = pz(z = f 1
(o)|c; ⇥) det
@f 1
(o)
@o
bz
f (bz)
z
training generation
☛ https://en.wikipedia.org/wiki/Probability_density_function#Dependent_variables_and_change_of_variables
35
PART II: NO RM ALIZIN G FLOW -BASED
MODELSGeneral idea
• o ---> Gaussian noise sequence z
• Multiple transformations: normalizing flow
o
c
bo
…
f 1
M
(·)
N(o; I)
Criterion
c
…
N(o; I)
bz
z
po(o|c; 1, · · · , M ) = pz(z = f 1
1
· · · f 1
M 1
f 1
M
(o)|c)
MY
m=1
det J(f 1
m
)
f 1
1
(·)
f 1
M 1
(·)
f M
(·)
f M 1
(·)
f 1
(·)
training generation
D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages 1530–1538, 2015.
SampleRNN
36
WaveRNN FFTNetLPCNet
LP-WaveNet
ExcitNet
GlotNet
Model
o1:T
c1:N
Criterion
Subband WaveNet
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Speech production model
WaveNet Autoregressive models
Model
o1:T
c1:N
Criterion
Transform
PART II: NORMALIZING FLOW-BASED MODELS
po(o|c; 1, · · · , M ) = pz(z = f 1
1
· · · f 1
M 1
f 1
M
(o)|c)
MY
m=1
det J(f 1
m
)
WaveGlow
FloWaveNet
Parallel WaveNet
ClariNet f m
(·) in linear time domain
f m
(·) in subscale time domain
f m
(·)• Different time complexity , different training strategies
A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. Parallel WaveNet:
Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018.
S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
37
PART II: NORMALIZING FLOW-BASED MODELS
ClariNet & parallel WaveNet
q Genera(on process
…
N(o; I)
c1:N
bo1:T
f M
(·)
f M 1
(·)
f 1
(·) CNN 1
(·)
1 2 3 4 T
1 2 3 4 T
1:T
µ1:T
bz
(0)
1:T
1 2 3 4 T
ü Parallel computation
ü Fast generation
* Initial condition may be implemented differently
1 2 3 4 T
bz
(0)
1:T
bz
(1)
1:T
bz
(1)
t = bz
(0)
t · t + µt
bz
(1)
1:T
38
PART II: NORMALIZING FLOW-BASED MODELS
ClariNet & parallel WaveNet
q Naïve training process
…
N(o; I)
c1:N
1 2 3 4 T
1
1
2
2
3
3
4
4
T
T
CNN 1
(·)
1 2 3 4
1:T
µ1:T
T
f 1
M
(·)
f 1
1
(·)
f 1
M 1
(·)
o1:T
z
(1)
1:T
z
(0)
1:T
Training 7me ~ O(T)
z
(0)
t = (z
(1)
t µt)/ t
z
(1)
1:T
z
(0)
1:T
39
PART II: NORMALIZING FLOW-BASED MODELS
o
c
…
f 1
M
(·)
N(o; I)
Criterion
z
f 1
1
(·)
f 1
M 1
(·)
Training Generation
(Inverse-AR [1])
Normalizing
flow
bo
c
…
N(o; I)
bz
f M
(·)
f M 1
(·)
f 1
(·)
Original
WaveNet
1 2 3 4
1 2 3
1 2 3 4
1 2 3
~ O(T)
[1] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow.
In Proc. NIPS, pages 4743–4751, 2016.
✓ ~ O(1)
✓ ~ O(1) ~ O(T)
40
PART II: NORMALIZING FLOW-BASED MODELS
ClariNet & parallel WaveNet
q Fast training : knowledge distilling
• Teacher gives
• Student gives
…
N(o; I)
f M
(·)
f M 1
(·)
f 1
(·)
c1:N
bo1:T
1 2 3
p(bot|z1:T , c1:T , student)
p(bot|bo1:t 1, c1:T , teacher)
Student learns from teacher
Teacher
Pre-trained WaveNet
Student
SampleRNN
41
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow / feature transformation
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART II: NORMALIZING FLOW-BASED MODELS
WaveNet Autoregressive models
Parallel WaveNet
ClariNet
q Parallel WaveNet & ClariNet
ü No AR dependency in generation
Need pre-trained WaveNet
Without teacher WaveNet ?
Speech/signal processing
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
WaveGlow
q Why ~ O(T) ? Dependency in linear time domain
q Reduce T? Dependency in subscale time domain
42
PART II: NORMALIZING FLOW-BASED MODELS
f 1
1
(·)
Generation Training
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1 2 3 4 T
1 2 3 4
Generation
1 2 3 4 T
1 2 3 4
Training
R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
WaveGlow
q Why ~ O(T) ? Dependency in linear time domain
q Reduce T? Dependency in subscale time domain
43
PART II: NORMALIZING FLOW-BASED MODELS
f 1
1
(·)
Generation Training
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1 2 3 4 T
1 2 3 4
Generation
1 2 3 4 T
1 2 3 4
Training
R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
1
2
3
4
5 9 13
1 2 3 4
SampleRNN
44
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow / feature transformation
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
PART II: NORMALIZING FLOW-BASED MODELS
WaveNet Autoregressive models
Parallel WaveNet
ClariNet
WaveGlow
FloWaveNet
?Simple model without normalizing flow
Neural source-
filter model
Speech/signal processing
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
l Introduction: text-to-speech synthesis
l Neural waveform models
• Overview
• Autoregressive models
• Normalizing flow
• STFT-based training criterion
l Summary & software
CONTENTS
45
46
PART III: STFT-BASED TRAINING CRITERION
1 2 3 4 T…
c1
…
c2 cNc3
Neural source-filter model
q Naïve model and STFT-based criterion
…
Generated
waveforms
X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis.
arXiv preprint arXiv:1810.11946, 2018.
47
PART III: STFT-BASED TRAINING CRITERION
1 2 3 4 T…
c1
…
c2 cNc3
Neural source-filter model
q Naïve model and STFT-based criterion
…
Generated
waveforms
Natural
waveforms
STFT
STFT
Spectral amplitude error
1 2 3 4 T
X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis.
arXiv preprint arXiv:1810.11946, 2018.
48
PART III: STFT-BASED TRAIN IN G CRITERIO N
1 2 3 4 T…
c1
…
c2 cNc3
Neural source-filter model
q Naïve model and STFT-based criterion
…
Generated
waveforms
1 2 3 4 T
Natural
waveforms
STFT
STFT
Spectral amplitude error
1 2 3 4 T
Sine &
harmonics F0
49
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
Only harmonics
No formants
50
PART III: STFT-BASED TRAIN IN G CRITERIO N
Neural source-filter model
q Examples
51
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
52
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
53
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
54
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Examples
55
PART III: STFT-BASED TRAINING CRITERION
Neural source-filter model
q Results
• Faster than WaveNet (at least 100 2mes)
• Smaller than WaveNet
• Speech quality is similar to WaveNet
Natural WOR WAD WAC NSF
1
2
3
4
5
Quality(MOS)
WORLD
vocoder
WaveNet
u-law
WaveNet
Gaussian
Neural
source-filter
Copy-synthesis
Text-to-speech
https://nii-yamagishilab.github.io/samples-nsf/index.html
l Introduction: text-to-speech synthesis
l Neural waveform models
l Summary & software
CO N TEN TS
56
57
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
SU M M ARY
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
58
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
SOFTWARE
NII neural network toolkit
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
NII neural network toolkit
q Neural waveform models
1. Toolkit cores: https://github.com/nii-yamagishilab/project-CURRENNT-public.git
2. Toolkit scripts: https://github.com/nii-yamagishilab/project-CURRENNT-scripts
59
SOFTWARE
NII neural network toolkit
q Useful slides
• http://tonywangx.github.io/pdfs/CURRENNT_TUTORIAL.pdf
• Other related slides http://tonywangx.github.io/slides.html
60
SO FTW ARE
61
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K.
Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An
unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors,
Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018.
FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc.
ICASSP, pages 2251–2255. IEEE, 2018.
Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural
vocoding. arXiv preprint arXiv:1811.06292, 2018.
Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder
covering entire audible frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018.
Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML,
pages 3918–3926, 2018.
ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint
arXiv:1807.07281, 2018.
FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint
arXiv:1811.02155, 2018.
WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv
preprint arXiv:1811.00002, 2018.
RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform
model. In Proc. ICASSP (submitted), 2018.
NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric
speech synthesis. arXiv preprint arXiv:1810.11946, 2018.
LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech
synthesis. arXiv preprint arXiv:1811.11913, 2018.
GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform
model for glottal excitation. arXiv preprint arXiv:1804.09593, 2018.
ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis
systems. arXiv preprint arXiv:1811.04769, 2018.
LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint
arXiv:1810.11846, 2018.
End of Part 1
62
Codes, scripts, slides
http://nii-yamagishilab.github.io
This work was partially supported by JST CREST Grant Number
JPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302,
16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.
Text-to-speech (TTS)
q Speech samples from NII’s TTS system
63
INTRODUCTION
Text Waveform Text Waveform
NII
TTS system
2013 2016 2018 Year
64
PART I: AUTOREGRESSIVE MODELS
SampleRNN
q Idea
• Hierarchical / multi-resolution dependency
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end
neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
65
PART I: AUTOREGRESSIVE MODELS
SampleRNN
q Example network structure
• R: time resolution increased by * 2
Tie 1
Tie 2
Tie 3
66
PART I: AUTOREGRESSIVE MODELS
SampleRNN
q Example network structure
• Training
• R: 8me resolu8on increased by * 2
1 2 3 4 5 6 7 8 9 T
Tie 1:
R = 1
Tie 2:
R = 2
Tie 3:
R = 4
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
67
PART I: AUTOREGRESSIVE MODELS
SampleRNN
q Example network structure
• Generation process
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
Tie 1:
R = 1
Tie 2:
R = 2
Tie 3:
R = 4
1 2
1
3
2
4
3
5
4
6
5
7
6
8
7
9
8 9
T
68
PART I: AUTOREGRESSIVE MODELS
WaveNet
q Variants
• u-Law discrete waveform ---> continuous-valued waveform
§ Mixture of logistic distribution [1]
§ GMM / Single-Gaussian [2]
• Quantization noise shaping [3], related noise shaping method [4]
[1] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
[2] W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018.
[3] T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda. Mel-cepstrum-based quantization noise shaping applied to neural-network-based speech waveform synthesis. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 26(7):1173–1180, 2018.
[4] K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation. In Proc. ICASSP, pages 5664–5668. IEEE, 2018.
1 2
1 2 3 4
3
softmax
1 2
1 2 3 4
3
GMM/Gaussian
69
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q WaveNet is inefficient
• Computa4on cost
1. Imprac4cal for 16 bit PCM (so?max of size 65536)
2. Very deep network (50 dilated CNN …)
3. …
• Time latency
1. Genera4on 4me ~ O(waveform_length)
1 2 3 4 T
1 2 3
70
PART I: AUTOREGRESSIVE MODELS
WaveRNN
q WaveRNN strategies
• Computation cost
1. Impractical for 16 bit PCM (softmax of size 65536)
2. Very deep network (50 dilated CNN …)
3. ...
• Time latency
1. Generation time ~ O(waveform_length)
1 2 3 4 T
1 2 3
Recurrent + feedfoward layers
N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K.
Kavukcuoglu. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning
Research, pages 2410–2419, Stock- holmsm ̈assan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
Two-level softmax
RNN + Feedforward
Subscale dependency + batch

More Related Content

What's hot

Spatial Fourier transform-based localized sound zone generation with loudspea...
Spatial Fourier transform-based localized sound zone generation with loudspea...Spatial Fourier transform-based localized sound zone generation with loudspea...
Spatial Fourier transform-based localized sound zone generation with loudspea...Takuma_OKAMOTO
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...NU_I_TODALAB
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionNU_I_TODALAB
 
Digital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course OutlineDigital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course OutlineMohammad Sohai Khan Niazi
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonMel Chua
 
Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018
Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018
Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...Tomoki Hayashi
 
Structured Interactive Scores with formal semantics
Structured Interactive Scores with formal semanticsStructured Interactive Scores with formal semantics
Structured Interactive Scores with formal semanticsMauricio Toro-Bermudez, PhD
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...a3labdsp
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrellzukun
 
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsOrthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsT. E. BOGALE
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC techniqueChun Hao Wang
 
SfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsSfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsagramfort
 
Quantitative Study of Innovation and Knowledge Building in Questions&Answers ...
Quantitative Study of Innovation and Knowledge Building in Questions&Answers ...Quantitative Study of Innovation and Knowledge Building in Questions&Answers ...
Quantitative Study of Innovation and Knowledge Building in Questions&Answers ...KNOWeSCAPE2014
 

What's hot (20)

Spatial Fourier transform-based localized sound zone generation with loudspea...
Spatial Fourier transform-based localized sound zone generation with loudspea...Spatial Fourier transform-based localized sound zone generation with loudspea...
Spatial Fourier transform-based localized sound zone generation with loudspea...
 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
Digital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course OutlineDigital Signal Processinf (DSP) Course Outline
Digital Signal Processinf (DSP) Course Outline
 
Digital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and PythonDigital signal processing through speech, hearing, and Python
Digital signal processing through speech, hearing, and Python
 
Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018
Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018
Towards Set Learning and Prediction - Laura Leal-Taixe - UPC Barcelona 2018
 
thesis_presentation
thesis_presentationthesis_presentation
thesis_presentation
 
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Te...
 
Structured Interactive Scores with formal semantics
Structured Interactive Scores with formal semanticsStructured Interactive Scores with formal semantics
Structured Interactive Scores with formal semantics
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
 
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrell
 
Pycon apac 2014
Pycon apac 2014Pycon apac 2014
Pycon apac 2014
 
Slide11 icc2015
Slide11 icc2015Slide11 icc2015
Slide11 icc2015
 
DETR ECCV20
DETR ECCV20DETR ECCV20
DETR ECCV20
 
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsOrthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 
SfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsSfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillations
 
Quantitative Study of Innovation and Knowledge Building in Questions&Answers ...
Quantitative Study of Innovation and Knowledge Building in Questions&Answers ...Quantitative Study of Innovation and Knowledge Building in Questions&Answers ...
Quantitative Study of Innovation and Knowledge Building in Questions&Answers ...
 
gio's tesi
gio's tesigio's tesi
gio's tesi
 

Similar to エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~

Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxHamzaJaved306957
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Saad Liaqat
 
Lecture 1 (ADSP).pptx
Lecture 1 (ADSP).pptxLecture 1 (ADSP).pptx
Lecture 1 (ADSP).pptxHarisMasood20
 
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataSpectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataRobert Oostenveld
 
l1.ppt
l1.pptl1.ppt
l1.pptImXaib
 
Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004jeronimored
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐlykhnh386525
 
Introduction of Algorithm.pdf
Introduction of Algorithm.pdfIntroduction of Algorithm.pdf
Introduction of Algorithm.pdfLaxmiMobile1
 
signal and system ch1-1 for students of NCNU EE
signal and system ch1-1 for students of  NCNU EEsignal and system ch1-1 for students of  NCNU EE
signal and system ch1-1 for students of NCNU EEssuserf111f7
 
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringReal-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringKohei Hayashi
 
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsAnomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsCynthia Freeman
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Dataymelka
 
Automated seismic-to-well ties?
Automated seismic-to-well ties?Automated seismic-to-well ties?
Automated seismic-to-well ties?UT Technology
 
UPDATED Sampling Lecture (2).pptx
UPDATED Sampling Lecture (2).pptxUPDATED Sampling Lecture (2).pptx
UPDATED Sampling Lecture (2).pptxHarisMasood20
 
Lecture 2- Practical AD and DA Conveters (Online Learning).pptx
Lecture 2- Practical AD and DA Conveters (Online Learning).pptxLecture 2- Practical AD and DA Conveters (Online Learning).pptx
Lecture 2- Practical AD and DA Conveters (Online Learning).pptxHamzaJaved306957
 
Direct digital frequency synthesizer
Direct digital frequency synthesizerDirect digital frequency synthesizer
Direct digital frequency synthesizerVenkat Malai Avichi
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 

Similar to エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~ (20)

U4301106110
U4301106110U4301106110
U4301106110
 
Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
Lecture 1 (ADSP).pptx
Lecture 1 (ADSP).pptxLecture 1 (ADSP).pptx
Lecture 1 (ADSP).pptx
 
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataSpectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 
Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 
Introduction of Algorithm.pdf
Introduction of Algorithm.pdfIntroduction of Algorithm.pdf
Introduction of Algorithm.pdf
 
signal and system ch1-1 for students of NCNU EE
signal and system ch1-1 for students of  NCNU EEsignal and system ch1-1 for students of  NCNU EE
signal and system ch1-1 for students of NCNU EE
 
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringReal-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
 
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsAnomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Data
 
Automated seismic-to-well ties?
Automated seismic-to-well ties?Automated seismic-to-well ties?
Automated seismic-to-well ties?
 
UPDATED Sampling Lecture (2).pptx
UPDATED Sampling Lecture (2).pptxUPDATED Sampling Lecture (2).pptx
UPDATED Sampling Lecture (2).pptx
 
Lecture 2- Practical AD and DA Conveters (Online Learning).pptx
Lecture 2- Practical AD and DA Conveters (Online Learning).pptxLecture 2- Practical AD and DA Conveters (Online Learning).pptx
Lecture 2- Practical AD and DA Conveters (Online Learning).pptx
 
Direct digital frequency synthesizer
Direct digital frequency synthesizerDirect digital frequency synthesizer
Direct digital frequency synthesizer
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan

More from Yamagishi Laboratory, National Institute of Informatics, Japan (11)

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
 
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
Attention Back-end for Automatic Speaker Verification with Multiple Enrollmen...
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~

  • 1. NII Tacotron WaveNet 1contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion Xin WANG, Yusuke YASUDA National Institute of Informatics, Japan 2019-01-27
  • 2. NII Part 1: WaveNet ssddddd 2contact: wangxin@nii.ac.jp we welcome cri4cal comments, sugges4ons, and discussion Xin WANG National Institute of Informatics, Japan 2019-01-27 Neural waveform models
  • 3. l l l l PhD (2015-2018) l M.A (2012-2015) l h0p://tonywangx.github.io SELF-INTRODUCTION 3
  • 4. l Introduction: text-to-speech synthesis l Neural waveform models l Summary & software CO N TEN TS 4
  • 5. Text-to-speech synthesis (TTS) [1,2] NII’s TTS systems 5 INTRODUCTION [1] P. Taylor. Text-to-Speech Synthesis. Cambridge University Press, 2009. [2] T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, 1997. Text Text-to-speech (TTS) Speech waveform Text Waveform Text Waveform NII TTS system 2013 2016 2018 Year
  • 6. 6 INTRODUCTION Text-to-speech synthesis (TTS) q Architectures text text-to-speech speechText Text analyzer Acous6c model(s) Linguistic features Waveform Waveform model Acoustic features … * | * || Linguistic features Acoustic features Spectral features,F0, Band-aperiodicity, etc.
  • 7. 7 INTRODUCTION Text-to-speech synthesis (TTS) q Architectures TTS backend model text text-to-speech speechText Text analyzer Acoustic model(s) Linguistic features Waveform Waveform model Acoustic features Text Text analyzer Linguistic features Waveform End-to-end TTS modelText Waveform
  • 8. 8 INTRODUCTION Text-to-speech synthesis (TTS) q Architectures TTS backend model sss text text-to-speech speechText Text analyzer Acoustic model(s) Linguistic features Waveform Waveform model Text Text analyzer Linguistic features Waveform End-to-end TTS modelText Waveform Waveform module Waveform module Tutorial part 1: neural waveform modeling Tutorial part 2: end-to-end TTS Acoustic features
  • 9. l Introduction: text-to-speech synthesis l Neural waveform modeling • Overview • Autoregressive models • Normalizing flow • STFT-based training criterion l Summary CONTENTS 9
  • 11. 11 OVERVIEW Task definition … c1 c2 cNc3 Waveform values Spectrogram 1 2 3 4 T… o1 o2 o3 o4 oT T waveform sampling points N frames
  • 12. Neural waveform models 12 OVERVIEW Task definition 1 2 3 4 T… o1 o2 o3 o4 oT c1 … c2 cNc3 p(o1:T |c1:N ; ⇥) Waveform values
  • 13. 13 OVERVIEW 1 2 3 4 T … c1 … c2 cNc3 Naïve model q Simple network + maximum-likelihood • Feedforward network Hidden layers Distribution p(o4|c1:N ; ⇥) …Up-sampling Waveform values
  • 14. 14 OVERVIEW 1 2 3 4 T… c1 … c2 cNc3 Naïve model q Simple network + maximum-likelihood • RNN …Up-sampling Hidden layers Distribution Waveform values
  • 15. 15 OVERVIEW 1 2 3 4 T… c1 … c2 cNc3 Naïve model q Simple network + maximum-likelihood • Dilated CNN [1,2] … Hidden layers Distribution Waveform values Up-sampling [1] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015. [2] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3):328–339, 1989.
  • 16. 16 OVERVIEW 1 2 3 4 T… … Naïve model … c1 c2 cNc3 o1 o2 o3 o4 oT p(o1:T |c1:N ; ⇥) = p(o1|c1:N ; ⇥)· · ·p(oT |c1:N ; ⇥) = TY t=1 p(ot|c1:N ; ⇥) o1:TWeak temporal model <---------------> strongly correlated data mismatch Hidden layers Distribution Waveform values Up-sampling
  • 17. 17 OVERVIEW Towards better models Naïve model o1:T c1:N Criterion Better model o1:T c1:N Criterion Better model Easier target Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Better criterion Weak temporal model <---------------> strongly correlated data mismatch Combination Better Model o1:T c1:N Criterion Knowledge- based Transform
  • 18. 18 OVERVIEW WaveNet WaveRNN SampleRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Autoregressive models Normlizing flow / feature transformation Training criterion in frequency domain Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform Speech/signal processing ☛ Please check reference in the appendix LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  • 19. l Introduction: text-to-speech synthesis l Neural waveform models • Overview • Autoregressive models • Normalizing flow • STFT-based training criterion l Summary & software CONTENTS 19
  • 20. 20 PART I: AUTOREGRESSIVE MODELS Core idea q Naïve model … … c1 c2 cNc3 p(o1:T |c1:N ; ⇥) = TY t=1 p(ot|c1:N ; ⇥) 1 2 3 4 T… o1 o2 o3 o4 oT
  • 21. 21 PART I: AUTOREGRESSIVE MODELS Core idea q Autoregressive (AR) model 1 2 3 4 T… … … c1 c2 cNc3 o1 o2 o3 o4 oT 1 2 3 p(o1:T |c1:N ; ⇥) = TY t=1 p(ot|o1:t 1, c1:N ; ⇥)
  • 22. 22 PART I: AUTOREGRESSIVE MODELS Core idea q Autoregressive (AR) model • Training: use natural waveform for feedback (teacher forcing [1]) 1 2 3 4 T… … … c1 c2 cNc3 o1 o2 o3 o4 oT 1 2 3 [1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
  • 23. 23 PART I: AUTOREGRESSIVE MODELS Core idea q Autoregressive (AR) model • Training: use natural waveform for feedback (teacher forcing [1]) 1 2 3 4 T… … … c1 c2 cNc3 o1 o2 o3 o4 oT 1 2 3 [1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. Natural waveform
  • 24. 24 PART I: AUTOREGRESSIVE MODELS Core idea q Autoregressive (AR) model • Training: use natural waveform for feedback (teacher forcing [1]) • Generation: … … c1 c2 cNc3 1 2 1 bo1 2 bo2 3 bo3 4 3 bo4 T… boT p(bo1:T |c1:N ; ⇥) = QT t=1 p(bot|bot P :t 1, c1:N ; ⇥) [1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. Generated waveform
  • 25. LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet SampleRNN 25 WaveRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Normlizing flow / feature transformation Training criterion in frequency domain Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform Speech/signal processing PART I: AUTOREGRESSIVE MODELS WaveNet Autoregressive models
  • 26. WaveNet q Network structure • Details: http://id.nii.ac.jp/1001/00185683/ http://tonywangx.github.io/pdfs/wavenet.pdf 26 PART I: AUTOREGRESSIVE MODELS 1 2 3 4 T 1 2 3 c1 c2 cNc3 A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  • 27. SampleRNN 27 WaveRNN FFTNet LPCNet LP-WaveNet ExcitNet GlotNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Normlizing flow / feature transformation Training criterion in frequency domain Subband WaveNet Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART I: AUTOREGRESSIVE MODELS WaveNet Autoregressive models ☛ Please check SampleRNN in the appendix q Issues with WaveNet & Sample RNN Generation is very very … very slow q Acceleration? • Engineering-based: parallelize/simplify computation • Knowledge-based: speech / signal processing theory Speech/signal processing
  • 28. 28 PART I: AUTOREGRESSIVE MODELS WaveRNN q Linear time AR dependency in WaveNet q Subscale AR dependency + batch sampling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Waveform values 16 steps to generate
  • 29. 29 PART I: AUTOREGRESSIVE MODELS WaveRNN q Linear time AR dependency in WaveNet q Subscale AR dependency + batch sampling • Generate multiple samples at the same time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 steps to generate 10 steps to generate step1 step2 step3 …
  • 30. 30 PART I: AUTOREGRESSIVE MODELS WaveRNN q Linear time AR dependency in WaveNet q Subscale AR dependency + batch sampling • Generate multiple samples at the same time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 steps to generate 10 steps to generate step1 step2 step3 …
  • 31. SampleRNN 31 WaveRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Normlizing flow / feature transformation Training criterion in frequency domain Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART I: AUTOREGRESSIVE MODELS WaveNet Autoregressive models q AR models Generation is still slow p(o1:T |c1:N ; ⇥) = TY t=1 p(ot|o1:t 1, c1:N ; ⇥) Speech/signal processing LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  • 32. SampleRNN 32 WaveRNN FFTNet Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Normlizing flow / feature transformation Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART I: AUTOREGRESSIVE MODELS WaveNet Autoregressive models Non-autoregressive model? Speech/signal processing LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  • 33. l Introduction: text-to-speech synthesis l Neural waveform models • Overview • Autoregressive models • Normalizing flow • STFT-based training criterion l Summary & software CONTENTS 33
  • 34. 34 PART II: NORMALIZING FLOW-BASED MODELS General idea • o with strong temporal correla3on ---> z with weak temporal correla3on • Principle of changing random variable Model s Criterion ⇥ o f 1 (o) c Model s⇥ c bo po(o|c; ⇥, ) = pz(z = f 1 (o)|c; ⇥) det @f 1 (o) @o bz f (bz) z training generation ☛ https://en.wikipedia.org/wiki/Probability_density_function#Dependent_variables_and_change_of_variables
  • 35. 35 PART II: NO RM ALIZIN G FLOW -BASED MODELSGeneral idea • o ---> Gaussian noise sequence z • Multiple transformations: normalizing flow o c bo … f 1 M (·) N(o; I) Criterion c … N(o; I) bz z po(o|c; 1, · · · , M ) = pz(z = f 1 1 · · · f 1 M 1 f 1 M (o)|c) MY m=1 det J(f 1 m ) f 1 1 (·) f 1 M 1 (·) f M (·) f M 1 (·) f 1 (·) training generation D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages 1530–1538, 2015.
  • 36. SampleRNN 36 WaveRNN FFTNetLPCNet LP-WaveNet ExcitNet GlotNet Model o1:T c1:N Criterion Subband WaveNet Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform Speech production model WaveNet Autoregressive models Model o1:T c1:N Criterion Transform PART II: NORMALIZING FLOW-BASED MODELS po(o|c; 1, · · · , M ) = pz(z = f 1 1 · · · f 1 M 1 f 1 M (o)|c) MY m=1 det J(f 1 m ) WaveGlow FloWaveNet Parallel WaveNet ClariNet f m (·) in linear time domain f m (·) in subscale time domain f m (·)• Different time complexity , different training strategies A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. Parallel WaveNet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017. W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018. R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
  • 37. 37 PART II: NORMALIZING FLOW-BASED MODELS ClariNet & parallel WaveNet q Genera(on process … N(o; I) c1:N bo1:T f M (·) f M 1 (·) f 1 (·) CNN 1 (·) 1 2 3 4 T 1 2 3 4 T 1:T µ1:T bz (0) 1:T 1 2 3 4 T ü Parallel computation ü Fast generation * Initial condition may be implemented differently 1 2 3 4 T bz (0) 1:T bz (1) 1:T bz (1) t = bz (0) t · t + µt bz (1) 1:T
  • 38. 38 PART II: NORMALIZING FLOW-BASED MODELS ClariNet & parallel WaveNet q Naïve training process … N(o; I) c1:N 1 2 3 4 T 1 1 2 2 3 3 4 4 T T CNN 1 (·) 1 2 3 4 1:T µ1:T T f 1 M (·) f 1 1 (·) f 1 M 1 (·) o1:T z (1) 1:T z (0) 1:T Training 7me ~ O(T) z (0) t = (z (1) t µt)/ t z (1) 1:T z (0) 1:T
  • 39. 39 PART II: NORMALIZING FLOW-BASED MODELS o c … f 1 M (·) N(o; I) Criterion z f 1 1 (·) f 1 M 1 (·) Training Generation (Inverse-AR [1]) Normalizing flow bo c … N(o; I) bz f M (·) f M 1 (·) f 1 (·) Original WaveNet 1 2 3 4 1 2 3 1 2 3 4 1 2 3 ~ O(T) [1] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. NIPS, pages 4743–4751, 2016. ✓ ~ O(1) ✓ ~ O(1) ~ O(T)
  • 40. 40 PART II: NORMALIZING FLOW-BASED MODELS ClariNet & parallel WaveNet q Fast training : knowledge distilling • Teacher gives • Student gives … N(o; I) f M (·) f M 1 (·) f 1 (·) c1:N bo1:T 1 2 3 p(bot|z1:T , c1:T , student) p(bot|bo1:t 1, c1:T , teacher) Student learns from teacher Teacher Pre-trained WaveNet Student
  • 41. SampleRNN 41 WaveRNN FFTNet Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Normlizing flow / feature transformation Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART II: NORMALIZING FLOW-BASED MODELS WaveNet Autoregressive models Parallel WaveNet ClariNet q Parallel WaveNet & ClariNet ü No AR dependency in generation Need pre-trained WaveNet Without teacher WaveNet ? Speech/signal processing LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  • 42. WaveGlow q Why ~ O(T) ? Dependency in linear time domain q Reduce T? Dependency in subscale time domain 42 PART II: NORMALIZING FLOW-BASED MODELS f 1 1 (·) Generation Training 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 T 1 2 3 4 Generation 1 2 3 4 T 1 2 3 4 Training R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.
  • 43. WaveGlow q Why ~ O(T) ? Dependency in linear time domain q Reduce T? Dependency in subscale time domain 43 PART II: NORMALIZING FLOW-BASED MODELS f 1 1 (·) Generation Training 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 5 9 13 1 2 3 4 T 1 2 3 4 Generation 1 2 3 4 T 1 2 3 4 Training R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018. 1 2 3 4 5 9 13 1 2 3 4
  • 44. SampleRNN 44 WaveRNN FFTNet Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Normlizing flow / feature transformation Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform PART II: NORMALIZING FLOW-BASED MODELS WaveNet Autoregressive models Parallel WaveNet ClariNet WaveGlow FloWaveNet ?Simple model without normalizing flow Neural source- filter model Speech/signal processing LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  • 45. l Introduction: text-to-speech synthesis l Neural waveform models • Overview • Autoregressive models • Normalizing flow • STFT-based training criterion l Summary & software CONTENTS 45
  • 46. 46 PART III: STFT-BASED TRAINING CRITERION 1 2 3 4 T… c1 … c2 cNc3 Neural source-filter model q Naïve model and STFT-based criterion … Generated waveforms X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. arXiv preprint arXiv:1810.11946, 2018.
  • 47. 47 PART III: STFT-BASED TRAINING CRITERION 1 2 3 4 T… c1 … c2 cNc3 Neural source-filter model q Naïve model and STFT-based criterion … Generated waveforms Natural waveforms STFT STFT Spectral amplitude error 1 2 3 4 T X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. arXiv preprint arXiv:1810.11946, 2018.
  • 48. 48 PART III: STFT-BASED TRAIN IN G CRITERIO N 1 2 3 4 T… c1 … c2 cNc3 Neural source-filter model q Naïve model and STFT-based criterion … Generated waveforms 1 2 3 4 T Natural waveforms STFT STFT Spectral amplitude error 1 2 3 4 T Sine & harmonics F0
  • 49. 49 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples Only harmonics No formants
  • 50. 50 PART III: STFT-BASED TRAIN IN G CRITERIO N Neural source-filter model q Examples
  • 51. 51 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples
  • 52. 52 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples
  • 53. 53 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples
  • 54. 54 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Examples
  • 55. 55 PART III: STFT-BASED TRAINING CRITERION Neural source-filter model q Results • Faster than WaveNet (at least 100 2mes) • Smaller than WaveNet • Speech quality is similar to WaveNet Natural WOR WAD WAC NSF 1 2 3 4 5 Quality(MOS) WORLD vocoder WaveNet u-law WaveNet Gaussian Neural source-filter Copy-synthesis Text-to-speech https://nii-yamagishilab.github.io/samples-nsf/index.html
  • 56. l Introduction: text-to-speech synthesis l Neural waveform models l Summary & software CO N TEN TS 56
  • 57. 57 WaveNet WaveRNN SampleRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform SU M M ARY LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  • 58. 58 WaveNet WaveRNN SampleRNN FFTNet WaveGlow FloWaveNet Neural source- filter model Parallel WaveNet ClariNet RNN + STFT loss Model o1:T c1:N Criterion Model o1:T c1:N Criterion Transform Model o1:T c1:N Criterion Universal neural vocoder Model o1:T c1:N Criterion Knowledge-based Transform SOFTWARE NII neural network toolkit LPCNet LP-WaveNet ExcitNet GlotNet Subband WaveNet
  • 59. NII neural network toolkit q Neural waveform models 1. Toolkit cores: https://github.com/nii-yamagishilab/project-CURRENNT-public.git 2. Toolkit scripts: https://github.com/nii-yamagishilab/project-CURRENNT-scripts 59 SOFTWARE
  • 60. NII neural network toolkit q Useful slides • http://tonywangx.github.io/pdfs/CURRENNT_TUTORIAL.pdf • Other related slides http://tonywangx.github.io/slides.html 60 SO FTW ARE
  • 61. 61 REFERENCE WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016. WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018. FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc. ICASSP, pages 2251–2255. IEEE, 2018. Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural vocoding. arXiv preprint arXiv:1811.06292, 2018. Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018. Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML, pages 3918–3926, 2018. ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018. WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018. RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform model. In Proc. ICASSP (submitted), 2018. NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical para- metric speech synthesis. arXiv preprint arXiv:1810.11946, 2018. LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech synthesis. arXiv preprint arXiv:1811.11913, 2018. GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform model for glottal excitation. arXiv preprint arXiv:1804.09593, 2018. ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis systems. arXiv preprint arXiv:1811.04769, 2018. LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint arXiv:1810.11846, 2018.
  • 62. End of Part 1 62 Codes, scripts, slides http://nii-yamagishilab.github.io This work was partially supported by JST CREST Grant Number JPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302, 16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.
  • 63. Text-to-speech (TTS) q Speech samples from NII’s TTS system 63 INTRODUCTION Text Waveform Text Waveform NII TTS system 2013 2016 2018 Year
  • 64. 64 PART I: AUTOREGRESSIVE MODELS SampleRNN q Idea • Hierarchical / multi-resolution dependency S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
  • 65. 65 PART I: AUTOREGRESSIVE MODELS SampleRNN q Example network structure • R: time resolution increased by * 2 Tie 1 Tie 2 Tie 3
  • 66. 66 PART I: AUTOREGRESSIVE MODELS SampleRNN q Example network structure • Training • R: 8me resolu8on increased by * 2 1 2 3 4 5 6 7 8 9 T Tie 1: R = 1 Tie 2: R = 2 Tie 3: R = 4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9
  • 67. 67 PART I: AUTOREGRESSIVE MODELS SampleRNN q Example network structure • Generation process 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 Tie 1: R = 1 Tie 2: R = 2 Tie 3: R = 4 1 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 9 T
  • 68. 68 PART I: AUTOREGRESSIVE MODELS WaveNet q Variants • u-Law discrete waveform ---> continuous-valued waveform § Mixture of logistic distribution [1] § GMM / Single-Gaussian [2] • Quantization noise shaping [3], related noise shaping method [4] [1] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. [2] W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018. [3] T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda. Mel-cepstrum-based quantization noise shaping applied to neural-network-based speech waveform synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(7):1173–1180, 2018. [4] K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation. In Proc. ICASSP, pages 5664–5668. IEEE, 2018. 1 2 1 2 3 4 3 softmax 1 2 1 2 3 4 3 GMM/Gaussian
  • 69. 69 PART I: AUTOREGRESSIVE MODELS WaveRNN q WaveNet is inefficient • Computa4on cost 1. Imprac4cal for 16 bit PCM (so?max of size 65536) 2. Very deep network (50 dilated CNN …) 3. … • Time latency 1. Genera4on 4me ~ O(waveform_length) 1 2 3 4 T 1 2 3
  • 70. 70 PART I: AUTOREGRESSIVE MODELS WaveRNN q WaveRNN strategies • Computation cost 1. Impractical for 16 bit PCM (softmax of size 65536) 2. Very deep network (50 dilated CNN …) 3. ... • Time latency 1. Generation time ~ O(waveform_length) 1 2 3 4 T 1 2 3 Recurrent + feedfoward layers N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu. Efficient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, Stock- holmsm ̈assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. Two-level softmax RNN + Feedforward Subscale dependency + batch