エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群～ TacotronとWaveNetのチュートリアル (Part 1)～

NII
Tacotron WaveNet
1contact: wangxin@nii.ac.jp
we welcome critical comments, suggestions, and discussion
Xin WANG, Yusuke YASUDA
National Institute of Informatics, Japan
2019-01-27

NII
Part 1: WaveNet ssddddd
2contact: wangxin@nii.ac.jp
we welcome cri4cal comments, sugges4ons, and discussion
Xin WANG
National Institute of Informatics, Japan
2019-01-27
Neural waveform models

l
l
l
l PhD (2015-2018)
l M.A (2012-2015)
l h0p://tonywangx.github.io
SELF-INTRODUCTION
3

l Introduction: text-to-speech synthesis
l Neural waveform models
l Summary & software
CO N TEN TS
4

Text-to-speech synthesis (TTS) [1,2]
NII’s TTS systems
5
INTRODUCTION
[1] P. Taylor. Text-to-Speech Synthesis. Cambridge University Press, 2009.
[2] T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, 1997.
Text
Text-to-speech
(TTS)
Speech
waveform
Text Waveform Text Waveform
NII
TTS system
2013 2016 2018 Year

6
INTRODUCTION
Text-to-speech synthesis (TTS)
q Architectures
text text-to-speech speechText
Text
analyzer
Acous6c
model(s)
Linguistic
features
Waveform
Waveform
model
Acoustic
features
…
*
| *
||
Linguistic features Acoustic features
Spectral features,F0,
Band-aperiodicity, etc.

7
INTRODUCTION
q Architectures
TTS backend model
Text
analyzer
Acoustic
model(s)
Linguistic
features
Waveform
Waveform
model
Acoustic
features
Text
Text
analyzer
Linguistic
features
Waveform
End-to-end TTS modelText Waveform

8
INTRODUCTION
q Architectures
TTS backend model sss
Text
analyzer
Acoustic
model(s)
Linguistic
features
Waveform
Waveform
model
Text
Text
analyzer
Linguistic
features
Waveform
End-to-end TTS modelText Waveform
Waveform
module
Waveform
module
Tutorial part 1: neural waveform modeling
Tutorial part 2: end-to-end TTS
Acoustic
features

l Neural waveform modeling
• Overview
• Autoregressive models
• Normalizing flow
• STFT-based training criterion
l Summary
CONTENTS
9

10
OVERVIEW
Task definition
Spectrogram

11
OVERVIEW
Task definition
…
c1 c2 cNc3
Waveform
values
Spectrogram
1 2 3 4 T…
o1 o2 o3 o4 oT
T waveform sampling points
N frames

Neural waveform models
12
OVERVIEW
Task definition
1 2 3 4 T…
o1 o2 o3 o4 oT
c1
…
c2 cNc3
p(o1:T |c1:N ; ⇥)
Waveform
values

13
OVERVIEW
1 2 3 4 T
…
c1
…
c2 cNc3
Naïve model
q Simple network + maximum-likelihood
• Feedforward network
Hidden
layers
Distribution
p(o4|c1:N ; ⇥)
…Up-sampling
Waveform
values

14
OVERVIEW
1 2 3 4 T…
c1
…
c2 cNc3
Naïve model
• RNN
…Up-sampling
Hidden
layers
Distribution
Waveform
values

15
OVERVIEW
1 2 3 4 T…
c1
…
c2 cNc3
Naïve model
• Dilated CNN [1,2]
…
Hidden
layers
Distribution
Waveform
values
Up-sampling
[1] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[2] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions
on Acoustics, Speech, and Signal Processing, 37(3):328–339, 1989.

16
OVERVIEW
1 2 3 4 T…
…
Naïve model
…
c1 c2 cNc3
o1 o2 o3 o4 oT
p(o1:T |c1:N ; ⇥) = p(o1|c1:N ; ⇥)· · ·p(oT |c1:N ; ⇥) =
TY
t=1
p(ot|c1:N ; ⇥)
o1:TWeak temporal model <---------------> strongly correlated data
mismatch
Hidden
layers
Distribution
Waveform
values
Up-sampling

17
OVERVIEW
Towards better models
Naïve
model
o1:T
c1:N
Criterion
Better
model
o1:T
c1:N
Criterion
Better model Easier target
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Better criterion
Weak temporal model <---------------> strongly correlated data
mismatch
Combination
Better
Model
o1:T
c1:N
Criterion
Knowledge-
based
Transform

18
OVERVIEW
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
FloWaveNet Neural source-
filter model
Parallel WaveNet
ClariNet RNN + STFT loss
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Autoregressive models
Normlizing flow / feature transformation Training criterion in frequency domain
Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Speech/signal processing
☛ Please check reference in the appendix
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet

• Overview
CONTENTS
19

20
PART I: AUTOREGRESSIVE MODELS
Core idea
q Naïve model
…
…
c1 c2 cNc3
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|c1:N ; ⇥)
1 2 3 4 T…
o1 o2 o3 o4 oT

21
Core idea
q Autoregressive (AR) model
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|o1:t 1, c1:N ; ⇥)

22
Core idea
• Training: use natural waveform for feedback (teacher forcing [1])
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
[1] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.

23
Core idea
1 2 3 4 T…
…
…
c1 c2 cNc3
o1 o2 o3 o4 oT
1 2 3
Natural
waveform

24
Core idea
• Generation:
…
…
c1 c2 cNc3
1 2
1
bo1
2
bo2
3
bo3
4
3
bo4
T…
boT
p(bo1:T |c1:N ; ⇥) =
QT
t=1 p(bot|bot P :t 1, c1:N ; ⇥)
Generated
waveform

LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet
SampleRNN
25
WaveRNN FFTNet
WaveGlow
ﬁlter model
Parallel WaveNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
WaveNet Autoregressive models

WaveNet
q Network structure
• Details: http://id.nii.ac.jp/1001/00185683/
http://tonywangx.github.io/pdfs/wavenet.pdf
26
1 2 3 4 T
1 2 3
c1 c2 cNc3
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.
WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

SampleRNN
27
WaveRNN FFTNet
LPCNet
LP-WaveNet
ExcitNet
GlotNet
WaveGlow
filter model
Parallel WaveNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Subband WaveNet Universal neural vocoder
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
☛ Please check SampleRNN in the appendix
q Issues with WaveNet & Sample RNN
Generation is very very … very slow
q Acceleration?
• Engineering-based: parallelize/simplify computation
• Knowledge-based: speech / signal processing theory

28
WaveRNN
q Linear time AR dependency in WaveNet
q Subscale AR dependency + batch sampling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Waveform
values
16 steps to generate

29
WaveRNN
• Generate multiple samples at the same time
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
step1
step2
step3
…

30
WaveRNN
• Generate multiple samples at the same time
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
step1
step2
step3
…

SampleRNN
31
WaveRNN FFTNet
WaveGlow
ﬁlter model
Parallel WaveNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
q AR models
Generation is still slow
p(o1:T |c1:N ; ⇥) =
TY
t=1
p(ot|o1:t 1, c1:N ; ⇥)
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet

SampleRNN
32
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Normlizing flow / feature transformation
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Non-autoregressive model?
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet

• Overview
CONTENTS
33

34
PART II: NORMALIZING FLOW-BASED MODELS
General idea
• o with strong temporal correla3on ---> z with weak temporal correla3on
• Principle of changing random variable
Model s
Criterion
⇥
o
f 1
(o)
c
Model s⇥
c
bo
po(o|c; ⇥, ) = pz(z = f 1
(o)|c; ⇥) det
@f 1
(o)
@o
bz
f (bz)
z
training generation
☛ https://en.wikipedia.org/wiki/Probability_density_function#Dependent_variables_and_change_of_variables

35
PART II: NO RM ALIZIN G FLOW -BASED
MODELSGeneral idea
• o ---> Gaussian noise sequence z
• Multiple transformations: normalizing flow
o
c
bo
…
f 1
M
(·)
N(o; I)
Criterion
c
…
N(o; I)
bz
z
po(o|c; 1, · · · , M ) = pz(z = f 1
1
· · · f 1
M 1
f 1
M
(o)|c)
MY
m=1
det J(f 1
m
)
f 1
1
(·)
f 1
M 1
(·)
f M
(·)
f M 1
(·)
f 1
(·)
training generation
D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages 1530–1538, 2015.

SampleRNN
36
WaveRNN FFTNetLPCNet
LP-WaveNet
ExcitNet
GlotNet
Model
o1:T
c1:N
Criterion
Subband WaveNet
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Speech production model
Model
o1:T
c1:N
Criterion
Transform
po(o|c; 1, · · · , M ) = pz(z = f 1
1
· · · f 1
M 1
f 1
M
(o)|c)
MY
m=1
det J(f 1
m
)
WaveGlow
FloWaveNet
Parallel WaveNet
ClariNet f m
(·) in linear time domain
f m
(·) in subscale time domain
f m
(·)• Different time complexity , different training strategies
A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. Parallel WaveNet:
Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.
W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018.
S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.
R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv preprint arXiv:1811.00002, 2018.

37
ClariNet & parallel WaveNet
q Genera(on process
…
N(o; I)
c1:N
bo1:T
f M
(·)
f M 1
(·)
f 1
(·) CNN 1
(·)
1 2 3 4 T
1 2 3 4 T
1:T
µ1:T
bz
(0)
1:T
1 2 3 4 T
ü Parallel computation
ü Fast generation
* Initial condition may be implemented differently
1 2 3 4 T
bz
(0)
1:T
bz
(1)
1:T
bz
(1)
t = bz
(0)
t · t + µt
bz
(1)
1:T

38
q Naïve training process
…
N(o; I)
c1:N
1 2 3 4 T
1
1
2
2
3
3
4
4
T
T
CNN 1
(·)
1 2 3 4
1:T
µ1:T
T
f 1
M
(·)
f 1
1
(·)
f 1
M 1
(·)
o1:T
z
(1)
1:T
z
(0)
1:T
Training 7me ~ O(T)
z
(0)
t = (z
(1)
t µt)/ t
z
(1)
1:T
z
(0)
1:T

39
o
c
…
f 1
M
(·)
N(o; I)
Criterion
z
f 1
1
(·)
f 1
M 1
(·)
Training Generation
(Inverse-AR [1])
Normalizing
flow
bo
c
…
N(o; I)
bz
f M
(·)
f M 1
(·)
f 1
(·)
Original
WaveNet
1 2 3 4
1 2 3
1 2 3 4
1 2 3
~ O(T)
[1] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow.
In Proc. NIPS, pages 4743–4751, 2016.
✓ ~ O(1)
✓ ~ O(1) ~ O(T)

40
q Fast training : knowledge distilling
• Teacher gives
• Student gives
…
N(o; I)
f M
(·)
f M 1
(·)
f 1
(·)
c1:N
bo1:T
1 2 3
p(bot|z1:T , c1:T , student)
p(bot|bo1:t 1, c1:T , teacher)
Student learns from teacher
Teacher
Pre-trained WaveNet
Student

SampleRNN
41
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Parallel WaveNet
ClariNet
q Parallel WaveNet & ClariNet
ü No AR dependency in generation
Need pre-trained WaveNet
Without teacher WaveNet ?
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet

WaveGlow
q Why ~ O(T) ? Dependency in linear time domain
q Reduce T? Dependency in subscale time domain
42
f 1
1
(·)
Generation Training
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1 2 3 4 T
1 2 3 4
Generation
1 2 3 4 T
1 2 3 4
Training

WaveGlow
q Why ~ O(T) ? Dependency in linear time domain
q Reduce T? Dependency in subscale time domain
43
f 1
1
(·)
Generation Training
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1
2
3
4
5 9 13
1 2 3 4 T
1 2 3 4
Generation
1 2 3 4 T
1 2 3 4
Training
1
2
3
4
5 9 13
1 2 3 4

SampleRNN
44
WaveRNN FFTNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
Parallel WaveNet
ClariNet
WaveGlow
FloWaveNet
?Simple model without normalizing flow
Neural source-
filter model
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet

• Overview
CONTENTS
45

46
PART III: STFT-BASED TRAINING CRITERION
1 2 3 4 T…
c1
…
c2 cNc3
Neural source-filter model
q Naïve model and STFT-based criterion
…
Generated
waveforms
X. Wang, S. Takaki, and J. Yamagishi. Neural source-ﬁlter-based waveform model for statistical parametric speech synthesis.
arXiv preprint arXiv:1810.11946, 2018.

47
1 2 3 4 T…
c1
…
c2 cNc3
…
Generated
waveforms
Natural
waveforms
STFT
STFT
Spectral amplitude error
1 2 3 4 T
X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis.
arXiv preprint arXiv:1810.11946, 2018.

48
PART III: STFT-BASED TRAIN IN G CRITERIO N
1 2 3 4 T…
c1
…
c2 cNc3
Neural source-ﬁlter model
…
Generated
waveforms
1 2 3 4 T
Natural
waveforms
STFT
STFT
Spectral amplitude error
1 2 3 4 T
Sine &
harmonics F0

49
q Examples
Only harmonics
No formants

50
PART III: STFT-BASED TRAIN IN G CRITERIO N
q Examples

51
q Examples

52
q Examples

53
q Examples

54
q Examples

55
Neural source-ﬁlter model
q Results
• Faster than WaveNet (at least 100 2mes)
• Smaller than WaveNet
• Speech quality is similar to WaveNet
Natural WOR WAD WAC NSF
1
2
3
4
5
Quality(MOS)
WORLD
vocoder
WaveNet
u-law
WaveNet
Gaussian
Neural
source-filter
Copy-synthesis
Text-to-speech
https://nii-yamagishilab.github.io/samples-nsf/index.html

CO N TEN TS
56

57
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
filter model
Parallel WaveNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
SU M M ARY
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet

58
WaveNet
WaveRNN
SampleRNN
FFTNet
WaveGlow
filter model
Parallel WaveNet
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Transform
Model
o1:T
c1:N
Criterion
Model
o1:T
c1:N
Criterion
Knowledge-based
Transform
SOFTWARE
NII neural network toolkit
LPCNet
LP-WaveNet
ExcitNet
GlotNet
Subband WaveNet

q Neural waveform models
1. Toolkit cores: https://github.com/nii-yamagishilab/project-CURRENNT-public.git
2. Toolkit scripts: https://github.com/nii-yamagishilab/project-CURRENNT-scripts
59
SOFTWARE

q Useful slides
• http://tonywangx.github.io/pdfs/CURRENNT_TUTORIAL.pdf
• Other related slides http://tonywangx.github.io/slides.html
60
SO FTW ARE

61
REFERENCE
WaveNet: A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K.
Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
SampleRNN: S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An
unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
WaveRNN: N. Kalchbrenner, E. Elsen, K. Simonyan, et.al. Efficient neural audio synthesis. In J. Dy and A. Krause, editors,
Proc. ICML, volume 80 of Proceedings of Machine Learning Research, pages 2410–2419, 10–15 Jul 2018.
FFTNet: Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu. FFTNet: A real-time speaker-dependent neural vocoder. In Proc.
ICASSP, pages 2251–2255. IEEE, 2018.
Universal vocoder: J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, and R. Barra-Chicote. Robust universal neural
vocoding. arXiv preprint arXiv:1811.06292, 2018.
Subband WaveNet: T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder
covering entire audible frequency range with limited acoustic features. In Proc. ICASSP, pages 5654–5658. 2018.
Parallel WaveNet: A. van den Oord, Y. Li, I. Babuschkin, et. al.. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proc. ICML,
pages 3918–3926, 2018.
ClariNet: W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint
arXiv:1807.07281, 2018.
FlowWaveNet: S. Kim, S.-g. Lee, J. Song, and S. Yoon. Flowavenet: A generative flow for raw audio. arXiv preprint
arXiv:1811.02155, 2018.
WaveGlow: R. Prenger, R. Valle, and B. Catanzaro. Waveglow: A flow-based generative network for speech synthesis. arXiv
preprint arXiv:1811.00002, 2018.
RNN+STFT: S. Takaki, T. Nakashika, X. Wang, and J. Yamagishi. STFT spectral loss for training a neural speech waveform
model. In Proc. ICASSP (submitted), 2018.
NSF: X. Wang, S. Takaki, and J. Yamagishi. Neural source-filter-based waveform model for statistical parametric
speech synthesis. arXiv preprint arXiv:1810.11946, 2018.
LP-WavNet: M.-J. Hwang, F. Soong, F. Xie, X. Wang, and H.-G. Kang. Lp-wavenet: Linear prediction-based wavenet speech
synthesis. arXiv preprint arXiv:1811.11913, 2018.
GlotNet: L. Juvela, V. Tsiaras, B. Bollepalli, M. Airaksinen, J. Yamagishi, and P. Alku. Speaker-independent raw waveform
model for glottal excitation. arXiv preprint arXiv:1804.09593, 2018.
ExcitNet: E. Song, K. Byun, and H.-G. Kang. Excitnet vocoder: A neural excitation model for parametric speech synthesis
systems. arXiv preprint arXiv:1811.04769, 2018.
LPCNet: J.-M. Valin and J. Skoglund. Lpcnet: Improving neural speech synthesis through linear prediction. arXiv preprint
arXiv:1810.11846, 2018.

End of Part 1
62
Codes, scripts, slides
http://nii-yamagishilab.github.io
This work was partially supported by JST CREST Grant Number
JPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302,
16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.

Text-to-speech (TTS)
q Speech samples from NII’s TTS system
63
INTRODUCTION
Text Waveform Text Waveform
NII
TTS system
2013 2016 2018 Year

64
SampleRNN
q Idea
• Hierarchical / multi-resolution dependency
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end
neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.

65
SampleRNN
q Example network structure
• R: time resolution increased by * 2
Tie 1
Tie 2
Tie 3

66
SampleRNN
• Training
• R: 8me resolu8on increased by * 2
1 2 3 4 5 6 7 8 9 T
Tie 1:
R = 1
Tie 2:
R = 2
Tie 3:
R = 4
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9

67
SampleRNN
• Generation process
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
Tie 1:
R = 1
Tie 2:
R = 2
Tie 3:
R = 4
1 2
1
3
2
4
3
5
4
6
5
7
6
8
7
9
8 9
T

68
WaveNet
q Variants
• u-Law discrete waveform ---> continuous-valued waveform
§ Mixture of logistic distribution [1]
§ GMM / Single-Gaussian [2]
• Quantization noise shaping [3], related noise shaping method [4]
[1] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
[2] W. Ping, K. Peng, and J. Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281, 2018.
[3] T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda. Mel-cepstrum-based quantization noise shaping applied to neural-network-based speech waveform synthesis. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 26(7):1173–1180, 2018.
[4] K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation. In Proc. ICASSP, pages 5664–5668. IEEE, 2018.
1 2
1 2 3 4
3
softmax
1 2
1 2 3 4
3
GMM/Gaussian

69
WaveRNN
q WaveNet is ineﬃcient
• Computa4on cost
1. Imprac4cal for 16 bit PCM (so?max of size 65536)
2. Very deep network (50 dilated CNN …)
3. …
• Time latency
1. Genera4on 4me ~ O(waveform_length)
1 2 3 4 T
1 2 3

70
WaveRNN
q WaveRNN strategies
• Computation cost
1. Impractical for 16 bit PCM (softmax of size 65536)
2. Very deep network (50 dilated CNN …)
3. ...
• Time latency
1. Generation time ~ O(waveform_length)
1 2 3 4 T
1 2 3
Recurrent + feedfoward layers
N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K.
Kavukcuoglu. Efﬁcient neural audio synthesis. In J. Dy and A. Krause, editors, Proc. ICML, volume 80 of Proceedings of Machine Learning
Research, pages 2410–2419, Stock- holmsm ̈assan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
Two-level softmax
RNN + Feedforward
Subscale dependency + batch

エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群～ TacotronとWaveNetのチュートリアル (Part 1)～

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群～ TacotronとWaveNetのチュートリアル (Part 1)～

Similar to エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群～ TacotronとWaveNetのチュートリアル (Part 1)～ (20)

More from Yamagishi Laboratory, National Institute of Informatics, Japan

More from Yamagishi Laboratory, National Institute of Informatics, Japan (11)

Recently uploaded

Recently uploaded (20)