12. Things not described and Guesses
• Kernel size of the dilation filters 2
• Number of the layers (ResNet-blocks) 4*10~ 6*10
• Number of the channels in hidden layers hundreds? 256?
• the other activation function in a Res-block? may be no
• Batch normalization no reason not to use
• Sampling frequency ‘at least 16kHz’
• Where to let the skip connection out? Every 10?
• Skip connections have weights yes?
14. Text-to-Speech (TTS)
• Single-speaker speech dataset
• North American English dataset: 24.6hr
• Mandarin Chinese dataset: 34.8hr
• Receptive field 240ms
• Ad hoc architecture as →
WaveNet
Audio(t)
Yet another
model
Liguistic feature h_i
(possibly phoneme)
Another model
Fundamental
frequency F0(t) duration(t)
Liguistic feature h(t)
※論文とは違った記号を使っています。
15. TTS: Mean Opinion Score
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
16. Speech Recoginition
• TIMIT dataset (possibly ~4hrs)
• Add pooling layer after dilated convolution
• of 160x down sampling (Does it mean 7th layer?)
• Then a few non-causal convolutions.
• Loss to predict the next sample (same as ordinary WaveNet)
• And a loss to classify the frame
• 18.8PER, which is best score among raw-audio models.