WaveNet

•

3 likes•1,747 views

Tsuguo Mogami

An introduction to WaveNet, presented at TIS-Albert seminar.

Data & Analytics

WaveNet: A Generative
Model for Raw Audio
TIS + Albert 勉強会
2017/01/24
最上嗣生
tsuguo_mogami@albert2005.co.jp

Why?
• Autoregressive model (e.g. PixelCNN)が良く成功している。
•
• → 音声はどうであろうか。
• それをRNNより効率的なCNNで行いたい。

Contributions
• これまでにない品質の音声合成。
• Dilated convolutionを使い、大きな受容野を持つにも関わらず
効率的なアーキテクチャ
• （音声認識も）

Dilated convolutionとは
https://github.com/vdumoulin/conv_arithmetic
大まかに言えば、本当は大きなkernel sizeのフィルタを使いたいとき、
これを使えば計算量を増やさずに、大きなカーネルと近似の結果が出せる。

stack of dilated causal convolutional layers
受容野の拡大の概念図であり、実際はResNet風blockの繰り返しです

Repetition Structure
1,2,4,…,512, 1,2,4,…,512, 1,2,4,…,512.
Suspected to be repeating the 1…512 blocks 16 times

Autoregression
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

residual block and the entire architecture
ちょっとわかりにくいので普通の表記に描きなおします

Gated activation units
•
• K: layerindex, f for filter, g for gate
• : elementwise multipulication
• h: condition (person, text, etc.)
• Why?
• PixelCNN (1606.05328)で導入
• それ以前のCNN生成モデルがPixelRNNに劣ったのは
LSTMのゲート構造のせいだと考えてLSTM似のゲー
トを導入した

Input/output
http://musyoku.github.io/2016/09/18/wavenet-a-generative-model-for-raw-audio/
雑に言えばLogスケールで
quantizeして256段階にコード

Things not described and Guesses
• Kernel size of the dilation filters 2
• Number of the layers (ResNet-blocks) 4*10~ 6*10
• Number of the channels in hidden layers hundreds? 256?
• the other activation function in a Res-block? may be no
• Batch normalization no reason not to use
• Sampling frequency ‘at least 16kHz’
• Where to let the skip connection out? Every 10?
• Skip connections have weights yes?

Text-to-Speech (TTS)
• Single-speaker speech dataset
• North American English dataset: 24.6hr
• Mandarin Chinese dataset: 34.8hr
• Receptive field 240ms
• Ad hoc architecture as →
WaveNet
Audio(t)
Yet another
model
Liguistic feature h_i
(possibly phoneme)
Another model
Fundamental
frequency F0(t) duration(t)
Liguistic feature h(t)
※論文とは違った記号を使っています。

TTS: Mean Opinion Score
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Speech Recoginition
• TIMIT dataset (possibly ~4hrs)
• Add pooling layer after dilated convolution
• of 160x down sampling (Does it mean 7th layer?)
• Then a few non-causal convolutions.
• Loss to predict the next sample (same as ordinary WaveNet)
• And a loss to classify the frame
• 18.8PER, which is best score among raw-audio models.

(Multi-speaker) Speech Generation
• Conditioned on the speaker
• 44 hours of data (from 109 speakers)

μ-law transformation (ITU-T, 1988)
•
• で-1,1の間を256分割している。
• 大雑把には log でコードしているだけ。

Featured

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Featured (20)

2024 State of Marketing Report – by Hubspot

Everything You Need To Know About ChatGPT

Product Design Trends in 2024 | Teenage Engineerings

How Race, Age and Gender Shape Attitudes Towards Mental Health

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

WaveNet

1. WaveNet: A Generative Model for Raw Audio TIS + Albert 勉強会 2017/01/24 最上嗣生 tsuguo_mogami@albert2005.co.jp

2. Why? • Autoregressive model (e.g. PixelCNN)が良く成功している。 • • → 音声はどうであろうか。 • それをRNNより効率的なCNNで行いたい。

3. Contributions • これまでにない品質の音声合成。 • Dilated convolutionを使い、大きな受容野を持つにも関わらず効率的なアーキテクチャ • （音声認識も）

4. Dilated convolutionとは https://github.com/vdumoulin/conv_arithmetic 大まかに言えば、本当は大きなkernel sizeのフィルタを使いたいとき、これを使えば計算量を増やさずに、大きなカーネルと近似の結果が出せる。

5. stack of dilated causal convolutional layers 受容野の拡大の概念図であり、実際はResNet風blockの繰り返しです

6. Repetition Structure 1,2,4,…,512, 1,2,4,…,512, 1,2,4,…,512. Suspected to be repeating the 1…512 blocks 16 times

7. Autoregression https://deepmind.com/blog/wavenet-generative-model-raw-audio/

8. residual block and the entire architecture ちょっとわかりにくいので普通の表記に描きなおします

9. ・・・

10. Gated activation units • • K: layerindex, f for filter, g for gate • : elementwise multipulication • h: condition (person, text, etc.) • Why? • PixelCNN (1606.05328)で導入 • それ以前のCNN生成モデルがPixelRNNに劣ったのは LSTMのゲート構造のせいだと考えてLSTM似のゲートを導入した

11. Input/output http://musyoku.github.io/2016/09/18/wavenet-a-generative-model-for-raw-audio/ 雑に言えばLogスケールで quantizeして256段階にコード

12. Things not described and Guesses • Kernel size of the dilation filters 2 • Number of the layers (ResNet-blocks) 4*10~ 6*10 • Number of the channels in hidden layers hundreds? 256? • the other activation function in a Res-block? may be no • Batch normalization no reason not to use • Sampling frequency ‘at least 16kHz’ • Where to let the skip connection out? Every 10? • Skip connections have weights yes?

13. Experiments

14. Text-to-Speech (TTS) • Single-speaker speech dataset • North American English dataset: 24.6hr • Mandarin Chinese dataset: 34.8hr • Receptive field 240ms • Ad hoc architecture as → WaveNet Audio(t) Yet another model Liguistic feature h_i (possibly phoneme) Another model Fundamental frequency F0(t) duration(t) Liguistic feature h(t) ※論文とは違った記号を使っています。

15. TTS: Mean Opinion Score https://deepmind.com/blog/wavenet-generative-model-raw-audio/

16. Speech Recoginition • TIMIT dataset (possibly ~4hrs) • Add pooling layer after dilated convolution • of 160x down sampling (Does it mean 7th layer?) • Then a few non-causal convolutions. • Loss to predict the next sample (same as ordinary WaveNet) • And a loss to classify the frame • 18.8PER, which is best score among raw-audio models.

17. End

18. (Multi-speaker) Speech Generation • Conditioned on the speaker • 44 hours of data (from 109 speakers)

19. TTS: Mean opinion score

20. μ-law transformation (ITU-T, 1988) • • で-1,1の間を256分割している。 • 大雑把には log でコードしているだけ。

WaveNet

Recommended

Recommended

More Related Content

Featured

Featured (20)

WaveNet