14. 今年のテーマ
• Grounded Sequence to Sequence Transduction
• General-Purpose Sentence Representation Learning
• Multilingual End-to-end ASR for Incomplete Data
17. Multilingual End-to-end ASR for Incomplete Data
エラー率[%]
100
言語に存在する学習データの量(時間)
0
0
学習データのほぼない言語
学習データの不足している言語
50
リソースの多い言語
1000
エラー削減!
Incomplete data
テキスト
データ
音
データ
Other
languages
Unpaired data Paired data
Lexicon, etc.
(optional)
Extra
Knowl
edge
Multi-lingual
training and
adaptation
Learning
algorithms for
unpaired data
New architecture and
training methods
担当
18. Exploring Better Units for End-to-end Speech Recognition
08/02/18
Takaaki Hori
(MERL)
Shinji Watanabe
(JHU)
Jaejin Cho (JHU)Jiro Nishitoba
(Retrieva)
• Incorporation of word-based RNN language model (Takaaki)
• Exploring subword-based end-to-end ASR (Jiro)
19. 認識の単位
• 音声認識における認識単位には選択肢がある
…
h’T’
…
x1 x2 x3 x4 x5 x6 x7 x8
… xT
hTh2 h3 h4 h5 h6 h7 h8
h’1 h’2 h’3 h’4
H
_ _ _
y1 y2
z2 z4
…
…
CTC
Shared
Encoder
q0
eossos y1 y2
qL-1
r0 r1
…
…
…
rL
Attention
Decoder
h1
q1
r2
y1 y2
…
Single Deep Network
Character
a _ c a t _ e a t s _ ...
Word
a cat eats ...
A cat eats …
33. Convolutive Stacked bottle-neck architectureの結果
• Significant improvement from multilingual
features – 1.6%-5% on 50h (full sets)
• Lower performance degradation (higher
improvement) on lower amount of data.
• No dependence on having target language
as part of feature training data (TokPisin,
Georgian)
35. 多言語同時学習の結果
Model Features Swahili
%CER
Amharic
%CER
Tok Pisin
%CER
Georgian
%CER
Monoling FBANK
28.6 45.3 32.2 34.8
Monoling Multiling
26.4 40.4 26.8 33.2
Multiling
(LT-Out)
FBANK
27.4 41.2 27.7 33.6
Multiling (f.
tune)
FBANK
27.8 - 27.5 33.3
Multiling
(f.tune)
Multiling
- - - -
36. Text-to-Speech
• Conventional TTS system
• E2E-TTS system
Text 音声
ニューラル
ネットワーク
ニューラル
vocoder
Deep Network
特徴
ベクトル
多数のモジュールを必要とする
モジュールを個別に最適化する必要がある
ニューラルネットワーク単体で構成することができる
モジュール全体を通して最適化できる
テキスト
F0
model
SP-based
Vocoder
特徴抽出
Spectrum
model
Duration
model 音声
前処理
37. Tacotron2
• Fully neural TTS system with human-level quality
• Generates mel spec. by spectrogram pred. net
• Generates waveform by WaveNet vocoder
ESPnetに搭載
40. Major accomplishments (1/2)
(1) Built multi-lingual end-to-end ASR systems for 17 languages
(2) Significant improvement with novel architecture and training
methods (submitted 3 papers to SLT’18)
08/02/18JSALT2018 closing session
40
…
h’T’
…
x1 x2 x3 x4 x5 x6 x7 x8
… xT
hTh2 h3 h4 h5 h6 h7 h8
h’1 h’2 h’3 h’4
H
_ _ _
y1 y2
z2 z4
…
…
CTC
Shared
Encoder
q0
eossos y1 y2
qL-1
r0 r1
…
…
…
rL
Attention
Decoder
h1
q1
r2
y1 y2
…
41. Major accomplishments (2/2)
08/02/18JSALT2018 closing session
41
(3) Built end-to-end ASR-TTS chain and unpaired data training
X Y
ASR
TTS
speech
ஹம்
text
(4) ESPnet: an open-source end-to-end speech processing toolkit
Developed for this workshop (github stars increased 196 to 330 during workshop)
Support state-of-the-art seq-to-seq models and ASR and TTS recipes
Follow Kaldi-style recipes, that we can port Kaldi experiments easily