論文紹介:Selective Structured State-Spaces for Long-Form Video Understanding
Icml読み会 deep speech2
1. Deep Speech 2:
End-to-End Speech
Recognition in English and
Mandarin
Amodei, et al.
ICML2016 読み会 2016/07/21 @ドワンゴセミナールーム
株式会社プリファードインフラストラクチャー
西鳥羽 二郎
20. Lookahead Convolution
Bidirectional GRUは精度面では良いが、online, 低レイテ
ンシでの実行ができない
19
Here W
(6)
k and b
(6)
k denote thek’th column of the weight matrix and k’th bias, respectively.
Oncewehavecomputed aprediction for P(ct |x), wecomputetheCTC loss[13] L(ˆy, y) to me
he error in prediction. During training, we can evaluate the gradient r ˆy L(ˆy, y) with respe
he network outputs given the ground-truth character sequence y. From this point, computin
gradient with respect to all of the model parameters may bedone via back-propagation throug
est of thenetwork. WeuseNesterov’sAccelerated gradient method for training [41].3
t1 t2 tn
どの段階の値を計算するにもt1
からtnのすべての入力が必要
26. デコーダーのスコアリング
25
nesecharacters.
At inference time, CTC modelsarepaired awith langua
model trained on abigger corpusof text. Weuseaspeci
ized beam search (Hannun et al., 2014b) to find the tra
scription y that maximizes
Q(y) = log(pRNN(y|x)) + ↵ log(pLM(y)) + βwc(y)
where wc(y) is the number of words (English) or chara
ters (Chinese) in the transcription y. The weight ↵ co
trols the relative contributions of the language model a
theCTCnetwork. Theweight β encouragesmorewords
thetranscription. Theseparameters aretuned on aheld o
トランスクリプション
文字列のスコア
ニューラルネットワークが出
力する文字列の生起確率
言語モデルによるスコア
word count
α, βは学習データに応じて変更する
28. (Sequence-wise) Batch Normalization
正則化に(Sequence-wise) Batch Normalizationを用いる
27
k k
Oncewehavecomputed aprediction for P(ct |x), wecomputetheCTC loss[13] L(ˆy, y) to mea
he error in prediction. During training, we can evaluate the gradient r ˆy L(ˆy, y) with respe
he network outputs given the ground-truth character sequence y. From this point, computing
radient with respect to all of the model parameters may bedone via back-propagation through
est of thenetwork. WeuseNesterov’sAccelerated gradient method for training [41].3
下位層からの入力にのみBatch
Normalizationを適用する
水平方向の入力にはBatch
Normalizationを適用しない
29. SortaGrad
Curriculum learning
- CTCの学習初期はblank文字列を出力しがちで損失がとても大き
くなりやすい
- 学習データを系列の長さでソートし、短い音声データから学習
を行う
28
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
Architecture Baseline BatchNorm GRU
5-layer, 1 RNN 13.55 14.40 10.53
5-layer, 3 RNN 11.61 10.56 8.00
7-layer, 5 RNN 10.77 9.78 7.79
9-layer, 7 RNN 10.83 9.52 8.19
9-layer, 7 RNN
no SortaGrad 11.96 9.78
Table 1: Comparison of WER on a development set as we
vary depth of RNN, application of BatchNorm and Sorta-
Grad, and type of recurrent hidden unit. All networkshave
50 100 150 200
Iteration (⇥10
20
30
40
50
60
Cost
SortaGradを用いない場合
精度が下がっている
39. 補足: Arxiv版
https://arxiv.org/abs/1512.02595 に掲載されていてICML
版に掲載されていないこと
- Striding (in convolution)
- Language Modeling
- Scalability and Data parallelism
- Memory allocation
- Node and cluster architecture
- GPU Implementation of CTC Loss Function
- Batch Dispatch
- Data Augmentation
- Beam Search
38