DNN音響モデルにおける特徴量抽出の諸相

Copyright©2015 NTT corp. All Rights Reserved.
吉岡拓也 (NTT CS研)
共著：デルクロアマーク, 藤本雅清, 中谷智広
2015年7月15日
音声研究会スペシャルセッション
“特徴量抽出器としてのDNN” 招待講演
1

このスライドは2015年7月に開催された電子情報通信学会
音声研究会 (SP) におけるスペシャルセッション “特徴量抽
出器としてのDNN” で使用したスライドです。
この分野の人であれば知っていると思われる概念（CMLLR,
STC, ビームフォーマ, DNN-HMM hybridの学習パイプライ
ン, etc）については説明を省略しています。
2

スペシャルセッション趣旨
3
[http://www.ieice.org/~sp/jpn/meeting/2015/201507c.html より抜粋]

スペシャルセッション趣旨
4
[http://www.ieice.org/~sp/jpn/meeting/2015/201507c.html より抜粋]

特徴抽出識別
DNN = 特徴抽出器 + 識別器
5

DNN = 特徴抽出器 + 識別器
…実際の特徴抽出はこれがすべてではない
6

単語列
HMM
トポロジー
言語モデル
WFST
デコーダ
発音辞書
MelFB
メルフィルタ
バンク分析
波形
HMM
DNN
DNN-HMM hybrid 音声認識システム
7

単語列
HMM
トポロジー
言語モデル
WFST
デコーダ
発音辞書
MelFB
メルフィルタ
バンク分析
波形
HMM
DNN
実際の特徴抽出
パイプライン
8

講演の目的
DNN-HMM hybridに基づく今日の音声認識システム
における特徴量抽出全般について議論
ロバストネスを
基軸として
9

AMI NTT Meeting CHiME-3
その他 CHiME-1, REVERB, Aurora4等
著者らの実験で用いたロバストネスに関連するタスク
AMIの写真：
[T. Hain and S. Renals, “Meeting recognition,” Interspeech 2010 チュートリアル資料より抜粋]
CHiME-3の写真：
[http://spandh.dcs.shef.ac.uk/chime_challenge/ より抜粋]
10

目次
DNN-HMM hybrid音響モデルのロバストネス
改善に向けたアプローチ
• CNN
• モデル適応
• CMLLR
• Auto-encoder
• 音声強調
特徴量抽出器としてのDNN
• DNN-GMM tandem
11

目次
• CNN
• モデル適応
• CMLLR
• Auto-encoder
• 音声強調
• DNN-GMM tandem
12

ロバストネスに関連するDNNの重要な性質
• 分散表現の効率性
• 深さの重要性
13

分割 by
neuron1
・
・
・分割 by
neuron2
分割 by
neuron3
template1
template2
template3
混合モデルニューラルネットvs.
[1/2] 分散表現の効率性
14

分割 by
neuron1
・
・
・分割 by
neuron2
分割 by
neuron3
template1
template2
template3
混合モデルニューラルネットvs.
[1/2] 分散表現の効率性
15

隠れ層1層隠れ層7層
SWB[Seide+, 2011]
MIT-OCW[informal exp]
24.1% 17.0%
26.3% 22.7%
vs.
[2/2] 深さの重要性
REVERB[Delcroix+, 2015] ~30% ~24%
[Seide+, 2011] F. Seide, et al., “Feature engineering in context-dependent deep neural
networks for conversational speech transcription,” ASRU 2011
[Delcroix+, 2015] M. Delcroix, et al., “Strategies for distant speech recognition
in reverberant environments,” Eurasip J. ASP, 2015 16

つまりここの幅が大事
多層の特徴変換が重要
＝
17

つまりここの幅が大事
多層の特徴変換が重要
＝
…何故？18

入力層での微小変動は層を経るごとに
減少しやすい [Yu+, 2013]
[Yu+, 2013] D. Yu, et al., “Feature Learning in Deep Neural Networks - A Study on Speech
Recognition Tasks,” ICLR 2013.
1+lhlh
19

ll δh + 1+lh
入力層での微小変動は層を経るごとに
減少しやすい [Yu+, 2013]
lh
[Yu+, 2013] D. Yu, et al., “Feature Learning in Deep Neural Networks - A Study on Speech
Recognition Tasks,” ICLR 2013.
はどう変化するか？1+lh
20

( )( ) ( )
( ) l
T
lll
llllll
WW
WW
δh
hδhδ
σ
σσ
′=
−+=+1
ll δh + 1+lhlh 11 ++ + ll δh
21

( )( ) ( )
( ) l
T
lll
llllll
WW
WW
δh
hδhδ
σ
σσ
′=
−+=+1
( )( ) l
T
llll W δhhδ 111 1diag +++ −< 
0.25以下
ほとんどの重みは小さい値
22

( )( ) ( )
( ) l
T
lll
llllll
WW
WW
δh
hδhδ
σ
σσ
′=
−+=+1
( )( ) l
T
llll W δhhδ 111 1diag +++ −< 
0.25以下
ほとんどの重みは小さい値
23

正しく学習されたサンプルの近傍では正し
い認識結果が得られやすい
24

Robustness remains as a major challenge
in the deep learning acoustic model
[Huang+, 2014]
[Huang+, 2013] Y. Huang, “A Comparative Analytic Study on the Gaussian
Mixture and Context Dependent Deep Neural Network Hidden Markov Models,”
Interspeech 2014
25

[Huang+, 2014 より抜粋]
26

…DNN単体のロバストネスは？
27

GMM-HMM build
AlignGen
DNN train
学習データ
アラインメント
DNN
HMMGMM
DNN-HMM hybrid音響モデルの学習パイプライン
28

GMM-HMM build
AlignGen
DNN train
学習データ
DNN
HMMGMM
DNN-HMM hybrid音響モデルの学習パイプライン
29
分けて
評価
したい

AMIコーパス
Headsetとtable-topの同期録音
[T. Hain and S. Renals, “Meeting recognition,” Interspeech 2010 チュートリアル資料より抜粋]
30

HMM/Alignment DNN input
table-top table-top
headset headset
%WER
43.1
26.4
[Yoshioka+, 2015] T. Yoshioka and M. J. F. Gales, “Environmentally robust ASR front-end
for deep neural networkacoustic models,” CSL, 2015
[Yoshioka+, 2015]
31

HMM/Alignment DNN input
table-top table-top
headset headset
headset table-top
%WER
43.1
26.4
41.3
この差がDNN単体の
ロバストネス（のなさ）
[Yoshioka+, 2015]
32
[Yoshioka+, 2015] T. Yoshioka and M. J. F. Gales, “Environmentally robust ASR front-end
for deep neural networkacoustic models,” CSL, 2015

目次
• CNN
• モデル適応
• CMLLR
• Auto-encoder
• 音声強調
• DNN-GMM tandem
34

MelFB波形状態尤度
各トピックは特徴抽出
パイプラインのいずれか
のステップに対応
35

[http://deeplearning.net/tutorial/lenet.html より抜粋]
画像認識におけるCNN
37

音声認識におけるCNN
第一層のフィルタでは時間方向の幅を広くとる
conv pool conv pool fully connected
38

音声認識におけるCNN
第一層のフィルタでは時間方向の幅を広くとる
conv pool conv pool fully connected
39

12
13
14
15
WERin%
全結合 CNN
畳み込み1層
CNN
畳み込み2層
Aurora4 [informal experiment]
40

Aurora4 [informal experiments]
clean
additive
noise
channel
noise
both
5.5 9.7 11.1 22.2
5.5 9.2 9.4 20.1
4.9 9.0 8.4 19.7
pooling
全結合 -
CNN
2
3
畳み込み層は1層
41

clean
additive
noise
channel
noise
both
5.5 9.7 11.1 22.2
5.5 9.2 9.4 20.1
4.9 9.0 8.4 19.7
pooling
全結合 -
CNN
2
3
Aurora4 [informal experiments]
畳み込み層は1層
poolingは加法性雑音が
ないときに特に有効
42

DNN音響モデルの話者適応
SI decode
Adapt
SA decode
目的話者データ
単語列
SI model モデル
データ
1-best
SA model
SI: 話者独立, SA: 話者適応 44

DNN音響モデルの話者適応
SI decode
Adapt
SA decode
目的話者データ
単語列
SI model モデル
データ
AlignGen
1-best
SA model
SI: 話者独立, SA: 話者適応 45

再学習アプローチ
Pro ：任意のニューラルネット(e.g., CNN) に
適用できる
Con ：比較的大量の適用データが必要
SI アライン
メント
目的話者
データ
Backprop
46

線形層挿入アプローチ
SIモデル
LIN/FDLR LHN LHUC
LIN: Linear Input Network; FDLR: Feature-space Discriminative Linear Regreession
LHN: Linear Hidden Network; LHUC: Linear Hidden Unit Contribution 47

最終層の適応 vs. 第一層の適応[Delcroix+, 2015]
0
22
23
24
%WER
＃epochs
302010
最終層
第一層
LIN
48

CMLLRによるDNN音響モデルのSAT[Yoshioka+, 2015]
話者sの
CMLLR変換
HLDA
特徴量
dev eval
HLDA (SI) 42.5 % 42.8 %
40.0 % 39.7 %CMLLR
[AMI table-top]
39x39
50
SAT: Speaker Adaptive Training

CMLLRによるDNN音響モデルのSAT[Yoshioka+, 2015]
話者sの
CMLLR変換
HLDA
特徴量
39x39
51
DNN音響モデルの標準的な入力はlog-mel...

log-mel特徴量のCMLLR
log-mel
特徴量
STC
話者sの
CMLLR変換
96x96
STC: Semi-Tied
Covariance transform
52

MFCC data FBANK data
MFCC
ML
FBANK
ML
DNN
training
CMLLR
FBANK
SAT
SPR
STC
変換
話者適応化
特徴量
Single Pass Re-training (SPR) による学習
53

log-mel特徴量のCMLLR [Yoshioka+, 2015]
log-mel
特徴量
STC
話者sの
CMLLR変換
dev eval
log-mel (SI) 42.6 % 40.2 %
37.4 % 37.4 %CMLLR
[AMI table-top]
37.3 % 36.6 %
CMLLR w/
bdiag
96x96
54

クラスタcの
CMLLR変換
クラスタリングに基づく発話単位のモデル適応
i-vector
抽出
クラスタ
割り当て
HLDA
特徴量
log-mel
特徴量
STC
55

dev eval
log-mel 41.9 % 40.9 %
[AMI table-top]
utt-CMLLR 41.0 % 40.0 %
dev eval
log-mel 27.8 % 24.2 %
[AMI headset]
utt-CMLLR 26.9 % 23.5 %
Informal experiments, 類似の結果は [Yoshioka+, 2015]に記載
56

Noisy speechからclean speechを出力するDNN
(clean, noisy)が対になった
データから学習
58

Clean
training
Noise adaptive
training
w/o DAE 40.6 % 9.6 %
23.2 % 10.7 %w/ DAE
[CHiME-1]
Results from [Araki+, 2014] S. Araki, et al., “Exploring multi-channel features for
denoising-autoencoder-based speech enhancement,” ICASSP, 2015
degradeした
どうしよう…
59

Integrated DAE[Narayanan+, 2014]
[Narayanan+, 2014] A. Narayanan and D. Wang, “Investigation of Speech Separation
as a Front-End for Noise Robust Speech Recognition, ” IEEE T. ASLP, 2014 60

Integrated DAE[Narayanan+, 2014]
音響モデルとは違うことをする
• 違うモデルを使う[Weninger+, 2013]
• 違う特徴量を使う
[Weninger+, 2014] F. Weninger, et al., “The Munich feature enhancement approach to the 2nd
CHiME challenge using BLSTM recurrent neural networks,” CHiME-2, 2013 61

音素特徴量
[Mimura et al., 2015]
空間特徴量
[Araki+, 2015]
[Mimura+, 2015] M. Mimura, et al., “Deep autoencoders augmented with phone-class feature
for reverberant speech recognition,” ICASSP, 2015
違う特徴量を使う
62

Clean
training
Noise adaptive
training
w/o DAE 40.6 % 9.6 %
23.2 % 10.7 %
17.8 % 8.9 %
w/ DAE
w/ DAE+空間特徴
[CHiME-1]
Results from [Araki+, 2014]
improve
63

マイクロホンアレイ雑音抑圧
• ビームフォーマ
• 音源分離
1ch雑音抑圧
• SS, Wiener filter, Ephraim-Malah filter
• Front-end VTS
残響除去
65

Linear time-invariant filterはタスクによらず効果大
• WPE残響除去（1ch, M-ch）
• ビームフォーマ（D-&-S, MVDR, MWF）
∑∑ −=
m k
mkm ktxhty )()( ,
経験則
66

NTT CS研の遠隔発話音声認識フロントエンド
M-ch
波形
WPE
残響除去
ビーム
フォーマ
1-ch
波形
M-ch
波形
WPE ビームフォーマ
REVERB [Delcroix+, 2015] 1ch: 20+%
8ch: ~37%
8ch: ~27%
AMI [Yoshioka+, 2014] 1ch: ~5% 8ch: ~15%
[relative gains]
67
[Yoshioka+, 2014] T. Yoshioka, et al., “Impact of single-microphone dereverberation on
DNN-based meeting transcription systems,” ICASSP, 2014

目次
• CNN
• モデル適応
• CMLLR
• Auto-encoder
• 音声強調
• DNN-GMM tandem
68

69

DNN-GMM tandem：
DNNから得られる特徴量をGMM識別器に入力
• GMM用の各種技法が使える
• SATではまだ使われることがある
（転移学習 – cross-lingual ASR）
70

DNN-GMM TandemによるSAT
MPE-SAT GMM-HMMs
MFCC Words
STC
HLDA Concat CMLLR
GMM
decode
Mean-MLLR
STC CMLLRlog-mel
71
[Yoshioka+, 2014 (2)] T. Yoshioka, et al., “Investigation of unsupervised
adaptation of DNN acoustic models with filter bank input,” ICASSP 2014.

Hybrid vs. Tandem
HLDA log-mel
Hybrid SAT 23.6% 22.7%
Tandem SAT 23.1% 22.3%
[AMI headset]
72
Results from [Yoshioka+, 2014 (2)]

HybridとTandemは相補的
HLDA log-mel
Hybrid SAT 23.6% 22.7%
Tandem SAT 23.1% 22.3%
hybrid+tandem
(CNC)
‐ 21.2%
[AMI headset]
73
Results from [Yoshioka+, 2014 (2)]

まとめ
特徴抽出の各ステップの改善は重要
ロバストネスに関連して今後研究が必要だと個人
的に思うこと
• CNN/LSTMの適応
• 系列識別学習されたDNNの適応
• Raw audioの入力（MelFBの学習）
• 音声強調のオンライン処理
74

DNN音響モデルにおける特徴量抽出の諸相

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DNN音響モデルにおける特徴量抽出の諸相

Similar to DNN音響モデルにおける特徴量抽出の諸相 (20)

DNN音響モデルにおける特徴量抽出の諸相