キーワード推定を内包したオーディオキャプション法

2-R2-10
キーワード推定を内包したオーディオキャプション法
○⼩泉悠⾺，増村亮，⻄⽥京介，安⽥昌弘，⿑藤翔⼀郎
NTTメディアインテリジェンス研究所
⽇本⾳響学会 2020年秋季研究発表会
⼤会２⽇⽬ 18:30-19:10
p ２分で概要が知りたい⽅︓
p １０分で詳細が知りたい⽅︓
n ５ページまでご覧ください
n ２１ページまでご覧ください
ポスター・質疑セッションでは、この資料をもとに、
ご説明・質疑応答いたします。別ブラウザなどでご準備いただけますと幸いです。

2
解きたい問題︓Audio Captioning
A man is speaking and cars are driving in the background.
n ⼊⼒︓録⾳した⾳データ（環境⾳）
n 出⼒︓⼊⼒⾳を説明する⾃然⾔語⽂
Hello!
vroooooomBEEP BEEP
n 環境⾳の説明⽂を⽣成するタスク
Moodle ページ内に、Audio Captioning 体験ページを作りました︕
ぜひ、ご体験ください︕

3
Audio Captioning の問題点
A man is speaking and cars are driving in the background.
A male is talking in front of many vehicles passing by.
どちらの⽂も正しいが、単語レベルでは全く違う
Hello!
vroooooomBEEP BEEP
n ⼀つの⾳響イベント/シーンを説明する単語が多数あり、正解⽂
の候補が組み合わせ爆発を起こす → 学習の不安定化 etc…
u ⾞というイベント︓{car, automobile, vehicle, wheels,…}
u 交差点という場所︓{road, roadway, intersection, street,…}
n ⾳を説明する⽂章が⼀意に決まらない

Hello!
vroooooomBEEP BEEP
4
どう解決したのか
n 候補単語（キーワード）を同時推定し、任意性を低減
A man is speaking and cars are
driving in the background.
Keywords
{man, speak, car,
drive, background}
Embeddings
Transformer-
encoder
A man is speaking and
cars are driving in the
background.
正解キャプション
推定
誤差最⼩化
Transformer-
decoder
A man is speaking in
front of driving cars
man, speak, car, drive,
background
Embeddingsキーワード推定
DNN
推定
誤差最⼩化
man, car, speak, drive,
front
⽔⾊の領域が
本研究の貢献
Point 1:
ルールベースの学習データ
の正解キーワード⽣成
Point 2:
弱ラベル⾳響イベント検知
を応⽤したキーワード推定
条件付け

5
結果、どうなったか
CIDEr SPICE SPIDEr
キーワード推定なし 23.3 9.1 16.2
キーワード推定あり
（本研究）
25.8 9.7 17.7
⼈⼿のキーワード
（オラクル性能）
27.5 10.1 18.8
110.7% 106.6% 109.3%
93.8% 96.0.% 94.1%
n 実験データ: Clotho dataset [Drossos+, ICASSP 2020]
n 学習データ 2,894⾳（約１６時間、各⾳に5種類のキャプション）
n 評価データ 1,045⾳
n 評価指標︓説明⽂⽣成タスクで広く使われる指標
n CIDEr [Vedantam+, CVPR 2015] ︓⽂の流暢さ
n SPICE [Anderson+, ECCV 2016] ︓単語選択の的確さ
n SPIDEr [Liu+, ICCV 2017] ︓総合評価
1. キーワード推定を⾏うことで⽂章⽣成精度向上
2. ⼈⼿でキーワードを与えた場合と遜⾊ない性能

7
⾳説明⽂⽣成の必要性
n 異常⾳検知: ただ「おかしい」から、「どう」「おかしい」へ
異常です
現状次世代
普段は聞かない、⾦属がこすれるような⾼
い⾳がしている。異常かもしれない。
n ヒアラブルデバイス: 「いつ/どこで/何が」の推定と組み合わせ、
とってわかりやすい状況説明＆コンシェルジュ
現状
・猫、左下
・⼥、右
・⾞、右上、接近
次世代
・⼥性側の⽅から、⾞が近づいてきます
（＋安全側へエスコートしましょう）
・愛らしい猫が近くにいます
（＋話のネタにしてみませんか）
n ユーザーが理解しやすい環境認識結果の提⽰

8
定式化︓これ⾃体はとてもシンプル
n ⼊⼒︓⾳響特徴量系列
n 出⼒︓単語系列
n ⼀般的な⽅法︓Encoder-Decoder モデル
n キーワードを利⽤して、デコーダを条件付けしたい
⾳響特徴量を埋め込んで
観測⾳と n-1番⽬までの単語を考慮して
n番⽬の単語の確率を推定する
n番⽬の単語の確率を推定する際に
キーワードも考慮する
どうやってキーワード集合 m を推定しよう︖

9
実現に向けた課題
n そもそも正解キーワードをどうやって作るのか
n 正解キーワードがあったとしてどうやって推定するのか︖
n 学習データには、単語系列しかないと想定
n キーワードの学習データを⽣成する⽅法が必要
n ⾳や⽂章には順序性や継続時間/重なりがあるが、キーワードには
ない
n 正解キーワードが「いつ」発⽣したかを推定したい

10
正解キーワード抽出
n 品詞・語幹推定を⾏い、出現頻度の⾼い単語を利⽤
n 名詞、動詞、形容詞、副詞のみを抽出し、原型へ変形
n 全学習データで出現頻度の⾼い C=50 単語をキーワードとした
POS-tagging+Lemmatize
Noun,verb,adjective,oradverb?
Training captions
A muddled noise of
broken channel of the TV
A person is turning a map
over and over
An office chair is
squeaking as someone
bends back and forth in it
A flying bee is buzzing
loudly around an object
and its wing hits it
Birds or small animals
rustling around in an
outdoor area
…
muddle, noise, break,
channel, TV
person, be, turn, map,
over
office, chair, be, squeak,
someone, bend, back,
forth
fly, bee, be, buzz, loud,
around, object, wing, hit
bird, small, animal, rustle,
around, outdoor, area
…
Four parts of speech
Countallcandidates,&most
frequentC=50lemmasexcept“be”
noise, break
person, turn, over
office, chair, squeak,
someone, bend
fly, bee, buzz, loud, object,
wing, hit
bird, animal, rustle,
outdoor
…
Keyword
名詞、動詞、形容詞、副
詞の原型を抽出
学習データに頻出する単
語をキーワードとする

11
キーワード推定
時間
電⾞⽝ドア
確率
⼈
n 弱ラベル⾳響イベント検知を応⽤したキーワード推定
n 各時刻で発⽣⾳を推定し、max pooling し、Binary Cross-entropy
最⼩化で学習
n Pooling 後の確率の上位 K=5 単語をキーワードとして推定
時間
電⾞が停⾞
ドアが開く
⼈が降りてくる
周波数
train
確率
people
bird
cat
dog
…
engine
door
car
street
train
確率
people
bird
cat
dog
…
engine
door
car
street
Binary
cross-entropy
train, people, door
Keyword
The train has stopped and people
are getting out from the door
Caption
DNNで推定
Max
pooling

モデル全体像
12
Feed Forward
Audio embedding
(VGGish)
Dropout
Muti-Head
Attention
Word embedding
(fastText)
Linear
(word dim. reduction)
Audio Text
Muti-Head
Attention
Add & Norm
Feed Forward
Add & Norm
Linear
Dropout
3x
Add & Norm
Add & Norm
Masked
Muti-Head
Attention
Add & Norm
3x
Softmax
Linear
Sort & Select
Concat
Linear
ReLU
Sigmoid
MaxKeyword embedding
(fastText)
Keyword estimation
branch
(a) (b)
m p(zc|⌫)
⌫
p(wn|⌫, m, wn 1)
p(zc|⌫)
Linear
(audio dim. reduction)
⌫
Linear
(word dim. reduction)
m
Encoder output
ベースはよくある
Transformer
この部分が
本研究の貢献
C番⽬の単語が
キーワードである
確率
Keyword Estimation
Branch の中⾝

実験観点
13
1. キーワード推定の効果はどれくらいか︖他⼿法と⽐較実験
n Baseline : DCASE2020 Challenge task6 の Baseline system
n LSTM : LSTM-based の Seq2Seq
n Transformer : Transformer-based の Seq2Seq
2. キーワードの推定数が変わるとどうなるのか︖
n Ours(K=10) : キーワードを１０個推定した場合
3. ⼈⼿で作成したキーワードを⼊⼒した場合からの性能劣化は︖
n Oracle1 : Clotho dataset が提供しているキーワードを⼊⼒
4. キーワード推定精度が１００％ならば︖
n Oracle2 : 提案法で作成した正解キーワードを⼊⼒
n 性能評価実験
n Ablation study
n DNNの挙動解析
5. どんなキーワードが推定されている︖
6. キーワードがどのように利⽤されている︖

実験結果
14
1. キーワード推定の効果はどれくらいか︖他⼿法と⽐較実験
n Baseline : DCASE2020 Challenge task6 の Baseline system
n LSTM : LSTM-based の Seq2Seq
n Transformer : Transformer-based の Seq2Seq
n 性能評価実験
知⾒︓⾳のみを⼊⼒した captioningでは、提案法が⼀番良い
n 明⽰的にキーワード推定を解いてあげた⽅が、性能が上がる

実験結果
15
2. キーワードの推定数が変わるとどうなるのか︖
n Ours(K=10) : キーワードを１０個推定した場合
n Ablation study
知⾒︓キーワードの推定数が多すぎると、性能は向上しない
n 学習データの95％は、キーワード数が5個以下
n 本来のキーワード数より、多くのキーワードを推定すると、結局
Captioning の不確定性が減らないためか、精度向上は⾒られない
適切なキーワード数も同時に推定できると良さそう
今後の課題

実験結果
16
3. ⼈⼿で作成したキーワードを⼊⼒した場合からの性能劣化は︖
n Oracle1 : Clotho dataset が提供しているキーワードを⼊⼒
n Ablation study
知⾒︓⼈⼿でキーワードを与えた場合と精度はほぼ同等
n BLEU-1（単語⼀致の指標）
n オラクル 53.4 → 提案法 52.1 （約97.6% の精度）
n ROUGE-L（⽂法⼀致の指標）
n オラクル 35.1 → 提案法34.2 （約97.4% の精度）

実験結果
17
4. キーワード推定精度が１００％ならば︖
n Oracle2 : 提案法で作成した正解キーワードを⼊⼒
n Ablation study
知⾒︓キーワードの精度が改善すれば、より性能が向上する可能性
n 提案法のキーワード推定精度は 48.1%
※ Percentage of estimated keywords that were included in ground-truth
今回はナイーブな⽅法でキーワード推定したが、⾳響イベント検知、
画像キャプション、⽂書要約などの最先端の⼿法を応⽤できれば、ま
だ性能が上がりそう
今後の課題

DNNの内部解析（1/3）
18
Key.
Audio
Memoryind.(T+K)
Text position
Freq.[kHz]
Time [s]
Probability
Decoder Layer 3Decoder Layer 2Decoder Layer 1
R0: A bunch of birds are chirping and singing
R1: Birds are chirping and loudly singing in the forest
Pred.: Birds are chirping and singing in the background
Est. keywords: ['bird', 'chirp', 'sing', 'distance', 'background']
(a) 20100422.waterfall.birds.wav
(i)
(ii)
(iii)
(iv)
Bird, sing, chirp
がキーワード
として推定されてる
正解キャプション
（５個のうち２つを表⽰）
推定キャプション
推定キーワード
⼊⼒スペクトログラム
Pooling 前のキーワード確率
Decoder の attention
各単語を⽣成する際に、
どの時刻の⾳、もしくは
どのキーワードを利⽤したか

19
Key.
Audio
Memoryind.(T+K)
Text position
Freq.[kHz]
Time [s]
Probability
R0: A machine is running at first then slows down
R1: An airplane engine is at a high idle and slows down to a slower
Pred.: A machine is running and then stops
Est. keywords: ['machine', 'run', 'engine', 'turn', 'move']
(b) WasherSpinCycleWindDown4BeepEndSignal.wav
{machine, airplane}
の不確定性が
キーワードで解決
されている

20
Key.
Audio
Memoryind.(T+K)
Text position
Freq.[kHz]
Time [s]
Probability
R0: A person opens a door with a key then he closes the door from
R1: A door is being unlatched, creaking open and being fastened again
Pred.: A person is opening and closing a door
Est. keywords: [‘door', ‘open', ‘close', ‘times', ‘someone']
(c) Door.wav
{close, fasten}の不確定性が
キーワードで解決されている
n ⼈間の⾳なんて⼊っていないのに、「ド
アを開ける」は普通⼈間がやるものだ、
と⾔う共起が⾃動で学習/推定されている
n これまでの⾳響イベント検知ではできな
かった、クロスモーダルタスクにしたか
らこそできた推論

Take-home message
21
n Audio Captioning とは︖
n どんな課題があった︖
n どうやって解いた︖
n どうなった︖
n 環境⾳の説明⽂を⽣成するタスク
n ⼀つの⾳響イベント/シーンを説明する単語が多数あり、正解⽂の
候補が組み合わせ爆発を起こす → 学習の不安定化 etc…
n 弱ラベル⾳響イベント検知を応⽤し、⽂章のキーワードを同時推
定。⽂章⽣成の際に条件付けすることで任意性の解消を狙った
n 従来法よりも性能が向上。また Decoder の挙動も解析しやすくな
り、結果の解釈性も向上した

キーワード推定を内包したオーディオキャプション法

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to キーワード推定を内包したオーディオキャプション法

Similar to キーワード推定を内包したオーディオキャプション法 (12)

More from Yuma Koizumi

More from Yuma Koizumi (10)

Recently uploaded

Recently uploaded (7)

キーワード推定を内包したオーディオキャプション法