李宏毅/當語音處理遇上深度學習

Deep Learning
and its Applica1on on
Speech Processing
Hung-yi Lee

Spoken
Content
Speech
Recogni4on
Recogni4on
Output
Speech
Recogni,on
How to do speech recogni4on with
deep learning?
Deep
Learning

People imagine ……
This is not true!
DNN can only take ﬁxed-length
vectors as input and output.
“大家好我今天 ….”
DNN
Input and output are sequences
with diﬀerent lengths.

Recurrent Neural Network
x1
x2
x3
y1
y2
y3
Wi
Wo
……
Wh
Wh
Wi
Wo
Wi
Wo
How about Recurrent Neural Network (RNN)?

好
好
好
Trimming
棒
棒
棒
棒
棒
“好棒”
Why can’t it be
“好棒棒”
Input:
Output:
(character sequence)
(vector
sequence
)
Problem?
How about Recurrent Neural Network (RNN)?
0.01s

•  Connec4onist Temporal Classiﬁca4on (CTC) [Alex Graves,
ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li,
Interspeech’15][Andrew Senior, ASRU’15]
好
φ
φ
棒
φ
φ
φ
φ
好
φ
φ
棒
φ
棒
φ
φ
“好棒”
“好棒棒”
Add an extra symbol
“φ” represen4ng “null”

Sequence-to-sequence Learning
•  Sequence to sequence learning: Both input and output are
both sequences with diﬀerent lengths.
Containing all
informa4on about
input uferance
……
……
“機器學習”
acous4c feature sequence → character sequence

……
……
“機器學習”
機
習
器
學
……
……
慣
性
Don’t know when to stop

……
……
“機器學習”
機
習
器
學
Add a symbol “。 “ (句點)
[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
。

Spoken
Content
Speech
Recogni4on
Recogni4on
Output
Retrieval
Retrieval
Result
Spoken Content
Retrieval

People think ……
l Transcribe spoken content into text by speech recognition
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query
learner
l Use text retrieval approach to search the transcriptions
Spoken
Content
Black Box

People think ……
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=

•  Good spoken content retrieval needs good speech recognition
system.
•  In real application, such high quality recognition models are
not available
•  Ex, YouTube
•  Different languages/accents
•  Different recording environments
•  Hope for spoken content retrieval
•  Don’t completely rely on accurate speech recognition
•  Accurate spoken content retrieval, even under poor speech
recognition
Problem?

Spoken
Content
Speech
Recogni4on
Beyond
Cascading
?
Recogni4on
Output
Retrieval
Retrieval
Result
Spoken Content
Retrieval
¨  Is the cascading of speech recognition and text retrieval
the only solution of spoken content retrieval?

Beyond Cascading Speech
Recogni1on and Text Retrieval
•  5 direc4ons
•  Modiﬁed Speech Recogni4on for Retrieval Purposes
•  Exploi4ng Informa4on not present in ASR outputs
•  Directly Matching on Acous4c Level without ASR
•  Seman4c Retrieval of Spoken Content
•  Interac4ve Retrieval and Eﬃcient Presenta4on of
Retrieved Objects
Overview paper "Spoken Content Retrieval —Beyond
Cascading Speech Recogni4on with Text Retrieval"
http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf

Our Point
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=

Spoken
Content
Speech
Recogni4on
Beyond
Cascading
?
Recogni4on
Output
Retrieval
Retrieval
Result
Interac4on
user
Interact with
Humans

Spoken
Content
Speech
Recogni4on
Beyond
Cascading
?
Recogni4on
Output
Retrieval
Seman4c
Analysis
Retrieval
Result
Interac4on
user
Seman,c
Analysis

Unsupervised Learning
•  Machine reads lots of text on the Internet ……
蔡英文 520宣誓就職
馬英九 520宣誓就職
蔡英文、馬英九 are
something very similar
You shall know a word
by the company it keeps

Seman1c Analysis
•  Let machine read lots of documents.
•  Each word is represented as a vector
dog
cat
rabbit
jump
run
ﬂower
tree

Seman1c Analysis
•  Even the distances between the vectors have some
meaning.
Source: hfp://
www.slideshare.net/hustwj/cikm-
keynotenov2014

Spoken
Content
Speech
Recogni4on
Beyond
Cascading
?
Recogni4on
Output
Retrieval
Seman4c
Analysis
Key Term
Extrac4on
Retrieval
Result
Interac4on
user
Key Term
Extrac,on
[Interspeech
2015]
(with 沈昇勳)

Spoken
Content
Speech
Recogni4on
Beyond
Cascading
?
Recogni4on
Output
Retrieval
Seman4c
Analysis
Key Term
Extrac4on
Retrieval
Result
Interac4on
user
Summariza,on
Summari-
za4on

Speech Summariza1on
Retrieved
Audio File
Summary
Select the most informative
segments to form a compact version
1 hour long
10 minutes
Extrac've Summaries
Ref: http://speech.ee.ntu.edu.tw/
~tlkagk/courses/MLDS_2015/
Structured%20Lecture/Summarization
%20Hidden_2.ecm.mp4/index.html

Speech Summariza1on
•  用自己的話寫 summary (Abstrac4ve Summaries)
•  Machine learns to do abstrac4ve summariza4on
from 2,000,000 training examples
,
, , , ,
; ……
Human
Machine
台大電機系盧柏儒、徐翊祥
台大資工系葉正杰、周儒杰
(助教:余朗祺)

Spoken
Content
Speech
Recogni4on
Beyond
Cascading
?
Recogni4on
Output
Retrieval
Seman4c
Analysis
Key Term
Extrac4on
Summari-
za4on
Ques4on-
answering
Retrieval
Result
Interac4on
user
question
answer
Ques,on
Answering

Spoken
Content
Speech
Recogni4on
Beyond
Cascading
?
Recogni4on
Output
Retrieval
Seman4c
Analysis
Key Term
Extrac4on
Summari-
za4on
Ques4on-
answering
Retrieval
Result
Interac4on
user
question
answer
Without
Speech
Recogni,on?

Outline
Very Brief Introduc4on of Deep Learning
Towards Machine Comprehension
of Spoken Content
•  Overview
•  Example I: Speech Ques4on Answering
•  Example II: Interac4ve Spoken Content Retrieval
•  Example III: What can machine learn from audio
without any supervision

Speech Ques1on Answering
•  Machine answers ques4ons based on the
informa4on in spoken content
What is a possible origin
of Venus’ clouds?
……… answer

Speech Ques1on Answering
•  TOEFL Listening Comprehension Test by Machine
•  Example:
Ques4on: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)

Simple Baselines
Accuracy (%)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Naive Approaches
random
(4) 選 seman4c 和其他
選項最像的選項
(2) select the shortest
choice as answer
Experimental setup:
717 for training,
124 for validation, 122 for
testing

Supervised Learning
Accuracy (%)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Memory Network: 39.2%
Naive Approaches
Interspeech 2016
(with 曾柏翔)
(proposed by FB AI group)

Model Architecture
“what is a possible
origin of Venus
Ques4on:
Ques4on
Seman4cs
…… It be quite possible that this be
due to volcanic erup4on because
volcanic erup4on o{en emit gas. If
that be the case volcanism could very
well be the root cause of Venus 's
thick cloud cover. And also we have
observe burst of radio energy from the
planet 's surface. These burst be
similar to what we see when volcano
erupt on earth ……
Audio Story:
Speech
Recogni4on
Seman4c
Analysis
Seman4c
Analysis
Afen4on
(畫重點)
Answer
Select the choice most
similar to the answer
Afen4on
Similar to
Memory Network

Model Architecture
Word-based Afen4on

Model Architecture
Sentence-based Afen4on

(A)
(A)
(A)
(A)
(A)
(B)
(B)
(B)

Supervised Learning
Accuracy (%)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Memory Network: 39.2%
Naive Approaches
Word-based Afen4on: 48.3%
Interspeech 2016
(with 曾柏翔)
(proposed by FB AI group)

Interact with Users
•  Interac4ve retrieval is helpful.
user
“深度學習”
和機器學習有關的
”深度學習” 嗎?
還是和教育有關的
”深度學習” 呢?

Audio is hard to browse
•  When the system returns the retrieval results, user
doesn’t know what he/she get at the ﬁrst glance
Retrieval Result

user
Spoken Content
Retrieval
Results
Spoken
Content
Interac,ve
retrieval
of spoken content
query
Directly showing the retrieval results is
probably not a good idea.

user
Spoken Content
Retrieval
Results
Spoken
Content
Interac,ve
retrieval
of spoken content
query
“Give me an example.”
“Is it relevant to XXX?”
“Can you give me another query?”
“Show the results.”
Given the current situation, which action should be taken?
……

user
Spoken Content
Retrieval
Results
Spoken
Content
Interac,ve
retrieval
of spoken content
query
State
Es4ma4on
Ac4on
Decision
state
The degree of
clarity from the
retrieval results
ac4on
features
¤  The policy π(s) is a function
¤  Input: state s, output: action a
Decide the actions by intrinsic
policy π(S)
[Interspeech 2012][ICASSP 2013]

user
Spoken Content
Retrieval
Results
Spoken
Content
Interac,ve
retrieval
of spoken content
query
features
…
……
DNN
State EstimationAction Decision
Is it relevant to
XXX?
Give me an example.
Show the results.
Max

user
Spoken Content
Retrieval
Results
Spoken
Content
Interac,ve
retrieval
of spoken content
query
features
…
……
DNN
Is it relevant to
XXX?
Give me an example.
Show the results.
Max
Learned from
historical interac4on
Goal: maximizing return
(Retrieval Quality - User labor)

Experimental Results
•  Broadcast news, seman4c retrieval
Retrieval Quality (MAP)
Op4miza4on Target:
Retrieval Quality - User labor
Hand-cra{ed
Deep Learning
Previous Method
(state + decision)
submifed to
Interspeech 2016 (with
吳彥諶、林子翔)

Unsupervised Learning
Machine listens to lots
of audio book

(TA: )
Audio Word2Vec: Unsupervised Learning of Audio
Segment Representa'ons using Sequence-to-sequence
Autoencoder (accepted by Interspeech 2016)

Audio Word to Vector
•  Consider audio segment corresponding to an
unknown word
Deep
Learning
with
(助教:沈家豪)

•  The audio segments corresponding to words with
similar pronuncia4ons are close to each other.
Deep
Learning

•  The audio segments corresponding to words with
similar pronuncia4ons are close to each other.
ever
ever
never
never
never
dog
dog
dogs
Deep
Learning

How to evaluate
never
ever
Cosine
Similarity
Phoneme sequence
edit distance
Deep
Learning
Deep
Learning

Experimental Results
More similar
pronuncia4on
Larger cosine
similarity.

Interes1ng Observa1on
•  Projec4ng the embedding vectors to 2-D
day
days
says
say

Spoken Content Retrieval without
Speech Recognition
user
“US President”
spoken query
[Hazen, ASRU 09]
[Zhang Glass, ASRU 09]
[Chan Lee, Interspeech 10]
[Zhang Glass, ICASSP 11]
[Gupta, Interspeech 11]
[Zhang Glass, Interspeech 11]
[Zhang Glass, ASRU 09]
[Huijbregts, ICASSP 11]
[Chan Lee, Interspeech 11]
Computing similarity between spoken queries and audio
files on signal level
Spoken Content
Handheld
device

Speech Recognition
• Why spoken content retrieval without speech
recognition?
•  Lots of audio files in different languages on the
Internet
•  Most languages have little annotated data for
training speech recognition systems.
•  Some audio files are produced in several different
of languages
•  Some languages even do not have text

Speech Recognition

Concluding Remarks
Very Brief Introduc4on of Deep Learning
Towards Machine Comprehension
of Spoken Content
•  Overview
•  Example I: Speech Ques4on Answering
•  Example II: Interac4ve Spoken Content Retrieval
•  Example III: What can machine learn from audio
without any supervision

李宏毅/當語音處理遇上深度學習

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 李宏毅/當語音處理遇上深度學習

Similar to 李宏毅/當語音處理遇上深度學習 (20)

More from 台灣資料科學年會

More from 台灣資料科學年會 (20)

Recently uploaded

Recently uploaded (20)

李宏毅/當語音處理遇上深度學習