11. People think ……
l Transcribe spoken content into text by speech recognition
Speech
Recognition Models
Text
Retrieval
Result
Text
Retrieval
Query
learner
l Use text retrieval approach to search the transcriptions
Spoken
Content
Black Box
13. • Good spoken content retrieval needs good speech recognition
system.
• In real application, such high quality recognition models are
not available
• Ex, YouTube
• Different languages/accents
• Different recording environments
• Hope for spoken content retrieval
• Don’t completely rely on accurate speech recognition
• Accurate spoken content retrieval, even under poor speech
recognition
Problem?
15. Beyond Cascading Speech
Recogni1on and Text Retrieval
• 5 direc4ons
• Modified Speech Recogni4on for Retrieval Purposes
• Exploi4ng Informa4on not present in ASR outputs
• Directly Matching on Acous4c Level without ASR
• Seman4c Retrieval of Spoken Content
• Interac4ve Retrieval and Efficient Presenta4on of
Retrieved Objects
Overview paper "Spoken Content Retrieval —Beyond
Cascading Speech Recogni4on with Text Retrieval"
http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf
24. Speech Summariza1on
Retrieved
Audio File
Summary
Select the most informative
segments to form a compact version
1 hour long
10 minutes
Extrac've Summaries
Ref: http://speech.ee.ntu.edu.tw/
~tlkagk/courses/MLDS_2015/
Structured%20Lecture/Summarization
%20Hidden_2.ecm.mp4/index.html
25. Speech Summariza1on
• 用自己的話寫 summary (Abstrac4ve Summaries)
• Machine learns to do abstrac4ve summariza4on
from 2,000,000 training examples
,
, , , ,
; ……
Human
Machine
台大電機系 盧柏儒、徐翊祥
台大資工系 葉正杰、周儒杰
(助教:余朗祺)
29. Speech Ques1on Answering
• Machine answers ques4ons based on the
informa4on in spoken content
What is a possible origin
of Venus’ clouds?
……… answer
30. Speech Ques1on Answering
• TOEFL Listening Comprehension Test by Machine
• Example:
Ques4on: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
31. Simple Baselines
Accuracy (%)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Naive Approaches
random
(4) 選 seman4c 和其他
選項最像的選項
(2) select the shortest
choice as answer
Experimental setup:
717 for training,
124 for validation, 122 for
testing
51. Audio Word to Vector
• Consider audio segment corresponding to an
unknown word
Deep
Learning
with
(助教:沈家豪)
52. Audio Word to Vector
• The audio segments corresponding to words with
similar pronuncia4ons are close to each other.
Deep
Learning
53. Audio Word to Vector
• The audio segments corresponding to words with
similar pronuncia4ons are close to each other.
ever
ever
never
never
never
dog
dog
dogs
Deep
Learning
58. Spoken Content Retrieval without
Speech Recognition
user
“US President”
spoken query
[Hazen, ASRU 09]
[Zhang Glass, ASRU 09]
[Chan Lee, Interspeech 10]
[Zhang Glass, ICASSP 11]
[Gupta, Interspeech 11]
[Zhang Glass, Interspeech 11]
[Zhang Glass, ASRU 09]
[Huijbregts, ICASSP 11]
[Chan Lee, Interspeech 11]
Computing similarity between spoken queries and audio
files on signal level
Spoken Content
Handheld
device
59. Spoken Content Retrieval without
Speech Recognition
• Why spoken content retrieval without speech
recognition?
• Lots of audio files in different languages on the
Internet
• Most languages have little annotated data for
training speech recognition systems.
• Some audio files are produced in several different
of languages
• Some languages even do not have text