SlideShare a Scribd company logo
1 of 69
ASAPP, One World Trade Center, 80th Floor, New York, 10007
asapp.com Confidential - Not for further distribution
Kyu J. Han, Ramon Prieto, Tao Ma
When Attention Meets
Speech Applications
September 16, 2019
Confidential - Not for further distribution
Intro
“ATTENTION” In Interspeech 2019
Very Deep Self-attention Networks for End-to-End Speech Recognition
Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention
Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention
Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile
Phonetically-aware embeddings - Wide Residual Networks with Time-Delay Neural
Networks and Self Attention models for the 2018 NIST Speaker Recognition Evaluation
A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews
RWTH ASR System for LibriSpeech: Hybrid vs Attention
Speaker Adaptation for Attention-Based End-to-End Speech Recognition
Large Margin Training for Attention Based End-to-End Speech Recognition
Predicting Group-Level Skin Attention to Short Movies from Audio-Based LSTM-Mixture of Experts Models
Attention model for articulatory features detection
Attention based Hybrid I-vector BLSTM Model for Language Recognition
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS
Self Attention in Variational Sequential Learning for Summarization
Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling
Conversational Emotion Analysis via Attention Mechanisms
An analysis of local monotonic attention variants
Lattice generation in attention-based speech recognition models
A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting
Individual differences in implicit attention to phonetic detail in speech perception
Learning how to listen: A temporal-frequential attention model for sound event detection
An Online Attention-based Model for Speech Recognition
Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition
The influence of distraction on speech processing: How selective is selective attention?
Environment-dependent Attention-driven Recurrent Convolutional Neural Network for Robust Speech Enhancement
Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation
Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition
Multi-task multi-resolution char-to-BPE cross-attention decoder for end-to-end speech recognition
Multi-Stride Self-Attention for Speech Recognition
Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning
Attention-based word vector prediction with LSTMs and its application to the OOV problem in ASR
Multi-stream Network With Temporal Attention For Environmental Sound Classification
Few-Shot Audio Classification with AttentionalGraph Neural Networks
Vectorized Beam Search for CTC-Attention-based Speech Recognition
Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition
Spatio-Temporal Attention Pooling for Audio Scene Classification
Multi-Scale Time-Frequency Attention for Rare Sound Event Detection
A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN - with application to language identification
An Attention-Based Hybrid Network for Automatic Detection of Alzheimer’s Disease from Narrative Speech
Automatic Hierarchical Attention Neural Network for Detecting Alzheimer’s Disease
Neural Text Clustering with Document-level Attention based on Dynamic Soft
Labels
End-to-End Multi-Channel Speech Enhancement Using Inter-Channel Time-Restricted Attention on Raw
Waveform
Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition
Cross-AttentionEnd-to-End ASR for Two-Party Conversations
Self-AttentionTransducers for End-to-End Speech Recognition
Variational Attentionusing Articulatory Priors for generating Code Mixed Speech using Monolingual Corpora
Confidential - Not for further distribution
Intro
● Around 50 papers, with the titles including “ATTENTION”
● Diverse areas being applied
○ Speech recognition
○ Speaker recognition
○ Language recognition
○ Emotion recognition
○ Speech synthesis
○ Audio classification
○ Event detection
○ Semantic classification
“ATTENTION” In Interspeech 2019
TABLE OF CONTENTS
1. Attention
2. Attention in Speech Recognition
3. Attention in Speaker Recognition
4. Pay Attentions on Challenges!
5. Conclusions / Q&A
TABLE OF CONTENTS
1. Attention
2. Attention in Speech Recognition
3. Attention in Speaker Recognition
4. Pay Attentions on Challenges!
5. Conclusions / Q&A
Confidential - Not for further distribution
● Understands where to pay more
attention
ATTENTION
Attention
Source: commons.wikimedia.org
Confidential - Not for further distribution
● Understands where to pay more
attention
● Common to humans
○ Visual attention
ATTENTION
Attention
Source: commons.wikimedia.org
Confidential - Not for further distribution
● Understands where to pay more
attention
● Common to humans
○ Visual attention
ATTENTION
Attention
Source: commons.wikimedia.org
Confidential - Not for further distribution
Attention
Source: commons.wikimedia.org
● Understands where to pay more
attention
● Common to humans
○ Visual attention
ATTENTION
Confidential - Not for further distribution
Attention
Source: giphy.com
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
ATTENTION
Confidential - Not for further distribution
Attention
Source: cbsnews.com
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
ATTENTION
Confidential - Not for further distribution
Attention
Source: giphy.com
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
ATTENTION
Confidential - Not for further distribution
Attention
Source: metroatlantahome.com
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
ATTENTION
Confidential - Not for further distribution
Attention
ATTENTION
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
● In neural networks,
○ “Generating sequences with
RNNs”, by A. Graves (2013)
Confidential - Not for further distribution
Attention
ATTENTION
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
● In neural networks,
○ “Generating sequences with
RNNs”, by A. Graves (2013)
A. Graves, "Generating sequences with
recurrent neural networks", 2013.
Confidential - Not for further distribution
Attention
ATTENTION
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
● In neural networks,
○ “Generating sequences with
RNNs”, by A. Graves (2013)
■ Soft windowing
A. Graves, "Generating sequences with
recurrent neural networks", 2013.
Confidential - Not for further distribution
Attention
ATTENTION
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
● In neural networks,
○ “Generating sequences with
RNNs”, by A. Graves (2013)
■ Soft windowing
■ Gaussian convolution
■ Location-aware attention
A. Graves, "Generating sequences with
recurrent neural networks", 2013.
Confidential - Not for further distribution
Attention
A. Graves, "Generating sequences with
recurrent neural networks", 2014.
ATTENTION
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
● In neural networks,
○ “Generating sequences with
RNNs”, by A. Graves (2013)
■ Soft windowing
■ Gaussian convolution
■ Location-aware attention
○ “Neural machine translation by
jointly learning to align and
translate”, D. Bahdanau, K. Cho
and Y. Bengio (2014/2015)
Confidential - Not for further distribution
Attention
ATTENTION
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
● In neural networks,
○ “Generating sequences with
RNNs”, by A. Graves (2013)
■ Soft windowing
■ Gaussian convolution
■ Location-aware attention
○ “Neural machine translation by
jointly learning to align and
translate”, D. Bahdanau, K. Cho
and Y. Bengio (2014/2015)
■ Content-aware attention
D. Bahdanau, et al., ”Neural machine
translation by jointly learning to align
and translate", 2014/2015.
:
Confidential - Not for further distribution
Attention
ATTENTION
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
● In neural networks,
○ “Generating sequences with
RNNs”, by A. Graves (2013)
■ Soft windowing
■ Gaussian convolution
■ Location-aware attention
○ “Neural machine translation by
jointly learning to align and
translate”, D. Bahdanau, K. Cho
and Y. Bengio (2014/2015)
■ Content-aware attention
○ “Attention is all you need”, A.
Vaswani, et al. (2017)
Confidential - Not for further distribution
Attention
ATTENTION
● Understands where to pay more
attention
● Common to humans
○ Visual attention
○ Auditory attention
○ Social attention
● Common to human decision making
○ Family meeting
○ House price
● In neural networks,
○ “Generating sequences with
RNNs”, by A. Graves (2013)
■ Soft windowing
■ Gaussian convolution
■ Location-aware attention
○ “Neural machine translation by
jointly learning to align and
translate”, D. Bahdanau, K. Cho
and Y. Bengio (2014/2015)
■ Content-aware attention
○ “Attention is all you need”, A.
Vaswani, et al. (2017)
■ Multi-head attention
■ No-recurrence
A. Vaswani, et al., ”Attention is all you
need", 2017.
TABLE OF CONTENTS
1. Attention
2. Attention in Speech Recognition
3. Attention in Speaker Recognition
4. Pay Attentions on Challenges!
5. Conclusions / Q&A
Confidential - Not for further distribution
Attention in End-to-End ASR
CTC RNN Transducer Seq-to-Seq
R. Prabhavalkar, et al., ”A comparison of
sequence-to-sequence models for
speech recognition", 2017.
Confidential - Not for further distribution
● CTC + attention (2018)
○ Hybrid attention
○ Implicit LM
○ Component attention
○ 20% (relative) impr. In WER
Attention in End-to-End ASR
A. Das, et al., ”Advancing connectionist
temporal classification with attention
modeling", 2018.
Confidential - Not for further distribution
● RNN-T + attention (2017)
○ Combines RNN-T w/ attention
○ Content-aware attention
○ Marginal impr. obtained
Attention in End-to-End ASR
RNN-T RNN-T w/ Attention
R. Prabhavalkar, et al., ”A comparison of
sequence-to-sequence models for
speech recognition", 2017.
Confidential - Not for further distribution
Attention in End-to-End ASR
CTC RNN Transducer Seq-to-Seq
R. Prabhavalkar, et al., ”A comparison of
sequence-to-sequence models for
speech recognition", 2017.
Confidential - Not for further distribution
Attention in End-to-End ASR
Seq-to-Seq
R. Prabhavalkar, et al., ”A comparison of
sequence-to-sequence models for
speech recognition", 2017.
Confidential - Not for further distribution
● Same structure with Bahdanau’s
neural translation model (2014/15)
First Attention in Speech
Confidential - Not for further distribution
● Same structure with Bahdanau’s
neural translation model (2014/15)
○ Encoder-decoder architecture w/ attention
○ Content-aware attention
First Attention in Speech
J. Chorowski, et al., “End-to-end continuous
speech recognition using attention-based
recurrent NN: First results", 2014/15.
Confidential - Not for further distribution
● ARSG using hybrid attention (2015)
○ Addressed the limitation of content-aware
attention  hybrid attention
Attention-based Recurrent Sequence Generator
Confidential - Not for further distribution
● ARSG using hybrid attention (2015)
○ Addressed the limitation of content-aware
attention  hybrid attention
Attention-based Recurrent Sequence Generator
(F: Convolving matrix)
J. Chorowski, et al., “Attention-based
models for speech recognition", 2014/15.
Confidential - Not for further distribution
● Two improvements for LVCSR (2016)
○ Windowing on attention during training
○ Frame pooling
■ Similar with LAS’s pyramidal encoder
structure
Improved ARSG
D. Bahdanau, et al., “End-to-end attention-
based large vocabulary speech recognition",
2016.
Confidential - Not for further distribution
● Combination w/ CTC objective (2017)
○ Joint CTC/attention decoding
○ Main model architecture in ESPnet
(https://github.com/espnet/espnet)
Multi-Objective Training
S. Watanabe, et al., “Hybrid CTC/attention
architecture for end-to-end speech
recognition", 2017.
Confidential - Not for further distribution
● LAS (2015)
○ Pyramidal encoder structure from
downsampling
○ Content-aware attention
Listen, Attend and Spell
Confidential - Not for further distribution
● LAS (2015)
○ Pyramidal encoder structure from
downsampling
○ Content-aware attention
Listen, Attend and Spell
W. Chan, et al., “Listen, attend and
spell", 2015.
Confidential - Not for further distribution
● Multi-head attention (2018)
○ Inspired by Transformer (A. Vaswani, 2017)
○ Replacing single head attention
Further Development of LAS
C. Chiu, et al., ”State-of-the-art speech
recognition with sequence-to-sequence
models", 2018.
Confidential - Not for further distribution
● Multi-head attention (2018)
○ Inspired by Transformer (A. Vaswani, 2017)
○ Replacing single head attention
● SpecAugment (2019)
○ Data augmentation to LAS
○ Achieved state-of-the-art results on LibriSpeech and
SWBD
Further Development of LAS
C. Chiu, et al., ”State-of-the-art speech
recognition with sequence-to-sequence
models", 2018.
D. Park, et al., “SpecAugment: A simple data
augmentation method for automatic
speech recognition", 2019.
Confidential - Not for further distribution
Performance of Seq-to-Seq w/ Attention
D. Park, et al., “SpecAugment: A simple data
augmentation method for automatic
speech recognition", 2019.
LibriSpeech SWBD
TABLE OF CONTENTS
1. Attention
2. Attention in Speech Recognition
3. Attention in Speaker Recognition
4. Pay Attentions on Challenges!
5. Conclusions / Q&A
Confidential - Not for further distribution
● Non-recurrence structure
○ Inspired by FIR approximation on IIRs
○ Exploits memory blocks
○ Can model long-term dependency, even
without recurrence in its structure
Feedforward Sequential Memory Network
Recurrent Feedback in RNN as IIR
Memory Blocks in FSMN as FIR
S. Zhang, et al.,
”Feedforward
sequential memory
networks without
recurrent feedback",
2015.
Confidential - Not for further distribution
● Non-recurrence structure
○ Inspired by FIR approximation on IIRs
○ Exploits memory blocks
○ Can model long-term dependency, even
without recurrence in its structure
Feedforward Sequential Memory Network
Deep FSMN
c-FSMNFSMN
S. Zhang, et al.,
”Deep-FSMN for
large vocabulary
continuous speech
recognition ",
2018.
Confidential - Not for further distribution
● Speech-Transformer
○ Transformer applied to Mandarin Chinese
○ With convolution layers on inputs
Multi-Head Self-Attention
L. Dong, et al., “Speech-
Transformer: A no
recurrence sequence-to-
sequence model for speech
recognition", 2018.
Confidential - Not for further distribution
● Speech-Transformer
○ Transformer applied to Mandarin Chinese
○ With convolution layers on inputs
● Transformer with convolutions
○ Convolutional contexts applied to inputs, similarly
Multi-Head Self-Attention
A. Mohamed, et al.,
“Transformers with
convolutional context for
ASR", 2019.
Confidential - Not for further distribution
Multi-Head Self-Attention
D. Povey, et al., “A time-
restricted self-attention
layer for ASR", 2018.
● Speech-Transformer
○ Transformer applied to Mandarin Chinese
○ With convolution layers on inputs
● Transformer with convolutions
○ Convolutional contexts applied to inputs, similarly
● Time-restricted self-attention
○ Left & right contexts restricting the attention mechanism
○ Relative positional encoding
○ Encoder structure only
○ LF-MMI objective
Confidential - Not for further distribution
Multi-Head Self-Attention
K. Han, et al., “Multi-stride
self-attention for speech
recognition", 2018.
● Speech-Transformer
○ Transformer applied to Mandarin Chinese
○ With convolution layers on inputs
● Transformer with convolutions
○ Convolutional contexts applied to inputs, similarly
● Time-restricted self-attention
○ Left & right contexts restricting the attention mechanism
○ Relative positional encoding
○ Encoder structure only
○ LF-MMI objective
Confidential - Not for further distribution
● Speech-Transformer
○ Transformer applied to Mandarin Chinese
○ With convolution layers on inputs
● Transformer with convolutions
○ Convolutional contexts applied to inputs, similarly
● Time-restricted self-attention
○ Left & right contexts restricting the attention mechanism
○ Relative positional encoding
○ Encoder structure only
○ LF-MMI objective
● Self-attention network (SAN) with CTC
○ CTC objective
Multi-Head Self-Attention
J. Salazar, et al., “Self-
attention networks for
connectionist temporal
classification in speech
recognition", 2019.
TABLE OF CONTENTS
1. Attention
2. Attention in Speech Recognition
3. Attention in Speaker Recognition
4. Pay Attentions on Challenges!
5. Conclusions / Q&A
Confidential - Not for further distribution
● Attention in speaker verification
○ Just averaged embeddings for a given utterance to make them a
fixed-length representation in the past
○ Applied attention on such embeddings, instead
Deep Speaker Embedding w/ Attention
G. Bhattacharya, et al.,
“Deep speaker
embeddings for short-
duration speaker
verification", 2017.
Confidential - Not for further distribution
● Attention in speaker verification
○ Just averaged embeddings for a given utterance to make them a
fixed-length representation in the past
○ Applied attention on such embeddings, instead
● Feedforward networks w/ attention
Deep Speaker Embedding w/ Attention
C. Raffel, et al., “Feed-
forward networks with
attention can solve some
long-term memory
problems", 2017.
Confidential - Not for further distribution
● Attentive statistics pooling
○ Appends standard deviation to weighted mean after attention
Deep Speaker Embedding w/ Attention
K. Okabe, et al.,
“Attentive statistics
pooling for deep speaker
embedding", 2018.
Confidential - Not for further distribution
Deep Speaker Embedding w/ Attention
S. Zhang, et al., “End-
to-end attention based
text-dependent
speaker verification",
2016.
● Multimodal attention in speaker verification
○ Attention on phonetic and speaker representation for the wake
word “Hey Cortana”
○ Combining keyword spotting with speaker verification
Confidential - Not for further distribution
● D-vectors in LSTM
○ Generates embedding through LSTMs
Deep Speaker Embedding w/ Attention
G. Heigold, et al., “End-
to-end text dependent
speaker verification",
2016.
Confidential - Not for further distribution
● D-vectors in LSTM
○ Generates embedding through LSTMs
○ Attention applied to get normalized weights for hidden
embedding
Deep Speaker Embedding w/ Attention
F. Chowdhury, et al.,
“Attention-based models for
text-dependent speaker
verification", 2017.
Confidential - Not for further distribution
● D-vectors in LSTM
○ Generates embedding through LSTMs
○ Attention applied to get normalized weights for hidden
embedding
Deep Speaker Embedding w/ Attention
F. Chowdhury, et al.,
“Attention-based models for
text-dependent speaker
verification", 2017.
Cross-layer Attention Divided-layer Attention
Confidential - Not for further distribution
● Self-attentive embedding
○ Extension of x-vector w/ structured self-attention from sentence embedding
Deep Speaker Embedding w/ Attention
Z. Lin, et al., “Structured
self-attentive sentence
embedding", 2017.
Confidential - Not for further distribution
● Self-attentive embedding
○ Extension of x-vector w/ structured self-attention
○ Multi-heads
Deep Speaker Embedding w/ Attention
Y. Zhu, et al., “Self-
attentive speaker
embeddings for text-
independent speaker
verification", 2018.
TABLE OF CONTENTS
1. Attention
2. Attention in Speech Recognition
3. Attention in Speaker Recognition
4. Pay Attentions on Challenges!
5. Conclusions / Q&A
Confidential - Not for further distribution
Challenges: Attention in Online
● Can we attend monotonically?
Attention Monotonic Chunkwise Attention
C. Chui, et al.,
“Monotonic chunkwise
attention", 2018.
Confidential - Not for further distribution
Challenges: Speech Frames
● Are they ideal as basic units?
http://jalammar.github.io/illustrated-bert/
Confidential - Not for further distribution
Challenges: Speech Frames
● Are they ideal as basic units?
http://jalammar.github.io/illustrated-bert/
https://towardsdatascience.com/deconstructing-bert-part-2-
visualizing-the-inner-workings-of-attention-60a16d86b5c1
Confidential - Not for further distribution
Challenges: Speech Frames
● Some effort exist…
○ Multi-resolution of
speech frames in multi-
stream self-attention
○ But, the question
remains…
K. Han, et al., “State-of-the-art speech
recognition using multi-stream self-attention
with dilated 1D convolutions", 2019.
TABLE OF CONTENTS
1. Attention
2. Attention in Speech Recognition
3. Attention in Speaker Recognition
4. Pay Attentions on Challenges!
5. Conclusions / Q&A
Confidential - Not for further distribution
Intro
“ATTENTION” In Interspeech 2019
Very Deep Self-attention Networks for End-to-End Speech Recognition
Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention
Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention
Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile
Phonetically-aware embeddings - Wide Residual Networks with Time-Delay Neural
Networks and Self Attention models for the 2018 NIST Speaker Recognition Evaluation
A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews
RWTH ASR System for LibriSpeech: Hybrid vs Attention
Speaker Adaptation for Attention-Based End-to-End Speech Recognition
Large Margin Training for Attention Based End-to-End Speech Recognition
Predicting Group-Level Skin Attention to Short Movies from Audio-Based LSTM-Mixture of Experts Models
Attention model for articulatory features detection
Attention based Hybrid I-vector BLSTM Model for Language Recognition
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS
Self Attention in Variational Sequential Learning for Summarization
Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling
Conversational Emotion Analysis via Attention Mechanisms
An analysis of local monotonic attention variants
Lattice generation in attention-based speech recognition models
A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting
Individual differences in implicit attention to phonetic detail in speech perception
Learning how to listen: A temporal-frequential attention model for sound event detection
An Online Attention-based Model for Speech Recognition
Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition
The influence of distraction on speech processing: How selective is selective attention?
Environment-dependent Attention-driven Recurrent Convolutional Neural Network for Robust Speech Enhancement
Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation
Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition
Multi-task multi-resolution char-to-BPE cross-attention decoder for end-to-end speech recognition
Multi-Stride Self-Attention for Speech Recognition
Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning
Attention-based word vector prediction with LSTMs and its application to the OOV problem in ASR
Multi-stream Network With Temporal Attention For Environmental Sound Classification
Few-Shot Audio Classification with AttentionalGraph Neural Networks
Vectorized Beam Search for CTC-Attention-based Speech Recognition
Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition
Spatio-Temporal Attention Pooling for Audio Scene Classification
Multi-Scale Time-Frequency Attention for Rare Sound Event Detection
A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN - with application to language identification
An Attention-Based Hybrid Network for Automatic Detection of Alzheimer’s Disease from Narrative Speech
Automatic Hierarchical Attention Neural Network for Detecting Alzheimer’s Disease
Neural Text Clustering with Document-level Attention based on Dynamic Soft
Labels
End-to-End Multi-Channel Speech Enhancement Using Inter-Channel Time-Restricted Attention on Raw
Waveform
Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition
Cross-AttentionEnd-to-End ASR for Two-Party Conversations
Self-AttentionTransducers for End-to-End Speech Recognition
Variational Attentionusing Articulatory Priors for generating Code Mixed Speech using Monolingual Corpora
Confidential - Not for further distribution
Lots of Areas ATTENDED
● Example
○ Multimodal emotion recognition
J. Li, et al., “Attentive to individual: A
multimodal emotion recognition network
with personalized attention profile", 2019.
Thank you!
khan@asapp.com
Confidential - Not for further distribution
References
1. Alex Graves, “Generating sequence with recurrent neural networks,” arXiv:1308.0850 [cs], Aug. 2013.
2. Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” ICLR, May 2015,
arXiv:1409.0473 [cs], Sep. 2014.
3. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin, “Attention is all you need”,
arXiv:1706.03762 [cs], June 2017.
4. Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath, Bo Li, Leif Johnson and Navdeep Jaitly, “A comparison of sequence-to-sequence model for speech
recognition,” Interspeech, Aug. 2017.
5. Amit Das, Jinyu Li, Rui Zhao and Yifan Gong, “Advancing connectionist temporal classification with attention modeling,” ICASSP, April 2018.
6. Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN:
First results,” Deep Learning and Representation Learning Workshop @NIPS, Dec. 2014.
7. Jan Chorowski, Dzmitry Bahdanau, Dimitriy Serdyuk, Kyunghyun Cho and Yoshua Bengio, “Attention-based models for speech recognition,” NIPS, Dec. 2015.
8. Dzmitry Bahdanau, Jan Chorowski, Dimitriy Serdyuk, Philemon Brakel and Yoshua Bengio, “End-to-end attention-based large vocabulary speech
recognition,” ICASSP, March 2016.
9. Suyoun Kim, Takaaki Hori and Shinji Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” ICASSP, March 2017.
10. Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey and Tomoki Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,”
Journal of Selected Topics in Signal Processing, vol. 11, no. 8, Dec. 2017.
11. Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin
Chen, Adithya Renduchintala and Tsubasa Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” Interspeech, Sept. 2018.
12. William Chan, Navdeep Jaitly, Quoc V. Le and Oriol Vinyals, “Listen, attend and spell,” arXiv:1508.01211 [cs], Aug. 2015.
13. Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina
Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” ICASSP, April
2018.
14. Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk and Quoc V. Le, “SpecAugment: A simple data augmentation
method for automatic speech recognition,” Interspeech, Sept. 2019.
REFERENCES
Confidential - Not for further distribution
References
15. Albert Zeyer, Kazuki Irie, Ralf Schluter and Hermann Ney, “Improved training of end-to-end attention models for speech recognition,” Interspeech, Sept.
2018.
16. Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach and Patrick Nguyen, “On the Choice of modeling unit for sequence-to-
sequence speech recognition,” Interspeech, Sept. 2019.
17. Albert Zeyer, Andre Merboldt, Ralf Schluter and Hermann Ney, “A comprehensive analysis on attention models”, Interpretability and Robustness in Audio,
Speech, and Language Workshop @NIPS, Dec. 2018.
18. Liang Lu, Xingxing Zhang and Steve Renals, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech
recognition,” ICASSP, March 2016.
19. Shubham Toshniwal, Hao Tang, Liang Lu and Karen Livescu, “Multitask learning with low-level auxiliary tasks for encoder-decoder based speech
recognition,” Interspeech, Aug. 2017.
20. Chao Weng, Jia Cui, Guangsen Wang, Jun Wang, Chengzhu Yu, Dan Su and Dong Yu, “Improving attention based sequence-to-sequence models for end-to-
end English conversational speech recognition,” Interspeech, Sept. 2018.
21. Shiliang Zhang, Hui Jiang, Si Wei and Lirong Dai, “Feed- forward sequential memory neural networks without recurrent feedback,” arXiv:1510.02693 [cs],
Oct. 2015.
22. Shiliang Zhang, Cong Liu, Hui Jiang, Si Wei, Lirong Dai and Yu Hu, “Feedforward sequential memory networks: A new structure to learn long-term
dependency,” arXiv:1512.08301 [cs], Dec. 2015.
23. Shiliang Zhang, Hui Jiang, Shifu Xiong, Si Wei and Li-Rong Dai, “Compact feedforward sequential memory networks for large vocabulary continuous
speech recognition,” Interspeech, Sept. 2016.
24. Shiliang Zhang, Ming Lei, Zhijie Yan and Lirong Dai, “Deep-FSMN for large vocabulary continuous speech recognition,” arXiv:1803.05030 [cs], March 2018.
25. Xuerui Yang, Jiwei Li and Xi Zhou, “A novel pyramidal-FSMN architecture with lattice-free MMI for speech recognition”, arXiv:1810.11352 [cs], Oct. 2018.
26. Linhao Dong, Shuang Xu and Bo Xu, “Speech-Transformer: A no-recurrence sequence-to-sequence model for speech recognition,” ICASSP, April 2018.
27. Shiiyu Zhou, Linhao Dong, Shuang Xu and Bo Xu, “A comparison of modeling units in sequence-to-sequence speech recognition with the Transformer on
Mandarin Chinese,” arXiv:1805.06239 [cs], May 2018.
28. Shiyu Zhou, Linhao Dong, Shuang Xu and Bo Xu, “Syllable-based sequence-to-sequence speech recognition with the Transformer in Mandarin Chinese,”
Interspeech, Sept. 2018.
REFERENCES
Confidential - Not for further distribution
References
29. Abdelrahman Mohamed, Dmytro Okhonko and Lukr Zettlemoyer, “Transformers with convolutional context for ASR,” arXiv:1904.11660 [cs], April 2019.
30. Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li and Sanjeev Khudanpur, ”A time-restricted self-attention layer for ASR,” ICASSP, April 2018.
31. Kyu J. Han, Jing Huang, Yun Tang, Xiaodong He and Bowen Zhou, “Multi-stride self-attention for speech recognition,” Interspeech, Sept. 2019.
32. Julian Salazar, Katrin Kirchhoff and Zhiheng Huang, ”Self-attention networks for connectionist temporal classification in speech recognition,” ICASSP, May
2019.
33. Shaoshi Ling, Julian Salazar and Katrin Kirchhoff, “Contextual phonetic pretraining for end-to-end utterance-level language and speaker recognition,”
Interspeech, Sept. 2019.
34. Yuanyuan Zhao, Jie Li, Xiaorui Wang and Yan Li, “The Speechtransformer for large-scale Mandarin Chinese speech recognition,” ICASSP, May, 2019.
35. Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stuker and Alex Waibel, “Self-attention acoustic models,” Interspeech, Sept. 2018.
36. Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Muller, Sebastian Stuker and Alex Waibel, “Very deep self-attention networks for end-to-end
speech recognition,” Interspeech, Sept. 2019.
37. Dong Yu and Jinyu Li, “Recent progress in deep learning based acoustic models (updated),” IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 3, 2017.
38. Gautam Bhattacharya, Jahangir Alam and Patrick Kenny, “Deep speaker embeddings for short-duration speaker verification,” Interspeech, Aug. 2017.
39. Colin Raffel and Daniel P. W. Ellis, “Feed-forward networks with attention can solve some long-term memory problems,” ICLR, May 2015.
40. Koji Okabe, Takafumi Koshinaka and Koichi Shinoda, “Attentive statistics pooling for deep speaker embedding,” Interspeech, Sept. 2018.
41. Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li and Yifan Gong, “End-to-end attention based text-dependent speaker verification,” SLT, Dec. 2016.
42. Georg Heigold, Ignacio Moreno, Samy Bengio and Noam Sharzeer, “End-to-end text dependent speaker verification,” ICASSP, March 2016.
43. F. A. Rezaur Rahman Chowdhury, Quan Wang, Ignacio Lopez Moreno and Li Wan, “Attention-based models for text-dependent speaker verification,”
arXiv:11710.10470 [cs], Oct. 2017.
44. Yann N. Dauphin, Angela Fan, Michael Auli and David Grangier, “Language modeling with gated convolutional networks,” arXiv:1612.08083 [cs], Dec. 2016.
45. Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou and Yoshua Bengio, “A structured self-attentive sentence
embedding,” ICLR, April 2017.
46. Yingke Zhu, Tom Ko, David Snyder, Brian Mak, Daniel Povey, “Self-attentive speaker embedding for text-independent speaker verification,” Interspeech,
Sept. 2018.
REFERENCES
Confidential - Not for further distribution
References
47. Qiongqiong Wang, Koji Okabe, Kong Aik Lee, Hitoshi Yamamoto and Takafumi Koshinaka, “Attention mechanism in speaker recognition: What does it learn
in deep speaker embedding?,” SLT, Dec. 2018.
48. Chung-Cheng Chiu and Colin Raffel, “Monotonic chunkwise attention,” ICLR, May 2018.
49. Kyu J. Han, Ramon Prieto and Tao Ma, “State-of-the-art speech recognition using multi-stream self-attention with dilated 1D convolutions,” ASRU, Dec.
2019.
50. Jeng-Lin Li and Chi-Chun Lee, “Attentive to individual: A multimodal emotion recognition network with personalized attention profile,” Interspeech, Sept.
2019.
REFERENCES

More Related Content

What's hot

Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-pythonJoe OntheRocks
 
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...ssuserf54db1
 
Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Daichi Kitamura
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumnYuki Saito
 
Independent Component Analysis
Independent Component Analysis Independent Component Analysis
Independent Component Analysis Ibrahim Amer
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in Rhtstatistics
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forestsMarc Garcia
 
An Introduction To Bioinformatics Algorithms
An Introduction To Bioinformatics AlgorithmsAn Introduction To Bioinformatics Algorithms
An Introduction To Bioinformatics AlgorithmsTracy Morgan
 
音楽の情報処理
音楽の情報処理音楽の情報処理
音楽の情報処理Akinori Ito
 
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talkAbhik Roychoudhury
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Dev Sahu
 
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価Shinnosuke Takamichi
 
From logistic regression to linear chain CRF
From logistic regression to linear chain CRFFrom logistic regression to linear chain CRF
From logistic regression to linear chain CRFDarren Yow-Bang Wang
 
スペクトログラム無矛盾性を用いた 独立低ランク行列分析の実験的評価
スペクトログラム無矛盾性を用いた独立低ランク行列分析の実験的評価スペクトログラム無矛盾性を用いた独立低ランク行列分析の実験的評価
スペクトログラム無矛盾性を用いた 独立低ランク行列分析の実験的評価Daichi Kitamura
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術NU_I_TODALAB
 
PhD Defense - Example-Dependent Cost-Sensitive Classification
PhD Defense - Example-Dependent Cost-Sensitive ClassificationPhD Defense - Example-Dependent Cost-Sensitive Classification
PhD Defense - Example-Dependent Cost-Sensitive ClassificationAlejandro Correa Bahnsen, PhD
 
フィラーを含む自発音声合成モデルの品質低下原因の調査と一貫性保証による改善
フィラーを含む自発音声合成モデルの品質低下原因の調査と一貫性保証による改善フィラーを含む自発音声合成モデルの品質低下原因の調査と一貫性保証による改善
フィラーを含む自発音声合成モデルの品質低下原因の調査と一貫性保証による改善Yuta Matsunaga
 

What's hot (20)

Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-python
 
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
Interspeech2020 paper reading workshop "Similarity-and-Independence-Aware-Bea...
 
Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
Independent Component Analysis
Independent Component Analysis Independent Component Analysis
Independent Component Analysis
 
Lda
LdaLda
Lda
 
NLP
NLPNLP
NLP
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in R
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forests
 
An Introduction To Bioinformatics Algorithms
An Introduction To Bioinformatics AlgorithmsAn Introduction To Bioinformatics Algorithms
An Introduction To Bioinformatics Algorithms
 
音楽の情報処理
音楽の情報処理音楽の情報処理
音楽の情報処理
 
Automated Program Repair Keynote talk
Automated Program Repair Keynote talkAutomated Program Repair Keynote talk
Automated Program Repair Keynote talk
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
サブバンドフィルタリングに基づくリアルタイム広帯域DNN声質変換の実装と評価
 
From logistic regression to linear chain CRF
From logistic regression to linear chain CRFFrom logistic regression to linear chain CRF
From logistic regression to linear chain CRF
 
スペクトログラム無矛盾性を用いた 独立低ランク行列分析の実験的評価
スペクトログラム無矛盾性を用いた独立低ランク行列分析の実験的評価スペクトログラム無矛盾性を用いた独立低ランク行列分析の実験的評価
スペクトログラム無矛盾性を用いた 独立低ランク行列分析の実験的評価
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術
 
PhD Defense - Example-Dependent Cost-Sensitive Classification
PhD Defense - Example-Dependent Cost-Sensitive ClassificationPhD Defense - Example-Dependent Cost-Sensitive Classification
PhD Defense - Example-Dependent Cost-Sensitive Classification
 
フィラーを含む自発音声合成モデルの品質低下原因の調査と一貫性保証による改善
フィラーを含む自発音声合成モデルの品質低下原因の調査と一貫性保証による改善フィラーを含む自発音声合成モデルの品質低下原因の調査と一貫性保証による改善
フィラーを含む自発音声合成モデルの品質低下原因の調査と一貫性保証による改善
 

Similar to Interspeech 2019 Survey Talk: When Attention Meets Speech Applications

Hearing screening for elderly people - feasible and enjoyable? - Dr. Dr. h.c....
Hearing screening for elderly people - feasible and enjoyable? - Dr. Dr. h.c....Hearing screening for elderly people - feasible and enjoyable? - Dr. Dr. h.c....
Hearing screening for elderly people - feasible and enjoyable? - Dr. Dr. h.c....Monika Lehnhardt
 
The Cocktail Party Effect. An inclusive vision of conversational interactions.
The Cocktail Party Effect. An inclusive vision of conversational interactions.The Cocktail Party Effect. An inclusive vision of conversational interactions.
The Cocktail Party Effect. An inclusive vision of conversational interactions.Isabella Loddo
 
Information security consciousness
Information security consciousnessInformation security consciousness
Information security consciousnessCiarán Mc Mahon
 
370_October 26_Presentation and TV
370_October 26_Presentation and TV 370_October 26_Presentation and TV
370_October 26_Presentation and TV Ohio University
 
Final Edited Deliverable
Final Edited DeliverableFinal Edited Deliverable
Final Edited Deliverableskylerdan
 
1. Highly Repetitive MotionIntensive keying for at least 5 hours.docx
1. Highly Repetitive MotionIntensive keying for at least 5 hours.docx1. Highly Repetitive MotionIntensive keying for at least 5 hours.docx
1. Highly Repetitive MotionIntensive keying for at least 5 hours.docxSONU61709
 
How To Deliver an Accessible Online Presentation
How To Deliver an Accessible Online PresentationHow To Deliver an Accessible Online Presentation
How To Deliver an Accessible Online Presentation3Play Media
 
Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...
Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...
Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...AI Frontiers
 
How to Make the Web Easier for Users with Limited Literacy Skills - Sandy Hil...
How to Make the Web Easier for Users with Limited Literacy Skills - Sandy Hil...How to Make the Web Easier for Users with Limited Literacy Skills - Sandy Hil...
How to Make the Web Easier for Users with Limited Literacy Skills - Sandy Hil...UXPA International
 
Language And Culture Essay
Language And Culture EssayLanguage And Culture Essay
Language And Culture EssayGermaine Newman
 
How To Make Outline For Essay
How To Make Outline For EssayHow To Make Outline For Essay
How To Make Outline For EssayJulia Slater
 
3 labs open house, july 2020, prospective
3 labs open house, july 2020, prospective3 labs open house, july 2020, prospective
3 labs open house, july 2020, prospectiveDick Detzner
 
Labs open house 2020, final
Labs open house 2020, finalLabs open house 2020, final
Labs open house 2020, finalDick Detzner
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkMonaDiab7
 
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
Intro to Auto Speech Recognition -- How ML Learns Speech-to-TextIntro to Auto Speech Recognition -- How ML Learns Speech-to-Text
Intro to Auto Speech Recognition -- How ML Learns Speech-to-TextYoshiyuki Igarashi
 
Conversation research: leveraging the power of social media
Conversation research: leveraging the power of social mediaConversation research: leveraging the power of social media
Conversation research: leveraging the power of social mediaSKIM
 
Oral communication.pdf
Oral communication.pdfOral communication.pdf
Oral communication.pdfAyzaFatima1
 
People-Centered Design
People-Centered DesignPeople-Centered Design
People-Centered DesignKatrina Alcorn
 

Similar to Interspeech 2019 Survey Talk: When Attention Meets Speech Applications (20)

Hearing screening for elderly people - feasible and enjoyable? - Dr. Dr. h.c....
Hearing screening for elderly people - feasible and enjoyable? - Dr. Dr. h.c....Hearing screening for elderly people - feasible and enjoyable? - Dr. Dr. h.c....
Hearing screening for elderly people - feasible and enjoyable? - Dr. Dr. h.c....
 
The Cocktail Party Effect. An inclusive vision of conversational interactions.
The Cocktail Party Effect. An inclusive vision of conversational interactions.The Cocktail Party Effect. An inclusive vision of conversational interactions.
The Cocktail Party Effect. An inclusive vision of conversational interactions.
 
Information security consciousness
Information security consciousnessInformation security consciousness
Information security consciousness
 
370_October 26_Presentation and TV
370_October 26_Presentation and TV 370_October 26_Presentation and TV
370_October 26_Presentation and TV
 
Final Edited Deliverable
Final Edited DeliverableFinal Edited Deliverable
Final Edited Deliverable
 
1. Highly Repetitive MotionIntensive keying for at least 5 hours.docx
1. Highly Repetitive MotionIntensive keying for at least 5 hours.docx1. Highly Repetitive MotionIntensive keying for at least 5 hours.docx
1. Highly Repetitive MotionIntensive keying for at least 5 hours.docx
 
iCitizen 2008: Steve Knox
iCitizen 2008: Steve KnoxiCitizen 2008: Steve Knox
iCitizen 2008: Steve Knox
 
How To Deliver an Accessible Online Presentation
How To Deliver an Accessible Online PresentationHow To Deliver an Accessible Online Presentation
How To Deliver an Accessible Online Presentation
 
Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...
Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...
Dilek Hakkani-Tur at AI Frontiers: Conversational machines: Deep Learning for...
 
How to Make the Web Easier for Users with Limited Literacy Skills - Sandy Hil...
How to Make the Web Easier for Users with Limited Literacy Skills - Sandy Hil...How to Make the Web Easier for Users with Limited Literacy Skills - Sandy Hil...
How to Make the Web Easier for Users with Limited Literacy Skills - Sandy Hil...
 
Language And Culture Essay
Language And Culture EssayLanguage And Culture Essay
Language And Culture Essay
 
How To Make Outline For Essay
How To Make Outline For EssayHow To Make Outline For Essay
How To Make Outline For Essay
 
3 labs open house, july 2020, prospective
3 labs open house, july 2020, prospective3 labs open house, july 2020, prospective
3 labs open house, july 2020, prospective
 
Labs open house 2020, final
Labs open house 2020, finalLabs open house 2020, final
Labs open house 2020, final
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walk
 
ROTARY AFRICA MAGAZINE
ROTARY AFRICA MAGAZINEROTARY AFRICA MAGAZINE
ROTARY AFRICA MAGAZINE
 
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
Intro to Auto Speech Recognition -- How ML Learns Speech-to-TextIntro to Auto Speech Recognition -- How ML Learns Speech-to-Text
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
 
Conversation research: leveraging the power of social media
Conversation research: leveraging the power of social mediaConversation research: leveraging the power of social media
Conversation research: leveraging the power of social media
 
Oral communication.pdf
Oral communication.pdfOral communication.pdf
Oral communication.pdf
 
People-Centered Design
People-Centered DesignPeople-Centered Design
People-Centered Design
 

Recently uploaded

Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Escort Service
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGYpruthirajnayak525
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxaryanv1753
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxJohnree4
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this periodSaraIsabelJimenez
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxCarrieButtitta
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...marjmae69
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸mathanramanathan2005
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationNathan Young
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxnoorehahmad
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.KathleenAnnCordero2
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comsaastr
 

Recently uploaded (20)

Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170Call Girls In Aerocity 🤳 Call Us +919599264170
Call Girls In Aerocity 🤳 Call Us +919599264170
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular PlasticsDutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
Dutch Power - 26 maart 2024 - Henk Kras - Circular Plastics
 
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC  - NANOTECHNOLOGYPHYSICS PROJECT BY MSC  - NANOTECHNOLOGY
PHYSICS PROJECT BY MSC - NANOTECHNOLOGY
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
Event 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptxEvent 4 Introduction to Open Source.pptx
Event 4 Introduction to Open Source.pptx
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
Genshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptxGenshin Impact PPT Template by EaTemp.pptx
Genshin Impact PPT Template by EaTemp.pptx
 
Early Modern Spain. All about this period
Early Modern Spain. All about this periodEarly Modern Spain. All about this period
Early Modern Spain. All about this period
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
miladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptxmiladyskindiseases-200705210221 2.!!pptx
miladyskindiseases-200705210221 2.!!pptx
 
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
Gaps, Issues and Challenges in the Implementation of Mother Tongue Based-Mult...
 
Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸Mathan flower ppt.pptx slide orchids ✨🌸
Mathan flower ppt.pptx slide orchids ✨🌸
 
The Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism PresentationThe Ten Facts About People With Autism Presentation
The Ten Facts About People With Autism Presentation
 
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptxAnne Frank A Beacon of Hope amidst darkness ppt.pptx
Anne Frank A Beacon of Hope amidst darkness ppt.pptx
 
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
PAG-UNLAD NG EKONOMIYA na dapat isaalang alang sa pag-aaral.
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.comSaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
SaaStr Workshop Wednesday w/ Kyle Norton, Owner.com
 

Interspeech 2019 Survey Talk: When Attention Meets Speech Applications

  • 1. ASAPP, One World Trade Center, 80th Floor, New York, 10007 asapp.com Confidential - Not for further distribution Kyu J. Han, Ramon Prieto, Tao Ma When Attention Meets Speech Applications September 16, 2019
  • 2. Confidential - Not for further distribution Intro “ATTENTION” In Interspeech 2019 Very Deep Self-attention Networks for End-to-End Speech Recognition Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile Phonetically-aware embeddings - Wide Residual Networks with Time-Delay Neural Networks and Self Attention models for the 2018 NIST Speaker Recognition Evaluation A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews RWTH ASR System for LibriSpeech: Hybrid vs Attention Speaker Adaptation for Attention-Based End-to-End Speech Recognition Large Margin Training for Attention Based End-to-End Speech Recognition Predicting Group-Level Skin Attention to Short Movies from Audio-Based LSTM-Mixture of Experts Models Attention model for articulatory features detection Attention based Hybrid I-vector BLSTM Model for Language Recognition Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS Self Attention in Variational Sequential Learning for Summarization Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling Conversational Emotion Analysis via Attention Mechanisms An analysis of local monotonic attention variants Lattice generation in attention-based speech recognition models A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting Individual differences in implicit attention to phonetic detail in speech perception Learning how to listen: A temporal-frequential attention model for sound event detection An Online Attention-based Model for Speech Recognition Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition The influence of distraction on speech processing: How selective is selective attention? Environment-dependent Attention-driven Recurrent Convolutional Neural Network for Robust Speech Enhancement Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition Multi-task multi-resolution char-to-BPE cross-attention decoder for end-to-end speech recognition Multi-Stride Self-Attention for Speech Recognition Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning Attention-based word vector prediction with LSTMs and its application to the OOV problem in ASR Multi-stream Network With Temporal Attention For Environmental Sound Classification Few-Shot Audio Classification with AttentionalGraph Neural Networks Vectorized Beam Search for CTC-Attention-based Speech Recognition Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition Spatio-Temporal Attention Pooling for Audio Scene Classification Multi-Scale Time-Frequency Attention for Rare Sound Event Detection A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN - with application to language identification An Attention-Based Hybrid Network for Automatic Detection of Alzheimer’s Disease from Narrative Speech Automatic Hierarchical Attention Neural Network for Detecting Alzheimer’s Disease Neural Text Clustering with Document-level Attention based on Dynamic Soft Labels End-to-End Multi-Channel Speech Enhancement Using Inter-Channel Time-Restricted Attention on Raw Waveform Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition Cross-AttentionEnd-to-End ASR for Two-Party Conversations Self-AttentionTransducers for End-to-End Speech Recognition Variational Attentionusing Articulatory Priors for generating Code Mixed Speech using Monolingual Corpora
  • 3. Confidential - Not for further distribution Intro ● Around 50 papers, with the titles including “ATTENTION” ● Diverse areas being applied ○ Speech recognition ○ Speaker recognition ○ Language recognition ○ Emotion recognition ○ Speech synthesis ○ Audio classification ○ Event detection ○ Semantic classification “ATTENTION” In Interspeech 2019
  • 4. TABLE OF CONTENTS 1. Attention 2. Attention in Speech Recognition 3. Attention in Speaker Recognition 4. Pay Attentions on Challenges! 5. Conclusions / Q&A
  • 5. TABLE OF CONTENTS 1. Attention 2. Attention in Speech Recognition 3. Attention in Speaker Recognition 4. Pay Attentions on Challenges! 5. Conclusions / Q&A
  • 6. Confidential - Not for further distribution ● Understands where to pay more attention ATTENTION Attention Source: commons.wikimedia.org
  • 7. Confidential - Not for further distribution ● Understands where to pay more attention ● Common to humans ○ Visual attention ATTENTION Attention Source: commons.wikimedia.org
  • 8. Confidential - Not for further distribution ● Understands where to pay more attention ● Common to humans ○ Visual attention ATTENTION Attention Source: commons.wikimedia.org
  • 9. Confidential - Not for further distribution Attention Source: commons.wikimedia.org ● Understands where to pay more attention ● Common to humans ○ Visual attention ATTENTION
  • 10. Confidential - Not for further distribution Attention Source: giphy.com ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ATTENTION
  • 11. Confidential - Not for further distribution Attention Source: cbsnews.com ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ATTENTION
  • 12. Confidential - Not for further distribution Attention Source: giphy.com ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ATTENTION
  • 13. Confidential - Not for further distribution Attention Source: metroatlantahome.com ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ATTENTION
  • 14. Confidential - Not for further distribution Attention ATTENTION ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ● In neural networks, ○ “Generating sequences with RNNs”, by A. Graves (2013)
  • 15. Confidential - Not for further distribution Attention ATTENTION ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ● In neural networks, ○ “Generating sequences with RNNs”, by A. Graves (2013) A. Graves, "Generating sequences with recurrent neural networks", 2013.
  • 16. Confidential - Not for further distribution Attention ATTENTION ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ● In neural networks, ○ “Generating sequences with RNNs”, by A. Graves (2013) ■ Soft windowing A. Graves, "Generating sequences with recurrent neural networks", 2013.
  • 17. Confidential - Not for further distribution Attention ATTENTION ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ● In neural networks, ○ “Generating sequences with RNNs”, by A. Graves (2013) ■ Soft windowing ■ Gaussian convolution ■ Location-aware attention A. Graves, "Generating sequences with recurrent neural networks", 2013.
  • 18. Confidential - Not for further distribution Attention A. Graves, "Generating sequences with recurrent neural networks", 2014. ATTENTION ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ● In neural networks, ○ “Generating sequences with RNNs”, by A. Graves (2013) ■ Soft windowing ■ Gaussian convolution ■ Location-aware attention ○ “Neural machine translation by jointly learning to align and translate”, D. Bahdanau, K. Cho and Y. Bengio (2014/2015)
  • 19. Confidential - Not for further distribution Attention ATTENTION ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ● In neural networks, ○ “Generating sequences with RNNs”, by A. Graves (2013) ■ Soft windowing ■ Gaussian convolution ■ Location-aware attention ○ “Neural machine translation by jointly learning to align and translate”, D. Bahdanau, K. Cho and Y. Bengio (2014/2015) ■ Content-aware attention D. Bahdanau, et al., ”Neural machine translation by jointly learning to align and translate", 2014/2015. :
  • 20. Confidential - Not for further distribution Attention ATTENTION ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ● In neural networks, ○ “Generating sequences with RNNs”, by A. Graves (2013) ■ Soft windowing ■ Gaussian convolution ■ Location-aware attention ○ “Neural machine translation by jointly learning to align and translate”, D. Bahdanau, K. Cho and Y. Bengio (2014/2015) ■ Content-aware attention ○ “Attention is all you need”, A. Vaswani, et al. (2017)
  • 21. Confidential - Not for further distribution Attention ATTENTION ● Understands where to pay more attention ● Common to humans ○ Visual attention ○ Auditory attention ○ Social attention ● Common to human decision making ○ Family meeting ○ House price ● In neural networks, ○ “Generating sequences with RNNs”, by A. Graves (2013) ■ Soft windowing ■ Gaussian convolution ■ Location-aware attention ○ “Neural machine translation by jointly learning to align and translate”, D. Bahdanau, K. Cho and Y. Bengio (2014/2015) ■ Content-aware attention ○ “Attention is all you need”, A. Vaswani, et al. (2017) ■ Multi-head attention ■ No-recurrence A. Vaswani, et al., ”Attention is all you need", 2017.
  • 22. TABLE OF CONTENTS 1. Attention 2. Attention in Speech Recognition 3. Attention in Speaker Recognition 4. Pay Attentions on Challenges! 5. Conclusions / Q&A
  • 23. Confidential - Not for further distribution Attention in End-to-End ASR CTC RNN Transducer Seq-to-Seq R. Prabhavalkar, et al., ”A comparison of sequence-to-sequence models for speech recognition", 2017.
  • 24. Confidential - Not for further distribution ● CTC + attention (2018) ○ Hybrid attention ○ Implicit LM ○ Component attention ○ 20% (relative) impr. In WER Attention in End-to-End ASR A. Das, et al., ”Advancing connectionist temporal classification with attention modeling", 2018.
  • 25. Confidential - Not for further distribution ● RNN-T + attention (2017) ○ Combines RNN-T w/ attention ○ Content-aware attention ○ Marginal impr. obtained Attention in End-to-End ASR RNN-T RNN-T w/ Attention R. Prabhavalkar, et al., ”A comparison of sequence-to-sequence models for speech recognition", 2017.
  • 26. Confidential - Not for further distribution Attention in End-to-End ASR CTC RNN Transducer Seq-to-Seq R. Prabhavalkar, et al., ”A comparison of sequence-to-sequence models for speech recognition", 2017.
  • 27. Confidential - Not for further distribution Attention in End-to-End ASR Seq-to-Seq R. Prabhavalkar, et al., ”A comparison of sequence-to-sequence models for speech recognition", 2017.
  • 28. Confidential - Not for further distribution ● Same structure with Bahdanau’s neural translation model (2014/15) First Attention in Speech
  • 29. Confidential - Not for further distribution ● Same structure with Bahdanau’s neural translation model (2014/15) ○ Encoder-decoder architecture w/ attention ○ Content-aware attention First Attention in Speech J. Chorowski, et al., “End-to-end continuous speech recognition using attention-based recurrent NN: First results", 2014/15.
  • 30. Confidential - Not for further distribution ● ARSG using hybrid attention (2015) ○ Addressed the limitation of content-aware attention  hybrid attention Attention-based Recurrent Sequence Generator
  • 31. Confidential - Not for further distribution ● ARSG using hybrid attention (2015) ○ Addressed the limitation of content-aware attention  hybrid attention Attention-based Recurrent Sequence Generator (F: Convolving matrix) J. Chorowski, et al., “Attention-based models for speech recognition", 2014/15.
  • 32. Confidential - Not for further distribution ● Two improvements for LVCSR (2016) ○ Windowing on attention during training ○ Frame pooling ■ Similar with LAS’s pyramidal encoder structure Improved ARSG D. Bahdanau, et al., “End-to-end attention- based large vocabulary speech recognition", 2016.
  • 33. Confidential - Not for further distribution ● Combination w/ CTC objective (2017) ○ Joint CTC/attention decoding ○ Main model architecture in ESPnet (https://github.com/espnet/espnet) Multi-Objective Training S. Watanabe, et al., “Hybrid CTC/attention architecture for end-to-end speech recognition", 2017.
  • 34. Confidential - Not for further distribution ● LAS (2015) ○ Pyramidal encoder structure from downsampling ○ Content-aware attention Listen, Attend and Spell
  • 35. Confidential - Not for further distribution ● LAS (2015) ○ Pyramidal encoder structure from downsampling ○ Content-aware attention Listen, Attend and Spell W. Chan, et al., “Listen, attend and spell", 2015.
  • 36. Confidential - Not for further distribution ● Multi-head attention (2018) ○ Inspired by Transformer (A. Vaswani, 2017) ○ Replacing single head attention Further Development of LAS C. Chiu, et al., ”State-of-the-art speech recognition with sequence-to-sequence models", 2018.
  • 37. Confidential - Not for further distribution ● Multi-head attention (2018) ○ Inspired by Transformer (A. Vaswani, 2017) ○ Replacing single head attention ● SpecAugment (2019) ○ Data augmentation to LAS ○ Achieved state-of-the-art results on LibriSpeech and SWBD Further Development of LAS C. Chiu, et al., ”State-of-the-art speech recognition with sequence-to-sequence models", 2018. D. Park, et al., “SpecAugment: A simple data augmentation method for automatic speech recognition", 2019.
  • 38. Confidential - Not for further distribution Performance of Seq-to-Seq w/ Attention D. Park, et al., “SpecAugment: A simple data augmentation method for automatic speech recognition", 2019. LibriSpeech SWBD
  • 39. TABLE OF CONTENTS 1. Attention 2. Attention in Speech Recognition 3. Attention in Speaker Recognition 4. Pay Attentions on Challenges! 5. Conclusions / Q&A
  • 40. Confidential - Not for further distribution ● Non-recurrence structure ○ Inspired by FIR approximation on IIRs ○ Exploits memory blocks ○ Can model long-term dependency, even without recurrence in its structure Feedforward Sequential Memory Network Recurrent Feedback in RNN as IIR Memory Blocks in FSMN as FIR S. Zhang, et al., ”Feedforward sequential memory networks without recurrent feedback", 2015.
  • 41. Confidential - Not for further distribution ● Non-recurrence structure ○ Inspired by FIR approximation on IIRs ○ Exploits memory blocks ○ Can model long-term dependency, even without recurrence in its structure Feedforward Sequential Memory Network Deep FSMN c-FSMNFSMN S. Zhang, et al., ”Deep-FSMN for large vocabulary continuous speech recognition ", 2018.
  • 42. Confidential - Not for further distribution ● Speech-Transformer ○ Transformer applied to Mandarin Chinese ○ With convolution layers on inputs Multi-Head Self-Attention L. Dong, et al., “Speech- Transformer: A no recurrence sequence-to- sequence model for speech recognition", 2018.
  • 43. Confidential - Not for further distribution ● Speech-Transformer ○ Transformer applied to Mandarin Chinese ○ With convolution layers on inputs ● Transformer with convolutions ○ Convolutional contexts applied to inputs, similarly Multi-Head Self-Attention A. Mohamed, et al., “Transformers with convolutional context for ASR", 2019.
  • 44. Confidential - Not for further distribution Multi-Head Self-Attention D. Povey, et al., “A time- restricted self-attention layer for ASR", 2018. ● Speech-Transformer ○ Transformer applied to Mandarin Chinese ○ With convolution layers on inputs ● Transformer with convolutions ○ Convolutional contexts applied to inputs, similarly ● Time-restricted self-attention ○ Left & right contexts restricting the attention mechanism ○ Relative positional encoding ○ Encoder structure only ○ LF-MMI objective
  • 45. Confidential - Not for further distribution Multi-Head Self-Attention K. Han, et al., “Multi-stride self-attention for speech recognition", 2018. ● Speech-Transformer ○ Transformer applied to Mandarin Chinese ○ With convolution layers on inputs ● Transformer with convolutions ○ Convolutional contexts applied to inputs, similarly ● Time-restricted self-attention ○ Left & right contexts restricting the attention mechanism ○ Relative positional encoding ○ Encoder structure only ○ LF-MMI objective
  • 46. Confidential - Not for further distribution ● Speech-Transformer ○ Transformer applied to Mandarin Chinese ○ With convolution layers on inputs ● Transformer with convolutions ○ Convolutional contexts applied to inputs, similarly ● Time-restricted self-attention ○ Left & right contexts restricting the attention mechanism ○ Relative positional encoding ○ Encoder structure only ○ LF-MMI objective ● Self-attention network (SAN) with CTC ○ CTC objective Multi-Head Self-Attention J. Salazar, et al., “Self- attention networks for connectionist temporal classification in speech recognition", 2019.
  • 47. TABLE OF CONTENTS 1. Attention 2. Attention in Speech Recognition 3. Attention in Speaker Recognition 4. Pay Attentions on Challenges! 5. Conclusions / Q&A
  • 48. Confidential - Not for further distribution ● Attention in speaker verification ○ Just averaged embeddings for a given utterance to make them a fixed-length representation in the past ○ Applied attention on such embeddings, instead Deep Speaker Embedding w/ Attention G. Bhattacharya, et al., “Deep speaker embeddings for short- duration speaker verification", 2017.
  • 49. Confidential - Not for further distribution ● Attention in speaker verification ○ Just averaged embeddings for a given utterance to make them a fixed-length representation in the past ○ Applied attention on such embeddings, instead ● Feedforward networks w/ attention Deep Speaker Embedding w/ Attention C. Raffel, et al., “Feed- forward networks with attention can solve some long-term memory problems", 2017.
  • 50. Confidential - Not for further distribution ● Attentive statistics pooling ○ Appends standard deviation to weighted mean after attention Deep Speaker Embedding w/ Attention K. Okabe, et al., “Attentive statistics pooling for deep speaker embedding", 2018.
  • 51. Confidential - Not for further distribution Deep Speaker Embedding w/ Attention S. Zhang, et al., “End- to-end attention based text-dependent speaker verification", 2016. ● Multimodal attention in speaker verification ○ Attention on phonetic and speaker representation for the wake word “Hey Cortana” ○ Combining keyword spotting with speaker verification
  • 52. Confidential - Not for further distribution ● D-vectors in LSTM ○ Generates embedding through LSTMs Deep Speaker Embedding w/ Attention G. Heigold, et al., “End- to-end text dependent speaker verification", 2016.
  • 53. Confidential - Not for further distribution ● D-vectors in LSTM ○ Generates embedding through LSTMs ○ Attention applied to get normalized weights for hidden embedding Deep Speaker Embedding w/ Attention F. Chowdhury, et al., “Attention-based models for text-dependent speaker verification", 2017.
  • 54. Confidential - Not for further distribution ● D-vectors in LSTM ○ Generates embedding through LSTMs ○ Attention applied to get normalized weights for hidden embedding Deep Speaker Embedding w/ Attention F. Chowdhury, et al., “Attention-based models for text-dependent speaker verification", 2017. Cross-layer Attention Divided-layer Attention
  • 55. Confidential - Not for further distribution ● Self-attentive embedding ○ Extension of x-vector w/ structured self-attention from sentence embedding Deep Speaker Embedding w/ Attention Z. Lin, et al., “Structured self-attentive sentence embedding", 2017.
  • 56. Confidential - Not for further distribution ● Self-attentive embedding ○ Extension of x-vector w/ structured self-attention ○ Multi-heads Deep Speaker Embedding w/ Attention Y. Zhu, et al., “Self- attentive speaker embeddings for text- independent speaker verification", 2018.
  • 57. TABLE OF CONTENTS 1. Attention 2. Attention in Speech Recognition 3. Attention in Speaker Recognition 4. Pay Attentions on Challenges! 5. Conclusions / Q&A
  • 58. Confidential - Not for further distribution Challenges: Attention in Online ● Can we attend monotonically? Attention Monotonic Chunkwise Attention C. Chui, et al., “Monotonic chunkwise attention", 2018.
  • 59. Confidential - Not for further distribution Challenges: Speech Frames ● Are they ideal as basic units? http://jalammar.github.io/illustrated-bert/
  • 60. Confidential - Not for further distribution Challenges: Speech Frames ● Are they ideal as basic units? http://jalammar.github.io/illustrated-bert/ https://towardsdatascience.com/deconstructing-bert-part-2- visualizing-the-inner-workings-of-attention-60a16d86b5c1
  • 61. Confidential - Not for further distribution Challenges: Speech Frames ● Some effort exist… ○ Multi-resolution of speech frames in multi- stream self-attention ○ But, the question remains… K. Han, et al., “State-of-the-art speech recognition using multi-stream self-attention with dilated 1D convolutions", 2019.
  • 62. TABLE OF CONTENTS 1. Attention 2. Attention in Speech Recognition 3. Attention in Speaker Recognition 4. Pay Attentions on Challenges! 5. Conclusions / Q&A
  • 63. Confidential - Not for further distribution Intro “ATTENTION” In Interspeech 2019 Very Deep Self-attention Networks for End-to-End Speech Recognition Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention Detecting Mismatch Between Speech and Transcription Using Cross-Modal Attention Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile Phonetically-aware embeddings - Wide Residual Networks with Time-Delay Neural Networks and Self Attention models for the 2018 NIST Speaker Recognition Evaluation A Hierarchical Attention Network-Based Approach for Depression Detection from Transcribed Clinical Interviews RWTH ASR System for LibriSpeech: Hybrid vs Attention Speaker Adaptation for Attention-Based End-to-End Speech Recognition Large Margin Training for Attention Based End-to-End Speech Recognition Predicting Group-Level Skin Attention to Short Movies from Audio-Based LSTM-Mixture of Experts Models Attention model for articulatory features detection Attention based Hybrid I-vector BLSTM Model for Language Recognition Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS Self Attention in Variational Sequential Learning for Summarization Speech Emotion Recognition in Dyadic Dialogues with Attentive Interaction Modeling Conversational Emotion Analysis via Attention Mechanisms An analysis of local monotonic attention variants Lattice generation in attention-based speech recognition models A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting Individual differences in implicit attention to phonetic detail in speech perception Learning how to listen: A temporal-frequential attention model for sound event detection An Online Attention-based Model for Speech Recognition Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition The influence of distraction on speech processing: How selective is selective attention? Environment-dependent Attention-driven Recurrent Convolutional Neural Network for Robust Speech Enhancement Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation Using Attention Networks and Adversarial Augmentation for Styrian Dialect Continuous Sleepiness and Baby Sound Recognition Multi-task multi-resolution char-to-BPE cross-attention decoder for end-to-end speech recognition Multi-Stride Self-Attention for Speech Recognition Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning Attention-based word vector prediction with LSTMs and its application to the OOV problem in ASR Multi-stream Network With Temporal Attention For Environmental Sound Classification Few-Shot Audio Classification with AttentionalGraph Neural Networks Vectorized Beam Search for CTC-Attention-based Speech Recognition Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition Spatio-Temporal Attention Pooling for Audio Scene Classification Multi-Scale Time-Frequency Attention for Rare Sound Event Detection A new time-frequency attention mechanism for TDNN and CNN-LSTM-TDNN - with application to language identification An Attention-Based Hybrid Network for Automatic Detection of Alzheimer’s Disease from Narrative Speech Automatic Hierarchical Attention Neural Network for Detecting Alzheimer’s Disease Neural Text Clustering with Document-level Attention based on Dynamic Soft Labels End-to-End Multi-Channel Speech Enhancement Using Inter-Channel Time-Restricted Attention on Raw Waveform Pyramid Memory Block and Timestep Attention for Speech Emotion Recognition Cross-AttentionEnd-to-End ASR for Two-Party Conversations Self-AttentionTransducers for End-to-End Speech Recognition Variational Attentionusing Articulatory Priors for generating Code Mixed Speech using Monolingual Corpora
  • 64. Confidential - Not for further distribution Lots of Areas ATTENDED ● Example ○ Multimodal emotion recognition J. Li, et al., “Attentive to individual: A multimodal emotion recognition network with personalized attention profile", 2019.
  • 66. Confidential - Not for further distribution References 1. Alex Graves, “Generating sequence with recurrent neural networks,” arXiv:1308.0850 [cs], Aug. 2013. 2. Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” ICLR, May 2015, arXiv:1409.0473 [cs], Sep. 2014. 3. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin, “Attention is all you need”, arXiv:1706.03762 [cs], June 2017. 4. Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath, Bo Li, Leif Johnson and Navdeep Jaitly, “A comparison of sequence-to-sequence model for speech recognition,” Interspeech, Aug. 2017. 5. Amit Das, Jinyu Li, Rui Zhao and Yifan Gong, “Advancing connectionist temporal classification with attention modeling,” ICASSP, April 2018. 6. Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: First results,” Deep Learning and Representation Learning Workshop @NIPS, Dec. 2014. 7. Jan Chorowski, Dzmitry Bahdanau, Dimitriy Serdyuk, Kyunghyun Cho and Yoshua Bengio, “Attention-based models for speech recognition,” NIPS, Dec. 2015. 8. Dzmitry Bahdanau, Jan Chorowski, Dimitriy Serdyuk, Philemon Brakel and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” ICASSP, March 2016. 9. Suyoun Kim, Takaaki Hori and Shinji Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” ICASSP, March 2017. 10. Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey and Tomoki Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” Journal of Selected Topics in Signal Processing, vol. 11, no. 8, Dec. 2017. 11. Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala and Tsubasa Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” Interspeech, Sept. 2018. 12. William Chan, Navdeep Jaitly, Quoc V. Le and Oriol Vinyals, “Listen, attend and spell,” arXiv:1508.01211 [cs], Aug. 2015. 13. Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” ICASSP, April 2018. 14. Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk and Quoc V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” Interspeech, Sept. 2019. REFERENCES
  • 67. Confidential - Not for further distribution References 15. Albert Zeyer, Kazuki Irie, Ralf Schluter and Hermann Ney, “Improved training of end-to-end attention models for speech recognition,” Interspeech, Sept. 2018. 16. Kazuki Irie, Rohit Prabhavalkar, Anjuli Kannan, Antoine Bruguier, David Rybach and Patrick Nguyen, “On the Choice of modeling unit for sequence-to- sequence speech recognition,” Interspeech, Sept. 2019. 17. Albert Zeyer, Andre Merboldt, Ralf Schluter and Hermann Ney, “A comprehensive analysis on attention models”, Interpretability and Robustness in Audio, Speech, and Language Workshop @NIPS, Dec. 2018. 18. Liang Lu, Xingxing Zhang and Steve Renals, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition,” ICASSP, March 2016. 19. Shubham Toshniwal, Hao Tang, Liang Lu and Karen Livescu, “Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition,” Interspeech, Aug. 2017. 20. Chao Weng, Jia Cui, Guangsen Wang, Jun Wang, Chengzhu Yu, Dan Su and Dong Yu, “Improving attention based sequence-to-sequence models for end-to- end English conversational speech recognition,” Interspeech, Sept. 2018. 21. Shiliang Zhang, Hui Jiang, Si Wei and Lirong Dai, “Feed- forward sequential memory neural networks without recurrent feedback,” arXiv:1510.02693 [cs], Oct. 2015. 22. Shiliang Zhang, Cong Liu, Hui Jiang, Si Wei, Lirong Dai and Yu Hu, “Feedforward sequential memory networks: A new structure to learn long-term dependency,” arXiv:1512.08301 [cs], Dec. 2015. 23. Shiliang Zhang, Hui Jiang, Shifu Xiong, Si Wei and Li-Rong Dai, “Compact feedforward sequential memory networks for large vocabulary continuous speech recognition,” Interspeech, Sept. 2016. 24. Shiliang Zhang, Ming Lei, Zhijie Yan and Lirong Dai, “Deep-FSMN for large vocabulary continuous speech recognition,” arXiv:1803.05030 [cs], March 2018. 25. Xuerui Yang, Jiwei Li and Xi Zhou, “A novel pyramidal-FSMN architecture with lattice-free MMI for speech recognition”, arXiv:1810.11352 [cs], Oct. 2018. 26. Linhao Dong, Shuang Xu and Bo Xu, “Speech-Transformer: A no-recurrence sequence-to-sequence model for speech recognition,” ICASSP, April 2018. 27. Shiiyu Zhou, Linhao Dong, Shuang Xu and Bo Xu, “A comparison of modeling units in sequence-to-sequence speech recognition with the Transformer on Mandarin Chinese,” arXiv:1805.06239 [cs], May 2018. 28. Shiyu Zhou, Linhao Dong, Shuang Xu and Bo Xu, “Syllable-based sequence-to-sequence speech recognition with the Transformer in Mandarin Chinese,” Interspeech, Sept. 2018. REFERENCES
  • 68. Confidential - Not for further distribution References 29. Abdelrahman Mohamed, Dmytro Okhonko and Lukr Zettlemoyer, “Transformers with convolutional context for ASR,” arXiv:1904.11660 [cs], April 2019. 30. Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li and Sanjeev Khudanpur, ”A time-restricted self-attention layer for ASR,” ICASSP, April 2018. 31. Kyu J. Han, Jing Huang, Yun Tang, Xiaodong He and Bowen Zhou, “Multi-stride self-attention for speech recognition,” Interspeech, Sept. 2019. 32. Julian Salazar, Katrin Kirchhoff and Zhiheng Huang, ”Self-attention networks for connectionist temporal classification in speech recognition,” ICASSP, May 2019. 33. Shaoshi Ling, Julian Salazar and Katrin Kirchhoff, “Contextual phonetic pretraining for end-to-end utterance-level language and speaker recognition,” Interspeech, Sept. 2019. 34. Yuanyuan Zhao, Jie Li, Xiaorui Wang and Yan Li, “The Speechtransformer for large-scale Mandarin Chinese speech recognition,” ICASSP, May, 2019. 35. Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stuker and Alex Waibel, “Self-attention acoustic models,” Interspeech, Sept. 2018. 36. Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Muller, Sebastian Stuker and Alex Waibel, “Very deep self-attention networks for end-to-end speech recognition,” Interspeech, Sept. 2019. 37. Dong Yu and Jinyu Li, “Recent progress in deep learning based acoustic models (updated),” IEEE/CAA Journal of Automatica Sinica, vol. 4, no. 3, 2017. 38. Gautam Bhattacharya, Jahangir Alam and Patrick Kenny, “Deep speaker embeddings for short-duration speaker verification,” Interspeech, Aug. 2017. 39. Colin Raffel and Daniel P. W. Ellis, “Feed-forward networks with attention can solve some long-term memory problems,” ICLR, May 2015. 40. Koji Okabe, Takafumi Koshinaka and Koichi Shinoda, “Attentive statistics pooling for deep speaker embedding,” Interspeech, Sept. 2018. 41. Shi-Xiong Zhang, Zhuo Chen, Yong Zhao, Jinyu Li and Yifan Gong, “End-to-end attention based text-dependent speaker verification,” SLT, Dec. 2016. 42. Georg Heigold, Ignacio Moreno, Samy Bengio and Noam Sharzeer, “End-to-end text dependent speaker verification,” ICASSP, March 2016. 43. F. A. Rezaur Rahman Chowdhury, Quan Wang, Ignacio Lopez Moreno and Li Wan, “Attention-based models for text-dependent speaker verification,” arXiv:11710.10470 [cs], Oct. 2017. 44. Yann N. Dauphin, Angela Fan, Michael Auli and David Grangier, “Language modeling with gated convolutional networks,” arXiv:1612.08083 [cs], Dec. 2016. 45. Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou and Yoshua Bengio, “A structured self-attentive sentence embedding,” ICLR, April 2017. 46. Yingke Zhu, Tom Ko, David Snyder, Brian Mak, Daniel Povey, “Self-attentive speaker embedding for text-independent speaker verification,” Interspeech, Sept. 2018. REFERENCES
  • 69. Confidential - Not for further distribution References 47. Qiongqiong Wang, Koji Okabe, Kong Aik Lee, Hitoshi Yamamoto and Takafumi Koshinaka, “Attention mechanism in speaker recognition: What does it learn in deep speaker embedding?,” SLT, Dec. 2018. 48. Chung-Cheng Chiu and Colin Raffel, “Monotonic chunkwise attention,” ICLR, May 2018. 49. Kyu J. Han, Ramon Prieto and Tao Ma, “State-of-the-art speech recognition using multi-stream self-attention with dilated 1D convolutions,” ASRU, Dec. 2019. 50. Jeng-Lin Li and Chi-Chun Lee, “Attentive to individual: A multimodal emotion recognition network with personalized attention profile,” Interspeech, Sept. 2019. REFERENCES