SlideShare a Scribd company logo
1 of 51
Download to read offline
Deep Reasoning
2016-03-16
Taehoon Kim
carpedm20@gmail.com
References
1. [Sukhbaatar, 2015] Sukhbaatar, Szlam, Weston, Fergus. “End-To-End Memory Networks” Advances in
Neural Information Processing Systems. 2015.
2. [Hill, 2015] Hill, Bordes, Chopra, Weston. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations” arXiv preprint arXiv:1511.02301 (2015).
3. [Kumar, 2015] Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, Socher. “Ask Me
Anything: Dynamic Memory Networks for Natural Language Processing” arXiv preprint
arXiv:1511.06038 (2015).
4. [Xiong, 2016] Xiong, Merity, Socher. “Dynamic Memory Networks for Visual and Textual Question
Answering” arXiv preprint arXiv:1603.01417 (2016).
5. [Yin, 2015] Yin, Schütze, Xiang, Zhou. “ABCNN: Attention-Based Convolutional Neural Network for
Modeling Sentence Pairs” arXiv preprint arXiv:1512.05193 (2015).
6. [Yu, 2015] Yu, Zhang, Hang, Xiang, Zhou. “Empirical Study on Deep Learning Models for Question
Answering” arXiv preprint arXiv:1510.07526 (2015).
7. [Hermann, 2015] Hermann, Kočiský, Grefenstette, Espeholt, Will Kay, Suleyman, Blunsom. “Teaching
Machines to Read and Comprehend” arXiv preprint arXiv:1506.03340 (2015).
8. [Kadlec, 2016] Kadlec, Schmid, Bajgar, Kleindienst. “Text Understanding with the Attention Sum Reader
Network” arXiv preprint arXiv:1603.01547 (2016).
2
References
9. [Miao, 2015] Miao, Lei Yu, Blunsom. “Neural Variational Inference for Text Processing” arXiv preprint
arXiv:1511.06038 (2015).
10. [Kingma, 2013] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes" arXiv preprint
arXiv:1312.6114 (2013).
11. [Sohn, 2015] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning Structured Output Representation
using Deep Conditional Generative Models." Advances in Neural Information Processing Systems. 2015.
3
Models
Answer selection	(WikiQA) General	QA	(CNN) Considered transitive	inference	(bAbI)
ABCNN E2E MN E2E MN
Variational Impatient	Attentive	Reader DMN
Attentive	Pooling Attentive	(Impatient)	Reader ReasoningNet
Attention	Sum Reader NTM
4
End-to-End Memory Network [Sukhbaatar, 2015]
5
End-to-End Memory Network [Sukhbaatar, 2015]
6
I go to school.
He gets ball.
…
Embed
C
Where does he go?u Embed
B
Embed
A
Attention
o
I
go
to
he
gets
ball
Input Memory
Output Memory
softmax
Inner product
weighted sum
	Σ linear
End-to-End Memory Network [Sukhbaatar, 2015]
7
I go to school.
He gets ball.
…
linear
Where does he go?
	Σ
	Σ
	Σ
Sentence representation :
𝑖 th sentence	:	𝑥% = 𝑥%',𝑥%),…, 𝑥%+
BoW	:	𝑚% = ∑ 𝐴𝑥%//
Position	Encoding	:	𝑚% = ∑ 𝑙/ 1 𝐴𝑥%//
Temporal	Encoding	:	𝑚% = ∑ 𝐴𝑥%/ + 𝑇4(𝑖)/
Training details
Linear Start (LS) help avoid local minima
- First train with softmax in each memory layer removed, making the model entirely linear except
for the final softmax
- When the validation loss stopped decreasing, the softmax layers were re-inserted and training
recommenced
RNN-style layer-wise weight tying
- The input and output embeddings are the same across different layers
Learning time invariance by injecting random noise
- Jittering the time index with random empty memories
- Add “dummy” memories to regularize 𝑇4(𝑖)
8
Example of bAbI tasks
9
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
10
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
11
• Context sentences : 𝑆 = 𝑠', 𝑠), … , 𝑠+ , 	 𝑠% ∶	BoW word representation
• Encoded memory : m; = 𝜙 𝑠 	∀𝑠 ∈ 𝑆
• Lexical memory
• Each word occupies a separate slot in the memory
• 𝑠 is a single word and 𝜙 𝑠 has only one non-zero feature
• Multiple hop only beneficial in this memory model
• Window memory (best)
• 𝑠 corresponds to a window of text from the context 𝑆 centered on an individual mention of a candidate 𝑐 in 𝑆
m; = 𝑤%A BA' )⁄ 	…	𝑤%	… 𝑤%D BA' )⁄
• Where 𝑤% ∈ 𝐶 which is an instance of one of the candidate words
• Sentential memory
• Same as original implementation of Memory Network
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
12
Self-supervision for window memories
- Memory supervision (knowing which memories to attend to) is not provided at training time
- Making gradient steps using SGD to force the model to give a higher score to the supporting
memory 𝒎G relative to any other memory from any other candidate using:
Hard attention (training and testing) : 𝑚H' = argmax
%M',…,+
𝑐%
N
𝑞
Soft attention (testing) : 𝑚H' = ∑ 𝛼% 𝑚%%M'…+ , 𝑤𝑖𝑡ℎ	𝛼% =
ST
U
VW
∑ S
T
U
VW
X
- If 𝑚H' happens to be different from 𝑚G (memory contain true answer), then model is updated
- Can be understood as a way of achieving hard attention over memories (no need any new
label information beyond the training data)
The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations [Hill, 2016]
13
Gated Recurrent Network (GRU)
14
ℎYA'
𝑥Y 𝑥YD'
ℎY
X
𝑟Y
𝑧Y
ℎY

X
1-
X
+
𝑧Y
1 − 𝑧Y
ℎYA'
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
15
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
16
I go to school.
He gets ball.
…
Where does he go?
GloVe
Embed
𝑦Y
I
go
to
GloVe
Embedwh
do
he
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
ℎY 𝑒%
𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
Episodic	Memory
𝑔Y
%
Input	Module
Answer	Module
Question	Module
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
17
𝑐Y
ℎY 𝑒%
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory 𝑒% 𝑚%
𝑞
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)
𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑚%
new Memory
Gate
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
18
ℎY 𝑒Y
I go to school.
He gets ball.
…
Where does he go?
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory 𝑒Y 𝑚%
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
) ℎYA'
%
𝑚%
new Memory
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
feature vector : captures a similarities between c, m, q
G	:	two-layer	feed	forward	neural	network
Attention Mechanism
𝑞𝑞𝑐Y
𝑔Y
%
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
19
𝑐Y
ℎY 𝑒Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory
𝑞
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
𝑒%
= ℎNy
%
new Memory
Episodic	memory	update
𝑒% 𝑚%
𝑚%
𝑚%
= 𝐺𝑅𝑈(𝑒%
, 𝑚%A'
)
Episodic	Memory	Module
- Iterates over	input representations,	while	updating	episodic	memory 𝒆𝒊
- Attention	mechanism	+	Recurrent	network	→ Update	memory	 𝒎 𝒊
Memory	update
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
20
𝑐Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
𝑞
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
𝑦Y
I
go
to
Input Memory
Episodic Memory
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
new Memory
ℎY 𝑒%
𝑞
𝑒% 𝑚%
𝑚%
Multiple	Episodes
- Allows	to	attend to	different inputs during	each	pass
- Allows	for	a	type	of	transitive inference,	since	the	first	
pass	may	uncover	the	need	to	retrieve	additional	facts.
Q	:	Where	is	the	football?
C1	:	John	put	down	the	football.
Only	once	the	model	sees	C1,	John	is	relevant,	
can	reason	that	the	second	iteration	should	
retrieve	where	John	was.
Criteria	for	Stopping
- Append	 a	special	end-of-passes	
representation	to	the	input	 𝒄
- Stop	if	this	representation	is	chosen by	
the	gate function
- Set	a	maximum	number	of	iterations
- This	is	why	called	Dynamic MM
Ask Me Anything: Dynamic Memory Networks for Natural Language
Processing [Kumar, 2015]
21
𝑐Y
I go to school.
He gets ball.
…
𝑔Y
%
Where does he go?
GloVe
Embed
𝐺𝑅𝑈(𝐿[𝑤Y
s
], ℎYA'
u
) = ℎY
u
= 𝑐Y
I
go
to
Input Memory
Episodic Memory
GloVe
EmbedI
go
to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y
v
, 𝑞YA')
ℎN
%
= 𝑒%
𝑮𝑹𝑼𝒍𝒊𝒔𝒉
𝑔Y
%
= 𝑮(𝑐Y, 𝑚%A'
, 𝑞)Gate
ℎY
%
= 𝑔Y
%
𝐺𝑅𝑈 𝑐Y, ℎYA'
%
+ (1 − 𝑔Y
%
)	ℎYA'
%
new Memory
ℎY 𝑒Y
𝑞
𝑒Y
𝑚%
𝑞
𝑦Y
𝑞
𝑎Y
𝑦Y
𝑞
𝑎Y
𝐺𝑅𝑈 𝐺𝑅𝑈
< 𝐸𝑂𝑆 >
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑚%
Answer	Module
- Triggered once	at	the	end	of	the	episodic	memory	or	at	
each	time	step
- Concatenate	the	last generated	word and	the	question
vector	as	the	input	at	each	time	step
- Cross-entropy	error
Training Details
- Adam optimization
- 𝐿) regularization, dropout on the word embedding (GloVe)
bAbI dataset
- Objective function : 𝐽 = 𝛼𝐸s† 𝐺𝑎𝑡𝑒𝑠 + 𝛽𝐸s†(𝐴𝑛𝑠𝑤𝑒𝑟𝑠)
- Gate supervision aims to select one sentence per pass
- Without supervision : GRU of c‰,ℎY
% and 𝑒% = ℎNy
%
- With supervision (simpler) : 𝑒% = ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y
% 𝑐Y
N
YM' , where 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y
% =
Š‹Œ •Ž
U
∑ Š‹Œ	(•X
U
)V
X••
and
𝑔Y
% is the value before sigmoid
- Better results, because softmax encourages sparsity & suited to picking one sentence
22
Training Details
Stanford Sentiment Treebank (Sentiment Analysis)
- Use all full sentences, subsample 50% of phrase-level labels every epoch
- Only evaluated on the full sentences
- Binary classification, neutral phrases are removed from the dataset
- Trained with GRU sequence models
23
Training Details
24
Dynamic Memory Networks for Visual and Textual Question
Answering [Xiong 2016]
25
Several	design	choices	are motivated	by	intuition	and accuracy	improvements
Input Module in DMN
- A single GRU for embedding story and store the hidden states
- GRU provides temporal component by allowing a sentence to know the content
of the sentences that came before them
- Cons:
- GRU only allows sentences to have context from sentences before them, but not after them
- Supporting sentences may be too far away from each other
- Here comes Input fusion layer
26
Input Module in DMN+
Replacing a single GRU with two different components
1. Sentence reader : responsible only for encoding the words into a sentence
embedding
• Use positional encoder (used in E2E) : 𝑓% = ∑ 𝑙/ 1 𝐴𝑥%//
• Considered GRUs LSTMs, but required more computational resources, prone to overfitting
2. Input fusion layer : interactions between sentences, allows content interaction
between sentences
• bi-directional GRU to allow information from both past and future sentences
• gradients do not need to propagate through the words between sentences
• distant supporting sentences can have a more direct interaction
27
Input Module for DMN+
28
Referenced	paper	:	A	Hierarchical	Neural	Autoencoder for	Paragraphs	and	Documents	[Li,	2015]
Episodic Memory Module in DMN+
- 𝐹⃡ = [𝑓', 𝑓), … , 𝑓“] : output of the input module
- Interactions between the fact 𝒇 𝒊 and both the question 𝒒 and episode memory
state 𝒎𝒕
29
Attention Mechanism in DMN+
Use attention to extract contextual vector 𝑐Y
based on the current focus
1. Soft attention
• A weighted summation of 𝐹⃡ : 𝑐Y
= ∑ 𝑔%
Y
𝑓%
“
%M'
• Can approximate a hard attention by selecting a single fact 𝑓%
• Cons: losses positional and ordering information
• Attention passes can retrieve some of this information, but inefficient
30
Attention Mechanism in DMN+
2. Attention based GRU (best)
- position and ordering information : RNN is proper but can’t use 𝑔%
Y
- 𝑢%: update, 𝑟%: how much retain from ℎ%A'
- Replace 𝑢% (vector) to 𝑔%
Y
(scalar)
- Allows us to easily visualize how the attention gates activate
- Use final hidden state as 𝒄 𝒕, which is used to update episodic memory 𝒎𝒕
31
𝑔%
Y
𝑔%
Y
Episode Memory Updates in DMT+
1. Untied and Tied (better) GRU
𝑚Y
= 𝐺𝑅𝑈(𝐶Y
, 𝑚YA'
)
2. Untied ReLU layer (best)
𝑚Y
= 𝑅𝑒𝐿𝑈(𝑊Y
𝑚YA'
; 𝑐Y
; 𝑞 + 𝑏)
32
Training Details
- Adam optimization
- Xavier initialization is used for all weights except for the word embeddings
- 𝐿) regularization on all weights except bias
- Dropout on the word embedding (GloVe) and answer module with 𝑝 = 0.9
33
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
34
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
35
• Most prior work on answer selection model each sentence separately and
neglects mutual influence
• Human focus on key parts of 𝑠Ÿ by extracting parts from 𝑠' related by
identity, synonymy, antonym etc.
• ABCNN : taking into account the interdependence between 𝑠Ÿ and 𝑠'
• Convolution layer : increase abstraction of a phrase from words
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
36
1. Input embedding with word2vec
2-1. Convolution layer with wide convolution
• To make each word 𝑣% to be detected by all weights in 𝑊
2-2. Average pooling layer
• all-ap : column-wise averaging over all columns
• w-ap : column-wise averaging over windows of 𝑤
3. Output layer with logistic regression
• Forward all-ap to all non-final ap layer + final ap layer
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
37
Attention on feature map (ABCNN-1)
• Attention values of row 𝑖 in 𝑨 : attention distribution of the 𝑖−th unit of 𝑠Ÿ	with respect to 𝑠'
• 𝐴%,/ = 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒(𝐹Ÿ,¢ :, 𝑖 , 𝐹',¢ :, 𝑗 )
• 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒 = 1/(1 + 𝑥 − 𝑦 )
• Generate the attention feature map 𝐹%,¦ for 𝑠%
• Cons : need more parameters
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
38
Attention after convolution (ABCNN-2)
• Attention weights directly on the representation with the aim of improving the
features computed by convolution
• 𝑎Ÿ,/ = ∑𝐴 𝑗, : → col-wise, row-wise sum
• w-ap on convolution feature
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
39
ABCNN-1 ABCNN-2
Indirect	impact		to	convolution
Direct	convolution	via	pooling
(weighted	attention)
Need more	features
Vulnerable	to	overfitting
No	need	features
handles	smaller-granularity	units
(ex. Word	level)
handles	larger-granularity	units
(ex. Phrase	level,	phrase	size	=	window	size)
ABCNN-3
ABCNN: Attention-Based Convolutional Neural Network for Modeling
Sentence Pairs [Yin 2015]
40
Empirical Study on Deep Learning Models for QA [Yu 2015]
41
Empirical Study on Deep Learning Models for QA [Yu 2015]
42
The first to examine Neural Turing Machines on QA problems
Split QA into two step
1. search supporting facts
2. Generate answer from relevant pieces of information
NTM
• Single-layer LSTM network as controller
• Input : word embedding
1. Support fact only
2. Fact highlighted : user marker to annotate begin and end of supporting facts
• Output : softmax layer (multiclass classification) for answer
Teaching Machines to Read and Comprehend [Hermann 2015]
43
Teaching Machines to Read and Comprehend [Hermann 2015]
44
where	𝑓% = 𝑦§ 𝑡
𝑠(𝑡) : degree to which the network
attends to a particular token in the
document when answering the query
(soft attention)
Text Understanding with the Attention Sum Reader Network [Kadlec 2016]
45
Answer should be in context
Inspired by Pinter Network
Contrast to Attentive Reader:
• We select answer from context
directly using weighted sum of
individual representation Attentive Reader
Stochastic Latent Variable
46
Z x
Zx
y
Generative	Model Conditional	GenerativeModel
𝑝 𝑥 = ¨ 𝑝(𝑥, 𝑧)
©
= ¨ 𝑝(𝑥|𝑧)𝑝(𝑧)	
©
𝑝 𝑦|𝑥 = ¨ 𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥)
©
𝑝 𝑥 = «𝑝(𝑥, 𝑧)
©
= «𝑝 𝑥 𝑧 𝑝(𝑧)
©
𝑝 𝑦|𝑥 = «𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥)
©
Variational Inference Framework
47
𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧)
¬
log 𝑝® 𝑥, 𝑧 = log«
𝑞 ℎ
𝑞 ℎ
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑑ℎ
¬
≥ « 𝑞 ℎ log
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
	
= « 𝑞 ℎ log
𝑝 𝑥 ℎ 𝑝 ℎ 𝑧
𝑞 ℎ
𝑑ℎ
¬
+ « 𝑞 ℎ log
𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
= 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)||𝑝 𝑧 )
= 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) − log 𝑞 ℎ
Variational Inference Framework
48
𝑝® 𝑥, 𝑧 = 𝑝® 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝(𝑧)
¬
log 𝑝® 𝑥, 𝑧 = log«
𝑞 ℎ
𝑞 ℎ
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑑ℎ
¬
≥ « 𝑞 ℎ log
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
	
= « 𝑞 ℎ log
𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧
𝑞 ℎ
𝑑ℎ
¬
+ « 𝑞 ℎ log
𝑝 𝑧
𝑞 ℎ
𝑑ℎ
¬
= 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)| 𝑝 𝑧
= 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ a	tight	lower	bound	if	𝑞 ℎ = 𝑝(ℎ|𝑥, 𝑧)
Jensen’s Inequality
Conditional Variational Inference Framework
49
𝑝® 𝑦|𝑥 = ¨ 𝑝®(𝑦, 𝑧|𝑥)	
©
= ¨ 𝑝®(𝑦|𝑥, 𝑧)𝑝µ 𝑧 𝑥
©
log 𝑝(𝑦|𝑥) = log «
𝑞 𝑧
𝑞 𝑧
𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑑𝑧
©
≥ « 𝑞 𝑧 log
𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
©
	
= « 𝑞 𝑧 log
𝑝 𝑦 𝑧, 𝑥
𝑞 𝑧
𝑑𝑧
©
+ « 𝑞 𝑧 log
𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
¬
= «𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑑𝑧
©
− «𝑞 𝑧 log 𝑞(𝑧) 𝑑𝑧
©
+ « 𝑞 𝑧 log
𝑝 𝑧 𝑥
𝑞 𝑧
𝑑𝑧
¬
= 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 − 𝐷³´(𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 )
= 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 a	tight	lower	bound	if	𝑞 𝑧 = 𝑝(𝑧|𝑥)
Jensen’s Inequality
Neural Variational Inference Framework
log 𝑝® 𝑥, 𝑧 ≥ 𝐸± © log 𝑝 𝑦 𝑧, 𝑥 − log 𝑞 𝑧 − 𝐷³´ 𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 = ℒ
1. Vector representations of the observed variables
𝑢 = 𝑓© 𝑧 , 𝑣 = 𝑓¸(𝑥)
2. Joint representation (concatenation)
𝜋 = 𝑔(𝑢, 𝑣)
3. Parameterize the variational distribution
𝜇 = 𝑙' 𝜋 , 𝜎 = 𝑙)(𝜋)
50
Neural Variational Document Model [Miao, 2015]
51

More Related Content

What's hot

[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processingNAVER Engineering
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networksParveenMalik18
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMark Chang
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative ModelsMijung Kim
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks IISang Jun Lee
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesDaniil Mirylenka
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Manohar Mukku
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Hojin Yang
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...MLconf
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用Ryo Iwaki
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 

What's hot (20)

[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networks
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative Models
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph Summaries
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 

Similar to Deep Reasoning: Dynamic Memory Networks

cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfKuan-Tsae Huang
 
Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalEdwin Efraín Jiménez Lepe
 
NetSci14 invited talk: Competing for attention
NetSci14 invited talk: Competing for attentionNetSci14 invited talk: Competing for attention
NetSci14 invited talk: Competing for attentionJames Gleeson
 
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Amazon Web Services
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Toru Fujino
 
NUS-ISS Learning Day 2017 - How About a Game of Chess?
NUS-ISS Learning Day 2017 - How About a Game of Chess?NUS-ISS Learning Day 2017 - How About a Game of Chess?
NUS-ISS Learning Day 2017 - How About a Game of Chess?NUS-ISS
 
Introduction to sentiment analysis
Introduction to sentiment analysisIntroduction to sentiment analysis
Introduction to sentiment analysisRajesh Piryani
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkDessy Amirudin
 
e-Learning and e-assessment examples
e-Learning and e-assessment examplese-Learning and e-assessment examples
e-Learning and e-assessment examplesmathewhillier
 

Similar to Deep Reasoning: Dynamic Memory Networks (15)

cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
Deep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image RetrievalDeep Residual Hashing Neural Network for Image Retrieval
Deep Residual Hashing Neural Network for Image Retrieval
 
Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...
Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...
Joseph Jay Williams - WESST - Bridging Research via MOOClets and Collaborativ...
 
NetSci14 invited talk: Competing for attention
NetSci14 invited talk: Competing for attentionNetSci14 invited talk: Competing for attention
NetSci14 invited talk: Competing for attention
 
Maze Path Finding
Maze Path FindingMaze Path Finding
Maze Path Finding
 
Machine Learning at Geeky Base 2
Machine Learning at Geeky Base 2Machine Learning at Geeky Base 2
Machine Learning at Geeky Base 2
 
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
Unbiased Learning from Biased User Feedback (AIS304) - AWS re:Invent 2018
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
Joseph Jay Williams - WESST - Bridging Research and Practice via MOOClets & C...
Joseph Jay Williams - WESST - Bridging Research and Practice via MOOClets & C...Joseph Jay Williams - WESST - Bridging Research and Practice via MOOClets & C...
Joseph Jay Williams - WESST - Bridging Research and Practice via MOOClets & C...
 
NUS-ISS Learning Day 2017 - How About a Game of Chess?
NUS-ISS Learning Day 2017 - How About a Game of Chess?NUS-ISS Learning Day 2017 - How About a Game of Chess?
NUS-ISS Learning Day 2017 - How About a Game of Chess?
 
Introduction to sentiment analysis
Introduction to sentiment analysisIntroduction to sentiment analysis
Introduction to sentiment analysis
 
Mathspring
MathspringMathspring
Mathspring
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
e-Learning and e-assessment examples
e-Learning and e-assessment examplese-Learning and e-assessment examples
e-Learning and e-assessment examples
 
CATS Learning Model - Intensify the learning experience
CATS Learning Model - Intensify the learning experienceCATS Learning Model - Intensify the learning experience
CATS Learning Model - Intensify the learning experience
 

More from Taehoon Kim

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...Taehoon Kim
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링Taehoon Kim
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018Taehoon Kim
 
Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Taehoon Kim
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017Taehoon Kim
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017Taehoon Kim
 
카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29Taehoon Kim
 
Differentiable Neural Computer
Differentiable Neural ComputerDifferentiable Neural Computer
Differentiable Neural ComputerTaehoon Kim
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016Taehoon Kim
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016Taehoon Kim
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introductionTaehoon Kim
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016Taehoon Kim
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 DjangoTaehoon Kim
 
영화 서비스에 대한 생각
영화 서비스에 대한 생각영화 서비스에 대한 생각
영화 서비스에 대한 생각Taehoon Kim
 

More from Taehoon Kim (15)

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
 
Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29
 
Differentiable Neural Computer
Differentiable Neural ComputerDifferentiable Neural Computer
Differentiable Neural Computer
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 Django
 
영화 서비스에 대한 생각
영화 서비스에 대한 생각영화 서비스에 대한 생각
영화 서비스에 대한 생각
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Deep Reasoning: Dynamic Memory Networks

  • 2. References 1. [Sukhbaatar, 2015] Sukhbaatar, Szlam, Weston, Fergus. “End-To-End Memory Networks” Advances in Neural Information Processing Systems. 2015. 2. [Hill, 2015] Hill, Bordes, Chopra, Weston. “The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations” arXiv preprint arXiv:1511.02301 (2015). 3. [Kumar, 2015] Kumar, Irsoy, Ondruska, Iyyer, Bradbury, Gulrajani, Zhong, Paulus, Socher. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing” arXiv preprint arXiv:1511.06038 (2015). 4. [Xiong, 2016] Xiong, Merity, Socher. “Dynamic Memory Networks for Visual and Textual Question Answering” arXiv preprint arXiv:1603.01417 (2016). 5. [Yin, 2015] Yin, Schütze, Xiang, Zhou. “ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs” arXiv preprint arXiv:1512.05193 (2015). 6. [Yu, 2015] Yu, Zhang, Hang, Xiang, Zhou. “Empirical Study on Deep Learning Models for Question Answering” arXiv preprint arXiv:1510.07526 (2015). 7. [Hermann, 2015] Hermann, Kočiský, Grefenstette, Espeholt, Will Kay, Suleyman, Blunsom. “Teaching Machines to Read and Comprehend” arXiv preprint arXiv:1506.03340 (2015). 8. [Kadlec, 2016] Kadlec, Schmid, Bajgar, Kleindienst. “Text Understanding with the Attention Sum Reader Network” arXiv preprint arXiv:1603.01547 (2016). 2
  • 3. References 9. [Miao, 2015] Miao, Lei Yu, Blunsom. “Neural Variational Inference for Text Processing” arXiv preprint arXiv:1511.06038 (2015). 10. [Kingma, 2013] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes" arXiv preprint arXiv:1312.6114 (2013). 11. [Sohn, 2015] Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. "Learning Structured Output Representation using Deep Conditional Generative Models." Advances in Neural Information Processing Systems. 2015. 3
  • 4. Models Answer selection (WikiQA) General QA (CNN) Considered transitive inference (bAbI) ABCNN E2E MN E2E MN Variational Impatient Attentive Reader DMN Attentive Pooling Attentive (Impatient) Reader ReasoningNet Attention Sum Reader NTM 4
  • 5. End-to-End Memory Network [Sukhbaatar, 2015] 5
  • 6. End-to-End Memory Network [Sukhbaatar, 2015] 6 I go to school. He gets ball. … Embed C Where does he go?u Embed B Embed A Attention o I go to he gets ball Input Memory Output Memory softmax Inner product weighted sum Σ linear
  • 7. End-to-End Memory Network [Sukhbaatar, 2015] 7 I go to school. He gets ball. … linear Where does he go? Σ Σ Σ Sentence representation : 𝑖 th sentence : 𝑥% = 𝑥%',𝑥%),…, 𝑥%+ BoW : 𝑚% = ∑ 𝐴𝑥%// Position Encoding : 𝑚% = ∑ 𝑙/ 1 𝐴𝑥%// Temporal Encoding : 𝑚% = ∑ 𝐴𝑥%/ + 𝑇4(𝑖)/
  • 8. Training details Linear Start (LS) help avoid local minima - First train with softmax in each memory layer removed, making the model entirely linear except for the final softmax - When the validation loss stopped decreasing, the softmax layers were re-inserted and training recommenced RNN-style layer-wise weight tying - The input and output embeddings are the same across different layers Learning time invariance by injecting random noise - Jittering the time index with random empty memories - Add “dummy” memories to regularize 𝑇4(𝑖) 8
  • 9. Example of bAbI tasks 9
  • 10. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 10
  • 11. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 11 • Context sentences : 𝑆 = 𝑠', 𝑠), … , 𝑠+ , 𝑠% ∶ BoW word representation • Encoded memory : m; = 𝜙 𝑠 ∀𝑠 ∈ 𝑆 • Lexical memory • Each word occupies a separate slot in the memory • 𝑠 is a single word and 𝜙 𝑠 has only one non-zero feature • Multiple hop only beneficial in this memory model • Window memory (best) • 𝑠 corresponds to a window of text from the context 𝑆 centered on an individual mention of a candidate 𝑐 in 𝑆 m; = 𝑤%A BA' )⁄ … 𝑤% … 𝑤%D BA' )⁄ • Where 𝑤% ∈ 𝐶 which is an instance of one of the candidate words • Sentential memory • Same as original implementation of Memory Network
  • 12. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 12 Self-supervision for window memories - Memory supervision (knowing which memories to attend to) is not provided at training time - Making gradient steps using SGD to force the model to give a higher score to the supporting memory 𝒎G relative to any other memory from any other candidate using: Hard attention (training and testing) : 𝑚H' = argmax %M',…,+ 𝑐% N 𝑞 Soft attention (testing) : 𝑚H' = ∑ 𝛼% 𝑚%%M'…+ , 𝑤𝑖𝑡ℎ 𝛼% = ST U VW ∑ S T U VW X - If 𝑚H' happens to be different from 𝑚G (memory contain true answer), then model is updated - Can be understood as a way of achieving hard attention over memories (no need any new label information beyond the training data)
  • 13. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations [Hill, 2016] 13
  • 14. Gated Recurrent Network (GRU) 14 ℎYA' 𝑥Y 𝑥YD' ℎY X 𝑟Y 𝑧Y ℎY X 1- X + 𝑧Y 1 − 𝑧Y ℎYA'
  • 15. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 15
  • 16. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 16 I go to school. He gets ball. … Where does he go? GloVe Embed 𝑦Y I go to GloVe Embedwh do he 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > ℎY 𝑒% 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 Episodic Memory 𝑔Y % Input Module Answer Module Question Module
  • 17. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 17 𝑐Y ℎY 𝑒% I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑒% 𝑚% 𝑞 GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞) 𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑚% new Memory Gate 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
  • 18. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 18 ℎY 𝑒Y I go to school. He gets ball. … Where does he go? GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑒Y 𝑚% GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑚% new Memory 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate feature vector : captures a similarities between c, m, q G : two-layer feed forward neural network Attention Mechanism 𝑞𝑞𝑐Y 𝑔Y %
  • 19. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 19 𝑐Y ℎY 𝑒Y I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory 𝑞 GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % 𝑒% = ℎNy % new Memory Episodic memory update 𝑒% 𝑚% 𝑚% 𝑚% = 𝐺𝑅𝑈(𝑒% , 𝑚%A' ) Episodic Memory Module - Iterates over input representations, while updating episodic memory 𝒆𝒊 - Attention mechanism + Recurrent network → Update memory 𝒎 𝒊 Memory update
  • 20. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 20 𝑐Y I go to school. He gets ball. … 𝑔Y % Where does he go? 𝑞 GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y 𝑦Y I go to Input Memory Episodic Memory GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % new Memory ℎY 𝑒% 𝑞 𝑒% 𝑚% 𝑚% Multiple Episodes - Allows to attend to different inputs during each pass - Allows for a type of transitive inference, since the first pass may uncover the need to retrieve additional facts. Q : Where is the football? C1 : John put down the football. Only once the model sees C1, John is relevant, can reason that the second iteration should retrieve where John was. Criteria for Stopping - Append a special end-of-passes representation to the input 𝒄 - Stop if this representation is chosen by the gate function - Set a maximum number of iterations - This is why called Dynamic MM
  • 21. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [Kumar, 2015] 21 𝑐Y I go to school. He gets ball. … 𝑔Y % Where does he go? GloVe Embed 𝐺𝑅𝑈(𝐿[𝑤Y s ], ℎYA' u ) = ℎY u = 𝑐Y I go to Input Memory Episodic Memory GloVe EmbedI go to𝑞Y = 𝐺𝑅𝑈(𝐿 𝑤Y v , 𝑞YA') ℎN % = 𝑒% 𝑮𝑹𝑼𝒍𝒊𝒔𝒉 𝑔Y % = 𝑮(𝑐Y, 𝑚%A' , 𝑞)Gate ℎY % = 𝑔Y % 𝐺𝑅𝑈 𝑐Y, ℎYA' % + (1 − 𝑔Y % ) ℎYA' % new Memory ℎY 𝑒Y 𝑞 𝑒Y 𝑚% 𝑞 𝑦Y 𝑞 𝑎Y 𝑦Y 𝑞 𝑎Y 𝐺𝑅𝑈 𝐺𝑅𝑈 < 𝐸𝑂𝑆 > 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑚% Answer Module - Triggered once at the end of the episodic memory or at each time step - Concatenate the last generated word and the question vector as the input at each time step - Cross-entropy error
  • 22. Training Details - Adam optimization - 𝐿) regularization, dropout on the word embedding (GloVe) bAbI dataset - Objective function : 𝐽 = 𝛼𝐸s† 𝐺𝑎𝑡𝑒𝑠 + 𝛽𝐸s†(𝐴𝑛𝑠𝑤𝑒𝑟𝑠) - Gate supervision aims to select one sentence per pass - Without supervision : GRU of c‰,ℎY % and 𝑒% = ℎNy % - With supervision (simpler) : 𝑒% = ∑ 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % 𝑐Y N YM' , where 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔Y % = Š‹Œ •Ž U ∑ Š‹Œ (•X U )V X•• and 𝑔Y % is the value before sigmoid - Better results, because softmax encourages sparsity & suited to picking one sentence 22
  • 23. Training Details Stanford Sentiment Treebank (Sentiment Analysis) - Use all full sentences, subsample 50% of phrase-level labels every epoch - Only evaluated on the full sentences - Binary classification, neutral phrases are removed from the dataset - Trained with GRU sequence models 23
  • 25. Dynamic Memory Networks for Visual and Textual Question Answering [Xiong 2016] 25 Several design choices are motivated by intuition and accuracy improvements
  • 26. Input Module in DMN - A single GRU for embedding story and store the hidden states - GRU provides temporal component by allowing a sentence to know the content of the sentences that came before them - Cons: - GRU only allows sentences to have context from sentences before them, but not after them - Supporting sentences may be too far away from each other - Here comes Input fusion layer 26
  • 27. Input Module in DMN+ Replacing a single GRU with two different components 1. Sentence reader : responsible only for encoding the words into a sentence embedding • Use positional encoder (used in E2E) : 𝑓% = ∑ 𝑙/ 1 𝐴𝑥%// • Considered GRUs LSTMs, but required more computational resources, prone to overfitting 2. Input fusion layer : interactions between sentences, allows content interaction between sentences • bi-directional GRU to allow information from both past and future sentences • gradients do not need to propagate through the words between sentences • distant supporting sentences can have a more direct interaction 27
  • 28. Input Module for DMN+ 28 Referenced paper : A Hierarchical Neural Autoencoder for Paragraphs and Documents [Li, 2015]
  • 29. Episodic Memory Module in DMN+ - 𝐹⃡ = [𝑓', 𝑓), … , 𝑓“] : output of the input module - Interactions between the fact 𝒇 𝒊 and both the question 𝒒 and episode memory state 𝒎𝒕 29
  • 30. Attention Mechanism in DMN+ Use attention to extract contextual vector 𝑐Y based on the current focus 1. Soft attention • A weighted summation of 𝐹⃡ : 𝑐Y = ∑ 𝑔% Y 𝑓% “ %M' • Can approximate a hard attention by selecting a single fact 𝑓% • Cons: losses positional and ordering information • Attention passes can retrieve some of this information, but inefficient 30
  • 31. Attention Mechanism in DMN+ 2. Attention based GRU (best) - position and ordering information : RNN is proper but can’t use 𝑔% Y - 𝑢%: update, 𝑟%: how much retain from ℎ%A' - Replace 𝑢% (vector) to 𝑔% Y (scalar) - Allows us to easily visualize how the attention gates activate - Use final hidden state as 𝒄 𝒕, which is used to update episodic memory 𝒎𝒕 31 𝑔% Y 𝑔% Y
  • 32. Episode Memory Updates in DMT+ 1. Untied and Tied (better) GRU 𝑚Y = 𝐺𝑅𝑈(𝐶Y , 𝑚YA' ) 2. Untied ReLU layer (best) 𝑚Y = 𝑅𝑒𝐿𝑈(𝑊Y 𝑚YA' ; 𝑐Y ; 𝑞 + 𝑏) 32
  • 33. Training Details - Adam optimization - Xavier initialization is used for all weights except for the word embeddings - 𝐿) regularization on all weights except bias - Dropout on the word embedding (GloVe) and answer module with 𝑝 = 0.9 33
  • 34. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 34
  • 35. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 35 • Most prior work on answer selection model each sentence separately and neglects mutual influence • Human focus on key parts of 𝑠Ÿ by extracting parts from 𝑠' related by identity, synonymy, antonym etc. • ABCNN : taking into account the interdependence between 𝑠Ÿ and 𝑠' • Convolution layer : increase abstraction of a phrase from words
  • 36. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 36 1. Input embedding with word2vec 2-1. Convolution layer with wide convolution • To make each word 𝑣% to be detected by all weights in 𝑊 2-2. Average pooling layer • all-ap : column-wise averaging over all columns • w-ap : column-wise averaging over windows of 𝑤 3. Output layer with logistic regression • Forward all-ap to all non-final ap layer + final ap layer
  • 37. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 37 Attention on feature map (ABCNN-1) • Attention values of row 𝑖 in 𝑨 : attention distribution of the 𝑖−th unit of 𝑠Ÿ with respect to 𝑠' • 𝐴%,/ = 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒(𝐹Ÿ,¢ :, 𝑖 , 𝐹',¢ :, 𝑗 ) • 𝑚𝑎𝑡𝑐ℎ𝑠𝑐𝑜𝑟𝑒 = 1/(1 + 𝑥 − 𝑦 ) • Generate the attention feature map 𝐹%,¦ for 𝑠% • Cons : need more parameters
  • 38. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 38 Attention after convolution (ABCNN-2) • Attention weights directly on the representation with the aim of improving the features computed by convolution • 𝑎Ÿ,/ = ∑𝐴 𝑗, : → col-wise, row-wise sum • w-ap on convolution feature
  • 39. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 39 ABCNN-1 ABCNN-2 Indirect impact to convolution Direct convolution via pooling (weighted attention) Need more features Vulnerable to overfitting No need features handles smaller-granularity units (ex. Word level) handles larger-granularity units (ex. Phrase level, phrase size = window size) ABCNN-3
  • 40. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs [Yin 2015] 40
  • 41. Empirical Study on Deep Learning Models for QA [Yu 2015] 41
  • 42. Empirical Study on Deep Learning Models for QA [Yu 2015] 42 The first to examine Neural Turing Machines on QA problems Split QA into two step 1. search supporting facts 2. Generate answer from relevant pieces of information NTM • Single-layer LSTM network as controller • Input : word embedding 1. Support fact only 2. Fact highlighted : user marker to annotate begin and end of supporting facts • Output : softmax layer (multiclass classification) for answer
  • 43. Teaching Machines to Read and Comprehend [Hermann 2015] 43
  • 44. Teaching Machines to Read and Comprehend [Hermann 2015] 44 where 𝑓% = 𝑦§ 𝑡 𝑠(𝑡) : degree to which the network attends to a particular token in the document when answering the query (soft attention)
  • 45. Text Understanding with the Attention Sum Reader Network [Kadlec 2016] 45 Answer should be in context Inspired by Pinter Network Contrast to Attentive Reader: • We select answer from context directly using weighted sum of individual representation Attentive Reader
  • 46. Stochastic Latent Variable 46 Z x Zx y Generative Model Conditional GenerativeModel 𝑝 𝑥 = ¨ 𝑝(𝑥, 𝑧) © = ¨ 𝑝(𝑥|𝑧)𝑝(𝑧) © 𝑝 𝑦|𝑥 = ¨ 𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥) © 𝑝 𝑥 = «𝑝(𝑥, 𝑧) © = «𝑝 𝑥 𝑧 𝑝(𝑧) © 𝑝 𝑦|𝑥 = «𝑝 𝑦 𝑧, 𝑥 𝑝(𝑧|𝑥) ©
  • 47. Variational Inference Framework 47 𝑝 𝑥, 𝑧 = 𝑝 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) ¬ log 𝑝® 𝑥, 𝑧 = log« 𝑞 ℎ 𝑞 ℎ 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑑ℎ ¬ ≥ « 𝑞 ℎ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = « 𝑞 ℎ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑞 ℎ 𝑑ℎ ¬ + « 𝑞 ℎ log 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)||𝑝 𝑧 ) = 𝐸± ¬ log 𝑝 𝑥 ℎ 𝑝 ℎ 𝑧 𝑝(𝑧) − log 𝑞 ℎ
  • 48. Variational Inference Framework 48 𝑝® 𝑥, 𝑧 = 𝑝® 𝑥 𝑧 𝑝 𝑧 = ¨ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝(𝑧) ¬ log 𝑝® 𝑥, 𝑧 = log« 𝑞 ℎ 𝑞 ℎ 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑑ℎ ¬ ≥ « 𝑞 ℎ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = « 𝑞 ℎ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 𝑞 ℎ 𝑑ℎ ¬ + « 𝑞 ℎ log 𝑝 𝑧 𝑞 ℎ 𝑑ℎ ¬ = 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ − 𝐷³´(𝑞(ℎ)| 𝑝 𝑧 = 𝐸± ¬ log 𝑝® 𝑥 ℎ 𝑝® ℎ 𝑧 − log 𝑞 ℎ a tight lower bound if 𝑞 ℎ = 𝑝(ℎ|𝑥, 𝑧) Jensen’s Inequality
  • 49. Conditional Variational Inference Framework 49 𝑝® 𝑦|𝑥 = ¨ 𝑝®(𝑦, 𝑧|𝑥) © = ¨ 𝑝®(𝑦|𝑥, 𝑧)𝑝µ 𝑧 𝑥 © log 𝑝(𝑦|𝑥) = log « 𝑞 𝑧 𝑞 𝑧 𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑑𝑧 © ≥ « 𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 © = « 𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑞 𝑧 𝑑𝑧 © + « 𝑞 𝑧 log 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 ¬ = «𝑞 𝑧 log 𝑝 𝑦 𝑧, 𝑥 𝑑𝑧 © − «𝑞 𝑧 log 𝑞(𝑧) 𝑑𝑧 © + « 𝑞 𝑧 log 𝑝 𝑧 𝑥 𝑞 𝑧 𝑑𝑧 ¬ = 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 − 𝐷³´(𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 ) = 𝐸± © log 𝑝(𝑦|𝑧, 𝑥) − log 𝑞 𝑧 a tight lower bound if 𝑞 𝑧 = 𝑝(𝑧|𝑥) Jensen’s Inequality
  • 50. Neural Variational Inference Framework log 𝑝® 𝑥, 𝑧 ≥ 𝐸± © log 𝑝 𝑦 𝑧, 𝑥 − log 𝑞 𝑧 − 𝐷³´ 𝑞 𝑧 ∥ 𝑝 𝑧|𝑥 = ℒ 1. Vector representations of the observed variables 𝑢 = 𝑓© 𝑧 , 𝑣 = 𝑓¸(𝑥) 2. Joint representation (concatenation) 𝜋 = 𝑔(𝑢, 𝑣) 3. Parameterize the variational distribution 𝜇 = 𝑙' 𝜋 , 𝜎 = 𝑙)(𝜋) 50
  • 51. Neural Variational Document Model [Miao, 2015] 51