Deep learning 1.0 and Beyond, Part 2

16/11/2020 1
A/Prof Truyen Tran
With contribution from Vuong Le, Hung
Le, Thao Le, Tin Pham & Dung Nguyen
Deakin University
December 2020
Deep learning 1.0 and Beyond
A tutorial
Part II
@truyenoz
truyentran.github.io
truyen.tran@deakin.edu.au
letdataspeak.blogspot.com
goo.gl/3jJ1O0
linkedin.com/in/truyen-tran

16/11/2020 2
“[By 2023] …
Emergence of the
generally agreed upon
"next big thing" in AI
beyond deep learning.”
Rodney Brooks
rodneybrooks.com
“[…] general-purpose computer
programs, built on top of far richer
primitives than our current
differentiable layers—[…] we will
get to reasoning and abstraction,
the fundamental weakness of
current models.”
Francois Chollet
blog.keras.io
“Software 2.0 is written in
neural network weights”
Andrej Karpathy
medium.com/@karpathy

DL 1.0 has been fantastic, but has serious limitations
(but not always its fault)
DL builds glorified function
approximators using gradient
descent
 Great at interpolating. Think GPT-X.
 One-step input/output mapping
 Require differentiability
Little systematic generalization
#REF: Marcus, Gary. "Deep learning: A critical appraisal." arXiv preprint arXiv:1801.00631 (2018).
Data hungry to cover all possible
patterns
 Computation demanding to process large
data
 Energy inefficient
 Prohibitive for small labs to compete
 Engineering effort is huge  Technical
debt
A little too much heuristic. Lack of
theory.

DL 1.0 has been fantastic, but has serious limitations
(but not always its fault) (cont.)
#REF: Marcus, Gary. "Deep learning: A critical appraisal." arXiv preprint arXiv:1801.00631 (2018).
Lack natural mechanism to
incorporate prior knowledge, e.g.,
common sense
Assume stationaries
 Changes cause trouble  Expensive
retraining
 No causality  Random correlations
can be “learnt”
Sensitive to adversarial attacks
Lack of reasoning
 Pure pattern recognizer
 Little explainability
  Trust issue
To be fair, may of these problems are
common issues of statistical
learning!

DL 1.0 is great, but it is struggled to solve many
AI/ML problems
Learn to organize and remember ultra-
long sequences
Learn to generate arbitrary objects, with
zero supports
Reasoning about object, relation,
causality, self and other agents
Imagine scenarios, act on the world and
learn from the feedbacks
Continual learning, never-ending, across
tasks, domains, representations
Learn by socializing
Learn just by observing and self-prediction
Organizing and reasoning about (common-
sense) knowledge
Automated discovery of physical laws
Solve genetics, neuroscience and
healthcare
Automate physical sciences
Automate software engineering

Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 6
Classic models
Transformers
Graph neural networks
Unsupervised learning
Deep learning 1.0
Agenda

1960s-1990s
 Hand-crafting rules,
domain-specific, logic-
based
 High in reasoning
 Can’t scale.
 Fail on unseen cases.
16/11/2020 7
2020s-2030s
 Learning + reasoning, general
purpose, human-like
 Has contextual and common-
sense reasoning
 Requires less data
 Adapt to change
 Explainable
1990s-present
 Machine learning, general
purpose, statistics-based
 Low in reasoning
 Needs lots of data
 Less adaptive
 Little explanation
Photo credit: DARPA

8
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Implicit/automatic
• Pattern recognition
• Multiple
System 2:
Analytical
• Slow
• Deliberate/rational
• Careful analysis
• Single, sequential
• Hypothetical thought
• Decoupled from data rep
Single
Memory
• Facts
• Semantics
• Events and relational
associations
• Working space –
temporal buffer
Pattern
recognition
Reasoning

Current neural networks offerings
16/11/2020 9
No storage of intermediate results
Little choices over what to compute and what to use
Lack of conditional computation
Little support for complex chained reasoning
Little support for rapid switching of tasks
Credit: hexahedria

What is missing? A memory
Use multiple pieces of information
Store intermediate results (RAM like)
Episodic recall of previous tasks (Tape like)
Encode/compress & generate/decompress
long sequences
Learn/store programs (e.g., fast weights)
Store and query external knowledge
Spatial memory for navigation
16/11/2020 10
Rare but important events (e.g., snake
bite)
Needed for complex control
Short-cuts for ease of gradient
propagation = constant path length
Division of labour: program, execution
and storage
Working-memory is an indicator of IQ in
human

Memory enables reasoning
Expert reasoning was enabled by a large long-term
memory, acquired through experience
Working memory for analytic reasoning
 WM is a system to support information binding to a coordinate
system
 Reasoning as deliberative hypothesis testing  memory-retrieval
based hypothesis generation
 Higher order cognition = creating & manipulating relations 
representation of premises, temporarily stored in WM.
Reasoning over concepts & relations requires semantic
memory
Memory is critical for episodic future thinking (mental
simulation)
16/11/2020 11
“[…] one cannot hope to
understand reasoning
without understanding the
memory processes […]”
(Thompson and Feeney, 2014)

Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 12
Classic models
Transformers
Deep learning 1.0
Agenda

Recall: Memory networks
 Input is a set  Load into memory,
which is NOT updated.
 State is a RNN with attention reading
from inputs
 Concepts: Query, key and content +
Content addressing.
 Deep models, but constant path length
from input to output.
 Equivalent to a RNN with shared input
set.
16/11/2020 13
Sukhbaatar, Sainbayar, Jason Weston, and Rob
Fergus. "End-to-end memory networks." Advances in
neural information processing systems. 2015.

MANN: Memory-Augmented Neural Networks
(a constant path length)
Long-term dependency
E.g., outcome depends on the far past
Memory is needed (e.g., as in LSTM)
Complex program requires multiple computational steps
Each step can be selective (attentive) to certain memory cell
Operations: Encoding | Decoding | Retrieval

16/11/2020 15
Learning a Turing machine
 Can we learn a (neural)
program that learns to
program from data?
Visual reasoning is a
specific program of two
inputs (visual, linguistic)

Neural Turing machine (NTM)
(simulating a differentiable Turing machine)
A controller that takes
input/output and talks to an
external memory module.
Memory has read/write
operations.
The main issue is where to write,
and how to update the memory
state.
All operations are differentiable.
Source: rylanschaeffer.github.io

NTM operations
16/11/2020 17
medium.com/@aidangomez
rylanschaeffer.github.io

16/11/2020 18
NTM unrolled in time with LSTM as controller
#Ref: https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315

MANN for reasoning
Three steps:
 Store data into memory
 Read query, process sequentially, consult memory
 Output answer
Behind the scene:
 Memory contains data & results of intermediate steps
Drawbacks of current MANNs:
 No memory of controllers  Less modularity and
compositionality when query is complex
 No memory of relations  Much harder to chain predicates.
16/11/2020 19
Source: rylanschaeffer.github.io

Failures of item-only MANNs for reasoning
Relational representation is NOT stored  Can’t reuse later in the
chain
A single memory of items and relations  Can’t understand how
relational reasoning occurs
The memory-memory relationship is coarse since it is represented as
either dot product, or weighted sum.
16/11/2020 20

Self-attentive associative memories (SAM)
Learning relations automatically over time
16/11/2020 21
Hung Le, Truyen Tran, Svetha Venkatesh, “Self-
attentive associative memory”, ICML'20.

NUTM = NTM + NSM
Hung Le, Truyen Tran, Svetha Venkatesh,
“Neural stored-program memory”, ICLR'20.

Computing devices vs neural counterparts
FSM (1943) ↔ RNNs (1982)
PDA (1954) ↔ Stack RNN (1993)
TM (1936) ↔ NTM (2014)
UTM/VNA (1936/1945) ↔ NUTM (2019)

Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 24
Classic models
Transformers
Deep learning 1.0
Agenda

25
What color is the thing with the same
size as the blue cylinder?
blue
• Requires multi-step
reasoning: find blue cylinder
➔ locate other object of the
same size ➔ determine its
color (green).
A testbed: Visual QA

26
Reasoning
Qualitative spatial
reasoning
Relational, temporal
inference
Commonsense
Object recognition
Scene graphs
Computer Vision
Natural Language
Processing
Machine
learning
Visual QA
Parsing
Symbol binding
Systematic generalisation
Learning to classify
entailment
Unsupervised
learning
Reinforcement
learning
Program synthesis
Action graphs
Event detection
Object
discovery

Learning to reason
Learning is to improve itself by experiencing ~ acquiring
knowledge & skills
Reasoning is to deduce knowledge from previously
acquired knowledge in response to a query (or a cues)
Learning to reason is to improve the ability to decide if a
knowledge base entails a predicate.
 E.g., given a video f, determines if the person with the hat turns
before singing.
Hypotheses:
 Reasoning as just-in-time program synthesis.
 It employs conditional computation.
16/11/2020 27
Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM
(JACM) 44.5 (1997): 697-725.
(Dan Roth; ACM
Fellow; IJCAI John
McCarthy Award)

Why neural reasoning?
Reasoning is not necessarily achieved by making
logical inferences
There is a continuity between [algebraically rich
inference] and [connecting together trainable
learning systems]
Central to reasoning is composition rules to guide
the combinations of modules to address new tasks
16/11/2020 28
“When we observe a visual scene, when
we hear a complex sentence, we are
able to explain in formal terms the
relation of the objects in the scene, or
the precise meaning of the sentence
components. However, there is no
evidence that such a formal analysis
necessarily takes place: we see a scene,
we hear a sentence, and we just know
what they mean. This suggests the
existence of a middle layer, already a
form of reasoning, but not yet formal
or logical.”
Bottou, Léon. "From machine learning to machine
reasoning." Machine learning 94.2 (2014): 133-149.

The two approaches to neural reasoning
Implicit chaining of predicates through recurrence:
 Step-wise query-specific attention to relevant concepts & relations.
 Iterative concept refinement & combination, e.g., through a working
memory.
 Answer is computed from the last memory state & question embedding.
Explicit program synthesis:
 There is a set of modules, each performs an pre-defined operation.
 Question is parse into a symbolic program.
 The program is implemented as a computational graph constructed by
chaining separate modules.
 The program is executed to compute an answer.
16/11/2020 29

MACNet: Composition-Attention-
Control
(reasoning by progressive refinement
of selected data)
16/11/2020 30
Hudson, Drew A., and Christopher D. Manning.
"Compositional attention networks for machine
reasoning." arXiv preprint arXiv:1803.03067 (2018).

LOGNet: Relational object reasoning with language binding
31
• Key insight: Reasoning is chaining of relational predicates to arrive
at a final conclusion
→ Needs to uncover spatial relations, conditioned on query
→ Chaining is query-driven
→ Objects/language needs binding
→ Object semantics is query-dependent
→ Very thing is end-to-end differentiable
System 1: visual
representation
System 2: High-level
reasoning
Thao Minh Le, Vuong Le, Svetha Venkatesh, and
Truyen Tran, “Dynamic Language Binding in
Relational Visual Reasoning”, IJCAI’20.

32
Language-binding Object Graph Network for VQA
Thao Minh Le, Vuong Le,
Svetha Venkatesh, and
Truyen Tran, “Dynamic
Language Binding in
Relational Visual
Reasoning”, IJCAI’20.

Transformer as implicit reasoning
Reasoning as (free-) energy minimisation
The classic Belief Propagation algorithm is minimization algorithm of
the Bethe free-energy!
Transformer has relational, iterative state refinement makes
it a great candidate for implicit relational reasoning.
16/11/2020 34
Heskes, Tom. "Stable fixed points of loopy belief propagation are local minima of the bethe free
energy." Advances in neural information processing systems. 2003.
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint
arXiv:2008.02217 (2020).

16/11/2020 35http://mccormickml.com/2020/03/10/question-answering-with-a-fine-tuned-BERT/
On SQuAD, Answer = start/end positions

16/11/2020 36
Anonymous, “Neural spatio-temporal reasoning with object-centric self-
supervised learning”, https://openreview.net/pdf?id=rEaz5uTcL6Q
Answer place holder

38
Mao, Jiayuan, et al. "The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences
From Natural Supervision." International Conference on Learning Representations. 2019.
NS-CL: Neuro-Symbolic Concept Learner
Question
parser

Extract object proposals from the image from which a feature vector is obtained usingRoI Align. Each
object feature is donated as 𝑜𝑜𝑖𝑖
Object concepts of the same attribute is mapped into a embedding space. For example, sphere, cube, and
cylinder are mapped into shape embedding space. This mapping is a classification problem!
= σ < 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠. 𝑜𝑜𝑜𝑜 𝑜𝑜𝑖𝑖, 𝑣𝑣 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
> −γ /τ
Where
 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠. 𝑜𝑜𝑜𝑜 is a neural networks
 𝑣𝑣𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
is the concept embedding to be learned of cube
 σ : sigmoid function
 γ and τ are scaling constants. 39
Concept learner

Program execution
Work on object-based visual
representation
An intermediate set of objects is
represented by a vector, as attention mask
over all object in the scene. For example,
Filter(Green_cube) outputs a mask
(0,1,0,0).
The output mask is fed into the next
module (e.g Relate)
40

Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 41
Classic models
Transformers
Deep learning 1.0
Agenda

Contextualized recursive reasoning
Thus far, QA tasks are straightforward and
objective:
Questioner: I will ask about what I don’t know.
Answerer: I will answer what I know.
Real life can be tricky, more subjective:
Questioner: I will ask only questions I think
they can answer.
Answerer 1: This is what I think they want from
an answer.
Answerer 2: I will answer only what I think
they think I can.
16/11/2020 42
Source: religious studies project
 We need Theory of Mind to function socially.

Sally and Anne
Sally Anne
Sally puts her cake
into her basket
Sally’s basket Anne’s box
Sally goes out of
the room.
Anne takes Sally’s
cake out of Sally’s
basket and put this
cake into Anne’s box
Sally comes back to
the room
1
2
4
5
3
Photo: wikipedia

Social dilemma: Stag Hunt games
Difficult decision: individual outcomes (selfish) or group outcomes
(cooperative).
 Together hunt Stag (both are cooperative): Both have more meat.
 Solely hunt Hare (both are selfish): Both have less meat.
 One hunts Stag (cooperative), other hunts Hare (selfish): Only one hunts hare
has meat.
Human evidence: Self-interested but considerate of others
(cultures vary).
Idea: Belief-based guilt-aversion
 One experiences loss if it lets other down.
 Necessitates Theory of Mind: reasoning about other’s mind.

A neural theory of mind
Successor
representationsnext-step action
probability
goal
Rabinowitz, Neil C., et al.
"Machine theory of
mind." arXiv preprint
arXiv:1802.07740 (2018).

Theory of Mind Agent with Guilt Aversion (ToMAGA)
Update Theory of Mind
 Predict whether other’s behaviour are
cooperative or uncooperative
 Updated the zero-order belief (what other will
do)
 Update the first-order belief (what other think
about me)
Guilt Aversion
 Compute the expected material reward of
other based on Theory of Mind
 Compute the psychological rewards, i.e.
“feeling guilty”
 Reward shaping: subtract the expected loss of
the other.
Nguyen, Dung, et al. "Theory of Mind with Guilt
Aversion Facilitates Cooperative Reinforcement
Learning." Asian Conference on Machine Learning.
PMLR, 2020.

47
System 1:
Intuitive
System 1:
Intuitive
System 1:
Intuitive
• Fast
• Implicit/automatic
• Pattern recognition
• Multiple
System 2:
Analytical
• Slow
• Deliberate/rational
• Careful analysis
• Single, sequential
• Hypothetical thought
• Decoupled from data rep
Single
Memory
• Facts
• Semantics
• Events and relational
associations
• Working space –
temporal buffer
Pattern
recognition
Reasoning

Neural memories
Theory of mind
Neural reasoning
A system view
Deep learning 2.0
16/11/2020 48
Classic models
Transformers
Deep learning 1.0
Summary

References
Anonymous, “Neuralspatio-temporal reasoning with object-centric self-supervised learning”,
https://openreview.net/pdf?id=rEaz5uTcL6Q
Bello, Irwan, et al. "Neural optimizer search with reinforcement learning." arXiv preprint arXiv:1709.07417 (2017).
Bengio, Yoshua, Aaron Courville, and Pascal Vincent. "Representation learning: A review and new perspectives." IEEE
transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.
Bottou, Léon. "From machine learning to machine reasoning." Machine learning 94.2 (2014): 133-149.
Dehghani, Mostafa, et al. "Universal Transformers." International Conference on Learning Representations. 2018.
Kien Do, Truyen Tran, and Svetha Venkatesh. "Graph Transformation Policy Network for Chemical Reaction
Prediction." KDD’19.
Kien Do, Truyen Tran, Svetha Venkatesh, “Learning deep matrix representations”,arXiv preprint arXiv:1703.01454
Gilmer, Justin, et al. "Neural message passing for quantum chemistry."arXiv preprint arXiv:1704.01212 (2017).
Ha, David, Andrew Dai, and Quoc V. Le. "Hypernetworks." arXiv preprint arXiv:1609.09106 (2016).
Heskes, Tom. "Stable fixed points of loopy belief propagation are local minima of the bethe free energy." Advances in
neural information processing systems. 2003.
Hudson, Drew A., and Christopher D. Manning. "Compositional attention networks for machine reasoning."arXiv preprint
arXiv:1803.03067 (2018).
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and
variation. arXiv preprint arXiv:1710.10196.
Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM (JACM) 44.5 (1997): 697-725.
Hung Le, Truyen Tran, Svetha Venkatesh, “Self-attentive associative memory”, ICML'20.
Hung Le, Truyen Tran, Svetha Venkatesh, “Neural stored-program memory”, ICLR'20.
16/11/2020 50

Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran, “Dynamic Language Binding in Relational Visual
Reasoning”, IJCAI’20.
Le-Khac, Phuc H., Graham Healy, and Alan F. Smeaton. "Contrastive Representation Learning: A Framework and
Review." arXiv preprint arXiv:2010.05113 (2020).
Liu, Xiao, et al. "Self-supervised learning: Generative or contrastive." arXiv preprint arXiv:2006.08218 (2020). Marcus,
Gary. "Deep learning: A critical appraisal." arXiv preprint arXiv:1801.00631 (2018).
Mao, Jiayuan, et al. "The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural
Supervision." International Conference on Learning Representations. 2019.
Nguyen, Dung, et al. "Theory of Mind with Guilt Aversion Facilitates Cooperative Reinforcement Learning." Asian
Conference on Machine Learning. PMLR, 2020.
Penmatsa, Aravind, Kevin H. Wang, and Eric Gouaux. "X-ray structure of dopamine transporter elucidates antidepressant
mechanism." Nature 503.7474 (2013): 85-90.
Pham, Trang, et al. "Column Networks for Collective Classification."AAAI. 2017.
Ramsauer, Hubert, et al. "Hopfield networks is all you need." arXiv preprint arXiv:2008.02217 (2020).
Rabinowitz, Neil C., et al. "Machine theory of mind." arXiv preprint arXiv:1802.07740 (2018).
Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information
processing systems. 2015.
Tay, Yi, et al. "Efficient transformers: A survey." arXiv preprint arXiv:2009.06732 (2020).
Xie, Tian, and Jeffrey C. Grossman. "Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable
Prediction of Material Properties." Physical review letters 120.14 (2018): 145301.
You, Jiaxuan, et al. "GraphRNN: Generating realistic graphs with deep auto-regressive models." ICML (2018).
16/11/2020 51
References (cont.)

Deep learning 1.0 and Beyond, Part 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning 1.0 and Beyond, Part 2

Similar to Deep learning 1.0 and Beyond, Part 2 (20)

More from Deakin University

More from Deakin University (11)

Recently uploaded

Recently uploaded (20)

Deep learning 1.0 and Beyond, Part 2