발표자: 조경현 (NYU 교수)
Kyunghyun Cho is an assistant professor of computer science and data science at New York University.
He was a postdoctoral fellow at University of Montreal until summer 2015, and received PhD and MSc degrees from Aalto University early 2014.
He tries best to find a balance among machine learning, natural language processing and life, but often fails to do so.
개요:
There are three axes along which advances in machine learning and deep learning happen. They are (1) network architectures, (2) learning algorithms and (3) spatio-temporal abstraction.
In this talk, I will describe a set of research topics I’ve pursued in each of these axes.
- For network architectures, I will describe how recurrent neural networks, which were largely forgotten during 90s and early 2000s, have evolved over time and have finally become a de facto standard in machine translation.
- I continue on to discussing various learning paradigms, how they related to each other, and how they are combined in order to build a strong learning system. Along this line, I briefly discuss my latest research on designing a query-efficient imitation learning algorithm for autonomous driving.
- Lastly, I present my view on what it means to be a higher-level learning system. Under this view each and every end-to-end trainable neural network serves as a module, regardless of how they were trained, and interacts with each other in order to solve a higher-level task.
I will describe my latest research on trainable decoding algorithm as a first step toward building such a framework.
발표영상: https://youtu.be/soZXAH3leeQ (본 발표는 영어로 진행됩니다.)
3. What we want is…
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• One system with
many modules
• Modules interact with
each other to solve a task
• Knowledge sharing across tasks via
shared modules
• Some trainable, others fixed
4. Paradigm shift
• One neural network per task
• One neural network per function
• Multiple networks cooperate to
solve many higher-level tasks
• Mixture of trainable networks
and fixed modules
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
5. Examples
• Q&A system
1. Receives a question via
awesome LM+ASR
2. Retrieves relevant info from
awesome memory
3. Generates a response via
awesome LM
• Autonomous driving
1. Senses the environment with
awesome ConvNet+ASR
2. Plans a route with
awesome memory
3. Controls a car via awesome
robot arm controller
But, simple composition of neural networks may not work! Why Not?
Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
6. Learning to use an NN module Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• Why not?
• Target tasks are often unknown at
training time
• Input/output cannot be defined
well a priori
• The amount of learning signal
differs vastly across tasks
• Rich information captured by the NN
module must be passed along
• Internal of the NN module must allow
external manipulation
7. Good news: NN’s are transparent!
Hidden activations of a recurrent language model
• NN’s are not black boxes.
• We can observe every single bit
inside a neural net.
Bad news: NN’s are not easy to understand!
• Humans are not good with high-dimensional
vectors
• Distributed representation
• exponential combinations of hidden units
8. Learning to use an NN module Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
• Neural nets are good at interpreting
high-dimensional input
• Neural nets are also good at
predicting high-dimensional output
• Internal representation learned by a
neural network is well structured
• Neural nets can be trained with an
arbitrary objective
(My Rejected NSF Proposal, 2016)
9. Learning to use an NN module Awesome
ConvNet
Awesome
LM
Awesome
ASR
Awesome
RoboArm
Controller
Awesome
Q&A
Awesome
Auto-
Driver
Awesome
Memory
1. Query-Efficient Imitation Learning
2. Trainable Decoding
• Real-time Neural Machine Translation
• Trainable Greedy Decoding
3. Neural Query Reformulation
4. Non-Parametric Neural Machine Translation
11. Imitation Learning
• A learner directly interacts with the world
• A supervisor augments reward signal from
the world
• Advantages over supervised and
• Match between training and test
• Strong learning signal
• Disadvantages
• Where do we get the supervisor???
(Ross et al., 2011; Daume III et al., 2007; and more…)
12. • Supervisors are expensive
• As the learner gets better, less
intervention from the supervisor
• Learner learns from difficult examples
• Questions:
1. Where do we get the safety net?
2. What is the impact on the
learner’s performance?
SafeDAgger: Query-Efficient Imitation Learning
(Zhang&Cho, AAAI 2017; Laskey et al., ICRA 2016)
13. SafeDAgger: Query-Efficient Imitation Learning
1. Learner observes the world
2. SafetyNet observes the learner
3. SafetyNet predicts whether the
learner will fail
4. If no, the learner continues
5. If yes,
1. the supervisor intervenes
2. The learner imitate the
supervisor’s behaviour
Reminds us of the value function from RL!
14. SafeDAgger: Learning
1. Initial labelled data sets: and
2. Train the policy using
3. Train the safety net using
1. Target for the safety net given
4. Collect additional data
1. Let drive, but the expert intervenes when
2. Collect data:
5. Data aggregation:
6. Go to 2
18. Trainable Decoding of
Neural Machine Translation
Jiatao Gu, Graham Neubig, K Cho and Victor Li. Learning to Translate in Real-time
with Neural Machine Translation. EACL 2017.
Jiatao Gu, K Cho and Victor Li. Trainable Greedy Decoding for Neural Machine
Translation. EMNLP 2017.
19. Trainable Decoding
Motivation
• Many decoding objectives unknown while training
• Lack of target training examples
• Arbitrary (non-differentiable) decoding objectives
• Sample-”in”efficiency of RL algorithms
Our Approach
• Train NMT with supervised learning
• Train a decoding module on top
20. (1) Real-Time Translation
Decoding
1. Start with a pretrained NMT
2. A simultaenous decoder intercepts and
interprets the incoming signal
3. The simultaneous decoder forces the
pretrained model to either
1. output a target symbol, or
2. wait for a next source symbol
Learning
1. Trade-off between delay and quality
2. Stochastic policy gradient (REINFORCE)
(Gu, Neubig, Cho & Li, EACL 2017)
22. (2) Trainable Greedy Decoding
Decoding
1. Start with a pretrained NMT
2. A Trainable decoder intercepts and
interprets the incoming signal
3. The trainable decoder sends out
the altering signal back to the
pretrained model
Learning
1. Deterministic policy gradient
2. Maximize any arbitrary objective
(Gu, Cho & Li, 2017)
23. (2) Trainable Greedy Decoding
Models
1. Actor
• Input: prev. hid. state , prev. symbol , and
context from the attention model
• Output: additive bias for hid. state
• Example:
2. Critic
• Input: a sequence of the hidden states from the decoder
• Output: a predicted return
• In our case, the critic estimates the full return rather than
Q at each time step
(Gu, Cho & Li, 2017)
24. (2) Trainable Greedy Decoding
(Gu, Cho & Li, EMNLP 2017)
Learning
1) Generate translation given a source sentence with noise
and
2) Train the critic to minimize
3) Generate multiple translations with noise
4) Critic-aware actor learning: newly proposed
where
Inference: simply throw away the critic and use the actor
25. (2) Trainable Greedy Decoding
• The trainable decoder does improve the target decoding objective
• Training is quite unstable without the critic-aware actor learning algorithm
• More work is definitely needed for further improvement
26. Toward End-to-End Q&A
Rodrigo Nogueira & K Cho. Task-Oriented Query Reformulation with Reinforcement
Learning. EMNLP 2017.
Dunn et al. SearchQA: A New Q&A Dataset Augmented with Context from a Search
Engine. arXiv 2017.
28. Neural Query Reformulator
Neural Query Reformulator
1. Reads an original query q0
2. Augment/reformulate q0
Learning
1. Hard RL problem: partial observability
due to the black box search engine
2. Policy gradient to maximize recall@K
(Nogueira & Cho, 2017)
Code and data available at https://github.com/nyu-dl/QueryReformulator
29. SearchQA: new dataset
for machine comprehension
(Dunn et al., 2017)
Data available at https://github.com/nyu-dl/SearchQA
(Q, A)
(Q, A, { S1, S2, . . . , SN } )
Retrieve
Crawl
Search
SearchQA
1. Realistic, noisy context from Google
2. Multiple snippets per question
3. Large-scale data (140k q-a-c tuples)
30. And, Google did it!
• A pretrained, black-box Q&A
model
• Query reformulation with RL
• Tested on SearchQA
(Buck et al., 2017)
https arxiv.org abs
31. Few more relevant research directions
• Communicating neural networks
• Neural nets talk to each other to solve a problem
• Sukhbaatar & Fergus (2015), Foerster et al. (2016), Evtimova et al. (2017), Lewis et al. (2017),
…
• Multimodal processing
• Image captioning, zero-shot retrieval, …
• Cho et al. (2015, review paper)
• Planning, program synthesis
• How do the modules compose with each other to solve a task?
• Neural programmer interpreter [Reed et al., 2016; Cai et al., 2017]
• Forward modelling [Henaff et al., 2017; Sutton, 1991 Dyna; optimal control…]
• Mixture of experts [Google], progressive networks [Google DeepMind]
35. • [Allen 1987 IEEE 1st ICNN]
• 3310 En-Es pairs constructed on 31
En, 40 Es words, max 10/11 word
sentence; 33 used as test set
• Binary encoding of words – 50
inputs, 66 outputs; 1 or 3 hidden
150-unit layers. Ave WER: 1.3
words
• [Chrisman 1992 Connection Science]
• Dual-ported RAAM architecture
[Pollack 1990 Artificial Intelligence]
applied to corpus of 216 parallel pairs
of simple En-Es sentences:
• Split 50/50 as train/test, 75% of
sentences correctly translated!
37. Modern neural machine translation
rce
ence
get
ence
ural
work
Source
Sentence
Target
Sentence
Neural Net
SMT
(Schwenk et al. 2006)
Source
Sentence
Target
Sentence
SMT
Neural Net
(Devlin et al. 2014)al MT
e
ce
et
ce
al
ork
Source
Sentence
Target
Sentence
Neural Net
SMT
(Schwenk et al. 2006)
Source
Sentence
Target
Sentence
SMT
Neural Net
(Devlin et al. 2014)MT
Source
Sentence
Target
Sentence
Neural
Network
So
Sen
Ta
Sen
Neur
S
(SchwenkNeural MT
42. What does NMT do?
Encoder
• Project a source sentence into a
set of continuous vectors
Decoder+Attention
• Decode a target sentence from a
set of “source” continuous
vectors
43. What is this “continuous vector space”?
• Similar sentences are near each other
in this vector space
• Multiple dimensions of similarity are
encoded simultaneously
(Sutskever et al., 2014)
44. What is this “continuous vector space”?
• Similar sentences are near each other
in this vector space
• Multiple dimensions of similarity are
encoded simultaneously
• (Trainable) near-bijective mapping
between the continuous vector space
and the sentence space
• Stripped of hard linguistic symbols
45. What is this “continuous vector space”?
(Firat et al., 2016; Luong et al., 2015; Dong et al., 2015)
• Can this continuous vector space be shared across multiple languages?
46. Multi-way, multilingual machine translation (1)
Language-agnostic
Continuous Vector
Space
• One encoder per source language
• One decoder per target language
• Attention/alignment shared across
all the language pairs
• Only bilingual parallel
corpora necessary
• No multi-way parallel corpus needed
(Firat et al., 2016)
47. Multi-way, multilingual machine translation (2)
• Neural nets are like lego
• Build one encoder per source
• Build one decoder per target
• Build one attention mechanism
• Given a sentence pair
•
•
(Firat et al., 2016)
48. Multi-way, multilingual machine translation (3)
Language-
agnostic
Continuous
Vector Space
• Sentence-level positive language transfer
• Helps low-resource language pairs
• Why?
1. Better structural constraint on the
continuous vector space
2. Regularization
• Real-valued vector-based interlingua?
(Firat et al., 2016)
49. Beyond languages: multimodal translation
• Does the source have to be “sentence”?
Annotation
Vectors
Word
Ssample
ui
Recurrent
State
zi
f = (a, man, is, jumping, into, a, lake, .)
+
hj
Attention
Mechanism
a
Attention
weight
j
ajΣ =1
ConvolutionalNeuralNetwork
(Xu et al., 2015)
51. What is a sentence?
Is a sentence a sequence of phrases, words, morphemes or characters?
52. What is a sentence to a neural net?
• Each word/symbol: one-hot vector
• Prior-less encoding
• Permutation invariant
• Sentence
• To us: a sequence of words
• To NN: a sequence of one-hot vectors
• What does it mean?
53. Why not words?
• Inefficient handling of various morphological variants
• Sub-optimal segmentation/tokenization
• “Etxaberria”, “Etxazarra”, “Etxaguren”, “Etxarren”: four independent vectors
• Lack of generalization to novel/rare morphological variants
• For instance, in Arabic => “and to his vehicle”
• One vector for compound words?
• “kolmi/vaihe/kilo/watti/tunti/mittari” => one vector?
• “kolme” => one vector?
• Spelling issues
• See Workshop on Processing Historical Language or Universal Dependencies
• Good segmentation/tokenization needed for each language
• So, no, words don’t look like the units we want to work with…
54. Then, what should we do…?
• Original: 고양이가 침대 위에 누워있습니다
• Word-level modelling:
(고양이가, 침대, 위에, 누워있습니다)
• Subword-level modelling (Sennrich et al., 2015; Wu et al., 2016)
(고양이, 가, 침대, 위, 에, 누워, 있습니, 다)
• Character-level modelling with segmentation
(Wang et al., 2015; Luong & Manning, 2016; Costa-Jussa & Fonollosa, 2016)
((ㄱ,ㅗ,ㅇ,ㅑ,ㅇ,ㅣ,ㄱ,ㅏ), (ㅊ,ㅣ,ㅁ,ㄷ,ㅐ), (ㅇ,ㅟ,ㅇ,ㅔ),
(ㄴ,ㅜ,ㅇ,ㅝ,ㅇ,ㅣ,ㅆ,ㅅ,ㅡ,ㅂ,ㄴ,ㅣ,ㄷ,ㅏ))
• Fully character-level modelling (Chung et al., 2016; Lee et al., 2017)
(ㄱ,ㅗ,ㅇ,ㅑ,ㅇ,ㅣ,ㄱ,ㅏ,_,ㅊ,ㅣ,ㅁ,ㄷ,ㅐ,_,ㅇ,ㅟ,ㅇ,ㅔ,_,ㄴ,ㅜ,ㅇ,ㅝ,ㅇ,ㅣ,ㅆ,ㅅ,ㅡ,ㅂ
,ㄴ,ㅣ,ㄷ,ㅏ))
55. Character-level translation
• Source: subword-level representation
• Target: character-level representation
• The decoder implicitly learned word-like units automatically!
(Chung et al., 2017)
56. Fully Character-level translation
• Source: character-level representation
• Target: character-level representation
• Efficient modelling with
a convolutional-recurrent encoder
• Works as well as, or better than,
subword-level translation
(Lee et al., 2017)
57. (Lee et al., 2017)
• More robust to errors
• Better handles rare tokens
• Rare tokens are not necessary rare!
58. Character-level Multilingual Translation
• When symbols are shared across multiple languages, why not share a
single encoder/decoder for them?
1. Language transfer at all levels: letters, words, phrases, sentences, …
2. Intra-sentence code-switching without any specific data
(Lee et al., 2017; Johnson et al., 2016; Ha et al., 2016)
60. Parametric ML: Learning as Compression
• What does learning do?
• Parametric machine learning: data compression + pattern matching
Neural
Network
Training
Data
learning
Neural
Network
Inference
61. Non-Parametric NMT (1)
• Bring the whole training corpus together with a model
• Retrieved a small subset of examples using a fast search engine
• Let NMT figure out how to fuse
1. the current sentence, and
2. the retrieved translation pairs
62. Non-Parametric NMT (2)
• Apache Lucene: search engine
• A key-value memory network
[Gulcehre et al., 2017; Miller et al., 2016]
for storing retrieved pairs
• Similar to larger-context NMT
• [Wang et al., 2017;
Jean et al., 2017]
• Similar to NMT with external
knowledge
• [Ahn et al., 2016;
Bahdanau et al., 2017]
63. Non-Parametric NMT (3)
• When retrieved pairs are similar, huge
improvement!
• Otherwise, revert back to a normal NMT
• More consistency in style and vocabulary choice
64. Other advances in neural machine translation
• Discourse-level machine translation
• [Jean et al., 2017; DCU, 2017]
• Better decoding strategies
• Learning-to-search [Wiseman & Rush, 2016]
• Reinforcement learning [MRT, 2016; Ranzato et al., 2015; Bahdanau et al., 2015]
• Trainable decoding [Gu et al., 2017]
• Alternative decoding cost [Li et al., 2016; Li et al., 2017]
• Linguistics-guided neural machine translation
• Learning to parse and translate [Eriguchi et al., 2017; Rohee & Goldberg, 2017; Luong
et al., 2016]
• Syntax-aware neural machine translation [Nadejde et al., 2017]
65. Paradigm Shift: modular, life-long learning
Search
Engine
Neural
Network
Database
Neural
Network
Neural
Network
• TenCent, eBay, Google, NVIDIA,
Facebook and NYU for generously
supporting my research and lab!
• Some of the works were sponsored
through industrial projects with
Samsung and NVIDIA!
Acknowledgement