Presented at Stanford AI event on 3/1/2018
join our free online AI group with 50,000+ tech engineers to learn and practice AI technology, including: latest AI news, tech articles/blogs, tech talks, tutorial videos, and hands-on workshop/codelabs, on machine learning, deep learning, data science, etc..
Scaling API-first ā The story of a global engineering organization
Ā
Deep Reinforcement Learning by Pieter
1. How to Make Deep Reinforcement Learning less Data Hungry
Pieter Abbeel
Embodied Intelligence
UC Berkeley
Gradescope
OpenAI
AI for robo<c automa<on
AI research
AI for grading homework and exams
AI research / advisory role
4. Deep RL Success Stories
DQN Mnih et al, NIPS 2013 / Nature 2015
MCTS Guo et al, NIPS 2014; TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015; A3C Mnih et al,
ICML 2016; Dueling DQN Wang et al ICML 2016; Double DQN van Hasselt et al, AAAI 2016; Prioritized
Experience Replay Schaul et al, ICLR 2016; Bootstrapped DQN Osband et al, 2016; Q-Ensembles Chen et al,
2017; Rainbow Hessel et al, 2017; ā¦
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
5. Deep RL Success Stories
AlphaGo Silver et al, Nature 2015
AlphaGoZero Silver et al, Nature 2017
Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015
AlphaZero Silver et al, 2017
6. n Super-human agent on Dota2 1v1, enabled by
n Reinforcement learning
n Self-play
n Enough computation
Deep RL Success Stories
[OpenAI, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
7. Deep RL Success Stories
^ TRPO Schulman et al, 2015 + GAE Schulman et al, 2016
See also: DDPG Lillicrap et al 2015; SVG Heess et al, 2015; Q-Prop Gu et al, 2016; Scaling up ES Salimans et
al, 2017; PPO Schulman et al, 2017; Parkour Heess et al, 2017;
10. Mastery? Yes
Deep RL (DQN) vs. Human
Score: 18.9 Score: 9.3
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
11. But how good is the learning?
Deep RL (DQN) vs. Human
Score: 18.9 Score: 9.3
#Experience
measured in real
time: 40 days
#Experience
measured in real
time: 2 hours
āSlowā āFastā
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
12. Humans vs. DDQN
Black dots: human play
Blue curve: mean of human play
Blue dashed line: āexpertā human play
Red dashed lines:
DDQN a@er 10, 25, 200M frames
(~ 46, 115, 920 hours)
[Tsividis, Pouncy, Xu, Tenenbaum, Gershman, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Humans after 15 minutes tend to outperform DDQN after 115 hours
13. How to bridge this gap?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
15. n On-Policy Algorithms:
n Can only perform update based on data collected under current policy
n Policy gradients, A3C, TRPO
n Atari: Most wall-clock efficient
n Off-Policy Algorithms:
n Can re-use all past data for updates (āexperience replayā)
n Q learning, DQN, DDPG
n Atari: Most sample efficient
On-Policy vs. Off-Policy
X
k from replay buāµer
(Qā(sk, ak) (rk + max
a
QāOLD
(sk+1, a))2
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
17. Main idea: Get reward signal from any experience by simply assuming then
goal equals whatever happened
Representa=on learned: Universal Qn -func=on Q(s, g, a)
Replay buļ¬er:n
Actual experience:n
Experience replay:n
Hindsight replay:n
Hindsight Experience Replay
[Andrychowicz et al, 2017] [Related work: Cabi et al, 2017]
(st, gt, at, rt, st+1)
X
k from replay buāµer
(Qā(sk, gk, ak) (rk + max
a
QāOLD
(sk+1, gk, a))2
X
k from replay buāµer
(Qā(sk, g0
k,ak) (r0
k + max
a
QāOLD
(sk+1, g0
k, a))2
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
20. Outline
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Hindsight Experience Replay
n Meta-learning ā learn-to-learn faster
21. Starting Observations
n TRPO, DQN, A3C, DDPG, PPO, Rainbow, ā¦ are fully general RL
algorithms
n i.e., for any environment that can be mathematically defined, these
algorithms are equally applicable
n Environments encountered in real world
= tiny, tiny subset of all environments that could be defined
(e.g. they all satisfy our universeās physics)
Can we develop āfastā RL algorithms that take advantage of this?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
24. Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
RL Algorithm
Policy
Environment A
asr
ā¦
RL Agent
RL Algorithm
Policy
Environment B
asr
RL Agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
25. Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
ā¦
RL Algorithm
Policy
Environment A
aor
RL Algorithm
Policy
Environment B
aor
RL Agent RL Agent
Traditional RL research:
ā¢ Human experts develop
the RL algorithm
ā¢ After many years, still
no RL algorithms nearly
as good as humansā¦
Alternative:
ā¢ Could we learn a better
RL algorithm?
ā¢ Or even learn a better
entire agent?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
26. Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fastā RL
Agent
Environment A
Meta RL
Algorithm
Environment B
ā¦
Meta-training environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
27. Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fastā RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
ā¦
Environment F
Meta-training environments
Testing environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
28. Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fastā RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
ā¦
Environment G
Meta-training environments
Testing environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
29. Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fastā RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
ā¦
Environment H
Meta-training environments
TesNng environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
30. Formalizing Learning to Reinforcement Learn
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
max
ā
EM Eā§
(k)
M
" KX
k=1
R(ā§
(k)
M ) | RLagentā
#
31. Formalizing Learning to Reinforcement Learn
[Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
M : sample MDP
ā§
(k)
M : kāth trajectory in MDP M
max
ā
EM Eā§
(k)
M
" KX
k=1
R(ā§
(k)
M ) | RLagentā
#
Meta-train:
max
ā
X
M2Mtrain
Eā§
(k)
M
" KX
k=1
R(ā§
(k)
M ) | RLagentā
#
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
32. n RLagent = RNN = generic computation architecture
n different weights in the RNN means different RL algorithm and prior
n different activations in the RNN means different current policy
n meta-train objective can be optimized with an existing (slow) RL algorithm
Representing RLagentā
max
ā
X
M2Mtrain
Eā§
(k)
M
" KX
k=1
R(ā§
(k)
M ) | RLagentā
#
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Duan, Schulman, Chen, BartleN, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
33. Evaluation: Multi-Armed Bandits
Mul5n -Armed Bandits se6ng
Each bandit has its own distribu5onn
over pay-outs
Each episode = choose 1 banditn
Good RL agent should exploren
bandits suļ¬ciently, yet also exploit
the good/best ones
Provably (asympto5cally) op5maln
RL algorithms have been invented
by humans: Gi6ns index, UCB1,
Thompson sampling, ā¦
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
34. [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016]
Evaluation: Multi-Armed Bandits
We consider Bayesian evaluaEon seFng. Some of these prior works also have adversarial
guarantees, which we donāt consider here.
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
35. Evaluation: Visual Navigation
Agentās view Maze
Agent input: current image
Agent action: straight / 2 degrees left / 2 degrees right
Map just shown for our purposes, but not available to agent
Related work: Mirowski, et al, 2016; Jaderberg et al, 2016; Mnih et al, 2016; Wang et al, 2016
36. Evaluation: Visual Navigation
After learning-to-learnBefore learning-to-learn
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016]
Agent input: current image
Agent ac,on: straight / 2 degrees leE / 2 degrees right
Map just shown for our purposes, but not available to agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
37. n Simple Neural Attentive Meta-Learner (SNAIL)
[Mishra, Rohaninejad, Chen, Abbeel, 2017]
n Model-Agnostic Meta-Learning (MAML)
[Finn, Abbeel, Levine, 2017; Finn*, Yu*, Zhang, Abbeel, Levine 2017]
Other Meta RL Architectures
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
40. n Deep RL Bootcamp [intense long weekend]
n 12 lectures by: Pieter Abbeel, Rocky Duan, Peter Chen, John Schulman,
Vlad Mnih, Chelsea Finn, Sergey Levine
n 4 hands-on labs
n https://sites.google.com/view/deep-rl-bootcamp/
n Deep RL Course [full semester]
n Originated by John Schulman, Sergey Levine, Chelsea Finn
n Latest offering (by Sergey Levine):
http://rll.berkeley.edu/deeprlcourse/
Want to learn more?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
41. Fast Reinforcement Learningn
Long Horizon Reasoningn
Taskabilityn
Lifelong Learningn
Leverage Simula;onn
Maximize Signal Extractedn
Many Exciting Directions
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Safe Learning
n Value Alignment
n Planning + Learning
n ā¦