SlideShare a Scribd company logo
1 of 42
How to Make Deep Reinforcement Learning less Data Hungry
Pieter Abbeel
Embodied Intelligence
UC Berkeley
Gradescope
OpenAI
AI for robo<c automa<on
AI research
AI for grading homework and exams
AI research / advisory role
Reinforcement Learning (RL)
Robot +
Environment
max
āœ“
E[
HX
t=0
R(st)|ā‡”āœ“]
[Image credit: Sutton and Barto 1998] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning (RL)
Robot +
Environment
Compared ton
supervised learning,
addi3onal challenges:
Credit assignmentn
Stabilityn
Explora3onn
[Image credit: Sutton and Barto 1998]
max
āœ“
E[
HX
t=0
R(st)|ā‡”āœ“]
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
DQN Mnih et al, NIPS 2013 / Nature 2015
MCTS Guo et al, NIPS 2014; TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015; A3C Mnih et al,
ICML 2016; Dueling DQN Wang et al ICML 2016; Double DQN van Hasselt et al, AAAI 2016; Prioritized
Experience Replay Schaul et al, ICLR 2016; Bootstrapped DQN Osband et al, 2016; Q-Ensembles Chen et al,
2017; Rainbow Hessel et al, 2017; ā€¦
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
AlphaGo Silver et al, Nature 2015
AlphaGoZero Silver et al, Nature 2017
Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015
AlphaZero Silver et al, 2017
n Super-human agent on Dota2 1v1, enabled by
n Reinforcement learning
n Self-play
n Enough computation
Deep RL Success Stories
[OpenAI, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Deep RL Success Stories
^ TRPO Schulman et al, 2015 + GAE Schulman et al, 2016
See also: DDPG Lillicrap et al 2015; SVG Heess et al, 2015; Q-Prop Gu et al, 2016; Scaling up ES Salimans et
al, 2017; PPO Schulman et al, 2017; Parkour Heess et al, 2017;
Deep RL Success Stories
Guided Policy Search Levine*, Finn*, Darrell, Abbeel, JMLR 2016
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
[Geng*, Zhang*, Bruce*, Caluwaerts, Vespignani, Sunspiral, Abbeel, Levine, ICRA 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Mastery? Yes
Deep RL (DQN) vs. Human
Score: 18.9 Score: 9.3
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
But how good is the learning?
Deep RL (DQN) vs. Human
Score: 18.9 Score: 9.3
#Experience
measured in real
time: 40 days
#Experience
measured in real
time: 2 hours
ā€œSlowā€ ā€œFastā€
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Humans vs. DDQN
Black dots: human play
Blue curve: mean of human play
Blue dashed line: ā€˜expertā€™ human play
Red dashed lines:
DDQN a@er 10, 25, 200M frames
(~ 46, 115, 920 hours)
[Tsividis, Pouncy, Xu, Tenenbaum, Gershman, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Humans after 15 minutes tend to outperform DDQN after 115 hours
How to bridge this gap?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Outline
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replayn
Metan -learning ā€“ learn-to-learn faster
n On-Policy Algorithms:
n Can only perform update based on data collected under current policy
n Policy gradients, A3C, TRPO
n Atari: Most wall-clock efficient
n Off-Policy Algorithms:
n Can re-use all past data for updates (ā€œexperience replayā€)
n Q learning, DQN, DDPG
n Atari: Most sample efficient
On-Policy vs. Off-Policy
X
k from replay buā†µer
(Qāœ“(sk, ak) (rk + max
a
Qāœ“OLD
(sk+1, a))2
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
But Signal?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Main idea: Get reward signal from any experience by simply assuming then
goal equals whatever happened
Representa=on learned: Universal Qn -func=on Q(s, g, a)
Replay buļ¬€er:n
Actual experience:n
Experience replay:n
Hindsight replay:n
Hindsight Experience Replay
[Andrychowicz et al, 2017] [Related work: Cabi et al, 2017]
(st, gt, at, rt, st+1)
X
k from replay buā†µer
(Qāœ“(sk, gk, ak) (rk + max
a
Qāœ“OLD
(sk+1, gk, a))2
X
k from replay buā†µer
(Qāœ“(sk, g0
k,ak) (r0
k + max
a
Qāœ“OLD
(sk+1, g0
k, a))2
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replay: Pushing, Sliding, Pick-and-Place
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Andrychowicz, Wolski, Ray, Schneider, Fong, Welinder, McGrew, Tobin, Abbeel, Zaremba, NIPS2017]
Quan%ta%ve Evalua%on
~3s/episode Ć  1.5 days for 50 epochs, 6 days for 200 epochs
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Outline
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Hindsight Experience Replay
n Meta-learning ā€“ learn-to-learn faster
Starting Observations
n TRPO, DQN, A3C, DDPG, PPO, Rainbow, ā€¦ are fully general RL
algorithms
n i.e., for any environment that can be mathematically defined, these
algorithms are equally applicable
n Environments encountered in real world
= tiny, tiny subset of all environments that could be defined
(e.g. they all satisfy our universeā€™s physics)
Can we develop ā€œfastā€ RL algorithms that take advantage of this?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
Agent
Environment
asr
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
RL Algorithm
Policy
Environment
aor
RL Agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
RL Algorithm
Policy
Environment A
asr
ā€¦
RL Agent
RL Algorithm
Policy
Environment B
asr
RL Agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
ā€¦
RL Algorithm
Policy
Environment A
aor
RL Algorithm
Policy
Environment B
aor
RL Agent RL Agent
Traditional RL research:
ā€¢ Human experts develop
the RL algorithm
ā€¢ After many years, still
no RL algorithms nearly
as good as humansā€¦
Alternative:
ā€¢ Could we learn a better
RL algorithm?
ā€¢ Or even learn a better
entire agent?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fastā€ RL
Agent
Environment A
Meta RL
Algorithm
Environment B
ā€¦
Meta-training environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fastā€ RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
ā€¦
Environment F
Meta-training environments
Testing environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fastā€ RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
ā€¦
Environment G
Meta-training environments
Testing environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Meta-Reinforcement Learning
[Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
"Fastā€ RL
Agent
Environment A
ar,o
Meta RL
Algorithm
Environment B
ā€¦
Environment H
Meta-training environments
TesNng environments
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Formalizing Learning to Reinforcement Learn
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
max
āœ“
EM EāŒ§
(k)
M
" KX
k=1
R(āŒ§
(k)
M ) | RLagentāœ“
#
Formalizing Learning to Reinforcement Learn
[Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
M : sample MDP
āŒ§
(k)
M : kā€™th trajectory in MDP M
max
āœ“
EM EāŒ§
(k)
M
" KX
k=1
R(āŒ§
(k)
M ) | RLagentāœ“
#
Meta-train:
max
āœ“
X
M2Mtrain
EāŒ§
(k)
M
" KX
k=1
R(āŒ§
(k)
M ) | RLagentāœ“
#
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n RLagent = RNN = generic computation architecture
n different weights in the RNN means different RL algorithm and prior
n different activations in the RNN means different current policy
n meta-train objective can be optimized with an existing (slow) RL algorithm
Representing RLagentāœ“
max
āœ“
X
M2Mtrain
EāŒ§
(k)
M
" KX
k=1
R(āŒ§
(k)
M ) | RLagentāœ“
#
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Duan, Schulman, Chen, BartleN, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
Evaluation: Multi-Armed Bandits
Mul5n -Armed Bandits se6ng
Each bandit has its own distribu5onn
over pay-outs
Each episode = choose 1 banditn
Good RL agent should exploren
bandits suļ¬ƒciently, yet also exploit
the good/best ones
Provably (asympto5cally) op5maln
RL algorithms have been invented
by humans: Gi6ns index, UCB1,
Thompson sampling, ā€¦
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016]
Evaluation: Multi-Armed Bandits
We consider Bayesian evaluaEon seFng. Some of these prior works also have adversarial
guarantees, which we donā€™t consider here.
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Evaluation: Visual Navigation
Agentā€™s view Maze
Agent input: current image
Agent action: straight / 2 degrees left / 2 degrees right
Map just shown for our purposes, but not available to agent
Related work: Mirowski, et al, 2016; Jaderberg et al, 2016; Mnih et al, 2016; Wang et al, 2016
Evaluation: Visual Navigation
After learning-to-learnBefore learning-to-learn
[Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016]
Agent input: current image
Agent ac,on: straight / 2 degrees leE / 2 degrees right
Map just shown for our purposes, but not available to agent
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Simple Neural Attentive Meta-Learner (SNAIL)
[Mishra, Rohaninejad, Chen, Abbeel, 2017]
n Model-Agnostic Meta-Learning (MAML)
[Finn, Abbeel, Levine, 2017; Finn*, Yu*, Zhang, Abbeel, Levine 2017]
Other Meta RL Architectures
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Hindsight Experience Replayn
Metan -learning ā€“ learn-to-learn faster
Summary
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
OpenAIn Gym: h-ps://gym.openai.com
OpenAIn baselines: h-ps://github.com/openai/baselines
rllabn : h-ps://github.com/rll/rllab
Want to Try Things Out?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Deep RL Bootcamp [intense long weekend]
n 12 lectures by: Pieter Abbeel, Rocky Duan, Peter Chen, John Schulman,
Vlad Mnih, Chelsea Finn, Sergey Levine
n 4 hands-on labs
n https://sites.google.com/view/deep-rl-bootcamp/
n Deep RL Course [full semester]
n Originated by John Schulman, Sergey Levine, Chelsea Finn
n Latest offering (by Sergey Levine):
http://rll.berkeley.edu/deeprlcourse/
Want to learn more?
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
Fast Reinforcement Learningn
Long Horizon Reasoningn
Taskabilityn
Lifelong Learningn
Leverage Simula;onn
Maximize Signal Extractedn
Many Exciting Directions
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
n Safe Learning
n Value Alignment
n Planning + Learning
n ā€¦
Thank you
@pabbeel
pabbeel@cs.berkeley.edu
Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope

More Related Content

What's hot

What's hot (6)

RegressĆ£o Espacial
RegressĆ£o EspacialRegressĆ£o Espacial
RegressĆ£o Espacial
Ā 
QGIS(v2.2)初ē“šē·Øć€€ć•ć‚ć£ć¦ćæ悈恆QGIS
QGIS(v2.2)初ē“šē·Øć€€ć•ć‚ć£ć¦ćæ悈恆QGISQGIS(v2.2)初ē“šē·Øć€€ć•ć‚ć£ć¦ćæ悈恆QGIS
QGIS(v2.2)初ē“šē·Øć€€ć•ć‚ć£ć¦ćæ悈恆QGIS
Ā 
[ź³µź°„ģ •ė³“ģ‹œģŠ¤ķ…œ ź°œė” ] L04 ķ•­ź³µģ‚¬ģ§„ģ˜ ģ“ķ•“
[ź³µź°„ģ •ė³“ģ‹œģŠ¤ķ…œ ź°œė” ] L04 ķ•­ź³µģ‚¬ģ§„ģ˜ ģ“ķ•“[ź³µź°„ģ •ė³“ģ‹œģŠ¤ķ…œ ź°œė” ] L04 ķ•­ź³µģ‚¬ģ§„ģ˜ ģ“ķ•“
[ź³µź°„ģ •ė³“ģ‹œģŠ¤ķ…œ ź°œė” ] L04 ķ•­ź³µģ‚¬ģ§„ģ˜ ģ“ķ•“
Ā 
SystĆØme d'information gĆ©ographique
SystĆØme d'information gĆ©ographiqueSystĆØme d'information gĆ©ographique
SystĆØme d'information gĆ©ographique
Ā 
GNSS: positionnement relatif statique
GNSS: positionnement relatif statiqueGNSS: positionnement relatif statique
GNSS: positionnement relatif statique
Ā 
Estatistica completo revisado
Estatistica completo revisadoEstatistica completo revisado
Estatistica completo revisado
Ā 

Similar to Deep Reinforcement Learning by Pieter

ā€œFrom Inference to Action: AI Beyond Pattern Recognition,ā€ a Keynote Presenta...
ā€œFrom Inference to Action: AI Beyond Pattern Recognition,ā€ a Keynote Presenta...ā€œFrom Inference to Action: AI Beyond Pattern Recognition,ā€ a Keynote Presenta...
ā€œFrom Inference to Action: AI Beyond Pattern Recognition,ā€ a Keynote Presenta...
Edge AI and Vision Alliance
Ā 
iccv2009 tutorial: boosting and random forest - part III
iccv2009 tutorial: boosting and random forest - part IIIiccv2009 tutorial: boosting and random forest - part III
iccv2009 tutorial: boosting and random forest - part III
zukun
Ā 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
c.titus.brown
Ā 
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...
butest
Ā 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Arithmer Inc.
Ā 
Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...
HPCC Systems
Ā 

Similar to Deep Reinforcement Learning by Pieter (20)

ā€œFrom Inference to Action: AI Beyond Pattern Recognition,ā€ a Keynote Presenta...
ā€œFrom Inference to Action: AI Beyond Pattern Recognition,ā€ a Keynote Presenta...ā€œFrom Inference to Action: AI Beyond Pattern Recognition,ā€ a Keynote Presenta...
ā€œFrom Inference to Action: AI Beyond Pattern Recognition,ā€ a Keynote Presenta...
Ā 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Ā 
Paper reading best of both world
Paper reading best of both worldPaper reading best of both world
Paper reading best of both world
Ā 
iccv2009 tutorial: boosting and random forest - part III
iccv2009 tutorial: boosting and random forest - part IIIiccv2009 tutorial: boosting and random forest - part III
iccv2009 tutorial: boosting and random forest - part III
Ā 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
Ā 
Maximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as ExplanationMaximally Invariant Data Perturbation as Explanation
Maximally Invariant Data Perturbation as Explanation
Ā 
Crystallization classification semisupervised
Crystallization classification semisupervisedCrystallization classification semisupervised
Crystallization classification semisupervised
Ā 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
Ā 
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
[SIGIR17] Learning to Rank Using Localized Geometric Mean Metrics
Ā 
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...Ensemble Learning Featuring the Netflix Prize Competition and ...
Ensemble Learning Featuring the Netflix Prize Competition and ...
Ā 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Ā 
Using Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization StrategiesUsing Simulation to Investigate Requirements Prioritization Strategies
Using Simulation to Investigate Requirements Prioritization Strategies
Ā 
Star Master Cocomo07
Star Master Cocomo07Star Master Cocomo07
Star Master Cocomo07
Ā 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
Ā 
Clusterix at VDS 2016
Clusterix at VDS 2016Clusterix at VDS 2016
Clusterix at VDS 2016
Ā 
assia2015sakai
assia2015sakaiassia2015sakai
assia2015sakai
Ā 
Cross-validation estimate of the number of clusters in a network
Cross-validation estimate of the number of clusters in a networkCross-validation estimate of the number of clusters in a network
Cross-validation estimate of the number of clusters in a network
Ā 
Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...Unsupervised Learning and Image Classification in High Performance Computing ...
Unsupervised Learning and Image Classification in High Performance Computing ...
Ā 
Step zhedong
Step zhedongStep zhedong
Step zhedong
Ā 
DeepLearning
DeepLearningDeepLearning
DeepLearning
Ā 

More from Bill Liu

More from Bill Liu (20)

Walk Through a Real World ML Production Project
Walk Through a Real World ML Production ProjectWalk Through a Real World ML Production Project
Walk Through a Real World ML Production Project
Ā 
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...Redefining MLOps with Model Deployment, Management and Observability in Produ...
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Ā 
Productizing Machine Learning at the Edge
Productizing Machine Learning at the EdgeProductizing Machine Learning at the Edge
Productizing Machine Learning at the Edge
Ā 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Ā 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
Ā 
Metaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at NetflixMetaflow: The ML Infrastructure at Netflix
Metaflow: The ML Infrastructure at Netflix
Ā 
Practical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at ScalePractical Crowdsourcing for ML at Scale
Practical Crowdsourcing for ML at Scale
Ā 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Ā 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Ā 
Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19Big Data and AI in Fighting Against COVID-19
Big Data and AI in Fighting Against COVID-19
Ā 
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world ApplicationsHighly-scalable Reinforcement Learning RLlib for Real-world Applications
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Ā 
Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...Build computer vision models to perform object detection and classification w...
Build computer vision models to perform object detection and classification w...
Ā 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
Ā 
Weekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on MobileWeekly #106: Deep Learning on Mobile
Weekly #106: Deep Learning on Mobile
Ā 
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine LearningWeekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
Ā 
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with MicroeconomicsAISF19 - On Blending Machine Learning with Microeconomics
AISF19 - On Blending Machine Learning with Microeconomics
Ā 
AISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First WorldAISF19 - Travel in the AI-First World
AISF19 - Travel in the AI-First World
Ā 
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the EdgeAISF19 - Unleash Computer Vision at the Edge
AISF19 - Unleash Computer Vision at the Edge
Ā 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Ā 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Ā 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
Ā 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Ā 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Ā 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Ā 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Ā 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
Ā 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organization
Ā 

Deep Reinforcement Learning by Pieter

  • 1. How to Make Deep Reinforcement Learning less Data Hungry Pieter Abbeel Embodied Intelligence UC Berkeley Gradescope OpenAI AI for robo<c automa<on AI research AI for grading homework and exams AI research / advisory role
  • 2. Reinforcement Learning (RL) Robot + Environment max āœ“ E[ HX t=0 R(st)|ā‡”āœ“] [Image credit: Sutton and Barto 1998] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 3. Reinforcement Learning (RL) Robot + Environment Compared ton supervised learning, addi3onal challenges: Credit assignmentn Stabilityn Explora3onn [Image credit: Sutton and Barto 1998] max āœ“ E[ HX t=0 R(st)|ā‡”āœ“] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 4. Deep RL Success Stories DQN Mnih et al, NIPS 2013 / Nature 2015 MCTS Guo et al, NIPS 2014; TRPO Schulman, Levine, Moritz, Jordan, Abbeel, ICML 2015; A3C Mnih et al, ICML 2016; Dueling DQN Wang et al ICML 2016; Double DQN van Hasselt et al, AAAI 2016; Prioritized Experience Replay Schaul et al, ICLR 2016; Bootstrapped DQN Osband et al, 2016; Q-Ensembles Chen et al, 2017; Rainbow Hessel et al, 2017; ā€¦ Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 5. Deep RL Success Stories AlphaGo Silver et al, Nature 2015 AlphaGoZero Silver et al, Nature 2017 Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015 AlphaZero Silver et al, 2017
  • 6. n Super-human agent on Dota2 1v1, enabled by n Reinforcement learning n Self-play n Enough computation Deep RL Success Stories [OpenAI, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 7. Deep RL Success Stories ^ TRPO Schulman et al, 2015 + GAE Schulman et al, 2016 See also: DDPG Lillicrap et al 2015; SVG Heess et al, 2015; Q-Prop Gu et al, 2016; Scaling up ES Salimans et al, 2017; PPO Schulman et al, 2017; Parkour Heess et al, 2017;
  • 8. Deep RL Success Stories Guided Policy Search Levine*, Finn*, Darrell, Abbeel, JMLR 2016 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 9. [Geng*, Zhang*, Bruce*, Caluwaerts, Vespignani, Sunspiral, Abbeel, Levine, ICRA 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 10. Mastery? Yes Deep RL (DQN) vs. Human Score: 18.9 Score: 9.3 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 11. But how good is the learning? Deep RL (DQN) vs. Human Score: 18.9 Score: 9.3 #Experience measured in real time: 40 days #Experience measured in real time: 2 hours ā€œSlowā€ ā€œFastā€ Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 12. Humans vs. DDQN Black dots: human play Blue curve: mean of human play Blue dashed line: ā€˜expertā€™ human play Red dashed lines: DDQN a@er 10, 25, 200M frames (~ 46, 115, 920 hours) [Tsividis, Pouncy, Xu, Tenenbaum, Gershman, 2017] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope Humans after 15 minutes tend to outperform DDQN after 115 hours
  • 13. How to bridge this gap? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 14. Outline Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope Hindsight Experience Replayn Metan -learning ā€“ learn-to-learn faster
  • 15. n On-Policy Algorithms: n Can only perform update based on data collected under current policy n Policy gradients, A3C, TRPO n Atari: Most wall-clock efficient n Off-Policy Algorithms: n Can re-use all past data for updates (ā€œexperience replayā€) n Q learning, DQN, DDPG n Atari: Most sample efficient On-Policy vs. Off-Policy X k from replay buā†µer (Qāœ“(sk, ak) (rk + max a Qāœ“OLD (sk+1, a))2 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 16. But Signal? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 17. Main idea: Get reward signal from any experience by simply assuming then goal equals whatever happened Representa=on learned: Universal Qn -func=on Q(s, g, a) Replay buļ¬€er:n Actual experience:n Experience replay:n Hindsight replay:n Hindsight Experience Replay [Andrychowicz et al, 2017] [Related work: Cabi et al, 2017] (st, gt, at, rt, st+1) X k from replay buā†µer (Qāœ“(sk, gk, ak) (rk + max a Qāœ“OLD (sk+1, gk, a))2 X k from replay buā†µer (Qāœ“(sk, g0 k,ak) (r0 k + max a Qāœ“OLD (sk+1, g0 k, a))2 Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 18. Hindsight Experience Replay: Pushing, Sliding, Pick-and-Place Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Andrychowicz, Wolski, Ray, Schneider, Fong, Welinder, McGrew, Tobin, Abbeel, Zaremba, NIPS2017]
  • 19. Quan%ta%ve Evalua%on ~3s/episode Ć  1.5 days for 50 epochs, 6 days for 200 epochs Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 20. Outline Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope n Hindsight Experience Replay n Meta-learning ā€“ learn-to-learn faster
  • 21. Starting Observations n TRPO, DQN, A3C, DDPG, PPO, Rainbow, ā€¦ are fully general RL algorithms n i.e., for any environment that can be mathematically defined, these algorithms are equally applicable n Environments encountered in real world = tiny, tiny subset of all environments that could be defined (e.g. they all satisfy our universeā€™s physics) Can we develop ā€œfastā€ RL algorithms that take advantage of this? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 22. Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Agent Environment asr Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 23. Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] RL Algorithm Policy Environment aor RL Agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 24. Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] RL Algorithm Policy Environment A asr ā€¦ RL Agent RL Algorithm Policy Environment B asr RL Agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 25. Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] ā€¦ RL Algorithm Policy Environment A aor RL Algorithm Policy Environment B aor RL Agent RL Agent Traditional RL research: ā€¢ Human experts develop the RL algorithm ā€¢ After many years, still no RL algorithms nearly as good as humansā€¦ Alternative: ā€¢ Could we learn a better RL algorithm? ā€¢ Or even learn a better entire agent? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 26. Meta-Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fastā€ RL Agent Environment A Meta RL Algorithm Environment B ā€¦ Meta-training environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 27. Meta-Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fastā€ RL Agent Environment A ar,o Meta RL Algorithm Environment B ā€¦ Environment F Meta-training environments Testing environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 28. Meta-Reinforcement Learning [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fastā€ RL Agent Environment A ar,o Meta RL Algorithm Environment B ā€¦ Environment G Meta-training environments Testing environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 29. Meta-Reinforcement Learning [Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] "Fastā€ RL Agent Environment A ar,o Meta RL Algorithm Environment B ā€¦ Environment H Meta-training environments TesNng environments Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 30. Formalizing Learning to Reinforcement Learn [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope max āœ“ EM EāŒ§ (k) M " KX k=1 R(āŒ§ (k) M ) | RLagentāœ“ #
  • 31. Formalizing Learning to Reinforcement Learn [Duan, Schulman, Chen, Bartle:, Sutskever, Abbeel, 2016] also: [Wang et al, 2016] M : sample MDP āŒ§ (k) M : kā€™th trajectory in MDP M max āœ“ EM EāŒ§ (k) M " KX k=1 R(āŒ§ (k) M ) | RLagentāœ“ # Meta-train: max āœ“ X M2Mtrain EāŒ§ (k) M " KX k=1 R(āŒ§ (k) M ) | RLagentāœ“ # Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 32. n RLagent = RNN = generic computation architecture n different weights in the RNN means different RL algorithm and prior n different activations in the RNN means different current policy n meta-train objective can be optimized with an existing (slow) RL algorithm Representing RLagentāœ“ max āœ“ X M2Mtrain EāŒ§ (k) M " KX k=1 R(āŒ§ (k) M ) | RLagentāœ“ # Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope[Duan, Schulman, Chen, BartleN, Sutskever, Abbeel, 2016] also: [Wang et al, 2016]
  • 33. Evaluation: Multi-Armed Bandits Mul5n -Armed Bandits se6ng Each bandit has its own distribu5onn over pay-outs Each episode = choose 1 banditn Good RL agent should exploren bandits suļ¬ƒciently, yet also exploit the good/best ones Provably (asympto5cally) op5maln RL algorithms have been invented by humans: Gi6ns index, UCB1, Thompson sampling, ā€¦ [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 34. [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Evaluation: Multi-Armed Bandits We consider Bayesian evaluaEon seFng. Some of these prior works also have adversarial guarantees, which we donā€™t consider here. Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 35. Evaluation: Visual Navigation Agentā€™s view Maze Agent input: current image Agent action: straight / 2 degrees left / 2 degrees right Map just shown for our purposes, but not available to agent Related work: Mirowski, et al, 2016; Jaderberg et al, 2016; Mnih et al, 2016; Wang et al, 2016
  • 36. Evaluation: Visual Navigation After learning-to-learnBefore learning-to-learn [Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel, 2016] Agent input: current image Agent ac,on: straight / 2 degrees leE / 2 degrees right Map just shown for our purposes, but not available to agent Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 37. n Simple Neural Attentive Meta-Learner (SNAIL) [Mishra, Rohaninejad, Chen, Abbeel, 2017] n Model-Agnostic Meta-Learning (MAML) [Finn, Abbeel, Levine, 2017; Finn*, Yu*, Zhang, Abbeel, Levine 2017] Other Meta RL Architectures Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 38. Hindsight Experience Replayn Metan -learning ā€“ learn-to-learn faster Summary Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 39. OpenAIn Gym: h-ps://gym.openai.com OpenAIn baselines: h-ps://github.com/openai/baselines rllabn : h-ps://github.com/rll/rllab Want to Try Things Out? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 40. n Deep RL Bootcamp [intense long weekend] n 12 lectures by: Pieter Abbeel, Rocky Duan, Peter Chen, John Schulman, Vlad Mnih, Chelsea Finn, Sergey Levine n 4 hands-on labs n https://sites.google.com/view/deep-rl-bootcamp/ n Deep RL Course [full semester] n Originated by John Schulman, Sergey Levine, Chelsea Finn n Latest offering (by Sergey Levine): http://rll.berkeley.edu/deeprlcourse/ Want to learn more? Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope
  • 41. Fast Reinforcement Learningn Long Horizon Reasoningn Taskabilityn Lifelong Learningn Leverage Simula;onn Maximize Signal Extractedn Many Exciting Directions Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope n Safe Learning n Value Alignment n Planning + Learning n ā€¦
  • 42. Thank you @pabbeel pabbeel@cs.berkeley.edu Pieter Abbeel -- embody.ai / UC Berkeley / Gradescope