SlideShare a Scribd company logo
1 of 16
Download to read offline
Continuous control with deep
reinforcement learning
2016-06-28
Taehoon Kim
Motivation
• DQN can only handle
• discrete (not continuous)
• low-dimensional action spaces
• Simple approach to adapt DQN to continuous domain is discretizing
• 7 degree of freedom system with discretization 𝑎" ∈ {−𝑘, 0, 𝑘}
• Now space dimensionality becomes 3+
= 2187
• explosion of the number of discrete actions
2
Contribution
• Present a model-free, off-policy actor-critic algorithm
• learn policies in high-dimensional, continuous action spaces
• Work based on DPG (Deterministic policy gradient)
3
Background
• actions 𝑎" ∈ ℝ2
, action space 𝒜 = ℝ2
• history of observation, action pairs 𝑠" = (𝑥7, 𝑎7, … , 𝑎"97, 𝑥")
• assume fully-observable so 𝑠" = 𝑥"
• policy 𝜋: 𝒮 → 𝒫(𝒜)
• Model environment as Markov decision process
• initial state distribution 𝑝(𝑠7)
• transition dynamics 𝑝(𝑠"A7|𝑠", 𝑎")
4
Background
• Discounted future reward 𝑅" = ∑ 𝛾F9"
𝑟(𝑠F, 𝑎F)H
FI"
• Goal of RL is to learn a policy 𝜋 which maximizes the expected return
• from the start distribution 𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7]
• Discounted state visitation distribution for a policy 𝜋: ρR
5
Background
• action-value function 𝑄R
𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"]
• expected return after taking an action 𝑎" in state 𝑠" and following policy 𝜋
• Bellman equation
• 𝑄R
𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R
(𝑠"A7, 𝑎"A7) ]
• With deterministic policy 𝜇: 𝒮 → 𝒜
• 𝑄^
𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^
𝑠"A7, 𝜇(𝑠"A7 )]
6
Background
• Expectation only depends on the environment
• possible to learn 𝑄 𝝁
off-policy, where transitions are generated from
different stochastic policy 𝜷
• Q-learning with greedy policy 𝜇 𝑠 = arg max
f
𝑄 𝑠, 𝑎
• 𝐿 𝜃i
= 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i
− 𝑦"
n
]
• where 𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i
)
• To scale Q-learning into large non-linear approximators:
• a replay buffer, a separate target network
7
(a	commonly	used	off-policy algorithm)
Deterministic Policy Gradient (DPG)
• In continuous space, finding the greedy policy requires an optimization of 𝑎" at
every timestep
• too slow to large, unconstrained function approximators and nontrivial action spaces
• Instead, used an actor-critic approach based on the DPG algorithm
• actor: 𝜇 𝑠 𝜃^
: 	𝒮 → 𝒜
• critic: 𝑄(𝑠, 𝑎|𝜃i
)
8
Learning algorithm
• Actor is updated by following the applying the chain rule to the expected return
from the start distribution 𝒥 w.r.t 𝜃^
• 𝛻rs 𝒥 ≈ 𝔼N~j 𝜷 𝛻rs 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ =
𝔼N~j 𝜷 𝛻Q 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY
∇rs 𝜇 𝑠 𝜃^ |NIN"
• Silver et al. (2014) proved this is the policy gradient
• the gradient of policy’s performance
9
Contributions
• Introducing non-linear function approximators means that
convergence is no longer guaranteed
• But essential to learn and generalize on large state spaces
• Contribution
• To provide modifications to DPG, inspired by the success of DQN
• Allow to use neural network function approximators to learn in large state and
action spaces online
10
Challenges 1
• NN for RL usually assume that the samples are i.i.d.
• but when the samples are generated from exploring sequentially in an environment,
this assumption no longer holds.
• As DQN, we use replay buffer to address this issue
• As DQN, we used target network for stable learning but use “soft” target
updates
• 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`, with 𝜏 ≪ 1
• Target network slowly change that greatly improve the stability of learning
11
Challenges 2
• When learning from low dimensional feature vector, observations may have
different physical units (i.e. positions and velocities)
• make it difficult to learn effectively and also to find hyper-parameters which generalize across
environments
• Use batch normalization [Ioffe & Szegedy, 2015] to normalize each dimension
across the samples in a minibatch to have unit mean and variance
• Also maintains a running average of the mean and variance for normalization during testing
• Use all layers of 𝜇 and 𝑄 prior to the action input
• Can train different units without needing to manually ensure the units were within a set range
12
(exploration	or	evaluation)
Challenges 3
• Advantage of off-policies algorithm (i.e. DDPG) is that we can treat the problem
of exploration independently from the learning algorithm
• Constructed an exploration policy 𝜇` by adding noise sampled from a noise
process 𝒩
• 𝜇` 𝑠" = 𝜇 𝑠" 𝜃"
^
+ 𝒩
• Use an Ornstein-Uhlenbeck process to generate temporally
correlated exploration for exploration efficiency with inertia
13
14
Experiment details
• Adam. 𝑙𝑟^
= 109|
, 𝑙𝑟i
= 109}
• 𝑄 include 𝐿n weight decay of 109n
and 𝛾 = 0.99
• 𝜏 = 0.001
• ReLU for hidden layers, tanh for output layer of the actor to bound the actions
• NN: 2 hidden layers with 400 and 300 units
• Action is not included until the 2nd hidden layer of 𝑄
• The final layer weights and biases are initialized from a uniform distribution −3×109}
,3×109}
• to ensure the initial outputs for the policy and value estimates were near zero
• The other layers are initialized from uniform distributions −
7
•
,
7
•
where 𝑓 is the fan-in of the layer
• Replay buffer ℛ = 10„
, Ornstein-Uhlenbeck process: 𝜃 = 0.15, 𝜎 = 0.2
15
References
1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for
deep reinforcement learning. arXiv preprint arXiv:1511.06581.
2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with
double Q-learning. CoRR, abs/1509.06461.
3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience
replay. arXiv preprint arXiv:1511.05952.
4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol.
1, No. 1). Cambridge: MIT press.
16

More Related Content

What's hot

Markov decision process
Markov decision processMarkov decision process
Markov decision process
Hamed Abdi
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
 

What's hot (20)

Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
가깝고도 먼 Trpo
가깝고도 먼 Trpo가깝고도 먼 Trpo
가깝고도 먼 Trpo
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based PoliciesReinforcement Learning with Deep Energy-Based Policies
Reinforcement Learning with Deep Energy-Based Policies
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
 

Viewers also liked

Viewers also liked (8)

Introduction to A3C model
Introduction to A3C modelIntroduction to A3C model
Introduction to A3C model
 
ChainerRLの紹介
ChainerRLの紹介ChainerRLの紹介
ChainerRLの紹介
 
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
 
アクターモデルについて
アクターモデルについてアクターモデルについて
アクターモデルについて
 
【論文紹介】PGQ: Combining Policy Gradient And Q-learning
【論文紹介】PGQ: Combining Policy Gradient And Q-learning【論文紹介】PGQ: Combining Policy Gradient And Q-learning
【論文紹介】PGQ: Combining Policy Gradient And Q-learning
 
A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話A3Cという強化学習アルゴリズムで遊んでみた話
A3Cという強化学習アルゴリズムで遊んでみた話
 
A3C解説
A3C解説A3C解説
A3C解説
 
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)TensorFlow を使った機械学習ことはじめ (GDG京都 機械学習勉強会)
TensorFlow を使った 機械学習ことはじめ (GDG京都 機械学習勉強会)
 

Similar to Continuous control with deep reinforcement learning (DDPG)

Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
taeseon ryu
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn
 

Similar to Continuous control with deep reinforcement learning (DDPG) (20)

Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glance
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptx
 

More from Taehoon Kim

More from Taehoon Kim (14)

LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
LLM에서 배우는 이미지 생성 모델 ZERO부터 학습하기 Training Large-Scale Diffusion Model from Scr...
 
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
상상을 현실로 만드는, 이미지 생성 모델을 위한 엔지니어링
 
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
머신러닝 해외 취업 준비: 닳고 닳은 이력서와 고통스러웠던 면접을 돌아보며 SNU 2018
 
Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]Random Thoughts on Paper Implementations [KAIST 2018]
Random Thoughts on Paper Implementations [KAIST 2018]
 
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
책 읽어주는 딥러닝: 배우 유인나가 해리포터를 읽어준다면 DEVIEW 2017
 
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017알아두면 쓸데있는 신기한 강화학습 NAVER 2017
알아두면 쓸데있는 신기한 강화학습 NAVER 2017
 
카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29카카오톡으로 여친 만들기 2013.06.29
카카오톡으로 여친 만들기 2013.06.29
 
Differentiable Neural Computer
Differentiable Neural ComputerDifferentiable Neural Computer
Differentiable Neural Computer
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
 
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
 
Deep Reasoning
Deep ReasoningDeep Reasoning
Deep Reasoning
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 Django
 
영화 서비스에 대한 생각
영화 서비스에 대한 생각영화 서비스에 대한 생각
영화 서비스에 대한 생각
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Continuous control with deep reinforcement learning (DDPG)

  • 1. Continuous control with deep reinforcement learning 2016-06-28 Taehoon Kim
  • 2. Motivation • DQN can only handle • discrete (not continuous) • low-dimensional action spaces • Simple approach to adapt DQN to continuous domain is discretizing • 7 degree of freedom system with discretization 𝑎" ∈ {−𝑘, 0, 𝑘} • Now space dimensionality becomes 3+ = 2187 • explosion of the number of discrete actions 2
  • 3. Contribution • Present a model-free, off-policy actor-critic algorithm • learn policies in high-dimensional, continuous action spaces • Work based on DPG (Deterministic policy gradient) 3
  • 4. Background • actions 𝑎" ∈ ℝ2 , action space 𝒜 = ℝ2 • history of observation, action pairs 𝑠" = (𝑥7, 𝑎7, … , 𝑎"97, 𝑥") • assume fully-observable so 𝑠" = 𝑥" • policy 𝜋: 𝒮 → 𝒫(𝒜) • Model environment as Markov decision process • initial state distribution 𝑝(𝑠7) • transition dynamics 𝑝(𝑠"A7|𝑠", 𝑎") 4
  • 5. Background • Discounted future reward 𝑅" = ∑ 𝛾F9" 𝑟(𝑠F, 𝑎F)H FI" • Goal of RL is to learn a policy 𝜋 which maximizes the expected return • from the start distribution 𝐽 = 𝔼LM ,NM~P,QM~R[𝑅7] • Discounted state visitation distribution for a policy 𝜋: ρR 5
  • 6. Background • action-value function 𝑄R 𝑠", 𝑎" = 𝔼LMW",NMXY~P,QMXY~R[𝑅"|𝑠", 𝑎"] • expected return after taking an action 𝑎" in state 𝑠" and following policy 𝜋 • Bellman equation • 𝑄R 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝔼QYZ[~R 𝑄R (𝑠"A7, 𝑎"A7) ] • With deterministic policy 𝜇: 𝒮 → 𝒜 • 𝑄^ 𝑠", 𝑎" = 𝔼LY,NYZ[~P[𝑟 𝑠", 𝑎" + 𝛾𝑄^ 𝑠"A7, 𝜇(𝑠"A7 )] 6
  • 7. Background • Expectation only depends on the environment • possible to learn 𝑄 𝝁 off-policy, where transitions are generated from different stochastic policy 𝜷 • Q-learning with greedy policy 𝜇 𝑠 = arg max f 𝑄 𝑠, 𝑎 • 𝐿 𝜃i = 𝔼NY~jk,QY~l,NY~P[ 𝑄 𝑠", 𝑎" 𝜃i − 𝑦" n ] • where 𝑦" = 𝑟 𝑠", 𝑎" + 𝛾𝑄(𝑠"A7, 𝜇(𝑠"A7)|𝜃i ) • To scale Q-learning into large non-linear approximators: • a replay buffer, a separate target network 7 (a commonly used off-policy algorithm)
  • 8. Deterministic Policy Gradient (DPG) • In continuous space, finding the greedy policy requires an optimization of 𝑎" at every timestep • too slow to large, unconstrained function approximators and nontrivial action spaces • Instead, used an actor-critic approach based on the DPG algorithm • actor: 𝜇 𝑠 𝜃^ : 𝒮 → 𝒜 • critic: 𝑄(𝑠, 𝑎|𝜃i ) 8
  • 9. Learning algorithm • Actor is updated by following the applying the chain rule to the expected return from the start distribution 𝒥 w.r.t 𝜃^ • 𝛻rs 𝒥 ≈ 𝔼N~j 𝜷 𝛻rs 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ 𝑠" 𝜃^ = 𝔼N~j 𝜷 𝛻Q 𝑄 𝑠, 𝑎 𝜃i |NINY,QI^ NY ∇rs 𝜇 𝑠 𝜃^ |NIN" • Silver et al. (2014) proved this is the policy gradient • the gradient of policy’s performance 9
  • 10. Contributions • Introducing non-linear function approximators means that convergence is no longer guaranteed • But essential to learn and generalize on large state spaces • Contribution • To provide modifications to DPG, inspired by the success of DQN • Allow to use neural network function approximators to learn in large state and action spaces online 10
  • 11. Challenges 1 • NN for RL usually assume that the samples are i.i.d. • but when the samples are generated from exploring sequentially in an environment, this assumption no longer holds. • As DQN, we use replay buffer to address this issue • As DQN, we used target network for stable learning but use “soft” target updates • 𝜃` ← 𝜏𝜃 + 1 − 𝜏 𝜃`, with 𝜏 ≪ 1 • Target network slowly change that greatly improve the stability of learning 11
  • 12. Challenges 2 • When learning from low dimensional feature vector, observations may have different physical units (i.e. positions and velocities) • make it difficult to learn effectively and also to find hyper-parameters which generalize across environments • Use batch normalization [Ioffe & Szegedy, 2015] to normalize each dimension across the samples in a minibatch to have unit mean and variance • Also maintains a running average of the mean and variance for normalization during testing • Use all layers of 𝜇 and 𝑄 prior to the action input • Can train different units without needing to manually ensure the units were within a set range 12 (exploration or evaluation)
  • 13. Challenges 3 • Advantage of off-policies algorithm (i.e. DDPG) is that we can treat the problem of exploration independently from the learning algorithm • Constructed an exploration policy 𝜇` by adding noise sampled from a noise process 𝒩 • 𝜇` 𝑠" = 𝜇 𝑠" 𝜃" ^ + 𝒩 • Use an Ornstein-Uhlenbeck process to generate temporally correlated exploration for exploration efficiency with inertia 13
  • 14. 14
  • 15. Experiment details • Adam. 𝑙𝑟^ = 109| , 𝑙𝑟i = 109} • 𝑄 include 𝐿n weight decay of 109n and 𝛾 = 0.99 • 𝜏 = 0.001 • ReLU for hidden layers, tanh for output layer of the actor to bound the actions • NN: 2 hidden layers with 400 and 300 units • Action is not included until the 2nd hidden layer of 𝑄 • The final layer weights and biases are initialized from a uniform distribution −3×109} ,3×109} • to ensure the initial outputs for the policy and value estimates were near zero • The other layers are initialized from uniform distributions − 7 • , 7 • where 𝑓 is the fan-in of the layer • Replay buffer ℛ = 10„ , Ornstein-Uhlenbeck process: 𝜃 = 0.15, 𝜎 = 0.2 15
  • 16. References 1. [Wang, 2015] Wang, Z., de Freitas, N., & Lanctot, M. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. 2. [Van, 2015] Van Hasselt, H., Guez, A., & Silver, D. (2015). Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461. 3. [Schaul, 2015] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952. 4. [Sutton, 1998] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction(Vol. 1, No. 1). Cambridge: MIT press. 16