SlideShare a Scribd company logo
1 of 32
Deep Reinforcement
Learning
using deep learning to play self-driving car games
Ben Lau
Ben Lau - Deep Learning and Reinforcement
MLConf 2017, New York City
What is Reinforcement
Learning?
Ben Lau - Deep Learning and Reinforcement
3 classes of
learning
Supervised Learning
 Label data
 Direct Feedback
Unsupervised Learning
 No labels data
 No feedback
 “Find Hidden Structure
Reinforcement Learning
 Using reward as feedback
 Learn series of actions
 Trial and Error
RL: Agent and Environment
Ben Lau - Deep Learning and Reinforcement
𝑅𝑡
Agent
Action 𝐴 𝑡
Environment
Reward
Observation 𝑂𝑡
At each step t the Agent
• Receive observation 𝑂𝑡
• Execute action 𝐴 𝑡
• Receive reward 𝑅𝑡
the Environment
• Receive action 𝐴 𝑡
• Sends observation 𝑂𝑡+1
• Sends reward 𝑅𝑡+1
RL: State
Ben Lau - Deep Learning and Reinforcement
Experience is a sequence of observations, actions, rewards
𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡
The state is a summary of experience
𝑠𝑡 = 𝑓(𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡)
Note: Not all the state are fully observable
Fully Observable Not Fully Observable
Approach to Reinforcement
Learning
Ben Lau - Deep Learning and Reinforcement
Value-Based RL
 Estimate the optimal value function 𝑄∗(𝑠, 𝑎)
 This is the maximum value achievable under any policy
Policy-Based RL
 Search directly for the optimal policy 𝜋∗
 This is the policy achieving maximum future reward
Model-based RL
 Build a model of the environment
 Plan (e.g. by lookahead) using model
Deep Learning + RL  AI
Ben Lau - Deep Learning and Reinforcement
reward
Game input
Deep convolution network
Stee
r
Gas
Peda
l
Brake
Policies
Ben Lau - Deep Learning and Reinforcement
A deterministic policy is the agent’s behavior
 It is a map from state to action:
 𝑎 𝑡 = 𝜋(𝑠𝑡)
In Reinforcement Learning, the agent’s goal is to
choose each action such that it maximize the sum
of future rewards
Choose at to maximize 𝑅𝑡 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2
𝑟𝑡+3 + ⋯
𝛾 is a discount factor [0,1], as the reward is less certain when
further away
State(s) Action(a)
Obstacle Brake
Corner Left/Right
Straight line Acceleration
Approach to Reinforcement
Learning
Ben Lau - Deep Learning and Reinforcement
Value-Based RL
 Estimate the optimal value function 𝑄∗(𝑠, 𝑎)
 This is the maximum value achievable under any policy
Value Function
Ben Lau - Deep Learning and Reinforcement
 A value function is a prediction of future reward
 How much reward will I get from action a in state s?
 A Q-value function gives expected total reward
 From state-action pair (s, a)
 Under policy 𝜋
 With discount factor 𝛾
𝑄 𝜋
𝑠, 𝑎 = 𝐸 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2
𝑟𝑡+3 + ⋯ 𝑠, 𝑎]
 An optimal value function is the maximum achievable value
𝑄∗ 𝑠, 𝑎 = 𝑀𝑎𝑥 𝑎 𝑄 𝜋 𝑠, 𝑎
 Once we have the 𝑄∗
we can act optimally
𝜋∗
𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄∗
(𝑠, 𝑎)
Understanding Q Function
Ben Lau - Deep Learning and Reinforcement
 The best way to understand Q function is considering a “strategy guide”
 Suppose you are playing a difficult game (DOOM)
 If you have a strategy guide, it’s pretty easy  Just follow the guide
 Suppose you are in state s, and need to make a decision, If you have this m
Q-function(strategy guide), then it is easy, just pick the action with highest Q
Doom Strategy Guide
How to find Q-function
Ben Lau - Deep Learning and Reinforcement
 Discount Future Reward:𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛
which can be written as:
 𝑅𝑡 = 𝑟𝑡 + 𝛾𝑅𝑡+1
Recall the definition of Q-function (max reward if choose action a in state s)
 𝑄 𝑠𝑡, 𝑎 𝑡 = max 𝑅𝑡+1
Therefore, we can rewrite the Q-function as below
 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′)
In plain English, it means maximum future reward for (s,a) is the
immediate reward r + maximum future reward in next state s’, action a’
It can be solved by dynamic programming or iterative solution
Deep Q-Network (DQN)
Ben Lau - Deep Learning and Reinforcement
 Action-Value function (Q-function) often very big
 DQN idea: I use the neural network to compress this Q-table, using
the weight (w) in the neural network
 𝑄 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎, 𝑤
 Training become finding sets of optimal weights w instead
 In the literature we often called “non-linear function approximation”
State Action Value
A 1 140.11
A 2 139.22
B 1 145.89
B 2 140.23
C 1 123.67
C 2 135.27
≈
DQN Demo Using DeepQ network to play Doom
Approach to Reinforcement
Learning
Ben Lau - Deep Learning and Reinforcement
Policy-Based RL
 Search directly for the optimal policy 𝜋∗
 This is the policy achieving maximum future reward
Deep Policy Network
Ben Lau - Deep Learning and Reinforcement
Review: A policy is the agent’s behavior
 It is a map from state to action:
 at = π(st)
 We can directly search the policy
 Let’s parameterize the policy by some model parameters 𝜃
 𝑎 = 𝜋(𝑠, 𝜃)
 We called it Policy-Based reinforcement learning because we
will adjust the model parameters 𝜃 directly
 The goal is to maximize the total discount reward from beginning
maximize total 𝑅 = 𝑟1 + 𝛾𝑟2 + 𝛾2
𝑟3 + ⋯
Policy Gradient
Ben Lau - Deep Learning and Reinforcement
How to make good action more likely?
 Define objective function as total discounted reward
𝐿 𝜃 = 𝐸 𝑟1 + 𝛾𝑟2 + 𝛾2
𝑟3 + ⋯ |𝜋 𝜃(𝑠, 𝑎)
or
𝐿 𝜃 = 𝐸 𝑅|𝜋 𝜃(𝑠, 𝑎)
Where the expectations of the total reward R is calculated under some
probability distribution 𝑝(𝑎|𝜃) parameterized by 𝜃
 The goal become maximize the total reward by
compute the gradient
𝜕𝐿(𝜃)
𝜕𝜃
Policy Gradient (II)
Ben Lau - Deep Learning and Reinforcement
Recall: Q-function is the maximum discounted future reward in state s, actio
𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥𝑅𝑡+1
 In the continuous case we can written as
𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑅𝑡+1
Therefore, we can compute the gradient as
𝜕𝐿(𝜃)
𝜕𝜃
= 𝐸 𝑝(𝑎|𝜃)
𝜕𝑄
𝜕𝜃
 Using chain-rule, we can re-write as
𝜕𝐿(𝜃)
𝜕𝜃
= 𝐸 𝑝(𝑎|𝜃)
𝜕𝑄 𝜃(𝑠,𝑎)
𝜕𝑎
𝜕𝑎
𝜕𝜃
No dynamics model required!
1. Only requires Q is differential w.r.t. a
2. As long as a can be parameterized
as function of 𝜃
The power of Policy Gradient
Ben Lau - Deep Learning and Reinforcement
Because the policy gradient does not require the dynamical model
therefore, no prior domain knowledge is required
AlphaGo doesn’t pre-programme any domain knowledge
It keep playing many times (via self-play) and adjust the policy parameters 𝜃
to maximize the reward(winning probability)
Intuition: Value vs Policy RL
Ben Lau - Deep Learning and Reinforcement
 Valued Based RL is similar to driving instructor : A score is
given for any action is taken by student
 Policy Based RL is similar to a driver : It is the actual policy
how to drive a car
The car racing game TORCS
Ben Lau - Deep Learning and Reinforcement
 TORCS is a state of the art open source simulator written in C++
 Main Features
 Sophisticated dynamics
 Provided with several
tracks, controllers
 Sensors
 Rangefinder
 Speed
 Position on track
 Rotation speed of wheels
 RPM
 Angle with tracks
Quite realistic to self-driving cars… Track sensors
Deep Learning Recipe
Ben Lau - Deep Learning and Reinforcement
reward
Game input state s
Deep Neural network
Stee
r
Gas
Peda
l
Brak
e
 Rangefinder
 Speed
 Position on track
 Rotation speed of wheels
 RPM
 Angle with tracks
Compute the optimal policy 𝜋 via policy gradient
Design of the reward function
Ben Lau - Deep Learning and Reinforcement
 Obvious choice : Highest velocity of the car 𝑅 = 𝑉𝑐𝑎𝑟 cos 𝜃
 However, experience found that learning not very stable
 Use modify reward function 𝑅 = 𝑉𝑥 cos 𝜃 −𝑉𝑥 sin 𝜃 −𝑉𝑥|track pos|
Encourage stay in the center of the track
Source code available here:
Google: DDPG Keras
Ben Lau - Deep Learning and Reinforcement
Training Set: Aalborg Track
Validation Set: Alpine Tracks
Recall basic Machine Learning, make sure you need to test the
model
In the validation set, not the training set
Learning how to brake
Ben Lau - Deep Learning and Reinforcement
Since we try to maximize the velocity of the car
The AI agent don’t want to hit the brake at all! (As it go against the reward function)
Using Stochastic Brake Idea
Final Demo – Car does not stay center
of track
Ben Lau - Deep Learning and Reinforcement
Future Application
Ben Lau - Deep Learning and Reinforcement
Self driving cars:
Future Application
Thank you!
Twitter: @yanpanlau
Appendix
How to find Q-function (II)
Ben Lau - Deep Learning and Reinforcement
 𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′
, 𝑎′
)
We could use iterative method to solve the Q-function, given a transition (s,a,
 We want 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′
, 𝑎′
) to be same as 𝑄 𝑠, 𝑎
 Consider find Q-function is a regression task, we can define a loss function
 Loss function =
1
2
𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) − 𝑄(𝑠, 𝑎) 2
 Q is optimal when the loss function is minimum
target prediction

More Related Content

What's hot

Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
Universitat Politècnica de Catalunya
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
Edward Balaban
 

What's hot (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Introduction to Reinforcement Learning
Introduction to Reinforcement LearningIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 

Viewers also liked

Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
MLconf
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
MLconf
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
MLconf
 
Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017
Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017
Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017
MLconf
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
MLconf
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
MLconf
 

Viewers also liked (20)

Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
 
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017
 
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016
 
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligen...
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017
Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017
Ashirth Barthur, Security Scientist, H2O, at MLconf Seattle 2017
 
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
 

Similar to Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
201roopikha
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 

Similar to Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017 (20)

RL.ppt
RL.pptRL.ppt
RL.ppt
 
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptxRL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
Inspirit AI Deep Dive - Self Driving Car Project (Mar 2022)
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Intro rl
Intro rlIntro rl
Intro rl
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Q_Learning.ppt
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 

More from MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

More from MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017

  • 1. Deep Reinforcement Learning using deep learning to play self-driving car games Ben Lau Ben Lau - Deep Learning and Reinforcement MLConf 2017, New York City
  • 2. What is Reinforcement Learning? Ben Lau - Deep Learning and Reinforcement 3 classes of learning Supervised Learning  Label data  Direct Feedback Unsupervised Learning  No labels data  No feedback  “Find Hidden Structure Reinforcement Learning  Using reward as feedback  Learn series of actions  Trial and Error
  • 3. RL: Agent and Environment Ben Lau - Deep Learning and Reinforcement 𝑅𝑡 Agent Action 𝐴 𝑡 Environment Reward Observation 𝑂𝑡 At each step t the Agent • Receive observation 𝑂𝑡 • Execute action 𝐴 𝑡 • Receive reward 𝑅𝑡 the Environment • Receive action 𝐴 𝑡 • Sends observation 𝑂𝑡+1 • Sends reward 𝑅𝑡+1
  • 4. RL: State Ben Lau - Deep Learning and Reinforcement Experience is a sequence of observations, actions, rewards 𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡 The state is a summary of experience 𝑠𝑡 = 𝑓(𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡) Note: Not all the state are fully observable Fully Observable Not Fully Observable
  • 5. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL  Estimate the optimal value function 𝑄∗(𝑠, 𝑎)  This is the maximum value achievable under any policy Policy-Based RL  Search directly for the optimal policy 𝜋∗  This is the policy achieving maximum future reward Model-based RL  Build a model of the environment  Plan (e.g. by lookahead) using model
  • 6. Deep Learning + RL  AI Ben Lau - Deep Learning and Reinforcement reward Game input Deep convolution network Stee r Gas Peda l Brake
  • 7. Policies Ben Lau - Deep Learning and Reinforcement A deterministic policy is the agent’s behavior  It is a map from state to action:  𝑎 𝑡 = 𝜋(𝑠𝑡) In Reinforcement Learning, the agent’s goal is to choose each action such that it maximize the sum of future rewards Choose at to maximize 𝑅𝑡 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2 𝑟𝑡+3 + ⋯ 𝛾 is a discount factor [0,1], as the reward is less certain when further away State(s) Action(a) Obstacle Brake Corner Left/Right Straight line Acceleration
  • 8. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Value-Based RL  Estimate the optimal value function 𝑄∗(𝑠, 𝑎)  This is the maximum value achievable under any policy
  • 9. Value Function Ben Lau - Deep Learning and Reinforcement  A value function is a prediction of future reward  How much reward will I get from action a in state s?  A Q-value function gives expected total reward  From state-action pair (s, a)  Under policy 𝜋  With discount factor 𝛾 𝑄 𝜋 𝑠, 𝑎 = 𝐸 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2 𝑟𝑡+3 + ⋯ 𝑠, 𝑎]  An optimal value function is the maximum achievable value 𝑄∗ 𝑠, 𝑎 = 𝑀𝑎𝑥 𝑎 𝑄 𝜋 𝑠, 𝑎  Once we have the 𝑄∗ we can act optimally 𝜋∗ 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄∗ (𝑠, 𝑎)
  • 10. Understanding Q Function Ben Lau - Deep Learning and Reinforcement  The best way to understand Q function is considering a “strategy guide”  Suppose you are playing a difficult game (DOOM)  If you have a strategy guide, it’s pretty easy  Just follow the guide  Suppose you are in state s, and need to make a decision, If you have this m Q-function(strategy guide), then it is easy, just pick the action with highest Q Doom Strategy Guide
  • 11. How to find Q-function Ben Lau - Deep Learning and Reinforcement  Discount Future Reward:𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛 which can be written as:  𝑅𝑡 = 𝑟𝑡 + 𝛾𝑅𝑡+1 Recall the definition of Q-function (max reward if choose action a in state s)  𝑄 𝑠𝑡, 𝑎 𝑡 = max 𝑅𝑡+1 Therefore, we can rewrite the Q-function as below  𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state s’, action a’ It can be solved by dynamic programming or iterative solution
  • 12. Deep Q-Network (DQN) Ben Lau - Deep Learning and Reinforcement  Action-Value function (Q-function) often very big  DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network  𝑄 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎, 𝑤  Training become finding sets of optimal weights w instead  In the literature we often called “non-linear function approximation” State Action Value A 1 140.11 A 2 139.22 B 1 145.89 B 2 140.23 C 1 123.67 C 2 135.27 ≈
  • 13. DQN Demo Using DeepQ network to play Doom
  • 14. Approach to Reinforcement Learning Ben Lau - Deep Learning and Reinforcement Policy-Based RL  Search directly for the optimal policy 𝜋∗  This is the policy achieving maximum future reward
  • 15. Deep Policy Network Ben Lau - Deep Learning and Reinforcement Review: A policy is the agent’s behavior  It is a map from state to action:  at = π(st)  We can directly search the policy  Let’s parameterize the policy by some model parameters 𝜃  𝑎 = 𝜋(𝑠, 𝜃)  We called it Policy-Based reinforcement learning because we will adjust the model parameters 𝜃 directly  The goal is to maximize the total discount reward from beginning maximize total 𝑅 = 𝑟1 + 𝛾𝑟2 + 𝛾2 𝑟3 + ⋯
  • 16. Policy Gradient Ben Lau - Deep Learning and Reinforcement How to make good action more likely?  Define objective function as total discounted reward 𝐿 𝜃 = 𝐸 𝑟1 + 𝛾𝑟2 + 𝛾2 𝑟3 + ⋯ |𝜋 𝜃(𝑠, 𝑎) or 𝐿 𝜃 = 𝐸 𝑅|𝜋 𝜃(𝑠, 𝑎) Where the expectations of the total reward R is calculated under some probability distribution 𝑝(𝑎|𝜃) parameterized by 𝜃  The goal become maximize the total reward by compute the gradient 𝜕𝐿(𝜃) 𝜕𝜃
  • 17. Policy Gradient (II) Ben Lau - Deep Learning and Reinforcement Recall: Q-function is the maximum discounted future reward in state s, actio 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥𝑅𝑡+1  In the continuous case we can written as 𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑅𝑡+1 Therefore, we can compute the gradient as 𝜕𝐿(𝜃) 𝜕𝜃 = 𝐸 𝑝(𝑎|𝜃) 𝜕𝑄 𝜕𝜃  Using chain-rule, we can re-write as 𝜕𝐿(𝜃) 𝜕𝜃 = 𝐸 𝑝(𝑎|𝜃) 𝜕𝑄 𝜃(𝑠,𝑎) 𝜕𝑎 𝜕𝑎 𝜕𝜃 No dynamics model required! 1. Only requires Q is differential w.r.t. a 2. As long as a can be parameterized as function of 𝜃
  • 18. The power of Policy Gradient Ben Lau - Deep Learning and Reinforcement Because the policy gradient does not require the dynamical model therefore, no prior domain knowledge is required AlphaGo doesn’t pre-programme any domain knowledge It keep playing many times (via self-play) and adjust the policy parameters 𝜃 to maximize the reward(winning probability)
  • 19. Intuition: Value vs Policy RL Ben Lau - Deep Learning and Reinforcement  Valued Based RL is similar to driving instructor : A score is given for any action is taken by student  Policy Based RL is similar to a driver : It is the actual policy how to drive a car
  • 20. The car racing game TORCS Ben Lau - Deep Learning and Reinforcement  TORCS is a state of the art open source simulator written in C++  Main Features  Sophisticated dynamics  Provided with several tracks, controllers  Sensors  Rangefinder  Speed  Position on track  Rotation speed of wheels  RPM  Angle with tracks Quite realistic to self-driving cars… Track sensors
  • 21. Deep Learning Recipe Ben Lau - Deep Learning and Reinforcement reward Game input state s Deep Neural network Stee r Gas Peda l Brak e  Rangefinder  Speed  Position on track  Rotation speed of wheels  RPM  Angle with tracks Compute the optimal policy 𝜋 via policy gradient
  • 22. Design of the reward function Ben Lau - Deep Learning and Reinforcement  Obvious choice : Highest velocity of the car 𝑅 = 𝑉𝑐𝑎𝑟 cos 𝜃  However, experience found that learning not very stable  Use modify reward function 𝑅 = 𝑉𝑥 cos 𝜃 −𝑉𝑥 sin 𝜃 −𝑉𝑥|track pos| Encourage stay in the center of the track
  • 23. Source code available here: Google: DDPG Keras Ben Lau - Deep Learning and Reinforcement
  • 25. Validation Set: Alpine Tracks Recall basic Machine Learning, make sure you need to test the model In the validation set, not the training set
  • 26. Learning how to brake Ben Lau - Deep Learning and Reinforcement Since we try to maximize the velocity of the car The AI agent don’t want to hit the brake at all! (As it go against the reward function) Using Stochastic Brake Idea
  • 27. Final Demo – Car does not stay center of track Ben Lau - Deep Learning and Reinforcement
  • 28. Future Application Ben Lau - Deep Learning and Reinforcement Self driving cars:
  • 32. How to find Q-function (II) Ben Lau - Deep Learning and Reinforcement  𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′ , 𝑎′ ) We could use iterative method to solve the Q-function, given a transition (s,a,  We want 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′ , 𝑎′ ) to be same as 𝑄 𝑠, 𝑎  Consider find Q-function is a regression task, we can define a loss function  Loss function = 1 2 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) − 𝑄(𝑠, 𝑎) 2  Q is optimal when the loss function is minimum target prediction