Ben Lau is a quantitative researcher in a macro hedge fund in Hong Kong and he looks to apply mathematical models and signal processing techniques to study the financial market. Prior joining the financial industry, he specialized in using his mathematical modelling skills to discover the mysteries of the universe whilst working at Stanford Linear Accelerator Centre, a national accelerator laboratory where he studied the asymmetry between matter and antimatter by analysing tens of billions of collision events created by the particle accelerators. Ben was awarded his Ph.D. in Particle Physics from Princeton University and his undergraduate degree (with First Class Honours) at the Chinese University of Hong Kong.
Abstract Summary:
Deep Reinforcement Learning: Developing a robotic car with the ability to form long term driving strategies is the key for enabling fully autonomous driving in the future. Reinforcement learning has been considered a strong AI paradigm which can be used to teach machines through interaction with the environment and by learning from their mistakes. In this talk, we will discuss how to apply deep reinforcement learning technique to train a self-driving car under an open source racing car simulator called TORCS. I am going to share how this is implemented and will discuss various challenges in this project.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
1. Deep Reinforcement
Learning
using deep learning to play self-driving car games
Ben Lau
Ben Lau - Deep Learning and Reinforcement
MLConf 2017, New York City
2. What is Reinforcement
Learning?
Ben Lau - Deep Learning and Reinforcement
3 classes of
learning
Supervised Learning
Label data
Direct Feedback
Unsupervised Learning
No labels data
No feedback
“Find Hidden Structure
Reinforcement Learning
Using reward as feedback
Learn series of actions
Trial and Error
3. RL: Agent and Environment
Ben Lau - Deep Learning and Reinforcement
𝑅𝑡
Agent
Action 𝐴 𝑡
Environment
Reward
Observation 𝑂𝑡
At each step t the Agent
• Receive observation 𝑂𝑡
• Execute action 𝐴 𝑡
• Receive reward 𝑅𝑡
the Environment
• Receive action 𝐴 𝑡
• Sends observation 𝑂𝑡+1
• Sends reward 𝑅𝑡+1
4. RL: State
Ben Lau - Deep Learning and Reinforcement
Experience is a sequence of observations, actions, rewards
𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡
The state is a summary of experience
𝑠𝑡 = 𝑓(𝑜1, 𝑟1, 𝑎1 … , 𝑜𝑡−1, 𝑟𝑡−1, 𝑎 𝑡−1, 𝑜𝑡, 𝑟𝑡, 𝑎 𝑡)
Note: Not all the state are fully observable
Fully Observable Not Fully Observable
5. Approach to Reinforcement
Learning
Ben Lau - Deep Learning and Reinforcement
Value-Based RL
Estimate the optimal value function 𝑄∗(𝑠, 𝑎)
This is the maximum value achievable under any policy
Policy-Based RL
Search directly for the optimal policy 𝜋∗
This is the policy achieving maximum future reward
Model-based RL
Build a model of the environment
Plan (e.g. by lookahead) using model
6. Deep Learning + RL AI
Ben Lau - Deep Learning and Reinforcement
reward
Game input
Deep convolution network
Stee
r
Gas
Peda
l
Brake
7. Policies
Ben Lau - Deep Learning and Reinforcement
A deterministic policy is the agent’s behavior
It is a map from state to action:
𝑎 𝑡 = 𝜋(𝑠𝑡)
In Reinforcement Learning, the agent’s goal is to
choose each action such that it maximize the sum
of future rewards
Choose at to maximize 𝑅𝑡 = 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2
𝑟𝑡+3 + ⋯
𝛾 is a discount factor [0,1], as the reward is less certain when
further away
State(s) Action(a)
Obstacle Brake
Corner Left/Right
Straight line Acceleration
8. Approach to Reinforcement
Learning
Ben Lau - Deep Learning and Reinforcement
Value-Based RL
Estimate the optimal value function 𝑄∗(𝑠, 𝑎)
This is the maximum value achievable under any policy
9. Value Function
Ben Lau - Deep Learning and Reinforcement
A value function is a prediction of future reward
How much reward will I get from action a in state s?
A Q-value function gives expected total reward
From state-action pair (s, a)
Under policy 𝜋
With discount factor 𝛾
𝑄 𝜋
𝑠, 𝑎 = 𝐸 𝑟𝑡+1 + 𝛾𝑟𝑡+2 + 𝛾2
𝑟𝑡+3 + ⋯ 𝑠, 𝑎]
An optimal value function is the maximum achievable value
𝑄∗ 𝑠, 𝑎 = 𝑀𝑎𝑥 𝑎 𝑄 𝜋 𝑠, 𝑎
Once we have the 𝑄∗
we can act optimally
𝜋∗
𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑄∗
(𝑠, 𝑎)
10. Understanding Q Function
Ben Lau - Deep Learning and Reinforcement
The best way to understand Q function is considering a “strategy guide”
Suppose you are playing a difficult game (DOOM)
If you have a strategy guide, it’s pretty easy Just follow the guide
Suppose you are in state s, and need to make a decision, If you have this m
Q-function(strategy guide), then it is easy, just pick the action with highest Q
Doom Strategy Guide
11. How to find Q-function
Ben Lau - Deep Learning and Reinforcement
Discount Future Reward:𝑅𝑡 = 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2 𝑟𝑡+2 + ⋯ + 𝛾 𝑛−𝑡 𝑟𝑛
which can be written as:
𝑅𝑡 = 𝑟𝑡 + 𝛾𝑅𝑡+1
Recall the definition of Q-function (max reward if choose action a in state s)
𝑄 𝑠𝑡, 𝑎 𝑡 = max 𝑅𝑡+1
Therefore, we can rewrite the Q-function as below
𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′)
In plain English, it means maximum future reward for (s,a) is the
immediate reward r + maximum future reward in next state s’, action a’
It can be solved by dynamic programming or iterative solution
12. Deep Q-Network (DQN)
Ben Lau - Deep Learning and Reinforcement
Action-Value function (Q-function) often very big
DQN idea: I use the neural network to compress this Q-table, using
the weight (w) in the neural network
𝑄 𝑠, 𝑎 ≈ 𝑄 𝑠, 𝑎, 𝑤
Training become finding sets of optimal weights w instead
In the literature we often called “non-linear function approximation”
State Action Value
A 1 140.11
A 2 139.22
B 1 145.89
B 2 140.23
C 1 123.67
C 2 135.27
≈
14. Approach to Reinforcement
Learning
Ben Lau - Deep Learning and Reinforcement
Policy-Based RL
Search directly for the optimal policy 𝜋∗
This is the policy achieving maximum future reward
15. Deep Policy Network
Ben Lau - Deep Learning and Reinforcement
Review: A policy is the agent’s behavior
It is a map from state to action:
at = π(st)
We can directly search the policy
Let’s parameterize the policy by some model parameters 𝜃
𝑎 = 𝜋(𝑠, 𝜃)
We called it Policy-Based reinforcement learning because we
will adjust the model parameters 𝜃 directly
The goal is to maximize the total discount reward from beginning
maximize total 𝑅 = 𝑟1 + 𝛾𝑟2 + 𝛾2
𝑟3 + ⋯
16. Policy Gradient
Ben Lau - Deep Learning and Reinforcement
How to make good action more likely?
Define objective function as total discounted reward
𝐿 𝜃 = 𝐸 𝑟1 + 𝛾𝑟2 + 𝛾2
𝑟3 + ⋯ |𝜋 𝜃(𝑠, 𝑎)
or
𝐿 𝜃 = 𝐸 𝑅|𝜋 𝜃(𝑠, 𝑎)
Where the expectations of the total reward R is calculated under some
probability distribution 𝑝(𝑎|𝜃) parameterized by 𝜃
The goal become maximize the total reward by
compute the gradient
𝜕𝐿(𝜃)
𝜕𝜃
17. Policy Gradient (II)
Ben Lau - Deep Learning and Reinforcement
Recall: Q-function is the maximum discounted future reward in state s, actio
𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑚𝑎𝑥𝑅𝑡+1
In the continuous case we can written as
𝑄 𝑠𝑡, 𝑎 𝑡 = 𝑅𝑡+1
Therefore, we can compute the gradient as
𝜕𝐿(𝜃)
𝜕𝜃
= 𝐸 𝑝(𝑎|𝜃)
𝜕𝑄
𝜕𝜃
Using chain-rule, we can re-write as
𝜕𝐿(𝜃)
𝜕𝜃
= 𝐸 𝑝(𝑎|𝜃)
𝜕𝑄 𝜃(𝑠,𝑎)
𝜕𝑎
𝜕𝑎
𝜕𝜃
No dynamics model required!
1. Only requires Q is differential w.r.t. a
2. As long as a can be parameterized
as function of 𝜃
18. The power of Policy Gradient
Ben Lau - Deep Learning and Reinforcement
Because the policy gradient does not require the dynamical model
therefore, no prior domain knowledge is required
AlphaGo doesn’t pre-programme any domain knowledge
It keep playing many times (via self-play) and adjust the policy parameters 𝜃
to maximize the reward(winning probability)
19. Intuition: Value vs Policy RL
Ben Lau - Deep Learning and Reinforcement
Valued Based RL is similar to driving instructor : A score is
given for any action is taken by student
Policy Based RL is similar to a driver : It is the actual policy
how to drive a car
20. The car racing game TORCS
Ben Lau - Deep Learning and Reinforcement
TORCS is a state of the art open source simulator written in C++
Main Features
Sophisticated dynamics
Provided with several
tracks, controllers
Sensors
Rangefinder
Speed
Position on track
Rotation speed of wheels
RPM
Angle with tracks
Quite realistic to self-driving cars… Track sensors
21. Deep Learning Recipe
Ben Lau - Deep Learning and Reinforcement
reward
Game input state s
Deep Neural network
Stee
r
Gas
Peda
l
Brak
e
Rangefinder
Speed
Position on track
Rotation speed of wheels
RPM
Angle with tracks
Compute the optimal policy 𝜋 via policy gradient
22. Design of the reward function
Ben Lau - Deep Learning and Reinforcement
Obvious choice : Highest velocity of the car 𝑅 = 𝑉𝑐𝑎𝑟 cos 𝜃
However, experience found that learning not very stable
Use modify reward function 𝑅 = 𝑉𝑥 cos 𝜃 −𝑉𝑥 sin 𝜃 −𝑉𝑥|track pos|
Encourage stay in the center of the track
23. Source code available here:
Google: DDPG Keras
Ben Lau - Deep Learning and Reinforcement
25. Validation Set: Alpine Tracks
Recall basic Machine Learning, make sure you need to test the
model
In the validation set, not the training set
26. Learning how to brake
Ben Lau - Deep Learning and Reinforcement
Since we try to maximize the velocity of the car
The AI agent don’t want to hit the brake at all! (As it go against the reward function)
Using Stochastic Brake Idea
27. Final Demo – Car does not stay center
of track
Ben Lau - Deep Learning and Reinforcement
32. How to find Q-function (II)
Ben Lau - Deep Learning and Reinforcement
𝑄 𝑠, 𝑎 = 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′
, 𝑎′
)
We could use iterative method to solve the Q-function, given a transition (s,a,
We want 𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′
, 𝑎′
) to be same as 𝑄 𝑠, 𝑎
Consider find Q-function is a regression task, we can define a loss function
Loss function =
1
2
𝑟 + 𝛾 × 𝑚𝑎𝑥 𝑎′Q(𝑠′, 𝑎′) − 𝑄(𝑠, 𝑎) 2
Q is optimal when the loss function is minimum
target prediction