Rl for self driving final

Reinforcement Learning
for Self-Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
Sponsored by Argo AI 1

Introduction
Setting up the problem
2

Problem Statement
• Train a self driving agent using
RL algorithms in simulation.
• To have an algorithm that can
be run on Argo’s driving logs.
• To aim for sample efficient
algorithms. An agent exploring the CARLA environment
3

Motivation – Why Reinforcement Learning?
• End-to-end system.
• Verifiable performance
through simulation.
• Behavior cloning is
capped by the expert’s
performance while RL
isn’t. If we can run it on one experience replay, we can run it on any experience replay!

Problem Setting
A short description of the set up
5

Problem Setting
• State space – Either encoded image, waypoints or manual
WP 0.4
Obstacle 1
Traffic Light 0
… …
6
Front View
(RGB / SS)
Input image can be either Front/Top and RGB/Semantic Waypoints describe the route Manual features are usually privileged info or
vehicle speed/steer which are known and given

Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
• PID Controller – For low level control
• Test Scenario : No Crash (regular & dense) benchmark – 25
routes to drive along. (~15-20k frames)
7

Decoupling the problem
• In initial attempts by the lab, the image
based agent didn't learn to stop.
• Q: Issue in representation or RL?
9
The old PPO based image agent

Input Images
and data
from
simulator
State Space
construction
RL
algorithm
Reward Optimization
Which components need to be improved?
10

Policy
Network
Reward Optimization
Input Images
and data
from
simulator
State Space
construction
Focusing solely on RL – Handcrafted Input
11

State Space
construction
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on representation– Imitation Learning
12

Pretrained
model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
13
Finally: Combining progress in both

Focus on RL : N-step Soft Actor Critic
• Used Soft Actor Critic (SAC)
• 8-dimensional state space
• Nearest obstacle distance and speed (2)
• Mean angle to next 5 waypoints (1)
• Distance from red light (1)
• Vehicle speed/steer,dev. from trajectory
and distance to goal (4)
• Reward
• Speed based reward
• Deviation from trajectory penalty
• Collision penalty (∝ collision speed)
14
Network Diagram for privileged RL expert

Results : N-step SAC outperforms PPO
• The final agent learned to drive well with very
few collisions and traffic light violations
• Such a privileged agent was also trained by
other lab members using PPO
• However, even though naïve SAC is not as
good as PPO,n-step SAC outperforms it in
terms of total reward.
15
Final trained agent using SAC

Results : N-step SAC outperforms PPO
16
• The final agent learned to drive well with very
few collisions and traffic light violations
• Such a privileged agent was also trained by
other lab members using PPO
• However, even though naïve SAC is not as
good as PPO,n-step SAC outperforms it in
terms of total reward.

Focus on Vision: Approaches so far
Pretrained
visual model
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder

Focus on Vision: Approaches so far
Pretrained
visual model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder

Proposed Method
• Leverage the trained policy network from stage 1 (privileged agent)
19
Privileged
Policy
Network
Reward Optimization
Visual Policy
Network
Input Images
and data
from
simulator
Behavior Cloning
State Space
construction
Input Images
and data
from
simulator

Behavior Cloning (DAgger + Auxiliary Task)
• Visual policy
• End to end training
• Helps train the conv
layers.
• DAgger: Train, deploy,
label with expert, train
again.
20

Qualitative results : DAgger
21
Our agent follows traffic lights with
few collisions in dense traffic.
Front View RGB agent after Imitation Learning

Advantages of Proposed approach
22
• Fully RL : Unlike LBC, there is no requirement of expert
• Faster : Time to get visual policy is less than direct RL on images
• Practical : Privileged information can be engineered using sensors.
• It can also incorporate traffic laws – For ex: Lane speeds, Speed Limits
• Transferable : Policies trained on engineered features are easy to transfer.
• Obstacle distance remains same irrespectiveof rainy or sunny weather

Disadvantages of Proposed approach
23
• Problem: The “expert” can sometimes fail.
So, we can't do behavior cloning all the time
• Solution: Use the expert as a behavior policy
and the visual policy as target and do SAC
Example of failure case of RL expert

Improving using off-policy RL (SAC)
Previously…
• Visual policy
• End to end training
• Helps train the conv
layers.
• Agent follows traffic
lights with few collisions
in dense traffic.
24

25
• Improves visual policy
further.
• Resnet is frozen and
used as a feature
extractor.
• Soft Actor Critic is used
to train the final two
layers.

26
The training procedure
Smoother and better driving after SAC training

Results : Comparison against baselines
27
Algorithm Expert Oracle
information
No crash
(Regular)
No crash
(Dense)
Training Time (millions
of interactions)
Learning by Cheating (LBC)[1] 94 51 0.174
Implicit Affordances (IA)[9] 87 42 10-20
Auton Lab Old Agent – (PPO)[8] 96 89 16
Ours (DAgger) 88 60 0.070
Ours (DAgger + Aux Task) (train only) 92 70 0.085
Ours (DAgger + SAC, n=25) 94 84 1
% of success episodes
out of 25 on testing town

Summary
• Learning from raw image pixels using RL is hard
• Learning from generic pretrained networks using RL is hard
(this works for behavior cloningbut not for RL)
• Learning from task specific (reward specific) pretrained networks is
highly recommended (visual policy vs policy visual?)
• It is possible to completely take humans/auto pilots out of the
loop and still learn a good driving behavior
28

Thank you!
29
Our vision based agent driving after RL + DAgger + RL

Thank you!
30
Some more videos!

References
[1] Chen et al., “Learning by cheating”,CoRL 2019.
[2] Prof Jeff Schneider’sRI Seminar Talk
[3] Liang,Xiaodan, et al. "Cirl: Controllable imitative reinforcementlearning for vision-based self-driving.“,ECCV,2018.
[4] Kendall,Alex, et al. "Learning to drive in a day.“, ICRA, IEEE, 2019.
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
[6] Hernandez-Garcia,J.Fernando,and Richard S. Sutton."Understanding multi-step deep reinforcement learning: A
systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvementsin deep reinforcement learning." AAAI 2018.
[8] Master'sThesis, Tech. Report,CMU-RI-TR-20-32,August, 2020,Tanmay Agarwal
[9] Toromanoff,Marin,Emilie Wirbel,and Fabien Moutarde."End-to-End Model-FreeReinforcement Learning for Urban
Driving using Implicit Affordances." Proceedingsof the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.
2020
31

Rl for self driving final

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Rl for self driving final