Rl for self driving- Fall 2020 MidSem presentation

Reinforcement Learning
for Self-Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
Sponsored by Argo AI 1

Outline
• Introduction
• Experiments
• Results
• Future Work
2

Introduction
Setting up the problem
3

Problem Statement
• Train a self driving agent using
RL algorithms in simulation.
• To have an algorithm that can
be run on Argo’s driving logs.
• To aim for sample efficient
algorithms. An agent exploring the CARLA environment
4

Motivation – Why Reinforcement Learning?
• End-to-end system.
• Verifiable performance
through simulation.
• Behavior cloning is
capped by the expert’s
performance while RL
isn’t. If we can run it on one video log, we can run it on any video log!

Problem Setting
A short description of the set up
6

Problem Setting
• State space – Either encoded image, waypoints or manual
WP 0.4
Obstacle 1
Traffic Light 0
… …
7

Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
• PID Controller – For low level control
• Test Scenario : Navigation with dynamic actors
8

Decoupling the problem
Input Images
and data
from
simulator
State Space
construction
RL
algorithm
Reward Optimization
Which components need to be improved?
10

Policy
Network
Reward Optimization
Input Images
and data
from
simulator
State Space
construction
Focusing solely on RL – Handcrafted Input
11

Previous Semester
• Used Soft Actor Critic (SAC)
• 8-dimensional state space
• Mean angle to next 5 waypoints
• Nearest obstacle distance and speed
• Vehicle speed and steer
• Distance from trajectory, goal and red light
• Reward
• Speed based reward
• Distance to trajectory
• Collision reward
12
Final trained agent using SAC

N-step SAC outperforms PPO
13
• Naïve SAC is not as good as PPO
• N-step SAC performs slightly better
than PPO

State Space
construction
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on representation– Imitation Learning
14

Pretrained
model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
15
Combining progress in both

Learning By Cheating – Pretrained Model
• Move from manual state space to
image state space + waypoints
• We want to leverage an Imitation
learning based pretrainedmodel to
acceleratetraining.
• EssentiallytreatingLBC as the “image-
net model” for this task.
16
Channel visualization from output of conv layers of LBC’s Resnet-34 [1] trained to drive
[1] Chen et al.,“Learning by cheating”, Conference on Robot Learning, 2019

Experiments with Policy Networks
• Instead of a 2-Layer MLP to get actions like
previous experiments,we have the following
network for policy and value function.
• The initial channel mixing helps keep the
number of parameters small.
• Result : Only passes 1 test case of 25. (Straight Nav)
Input: bs x 512 x 5 x 12
Conv1x1 (bs x 64 x 5 x 12)
(bs x 32 x 3 x 10)
ReLU
Conv3x3
ReLU
Conv3x3 (bs x 16 x 1 x 8)
ReLU + flatten + concat(speed, steer, wp)
FC Layers
Mean and Variances
for Actions
17
ResNet-34 (Pretrained)
384 x 160 RGB

• Spatial Softmax to reduce dimensions.
• The idea was that RL performs best in low
dimensional state spaces.
• Result : Only passes 1 of 25 test cases. (~20000
reward)
• We will revisit this once we have an end-to-end
traininginstead of frozen nets
Spatial Softmax (bs x 1024)
FC Layers
Mean and Variances
for Actions
18
384 x 160 RGB

• Regular Conv layers into FC.
• Result : Only passes 3 test case of 25.
• ~10000 reward higher.Not much.
• Several other architectureswere also tried
yielding similar results.
Conv3x3 (bs x 64 x 3 x 10)
(bs x 8 x 1 x 8)
ReLU
Conv3x3
FC Layers
Mean and Variances
for Actions
19
384 x 160 RGB

Approaches so far
Pretrained
visual model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder

Proposed
• Leverage the trained policy network (privileged agent)
21
Policy
Network
Reward Optimization
Visual Policy
Network
Input Images
and data
from
simulator
Behavior Cloning
State Space
construction
Input Images
and data
from
simulator

RL Network – 8 Dim Input
• Privileged agent
22

BC Network – Image + 5 dim
• Visual policy
23

Initial experiment settings
• Semantically segmented images
• Removed manual obstacle information
• Simple conv network to ensure a fair comparison
• Expert can be trained using off policy or on policy
• Took expert trained using PPO for easy comparison
24

Comparison of RL+BC against pure RL
25
• The “expert” RL agent used in the previous slides is
shown in Red. It is the best performingand fastest
to train.
• The BC agent is comparable to yellow curve (I).
• Our BEV agent trains much faster and achieves a
much higher success rate of 96% (24/25).
• Our Front view agent heavily outperforms the pre-
existing one with 92% success against 60-70%.
[5] T. Agarwal, “On-Policy Reinforcement Learning for Learning to Drive in Urban Settings“, Master's Thesis, Tech. Report, CMU-RI-TR-20-32, August, 2020
Performance of RL using 8-dim (A), Image + 8-dim (A+I) and
Image + 6-dim networks (I) from a former lab member’s work [5].

Advantages of Proposed approach
27
• Unlike LBC, there is no requirement of expert (fully RL)
• Training of initial visual policy is very fast
• Visual policy be finetuned using RL
• The privileged information can be easily obtained using sensor
• Easy to train
• Incorporate priors

Advantages of Proposed approach
28
• Easy to transfer the policies obtained from manual state space
• But, can sometimes fail:
• So, can't do behavior cloning all the time
• Use it as a behavior policy with
visual policy as target and do RL

Improvements of Proposed approach
29
• Naïve Behavior cloning can miss important tail distribution:
• Traffic light scenarios which are relatively rare in a video
• Heuristics can be applied to handle those
• Ex: custom prioritized experience replay

Future Work
• Train on RGB images
• Remove the traffic light information from manual state space
• Test on different weathers
• Finetune Visual Policy using CURL [8] (RL + Auxiliary Task)
• Dense traffic scenarios
• Lane change scenarios
30[8] M. Laskin, A. Srinivas, P. Abbeel, CURL: Contrastive Unsupervised Representations for Reinforcement Learning, ICML - July, 2020

References
[1] Chen et al., “Learning by cheating”, CoRL2019.
[2] Prof Jeff Schneider’s RI Seminar Talk
[3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self-
driving.“,ECCV,2018.
[4] Kendall, Alex, et al. "Learning to drive in a day.“, ICRA, IEEE,2019.
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
[6] Hernandez-Garcia,J. Fernando, and Richard S. Sutton. "Understanding multi-step deep
reinforcement learning: A systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.
[8]
31

Rl for self driving- Fall 2020 MidSem presentation

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Rl for self driving- Fall 2020 MidSem presentation