4. Problem Statement
• Train a self driving agent using
RL algorithms in simulation.
• To have an algorithm that can
be run on Argo’s driving logs.
• To aim for sample efficient
algorithms. An agent exploring the CARLA environment
4
5. Motivation – Why Reinforcement Learning?
• End-to-end system.
• Verifiable performance
through simulation.
• Behavior cloning is
capped by the expert’s
performance while RL
isn’t. If we can run it on one video log, we can run it on any video log!
7. Problem Setting
• State space – Either encoded image, waypoints or manual
WP 0.4
Obstacle 1
Traffic Light 0
… …
7
8. Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
• PID Controller – For low level control
• Test Scenario : Navigation with dynamic actors
8
10. Decoupling the problem
Input Images
and data
from
simulator
State Space
construction
RL
algorithm
Reward Optimization
Which components need to be improved?
10
12. Previous Semester
• Used Soft Actor Critic (SAC)
• 8-dimensional state space
• Mean angle to next 5 waypoints
• Nearest obstacle distance and speed
• Vehicle speed and steer
• Distance from trajectory, goal and red light
• Reward
• Speed based reward
• Distance to trajectory
• Collision reward
12
Final trained agent using SAC
13. N-step SAC outperforms PPO
13
• Naïve SAC is not as good as PPO
• N-step SAC performs slightly better
than PPO
14. Decoupling the problem
State Space
construction
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on representation– Imitation Learning
14
16. Learning By Cheating – Pretrained Model
• Move from manual state space to
image state space + waypoints
• We want to leverage an Imitation
learning based pretrainedmodel to
acceleratetraining.
• EssentiallytreatingLBC as the “image-
net model” for this task.
16
Channel visualization from output of conv layers of LBC’s Resnet-34 [1] trained to drive
[1] Chen et al.,“Learning by cheating”, Conference on Robot Learning, 2019
17. Experiments with Policy Networks
• Instead of a 2-Layer MLP to get actions like
previous experiments,we have the following
network for policy and value function.
• The initial channel mixing helps keep the
number of parameters small.
• Result : Only passes 1 test case of 25. (Straight Nav)
Input: bs x 512 x 5 x 12
Conv1x1 (bs x 64 x 5 x 12)
(bs x 32 x 3 x 10)
ReLU
Conv3x3
ReLU
Conv3x3 (bs x 16 x 1 x 8)
ReLU + flatten + concat(speed, steer, wp)
FC Layers
Mean and Variances
for Actions
17
ResNet-34 (Pretrained)
384 x 160 RGB
18. Experiments with Policy Networks
• Spatial Softmax to reduce dimensions.
• The idea was that RL performs best in low
dimensional state spaces.
• Result : Only passes 1 of 25 test cases. (~20000
reward)
• We will revisit this once we have an end-to-end
traininginstead of frozen nets
Input: bs x 512 x 5 x 12
Spatial Softmax (bs x 1024)
ReLU + flatten + concat(speed, steer, wp)
FC Layers
Mean and Variances
for Actions
18
ResNet-34 (Pretrained)
384 x 160 RGB
19. Experiments with Policy Networks
• Regular Conv layers into FC.
• Result : Only passes 3 test case of 25.
• ~10000 reward higher.Not much.
• Several other architectureswere also tried
yielding similar results.
Input: bs x 512 x 5 x 12
Conv3x3 (bs x 64 x 3 x 10)
(bs x 8 x 1 x 8)
ReLU
Conv3x3
ReLU + flatten + concat(speed, steer, wp)
FC Layers
Mean and Variances
for Actions
19
ResNet-34 (Pretrained)
384 x 160 RGB
20. Approaches so far
Pretrained
visual model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder
21. Proposed
• Leverage the trained policy network (privileged agent)
21
Policy
Network
Reward Optimization
Visual Policy
Network
Input Images
and data
from
simulator
Behavior Cloning
State Space
construction
Input Images
and data
from
simulator
24. Initial experiment settings
• Semantically segmented images
• Removed manual obstacle information
• Simple conv network to ensure a fair comparison
• Expert can be trained using off policy or on policy
• Took expert trained using PPO for easy comparison
24
25. Comparison of RL+BC against pure RL
25
• The “expert” RL agent used in the previous slides is
shown in Red. It is the best performingand fastest
to train.
• The BC agent is comparable to yellow curve (I).
• Our BEV agent trains much faster and achieves a
much higher success rate of 96% (24/25).
• Our Front view agent heavily outperforms the pre-
existing one with 92% success against 60-70%.
[5] T. Agarwal, “On-Policy Reinforcement Learning for Learning to Drive in Urban Settings“, Master's Thesis, Tech. Report, CMU-RI-TR-20-32, August, 2020
Performance of RL using 8-dim (A), Image + 8-dim (A+I) and
Image + 6-dim networks (I) from a former lab member’s work [5].
27. Advantages of Proposed approach
27
• Unlike LBC, there is no requirement of expert (fully RL)
• Training of initial visual policy is very fast
• Visual policy be finetuned using RL
• The privileged information can be easily obtained using sensor
• Easy to train
• Incorporate priors
28. Advantages of Proposed approach
28
• Easy to transfer the policies obtained from manual state space
• But, can sometimes fail:
• So, can't do behavior cloning all the time
• Use it as a behavior policy with
visual policy as target and do RL
29. Improvements of Proposed approach
29
• Naïve Behavior cloning can miss important tail distribution:
• Traffic light scenarios which are relatively rare in a video
• Heuristics can be applied to handle those
• Ex: custom prioritized experience replay
30. Future Work
• Train on RGB images
• Remove the traffic light information from manual state space
• Test on different weathers
• Finetune Visual Policy using CURL [8] (RL + Auxiliary Task)
• Dense traffic scenarios
• Lane change scenarios
30[8] M. Laskin, A. Srinivas, P. Abbeel, CURL: Contrastive Unsupervised Representations for Reinforcement Learning, ICML - July, 2020
31. References
[1] Chen et al., “Learning by cheating”, CoRL2019.
[2] Prof Jeff Schneider’s RI Seminar Talk
[3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self-
driving.“,ECCV,2018.
[4] Kendall, Alex, et al. "Learning to drive in a day.“, ICRA, IEEE,2019.
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
[6] Hernandez-Garcia,J. Fernando, and Richard S. Sutton. "Understanding multi-step deep
reinforcement learning: A systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.
[8]
31