3. Problem Statement
• Train a self driving agent using
RL algorithms in simulation.
• To have an algorithm that can
be run on Argo’s driving logs.
• To aim for sample efficient
algorithms. An agent exploring the CARLA environment
3
4. Motivation – Why Reinforcement Learning?
• End-to-end system.
• Verifiable performance
through simulation.
• Behavior cloning is
capped by the expert’s
performance while RL
isn’t. If we can run it on one experience replay, we can run it on any experience replay!
6. Problem Setting
• State space – Either encoded image, waypoints or manual
WP 0.4
Obstacle 1
Traffic Light 0
… …
6
Front View
(RGB / SS)
Input image can be either Front/Top and RGB/Semantic Waypoints describe the route Manual features are usually privileged info or
vehicle speed/steer which are known and given
7. Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
• PID Controller – For low level control
• Test Scenario : No Crash (regular & dense) benchmark – 25
routes to drive along. (~15-20k frames)
7
9. Decoupling the problem
• In initial attempts by the lab, the image
based agent didn't learn to stop.
• Q: Issue in representation or RL?
9
The old PPO based image agent
10. Decoupling the problem
Input Images
and data
from
simulator
State Space
construction
RL
algorithm
Reward Optimization
Which components need to be improved?
10
12. Decoupling the problem
State Space
construction
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on representation– Imitation Learning
12
14. Focus on RL : N-step Soft Actor Critic
• Used Soft Actor Critic (SAC)
• 8-dimensional state space
• Nearest obstacle distance and speed (2)
• Mean angle to next 5 waypoints (1)
• Distance from red light (1)
• Vehicle speed/steer,dev. from trajectory
and distance to goal (4)
• Reward
• Speed based reward
• Deviation from trajectory penalty
• Collision penalty (∝ collision speed)
14
Network Diagram for privileged RL expert
15. Results : N-step SAC outperforms PPO
• The final agent learned to drive well with very
few collisions and traffic light violations
• Such a privileged agent was also trained by
other lab members using PPO
• However, even though naïve SAC is not as
good as PPO,n-step SAC outperforms it in
terms of total reward.
15
Final trained agent using SAC
16. Results : N-step SAC outperforms PPO
16
• The final agent learned to drive well with very
few collisions and traffic light violations
• Such a privileged agent was also trained by
other lab members using PPO
• However, even though naïve SAC is not as
good as PPO,n-step SAC outperforms it in
terms of total reward.
17. Focus on Vision: Approaches so far
Pretrained
visual model
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder
18. Focus on Vision: Approaches so far
Pretrained
visual model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder
19. Proposed Method
• Leverage the trained policy network from stage 1 (privileged agent)
19
Privileged
Policy
Network
Reward Optimization
Visual Policy
Network
Input Images
and data
from
simulator
Behavior Cloning
State Space
construction
Input Images
and data
from
simulator
20. Behavior Cloning (DAgger + Auxiliary Task)
• Visual policy
• End to end training
• Helps train the conv
layers.
• DAgger: Train, deploy,
label with expert, train
again.
20
21. Qualitative results : DAgger
21
Our agent follows traffic lights with
few collisions in dense traffic.
Front View RGB agent after Imitation Learning
22. Advantages of Proposed approach
22
• Fully RL : Unlike LBC, there is no requirement of expert
• Faster : Time to get visual policy is less than direct RL on images
• Practical : Privileged information can be engineered using sensors.
• It can also incorporate traffic laws – For ex: Lane speeds, Speed Limits
• Transferable : Policies trained on engineered features are easy to transfer.
• Obstacle distance remains same irrespectiveof rainy or sunny weather
23. Disadvantages of Proposed approach
23
• Problem: The “expert” can sometimes fail.
So, we can't do behavior cloning all the time
• Solution: Use the expert as a behavior policy
and the visual policy as target and do SAC
Example of failure case of RL expert
24. Improving using off-policy RL (SAC)
Previously…
• Visual policy
• End to end training
• Helps train the conv
layers.
• Agent follows traffic
lights with few collisions
in dense traffic.
24
25. Improving using off-policy RL (SAC)
25
• Improves visual policy
further.
• Resnet is frozen and
used as a feature
extractor.
• Soft Actor Critic is used
to train the final two
layers.
26. Improving using off-policy RL (SAC)
26
The training procedure
Smoother and better driving after SAC training
27. Results : Comparison against baselines
27
Algorithm Expert Oracle
information
No crash
(Regular)
No crash
(Dense)
Training Time (millions
of interactions)
Learning by Cheating (LBC)[1] 94 51 0.174
Implicit Affordances (IA)[9] 87 42 10-20
Auton Lab Old Agent – (PPO)[8] 96 89 16
Ours (DAgger) 88 60 0.070
Ours (DAgger + Aux Task) (train only) 92 70 0.085
Ours (DAgger + SAC, n=25) 94 84 1
% of success episodes
out of 25 on testing town
28. Summary
• Learning from raw image pixels using RL is hard
• Learning from generic pretrained networks using RL is hard
(this works for behavior cloningbut not for RL)
• Learning from task specific (reward specific) pretrained networks is
highly recommended (visual policy vs policy visual?)
• It is possible to completely take humans/auto pilots out of the
loop and still learn a good driving behavior
28
31. References
[1] Chen et al., “Learning by cheating”,CoRL 2019.
[2] Prof Jeff Schneider’sRI Seminar Talk
[3] Liang,Xiaodan, et al. "Cirl: Controllable imitative reinforcementlearning for vision-based self-driving.“,ECCV,2018.
[4] Kendall,Alex, et al. "Learning to drive in a day.“, ICRA, IEEE, 2019.
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
[6] Hernandez-Garcia,J.Fernando,and Richard S. Sutton."Understanding multi-step deep reinforcement learning: A
systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvementsin deep reinforcement learning." AAAI 2018.
[8] Master'sThesis, Tech. Report,CMU-RI-TR-20-32,August, 2020,Tanmay Agarwal
[9] Toromanoff,Marin,Emilie Wirbel,and Fabien Moutarde."End-to-End Model-FreeReinforcement Learning for Urban
Driving using Implicit Affordances." Proceedingsof the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.
2020
31