SlideShare a Scribd company logo
1 of 31
Download to read offline
Reinforcement Learning
for Self-Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
Sponsored by Argo AI 1
Introduction
Setting up the problem
2
Problem Statement
• Train a self driving agent using
RL algorithms in simulation.
• To have an algorithm that can
be run on Argo’s driving logs.
• To aim for sample efficient
algorithms. An agent exploring the CARLA environment
3
Motivation – Why Reinforcement Learning?
• End-to-end system.
• Verifiable performance
through simulation.
• Behavior cloning is
capped by the expert’s
performance while RL
isn’t. If we can run it on one experience replay, we can run it on any experience replay!
Problem Setting
A short description of the set up
5
Problem Setting
• State space – Either encoded image, waypoints or manual
WP 0.4
Obstacle 1
Traffic Light 0
… …
6
Front View
(RGB / SS)
Input image can be either Front/Top and RGB/Semantic Waypoints describe the route Manual features are usually privileged info or
vehicle speed/steer which are known and given
Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
• PID Controller – For low level control
• Test Scenario : No Crash (regular & dense) benchmark – 25
routes to drive along. (~15-20k frames)
7
Experiments
8
Decoupling the problem
• In initial attempts by the lab, the image
based agent didn't learn to stop.
• Q: Issue in representation or RL?
9
The old PPO based image agent
Decoupling the problem
Input Images
and data
from
simulator
State Space
construction
RL
algorithm
Reward Optimization
Which components need to be improved?
10
Decoupling the problem
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
State Space
construction
Focusing solely on RL – Handcrafted Input
11
Decoupling the problem
State Space
construction
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on representation– Imitation Learning
12
Decoupling the problem
Pretrained
model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
13
Finally: Combining progress in both
Focus on RL : N-step Soft Actor Critic
• Used Soft Actor Critic (SAC)
• 8-dimensional state space
• Nearest obstacle distance and speed (2)
• Mean angle to next 5 waypoints (1)
• Distance from red light (1)
• Vehicle speed/steer,dev. from trajectory
and distance to goal (4)
• Reward
• Speed based reward
• Deviation from trajectory penalty
• Collision penalty (∝ collision speed)
14
Network Diagram for privileged RL expert
Results : N-step SAC outperforms PPO
• The final agent learned to drive well with very
few collisions and traffic light violations
• Such a privileged agent was also trained by
other lab members using PPO
• However, even though naïve SAC is not as
good as PPO,n-step SAC outperforms it in
terms of total reward.
15
Final trained agent using SAC
Results : N-step SAC outperforms PPO
16
• The final agent learned to drive well with very
few collisions and traffic light violations
• Such a privileged agent was also trained by
other lab members using PPO
• However, even though naïve SAC is not as
good as PPO,n-step SAC outperforms it in
terms of total reward.
Focus on Vision: Approaches so far
Pretrained
visual model
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder
Focus on Vision: Approaches so far
Pretrained
visual model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder
Proposed Method
• Leverage the trained policy network from stage 1 (privileged agent)
19
Privileged
Policy
Network
Reward Optimization
Visual Policy
Network
Input Images
and data
from
simulator
Behavior Cloning
State Space
construction
Input Images
and data
from
simulator
Behavior Cloning (DAgger + Auxiliary Task)
• Visual policy
• End to end training
• Helps train the conv
layers.
• DAgger: Train, deploy,
label with expert, train
again.
20
Qualitative results : DAgger
21
Our agent follows traffic lights with
few collisions in dense traffic.
Front View RGB agent after Imitation Learning
Advantages of Proposed approach
22
• Fully RL : Unlike LBC, there is no requirement of expert
• Faster : Time to get visual policy is less than direct RL on images
• Practical : Privileged information can be engineered using sensors.
• It can also incorporate traffic laws – For ex: Lane speeds, Speed Limits
• Transferable : Policies trained on engineered features are easy to transfer.
• Obstacle distance remains same irrespectiveof rainy or sunny weather
Disadvantages of Proposed approach
23
• Problem: The “expert” can sometimes fail.
So, we can't do behavior cloning all the time
• Solution: Use the expert as a behavior policy
and the visual policy as target and do SAC
Example of failure case of RL expert
Improving using off-policy RL (SAC)
Previously…
• Visual policy
• End to end training
• Helps train the conv
layers.
• Agent follows traffic
lights with few collisions
in dense traffic.
24
Improving using off-policy RL (SAC)
25
• Improves visual policy
further.
• Resnet is frozen and
used as a feature
extractor.
• Soft Actor Critic is used
to train the final two
layers.
Improving using off-policy RL (SAC)
26
The training procedure
Smoother and better driving after SAC training
Results : Comparison against baselines
27
Algorithm Expert Oracle
information
No crash
(Regular)
No crash
(Dense)
Training Time (millions
of interactions)
Learning by Cheating (LBC)[1] 94 51 0.174
Implicit Affordances (IA)[9] 87 42 10-20
Auton Lab Old Agent – (PPO)[8] 96 89 16
Ours (DAgger) 88 60 0.070
Ours (DAgger + Aux Task) (train only) 92 70 0.085
Ours (DAgger + SAC, n=25) 94 84 1
% of success episodes
out of 25 on testing town
Summary
• Learning from raw image pixels using RL is hard
• Learning from generic pretrained networks using RL is hard
(this works for behavior cloningbut not for RL)
• Learning from task specific (reward specific) pretrained networks is
highly recommended (visual policy vs policy visual?)
• It is possible to completely take humans/auto pilots out of the
loop and still learn a good driving behavior
28
Thank you!
29
Our vision based agent driving after RL + DAgger + RL
Thank you!
30
Some more videos!
References
[1] Chen et al., “Learning by cheating”,CoRL 2019.
[2] Prof Jeff Schneider’sRI Seminar Talk
[3] Liang,Xiaodan, et al. "Cirl: Controllable imitative reinforcementlearning for vision-based self-driving.“,ECCV,2018.
[4] Kendall,Alex, et al. "Learning to drive in a day.“, ICRA, IEEE, 2019.
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
[6] Hernandez-Garcia,J.Fernando,and Richard S. Sutton."Understanding multi-step deep reinforcement learning: A
systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvementsin deep reinforcement learning." AAAI 2018.
[8] Master'sThesis, Tech. Report,CMU-RI-TR-20-32,August, 2020,Tanmay Agarwal
[9] Toromanoff,Marin,Emilie Wirbel,and Fabien Moutarde."End-to-End Model-FreeReinforcement Learning for Urban
Driving using Implicit Affordances." Proceedingsof the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition.
2020
31

More Related Content

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Rl for self driving final

  • 1. Reinforcement Learning for Self-Driving Cars Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider Sponsored by Argo AI 1
  • 3. Problem Statement • Train a self driving agent using RL algorithms in simulation. • To have an algorithm that can be run on Argo’s driving logs. • To aim for sample efficient algorithms. An agent exploring the CARLA environment 3
  • 4. Motivation – Why Reinforcement Learning? • End-to-end system. • Verifiable performance through simulation. • Behavior cloning is capped by the expert’s performance while RL isn’t. If we can run it on one experience replay, we can run it on any experience replay!
  • 5. Problem Setting A short description of the set up 5
  • 6. Problem Setting • State space – Either encoded image, waypoints or manual WP 0.4 Obstacle 1 Traffic Light 0 … … 6 Front View (RGB / SS) Input image can be either Front/Top and RGB/Semantic Waypoints describe the route Manual features are usually privileged info or vehicle speed/steer which are known and given
  • 7. Problem Setting • State space – Either encoded image, waypoints or manual • Action space – Speed and Steer (bounded and continuous) • PID Controller – For low level control • Test Scenario : No Crash (regular & dense) benchmark – 25 routes to drive along. (~15-20k frames) 7
  • 9. Decoupling the problem • In initial attempts by the lab, the image based agent didn't learn to stop. • Q: Issue in representation or RL? 9 The old PPO based image agent
  • 10. Decoupling the problem Input Images and data from simulator State Space construction RL algorithm Reward Optimization Which components need to be improved? 10
  • 11. Decoupling the problem Policy Network Reward Optimization Input Images and data from simulator State Space construction Focusing solely on RL – Handcrafted Input 11
  • 12. Decoupling the problem State Space construction RL algorithm Reward Optimization Input Images and data from simulator Focusing on representation– Imitation Learning 12
  • 13. Decoupling the problem Pretrained model Policy Network Reward Optimization Input Images and data from simulator 13 Finally: Combining progress in both
  • 14. Focus on RL : N-step Soft Actor Critic • Used Soft Actor Critic (SAC) • 8-dimensional state space • Nearest obstacle distance and speed (2) • Mean angle to next 5 waypoints (1) • Distance from red light (1) • Vehicle speed/steer,dev. from trajectory and distance to goal (4) • Reward • Speed based reward • Deviation from trajectory penalty • Collision penalty (∝ collision speed) 14 Network Diagram for privileged RL expert
  • 15. Results : N-step SAC outperforms PPO • The final agent learned to drive well with very few collisions and traffic light violations • Such a privileged agent was also trained by other lab members using PPO • However, even though naïve SAC is not as good as PPO,n-step SAC outperforms it in terms of total reward. 15 Final trained agent using SAC
  • 16. Results : N-step SAC outperforms PPO 16 • The final agent learned to drive well with very few collisions and traffic light violations • Such a privileged agent was also trained by other lab members using PPO • However, even though naïve SAC is not as good as PPO,n-step SAC outperforms it in terms of total reward.
  • 17. Focus on Vision: Approaches so far Pretrained visual model Input Images and data from simulator Behavior cloning (LBC) Auto Encoder
  • 18. Focus on Vision: Approaches so far Pretrained visual model Policy Network Reward Optimization Input Images and data from simulator Behavior cloning (LBC) Auto Encoder
  • 19. Proposed Method • Leverage the trained policy network from stage 1 (privileged agent) 19 Privileged Policy Network Reward Optimization Visual Policy Network Input Images and data from simulator Behavior Cloning State Space construction Input Images and data from simulator
  • 20. Behavior Cloning (DAgger + Auxiliary Task) • Visual policy • End to end training • Helps train the conv layers. • DAgger: Train, deploy, label with expert, train again. 20
  • 21. Qualitative results : DAgger 21 Our agent follows traffic lights with few collisions in dense traffic. Front View RGB agent after Imitation Learning
  • 22. Advantages of Proposed approach 22 • Fully RL : Unlike LBC, there is no requirement of expert • Faster : Time to get visual policy is less than direct RL on images • Practical : Privileged information can be engineered using sensors. • It can also incorporate traffic laws – For ex: Lane speeds, Speed Limits • Transferable : Policies trained on engineered features are easy to transfer. • Obstacle distance remains same irrespectiveof rainy or sunny weather
  • 23. Disadvantages of Proposed approach 23 • Problem: The “expert” can sometimes fail. So, we can't do behavior cloning all the time • Solution: Use the expert as a behavior policy and the visual policy as target and do SAC Example of failure case of RL expert
  • 24. Improving using off-policy RL (SAC) Previously… • Visual policy • End to end training • Helps train the conv layers. • Agent follows traffic lights with few collisions in dense traffic. 24
  • 25. Improving using off-policy RL (SAC) 25 • Improves visual policy further. • Resnet is frozen and used as a feature extractor. • Soft Actor Critic is used to train the final two layers.
  • 26. Improving using off-policy RL (SAC) 26 The training procedure Smoother and better driving after SAC training
  • 27. Results : Comparison against baselines 27 Algorithm Expert Oracle information No crash (Regular) No crash (Dense) Training Time (millions of interactions) Learning by Cheating (LBC)[1] 94 51 0.174 Implicit Affordances (IA)[9] 87 42 10-20 Auton Lab Old Agent – (PPO)[8] 96 89 16 Ours (DAgger) 88 60 0.070 Ours (DAgger + Aux Task) (train only) 92 70 0.085 Ours (DAgger + SAC, n=25) 94 84 1 % of success episodes out of 25 on testing town
  • 28. Summary • Learning from raw image pixels using RL is hard • Learning from generic pretrained networks using RL is hard (this works for behavior cloningbut not for RL) • Learning from task specific (reward specific) pretrained networks is highly recommended (visual policy vs policy visual?) • It is possible to completely take humans/auto pilots out of the loop and still learn a good driving behavior 28
  • 29. Thank you! 29 Our vision based agent driving after RL + DAgger + RL
  • 31. References [1] Chen et al., “Learning by cheating”,CoRL 2019. [2] Prof Jeff Schneider’sRI Seminar Talk [3] Liang,Xiaodan, et al. "Cirl: Controllable imitative reinforcementlearning for vision-based self-driving.“,ECCV,2018. [4] Kendall,Alex, et al. "Learning to drive in a day.“, ICRA, IEEE, 2019. [5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD [6] Hernandez-Garcia,J.Fernando,and Richard S. Sutton."Understanding multi-step deep reinforcement learning: A systematic study of the DQN target.“ [7] Hessel, et al. "Rainbow: Combining improvementsin deep reinforcement learning." AAAI 2018. [8] Master'sThesis, Tech. Report,CMU-RI-TR-20-32,August, 2020,Tanmay Agarwal [9] Toromanoff,Marin,Emilie Wirbel,and Fabien Moutarde."End-to-End Model-FreeReinforcement Learning for Urban Driving using Implicit Affordances." Proceedingsof the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition. 2020 31