SlideShare a Scribd company logo
1 of 31
Download to read offline
Reinforcement Learning
for Self-Driving Cars
Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider
Sponsored by Argo AI 1
Outline
• Introduction
• Experiments
• Results
• Future Work
2
Introduction
Setting up the problem
3
Problem Statement
• Train a self driving agent using
RL algorithms in simulation.
• To have an algorithm that can
be run on Argo’s driving logs.
• To aim for sample efficient
algorithms. An agent exploring the CARLA environment
4
Motivation – Why Reinforcement Learning?
• End-to-end system.
• Verifiable performance
through simulation.
• Behavior cloning is
capped by the expert’s
performance while RL
isn’t. If we can run it on one video log, we can run it on any video log!
Problem Setting
A short description of the set up
6
Problem Setting
• State space – Either encoded image, waypoints or manual
WP 0.4
Obstacle 1
Traffic Light 0
… …
7
Problem Setting
• State space – Either encoded image, waypoints or manual
• Action space – Speed and Steer (bounded and continuous)
• PID Controller – For low level control
• Test Scenario : Navigation with dynamic actors
8
Experiments
9
Decoupling the problem
Input Images
and data
from
simulator
State Space
construction
RL
algorithm
Reward Optimization
Which components need to be improved?
10
Decoupling the problem
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
State Space
construction
Focusing solely on RL – Handcrafted Input
11
Previous Semester
• Used Soft Actor Critic (SAC)
• 8-dimensional state space
• Mean angle to next 5 waypoints
• Nearest obstacle distance and speed
• Vehicle speed and steer
• Distance from trajectory, goal and red light
• Reward
• Speed based reward
• Distance to trajectory
• Collision reward
12
Final trained agent using SAC
N-step SAC outperforms PPO
13
• Naïve SAC is not as good as PPO
• N-step SAC performs slightly better
than PPO
Decoupling the problem
State Space
construction
RL
algorithm
Reward Optimization
Input Images
and data
from
simulator
Focusing on representation– Imitation Learning
14
Decoupling the problem
Pretrained
model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
15
Combining progress in both
Learning By Cheating – Pretrained Model
• Move from manual state space to
image state space + waypoints
• We want to leverage an Imitation
learning based pretrainedmodel to
acceleratetraining.
• EssentiallytreatingLBC as the “image-
net model” for this task.
16
Channel visualization from output of conv layers of LBC’s Resnet-34 [1] trained to drive
[1] Chen et al.,“Learning by cheating”, Conference on Robot Learning, 2019
Experiments with Policy Networks
• Instead of a 2-Layer MLP to get actions like
previous experiments,we have the following
network for policy and value function.
• The initial channel mixing helps keep the
number of parameters small.
• Result : Only passes 1 test case of 25. (Straight Nav)
Input: bs x 512 x 5 x 12
Conv1x1 (bs x 64 x 5 x 12)
(bs x 32 x 3 x 10)
ReLU
Conv3x3
ReLU
Conv3x3 (bs x 16 x 1 x 8)
ReLU + flatten + concat(speed, steer, wp)
FC Layers
Mean and Variances
for Actions
17
ResNet-34 (Pretrained)
384 x 160 RGB
Experiments with Policy Networks
• Spatial Softmax to reduce dimensions.
• The idea was that RL performs best in low
dimensional state spaces.
• Result : Only passes 1 of 25 test cases. (~20000
reward)
• We will revisit this once we have an end-to-end
traininginstead of frozen nets
Input: bs x 512 x 5 x 12
Spatial Softmax (bs x 1024)
ReLU + flatten + concat(speed, steer, wp)
FC Layers
Mean and Variances
for Actions
18
ResNet-34 (Pretrained)
384 x 160 RGB
Experiments with Policy Networks
• Regular Conv layers into FC.
• Result : Only passes 3 test case of 25.
• ~10000 reward higher.Not much.
• Several other architectureswere also tried
yielding similar results.
Input: bs x 512 x 5 x 12
Conv3x3 (bs x 64 x 3 x 10)
(bs x 8 x 1 x 8)
ReLU
Conv3x3
ReLU + flatten + concat(speed, steer, wp)
FC Layers
Mean and Variances
for Actions
19
ResNet-34 (Pretrained)
384 x 160 RGB
Approaches so far
Pretrained
visual model
Policy
Network
Reward Optimization
Input Images
and data
from
simulator
Behavior
cloning
(LBC)
Auto
Encoder
Proposed
• Leverage the trained policy network (privileged agent)
21
Policy
Network
Reward Optimization
Visual Policy
Network
Input Images
and data
from
simulator
Behavior Cloning
State Space
construction
Input Images
and data
from
simulator
RL Network – 8 Dim Input
• Privileged agent
22
BC Network – Image + 5 dim
• Visual policy
23
Initial experiment settings
• Semantically segmented images
• Removed manual obstacle information
• Simple conv network to ensure a fair comparison
• Expert can be trained using off policy or on policy
• Took expert trained using PPO for easy comparison
24
Comparison of RL+BC against pure RL
25
• The “expert” RL agent used in the previous slides is
shown in Red. It is the best performingand fastest
to train.
• The BC agent is comparable to yellow curve (I).
• Our BEV agent trains much faster and achieves a
much higher success rate of 96% (24/25).
• Our Front view agent heavily outperforms the pre-
existing one with 92% success against 60-70%.
[5] T. Agarwal, “On-Policy Reinforcement Learning for Learning to Drive in Urban Settings“, Master's Thesis, Tech. Report, CMU-RI-TR-20-32, August, 2020
Performance of RL using 8-dim (A), Image + 8-dim (A+I) and
Image + 6-dim networks (I) from a former lab member’s work [5].
Qualitative results
26
Advantages of Proposed approach
27
• Unlike LBC, there is no requirement of expert (fully RL)
• Training of initial visual policy is very fast
• Visual policy be finetuned using RL
• The privileged information can be easily obtained using sensor
• Easy to train
• Incorporate priors
Advantages of Proposed approach
28
• Easy to transfer the policies obtained from manual state space
• But, can sometimes fail:
• So, can't do behavior cloning all the time
• Use it as a behavior policy with
visual policy as target and do RL
Improvements of Proposed approach
29
• Naïve Behavior cloning can miss important tail distribution:
• Traffic light scenarios which are relatively rare in a video
• Heuristics can be applied to handle those
• Ex: custom prioritized experience replay
Future Work
• Train on RGB images
• Remove the traffic light information from manual state space
• Test on different weathers
• Finetune Visual Policy using CURL [8] (RL + Auxiliary Task)
• Dense traffic scenarios
• Lane change scenarios
30[8] M. Laskin, A. Srinivas, P. Abbeel, CURL: Contrastive Unsupervised Representations for Reinforcement Learning, ICML - July, 2020
References
[1] Chen et al., “Learning by cheating”, CoRL2019.
[2] Prof Jeff Schneider’s RI Seminar Talk
[3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self-
driving.“,ECCV,2018.
[4] Kendall, Alex, et al. "Learning to drive in a day.“, ICRA, IEEE,2019.
[5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD
[6] Hernandez-Garcia,J. Fernando, and Richard S. Sutton. "Understanding multi-step deep
reinforcement learning: A systematic study of the DQN target.“
[7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018.
[8]
31

More Related Content

Recently uploaded

Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...FIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jNeo4j
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 

Recently uploaded (20)

Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming LanguageSimplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Rl for self driving- Fall 2020 MidSem presentation

  • 1. Reinforcement Learning for Self-Driving Cars Vinay Sameer Kadi and Mayank Gupta, with Prof. Jeff Schneider Sponsored by Argo AI 1
  • 4. Problem Statement • Train a self driving agent using RL algorithms in simulation. • To have an algorithm that can be run on Argo’s driving logs. • To aim for sample efficient algorithms. An agent exploring the CARLA environment 4
  • 5. Motivation – Why Reinforcement Learning? • End-to-end system. • Verifiable performance through simulation. • Behavior cloning is capped by the expert’s performance while RL isn’t. If we can run it on one video log, we can run it on any video log!
  • 6. Problem Setting A short description of the set up 6
  • 7. Problem Setting • State space – Either encoded image, waypoints or manual WP 0.4 Obstacle 1 Traffic Light 0 … … 7
  • 8. Problem Setting • State space – Either encoded image, waypoints or manual • Action space – Speed and Steer (bounded and continuous) • PID Controller – For low level control • Test Scenario : Navigation with dynamic actors 8
  • 10. Decoupling the problem Input Images and data from simulator State Space construction RL algorithm Reward Optimization Which components need to be improved? 10
  • 11. Decoupling the problem Policy Network Reward Optimization Input Images and data from simulator State Space construction Focusing solely on RL – Handcrafted Input 11
  • 12. Previous Semester • Used Soft Actor Critic (SAC) • 8-dimensional state space • Mean angle to next 5 waypoints • Nearest obstacle distance and speed • Vehicle speed and steer • Distance from trajectory, goal and red light • Reward • Speed based reward • Distance to trajectory • Collision reward 12 Final trained agent using SAC
  • 13. N-step SAC outperforms PPO 13 • Naïve SAC is not as good as PPO • N-step SAC performs slightly better than PPO
  • 14. Decoupling the problem State Space construction RL algorithm Reward Optimization Input Images and data from simulator Focusing on representation– Imitation Learning 14
  • 15. Decoupling the problem Pretrained model Policy Network Reward Optimization Input Images and data from simulator 15 Combining progress in both
  • 16. Learning By Cheating – Pretrained Model • Move from manual state space to image state space + waypoints • We want to leverage an Imitation learning based pretrainedmodel to acceleratetraining. • EssentiallytreatingLBC as the “image- net model” for this task. 16 Channel visualization from output of conv layers of LBC’s Resnet-34 [1] trained to drive [1] Chen et al.,“Learning by cheating”, Conference on Robot Learning, 2019
  • 17. Experiments with Policy Networks • Instead of a 2-Layer MLP to get actions like previous experiments,we have the following network for policy and value function. • The initial channel mixing helps keep the number of parameters small. • Result : Only passes 1 test case of 25. (Straight Nav) Input: bs x 512 x 5 x 12 Conv1x1 (bs x 64 x 5 x 12) (bs x 32 x 3 x 10) ReLU Conv3x3 ReLU Conv3x3 (bs x 16 x 1 x 8) ReLU + flatten + concat(speed, steer, wp) FC Layers Mean and Variances for Actions 17 ResNet-34 (Pretrained) 384 x 160 RGB
  • 18. Experiments with Policy Networks • Spatial Softmax to reduce dimensions. • The idea was that RL performs best in low dimensional state spaces. • Result : Only passes 1 of 25 test cases. (~20000 reward) • We will revisit this once we have an end-to-end traininginstead of frozen nets Input: bs x 512 x 5 x 12 Spatial Softmax (bs x 1024) ReLU + flatten + concat(speed, steer, wp) FC Layers Mean and Variances for Actions 18 ResNet-34 (Pretrained) 384 x 160 RGB
  • 19. Experiments with Policy Networks • Regular Conv layers into FC. • Result : Only passes 3 test case of 25. • ~10000 reward higher.Not much. • Several other architectureswere also tried yielding similar results. Input: bs x 512 x 5 x 12 Conv3x3 (bs x 64 x 3 x 10) (bs x 8 x 1 x 8) ReLU Conv3x3 ReLU + flatten + concat(speed, steer, wp) FC Layers Mean and Variances for Actions 19 ResNet-34 (Pretrained) 384 x 160 RGB
  • 20. Approaches so far Pretrained visual model Policy Network Reward Optimization Input Images and data from simulator Behavior cloning (LBC) Auto Encoder
  • 21. Proposed • Leverage the trained policy network (privileged agent) 21 Policy Network Reward Optimization Visual Policy Network Input Images and data from simulator Behavior Cloning State Space construction Input Images and data from simulator
  • 22. RL Network – 8 Dim Input • Privileged agent 22
  • 23. BC Network – Image + 5 dim • Visual policy 23
  • 24. Initial experiment settings • Semantically segmented images • Removed manual obstacle information • Simple conv network to ensure a fair comparison • Expert can be trained using off policy or on policy • Took expert trained using PPO for easy comparison 24
  • 25. Comparison of RL+BC against pure RL 25 • The “expert” RL agent used in the previous slides is shown in Red. It is the best performingand fastest to train. • The BC agent is comparable to yellow curve (I). • Our BEV agent trains much faster and achieves a much higher success rate of 96% (24/25). • Our Front view agent heavily outperforms the pre- existing one with 92% success against 60-70%. [5] T. Agarwal, “On-Policy Reinforcement Learning for Learning to Drive in Urban Settings“, Master's Thesis, Tech. Report, CMU-RI-TR-20-32, August, 2020 Performance of RL using 8-dim (A), Image + 8-dim (A+I) and Image + 6-dim networks (I) from a former lab member’s work [5].
  • 27. Advantages of Proposed approach 27 • Unlike LBC, there is no requirement of expert (fully RL) • Training of initial visual policy is very fast • Visual policy be finetuned using RL • The privileged information can be easily obtained using sensor • Easy to train • Incorporate priors
  • 28. Advantages of Proposed approach 28 • Easy to transfer the policies obtained from manual state space • But, can sometimes fail: • So, can't do behavior cloning all the time • Use it as a behavior policy with visual policy as target and do RL
  • 29. Improvements of Proposed approach 29 • Naïve Behavior cloning can miss important tail distribution: • Traffic light scenarios which are relatively rare in a video • Heuristics can be applied to handle those • Ex: custom prioritized experience replay
  • 30. Future Work • Train on RGB images • Remove the traffic light information from manual state space • Test on different weathers • Finetune Visual Policy using CURL [8] (RL + Auxiliary Task) • Dense traffic scenarios • Lane change scenarios 30[8] M. Laskin, A. Srinivas, P. Abbeel, CURL: Contrastive Unsupervised Representations for Reinforcement Learning, ICML - July, 2020
  • 31. References [1] Chen et al., “Learning by cheating”, CoRL2019. [2] Prof Jeff Schneider’s RI Seminar Talk [3] Liang, Xiaodan, et al. "Cirl: Controllable imitative reinforcement learning for vision-based self- driving.“,ECCV,2018. [4] Kendall, Alex, et al. "Learning to drive in a day.“, ICRA, IEEE,2019. [5] Agarwal, et al. “Learning to Drive using Waypoints“, NeurIPS 2019 Workshop – ML4AD [6] Hernandez-Garcia,J. Fernando, and Richard S. Sutton. "Understanding multi-step deep reinforcement learning: A systematic study of the DQN target.“ [7] Hessel, et al. "Rainbow: Combining improvements in deep reinforcement learning." AAAI 2018. [8] 31