SlideShare a Scribd company logo
1 of 22
ETA
Prediction
Challenge
@DataHack
A taxi goes from
Chinatown to Times
Square. How long will
it take to arrive?
Taxi challenge by @Final
In this challenge, you are given data
on taxi rides in New York, containing
information on each ride such as the
start and end points, date, time of
day, distance, etc...
Our purpose is to predict the travel
time (in logarithmic scale) of a ride.
The data is split to train and test sets,
and we can use both general data of
the ride with local data on similar
rides from the train set.
Data : Goal :
Ride Information - Given Dataset
● From / To coordinates (lon, lat)
● Departure timestamp
● Trip distance (road distance)
● Vendor - Taxi company (Found to be not important)
● Passenger count (Found to be not important)
Data Wizard, expert in big
data processing and
production ready ML.
Googler, Wazer and traffic
analytics expert.
Data Ninja, a pure
professional in every data
spect from gathering and
exploring to modeling.
Kaggle Master
An innovative ML expert
and programmer, expert in
feature engineering and
selection techniques.
Kaggle Master
The Team
A group of talented and creative world class professionals in ML and Traffic Analytics
Nir Malbin Gad Benram Seffi Cohen
CDS (Chief Data Scientist)
for the Israeli Defense
Forces, and pioneer in ML
ensemble techniques.
Kaggle Master
Daniel Marcous
How it works
Dataset
Train
(Train model on)
Test
(Make prediction on)
Public score
(30%)
Private score
(70%)
Reminders
The Metric
Mean Square Error (log values) / Variance (constant)
sum((y-y')^2) / sum((y-avg(y))^2)
Notes :
● Interesting part :
𝚺((real-pred)^2) ~ Least Squares
● Log values
○ Mistake weight goes down for longer rides
○ Mistake is determined by error percentile -
10 minutes mistake on a 20 minutes ride
matters the same as
20 minutes mistake on a 40 minutes ride.
{log(a)-log(b)=log(a/b)}
Predicting ETA Using :
1. Ride Information
2. Environment
3. Geography
4. Inferred States
Machine
Learning
Data Shortage
● We Don’t Have
○ Historical speeds
○ Real Time speeds
● Box coordinates to NYC (remove 0.0 etc.)
● Remove very long / far rides (>2h/65km)
● Remove unreasonable speed / time-distance ratio
○ Remove 5% anomalies (Top & Bottom)
Data Cleaning
New feature - Abnormal Ride
Feature Engineering
Datetime based features
● Month start / end
● Day / Day of week / Hour / 15 Minute interval
● Is weekend / business day
● Is work hour (09:00-17:00)
● Is rush hour (morning / afternoon)
● Is holiday
● Coordinate Transformations (Coding directionality)
○ PCA 2 2
○ RBF
● Spheric (geo) distance
● Distance percentile
● Spheric-Trip distance ratio
Location based features
City based features
● NYC Neighbourhood (pair crossing)
● Distance to points of interest (100X2)
○ Schools / Hospitals / Parks etc.
PCA 2
Weather based features
● Temperature
● Events - Rain / Snow etc.
● Humidity
● Wind
● Visibility
● Min / Max / Avg / std etc.
PCA 2
Inferred Traffic based
features
PCA 1
● Assumption :
our data is a representative sample
of the NYC’s - “driving population”
● Crowdedness
○ #rides in X radius
■ 100 / 500 / 1500 / 5000
■ Euclidean / Manhattan
News based
features
1. Crawling NYTimes
2. Topic Modeling
3. Finding topics correlated with ETA
4. Using top10 correlated topics as
features
a. Number of articles on a day for
every topic
Results
Caveats
● Timeseries future mixing
● Not exactly a metric that might be the most important
○ Same weight for positive / negative error
● Crowdedness - assumes that data is a representative sample of the total
car population
○ E.g. 2 times the samples equals 2 times the traffic
● Variance - taken from original validation dataset (constant)
Public Leader Board
Team Score
Team DCountdown 0.150041
Squanchers 0.160468
Noa's stars 0.165869
Aperture Science 0.167958
R-North 0.175308
TAU Deep Learning Lab 0.182602
Mr Terminal 0.193593
MTG 0.282009
SuperFish 0.302637
Summary
Machine Learning Approach
● Black Box (~ish)
● Harder to deploy
● Retrain when system changes
Advantages : Disadvantages :
● No manual tuning
● No complex heuristics
● Optimise different metrics
● Personalisation
● Taking different ideas and their
interactions into account as one

More Related Content

Similar to Prediction of taxi rides ETA

Kharita: Robust Road Map Inference Through Network Alignment of Trajectories
Kharita: Robust Road Map Inference Through Network Alignment of TrajectoriesKharita: Robust Road Map Inference Through Network Alignment of Trajectories
Kharita: Robust Road Map Inference Through Network Alignment of Trajectoriesvipyoung
 
Webinar: Using smart card and GPS data for policy and planning: the case of T...
Webinar: Using smart card and GPS data for policy and planning: the case of T...Webinar: Using smart card and GPS data for policy and planning: the case of T...
Webinar: Using smart card and GPS data for policy and planning: the case of T...BRTCoE
 
Case Studies in Managing Traffic in a Developing Country with Privacy-Preserv...
Case Studies in Managing Traffic in a Developing Country with Privacy-Preserv...Case Studies in Managing Traffic in a Developing Country with Privacy-Preserv...
Case Studies in Managing Traffic in a Developing Country with Privacy-Preserv...Biplav Srivastava
 
Theme 3b Users perspective of integrated transit systems
Theme 3b Users perspective of integrated transit systemsTheme 3b Users perspective of integrated transit systems
Theme 3b Users perspective of integrated transit systemsBRTCoE
 
Hybrid Ant Colony Optimization for Real-World Delivery Problems Based on Real...
Hybrid Ant Colony Optimization for Real-World Delivery Problems Based on Real...Hybrid Ant Colony Optimization for Real-World Delivery Problems Based on Real...
Hybrid Ant Colony Optimization for Real-World Delivery Problems Based on Real...csandit
 
Crunching Gigabytes Locally
Crunching Gigabytes LocallyCrunching Gigabytes Locally
Crunching Gigabytes LocallyDima Korolev
 
Machine Learning Approach to Report Prioritization with an ...
Machine Learning Approach to Report Prioritization with an ...Machine Learning Approach to Report Prioritization with an ...
Machine Learning Approach to Report Prioritization with an ...butest
 
Cab travel time prediction using ensemble models
Cab travel time prediction using ensemble modelsCab travel time prediction using ensemble models
Cab travel time prediction using ensemble modelsAyan Sengupta
 
Insight_Project_Presentation
Insight_Project_PresentationInsight_Project_Presentation
Insight_Project_Presentationdforthomme
 
3rd Conference on Sustainable Urban Mobility
3rd Conference on Sustainable Urban Mobility3rd Conference on Sustainable Urban Mobility
3rd Conference on Sustainable Urban MobilityLIFE GreenYourMove
 
Chapter 3&4
Chapter 3&4Chapter 3&4
Chapter 3&4EWIT
 
Supply chain logistics : vehicle routing and scheduling
Supply chain logistics : vehicle  routing and  schedulingSupply chain logistics : vehicle  routing and  scheduling
Supply chain logistics : vehicle routing and schedulingRetigence Technologies
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit
 
IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light...
IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light...IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light...
IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light...Yamato OKAMOTO
 
Replacing Manhattan Subway Service with On-demand transportation
Replacing Manhattan Subway Service with On-demand transportationReplacing Manhattan Subway Service with On-demand transportation
Replacing Manhattan Subway Service with On-demand transportationChristian Moscardi
 
Christian Moscardi Presentation
Christian Moscardi PresentationChristian Moscardi Presentation
Christian Moscardi PresentationJoseph Chow
 
KTH-Texxi Project 2010
KTH-Texxi Project 2010KTH-Texxi Project 2010
KTH-Texxi Project 2010Texxi Global
 
Participatory Project
Participatory ProjectParticipatory Project
Participatory Project#Xiao Zhe#
 

Similar to Prediction of taxi rides ETA (20)

Kharita: Robust Road Map Inference Through Network Alignment of Trajectories
Kharita: Robust Road Map Inference Through Network Alignment of TrajectoriesKharita: Robust Road Map Inference Through Network Alignment of Trajectories
Kharita: Robust Road Map Inference Through Network Alignment of Trajectories
 
Webinar: Using smart card and GPS data for policy and planning: the case of T...
Webinar: Using smart card and GPS data for policy and planning: the case of T...Webinar: Using smart card and GPS data for policy and planning: the case of T...
Webinar: Using smart card and GPS data for policy and planning: the case of T...
 
Case Studies in Managing Traffic in a Developing Country with Privacy-Preserv...
Case Studies in Managing Traffic in a Developing Country with Privacy-Preserv...Case Studies in Managing Traffic in a Developing Country with Privacy-Preserv...
Case Studies in Managing Traffic in a Developing Country with Privacy-Preserv...
 
Analysing road traffic
Analysing road trafficAnalysing road traffic
Analysing road traffic
 
Theme 3b Users perspective of integrated transit systems
Theme 3b Users perspective of integrated transit systemsTheme 3b Users perspective of integrated transit systems
Theme 3b Users perspective of integrated transit systems
 
Hybrid Ant Colony Optimization for Real-World Delivery Problems Based on Real...
Hybrid Ant Colony Optimization for Real-World Delivery Problems Based on Real...Hybrid Ant Colony Optimization for Real-World Delivery Problems Based on Real...
Hybrid Ant Colony Optimization for Real-World Delivery Problems Based on Real...
 
Crunching Gigabytes Locally
Crunching Gigabytes LocallyCrunching Gigabytes Locally
Crunching Gigabytes Locally
 
Machine Learning Approach to Report Prioritization with an ...
Machine Learning Approach to Report Prioritization with an ...Machine Learning Approach to Report Prioritization with an ...
Machine Learning Approach to Report Prioritization with an ...
 
Cab travel time prediction using ensemble models
Cab travel time prediction using ensemble modelsCab travel time prediction using ensemble models
Cab travel time prediction using ensemble models
 
Insight_Project_Presentation
Insight_Project_PresentationInsight_Project_Presentation
Insight_Project_Presentation
 
3rd Conference on Sustainable Urban Mobility
3rd Conference on Sustainable Urban Mobility3rd Conference on Sustainable Urban Mobility
3rd Conference on Sustainable Urban Mobility
 
Chapter 3&4
Chapter 3&4Chapter 3&4
Chapter 3&4
 
Supply chain logistics : vehicle routing and scheduling
Supply chain logistics : vehicle  routing and  schedulingSupply chain logistics : vehicle  routing and  scheduling
Supply chain logistics : vehicle routing and scheduling
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
IV2021-431-slides.pdf
IV2021-431-slides.pdfIV2021-431-slides.pdf
IV2021-431-slides.pdf
 
IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light...
IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light...IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light...
IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light...
 
Replacing Manhattan Subway Service with On-demand transportation
Replacing Manhattan Subway Service with On-demand transportationReplacing Manhattan Subway Service with On-demand transportation
Replacing Manhattan Subway Service with On-demand transportation
 
Christian Moscardi Presentation
Christian Moscardi PresentationChristian Moscardi Presentation
Christian Moscardi Presentation
 
KTH-Texxi Project 2010
KTH-Texxi Project 2010KTH-Texxi Project 2010
KTH-Texxi Project 2010
 
Participatory Project
Participatory ProjectParticipatory Project
Participatory Project
 

More from Daniel Marcous

Cloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILCloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILDaniel Marcous
 
Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Daniel Marcous
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDaniel Marcous
 
Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Daniel Marcous
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Big Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleBig Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleDaniel Marcous
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 

More from Daniel Marcous (10)

Cloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle ILCloud AI Platform Notebooks - Kaggle IL
Cloud AI Platform Notebooks - Kaggle IL
 
S2
S2S2
S2
 
Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018Towards Smart Transportation DSS 2018
Towards Smart Transportation DSS 2018
 
Distributed Databases - Concepts & Architectures
Distributed Databases - Concepts & ArchitecturesDistributed Databases - Concepts & Architectures
Distributed Databases - Concepts & Architectures
 
Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)Distributed K-Betweenness (Spark)
Distributed K-Betweenness (Spark)
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Big Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @GoogleBig Data - Big Insights - Waze @Google
Big Data - Big Insights - Waze @Google
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Data Visualisation
Data VisualisationData Visualisation
Data Visualisation
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 

Prediction of taxi rides ETA

  • 2. A taxi goes from Chinatown to Times Square. How long will it take to arrive?
  • 3. Taxi challenge by @Final In this challenge, you are given data on taxi rides in New York, containing information on each ride such as the start and end points, date, time of day, distance, etc... Our purpose is to predict the travel time (in logarithmic scale) of a ride. The data is split to train and test sets, and we can use both general data of the ride with local data on similar rides from the train set. Data : Goal :
  • 4. Ride Information - Given Dataset ● From / To coordinates (lon, lat) ● Departure timestamp ● Trip distance (road distance) ● Vendor - Taxi company (Found to be not important) ● Passenger count (Found to be not important)
  • 5. Data Wizard, expert in big data processing and production ready ML. Googler, Wazer and traffic analytics expert. Data Ninja, a pure professional in every data spect from gathering and exploring to modeling. Kaggle Master An innovative ML expert and programmer, expert in feature engineering and selection techniques. Kaggle Master The Team A group of talented and creative world class professionals in ML and Traffic Analytics Nir Malbin Gad Benram Seffi Cohen CDS (Chief Data Scientist) for the Israeli Defense Forces, and pioneer in ML ensemble techniques. Kaggle Master Daniel Marcous
  • 6. How it works Dataset Train (Train model on) Test (Make prediction on) Public score (30%) Private score (70%)
  • 7. Reminders The Metric Mean Square Error (log values) / Variance (constant) sum((y-y')^2) / sum((y-avg(y))^2) Notes : ● Interesting part : 𝚺((real-pred)^2) ~ Least Squares ● Log values ○ Mistake weight goes down for longer rides ○ Mistake is determined by error percentile - 10 minutes mistake on a 20 minutes ride matters the same as 20 minutes mistake on a 40 minutes ride. {log(a)-log(b)=log(a/b)}
  • 8. Predicting ETA Using : 1. Ride Information 2. Environment 3. Geography 4. Inferred States Machine Learning
  • 9. Data Shortage ● We Don’t Have ○ Historical speeds ○ Real Time speeds
  • 10. ● Box coordinates to NYC (remove 0.0 etc.) ● Remove very long / far rides (>2h/65km) ● Remove unreasonable speed / time-distance ratio ○ Remove 5% anomalies (Top & Bottom) Data Cleaning New feature - Abnormal Ride
  • 12. Datetime based features ● Month start / end ● Day / Day of week / Hour / 15 Minute interval ● Is weekend / business day ● Is work hour (09:00-17:00) ● Is rush hour (morning / afternoon) ● Is holiday
  • 13. ● Coordinate Transformations (Coding directionality) ○ PCA 2 2 ○ RBF ● Spheric (geo) distance ● Distance percentile ● Spheric-Trip distance ratio Location based features
  • 14. City based features ● NYC Neighbourhood (pair crossing) ● Distance to points of interest (100X2) ○ Schools / Hospitals / Parks etc. PCA 2
  • 15. Weather based features ● Temperature ● Events - Rain / Snow etc. ● Humidity ● Wind ● Visibility ● Min / Max / Avg / std etc. PCA 2
  • 16. Inferred Traffic based features PCA 1 ● Assumption : our data is a representative sample of the NYC’s - “driving population” ● Crowdedness ○ #rides in X radius ■ 100 / 500 / 1500 / 5000 ■ Euclidean / Manhattan
  • 17. News based features 1. Crawling NYTimes 2. Topic Modeling 3. Finding topics correlated with ETA 4. Using top10 correlated topics as features a. Number of articles on a day for every topic
  • 19. Caveats ● Timeseries future mixing ● Not exactly a metric that might be the most important ○ Same weight for positive / negative error ● Crowdedness - assumes that data is a representative sample of the total car population ○ E.g. 2 times the samples equals 2 times the traffic ● Variance - taken from original validation dataset (constant)
  • 20. Public Leader Board Team Score Team DCountdown 0.150041 Squanchers 0.160468 Noa's stars 0.165869 Aperture Science 0.167958 R-North 0.175308 TAU Deep Learning Lab 0.182602 Mr Terminal 0.193593 MTG 0.282009 SuperFish 0.302637
  • 22. Machine Learning Approach ● Black Box (~ish) ● Harder to deploy ● Retrain when system changes Advantages : Disadvantages : ● No manual tuning ● No complex heuristics ● Optimise different metrics ● Personalisation ● Taking different ideas and their interactions into account as one