As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
2. #EUent7
Pranita Dewan Joshua Rosenkranz
Ramya Raghavendra Mudhakar Srivatsa
About me
• PhD, CS from UC Santa
Barbara
• Researcher at IBM TJ
Watson
3. Machine Learning Process
Business
Understanding
• Challenge
• Why it is
important
• Why it is
hard
Data Collection
• Traffic
• Weather
• Archival
• Real-time
Data
preprocessing
• Cleaning
• Joins
• Spark time
series library
Traffic
modeling
• ARIMA
• Random
forest
• LSTM
#EUent7
4. Machine Learning Process
Business
Understanding
• Challenge
• Why it is
important
• Why it is
hard
Data Collection
• Traffic
• Weather
• Archival
• Real-time
Data
preprocessing
• Cleaning
• Joins
• Spark time
series library
Traffic
modeling
• ARIMA
• Random
forest
• LSTM
#EUent7
5. Driver behavior data is only valid in the context of what is
also happening on the road
UBI – Usage Based Insurance
71 6571 7265 44˚
Driver
Speed
Driver
Speed
Speed
Limit
Speed
Limit
Reference
Speed
Weather
Condition
Temp
Reading
2
Congestion
Index
Limited Analysis
can lead to
inaccurate
assessments, and
impact retention
More data, and driver relevant data will
lead to greater understanding of
behavior and associated risk
With 36.2 Billion wasted trucking hours caused by traffic congestion,
and the average citizen losing nearly $800 per year in wasted fuel and
time, we need to PREDICT traffic to increase efficiency.
The Challenge
What time should I leave tomorrow to get
to Newark the quickest?
With snow expected in the morning, what
time do I need to leave to get to work by 8:00?
What should I tell my morning viewers
about their evening commute today?
Predictive Traffic Demo
#EUent7
7. We historically know general traffic patterns, but many variables
can significantly change expectations. Weather is one of the
primary variables. So what did we do?
The Challenge – No Easy Task
• 2.58 Billion Traffic records in
the five cites studied
• 262 Million weather records in
the 1 year study
• Week Day vs. Weekend,
Morning Commute vs.
Evening Commute
• Results tabulated on bad
weather days, where impacts
matter the most.
Selected 5 Unique Cities in
different US geographies
Analyzed 1 year of both
traffic and weather data
Built a cognitive model that
predicts future traffic flows for
15 mins to 24 hours into the
future.
#EUent7
8. Machine Learning Process
Business
Understanding
• Challenge
• Why it is
important
• Why it is
hard
Data Collection
• Traffic
• Weather
• Archival
• Real-time
Data
preprocessing
• Cleaning
• Joins
• Spark time
series library
Traffic
modeling
• ARIMA
• Random
forest
• LSTM
#EUent7
9. • History on Demand
– Weather features accessed via lat/lon or bounding box
– Hourly historical information from July 2011
• Enhanced Forecast
– Forecasts at 4 km. resolution every 15 minutes
#EUent7
Weather Data
https://business.weather.com/products/weather-data-packages
10. • Traffic, road and incident data
– 300M sources
– 8M kilometers of road
• Real-time traffic flow information for all
functional road classifications
• eXtreme Definiton segments (XD)
– 100-350m long
– traffic updated every 5 minutes
#EUent7
Traffic Data
11. 1Apache Spark extensions to handle time series and geospatial data
Traffic
(historical)
Weather
(historical +
predicted)
Incidence
Reports
(Police,
Construction,
Traffic Cam,
Tweets)
Data
Sources
First Order Models
• ARIMA/BATS
Second Order
Models
• Spatial
Correlation
• Causality
Higher Order Models
• Random forest
• LSTM
Machine
Learning
Models
Analytics
Platform
Spark
Streaming
Training
Scoring
Apache
Spark1
HDFS/
Cassandra
#EUent7
Setup
12. Spark-TimeSeries: Library for Distributed Time Series
Analytics on Apache Spark
#EUent7
Scale out
• Single JVM: Streams
• Horizontal: ShortTSRDD
• Longitudinal: LongTSRDD
Data types
• Fully templated
• Integers, Doubles, Strings etc
• Fully supporting geo locations
Windowing
• Record based
• Time based
• Activity based
Runtime support
• Periodic, Aperiodic, Hybrid
• Aligned/ Unaligned timeseries
Multivariate analysis
• Temporal joins
• Record-based Join
Languages
• Scala
• Java
• Python*
14. Machine Learning Process
Business
Understanding
• Challenge
• Why it is
important
• Why it is
hard
Data Collection
• Traffic
• Weather
• Archival
• Real-time
Data
preprocessing
• Cleaning
• Joins
• Spark time
series library
Traffic
modeling
• ARIMA
• Random
forest
• LSTM
#EUent7
15. • ARIMA (Autoregressive integrated moving average) – Used for time-series forecasting
• Use ARIMA to predict per road segment future speeds based on previously observed values
• Can model hour-of-day and day-of-week patterns
• Cannot handle non-periodic “incidents”
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5
24 hour window prediction
errors
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
0 2 4 6
prediction errors tailARIMA Prediction example
p: # autoregressive terms,
d: # non-seasonal differences needed for stationarity
q: # lagged forecast errors in the prediction equation.
75% accuracy
Time: ~3 mins
(linear scaleout with
TSRDD)
#EUent7
ARIMA Based Model
16. • Per-road segment regression tree for prediction
• Regression tree features:
• Current speeds on the road segment
• Current speeds on “connected” road segments
• Predicted weather on the road segment
• Connected Road Segment Extraction Methodologies:
à Spatial Radius àCorrelation àCausality
Congestion on a road segment affects
connected road segments
Accuracy:
• 89% weather
• 82% noweather
Time: 6-8 mins
(linear scaleout with
TSRDD)
TSRDD
#EUent7
Random Forest Based Model
17. Vu +
Training per node
#EUent7
LSTM + Node Embedding as Feature Vector
• Create node embedding
• Concatenate node embedding with time series data
• Node embedding allow the model to learn spatial components of the
graph, while the time series data incorporates the temporal components
18. SparkHDFS
CSV
Parquet
JSON
(File) Train
Models Offline: One model
per-city and per-
prediction-time-
horizon; Updated
every three months;
No raw data is stored
CSV
JSON
(15 min
per-city
updates)
StreamingKafka
Model Updates
REDIS
REST
API
Online: One Kafka and one Spark streaming job per city,
prediction over multiple time horizons are stored against the
edge id key in REDIS; REST API only accesses REDIS
Traffic
Weather
Temporal &
spatial joins
#EUent7
Architecture
19. Driver behavior data is only valid in the context of what is
also happening on the road
UBI – Usage Based Insurance
71 6571 7265 44˚
Driver
Speed
Driver
Speed
Speed
Limit
Speed
Limit
Reference
Speed
Weather
Condition
Temp
Reading
2
Congestion
Index
Limited Analysis
can lead to
inaccurate
assessments, and
impact retention
More data, and driver relevant data will
lead to greater understanding of
behavior and associated risk
The Results
Total Percentage
reduction in
prediction error
Percentage
reduction in error
during morning rush
hour
Percentage reduction
in error during evening
rush hour
Chicago 34.4% 16.9% 41.5%
Houston 30.6% 19.3% 17.9%
Philadelphia 24.7% 9.5% 19.5%
Atlanta 15.1% 3.3% 2.19%
Portland 23.0% 15.3% 23.8%
Chicago
Houston
Philadelphia
Atlanta
Portland
Significant Improvements in Accuracy in All Geographies Modeled
#EUent7
20. 5
Predictive Traffic will significantly impact how
drivers plan their day. We will…
Alert users, before they travel, that their journey may take
longer than normal.
Deliver intelligent mobile tools to find the best times to
travel – if at all.
Over time, Predictive Traffic gets smarter by learning from
new IoT data: road conditions, local traffic behavior,
weather sensors, incidents, user generated feedback, traffic
cameras, etc.
Commuting gets better with Predictive Traffic
#EUent7