The document discusses how data science can be used to make cars smarter and more autonomous. It provides examples of using predictive maintenance models to predict needed repairs from diagnostic trouble codes. It also discusses using unsupervised learning techniques like hidden Markov models and topic modeling to analyze driving behavior patterns from sensor data and provide more personalized driving experiences. The presentation concludes by emphasizing that data science can help improve safety, utility and driving experiences to progress towards smarter augmented vehicles.
1. How to Make Cars Smarter: A Step
Towards Self-Driving Cars
Kaushik K. Das
Esther Vasiete
Pivotal Data Science
October 2016
2. Today’s presenters
Pivotal Data Science Perspectives
Kaushik K. Das
Head of Data Science, Pivotal
Esther Vasiete
Data Scientist, Pivotal
3. Agenda
• What do we mean by “smarter cars”?
• How do we apply data science to build
smarter cars?
Example 1: Predictive Maintenance
Example 2: Understanding Driver
Behavior Patterns
• Demo
• Next Steps
4. Autonomous Cars will offer many advantages
Call a car whenever you want to go somewhere – sit and relax – and
you are there!
● No stress for you – don’t have to drive in traffic or maintain a car
● Better utilization of cars leading to lower impact on environment
● Fewer accidents and injuries
BUT
there are some issues that still need to be solved – e.g. California law
needs a driver ready to take over in case of an emergency
6. Smart “Augmented” Cars*
Autonomous Cars
Manually Driven Cars
Why not -
* Some people refer to smart augmented cars as semi-autonomous vehicles
7. Augmentation – a situation in which humans and
computers combine to create effective and efficient
outcomes*
● You get reduced stress and fewer accidents
● Fewer regulatory / legal barriers
● Easier to implement
* Thomas H. Davenport, Augmentation or Automation ?, WSJ, Feb 25, 2015.
Smart Cars offer many of the advantages of
automation
8. Smart System = Sensors Digital Brain + Actuators
Problem
Formulation
Data Step
Modeling
Step
Application
Step
Data Science For Building Models
Sensors & Data
Data Lake
Big Data Platform
9. Phase 1: Problem
Formulation
Make sure you formulate a
problem that is relevant to
the goals and pain points of
the stakeholders
Phase 2: Data Step
Build the right feature set
making full use of the
volume, variety and
velocity of all available
data
Phase 3: Modeling Step
This is where you move from
answering what, where and
when to answering why and
what if?
Phase 4: Application
Create a framework for
integrating the model with
decision making processes
and taking action using the
Internet of Things
Technology Selection
Select the right platform and the
right set of tools for solving the
problem at hand
Iterative Approach
Perform each phase in an agile
manner, team up with domain
experts and SMEs, and iterate
as required
Creativity
Take the opportunity to
innovate at every phase
Building a Narrative
Create a fact-based narrative
that clearly communicates
insights to stakeholders
The Eightfold Path of Data Science – four phases
and four differentiating factors
10. KEY LANGUAGES
P L A T F O R
M
KEY TOOLS
MLlib
PL/X
ModelingTools
VisualizationTools
Platform
Pivotal
HDB
Pivotal
Greenplum
Spring Cloud
Data Flow
Apache
Spark
Pivotal
HDP
Data Science Toolkit
11. Scalable, In-Database
Machine Learning
• Open source https://github.com/apache/incubator-madlib
• Downloads and docs http://madlib.incubator.apache.org/
• Wiki
https://cwiki.apache.org/confluence/display/MADLIB/
12. Functions
Linear Systems
• Sparse and Dense Solvers
• Linear Algebra
Matrix Factorization
• Singular Value Decomposition (SVD)
• Low Rank
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Ordinal Regression
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Robust Variance (Huber-White),
Clustered Variance, Marginal Effects
Other Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Apriori)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Random Forest
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
• Naïve Bayes
• Support Vector Machines (SVM)
• Prediction Metrics
Descriptive Statistics
Sketch-Based Estimators
• CountMin (Cormode-Muth.)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)
Correlation and Covariance
Summary
Utility Modules
Array and Matrix Operations
Sparse Vectors
Random Sampling
Probability Functions
Data Preparation
PMML Export
Conjugate Gradient
Stemming
Sessionization
Pivot
Inferential Statistics
Hypothesis Tests
Time Series
• ARIMA
Sept 2016
Path Functions
• Operations on Pattern Matches
13. Data Science Use-Cases
● Smarter Car
‒ Is the car functioning well?
‒ Do any of the parts need servicing or replacement?
‒ How are the new parts functioning? Are they better than the old parts? How’s their performance
relative to tests?
● Smarter Driver Response
‒ Understand drivers driving patterns and typical routes and customize for better driving experience
(Advanced Driver Assistance Systems)
● Smarter Response to Surroundings
‒ How do we improve congestion forecasting and optimize routes better?
‒ How do we improve traffic management ?
‒ How can city planning be improved by using very granular driving and traffic information?
17. Data Sources for Predictive Maintenance
VIN
Timestamp
DTC Code
Odometer
Speed
Acceleration
Engine Temperature
Engine Torque GPS
Coordinates
etc.
VIN
Date vehicle in
Date vehicle out
Repair code
Parts replaced
Warranty claims
Repair Comments
etc.
Vehicle Data Car Repairs Data
18. Predicting Job Type from Diagnostic Trouble Codes
(DTCs)
Time
Job Type:
Transmission
Job Type:
Transmission
Engine
Job Type:
Regular check
DTC: B DTC:
B,
P, C
DTC: U
DTC: B DTC: B
DTC:
B, P, C, U
DTC:
P, B, U
DTC: P DTC: B DTC:
B,P
DTC:
B,P
Can the DTCs
observed here predict
this Job Type?
Can the DTCs observed
here predict this Job
Type?
Can the DTCs observed
here predict this Job
Type?
22. Unsupervised driving behavior analysis
Segmentation:
From raw sensor data to
driving scenes using
HMM.
Feature Distribution:
Quantization of physical
features observed in
each scene
Driving topics:
Scenes are represented
as a combination of
driving topics, which
explain driving patterns.
Parallelism using:
PL/Python *
* HMM inference from
pre-trained model
PL/Pytho
n
[T. Bando, K. Tabenaka, S. Negasaka, T. Taniguchi, Unsupervised drive topic finding from driving behavioral data, IEEE Intelligent Vehicles Symposium, 2013]
23. HMM inference using PL/Python
Note: HMM parameters had been provided to us
and loaded in the database.
hmmlearn library installed in every segment!
24. From time-series driving behavior into natural language
Latent Dirichlet Allocation (LDA)
Document
Word
Scene
Quantized
sensor
value
[D. Blei, Probabilistic topic models, Communications of the ACM, 2012]
26. Data Lake
Business Levers
Apps
MLlib
PL/X
Model Building
Model Tuning
Continuous Model
Improvement
Data Feeds
Ingest Filter Enrich Sink
Spring Cloud Data Flow
Greenplum
Operationalization - Pipeline of a Data Science Driven App
27. We will be able to improve your
driving experience by preparing your
car for the exact conditions you are
about to encounter.
28. It’s easy to make cars smarter -
let’s make it happen!
30. Additional resources & next steps
Read: Pivotal Data Science Blog
https://blog.pivotal.io/channels/data-science-pivotal
Strategic: Pivotal Data Science Analytics Roadmapping Engagement
https://pivotal.io/contact
Tune in: Next data science webinar “How Data Science can help with Fraud
Detection and Cybersecurity” - Q1 2017 (Date TBD)
https://pivotal.io/resources/1/webinars
Hands on:
HDB Sandbox on HDP VM https://network.pivotal.io/products/pivotal-hdb
Greenplum Sandbox https://network.pivotal.io/products/pivotal-gpdb
Apache MADlib (incubating) http://madlib.incubator.apache.org/