Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase

© 2017 MapR Technologies
Applying Machine Learning to IOT:
End to End Distributed Pipeline for Real-
Time Uber Data Using Apache APIs: Kafka,
Spark, HBase
Carol McDonald
@caroljmcdonald

Uber trip cluster dashboard

Data Collect Process Store
Spark
Streami
ng
Analyze
HBase
SQL
ML
Model
Stream
Input
Spark
Streami
ng
Stream
Enriched
Use Case: Real-Time Analysis of Geographically Clustered Vehicles

Fast data Pipeline for Uber Data Using Apache APIs: Kafka, Spark, Hba
•  Why combine Machine Learning with Streaming Events?
•  Introduction to Machine Learning and Spark
•  Introduction to Kafka & Spark Streaming
•  ISpark Streaming and NoSQL HBase
Note: this code example is from me, only the data is from Uber

Why combine Streaming events
with Machine Learning?

What’s a Stream ?
Producers ConsumersEvents_Stream
an unbounded sequence of events
Events

Why Stream Processing?
6:05 P.M.: 90°
To
pic
Stream
Temperature
Turn on the
air
conditioning!
It’s becoming important to process events as they arrive

Why combine Streaming Events with Machine Learning?
Fraud detection Smart Machinery Utility Smart Meters Home Automation
Networks Manufacturing Security Systems Patient Monitoring

Why combine IOT/Streaming Events with Machine Learning?
•  Audi and Daimler deep learning for
autonomous vehicles
–  Using MapR platform to scale deep learning
efforts
https://mapr.com/company/press-releases/norcom-selects-mapr-deep-
learning/

Why combine IOT with Machine Learning?
•  Monitoring devices combined with ML can provide alerts for Sepsis,
which is one of the leading causes for death in hospitals
–  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic-
patient-records

•  A Stanford team has shown that a machine-learning model can identify heart
arrhythmias from an electrocardiogram (ECG) better than an expert
–  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/

Applying Machine Learning to Live Patient Data
•  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-
live-patient-data

What if BP had detected problems before the oil hit the water ?
•  1M samples/sec
•  High performance at
scale is necessary!

Why combine IOT/Streaming Events with Machine Learning?
•  ML & location data:
–  identify behavior patterns
and trends:
–  telecom, travel,
marketing...
•  http://www.cisco.com/c/en/us/solutions/
industries/smart-connected-communities.html

•  Uber Near Realtime Price Surging
–  https://www.slideshare.net/ConfluentInc/kafka-uber-
the-worlds-realtime-transit-infrastructure-aaron-
schildkrout
•  (however this example code is mine)
NEAR REALTIME
PRICE SURGING

What has changed in the past 10 years?
•  Distributed computing
•  Streaming analytics
•  Improved machine learning

Intro to Machine Learning

What is Machine Learning?
Data Build ModelTrain Algorithm
Finds patterns
New Data Use Model
(prediction function)
Predictions
Contains patterns Recognizes patterns

ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
●  Churn Modelling
Uber
trips
Stream
TopicUber
trips
New Data

End to End Application Architecture

What is Supervised Machine Learning?
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD

Supervised Algorithms use labeled data
Data
features
Build Model
New Data
features
Predict
Use Model
X1, X2
Y
f(X1, X2) =Y
X1, X2
Y

Supervised Machine Learning: Classification & Regression
Classification
Identifies
category for item

Classification: Definition
Form of ML that:
•  Identifies which category an item belongs to
•  Uses supervised learning algorithms
–  Data is labeled
Sentiment

If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacks
swims
Features:

Debit Card Fraud Example
•  What are we trying to predict?
–  This is the Label or Target outcome:
–  Fraud or Not Fraud
•  What are the “if questions” or properties we can use to predict?
–  These are the Features:
–  Is the amount spent today > historical average?
–  Unusual region for card history ?
–  Known merchant or not ?

Decision Tree For Classification
•  Tree of decisions about features
•  IF THEN ELSE questions using
features at each tree node
•  Answers branch to child nodes
Is the amount spent in 24
hours > average
Is the number of
states used from > 2
Are there multiple
Purchases today from
risky merchants?
YES NO
NoYES
Fraud
90%
Not Fraud
50%
Fraud
90%
Not Fraud
30%
YES No

What is Unsupervised Machine Learning?
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD
Supervised
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic

Unsupervised Algorithms use Unlabeled data
Customer GroupsBuild ModelTrain Algorithm
Finds patterns
New Customer
Purchase Data
Use Model
Similar Customer Group
Contains patterns Recognizes patterns
Customer purchase
data

Unsupervised Machine Learning: Clustering
Clustering
group news articles into different categories

Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high similarity

Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high similarity
–  Search results grouping
–  Grouping of customers, patients
–  Text categorization
–  recommendations
•  Anomaly detection: find what’s not similar

Clustering: Example
•  Group similar objects

Clustering: Example
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
x
x
x
x
x

Clustering: Example
2.  Assign all points to nearest
centroid
x
x
x
x
x

Clustering: Example
centroid
3.  Update centroids to center of
points
x
x
x
x
x

Clustering: Example
centroid
3.  Update centroids to center of
points
4.  Repeat until conditions met
x
x
x
x
x

Spark intro

Apache Spark Streaming
•  Task scheduling
•  Memory Management
•  Fault recovery
•  Interacting with storage systems
•  DataFrame API
•  Catalyst Optimizer
•  Processing of live
streams
•  Micro-batching
•  Machine Learning
•  Multiple types of
ML algorithms
•  Graph processing
•  Graph parallel
computations
Distributed Parallel Cluster computing Programming Framework

Spark Distributed Datasets
Dataset
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
partitioned
Partition 1
8213034705, 95,
2.927373,
jake7870, 0……
Partition 2
8213034705,
115, 2.943484,
Davidbresler2,
1….
Partition 3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
Partition 4
8213034705,
117, 2.998947,
daysrus, 95….
•  Read only collection of typed objects
Dataset[T]
•  Partitioned across a cluster
•  Operated on in parallel
•  in memory can be Cached

Example loading a Dataset
val df: Dataset[Uber] = spark.read.option("inferSchema",
"false").schema(schema).csv(“data/uber.csv").as[Uber]
df.cache
df.count Worker
Worker
Worker
Driver
Block 1
Block 2
Block 3

Example:
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
tasks
tasks
tasks

Example
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block

Example
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data

Example:
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
df.cache
df.count
res9: Long = 829275

Example
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
df.cache
df.count
df.show

Example
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
tasks
tasks
tasks
Driver

Example: Log Mining
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
Cached, does
not have to read
from file again
df.cache
df.count
df.show

Example:
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
results
results
results
df.cache
df.count
df.show

Example:
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
Cache your data è Faster Results
df.cache
df.count
df.show

Spark Use Cases
Iterative Algorithms on large amounts of data
Some Algorithms that need iterations
•  Clustering (K-Means)
•  Linear Regression
•  Graph Algorithms (e.g., PageRank)
•  Alternating Least Squares ALS
Some Example Use Cases:
•  Anomaly detection
•  Recommendations

Cluster Uber Trip Locations

Part 1: Spark Machine Learning
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-1/

Zeppelin Notebook with Spark
Data
Engineer
Data
Scientist

Spark ML workflow

Uber Data
•  Date/Time: The date and time of the Uber pickup
•  Lat: The latitude of the Uber pickup
•  Lon: The longitude of the Uber pickup
•  Base: The TLC base company affiliated with the Uber pickup
The Data Records are in CSV format. An example line is shown below:
•  2014-08-01 00:00:00,40.729,-73.9422,B02598

Load the data into a Dataframe: Define the Schema
case class Uber(dt: String, lat: Double, lon: Double, base: String)
val schema = StructType(Array(
StructField("dt", TimestampType, true),
StructField("lat", DoubleType, true),
StructField("lon", DoubleType, true),
StructField("base", StringType, true)
))
Input Comma Separated Values:
datetime, lattitude, longitude, base
2014-08-01 00:00:00,40.729,-73.9422,B02598

Dataset merged with Dataframe
•  in Spark 2.0, DataFrame APIs merged with Datasets APIs
•  A Dataset is a collection of typed objects (SQL and functions)
•  A DataFrame is a Dataset of generic Row objects (SQL)

Data
Frame
Load data
Load the data into a Dataframe
))
val df = spark.read.option("inferSchema", "false").schema(schema)
.csv("/user/user01/data/uber.csv")
df.show

Load the data into a Dataframe
Dataframe
row
columns

Dataset
Load data
Load the data into a Dataset
case class Uber(dt: String, lat: Double, lon: Double, base: String) extends Serializable
))
val df = spark.read.option("inferSchema", "false").schema(schema)
.csv("/user/user01/data/uber.csv") .as[Uber]
df.show

Load the data into a Dataset
Dataset
Collection of Uber
objects
columns
row

Uber Example
•  What are the “if questions” or
properties we can use to group?
–  These are the Features:
–  We will group by Lattitude,
longitude
•  Use Spark SQL to analyze: Day of
the week, time, rush hour …
NEAR REALTIME
PRICE SURGING

Extract the Features
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Training Data
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
Feature Vectors are vectors of numbers representing the value for each feature

Use VectorAssembler to put features in vector column
val featureCols = Array("lat", "lon")
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
val df2 = assembler.transform(df)
Data
Frame
Load data transform DataFrame +
Features

Data
Frame
Load data transform
Estimator
val kmeans = new KMeans()
.setK(8)
.setFeaturesCol("features")
.setMaxIter(5)
Create Kmeans Estimator, Set Features
DataFrame +
Features

Data
Frame
Load data transform
Estimator
val model = kmeans.fit(df2)
Fit the Model on the Training Data Features
DataFrame +
Features
fit fitted
model
input

Data
Frame
Load data transform
Estimator
model.clusterCenters.foreach(println)
[40.76930621976264,-73.96034885367698]
[40.67562793272868,-73.79810579052476]
[40.68848772848041,-73.9634449047477]
[40.78957777777776,-73.14270740740741]
[40.32418330308531,-74.18665245009073]
[40.732808848486286,-74.00150153727878]
[40.75396549974632,-73.57692359208531]
[40.901700842900674,-73.868760398198]
Clusters from fitted model
DataFrame +
Features
fit fitted
model
input

fitted
model
Analyze Clusters
summary
val clusters = model.summary.predictions
clusters.show()
prediction
DataFrame +
Features +
prediciton

fitted
model
Transform new data, adds column with Clusters
transform
features
val clusters = model.transform(newdata)
prediction
DataFrame +
Features
DataFrame +
Features +
prediciton

fitted
model
Save the model to distributed file system
save
model.write.overwrite().save("/path/savemodel")
Use later
val sameModel = KMeansModel.load("/user/user01/data/savemodel")
DataFrame +
Features

Kafka API and Streaming Data

Part 2: MapR Event Streams with Kafka API and Spark Streaming
•  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-2/

Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?

Collect the Data
Data IngestSource
Stream
Topic
•  Data Ingest:
–  Network Based: MapR Event
Streams using the Kafka API

Organize Data into Topics with MapR Streams
Topics Organize Events into Categories and Decouple Producers from Consumers
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API

Scalable Messaging with MapR Streams
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Server 3
Topics are
partitioned for
throughput and
scalability

Producers are load
balanced between partitions
Kafka API

Consumers
Consumers
Consumers
Consumer groups can read in parallel
Kafka API

Partition is like a Queue
Consumers
MapR Cluster
Topic: Admission / Server 1
Consumers
Consumers
Partition
1
New Messages are
appended to the end
Partition
2
Partition
3
6 5 4 3 2 1
3 2 1
5 4 3 2 1
Producers
Producers
Producers
New
Message
6 5 4 3 2 1
Old
Message

Events are delivered in the order they are received, like a queue
messages are delivered in the order they are received
MapR Cluster
6 5 4 3 2 1
Consumer
groupProducers
Read cursors
Consumer
group

Unlike a queue, events are persisted even after they’re delivered
Messages remain on the partition, available to other consumers
MapR Cluster (1 Server)
Topic: Warning
Partition
1
3 2 1 Unread Events
Get Unread
3 2 1
Client Library ConsumerPoll

How do we do this with High Performance at Scale?
•  Parallel operations
•  minimizes disk read/writes

Processing Same Message for Different Purposes
Consumers
Consumers
Consumers
Producers
Producers
Producers
MapR-FS
Kafka API Kafka API

Machine Learning Logistics
Input Data +
Actual Delay
Input Data +
Predictions
Consumer
withML
Model 2
Consumer
withML
Model 1
Decoy
results
Consumer
Consumer
withML
Model 3
Consumer
Stream
Archive
Stream
Scores
Stream
Input
SQL
SQL
Real time
Data
Stream
Input
Delayed data
Input Data +
Predictions +
Actual Delay
Real Time
dashboard +
Historical
Analysis

Use the Model with Streaming Data

Collect Data
Process the Data with Spark Streaming and Spark Machine Learning
Process Data
Stream
Topic
•  Extension of the core Spark AP
•  scalable, high-throughput, fault-
tolerant stream processing

ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
Uber
trips
Stream
TopicUber
trips
New Data

Use Case: Real-Time Analysis of Geographically Clustered Vehicles
Uber trip data enrich with K-means
Cluster location
Stream
Topic
Stream
Topic
Spark
Streaming
Spark
Streaming
Write to
MapR-DB
SQL

Use Case: Time Series Data
Uber trip data
Stream
Topic
2014-08-01 00:00:00,
40.729,-73.9422,B02598
{"dt":"2014-08-01 00:00:00.0”,
"lat":40.3495,"lon":-74.0667,
"base":"B02682","cluster":5}
Enrich with
K-means cluster id
Spark
Streaming
read
Stream
Topic

Processing Spark DStreams
Data stream divided into batches of X milliseconds = DStreams

Load the saved model
// load model for getting clusters
val model = KMeansModel.load(modelpath)

Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val messagesDStream = KafkaUtils.createDirectStream[String,String]
(ssc, LocationStrategies.PreferConsistent,consumerStrategy)
// get message values from key,value and parse to Uber objects
val uDStream = linesDStream.map(_.value())
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD

Parse message txt to Uber Object and convert to DataFrame
uDStream.foreachRDD{ rdd =>
// get cluster centers and add to df
// send to Topic
}
ssc.start()
ssc.awaitTermination()

Enrich Data with Cluster

Convert to JSON send to Topic, Send the Enriched Message

Process Dstream Streaming Applicaton Output
dStream
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
Result Dstream
Transformed RDDs
map map map
Stream
Topic

Real Time Dashboard

Part 3: Realtime Dashboard using Vert.x
•  https://mapr.com/blog/monitoring-uber-with-spark-streaming-kafka-and-
vertx/

Serve DataCollect Data
Serving the Data
MapR-FS
Stream
Topic

Use Case Dashboard

The Vert.x toolkit and Web Application Architecture
•  Event-driven
•  Event Bus
•  Verticles single threaded

Dashboard Architecture

The Dashboard Vert.x HTML5 Javascript Client

Initializing the Heatmap

Creating the Vertx EventBus
•  create an instance of the vertx.EventBus object
•  add an onopen listener, which registers an event bus handler for the
address “dashboard.”
•  handler will receive all messages published to the “dashboard” address

Add Event Trip location points to Map
Parse JSON message

Add lattitude and longitude points to heatmap

If cluster center is new then add marker

Spark and HBase

Part 4: using MapR-DB with HBase API
•  https://mapr.com/blog/monitoring-uber-pt4/

Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-FS
MapR-FS
Stream
Topic

MapR-DB (HBase API) is Designed to Scale
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Fast Reads and Writes by Key! Data is automatically partitioned
by Key Range!
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val

Store Lots of Data with NoSQL MapR-DB
bottleneck
Storage ModelRDBMS MapR-DB
Normalized schema à Joins for
queries can cause bottleneck De-Normalized schema à Data that
is read together is stored together
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val

HBase Schema
With Hbase/MapR-DB data is automatically partitioned by Key Range

Spark Streaming writing to MapR-DB (HBase API)

Spark HBase and MapR-DB Binary Connector
•  HConnection object in every Spark Executor:
•  allowing for distributed parallel writes, reads, or scans

Spark Hbase streamBulkPut
•  HBaseContext streamBulkPut method parameters:
•  message value DStream, the TableName to write to, function to convert the Dstream
values to HBase put records.

Massively Parrallel writes to HBase
The Spark Streaming bulk put enables massively parallel sending of puts to HBase

HBase Schema
To use the Spark HBase Connector for reads, you need to define the Catalog for the
schema mapping between the HBase and Spark

SparkSQL and DataFrames: Define the Schema
define the Catalog for the schema mapping between the HBase and Spark

Loading data from MapR-DB into a Spark DataFrame
Use Catalog defining schema

Spark Dataframes combine filters and select
filters rows for cluster ids (the beginning of the row key) >= 9. The select selects a
set of columns: key, lat, and lon.

Use Case: Real-Time Data Pipelines
Input Data +
Actual Delay
Input Data +
Predictions
Consumer
withML
Model 2
Consumer
withML
Model 1
Decoy
results
Consumer
Consumer
withML
Model 3
Consumer
Stream
Archive
Stream
Scores
Stream
Input
SQL
SQL
Real time
Flight Data
Stream
Input
Actual Delay
Input Data +
Predictions +
Actual Delay
Real Time
dashboard +
Historical
Analysis

To Learn More:
•  MapR Free ODT http://learn.mapr.com/

MapR Blog
• https://www.mapr.com/blog/

MapR Container for Developers
• https://maprdocs.mapr.com/home/MapRContainerDevelopers/
MapRContainerDevelopersOverview.html

…helping you put data technology to work
●  Find answers
●  Ask technical questions
●  Join on-demand training course
discussions
●  Follow release announcements
●  Share and vote on product ideas
●  Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com

Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-XD)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Event Streams
Sources/Apps Bulk Processing

Q&A
ENGAGE WITH US

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase

Similar to Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase (20)

More from Carol McDonald

More from Carol McDonald (12)

Recently uploaded

Recently uploaded (20)

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase