SlideShare a Scribd company logo
1 of 134
Download to read offline
© 2017 MapR Technologies
Applying Machine Learning to IOT:
End to End Distributed Pipeline for Real-
Time Uber Data Using Apache APIs: Kafka,
Spark, HBase
Carol McDonald
@caroljmcdonald
© 2017 MapR Technologies
Uber trip cluster dashboard
© 2017 MapR Technologies
Data Collect Process Store
Spark
Streami
ng
Analyze
HBase
SQL
ML
Model
Stream
Input
Spark
Streami
ng
Stream
Enriched
Use Case: Real-Time Analysis of Geographically Clustered Vehicles
© 2017 MapR Technologies
Fast data Pipeline for Uber Data Using Apache APIs: Kafka, Spark, Hba
•  Why combine Machine Learning with Streaming Events?
•  Introduction to Machine Learning and Spark
•  Introduction to Kafka & Spark Streaming
•  ISpark Streaming and NoSQL HBase
Note: this code example is from me, only the data is from Uber
© 2017 MapR Technologies
Why combine Streaming events
with Machine Learning?
© 2017 MapR Technologies
What’s a Stream ?
Producers ConsumersEvents_Stream
an unbounded sequence of events
Events
© 2017 MapR Technologies
Why Stream Processing?
6:05 P.M.: 90°
To
pic
Stream
Temperature
Turn on the
air
conditioning!
It’s becoming important to process events as they arrive
© 2017 MapR Technologies
Why combine Streaming Events with Machine Learning?
Fraud detection Smart Machinery Utility Smart Meters Home Automation
Networks Manufacturing Security Systems Patient Monitoring
© 2017 MapR Technologies
Why combine IOT/Streaming Events with Machine Learning?
•  Audi and Daimler deep learning for
autonomous vehicles
–  Using MapR platform to scale deep learning
efforts
https://mapr.com/company/press-releases/norcom-selects-mapr-deep-
learning/
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  Monitoring devices combined with ML can provide alerts for Sepsis,
which is one of the leading causes for death in hospitals
–  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic-
patient-records
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  A Stanford team has shown that a machine-learning model can identify heart
arrhythmias from an electrocardiogram (ECG) better than an expert
–  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/
© 2017 MapR Technologies
Applying Machine Learning to Live Patient Data
•  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-
live-patient-data
© 2017 MapR Technologies
What if BP had detected problems before the oil hit the water ?
•  1M samples/sec
•  High performance at
scale is necessary!
© 2017 MapR Technologies
Why combine IOT/Streaming Events with Machine Learning?
•  ML & location data:
–  identify behavior patterns
and trends:
–  telecom, travel,
marketing...
•  http://www.cisco.com/c/en/us/solutions/
industries/smart-connected-communities.html
© 2017 MapR Technologies
Why combine IOT with Machine Learning?
•  Uber Near Realtime Price Surging
–  https://www.slideshare.net/ConfluentInc/kafka-uber-
the-worlds-realtime-transit-infrastructure-aaron-
schildkrout
•  (however this example code is mine)
NEAR REALTIME 
PRICE SURGING
© 2017 MapR Technologies
What has changed in the past 10 years?
•  Distributed computing
•  Streaming analytics
•  Improved machine learning
© 2017 MapR Technologies
Intro to Machine Learning
© 2017 MapR Technologies
What is Machine Learning?
Data Build ModelTrain Algorithm
Finds patterns
New Data Use Model
(prediction function)
Predictions
Contains patterns Recognizes patterns
© 2017 MapR Technologies
ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
●  Churn Modelling
Uber
trips
Stream
TopicUber
trips
New Data
© 2017 MapR Technologies
End to End Application Architecture
© 2017 MapR Technologies
What is Supervised Machine Learning?
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD
© 2017 MapR Technologies
Supervised Algorithms use labeled data
Data
features
Build Model
New Data
features
Predict
Use Model
X1, X2
Y
f(X1, X2) =Y
X1, X2
Y
© 2017 MapR Technologies
Supervised Machine Learning: Classification & Regression
Classification
Identifies
category for item
© 2017 MapR Technologies
Classification: Definition
Form of ML that:
•  Identifies which category an item belongs to
•  Uses supervised learning algorithms
–  Data is labeled
Sentiment
© 2017 MapR Technologies
If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacks
swims
Features:
© 2017 MapR Technologies
Debit Card Fraud Example
•  What are we trying to predict?
–  This is the Label or Target outcome:
–  Fraud or Not Fraud
•  What are the “if questions” or properties we can use to predict?
–  These are the Features:
–  Is the amount spent today > historical average?
–  Unusual region for card history ?
–  Known merchant or not ?
© 2017 MapR Technologies
Decision Tree For Classification
•  Tree of decisions about features
•  IF THEN ELSE questions using
features at each tree node
•  Answers branch to child nodes
Is the amount spent in 24
hours > average
Is the number of
states used from > 2
Are there multiple
Purchases today from
risky merchants?
YES NO
NoYES
Fraud
90%
Not Fraud
50%
Fraud
90%
Not Fraud
30%
YES No
© 2017 MapR Technologies
What is Unsupervised Machine Learning?
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionality reduction
–  Principal Component
Analysis
–  SVD
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
Forests
•  Regression
–  Linear
–  Logistic
© 2017 MapR Technologies
Unsupervised Algorithms use Unlabeled data
Customer GroupsBuild ModelTrain Algorithm
Finds patterns
New Customer
Purchase Data
Use Model
Similar Customer Group
Contains patterns Recognizes patterns
Customer purchase
data
© 2017 MapR Technologies
Unsupervised Machine Learning: Clustering
Clustering
group news articles into different categories
© 2017 MapR Technologies
Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high similarity
© 2017 MapR Technologies
Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high similarity
–  Search results grouping
–  Grouping of customers, patients
–  Text categorization
–  recommendations
•  Anomaly detection: find what’s not similar
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
x
x
x
x
x
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
2.  Assign all points to nearest
centroid
x
x
x
x
x
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
2.  Assign all points to nearest
centroid
3.  Update centroids to center of
points
x
x
x
x
x
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordinates to center
of clusters (centroid)
2.  Assign all points to nearest
centroid
3.  Update centroids to center of
points
4.  Repeat until conditions met
x
x
x
x
x
© 2017 MapR Technologies
Spark intro
© 2017 MapR Technologies
Apache Spark Streaming
•  Task scheduling
•  Memory Management
•  Fault recovery
•  Interacting with storage systems
•  DataFrame API
•  Catalyst Optimizer
•  Processing of live
streams
•  Micro-batching
•  Machine Learning
•  Multiple types of
ML algorithms
•  Graph processing
•  Graph parallel
computations
Distributed Parallel Cluster computing Programming Framework
© 2017 MapR Technologies
Spark Distributed Datasets
Dataset
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
partitioned
Partition 1
8213034705, 95,
2.927373,
jake7870, 0……
Partition 2
8213034705,
115, 2.943484,
Davidbresler2,
1….
Partition 3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
Partition 4
8213034705,
117, 2.998947,
daysrus, 95….
•  Read only collection of typed objects
Dataset[T]
•  Partitioned across a cluster
•  Operated on in parallel
•  in memory can be Cached
© 2017 MapR Technologies
Example loading a Dataset
val df: Dataset[Uber] = spark.read.option("inferSchema",
"false").schema(schema).csv(“data/uber.csv").as[Uber]
df.cache
df.count Worker
Worker
Worker
Driver
Block 1
Block 2
Block 3
© 2017 MapR Technologies
Example:
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
tasks
tasks
tasks
© 2017 MapR Technologies
Example
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block
© 2017 MapR Technologies
Example
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
© 2017 MapR Technologies
Example:
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
val df: Dataset[Uber] = spark.read.option("inferSchema",
"false").schema(schema).csv(“data/uber.csv").as[Uber]
df.cache
df.count
res9: Long = 829275
© 2017 MapR Technologies
Example
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
val df: Dataset[Uber] = spark.read.option("inferSchema",
"false").schema(schema).csv(“data/uber.csv").as[Uber]
df.cache
df.count
df.show
© 2017 MapR Technologies
Example
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
tasks
tasks
tasks
Driver
© 2017 MapR Technologies
Example: Log Mining
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
Cached, does
not have to read
from file again
val df: Dataset[Uber] = spark.read.option("inferSchema",
"false").schema(schema).csv(“data/uber.csv").as[Uber]
df.cache
df.count
df.show
© 2017 MapR Technologies
Example:
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
results
results
results
val df: Dataset[Uber] = spark.read.option("inferSchema",
"false").schema(schema).csv(“data/uber.csv").as[Uber]
df.cache
df.count
df.show
© 2017 MapR Technologies
Example:
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
Cache your data è Faster Results
val df: Dataset[Uber] = spark.read.option("inferSchema",
"false").schema(schema).csv(“data/uber.csv").as[Uber]
df.cache
df.count
df.show
© 2017 MapR Technologies
Spark Use Cases
Iterative Algorithms on large amounts of data
Some Algorithms that need iterations
•  Clustering (K-Means)
•  Linear Regression
•  Graph Algorithms (e.g., PageRank)
•  Alternating Least Squares ALS
Some Example Use Cases:
•  Anomaly detection
•  Classification
•  Recommendations
© 2017 MapR Technologies
Cluster Uber Trip Locations
© 2017 MapR Technologies
Part 1: Spark Machine Learning
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-1/
© 2017 MapR Technologies
Zeppelin Notebook with Spark
Data
Engineer
Data
Scientist
© 2017 MapR Technologies
Spark ML workflow
© 2017 MapR Technologies
Uber Data
•  Date/Time: The date and time of the Uber pickup
•  Lat: The latitude of the Uber pickup
•  Lon: The longitude of the Uber pickup
•  Base: The TLC base company affiliated with the Uber pickup
The Data Records are in CSV format. An example line is shown below:
•  2014-08-01 00:00:00,40.729,-73.9422,B02598
© 2017 MapR Technologies
Load the data into a Dataframe: Define the Schema
case class Uber(dt: String, lat: Double, lon: Double, base: String)
val schema = StructType(Array(
StructField("dt", TimestampType, true),
StructField("lat", DoubleType, true),
StructField("lon", DoubleType, true),
StructField("base", StringType, true)
))
Input Comma Separated Values:
datetime, lattitude, longitude, base
2014-08-01 00:00:00,40.729,-73.9422,B02598
© 2017 MapR Technologies
Dataset merged with Dataframe
•  in Spark 2.0, DataFrame APIs merged with Datasets APIs
•  A Dataset is a collection of typed objects (SQL and functions)
•  A DataFrame is a Dataset of generic Row objects (SQL)
© 2017 MapR Technologies
Data
Frame
Load data
Load the data into a Dataframe
val schema = StructType(Array(
StructField("dt", TimestampType, true),
StructField("lat", DoubleType, true),
StructField("lon", DoubleType, true),
StructField("base", StringType, true)
))
val df = spark.read.option("inferSchema", "false").schema(schema)
.csv("/user/user01/data/uber.csv")
df.show
© 2017 MapR Technologies
Load the data into a Dataframe
Dataframe
row
columns
© 2017 MapR Technologies
Dataset
Load data
Load the data into a Dataset
case class Uber(dt: String, lat: Double, lon: Double, base: String) extends Serializable
val schema = StructType(Array(
StructField("dt", TimestampType, true),
StructField("lat", DoubleType, true),
StructField("lon", DoubleType, true),
StructField("base", StringType, true)
))
val df = spark.read.option("inferSchema", "false").schema(schema)
.csv("/user/user01/data/uber.csv") .as[Uber]
df.show
© 2017 MapR Technologies
Load the data into a Dataset
Dataset
Collection of Uber
objects
columns
row
© 2017 MapR Technologies
Uber Example
•  What are the “if questions” or
properties we can use to group?
–  These are the Features:
–  We will group by Lattitude,
longitude
•  Use Spark SQL to analyze: Day of
the week, time, rush hour …
NEAR REALTIME 
PRICE SURGING
© 2017 MapR Technologies
Extract the Features
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Training Data
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
Feature Vectors are vectors of numbers representing the value for each feature
© 2017 MapR Technologies
Use VectorAssembler to put features in vector column
val featureCols = Array("lat", "lon")
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
val df2 = assembler.transform(df)
Data
Frame
Load data transform DataFrame +
Features
© 2017 MapR Technologies
Data
Frame
Load data transform
Estimator
val kmeans = new KMeans()
.setK(8)
.setFeaturesCol("features")
.setMaxIter(5)
Create Kmeans Estimator, Set Features
DataFrame +
Features
© 2017 MapR Technologies
Data
Frame
Load data transform
Estimator
val model = kmeans.fit(df2)
Fit the Model on the Training Data Features
DataFrame +
Features
fit fitted
model
input
© 2017 MapR Technologies
Data
Frame
Load data transform
Estimator
model.clusterCenters.foreach(println)
[40.76930621976264,-73.96034885367698]
[40.67562793272868,-73.79810579052476]
[40.68848772848041,-73.9634449047477]
[40.78957777777776,-73.14270740740741]
[40.32418330308531,-74.18665245009073]
[40.732808848486286,-74.00150153727878]
[40.75396549974632,-73.57692359208531]
[40.901700842900674,-73.868760398198]
Clusters from fitted model
DataFrame +
Features
fit fitted
model
input
© 2017 MapR Technologies
fitted
model
Analyze Clusters
summary
val clusters = model.summary.predictions
clusters.show()
prediction
DataFrame +
Features +
prediciton
© 2017 MapR Technologies
fitted
model
Transform new data, adds column with Clusters
transform
features
val clusters = model.transform(newdata)
prediction
DataFrame +
Features
DataFrame +
Features +
prediciton
© 2017 MapR Technologies
fitted
model
Save the model to distributed file system
save
model.write.overwrite().save("/path/savemodel")
Use later
val sameModel = KMeansModel.load("/user/user01/data/savemodel")
DataFrame +
Features
© 2017 MapR Technologies
Kafka API and Streaming Data
© 2017 MapR Technologies
Part 2: MapR Event Streams with Kafka API and Spark Streaming
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-
learning-streaming-and-kafka-api-part-2/
© 2017 MapR Technologies
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
© 2017 MapR Technologies
Collect the Data
Data IngestSource
Stream
Topic
•  Data Ingest:
–  Network Based: MapR Event
Streams using the Kafka API
© 2017 MapR Technologies
Organize Data into Topics with MapR Streams
Topics Organize Events into Categories and Decouple Producers from Consumers
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API
© 2017 MapR Technologies
Scalable Messaging with MapR Streams
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Server 3
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Topics are
partitioned for
throughput and
scalability
© 2017 MapR Technologies
Scalable Messaging with MapR Streams
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Producers are load
balanced between partitions
Kafka API
© 2017 MapR Technologies
Scalable Messaging with MapR Streams
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Consumers
Consumers
Consumers
Consumer groups can read in parallel
Kafka API
© 2017 MapR Technologies
Partition is like a Queue
Consumers
MapR Cluster
Topic: Admission / Server 1
Topic: Admission / Server 2
Topic: Admission / Server 3
Consumers
Consumers
Partition
1
New Messages are
appended to the end
Partition
2
Partition
3
6 5 4 3 2 1
3 2 1
5 4 3 2 1
Producers
Producers
Producers
New
Message
6 5 4 3 2 1
Old
Message
© 2017 MapR Technologies
Events are delivered in the order they are received, like a queue
messages are delivered in the order they are received
MapR Cluster
6 5 4 3 2 1
Consumer
groupProducers
Read cursors
Consumer
group
© 2017 MapR Technologies
Unlike a queue, events are persisted even after they’re delivered
Messages remain on the partition, available to other consumers
MapR Cluster (1 Server)
Topic: Warning
Partition
1
3 2 1 Unread Events
Get Unread
3 2 1
Client Library ConsumerPoll
© 2017 MapR Technologies
How do we do this with High Performance at Scale?
•  Parallel operations
•  minimizes disk read/writes
© 2017 MapR Technologies
Processing Same Message for Different Purposes
Consumers
Consumers
Consumers
Producers
Producers
Producers
MapR-FS
Kafka API Kafka API
© 2017 MapR Technologies
Machine Learning Logistics
Input Data +
Actual Delay
Input Data +
Predictions
Consumer
withML
Model 2
Consumer
withML
Model 1
Decoy
results
Consumer
Consumer
withML
Model 3
Consumer
Stream
Archive
Stream
Scores
Stream
Input
SQL
SQL
Real time
Data
Stream
Input
Delayed data
Input Data +
Predictions +
Actual Delay
Real Time
dashboard +
Historical
Analysis
© 2017 MapR Technologies
Use the Model with Streaming Data
© 2017 MapR Technologies
Collect Data
Process the Data with Spark Streaming and Spark Machine Learning
Process Data
Stream
Topic
•  Extension of the core Spark AP
•  scalable, high-throughput, fault-
tolerant stream processing
© 2017 MapR Technologies
ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
Uber
trips
Stream
TopicUber
trips
New Data
© 2017 MapR Technologies
Use Case: Real-Time Analysis of Geographically Clustered Vehicles
Uber trip data enrich with K-means
Cluster location
Stream
Topic
Stream
Topic
Spark
Streaming
Spark
Streaming
Write to
MapR-DB
SQL
© 2017 MapR Technologies
Use Case: Time Series Data
Uber trip data
Stream
Topic
2014-08-01 00:00:00,
40.729,-73.9422,B02598
{"dt":"2014-08-01 00:00:00.0”,
"lat":40.3495,"lon":-74.0667,
"base":"B02682","cluster":5}
Enrich with
K-means cluster id
Spark
Streaming
read
Stream
Topic
© 2017 MapR Technologies
Processing Spark DStreams
Data stream divided into batches of X milliseconds = DStreams
© 2017 MapR Technologies
Load the saved model
// load model for getting clusters
val model = KMeansModel.load(modelpath)
© 2017 MapR Technologies
Create a DStream
DStream: a sequence of RDDs
representing a stream of data
val messagesDStream = KafkaUtils.createDirectStream[String,String]
(ssc, LocationStrategies.PreferConsistent,consumerStrategy)
// get message values from key,value and parse to Uber objects
val uDStream = linesDStream.map(_.value())
batch
time 0 to 1
batch
time 1 to 2
batch
time 2 to 3
dStream
Stored in memory
as an RDD
© 2017 MapR Technologies
Parse message txt to Uber Object and convert to DataFrame
uDStream.foreachRDD{ rdd =>
// get cluster centers and add to df
// send to Topic
}
ssc.start()
ssc.awaitTermination()
© 2017 MapR Technologies
Enrich Data with Cluster
© 2017 MapR Technologies
Convert to JSON send to Topic, Send the Enriched Message
© 2017 MapR Technologies
Process Dstream Streaming Applicaton Output
dStream
batch
time 2 to 3
batch
time 1 to 2
batch
time 0 to 1
Result Dstream
Transformed RDDs
map map map
Stream
Topic
© 2017 MapR Technologies
Real Time Dashboard
© 2017 MapR Technologies
Part 3: Realtime Dashboard using Vert.x
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr.com/blog/monitoring-uber-with-spark-streaming-kafka-and-
vertx/
© 2017 MapR Technologies
Serve DataCollect Data
Serving the Data
MapR-FS
Process DataData Sources
Stream
Topic
© 2017 MapR Technologies
Use Case: Real-Time Analysis of Geographically Clustered Vehicles
Uber trip data enrich with K-means
Cluster location
Stream
Topic
Stream
Topic
Spark
Streaming
Spark
Streaming
Write to
MapR-DB
SQL
© 2017 MapR Technologies
Use Case Dashboard
© 2017 MapR Technologies
The Vert.x toolkit and Web Application Architecture
•  Event-driven
•  Event Bus
•  Verticles single threaded
© 2017 MapR Technologies
Dashboard Architecture
© 2017 MapR Technologies
The Dashboard Vert.x HTML5 Javascript Client
© 2017 MapR Technologies
Initializing the Heatmap
© 2017 MapR Technologies
Dashboard Architecture
© 2017 MapR Technologies
Creating the Vertx EventBus
•  create an instance of the vertx.EventBus object
•  add an onopen listener, which registers an event bus handler for the
address “dashboard.”
•  handler will receive all messages published to the “dashboard” address
© 2017 MapR Technologies
Add Event Trip location points to Map
Parse JSON message
© 2017 MapR Technologies
Add Event Trip location points to Map
Add lattitude and longitude points to heatmap
© 2017 MapR Technologies
Add Event Trip location points to Map
If cluster center is new then add marker
© 2017 MapR Technologies
Spark and HBase
© 2017 MapR Technologies
Part 4: using MapR-DB with HBase API
•  https://mapr.com/blog/monitoring-uber-pt4/
© 2017 MapR Technologies
Serve DataStore DataCollect Data
What Do We Need to Do ?
MapR-FS
Process DataData Sources
MapR-FS
Stream
Topic
© 2017 MapR Technologies
Use Case: Real-Time Analysis of Geographically Clustered Vehicles
Uber trip data enrich with K-means
Cluster location
Stream
Topic
Stream
Topic
Spark
Streaming
Spark
Streaming
Write to
MapR-DB
SQL
© 2017 MapR Technologies
MapR-DB (HBase API) is Designed to Scale
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Fast Reads and Writes by Key! Data is automatically partitioned
by Key Range!
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
© 2017 MapR Technologies
Store Lots of Data with NoSQL MapR-DB
bottleneck
Storage ModelRDBMS MapR-DB
Normalized schema à Joins for
queries can cause bottleneck De-Normalized schema à Data that
is read together is stored together
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
© 2017 MapR Technologies
HBase Schema
With Hbase/MapR-DB data is automatically partitioned by Key Range
© 2017 MapR Technologies
Spark Streaming writing to MapR-DB (HBase API)
© 2017 MapR Technologies
Spark HBase and MapR-DB Binary Connector
•  HConnection object in every Spark Executor:
•  allowing for distributed parallel writes, reads, or scans
© 2017 MapR Technologies
Spark Hbase streamBulkPut
•  HBaseContext streamBulkPut method parameters:
•  message value DStream, the TableName to write to, function to convert the Dstream
values to HBase put records.
© 2017 MapR Technologies
Massively Parrallel writes to HBase
The Spark Streaming bulk put enables massively parallel sending of puts to HBase
© 2017 MapR Technologies
HBase Schema
To use the Spark HBase Connector for reads, you need to define the Catalog for the
schema mapping between the HBase and Spark
© 2017 MapR Technologies
SparkSQL and DataFrames: Define the Schema
define the Catalog for the schema mapping between the HBase and Spark
© 2017 MapR Technologies
Loading data from MapR-DB into a Spark DataFrame
Use Catalog defining schema
© 2017 MapR Technologies
Spark Dataframes combine filters and select
filters rows for cluster ids (the beginning of the row key) >= 9. The select selects a
set of columns: key, lat, and lon.
© 2017 MapR Technologies
Use Case: Real-Time Data Pipelines
Input Data +
Actual Delay
Input Data +
Predictions
Consumer
withML
Model 2
Consumer
withML
Model 1
Decoy
results
Consumer
Consumer
withML
Model 3
Consumer
Stream
Archive
Stream
Scores
Stream
Input
SQL
SQL
Real time
Flight Data
Stream
Input
Actual Delay
Input Data +
Predictions +
Actual Delay
Real Time
dashboard +
Historical
Analysis
© 2017 MapR Technologies
© 2017 MapR Technologies
To Learn More:
•  MapR Free ODT http://learn.mapr.com/
© 2017 MapR Technologies
MapR Blog
• https://www.mapr.com/blog/
© 2017 MapR Technologies
MapR Container for Developers
• https://maprdocs.mapr.com/home/MapRContainerDevelopers/
MapRContainerDevelopersOverview.html
© 2017 MapR Technologies
…helping you put data technology to work
●  Find answers
●  Ask technical questions
●  Join on-demand training course
discussions
●  Follow release announcements
●  Share and vote on product ideas
●  Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com
© 2017 MapR Technologies
Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-XD)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Event Streams
Sources/Apps Bulk Processing
© 2017 MapR Technologies
Q&A
ENGAGE WITH US

More Related Content

What's hot

Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemReal time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemChris Huang
 
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Chris Huang
 
Approaching real-time-hadoop
Approaching real-time-hadoopApproaching real-time-hadoop
Approaching real-time-hadoopChris Huang
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Mathieu Dumoulin
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes StrategicMapR Technologies
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning PrimerMathieu Dumoulin
 

What's hot (20)

Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemReal time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystem
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Scaling big-data-mining-infra2
Scaling big-data-mining-infra2Scaling big-data-mining-infra2
Scaling big-data-mining-infra2
 
Approaching real-time-hadoop
Approaching real-time-hadoopApproaching real-time-hadoop
Approaching real-time-hadoop
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes Strategic
 
MapR and Machine Learning Primer
MapR and Machine Learning PrimerMapR and Machine Learning Primer
MapR and Machine Learning Primer
 

Similar to Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase

Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Mathieu Dumoulin
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...Codemotion
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Map r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupMap r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupAlan Iovine
 
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business SolutionsBig Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business SolutionsMatt Stubbs
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...Codemotion
 
MapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn GloballyMapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn Globallyridhav
 

Similar to Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase (20)

Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Map r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupMap r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetup
 
Big Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business SolutionsBig Data LDN 2017: How to leverage the cloud for Business Solutions
Big Data LDN 2017: How to leverage the cloud for Business Solutions
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
 
MapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn GloballyMapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn Globally
 
T digest-update
T digest-updateT digest-update
T digest-update
 

More from Carol McDonald

Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBaseCarol McDonald
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 

More from Carol McDonald (12)

Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
CU9411MW.DOC
CU9411MW.DOCCU9411MW.DOC
CU9411MW.DOC
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 

Recently uploaded

Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 

Recently uploaded (20)

Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 

Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using Apache APIs: Kafka, Spark, HBase

  • 1. © 2017 MapR Technologies Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- Time Uber Data Using Apache APIs: Kafka, Spark, HBase Carol McDonald @caroljmcdonald
  • 2. © 2017 MapR Technologies Uber trip cluster dashboard
  • 3. © 2017 MapR Technologies Data Collect Process Store Spark Streami ng Analyze HBase SQL ML Model Stream Input Spark Streami ng Stream Enriched Use Case: Real-Time Analysis of Geographically Clustered Vehicles
  • 4. © 2017 MapR Technologies Fast data Pipeline for Uber Data Using Apache APIs: Kafka, Spark, Hba •  Why combine Machine Learning with Streaming Events? •  Introduction to Machine Learning and Spark •  Introduction to Kafka & Spark Streaming •  ISpark Streaming and NoSQL HBase Note: this code example is from me, only the data is from Uber
  • 5. © 2017 MapR Technologies Why combine Streaming events with Machine Learning?
  • 6. © 2017 MapR Technologies What’s a Stream ? Producers ConsumersEvents_Stream an unbounded sequence of events Events
  • 7. © 2017 MapR Technologies Why Stream Processing? 6:05 P.M.: 90° To pic Stream Temperature Turn on the air conditioning! It’s becoming important to process events as they arrive
  • 8. © 2017 MapR Technologies Why combine Streaming Events with Machine Learning? Fraud detection Smart Machinery Utility Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  • 9. © 2017 MapR Technologies Why combine IOT/Streaming Events with Machine Learning? •  Audi and Daimler deep learning for autonomous vehicles –  Using MapR platform to scale deep learning efforts https://mapr.com/company/press-releases/norcom-selects-mapr-deep- learning/
  • 10. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  Monitoring devices combined with ML can provide alerts for Sepsis, which is one of the leading causes for death in hospitals –  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic- patient-records
  • 11. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  A Stanford team has shown that a machine-learning model can identify heart arrhythmias from an electrocardiogram (ECG) better than an expert –  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/
  • 12. © 2017 MapR Technologies Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to- live-patient-data
  • 13. © 2017 MapR Technologies What if BP had detected problems before the oil hit the water ? •  1M samples/sec •  High performance at scale is necessary!
  • 14. © 2017 MapR Technologies Why combine IOT/Streaming Events with Machine Learning? •  ML & location data: –  identify behavior patterns and trends: –  telecom, travel, marketing... •  http://www.cisco.com/c/en/us/solutions/ industries/smart-connected-communities.html
  • 15. © 2017 MapR Technologies Why combine IOT with Machine Learning? •  Uber Near Realtime Price Surging –  https://www.slideshare.net/ConfluentInc/kafka-uber- the-worlds-realtime-transit-infrastructure-aaron- schildkrout •  (however this example code is mine) NEAR REALTIME PRICE SURGING
  • 16. © 2017 MapR Technologies What has changed in the past 10 years? •  Distributed computing •  Streaming analytics •  Improved machine learning
  • 17. © 2017 MapR Technologies Intro to Machine Learning
  • 18. © 2017 MapR Technologies What is Machine Learning? Data Build ModelTrain Algorithm Finds patterns New Data Use Model (prediction function) Predictions Contains patterns Recognizes patterns
  • 19. © 2017 MapR Technologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction ●  Churn Modelling Uber trips Stream TopicUber trips New Data
  • 20. © 2017 MapR Technologies End to End Application Architecture
  • 21. © 2017 MapR Technologies What is Supervised Machine Learning? Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD
  • 22. © 2017 MapR Technologies Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model X1, X2 Y f(X1, X2) =Y X1, X2 Y
  • 23. © 2017 MapR Technologies Supervised Machine Learning: Classification & Regression Classification Identifies category for item
  • 24. © 2017 MapR Technologies Classification: Definition Form of ML that: •  Identifies which category an item belongs to •  Uses supervised learning algorithms –  Data is labeled Sentiment
  • 25. © 2017 MapR Technologies If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
  • 26. © 2017 MapR Technologies Debit Card Fraud Example •  What are we trying to predict? –  This is the Label or Target outcome: –  Fraud or Not Fraud •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  Is the amount spent today > historical average? –  Unusual region for card history ? –  Known merchant or not ?
  • 27. © 2017 MapR Technologies Decision Tree For Classification •  Tree of decisions about features •  IF THEN ELSE questions using features at each tree node •  Answers branch to child nodes Is the amount spent in 24 hours > average Is the number of states used from > 2 Are there multiple Purchases today from risky merchants? YES NO NoYES Fraud 90% Not Fraud 50% Fraud 90% Not Fraud 30% YES No
  • 28. © 2017 MapR Technologies What is Unsupervised Machine Learning? Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic
  • 29. © 2017 MapR Technologies Unsupervised Algorithms use Unlabeled data Customer GroupsBuild ModelTrain Algorithm Finds patterns New Customer Purchase Data Use Model Similar Customer Group Contains patterns Recognizes patterns Customer purchase data
  • 30. © 2017 MapR Technologies Unsupervised Machine Learning: Clustering Clustering group news articles into different categories
  • 31. © 2017 MapR Technologies Clustering: Definition •  Unsupervised learning task •  Groups objects into clusters of high similarity
  • 32. © 2017 MapR Technologies Clustering: Definition •  Unsupervised learning task •  Groups objects into clusters of high similarity –  Search results grouping –  Grouping of customers, patients –  Text categorization –  recommendations •  Anomaly detection: find what’s not similar
  • 33. © 2017 MapR Technologies Clustering: Example •  Group similar objects
  • 34. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) x x x x x
  • 35. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid x x x x x
  • 36. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points x x x x x
  • 37. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points 4.  Repeat until conditions met x x x x x
  • 38. © 2017 MapR Technologies Spark intro
  • 39. © 2017 MapR Technologies Apache Spark Streaming •  Task scheduling •  Memory Management •  Fault recovery •  Interacting with storage systems •  DataFrame API •  Catalyst Optimizer •  Processing of live streams •  Micro-batching •  Machine Learning •  Multiple types of ML algorithms •  Graph processing •  Graph parallel computations Distributed Parallel Cluster computing Programming Framework
  • 40. © 2017 MapR Technologies Spark Distributed Datasets Dataset W Executor P4 W Executor P1 P3 W Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. •  Read only collection of typed objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached
  • 41. © 2017 MapR Technologies Example loading a Dataset val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count Worker Worker Worker Driver Block 1 Block 2 Block 3
  • 42. © 2017 MapR Technologies Example: Worker Worker Worker Block 1 Block 2 Block 3 Driver tasks tasks tasks
  • 43. © 2017 MapR Technologies Example Worker Worker Worker Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  • 44. © 2017 MapR Technologies Example Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  • 45. © 2017 MapR Technologies Example: Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count res9: Long = 829275
  • 46. © 2017 MapR Technologies Example Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show
  • 47. © 2017 MapR Technologies Example Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 tasks tasks tasks Driver
  • 48. © 2017 MapR Technologies Example: Log Mining Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver Process from Cache Process from Cache Process from Cache Cached, does not have to read from file again val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show
  • 49. © 2017 MapR Technologies Example: Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver results results results val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show
  • 50. © 2017 MapR Technologies Example: Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver Cache your data è Faster Results val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show
  • 51. © 2017 MapR Technologies Spark Use Cases Iterative Algorithms on large amounts of data Some Algorithms that need iterations •  Clustering (K-Means) •  Linear Regression •  Graph Algorithms (e.g., PageRank) •  Alternating Least Squares ALS Some Example Use Cases: •  Anomaly detection •  Classification •  Recommendations
  • 52. © 2017 MapR Technologies Cluster Uber Trip Locations
  • 53. © 2017 MapR Technologies Part 1: Spark Machine Learning •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine- learning-streaming-and-kafka-api-part-1/
  • 54. © 2017 MapR Technologies Zeppelin Notebook with Spark Data Engineer Data Scientist
  • 55. © 2017 MapR Technologies Spark ML workflow
  • 56. © 2017 MapR Technologies Uber Data •  Date/Time: The date and time of the Uber pickup •  Lat: The latitude of the Uber pickup •  Lon: The longitude of the Uber pickup •  Base: The TLC base company affiliated with the Uber pickup The Data Records are in CSV format. An example line is shown below: •  2014-08-01 00:00:00,40.729,-73.9422,B02598
  • 57. © 2017 MapR Technologies Load the data into a Dataframe: Define the Schema case class Uber(dt: String, lat: Double, lon: Double, base: String) val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) Input Comma Separated Values: datetime, lattitude, longitude, base 2014-08-01 00:00:00,40.729,-73.9422,B02598
  • 58. © 2017 MapR Technologies Dataset merged with Dataframe •  in Spark 2.0, DataFrame APIs merged with Datasets APIs •  A Dataset is a collection of typed objects (SQL and functions) •  A DataFrame is a Dataset of generic Row objects (SQL)
  • 59. © 2017 MapR Technologies Data Frame Load data Load the data into a Dataframe val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) val df = spark.read.option("inferSchema", "false").schema(schema) .csv("/user/user01/data/uber.csv") df.show
  • 60. © 2017 MapR Technologies Load the data into a Dataframe Dataframe row columns
  • 61. © 2017 MapR Technologies Dataset Load data Load the data into a Dataset case class Uber(dt: String, lat: Double, lon: Double, base: String) extends Serializable val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) val df = spark.read.option("inferSchema", "false").schema(schema) .csv("/user/user01/data/uber.csv") .as[Uber] df.show
  • 62. © 2017 MapR Technologies Load the data into a Dataset Dataset Collection of Uber objects columns row
  • 63. © 2017 MapR Technologies Uber Example •  What are the “if questions” or properties we can use to group? –  These are the Features: –  We will group by Lattitude, longitude •  Use Spark SQL to analyze: Day of the week, time, rush hour … NEAR REALTIME PRICE SURGING
  • 64. © 2017 MapR Technologies Extract the Features Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Training Data + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ Feature Vectors are vectors of numbers representing the value for each feature
  • 65. © 2017 MapR Technologies Use VectorAssembler to put features in vector column val featureCols = Array("lat", "lon") val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features") val df2 = assembler.transform(df) Data Frame Load data transform DataFrame + Features
  • 66. © 2017 MapR Technologies Data Frame Load data transform Estimator val kmeans = new KMeans() .setK(8) .setFeaturesCol("features") .setMaxIter(5) Create Kmeans Estimator, Set Features DataFrame + Features
  • 67. © 2017 MapR Technologies Data Frame Load data transform Estimator val model = kmeans.fit(df2) Fit the Model on the Training Data Features DataFrame + Features fit fitted model input
  • 68. © 2017 MapR Technologies Data Frame Load data transform Estimator model.clusterCenters.foreach(println) [40.76930621976264,-73.96034885367698] [40.67562793272868,-73.79810579052476] [40.68848772848041,-73.9634449047477] [40.78957777777776,-73.14270740740741] [40.32418330308531,-74.18665245009073] [40.732808848486286,-74.00150153727878] [40.75396549974632,-73.57692359208531] [40.901700842900674,-73.868760398198] Clusters from fitted model DataFrame + Features fit fitted model input
  • 69. © 2017 MapR Technologies fitted model Analyze Clusters summary val clusters = model.summary.predictions clusters.show() prediction DataFrame + Features + prediciton
  • 70. © 2017 MapR Technologies fitted model Transform new data, adds column with Clusters transform features val clusters = model.transform(newdata) prediction DataFrame + Features DataFrame + Features + prediciton
  • 71. © 2017 MapR Technologies fitted model Save the model to distributed file system save model.write.overwrite().save("/path/savemodel") Use later val sameModel = KMeansModel.load("/user/user01/data/savemodel") DataFrame + Features
  • 72. © 2017 MapR Technologies Kafka API and Streaming Data
  • 73. © 2017 MapR Technologies Part 2: MapR Event Streams with Kafka API and Spark Streaming •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine- learning-streaming-and-kafka-api-part-2/
  • 74. © 2017 MapR Technologies Serve DataStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ?
  • 75. © 2017 MapR Technologies Collect the Data Data IngestSource Stream Topic •  Data Ingest: –  Network Based: MapR Event Streams using the Kafka API
  • 76. © 2017 MapR Technologies Organize Data into Topics with MapR Streams Topics Organize Events into Categories and Decouple Producers from Consumers Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  • 77. © 2017 MapR Technologies Scalable Messaging with MapR Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Topics are partitioned for throughput and scalability
  • 78. © 2017 MapR Technologies Scalable Messaging with MapR Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Producers are load balanced between partitions Kafka API
  • 79. © 2017 MapR Technologies Scalable Messaging with MapR Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Consumers Consumers Consumers Consumer groups can read in parallel Kafka API
  • 80. © 2017 MapR Technologies Partition is like a Queue Consumers MapR Cluster Topic: Admission / Server 1 Topic: Admission / Server 2 Topic: Admission / Server 3 Consumers Consumers Partition 1 New Messages are appended to the end Partition 2 Partition 3 6 5 4 3 2 1 3 2 1 5 4 3 2 1 Producers Producers Producers New Message 6 5 4 3 2 1 Old Message
  • 81. © 2017 MapR Technologies Events are delivered in the order they are received, like a queue messages are delivered in the order they are received MapR Cluster 6 5 4 3 2 1 Consumer groupProducers Read cursors Consumer group
  • 82. © 2017 MapR Technologies Unlike a queue, events are persisted even after they’re delivered Messages remain on the partition, available to other consumers MapR Cluster (1 Server) Topic: Warning Partition 1 3 2 1 Unread Events Get Unread 3 2 1 Client Library ConsumerPoll
  • 83. © 2017 MapR Technologies How do we do this with High Performance at Scale? •  Parallel operations •  minimizes disk read/writes
  • 84. © 2017 MapR Technologies Processing Same Message for Different Purposes Consumers Consumers Consumers Producers Producers Producers MapR-FS Kafka API Kafka API
  • 85. © 2017 MapR Technologies Machine Learning Logistics Input Data + Actual Delay Input Data + Predictions Consumer withML Model 2 Consumer withML Model 1 Decoy results Consumer Consumer withML Model 3 Consumer Stream Archive Stream Scores Stream Input SQL SQL Real time Data Stream Input Delayed data Input Data + Predictions + Actual Delay Real Time dashboard + Historical Analysis
  • 86. © 2017 MapR Technologies Use the Model with Streaming Data
  • 87. © 2017 MapR Technologies Collect Data Process the Data with Spark Streaming and Spark Machine Learning Process Data Stream Topic •  Extension of the core Spark AP •  scalable, high-throughput, fault- tolerant stream processing
  • 88. © 2017 MapR Technologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction Uber trips Stream TopicUber trips New Data
  • 89. © 2017 MapR Technologies Use Case: Real-Time Analysis of Geographically Clustered Vehicles Uber trip data enrich with K-means Cluster location Stream Topic Stream Topic Spark Streaming Spark Streaming Write to MapR-DB SQL
  • 90. © 2017 MapR Technologies Use Case: Time Series Data Uber trip data Stream Topic 2014-08-01 00:00:00, 40.729,-73.9422,B02598 {"dt":"2014-08-01 00:00:00.0”, "lat":40.3495,"lon":-74.0667, "base":"B02682","cluster":5} Enrich with K-means cluster id Spark Streaming read Stream Topic
  • 91. © 2017 MapR Technologies Processing Spark DStreams Data stream divided into batches of X milliseconds = DStreams
  • 92. © 2017 MapR Technologies Load the saved model // load model for getting clusters val model = KMeansModel.load(modelpath)
  • 93. © 2017 MapR Technologies Create a DStream DStream: a sequence of RDDs representing a stream of data val messagesDStream = KafkaUtils.createDirectStream[String,String] (ssc, LocationStrategies.PreferConsistent,consumerStrategy) // get message values from key,value and parse to Uber objects val uDStream = linesDStream.map(_.value()) batch time 0 to 1 batch time 1 to 2 batch time 2 to 3 dStream Stored in memory as an RDD
  • 94. © 2017 MapR Technologies Parse message txt to Uber Object and convert to DataFrame uDStream.foreachRDD{ rdd => // get cluster centers and add to df // send to Topic } ssc.start() ssc.awaitTermination()
  • 95. © 2017 MapR Technologies Enrich Data with Cluster
  • 96. © 2017 MapR Technologies Convert to JSON send to Topic, Send the Enriched Message
  • 97. © 2017 MapR Technologies Process Dstream Streaming Applicaton Output dStream batch time 2 to 3 batch time 1 to 2 batch time 0 to 1 Result Dstream Transformed RDDs map map map Stream Topic
  • 98. © 2017 MapR Technologies Real Time Dashboard
  • 99. © 2017 MapR Technologies Part 3: Realtime Dashboard using Vert.x •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-uber-with-spark-streaming-kafka-and- vertx/
  • 100. © 2017 MapR Technologies Serve DataCollect Data Serving the Data MapR-FS Process DataData Sources Stream Topic
  • 101. © 2017 MapR Technologies Use Case: Real-Time Analysis of Geographically Clustered Vehicles Uber trip data enrich with K-means Cluster location Stream Topic Stream Topic Spark Streaming Spark Streaming Write to MapR-DB SQL
  • 102. © 2017 MapR Technologies Use Case Dashboard
  • 103. © 2017 MapR Technologies The Vert.x toolkit and Web Application Architecture •  Event-driven •  Event Bus •  Verticles single threaded
  • 104. © 2017 MapR Technologies Dashboard Architecture
  • 105. © 2017 MapR Technologies The Dashboard Vert.x HTML5 Javascript Client
  • 106. © 2017 MapR Technologies Initializing the Heatmap
  • 107. © 2017 MapR Technologies Dashboard Architecture
  • 108. © 2017 MapR Technologies Creating the Vertx EventBus •  create an instance of the vertx.EventBus object •  add an onopen listener, which registers an event bus handler for the address “dashboard.” •  handler will receive all messages published to the “dashboard” address
  • 109. © 2017 MapR Technologies Add Event Trip location points to Map Parse JSON message
  • 110. © 2017 MapR Technologies Add Event Trip location points to Map Add lattitude and longitude points to heatmap
  • 111. © 2017 MapR Technologies Add Event Trip location points to Map If cluster center is new then add marker
  • 112. © 2017 MapR Technologies Spark and HBase
  • 113. © 2017 MapR Technologies Part 4: using MapR-DB with HBase API •  https://mapr.com/blog/monitoring-uber-pt4/
  • 114. © 2017 MapR Technologies Serve DataStore DataCollect Data What Do We Need to Do ? MapR-FS Process DataData Sources MapR-FS Stream Topic
  • 115. © 2017 MapR Technologies Use Case: Real-Time Analysis of Geographically Clustered Vehicles Uber trip data enrich with K-means Cluster location Stream Topic Stream Topic Spark Streaming Spark Streaming Write to MapR-DB SQL
  • 116. © 2017 MapR Technologies MapR-DB (HBase API) is Designed to Scale Key Range xxxx xxxx Key Range xxxx xxxx Key Range xxxx xxxx Fast Reads and Writes by Key! Data is automatically partitioned by Key Range! Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 117. © 2017 MapR Technologies Store Lots of Data with NoSQL MapR-DB bottleneck Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 118. © 2017 MapR Technologies HBase Schema With Hbase/MapR-DB data is automatically partitioned by Key Range
  • 119. © 2017 MapR Technologies Spark Streaming writing to MapR-DB (HBase API)
  • 120. © 2017 MapR Technologies Spark HBase and MapR-DB Binary Connector •  HConnection object in every Spark Executor: •  allowing for distributed parallel writes, reads, or scans
  • 121. © 2017 MapR Technologies Spark Hbase streamBulkPut •  HBaseContext streamBulkPut method parameters: •  message value DStream, the TableName to write to, function to convert the Dstream values to HBase put records.
  • 122. © 2017 MapR Technologies Massively Parrallel writes to HBase The Spark Streaming bulk put enables massively parallel sending of puts to HBase
  • 123. © 2017 MapR Technologies HBase Schema To use the Spark HBase Connector for reads, you need to define the Catalog for the schema mapping between the HBase and Spark
  • 124. © 2017 MapR Technologies SparkSQL and DataFrames: Define the Schema define the Catalog for the schema mapping between the HBase and Spark
  • 125. © 2017 MapR Technologies Loading data from MapR-DB into a Spark DataFrame Use Catalog defining schema
  • 126. © 2017 MapR Technologies Spark Dataframes combine filters and select filters rows for cluster ids (the beginning of the row key) >= 9. The select selects a set of columns: key, lat, and lon.
  • 127. © 2017 MapR Technologies Use Case: Real-Time Data Pipelines Input Data + Actual Delay Input Data + Predictions Consumer withML Model 2 Consumer withML Model 1 Decoy results Consumer Consumer withML Model 3 Consumer Stream Archive Stream Scores Stream Input SQL SQL Real time Flight Data Stream Input Actual Delay Input Data + Predictions + Actual Delay Real Time dashboard + Historical Analysis
  • 128. © 2017 MapR Technologies
  • 129. © 2017 MapR Technologies To Learn More: •  MapR Free ODT http://learn.mapr.com/
  • 130. © 2017 MapR Technologies MapR Blog • https://www.mapr.com/blog/
  • 131. © 2017 MapR Technologies MapR Container for Developers • https://maprdocs.mapr.com/home/MapRContainerDevelopers/ MapRContainerDevelopersOverview.html
  • 132. © 2017 MapR Technologies …helping you put data technology to work ●  Find answers ●  Ask technical questions ●  Join on-demand training course discussions ●  Follow release announcements ●  Share and vote on product ideas ●  Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com
  • 133. © 2017 MapR Technologies Stream Processing Building a Complete Data Architecture MapR File System (MapR-XD) MapR Converged Data Platform MapR Database (MapR-DB) MapR Event Streams Sources/Apps Bulk Processing
  • 134. © 2017 MapR Technologies Q&A ENGAGE WITH US