SlideShare a Scribd company logo
1 of 107
Download to read offline
Real-Time	Analysis	of	Popular	Uber	Locations	
using	Apache	APIs:		
•  Spark	Machine	Learning,		
•  Spark	Structured	Streaming,		
•  Kafka																		
with	MapR-ES	and	MapR-DB
2 © 2018 MapR Technologies, Inc
•  Overview	of	Unsupervised	Machine	Learning	Clustering	
•  Use	K-Means	to	Cluster	Uber	locations	and	save	ML	model	
•  Overview	of	Kafka	API	
•  Use	Spark	Structured	Streaming:	
•  To	Read	from	Kafka	topic	
•  Enrich	with	ML	model	
•  Write	to	MapR-DB	JSON	document	database	
•  Use	Spark	SQL	to	query	MapR-DB	database	
	
	
Agenda	
2
3 © 2018 MapR Technologies, Inc
Use	Case:	Real-Time	Analysis	of	Geographically	Clustered	Vehicles
Intro	to	Machine	Learning
5 © 2018 MapR Technologies, Inc
What	is	Machine	Learning?	
Data Build ModelTrain Algorithm
Finds patterns
New Data Use Model
(prediction function)
Predictions
Contains patterns Recognizes patterns
6 © 2018 MapR Technologies, Inc
ML	Discovery	Model	Building	
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
Uber
trips
Stream
TopicUber
trips
New Data
7 © 2018 MapR Technologies, Inc
Supervised	and	Unsupervised	Machine	Learning	
Machine Learning
Unsupervised
•  Clustering
•  Collaborative Filtering
•  Frequent Pattern Mining
Supervised
•  Classification
•  Regression
Label
8 © 2018 MapR Technologies, Inc
Supervised	Algorithms	use	labeled	data	
Data
features	
Build Model
New Data
features	
Predict
Use Model
X1, X2
Y
f(X1, X2) =Y
X1, X2
Y
9 © 2018 MapR Technologies, Inc
Unsupervised	Algorithms	use	Unlabeled	data	
Customer GroupsBuild ModelTrain Algorithm
Finds patterns
New Customer
Purchase Data
Use Model
Similar Customer Group
Contains patterns Recognizes patterns
Customer purchase
data
10 © 2018 MapR Technologies, Inc
Unsupervised	Machine	Learning:	Clustering	
Clustering
group news articles into different categories
11 © 2018 MapR Technologies, Inc
Clustering:	Definition	
Unsupervised	learning	task	
Groups	objects	into	clusters	of	high	similarity
12 © 2018 MapR Technologies, Inc
Clustering:	Definition	
Unsupervised	learning	task	
Groups	objects	into	clusters	of	high	similarity	
–  Search	results	grouping	
–  Grouping	of	customers,	patients	
–  Text	categorization	
–  recommendations	
•  Anomaly	detection:	find	what’s	not	similar
13 © 2018 MapR Technologies, Inc
Clustering:	Example	
Group	similar	objects
14 © 2018 MapR Technologies, Inc
Clustering:	Example	
Group	similar	objects	
Use	MLlib	K-means	algorithm	
1.  Initialize coordinates to K cluster centers
15 © 2018 MapR Technologies, Inc
Clustering:	Example	
Group	similar	objects	
Use	MLlib	K-means	algorithm	
1.  Initialize coordinates to K clusters
centers (centroid)
2.  Assign all points to nearest cluster
center (centroid)
16 © 2018 MapR Technologies, Inc
Clustering:	Example	
Group	similar	objects	
Use	MLlib	K-means	algorithm	
1.  Initialize coordinates to center of clusters
(centroid)
2.  Assign all points to nearest centroid
3.  Update centroids to center of assigned
points
17 © 2018 MapR Technologies, Inc
Clustering:	Example	
Group	similar	objects	
Use	MLlib	K-means	algorithm	
1.  Initialize coordinates to center of clusters
(centroid)
2.  Assign all points to nearest centroid
3.  Update centroids to center of points
4.  Repeat until conditions met
Cluster	Uber	Trip	Locations
19 © 2018 MapR Technologies, Inc
How	a	Spark	Application	Runs	on	a	Cluster
20 © 2018 MapR Technologies, Inc
Spark	Distributed	Datasets	
partitioned
•  Read only collection of typed objects
Dataset[T]
•  Partitioned across a cluster
•  Operated on in parallel
•  in memory can be Cached
21 © 2018 MapR Technologies, Inc
Loading	a	Dataset
22 © 2018 MapR Technologies, Inc
Dataset	Read	From	a	File	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Driver	
tasks
tasks
tasks
23 © 2018 MapR Technologies, Inc
Dataset	Read	From	a	File	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Driver	
Cache	1	
Cache	2	
Cache	3	
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
24 © 2018 MapR Technologies, Inc
Date/Time:	The	date	and	time	of	the	Uber	pickup	
Lat:	The	latitude	of	the	Uber	pickup	
Lon:	The	longitude	of	the	Uber	pickup	
Base:	The	TLC	base	company	affiliated	with	the	Uber	pickup	
	
The	Data	Records	are	in	CSV	format.	An	example	line	is	shown	below:	
2014-08-01	00:00:00,40.729,-73.9422,B02598	
	
Uber	Data
25 © 2018 MapR Technologies, Inc
case	class	Uber(dt:	String,	lat:	Double,	lon:	Double,	base:	String)		
	
val	schema	=	StructType(Array(	
						StructField("dt",	TimestampType,	true),	
						StructField("lat",	DoubleType,	true),	
						StructField("lon",	DoubleType,	true),	
						StructField("base",	StringType,	true)	
				))	
	
Load	the	data	into	a	Dataframe:	Define	the	Schema
26 © 2018 MapR Technologies, Inc
val	df	=	spark.read.format("csv").option("inferSchema",	"false")	
	.schema(schema).option("header",	"false")	
	.load(file)	
Load	the	data	into	a	Dataframe
27 © 2018 MapR Technologies, Inc
Load	the	data	into	a	DataFrame	
columns
row
28 © 2018 MapR Technologies, Inc
val	df	=	spark.read.format("csv").option("inferSchema",	"false")	
	.schema(schema).option("header",	"false")	
	.load(file).as[Uber]	
	
Load	the	data	into	a	Dataset
29 © 2018 MapR Technologies, Inc
Load	the	data	into	a	Dataset	
Collection of Uber
objects
columns
row
30 © 2018 MapR Technologies, Inc
•  in	Spark	2.0,	DataFrame	APIs	merged	with	Datasets	APIs	
•  A	Dataset	is	a	collection	of	typed	objects	(SQL	and	functions)	
•  Dataset[T]		
•  A	DataFrame	is	a	Dataset	of	generic	Row	objects		(SQL)	
•  Dataset[Row]	
	
Dataset	merged	with	Dataframe
31 © 2018 MapR Technologies, Inc
Spark	Distributed	Datasets	
Transformations create a new Dataset
from the current one,
Lazily evaluated
Actions return a value to the driver
32 © 2018 MapR Technologies, Inc
Spark	ML	workflow
33 © 2018 MapR Technologies, Inc
Feature	Vectors	are	vectors	of	numbers	representing	the	value	for	each	feature	
	
Extract	the	Features	
Image reference O’Reilly Learning Spark
+
+ ̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Training Data
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+ ̶+
̶ ̶
34 © 2018 MapR Technologies, Inc
Uber	Example	
•  What	are	the	“if	questions”	or	properties	
we	can	use	to	group?		
–  These	are	the	Features:		
–  We	will	group	by	Lattitude,	longitude	
•  Use	Spark	SQL	to	analyze:	Day	of	the	
week,	time,		rush	hour	for	groups	…			
•  NOTE:	this	example	uses	real	Uber	data,	
but	the	code	is	from	me,	not	Uber	
NEAR REALTIME 
PRICE SURGING
35 © 2018 MapR Technologies, Inc
val	featureCols	=	Array("lat",	"lon")	
val	assembler	=	new	VectorAssembler()	
.setInputCols(featureCols)	
.setOutputCol("features")	
val	df2	=	assembler.transform(df)	
	
Use	VectorAssembler	to	put	features	in	vector	column
36 © 2018 MapR Technologies, Inc
val	kmeans	=	new	KMeans()	
		.setK(10)	
		.setFeaturesCol("features")	
		.setPredictionCol("cid")	
		.setMaxIter(20)
	
Create	Kmeans	Estimator,	Set	Features
37 © 2018 MapR Technologies, Inc
val	model	=	kmeans.fit(df2)
	
Fit	the	Model	on	the	Training	Data	Features
38 © 2018 MapR Technologies, Inc
model.clusterCenters.foreach(println)	
	
[40.76930621976264,-73.96034885367698]	
[40.67562793272868,-73.79810579052476]	
[40.68848772848041,-73.9634449047477]	
[40.78957777777776,-73.14270740740741]	
[40.32418330308531,-74.18665245009073]	
[40.732808848486286,-74.00150153727878]	
[40.75396549974632,-73.57692359208531]	
[40.901700842900674,-73.868760398198]
	
Cluster	Centers	from	fitted	model
39 © 2018 MapR Technologies, Inc
Clusters	from	fitted	model
40 © 2018 MapR Technologies, Inc
K-means
model
val	clusters	=	model.summary.predictions	
Or	
val	clusters	=	model.transform(df3)	
clusters.createOrReplaceTempView("uber”)	
clusters.show()	
	
Analyze	Clusters	
summary DataFrame +
Features +
cluster
41 © 2018 MapR Technologies, Inc
clusters.groupBy("cid").count().orderBy(desc( "count")).show(5)
+---+-----+
|cid|count|
+---+-----+
| 6|83505|
| 5|79472|
| 0|56241|
| 16|26933|
| 13|23581|
+---+-----+
	
Which	clusters	had	the	highest	number	of	pickups?
42 © 2018 MapR Technologies, Inc
Which	clusters	had	the	highest	number	of	pickups?	
%sql
SELECT COUNT(cid), cid
FROM uber
GROUP BY cid
ORDER BY COUNT(cid) DESC
43 © 2018 MapR Technologies, Inc
How	many	pickups	occurred	in	the	busiest	5	clusters	by	hour?	
select hour(uber.dt) as hr,cid, count(cid) as ct
from uber where cid in (0,8,9,13,17)
group By hour(uber.dt), cid
44 © 2018 MapR Technologies, Inc
Which	hours		had	the	highest	number	of	pickups?	
SELECT hour(uber.dt) as hr,count(cid) as ct
FROM uber
GROUP BY hour(uber.dt)
45 © 2018 MapR Technologies, Inc
fitted
model
model.write.overwrite().save("/path/savemodel")	
	
Use	later	
	
val	sameModel	=	KMeansModel.load("/user/user01/data/savemodel")	
	
Save	the	model	to	distributed	file	system		
saveDataFrame +
Features
46 © 2018 MapR Technologies, Inc
hadoop	fs	-ls	/user/mapr/ubermodel/metadata	
/user/mapr/ubermodel/metadata/_SUCCESS	
/user/mapr/ubermodel/metadata/part-00000	
hadoop	fs	-ls	/user/mapr/ubermodel/data	
/user/mapr/ubermodel/data/_SUCCESS
/user/mapr/ubermodel/data/part-00000-4d20b313-ddc1-43cb-
a863-434a36330639-c000.snappy.parquet	
hadoop	fs	-cat	/user/mapr/ubermodel/metadata/part-00000	
{"class":"org.apache.spark.ml.clustering.KMeansModel","timestamp":
1540826934502,"sparkVersion":"2.3.1-mapr-1808","uid":"kmeans_4ad427355253","paramMap":
{"predictionCol":"cid","seed":1,"initMode":"k-means||","featuresCol":"features","initSteps":
2,"maxIter":100,"tol":1.0E-4,"k":20}}	
The	model	on	the	distributed	file	system
Kafka	API	and	Streaming	Data
48 © 2018 MapR Technologies, Inc
Use	Case:	Real-Time	Analysis	of	Geographically	Clustered	Vehicles
49 © 2018 MapR Technologies, Inc
What	is	a	Stream	?	
•  A stream is an continuous sequence of events or records
•  Records are key-value pairs
50 © 2018 MapR Technologies, Inc
Examples	of	Streaming	Data	
Fraud detection Smart Machinery Smart Meters Home Automation
Networks Manufacturing Security Systems Patient Monitoring
51 © 2018 MapR Technologies, Inc
A	Stanford	team	has	shown	that	a	machine-learning	model	can	identify	arrhythmias	
from	an	EKG	better	than	an	expert	
•  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-
to-play-doctor/	
Example	of	Streaming	Data	combined	with	Machine	Learning
52 © 2018 MapR Technologies, Inc
https://mapr.com/blog/ml-iot-connected-medical-devices/	
Applying	Machine	Learning	to	Live	Patient	Data
53 © 2018 MapR Technologies, Inc
Collect	the	Data	
Data IngestSource
Stream
Topic
•  Data Ingest:
–  Using the Kafka API
54 © 2018 MapR Technologies, Inc
Topics:		
Logical	collection	of	events		
Organize	Events	into	Categories	
Organize	Data	into	Topics	with	the	MapR	Event	Store	for	Kafka	
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API
55 © 2018 MapR Technologies, Inc
Topics	are	partitioned	for	throughput	and	
scalability	
	
Scalable	Messaging	with	MapR	Event	Streams	
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Server 3
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
56 © 2018 MapR Technologies, Inc
Scalable	Messaging	with	MapR	Event	Streams	
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Producers are load balanced
between partitions
Kafka API
57 © 2018 MapR Technologies, Inc
Scalable	Messaging	with	MapR	Event	Streams	
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Consumers
Consumers
Consumers
Consumer
groups can
read in parallel
Kafka API
58 © 2018 MapR Technologies, Inc
New	Messages	are		
Added	to	the	end	
Partition	is	like	an	Event	Log	
New
Message
6 5 4 3 2 1
Old
Message
59 © 2018 MapR Technologies, Inc
Messages	are	delivered	in	the	order	they	are	received	
Partition	is	like	a	Queue
60 © 2018 MapR Technologies, Inc
Messages	remain	on	the	partition,	available	to	other	consumers	
	
	
Unlike	a	queue,	events	are	still	persisted	after	they’re	delivered
61 © 2018 MapR Technologies, Inc
Messages	can	be	persisted	forever		
Or			
Older	messages	can	be	deleted	automatically	based	on	time	to	live		
	
When	Are	Messages	Deleted?	
MapR Cluster
6 5 4 3 2 1Partition
1
Older
message
62 © 2018 MapR Technologies, Inc
How	do	we	do	this	with	High	Performance	at	Scale?		
•  Parallel	operations		
•  minimizes	disk	read/writes
63 © 2018 MapR Technologies, Inc
Processing	Same	Message	for	Different	Purposes
Spark	Structured	Streaming
65 © 2018 MapR Technologies, Inc
Process	the	Data	with	Spark	Structured	Streaming
66 © 2018 MapR Technologies, Inc
Datasets	Read	from	Stream	
Task	
Cache		
Process
& Cache
Data
offsets
Stream
partition
Task	
Cache		
Process
& Cache
Data
Task	
Cache		
Process
& Cache
Data
Driver	
Stream
partition
Stream
partition
Data is cached for
aggregations
And windowed
functions
67 © 2018 MapR Technologies, Inc
new data in the
data stream
=
new rows appended
to an unbounded table
Data stream as an unbounded table
	
Treat	Stream	as	Unbounded	Tables
68 © 2018 MapR Technologies, Inc
The	Stream	is	continuously	processed
69 © 2018 MapR Technologies, Inc
Spark	automatically		streamifies	SQL	plans	
Image	reference	Databricks
70 © 2018 MapR Technologies, Inc
Stream	Processing
71 © 2018 MapR Technologies, Inc
ML	Discovery	Model	Building	
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Historical
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
Uber
trips
Stream
TopicUber
trips
New Data
72 © 2018 MapR Technologies, Inc
Use	Case:	Real-Time	Analysis	of	Geographically	Clustered	Vehicles
73 © 2018 MapR Technologies, Inc
// load the saved model from the distributed file system
val model = KMeansModel.load(modelpath)
Load	the	saved	model
74 © 2018 MapR Technologies, Inc
val df1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "maprdemo:9092")
.option("subscribe", "/apps/uberstream:ubers”)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", false)
.option("maxOffsetsPerTrigger", 1000)
.load()
	
Streaming	pipeline		Kafka	Data	source
75 © 2018 MapR Technologies, Inc
df1.printSchema()
root

|-- key: binary (nullable = true)

|-- value: binary (nullable = true)

|-- topic: string (nullable = true)

|-- partition: integer (nullable = true)

|-- offset: long (nullable = true)

|-- timestamp: timestamp (nullable = true)

|-- timestampType: integer (nullable = true)
Kafka	DataFrame	schema
76 © 2018 MapR Technologies, Inc
case class Uber(dt: String, lat: Double, lon: Double, base: String,
rdt: String)
 
// Parse string into Uber case class
def parseUber(str: String): Uber = {
val p = str.split(",")
Uber(p(0), p(1).toDouble, p(2).toDouble, p(3), p(4))
}
	
Function	to	Parse	CSV	data	to	Uber	Object
77 © 2018 MapR Technologies, Inc
//register a user-defined function (UDF) to deserialize the message
spark.udf.register("deserialize",
(message: String) => parseUber(message))
//use the UDF in a select expression
val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS
message""").select($"message".as[Uber])
	
Parse	message	txt	to	Uber	Object
78 © 2018 MapR Technologies, Inc
val featureCols = Array("lat", "lon”)
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
 
val df3 = assembler.transform(df2)
	
Use	VectorAssembler	to	put	Features	in	a	column
79 © 2018 MapR Technologies, Inc
//use model to get the cluster ids from the features
val clusters1 = model.transform(df3)
	
Use	Model	to	get	Cluster	Ids	from	the	features
80 © 2018 MapR Technologies, Inc
//select columns we want to keep
val clusters= clusters1.select($"dt".cast(TimestampType),
$"lat", $"lon", $"base",$"rdt", $”cid”)
// Create object with unique Id for mapr-db
case class UberwId(_id: String, dt: java.sql.Timestamp, base: String, cid: Integer,
clat: Double, clon: Double)
val cdf = clusters.withColumn("_id", concat($"cid", lit("_"), $"rdt")).as[UberwId]
// cdf is like this:
+--------------------+-------------------+-------+--------+------+---+----------------+------------------+
| _id| dt| lat| lon| base|cid| clat| clon|
+--------------------+-------------------+-------+--------+------+---+----------------+------------------+
|0_922337049642672...|2014-08-18 08:36:00| 40.723|-74.0021|B02598| 0|40.7173662333218|-74.00933866774037|
|0_922337049642672...|2014-08-18 08:36:00|40.7288|-74.0113|B02598| 0|40.7173662333218|-74.00933866774037|
|0_922337049642672...|2014-08-18 08:35:00|40.7417|-74.0488|B02617| 0|40.7173662333218|-74.00933866774037|
Create	Unique	Id	for	MapR-DB	row	key
81 © 2018 MapR Technologies, Inc
Writing	to	a	Memory	Sink	
Write		results	to	MapR-DB	
Start	running	the	query	
	val	query	=	cdf.writeStream	
						.format(MapRDBSourceConfig.Format)	
						.option(MapRDBSourceConfig.TablePathOption,	tableName)	
						.option(MapRDBSourceConfig.IdFieldPathOption,	"_id")	
						.option(MapRDBSourceConfig.CreateTableOption,	false)	
						.option("checkpointLocation",	"/user/mapr/ubercheck")	
						.option(MapRDBSourceConfig.BulkModeOption,	true)	
						.option(MapRDBSourceConfig.SampleSizeOption,	1000)	
	
				query.start().awaitTermination()
82 © 2018 MapR Technologies, Inc
%sql	select	*	from	uber	limit	3:		
Streaming	Applicaton
83 © 2018 MapR Technologies, Inc
SELECT	hour(uber.dt)	as	hr,cid,	count(cid)	as	ct	FROM	uber	group	By	hour(uber.dt),	
cid	
Streaming	Applicaton
Spark		&	MapR-DB
85 © 2018 MapR Technologies, Inc
Stream	Processing	Pipeline
86 © 2018 MapR Technologies, Inc
MapR-DB Connector for Apache Spark 	
Spark	Streaming	writing	to	MapR-DB	JSON
87 © 2018 MapR Technologies, Inc
Spark	MapR-DB	Connector
88 © 2018 MapR Technologies, Inc
Relational	Database	vs.	MapR-DB	
bottleneck
Storage ModelRDBMS MapR-DB
Normalized schema à Joins for
queries can cause bottleneck De-Normalized schema à Data that
is read together is stored together
Key	 colB	 colC	
xxx	 val	 val	
xxx	 val	 val	
Key	 colB	 colC	
xxx	 val	 val	
xxx	 val	 val	
Key	 colB	 colC	
xxx	 val	 val	
xxx	 val	 val
89 © 2018 MapR Technologies, Inc
Designed	for	Partitioning	and	Scaling
90 © 2018 MapR Technologies, Inc
MapR-DB	JSON	Document	Store	
Data is automatically partitioned
and sorted by _id row key!
91 © 2018 MapR Technologies, Inc
Writing	to	a	MapR-DB	Sink	
Write		Streaming	DataFrame	
Query	Results	to	MapR-DB	
	
Start	running	the	query	
	val	query	=	cdf.writeStream	
						.format(MapRDBSourceConfig.Format)	
						.option(MapRDBSourceConfig.TablePathOption,	tableName)	
						.option(MapRDBSourceConfig.IdFieldPathOption,	"_id")	
						.option(MapRDBSourceConfig.CreateTableOption,	false)	
						.option("checkpointLocation",	"/user/mapr/ubercheck")	
						.option(MapRDBSourceConfig.BulkModeOption,	true)	
						.option(MapRDBSourceConfig.SampleSizeOption,	1000)	
	
				query.start().awaitTermination()
92 © 2018 MapR Technologies, Inc
Streaming	Applicaton
Explore	the	Data	With	Spark	SQL
94 © 2018 MapR Technologies, Inc
•  Spark SQL queries and updates to MapR-DB
•  With projection and filter pushdown, custom partitioning, and data locality
	
Spark	SQL	Querying	MapR-DB	JSON
95 © 2018 MapR Technologies, Inc
val df: Dataset[UberwId] = spark
.loadFromMapRDB[UberwId](tableName, schema)
.as[UberwId]
	
Spark	Distributed	Datasets	read	from	MapR-DB	Partitions	
Worker	
Task	
Worker	
Driver	
Cache	1	
Cache	2	
Cache	3	
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task	
Task	
Driver	
tasks
tasks
tasks
96 © 2018 MapR Technologies, Inc
Data
Frame
Load data
df.createOrReplaceTempView("uber")	
df.show	
	
Load	the	data	into	a	Dataframe	
Data is automatically partitioned
and sorted by _id row key!
97 © 2018 MapR Technologies, Inc
val res = df.groupBy(“cid")
.count()
.orderBy(desc(count))
.show(5)
+---+------+
|cid| count|
+---+------+
| 6|197225|
| 5|192073|
| 0|131296|
| 16| 62465|
| 13| 52408|
+---+------+
Top	5	Cluster	trip	counts		?
98 © 2018 MapR Technologies, Inc
val points = df.select("lat","lon”,"cid”).orderBy(desc("dt"))
Display	latest	locations	and	Cluster	centers	on	a	Google	Map
99 © 2018 MapR Technologies, Inc
df.filter($"_id" <= ”1”).select(hour($"dt").alias("hour"), $"cid")
.groupBy("hour","cid").agg(count("cid")
.alias("count"))
Which	hours	have	the	highest	pickups	for	cluster	id	0	?
100 © 2018 MapR Technologies, Inc
df.filter($"_id" <= "1").select(hour($"dt").alias("hour"), $"cid")
.groupBy("hour","cid").agg(count("cid").alias("count"))
.orderBy(desc( "count")).explain
== Physical Plan ==
*(3) Sort [count#120L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#120L DESC NULLS LAST, 200)
+- *(2) HashAggregate(keys=[hour#113, cid#5], functions=[count(cid#5)])
+- Exchange hashpartitioning(hour#113, cid#5, 200)
+- *(1) HashAggregate(keys=[hour#113, cid#5],
functions=[partial_count(cid#5)])
+- *(1) Project [hour(dt#1, Some(Etc/UTC)) AS hour#113, cid#5]
+- *(1) Filter (isnotnull(_id#0) && (_id#0 <= 1))
+- *(1) Scan MapRDBRelation(/user/mapr/ubertable
[dt#1,cid#5,_id#0]
PushedFilters: [IsNotNull(_id), LessThanOrEqual(_id,1)]
MapR-DB	Projection	and	Filter	push	down
101 © 2018 MapR Technologies, Inc
Spark	MapR-DB		Projection	Filter	push	down	
Projection and Filter pushdown reduces the
amount of data passed between MapR-DB
and the Spark engine when selecting and
filtering data.
	
Data is selected and filtered in
MapR-DB!
102 © 2018 MapR Technologies, Inc
SELECT hour(uber.dt) as hr,cid, count(cid) as ct FROM uber
GROUP BY hour(uber.dt), cid
Which	hours	and	Clusters	have	the	highest	pick	ups?
103 © 2018 MapR Technologies, Inc
MapR	Data	Platform
104 © 2018 MapR Technologies, Inc
Link	to	Code	for	this	webinar	is	in	
appendix	of	this		book.			
https://mapr.com/ebook/getting-started-
with-apache-spark-v2/	
New	Spark	Ebook
105 © 2018 MapR Technologies, Inc
106 © 2018 MapR Technologies, Inc
MapR	Free	ODT	http://learn.mapr.com/	
To	Learn	More:	New	Spark	2.0	training
107 © 2018 MapR Technologies, Inc
https://mapr.com/blog/	
MapR	Blog

More Related Content

What's hot

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Making sense of the Graph Revolution
Making sense of the Graph RevolutionMaking sense of the Graph Revolution
Making sense of the Graph RevolutionInfiniteGraph
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at ScaleDatabricks
 
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemReal time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemChris Huang
 

What's hot (20)

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Making sense of the Graph Revolution
Making sense of the Graph RevolutionMaking sense of the Graph Revolution
Making sense of the Graph Revolution
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
 
Real time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystemReal time big data applications with hadoop ecosystem
Real time big data applications with hadoop ecosystem
 

Similar to Real-Time Analysis of Popular Uber Locations using Apache APIs: Machine Learning, Spark Streaming, Kafka with MapR-ES and MapR-DB

Build MLOps System on AWS
Build MLOps System on AWS Build MLOps System on AWS
Build MLOps System on AWS Yunrui Li
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkMapR Technologies
 
Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data - How Streaming Can Help Formula 1Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data - How Streaming Can Help Formula 1Tugdual Grall
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkDataWorks Summit
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
 
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...nimak
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Exploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerExploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerSambit Banerjee
 
Reigniting the API Description Wars with TypeSpec and the Next Generation of ...
Reigniting the API Description Wars with TypeSpec and the Next Generation of...Reigniting the API Description Wars with TypeSpec and the Next Generation of...
Reigniting the API Description Wars with TypeSpec and the Next Generation of ...Nordic APIs
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Amazon Web Services
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberSudhir Tonse
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017Sudhir Tonse
 
Instrumenting Applications for Observability Using AWS X-Ray (DEV402-R2) - AW...
Instrumenting Applications for Observability Using AWS X-Ray (DEV402-R2) - AW...Instrumenting Applications for Observability Using AWS X-Ray (DEV402-R2) - AW...
Instrumenting Applications for Observability Using AWS X-Ray (DEV402-R2) - AW...Amazon Web Services
 

Similar to Real-Time Analysis of Popular Uber Locations using Apache APIs: Machine Learning, Spark Streaming, Kafka with MapR-ES and MapR-DB (20)

Build MLOps System on AWS
Build MLOps System on AWS Build MLOps System on AWS
Build MLOps System on AWS
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
 
Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data - How Streaming Can Help Formula 1Fast Cars, Big Data - How Streaming Can Help Formula 1
Fast Cars, Big Data - How Streaming Can Help Formula 1
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
 
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Exploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access LayerExploring Neo4j Graph Database as a Fast Data Access Layer
Exploring Neo4j Graph Database as a Fast Data Access Layer
 
Reigniting the API Description Wars with TypeSpec and the Next Generation of ...
Reigniting the API Description Wars with TypeSpec and the Next Generation of...Reigniting the API Description Wars with TypeSpec and the Next Generation of...
Reigniting the API Description Wars with TypeSpec and the Next Generation of ...
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Big Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at UberBig Data Pipelines and Machine Learning at Uber
Big Data Pipelines and Machine Learning at Uber
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
Instrumenting Applications for Observability Using AWS X-Ray (DEV402-R2) - AW...
Instrumenting Applications for Observability Using AWS X-Ray (DEV402-R2) - AW...Instrumenting Applications for Observability Using AWS X-Ray (DEV402-R2) - AW...
Instrumenting Applications for Observability Using AWS X-Ray (DEV402-R2) - AW...
 

More from Carol McDonald

Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBaseCarol McDonald
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 

More from Carol McDonald (13)

Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
CU9411MW.DOC
CU9411MW.DOCCU9411MW.DOC
CU9411MW.DOC
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 

Recently uploaded

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 

Recently uploaded (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 

Real-Time Analysis of Popular Uber Locations using Apache APIs: Machine Learning, Spark Streaming, Kafka with MapR-ES and MapR-DB

  • 2. 2 © 2018 MapR Technologies, Inc •  Overview of Unsupervised Machine Learning Clustering •  Use K-Means to Cluster Uber locations and save ML model •  Overview of Kafka API •  Use Spark Structured Streaming: •  To Read from Kafka topic •  Enrich with ML model •  Write to MapR-DB JSON document database •  Use Spark SQL to query MapR-DB database Agenda 2
  • 3. 3 © 2018 MapR Technologies, Inc Use Case: Real-Time Analysis of Geographically Clustered Vehicles
  • 5. 5 © 2018 MapR Technologies, Inc What is Machine Learning? Data Build ModelTrain Algorithm Finds patterns New Data Use Model (prediction function) Predictions Contains patterns Recognizes patterns
  • 6. 6 © 2018 MapR Technologies, Inc ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction Uber trips Stream TopicUber trips New Data
  • 7. 7 © 2018 MapR Technologies, Inc Supervised and Unsupervised Machine Learning Machine Learning Unsupervised •  Clustering •  Collaborative Filtering •  Frequent Pattern Mining Supervised •  Classification •  Regression Label
  • 8. 8 © 2018 MapR Technologies, Inc Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model X1, X2 Y f(X1, X2) =Y X1, X2 Y
  • 9. 9 © 2018 MapR Technologies, Inc Unsupervised Algorithms use Unlabeled data Customer GroupsBuild ModelTrain Algorithm Finds patterns New Customer Purchase Data Use Model Similar Customer Group Contains patterns Recognizes patterns Customer purchase data
  • 10. 10 © 2018 MapR Technologies, Inc Unsupervised Machine Learning: Clustering Clustering group news articles into different categories
  • 11. 11 © 2018 MapR Technologies, Inc Clustering: Definition Unsupervised learning task Groups objects into clusters of high similarity
  • 12. 12 © 2018 MapR Technologies, Inc Clustering: Definition Unsupervised learning task Groups objects into clusters of high similarity –  Search results grouping –  Grouping of customers, patients –  Text categorization –  recommendations •  Anomaly detection: find what’s not similar
  • 13. 13 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects
  • 14. 14 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects Use MLlib K-means algorithm 1.  Initialize coordinates to K cluster centers
  • 15. 15 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects Use MLlib K-means algorithm 1.  Initialize coordinates to K clusters centers (centroid) 2.  Assign all points to nearest cluster center (centroid)
  • 16. 16 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of assigned points
  • 17. 17 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points 4.  Repeat until conditions met
  • 19. 19 © 2018 MapR Technologies, Inc How a Spark Application Runs on a Cluster
  • 20. 20 © 2018 MapR Technologies, Inc Spark Distributed Datasets partitioned •  Read only collection of typed objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached
  • 21. 21 © 2018 MapR Technologies, Inc Loading a Dataset
  • 22. 22 © 2018 MapR Technologies, Inc Dataset Read From a File Worker Worker Worker Block 1 Block 2 Block 3 Driver tasks tasks tasks
  • 23. 23 © 2018 MapR Technologies, Inc Dataset Read From a File Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  • 24. 24 © 2018 MapR Technologies, Inc Date/Time: The date and time of the Uber pickup Lat: The latitude of the Uber pickup Lon: The longitude of the Uber pickup Base: The TLC base company affiliated with the Uber pickup The Data Records are in CSV format. An example line is shown below: 2014-08-01 00:00:00,40.729,-73.9422,B02598 Uber Data
  • 25. 25 © 2018 MapR Technologies, Inc case class Uber(dt: String, lat: Double, lon: Double, base: String) val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) Load the data into a Dataframe: Define the Schema
  • 26. 26 © 2018 MapR Technologies, Inc val df = spark.read.format("csv").option("inferSchema", "false") .schema(schema).option("header", "false") .load(file) Load the data into a Dataframe
  • 27. 27 © 2018 MapR Technologies, Inc Load the data into a DataFrame columns row
  • 28. 28 © 2018 MapR Technologies, Inc val df = spark.read.format("csv").option("inferSchema", "false") .schema(schema).option("header", "false") .load(file).as[Uber] Load the data into a Dataset
  • 29. 29 © 2018 MapR Technologies, Inc Load the data into a Dataset Collection of Uber objects columns row
  • 30. 30 © 2018 MapR Technologies, Inc •  in Spark 2.0, DataFrame APIs merged with Datasets APIs •  A Dataset is a collection of typed objects (SQL and functions) •  Dataset[T] •  A DataFrame is a Dataset of generic Row objects (SQL) •  Dataset[Row] Dataset merged with Dataframe
  • 31. 31 © 2018 MapR Technologies, Inc Spark Distributed Datasets Transformations create a new Dataset from the current one, Lazily evaluated Actions return a value to the driver
  • 32. 32 © 2018 MapR Technologies, Inc Spark ML workflow
  • 33. 33 © 2018 MapR Technologies, Inc Feature Vectors are vectors of numbers representing the value for each feature Extract the Features Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Training Data + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶
  • 34. 34 © 2018 MapR Technologies, Inc Uber Example •  What are the “if questions” or properties we can use to group? –  These are the Features: –  We will group by Lattitude, longitude •  Use Spark SQL to analyze: Day of the week, time, rush hour for groups … •  NOTE: this example uses real Uber data, but the code is from me, not Uber NEAR REALTIME PRICE SURGING
  • 35. 35 © 2018 MapR Technologies, Inc val featureCols = Array("lat", "lon") val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features") val df2 = assembler.transform(df) Use VectorAssembler to put features in vector column
  • 36. 36 © 2018 MapR Technologies, Inc val kmeans = new KMeans() .setK(10) .setFeaturesCol("features") .setPredictionCol("cid") .setMaxIter(20) Create Kmeans Estimator, Set Features
  • 37. 37 © 2018 MapR Technologies, Inc val model = kmeans.fit(df2) Fit the Model on the Training Data Features
  • 38. 38 © 2018 MapR Technologies, Inc model.clusterCenters.foreach(println) [40.76930621976264,-73.96034885367698] [40.67562793272868,-73.79810579052476] [40.68848772848041,-73.9634449047477] [40.78957777777776,-73.14270740740741] [40.32418330308531,-74.18665245009073] [40.732808848486286,-74.00150153727878] [40.75396549974632,-73.57692359208531] [40.901700842900674,-73.868760398198] Cluster Centers from fitted model
  • 39. 39 © 2018 MapR Technologies, Inc Clusters from fitted model
  • 40. 40 © 2018 MapR Technologies, Inc K-means model val clusters = model.summary.predictions Or val clusters = model.transform(df3) clusters.createOrReplaceTempView("uber”) clusters.show() Analyze Clusters summary DataFrame + Features + cluster
  • 41. 41 © 2018 MapR Technologies, Inc clusters.groupBy("cid").count().orderBy(desc( "count")).show(5) +---+-----+ |cid|count| +---+-----+ | 6|83505| | 5|79472| | 0|56241| | 16|26933| | 13|23581| +---+-----+ Which clusters had the highest number of pickups?
  • 42. 42 © 2018 MapR Technologies, Inc Which clusters had the highest number of pickups? %sql SELECT COUNT(cid), cid FROM uber GROUP BY cid ORDER BY COUNT(cid) DESC
  • 43. 43 © 2018 MapR Technologies, Inc How many pickups occurred in the busiest 5 clusters by hour? select hour(uber.dt) as hr,cid, count(cid) as ct from uber where cid in (0,8,9,13,17) group By hour(uber.dt), cid
  • 44. 44 © 2018 MapR Technologies, Inc Which hours had the highest number of pickups? SELECT hour(uber.dt) as hr,count(cid) as ct FROM uber GROUP BY hour(uber.dt)
  • 45. 45 © 2018 MapR Technologies, Inc fitted model model.write.overwrite().save("/path/savemodel") Use later val sameModel = KMeansModel.load("/user/user01/data/savemodel") Save the model to distributed file system saveDataFrame + Features
  • 46. 46 © 2018 MapR Technologies, Inc hadoop fs -ls /user/mapr/ubermodel/metadata /user/mapr/ubermodel/metadata/_SUCCESS /user/mapr/ubermodel/metadata/part-00000 hadoop fs -ls /user/mapr/ubermodel/data /user/mapr/ubermodel/data/_SUCCESS /user/mapr/ubermodel/data/part-00000-4d20b313-ddc1-43cb- a863-434a36330639-c000.snappy.parquet hadoop fs -cat /user/mapr/ubermodel/metadata/part-00000 {"class":"org.apache.spark.ml.clustering.KMeansModel","timestamp": 1540826934502,"sparkVersion":"2.3.1-mapr-1808","uid":"kmeans_4ad427355253","paramMap": {"predictionCol":"cid","seed":1,"initMode":"k-means||","featuresCol":"features","initSteps": 2,"maxIter":100,"tol":1.0E-4,"k":20}} The model on the distributed file system
  • 48. 48 © 2018 MapR Technologies, Inc Use Case: Real-Time Analysis of Geographically Clustered Vehicles
  • 49. 49 © 2018 MapR Technologies, Inc What is a Stream ? •  A stream is an continuous sequence of events or records •  Records are key-value pairs
  • 50. 50 © 2018 MapR Technologies, Inc Examples of Streaming Data Fraud detection Smart Machinery Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  • 51. 51 © 2018 MapR Technologies, Inc A Stanford team has shown that a machine-learning model can identify arrhythmias from an EKG better than an expert •  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready- to-play-doctor/ Example of Streaming Data combined with Machine Learning
  • 52. 52 © 2018 MapR Technologies, Inc https://mapr.com/blog/ml-iot-connected-medical-devices/ Applying Machine Learning to Live Patient Data
  • 53. 53 © 2018 MapR Technologies, Inc Collect the Data Data IngestSource Stream Topic •  Data Ingest: –  Using the Kafka API
  • 54. 54 © 2018 MapR Technologies, Inc Topics: Logical collection of events Organize Events into Categories Organize Data into Topics with the MapR Event Store for Kafka Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  • 55. 55 © 2018 MapR Technologies, Inc Topics are partitioned for throughput and scalability Scalable Messaging with MapR Event Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning
  • 56. 56 © 2018 MapR Technologies, Inc Scalable Messaging with MapR Event Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Producers are load balanced between partitions Kafka API
  • 57. 57 © 2018 MapR Technologies, Inc Scalable Messaging with MapR Event Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Consumers Consumers Consumers Consumer groups can read in parallel Kafka API
  • 58. 58 © 2018 MapR Technologies, Inc New Messages are Added to the end Partition is like an Event Log New Message 6 5 4 3 2 1 Old Message
  • 59. 59 © 2018 MapR Technologies, Inc Messages are delivered in the order they are received Partition is like a Queue
  • 60. 60 © 2018 MapR Technologies, Inc Messages remain on the partition, available to other consumers Unlike a queue, events are still persisted after they’re delivered
  • 61. 61 © 2018 MapR Technologies, Inc Messages can be persisted forever Or Older messages can be deleted automatically based on time to live When Are Messages Deleted? MapR Cluster 6 5 4 3 2 1Partition 1 Older message
  • 62. 62 © 2018 MapR Technologies, Inc How do we do this with High Performance at Scale? •  Parallel operations •  minimizes disk read/writes
  • 63. 63 © 2018 MapR Technologies, Inc Processing Same Message for Different Purposes
  • 65. 65 © 2018 MapR Technologies, Inc Process the Data with Spark Structured Streaming
  • 66. 66 © 2018 MapR Technologies, Inc Datasets Read from Stream Task Cache Process & Cache Data offsets Stream partition Task Cache Process & Cache Data Task Cache Process & Cache Data Driver Stream partition Stream partition Data is cached for aggregations And windowed functions
  • 67. 67 © 2018 MapR Technologies, Inc new data in the data stream = new rows appended to an unbounded table Data stream as an unbounded table Treat Stream as Unbounded Tables
  • 68. 68 © 2018 MapR Technologies, Inc The Stream is continuously processed
  • 69. 69 © 2018 MapR Technologies, Inc Spark automatically streamifies SQL plans Image reference Databricks
  • 70. 70 © 2018 MapR Technologies, Inc Stream Processing
  • 71. 71 © 2018 MapR Technologies, Inc ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction Uber trips Stream TopicUber trips New Data
  • 72. 72 © 2018 MapR Technologies, Inc Use Case: Real-Time Analysis of Geographically Clustered Vehicles
  • 73. 73 © 2018 MapR Technologies, Inc // load the saved model from the distributed file system val model = KMeansModel.load(modelpath) Load the saved model
  • 74. 74 © 2018 MapR Technologies, Inc val df1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "maprdemo:9092") .option("subscribe", "/apps/uberstream:ubers”) .option("startingOffsets", "earliest") .option("failOnDataLoss", false) .option("maxOffsetsPerTrigger", 1000) .load() Streaming pipeline Kafka Data source
  • 75. 75 © 2018 MapR Technologies, Inc df1.printSchema() root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true) Kafka DataFrame schema
  • 76. 76 © 2018 MapR Technologies, Inc case class Uber(dt: String, lat: Double, lon: Double, base: String, rdt: String)   // Parse string into Uber case class def parseUber(str: String): Uber = { val p = str.split(",") Uber(p(0), p(1).toDouble, p(2).toDouble, p(3), p(4)) } Function to Parse CSV data to Uber Object
  • 77. 77 © 2018 MapR Technologies, Inc //register a user-defined function (UDF) to deserialize the message spark.udf.register("deserialize", (message: String) => parseUber(message)) //use the UDF in a select expression val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS message""").select($"message".as[Uber]) Parse message txt to Uber Object
  • 78. 78 © 2018 MapR Technologies, Inc val featureCols = Array("lat", "lon”) val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features")   val df3 = assembler.transform(df2) Use VectorAssembler to put Features in a column
  • 79. 79 © 2018 MapR Technologies, Inc //use model to get the cluster ids from the features val clusters1 = model.transform(df3) Use Model to get Cluster Ids from the features
  • 80. 80 © 2018 MapR Technologies, Inc //select columns we want to keep val clusters= clusters1.select($"dt".cast(TimestampType), $"lat", $"lon", $"base",$"rdt", $”cid”) // Create object with unique Id for mapr-db case class UberwId(_id: String, dt: java.sql.Timestamp, base: String, cid: Integer, clat: Double, clon: Double) val cdf = clusters.withColumn("_id", concat($"cid", lit("_"), $"rdt")).as[UberwId] // cdf is like this: +--------------------+-------------------+-------+--------+------+---+----------------+------------------+ | _id| dt| lat| lon| base|cid| clat| clon| +--------------------+-------------------+-------+--------+------+---+----------------+------------------+ |0_922337049642672...|2014-08-18 08:36:00| 40.723|-74.0021|B02598| 0|40.7173662333218|-74.00933866774037| |0_922337049642672...|2014-08-18 08:36:00|40.7288|-74.0113|B02598| 0|40.7173662333218|-74.00933866774037| |0_922337049642672...|2014-08-18 08:35:00|40.7417|-74.0488|B02617| 0|40.7173662333218|-74.00933866774037| Create Unique Id for MapR-DB row key
  • 81. 81 © 2018 MapR Technologies, Inc Writing to a Memory Sink Write results to MapR-DB Start running the query val query = cdf.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, tableName) .option(MapRDBSourceConfig.IdFieldPathOption, "_id") .option(MapRDBSourceConfig.CreateTableOption, false) .option("checkpointLocation", "/user/mapr/ubercheck") .option(MapRDBSourceConfig.BulkModeOption, true) .option(MapRDBSourceConfig.SampleSizeOption, 1000) query.start().awaitTermination()
  • 82. 82 © 2018 MapR Technologies, Inc %sql select * from uber limit 3: Streaming Applicaton
  • 83. 83 © 2018 MapR Technologies, Inc SELECT hour(uber.dt) as hr,cid, count(cid) as ct FROM uber group By hour(uber.dt), cid Streaming Applicaton
  • 85. 85 © 2018 MapR Technologies, Inc Stream Processing Pipeline
  • 86. 86 © 2018 MapR Technologies, Inc MapR-DB Connector for Apache Spark Spark Streaming writing to MapR-DB JSON
  • 87. 87 © 2018 MapR Technologies, Inc Spark MapR-DB Connector
  • 88. 88 © 2018 MapR Technologies, Inc Relational Database vs. MapR-DB bottleneck Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 89. 89 © 2018 MapR Technologies, Inc Designed for Partitioning and Scaling
  • 90. 90 © 2018 MapR Technologies, Inc MapR-DB JSON Document Store Data is automatically partitioned and sorted by _id row key!
  • 91. 91 © 2018 MapR Technologies, Inc Writing to a MapR-DB Sink Write Streaming DataFrame Query Results to MapR-DB Start running the query val query = cdf.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, tableName) .option(MapRDBSourceConfig.IdFieldPathOption, "_id") .option(MapRDBSourceConfig.CreateTableOption, false) .option("checkpointLocation", "/user/mapr/ubercheck") .option(MapRDBSourceConfig.BulkModeOption, true) .option(MapRDBSourceConfig.SampleSizeOption, 1000) query.start().awaitTermination()
  • 92. 92 © 2018 MapR Technologies, Inc Streaming Applicaton
  • 94. 94 © 2018 MapR Technologies, Inc •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark SQL Querying MapR-DB JSON
  • 95. 95 © 2018 MapR Technologies, Inc val df: Dataset[UberwId] = spark .loadFromMapRDB[UberwId](tableName, schema) .as[UberwId] Spark Distributed Datasets read from MapR-DB Partitions Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  • 96. 96 © 2018 MapR Technologies, Inc Data Frame Load data df.createOrReplaceTempView("uber") df.show Load the data into a Dataframe Data is automatically partitioned and sorted by _id row key!
  • 97. 97 © 2018 MapR Technologies, Inc val res = df.groupBy(“cid") .count() .orderBy(desc(count)) .show(5) +---+------+ |cid| count| +---+------+ | 6|197225| | 5|192073| | 0|131296| | 16| 62465| | 13| 52408| +---+------+ Top 5 Cluster trip counts ?
  • 98. 98 © 2018 MapR Technologies, Inc val points = df.select("lat","lon”,"cid”).orderBy(desc("dt")) Display latest locations and Cluster centers on a Google Map
  • 99. 99 © 2018 MapR Technologies, Inc df.filter($"_id" <= ”1”).select(hour($"dt").alias("hour"), $"cid") .groupBy("hour","cid").agg(count("cid") .alias("count")) Which hours have the highest pickups for cluster id 0 ?
  • 100. 100 © 2018 MapR Technologies, Inc df.filter($"_id" <= "1").select(hour($"dt").alias("hour"), $"cid") .groupBy("hour","cid").agg(count("cid").alias("count")) .orderBy(desc( "count")).explain == Physical Plan == *(3) Sort [count#120L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#120L DESC NULLS LAST, 200) +- *(2) HashAggregate(keys=[hour#113, cid#5], functions=[count(cid#5)]) +- Exchange hashpartitioning(hour#113, cid#5, 200) +- *(1) HashAggregate(keys=[hour#113, cid#5], functions=[partial_count(cid#5)]) +- *(1) Project [hour(dt#1, Some(Etc/UTC)) AS hour#113, cid#5] +- *(1) Filter (isnotnull(_id#0) && (_id#0 <= 1)) +- *(1) Scan MapRDBRelation(/user/mapr/ubertable [dt#1,cid#5,_id#0] PushedFilters: [IsNotNull(_id), LessThanOrEqual(_id,1)] MapR-DB Projection and Filter push down
  • 101. 101 © 2018 MapR Technologies, Inc Spark MapR-DB Projection Filter push down Projection and Filter pushdown reduces the amount of data passed between MapR-DB and the Spark engine when selecting and filtering data. Data is selected and filtered in MapR-DB!
  • 102. 102 © 2018 MapR Technologies, Inc SELECT hour(uber.dt) as hr,cid, count(cid) as ct FROM uber GROUP BY hour(uber.dt), cid Which hours and Clusters have the highest pick ups?
  • 103. 103 © 2018 MapR Technologies, Inc MapR Data Platform
  • 104. 104 © 2018 MapR Technologies, Inc Link to Code for this webinar is in appendix of this book. https://mapr.com/ebook/getting-started- with-apache-spark-v2/ New Spark Ebook
  • 105. 105 © 2018 MapR Technologies, Inc
  • 106. 106 © 2018 MapR Technologies, Inc MapR Free ODT http://learn.mapr.com/ To Learn More: New Spark 2.0 training
  • 107. 107 © 2018 MapR Technologies, Inc https://mapr.com/blog/ MapR Blog