SlideShare a Scribd company logo
1 of 100
Download to read offline
Getting	Started	with	Spark	2.x	Machine	
Learning	Pipelines	to	Predict	Flight	Delays	
CAROL	MCDONALD	
Solutions	Architect
2 © 2018 MapR Technologies, Inc.
Agenda	
•  Introduction	to	Machine	Learning	Classification	
•  Explore	the	Flight	Dataset	with	Spark	DataFrames	
•  Use	Random	Forest	to	Predict	Flight	Delays	
2
Intro	to	Machine	Learning
4 © 2018 MapR Technologies, Inc.
What	is	AI?	
		
Image	reference	NVIDIA	
The	definition	of	AI	is	continuously	redefined
5 © 2018 MapR Technologies, Inc.
AI	Late	80s:	Expert	System	rules	engine	
The	definition	of	AI	is	continuously	redefined	:	1980s:
6 © 2018 MapR Technologies, Inc.
Problems	with	hard	coded	Rules	
Rules	are	manual,		uses	a	human	expert	
–  difficult	to	maintain	
–  give	a	one	size	fits	all	decision!		(2	times	overdose	same	as	38	times)	
Machine	learning	uses	data	and	statistics		
–  can	give	finer	grained	predictions
7 © 2018 MapR Technologies, Inc.
What	is	Machine	Learning?	
Data Build ModelTrain Algorithm
Finds patterns
New Data Use Model
(prediction function)
Predictions
Contains patterns Recognizes patterns
8 © 2018 MapR Technologies, Inc.
What	is	Supervised	Machine	Learning?	
Machine Learning
Unsupervised
•  Clustering
•  Collaborative Filtering
•  Frequent Pattern Mining
Supervised
•  Classification
•  Regression
9 © 2018 MapR Technologies, Inc.
Supervised	Algorithms	use	labeled	data	
Data
features	
Build Model
New Data
features	
Predict
Use Model
X1, X2
Y
f(X1, X2) =Y
X1, X2
Y
10 © 2018 MapR Technologies, Inc.
Supervised	Machine	Learning:	Classification	&	Regression	
Classification
Identifies
category for item
11 © 2018 MapR Technologies, Inc.
Classification:	Definition	
Form	of	ML	that:	
•  Identifies	which	category	an	item	belongs	to		
•  Uses	supervised	learning	algorithms	
–  Data	is	labeled		
	
Sentiment
12 © 2018 MapR Technologies, Inc.
	If	it	Walks/Swims/Quacks	Like	a	Duck	……	Then	It	Must	Be	a	Duck	
swims
walks
quacks
Features:
walks
quacks
swims
Features:
13 © 2018 MapR Technologies, Inc.
Credit	Card	Fraud	Example	
•  What	are	we	trying	to	predict?		
–  This	is	the	Label:		
–  The	probability	of	Fraud	
•  What	are	the	“if	questions”	or	properties	we	can	use	to	predict?		
–  These	are	the	Features:		
–  Amount	$$	24	hours	>	historical	use	?		
Credit	Card	Transaction	Features	
Number	of	Transactions	last	24	hours	
Total	$	Amount	last	24	hours	
Average	Amount	last	24	hours	
Average	Amount	last	24	hours	compared	to	historical	use	
Location	and	Time	difference	since	Last	Transaction	
Average	transaction		
fraud	risk	of	merchant	type	
Merchant	types	for	day	compared	to	historical	use	
Features derived
From Transaction
History
14 © 2018 MapR Technologies, Inc.
Label
Probabilty
of Fraud 1
X
Features: trans amount, Time Location
difference last trans.
Fraud
0
Not Fraud
.5
Credit	Card	Fraud	Logistic	Regression	Example
15 © 2018 MapR Technologies, Inc.
Supervised	Algorithms	use	labeled	data	
Data
features	
Build Model
New Data
features	
Predict
Use Model (logistic regression function)
X trans amount…..
Y
y = 1 / 1+ exp -(a + bx)
X trans amount…..
Y
Model=	logistic	regression	function
16 © 2018 MapR Technologies, Inc.
House	Price	Prediction	Example	
•  What	are	we	trying	to	predict?		
–  This	is	the	Label	or	Target	outcome:		
–  The	house	price	$	
•  What	are	the	“if	questions”	or	properties	we	can	use	to	predict?		
–  These	are	the	Features:		
–  The	size	of	the	house	(square	meters)
17 © 2018 MapR Technologies, Inc.
Label:
House Price
Y
X
Feature: house size
(square meters)
Data point: price, size
House Price = intercept + coefficient * house size
y = a + bx
House	Price	Regression	Example
18 © 2018 MapR Technologies, Inc.
Supervised	Algorithms	use	labeled	data	
Data
features	
Build Model
New Data
features	
Predict
Use Model: slope a, coefficient b
X size
Y
y = a + bx
X
Y
19 © 2018 MapR Technologies, Inc.
Steps	to	Build	a	Machine	Learning	Solution	
1	
Identify	
Problem		
2	Get/
Prepare	
Data	
3	
Model	
Data	
4	
Get	Insight	
5	
Deploy	
Model	
6	
Evaluate	/	
Track	
Performance	
Machine Learning
20 © 2018 MapR Technologies, Inc.
Supervised	Learning:	Classification	&	Regression	
•  Classification:	
–  identifies	which	category	(eg	fraud	or	not	fraud)	
•  Linear	Regression:		
–  predicts	a	value	(eg	amount	of	fraud)
21 © 2018 MapR Technologies, Inc.
Examples	
•  Retail	Example:	
–  Predict	price,	sales		
•  Telecom:	
–  Predict	customer	will	churn	
•  Healthcare	Example:			
–  Probability	of	readmission	
•  Marketing	
–  Predict	probability	customer	will	click	on	add
Exploring	the	Flight	Delay	Dataset
23 © 2018 MapR Technologies, Inc.
How	a	Spark	Application	Runs	on	a	Cluster
24 © 2018 MapR Technologies, Inc.
Spark	Distributed	Datasets	
partitioned
•  Read only collection of typed objects
Dataset[T]
•  Partitioned across a cluster
•  Operated on in parallel
•  in memory can be Cached
25 © 2018 MapR Technologies, Inc.
Dataset	merged	with	Dataframe	
•  in	Spark	2.0,	DataFrame	APIs	merged	with	Datasets	APIs	
•  A	Dataset	is	a	collection	of	typed	objects	(SQL	and	functions)	
•  Dataset[T]		
•  A	DataFrame	is	a	Dataset	of	generic	Row	objects		(SQL)	
•  Dataset[Row]
26 © 2018 MapR Technologies, Inc.
{
"_id": ”ATL_LGA_2017-01-01_AA_1678",
"dofW": 7,
"carrier": "AA",
"origin": "ATL",
"dest": "LGA",
"crsdephour": 17,
"crsdeptime": 1700,
"depdelay": 0.0,
"crsarrtime": 1912,
"arrdelay": 0.0,
"crselapsedtime": 132.0,
"dist": 762.0
}
Flight Dataset
27 © 2018 MapR Technologies, Inc.
Loading	a	Dataset
28 © 2018 MapR Technologies, Inc.
Load	the	data	into	a	Dataset:	Define	the	Schema
29 © 2018 MapR Technologies, Inc.
Dataset
Load data
Load	the	data	into	a	Dataset	
val df: Dataset[Flight] = spark.read.option("inferSchema", "false")
.schema(schema).json(”/p/file.json").as[Flight]
columns row
30 © 2018 MapR Technologies, Inc.
Spark	Distributed	Datasets	
Transformations create a new Dataset
from the current one,
Lazily evaluated
Actions return a value to the driver
31 © 2018 MapR Technologies, Inc.
Dataset
Load data
Action	on	a	Dataset	
df.take(1)
result:
Array[Flight] = Array(Flight(ATL_LGA_2017-01-01_17_AA_1678,
7,AA,ATL,LGA,17,1700.0,0.0,1912.0,0.0,132.0,762.0))
action
32 © 2018 MapR Technologies, Inc.
DataFrame	Transformations	
L	I	Query	 Description	
agg(expr, exprs) Aggregates	on	entire	DataFrame	
distinct Returns	new	DataFrame	with	unique	rows	
except(other) Returns	new	DataFrame	with	rows	from	this	DataFrame	not	in	other	
DataFrame	
filter(expr);
where(condition)
Filter	based	on	the	SQL	expression	or	condition	
groupBy(cols:
Columns)
Groups	DataFrame	using	specified	columns	
join (DataFrame,
joinExpr)
Joins	with	another	DataFrame	using	given	join	expression	
sort(sortcol) Returns	new	DataFrame	sorted	by	specified	column	
select(col) Selects	set	of	columns
33 © 2018 MapR Technologies, Inc.
DataFrame	Actions	
Action	 Description	
collect() Returns	array	containing	all	rows	in	DataFrame	
count() Returns	number	of	rows	in	DataFrame	
describe(cols:String*) Computes	statistics	for	numeric	columns	(count,	mean,	
stddev,	min	and	max)	
first() Returns	the	first	row	
head() Returns	the	first	row	
show() Displays	first	20	rows	
take(n:int) Returns	the	first	n	rows
34 © 2018 MapR Technologies, Inc.
Example	loading	a	Dataset	
val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight]
df.cache
df.count
Worker	
Worker	
Worker	
Driver	
Block	1	
Block	2	
Block	3
35 © 2018 MapR Technologies, Inc.
Example:	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Driver	
tasks
tasks
tasks
36 © 2018 MapR Technologies, Inc.
Example	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Driver	
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block
37 © 2018 MapR Technologies, Inc.
Example	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Driver	
Cache	1	
Cache	2	
Cache	3	
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
38 © 2018 MapR Technologies, Inc.
Example:	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Driver	
Cache	1	
Cache	2	
Cache	3	
results
results
results
val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight]
df.cache
df.count
res9: Long = 41348
39 © 2018 MapR Technologies, Inc.
Example	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Driver	
Cache	1	
Cache	2	
Cache	3	
val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight]
df.cache
df.count
df.show
40 © 2018 MapR Technologies, Inc.
Example	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Cache	1	
Cache	2	
Cache	3	
tasks
tasks
tasks
Driver
41 © 2018 MapR Technologies, Inc.
Example:	Log	Mining	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Cache	1	
Cache	2	
Cache	3	
Driver	
Process
from
Cache
Process
from
Cache
Process
from
Cache
Cached,	does	not	
have	to	read	from	
file	again	
val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight]
df.cache
df.count
df.show
42 © 2018 MapR Technologies, Inc.
Example:	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Cache	1	
Cache	2	
Cache	3	
Driver	
results
results
results
val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight]
df.cache
df.count
df.show
43 © 2018 MapR Technologies, Inc.
Example:	
Worker	
Worker	
Worker	
Block	1	
Block	2	
Block	3	
Cache	1	
Cache	2	
Cache	3	
Driver	
Cache	your	data	è	Faster	Results	
val df: Dataset[Uber] = spark.read.option("inferSchema",
"false").schema(schema).csv(“data/uber.csv").as[Uber]
df.cache
df.count
df.show
44 © 2018 MapR Technologies, Inc.
Spark	Use	Cases	
Iterative	Algorithms	on	large	amounts	of	data	
	
Some	Algorithms	that	need	iterations	
•  Clustering	(K-Means)	
•  Linear	Regression		
•  Graph	Algorithms	(e.g.,	PageRank)	
•  Alternating	Least	Squares	ALS	
	
Some	Example	Use	Cases:	
•  Anomaly	detection	
•  Classification	
•  Recommendations
45 © 2018 MapR Technologies, Inc.
Destinations	with	highest	number	of	Departure	Delays	
df.filter($"depdelay" > 40).groupBy(”dest”)
.count().orderBy(desc(count)).show(10)
46 © 2018 MapR Technologies, Inc.
Top	5	Longest	Departure	Delays	
df.select($"carrier",$"origin",$"dest”,$"depdelay”,
$"crsdephour”).filter($"depdelay" > 40)
.orderBy(desc( "depdelay" )).show(5)
	
+-------+------+----+--------+----------+
|carrier|origin|dest|depdelay|crsdephour|
+-------+------+----+--------+----------+
| AA| SFO| ORD| 1440.0| 8|
| DL| BOS| ATL| 1185.0| 17|
| UA| DEN| EWR| 1138.0| 12|
| DL| ORD| ATL| 1087.0| 19|
| UA| MIA| EWR| 1072.0| 20|
+-------+------+----+--------+----------+
47 © 2018 MapR Technologies, Inc.
Register	Dataframe	as	a	Temporary	View
48 © 2018 MapR Technologies, Inc.
Average	Departure	Delay	by	Carrier	
%sql select carrier, avg(depdelay) from flights
group by carrier
49 © 2018 MapR Technologies, Inc.
Count	of		Departure	Delays	by	Carrier	
%sql select carrier, count(depdelay) from
flights where depdelay > 40 group by carrier
50 © 2018 MapR Technologies, Inc.
Count	of		Departure	Delays	by	Day	of	the	Week	
%sql select dofw, count(depdelay) from flights
where depdelay > 40 group by dofw
Monday= 1 Sunday = 7
51 © 2018 MapR Technologies, Inc.
Count	of		Departure	Delays	by	Hour	of	the	Day	
%sql select crsdephour, count(depdelay) from
flights where depdelay > 40 group by crsdephour
52 © 2018 MapR Technologies, Inc.
Count	of		Departure	Delays	by	Origin	
%sql select origin, count(depdelay) from flights
where depdelay > 40 group by origin
53 © 2018 MapR Technologies, Inc.
Count	of		Departure	Delays	by	Origin,	Destination	
%sql select origin,dest count(depdelay) from
flights where depdelay > 40 group by origin,dest
54 © 2018 MapR Technologies, Inc.
Bucket	Dataset	for	Delayed	not	Delayed		
val delaybucketizer = new Bucketizer()
.setInputCol("depdelay”)
.setOutputCol("delayed")
.setSplits(Array( 0.0, 40.0))
 
val df2 = delaybucketizer.transform(df1)
  
55 © 2018 MapR Technologies, Inc.
Count	of		Departure	Delays	by	Origin,	Destination	
%sql select origin,dest count(depdelay) from
flights where depdelay > 40 group by origin,dest
56 © 2018 MapR Technologies, Inc.
Bucket	Dataset	for	Delayed	not	Delayed		
 df2.groupBy("delayed").count.show
result:
+-------+-----+
|delayed|count|
+-------+-----+
| 0.0|36790|
| 1.0| 4558|
+-------+-----+
57 © 2018 MapR Technologies, Inc.
Stratify		
val fractions = Map(0.0 -> .13, 1.0 -> 1.0)
Val train = df.stat.sampleBy("delayed", fractions, 36L)
 
train.groupBy("delayed").count.show
 
result:
 
+-------+-----+
|delayed|count|
+-------+-----+
| 0.0| 4766|
| 1.0| 4558|
+-------+-----+
  
Predict	Flight	Delays
59 © 2018 MapR Technologies, Inc.
ML	Discovery	Model	Building	
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Archived
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
Flight
data
Stream
InputFlight
Data
New Data
60 © 2018 MapR Technologies, Inc.
Flight	Delay	Example	
•  What	are	we	trying	to	predict?		
–  This	is	the	Label	or	Target	outcome:		
–  Delayed	=	1		if	delay	>	40	minutes	
•  What	are	the	“if	questions”	or	properties	we	can	use	to	predict?		
•  These	are	the	Features:		
–  Day	of	the	week,	scheduled	departure	time,	scheduled	arrival	time,	Airline	carrier,	scheduled	travel	
time,	Origin,	Destination,	Distance	
featureslabel
61 © 2018 MapR Technologies, Inc.
Decision	Tree	for	Classification	
•  Tree of decisions about features
•  IF THEN ELSE questions using
features at each tree node
•  Answers branch to child nodes
62 © 2018 MapR Technologies, Inc.
Decision	Tree	
•  Q1:	If	the	scheduled	departure	time	<	10:15	AM	
–  T:Q2:	If	the	originating	airport	is	in	the	set	{ORD,	ATL,SFO}	
§  T:Q3:	If	the	day	of	the	week	is	in	{Monday,	Sunday}	
•  T:Delayed=1	
•  F:	Delayed=0	
§  F:	Q3:	If	the	destination	airport	is	in	the	set	{SFO,ORD,EWR}	
•  T:	Delayed=1	
•  F:	Delayed=0	
–  F	:Q2:	If	the	originating	airport	is	not	in	the	set	{BOS	.	MIA}	
§  T:Q3:	If	the	day	of	the	week	is	in	the	set	{Monday	,	Sunday}	
•  T:	Delayed=1	
•  F:	Delayed=0	
§  F:	Q3:	If	the	destination	airport	is	not	in	the	set	{BOS	.	MIA}	
•  T:	Delayed=1	
•  F:	Delayed=0
63 © 2018 MapR Technologies, Inc.
Random	Forest
64 © 2018 MapR Technologies, Inc.
Featurization,	Training	and	Model	Selection	
Image reference O’Reilly Learning Spark
+
+ ̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Training Data
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+ ̶+
̶ ̶
Feature Vectors are vectors of numbers representing the value for each feature
65 © 2018 MapR Technologies, Inc.
ML	Discovery	Model	Building	
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Evaluate Results
Archived
Data
Deployed
Model
Insights
Data
Discovery,
Model
Creation
Production
Feature Extraction
Feature
Extraction
Flight
data
Stream
InputFlight
Data
New Data
66 © 2018 MapR Technologies, Inc.
Spark	ML	workflow	
fit
transform
transform
67 © 2018 MapR Technologies, Inc.
Spark	ML	workflow	with	a	Pipeline	
A pipeline estimator
fit returns a pipeline
model 	
fit
transform
68 © 2018 MapR Technologies, Inc.
Extract	the	Features	
Image reference O’Reilly Learning Spark
+
+ ̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Training Data
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+ ̶+
̶ ̶
Feature Vectors are vectors of numbers representing the value for each feature
69 © 2018 MapR Technologies, Inc.
Set	up	Transformer	Estimator	Pipeline	
A Pipeline chains multiple Transformers and Estimators together in a ML workflow
70 © 2018 MapR Technologies, Inc.
StringIndexer	Transformer	maps	Strings	to	Numbers	
val indexer = new StringIndexer()
.setInputCol(”carrier")
.setOutputCol(”carrierIndex”)
Val df2=indexer.transform(df)
Transformer:
•  transform() method converts one
DataFrame into another, by
appending columns
71 © 2018 MapR Technologies, Inc.
Create	StringIndexers	for	all	Categorical	features	
// column names for categorical feature columns
val categoricalColumns = Array("carrier", "origin", "dest",
"dofW", "orig_dest")
 
// create stringindexers for all string columns
val stringIndexers = categoricalColumns.map{ colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Indexed")
.fit(strain)
}
72 © 2018 MapR Technologies, Inc.
Use	Bucketizer	Transformer	to	add	a	label		0	or	1		
// add a label column based on departure delay
val labeler = new Bucketizer().setInputCol("depdelay")
.setOutputCol("label")
.setSplits(Array(0.0, 40.0,
Double.PositiveInfinity)
73 © 2018 MapR Technologies, Inc.
Use	Vector	Assembler	to	add	features	
val featureCols = Array("carrierIndexed", "destIndexed”,"originIndexed",
"dofWIndexed", "orig_destIndexed”,"crsdephour", "crsdeptime”,"crsarrtime”,
"crselapsedtime", "dist")
// combines a list of feature columns into a vector column
val assembler = new VectorAssembler()
.setInputCols(featureCols)
.setOutputCol("features")
74 © 2018 MapR Technologies, Inc.
Create	RandomForest	Estimator	to	train	our	model	
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
75 © 2018 MapR Technologies, Inc.
Put	Feature	Transformers	and	Estimator	in	Pipeline	
val steps = stringIndexers ++ Array(labeler, assembler, rf)
 
val pipeline = new Pipeline().setStages(steps)
Estimator	pipeline	fit()	method		trains	on	a	dataset	and	produces	a	model	
fit
76 © 2018 MapR Technologies, Inc.
Evaluate	the	Predictions	from	the	model	
val areaUnderROC = evaluator.evaluate(predictions)
println("areaUnderROC", areaUnderROC)
77 © 2018 MapR Technologies, Inc.
K-fold	Cross-Validation	Process	
Data
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
data is randomly split into K partition training and test dataset pairs
78 © 2018 MapR Technologies, Inc.
K-fold		Cross-Validation	Process	
Data
Model
Training
Training
Set
Test Model
Predictions
Test
Set
Train algorithm with training dataset
79 © 2018 MapR Technologies, Inc.
ML	Cross-Validation	Process	
Data Model
Training
Set
Test Model
Predictions
Test
Set
Evaluate the model with the Test Set
80 © 2018 MapR Technologies, Inc.
K-fold		Cross-Validation	Process	
Data
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set
Train/Test loop K times
Repeat K times
select the Model produced by the best-performing set of parameters
81 © 2018 MapR Technologies, Inc.
Cross	Validation	transformation	estimation	pipeline	
Set up a CrossValidator with:
•  Parameter grid, Estimator (pipeline), Evaluator
Perform grid search based model selection
82 © 2018 MapR Technologies, Inc.
Parameter	Tuning	with	CrossValidator	with	a	Paramgrid	
CrossValidator	
•  Given:	
–  Estimator	
–  Parameter	grid	
–  Evaluator	
•  Find	best	parameters	and	
model	
val paramGrid = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(100, 200))
.addGrid(rf.maxDepth, Array(2, 4, 10))
.addGrid(rf.numTrees, Array(5, 20))
.addGrid(rf.impurity, Array("entropy", "gini"))
.build()
val evaluator= new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("prediction")
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
83 © 2018 MapR Technologies, Inc.
Crossvalidator	fit	a	Model		
val pipelineModel = crossvalidator.fit(train)
84 © 2018 MapR Technologies, Inc.
Random	Forest	Feature	Importances	
val featureImportances = pipelineModel.bestModel.asInstanceOf[PipelineModel]
.stages(stringIndexers.size + 2)
.asInstanceOf[RandomForestClassificationModel].featureImportances
assembler.getInputCols.zip(featureImportances.toArray).sortBy(-_._2).foreach { case (feat, imp) =>
println(s"feature: $feat, importance: $imp") }
result:
feature: crsdeptime, importance: 0.2954321019748874
feature: orig_destIndexed, importance: 0.21540676913162476
feature: crsarrtime, importance: 0.1594826730807351
feature: crsdephour, importance: 0.11232750835024508
feature: destIndexed, importance: 0.07068851952515658
feature: carrierIndexed, importance: 0.03737067561393635
feature: dist, importance: 0.03675114205144413
feature: dofWIndexed, importance: 0.030118527912782744
feature: originIndexed, importance: 0.022401521272697823
feature: crselapsedtime, importance: 0.020020561086490113
85 © 2018 MapR Technologies, Inc.
Get	Test	Predictions
86 © 2018 MapR Technologies, Inc.
Get	Predictions	on	Test	Data	
val predictions = pipelineModel.transform(test)
87 © 2018 MapR Technologies, Inc.
Evaluate	the	Predictions	from	the	model	
val areaUnderROC = evaluator.evaluate(predictions)
88 © 2018 MapR Technologies, Inc.
Area	under	the	ROC	curve	
Accuracy	is	measured	by	the	area	under	the	
ROC	curve.	
The	area	measures	correct	classifications	
	
•  An	area	of	1	represents	a	perfect	test		
•  an	area	of	.5	represents	a	worthless	test
89 © 2018 MapR Technologies, Inc.
Calculate	Metrics	
val	lp	=	predictions.select(	"label",	"prediction")	
val	counttotal	=	predictions.count()	
val	correct	=	lp.filter($"label"	===	$"prediction").count()	
val	wrong	=	lp.filter(not($"label"	===	$"prediction")).count()	
val	truep	=	lp.filter($"prediction"	===	0.0).filter($"label"	===	$"prediction").count()	/
counttotal.toDouble	
val	truen	=	lp.filter($"prediction"	===	1.0).filter($"label"	===	$"prediction").count()/
counttotal.toDouble	
val	falsen	=	lp.filter($"prediction"	===	0.0).filter(not($"label"	===	$"prediction")).count()/
counttotal.toDouble	
val	falsep	=	lp.filter($"prediction"	===	1.0).filter(not($"label"	===	$"prediction")).count()/
counttotal.toDouble	
val	ratioWrong=wrong.toDouble/counttotal.toDouble	
val	ratioCorrect=correct.toDouble/counttotal.toDouble
90 © 2018 MapR Technologies, Inc.
Calculate	Metrics	
ratio	Correct0.6129679791041889	
ratio	wrong	0.3870320208958112	
true	positive	0.537994582567476	
true	negative	0.07497339653671278	
false	negative	0.035261681338879754	
false	positive	0.35177033955693143
91 © 2018 MapR Technologies, Inc.
Precision	Grid	
	prediction/label	 Actual	Not	Delayed	 	Actual	Delayed	
Predicted	Not	Delayed	 		.53	(TP)	 	.035	(FN)	
	Predicted	Delayed	 		.35	(FP)	 	.074	(TN)
92 © 2018 MapR Technologies, Inc.
Save	the	Model	
pipelineModel.write.overwrite().save(”/pathl")	
	
	
Later	
	
val	sameModel	=	CrossValidatorModel.load(“/path")
93 © 2018 MapR Technologies, Inc.
Serve DataCollect DataData Sources
Flight
Data
Stream Processing
process	
Batch Processing
	
Model	
	
build model
update model
Machine-learning
Models
Stream
Topic
Streaming
Flight
Data
Kafka
API
Stream
Topic
Stream
Topic
End	to	End	Application	Architecture	
MapR-XD
94 © 2018 MapR Technologies, Inc.
Simplified	Pipeline:	Real-Time	Flight	Delay	Predictions	
Flight data
Stream
Input
Stream
Scores
Spark	
Streaming	
Spark	
Streaming	
SQL
Machine
Lerning
Model
Flight Delay
predictions
Results
Data Collect Process Store Analyze
95 © 2018 MapR Technologies, Inc.
Machine	Learning	Pipeline	for	Flight	Delays	
Input Data +
Actual Delay
Input Data +
Predictions
Consumer
withML
Model 2
Consumer
withML
Model 1
Decoy
results
Consumer
Consumer
withML
Model 3
Consumer
Stream
Archive
Stream
Scores
Stream
Input
SQL
SQL
Real time
Flight Data
Stream
Input
Actual Delay
Input Data +
Predictions +
Actual Delay
Real Time
dashboard +
Historical
Analysis
96 © 2018 MapR Technologies, Inc.
New	Spark	Ebook	
Link	to	Code	for	this	webinar	is	
in	appendix	of	this		
Book.			
https://mapr.com/ebooks/
97 © 2018 MapR Technologies, Inc.
98 © 2018 MapR Technologies, Inc.
To	Learn	More:	New	Spark	2.0	training		
•  MapR	Free	ODT	http://learn.mapr.com/
99 © 2018 MapR Technologies, Inc.
MapR	Blog		
HTTPS://MAPR.COM/BLOG/
100 © 2018 MapR Technologies, Inc.
Q&A
ENGAGE WITH US

More Related Content

What's hot

GSK: How Knowledge Graphs Improve Clinical Reporting Workflows
GSK: How Knowledge Graphs Improve Clinical Reporting WorkflowsGSK: How Knowledge Graphs Improve Clinical Reporting Workflows
GSK: How Knowledge Graphs Improve Clinical Reporting WorkflowsNeo4j
 
Big Data Stockholm v 7 | "Federated Machine Learning for Collaborative and Se...
Big Data Stockholm v 7 | "Federated Machine Learning for Collaborative and Se...Big Data Stockholm v 7 | "Federated Machine Learning for Collaborative and Se...
Big Data Stockholm v 7 | "Federated Machine Learning for Collaborative and Se...Dataconomy Media
 
How Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondHow Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondNeo4j
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2OSri Ambati
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...Neo4j
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learningSri Ambati
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Demystifying Graph Neural Networks
Demystifying Graph Neural NetworksDemystifying Graph Neural Networks
Demystifying Graph Neural NetworksNeo4j
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelNeo4j
 
Applying Network Analytics in KYC
Applying Network Analytics in KYCApplying Network Analytics in KYC
Applying Network Analytics in KYCNeo4j
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
 
Smarter Fraud Detection With Graph Data Science
Smarter Fraud Detection With Graph Data ScienceSmarter Fraud Detection With Graph Data Science
Smarter Fraud Detection With Graph Data ScienceNeo4j
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 

What's hot (20)

GSK: How Knowledge Graphs Improve Clinical Reporting Workflows
GSK: How Knowledge Graphs Improve Clinical Reporting WorkflowsGSK: How Knowledge Graphs Improve Clinical Reporting Workflows
GSK: How Knowledge Graphs Improve Clinical Reporting Workflows
 
Big Data Stockholm v 7 | "Federated Machine Learning for Collaborative and Se...
Big Data Stockholm v 7 | "Federated Machine Learning for Collaborative and Se...Big Data Stockholm v 7 | "Federated Machine Learning for Collaborative and Se...
Big Data Stockholm v 7 | "Federated Machine Learning for Collaborative and Se...
 
How Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondHow Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and Beyond
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2O
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
 
Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Demystifying Graph Neural Networks
Demystifying Graph Neural NetworksDemystifying Graph Neural Networks
Demystifying Graph Neural Networks
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Apache flink
Apache flinkApache flink
Apache flink
 
How Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global TravelHow Expedia’s Entity Graph Powers Global Travel
How Expedia’s Entity Graph Powers Global Travel
 
Applying Network Analytics in KYC
Applying Network Analytics in KYCApplying Network Analytics in KYC
Applying Network Analytics in KYC
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Smarter Fraud Detection With Graph Data Science
Smarter Fraud Detection With Graph Data ScienceSmarter Fraud Detection With Graph Data Science
Smarter Fraud Detection With Graph Data Science
 
RDD
RDDRDD
RDD
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 

Similar to Predicting Flight Delays with Spark Machine Learning

Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...Carol McDonald
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...DataWorks Summit
 
Predictive Maintenance - Portland Machine Learning Meetup
Predictive Maintenance - Portland Machine Learning MeetupPredictive Maintenance - Portland Machine Learning Meetup
Predictive Maintenance - Portland Machine Learning MeetupIan Downard
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018TigerGraph
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Jean Ihm
 
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
 Improving Apache Spark™ In-Memory Computing with Apache Ignite™ Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Improving Apache Spark™ In-Memory Computing with Apache Ignite™Tom Diederich
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Taro L. Saito
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
Azure machine learning
Azure machine learningAzure machine learning
Azure machine learningMark Reynolds
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseDataWorks Summit
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
MySQL 8.0 GIS Overview
MySQL 8.0 GIS OverviewMySQL 8.0 GIS Overview
MySQL 8.0 GIS OverviewNorvald Ryeng
 
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...LEGATO project
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 

Similar to Predicting Flight Delays with Spark Machine Learning (20)

Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
Predictive Maintenance - Portland Machine Learning Meetup
Predictive Maintenance - Portland Machine Learning MeetupPredictive Maintenance - Portland Machine Learning Meetup
Predictive Maintenance - Portland Machine Learning Meetup
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Gain Insights with Graph Analytics
Gain Insights with Graph Analytics Gain Insights with Graph Analytics
Gain Insights with Graph Analytics
 
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
 Improving Apache Spark™ In-Memory Computing with Apache Ignite™ Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
MATLAB and HDF-EOS
MATLAB and HDF-EOSMATLAB and HDF-EOS
MATLAB and HDF-EOS
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Azure machine learning
Azure machine learningAzure machine learning
Azure machine learning
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use case
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
MySQL 8.0 GIS Overview
MySQL 8.0 GIS OverviewMySQL 8.0 GIS Overview
MySQL 8.0 GIS Overview
 
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 

More from Carol McDonald

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 

More from Carol McDonald (20)

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

Recently uploaded

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 

Recently uploaded (20)

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 

Predicting Flight Delays with Spark Machine Learning

  • 2. 2 © 2018 MapR Technologies, Inc. Agenda •  Introduction to Machine Learning Classification •  Explore the Flight Dataset with Spark DataFrames •  Use Random Forest to Predict Flight Delays 2
  • 4. 4 © 2018 MapR Technologies, Inc. What is AI? Image reference NVIDIA The definition of AI is continuously redefined
  • 5. 5 © 2018 MapR Technologies, Inc. AI Late 80s: Expert System rules engine The definition of AI is continuously redefined : 1980s:
  • 6. 6 © 2018 MapR Technologies, Inc. Problems with hard coded Rules Rules are manual, uses a human expert –  difficult to maintain –  give a one size fits all decision! (2 times overdose same as 38 times) Machine learning uses data and statistics –  can give finer grained predictions
  • 7. 7 © 2018 MapR Technologies, Inc. What is Machine Learning? Data Build ModelTrain Algorithm Finds patterns New Data Use Model (prediction function) Predictions Contains patterns Recognizes patterns
  • 8. 8 © 2018 MapR Technologies, Inc. What is Supervised Machine Learning? Machine Learning Unsupervised •  Clustering •  Collaborative Filtering •  Frequent Pattern Mining Supervised •  Classification •  Regression
  • 9. 9 © 2018 MapR Technologies, Inc. Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model X1, X2 Y f(X1, X2) =Y X1, X2 Y
  • 10. 10 © 2018 MapR Technologies, Inc. Supervised Machine Learning: Classification & Regression Classification Identifies category for item
  • 11. 11 © 2018 MapR Technologies, Inc. Classification: Definition Form of ML that: •  Identifies which category an item belongs to •  Uses supervised learning algorithms –  Data is labeled Sentiment
  • 12. 12 © 2018 MapR Technologies, Inc. If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
  • 13. 13 © 2018 MapR Technologies, Inc. Credit Card Fraud Example •  What are we trying to predict? –  This is the Label: –  The probability of Fraud •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  Amount $$ 24 hours > historical use ? Credit Card Transaction Features Number of Transactions last 24 hours Total $ Amount last 24 hours Average Amount last 24 hours Average Amount last 24 hours compared to historical use Location and Time difference since Last Transaction Average transaction fraud risk of merchant type Merchant types for day compared to historical use Features derived From Transaction History
  • 14. 14 © 2018 MapR Technologies, Inc. Label Probabilty of Fraud 1 X Features: trans amount, Time Location difference last trans. Fraud 0 Not Fraud .5 Credit Card Fraud Logistic Regression Example
  • 15. 15 © 2018 MapR Technologies, Inc. Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model (logistic regression function) X trans amount….. Y y = 1 / 1+ exp -(a + bx) X trans amount….. Y Model= logistic regression function
  • 16. 16 © 2018 MapR Technologies, Inc. House Price Prediction Example •  What are we trying to predict? –  This is the Label or Target outcome: –  The house price $ •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  The size of the house (square meters)
  • 17. 17 © 2018 MapR Technologies, Inc. Label: House Price Y X Feature: house size (square meters) Data point: price, size House Price = intercept + coefficient * house size y = a + bx House Price Regression Example
  • 18. 18 © 2018 MapR Technologies, Inc. Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model: slope a, coefficient b X size Y y = a + bx X Y
  • 19. 19 © 2018 MapR Technologies, Inc. Steps to Build a Machine Learning Solution 1 Identify Problem 2 Get/ Prepare Data 3 Model Data 4 Get Insight 5 Deploy Model 6 Evaluate / Track Performance Machine Learning
  • 20. 20 © 2018 MapR Technologies, Inc. Supervised Learning: Classification & Regression •  Classification: –  identifies which category (eg fraud or not fraud) •  Linear Regression: –  predicts a value (eg amount of fraud)
  • 21. 21 © 2018 MapR Technologies, Inc. Examples •  Retail Example: –  Predict price, sales •  Telecom: –  Predict customer will churn •  Healthcare Example: –  Probability of readmission •  Marketing –  Predict probability customer will click on add
  • 23. 23 © 2018 MapR Technologies, Inc. How a Spark Application Runs on a Cluster
  • 24. 24 © 2018 MapR Technologies, Inc. Spark Distributed Datasets partitioned •  Read only collection of typed objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached
  • 25. 25 © 2018 MapR Technologies, Inc. Dataset merged with Dataframe •  in Spark 2.0, DataFrame APIs merged with Datasets APIs •  A Dataset is a collection of typed objects (SQL and functions) •  Dataset[T] •  A DataFrame is a Dataset of generic Row objects (SQL) •  Dataset[Row]
  • 26. 26 © 2018 MapR Technologies, Inc. { "_id": ”ATL_LGA_2017-01-01_AA_1678", "dofW": 7, "carrier": "AA", "origin": "ATL", "dest": "LGA", "crsdephour": 17, "crsdeptime": 1700, "depdelay": 0.0, "crsarrtime": 1912, "arrdelay": 0.0, "crselapsedtime": 132.0, "dist": 762.0 } Flight Dataset
  • 27. 27 © 2018 MapR Technologies, Inc. Loading a Dataset
  • 28. 28 © 2018 MapR Technologies, Inc. Load the data into a Dataset: Define the Schema
  • 29. 29 © 2018 MapR Technologies, Inc. Dataset Load data Load the data into a Dataset val df: Dataset[Flight] = spark.read.option("inferSchema", "false") .schema(schema).json(”/p/file.json").as[Flight] columns row
  • 30. 30 © 2018 MapR Technologies, Inc. Spark Distributed Datasets Transformations create a new Dataset from the current one, Lazily evaluated Actions return a value to the driver
  • 31. 31 © 2018 MapR Technologies, Inc. Dataset Load data Action on a Dataset df.take(1) result: Array[Flight] = Array(Flight(ATL_LGA_2017-01-01_17_AA_1678, 7,AA,ATL,LGA,17,1700.0,0.0,1912.0,0.0,132.0,762.0)) action
  • 32. 32 © 2018 MapR Technologies, Inc. DataFrame Transformations L I Query Description agg(expr, exprs) Aggregates on entire DataFrame distinct Returns new DataFrame with unique rows except(other) Returns new DataFrame with rows from this DataFrame not in other DataFrame filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression sort(sortcol) Returns new DataFrame sorted by specified column select(col) Selects set of columns
  • 33. 33 © 2018 MapR Technologies, Inc. DataFrame Actions Action Description collect() Returns array containing all rows in DataFrame count() Returns number of rows in DataFrame describe(cols:String*) Computes statistics for numeric columns (count, mean, stddev, min and max) first() Returns the first row head() Returns the first row show() Displays first 20 rows take(n:int) Returns the first n rows
  • 34. 34 © 2018 MapR Technologies, Inc. Example loading a Dataset val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight] df.cache df.count Worker Worker Worker Driver Block 1 Block 2 Block 3
  • 35. 35 © 2018 MapR Technologies, Inc. Example: Worker Worker Worker Block 1 Block 2 Block 3 Driver tasks tasks tasks
  • 36. 36 © 2018 MapR Technologies, Inc. Example Worker Worker Worker Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  • 37. 37 © 2018 MapR Technologies, Inc. Example Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  • 38. 38 © 2018 MapR Technologies, Inc. Example: Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight] df.cache df.count res9: Long = 41348
  • 39. 39 © 2018 MapR Technologies, Inc. Example Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight] df.cache df.count df.show
  • 40. 40 © 2018 MapR Technologies, Inc. Example Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 tasks tasks tasks Driver
  • 41. 41 © 2018 MapR Technologies, Inc. Example: Log Mining Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver Process from Cache Process from Cache Process from Cache Cached, does not have to read from file again val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight] df.cache df.count df.show
  • 42. 42 © 2018 MapR Technologies, Inc. Example: Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver results results results val df: Dataset[Flight] = spark.read.json(”/p/file.json").as[Flight] df.cache df.count df.show
  • 43. 43 © 2018 MapR Technologies, Inc. Example: Worker Worker Worker Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Driver Cache your data è Faster Results val df: Dataset[Uber] = spark.read.option("inferSchema", "false").schema(schema).csv(“data/uber.csv").as[Uber] df.cache df.count df.show
  • 44. 44 © 2018 MapR Technologies, Inc. Spark Use Cases Iterative Algorithms on large amounts of data Some Algorithms that need iterations •  Clustering (K-Means) •  Linear Regression •  Graph Algorithms (e.g., PageRank) •  Alternating Least Squares ALS Some Example Use Cases: •  Anomaly detection •  Classification •  Recommendations
  • 45. 45 © 2018 MapR Technologies, Inc. Destinations with highest number of Departure Delays df.filter($"depdelay" > 40).groupBy(”dest”) .count().orderBy(desc(count)).show(10)
  • 46. 46 © 2018 MapR Technologies, Inc. Top 5 Longest Departure Delays df.select($"carrier",$"origin",$"dest”,$"depdelay”, $"crsdephour”).filter($"depdelay" > 40) .orderBy(desc( "depdelay" )).show(5) +-------+------+----+--------+----------+ |carrier|origin|dest|depdelay|crsdephour| +-------+------+----+--------+----------+ | AA| SFO| ORD| 1440.0| 8| | DL| BOS| ATL| 1185.0| 17| | UA| DEN| EWR| 1138.0| 12| | DL| ORD| ATL| 1087.0| 19| | UA| MIA| EWR| 1072.0| 20| +-------+------+----+--------+----------+
  • 47. 47 © 2018 MapR Technologies, Inc. Register Dataframe as a Temporary View
  • 48. 48 © 2018 MapR Technologies, Inc. Average Departure Delay by Carrier %sql select carrier, avg(depdelay) from flights group by carrier
  • 49. 49 © 2018 MapR Technologies, Inc. Count of Departure Delays by Carrier %sql select carrier, count(depdelay) from flights where depdelay > 40 group by carrier
  • 50. 50 © 2018 MapR Technologies, Inc. Count of Departure Delays by Day of the Week %sql select dofw, count(depdelay) from flights where depdelay > 40 group by dofw Monday= 1 Sunday = 7
  • 51. 51 © 2018 MapR Technologies, Inc. Count of Departure Delays by Hour of the Day %sql select crsdephour, count(depdelay) from flights where depdelay > 40 group by crsdephour
  • 52. 52 © 2018 MapR Technologies, Inc. Count of Departure Delays by Origin %sql select origin, count(depdelay) from flights where depdelay > 40 group by origin
  • 53. 53 © 2018 MapR Technologies, Inc. Count of Departure Delays by Origin, Destination %sql select origin,dest count(depdelay) from flights where depdelay > 40 group by origin,dest
  • 54. 54 © 2018 MapR Technologies, Inc. Bucket Dataset for Delayed not Delayed val delaybucketizer = new Bucketizer() .setInputCol("depdelay”) .setOutputCol("delayed") .setSplits(Array( 0.0, 40.0))   val df2 = delaybucketizer.transform(df1)   
  • 55. 55 © 2018 MapR Technologies, Inc. Count of Departure Delays by Origin, Destination %sql select origin,dest count(depdelay) from flights where depdelay > 40 group by origin,dest
  • 56. 56 © 2018 MapR Technologies, Inc. Bucket Dataset for Delayed not Delayed  df2.groupBy("delayed").count.show result: +-------+-----+ |delayed|count| +-------+-----+ | 0.0|36790| | 1.0| 4558| +-------+-----+
  • 57. 57 © 2018 MapR Technologies, Inc. Stratify val fractions = Map(0.0 -> .13, 1.0 -> 1.0) Val train = df.stat.sampleBy("delayed", fractions, 36L)   train.groupBy("delayed").count.show   result:   +-------+-----+ |delayed|count| +-------+-----+ | 0.0| 4766| | 1.0| 4558| +-------+-----+   
  • 59. 59 © 2018 MapR Technologies, Inc. ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Archived Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction Flight data Stream InputFlight Data New Data
  • 60. 60 © 2018 MapR Technologies, Inc. Flight Delay Example •  What are we trying to predict? –  This is the Label or Target outcome: –  Delayed = 1 if delay > 40 minutes •  What are the “if questions” or properties we can use to predict? •  These are the Features: –  Day of the week, scheduled departure time, scheduled arrival time, Airline carrier, scheduled travel time, Origin, Destination, Distance featureslabel
  • 61. 61 © 2018 MapR Technologies, Inc. Decision Tree for Classification •  Tree of decisions about features •  IF THEN ELSE questions using features at each tree node •  Answers branch to child nodes
  • 62. 62 © 2018 MapR Technologies, Inc. Decision Tree •  Q1: If the scheduled departure time < 10:15 AM –  T:Q2: If the originating airport is in the set {ORD, ATL,SFO} §  T:Q3: If the day of the week is in {Monday, Sunday} •  T:Delayed=1 •  F: Delayed=0 §  F: Q3: If the destination airport is in the set {SFO,ORD,EWR} •  T: Delayed=1 •  F: Delayed=0 –  F :Q2: If the originating airport is not in the set {BOS . MIA} §  T:Q3: If the day of the week is in the set {Monday , Sunday} •  T: Delayed=1 •  F: Delayed=0 §  F: Q3: If the destination airport is not in the set {BOS . MIA} •  T: Delayed=1 •  F: Delayed=0
  • 63. 63 © 2018 MapR Technologies, Inc. Random Forest
  • 64. 64 © 2018 MapR Technologies, Inc. Featurization, Training and Model Selection Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Training Data + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ Feature Vectors are vectors of numbers representing the value for each feature
  • 65. 65 © 2018 MapR Technologies, Inc. ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Archived Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction Flight data Stream InputFlight Data New Data
  • 66. 66 © 2018 MapR Technologies, Inc. Spark ML workflow fit transform transform
  • 67. 67 © 2018 MapR Technologies, Inc. Spark ML workflow with a Pipeline A pipeline estimator fit returns a pipeline model fit transform
  • 68. 68 © 2018 MapR Technologies, Inc. Extract the Features Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Training Data + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ Feature Vectors are vectors of numbers representing the value for each feature
  • 69. 69 © 2018 MapR Technologies, Inc. Set up Transformer Estimator Pipeline A Pipeline chains multiple Transformers and Estimators together in a ML workflow
  • 70. 70 © 2018 MapR Technologies, Inc. StringIndexer Transformer maps Strings to Numbers val indexer = new StringIndexer() .setInputCol(”carrier") .setOutputCol(”carrierIndex”) Val df2=indexer.transform(df) Transformer: •  transform() method converts one DataFrame into another, by appending columns
  • 71. 71 © 2018 MapR Technologies, Inc. Create StringIndexers for all Categorical features // column names for categorical feature columns val categoricalColumns = Array("carrier", "origin", "dest", "dofW", "orig_dest")   // create stringindexers for all string columns val stringIndexers = categoricalColumns.map{ colName => new StringIndexer() .setInputCol(colName) .setOutputCol(colName + "Indexed") .fit(strain) }
  • 72. 72 © 2018 MapR Technologies, Inc. Use Bucketizer Transformer to add a label 0 or 1 // add a label column based on departure delay val labeler = new Bucketizer().setInputCol("depdelay") .setOutputCol("label") .setSplits(Array(0.0, 40.0, Double.PositiveInfinity)
  • 73. 73 © 2018 MapR Technologies, Inc. Use Vector Assembler to add features val featureCols = Array("carrierIndexed", "destIndexed”,"originIndexed", "dofWIndexed", "orig_destIndexed”,"crsdephour", "crsdeptime”,"crsarrtime”, "crselapsedtime", "dist") // combines a list of feature columns into a vector column val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features")
  • 74. 74 © 2018 MapR Technologies, Inc. Create RandomForest Estimator to train our model val rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features")
  • 75. 75 © 2018 MapR Technologies, Inc. Put Feature Transformers and Estimator in Pipeline val steps = stringIndexers ++ Array(labeler, assembler, rf)   val pipeline = new Pipeline().setStages(steps) Estimator pipeline fit() method trains on a dataset and produces a model fit
  • 76. 76 © 2018 MapR Technologies, Inc. Evaluate the Predictions from the model val areaUnderROC = evaluator.evaluate(predictions) println("areaUnderROC", areaUnderROC)
  • 77. 77 © 2018 MapR Technologies, Inc. K-fold Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set data is randomly split into K partition training and test dataset pairs
  • 78. 78 © 2018 MapR Technologies, Inc. K-fold Cross-Validation Process Data Model Training Training Set Test Model Predictions Test Set Train algorithm with training dataset
  • 79. 79 © 2018 MapR Technologies, Inc. ML Cross-Validation Process Data Model Training Set Test Model Predictions Test Set Evaluate the model with the Test Set
  • 80. 80 © 2018 MapR Technologies, Inc. K-fold Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set Train/Test loop K times Repeat K times select the Model produced by the best-performing set of parameters
  • 81. 81 © 2018 MapR Technologies, Inc. Cross Validation transformation estimation pipeline Set up a CrossValidator with: •  Parameter grid, Estimator (pipeline), Evaluator Perform grid search based model selection
  • 82. 82 © 2018 MapR Technologies, Inc. Parameter Tuning with CrossValidator with a Paramgrid CrossValidator •  Given: –  Estimator –  Parameter grid –  Evaluator •  Find best parameters and model val paramGrid = new ParamGridBuilder() .addGrid(rf.maxBins, Array(100, 200)) .addGrid(rf.maxDepth, Array(2, 4, 10)) .addGrid(rf.numTrees, Array(5, 20)) .addGrid(rf.impurity, Array("entropy", "gini")) .build() val evaluator= new BinaryClassificationEvaluator() .setLabelCol("label") .setRawPredictionCol("prediction") val crossval = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3)
  • 83. 83 © 2018 MapR Technologies, Inc. Crossvalidator fit a Model val pipelineModel = crossvalidator.fit(train)
  • 84. 84 © 2018 MapR Technologies, Inc. Random Forest Feature Importances val featureImportances = pipelineModel.bestModel.asInstanceOf[PipelineModel] .stages(stringIndexers.size + 2) .asInstanceOf[RandomForestClassificationModel].featureImportances assembler.getInputCols.zip(featureImportances.toArray).sortBy(-_._2).foreach { case (feat, imp) => println(s"feature: $feat, importance: $imp") } result: feature: crsdeptime, importance: 0.2954321019748874 feature: orig_destIndexed, importance: 0.21540676913162476 feature: crsarrtime, importance: 0.1594826730807351 feature: crsdephour, importance: 0.11232750835024508 feature: destIndexed, importance: 0.07068851952515658 feature: carrierIndexed, importance: 0.03737067561393635 feature: dist, importance: 0.03675114205144413 feature: dofWIndexed, importance: 0.030118527912782744 feature: originIndexed, importance: 0.022401521272697823 feature: crselapsedtime, importance: 0.020020561086490113
  • 85. 85 © 2018 MapR Technologies, Inc. Get Test Predictions
  • 86. 86 © 2018 MapR Technologies, Inc. Get Predictions on Test Data val predictions = pipelineModel.transform(test)
  • 87. 87 © 2018 MapR Technologies, Inc. Evaluate the Predictions from the model val areaUnderROC = evaluator.evaluate(predictions)
  • 88. 88 © 2018 MapR Technologies, Inc. Area under the ROC curve Accuracy is measured by the area under the ROC curve. The area measures correct classifications •  An area of 1 represents a perfect test •  an area of .5 represents a worthless test
  • 89. 89 © 2018 MapR Technologies, Inc. Calculate Metrics val lp = predictions.select( "label", "prediction") val counttotal = predictions.count() val correct = lp.filter($"label" === $"prediction").count() val wrong = lp.filter(not($"label" === $"prediction")).count() val truep = lp.filter($"prediction" === 0.0).filter($"label" === $"prediction").count() / counttotal.toDouble val truen = lp.filter($"prediction" === 1.0).filter($"label" === $"prediction").count()/ counttotal.toDouble val falsen = lp.filter($"prediction" === 0.0).filter(not($"label" === $"prediction")).count()/ counttotal.toDouble val falsep = lp.filter($"prediction" === 1.0).filter(not($"label" === $"prediction")).count()/ counttotal.toDouble val ratioWrong=wrong.toDouble/counttotal.toDouble val ratioCorrect=correct.toDouble/counttotal.toDouble
  • 90. 90 © 2018 MapR Technologies, Inc. Calculate Metrics ratio Correct0.6129679791041889 ratio wrong 0.3870320208958112 true positive 0.537994582567476 true negative 0.07497339653671278 false negative 0.035261681338879754 false positive 0.35177033955693143
  • 91. 91 © 2018 MapR Technologies, Inc. Precision Grid prediction/label Actual Not Delayed Actual Delayed Predicted Not Delayed .53 (TP) .035 (FN) Predicted Delayed .35 (FP) .074 (TN)
  • 92. 92 © 2018 MapR Technologies, Inc. Save the Model pipelineModel.write.overwrite().save(”/pathl") Later val sameModel = CrossValidatorModel.load(“/path")
  • 93. 93 © 2018 MapR Technologies, Inc. Serve DataCollect DataData Sources Flight Data Stream Processing process Batch Processing Model build model update model Machine-learning Models Stream Topic Streaming Flight Data Kafka API Stream Topic Stream Topic End to End Application Architecture MapR-XD
  • 94. 94 © 2018 MapR Technologies, Inc. Simplified Pipeline: Real-Time Flight Delay Predictions Flight data Stream Input Stream Scores Spark Streaming Spark Streaming SQL Machine Lerning Model Flight Delay predictions Results Data Collect Process Store Analyze
  • 95. 95 © 2018 MapR Technologies, Inc. Machine Learning Pipeline for Flight Delays Input Data + Actual Delay Input Data + Predictions Consumer withML Model 2 Consumer withML Model 1 Decoy results Consumer Consumer withML Model 3 Consumer Stream Archive Stream Scores Stream Input SQL SQL Real time Flight Data Stream Input Actual Delay Input Data + Predictions + Actual Delay Real Time dashboard + Historical Analysis
  • 96. 96 © 2018 MapR Technologies, Inc. New Spark Ebook Link to Code for this webinar is in appendix of this Book. https://mapr.com/ebooks/
  • 97. 97 © 2018 MapR Technologies, Inc.
  • 98. 98 © 2018 MapR Technologies, Inc. To Learn More: New Spark 2.0 training •  MapR Free ODT http://learn.mapr.com/
  • 99. 99 © 2018 MapR Technologies, Inc. MapR Blog HTTPS://MAPR.COM/BLOG/
  • 100. 100 © 2018 MapR Technologies, Inc. Q&A ENGAGE WITH US