SlideShare a Scribd company logo
1 of 94
Download to read offline
Analyzing	Flight	Delays	with	Apache	Spark	
GraphFrames	and	MapR-DB
2 © 2018 MapR Technologies, Inc. // MapR Confidential
Agenda	
•  Introduction	to	Graphs		
•  Introduction	to	GraphFrames	with	a	simple	Flight	Dataset	
•  Use	GraphFrames	with	Flight	Dataset	for	2018	
2
Intro	to	Graphs
4 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Graph:	Models	Relations	between	Objects	
•  Graph:	Vertices	connected	by	Edges	
•  Vertices:	the	objects	
•  Edges:	the	relationships	between	Vertices	
What	is	a	Graph?
5 © 2018 MapR Technologies, Inc. // MapR Confidential
Regular	graph:	each	vertex	has	the	same	
number	of	edges	
Example:	Facebook	friends	
– Ted	is	a	friend	of	Carol	
– Carol	is	a	friend	of	Ted	
Regular	Graphs	vs	Directed	Graphs
6 © 2018 MapR Technologies, Inc. // MapR Confidential
Directed	graph:	edges	have	a	direction	
Example:	Twitter	followers	
– Carol	follows	Oprah	
– Oprah	does	not	follow	Carol	
Regular	Graphs	vs	Directed	Graphs
7 © 2018 MapR Technologies, Inc. // MapR Confidential
Property	Graph:		
•  Edges	and	Vertexes	have	properties	
•  Vertex	can	have	multiple	directed	
edges	in	parallel	
•  Allows	multiple	relationships	
Spark	GraphX	supports	a	distributed	
property	graph.		
Property	Graph	 Properties:	
City,State	
Properties:	
Flight	number,	
Distance,	
Delay
8 © 2018 MapR Technologies, Inc. // MapR Confidential
What	is	GraphX?	
Spark SQL
•  Structured Data
•  Querying with
SQL/HQL
•  DataFrames
Spark Streaming
•  Processing of live
streams
•  Micro-batching
MLlib
•  Machine Learning
•  Multiple types of
ML algorithms
GraphX
•  Graph processing
•  Graph parallel
computations
•  Task scheduling
•  Memory management
•  Fault recovery
•  Interacting with storage systems
Spark Core
Graph	Algorithms	and		Graph	
Queries	with	GraphFrames
10 © 2018 MapR Technologies, Inc. // MapR Confidential
Web	Sites	
•  Vertices	=	Web	Pages	
•  Edges	=	Links	between	Pages	
•  PageRank	Importance	=		
•  Iterative	Number	of	Links	to	a	
page	and	it’s	linking	pages	
•  Twitter	Example:		who	has	the	most	
twitter	followers	
	
Graph	Algorithms:	PageRank	
Vertex=	
Web	Page	
Edge=	
Link	
Importance	depends	
on	Number	and	Rank	
of	linking	pages
11 © 2018 MapR Technologies, Inc. // MapR Confidential
Visualize	PageRank		
1.  Each	page	sends	message	function	
with	it’s	“rank”	to	neighbors	
Graph	Algorithms:	PageRank	
0.20	 0.20	
0.20	
0.20	 0.20	
Message	function	
Sent	from	each	
vertex
12 © 2018 MapR Technologies, Inc. // MapR Confidential
Visualize	PageRank		
1.  Each	page	sends	message	function	
with	it’s	“rank”	to	neighbors	
2.  Messages	are	Aggregated	and	
Calculated	at	each	destination	vertex	
3.  Sum	of	messages	becomes	new	
vertex	Page	rank	
4.  Repeat		
Graph	Algorithms:	PageRank	 Messages	Aggregated		
and		
Calculated	at	each	Vertex
13 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Many	Graph	Algorithms	Aggregate	properties	of	neighbors:	
•  PageRank	
•  Connected	Components	
•  Shortest	Path	
Graph	Algorithms	
Connected	Components	
Reference	https://en.wikipedia.org/	
Shortest	Path	A	to	F
14 © 2018 MapR Technologies, Inc. // MapR Confidential
Graph	Motif:	recurrent	patterns	in	a	graph	
Graph	Motif	Query:	Search	a	graph	for	
occurrences	of	a	given	a	pattern	
Twitter	Example:		
Who	should	we	recommend	for	Carol	to	Follow?	
•  Carol	follows	Oprah		
•  Oprah	follows	Reese	Witherspoon	
•  Recommend	Carol	to	follow	Reese	
Graph	Motif	Queries	
Reese
WitherspoonCarol
follows
Oprah
follows
recommend?
15 © 2018 MapR Technologies, Inc. // MapR Confidential
Graph	Motif:	recurrent	patterns	in	a	graph	
Graph	Motif	Query:	Search	a	graph	for	
occurrences	of	a	given	a	pattern	
Twitter	Example:		
Recommend	who	to	Follow?	Search	for	patterns	
•  A	follows	B	
•  B	follows	C	
•  A	does	not	follow	C	
Graph	Motif	Queries	
A
follows
B
follows
recommend
C
16 © 2018 MapR Technologies, Inc. // MapR Confidential
Twitter:	A	follows	B;	B	follows	C;	A	doesn’t	follow	C
graph.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
Graph	Query:	Motif	Find	Structural	Pattern	
Edge	[	]	
(c)
Vertex	(	)	
(a)
!(a)-[]->(c)
a doesn’t follow c
(b)-[]->(c)
b follows c
(a)-[]->(b)
a follows b
(b)
Search for a
pattern
17 © 2018 MapR Technologies, Inc. // MapR Confidential
Separate	Systems	
Image	reference	Spark	Summit
18 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrames:	Graph	Algorithms	+	Graph	Queries	
Image	reference	Spark	Summit
Graph	Examples
20 © 2018 MapR Technologies, Inc. // MapR Confidential
Twitter	Tweets:	
morally	outraged	tweets	retweeted	within	
political	sphere	
But	rarely	outside	sphere	
	
	
Real	World	Graphs:	Twitter	
Reference	National	Academy	of	Sciences
21 © 2018 MapR Technologies, Inc. // MapR Confidential
Recommendation	Engine:	
•  Vertices	=	Users,	Products	
•  Edges	=	Ratings	or	Purchases	
•  Calculate	how	similar	users	rated	
similar	products	
	
Graph:	Recommendation	Engines
22 © 2018 MapR Technologies, Inc. // MapR Confidential
Healthcare	Fraud:	
•  Vertices	=	Doctors,	Patients,	
Prescriptions	
•  Edges	=	prescribed	
•  Calculate	Narcotic	Abuse,	Patient	
Similarity,	Over	prescribing	
	
Real	World	Graphs:	Fraud	
Prescribed	
Prescribed	
Prescribed
23 © 2018 MapR Technologies, Inc. // MapR Confidential
Credit	Card	Aplication	Fraud:	
•  Vertices	=	Credit	Card	Applicant,	
Phone,	email,	address,	ssn	
•  Edges	=	Identifier	
•  Detect	People	sharing	identifiers	such	
as	telephone	number	
	
Real	World	Graphs:	Fraud	
Shared	
Identifier	
Phone	number	
	
Image	reference	Capitol	One	at	Spark	Summit
A	Simple	Flight	Example	with	GraphFrames
25 © 2018 MapR Technologies, Inc. // MapR Confidential
Simple	Flight	Example	with	GraphFrames 		
Originating	
Airport	
Destination	
Airport	
Distance	 Delay		
SFO	 ORD	 1800	miles	 40	
ORD	 DFW	 800	miles	 0	
DFW	 SFO	 1400	miles	 10
26 © 2018 MapR Technologies, Inc. // MapR Confidential
Vertex	Table
27 © 2018 MapR Technologies, Inc. // MapR Confidential
Edges	Table
28 © 2018 MapR Technologies, Inc. // MapR Confidential
case class Airport(id: String, city: String)  
val airports=Array(Airport("SFO","San Francisco"),
Airport("ORD","Chicago"), Airport("DFW","Dallas Fort Worth"))
 
val vertices = spark.createDataset(airports).toDF
vertices.show
+---+-----------------+
| id| city|
+---+-----------------+
|SFO| San Francisco|
|ORD| Chicago|
|DFW|Dallas Fort Worth|
+---+-----------------+
Create	a	Vertices	DataFrame	
Id	 City	
SFO	 San	Francisco	
ORD	 Chicago	
DFW	 Dallas
29 © 2018 MapR Technologies, Inc. // MapR Confidential
case class Flight(id: String, src: String, dst: String,
dist: Double, delay: Double)
val flights=Array(
Flight("SFO_ORD_2017-01-01_AA”,"SFO”,"ORD”,1800, 40),
Flight("ORD_DFW_2017-01-01_UA","ORD","DFW",800, 0),
Flight("DFW_SFO_2017-01-01_DL","DFW","SFO",1400, 10))
val edges = spark.createDataset(flights).toDF
edges.show
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
Create	an	Edges	DataFrame
30 © 2018 MapR Technologies, Inc. // MapR Confidential
val graph = GraphFrame(vertices, edges)
graph.vertices.show
 
+---+-----------------+
| id| name|
+---+-----------------+
|SFO| San Francisco|
|ORD| Chicago|
|DFW|Dallas Fort Worth|
+---+-----------------+
Create	the	GraphFrame
31 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.show
 
result:
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
GraphFrame	Edges
32 © 2018 MapR Technologies, Inc. // MapR Confidential
To	answer	questions	such	as:	
How	many	airports	are	there?	
How	many	flight	routes	are	there?	
What	are	the	longest	distance	routes?	
Which	airport	has	the	most	incoming	flights?	
What	are	the	top	10	flights?	
	
	
Graph	Operators
33 © 2018 MapR Technologies, Inc. // MapR Confidential
// How many airports?
graph.vertices.count
 
result: = 3
// How many flights?
graph.edges.count
 
result: = 3
Query	the	GraphFrame
34 © 2018 MapR Technologies, Inc. // MapR Confidential
// flight routes > 800 miles distance?
graph.edges.filter("dist > 800").show
+--------------------+---+---+------+-----+
| id|src|dst| dist|delay|
+--------------------+---+---+------+-----+
|SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0|
|DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0|
+--------------------+---+---+------+-----+
 
Query	the	GraphFrame
Loading	and	Exploring	the	MapR-DB	Flight	
Table	with	DataFrames
36 © 2018 MapR Technologies, Inc. // MapR Confidential
How	a	Spark	Application	Runs	on	a	Cluster
37 © 2018 MapR Technologies, Inc. // MapR Confidential
•  A	Dataset	is	a	collection	of	Typed	Objects	
•  Dataset[T]		
•  (can	use	SQL	and	functions)	
•  A	DataFrame	is	a	Dataset	of	Row	objects			
•  Dataset[Row]	
•  (can	use	SQL)	
•  Partitioned	across	a	cluster	
•  Operated	on	in	parallel	
•  can	be	Cached	
	
Spark	Distributed	Datasets	
partitioned
38 © 2018 MapR Technologies, Inc. // MapR Confidential
•  Spark SQL queries and updates to MapR-DB
•  With projection and filter pushdown, custom partitioning, and data locality
	
Spark	SQL	Querying	MapR-DB	JSON
39 © 2018 MapR Technologies, Inc. // MapR Confidential
Designed	for	Partitioning	and	Scaling	
Data is automatically partitioned
and sorted by id row key!
40 © 2018 MapR Technologies, Inc. // MapR Confidential
Spark	MapR-DB	Connector
41 © 2018 MapR Technologies, Inc. // MapR Confidential
{
“id": ”ATL_LGA_2017-01-01_AA_1678",
"dofW": 7,
"carrier": "AA",
”src": "ATL",
”dst": "LGA",
"crsdephour": 17,
"crsdeptime": 1700,
"depdelay": 0.0,
"crsarrtime": 1912,
"arrdelay": 0.0,
"crselapsedtime": 132.0,
"dist": 762.0
}
Flight Dataset
Table is automatically partitioned
and sorted by id row key!
42 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR-DB	JSON	Document	Store	
Data is automatically partitioned
and sorted by id row key!
{
“id": ”ATL_LGA_2017-01-01_AA_1678",
"dofW": 7,
"carrier": "AA",
”src": "ATL",
”dst": "LGA",
"crsdephour": 17,
"crsdeptime": 1700,
"depdelay": 0.0,
"crsarrtime": 1912,
"arrdelay": 0.0,
"crselapsedtime": 132.0,
"dist": 762.0
}
43 © 2018 MapR Technologies, Inc. // MapR Confidential
Row	key	=	Table	is	Partitioned	by	src,dst	vertexes	
Data is automatically partitioned by key
range and sorted = src_dst
ATL_LGA_2017-01-01_AA_1678!
44 © 2018 MapR Technologies, Inc. // MapR Confidential
SFO
DEN
IAH
ATL
ORD
BOS
LGA
EWR
MIA
SEA	
LAX	
DFW	
Airports
45 © 2018 MapR Technologies, Inc. // MapR Confidential
Load	the	data	into	a	Dataset:	Define	the	Schema
46 © 2018 MapR Technologies, Inc. // MapR Confidential
var tableName = "/user/mapr/flighttable”
val df = spark.sparkSession
.loadFromMapRDB[Flight](tableName, schema)
Read	Dataset	from	MapR-DB	
Worker	
Task	
Worker	
Driver	
Cache	1	
Cache	2	
Cache	3	
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task	
Task	
Driver	
tasks
tasks
tasks
47 © 2018 MapR Technologies, Inc. // MapR Confidential
df.show(5)
Show	the	first	rows	of	the	DataFrame	
columns
row
Data is automatically partitioned and
sorted by row key = src dst
ATL_BOS_2018-01-01_AA_1678!
48 © 2018 MapR Technologies, Inc. // MapR Confidential
df.filter($"depdelay" > 40).groupBy(”src”)
.count().orderBy(desc(“count”)).show(5)
+---+-----+
|src|count|
+---+-----+
|ORD| 4033|
|ATL| 3106|
|DFW| 2782|
|EWR| 2328|
|DEN| 2304|
+---+-----+
Originating	airports	with	highest	number	of	Departure	Delays
49 © 2018 MapR Technologies, Inc. // MapR Confidential
df.filter($"depdelay" > 40).groupBy("src")
.count.orderBy(desc("count" )).explain
== Physical Plan ==
*(3) Sort [count#549L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#549L DESC NULLS LAST, 200)
+- *(2) HashAggregate(keys=[src#5], functions=[count(1)])
+- Exchange hashpartitioning(src#5, 200)
+- *(1) HashAggregate(keys=[src#5], functions=[partial_count(1)])
+- *(1) Project [src#5]
+- *(1) Filter (isnotnull(depdelay#9) && (depdelay#9 > 40.0))
+- *(1) Scan MapRDBRelation(/user/mapr/flighttable
[src#5,depdelay#9] PushedFilters: [IsNotNull(depdelay), GreaterThan(depdelay,
40.0)], ReadSchema: struct<src:string,depdelay:double>
MapR-DB	Projection	and	Filter	push	down	
Project and Filter pushed into
MapR-DB!
50 © 2018 MapR Technologies, Inc. // MapR Confidential
Spark	MapR-DB		Projection	Filter	push	down	
Projection and Filter pushdown reduces the
amount of data passed between MapR-DB
and the Spark engine when selecting and
filtering data.
	
Data is selected and filtered in
MapR-DB!
51 © 2018 MapR Technologies, Inc. // MapR Confidential
df.cache	
df.count()	
df.createOrReplaceTempView("flights")	
	
Long	=	282628	
Register	Dataframe	as	a	Temporary	View
52 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select carrier, avg(depdelay) from flights
group by carrier
Average	Departure	Delay	by	Carrier
53 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select src, count(depdelay) from flights
where depdelay > 40 group by src
Count	of		Departure	Delays	by	Origin
54 © 2018 MapR Technologies, Inc. // MapR Confidential
%sql select src,dst count(depdelay) from flights
where depdelay > 40 group by src,dst
Count	of		Departure	Delays	by	Origin,	Destination
Explore	MapR-DB	Flight	Table	with	
GraphFrames
56 © 2018 MapR Technologies, Inc. // MapR Confidential
To	answer	questions	such	as:	
How	many	flight	routes	are	there?	
What	are	the	longest	distance	routes?	
Which	airport	has	the	most	incoming	flights?	
What	are	the	top	10	flight	routes?	
	
	
GraphFrame	and	DataFrame
57 © 2018 MapR Technologies, Inc. // MapR Confidential
val airports = spark.read.json(file)
airports.show
+-------------+-------+-----+---+
| City|Country|State| id|
+-------------+-------+-----+---+
| Chicago| USA| IL|ORD|
| New York| USA| NY|JFK|
| New York| USA| NY|LGA|
| Boston| USA| MA|BOS|
| Houston| USA| TX|IAH|
| Newark| USA| NJ|EWR|
| Denver| USA| CO|DEN|
| Miami| USA| FL|MIA|
|San Francisco| USA| CA|SFO|
| Atlanta| USA| GA|ATL|
| Dallas| USA| TX|DFW|
| Charlotte| USA| NC|CLT|
| Los Angeles| USA| CA|LAX|
| Seattle| USA| WA|SEA|
+-------------+-------+-----+---+
Read	Vertices	DataFrame	from	a	JSON	File
58 © 2018 MapR Technologies, Inc. // MapR Confidential
val graph = GraphFrame(airports, df)
// graph.edges is a DataFrame
graph.edges.show
 
Create	the	GraphFrame
59 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents,	triangleCount	
Graph	Queries	 Motif	find
60 © 2018 MapR Technologies, Inc. // MapR Confidential
DataFrame	Queries	
Operation	 Description	
select(col) Selects	set	of	columns	
sort(sortcol) Returns	new	DataFrame	sorted	by	specified	column	
filter(expr);
where(condition)
Filter	based	on	the	SQL	expression	or	condition	
groupBy(cols:
Columns)
Groups	DataFrame	using	specified	columns	
join (DataFrame,
joinExpr)
Joins	with	another	DataFrame	using	given	join	expression	
count Count	of	rows		
avg, count, min,
max, sum (col)
Average	,	count	,	min	,	max	on	values	in	a	group
61 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.vertices.filter("State='TX'").show
+-------+-------+-----+---+
| City|Country|State| id|
+-------+-------+-----+---+
|Houston| USA| TX|IAH|
| Dallas| USA| TX|DFW|
+-------+-------+-----+---+
Graph	Vertices	and	Edges	are	DataFrames
62 © 2018 MapR Technologies, Inc. // MapR Confidential
// How many airports?
graph.vertices.count
 
result: = 13
// How many flights?
graph.edges.count
 
result: = 282628
GraphFrame	DataFrame	Queries
63 © 2018 MapR Technologies, Inc. // MapR Confidential
// Show the longest distance flight routes
graph.edges.groupBy("src", "dst")
.max("dist").sort(desc("max(dist)")).show(4)
+---+---+---------+
|src|dst|max(dist)|
+---+---+---------+
|MIA|SEA| 2724.0|
|SEA|MIA| 2724.0|
|BOS|SFO| 2704.0|
|SFO|BOS| 2704.0|
+---+---+---------+ 
What	are	the	4	Longest	Distance	Flights?
64 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.filter("src = 'ATL' and depdelay > 1")
.groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).show
+---+---+------------------+
|src|dst| avg(depdelay)|
+---+---+------------------+
|ATL|EWR| 58.1085801063022|
|ATL|ORD| 46.42393736017897|
|ATL|DFW|39.454460966542754|
|ATL|LGA| 39.25498489425982|
|ATL|CLT| 37.56777108433735|
|ATL|SFO| 36.83008356545961|
+---+---+------------------+
What	is	the	average	delay	for	delayed	flights	from	Atlanta?
65 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.edges.filter("src = 'ATL' and depdelay > 1")
.groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).explain
== Physical Plan ==
*(3) Sort [avg(depdelay)#273 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(avg(depdelay)#273 DESC NULLS LAST, 200)
+- *(2) HashAggregate(keys=[src#5, dst#6], functions=[avg(depdelay#9)])
+- Exchange hashpartitioning(src#5, dst#6, 200)
+- *(1) HashAggregate(keys=[src#5, dst#6], functions=[partial_avg(depdelay#9)])
+- *(1) Filter (((isnotnull(src#5) && isnotnull(depdelay#9)) &&
(src#5 = ATL)) && (depdelay#9 > 1.0))
+- *(1) Scan MapRDBRelation(/user/mapr/flighttable
[src#5,dst#6,depdelay#9] PushedFilters: [IsNotNull(src), IsNotNull(depdelay),
EqualTo(src,ATL), GreaterThan(depdelay,1.0)], ReadSchema:
struct<src:string,dst:string,depdelay:double>
MapR-DB	Projection	and	Filter	push	down
66 © 2018 MapR Technologies, Inc. // MapR Confidential
z.show( graph.edges
.filter("src = 'ATL' and depdelay > 1”)
.groupBy("crsdephour")
.avg("depdelay”) )
What	is	the	Average	Delay	for	delayed	flights	from	Atlanta	by	
Hour?
67 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
68 © 2018 MapR Technologies, Inc. // MapR Confidential
WHAT	ARE	THE	HIGHEST	DEGREE	VERTEXES?	
z.show( graph.degrees.orderBy(desc("degree")) )
Which	Airports	have	the	most	incoming	and	outgoing	flights?
69 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
70 © 2018 MapR Technologies, Inc. // MapR Confidential
val ranks = graph.pageRank.resetProbability(0.15).maxIter(10).run()
ranks.vertices.orderBy($"pagerank".desc).show(5)
+-------------+-------+-----+---+-------------------+
| City|Country|State| id| pagerank|
+-------------+-------+-----+---+-------------------+
| Chicago| USA| IL|ORD| 1.5129929839358685|
| Atlanta| USA| GA|ATL| 1.4255481544216664|
| Los Angeles| USA| CA|LAX| 1.2787001001758738|
| Dallas| USA| TX|DFW| 1.1999252171688064|
| Denver| USA| CO|DEN| 1.1275194324360767|
+-------------+-------+-----+---+-------------------+
Use	Pagerank	to	find	most	important	airports
71 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
72 © 2018 MapR Technologies, Inc. // MapR Confidential
val AM = AggregateMessages
val msgToSrc = AM.edge("depdelay")
val agg = { graph
.aggregateMessages
.sendToSrc(msgToSrc)
.agg(avg(AM.msg).as("avgdelay"))}
agg.show()
+---+------------------+
| id| avgdelay|
+---+------------------+
|EWR|17.818079459546404|
|MIA|17.768691978431264|
|ORD| 16.5199551010227|
+---+------------------+
Aggregate	Messages	to	calculate	avg	delay
73 © 2018 MapR Technologies, Inc. // MapR Confidential
// count of flight routes
val flightroutecount=graph.edges
.groupBy("src", "dst”)
.count().orderBy(desc("count"))
flightroutecount.show(5)
+---+---+-----+
|src|dst|count|
+---+---+-----+
|LGA|ORD| 4442|
|ORD|LGA| 4426|
|LAX|SFO| 4406|
|SFO|LAX| 4354|
|ATL|LGA| 3884|
+---+---+-----+
// how many routes?
flightroutecount.count
Long = 148
What	are	the	most	Frequent	Flight	Routes?
74 © 2018 MapR Technologies, Inc. // MapR Confidential
(HIGHEST	COUNT	OF	FLIGHTS)	
z.show (flightroutecount )
What	are	the	most	Frequent	Flight	Routes?
75 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
76 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.triplets
.show(3)
+--------------------+--------------------+--------------------+
| src| edge| dst|
+--------------------+--------------------+--------------------+
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
+--------------------+--------------------+--------------------+
Triplets	=	2	Vertices	and	1	Connecting	Edge	DataFrames			
dstsrc
edge
77 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.triplets
.filter("src.State='TX'”)
.show
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
|src |edge |dst |
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1473, 2018-01-01, 1, 1, AA, DFW, ATL, 10, 1026, 26.0, 1327, 21.0, 121.0, 731.0]|[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1675, 2018-01-01, 1, 1, AA, DFW, ATL, 13, 1255, 32.0, 1557, 16.0, 122.0, 731.0]|[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2408, 2018-01-01, 1, 1, AA, DFW, ATL, 18, 1835, 4.0, 2141, 0.0, 126.0, 731.0] |[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2479, 2018-01-01, 1, 1, AA, DFW, ATL, 9, 855, 0.0, 1200, 0.0, 125.0, 731.0] |[Atlanta, USA, GA, ATL]|
|[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2497, 2018-01-01, 1, 1, AA, DFW, ATL, 21, 2055, 0.0, 2359, 0.0, 124.0, 731.0] |[Atlanta, USA, GA, ATL]|
+----------------------+------------------------------------------------------------------------------------------------------+-----------------------+
Triplets	=	2	Vertices	and	1	Connecting	Edge	DataFrames			
dstsrc
edge
DataFrames
Refine the result
78 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.find("(src)-[edge]->(dst)")
.show(3)
+--------------------+--------------------+--------------------+
| src| edge| dst|
+--------------------+--------------------+--------------------+
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
|[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...|
+--------------------+--------------------+--------------------+
Motif	find	
dstsrc
edge
Search for a
pattern
79 © 2018 MapR Technologies, Inc. // MapR Confidential
// count of flight routes
val flightroutecount=graph.edges
.groupBy("src", "dst”)
.count().orderBy(desc("count"))
flightroutecount.show(5)
+---+---+-----+
|src|dst|count|
+---+---+-----+
|LGA|ORD| 4442|
|ORD|LGA| 4426|
|LAX|SFO| 4406|
|SFO|LAX| 4354|
|ATL|LGA| 3884|
+---+---+-----+
Next:	use	flightroutecount	with	Motif	find
80 © 2018 MapR Technologies, Inc. // MapR Confidential
val subGraph = GraphFrame(graph.vertices, flightroutecount)
val res = subGraph
.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
.filter("c.id !=a.id”)
Motif	Find	Flights	with	No	Direct	Connection	
Edge	[	]	
(c)
Vertex	(	)	
(a)
!(a)-[]->(c)
(b)-[]->(c)(a)-[]->(b)
(b)
Search for a
pattern
DataFrames
Refine the result:
Remove duplicates
81 © 2018 MapR Technologies, Inc. // MapR Confidential
val subGraph = GraphFrame(graph.vertices, flightroutecount)
val res = subGraph
.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)")
.filter("c.id !=a.id”)
Motif	Find	Flights	with	No	Direct	Connection
82 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
83 © 2018 MapR Technologies, Inc. // MapR Confidential
val results = graph.shortestPaths.landmarks(Seq("LGA")).run()
+---+----------+
| id| distances|
+---+----------+
|IAH|[LGA -> 1]|
|CLT|[LGA -> 1]|
|LAX|[LGA -> 2]|
|DEN|[LGA -> 1]|
|DFW|[LGA -> 1]|
|SFO|[LGA -> 2]|
|LGA|[LGA -> 0]|
|ORD|[LGA -> 1]|
|MIA|[LGA -> 1]|
|SEA|[LGA -> 2]|
|ATL|[LGA -> 1]|
|BOS|[LGA -> 1]|
|EWR|[LGA -> 2]|
+---+----------+
Compute	shortest	paths		from	each	Airport	to	LGA
84 © 2018 MapR Technologies, Inc. // MapR Confidential
GraphFrame	API	
Category	 Methods		
Graph	Topology	 vertices,	edges,	triplets	
Graph	Structure	 inDegrees,	outDegrees,	degrees	
Graph	Algorithms	 pageRank,	bfs,	aggregatedMessages,	shortestPaths,	
connectedComponents	
Graph	Queries	 Motif	find
85 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.bfs.fromExpr("id = 'LAX'")
.toExpr("id = 'LGA'").maxPathLength(1).run().show()
+----+-------+-----+---+
|City|Country|State| id|
+----+-------+-----+---+
+----+-------+-----+---+
Breadth	First	Search	for	Direct	Flights	between	LAX	and	LGA
86 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.bfs.fromExpr("id = 'LAX'")
.toExpr("id = 'LGA'").maxPathLength(2).run().show(5)
+--------------------+--------------------+--------------------+--------------------+--------------------+
| from| e0| v1| e1| to|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
|[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
Breadth	First	Search	for	Flights	between	LAX	and	LGA
87 © 2018 MapR Technologies, Inc. // MapR Confidential
graph.find("(a)-[ab]->(b); (b)-[bc]->(c)")
.filter("a.id = 'LAX'")
.filter("c.id = 'LGA'").show(4)
Motif	Search	for	Flights	between	LAX	and	LGA		
Search for a
pattern
DataFrames
Refine the result
88 © 2018 MapR Technologies, Inc. // MapR Confidential
val paths = graph.bfs.fromExpr("id = 'LAX'”).toExpr("id = 'LGA'”)
.maxPathLength(3).edgeFilter("carrier = 'AA'").run()
paths.filter("e0.crsarrtime<e1.crsdeptime-60 and e0.fldate=e1.fldate")
.select("e0.id","e1.id").show(5)
+--------------------------+--------------------------+
|id |id |
+--------------------------+--------------------------+
|LAX_BOS_2018-02-03_AA_1098|BOS_LGA_2018-02-03_AA_2126|
|LAX_BOS_2018-02-03_AA_1379|BOS_LGA_2018-02-03_AA_2126|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1740|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1910|
|LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1954|
+--------------------------+--------------------------+
Breadth	First	Search	for	Flights	between	LAX	and	LGA	with	AA	
BFS
DataFrames
Refine the result
Resources
90 © 2018 MapR Technologies, Inc. // MapR Confidential
Link	to	Code	for	this	webinar	is	in	
appendix	of	this		book.			
https://mapr.com/ebook/getting-started-
with-apache-spark-v2/	
New	Spark	Ebook
91 © 2018 MapR Technologies, Inc. // MapR Confidential
92 © 2018 MapR Technologies, Inc. // MapR Confidential
•  MapR	Free	ODT	http://learn.mapr.com/	
To	Learn	More:	New	Spark	2.0	training
93 © 2018 MapR Technologies, Inc. // MapR Confidential
https://mapr.com/blog/	
MapR	Blog
94 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR	Data	Platform	
Link to Code for this webinar is in
appendix of the book.
https://mapr.com/ebook/getting-
started-with-apache-spark-v2/

More Related Content

What's hot

Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkDatabricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)Emil Eifrem
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDatabricks
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

What's hot (20)

Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
A NOSQL Overview And The Benefits Of Graph Databases (nosql east 2009)
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Similar to Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB

Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningCarol McDonald
 
Migrating to Amazon Neptune (DAT338) - AWS re:Invent 2018
Migrating to Amazon Neptune (DAT338) - AWS re:Invent 2018Migrating to Amazon Neptune (DAT338) - AWS re:Invent 2018
Migrating to Amazon Neptune (DAT338) - AWS re:Invent 2018Amazon Web Services
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...
2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...
2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...GIS in the Rockies
 
Property Graphs in APEX.pptx
Property Graphs in APEX.pptxProperty Graphs in APEX.pptx
Property Graphs in APEX.pptxssuser923120
 
Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jNeo4j
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...Carol McDonald
 
Workshop - Build a Graph Solution
Workshop - Build a Graph SolutionWorkshop - Build a Graph Solution
Workshop - Build a Graph SolutionNeo4j
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between  CAD & GIS: 6 Ways to Automate Your  Data IntegrationBridging Between  CAD & GIS: 6 Ways to Automate Your  Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationSafe Software
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Roadmap y Novedades de producto
Roadmap y Novedades de productoRoadmap y Novedades de producto
Roadmap y Novedades de productoNeo4j
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkMapR Technologies
 
Surprising Advantages of Streaming - ACM March 2018
Surprising Advantages of Streaming - ACM March 2018Surprising Advantages of Streaming - ACM March 2018
Surprising Advantages of Streaming - ACM March 2018Ellen Friedman
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limitsAntje Barth
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
MicroStation DGN: How to Integrate CAD and GIS
MicroStation DGN: How to Integrate CAD and GISMicroStation DGN: How to Integrate CAD and GIS
MicroStation DGN: How to Integrate CAD and GISSafe Software
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 

Similar to Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB (20)

Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
 
Migrating to Amazon Neptune (DAT338) - AWS re:Invent 2018
Migrating to Amazon Neptune (DAT338) - AWS re:Invent 2018Migrating to Amazon Neptune (DAT338) - AWS re:Invent 2018
Migrating to Amazon Neptune (DAT338) - AWS re:Invent 2018
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...
2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...
2018 GIS Colorado: Your Geospatial Connection: ZDV 3D A Modern 3D Visualizati...
 
Property Graphs in APEX.pptx
Property Graphs in APEX.pptxProperty Graphs in APEX.pptx
Property Graphs in APEX.pptx
 
Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
 
Workshop - Build a Graph Solution
Workshop - Build a Graph SolutionWorkshop - Build a Graph Solution
Workshop - Build a Graph Solution
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between  CAD & GIS: 6 Ways to Automate Your  Data IntegrationBridging Between  CAD & GIS: 6 Ways to Automate Your  Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Roadmap y Novedades de producto
Roadmap y Novedades de productoRoadmap y Novedades de producto
Roadmap y Novedades de producto
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
 
Surprising Advantages of Streaming - ACM March 2018
Surprising Advantages of Streaming - ACM March 2018Surprising Advantages of Streaming - ACM March 2018
Surprising Advantages of Streaming - ACM March 2018
 
A practical guide to GIS in Civil 3D
A practical guide to GIS in Civil 3DA practical guide to GIS in Civil 3D
A practical guide to GIS in Civil 3D
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limits
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
MicroStation DGN: How to Integrate CAD and GIS
MicroStation DGN: How to Integrate CAD and GISMicroStation DGN: How to Integrate CAD and GIS
MicroStation DGN: How to Integrate CAD and GIS
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 

More from Carol McDonald

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 

More from Carol McDonald (20)

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 

Recently uploaded

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 

Recently uploaded (20)

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 

Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB

  • 2. 2 © 2018 MapR Technologies, Inc. // MapR Confidential Agenda •  Introduction to Graphs •  Introduction to GraphFrames with a simple Flight Dataset •  Use GraphFrames with Flight Dataset for 2018 2
  • 4. 4 © 2018 MapR Technologies, Inc. // MapR Confidential •  Graph: Models Relations between Objects •  Graph: Vertices connected by Edges •  Vertices: the objects •  Edges: the relationships between Vertices What is a Graph?
  • 5. 5 © 2018 MapR Technologies, Inc. // MapR Confidential Regular graph: each vertex has the same number of edges Example: Facebook friends – Ted is a friend of Carol – Carol is a friend of Ted Regular Graphs vs Directed Graphs
  • 6. 6 © 2018 MapR Technologies, Inc. // MapR Confidential Directed graph: edges have a direction Example: Twitter followers – Carol follows Oprah – Oprah does not follow Carol Regular Graphs vs Directed Graphs
  • 7. 7 © 2018 MapR Technologies, Inc. // MapR Confidential Property Graph: •  Edges and Vertexes have properties •  Vertex can have multiple directed edges in parallel •  Allows multiple relationships Spark GraphX supports a distributed property graph. Property Graph Properties: City,State Properties: Flight number, Distance, Delay
  • 8. 8 © 2018 MapR Technologies, Inc. // MapR Confidential What is GraphX? Spark SQL •  Structured Data •  Querying with SQL/HQL •  DataFrames Spark Streaming •  Processing of live streams •  Micro-batching MLlib •  Machine Learning •  Multiple types of ML algorithms GraphX •  Graph processing •  Graph parallel computations •  Task scheduling •  Memory management •  Fault recovery •  Interacting with storage systems Spark Core
  • 10. 10 © 2018 MapR Technologies, Inc. // MapR Confidential Web Sites •  Vertices = Web Pages •  Edges = Links between Pages •  PageRank Importance = •  Iterative Number of Links to a page and it’s linking pages •  Twitter Example: who has the most twitter followers Graph Algorithms: PageRank Vertex= Web Page Edge= Link Importance depends on Number and Rank of linking pages
  • 11. 11 © 2018 MapR Technologies, Inc. // MapR Confidential Visualize PageRank 1.  Each page sends message function with it’s “rank” to neighbors Graph Algorithms: PageRank 0.20 0.20 0.20 0.20 0.20 Message function Sent from each vertex
  • 12. 12 © 2018 MapR Technologies, Inc. // MapR Confidential Visualize PageRank 1.  Each page sends message function with it’s “rank” to neighbors 2.  Messages are Aggregated and Calculated at each destination vertex 3.  Sum of messages becomes new vertex Page rank 4.  Repeat Graph Algorithms: PageRank Messages Aggregated and Calculated at each Vertex
  • 13. 13 © 2018 MapR Technologies, Inc. // MapR Confidential •  Many Graph Algorithms Aggregate properties of neighbors: •  PageRank •  Connected Components •  Shortest Path Graph Algorithms Connected Components Reference https://en.wikipedia.org/ Shortest Path A to F
  • 14. 14 © 2018 MapR Technologies, Inc. // MapR Confidential Graph Motif: recurrent patterns in a graph Graph Motif Query: Search a graph for occurrences of a given a pattern Twitter Example: Who should we recommend for Carol to Follow? •  Carol follows Oprah •  Oprah follows Reese Witherspoon •  Recommend Carol to follow Reese Graph Motif Queries Reese WitherspoonCarol follows Oprah follows recommend?
  • 15. 15 © 2018 MapR Technologies, Inc. // MapR Confidential Graph Motif: recurrent patterns in a graph Graph Motif Query: Search a graph for occurrences of a given a pattern Twitter Example: Recommend who to Follow? Search for patterns •  A follows B •  B follows C •  A does not follow C Graph Motif Queries A follows B follows recommend C
  • 16. 16 © 2018 MapR Technologies, Inc. // MapR Confidential Twitter: A follows B; B follows C; A doesn’t follow C graph.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") Graph Query: Motif Find Structural Pattern Edge [ ] (c) Vertex ( ) (a) !(a)-[]->(c) a doesn’t follow c (b)-[]->(c) b follows c (a)-[]->(b) a follows b (b) Search for a pattern
  • 17. 17 © 2018 MapR Technologies, Inc. // MapR Confidential Separate Systems Image reference Spark Summit
  • 18. 18 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrames: Graph Algorithms + Graph Queries Image reference Spark Summit
  • 20. 20 © 2018 MapR Technologies, Inc. // MapR Confidential Twitter Tweets: morally outraged tweets retweeted within political sphere But rarely outside sphere Real World Graphs: Twitter Reference National Academy of Sciences
  • 21. 21 © 2018 MapR Technologies, Inc. // MapR Confidential Recommendation Engine: •  Vertices = Users, Products •  Edges = Ratings or Purchases •  Calculate how similar users rated similar products Graph: Recommendation Engines
  • 22. 22 © 2018 MapR Technologies, Inc. // MapR Confidential Healthcare Fraud: •  Vertices = Doctors, Patients, Prescriptions •  Edges = prescribed •  Calculate Narcotic Abuse, Patient Similarity, Over prescribing Real World Graphs: Fraud Prescribed Prescribed Prescribed
  • 23. 23 © 2018 MapR Technologies, Inc. // MapR Confidential Credit Card Aplication Fraud: •  Vertices = Credit Card Applicant, Phone, email, address, ssn •  Edges = Identifier •  Detect People sharing identifiers such as telephone number Real World Graphs: Fraud Shared Identifier Phone number Image reference Capitol One at Spark Summit
  • 25. 25 © 2018 MapR Technologies, Inc. // MapR Confidential Simple Flight Example with GraphFrames Originating Airport Destination Airport Distance Delay SFO ORD 1800 miles 40 ORD DFW 800 miles 0 DFW SFO 1400 miles 10
  • 26. 26 © 2018 MapR Technologies, Inc. // MapR Confidential Vertex Table
  • 27. 27 © 2018 MapR Technologies, Inc. // MapR Confidential Edges Table
  • 28. 28 © 2018 MapR Technologies, Inc. // MapR Confidential case class Airport(id: String, city: String)   val airports=Array(Airport("SFO","San Francisco"), Airport("ORD","Chicago"), Airport("DFW","Dallas Fort Worth"))   val vertices = spark.createDataset(airports).toDF vertices.show +---+-----------------+ | id| city| +---+-----------------+ |SFO| San Francisco| |ORD| Chicago| |DFW|Dallas Fort Worth| +---+-----------------+ Create a Vertices DataFrame Id City SFO San Francisco ORD Chicago DFW Dallas
  • 29. 29 © 2018 MapR Technologies, Inc. // MapR Confidential case class Flight(id: String, src: String, dst: String, dist: Double, delay: Double) val flights=Array( Flight("SFO_ORD_2017-01-01_AA”,"SFO”,"ORD”,1800, 40), Flight("ORD_DFW_2017-01-01_UA","ORD","DFW",800, 0), Flight("DFW_SFO_2017-01-01_DL","DFW","SFO",1400, 10)) val edges = spark.createDataset(flights).toDF edges.show +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+ Create an Edges DataFrame
  • 30. 30 © 2018 MapR Technologies, Inc. // MapR Confidential val graph = GraphFrame(vertices, edges) graph.vertices.show   +---+-----------------+ | id| name| +---+-----------------+ |SFO| San Francisco| |ORD| Chicago| |DFW|Dallas Fort Worth| +---+-----------------+ Create the GraphFrame
  • 31. 31 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.show   result: +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |ORD_DFW_2017-01-0...|ORD|DFW| 800.0| 0.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+ GraphFrame Edges
  • 32. 32 © 2018 MapR Technologies, Inc. // MapR Confidential To answer questions such as: How many airports are there? How many flight routes are there? What are the longest distance routes? Which airport has the most incoming flights? What are the top 10 flights? Graph Operators
  • 33. 33 © 2018 MapR Technologies, Inc. // MapR Confidential // How many airports? graph.vertices.count   result: = 3 // How many flights? graph.edges.count   result: = 3 Query the GraphFrame
  • 34. 34 © 2018 MapR Technologies, Inc. // MapR Confidential // flight routes > 800 miles distance? graph.edges.filter("dist > 800").show +--------------------+---+---+------+-----+ | id|src|dst| dist|delay| +--------------------+---+---+------+-----+ |SFO_ORD_2017-01-0...|SFO|ORD|1800.0| 40.0| |DFW_SFO_2017-01-0...|DFW|SFO|1400.0| 10.0| +--------------------+---+---+------+-----+   Query the GraphFrame
  • 36. 36 © 2018 MapR Technologies, Inc. // MapR Confidential How a Spark Application Runs on a Cluster
  • 37. 37 © 2018 MapR Technologies, Inc. // MapR Confidential •  A Dataset is a collection of Typed Objects •  Dataset[T] •  (can use SQL and functions) •  A DataFrame is a Dataset of Row objects •  Dataset[Row] •  (can use SQL) •  Partitioned across a cluster •  Operated on in parallel •  can be Cached Spark Distributed Datasets partitioned
  • 38. 38 © 2018 MapR Technologies, Inc. // MapR Confidential •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark SQL Querying MapR-DB JSON
  • 39. 39 © 2018 MapR Technologies, Inc. // MapR Confidential Designed for Partitioning and Scaling Data is automatically partitioned and sorted by id row key!
  • 40. 40 © 2018 MapR Technologies, Inc. // MapR Confidential Spark MapR-DB Connector
  • 41. 41 © 2018 MapR Technologies, Inc. // MapR Confidential { “id": ”ATL_LGA_2017-01-01_AA_1678", "dofW": 7, "carrier": "AA", ”src": "ATL", ”dst": "LGA", "crsdephour": 17, "crsdeptime": 1700, "depdelay": 0.0, "crsarrtime": 1912, "arrdelay": 0.0, "crselapsedtime": 132.0, "dist": 762.0 } Flight Dataset Table is automatically partitioned and sorted by id row key!
  • 42. 42 © 2018 MapR Technologies, Inc. // MapR Confidential MapR-DB JSON Document Store Data is automatically partitioned and sorted by id row key! { “id": ”ATL_LGA_2017-01-01_AA_1678", "dofW": 7, "carrier": "AA", ”src": "ATL", ”dst": "LGA", "crsdephour": 17, "crsdeptime": 1700, "depdelay": 0.0, "crsarrtime": 1912, "arrdelay": 0.0, "crselapsedtime": 132.0, "dist": 762.0 }
  • 43. 43 © 2018 MapR Technologies, Inc. // MapR Confidential Row key = Table is Partitioned by src,dst vertexes Data is automatically partitioned by key range and sorted = src_dst ATL_LGA_2017-01-01_AA_1678!
  • 44. 44 © 2018 MapR Technologies, Inc. // MapR Confidential SFO DEN IAH ATL ORD BOS LGA EWR MIA SEA LAX DFW Airports
  • 45. 45 © 2018 MapR Technologies, Inc. // MapR Confidential Load the data into a Dataset: Define the Schema
  • 46. 46 © 2018 MapR Technologies, Inc. // MapR Confidential var tableName = "/user/mapr/flighttable” val df = spark.sparkSession .loadFromMapRDB[Flight](tableName, schema) Read Dataset from MapR-DB Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  • 47. 47 © 2018 MapR Technologies, Inc. // MapR Confidential df.show(5) Show the first rows of the DataFrame columns row Data is automatically partitioned and sorted by row key = src dst ATL_BOS_2018-01-01_AA_1678!
  • 48. 48 © 2018 MapR Technologies, Inc. // MapR Confidential df.filter($"depdelay" > 40).groupBy(”src”) .count().orderBy(desc(“count”)).show(5) +---+-----+ |src|count| +---+-----+ |ORD| 4033| |ATL| 3106| |DFW| 2782| |EWR| 2328| |DEN| 2304| +---+-----+ Originating airports with highest number of Departure Delays
  • 49. 49 © 2018 MapR Technologies, Inc. // MapR Confidential df.filter($"depdelay" > 40).groupBy("src") .count.orderBy(desc("count" )).explain == Physical Plan == *(3) Sort [count#549L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#549L DESC NULLS LAST, 200) +- *(2) HashAggregate(keys=[src#5], functions=[count(1)]) +- Exchange hashpartitioning(src#5, 200) +- *(1) HashAggregate(keys=[src#5], functions=[partial_count(1)]) +- *(1) Project [src#5] +- *(1) Filter (isnotnull(depdelay#9) && (depdelay#9 > 40.0)) +- *(1) Scan MapRDBRelation(/user/mapr/flighttable [src#5,depdelay#9] PushedFilters: [IsNotNull(depdelay), GreaterThan(depdelay, 40.0)], ReadSchema: struct<src:string,depdelay:double> MapR-DB Projection and Filter push down Project and Filter pushed into MapR-DB!
  • 50. 50 © 2018 MapR Technologies, Inc. // MapR Confidential Spark MapR-DB Projection Filter push down Projection and Filter pushdown reduces the amount of data passed between MapR-DB and the Spark engine when selecting and filtering data. Data is selected and filtered in MapR-DB!
  • 51. 51 © 2018 MapR Technologies, Inc. // MapR Confidential df.cache df.count() df.createOrReplaceTempView("flights") Long = 282628 Register Dataframe as a Temporary View
  • 52. 52 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select carrier, avg(depdelay) from flights group by carrier Average Departure Delay by Carrier
  • 53. 53 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select src, count(depdelay) from flights where depdelay > 40 group by src Count of Departure Delays by Origin
  • 54. 54 © 2018 MapR Technologies, Inc. // MapR Confidential %sql select src,dst count(depdelay) from flights where depdelay > 40 group by src,dst Count of Departure Delays by Origin, Destination
  • 56. 56 © 2018 MapR Technologies, Inc. // MapR Confidential To answer questions such as: How many flight routes are there? What are the longest distance routes? Which airport has the most incoming flights? What are the top 10 flight routes? GraphFrame and DataFrame
  • 57. 57 © 2018 MapR Technologies, Inc. // MapR Confidential val airports = spark.read.json(file) airports.show +-------------+-------+-----+---+ | City|Country|State| id| +-------------+-------+-----+---+ | Chicago| USA| IL|ORD| | New York| USA| NY|JFK| | New York| USA| NY|LGA| | Boston| USA| MA|BOS| | Houston| USA| TX|IAH| | Newark| USA| NJ|EWR| | Denver| USA| CO|DEN| | Miami| USA| FL|MIA| |San Francisco| USA| CA|SFO| | Atlanta| USA| GA|ATL| | Dallas| USA| TX|DFW| | Charlotte| USA| NC|CLT| | Los Angeles| USA| CA|LAX| | Seattle| USA| WA|SEA| +-------------+-------+-----+---+ Read Vertices DataFrame from a JSON File
  • 58. 58 © 2018 MapR Technologies, Inc. // MapR Confidential val graph = GraphFrame(airports, df) // graph.edges is a DataFrame graph.edges.show   Create the GraphFrame
  • 59. 59 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents, triangleCount Graph Queries Motif find
  • 60. 60 © 2018 MapR Technologies, Inc. // MapR Confidential DataFrame Queries Operation Description select(col) Selects set of columns sort(sortcol) Returns new DataFrame sorted by specified column filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression count Count of rows avg, count, min, max, sum (col) Average , count , min , max on values in a group
  • 61. 61 © 2018 MapR Technologies, Inc. // MapR Confidential graph.vertices.filter("State='TX'").show +-------+-------+-----+---+ | City|Country|State| id| +-------+-------+-----+---+ |Houston| USA| TX|IAH| | Dallas| USA| TX|DFW| +-------+-------+-----+---+ Graph Vertices and Edges are DataFrames
  • 62. 62 © 2018 MapR Technologies, Inc. // MapR Confidential // How many airports? graph.vertices.count   result: = 13 // How many flights? graph.edges.count   result: = 282628 GraphFrame DataFrame Queries
  • 63. 63 © 2018 MapR Technologies, Inc. // MapR Confidential // Show the longest distance flight routes graph.edges.groupBy("src", "dst") .max("dist").sort(desc("max(dist)")).show(4) +---+---+---------+ |src|dst|max(dist)| +---+---+---------+ |MIA|SEA| 2724.0| |SEA|MIA| 2724.0| |BOS|SFO| 2704.0| |SFO|BOS| 2704.0| +---+---+---------+  What are the 4 Longest Distance Flights?
  • 64. 64 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.filter("src = 'ATL' and depdelay > 1") .groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).show +---+---+------------------+ |src|dst| avg(depdelay)| +---+---+------------------+ |ATL|EWR| 58.1085801063022| |ATL|ORD| 46.42393736017897| |ATL|DFW|39.454460966542754| |ATL|LGA| 39.25498489425982| |ATL|CLT| 37.56777108433735| |ATL|SFO| 36.83008356545961| +---+---+------------------+ What is the average delay for delayed flights from Atlanta?
  • 65. 65 © 2018 MapR Technologies, Inc. // MapR Confidential graph.edges.filter("src = 'ATL' and depdelay > 1") .groupBy("src", "dst").avg("depdelay").sort(desc("avg(depdelay)")).explain == Physical Plan == *(3) Sort [avg(depdelay)#273 DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(avg(depdelay)#273 DESC NULLS LAST, 200) +- *(2) HashAggregate(keys=[src#5, dst#6], functions=[avg(depdelay#9)]) +- Exchange hashpartitioning(src#5, dst#6, 200) +- *(1) HashAggregate(keys=[src#5, dst#6], functions=[partial_avg(depdelay#9)]) +- *(1) Filter (((isnotnull(src#5) && isnotnull(depdelay#9)) && (src#5 = ATL)) && (depdelay#9 > 1.0)) +- *(1) Scan MapRDBRelation(/user/mapr/flighttable [src#5,dst#6,depdelay#9] PushedFilters: [IsNotNull(src), IsNotNull(depdelay), EqualTo(src,ATL), GreaterThan(depdelay,1.0)], ReadSchema: struct<src:string,dst:string,depdelay:double> MapR-DB Projection and Filter push down
  • 66. 66 © 2018 MapR Technologies, Inc. // MapR Confidential z.show( graph.edges .filter("src = 'ATL' and depdelay > 1”) .groupBy("crsdephour") .avg("depdelay”) ) What is the Average Delay for delayed flights from Atlanta by Hour?
  • 67. 67 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 68. 68 © 2018 MapR Technologies, Inc. // MapR Confidential WHAT ARE THE HIGHEST DEGREE VERTEXES? z.show( graph.degrees.orderBy(desc("degree")) ) Which Airports have the most incoming and outgoing flights?
  • 69. 69 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 70. 70 © 2018 MapR Technologies, Inc. // MapR Confidential val ranks = graph.pageRank.resetProbability(0.15).maxIter(10).run() ranks.vertices.orderBy($"pagerank".desc).show(5) +-------------+-------+-----+---+-------------------+ | City|Country|State| id| pagerank| +-------------+-------+-----+---+-------------------+ | Chicago| USA| IL|ORD| 1.5129929839358685| | Atlanta| USA| GA|ATL| 1.4255481544216664| | Los Angeles| USA| CA|LAX| 1.2787001001758738| | Dallas| USA| TX|DFW| 1.1999252171688064| | Denver| USA| CO|DEN| 1.1275194324360767| +-------------+-------+-----+---+-------------------+ Use Pagerank to find most important airports
  • 71. 71 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 72. 72 © 2018 MapR Technologies, Inc. // MapR Confidential val AM = AggregateMessages val msgToSrc = AM.edge("depdelay") val agg = { graph .aggregateMessages .sendToSrc(msgToSrc) .agg(avg(AM.msg).as("avgdelay"))} agg.show() +---+------------------+ | id| avgdelay| +---+------------------+ |EWR|17.818079459546404| |MIA|17.768691978431264| |ORD| 16.5199551010227| +---+------------------+ Aggregate Messages to calculate avg delay
  • 73. 73 © 2018 MapR Technologies, Inc. // MapR Confidential // count of flight routes val flightroutecount=graph.edges .groupBy("src", "dst”) .count().orderBy(desc("count")) flightroutecount.show(5) +---+---+-----+ |src|dst|count| +---+---+-----+ |LGA|ORD| 4442| |ORD|LGA| 4426| |LAX|SFO| 4406| |SFO|LAX| 4354| |ATL|LGA| 3884| +---+---+-----+ // how many routes? flightroutecount.count Long = 148 What are the most Frequent Flight Routes?
  • 74. 74 © 2018 MapR Technologies, Inc. // MapR Confidential (HIGHEST COUNT OF FLIGHTS) z.show (flightroutecount ) What are the most Frequent Flight Routes?
  • 75. 75 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 76. 76 © 2018 MapR Technologies, Inc. // MapR Confidential graph.triplets .show(3) +--------------------+--------------------+--------------------+ | src| edge| dst| +--------------------+--------------------+--------------------+ |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| +--------------------+--------------------+--------------------+ Triplets = 2 Vertices and 1 Connecting Edge DataFrames dstsrc edge
  • 77. 77 © 2018 MapR Technologies, Inc. // MapR Confidential graph.triplets .filter("src.State='TX'”) .show +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ |src |edge |dst | +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1473, 2018-01-01, 1, 1, AA, DFW, ATL, 10, 1026, 26.0, 1327, 21.0, 121.0, 731.0]|[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_1675, 2018-01-01, 1, 1, AA, DFW, ATL, 13, 1255, 32.0, 1557, 16.0, 122.0, 731.0]|[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2408, 2018-01-01, 1, 1, AA, DFW, ATL, 18, 1835, 4.0, 2141, 0.0, 126.0, 731.0] |[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2479, 2018-01-01, 1, 1, AA, DFW, ATL, 9, 855, 0.0, 1200, 0.0, 125.0, 731.0] |[Atlanta, USA, GA, ATL]| |[Dallas, USA, TX, DFW]|[DFW_ATL_2018-01-01_AA_2497, 2018-01-01, 1, 1, AA, DFW, ATL, 21, 2055, 0.0, 2359, 0.0, 124.0, 731.0] |[Atlanta, USA, GA, ATL]| +----------------------+------------------------------------------------------------------------------------------------------+-----------------------+ Triplets = 2 Vertices and 1 Connecting Edge DataFrames dstsrc edge DataFrames Refine the result
  • 78. 78 © 2018 MapR Technologies, Inc. // MapR Confidential graph.find("(src)-[edge]->(dst)") .show(3) +--------------------+--------------------+--------------------+ | src| edge| dst| +--------------------+--------------------+--------------------+ |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| |[Atlanta, USA, GA...|[ATL_BOS_2018-01-...|[Boston, USA, MA,...| +--------------------+--------------------+--------------------+ Motif find dstsrc edge Search for a pattern
  • 79. 79 © 2018 MapR Technologies, Inc. // MapR Confidential // count of flight routes val flightroutecount=graph.edges .groupBy("src", "dst”) .count().orderBy(desc("count")) flightroutecount.show(5) +---+---+-----+ |src|dst|count| +---+---+-----+ |LGA|ORD| 4442| |ORD|LGA| 4426| |LAX|SFO| 4406| |SFO|LAX| 4354| |ATL|LGA| 3884| +---+---+-----+ Next: use flightroutecount with Motif find
  • 80. 80 © 2018 MapR Technologies, Inc. // MapR Confidential val subGraph = GraphFrame(graph.vertices, flightroutecount) val res = subGraph .find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") .filter("c.id !=a.id”) Motif Find Flights with No Direct Connection Edge [ ] (c) Vertex ( ) (a) !(a)-[]->(c) (b)-[]->(c)(a)-[]->(b) (b) Search for a pattern DataFrames Refine the result: Remove duplicates
  • 81. 81 © 2018 MapR Technologies, Inc. // MapR Confidential val subGraph = GraphFrame(graph.vertices, flightroutecount) val res = subGraph .find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)") .filter("c.id !=a.id”) Motif Find Flights with No Direct Connection
  • 82. 82 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 83. 83 © 2018 MapR Technologies, Inc. // MapR Confidential val results = graph.shortestPaths.landmarks(Seq("LGA")).run() +---+----------+ | id| distances| +---+----------+ |IAH|[LGA -> 1]| |CLT|[LGA -> 1]| |LAX|[LGA -> 2]| |DEN|[LGA -> 1]| |DFW|[LGA -> 1]| |SFO|[LGA -> 2]| |LGA|[LGA -> 0]| |ORD|[LGA -> 1]| |MIA|[LGA -> 1]| |SEA|[LGA -> 2]| |ATL|[LGA -> 1]| |BOS|[LGA -> 1]| |EWR|[LGA -> 2]| +---+----------+ Compute shortest paths from each Airport to LGA
  • 84. 84 © 2018 MapR Technologies, Inc. // MapR Confidential GraphFrame API Category Methods Graph Topology vertices, edges, triplets Graph Structure inDegrees, outDegrees, degrees Graph Algorithms pageRank, bfs, aggregatedMessages, shortestPaths, connectedComponents Graph Queries Motif find
  • 85. 85 © 2018 MapR Technologies, Inc. // MapR Confidential graph.bfs.fromExpr("id = 'LAX'") .toExpr("id = 'LGA'").maxPathLength(1).run().show() +----+-------+-----+---+ |City|Country|State| id| +----+-------+-----+---+ +----+-------+-----+---+ Breadth First Search for Direct Flights between LAX and LGA
  • 86. 86 © 2018 MapR Technologies, Inc. // MapR Confidential graph.bfs.fromExpr("id = 'LAX'") .toExpr("id = 'LGA'").maxPathLength(2).run().show(5) +--------------------+--------------------+--------------------+--------------------+--------------------+ | from| e0| v1| e1| to| +--------------------+--------------------+--------------------+--------------------+--------------------+ |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| |[Los Angeles, USA...|[LAX_IAH_2018-01-...|[Houston, USA, TX...|[IAH_LGA_2018-01-...|[New York, USA, N...| +--------------------+--------------------+--------------------+--------------------+--------------------+ Breadth First Search for Flights between LAX and LGA
  • 87. 87 © 2018 MapR Technologies, Inc. // MapR Confidential graph.find("(a)-[ab]->(b); (b)-[bc]->(c)") .filter("a.id = 'LAX'") .filter("c.id = 'LGA'").show(4) Motif Search for Flights between LAX and LGA Search for a pattern DataFrames Refine the result
  • 88. 88 © 2018 MapR Technologies, Inc. // MapR Confidential val paths = graph.bfs.fromExpr("id = 'LAX'”).toExpr("id = 'LGA'”) .maxPathLength(3).edgeFilter("carrier = 'AA'").run() paths.filter("e0.crsarrtime<e1.crsdeptime-60 and e0.fldate=e1.fldate") .select("e0.id","e1.id").show(5) +--------------------------+--------------------------+ |id |id | +--------------------------+--------------------------+ |LAX_BOS_2018-02-03_AA_1098|BOS_LGA_2018-02-03_AA_2126| |LAX_BOS_2018-02-03_AA_1379|BOS_LGA_2018-02-03_AA_2126| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1740| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1910| |LAX_CLT_2018-02-14_AA_1905|CLT_LGA_2018-02-14_AA_1954| +--------------------------+--------------------------+ Breadth First Search for Flights between LAX and LGA with AA BFS DataFrames Refine the result
  • 90. 90 © 2018 MapR Technologies, Inc. // MapR Confidential Link to Code for this webinar is in appendix of this book. https://mapr.com/ebook/getting-started- with-apache-spark-v2/ New Spark Ebook
  • 91. 91 © 2018 MapR Technologies, Inc. // MapR Confidential
  • 92. 92 © 2018 MapR Technologies, Inc. // MapR Confidential •  MapR Free ODT http://learn.mapr.com/ To Learn More: New Spark 2.0 training
  • 93. 93 © 2018 MapR Technologies, Inc. // MapR Confidential https://mapr.com/blog/ MapR Blog
  • 94. 94 © 2018 MapR Technologies, Inc. // MapR Confidential MapR Data Platform Link to Code for this webinar is in appendix of the book. https://mapr.com/ebook/getting- started-with-apache-spark-v2/