GraphFrames: Graph Queries in Spark SQL by Ankur Dave
1. GraphFrames: Graph
Queries in Spark SQL
Ankur Dave
UC BerkeleyAMPLab
Joint work with Alekh Jindal (Microsoft), Li ErranLi (Uber),
Joseph Gonzalez(UC Berkeley),Matei Zaharia (MIT and Databricks)
2. + Graph
Queries
2016
Spark + GraphFrames
Support for Graph Analysis in Spark
+ Graph
Algorithms
2013
Spark + GraphX
Relational
Queries
2009
Spark
Graph Updates +
Anchored
Traversals
Neo4j, etc.
3. Graph Algorithms vs. Graph Queries
≈
x
PageRank
Alternating Least Squares
Graph Algorithms Graph Queries
4. Graph Algorithms vs. Graph Queries
Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators
Editor 1 Editor 2 Article 1 Article 2
⇓
Article 1
Article 2
Editor 1
Editor 2
same day} same day}
7. Raw Wikipedia
< / >< / >< / >
XML
Text Table
Edit Graph
Edit Table
Frequent
Collaborators
Problem: Mixed Graph Analysis
Hyperlinks PageRank
User Product
User Article
Vandalism
Suspects
User User
User Article
12. Materialized View Selection
GraphX: Triplet view enables efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
+
Updated
PageRanks
B
A
C
D
A
13. Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
User-Defined Views
PageRank
Community
Detection
…
Graph Queries
14. Join Elimination
Src Dst
1 2
1 3
2 3
2 5
Edges
ID Attr
1 A
2 B
3 C
4 D
Vertices
SELECT src, dst, attr AS src_attr
FROM edges INNER JOIN vertices ON src = id;
Standard vertex-edgejoin:
SELECT src, dst
FROM edges INNER JOIN vertices ON src = id;
Unnecessaryjoin
can be eliminated if tables satisfy referential
integrity, simplifying graph–relational
translation
15. Join Reordering
A → B B → A
⋈A, B
B → D
C → B⋈B
B → E⋈B
C → D⋈B
C → E⋈C, D
⋈C, E
Example Query
Left-Deep Plan BushyPlan
A → B B → A
⋈A, B
B → D C → B
⋈B
B → E⋈B
⋈B
⋈B, C
User-Defined View
16. Evaluation
Faster than Neo4j for unanchored patternqueries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Querylatency,s
AnchoredPatternQuery
0
10
20
30
40
50
60
70
80
GraphFrames Neo4j
Querylatency,s
UnanchoredPatternQuery
Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.
17. Evaluation
Approaches performance of GraphX for graph algorithms using Spark SQL
whole-stage code generation
0
1
2
3
4
5
6
7
GraphFrames GraphX Naïve Spark
Per-iterationruntime,s
PageRankPerformance
Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.
19. Future Work
• Suggest views automatically
• Exploit attribute-based partitioning in optimizer
• Code generationfor single node
20. Try It Out!
Preview available for Spark 1.4+ at:
https://github.com/graphframes/graphframes
Thanks to Databricks contributors Joseph Bradley,Xiangrui Meng,and Timothy Hunter.
Watch for the release on Spark Packages in the coming weeks.
ankurd@eecs.berkeley.edu