SlideShare a Scribd company logo
1 of 49
Download to read offline
GraphFrames
DataFrame-based graphs for Apache® Spark™
Joseph K. Bradley
4/14/2016
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineerand Apache
Spark PMC member working on MLlib at
Databricks. Previously,he was a postdoc at UC
Berkeley after receiving hisPh.D. in Machine
Learning from Carnegie Mellon U.in 2013.His
research included probabilistic graphical models,
parallel sparse regression, and aggregation
mechanismsfor peergrading in MOOCs.
2
About the moderator: Denny Lee
Denny Lee is a Technology Evangelistwith
Databricks; he is a hands-on data sciencesengineer
with more than 15 years of experience developing
internet-scale infrastructure, data platforms, and
distributed systems for both on-premisesand cloud.
Prior to joining Databricks, Denny worked as a
SeniorDirector of Data SciencesEngineering at
Concur and was part of the incubation teamthat
builtHadoop on Windowsand Azure (currently
known as HDInsight).
3
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
contributed by Databricks
in 2014
75%
4
Data Value
Created Databricks on top of Spark to make big data simple.
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engineacross diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, and R
APIs
Standard libraries
NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT
2 0 1 5 SAN F RANCISCO
Source: Slide5ofSparkCommunityUpdate
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
8
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
9
Graphs
10
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK
”
“SEA” 45 1058923
Apache Spark’s GraphX library
Overview
• General-purpose graph
processinglibrary
• Optimized for fast
distributedcomputing
• Library of algorithms:
PageRank, Connected
Components,etc.
11
Challenges
• No Java, PythonAPIs
• Lower-levelRDD-based
API (vs.DataFrames)
• Cannot use recent Spark
optimizations:Catalyst
query optimizer,Tungsten
memory management
Enter GraphFrames
Goal: DataFrame-based graphson ApacheSpark
• Simplify interactive queries
• Support motif-findingforstructural pattern search
• Benefitfrom DataFrame optimizations
Collaboration between Databricks, UC Berkeley& MIT
+ Now with community contributors!
12
Graphs
13
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
src dst delay tripID
“JFK
”
“SEA” 45 1058923
GraphFrames
“vertices” DataFrame
• 1 vertexper Row
• id: column with unique ID
“edges” DataFrame
• 1 edge per Row
• src, dst: columns using IDs from vertices.id
14
Extra columns store vertexor edge data
(a.k.a. attributes or properties).
id City State
“JFK” “New York” NY
“SEA” “Seattle” WA
src dst delay tripID
“JFK” “SEA” 45 1058923
“DFW” “SFO” -7 4100224
Demo:
Building a GraphFrame
15
16
Queries
Simple queries
Motif finding
Graph algorithms
19
Simple queries
SQL queries on vertices & edges
E.g., what trips are most likely to have significantdelays?
20
Graph queries
• Vertex degrees
• # edgesper vertex(incoming,outgoing,total)
• Triplets
• Join vertices and edgesto get (src, edge,dst)
21
Motif finding
24
IAD
JFK
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
25
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
26
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
27
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
Motif finding
28
IAD
JFK
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)-[e1]->(b);
(b)-[e2]->(c);
!(c)-[]->(a)”)
(b)
(a)
(c)
Then filter using vertex &
edge data.
paths.filter(“e1.delay > 20”)
29
Graph algorithms
Find importantvertices
• PageRank
31
Find pathsbetweensets of vertices
• Breadth-first search (BFS)
• Shortest paths
Find groupsof vertices(components,
communities)
• Connected components
• Strongly connected components
• Label Propagation Algorithm(LPA)
Other
• Triangle counting
• SVDPlusPlus
32
Algorithm implementations
Mostly wrappers for GraphX
• PageRank
• Shortest paths
• Connected components
• Strongly connected components
• Label Propagation Algorithm (LPA)
• SVDPlusPlus
33
Some algorithms implemented
usingDataFrames
• Breadth-first search
• Triangle counting
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
In the future...
• SQL data sources for graph formats
34
APIs: Scala, Java, Python
API available from all 3 languages
à First time GraphX functionality hasbeen available to
Java & Python users
2 missing items (WIP)
• Java-friendliness is currently in alpha.
• Python does not have aggregateMessages
(for implementing your own graph algorithms).
35
Outline
GraphFrames overview
GraphFrames vs. GraphX and other libraries
Details for power users
Roadmap and resources
36
2 types of graph libraries
37
Graph algorithms Graph queries
Standard & custom algorithms
Optimized for batch processing
Motif finding
Point queries &updates
GraphFrames: Both algorithms &queries (but notpoint updates)
GraphFrames vs. GraphX
38
GraphFrames GraphX
Builton DataFrames RDDs
Languages Scala, Java, Python Scala
Use cases Queries & algorithms Algorithms
Vertex IDs Any type (in Catalyst) Long
Vertex/edg
e attributes
Any number of
DataFrame columns
Any type (VD, ED)
Return
types
GraphFrame or
DataFrame
Graph[VD, ED], or
RDD[Long, VD]
GraphX compatibility
Simple conversionsbetweenGraphFrames& GraphX.
val g: GraphFrame = ...
// Convert GraphFrame à GraphX
val gx: Graph[Row, Row] = g.toGraphX
// Convert GraphX à GraphFrame
val g2: GraphFrame = GraphFrame.fromGraphX(gx)
39
Vertex & edgeattributes
are Rows in order to
handlenon-LongIDs
Wrapping existing GraphX code: See Belief Propagation example:
https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
40
Scalability
Currentstatus
• DataFrame-based parts benefitfrom DataFrame scalability +
performance optimizations(Catalyst, Tungsten).
• GraphX wrappers are as fast as GraphX (+ conversion overhead).
WIP
• GraphX hasoptimizationswhich are not yet ported to GraphFrames.
• See nextslide…
41
WIP optimizations
Join elimination
• GraphFrame algorithms require lots
of joins.
• Not all joins are necessary
Solution:
• Vertex IDs serve as unique keys.
• Tracking keys allows Catalyst to
eliminate some joins.
42
For more info & benchmark results, see AnkurDave’s SSE 2016 talk.
https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/
Materializedviews
• Data locality for common usecases
• Message-passing algorithms often
need “triplet view” (src,edge, dst)
Solution:
• Materialize specific views
• Analogous to GraphX’s “replicated
vertex view”
Implementing new algorithms
43
Method 2: Messagepassing
aggregateMessages
• Same primitive as GraphX
• Specify messages & aggregation
using DataFrame expressions
Belief propagation example code
Method 1: DataFrame &
GraphFrame operations
Motif finding
• Series of DataFrame joins
Triangle count
• DataFrame ops + motif finding
BFS
• DataFrame joins & filters
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
44
Current status
Published
• Open source (Apache 2.0) on Github
https://github.com/graphframes/graphframes
• Spark package http://spark-
packages.org/package/graphframes/graphframes
Compatible
• Spark 1.4, 1.5, 1.6
• Databricks Community Edition
Documented
• http://graphframes.github.io/
45
Roadmap
• MergeWIP speed optimizations
• Java API tests & examples
• Migrate more algorithms to DataFrame-based
implementations for greater scalability
• Getcommunity feedback!
46
Contribute
• Tracking issueson Github
• Thanks to those who have
already sent pull requests!
Resources for learning more
User guide + API docs http://graphframes.github.io/
• Quick-start
• Overview & examples for all algorithms
• Alsoavailable as executablenotebooks:
• Scala: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-scala.html
• Python: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html
Blog posts
• Intro: https://databricks.com/blog/2016/03/03/introducing-graphframes.html
• Flight delay analysis: https://databricks.com/blog/2016/03/16/on-time-flight-performance-
with-spark-graphframes.html
47
48
Thank you!
Thanks to
• Denny Lee & Bill Chambers (demo)
• Tim Hunter, Xiangrui Meng, Ankur Dave &others (GraphFrames development)

More Related Content

What's hot

Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
DataWorks Summit
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 

What's hot (20)

Finding Graph Isomorphisms In GraphX And GraphFrames
Finding Graph Isomorphisms In GraphX And GraphFramesFinding Graph Isomorphisms In GraphX And GraphFrames
Finding Graph Isomorphisms In GraphX And GraphFrames
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Graph based data models
Graph based data modelsGraph based data models
Graph based data models
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Fuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer ShinFuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer Shin
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Similar to GraphFrames: DataFrame-based graphs for Apache® Spark™

Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 

Similar to GraphFrames: DataFrame-based graphs for Apache® Spark™ (20)

Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Morpheus SQL and Cypher® in Apache® Spark - Big Data Meetup Munich
Morpheus SQL and Cypher® in Apache® Spark - Big Data Meetup MunichMorpheus SQL and Cypher® in Apache® Spark - Big Data Meetup Munich
Morpheus SQL and Cypher® in Apache® Spark - Big Data Meetup Munich
 
Morpheus - SQL and Cypher in Apache Spark
Morpheus - SQL and Cypher in Apache SparkMorpheus - SQL and Cypher in Apache Spark
Morpheus - SQL and Cypher in Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

GraphFrames: DataFrame-based graphs for Apache® Spark™

  • 1. GraphFrames DataFrame-based graphs for Apache® Spark™ Joseph K. Bradley 4/14/2016
  • 2. About the speaker: Joseph Bradley Joseph Bradley is a Software Engineerand Apache Spark PMC member working on MLlib at Databricks. Previously,he was a postdoc at UC Berkeley after receiving hisPh.D. in Machine Learning from Carnegie Mellon U.in 2013.His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanismsfor peergrading in MOOCs. 2
  • 3. About the moderator: Denny Lee Denny Lee is a Technology Evangelistwith Databricks; he is a hands-on data sciencesengineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premisesand cloud. Prior to joining Databricks, Denny worked as a SeniorDirector of Data SciencesEngineering at Concur and was part of the incubation teamthat builtHadoop on Windowsand Azure (currently known as HDInsight). 3
  • 4. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 4 Data Value Created Databricks on top of Spark to make big data simple.
  • 5. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engineacross diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  • 6.
  • 7. NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT 2 0 1 5 SAN F RANCISCO Source: Slide5ofSparkCommunityUpdate
  • 8. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 8
  • 9. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 9
  • 10. Graphs 10 vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK ” “SEA” 45 1058923
  • 11. Apache Spark’s GraphX library Overview • General-purpose graph processinglibrary • Optimized for fast distributedcomputing • Library of algorithms: PageRank, Connected Components,etc. 11 Challenges • No Java, PythonAPIs • Lower-levelRDD-based API (vs.DataFrames) • Cannot use recent Spark optimizations:Catalyst query optimizer,Tungsten memory management
  • 12. Enter GraphFrames Goal: DataFrame-based graphson ApacheSpark • Simplify interactive queries • Support motif-findingforstructural pattern search • Benefitfrom DataFrame optimizations Collaboration between Databricks, UC Berkeley& MIT + Now with community contributors! 12
  • 13. Graphs 13 vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK ” “SEA” 45 1058923
  • 14. GraphFrames “vertices” DataFrame • 1 vertexper Row • id: column with unique ID “edges” DataFrame • 1 edge per Row • src, dst: columns using IDs from vertices.id 14 Extra columns store vertexor edge data (a.k.a. attributes or properties). id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224
  • 16. 16
  • 17.
  • 18.
  • 20. Simple queries SQL queries on vertices & edges E.g., what trips are most likely to have significantdelays? 20 Graph queries • Vertex degrees • # edgesper vertex(incoming,outgoing,total) • Triplets • Join vertices and edgesto get (src, edge,dst)
  • 21. 21
  • 22.
  • 23.
  • 24. Motif finding 24 IAD JFK LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 25. Motif finding 25 IAD JFK LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 26. Motif finding 26 IAD JFK LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 27. Motif finding 27 IAD JFK LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  • 28. Motif finding 28 IAD JFK LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  • 29. 29
  • 30.
  • 31. Graph algorithms Find importantvertices • PageRank 31 Find pathsbetweensets of vertices • Breadth-first search (BFS) • Shortest paths Find groupsof vertices(components, communities) • Connected components • Strongly connected components • Label Propagation Algorithm(LPA) Other • Triangle counting • SVDPlusPlus
  • 32. 32
  • 33. Algorithm implementations Mostly wrappers for GraphX • PageRank • Shortest paths • Connected components • Strongly connected components • Label Propagation Algorithm (LPA) • SVDPlusPlus 33 Some algorithms implemented usingDataFrames • Breadth-first search • Triangle counting
  • 34. Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) In the future... • SQL data sources for graph formats 34
  • 35. APIs: Scala, Java, Python API available from all 3 languages à First time GraphX functionality hasbeen available to Java & Python users 2 missing items (WIP) • Java-friendliness is currently in alpha. • Python does not have aggregateMessages (for implementing your own graph algorithms). 35
  • 36. Outline GraphFrames overview GraphFrames vs. GraphX and other libraries Details for power users Roadmap and resources 36
  • 37. 2 types of graph libraries 37 Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries &updates GraphFrames: Both algorithms &queries (but notpoint updates)
  • 38. GraphFrames vs. GraphX 38 GraphFrames GraphX Builton DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edg e attributes Any number of DataFrame columns Any type (VD, ED) Return types GraphFrame or DataFrame Graph[VD, ED], or RDD[Long, VD]
  • 39. GraphX compatibility Simple conversionsbetweenGraphFrames& GraphX. val g: GraphFrame = ... // Convert GraphFrame à GraphX val gx: Graph[Row, Row] = g.toGraphX // Convert GraphX à GraphFrame val g2: GraphFrame = GraphFrame.fromGraphX(gx) 39 Vertex & edgeattributes are Rows in order to handlenon-LongIDs Wrapping existing GraphX code: See Belief Propagation example: https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala
  • 40. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 40
  • 41. Scalability Currentstatus • DataFrame-based parts benefitfrom DataFrame scalability + performance optimizations(Catalyst, Tungsten). • GraphX wrappers are as fast as GraphX (+ conversion overhead). WIP • GraphX hasoptimizationswhich are not yet ported to GraphFrames. • See nextslide… 41
  • 42. WIP optimizations Join elimination • GraphFrame algorithms require lots of joins. • Not all joins are necessary Solution: • Vertex IDs serve as unique keys. • Tracking keys allows Catalyst to eliminate some joins. 42 For more info & benchmark results, see AnkurDave’s SSE 2016 talk. https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/ Materializedviews • Data locality for common usecases • Message-passing algorithms often need “triplet view” (src,edge, dst) Solution: • Materialize specific views • Analogous to GraphX’s “replicated vertex view”
  • 43. Implementing new algorithms 43 Method 2: Messagepassing aggregateMessages • Same primitive as GraphX • Specify messages & aggregation using DataFrame expressions Belief propagation example code Method 1: DataFrame & GraphFrame operations Motif finding • Series of DataFrame joins Triangle count • DataFrame ops + motif finding BFS • DataFrame joins & filters
  • 44. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 44
  • 45. Current status Published • Open source (Apache 2.0) on Github https://github.com/graphframes/graphframes • Spark package http://spark- packages.org/package/graphframes/graphframes Compatible • Spark 1.4, 1.5, 1.6 • Databricks Community Edition Documented • http://graphframes.github.io/ 45
  • 46. Roadmap • MergeWIP speed optimizations • Java API tests & examples • Migrate more algorithms to DataFrame-based implementations for greater scalability • Getcommunity feedback! 46 Contribute • Tracking issueson Github • Thanks to those who have already sent pull requests!
  • 47. Resources for learning more User guide + API docs http://graphframes.github.io/ • Quick-start • Overview & examples for all algorithms • Alsoavailable as executablenotebooks: • Scala: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-scala.html • Python: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html Blog posts • Intro: https://databricks.com/blog/2016/03/03/introducing-graphframes.html • Flight delay analysis: https://databricks.com/blog/2016/03/16/on-time-flight-performance- with-spark-graphframes.html 47
  • 48. 48
  • 49. Thank you! Thanks to • Denny Lee & Bill Chambers (demo) • Tim Hunter, Xiangrui Meng, Ankur Dave &others (GraphFrames development)