WIFI SSID and Password for Spark Summit

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

https://db-engines.com/en/ranking_categories

Node
● Represents an entity within the graph
● Can have labels
Relationship
● Connects a start node with an end node
● Has one type
Property
● Describes a node/relationship: e.g. name, age, weight etc
● Key-value pair: String key; typed value (string, number, bool, list, ...)

Property graph view of data mirrors conceptual view
○ Entities and relationships, with attributes
○ Nodes and relationships, with properties
Graph queries are concise and visual (ASCII Art)
MATCH (c:Customer)-[:BOUGHT]-(p:Product)
RETURN c.id, p.id
Network algorithms run over graphs
→ Graphs enhance data engineering and science

Tables Graphs
Transactional
PostgreSQL,
Oracle,
SQLServer
Neo4j
Data
Integration
& Analytics Spark SQL Morpheus

Spark is an immutable data processing engine
○ Spark graphs are compositions of tables (DFs)
○ Spark graphs can be transformed and combined
○ Functions (including queries) over multiple graphs
○ Cypher query plans mapped to Catalyst
Neo4j is a native transactional CRUD database
○ Neo4j graphs use a native graph data representation
○ Neo4j has optimized in-process MT graph algos
○ Morpheus helps move data in and out of Neo4j

Graphs and tables are both useful data models
○ Finding paths and subgraphs, and transforming graphs
○ Viewing, aggregating and ordering values
The Morpheus project parallels Spark SQL
○ PropertyGraph type (composed of DataFrames)
○ Catalog of graph data sources, named graphs, views,
○ Cypher query language
A CypherSession adds graphs to a SparkSession

● Data integration
○ Integrate (non-)graphy data from multiple, heterogeneous
data sources into one or more property graphs
● Distributed Cypher execution
○ OLAP-style graph analytics
● Data science
○ Integration with other Spark libraries
○ Feature extraction using Neo4j Graph Algorithms

Pathﬁnding
& Search
Centrality /
Importance
Community
Detection
Link
Prediction
Finds optimal paths
or evaluates route
availability and quality
Determines the
importance of distinct
nodes in the network
Detects group
clustering or partition
options
Evaluates how
alike nodes are
Estimates the likelihood
of nodes forming a
future relationship
Similarity

PROPERTY
GRAPH
composing
DataFrames
Hive, DF, JDBC
TABLES
SUB-
GRAPH
FS snapshot
Morpheus
SOURCES

DataFrame
Table Result
Cypher
QUERY
Property
Graph Result
Property
Graph Cypher
QUERY
Cypher
QUERY
Property
Graph Result
DataFrame
Driving Table

GRAPH
ALGOS
ANALYSIS
toolsets
DataFrame DataFrame
Property
Graph
Property
Graph

Morpheus
STORE
SUBGRAPH
FS snapshot
Property
Graph

Cypher 9 is the latest full version of openCypher
○ Implemented in Neo4j 3.5
○ Includes date/time types and functions
○ Implemented in whole/part by six other vendors
○ Several other partial and research implementations
○ Cypher for Gremlin is another openCypher project

Cypher is a full CRUD language ← OLTP database
○ RETURNs only tabular results: not composable
○ Results can include graph elements (paths,
relationships, nodes) or property values
Morpheus implements most of read-only Cypher
○ No MERGE or DELETE
○ Spark immutable data + transformations

Cypher 10 proposes Multiple Graph features
○ Multiple Graph CIP: https://git.io/fjmrx
Allows for Cypher Query composition
○ Similar to chaining transformations on DataFrames
Support Graph Catalog for managing Graphs
○ Analogous to Spark SQL catalog
Query support for Graph Construction

Input: a property graph
Output: a table
FROM GRAPH socialNetwork
MATCH ({name: 'Dan'})-[:FRIEND*2]->(foaf)
RETURN toUpper(foaf.name) AS name
ORDER BY name DESC
Language features available in Morpheus

Input: a property graph
Output: a property graph
MATCH (p:Person)-[:FRIEND*2]->(foaf)
WHERE NOT (p)-[:FRIEND]->(foaf)
CONSTRUCT
CREATE (p)-[:POSSIBLE_FRIEND]->(foaf)
RETURN GRAPH

Input: property graphs
MATCH (p:Person)
FROM GRAPH products
MATCH (c:Customer)
WHERE p.email = c.email
CONSTRUCT ON socialNetwork, products
CREATE (p)-[:IS]->(c)
RETURN GRAPH

CATALOG CREATE VIEW youngFriends($inGraph){
FROM GRAPH $inGraph
MATCH (p1:Person)-[r]->(p2:Person)
WHERE p1.age < 25 AND p2.age < 25
CONSTRUCT
CREATE (p1)-[COPY OF r]->(p2)
RETURN GRAPH
}

Output: table or graph
FROM youngFriends(socialNetwork)
MATCH (p:Person)-[r]->(o)
RETURN p, r, o
// and views over views
FROM youngFriends(europe(socialNetwork))
MATCH ...

Morpheus
Query EngineProperty Graph Data Sources
Property Graph Catalog
Scala API
SQL JDBC

● Distributed executionSpark Core
Spark SQL
● Rule- and Cost-based query
optimization via Catalyst
MATCH (c:Captain)-[:COMMANDS]->(s:Ship)
WHERE c.name = ‘Morpheus’
RETURN c.name, s.name
openCypher
Frontend
● Parsing, Rewriting, Normalization
● Semantic Analysis (Scoping,
Typing, etc.)
Morpheus
● Data Import and Export
● Schema and Type handling
● Query translation to Spark
operations
Relational
Planning
Logical
Planning
Spark
Backend
● Translation into Logical
Operators
● Basic Logical Optimization
● Backend Agnostic Query
Representation
● Conversion and typing of
Frontend expressions
● Translation into Relational
Operations on abstract
tables
● Column layout computation
Intermediate
Language
● Spark-speciﬁc table
implementation

● In Morpheus, PropertyGraphs are represented by
○ Node Tables and Relationship Tables
● Tables are represented by DataFrames
○ Require a ﬁxed schema
● Property Graphs have a Graph Type
○ Node and relationship types that occur in the graph
○ Node and relationship properties and their data type
Property Graph
Node Tables
Rel. Tables
Graph Type

:Captain:Person
name: Morpheus
:Ship
name: Nebuchadnezzar
:COMMANDS
id name
0 Morpheus
id name
1 Nebuchadnezzar
id source target
0 0 1
:Captain:Person
:Ship
:COMMANDS
Graph Type {
:Captain:Person (
name: STRING
),
:Ship (
name: STRING
),
:COMMANDS
}

Property Graph
⋈
⋈
π
MATCH (c:Captain)-[:COMMANDS]->(s:Ship)
WHERE c.name = ‘Morpheus’
RETURN c.name, s.name
π
π
Morpheus
Relational
Planning
...

Part 1
From JSON to Graph
Create persistent
Property Graph from
raw Yelp dataset
Read Yelp Data from
JSON into DataFrames
Create Property Graph
from DataFrames
Store Property Graph
using Parquet
Part 2
A library of Graphs
Create a library of
graph projections
Read Property Graph
from Parquet
Create subgraph for a
specifc city
Project and persist city
subgraph
Part 3
Federated queries
Integrate reviews with
social network data
Deﬁne Graph Type and
Mapping with Graph
DDL
Load data from Hive
and H2
Run analytical query on
the integrated graph
Part 5
Neo4j Integration II
Recommend
businesses to users
Load graph projections
from library
Write graphs to Neo4j,
run Louvain + Jaccard
Run analytical query in
Morpheus to ﬁnd
recommendations
Part 4
Neo4j Integration I
Find trending
businesses
Load graph projections
from library
Write graphs to Neo4j
and run PageRank
Combine graphs in
Morpheus and select
trending businesses
https://git.io/fjZ2b

● Yelp is a search service based on crowd-sourced
reviews about local businesses
● The Yelp Open Dataset is part of the Yelp Dataset
Challenge
○ Yelps’ effort to encourage researchers to explore the
dataset
○ ~150K businesses, 10M users, 5M reviews, 35M
friendships
https://www.yelp.com
https://www.yelp.com/dataset
https://www.yelp.com/dataset/challenge

:Business
name : ACME
address : 123 ACME Rd.
city : San Jose
state : CA
:User
name : Alice
since : 2013
elite : [2014, 2016]
:User
name : Bob
since : 2014
elite : null
:REVIEWS
stars : 5
date : 2014-02-03
:REVIEWS
stars : 4
date : 2014-08-03

business.json
user.json
review.json
Create Node and
Relationship Tables
Create Property Graph Store Property Graph
https://git.io/fjZ2N

// (:User)
val userDataFrame = spark.read.json(...).select(...)
val userNodeTable = CAPSEntityTable.create(NodeMappingBuilder.on("id")
.withImpliedLabel("User")
.withPropertyKey("name")
.withPropertyKey("yelping_since")
.withPropertyKey("elite")
.build, userDataFrame)
id name yelping_since elite
0 Alice 2013 [2014, 2016]
1 Bob 2014 null

● Property Graphs are managed within a catalog
Cypher Session
Property Graph Data Source <namespace>
Property Graph <name>
QualiﬁedGraphName = <namespace>.<name>

● API to operate with the query engine and the catalog
trait CypherSession {
def cypher(
query: String,
parameters: CypherMap = CypherMap.empty,
drivingTable: Option[CypherRecords] = None
): Result
def catalog: PropertyGraphCatalog
}

● API to manage multiple Property Graphs
● Catalog functions can be executed via Cypher or Scala API
trait PropertyGraphCatalog {
def register(namespace: Namespace, dataSource: PropertyGraphDataSource): Unit
def store(qualifiedGraphName: QualifiedGraphName, graph: PropertyGraph): Unit
def graph(qualifiedGraphName: QualifiedGraphName): PropertyGraph
def drop(qualifiedGraphName: QualifiedGraphName): Unit
// additional methods for managing views, listing namespaces and graphs
}

● API for loading and saving property graphs
trait PropertyGraphDataSource {
def hasGraph(name: GraphName): Boolean
def graph(name: GraphName): PropertyGraph
def schema(name: GraphName): Option[Schema]
// additional methods for storing, deleting, listing graphs
}

PGDS Multiple graphs Read graphs Write graphs
File-based
Parquet, ORC, CSV
HDFS, local, S3
Yes Yes Yes
SQL
Hive, Jdbc
Yes Yes No
Neo4j Bolt Yes Yes Yes
Neo4j Bulk Import No No Yes

Cypher Session
Property Graph Data Source <namespace>
Property Graph <name>
QualiﬁedGraphName = <namespace>.<name>

Cypher Session
“social-net” (Neo4j PGDS)
“US” (Property Graph)
FROM social-net.US
MATCH (p:Person)
RETURN p

Cypher Session
“US”
“EU”
“products” (SQL PGDS)
“2018”
“2017”
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
RETURN p, c

Cypher Session
“US”
“EU”
“2018”
“2017”
CATALOG CREATE GRAPH social-net.US_new {
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
CONSTRUCT ON social-net.US
CREATE (p)-[:SAME_AS]->(c)
RETURN GRAPH
}

CATALOG CREATE GRAPH social-net.US_new {
FROM social-net.US
MATCH (p:Person)
FROM products.2018
MATCH (c:Customer)
CONSTRUCT ON social-net.US
CREATE (p)-[:SAME_AS]->(c)
RETURN GRAPH
}
Cypher Session
“US”
“EU”
“2018”
“2017”
“US_new”

Cypher Session
“US”
“EU”
...
CATALOG CREATE VIEW youngPeople($sn) {
FROM $sn
MATCH (p:Person)-[r]->(n)
WHERE p.age < 21
CONSTRUCT
CREATE (p)-[COPY OF r]->(n)
RETURN GRAPH
}
FROM youngPeople(social-net.US)
MATCH (p:Person)
RETURN p
“youngPeople”
Views

2015 - 2018
https://git.io/fjZ25
Boulder City
(:User)-[:CO_REVIEWS]->(:User)
(:User)-[:REVIEWS]->(:Business)
Constuct graphs for each year
Extract Yelp
subgraph for
speciﬁc city
(:Business)-[:CO_REVIEWED]->(:Business)

JDBC
Hive
Oracle
SQL Server
Orc
Parquet
Table/View
Table/View
Table/View
...
...
Graph DDL
Graph Instance
- Table mappings
SQL Tables Property Graphs
Property Graph
Node Tables
Rel. Tables
Graph Type
SQL Property Graph
Data Source
Spark SQL
Data Sources
Graph Type
- Element types
- Node types
- Relationship types

:Business
name : ACME
address : 123 ACME Rd.
city : San Jose
state : CA
:User
name : Alice
since : 2013
elite : [2014, 2016]
email : alice@yelp.com
:User
name : Bob
since : 2014
elite : null
email : bob@yelp.com
:REVIEWS
stars : 5
date : 2014-02-03
:REVIEWS
stars : 4
date : 2014-08-03

:User
email: alice@yelp.com
:User
email : bob@yelp.com
:FRIEND

Yelp Reviews
Yelp Book
Graph DDL
+
SQL PGDS
(:User)-[:FRIEND]->(:User)
https://git.io/fjZ2p

CREATE GRAPH TYPE yelp (
-- Element types (concepts used to describe a graph)
User ( name STRING, since DATE ),
Business ( name STRING, city STRING ),
REVIEWS ( stars INTEGER, date LOCALDATETIME ),
FRIEND,
-- Node types
(User),
(Business),
-- Relationship types
(User)-[REVIEWS]->(Business),
(User)-[FRIEND]->(User)
)

CREATE GRAPH yelp_and_yelpBook OF yelp (
-- Node type mappings
(User) FROM HIVE.yelp.user,
(Business) FROM HIVE.yelp.business,
-- Relationship type mappings
(User)-[REVIEWS]->(Business) FROM HIVE.yelp.review e
START NODES (User) FROM HIVE.yelp.user n JOIN e.user_email = n.email
END NODES (Business) FROM HIVE.yelp.business n JOIN e.business_id = n.business_id,
(User)-[FRIEND]->(User) FROM H2.yelpbook.friend e
START NODES (User) FROM HIVE.yelp.user n JOIN e.user1_email = n.email
END NODES (User) FROM HIVE.yelp.user n JOIN e.user2_email = n.email
)

● Morpheus and Neo4j
Graph Algorithms
● Spark Graph SPIP
sneak peek
● SQL/Cypher/GQL
https://theoatmeal.com/comics/sneak_peek

Neo4j
Native Graph
Database
Analytics
Integrations
Cypher Query
Language
Wide Range of
APOC Procedures
Native
Graph Algorithms

• Parallel Breadth First Search*
• Parallel Depth First Search
• Shortest Path*
• Single-Source Shortest Path
• All Pairs Shortest Path
• Minimum Spanning Tree
• A* Shortest Path
• Yen’s K Shortest Path
• K-Spanning Tree (MST)
• Random Walk
• Degree Centrality
• Closeness Centrality
• CC Variations: Harmonic, Dangalchev,
Wasserman & Faust
• Betweenness Centrality
• Approximate Betweenness Centrality
• PageRank*
• Personalized PageRank
• ArticleRank
• Eigenvector Centrality
• Triangle Count*
• Clustering Coefficients
• Connected Components (Union Find)*
• Strongly Connected Components*
• Label Propagation*
• Louvain Modularity – 1 Step & Multi-Step
• Balanced Triad (identification)
• Euclidean Distance
• Cosine Similarity
• Jaccard Similarity
• Overlap Similarity
• Pearson Similarity
Pathfinding
& Search
Centrality /
Importance
Community
Detection
Similarity
neo4j.com/docs/
graph-algorithms/current/
Link
Prediction
• Adamic Adar
• Common Neighbors
• Preferential Attachment
• Resource Allocations
• Same Community
• Total Neighbors* Available in GraphFrames

Free O’Reilly Book
neo4j.com/
graph-algorithms-book
• Spark & Neo4j Examples
• Machine Learning Chapter

● Use when
○ Anytime you’re looking for broad inﬂuence over a network
○ Many domain speciﬁc variations for differing analysis, e.g.
Personalized PageRank for personalized recommendations
● Examples:
○ Twitter Recommendations
○ Fraud Detection

2017
to
2018
call algo.pagerank
2017
2018
trendRank =
pageRank_2018 -
pageRank_2017
⋈
(:Business)
-[:CO_REVIEWED]->
(:Business)
https://git.io/fjZ2j

● Use when
○ Community Detection in large networks
○ Uncover hierarchical structures in data
● Examples
○ Money Laundering
○ Protein-Protein-Interactions

● Use when
○ Computing pair-wise similarities
○ Accommodates vectors of different lengths
● Examples
○ Recommendations
○ Disambiguation

call algo.louvain
call algo.jaccard
Recommend
businesses similar
users have
reviewed
2017
Compute similarity
based on overlapping
reviewed businesses
Compute
communities based
on co-reviews
for each
community
:IS_SIMILAR
https://git.io/fjZaU

● SPARK-25994 Spark Graph for Apache Spark 3.0
○ Property Graphs, Cypher Queries, and Algorithms
● Deﬁnes a Cypher-compatible Property Graph
type based on DataFrames
● Replaces GraphFrames querying with Cypher
● Reimplements GraphFrames/GraphX algos on
the Property Graph type

● “Spark Cypher”
○ Run a Cypher 9 query on a Property Graph returning a
tabular result
● Migrate GraphFrames to Spark Graph
● Implementation is based on Spark SQL
○ Property Graphs are composed of one or more DFs
● Provide Scala, Python and Java APIs

● Addresses the Cypher Property Graph Model
○ Does not deal with variants of that model (e.g. RDF)
● No Cypher 10 multiple graph features
○ API is ﬂexible to support this in future iterations
● No Property Graph Catalog
○ Also no Property Graph Data Sources

[SPARK-27299][GRAPH][WIP] Spark Graph API
design proposal (GraphExamplesSuite.scala)
test("create PropertyGraph from Node- and RelationshipFrames") {
val nodeData: DataFrame = spark.createDataFrame(Seq(0 -> "Alice", 1 -> "Bob")).toDF("id", "name")
val relationshipData: DataFrame = spark.createDataFrame(Seq((0, 0, 1))).toDF("id", "source", "target")
val nodeFrame: NodeFrame = NodeFrame(nodeData, "id", Set("Person"))
val relationshipFrame: RelationshipFrame = RelationshipFrame(relationshipData, "id", "source", "target", "KNOWS")
val graph: PropertyGraph = cypherSession.createGraph(Seq(nodeFrame), Seq(relationshipFrame))
val result: CypherResult = graph.cypher(
"""
|MATCH (a:Person)-[r:KNOWS]->(:Person)
|RETURN a, r""".stripMargin)
result.df.show()
}
https://git.io/fjqp6

spark-graph-api
spark-cypher
spark-sql
okapi morpheus
spark-sql
openCypherSPIP
Cypher to relational
operators compiler
openCypher

Spark SQL and “Spark GQL”
Two models, two languages
A common core of datatypes and expressions
GQL as the focal point of graph programming
Graph languages with a shared graph type system

WIFI SSID and Password for Spark Summit

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to WIFI SSID and Password for Spark Summit

Similar to WIFI SSID and Password for Spark Summit (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

WIFI SSID and Password for Spark Summit