Cypher and apache spark multiple graphs and more in open cypher

Cypher and Apache Spark
Multiple graphs and more in
openCypher
Stefan Plantikow, Martin Junghanns,
Max Kießling, Petra Selmer

(:openCypher)-[:IS_GOING]->(:Places)

openCypher in 2017
openCypher
is a community effort to evolve the standard graph query language Cypher
openCypher implementers: SAP, Redis, Agens Graph, Cypher.PL, Neo4j, ...
openCypher events: Implementers meeting, Summer of Syntax
openCypher process: Cypher Improvement Requests and Proposals (CIPs/CIRs)
openCypher releases: Fronted, Grammar, TCK
openCypher research: Formal semantics (U Edinburgh), Stream processing, …
openCypher & standards: LDBC, ISO SQL PG Ad-Hoc …
openCypher features: Multiple graphs, subqueries, path patterns, ...

Cypher originally conceived in the context of OLTP workloads at Neo4j.
Beyond OLTP, many Neo4j customers have a data lake and use Apache Spark
for
- Big Data analytical processing
- Data integration (wrangling)
- Today's big data applications
○ Collect data from user interactions at website
○ Combine with other data from various departments (billing, marketing, ...)
○ Combine with ontological data
○ Analyze to better target customers, optimize supply chains, detect fraud, ...
Is Cypher ready to be used in a big data lake context?
Cypher for Big Data

- Data integration
○ Use multiple, large-scale data sets
○ Retain and reuse intermediary results
○ Integrate multiple data sources
○ Shape and handle heterogeneous data
- Complex Execution
○ Compose complex workflows from building blocks
○ Use machine learning, AI, graph algorithms, domain specific business logic
○ Distributed query execution in a cluster
- Common framework: Apache Spark over Hadoop (+ Neo4j)
Distilling Graphs from the Data Lake

(:Cypher)-[:FOR]->(:Apache:Spark™)

Spark Package for the execution of Cypher on Apache Spark
- Execute Cypher queries on multiple large, distributed graphs
- Integrate Cypher into your Spark analytical pipeline
- Integrate multiple data sources (Neo4j, HDFS, Local FS, ...)
- Handle heterogeneous data
- Compose Cypher queries

- Made by Neo4j, donated to to the openCypher community
- Alpha release of source code under APL2 on GitHub. Available Now:
github.com/openCypher/cypher-for-apache-spark
- Release 1.0: Targeted for first half 2018
- Commercial extension for integrating more sophisticated data sources
- Innovations:
○ Technical Architecture for executing Cypher on a big-data analytics system
○ Composable queries for working with multiple graphs
CAPS

MATCH (n:Person)-[:LOVES]->(s1:System {title: “Neo4j”})
OPTIONAL MATCH (n)-[:LOVES]->(s2:System {title: “Spark”})
RETURN n, s1, s2
openCypher Frontend
CAPS
Spark Catalyst Optimizer
Spark Runtime
➢ Based on proven Neo4j Cypher parser
➢ Parsing, Rewriting, Optimization
➢ Data Import and Export
➢ Schema and Type handling
➢ Query translation to DataFrame operations
➢ Rule based query optimization
➢ Distributed execution

Data in Spark
Spark's core: Transform tabular data
SparkSQL Table => Table => Table
Cypher 9 (Single) Graph => Table
How to handle multiple graphs?

(:Cypher)-[:WITH]->(:Multiple:Graphs)

Why Multiple Graphs?
- Combining and transforming graphs from multiple sources
- Versioning, snapshotting, computing difference graphs
- Graph views for access control
- Provisioning applications with tailored graph views
- Shaping and integrating heterogenous graph data
- Roll-up and drill-down at different levels of detail
Graph
Management
Graph
Modeling

Cypher today: Single graph model
Graph Database System
(e.g. a cluster)
The (single) Graph
Application Server
Client 1
Client 2
Client 3

Cypher: Multiple graphs model
Graph Database System
(e.g. a cluster)
Graph Space
Application Server
Client 1
Client 2
Client 3

Tables from graphs...
It's easy to construct tables from a graph... but what's the inverse?
MATCH (a)-->(b) WITH a, b ...

...graphs from tables
...a graph is a set of pattern matches!
WITH a, r, b RETURN GRAPH OF (a)-[r]->(b) AS foo

Cypher
queries
with
multiple
graphs

Cypher
query
pipeline
composition

Current CAPS Multiple Graphs Syntax
FROM GRAPH graph_A AT "bolt://.../people"
MATCH (a:Person)-[:KNOWS]-(b:Person)
FROM GRAPH graph_B AT "hdfs://.../products"
MATCH (:Customer {name: a.name})-[:BOUGHT]->(p:Product)
RETURN GRAPH OF (b)-[:SHOULD_BUY]->(p)
(Ongoing work in CIP2017-06-18: Multiple Graphs)

Cypher support for multiple graphs
- Graphs are addressed using URIs
- Graphs and tabular data are passed into and returned from a query
Extensions
- Set operations and subqueries over multiple graphs
- Updating graphs (DML)
- Managing graph persistence (Move, Snapshot, Version, ...)
- Creating views
- Schema and constraint definitions for multiple graphs
...
=> Join the openCypher MG Task Force

(:Cypher)-[:ON]->(:Relational:Engine)

Challenge: Graph engine vs. Relational engine
Neo4j Apache Spark
Graph Format Native (i.e. optimized for graph ops) DataFrame (i.e. tables)
Query operators Native (e.g. Expand, VarExpand) Relational operators
Schema Schema optional Fixed Schema
Data types Cypher type system Spark SQL type system

Node labels:
:Employee
name: STRING
:Person
name: STRING
job: INTEGER (nullable)
:System
title: STRING
Relationship types:
:KNOWS
name: STRING
:LOVES
Implied Labels:
:Employee -> :Person
:Employee:Person
{ name : Alice }
:Person
{ name : Bob, yob : 1984 }
:System
{ title : Spark }
:KNOWS
{ since : 2017 }
:LOVES:LOVES:LOVES
:System
{ title : Neo4j }
● Required for Spark DataFrame
• Explicitly defined (e.g. for HDFS data source)
• Implicitly inferred (e.g. for Neo4j data source)
● Requires type mapping from Cypher types to Spark types

:Employee:Person
{ name : Alice }
:Person
{ name : Bob, yob : 1984 }
:System
{ title : Spark }
:KNOWS
{ since : 2017 }
:LOVES:LOVES:LOVES
:System
{ title : Neo4j }
Logical view
Physical view (DataFrame)
NodeScan(Person)
n n:Employee n.name n.yob
0 true Alice null
1 false Bob 1984
NodeScan(System)
n n.title
2 Spark
3 Neo4j
RelScan(KNOWS)
src(r) r trgt(r) n.since
0 0 1 2017
RelScan(KNOWS)
src(r) r trgt(r)
0 1 2
0 2 3
1 3 3

OPTIONAL MATCH (n)-[:LOVES]->(s2:Database {title: “Spark”})
RETURN n, s1, s2
Logical view
Physical view (DataFrame operations)
NodeScan(System)
RelScan(LOVES)
NodeScan(Person)
ResultAPPLY MAGIC
HERE

Result
n n.name n.yob n:Person s1 s1.title s1:System s2 s2.title s2:System
0 Alice null true 3 Neo4j true 2 Spark true
1 Bob 1984 true 3 Neo4j true null null null
OPTIONAL MATCH (n)-[:LOVES]->(s2:Database {title: “Spark”})
RETURN n, s1, s2
:Employee:Person
{ name : Alice }
:Person
{ name : Bob, yob : 1984 }
:System
{ title : Spark }
:KNOWS
{ since : 2017 }
:LOVES
:LOVES
:LOVES
:System
{ title : Neo4j }
Logical view
Physical view (DataFrame)

• Programmatic, high-level API (similar to Sparks’ DataFrame API)
• Central entry point: CAPSSession
1: val sparkSession = SparkSession.builder().master("local[*]").appName("caps-example").getOrCreate()
2: val capsSession = CAPSSession.create(sparkSession)
3: val graph = capsSession.graphAt("hdfs://localhost:9000/path/to/graph")
4: val result = graph.cypher("MATCH (n:Person)-[:LOVES]->(s:System) RETURN n.name, s.title")
5: result.print
+---------------------------------------------+
| n.name | s.title |
+---------------------------------------------+
| 'Alice' | 'Neo4j' |
| 'Alice' | 'Spark' |
| 'Bob' | 'Neo4j' |
+---------------------------------------------+
(3 rows)

• Mount graphs from multiple sources
• Store graphs in session-local graph storage
1: capsSession.mountGraphAt("hdfs+csv://localhost:9000/path/to/graph", "/my-hdfs-graph")
2: capsSession.mountGraphAt(
"bolt://localhost:7687&MATCH (n) RETURN n;MATCH ()-[r]->() RETURN r",
"/my-neo-graph"
)
3: val result = capsSession.cypher("""
FROM GRAPH AT 'session://my-hdfs-graph'
MATCH (e:Employee)
FROM GRAPH AT 'session://my-neo-graph'
MATCH (p:Person)
WHERE e.email = p.email
RETURN GRAPH result OF (e)-[:SAME_AS]->(p)
""").graphs("result")
4: result.cypher("MATCH ()-[e]->() RETURN COUNT(e)")

(:Cypher)-[:FOR]->(Apache:Spark™)
Demo

• Target specific customers in selected metropolitan areas as part of a marketing campaign
• Combine multi-region social network data with product data to derive recommendations
• Social network is partitioned by region (SN_NA, SN_EU) and stored in separate Neo4j instances
• Product data is stored in HDFS using a CAPS-specific CSV format
:Person {
name : Bob,
email : bob@gmail.com
}
:Person {
name : Alice,
email : alice@gmail.com
}
:Interest
{ name : Graphs }
:KNOWS
:LIKES:LIVES_IN:LIVES_IN
:City
{ name : New York }
:Customer
{ email : alice@gmail.com }
:Product
{ name : Graph Databases }
:BOUGHT {
rating : 5,
votes : 10
helpful : 6
}
:BELONGS_TO
:Category
{ name : Books }
Social Network (SN) Products (PROD)

1. Load data from the corresponding data sources (i.e. Neo4j and HDFS)
2. Extract metropolitan subgraphs from Social Networks (e.g. people from NY / SFO for SN_NA)
3. Merge Social Network data with Product data using identifying properties (i.e. Email)
4. Compute recommendations based on friends’ interests and bought products
:Person {
name : Bob,
email : bob@gmail.com
}
:Person {
name : Alice,
email : alice@gmail.com
}
:Interest
{ name : Graphs }
:KNOWS
:LIKES:LIVES_IN:LIVES_IN
:City
{ name : New York }
:Customer
{ email : alice@gmail.com }
:Product
{ name : Graph Databases }
:BOUGHT {
rating : 5,
votes : 10
helpful : 6
}
:BELONGS_TO
:Category
{ name : Books }
Social Network (SN) Products (PROD)
:IS

(:openCypher)-[:EVOLVES]->(:Cypher)

How does a feature make it into Cypher?
CIR = Cypher Improvement Request
- Ideas & suggestions, topics for discussion, …
- Raise a Github issue at https://github.com/opencypher/openCypher
CIP = Cypher Improvement Proposal
- Response to a CIR
- Full description of behaviour and syntax
- Create a Pull Request at https://github.com/opencypher/openCypher

openCypher
openCypher Implementers Group (oCIG)
- Evolve Cypher through an open process
- Comprises vendors, researchers, implementers, interested parties
Face-to-face and virtual meetings to present, discuss and agree upon new
features
- Germany (February)
- UK (May)
- France (November)

(:Cypher)-[:WITH]->(:Subqueries)

Why?
Queries are easier to
- construct
- maintain
- read
Subqueries enable
- composition of query pipelines
- post-processing of results
- multiple write actions for each record

Example: Post-UNION processing
MATCH {
// authored tweets
MATCH (me:User {name: 'Alice'})-[:FOLLOWS]->(user:User),
(user)<-[:AUTHORED]-(tweet:Tweet)
RETURN tweet, tweet.time AS time, user.country AS country
UNION
// favorited tweets
MATCH (me:User {name: 'Alice'})-[:FOLLOWS]->(user:User),
(user)<-[:HAS_FAVOURITE]-(favorite:Favorite)-[:TARGETS]->(tweet:Tweet)
RETURN tweet, favourite.time AS time, user.country AS country
}
WHERE country = 'se'
RETURN DISTINCT tweet
ORDER BY time DESC
LIMIT 10

Types of subqueries
Nested
- Run any complete read-only Cypher query
- Incoming variables remain in scope: correlated subquery
- Arbitrary depth
Existential returns true if at least one match found; false otherwise
Scalar result is a single value in a single row
List result is the list formed by collecting all the values of all rows (single value per row)
Updating: simple and conditional updates, executed once per incoming row

(Cypher)-[:WITH]->(Path:Pattern:Queries)

Why?
Find complex connections
Repetitions of patterns:
( likes.hates )+
Alternatives between patterns rather than just a single relationship type:
( drinks | eats )*
Express patterns directly, rather than resorting to using UNION

Example: a sad state of affairs...
Find a chain of unreciprocated lovers:
PATH PATTERN unreciprocated_love = (a)-[:LOVES]->(b)
WHERE NOT EXISTS { (b)-[:LOVES]->(a) }
MATCH (you)-/~unreciprocated_love*/->(someone)
Named
Path
Predicate

Relationship Type Predicate ()-/:FOO/-()
Node Predicates ()-/(:Alpha {beta:'gamma'})/-()
Alternation ()-/:FOO | :BAR | :BAZ/-()
Sequence ()-/:FOO :BAR :BAZ/-()
Grouping ()-/:FOO | [:BAR :BAZ]/-()
Direction ()-/<:FOO :BAR <:BAZ>/->()
Repetition ()-/:FOO? :BAR+ :BAZ* :FOO*3.. :BAR*1..5/-()

(:Cypher)-[:IS]->(:Everywhere)

openCypher: Summer of Syntax
Multiple graphs
Subqueries
Path pattern queries (complex pattern matching)
Aggregation and grouping
MANDATORY MATCH
Configurable pattern matching
Cypher versioning & Cypher 9

Want to find out more?
Join us at the openCypher Meetup!
Wednesday, 25 October, 5:30pm - 8pm
WeWork Park South at 110 East 28th Street, NY
Agenda
- Multiple graphs, subqueries, path pattern queries
- Connecting research in graph processing to industrial technologies
- Property graphs with time

(Cypher)-[:IS]->(:Everywhere)
CAPS core alpha source release with multiple graphs out now,
production-ready release next year
Plus commercial release (from Neo4j):
Data lake integration and other sophisticated graph data sources
Also: Cypher over Gremlin is in the works! => Cypher everywhere
openCypher continues to evolve: Get involved! openCypher.org
Upcoming
- openCypher booth here at GraphConnect NYC
- openCypher meetup tomorrow: opencypher.org/event/2017/10/25/event-oc-meetup/
- Third openCypher implementers meeting: opencypher.org/event/2017/11/13/ocim3/

(:Thank)-[:-]->(:You)
stefan.plantikow@neo4j.com, martin.junghanns@neo4j.com, max.kiessling@neo4j.com, petra.selmer@neo4j.com

Cypher and apache spark multiple graphs and more in open cypher

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cypher and apache spark multiple graphs and more in open cypher

Similar to Cypher and apache spark multiple graphs and more in open cypher (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Cypher and apache spark multiple graphs and more in open cypher