NDC Oslo 2018 - A Practical Guide to Graph Databases

A Practical Guide to Graph
Databases

About Me
Architect and Full Stack Developer
● 20 years of full stack experience
● Distributed high performance low
latency big data platforms
● Graph Databases are kinda my thing
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger

What is Graph Datastore?
● Type of NoSQL datastore
● Uses graph structures (nodes, edges)
to store data
● Efficiently represents and traverses
relationships

Why use a graph database?
Network Analysis
Master Data Management
Recommendation Engines
Fraud Detection

The ecosystem is large and growing

The ecosystem is complex
Frameworks
RDF Triple Stores Labeled Property Model
Databases

Databases vs. Frameworks
Frameworks
● Data is processed not persisted
● Works on enormous datasets
● OLAP workloads
Databases
● Data is persisted and processed
● Real time querying
● OLTP and OLAP workloads

RDF/Triple Stores vs. Labeled Property Graphs
RDF Triple Stores
● Each entity is a triple
● Works with subject - object -
predicate
● Comes from semantic web
● Great for inferring relationships
Labeled Property Graphs
● Entities are a node or an edge
● Works with nodes - edges -
properties - labels
● Both nodes and edges contain
properties
● Great for efficiently traversing
relationships

RDF/Triple Stores vs. Labeled Property Graphs
RDF Triple Stores Labeled Property Graphs

Graph Query Languages
Gremlin
● Imperative +
Declarative
● Powerful
● Steep Learning
Curve
GraphQL
● Most useful for
REST endpoints
● Query Language
for APIs
SPARQL
● W3C Standard
for RDFs
● Based on
semantic Web
Cypher
● Declarative
● Easy to Use
● Most Popular
Language
Others
● Most are
extensions of SQL
● Usually specific to
one system

Queries - Find a Friend of a Friend
SPARQL
PREFIX foaf:
<http://xmlns.com/foaf/0.1/>
SELECT ?name WHERE {
?x foaf:name ?y .
?y foaf:name ?name .}
Cypher
MATCH n (me:Person)-[:FRIEND*2]->
(myFriend:Person) RETURN n.name
Gremlin
g.V().hasLabel(‘person’)
.repeat(out(‘friend’)).times(2)
.dedup().values(‘name’).next()
GraphQL
{
friend {
friend {
name
}
}
}
SQL Variants
SELECT name FROM expand(
bothE('is_friend_with').bothV()
.bothE('is_friend_with').bothV()
)

Both
Visualization
Desktop Tool Web

To use or not to use,
that is the question

Everything is a
Graph
But that doesn’t mean you should solve it with a graph

Search and Selection
● Get me everyone who works at X?
● Find me everyone with a first name like “John”?
● Find me all stores within X miles?
Answer: Use a RDBMS or a Search Server

Related Data
● What is the easiest way for me to be introduced to an executive at X?
● How do “John” and “Paula” know each other?
● How is company X related to company Y?
Answer: Use a Graph

Aggregation
● How many companies are in my system?
● What are my average sales for each day over the past month?
● What is the number of transactions processed by my system each day?
Answer: Use a RDBMS

Pattern Matching
● Who in my system has a similar profile to me?
● Does this transaction look like other known fraudulent transactions?
● Is the user “J. Smith” the same as “Johan S.”?
Answer: It depends, you might use search server or a graph

Clustering, Centrality, and Influence
● Who is the most influential person I am connected with on LinkedIn?
● What equipment in my network will have the largest impact if it breaks?
● What parts tend to fail at the same time?
Answer: Use a graph

Should I use Graph?
I sold this to Management as a Graph
project so we are using a graph
Based on work by Dr. Denise Gosnell: https://bit.ly/2s0qBC2

I’m still confused
● Do we care about the relationships between entities as
much or more than the entities themselves?
● If I were to model this in a RDBMS would I be writing
queries with multiple (5+) joins or recursive CTE’s to
retrieve my data?
● Is the structure of my data continuously evolving?
● Is my domain a natural fit for a graph?

Can’t I just do this in
SQL?

Give me all products in a category (Search/Selection)
SQL
SELECT c.categoryName, p.productName,
FROM product AS p
INNER JOIN category AS c ON
c.categoryId=p.categoryId
WHERE c.categoryName=’Beverages’
Gremlin
g.V().has(‘category’, ‘categoryName’,
‘Beverages’).as(‘c’).in(‘part_of’)
.as(‘p’).select(‘c’, ‘p’)
.by(‘categoryName’).by(‘productName’)
Cypher
MATCH (o:Category)-[:PARTOF]->(p:Product)
RETURN c.categoryName, p.productName

Give me the top 5 products ordered (Aggregation)
SQL
SELECT TOP(5) c.categoryName,
p.productName, count(o)
FROM order AS o
INNER JOIN product AS p ON
p.productId=o.productId
INNER JOIN category AS c ON
c.categoryId=p.categoryId
ORDER BY count(o)
Gremlin
g.V().hasLabel("order").as(‘o’)
.out(‘orders’).as(‘p’).out(‘part_of’)
.as(‘c’).order().by(select(‘o’).count()).
select(‘c’, ‘p’, ‘o’).by(‘categoryName’)
.by(‘productName’).by(count())
Cypher
MATCH (o:Order)-[:ORDERS]->(p:Product) -
[:PART_OF]->(c:Category)
RETURN c.categoryName, p.productName,
count(o)
ORDER BY count(o)
DESC LIMIT 5

Find Products Purchased by others that I haven’t purchased
(Related Data/Pattern Matching)
SQL
SELECT TOP(5) product.product_name as Recommendation,
count(1) as Frequency
FROM product, customer_product_mapping,
(SELECT cpm3.product_id, cpm3.customer_id
FROM Customer_product_mapping cpm,
Customer_product_mapping cpm2, Customer_product_mapping cpm3
WHERE cpm.customer_id = ‘123’
and cpm.product_id = cpm2.product_id
and cpm2.customer_id != ‘customer-one’
and cpm3.customer_id = cpm2.customer_id
and cpm3.product_id not in (select distinct product_id
FROM Customer_product_mapping cpm
WHERE cpm.customer_id = ‘customer-one’)
) recommended_products
WHERE customer_product_mapping.product_id = product.product_id
and customer_product_mapping.product_id in
recommended_products.product_id
and customer_product_mapping.customer_id =
recommended_products.customer_id
GROUP BY product.product_name
ORDER BY Frequency desc
Gremlin
g.V().has("customer", "customerId", "123").as("c").
out("ordered").out("contains").out("is").aggregate("p").
in("is").in("contains").in("ordered").where(neq("c")).
out("ordered").out("contains").out("is").where(without("p")).
groupCount().order(local).by(values,
decr).select(keys).limit(local, 5).
unfold().values("name")
Cypher
MATCH (u:Customer {customer_id:’123’})-[:BOUGHT]->(p:Product)<-
[:BOUGHT]-(peer:Customer)-[:BOUGHT]->(r:Product)
WHERE not (u)-[:BOUGHT]->(r)
RETURN r as Recommendation, count(*) as Frequency
ORDER BY Frequency DESC LIMIT 5;

Give me all employees, their supervisor and level (Recursive CTE)
SQL
WITH EmployeeHierarchy (EmployeeID,
LastName,
FirstName,
ReportsTo,
HierarchyLevel) AS
( SELECT EmployeeID
, LastName
, FirstName
, ReportsTo
, 1 as HierarchyLevel
FROM Employees
WHERE ReportsTo IS NULL
UNION ALL
SELECT e.EmployeeID
, e.LastName
, e.FirstName
, e.ReportsTo
, eh.HierarchyLevel + 1 AS HierarchyLevel
FROM Employees e
INNER JOIN EmployeeHierarchy eh
ON e.ReportsTo = eh.EmployeeID)
SELECT *
FROM EmployeeHierarchy
ORDER BY HierarchyLevel, LastName, FirstName
Gremlin
g.V().hasLabel("employee").where(__.not(out("reportsTo"))).
repeat(__.in("reportsTo")).emit().tree().by(map
{def employee = it.get() employee.value("firstName") + " " +
employee.value("lastName")}).next()
Cypher
MATCH p = (u:Employee)->[:ReportsTo]->(s:Employee)<-
RETURN u.firstName as FirstName, u.LastName AS LastName,
(s.firstName + " " + s.lastName) AS ReportsTo, path(p) AS
HierarchyLevel ORDER BY HierarchyLevel, LastName, FirstName
Based on work by http://sql2gremlin.com/

Choosing a Datastore
● Framework vs. RDF vs. Property Model
● HA/Transaction Volume/Data Size
● Hosted vs On Premise

Datastore Concerns
● Data Consistency - ACID or BASE
● Explore your choices
● Beware the Operational Overhead

Data Modelling
● Whiteboard friendly - close to but Pragmatic Conceptual model
● Take into account how you are traversing data
● Use your Relational model to start
● Iterate, Iterate, Iterate

Data Modelling Concerns
● Don’t use Symmetric Relationships
● Look out for Hidden/Anemic Relationships
● Look for Supernodes
● Schema - Use it and make it general

The Good
● Graphs are flexible
● Great at finding and traversing relationships
● Natural fit in many complex domains
● Query times are proportional to amount of graph you traverse

The Bad
● Different options scale very differently
● Team needs to learn a new mindset
● Still immature space

The Ugly
● Lack of documentation
● Large, splintered and rapidly evolving ecosystem
● Hard for new users to tell good versus bad use cases

Advice from the trenches...
● Graph datastores may solve your problem, but understand your problem first
● Expect some trial and error
● Your data model will evolve, plan for it
● Don’t underestimate the time it takes to bring your team up to speed
● Graphs databases are not a silver bullet

www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger
Questions?

NDC Oslo 2018 - A Practical Guide to Graph Databases

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

NDC Oslo 2018 - A Practical Guide to Graph Databases

Editor's Notes