With the emergence of offerings on both AWS (Neptune) and Azure (CosmosDB) within the past year it is fair to say that graph databases are of the hottest trends and that they are here to stay. So what are graph databases all about then? You can read article after article about how great they are and that they will solve all your problems better than your relational database but its difficult to really find any practical information about them.
This talk will start with a short primer on graph databases and the ecosystem but will then quickly transition to discussing the practical aspects of how to apply them to solve real world business problems. We will dive into what makes a good use case and what does not. We will then follow this up with some real world examples of some of the common patterns and anti-patterns of using graph databases. If you haven't been scared away by this point we will end by showing you some of the powerful insights that graph databases can provide you.
2. About Me
Architect and Full Stack Developer
● 20 years of full stack experience
● Distributed high performance low
latency big data platforms
● Graph Databases are kinda my thing
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger
6. What is Graph Datastore?
● Type of NoSQL datastore
● Uses graph structures (nodes, edges)
to store data
● Efficiently represents and traverses
relationships
11. The ecosystem is complex
Frameworks
RDF Triple Stores Labeled Property Model
Databases
12. Databases vs. Frameworks
Frameworks
● Data is processed not persisted
● Works on enormous datasets
● OLAP workloads
Databases
● Data is persisted and processed
● Real time querying
● OLTP and OLAP workloads
13. RDF/Triple Stores vs. Labeled Property Graphs
RDF Triple Stores
● Each entity is a triple
● Works with subject - object -
predicate
● Comes from semantic web
● Great for inferring relationships
Labeled Property Graphs
● Entities are a node or an edge
● Works with nodes - edges -
properties - labels
● Both nodes and edges contain
properties
● Great for efficiently traversing
relationships
15. Graph Query Languages
Gremlin
● Imperative +
Declarative
● Powerful
● Steep Learning
Curve
GraphQL
● Most useful for
REST endpoints
● Query Language
for APIs
SPARQL
● W3C Standard
for RDFs
● Based on
semantic Web
Cypher
● Declarative
● Easy to Use
● Most Popular
Language
Others
● Most are
extensions of SQL
● Usually specific to
one system
16. Queries - Find a Friend of a Friend
SPARQL
PREFIX foaf:
<http://xmlns.com/foaf/0.1/>
SELECT ?name WHERE {
?x foaf:name ?y .
?y foaf:name ?name .}
Cypher
MATCH n (me:Person)-[:FRIEND*2]->
(myFriend:Person) RETURN n.name
Gremlin
g.V().hasLabel(‘person’)
.repeat(out(‘friend’)).times(2)
.dedup().values(‘name’).next()
GraphQL
{
friend {
friend {
name
}
}
}
SQL Variants
SELECT name FROM expand(
bothE('is_friend_with').bothV()
.bothE('is_friend_with').bothV()
)
22. Search and Selection
● Get me everyone who works at X?
● Find me everyone with a first name like “John”?
● Find me all stores within X miles?
Answer: Use a RDBMS or a Search Server
23. Related Data
● What is the easiest way for me to be introduced to an executive at X?
● How do “John” and “Paula” know each other?
● How is company X related to company Y?
Answer: Use a Graph
24. Aggregation
● How many companies are in my system?
● What are my average sales for each day over the past month?
● What is the number of transactions processed by my system each day?
Answer: Use a RDBMS
25. Pattern Matching
● Who in my system has a similar profile to me?
● Does this transaction look like other known fraudulent transactions?
● Is the user “J. Smith” the same as “Johan S.”?
Answer: It depends, you might use search server or a graph
26. Clustering, Centrality, and Influence
● Who is the most influential person I am connected with on LinkedIn?
● What equipment in my network will have the largest impact if it breaks?
● What parts tend to fail at the same time?
Answer: Use a graph
28. Should I use Graph?
I sold this to Management as a Graph
project so we are using a graph
Based on work by Dr. Denise Gosnell: https://bit.ly/2s0qBC2
29. I’m still confused
● Do we care about the relationships between entities as
much or more than the entities themselves?
● If I were to model this in a RDBMS would I be writing
queries with multiple (5+) joins or recursive CTE’s to
retrieve my data?
● Is the structure of my data continuously evolving?
● Is my domain a natural fit for a graph?
32. Give me all products in a category (Search/Selection)
SQL
SELECT c.categoryName, p.productName,
FROM product AS p
INNER JOIN category AS c ON
c.categoryId=p.categoryId
WHERE c.categoryName=’Beverages’
Gremlin
g.V().has(‘category’, ‘categoryName’,
‘Beverages’).as(‘c’).in(‘part_of’)
.as(‘p’).select(‘c’, ‘p’)
.by(‘categoryName’).by(‘productName’)
Cypher
MATCH (o:Category)-[:PARTOF]->(p:Product)
RETURN c.categoryName, p.productName
33. Give me the top 5 products ordered (Aggregation)
SQL
SELECT TOP(5) c.categoryName,
p.productName, count(o)
FROM order AS o
INNER JOIN product AS p ON
p.productId=o.productId
INNER JOIN category AS c ON
c.categoryId=p.categoryId
ORDER BY count(o)
Gremlin
g.V().hasLabel("order").as(‘o’)
.out(‘orders’).as(‘p’).out(‘part_of’)
.as(‘c’).order().by(select(‘o’).count()).
select(‘c’, ‘p’, ‘o’).by(‘categoryName’)
.by(‘productName’).by(count())
Cypher
MATCH (o:Order)-[:ORDERS]->(p:Product) -
[:PART_OF]->(c:Category)
RETURN c.categoryName, p.productName,
count(o)
ORDER BY count(o)
DESC LIMIT 5
34. Find Products Purchased by others that I haven’t purchased
(Related Data/Pattern Matching)
SQL
SELECT TOP(5) product.product_name as Recommendation,
count(1) as Frequency
FROM product, customer_product_mapping,
(SELECT cpm3.product_id, cpm3.customer_id
FROM Customer_product_mapping cpm,
Customer_product_mapping cpm2, Customer_product_mapping cpm3
WHERE cpm.customer_id = ‘123’
and cpm.product_id = cpm2.product_id
and cpm2.customer_id != ‘customer-one’
and cpm3.customer_id = cpm2.customer_id
and cpm3.product_id not in (select distinct product_id
FROM Customer_product_mapping cpm
WHERE cpm.customer_id = ‘customer-one’)
) recommended_products
WHERE customer_product_mapping.product_id = product.product_id
and customer_product_mapping.product_id in
recommended_products.product_id
and customer_product_mapping.customer_id =
recommended_products.customer_id
GROUP BY product.product_name
ORDER BY Frequency desc
Gremlin
g.V().has("customer", "customerId", "123").as("c").
out("ordered").out("contains").out("is").aggregate("p").
in("is").in("contains").in("ordered").where(neq("c")).
out("ordered").out("contains").out("is").where(without("p")).
groupCount().order(local).by(values,
decr).select(keys).limit(local, 5).
unfold().values("name")
Cypher
MATCH (u:Customer {customer_id:’123’})-[:BOUGHT]->(p:Product)<-
[:BOUGHT]-(peer:Customer)-[:BOUGHT]->(r:Product)
WHERE not (u)-[:BOUGHT]->(r)
RETURN r as Recommendation, count(*) as Frequency
ORDER BY Frequency DESC LIMIT 5;
35. Give me all employees, their supervisor and level (Recursive CTE)
SQL
WITH EmployeeHierarchy (EmployeeID,
LastName,
FirstName,
ReportsTo,
HierarchyLevel) AS
( SELECT EmployeeID
, LastName
, FirstName
, ReportsTo
, 1 as HierarchyLevel
FROM Employees
WHERE ReportsTo IS NULL
UNION ALL
SELECT e.EmployeeID
, e.LastName
, e.FirstName
, e.ReportsTo
, eh.HierarchyLevel + 1 AS HierarchyLevel
FROM Employees e
INNER JOIN EmployeeHierarchy eh
ON e.ReportsTo = eh.EmployeeID)
SELECT *
FROM EmployeeHierarchy
ORDER BY HierarchyLevel, LastName, FirstName
Gremlin
g.V().hasLabel("employee").where(__.not(out("reportsTo"))).
repeat(__.in("reportsTo")).emit().tree().by(map
{def employee = it.get() employee.value("firstName") + " " +
employee.value("lastName")}).next()
Cypher
MATCH p = (u:Employee)->[:ReportsTo]->(s:Employee)<-
RETURN u.firstName as FirstName, u.LastName AS LastName,
(s.firstName + " " + s.lastName) AS ReportsTo, path(p) AS
HierarchyLevel ORDER BY HierarchyLevel, LastName, FirstName
Based on work by http://sql2gremlin.com/
37. Choosing a Datastore
● Framework vs. RDF vs. Property Model
● HA/Transaction Volume/Data Size
● Hosted vs On Premise
38. Datastore Concerns
● Data Consistency - ACID or BASE
● Explore your choices
● Beware the Operational Overhead
39. Data Modelling
● Whiteboard friendly - close to but Pragmatic Conceptual model
● Take into account how you are traversing data
● Use your Relational model to start
● Iterate, Iterate, Iterate
40. Data Modelling Concerns
● Don’t use Symmetric Relationships
● Look out for Hidden/Anemic Relationships
● Look for Supernodes
● Schema - Use it and make it general
43. The Good
● Graphs are flexible
● Great at finding and traversing relationships
● Natural fit in many complex domains
● Query times are proportional to amount of graph you traverse
44. The Bad
● Different options scale very differently
● Team needs to learn a new mindset
● Still immature space
45. The Ugly
● Lack of documentation
● Large, splintered and rapidly evolving ecosystem
● Hard for new users to tell good versus bad use cases
46. Advice from the trenches...
● Graph datastores may solve your problem, but understand your problem first
● Expect some trial and error
● Your data model will evolve, plan for it
● Don’t underestimate the time it takes to bring your team up to speed
● Graphs databases are not a silver bullet
Not an architect that just draws boxes and lines, I get my hands dirty by actually helping to build these things
Graph database popularity is up almost 800% since January of 2013
Leohard Euler - 1735 - 7 Bridges of Koingsberg
2 Islands in Pregel River w/ 7 bridges
Can you walk all bridges and return to start w/o repeating
A knowledge of Graph Theory may help but is not required
Lots of examples out there as to why use a graph database but these are just a few
The ecosystem is large and Growing
This slide currently shows 43. I originally put this out on Twitter and immediately had ~ 10 more additions of datastores I had never heard of
Lots of options out there
SPARQL is a Standard for RDF graphs, there is not one for Property Model Graphs
There is a movement out there called GQL to attempt to create a standard property model graph language
There are lots of tools to help you visualize your data
Don’t fall into the trap that the only way to view your data is as a node chart
There are lots of tools to help you visualize your data
Don’t fall into the trap that the only way to view your data is as a node chart
Graphs are flexible. In general it is easy to extend your model with additional attributes and objects allowing data evolution at a rapid pace
Graphs are great for searching relationships between items, but make sure that's what you want to search
Graphs are a more natural data model in many domains
Graph processing times are proportional to the amount of nodes and edges you choose to traverse, not the data size
Depending on the graph datastore, they scale differently in terms of transactions and data size, many are single server only
It is a different mindset your team has to learn, and learning is not a cheap process
Graph databases are still not as mature as RDBMS systems
Their is a lot of documentation for neophyte and expert users, not much in between
The ecosystem is vast, splintered and constantly evolving.
Graph databases are great for some use cases, horrible for others and it's not always easy to tell which you are in