Graph Databases for Highly Connected Data

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Graph and Amazon Neptune
Bill Baldwin
bbaldwin@amazon.com
Global Enterprise Support Leader

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HIGHLY CONNECTED DATA
Retail Fraud DetectionRestaurant RecommendationsSocial Networks

US E C AS ES FOR HI G HL Y C ONNEC T ED DAT A
Social Networking
Life Sciences Network & IT OperationsFraud Detection
Recommendations Knowledge Graphs

RECOMMENDATIONS BASED ON RELATIONSHIPS

KNOWLEDGE GRAPH APPLICATIONS
What museums should Alice
visit while in Paris?
Who painted the Mona Lisa?
What artists have paintings
in The Louvre?

NAV I GAT E A WEB OF GLOB AL T AX POLI C I ES
“Our customers are increasingly required to navigate a complex web of global tax policies and
regulations. We need an approach to model the sophisticated corporate structures of our
largest clients and deliver an end-to-end tax solution. We use a microservices architecture
approach for our platforms and are beginning to leverage Amazon Neptune as a graph-based
system to quickly create links within the data.”
said Tim Vanderham, chief technology officer, Thomson Reuters Tax & Accounting

Challenges Building Apps with Highly Connected DataRELATIONAL DATABASE CHALLENGES BUILDING
APPS WITH HIGHLY CONNECTED DATA
Unnatural for
querying graph
Inefficient
graph processing
Rigid schema inflexible
for changing data

DIFFERENT APPROACHES FOR HIGHLY
CONNECTED DATA
Purpose-built for a business process
Purpose-built to answer questions about
relationships

A G RAPH DATABASE IS OPTIMIZ E D F OR E F F ICIE NT
STORAG E AND RE TRIE VAL OF HIG HL Y CONNE CTE D DATA

Open Source Apache TinkerPop
Gremlin Traversal Language
W3C Standard
SPARQL Query Language
R E S O U R C E D E S C R I P T I O N
F R A M E W O R K ( R D F )
P R O P E R T Y G R A P H
LEADING GRAPH MODELS AND FRAMEWORKS

CHALLENGES OF EXISTING GRAPH DATABASES
Difficult to maintain
high availability
Difficult to scale
Limited support for
open standards
Too expensive

AMAZON NEPTUNE
F u l l y m a n a g e d g r a p h d a t a b a s e
FAST RELIABLE OPEN
Query billions of
relationships with
millisecond latency
6 replicas of your data
across 3 AZs with full
backup and restore
Build powerful
queries easily with
Gremlin and SPARQL
Supports Apache
TinkerPop & W3C
RDF graph models
EASY

AMAZON NEPTUNE HIGH LEVEL ARCHITECTURE
Bulk load
from
Amazon S3
Database
Mgmt.

PROPERTY GRAPH
A property graph is a set of vertices and edges with respective properties (i.e. key/value pairs)
• Vertex represents entities/domains
• Edge represents directional relationship
between vertices.
• Each edge has a label that denotes the
type of relationship
• Each vertex & edge has a unique identifier
• Vertex and edges can have properties
• Properties express non-relational information about the vertices and edges
FRIENDname:
Bill
name:
Sarah
UserUser
Since 11/29/16

PROPERTY GRAPH & APACHE TINKERPOP
• Apache TinkerPop
Open source graph computing framework for
Property Graph
• Gremlin
Graph traversal language used to analyze the
graph
Amazon Neptune is fully compatibility with Tinkerpop Gremlin 3.3.0 (latest
version released August 2018) and provides optimized query execution
engine for Gremlin query language.

CREATING A TINKERPOP GRAPH
//Connect to Neptune and receive a remote graph, g.
user1 = g.addVertex (id, 1, label, "User", "name", "Bill");
user2 = g.addVertex (id, 2, label, "User", "name", "Sarah");
...
user1.addEdge("FRIEND", user2, id, 21);
Gremlin (Apache TinkerPop 3.3)
FRIEND
name:
Bill
name:
Sarah
User
User

RDF GRAPHS
• RDF Graphs are described as a collection of triples: subject, predicate, and object.
• Internationalized Resource Identifiers (IRIs) uniquely identify subjects.
• The Object can be an IRI or Literal.
• A Literal in RDF is like a property and RDF supports the XML data types.
• When the Object is an IRI, it forms an “Edge” in the graph.
<http://www.socialnetwork.com/person#1>
rdf:type contacts:User;
contact:name: ”Bill” .
subject
predicate
Object (literal)
name:
Bill
User
<http://www.socialnetwork.com/person#1>IRI
contacts:friend
<http://www.socialnetwork.com/person#2> .
subject
predicate
Object (IRI)
FRIEND
#1 2#2

“THERE’S NO TROUBLE WITH TRIPLES”: RDF
EXAMPLE
@prefix contacts: <http://www.socialnetwork.com/people#>.
contact:name: ”Bill” .
contacts:friend <http://www.socialnetwork.com/person#2> .
contact:name: ”Sarah” .
RDF
(Turtle Serialization)
FRIEND
name:
Bill
name:
Sarah
User
User

GRAPH VS. RELATIONAL DATABASE MODELING.
* Source : http://www.playnexacro.com/index.html#show:article
Relational model Graph model subset
CompanyName:
Acme
…
Customers
OrderDate:
8/1/2018
…
Order
PURCHASED
HAS_DETAILS
UnitPrice:
$179.99
…
Order
DetailsProductName:
“Echo”
…
Product
HAS_PRODUCT
CompanyName:
“Amazon”
…
SupplierSUPPLIES

SQL RELATIONAL DATABASE QUERY
SELECT distinct c.CompanyName
FROM customers AS c
JOIN orders AS o ON /* Join the customer from the order */
(c.CustomerID = o.CustomerID)
JOIN order_details AS od /* Join the order details from the order
*/
ON (o.OrderID = od.OrderID)
JOIN products as p /* Join the products from the order details
*/
ON (od.ProductID = p.ProductID)
WHERE p.ProductName = ’Echo'; /* Find the product named ‘Echo’ */
Find the name of companies that purchased the ‘Echo’.

SPARQL DECLARATIVE GRAPH QUERY
PREFIX sales_db: <http://sales.widget.com/>
SELECT distinct ?comp_name WHERE {
?customer <sales_db:HAS_ORDER> ?order ; #customer graph pattern
<sales_db:CompanyName> ?comp_name . #orders graph pattern
?order <sales_db:HAS_DETAILS> ?order_d . #order details graph pattern
?order_d <sales_db:HAS_PRODUCT> ?product . #products graph
pattern
?product <sales_db:ProductName> “Echo” .
}
* Source : http://www.playnexacro.com/index.html#show:article

GREMLIN IMPERATIVE GRAPH TRAVERSAL
/* All products named ”Echo” */
g.V().hasLabel(‘Product’).has('name',’Echo')
.in(’HAS_PRODUCT') /* Traverse to order details */
.in(‘HAS_DETAILS’) /* Traverse to order */
.in(’HAS_ORDER’) /* Traverse to Customer */
.values(’CompanyName’).dedup() /* Unique Company Name */

TRIADIC CLOSURE – CLOSING TRIANGLES
FRIEND
FRIEND
Terry
Bill
Sarah
FRIEND

RECOMMENDING NEW CONNECTIONS
Terry

IMMEDIATE FRIENDSHIPS
FRIEND
Terry
Bill

MEANS AND MOTIVE
FRIEND
FRIEND
Terry
Bill
Sarah

RECOMMENDATION
FRIEND
FRIEND
Terry
Bill
Sarah

Recommend New Connections
g = graph.traversal()
g.V().has('name','Terry').as('user').
both('FRIEND').aggregate('friends').
both('FRIEND').
where(neq('user')).where(neq('friends')).
groupCount().by('name').
order(local).by(values, decr)

FIND TERRY
both('FRIEND').

FIND TERRY’S FRIENDS
both('FRIEND').

AND THE FRIENDS OF THOSE FRIENDS
both('FRIEND').
user
friend
fof
FRIEND
FRIEND

...WHO AREN’T TERRY AND AREN’T FRIENDS
WITH TERRY
both('FRIEND').
user
friend
fof
X
FRIEND
FRIEND

Fully Managed Service
Easily configurable via the console
Multi-AZ high availability
Support for up to 15 read replicas
Supports encryption at rest
Supports encryption in transit (TLS)
Backup and restore, point-in-time
recovery
B E N E F I T S

• Secure deployment in a VPC
• Increased availability through
deployment in two subnets in two
different Availability Zones (AZs)
• Cluster volume always spans three
AZ to provide durable storage
• See the Amazon Neptune
Documentation for VPC setup details
AMAZON NEPTUNE: VPC DEPLOYMENT

BATTLE-TESTED CLOUD-NATIVE STORAGE ENGINE
OVERVIEW
Data is replicated 6 times across 3 Availability Zones
Continuous backup to Amazon S3
(built for 11 9s durability)
Continuous monitoring of nodes and disks for repair
10 GB segments as unit of repair or hotspot rebalance
Quorum system for read/write; latency tolerant
Quorum membership changes do not stall writes
Storage volume automatically grows up to 64 TB
AZ 1 AZ 2 AZ 3
Amazon S3
Amazon
Neptune
Storage
Node
Storage
Node
Storage
Node
Storage
Node
Storage
Node
Storage
Node
Storage
Monitoring

AMAZON NEPTUNE HIGH AVAILABILITY AND FAULT
TOLERANCE (CLOUD -NATIVE STORAGE)
What can fail?
Segment failures (disks)
Node failures (machines)
AZ failures (network or datacenter)
Optimizations
4 out of 6 write quorum
3 out of 6 read quorum
Peer-to-peer replication for repairs
AZ 1 AZ 2 AZ 3
Caching
Amazon
Neptune
AZ 1 AZ 2 AZ 3
Caching
Amazon
Neptune

AMAZON NEPTUNE READ REPLICAS
Availability
• Failing database nodes are
automatically detected and replaced
• Failing database processes are
automatically detected and recycled
• Replicas are automatically promoted
to primary if needed (failover)
• Customer specifiable fail-over order
AZ 1 AZ 3AZ 2
Primary
Node
Primary
Node
Primary
Master
Node
Primary
Node
Primary
Node
Read
Replica
Primary
Node
Primary
Node
Read
Replica
Cluster
and
Instance
Monitoring
Performance
• Customer applications can scale out read
traffic across read replicas
• Read balancing across read replicas

AMAZON NEPTUNE FAILOVER TIMES ARE
TYPICALLY < 30 SECONDS
Replica-Aware App Running
Failure Detection DNS Propagation
Recovery
Database
Failure
1 5 - 2 0 s e c 3 - 1 0 s e c
App
Running

AMAZON NEPTUNE CONTINUOUS BACKUP (CLOUD -
NATIVE STORAGE)
• Take periodic snapshot of each segment in parallel; stream the logs to Amazon S3
• Backup happens continuously without performance or availability impact
• At restore, retrieve the appropriate segment snapshots and log streams to storage nodes
• Apply log streams to segment snapshots in parallel and asynchronously
Segment snapshot Log records
Recovery point
Segment 1
Segment 2
Segment 3
Time

AMAZON NEPTUNE ONLINE POINT -IN-TIME
RESTORE (CLOUD-NATIVE STORAGE)
Online point-in-time restore is a quick way to bring the database to a particular point
in time without having to restore from backups
• Rewinding the database to quickly
• Rewind multiple times to determine the desired point-in-time in the database state
t0 t1 t2
t0 t1
t2
t3 t4
t3
t4
Rewind to t1
Rewind to t3
Invisible Invisible

Graph Databases for Highly Connected Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Graph Databases for Highly Connected Data

Similar to Graph Databases for Highly Connected Data (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Graph Databases for Highly Connected Data