Ready to leverage the power of a graph database to bring your application to the next level, but all the data is still stuck in a legacy relational database?
Fortunately, Neo4j offers several ways to quickly and efficiently import relational data into a suitable graph model. It's as simple as exporting the subset of the data you want to import and ingest it either with an initial loader in seconds or minutes or apply Cypher's power to put your relational data transactionally in the right places of your graph model.
In this webinar, Michael will also demonstrate a simple tool that can load relational data directly into Neo4j, automatically transforming it into a graph representation of your normalized entity-relationship model.
4. Webinar Review – Relational to Graph
• Introduction and Overview
• Introduction of Neo4j, Solving RDBMS Issues, Northwind Demo
• Modeling Concerns
• Modeling in Graphs and RDBMS, Good Modeling Practices,
• Model first, incremental Modeling, Model Transformation (Rules)
• Import
• Importing into Neo4j, Getting Data from RDBMS, Concrete Examples
• NEXT: Querying
• SQL to Cypher, Comparison, Example Queries, Hard in SQL -> Easy and Fast
in Cypher
6. Relational DBs Can’t Handle Relationships Well
• Cannot model or store data and relationships
without complexity
• Performance degrades with number and levels
of relationships, and database size
• Query complexity grows with need for JOINs
• Adding new types of data and relationships
requires schema redesign, increasing time to
market
… making traditional databases inappropriate
when data relationships are valuable in real-time
Slow development
Poor performance
Low scalability
Hard to maintain
7. Unlocking Value from Your Data Relationships
• Model your data naturally as a graph
of data and relationships
• Drive graph model from domain and
use-cases
• Use relationship information in real-
time to transform your business
• Add new relationships on the fly to
adapt to your changing requirements
8. High Query Performance with a Native Graph DB
• Relationships are first class citizen
• No need for joins, just follow pre-
materialized relationships of nodes
• Query & Data-locality – navigate out
from your starting points
• Only load what’s needed
• Aggregate and project results as you
go
• Optimized disk and memory model
for graphs
10. Getting Data into Neo4j: CSV
Cypher-Based “LOAD CSV” Capability
• Transactional (ACID) writes
• Initial and incremental loads of up to
10 million nodes and relationships
• From HTTP and Files
• Power of Cypher
• Create and Update Graph Structures
• Data conversion, filtering, aggregation
• Destructuring of Input Data
• Transaction Size Control
• Also via Neo4j-Shell
CSV
10
M
11. Getting Data into Neo4j: CSV
Command-Line Bulk Loader neo4j-import
• For initial database population
• Scale across CPUs and disk performance
• Efficient RAM usage
• Split- and compressed file support
• For loads up to 10B+ records
• Up to 1M records per second
CSV
100
B
12. Getting Data into Neo4j: APIs
Custom Cypher-Based Loader
• Uses transactional Cypher http endpoint
• Parameterized, batched, concurrent
Cypher statements
• Any programming/script language with
driver or plain http requests
• Also for JSON and other formats
• Also available as JDBC Driver
Any
Data
Program
Program
Program
10
M
13. Getting Data into Neo4j: APIs
JVM Transactional Loader
• Use Neo4j’s Java-API
• From any JVM language, concurrent
• Fine grained TX Management
• Create Nodes and Relationships directly
• Also possible as Server extension
• Arbitrary data loading
Any
Data
Program
Program
Program
1B
14. Getting Data into Neo4j: API
Bulk Loader API
• Used by neo4j-import tool
• Create Streams of node and relationship
data
• Id-groups, id-handling & generation,
conversions
• Highly concurrent and memory efficient
• High performance CSV Parser, Decorators
CSV
100
B
15. Import Performance: Some Numbers
• Cypher Import 10k-10M records
• Import 100K-100M records per
second transactionally
• Bulk import tens of billions of records
in a few hours
16. Import Performance: Hardware Requirements
• Fast disk: SSD or SSD RAID
• Many Cores
• Medium amount of RAM (8-64G)
• Local Data Files, compress to save space
• High performance concurrent connection
to relational DB
• Linux, OSX works better than Windows
(FS-Handling)
• Disable Virus Scanners, Check Disk
Scheduler
18. Accessing Relational Data
• Dump to CSV all relational database have the
option to dump query results and tables to CSV
• Access with DB-Driver access DB with
JDBC/ODBC or other driver to pull out selected
datasets
• Use built-in or external endpoints some
databases expose HTTP-APIs or can be
integrated (DataClips)
• Use ETL-Tools existing ETL Tools can read from
relational and write to Neo4j e.g. via JDBC
22. Data Quality – Beware of Real World Data !
• Messy ! Don‘t trust the data
• Byte Order Mark
• Binary Zeros, non-text characters
• Inconsisent line breaks
• Header inconsistent with data
• Special character in non-quoted text
• Unexpected newlines in quoted and unquoted text-fields
• stray quotes
23. CSV – Power-Horse of Data Exchange
• Most Databases, ETL and Office-Tools
can read and write CSV
• Format only loosely specified
• Problems with quotes, newlines, charsets
• Some good checking tools (CSVKit)
24. Address Dataset
• Exported as large JOIN between
• City
• Zip
• Street
• Number
• Enterprise
• address.csv EntityNumber TypeOfAddress Zipcode MunicipalityNL StreetNL StreetFR HouseNr
200.065.765 REGO 9070 Destelbergen
Dendermon
desteenwe
g
Dendermonde
steenweg 430
200.068.636 REGO 9000 Gent Stropstraat Stropstraat 1
25. LOAD CSV
// create constraints
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
CREATE CONSTRAINT ON (z:Zip) ASSERT z.name IS UNIQUE;
// manage tx
USING PERIODIC COMMIT 50000
// load csv row by row
LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv
// transform values
WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip
// create nodes
MERGE (:City {name: city})
MERGE (:Zip {name: zip});
26. LOAD CSV
// manage tx
USING PERIODIC COMMIT 100000
// load csv row by row
LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv
// transform values
WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip
// find nodes
MATCH (c:City {name: city}), (z:Zip {name: zip})
// create relationships
MERGE (c)-[:HAS_ZIP_CODE]->(z);
27. LOAD CSV Considerations
• Provide enough memory (heap & page-cache)
• Make sure your data is clean
• Create indexes and constraints upfront
• Use Labels for Matching
• DISTINCT, SKIP, LIMIT to control data volume
• Test with small batch
• Use PERIODIC COMMIT for larger volumes (> 20k)
• Beware of the EAGER Operation
• Will pull in all your CSV data
• Use EXPLAIN to detect it
Simplest LOAD CSV Example | Guide Import CSV | RDBMS ETL Guide
30. Neo4j Bulk Import Tool
• Memory efficient and scalable Bulk-Inserter
• Proven to work well for billions of records
• Easy to use, no memory configuration needed
CSV
Reference Manual: Import Tool
31. Chicago Crime Dataset
• City of Chicago, Crime Data since 2001
• Go to Website, download dataset
• Prepare Dataset, Cleanup
• Specify Headers (direct or separate file)
• ID-definition, data-types, labels, rel-types
• Import (30-50s)
• Use!
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
http://markhneedham.com/blog?s=Chicago+Crime
32. Chicago Crime Dataset
• crimeTypes.csv
• Types of crimes
• beats.csv
• Police areas
• crimes.csv
• Crime description
• crimesBeats.csv
• In which beat did a crime happen
• crimesPrimaryTypes.csv
• Primary Type assignment
38. Normalized ER-Models: Transformation Rules
• Tables become nodes
• Table name as node-label
• Columns turn into properties
• Convert values if needed
• Foreign Keys (1:1, 1:n, n:1) into relationships,
column name into relationship-type (or better verb)
• JOIN-Tables represent relationships
• Also other tables without domain identity (w/o PK) and two FKs
• Columns turn into relationship properties
39. Normalized ER-Models: Cleanup Rules
• Remove technical IDs (auto-incrementing PKs)
• Keep domain IDs (e.g. ISBN)
• Add constraints for those
• Add indexes for lookup fields
• Adjust names for Label, REL_TYPE and propertyName
Note: currently no composite constraints and indexes
40. RDBMS Import Tool Demo – Proof of Concept
• JDBC for vendor-independent database connection
• SchemaCrawler to extract DB-Meta-Data
• Use Rules to drive graph model import
• Optional means to override default behavior
• Scales writes with Parallel Batch Importer API
• Reads tables concurrently for nodes & relationships
Demo: MySQL - Employee Demo Database
Source: github.com/jexp/neo4j-rdbms-import
Blog Post
Post
gres MySQ
L
Oracle
43. MIGRATE
ALL DATA
MIGRATE
GRAPH DATA
DUPLICATE
GRAPH DATA
Non-graph data Graph data
Graph dataAll data
All data
Relational
Database
Graph
Database
Application
Application
Application
Three Ways to Migrate Data to Neo4j
44. Data Storage and
Business Rules Execution
Data Mining
and Aggregation
Neo4j Fits into Your Enterprise Environment
Application
Graph Database Cluster
Neo4j Neo4j Neo4j
Ad Hoc
Analysis
Bulk Analytic
Infrastructure
Graph Compute Engine
EDW …
Data
Scientist
End User
Databases
Relational
NoSQL
Hadoop
Presenter Notes - Challenges with current technologies?
Database options are not suited to model or store data as a network of relationships
Performance degrades with number and levels of relationships making it harder to use for real-time applications
Not flexible to add or change relationships in realtime
Presenter Notes - How does one take advantage of data relationships for real-time applications?
To take advantage of relationships
Data needs to be available as a network of connections (or as a graph)
Real-time access to relationship information should be available regardless of the size of data set or number and complexity of relationships
The graph should be able to accommodate new relationships or modify existing ones
Presenter Notes - How does one take advantage of data relationships for real-time applications?
To take advantage of relationships
Data needs to be available as a network of connections (or as a graph)
Real-time access to relationship information should be available regardless of the size of data set or number and complexity of relationships
The graph should be able to accommodate new relationships or modify existing ones
Presenter Notes - How does one take advantage of data relationships for real-time applications?
To take advantage of relationships
Data needs to be available as a network of connections (or as a graph)
Real-time access to relationship information should be available regardless of the size of data set or number and complexity of relationships
The graph should be able to accommodate new relationships or modify existing ones
Presenter Notes - How does one take advantage of data relationships for real-time applications?
To take advantage of relationships
Data needs to be available as a network of connections (or as a graph)
Real-time access to relationship information should be available regardless of the size of data set or number and complexity of relationships
The graph should be able to accommodate new relationships or modify existing ones
Presenter Notes - Challenges with current technologies?
Database options are not suited to model or store data as a network of relationships
Performance degrades with number and levels of relationships making it harder to use for real-time applications
Not flexible to add or change relationships in realtime
In the near future, many of your apps will be driven by data relationships and not transactions
You can unlock value from business relationships with Neo4j