Relational to Graph - Import

Relational to Graph
Importing Data into Neo4j
June 2015
Michael Hunger
michael@neo4j.org |@mesirii

Agenda
• Review Webinar Series
• Importing Data into Neo4j
• Getting Data from RDBMS
• Concrete Examples
• Demo
• Q&A

Webinar Review
Relational to Graph

Webinar Review – Relational to Graph
• Introduction and Overview
• Introduction of Neo4j, Solving RDBMS Issues, Northwind Demo
• Modeling Concerns
• Modeling in Graphs and RDBMS, Good Modeling Practices,
• Model first, incremental Modeling, Model Transformation (Rules)
• Import
• Importing into Neo4j, Getting Data from RDBMS, Concrete Examples
• NEXT: Querying
• SQL to Cypher, Comparison, Example Queries, Hard in SQL -> Easy and Fast
in Cypher

Why are we doing this?
The Graph Advantage

Relational DBs Can’t Handle Relationships Well
• Cannot model or store data and relationships
without complexity
• Performance degrades with number and levels
of relationships, and database size
• Query complexity grows with need for JOINs
• Adding new types of data and relationships
requires schema redesign, increasing time to
market
… making traditional databases inappropriate
when data relationships are valuable in real-time
Slow development
Poor performance
Low scalability
Hard to maintain

Unlocking Value from Your Data Relationships
• Model your data naturally as a graph
of data and relationships
• Drive graph model from domain and
use-cases
• Use relationship information in real-
time to transform your business
• Add new relationships on the fly to
adapt to your changing requirements

High Query Performance with a Native Graph DB
• Relationships are first class citizen
• No need for joins, just follow pre-
materialized relationships of nodes
• Query & Data-locality – navigate out
from your starting points
• Only load what’s needed
• Aggregate and project results as you
go
• Optimized disk and memory model
for graphs

Importing into Neo4j
APIs, Tools, Tricks

Getting Data into Neo4j: CSV
Cypher-Based “LOAD CSV” Capability
• Transactional (ACID) writes
• Initial and incremental loads of up to
10 million nodes and relationships
• From HTTP and Files
• Power of Cypher
• Create and Update Graph Structures
• Data conversion, filtering, aggregation
• Destructuring of Input Data
• Transaction Size Control
• Also via Neo4j-Shell
CSV
10
M

Getting Data into Neo4j: CSV
Command-Line Bulk Loader neo4j-import
• For initial database population
• Scale across CPUs and disk performance
• Efficient RAM usage
• Split- and compressed file support
• For loads up to 10B+ records
• Up to 1M records per second
CSV
100
B

Getting Data into Neo4j: APIs
Custom Cypher-Based Loader
• Uses transactional Cypher http endpoint
• Parameterized, batched, concurrent
Cypher statements
• Any programming/script language with
driver or plain http requests
• Also for JSON and other formats
• Also available as JDBC Driver
Any
Data
Program
Program
Program
10
M

Getting Data into Neo4j: APIs
JVM Transactional Loader
• Use Neo4j’s Java-API
• From any JVM language, concurrent
• Fine grained TX Management
• Create Nodes and Relationships directly
• Also possible as Server extension
• Arbitrary data loading
Any
Data
Program
Program
Program
1B

Getting Data into Neo4j: API
Bulk Loader API
• Used by neo4j-import tool
• Create Streams of node and relationship
data
• Id-groups, id-handling & generation,
conversions
• Highly concurrent and memory efficient
• High performance CSV Parser, Decorators
CSV
100
B

Import Performance: Some Numbers
• Cypher Import 10k-10M records
• Import 100K-100M records per
second transactionally
• Bulk import tens of billions of records
in a few hours

Import Performance: Hardware Requirements
• Fast disk: SSD or SSD RAID
• Many Cores
• Medium amount of RAM (8-64G)
• Local Data Files, compress to save space
• High performance concurrent connection
to relational DB
• Linux, OSX works better than Windows
(FS-Handling)
• Disable Virus Scanners, Check Disk
Scheduler

Accessing Relational Data
Dump, Connect, Extract

Accessing Relational Data
• Dump to CSV all relational database have the
option to dump query results and tables to CSV
• Access with DB-Driver access DB with
JDBC/ODBC or other driver to pull out selected
datasets
• Use built-in or external endpoints some
databases expose HTTP-APIs or can be
integrated (DataClips)
• Use ETL-Tools existing ETL Tools can read from
relational and write to Neo4j e.g. via JDBC

Import Demo
Cypher-Based “LOAD CSV” Capability
• Use to import address data
Command-Line Bulk Loader neo4j-import
• Chicago Crime Dataset
Relational Import Tool neo4j-rdbms-import
• Proof of Concept
JDBC + API
CSV

LOAD CSV
Powerhorse of Graph ETL

Data Quality – Beware of Real World Data !
• Messy ! Don‘t trust the data
• Byte Order Mark
• Binary Zeros, non-text characters
• Inconsisent line breaks
• Header inconsistent with data
• Special character in non-quoted text
• Unexpected newlines in quoted and unquoted text-fields
• stray quotes

CSV – Power-Horse of Data Exchange
• Most Databases, ETL and Office-Tools
can read and write CSV
• Format only loosely specified
• Problems with quotes, newlines, charsets
• Some good checking tools (CSVKit)

Address Dataset
• Exported as large JOIN between
• City
• Zip
• Street
• Number
• Enterprise
• address.csv EntityNumber TypeOfAddress Zipcode MunicipalityNL StreetNL StreetFR HouseNr
200.065.765 REGO 9070 Destelbergen
Dendermon
desteenwe
g
Dendermonde
steenweg 430
200.068.636 REGO 9000 Gent Stropstraat Stropstraat 1

LOAD CSV
// create constraints
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
CREATE CONSTRAINT ON (z:Zip) ASSERT z.name IS UNIQUE;
// manage tx
USING PERIODIC COMMIT 50000
// load csv row by row
LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv
// transform values
WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip
// create nodes
MERGE (:City {name: city})
MERGE (:Zip {name: zip});

LOAD CSV
// manage tx
USING PERIODIC COMMIT 100000
// load csv row by row
LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv
// transform values
WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip
// find nodes
MATCH (c:City {name: city}), (z:Zip {name: zip})
// create relationships
MERGE (c)-[:HAS_ZIP_CODE]->(z);

LOAD CSV Considerations
• Provide enough memory (heap & page-cache)
• Make sure your data is clean
• Create indexes and constraints upfront
• Use Labels for Matching
• DISTINCT, SKIP, LIMIT to control data volume
• Test with small batch
• Use PERIODIC COMMIT for larger volumes (> 20k)
• Beware of the EAGER Operation
• Will pull in all your CSV data
• Use EXPLAIN to detect it
Simplest LOAD CSV Example | Guide Import CSV | RDBMS ETL Guide

Mass Data Bulk Importer
neo4j-import --into graph.db

Neo4j Bulk Import Tool
• Memory efficient and scalable Bulk-Inserter
• Proven to work well for billions of records
• Easy to use, no memory configuration needed
CSV
Reference Manual: Import Tool

Chicago Crime Dataset
• City of Chicago, Crime Data since 2001
• Go to Website, download dataset
• Prepare Dataset, Cleanup
• Specify Headers (direct or separate file)
• ID-definition, data-types, labels, rel-types
• Import (30-50s)
• Use!
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
http://markhneedham.com/blog?s=Chicago+Crime

• crimeTypes.csv
• Types of crimes
• beats.csv
• Police areas
• crimes.csv
• Crime description
• crimesBeats.csv
• In which beat did a crime happen
• crimesPrimaryTypes.csv
• Primary Type assignment

crimes.csv
:ID(Crime),id,:LABEL,date,description
8920441,8920441,Crime,12/07/2012 07:50:00 AM,AUTOMOBILE
primaryTypes.csv
:ID(PrimaryType),crimeType
ARSON,ARSON
crimesPrimaryTypes.csv
:START_ID(Crime),:END_ID(PrimaryType)
5221115,NARCOTICS

./neo/bin/neo4j-import
--into crimes.db
--nodes:CrimeType primaryTypes.csv
--nodes beats.csv
--nodes crimes_header.csv,crimes.csv
--relationships:CRIME_TYPE crimesPrimaryTypes.csv
--relationships crimesBeats.csv

Neo4j-RDBMS-Importer
Proof of Concept

s
Recap –
Transformation Rules

Normalized ER-Models: Transformation Rules
• Tables become nodes
• Table name as node-label
• Columns turn into properties
• Convert values if needed
• Foreign Keys (1:1, 1:n, n:1) into relationships,
column name into relationship-type (or better verb)
• JOIN-Tables represent relationships
• Also other tables without domain identity (w/o PK) and two FKs
• Columns turn into relationship properties

Normalized ER-Models: Cleanup Rules
• Remove technical IDs (auto-incrementing PKs)
• Keep domain IDs (e.g. ISBN)
• Add constraints for those
• Add indexes for lookup fields
• Adjust names for Label, REL_TYPE and propertyName
Note: currently no composite constraints and indexes

RDBMS Import Tool Demo – Proof of Concept
• JDBC for vendor-independent database connection
• SchemaCrawler to extract DB-Meta-Data
• Use Rules to drive graph model import
• Optional means to override default behavior
• Scales writes with Parallel Batch Importer API
• Reads tables concurrently for nodes & relationships
Demo: MySQL - Employee Demo Database
Source: github.com/jexp/neo4j-rdbms-import
Blog Post
Post
gres MySQ
L
Oracle

Architecture & Integration
“Polyglot Persistence”

MIGRATE
ALL DATA
MIGRATE
GRAPH DATA
DUPLICATE
GRAPH DATA
Non-graph data Graph data
Graph dataAll data
All data
Relational
Database
Graph
Database
Application
Application
Application
Three Ways to Migrate Data to Neo4j

Data Storage and
Business Rules Execution
Data Mining
and Aggregation
Neo4j Fits into Your Enterprise Environment
Application
Graph Database Cluster
Neo4j Neo4j Neo4j
Ad Hoc
Analysis
Bulk Analytic
Infrastructure
Graph Compute Engine
EDW …
Data
Scientist
End User
Databases
Relational
NoSQL
Hadoop

Next Steps
Community. Training. Support.

There Are Lots of Ways to Easily Learn Neo4j

Resources
Online
• Developer Site
neo4j.com/developer
• RDBMS to Graph
• Guide: ETL from RDBMS
• Guide: CSV Import
• LOAD CSV Webinar
• Reference Manual
• StackOverflow
Offline
• In Browser Guide „Northwind“
• Import Training Classes
• Office Hours
• Professional Services Workshop
• Free Books:
• Graph Databases 2nd Edition
• Learning Neo4j

Register today at graphconnect.com
Early Bird only $99

Relational to Graph
Data Import
Thank you !
Questions ?
neo4j.com | @neo4j

Relational to Graph - Import

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Relational to Graph - Import

Similar to Relational to Graph - Import (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Relational to Graph - Import

Editor's Notes