Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Relational to Graph - Import

Ready to leverage the power of a graph database to bring your application to the next level, but all the data is still stuck in a legacy relational database?

Fortunately, Neo4j offers several ways to quickly and efficiently import relational data into a suitable graph model. It's as simple as exporting the subset of the data you want to import and ingest it either with an initial loader in seconds or minutes or apply Cypher's power to put your relational data transactionally in the right places of your graph model.

In this webinar, Michael will also demonstrate a simple tool that can load relational data directly into Neo4j, automatically transforming it into a graph representation of your normalized entity-relationship model.

  • Login to see the comments

Relational to Graph - Import

  1. 1. Relational to Graph Importing Data into Neo4j June 2015 Michael Hunger |@mesirii
  2. 2. Agenda • Review Webinar Series • Importing Data into Neo4j • Getting Data from RDBMS • Concrete Examples • Demo • Q&A
  3. 3. Webinar Review Relational to Graph
  4. 4. Webinar Review – Relational to Graph • Introduction and Overview • Introduction of Neo4j, Solving RDBMS Issues, Northwind Demo • Modeling Concerns • Modeling in Graphs and RDBMS, Good Modeling Practices, • Model first, incremental Modeling, Model Transformation (Rules) • Import • Importing into Neo4j, Getting Data from RDBMS, Concrete Examples • NEXT: Querying • SQL to Cypher, Comparison, Example Queries, Hard in SQL -> Easy and Fast in Cypher
  5. 5. Why are we doing this? The Graph Advantage
  6. 6. Relational DBs Can’t Handle Relationships Well • Cannot model or store data and relationships without complexity • Performance degrades with number and levels of relationships, and database size • Query complexity grows with need for JOINs • Adding new types of data and relationships requires schema redesign, increasing time to market … making traditional databases inappropriate when data relationships are valuable in real-time Slow development Poor performance Low scalability Hard to maintain
  7. 7. Unlocking Value from Your Data Relationships • Model your data naturally as a graph of data and relationships • Drive graph model from domain and use-cases • Use relationship information in real- time to transform your business • Add new relationships on the fly to adapt to your changing requirements
  8. 8. High Query Performance with a Native Graph DB • Relationships are first class citizen • No need for joins, just follow pre- materialized relationships of nodes • Query & Data-locality – navigate out from your starting points • Only load what’s needed • Aggregate and project results as you go • Optimized disk and memory model for graphs
  9. 9. Importing into Neo4j APIs, Tools, Tricks
  10. 10. Getting Data into Neo4j: CSV Cypher-Based “LOAD CSV” Capability • Transactional (ACID) writes • Initial and incremental loads of up to 10 million nodes and relationships • From HTTP and Files • Power of Cypher • Create and Update Graph Structures • Data conversion, filtering, aggregation • Destructuring of Input Data • Transaction Size Control • Also via Neo4j-Shell CSV 10 M
  11. 11. Getting Data into Neo4j: CSV Command-Line Bulk Loader neo4j-import • For initial database population • Scale across CPUs and disk performance • Efficient RAM usage • Split- and compressed file support • For loads up to 10B+ records • Up to 1M records per second CSV 100 B
  12. 12. Getting Data into Neo4j: APIs Custom Cypher-Based Loader • Uses transactional Cypher http endpoint • Parameterized, batched, concurrent Cypher statements • Any programming/script language with driver or plain http requests • Also for JSON and other formats • Also available as JDBC Driver Any Data Program Program Program 10 M
  13. 13. Getting Data into Neo4j: APIs JVM Transactional Loader • Use Neo4j’s Java-API • From any JVM language, concurrent • Fine grained TX Management • Create Nodes and Relationships directly • Also possible as Server extension • Arbitrary data loading Any Data Program Program Program 1B
  14. 14. Getting Data into Neo4j: API Bulk Loader API • Used by neo4j-import tool • Create Streams of node and relationship data • Id-groups, id-handling & generation, conversions • Highly concurrent and memory efficient • High performance CSV Parser, Decorators CSV 100 B
  15. 15. Import Performance: Some Numbers • Cypher Import 10k-10M records • Import 100K-100M records per second transactionally • Bulk import tens of billions of records in a few hours
  16. 16. Import Performance: Hardware Requirements • Fast disk: SSD or SSD RAID • Many Cores • Medium amount of RAM (8-64G) • Local Data Files, compress to save space • High performance concurrent connection to relational DB • Linux, OSX works better than Windows (FS-Handling) • Disable Virus Scanners, Check Disk Scheduler
  17. 17. Accessing Relational Data Dump, Connect, Extract
  18. 18. Accessing Relational Data • Dump to CSV all relational database have the option to dump query results and tables to CSV • Access with DB-Driver access DB with JDBC/ODBC or other driver to pull out selected datasets • Use built-in or external endpoints some databases expose HTTP-APIs or can be integrated (DataClips) • Use ETL-Tools existing ETL Tools can read from relational and write to Neo4j e.g. via JDBC
  19. 19. Importing Your Data Examples
  20. 20. Import Demo Cypher-Based “LOAD CSV” Capability • Use to import address data Command-Line Bulk Loader neo4j-import • Chicago Crime Dataset Relational Import Tool neo4j-rdbms-import • Proof of Concept JDBC + API CSV
  21. 21. LOAD CSV Powerhorse of Graph ETL
  22. 22. Data Quality – Beware of Real World Data ! • Messy ! Don‘t trust the data • Byte Order Mark • Binary Zeros, non-text characters • Inconsisent line breaks • Header inconsistent with data • Special character in non-quoted text • Unexpected newlines in quoted and unquoted text-fields • stray quotes
  23. 23. CSV – Power-Horse of Data Exchange • Most Databases, ETL and Office-Tools can read and write CSV • Format only loosely specified • Problems with quotes, newlines, charsets • Some good checking tools (CSVKit)
  24. 24. Address Dataset • Exported as large JOIN between • City • Zip • Street • Number • Enterprise • address.csv EntityNumber TypeOfAddress Zipcode MunicipalityNL StreetNL StreetFR HouseNr 200.065.765 REGO 9070 Destelbergen Dendermon desteenwe g Dendermonde steenweg 430 200.068.636 REGO 9000 Gent Stropstraat Stropstraat 1
  25. 25. LOAD CSV // create constraints CREATE CONSTRAINT ON (c:City) ASSERT IS UNIQUE; CREATE CONSTRAINT ON (z:Zip) ASSERT IS UNIQUE; // manage tx USING PERIODIC COMMIT 50000 // load csv row by row LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv // transform values WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip // create nodes MERGE (:City {name: city}) MERGE (:Zip {name: zip});
  26. 26. LOAD CSV // manage tx USING PERIODIC COMMIT 100000 // load csv row by row LOAD CSV WITH HEADERS FROM "file:address.csv" AS csv // transform values WITH DISTINCT toUpper(csv.City) AS city, toUpper(csv.Zip) AS zip // find nodes MATCH (c:City {name: city}), (z:Zip {name: zip}) // create relationships MERGE (c)-[:HAS_ZIP_CODE]->(z);
  27. 27. LOAD CSV Considerations • Provide enough memory (heap & page-cache) • Make sure your data is clean • Create indexes and constraints upfront • Use Labels for Matching • DISTINCT, SKIP, LIMIT to control data volume • Test with small batch • Use PERIODIC COMMIT for larger volumes (> 20k) • Beware of the EAGER Operation • Will pull in all your CSV data • Use EXPLAIN to detect it Simplest LOAD CSV Example | Guide Import CSV | RDBMS ETL Guide
  28. 28. s Demo
  29. 29. Mass Data Bulk Importer neo4j-import --into graph.db
  30. 30. Neo4j Bulk Import Tool • Memory efficient and scalable Bulk-Inserter • Proven to work well for billions of records • Easy to use, no memory configuration needed CSV Reference Manual: Import Tool
  31. 31. Chicago Crime Dataset • City of Chicago, Crime Data since 2001 • Go to Website, download dataset • Prepare Dataset, Cleanup • Specify Headers (direct or separate file) • ID-definition, data-types, labels, rel-types • Import (30-50s) • Use!
  32. 32. Chicago Crime Dataset • crimeTypes.csv • Types of crimes • beats.csv • Police areas • crimes.csv • Crime description • crimesBeats.csv • In which beat did a crime happen • crimesPrimaryTypes.csv • Primary Type assignment
  33. 33. Chicago Crime Dataset crimes.csv :ID(Crime),id,:LABEL,date,description 8920441,8920441,Crime,12/07/2012 07:50:00 AM,AUTOMOBILE primaryTypes.csv :ID(PrimaryType),crimeType ARSON,ARSON crimesPrimaryTypes.csv :START_ID(Crime),:END_ID(PrimaryType) 5221115,NARCOTICS
  34. 34. Chicago Crime Dataset ./neo/bin/neo4j-import --into crimes.db --nodes:CrimeType primaryTypes.csv --nodes beats.csv --nodes crimes_header.csv,crimes.csv --relationships:CRIME_TYPE crimesPrimaryTypes.csv --relationships crimesBeats.csv
  35. 35. s Demo
  36. 36. Neo4j-RDBMS-Importer Proof of Concept
  37. 37. s Recap – Transformation Rules
  38. 38. Normalized ER-Models: Transformation Rules • Tables become nodes • Table name as node-label • Columns turn into properties • Convert values if needed • Foreign Keys (1:1, 1:n, n:1) into relationships, column name into relationship-type (or better verb) • JOIN-Tables represent relationships • Also other tables without domain identity (w/o PK) and two FKs • Columns turn into relationship properties
  39. 39. Normalized ER-Models: Cleanup Rules • Remove technical IDs (auto-incrementing PKs) • Keep domain IDs (e.g. ISBN) • Add constraints for those • Add indexes for lookup fields • Adjust names for Label, REL_TYPE and propertyName Note: currently no composite constraints and indexes
  40. 40. RDBMS Import Tool Demo – Proof of Concept • JDBC for vendor-independent database connection • SchemaCrawler to extract DB-Meta-Data • Use Rules to drive graph model import • Optional means to override default behavior • Scales writes with Parallel Batch Importer API • Reads tables concurrently for nodes & relationships Demo: MySQL - Employee Demo Database Source: Blog Post Post gres MySQ L Oracle
  41. 41. s Demo
  42. 42. Architecture & Integration “Polyglot Persistence”
  43. 43. MIGRATE ALL DATA MIGRATE GRAPH DATA DUPLICATE GRAPH DATA Non-graph data Graph data Graph dataAll data All data Relational Database Graph Database Application Application Application Three Ways to Migrate Data to Neo4j
  44. 44. Data Storage and Business Rules Execution Data Mining and Aggregation Neo4j Fits into Your Enterprise Environment Application Graph Database Cluster Neo4j Neo4j Neo4j Ad Hoc Analysis Bulk Analytic Infrastructure Graph Compute Engine EDW … Data Scientist End User Databases Relational NoSQL Hadoop
  45. 45. Next Steps Community. Training. Support.
  46. 46. There Are Lots of Ways to Easily Learn Neo4j
  47. 47. Resources Online • Developer Site • RDBMS to Graph • Guide: ETL from RDBMS • Guide: CSV Import • LOAD CSV Webinar • Reference Manual • StackOverflow Offline • In Browser Guide „Northwind“ • Import Training Classes • Office Hours • Professional Services Workshop • Free Books: • Graph Databases 2nd Edition • Learning Neo4j
  48. 48. Register today at Early Bird only $99
  49. 49. Relational to Graph Data Import Thank you ! Questions ? | @neo4j