Neo4j Data Loading with Kettle

Neo4j Data Loading
with Kettle
Matt Casters
Chief Solutions Architect / Kettle Project Founder

Agenda
➢ What is Kettle?
➢ The Neo4j plugins
➢ Data loading performance tips
➢ Streaming data integration
➢ Metadata driven data possibilities
➢ Kettle Execution lineage in a graph
➢ Roadmap update
➢ Q&A

Kettle: Introduction
➢ Pentaho Data Integration from Hitachi Vantara
➢ One of the most widely used ETL tools
➢ Ready for the most demanding tasks
➢ Open source Apache Public License 2.0
➢ Well maintained
➢ Large community, marketplace, ...
➢ Easy to embed, install, package, rebrand
➢ Download : Sourceforge / Pentaho / 8.2 / PDI-CE

Kettle: where is it used?
➢ On tiny and enormous systems, real or virtual
➢ Very small computers, Raspberry Pie sized
➢ Your laptop or browser
➢ Locally or in the cloud
➢ On Hadoop clusters, VMs, Docker, Serverless,
➢ At large and small companies
➢ In government
➢ In education
➢ In the Neo4j Solutions Reference Architecture

Kettle: Why is it used?
➢ Reduce costs!
➢ Answers the “build or buy?” question
build
buy
Time
Accum.
Cost
Kettle

Kettle: Architecture
➢ Metadata driven, engine based :
○ No code generation
○ Define what you need to happen
→ GUI, Web, code, rules, …
○ Clear and transparent, self documenting
➢ Types of work:
○ Jobs for workflows
○ Transformations for parallel data streaming

Kettle: Design
➢ 100% Exposure of our engine through UI elements
➢ Everyone should be able to play along: plugins!
➢ We built integration points for others: run everywhere!
➢ Allow the user to avoid programming anything
➢ Allow the user to program anything: JavaScript, Java,
Groovy, RegEx, Rules, Python, Ruby, R, …
➢ Transparency wins: best in class logging, data lineage,
execution lineage, debugging, data previewing, row
sniff testing, …

Kettle: things of note
➢ SpoonGit: UI integration with git
➢ WebSpoon: web interface to the full Spoon UI
➢ Data Sets: build transformation unit tests
➢ Huge list of other plugins available, including from
Neo4j, on a marketplace, …
➢ Support for the latest technology stacks
➢ Project on github has over 1,000 forks
https://github.com/pentaho/pentaho-kettle

Kettle: The Toolset
➢ Spoon: GUI
➢ Scripts
➢ Server(s)
➢ Java API & SDK
➢ Standard file format
➢ Plugin ecosystem
➢ Docker image(s)
➢ Documentation, books, ...

Neo4j Plugins: where to find?
➢ Started by the community, extended by Neo4j
➢ Releases/Download shortcut:
○ http://neo4j.kettle.be
➢ Project:
○ https://github.com/knowbi/knowbi-pentaho-
pdi-neo4j-output
Give us feedback!

Neo4j Cypher
➢ For reading and writing
➢ Dynamic Cypher
➢ Batching and UNWIND
➢ Parameters
➢ Return values
➢ Helpers

Neo4j Output
➢ Easy node creation
➢ Create/Merge of ()-[]-()
➢ Batching and UNWIND
➢ Dynamic labels

Neo4j Graph Output
➢ Update (parts of) a graph
➢ Using a logical model
➢ Using field mapping
➢ Auto-generate Cypher

Check Neo4j Connection
➢ Job Entry (workflow)
➢ Validate DBs are up
➢ Used in error diagnostic
➢ Defensive setup
➢ Pessimistic approach

Neo4j Cypher Script
➢ Job Entry (workflow)
➢ Executes series of Cypher statements

Plugins v4
➢ Bulk loading steps
➢ Performance options
➢ Encrypted/obfuscated password in variables
➢ Bug fixes & UI improvements

Neo4j Generate CSVs
➢ Generate CSV files for Neo4j Import
➢ Generates appropriate header
➢ Handles escaping, quoting, …
➢ Outputs file names

Neo4j Split Graph
➢ Splits a graph field into nodes and relationships
➢ Used for unique value calculation

Neo4j Importer
➢ Runs a neo4j-import command
➢ Accepts the filenames of CSV files

Data loading
Performance tips
23

Pre-processing in Kettle
➢ Do work in Kettle that can be avoided in Neo4j
➢ Calculate unique nodes
➢ Do required data conversions
➢ Data cleaning

Parallel loading & batching
➢ Parallel node creation
➢ Limit high parallelism in the general case
➢ UNWIND in Neo4j Cypher step
➢ Create option in Neo4j Output step
➢ Use larger batch sizes (>1000)
➢ Create indexes up-front or with the options

Importing data
➢ Bulk loading with import is much faster
➢ A few orders of magnitude faster
➢ Collect all the data in CSV files
➢ Use the new steps to load
➢ Seamless path to incremental loads

Streaming options
➢ Micro-batching (every X minutes)
➢ Kafka, Event Hubs, Queues,... (never ending)

Streaming options
➢ Transformations can be never ending
➢ Any operation is possible
➢ Can collect data in other data platforms
➢ Is transactionally safe if it is supported (Kafka, …)
➢ Can be parallelized & scaled out

Metadata driven
Data possibilities
30

➢ Kettle transformations & jobs are metadata
➢ ETL Metadata Injection: transformation templates
➢ Neo4j is a great metadata database
➢ Kettle can make use of this
Metadata FTW

Metadata driven loads
➢ Loading hundreds of types of files
➢ Processing data from hundreds of databases
➢ Automatic data standardization and normalisation
→ Massive time gains!

Metadata driven extracts
➢ Without hardcoded sources, selections and targets
➢ Sourcing selections from users, processes, ...
➢ Using the possibilities of the Kettle engine
→ Flexibility, performance, without coding

Kettle Logging Architecture
➢ Unique ID per execution
➢ Precise sourcing of logging records
➢ Very “graphy” data
Execution
Metadata
Impact
Parent /
child
relation
Parent /
child
relation

The Kettle Neo4j Logging plugin
➢ Stores operational metadata in a graph
➢ https://github.com/mattcasters/kettle-neo4j-logging
➢ Tools
○ View execution information: log, duration, errors
○ Find error paths
○ Jump to error location
○ Find execution path of a step
○ Get time window: “since last succesful execution”

Execution lineage in a graph
➢ Documents the exection process
○ Log text, metadata, times, ...

Roadmap Neo4j plugin
➢ 25 releases in 2018
➢ Major 4.0 release next week
➢ Then:
○ New Neo4j Output step
○ More graph data type operations
○ <Insert YOUR suggestion!>
➢ Tuning options for Neo4j steps running in initial
Kettle Apache Beam implementation:
→ DataFlow, Spark, Flink, …

Roadmap Neo4j Logging plugin
➢ Generic impact information logging
➢ Store data lineage in Neo4j
➢ Git revision graph loading (new step)
➢ Storing and viewing unit testing results
➢ Operational “dashboard”

Neo4j Data Loading with Kettle

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Neo4j Data Loading with Kettle

Similar to Neo4j Data Loading with Kettle (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Neo4j Data Loading with Kettle