We will describe and demonstrate all the options for loading data into Neo4j and for getting it back out, all using Kettle (Pentaho Data Integration).
Among the topics covered will be:
high performance data loading
streaming data integration into Neo4j
metadata driven data extraction
automatic Kettle execution lineage and path finding using Neo4j
roadmap update
Q&A
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Neo4j Data Loading with Kettle
1. Neo4j Data Loading
with Kettle
Matt Casters
Chief Solutions Architect / Kettle Project Founder
2. Agenda
➢ What is Kettle?
➢ The Neo4j plugins
➢ Data loading performance tips
➢ Streaming data integration
➢ Metadata driven data possibilities
➢ Kettle Execution lineage in a graph
➢ Roadmap update
➢ Q&A
4. Kettle: Introduction
➢ Pentaho Data Integration from Hitachi Vantara
➢ One of the most widely used ETL tools
➢ Ready for the most demanding tasks
➢ Open source Apache Public License 2.0
➢ Well maintained
➢ Large community, marketplace, ...
➢ Easy to embed, install, package, rebrand
➢ Download : Sourceforge / Pentaho / 8.2 / PDI-CE
5. Kettle: where is it used?
➢ On tiny and enormous systems, real or virtual
➢ Very small computers, Raspberry Pie sized
➢ Your laptop or browser
➢ Locally or in the cloud
➢ On Hadoop clusters, VMs, Docker, Serverless,
➢ At large and small companies
➢ In government
➢ In education
➢ In the Neo4j Solutions Reference Architecture
6. Kettle: Why is it used?
➢ Reduce costs!
➢ Answers the “build or buy?” question
build
buy
Time
Accum.
Cost
Kettle
7. Kettle: Architecture
➢ Metadata driven, engine based :
○ No code generation
○ Define what you need to happen
→ GUI, Web, code, rules, …
○ Clear and transparent, self documenting
➢ Types of work:
○ Jobs for workflows
○ Transformations for parallel data streaming
8. Kettle: Design
➢ 100% Exposure of our engine through UI elements
➢ Everyone should be able to play along: plugins!
➢ We built integration points for others: run everywhere!
➢ Allow the user to avoid programming anything
➢ Allow the user to program anything: JavaScript, Java,
Groovy, RegEx, Rules, Python, Ruby, R, …
➢ Transparency wins: best in class logging, data lineage,
execution lineage, debugging, data previewing, row
sniff testing, …
9. Kettle: things of note
➢ SpoonGit: UI integration with git
➢ WebSpoon: web interface to the full Spoon UI
➢ Data Sets: build transformation unit tests
➢ Huge list of other plugins available, including from
Neo4j, on a marketplace, …
➢ Support for the latest technology stacks
➢ Project on github has over 1,000 forks
https://github.com/pentaho/pentaho-kettle
10. Kettle: The Toolset
➢ Spoon: GUI
➢ Scripts
➢ Server(s)
➢ Java API & SDK
➢ Standard file format
➢ Plugin ecosystem
➢ Docker image(s)
➢ Documentation, books, ...
12. Neo4j Plugins: where to find?
➢ Started by the community, extended by Neo4j
➢ Releases/Download shortcut:
○ http://neo4j.kettle.be
➢ Project:
○ https://github.com/knowbi/knowbi-pentaho-
pdi-neo4j-output
Give us feedback!
13. Neo4j Cypher
➢ For reading and writing
➢ Dynamic Cypher
➢ Batching and UNWIND
➢ Parameters
➢ Return values
➢ Helpers
14. Neo4j Output
➢ Easy node creation
➢ Create/Merge of ()-[]-()
➢ Batching and UNWIND
➢ Dynamic labels
15. Neo4j Graph Output
➢ Update (parts of) a graph
➢ Using a logical model
➢ Using field mapping
➢ Auto-generate Cypher
16. Check Neo4j Connection
➢ Job Entry (workflow)
➢ Validate DBs are up
➢ Used in error diagnostic
➢ Defensive setup
➢ Pessimistic approach
24. Pre-processing in Kettle
➢ Do work in Kettle that can be avoided in Neo4j
➢ Calculate unique nodes
➢ Do required data conversions
➢ Data cleaning
25. Parallel loading & batching
➢ Parallel node creation
➢ Limit high parallelism in the general case
➢ UNWIND in Neo4j Cypher step
➢ Create option in Neo4j Output step
➢ Use larger batch sizes (>1000)
➢ Create indexes up-front or with the options
26. Importing data
➢ Bulk loading with import is much faster
➢ A few orders of magnitude faster
➢ Collect all the data in CSV files
➢ Use the new steps to load
➢ Seamless path to incremental loads
29. Streaming options
➢ Transformations can be never ending
➢ Any operation is possible
➢ Can collect data in other data platforms
➢ Is transactionally safe if it is supported (Kafka, …)
➢ Can be parallelized & scaled out
31. ➢ Kettle transformations & jobs are metadata
➢ ETL Metadata Injection: transformation templates
➢ Neo4j is a great metadata database
➢ Kettle can make use of this
Metadata FTW
32. Metadata driven loads
➢ Loading hundreds of types of files
➢ Processing data from hundreds of databases
➢ Automatic data standardization and normalisation
→ Massive time gains!
33. Metadata driven extracts
➢ Without hardcoded sources, selections and targets
➢ Sourcing selections from users, processes, ...
➢ Using the possibilities of the Kettle engine
→ Flexibility, performance, without coding
35. Kettle Logging Architecture
➢ Unique ID per execution
➢ Precise sourcing of logging records
➢ Very “graphy” data
Execution
Metadata
Impact
Parent /
child
relation
Parent /
child
relation
36. The Kettle Neo4j Logging plugin
➢ Stores operational metadata in a graph
➢ https://github.com/mattcasters/kettle-neo4j-logging
➢ Tools
○ View execution information: log, duration, errors
○ Find error paths
○ Jump to error location
○ Find execution path of a step
○ Get time window: “since last succesful execution”
37. Execution lineage in a graph
➢ Documents the exection process
○ Log text, metadata, times, ...
39. Roadmap Neo4j plugin
➢ 25 releases in 2018
➢ Major 4.0 release next week
➢ Then:
○ New Neo4j Output step
○ More graph data type operations
○ <Insert YOUR suggestion!>
➢ Tuning options for Neo4j steps running in initial
Kettle Apache Beam implementation:
→ DataFlow, Spark, Flink, …
40. Roadmap Neo4j Logging plugin
➢ Generic impact information logging
➢ Store data lineage in Neo4j
➢ Git revision graph loading (new step)
➢ Storing and viewing unit testing results
➢ Operational “dashboard”