4. Kettle: Introduction
•a.k.a Pentaho Data Integration
•One of the most widely used ETL tools
•Ready for the most demanding tasks
•Open source Apache Public License 2.0
•Well maintained
•Large community, marketplace, ...
•Easy to embed, install, package, rebrand
•Download from Sourceforge / Pentaho / PDI-CE
6. Kettle: Architecture
•Metadata driven, engine based :
•No code generation
•Define what you need to happen
-> GUI, Web, code, rules, …
•Execute wherever you need to
-> From Raspberry Pi to Hadoop
•Types of work:
● Jobs for workflows
● Transformations for parallel streaming
7. Kettle: Design
• 100% Exposure of our engine through UI elements
• Everyone should be able to play along: plugins!
•We built integration points for others: run everywhere!
• Allow the user to avoid programming anything
• Allow the user to program anything: JavaScript, Java,
SQL, RegEx, Rules, Python, Ruby, R, OO Formula, Pig, …
• Transparency wins: top class logging, data lineage,
execution lineage, debugging, data previewing, row
sniff testing, …
8. Kettle: Cool things
• SpoonGit: UI integration with git
• WebSpoon: web interface to the full Spoon UI
•Data Sets: build transformation unit tests
• Large marketplace with:
http://www.pentaho.com/marketplace/
• Project on github has over 1,000 forks
https://github.com/pentaho/pentaho-kettle
17. Loading Neo4j: loading nodes
•Demonstrates the Neo4j Output step
•Read a CSV file in parallel
•Load the data into nodes in parallel
18. Loading Neo4j: remove all data
•Demonstrates the Neo4j Cypher step
•Calls procedures
•Uses dynamic Cypher statements
•Reads and updates Neo4j
•Removes the all nodes and edges in batches
19. Loading Neo4j: update graphs
•Demonstrates the Neo4j Graph Output step
•Updates multiple nodes and relationships at once
•Takes key values into account to ignore nodes
•Automatically generates MERGE statements
20. Loading Neo4j: Kafka updating Neo4j
• Demonstrates Kafka integration
• Stream data using a Kafka consumer
• Continuously update Neo4j