Modern engineering requires machine learning engineers, who are needed to monitor and implement ETL and machine learning models in production. Natalino Busa shares technologies, techniques, and blueprints on how to robustly and reliably manage data science and ETL flows from inception to production.
In particular, Natalino explains how to solve one of the most annoying problems in modern data pipelines—migrating and managing legacy ETL—by generating Spark jobs from a textual representation (NLP and SQL). Natalino also demonstrates an open source web UI implemented in React that transforms high-level representations to Spark code and shows how users are able to capture and discover data in the organization by accessing a metadata service. Natalino also introduces the datalabframework, a Jupyter-powered lightweight framework that allows machine learning scientists and engineers to build a robust production ML system only using notebooks.
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/64481
2. 2
Talk of today
ETL: why is it still important?
Reporting, Analytics, Enterprise Systems
Unified Data Architecture
Streaming, Queries, ML and AI, APIs and DevOps.
ETL Pipeline Building as Software Engineering
Current Solutions, our approach: Sparkcola demo
4. 4
ETL : 4 basic ingredients
Data Sourcing Data Staging Data Modeling Data Load
5. 5
ETL : how to hold it together?
Metadata
Capture and Version:
Scripts, Sources, Targets, SLA (Retries, Max Duration, Typical Records),
User Permission and Access, Scheduling, Data Quality Constraints,
Behaviour on Error, Mappings (Source to Target), etc ...
6. 6
ETL : how to hold it together?
Workflow Scheduler
Manages:
Dependencies between Jobs, Data Lineage, Job re-use,
Retries and Alerting on failure, Fail-over strategies, Resource
Management, etc ...
7. 7
ETL : how to hold it together?
Glossary
Keeps the semantics and the meaning of data :
Naming mapping between domains, Business taxonomies,
Technical column names, naming hierarchies, Documentation on
data columns and data fields.
Az
8. 8
ETL : how to hold it together?
Data Security
Provide a controlled access on the data universe
Access Control, Data Encryption, Data Tokenization, Roles and
Policies management, Data Filtering, Queries Rewrite, etc ...
9. 9
ETL tooling: open source projects
Task and Synopsis Tool
Scheduling and Workflow
Manage Job Dependencies
Airflow, Azkaban, Oozie
Dataflow Processors
Concatenate Transformations
Nifi, Seahorse, Streamsets
Dataflow UIs
Edit and Create Data Flows
Kylo, Seahorse
Metadata
Capture and Edit Workflow Info
Atlas, Falcon, Protegé
Security
Managed Access, Roles, Policies
Sentry, Ranger, Knox
10. 10
How is the Open Source Community doing?
● Still quite “green” tooling
● Most of this tools are not sexy …
● Proprietary solutions still dominate the market
● User Experience and Usability not great yet
● Low Integration with various engines
12. 12
• Streaming Analytics
• Big Data / Big Queries
• ML and AI
• APIs and DS Automation
• DS Exploration
Unified Data Architecture
https://eng.uber.com/michelangelo/
13. 13
Data People: 8 profiles
Dm
Ma
Cs DevOps: Expose models
ML Engineer: CI-CD models
Data Engineer Admin Cluster Services
Data Scientist: Looks for patterns, predictions
Business Analyst: Reporting and Biz Ops
BizDev: New Business Features
Statistician:
Advanced Modeling
AI Reseacher:
ML at scale, New Algorithms
Maths
Domain Expertise
Technology
14. 14
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.
15. 15
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.
Researcher and Statisticians
Who are we kidding? Just use the algos from NIPS people.
16. 16
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.
Researcher and Statisticians
Who are we kidding? Just use the algos from NIPS people.
ML engineers and DevOps
CI/CD Pipelines both for Code *and* Models
17. 17
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.
Researcher and Statisticians
Who are we kidding? Just use the algos from NIPS people.
ML engineers and DevOps
CI/CD Pipelines both for Code *and* Models
Business Analysts
They are all data scientists. End of the story.
18. 18
BizDev .
ML - DevOps
Data
Engineer
Data
Scientist
https://commons.wikimedia.org/wiki/Category:Kiss_(band)#/media/File:Kiss_original_lineup_(1976).jpg
23. 23
ETL Pipelines as Software Engineering
Designing, Implementing and deploying scalable ETL pipelines
requires proper Software Engineering practices
• • •
With Sparkola we address ETL design as proper software engineering
projects. How?
24. 24
Modularity
• Encapsulation •
Pipelines must be broken up in basic blocks (separation of concerns) that can be glued together
using `scripting languages`
• Extensibility •
It should be extremely easy to create, install, test, publish and deploy new components
25. 25
Usability
• Multi-language •
Multiple ways of `gluing` components together should be provided: SQL, rule-based, interactive
excel-like interface, scripting
• IDE •
A proper development environment should be provided
26. 26
Testability
• Debugging •
It should be possible to interactively debug ETL pipelines and analyze problems
• Testing Framework •
Data validation rules should be part of the pipeline definition, and `unit tests` should be bundled
with the ETL pipeline
27. 27
Continuous Integration
• Building and packaging •
It should be possible to package and deploy ETL pipelines as stand-alone components
• Automated testing •
Before deployment, data validation tests should be executed
28. 28
Traceability
• Readability •
ETL pipelines should be metadata-driven and human-readable
• Version Control •
Any change to the ETL pipeline should be versioned and tracked
• Static Analysis •
ETL code analysis should be performed and reported in the form of lineage
29. 29
Here is Sparkola
• Development •
Interactive development
of the ETL pipeline
using a web-based IDE
• Testing •
Automated validation
tests are run in a
CI/CD environment
• Deployment •
Pipelines are packaged
and deployed and lineage
metadata is automatically
generated