Data Production Pipelines: Legacy, practices, and innovation

1
Data production pipelines:
Legacy, practices, and innovation
Natalino Busa
Matteo Pelati

2
Talk of today
ETL: why is it still important?
Reporting, Analytics, Enterprise Systems
Unified Data Architecture
Streaming, Queries, ML and AI, APIs and DevOps.
ETL Pipeline Building as Software Engineering
Current Solutions, our approach: Sparkcola demo

3
ETL :
Why is it still important?

4
ETL : 4 basic ingredients
Data Sourcing Data Staging Data Modeling Data Load

5
ETL : how to hold it together?
Metadata
Capture and Version:
Scripts, Sources, Targets, SLA (Retries, Max Duration, Typical Records),
User Permission and Access, Scheduling, Data Quality Constraints,
Behaviour on Error, Mappings (Source to Target), etc ...

6
Workflow Scheduler
Manages:
Dependencies between Jobs, Data Lineage, Job re-use,
Retries and Alerting on failure, Fail-over strategies, Resource
Management, etc ...

7
Glossary
Keeps the semantics and the meaning of data :
Naming mapping between domains, Business taxonomies,
Technical column names, naming hierarchies, Documentation on
data columns and data fields.
Az

8
Data Security
Provide a controlled access on the data universe
Access Control, Data Encryption, Data Tokenization, Roles and
Policies management, Data Filtering, Queries Rewrite, etc ...

9
ETL tooling: open source projects
Task and Synopsis Tool
Scheduling and Workflow
Manage Job Dependencies
Airflow, Azkaban, Oozie
Dataflow Processors
Concatenate Transformations
Nifi, Seahorse, Streamsets
Dataflow UIs
Edit and Create Data Flows
Kylo, Seahorse
Metadata
Capture and Edit Workflow Info
Atlas, Falcon, Protegé
Security
Managed Access, Roles, Policies
Sentry, Ranger, Knox

10
How is the Open Source Community doing?
● Still quite “green” tooling
● Most of this tools are not sexy …
● Proprietary solutions still dominate the market
● User Experience and Usability not great yet
● Low Integration with various engines

12
• Streaming Analytics
• Big Data / Big Queries
• ML and AI
• APIs and DS Automation
• DS Exploration
Unified Data Architecture
https://eng.uber.com/michelangelo/

13
Data People: 8 profiles
Dm
Ma
Cs DevOps: Expose models
ML Engineer: CI-CD models
Data Engineer Admin Cluster Services
Data Scientist: Looks for patterns, predictions
Business Analyst: Reporting and Biz Ops
BizDev: New Business Features
Statistician:
Advanced Modeling
AI Reseacher:
ML at scale, New Algorithms
Maths
Domain Expertise
Technology

14
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.

15
Researcher and Statisticians
Who are we kidding? Just use the algos from NIPS people.

16
ML engineers and DevOps
CI/CD Pipelines both for Code *and* Models

17
ML engineers and DevOps
CI/CD Pipelines both for Code *and* Models
Business Analysts
They are all data scientists. End of the story.

18
BizDev .
ML - DevOps
Data
Engineer
Data
Scientist
https://commons.wikimedia.org/wiki/Category:Kiss_(band)#/media/File:Kiss_original_lineup_(1976).jpg

20
• Manage data
• Collect Metrics
• Provision Resources
• Setup ETL flows
ML Devops
• Train models
• Evaluate models
• Package models
• Deploy models
• Expose and monitor APIs
• A/B Testing Strategies
• Monitor predictions Quality
• Tune API performance
Data Flows ML CI/CDAPIs

21
ML Devops: It’s all about automation

23
ETL Pipelines as Software Engineering
Designing, Implementing and deploying scalable ETL pipelines
requires proper Software Engineering practices
• • •
With Sparkola we address ETL design as proper software engineering
projects. How?

24
Modularity
• Encapsulation •
Pipelines must be broken up in basic blocks (separation of concerns) that can be glued together
using `scripting languages`
• Extensibility •
It should be extremely easy to create, install, test, publish and deploy new components

25
Usability
• Multi-language •
Multiple ways of `gluing` components together should be provided: SQL, rule-based, interactive
excel-like interface, scripting
• IDE •
A proper development environment should be provided

26
Testability
• Debugging •
It should be possible to interactively debug ETL pipelines and analyze problems
• Testing Framework •
Data validation rules should be part of the pipeline definition, and `unit tests` should be bundled
with the ETL pipeline

27
Continuous Integration
• Building and packaging •
It should be possible to package and deploy ETL pipelines as stand-alone components
• Automated testing •
Before deployment, data validation tests should be executed

28
Traceability
• Readability •
ETL pipelines should be metadata-driven and human-readable
• Version Control •
Any change to the ETL pipeline should be versioned and tracked
• Static Analysis •
ETL code analysis should be performed and reported in the form of lineage

29
Here is Sparkola
• Development •
Interactive development
of the ETL pipeline
using a web-based IDE
• Testing •
Automated validation
tests are run in a
CI/CD environment
• Deployment •
Pipelines are packaged
and deployed and lineage
metadata is automatically
generated

Data Production Pipelines: Legacy, practices, and innovation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Production Pipelines: Legacy, practices, and innovation

Similar to Data Production Pipelines: Legacy, practices, and innovation (20)

More from Natalino Busa

More from Natalino Busa (19)

Recently uploaded

Recently uploaded (20)

Data Production Pipelines: Legacy, practices, and innovation