SlideShare a Scribd company logo
1 of 30
Download to read offline
1
Data production pipelines:
Legacy, practices, and innovation
Natalino Busa
Matteo Pelati
2
Talk of today
ETL: why is it still important?
Reporting, Analytics, Enterprise Systems
Unified Data Architecture
Streaming, Queries, ML and AI, APIs and DevOps.
ETL Pipeline Building as Software Engineering
Current Solutions, our approach: Sparkcola demo
3
ETL :
Why is it still important?
4
ETL : 4 basic ingredients
Data Sourcing Data Staging Data Modeling Data Load
5
ETL : how to hold it together?
Metadata
Capture and Version:
Scripts, Sources, Targets, SLA (Retries, Max Duration, Typical Records),
User Permission and Access, Scheduling, Data Quality Constraints,
Behaviour on Error, Mappings (Source to Target), etc ...
6
ETL : how to hold it together?
Workflow Scheduler
Manages:
Dependencies between Jobs, Data Lineage, Job re-use,
Retries and Alerting on failure, Fail-over strategies, Resource
Management, etc ...
7
ETL : how to hold it together?
Glossary
Keeps the semantics and the meaning of data :
Naming mapping between domains, Business taxonomies,
Technical column names, naming hierarchies, Documentation on
data columns and data fields.
Az
8
ETL : how to hold it together?
Data Security
Provide a controlled access on the data universe
Access Control, Data Encryption, Data Tokenization, Roles and
Policies management, Data Filtering, Queries Rewrite, etc ...
9
ETL tooling: open source projects
Task and Synopsis Tool
Scheduling and Workflow
Manage Job Dependencies
Airflow, Azkaban, Oozie
Dataflow Processors
Concatenate Transformations
Nifi, Seahorse, Streamsets
Dataflow UIs
Edit and Create Data Flows
Kylo, Seahorse
Metadata
Capture and Edit Workflow Info
Atlas, Falcon, Protegé
Security
Managed Access, Roles, Policies
Sentry, Ranger, Knox
10
How is the Open Source Community doing?
● Still quite “green” tooling
● Most of this tools are not sexy …
● Proprietary solutions still dominate the market
● User Experience and Usability not great yet
● Low Integration with various engines
11
Unified Data
Architectures
12
• Streaming Analytics
• Big Data / Big Queries
• ML and AI
• APIs and DS Automation
• DS Exploration
Unified Data Architecture
https://eng.uber.com/michelangelo/
13
Data People: 8 profiles
Dm
Ma
Cs DevOps: Expose models
ML Engineer: CI-CD models
Data Engineer Admin Cluster Services
Data Scientist: Looks for patterns, predictions
Business Analyst: Reporting and Biz Ops
BizDev: New Business Features
Statistician:
Advanced Modeling
AI Reseacher:
ML at scale, New Algorithms
Maths
Domain Expertise
Technology
14
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.
15
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.
Researcher and Statisticians
Who are we kidding? Just use the algos from NIPS people.
16
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.
Researcher and Statisticians
Who are we kidding? Just use the algos from NIPS people.
ML engineers and DevOps
CI/CD Pipelines both for Code *and* Models
17
… you actually only need 4 profiles ...
Cloud and Virtualization
No Need for Infra. DevOps take over provisioning.
Researcher and Statisticians
Who are we kidding? Just use the algos from NIPS people.
ML engineers and DevOps
CI/CD Pipelines both for Code *and* Models
Business Analysts
They are all data scientists. End of the story.
18
BizDev .
ML - DevOps
Data
Engineer
Data
Scientist
https://commons.wikimedia.org/wiki/Category:Kiss_(band)#/media/File:Kiss_original_lineup_(1976).jpg
19
20
• Manage data
• Collect Metrics
• Provision Resources
• Setup ETL flows
ML Devops
• Train models
• Evaluate models
• Package models
• Deploy models
• Expose and monitor APIs
• A/B Testing Strategies
• Monitor predictions Quality
• Tune API performance
Data Flows ML CI/CDAPIs
21
ML Devops: It’s all about automation
22
ETL Engineering:
Sparkola
23
ETL Pipelines as Software Engineering
Designing, Implementing and deploying scalable ETL pipelines
requires proper Software Engineering practices
• • •
With Sparkola we address ETL design as proper software engineering
projects. How?
24
Modularity
• Encapsulation •
Pipelines must be broken up in basic blocks (separation of concerns) that can be glued together
using `scripting languages`
• Extensibility •
It should be extremely easy to create, install, test, publish and deploy new components
25
Usability
• Multi-language •
Multiple ways of `gluing` components together should be provided: SQL, rule-based, interactive
excel-like interface, scripting
• IDE •
A proper development environment should be provided
26
Testability
• Debugging •
It should be possible to interactively debug ETL pipelines and analyze problems
• Testing Framework •
Data validation rules should be part of the pipeline definition, and `unit tests` should be bundled
with the ETL pipeline
27
Continuous Integration
• Building and packaging •
It should be possible to package and deploy ETL pipelines as stand-alone components
• Automated testing •
Before deployment, data validation tests should be executed
28
Traceability
• Readability •
ETL pipelines should be metadata-driven and human-readable
• Version Control •
Any change to the ETL pipeline should be versioned and tracked
• Static Analysis •
ETL code analysis should be performed and reported in the form of lineage
29
Here is Sparkola
• Development •
Interactive development
of the ETL pipeline
using a web-based IDE
• Testing •
Automated validation
tests are run in a
CI/CD environment
• Deployment •
Pipelines are packaged
and deployed and lineage
metadata is automatically
generated
30
• Demo •

More Related Content

What's hot

josh huspen - resume
josh huspen - resumejosh huspen - resume
josh huspen - resume
Josh Huspen
 
Pankaj_Kumar_3 yr exp _ETL
Pankaj_Kumar_3  yr exp _ETL Pankaj_Kumar_3  yr exp _ETL
Pankaj_Kumar_3 yr exp _ETL
Kumar Pankaj
 
Himel_Sen_Resume
Himel_Sen_ResumeHimel_Sen_Resume
Himel_Sen_Resume
himel sen
 
Resume - Abhishek Ray-Mar-2016 - Ind
Resume - Abhishek Ray-Mar-2016 - IndResume - Abhishek Ray-Mar-2016 - Ind
Resume - Abhishek Ray-Mar-2016 - Ind
Abhishek Ray
 
Introduction to DISQL, a distributed programming framework widely used in Baidu
Introduction to DISQL, a distributed programming framework widely used in BaiduIntroduction to DISQL, a distributed programming framework widely used in Baidu
Introduction to DISQL, a distributed programming framework widely used in Baidu
Xiaoming Chen
 
ETL_Developer_Resume_Shipra_7_02_17
ETL_Developer_Resume_Shipra_7_02_17ETL_Developer_Resume_Shipra_7_02_17
ETL_Developer_Resume_Shipra_7_02_17
Shipra Jaiswal
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Databricks
 
Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...
People10 Technosoft Private Limited
 
Subhoshree_ETLDeveloper
Subhoshree_ETLDeveloperSubhoshree_ETLDeveloper
Subhoshree_ETLDeveloper
Subhoshree Deo
 

What's hot (20)

How to Handle DEV&TEST&PROD for Oracle Data Integrator
How to Handle DEV&TEST&PROD for Oracle Data IntegratorHow to Handle DEV&TEST&PROD for Oracle Data Integrator
How to Handle DEV&TEST&PROD for Oracle Data Integrator
 
Data Integration Solutions Created By Koneksys
Data Integration Solutions Created By KoneksysData Integration Solutions Created By Koneksys
Data Integration Solutions Created By Koneksys
 
Shivaprasada_Kodoth
Shivaprasada_KodothShivaprasada_Kodoth
Shivaprasada_Kodoth
 
josh huspen - resume
josh huspen - resumejosh huspen - resume
josh huspen - resume
 
Pankaj_Kumar_3 yr exp _ETL
Pankaj_Kumar_3  yr exp _ETL Pankaj_Kumar_3  yr exp _ETL
Pankaj_Kumar_3 yr exp _ETL
 
Himel_Sen_Resume
Himel_Sen_ResumeHimel_Sen_Resume
Himel_Sen_Resume
 
Kanakaraj_Periasamy
Kanakaraj_PeriasamyKanakaraj_Periasamy
Kanakaraj_Periasamy
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Resume - Abhishek Ray-Mar-2016 - Ind
Resume - Abhishek Ray-Mar-2016 - IndResume - Abhishek Ray-Mar-2016 - Ind
Resume - Abhishek Ray-Mar-2016 - Ind
 
Veera Narayanaswamy_PLSQL_Profile
Veera Narayanaswamy_PLSQL_ProfileVeera Narayanaswamy_PLSQL_Profile
Veera Narayanaswamy_PLSQL_Profile
 
Mukhtar resume etl_developer
Mukhtar resume etl_developerMukhtar resume etl_developer
Mukhtar resume etl_developer
 
Resume
ResumeResume
Resume
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
 
Introduction to DISQL, a distributed programming framework widely used in Baidu
Introduction to DISQL, a distributed programming framework widely used in BaiduIntroduction to DISQL, a distributed programming framework widely used in Baidu
Introduction to DISQL, a distributed programming framework widely used in Baidu
 
ETL_Developer_Resume_Shipra_7_02_17
ETL_Developer_Resume_Shipra_7_02_17ETL_Developer_Resume_Shipra_7_02_17
ETL_Developer_Resume_Shipra_7_02_17
 
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
 
Dnyanda Kotkar_Resume
Dnyanda  Kotkar_ResumeDnyanda  Kotkar_Resume
Dnyanda Kotkar_Resume
 
ETL
ETLETL
ETL
 
Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...Revolutionizing Enterprise Software Development through Continuous Delivery &...
Revolutionizing Enterprise Software Development through Continuous Delivery &...
 
Subhoshree_ETLDeveloper
Subhoshree_ETLDeveloperSubhoshree_ETLDeveloper
Subhoshree_ETLDeveloper
 

Similar to Data Production Pipelines: Legacy, practices, and innovation

Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Rakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resumeRakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resume
Rakesh Kumar
 
Rakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resumeRakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resume
Rakesh Kumar
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 

Similar to Data Production Pipelines: Legacy, practices, and innovation (20)

MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a Service
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)
 
Rakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resumeRakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resume
 
Rakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resumeRakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resume
 
Studying Software Engineering Patterns for Designing Machine Learning Systems
Studying Software Engineering Patterns for Designing Machine Learning SystemsStudying Software Engineering Patterns for Designing Machine Learning Systems
Studying Software Engineering Patterns for Designing Machine Learning Systems
 
PradeepDWH
PradeepDWHPradeepDWH
PradeepDWH
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Navendu_Resume
Navendu_ResumeNavendu_Resume
Navendu_Resume
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
Dagster - DataOps and MLOps for Machine Learning Engineers.pdfDagster - DataOps and MLOps for Machine Learning Engineers.pdf
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 

More from Natalino Busa

Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Natalino Busa
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
Natalino Busa
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Natalino Busa
 

More from Natalino Busa (19)

Data science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter NotebooksData science apps powered by Jupyter Notebooks
Data science apps powered by Jupyter Notebooks
 
7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks7 steps for highly effective deep neural networks
7 steps for highly effective deep neural networks
 
Data science apps: beyond notebooks
Data science apps: beyond notebooksData science apps: beyond notebooks
Data science apps: beyond notebooks
 
[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing[Ai in finance] AI in regulatory compliance, risk management, and auditing
[Ai in finance] AI in regulatory compliance, risk management, and auditing
 
Strata London 16: sightseeing, venues, and friends
Strata  London 16: sightseeing, venues, and friendsStrata  London 16: sightseeing, venues, and friends
Strata London 16: sightseeing, venues, and friends
 
Data in Action
Data in ActionData in Action
Data in Action
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
The evolution of data analytics
The evolution of data analyticsThe evolution of data analytics
The evolution of data analytics
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
 
Streaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and SprayStreaming Api Design with Akka, Scala and Spray
Streaming Api Design with Akka, Scala and Spray
 
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.Hadoop + Cassandra: Fast queries on data lakes, and  wikipedia search tutorial.
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
 
Big data solutions for advanced marketing analytics
Big data solutions for advanced marketing analyticsBig data solutions for advanced marketing analytics
Big data solutions for advanced marketing analytics
 
Awesome Banking API's
Awesome Banking API'sAwesome Banking API's
Awesome Banking API's
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analyticsBig and fast a quest for relevant and real-time analytics
Big and fast a quest for relevant and real-time analytics
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Strata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topicsStrata 2014: Data science and big data trending topics
Strata 2014: Data science and big data trending topics
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 

Recently uploaded

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 

Recently uploaded (20)

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

Data Production Pipelines: Legacy, practices, and innovation

  • 1. 1 Data production pipelines: Legacy, practices, and innovation Natalino Busa Matteo Pelati
  • 2. 2 Talk of today ETL: why is it still important? Reporting, Analytics, Enterprise Systems Unified Data Architecture Streaming, Queries, ML and AI, APIs and DevOps. ETL Pipeline Building as Software Engineering Current Solutions, our approach: Sparkcola demo
  • 3. 3 ETL : Why is it still important?
  • 4. 4 ETL : 4 basic ingredients Data Sourcing Data Staging Data Modeling Data Load
  • 5. 5 ETL : how to hold it together? Metadata Capture and Version: Scripts, Sources, Targets, SLA (Retries, Max Duration, Typical Records), User Permission and Access, Scheduling, Data Quality Constraints, Behaviour on Error, Mappings (Source to Target), etc ...
  • 6. 6 ETL : how to hold it together? Workflow Scheduler Manages: Dependencies between Jobs, Data Lineage, Job re-use, Retries and Alerting on failure, Fail-over strategies, Resource Management, etc ...
  • 7. 7 ETL : how to hold it together? Glossary Keeps the semantics and the meaning of data : Naming mapping between domains, Business taxonomies, Technical column names, naming hierarchies, Documentation on data columns and data fields. Az
  • 8. 8 ETL : how to hold it together? Data Security Provide a controlled access on the data universe Access Control, Data Encryption, Data Tokenization, Roles and Policies management, Data Filtering, Queries Rewrite, etc ...
  • 9. 9 ETL tooling: open source projects Task and Synopsis Tool Scheduling and Workflow Manage Job Dependencies Airflow, Azkaban, Oozie Dataflow Processors Concatenate Transformations Nifi, Seahorse, Streamsets Dataflow UIs Edit and Create Data Flows Kylo, Seahorse Metadata Capture and Edit Workflow Info Atlas, Falcon, Protegé Security Managed Access, Roles, Policies Sentry, Ranger, Knox
  • 10. 10 How is the Open Source Community doing? ● Still quite “green” tooling ● Most of this tools are not sexy … ● Proprietary solutions still dominate the market ● User Experience and Usability not great yet ● Low Integration with various engines
  • 12. 12 • Streaming Analytics • Big Data / Big Queries • ML and AI • APIs and DS Automation • DS Exploration Unified Data Architecture https://eng.uber.com/michelangelo/
  • 13. 13 Data People: 8 profiles Dm Ma Cs DevOps: Expose models ML Engineer: CI-CD models Data Engineer Admin Cluster Services Data Scientist: Looks for patterns, predictions Business Analyst: Reporting and Biz Ops BizDev: New Business Features Statistician: Advanced Modeling AI Reseacher: ML at scale, New Algorithms Maths Domain Expertise Technology
  • 14. 14 … you actually only need 4 profiles ... Cloud and Virtualization No Need for Infra. DevOps take over provisioning.
  • 15. 15 … you actually only need 4 profiles ... Cloud and Virtualization No Need for Infra. DevOps take over provisioning. Researcher and Statisticians Who are we kidding? Just use the algos from NIPS people.
  • 16. 16 … you actually only need 4 profiles ... Cloud and Virtualization No Need for Infra. DevOps take over provisioning. Researcher and Statisticians Who are we kidding? Just use the algos from NIPS people. ML engineers and DevOps CI/CD Pipelines both for Code *and* Models
  • 17. 17 … you actually only need 4 profiles ... Cloud and Virtualization No Need for Infra. DevOps take over provisioning. Researcher and Statisticians Who are we kidding? Just use the algos from NIPS people. ML engineers and DevOps CI/CD Pipelines both for Code *and* Models Business Analysts They are all data scientists. End of the story.
  • 18. 18 BizDev . ML - DevOps Data Engineer Data Scientist https://commons.wikimedia.org/wiki/Category:Kiss_(band)#/media/File:Kiss_original_lineup_(1976).jpg
  • 19. 19
  • 20. 20 • Manage data • Collect Metrics • Provision Resources • Setup ETL flows ML Devops • Train models • Evaluate models • Package models • Deploy models • Expose and monitor APIs • A/B Testing Strategies • Monitor predictions Quality • Tune API performance Data Flows ML CI/CDAPIs
  • 21. 21 ML Devops: It’s all about automation
  • 23. 23 ETL Pipelines as Software Engineering Designing, Implementing and deploying scalable ETL pipelines requires proper Software Engineering practices • • • With Sparkola we address ETL design as proper software engineering projects. How?
  • 24. 24 Modularity • Encapsulation • Pipelines must be broken up in basic blocks (separation of concerns) that can be glued together using `scripting languages` • Extensibility • It should be extremely easy to create, install, test, publish and deploy new components
  • 25. 25 Usability • Multi-language • Multiple ways of `gluing` components together should be provided: SQL, rule-based, interactive excel-like interface, scripting • IDE • A proper development environment should be provided
  • 26. 26 Testability • Debugging • It should be possible to interactively debug ETL pipelines and analyze problems • Testing Framework • Data validation rules should be part of the pipeline definition, and `unit tests` should be bundled with the ETL pipeline
  • 27. 27 Continuous Integration • Building and packaging • It should be possible to package and deploy ETL pipelines as stand-alone components • Automated testing • Before deployment, data validation tests should be executed
  • 28. 28 Traceability • Readability • ETL pipelines should be metadata-driven and human-readable • Version Control • Any change to the ETL pipeline should be versioned and tracked • Static Analysis • ETL code analysis should be performed and reported in the form of lineage
  • 29. 29 Here is Sparkola • Development • Interactive development of the ETL pipeline using a web-based IDE • Testing • Automated validation tests are run in a CI/CD environment • Deployment • Pipelines are packaged and deployed and lineage metadata is automatically generated