SlideShare a Scribd company logo
1 of 52
Building Data
Pipelines in Python
Marco Bonzanini
QCon London 2017
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations
/data-pipelines-python
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
Nice to meet you
R&D ≠ Engineering
R&D ≠ Engineering
R&D results in production = high value
Big Data Problems
vs
Big Data Problems
Data Pipelines (from 30,000ft)
Data ETL Analytics
Data Pipelines (zooming in)
ETL
{Extract
Transform
Load
{
Clean
Augment
Join
Good Data Pipelines
Easy to
Reproduce
Productise{
Towards Good Data Pipelines
Towards Good Data Pipelines (a)
Your Data is Dirty
unless proven otherwise
“It’s in the database, so it’s already good”
Towards Good Data Pipelines (b)
All Your Data is Important
unless proven otherwise
Towards Good Data Pipelines (b)
All Your Data is Important
unless proven otherwise
Keep it. Transform it. Don’t overwrite it.
Towards Good Data Pipelines (c)
Pipelines vs Script Soups
Tasty, but not a pipeline
Pic: Romanian potato soup from Wikipedia
$ ./do_something.sh
$ ./do_something_else.sh
$ ./extract_some_data.sh
$ ./join_some_other_data.sh
...
Anti-pattern: the script soup
Script soups kill replicability
$ cat ./run_everything.sh
./do_something.sh
./do_something_else.sh
./extract_some_data.sh
./join_some_other_data.sh
$ ./run_everything.sh
Anti-pattern: the master script
Towards Good Data Pipelines (d)
Break it Down
setup.py and conda
Towards Good Data Pipelines (e)
Automated Testing
i.e. why scientists don’t write unit tests
Intermezzo
Let me rant about testing
Icon by Freepik from flaticon.com
(Unit) Testing
Unit tests in three easy steps:
• import unittest
• Write your tests
• Quit complaining about lack of time to write tests
Benefits of (unit) testing
• Safety net for refactoring
• Safety net for lib upgrades
• Validate your assumptions
• Document code / communicate your intentions
• You’re forced to think
Testing: not convinced yet?
Testing: not convinced yet?
Testing: not convinced yet?


f1 = fscore(p, r)
min_bound, max_bound = sorted([p, r])
assert min_bound <= f1 <= max_bound
Testing: I’m almost done
• Unit tests vs Defensive Programming
• Say no to tautologies
• Say no to vanity tests
• The Python ecosystem is rich: 

py.test, nosetests, hypothesis, coverage.py, …
</rant>
Towards Good Data Pipelines (f)
Orchestration
Don’t re-invent the wheel
You need a workflow manager
Think: 

GNU Make + Unix pipes + Steroids
Intro to Luigi
• Task dependency management
• Error control, checkpoints, failure recovery
• Minimal boilerplate
• Dependency graph visualisation
$ pip install luigi
Luigi Task: unit of execution
class MyTask(luigi.Task):
def requires(self):
return [SomeTask()]
def output(self):
return luigi.LocalTarget(…)
def run(self):
mylib.run()
Luigi Target: output of a task
class MyTarget(luigi.Target):
def exists(self):
... # return bool
Great off the shelf support 

local file system, S3, Elasticsearch, RDBMS
(also via luigi.contrib)
Intro to Airflow
• Like Luigi, just younger
• Nicer (?) GUI
• Scheduling
• Apache Project
Towards Good Data Pipelines (g)
When things go wrong
The Joy of debugging
import logging
Who reads the logs?
You’re not going to read the logs, unless…
• E-mail notifications (built-in in Luigi)
• Slack notifications
$ pip install luigi_slack # WIP
Towards Good Data Pipelines (h)
Static Analysis
The Joy of Duck Typing
If it looks like a duck,
swims like a duck,
and quacks like a duck,
then it probably is a duck.
— somebody on the Web
>>> 1.0 == 1 == True
True
>>> 1 + True
2
>>> '1' * 2
'11'
>>> '1' + 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'int' object
to str implicitly
def do_stuff(a: int,
b: int) -> str:
...
return something
PEP 3107 — Function Annotations

(since Python 3.0)
(annotations are ignored by the interpreter)
typing module: semantically coherent
PEP 484 — Type Hints

(since Python 3.5)
(still ignored by the interpreter)
pip install mypy
• Add optional types
• Run:
mypy --follow-imports silent mylib
• Refine gradual typing (e.g. Any)
Summary
Basic engineering principles help

(packaging, testing, orchestration, logging, static analysis, ...)
Summary
R&D is not Engineering:

can we meet halfway?
Vanity Slide
• speakerdeck.com/marcobonzanini
• github.com/bonzanini
• marcobonzanini.com
• @MarcoBonzanini
Watch the video with slide synchronization on
InfoQ.com!
https://www.infoq.com/presentations/data-
pipelines-python

More Related Content

What's hot

Introducing Saga Pattern in Microservices with Spring Statemachine
Introducing Saga Pattern in Microservices with Spring StatemachineIntroducing Saga Pattern in Microservices with Spring Statemachine
Introducing Saga Pattern in Microservices with Spring StatemachineVMware Tanzu
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLJonathan Katz
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Event Driven Systems with Spring Boot, Spring Cloud Streams and Kafka
Event Driven Systems with Spring Boot, Spring Cloud Streams and KafkaEvent Driven Systems with Spring Boot, Spring Cloud Streams and Kafka
Event Driven Systems with Spring Boot, Spring Cloud Streams and KafkaVMware Tanzu
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Effectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache PulsarEffectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache PulsarMatteo Merli
 
Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateFlink Forward
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Vert.x for Microservices Architecture
Vert.x for Microservices ArchitectureVert.x for Microservices Architecture
Vert.x for Microservices ArchitectureIdan Fridman
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022HostedbyConfluent
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...HostedbyConfluent
 
Distributed Transactions: Saga Patterns
Distributed Transactions: Saga PatternsDistributed Transactions: Saga Patterns
Distributed Transactions: Saga PatternsKnoldus Inc.
 

What's hot (20)

Introducing Saga Pattern in Microservices with Spring Statemachine
Introducing Saga Pattern in Microservices with Spring StatemachineIntroducing Saga Pattern in Microservices with Spring Statemachine
Introducing Saga Pattern in Microservices with Spring Statemachine
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQL
 
Prometheus Storage
Prometheus StoragePrometheus Storage
Prometheus Storage
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Event Driven Systems with Spring Boot, Spring Cloud Streams and Kafka
Event Driven Systems with Spring Boot, Spring Cloud Streams and KafkaEvent Driven Systems with Spring Boot, Spring Cloud Streams and Kafka
Event Driven Systems with Spring Boot, Spring Cloud Streams and Kafka
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Effectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache PulsarEffectively-once semantics in Apache Pulsar
Effectively-once semantics in Apache Pulsar
 
Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large State
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Vert.x for Microservices Architecture
Vert.x for Microservices ArchitectureVert.x for Microservices Architecture
Vert.x for Microservices Architecture
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
 
Distributed Transactions: Saga Patterns
Distributed Transactions: Saga PatternsDistributed Transactions: Saga Patterns
Distributed Transactions: Saga Patterns
 

Similar to Building Data Pipelines in Python

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Reveal's Advanced Analytics: Using R & Python
Reveal's Advanced Analytics: Using R & PythonReveal's Advanced Analytics: Using R & Python
Reveal's Advanced Analytics: Using R & PythonPoojitha B
 
A Practical Road to SaaS in Python
A Practical Road to SaaS in PythonA Practical Road to SaaS in Python
A Practical Road to SaaS in PythonC4Media
 
Microservices Chaos Testing at Jet
Microservices Chaos Testing at JetMicroservices Chaos Testing at Jet
Microservices Chaos Testing at JetC4Media
 
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWERContinuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWERIndrajit Poddar
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Derek Jacoby
 
Continuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoContinuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoPeter Bittner
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
Open frameworks 101_fitc
Open frameworks 101_fitcOpen frameworks 101_fitc
Open frameworks 101_fitcbenDesigning
 
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...Pantheon
 
SwissJUG_Bringing the cloud back down to earth.pptx
SwissJUG_Bringing the cloud back down to earth.pptxSwissJUG_Bringing the cloud back down to earth.pptx
SwissJUG_Bringing the cloud back down to earth.pptxGrace Jansen
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesTao Xie
 
Going open source with small teams
Going open source with small teamsGoing open source with small teams
Going open source with small teamsJamie Thomas
 
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...Jazkarta, Inc.
 
Cytoscape CI Chapter 2
Cytoscape CI Chapter 2Cytoscape CI Chapter 2
Cytoscape CI Chapter 2bdemchak
 
The Power of Azure DevOps - Global Azure Day 2020
The Power of Azure DevOps - Global Azure Day 2020The Power of Azure DevOps - Global Azure Day 2020
The Power of Azure DevOps - Global Azure Day 2020Jeff Bramwell
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonJo-fai Chow
 

Similar to Building Data Pipelines in Python (20)

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Reveal's Advanced Analytics: Using R & Python
Reveal's Advanced Analytics: Using R & PythonReveal's Advanced Analytics: Using R & Python
Reveal's Advanced Analytics: Using R & Python
 
A Practical Road to SaaS in Python
A Practical Road to SaaS in PythonA Practical Road to SaaS in Python
A Practical Road to SaaS in Python
 
Microservices Chaos Testing at Jet
Microservices Chaos Testing at JetMicroservices Chaos Testing at Jet
Microservices Chaos Testing at Jet
 
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWERContinuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER
 
Raising the Bar
Raising the BarRaising the Bar
Raising the Bar
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
Continuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon OttoContinuous Delivery for Python Developers – PyCon Otto
Continuous Delivery for Python Developers – PyCon Otto
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Write tests, please
Write tests, pleaseWrite tests, please
Write tests, please
 
Open frameworks 101_fitc
Open frameworks 101_fitcOpen frameworks 101_fitc
Open frameworks 101_fitc
 
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...
 
SwissJUG_Bringing the cloud back down to earth.pptx
SwissJUG_Bringing the cloud back down to earth.pptxSwissJUG_Bringing the cloud back down to earth.pptx
SwissJUG_Bringing the cloud back down to earth.pptx
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Going open source with small teams
Going open source with small teamsGoing open source with small teams
Going open source with small teams
 
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
 
Cytoscape CI Chapter 2
Cytoscape CI Chapter 2Cytoscape CI Chapter 2
Cytoscape CI Chapter 2
 
The Power of Azure DevOps - Global Azure Day 2020
The Power of Azure DevOps - Global Azure Day 2020The Power of Azure DevOps - Global Azure Day 2020
The Power of Azure DevOps - Global Azure Day 2020
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Building Data Pipelines in Python

  • 1. Building Data Pipelines in Python Marco Bonzanini QCon London 2017
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations /data-pipelines-python • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 6. R&D ≠ Engineering R&D results in production = high value
  • 7.
  • 8. Big Data Problems vs Big Data Problems
  • 9. Data Pipelines (from 30,000ft) Data ETL Analytics
  • 10. Data Pipelines (zooming in) ETL {Extract Transform Load { Clean Augment Join
  • 11. Good Data Pipelines Easy to Reproduce Productise{
  • 12. Towards Good Data Pipelines
  • 13. Towards Good Data Pipelines (a) Your Data is Dirty unless proven otherwise “It’s in the database, so it’s already good”
  • 14. Towards Good Data Pipelines (b) All Your Data is Important unless proven otherwise
  • 15. Towards Good Data Pipelines (b) All Your Data is Important unless proven otherwise Keep it. Transform it. Don’t overwrite it.
  • 16. Towards Good Data Pipelines (c) Pipelines vs Script Soups
  • 17. Tasty, but not a pipeline Pic: Romanian potato soup from Wikipedia
  • 18. $ ./do_something.sh $ ./do_something_else.sh $ ./extract_some_data.sh $ ./join_some_other_data.sh ... Anti-pattern: the script soup
  • 19. Script soups kill replicability
  • 21. Towards Good Data Pipelines (d) Break it Down setup.py and conda
  • 22. Towards Good Data Pipelines (e) Automated Testing i.e. why scientists don’t write unit tests
  • 23. Intermezzo Let me rant about testing Icon by Freepik from flaticon.com
  • 24. (Unit) Testing Unit tests in three easy steps: • import unittest • Write your tests • Quit complaining about lack of time to write tests
  • 25. Benefits of (unit) testing • Safety net for refactoring • Safety net for lib upgrades • Validate your assumptions • Document code / communicate your intentions • You’re forced to think
  • 28. Testing: not convinced yet? 
 f1 = fscore(p, r) min_bound, max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound
  • 29. Testing: I’m almost done • Unit tests vs Defensive Programming • Say no to tautologies • Say no to vanity tests • The Python ecosystem is rich: 
 py.test, nosetests, hypothesis, coverage.py, …
  • 31. Towards Good Data Pipelines (f) Orchestration Don’t re-invent the wheel
  • 32. You need a workflow manager Think: 
 GNU Make + Unix pipes + Steroids
  • 33. Intro to Luigi • Task dependency management • Error control, checkpoints, failure recovery • Minimal boilerplate • Dependency graph visualisation $ pip install luigi
  • 34. Luigi Task: unit of execution class MyTask(luigi.Task): def requires(self): return [SomeTask()] def output(self): return luigi.LocalTarget(…) def run(self): mylib.run()
  • 35. Luigi Target: output of a task class MyTarget(luigi.Target): def exists(self): ... # return bool Great off the shelf support 
 local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)
  • 36.
  • 37. Intro to Airflow • Like Luigi, just younger • Nicer (?) GUI • Scheduling • Apache Project
  • 38. Towards Good Data Pipelines (g) When things go wrong The Joy of debugging
  • 40. Who reads the logs? You’re not going to read the logs, unless… • E-mail notifications (built-in in Luigi) • Slack notifications $ pip install luigi_slack # WIP
  • 41. Towards Good Data Pipelines (h) Static Analysis The Joy of Duck Typing
  • 42. If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck. — somebody on the Web
  • 43. >>> 1.0 == 1 == True True >>> 1 + True 2
  • 44. >>> '1' * 2 '11' >>> '1' + 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'int' object to str implicitly
  • 45. def do_stuff(a: int, b: int) -> str: ... return something PEP 3107 — Function Annotations
 (since Python 3.0) (annotations are ignored by the interpreter)
  • 46. typing module: semantically coherent PEP 484 — Type Hints
 (since Python 3.5) (still ignored by the interpreter)
  • 48. • Add optional types • Run: mypy --follow-imports silent mylib • Refine gradual typing (e.g. Any)
  • 49. Summary Basic engineering principles help
 (packaging, testing, orchestration, logging, static analysis, ...)
  • 50. Summary R&D is not Engineering:
 can we meet halfway?
  • 51. Vanity Slide • speakerdeck.com/marcobonzanini • github.com/bonzanini • marcobonzanini.com • @MarcoBonzanini
  • 52. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/data- pipelines-python