SlideShare a Scribd company logo
1 of 40
Airflow
Production tales
Eran Shemesh - Senior Big Data Developer
2
Pipeline
Airflow’s Architecture
Why?
4
Spark
Update
DB
Http
Spark
Update
DB
Send
emails
Http Spark
Update
DB
30m-50m 5m-10m
1h-1.5h
10 sec
1m-3m 20m-40m 10 sec
The cron way
0 * * * * 0 * * * *
15 * * * *
50 * * * *
0 * * * * 5 * * * * 55 * * * *
Why?
5
The cron way
■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the time buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
■ Visability
Why?
6
The airflow way
■ Tasks are really dependant on each other
■ Easily Scalable
■ Web UI
■ Can recover from downtime
■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
Why?
7
The airflow way
■ An HTTP request to invoke job on databricks (SimpleHttpOperator)
■ Extract the databricks task_id from the response (PythonOperator)
■ Monitor task progress (HttpSensor) by task id
■ In case of success, get the result (SimpleHttpOperator)
■ Extract result from the HttpResponse (PythonOperator)
Hello Airflow
SimpleHttpOperator PythonOperator HttpSensor SimpleHttpOperator PythonOperator
Subdags
Use with caution!
■ An operator like any other, for self-running a group of tasks
■ Better visualisation
■ Reusable Components
■ Encapsulation
Sub - DAGs
// Previous code
■ There is no retry mechanism on a dag level, only on task level
■ Out of the box, a sub DAG does not retry well
■ We utilized the sub DAG’s on_retry_callback for it’s retry mechanism when needed
Retryable Sub Dags
Airflow’s Architecture
Sub dags - use with caution!
15
subdag task task subdag task taskWorker
Concurrency Level
task subdag task task
Sub dags - use with caution!
16
subdag subdag subdag subdag task taskWorker
Concurrency Level
task subdag task task
Sub dags - use with caution!
17
subdag subdag subdag subdag subdag subdagWorker
Concurrency Level
task subdag task task
Sub dags - use with caution!
18
subdag subdag subdag subdag task taskWorker
Thread pool
task subdag task task
task task task task
Airflow 10’s default solution:
SequentialExecutor ( One process to run them all)
Sub dags - use with caution!
19
subdag subdag subdag subdag subdag subdagWorker 1
Concurrency Level
task subdag task task
task subdag taskWorker 2
Concurrency Level
task taskWorker 3
Concurrency Level
Second option -
Add more workers!
Monitoring
And auto fixing...
21
Pipeline
Monitoring pipeline
22
A typical flow
Monitoring pipeline
23
Each task (or a group of tasks) be followed by a monitoring task
Monitoring pipeline
24
Each monitoring task is a group of tasks for monitoring and auto fixing
Building modules
25
Building modules
26
■ A template of tasks and dependencies between them
■ Using the template method design pattern, the module dictates the general flow, to be
implemented by different business logic subclasses
■ Most commonly used inside a sub dag, like in the monitoring example
DAG extensions
Building modules
27
Creating a template for a sets of tasks
Building modules
28
Further extending this template when needed
Building modules
29
Further extending this template when needed
Some dev
paradigms
Use case 1: Skipping daily tasks
31
■ Each hour calculates hourly aggregation and than daily agg
■ When fixing data or when the task runs are delayed, it’s unnecessary to calculate partial
daily aggregations
■ Using the ShortCircuitOperator, we check if the next execution should have happened
already
■ If it has, we skip all following tasks in the same dag run
Hourly and daily flow
32
Use case 1: Skipping daily tasks
Hourly and daily flow
33
Use case 1: Skipping daily tasks
Hourly and daily flow
Use case 1: Skipping daily tasks
34
Hourly and daily flow
Use case 2: Programatically clearing DAG
35
S3/{bucket_name}/day=23
S3/{bucket_name}/day=22
S3/{bucket_name}/day=21
S3/{bucket_name}/day=10
36
■ Creating a DAG for executing a single day’s flow
■ The scheduling for the above DAG would occur by another DAG (and not the Airflow’s scheduler)
■ The scheduling DAG would:
○ Create a new run for each day in the target DAG
○ Clear the target DAG runs for the previous 14 days
Use case 2: Programatically clearing DAG
37
Using another DAG to clear the above DAG for the last 14 days:
Use case 2: Programatically clearing DAG
Tips and best
practices
Tips and best practices
39
■ Create only idempotent tasks
■ Notice that the worker only creates an OS process for each task
■ Always use a retry on a task, the workers can fail!
■ Use connections to store passwords and secret keys (for encryption)
■ Notice that your python files gets executed constantly by the scheduler
■ Use a docker compose environment on your dev machine
Thanks!

More Related Content

What's hot

Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Kaxil Naik
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache AirflowManning Publications
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
nginx + ansible로 점검모드 만들기
nginx + ansible로 점검모드 만들기nginx + ansible로 점검모드 만들기
nginx + ansible로 점검모드 만들기June Kim
 
Apache Airflow at Dailymotion
Apache Airflow at DailymotionApache Airflow at Dailymotion
Apache Airflow at DailymotionGermain Tanguy
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)Jarek Potiuk
 

What's hot (20)

Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
Contributing to Apache Airflow | Journey to becoming Airflow's leading contri...
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Data Pipelines with Apache Airflow
Data Pipelines with Apache AirflowData Pipelines with Apache Airflow
Data Pipelines with Apache Airflow
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
nginx + ansible로 점검모드 만들기
nginx + ansible로 점검모드 만들기nginx + ansible로 점검모드 만들기
nginx + ansible로 점검모드 만들기
 
Apache Airflow at Dailymotion
Apache Airflow at DailymotionApache Airflow at Dailymotion
Apache Airflow at Dailymotion
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
 

Similar to Fyber - airflow best practices in production

Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeDocker, Inc.
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlcAlexey Tokar
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelDocker, Inc.
 
Flux architecture and Redux - theory, context and practice
Flux architecture and Redux - theory, context and practiceFlux architecture and Redux - theory, context and practice
Flux architecture and Redux - theory, context and practiceJakub Kocikowski
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
Gearman - Northeast PHP 2012
Gearman - Northeast PHP 2012Gearman - Northeast PHP 2012
Gearman - Northeast PHP 2012Mike Willbanks
 
improving the performance of Rails web Applications
improving the performance of Rails web Applicationsimproving the performance of Rails web Applications
improving the performance of Rails web ApplicationsJohn McCaffrey
 
Paris.rb – 07/19 – Sidekiq scaling, workers vs processes
Paris.rb – 07/19 – Sidekiq scaling, workers vs processesParis.rb – 07/19 – Sidekiq scaling, workers vs processes
Paris.rb – 07/19 – Sidekiq scaling, workers vs processesMaxence Haltel
 
SwarmKit in Theory and Practice
SwarmKit in Theory and PracticeSwarmKit in Theory and Practice
SwarmKit in Theory and PracticeLaura Frank Tacho
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Bhupesh Chawda
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexApache Apex
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018javier ramirez
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
 
Web Performance & Latest in React
Web Performance & Latest in ReactWeb Performance & Latest in React
Web Performance & Latest in ReactTalentica Software
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory modelSeongJae Park
 

Similar to Fyber - airflow best practices in production (20)

Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to Practice
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlc
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object Model
 
Flux architecture and Redux - theory, context and practice
Flux architecture and Redux - theory, context and practiceFlux architecture and Redux - theory, context and practice
Flux architecture and Redux - theory, context and practice
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
Gearman - Northeast PHP 2012
Gearman - Northeast PHP 2012Gearman - Northeast PHP 2012
Gearman - Northeast PHP 2012
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
improving the performance of Rails web Applications
improving the performance of Rails web Applicationsimproving the performance of Rails web Applications
improving the performance of Rails web Applications
 
Paris.rb – 07/19 – Sidekiq scaling, workers vs processes
Paris.rb – 07/19 – Sidekiq scaling, workers vs processesParis.rb – 07/19 – Sidekiq scaling, workers vs processes
Paris.rb – 07/19 – Sidekiq scaling, workers vs processes
 
SwarmKit in Theory and Practice
SwarmKit in Theory and PracticeSwarmKit in Theory and Practice
SwarmKit in Theory and Practice
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache Apex
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
 
03 performance
03 performance03 performance
03 performance
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
Web Performance & Latest in React
Web Performance & Latest in ReactWeb Performance & Latest in React
Web Performance & Latest in React
 
Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
 

More from Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 

Recently uploaded

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Recently uploaded (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Fyber - airflow best practices in production

  • 1. Airflow Production tales Eran Shemesh - Senior Big Data Developer
  • 4. Why? 4 Spark Update DB Http Spark Update DB Send emails Http Spark Update DB 30m-50m 5m-10m 1h-1.5h 10 sec 1m-3m 20m-40m 10 sec The cron way 0 * * * * 0 * * * * 15 * * * * 50 * * * * 0 * * * * 5 * * * * 55 * * * *
  • 5. Why? 5 The cron way ■ Each valid flow takes more time than it should ■ Each job should be aware to the buffer from its execution time to its working time ■ In a case of a retry for a certain task in the flow, the whole flow can fail ■ What if the time buffer is sometimes not enough? ■ What if one of the system that runs a cron job was down for a run or more? ■ What if the input data to a flow was incorrect? ■ What if, for a product requirement change, I need to re-run the past X runs? ■ Visability
  • 6. Why? 6 The airflow way ■ Tasks are really dependant on each other ■ Easily Scalable ■ Web UI ■ Can recover from downtime
  • 7. ■ Each valid flow takes more time than it should ■ Each job should be aware to the buffer from its execution time to its working time ■ In a case of a retry for a certain task in the flow, the whole flow can fail ■ What if the buffer is sometimes not enough? ■ What if one of the system that runs a cron job was down for a run or more? ■ What if the input data to a flow was incorrect? ■ What if, for a product requirement change, I need to re-run the past X runs? Why? 7 The airflow way
  • 8. ■ An HTTP request to invoke job on databricks (SimpleHttpOperator) ■ Extract the databricks task_id from the response (PythonOperator) ■ Monitor task progress (HttpSensor) by task id ■ In case of success, get the result (SimpleHttpOperator) ■ Extract result from the HttpResponse (PythonOperator) Hello Airflow SimpleHttpOperator PythonOperator HttpSensor SimpleHttpOperator PythonOperator
  • 9.
  • 11. ■ An operator like any other, for self-running a group of tasks ■ Better visualisation ■ Reusable Components ■ Encapsulation Sub - DAGs
  • 13. ■ There is no retry mechanism on a dag level, only on task level ■ Out of the box, a sub DAG does not retry well ■ We utilized the sub DAG’s on_retry_callback for it’s retry mechanism when needed Retryable Sub Dags
  • 15. Sub dags - use with caution! 15 subdag task task subdag task taskWorker Concurrency Level task subdag task task
  • 16. Sub dags - use with caution! 16 subdag subdag subdag subdag task taskWorker Concurrency Level task subdag task task
  • 17. Sub dags - use with caution! 17 subdag subdag subdag subdag subdag subdagWorker Concurrency Level task subdag task task
  • 18. Sub dags - use with caution! 18 subdag subdag subdag subdag task taskWorker Thread pool task subdag task task task task task task Airflow 10’s default solution: SequentialExecutor ( One process to run them all)
  • 19. Sub dags - use with caution! 19 subdag subdag subdag subdag subdag subdagWorker 1 Concurrency Level task subdag task task task subdag taskWorker 2 Concurrency Level task taskWorker 3 Concurrency Level Second option - Add more workers!
  • 23. Monitoring pipeline 23 Each task (or a group of tasks) be followed by a monitoring task
  • 24. Monitoring pipeline 24 Each monitoring task is a group of tasks for monitoring and auto fixing
  • 26. Building modules 26 ■ A template of tasks and dependencies between them ■ Using the template method design pattern, the module dictates the general flow, to be implemented by different business logic subclasses ■ Most commonly used inside a sub dag, like in the monitoring example DAG extensions
  • 27. Building modules 27 Creating a template for a sets of tasks
  • 28. Building modules 28 Further extending this template when needed
  • 29. Building modules 29 Further extending this template when needed
  • 31. Use case 1: Skipping daily tasks 31 ■ Each hour calculates hourly aggregation and than daily agg ■ When fixing data or when the task runs are delayed, it’s unnecessary to calculate partial daily aggregations ■ Using the ShortCircuitOperator, we check if the next execution should have happened already ■ If it has, we skip all following tasks in the same dag run Hourly and daily flow
  • 32. 32 Use case 1: Skipping daily tasks Hourly and daily flow
  • 33. 33 Use case 1: Skipping daily tasks Hourly and daily flow
  • 34. Use case 1: Skipping daily tasks 34 Hourly and daily flow
  • 35. Use case 2: Programatically clearing DAG 35 S3/{bucket_name}/day=23 S3/{bucket_name}/day=22 S3/{bucket_name}/day=21 S3/{bucket_name}/day=10
  • 36. 36 ■ Creating a DAG for executing a single day’s flow ■ The scheduling for the above DAG would occur by another DAG (and not the Airflow’s scheduler) ■ The scheduling DAG would: ○ Create a new run for each day in the target DAG ○ Clear the target DAG runs for the previous 14 days Use case 2: Programatically clearing DAG
  • 37. 37 Using another DAG to clear the above DAG for the last 14 days: Use case 2: Programatically clearing DAG
  • 39. Tips and best practices 39 ■ Create only idempotent tasks ■ Notice that the worker only creates an OS process for each task ■ Always use a retry on a task, the workers can fail! ■ Use connections to store passwords and secret keys (for encryption) ■ Notice that your python files gets executed constantly by the scheduler ■ Use a docker compose environment on your dev machine