Eran Shemesh @ Fyber:
Fyber uses airflow to manage its entire big data pipelines including monitoring and auto-fix, the session will describe best practices that we implemented in production
5. Why?
5
The cron way
■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the time buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
■ Visability
6. Why?
6
The airflow way
■ Tasks are really dependant on each other
■ Easily Scalable
■ Web UI
■ Can recover from downtime
7. ■ Each valid flow takes more time than it should
■ Each job should be aware to the buffer from its execution time to its working time
■ In a case of a retry for a certain task in the flow, the whole flow can fail
■ What if the buffer is sometimes not enough?
■ What if one of the system that runs a cron job was down for a run or more?
■ What if the input data to a flow was incorrect?
■ What if, for a product requirement change, I need to re-run the past X runs?
Why?
7
The airflow way
8. ■ An HTTP request to invoke job on databricks (SimpleHttpOperator)
■ Extract the databricks task_id from the response (PythonOperator)
■ Monitor task progress (HttpSensor) by task id
■ In case of success, get the result (SimpleHttpOperator)
■ Extract result from the HttpResponse (PythonOperator)
Hello Airflow
SimpleHttpOperator PythonOperator HttpSensor SimpleHttpOperator PythonOperator
13. ■ There is no retry mechanism on a dag level, only on task level
■ Out of the box, a sub DAG does not retry well
■ We utilized the sub DAG’s on_retry_callback for it’s retry mechanism when needed
Retryable Sub Dags
15. Sub dags - use with caution!
15
subdag task task subdag task taskWorker
Concurrency Level
task subdag task task
16. Sub dags - use with caution!
16
subdag subdag subdag subdag task taskWorker
Concurrency Level
task subdag task task
17. Sub dags - use with caution!
17
subdag subdag subdag subdag subdag subdagWorker
Concurrency Level
task subdag task task
18. Sub dags - use with caution!
18
subdag subdag subdag subdag task taskWorker
Thread pool
task subdag task task
task task task task
Airflow 10’s default solution:
SequentialExecutor ( One process to run them all)
19. Sub dags - use with caution!
19
subdag subdag subdag subdag subdag subdagWorker 1
Concurrency Level
task subdag task task
task subdag taskWorker 2
Concurrency Level
task taskWorker 3
Concurrency Level
Second option -
Add more workers!
26. Building modules
26
■ A template of tasks and dependencies between them
■ Using the template method design pattern, the module dictates the general flow, to be
implemented by different business logic subclasses
■ Most commonly used inside a sub dag, like in the monitoring example
DAG extensions
31. Use case 1: Skipping daily tasks
31
■ Each hour calculates hourly aggregation and than daily agg
■ When fixing data or when the task runs are delayed, it’s unnecessary to calculate partial
daily aggregations
■ Using the ShortCircuitOperator, we check if the next execution should have happened
already
■ If it has, we skip all following tasks in the same dag run
Hourly and daily flow
32. 32
Use case 1: Skipping daily tasks
Hourly and daily flow
33. 33
Use case 1: Skipping daily tasks
Hourly and daily flow
34. Use case 1: Skipping daily tasks
34
Hourly and daily flow
35. Use case 2: Programatically clearing DAG
35
S3/{bucket_name}/day=23
S3/{bucket_name}/day=22
S3/{bucket_name}/day=21
S3/{bucket_name}/day=10
36. 36
■ Creating a DAG for executing a single day’s flow
■ The scheduling for the above DAG would occur by another DAG (and not the Airflow’s scheduler)
■ The scheduling DAG would:
○ Create a new run for each day in the target DAG
○ Clear the target DAG runs for the previous 14 days
Use case 2: Programatically clearing DAG
37. 37
Using another DAG to clear the above DAG for the last 14 days:
Use case 2: Programatically clearing DAG
39. Tips and best practices
39
■ Create only idempotent tasks
■ Notice that the worker only creates an OS process for each task
■ Always use a retry on a task, the workers can fail!
■ Use connections to store passwords and secret keys (for encryption)
■ Notice that your python files gets executed constantly by the scheduler
■ Use a docker compose environment on your dev machine