SlideShare a Scribd company logo
1 of 43
April 10th 2019
Tao Feng | @feng-tao | Software Engineer, Lyft
Airflow @ Lyft
2
Who
● Tao Feng
● Engineer at Lyft Data Platform
● Apache Airflow PMC and Committer
● Working on different data products (Airflow,
Amundsen, etc)
Agenda
• Airflow in general
• Airflow @ Lyft
• Upstream @ Lyft
• Next Step
• Summary
3
Airflow in general
4
Airflow in general
5
• Airflow just became an Apache top level project(TLP) recently.
‒ Total 20 PMCs / committers
• Most recent release 1.10, 1.10.1, 1.10.2 (1.10.3 is coming).
‒ New Features: Airflow RBAC, Airflow K8S integration, etc
• New Process in Airflow for proposing architecture change.
‒ Airflow Improvements Proposals (currently 19+ proposals)
• Recent community conducted Airflow user survey (link).
11k+
github
stars
740+
contributors
250+
Companies
using
Airflow @ Lyft
6
7
Core Infra high level architecture @ Lyft
Airflow Architecture @ Lyft
8
Airflow Architecture @ Lyft
• WebUI: the portal for users to view the related status of the DAGs.
• Metadata DB: the Airflow metastore for storing various job status.
• Scheduler: a multi-process which parses the DAG bag, creates a DAG object and
triggers executor to execute those dependency met tasks.
• Executor: A message queuing process that orchestrates worker processes to execute
tasks. We uses CeleryExecutor at Lyft.
• TARS: Airflow development / backfill environment, which provides access to production
data. 9
Airflow Architecture @ Lyft
10
• Main Cluster Config: Apache Airflow 1.8.2 with cherry-picks and numerous in-
house Lyft patches.
• Scale: Three set of ASGs for workers.
‒ ASG #1: 15 worker nodes each of which is the r5.4xlarge (16vcpu, 128g mem) type. This fleet of
workers is for processing low-priority memory intensive tasks.
‒ ASG #2: 3 worker nodes each of which is the m4.4xlarge (16vcpu, 64g mem) )type. This fleet of
workers is dedicated for those DAGs with a strict SLA.
‒ ASG #3: 1 worker node with m4.10xlarge (40vcpu, 160g mem) type. The single node is used to
process the compute-intensive workloads from a critical team’s DAGs.
‒ Backfill Box(TARS): 1 node with m4.16xlarge (64vcpu, 256g mem) )type. This box is used for fast
DAG prototyping and backfill.
Airflow daily stats @ Lyft
11
600+
DAGs
800+
DagRuns
25k+
TIs
Airflow Monitoring @
Lyft
12
Airflow Availability
• Scheduler and worker health check
‒ Use Canary monitoring DAG.
‒ No task has been scheduled for 10 mins is considered downtime.
• UI health check
‒ Leverage Envoy membership health check.
• Total system Uptime pct
‒ Airflow is down if either scheduler, workers, or web server is down.
13
Schedule Delay
• scheduler delay = TI.start_time - TI.execution_date
14
DAG last run time
• The time that have elapsed since the DAG file was last processed.
• If the time becomes too long, it means scheduler has issues processing the
DAG files.
‒ E.g could due to parser threads occupied by malicious DAG files.
15
Executor parallelism
• Parallelism: control the #. concurrent running tasks.
‒ Please monitor your worker nodes’ cpu utilization before increasing the value.
16
Airflow monitoring @ Lyft
17
Stats Name Meaning
dag_processing.last_run.seconds_ago.<d
ag_file>
Seconds since <dag_file> was last
processed
executor.open_slots Number of of open slots on executor
(parallelism - # running task)
executor.queued_tasks Number of queued tasks on executor
executor.running_tasks Number of running tasks on executor
pool.starving_tasks.<pool_name> Number of starving tasks in the pool.
Check how many tasks are starving due to
pool count limitation.
…...
Airflow Customization
@ Lyft
18
Airflow customization @ Lyft
• UI auditing
• Extra link for task instance UI panel (AIRFLOW-161)
19
Airflow customization @ Lyft
• DAG dependency graph
20
Improve Airflow
Reliability @ Lyft
21
Improving Airflow Performance @ Lyft
• Reduce Airflow UI page load time
‒ Change default_dag_run_display_number to 5.
• Tunables that impacts tasks execution parallelisms
‒ Parallelism
‒ Concurrency
‒ Max_active_runs
‒ Pool
22
Improving Airflow Reliability at Lyft
• Source Control For Pools
‒ All Airflow pools are defined in a source controlled github source file.
‒ Airflow pools are configured based on the file in runtime.
• Integration tests for DAG to enforce best practice and improve reliability
‒ All the DAGs should be loadable within time threshold.
‒ All the DAGs should have valid pools associated.
‒ External task sensors should be valid((dag_id, task_id) exists).
‒ Each pool is used by at least by one DAG.
‒ The sensor has a reasonable timeout.
‒ Each DAG has a non dynamic start date.
• Secure UI access
23
Production Debug @
Lyft
24
Production Debug @ Lyft
• We document every production issue investigation in the doc.
• Couples of methodologies:
‒ View the centralized Airflow dashboard.
‒ Identify whether it is UI or Airflow scheduler(backend) issues.
‒ View the webserver log or scheduler log.
∙ If the log is not available in machine, check the log in kibana.
∙ To further identify issues, we sometimes even look at logs in S3
‒ Use different tools for further investigation
∙ If exceptions is thrown, understand which part of Airflow code throws the exception.
∙ If CPU / memory alarm, use top to identify which DAG causes the issue.
∙ If failure related to celery, login to celery flower UI to further investigate.
∙ ...
25
Airflow Gotchas @ Lyft
26
Airflow Gotchas at Lyft
• DST
‒ UI doesn’t have timezone support even in upstream.
‒ Scheduler internal version has no timezone support.
• DAGs with dynamic start date.
‒ Hard to predict when the DAG is scheduled
• Long running external task sensors that don’t have valid external tasks.
• HivePartitionSensor doesn’t work for partial partition
‒ It only checks whether data exists, not check whether data fully loaded.
• Backfill experience
‒ We use local executor to backfill.
• Long running sensor occupies task slot of the pool
• User confused with DAG level argument vs Task level argument
‒ E.g Put max_active_run in default task argument
• Legacy high abstraction framework over Airflow
‒ Hard to debug for the user and us. 27
Upstream @ Lyft
28
Improve backfill
experience
29
Improve backfill experience
30
• New options for backfill
‒ --reset_dagruns: if used, Airflow will first check if there are any existing dag_runs /
task_instances associated with the backfill date range. If yes, it will prompt user whether the
user wants to clear those task_instances first. (AIRFLOW-2718)
‒ --rerun_failed_tasks: if used, Airflow automatically try to rerun those failed tasks again
without requiring any user intervention. (AIRFLOW-2566)
• Backfill respects pool for isolation (AIRFLOW-1557)
Improve backfill experience
Support batch backfill
• Use {{ prev_ds }} and {{ ds }} in SQL
‒ Prev_ds equals to ds -
schedule_interval
‒ User could change the
schedule_interval in the DAG file
during backfill.
• Use could override dag param with -c
options during backfill.
31
INSERT OVERWRITE TABLE {{
dest_db(default.superhero_data) }}
SELECT supe.superhero_name AS superhero_name,
pop.popularity AS popularity
FROM
{{ source_table(events.superheroes) }} supe
WHERE {{ prev_ds }} >= ds AND ds < {{ ds }}
airflow backfill superheroes -s 2018-05-01 -e
2018-05-08 -c {‘hive_cluster’:
‘backfill_cluster’}
Airflow DAG level
access
33
Airflow DAG level access @ Lyft
34
• DAG access control has always been a real need at Lyft
‒ HR data, Financial data, etc
‒ The workaround is to build an isolated dedicated cluster for each use case.
• Airflow introduces the RBAC feature in 1.10
‒ Airflow new webserver is based on Flask-Appbuilder.
‒ Ships with 5 static roles(Admin, User, Op, Viewer, Public).
‒ ...
• Airflow DAG level access (AIRFLOW-2267)
‒ Provides additional granular access control on DAG level.
Airflow DAG level access @ Lyft
• New Airflow UI migrates from Flask-Admin to Flask-Appbuilder(FAB).
• FAB’s security model.
35
Airflow DAG level access @ Lyft
• Which Airflow includes the change?
‒ 1.10.2 includes initial implementation
‒ 1.10.3(upcoming) includes the enhancements
• How it works
‒ Two new perms: can_dag_read (read), can_dag_edit (write).
‒ DAG level role could be created through cli / UI by Admin (doc).
‒ DAG level role could only see the viewable DAGs.
‒ User could declare permissions in DAG file (AIRFLOW-2694).
36
Airflow DAG level access @ Lyft
37
• We build a new cluster based on
Airflow master branch and
onboard couples of new sensitive
data use cases.
‒ Each use case has its own repo.
‒ User role relationship source
controlled in a YAML file.
• DAG owners specify the access
control info in the DAG files.
• Gotchas
‒ New user onboarding
‒ Integration between FAB and
google authentication(OAUTH)
‒ Integration with internal ACL
service
‒ ...
User registration flow
Next Step
38
Next Step
• Support Airflow DAG level access feature in beta internally.
• Integrate Airflow RBAC / DAG level feature with internal ACL service(FAB issue).
• Migrate all the existing DAGs to this new cluster.
• Explore running Airflow with k8s executor internally.
39
Summary
40
Summary
41
• Airflow community has been growing a lot!
• We share our experience on operating Airflow at Lyft.
• We share some of our upstream work
‒ Improve Airflow backfill experience
‒ Support Airflow DAG level Access
Acknowledgement
42
• Members who maintain Airflow at Lyft
‒ Alagappan Sethuraman
‒ Andrew Stahlman
‒ Chao-han Tsai
‒ Jinhyuk Chang
‒ Junda Yang
‒ Max Payton
‒ Tao Feng
• Special thanks to Maxime Beauchemin who provides numerous suggestions for
us.
Tao Feng | @feng-tao
Slides at TBD
Blog at go.lyft.com/airflowblog
Icons under Creative Commons License from https://thenounproject.com/ 43
Backup
44

More Related Content

What's hot

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Kai Wähner
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 

What's hot (20)

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?Can Apache Kafka Replace a Database?
Can Apache Kafka Replace a Database?
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?Streaming SQL for Data Engineers: The Next Big Thing?
Streaming SQL for Data Engineers: The Next Big Thing?
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 

Similar to Airflow at lyft

What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
airflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxairflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxVIJAYAPRABAP
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environmentBIOVIA
 
airflow web UI and CLI.pptx
airflow web UI and CLI.pptxairflow web UI and CLI.pptx
airflow web UI and CLI.pptxVIJAYAPRABAP
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Kaxil Naik
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2Kaxil Naik
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
DataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxDataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxJohn J Zhao
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpNathan Handler
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 

Similar to Airflow at lyft (20)

What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
airflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxairflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptx
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment
 
GoDocker presentation
GoDocker presentationGoDocker presentation
GoDocker presentation
 
airflow web UI and CLI.pptx
airflow web UI and CLI.pptxairflow web UI and CLI.pptx
airflow web UI and CLI.pptx
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
sun solaris
sun solarissun solaris
sun solaris
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
DataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxDataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptx
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 

More from Tao Feng

Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Tao Feng
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...Tao Feng
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
 

More from Tao Feng (7)

Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza Framework
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 

Recently uploaded

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 

Recently uploaded (20)

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 

Airflow at lyft

  • 1. April 10th 2019 Tao Feng | @feng-tao | Software Engineer, Lyft Airflow @ Lyft
  • 2. 2 Who ● Tao Feng ● Engineer at Lyft Data Platform ● Apache Airflow PMC and Committer ● Working on different data products (Airflow, Amundsen, etc)
  • 3. Agenda • Airflow in general • Airflow @ Lyft • Upstream @ Lyft • Next Step • Summary 3
  • 5. Airflow in general 5 • Airflow just became an Apache top level project(TLP) recently. ‒ Total 20 PMCs / committers • Most recent release 1.10, 1.10.1, 1.10.2 (1.10.3 is coming). ‒ New Features: Airflow RBAC, Airflow K8S integration, etc • New Process in Airflow for proposing architecture change. ‒ Airflow Improvements Proposals (currently 19+ proposals) • Recent community conducted Airflow user survey (link). 11k+ github stars 740+ contributors 250+ Companies using
  • 7. 7 Core Infra high level architecture @ Lyft
  • 9. Airflow Architecture @ Lyft • WebUI: the portal for users to view the related status of the DAGs. • Metadata DB: the Airflow metastore for storing various job status. • Scheduler: a multi-process which parses the DAG bag, creates a DAG object and triggers executor to execute those dependency met tasks. • Executor: A message queuing process that orchestrates worker processes to execute tasks. We uses CeleryExecutor at Lyft. • TARS: Airflow development / backfill environment, which provides access to production data. 9
  • 10. Airflow Architecture @ Lyft 10 • Main Cluster Config: Apache Airflow 1.8.2 with cherry-picks and numerous in- house Lyft patches. • Scale: Three set of ASGs for workers. ‒ ASG #1: 15 worker nodes each of which is the r5.4xlarge (16vcpu, 128g mem) type. This fleet of workers is for processing low-priority memory intensive tasks. ‒ ASG #2: 3 worker nodes each of which is the m4.4xlarge (16vcpu, 64g mem) )type. This fleet of workers is dedicated for those DAGs with a strict SLA. ‒ ASG #3: 1 worker node with m4.10xlarge (40vcpu, 160g mem) type. The single node is used to process the compute-intensive workloads from a critical team’s DAGs. ‒ Backfill Box(TARS): 1 node with m4.16xlarge (64vcpu, 256g mem) )type. This box is used for fast DAG prototyping and backfill.
  • 11. Airflow daily stats @ Lyft 11 600+ DAGs 800+ DagRuns 25k+ TIs
  • 13. Airflow Availability • Scheduler and worker health check ‒ Use Canary monitoring DAG. ‒ No task has been scheduled for 10 mins is considered downtime. • UI health check ‒ Leverage Envoy membership health check. • Total system Uptime pct ‒ Airflow is down if either scheduler, workers, or web server is down. 13
  • 14. Schedule Delay • scheduler delay = TI.start_time - TI.execution_date 14
  • 15. DAG last run time • The time that have elapsed since the DAG file was last processed. • If the time becomes too long, it means scheduler has issues processing the DAG files. ‒ E.g could due to parser threads occupied by malicious DAG files. 15
  • 16. Executor parallelism • Parallelism: control the #. concurrent running tasks. ‒ Please monitor your worker nodes’ cpu utilization before increasing the value. 16
  • 17. Airflow monitoring @ Lyft 17 Stats Name Meaning dag_processing.last_run.seconds_ago.<d ag_file> Seconds since <dag_file> was last processed executor.open_slots Number of of open slots on executor (parallelism - # running task) executor.queued_tasks Number of queued tasks on executor executor.running_tasks Number of running tasks on executor pool.starving_tasks.<pool_name> Number of starving tasks in the pool. Check how many tasks are starving due to pool count limitation. …...
  • 19. Airflow customization @ Lyft • UI auditing • Extra link for task instance UI panel (AIRFLOW-161) 19
  • 20. Airflow customization @ Lyft • DAG dependency graph 20
  • 22. Improving Airflow Performance @ Lyft • Reduce Airflow UI page load time ‒ Change default_dag_run_display_number to 5. • Tunables that impacts tasks execution parallelisms ‒ Parallelism ‒ Concurrency ‒ Max_active_runs ‒ Pool 22
  • 23. Improving Airflow Reliability at Lyft • Source Control For Pools ‒ All Airflow pools are defined in a source controlled github source file. ‒ Airflow pools are configured based on the file in runtime. • Integration tests for DAG to enforce best practice and improve reliability ‒ All the DAGs should be loadable within time threshold. ‒ All the DAGs should have valid pools associated. ‒ External task sensors should be valid((dag_id, task_id) exists). ‒ Each pool is used by at least by one DAG. ‒ The sensor has a reasonable timeout. ‒ Each DAG has a non dynamic start date. • Secure UI access 23
  • 25. Production Debug @ Lyft • We document every production issue investigation in the doc. • Couples of methodologies: ‒ View the centralized Airflow dashboard. ‒ Identify whether it is UI or Airflow scheduler(backend) issues. ‒ View the webserver log or scheduler log. ∙ If the log is not available in machine, check the log in kibana. ∙ To further identify issues, we sometimes even look at logs in S3 ‒ Use different tools for further investigation ∙ If exceptions is thrown, understand which part of Airflow code throws the exception. ∙ If CPU / memory alarm, use top to identify which DAG causes the issue. ∙ If failure related to celery, login to celery flower UI to further investigate. ∙ ... 25
  • 26. Airflow Gotchas @ Lyft 26
  • 27. Airflow Gotchas at Lyft • DST ‒ UI doesn’t have timezone support even in upstream. ‒ Scheduler internal version has no timezone support. • DAGs with dynamic start date. ‒ Hard to predict when the DAG is scheduled • Long running external task sensors that don’t have valid external tasks. • HivePartitionSensor doesn’t work for partial partition ‒ It only checks whether data exists, not check whether data fully loaded. • Backfill experience ‒ We use local executor to backfill. • Long running sensor occupies task slot of the pool • User confused with DAG level argument vs Task level argument ‒ E.g Put max_active_run in default task argument • Legacy high abstraction framework over Airflow ‒ Hard to debug for the user and us. 27
  • 30. Improve backfill experience 30 • New options for backfill ‒ --reset_dagruns: if used, Airflow will first check if there are any existing dag_runs / task_instances associated with the backfill date range. If yes, it will prompt user whether the user wants to clear those task_instances first. (AIRFLOW-2718) ‒ --rerun_failed_tasks: if used, Airflow automatically try to rerun those failed tasks again without requiring any user intervention. (AIRFLOW-2566) • Backfill respects pool for isolation (AIRFLOW-1557)
  • 31. Improve backfill experience Support batch backfill • Use {{ prev_ds }} and {{ ds }} in SQL ‒ Prev_ds equals to ds - schedule_interval ‒ User could change the schedule_interval in the DAG file during backfill. • Use could override dag param with -c options during backfill. 31 INSERT OVERWRITE TABLE {{ dest_db(default.superhero_data) }} SELECT supe.superhero_name AS superhero_name, pop.popularity AS popularity FROM {{ source_table(events.superheroes) }} supe WHERE {{ prev_ds }} >= ds AND ds < {{ ds }} airflow backfill superheroes -s 2018-05-01 -e 2018-05-08 -c {‘hive_cluster’: ‘backfill_cluster’}
  • 33. Airflow DAG level access @ Lyft 34 • DAG access control has always been a real need at Lyft ‒ HR data, Financial data, etc ‒ The workaround is to build an isolated dedicated cluster for each use case. • Airflow introduces the RBAC feature in 1.10 ‒ Airflow new webserver is based on Flask-Appbuilder. ‒ Ships with 5 static roles(Admin, User, Op, Viewer, Public). ‒ ... • Airflow DAG level access (AIRFLOW-2267) ‒ Provides additional granular access control on DAG level.
  • 34. Airflow DAG level access @ Lyft • New Airflow UI migrates from Flask-Admin to Flask-Appbuilder(FAB). • FAB’s security model. 35
  • 35. Airflow DAG level access @ Lyft • Which Airflow includes the change? ‒ 1.10.2 includes initial implementation ‒ 1.10.3(upcoming) includes the enhancements • How it works ‒ Two new perms: can_dag_read (read), can_dag_edit (write). ‒ DAG level role could be created through cli / UI by Admin (doc). ‒ DAG level role could only see the viewable DAGs. ‒ User could declare permissions in DAG file (AIRFLOW-2694). 36
  • 36. Airflow DAG level access @ Lyft 37 • We build a new cluster based on Airflow master branch and onboard couples of new sensitive data use cases. ‒ Each use case has its own repo. ‒ User role relationship source controlled in a YAML file. • DAG owners specify the access control info in the DAG files. • Gotchas ‒ New user onboarding ‒ Integration between FAB and google authentication(OAUTH) ‒ Integration with internal ACL service ‒ ... User registration flow
  • 38. Next Step • Support Airflow DAG level access feature in beta internally. • Integrate Airflow RBAC / DAG level feature with internal ACL service(FAB issue). • Migrate all the existing DAGs to this new cluster. • Explore running Airflow with k8s executor internally. 39
  • 40. Summary 41 • Airflow community has been growing a lot! • We share our experience on operating Airflow at Lyft. • We share some of our upstream work ‒ Improve Airflow backfill experience ‒ Support Airflow DAG level Access
  • 41. Acknowledgement 42 • Members who maintain Airflow at Lyft ‒ Alagappan Sethuraman ‒ Andrew Stahlman ‒ Chao-han Tsai ‒ Jinhyuk Chang ‒ Junda Yang ‒ Max Payton ‒ Tao Feng • Special thanks to Maxime Beauchemin who provides numerous suggestions for us.
  • 42. Tao Feng | @feng-tao Slides at TBD Blog at go.lyft.com/airflowblog Icons under Creative Commons License from https://thenounproject.com/ 43

Editor's Notes

  1. Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. workflows are defined as code Growing community Todo: first mention about the stat then about the fact.
  2. What does the architecture for our core infra look like? Mobile application primarily… Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto… Airflow is used for scheduling (executive dashboard, metric aggregation, derived data generation, machine learning feature computation)
  3. https://github.com/dpgaspar/Flask-AppBuilder/issues/518 We are not the only team manage Airflow, but we are the biggest team which manage Airflow at Lyft. Previously there are some other teams which has security requirement which they will have a separate cluster for their own use case.
  4. Parallelism set to 200 r5.4xlarge type(16vcpu, 128g mem) m4.4xlarge(16vcpu,64g) m4.10xlarge(40vcpu,160g) m4.16xlarge type(64vcpu, 256g)
  5. Canary monitoring dag: When we do Airflow maintance, we check whether the canary dag is running as the signal to see whether there is any issues.
  6. Scheduler delay roughly equals to the time that scheduler picks up the tasks(depends on scheduling loop, task priority) + the time celery worker picks up the task from celery broker Measure with canary monitoring dag
  7. Open slots, running tasks, queue tasks
  8. At Lyft we used externalTaskSensor and hivePartitionSensor mostly. This is one of our Intern’s summer project which built a DAG dependency graph which based externalTaskSensor and hivePartitionSensor . The info is generated in a daily Airflow DAG.
  9. Parallelism: This variable controls the number of task instances that the Airflow worker can run simultaneously. Users could increase the parallelism variable in the Airflow.cfg. We normally suggest users increase this value when doing backfill. Concurrency: The Airflow scheduler will run no more than concurrency task instances for your DAG at any given time. Concurrency is defined in your Airflow DAG as a DAG input argument. If you do not set the concurrency on your DAG, the scheduler will use the default value from the dag_concurrency entry in your Airflow.cfg. max_active_runs: Airflow will run no more than max_active_runs DagRuns of your DAG at a given time. If you do not set the max_active_runs on your DAG, Airflow will use the default value from the max_active_runs_per_dag entry in your Airflow.cfg. We suggest users not to set depends_on_past to true and increase this configuration during backfill. Pool: Airflow pool is used to limit the execution parallelism. Users could increase the priority_weight for the task if it is a critical one.
  10. Todo: mention pool source control Todo: need to have some examples for reliablity
  11. Todo: mention pool source control Todo: need to have some examples for reliablity
  12. Data engineering handbook
  13. Provide util to allow user to easy promote the partition to the table in dest schema.
  14. Talk about backfill improvement