Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Airflow at lyft

3,215 views

Published on

This is the slide for my talk about Airflow at Lyft on the SF big analytics April 2019 meetup.

Published in: Engineering
  • Login to see the comments

Airflow at lyft

  1. 1. April 10th 2019 Tao Feng | @feng-tao | Software Engineer, Lyft Airflow @ Lyft
  2. 2. 2 Who ● Tao Feng ● Engineer at Lyft Data Platform ● Apache Airflow PMC and Committer ● Working on different data products (Airflow, Amundsen, etc)
  3. 3. Agenda • Airflow in general • Airflow @ Lyft • Upstream @ Lyft • Next Step • Summary 3
  4. 4. Airflow in general 4
  5. 5. Airflow in general 5 • Airflow just became an Apache top level project(TLP) recently. ‒ Total 20 PMCs / committers • Most recent release 1.10, 1.10.1, 1.10.2 (1.10.3 is coming). ‒ New Features: Airflow RBAC, Airflow K8S integration, etc • New Process in Airflow for proposing architecture change. ‒ Airflow Improvements Proposals (currently 19+ proposals) • Recent community conducted Airflow user survey (link). 11k+ github stars 740+ contributors 250+ Companies using
  6. 6. Airflow @ Lyft 6
  7. 7. 7 Core Infra high level architecture @ Lyft
  8. 8. Airflow Architecture @ Lyft 8
  9. 9. Airflow Architecture @ Lyft • WebUI: the portal for users to view the related status of the DAGs. • Metadata DB: the Airflow metastore for storing various job status. • Scheduler: a multi-process which parses the DAG bag, creates a DAG object and triggers executor to execute those dependency met tasks. • Executor: A message queuing process that orchestrates worker processes to execute tasks. We uses CeleryExecutor at Lyft. • TARS: Airflow development / backfill environment, which provides access to production data. 9
  10. 10. Airflow Architecture @ Lyft 10 • Main Cluster Config: Apache Airflow 1.8.2 with cherry-picks and numerous in- house Lyft patches. • Scale: Three set of ASGs for workers. ‒ ASG #1: 15 worker nodes each of which is the r5.4xlarge (16vcpu, 128g mem) type. This fleet of workers is for processing low-priority memory intensive tasks. ‒ ASG #2: 3 worker nodes each of which is the m4.4xlarge (16vcpu, 64g mem) )type. This fleet of workers is dedicated for those DAGs with a strict SLA. ‒ ASG #3: 1 worker node with m4.10xlarge (40vcpu, 160g mem) type. The single node is used to process the compute-intensive workloads from a critical team’s DAGs. ‒ Backfill Box(TARS): 1 node with m4.16xlarge (64vcpu, 256g mem) )type. This box is used for fast DAG prototyping and backfill.
  11. 11. Airflow daily stats @ Lyft 11 600+ DAGs 800+ DagRuns 25k+ TIs
  12. 12. Airflow Monitoring @ Lyft 12
  13. 13. Airflow Availability • Scheduler and worker health check ‒ Use Canary monitoring DAG. ‒ No task has been scheduled for 10 mins is considered downtime. • UI health check ‒ Leverage Envoy membership health check. • Total system Uptime pct ‒ Airflow is down if either scheduler, workers, or web server is down. 13
  14. 14. Schedule Delay • scheduler delay = TI.start_time - TI.execution_date 14
  15. 15. DAG last run time • The time that have elapsed since the DAG file was last processed. • If the time becomes too long, it means scheduler has issues processing the DAG files. ‒ E.g could due to parser threads occupied by malicious DAG files. 15
  16. 16. Executor parallelism • Parallelism: control the #. concurrent running tasks. ‒ Please monitor your worker nodes’ cpu utilization before increasing the value. 16
  17. 17. Airflow monitoring @ Lyft 17 Stats Name Meaning dag_processing.last_run.seconds_ago.<d ag_file> Seconds since <dag_file> was last processed executor.open_slots Number of of open slots on executor (parallelism - # running task) executor.queued_tasks Number of queued tasks on executor executor.running_tasks Number of running tasks on executor pool.starving_tasks.<pool_name> Number of starving tasks in the pool. Check how many tasks are starving due to pool count limitation. …...
  18. 18. Airflow Customization @ Lyft 18
  19. 19. Airflow customization @ Lyft • UI auditing • Extra link for task instance UI panel (AIRFLOW-161) 19
  20. 20. Airflow customization @ Lyft • DAG dependency graph 20
  21. 21. Improve Airflow Reliability @ Lyft 21
  22. 22. Improving Airflow Performance @ Lyft • Reduce Airflow UI page load time ‒ Change default_dag_run_display_number to 5. • Tunables that impacts tasks execution parallelisms ‒ Parallelism ‒ Concurrency ‒ Max_active_runs ‒ Pool 22
  23. 23. Improving Airflow Reliability at Lyft • Source Control For Pools ‒ All Airflow pools are defined in a source controlled github source file. ‒ Airflow pools are configured based on the file in runtime. • Integration tests for DAG to enforce best practice and improve reliability ‒ All the DAGs should be loadable within time threshold. ‒ All the DAGs should have valid pools associated. ‒ External task sensors should be valid((dag_id, task_id) exists). ‒ Each pool is used by at least by one DAG. ‒ The sensor has a reasonable timeout. ‒ Each DAG has a non dynamic start date. • Secure UI access 23
  24. 24. Production Debug @ Lyft 24
  25. 25. Production Debug @ Lyft • We document every production issue investigation in the doc. • Couples of methodologies: ‒ View the centralized Airflow dashboard. ‒ Identify whether it is UI or Airflow scheduler(backend) issues. ‒ View the webserver log or scheduler log. ∙ If the log is not available in machine, check the log in kibana. ∙ To further identify issues, we sometimes even look at logs in S3 ‒ Use different tools for further investigation ∙ If exceptions is thrown, understand which part of Airflow code throws the exception. ∙ If CPU / memory alarm, use top to identify which DAG causes the issue. ∙ If failure related to celery, login to celery flower UI to further investigate. ∙ ... 25
  26. 26. Airflow Gotchas @ Lyft 26
  27. 27. Airflow Gotchas at Lyft • DST ‒ UI doesn’t have timezone support even in upstream. ‒ Scheduler internal version has no timezone support. • DAGs with dynamic start date. ‒ Hard to predict when the DAG is scheduled • Long running external task sensors that don’t have valid external tasks. • HivePartitionSensor doesn’t work for partial partition ‒ It only checks whether data exists, not check whether data fully loaded. • Backfill experience ‒ We use local executor to backfill. • Long running sensor occupies task slot of the pool • User confused with DAG level argument vs Task level argument ‒ E.g Put max_active_run in default task argument • Legacy high abstraction framework over Airflow ‒ Hard to debug for the user and us. 27
  28. 28. Upstream @ Lyft 28
  29. 29. Improve backfill experience 29
  30. 30. Improve backfill experience 30 • New options for backfill ‒ --reset_dagruns: if used, Airflow will first check if there are any existing dag_runs / task_instances associated with the backfill date range. If yes, it will prompt user whether the user wants to clear those task_instances first. (AIRFLOW-2718) ‒ --rerun_failed_tasks: if used, Airflow automatically try to rerun those failed tasks again without requiring any user intervention. (AIRFLOW-2566) • Backfill respects pool for isolation (AIRFLOW-1557)
  31. 31. Improve backfill experience Support batch backfill • Use {{ prev_ds }} and {{ ds }} in SQL ‒ Prev_ds equals to ds - schedule_interval ‒ User could change the schedule_interval in the DAG file during backfill. • Use could override dag param with -c options during backfill. 31 INSERT OVERWRITE TABLE {{ dest_db(default.superhero_data) }} SELECT supe.superhero_name AS superhero_name, pop.popularity AS popularity FROM {{ source_table(events.superheroes) }} supe WHERE {{ prev_ds }} >= ds AND ds < {{ ds }} airflow backfill superheroes -s 2018-05-01 -e 2018-05-08 -c {‘hive_cluster’: ‘backfill_cluster’}
  32. 32. Airflow DAG level access 33
  33. 33. Airflow DAG level access @ Lyft 34 • DAG access control has always been a real need at Lyft ‒ HR data, Financial data, etc ‒ The workaround is to build an isolated dedicated cluster for each use case. • Airflow introduces the RBAC feature in 1.10 ‒ Airflow new webserver is based on Flask-Appbuilder. ‒ Ships with 5 static roles(Admin, User, Op, Viewer, Public). ‒ ... • Airflow DAG level access (AIRFLOW-2267) ‒ Provides additional granular access control on DAG level.
  34. 34. Airflow DAG level access @ Lyft • New Airflow UI migrates from Flask-Admin to Flask-Appbuilder(FAB). • FAB’s security model. 35
  35. 35. Airflow DAG level access @ Lyft • Which Airflow includes the change? ‒ 1.10.2 includes initial implementation ‒ 1.10.3(upcoming) includes the enhancements • How it works ‒ Two new perms: can_dag_read (read), can_dag_edit (write). ‒ DAG level role could be created through cli / UI by Admin (doc). ‒ DAG level role could only see the viewable DAGs. ‒ User could declare permissions in DAG file (AIRFLOW-2694). 36
  36. 36. Airflow DAG level access @ Lyft 37 • We build a new cluster based on Airflow master branch and onboard couples of new sensitive data use cases. ‒ Each use case has its own repo. ‒ User role relationship source controlled in a YAML file. • DAG owners specify the access control info in the DAG files. • Gotchas ‒ New user onboarding ‒ Integration between FAB and google authentication(OAUTH) ‒ Integration with internal ACL service ‒ ... User registration flow
  37. 37. Next Step 38
  38. 38. Next Step • Support Airflow DAG level access feature in beta internally. • Integrate Airflow RBAC / DAG level feature with internal ACL service(FAB issue). • Migrate all the existing DAGs to this new cluster. • Explore running Airflow with k8s executor internally. 39
  39. 39. Summary 40
  40. 40. Summary 41 • Airflow community has been growing a lot! • We share our experience on operating Airflow at Lyft. • We share some of our upstream work ‒ Improve Airflow backfill experience ‒ Support Airflow DAG level Access
  41. 41. Acknowledgement 42 • Members who maintain Airflow at Lyft ‒ Alagappan Sethuraman ‒ Andrew Stahlman ‒ Chao-han Tsai ‒ Jinhyuk Chang ‒ Junda Yang ‒ Max Payton ‒ Tao Feng • Special thanks to Maxime Beauchemin who provides numerous suggestions for us.
  42. 42. Tao Feng | @feng-tao Slides at TBD Blog at go.lyft.com/airflowblog Icons under Creative Commons License from https://thenounproject.com/ 43
  43. 43. Backup 44

×