Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Airflow in Production

5,333 views

Published on

We will introduce Airflow, an Apache Project for scheduling and workflow orchestration. We will discuss use cases, applicability and how best to use Airflow, mainly in the context of building data engineering pipelines. We have been running Airflow in production for about 2 years, we will also go over some learnings, best practices and some tools we have built around it.

Speakers: Robert Sanders, Shekhar Vemuri

Published in: Technology
  • Login to see the comments

Apache Airflow in Production

  1. 1. www.clairvoyantsoft.com Airflow A CLAIRVOYANT Story
  2. 2. Quick Poll | 2
  3. 3. | 3 Robert Sanders Big Data Manager and Engineer Shekhar Vemuri CTO Shekhar works with clients across various industries and helps define data strategy, and lead the implementation of data engineering and data science efforts. Was Co-founder and CTO of Blue Canary, a Predictive analytics solution to help with student retention, Blue Canary was later Acquired by Blackboard in 2015. One of the early employees of Clairvoyant, Robert primarily works with clients to enable them along their big data journey. Robert has deep background in web and enterprise systems, working on full-stack implementations and then focusing on Data management platforms.
  4. 4. | 4 About Background Awards & Recognition Boutique consulting firm centered on building data solutions and products All things Web and Data Engineering, Analytics, ML and User Experience to bring it all together Support core Hadoop platform, data engineering pipelines and provide administrative and devops expertise focused on Hadoop
  5. 5. | 5 Currently working on building a data security solution to help enterprises discover, secure and monitor sensitive data in their environment.
  6. 6. | 6 ● What is Apache Airflow? ○ Features ○ Architecture ● Use Cases ● Lessons Learned ● Best Practices ● Scaling & High Availability ● Deployment, Management & More ● Questions Agenda
  7. 7. | 7 Hey Robert, I heard about this new hotness that will solve all of our workflow scheduling and orchestration problems. I played with it for 2 hours and I am in love! Can you try it out? Must be pretty cool. I wonder how it compares to what we’re using. I’ll check it out! Genesis
  8. 8. | 8
  9. 9. Building Expertise vs Yak Shaving | 9
  10. 10. | 10 ● Mostly used Cron and Oozie ● Did some crazy things with Java and Quartz in a past life ● Lot of operational support was going into debugging Oozie workloads and issues we ran into with that ○ 4+ Years of working with Oozie “built expertise??” ● Needed a scalable, open source, user friendly engine for ○ Internal product needs ○ Client engagements ○ Making our Ops and Support teams lives easier Why?
  11. 11. | 11 Scheduler Landscape
  12. 12. | 12 ● “Apache Airflow is an Open Source platform to programmatically Author, Schedule and Monitor workflows” ○ Workflows as Python Code (this is huge!!!!!) ○ Provides monitoring tools like alerts and a web interface ● Written in Python ● Apache Incubator Project ○ Joined Apache Foundation in early 2016 ○ https://github.com/apache/incubator-airflow/ What is Apache Airflow?
  13. 13. | 13 ● Lightweight Workflow Platform ● Full blown Python scripts as DSL ● More flexible execution and workflow generation ● Feature Rich Web Interface ● Worker Processes can Scale Horizontally and Vertically ● Extensible Why use Apache Airflow?
  14. 14. | 14 Building Blocks
  15. 15. | 15 Different Executors ● SequentialExecutor ● LocalExecutor ● CeleryExecutor ● MesosExecutor ● KubernetesExecutor (Coming Soon) Executors What are Executors? Executors are the mechanism by which task instances get run.
  16. 16. | 16 Single Node Deployment
  17. 17. | 17 Multi-Node Deployment
  18. 18. | 18 # Library Imports from airflow.models import DAG from airflow.operators import BashOperator from datetime import datetime, timedelta # Define global variables and default arguments START_DATE = datetime.now() - timedelta(minutes=1) default_args = dict( 'owner'='Airflow', 'retries'=1, 'retry_delay'=timedelta(minutes=5), ) # Define the DAG dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *’, start_date=START_DATE) # Define the Tasks task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag) task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag) task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag) # Define the Task Relationships task1.set_downstream(task2) task2.set_downstream(task3) Defining a Workflow
  19. 19. | 19 dag = DAG('dag_id', …) last_task = None for i in range(1, 3): task = BashOperator( task_id='task' + str(i), bash_command="echo 'Task" + str(i) + "'", dag=dag) if last_task is None: last_task = task else: last_task.set_downstream(task) last_task = task Defining a Dynamic Workflow
  20. 20. | 20 ● Action Operators ○ BashOperator(bash_command)) ○ SSHOperator(ssh_hook, ssh_conn_id, remote_host, command) ○ PythonOperator(python_callable=python_function) ● Transfer Operators ○ HiveToMySqlTransfer(sql, mysql_table, hiveserver2_conn_id, mysql_conn_id, mysql_preoperator, mysql_postoperator, bulk_load) ○ MySqlToHiveTransfer(sql, hive_table, create, recreate, partition, delimiter, mysql_conn_id, hive_cli_conn_id, tblproperties) ● Sensor Operators ○ HdfsSensor(filepath, hdfs_conn_id, ignored_ext, ignore_copying, file_size, hook) ○ HttpSensor(endpoint, http_conn_id, method, request_params, headers, response_check, extra_options) Many More Operators
  21. 21. | 21 ● Kogni discovers sensitive data across all data sources enterprise ● Need to configure scans with various schedules, work standalone or with a spark cluster ● Orchestrate, execute and manage dozens of pipelines that scan and ingest data in a secure fashion ● Needed a tool to manage this outside of the core platform ● Started with exporting Oozie configuration from the core app - but conditional aspects and visibility became an issue ● Needed something that supported deep DAGs and Broad DAGs First Use Case (Description)
  22. 22. | 22 ● Daily ETL Batch Process to Ingest data into Hadoop ○ Extract ■ 1226 tables from 23 databases ○ Transform ■ Impala scripts to join and transform data ○ Load ■ Impala scripts to load data into common final tables ● Other requirements ○ Make it extensible to allow the client to import more databases and tables in the future ○ Status emails to be sent out after daily job to report on success and failures ● Solution ○ Create a DAG that dynamically generates the workflow based off data in a Metastore Second Use Case (Description)
  23. 23. | 23 Second Use Case (Architecture)
  24. 24. | 24 Second Use Case (DAG) 1,000 ft view 100,000 ft view
  25. 25. | 25 ● Support ● Documentation ● Bugs and Odd Behavior ● Monitor Performance with Charts ● Tune Retries ● Tune Parallelism Lessons Learned
  26. 26. | 26 ● Load Data Incrementally ● Process Historic Data with Backfill operations ● Enforce Idempotency (retry safe) ● Execute Conditionally (BranchPythonOperator, ShortCuircuitOperator) ● Alert if there are failures (task failures and SLA misses) (Email/Slack) ● Use Sensor Operators to determine when to Start a Task (if applicable) ● Build Validation into your Workflows ● Test as much - but needs some thought Best Practices
  27. 27. | 27 Scaling & High Availability
  28. 28. | 28 High Availability for the Scheduler Scheduler Failover Controller: https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
  29. 29. | 29 ● PIP Install airflow site packages on all Nodes ● Set AIRFLOW_HOME env variable before setup ● Utilize MySQL or PostgreSQL as a Metastore ● Update Web App Port ● Utilize SystemD or Upstart Scripts (https://github.com/apache/incubator- airflow/tree/master/scripts) ● Set Log Location ○ Local File System, S3 Bucket, Google Cloud Storage ● Daemon Monitoring (Nagios) ● Cloudera Manager CSD (Coming Soon) Deployment & Management
  30. 30. | 30 ● Web App Authentication ○ Password ○ LDAP ○ OAuth: Google, GitHub ● Role Based Access Control (RBAC) (Coming Soon) ● Protect airflow.cfg (expose_config, read access to airflow.cfg) ● Web App SSL ● Kerberos Ticket Renewer Security
  31. 31. | 31 ● PyUnit - Unit Testing ● Test DAG Tasks Individually airflow test [--subdir SUBDIR] [--dry_run] [--task_params TASK_PARAMS_JSON] dag_id task_id execution_date ● Airflow Unit Test Mode - Loads configurations from the unittests.cfg file [tests] unit_test_mode = true ● Always at the very least ensure that the DAG is valid (can be done as part of CI) ● Take it a step ahead by mock pipeline testing(with inputs and outputs) (especially if your DAGs are broad) Testing
  32. 32. Questions? | 32
  33. 33. We are hiring! | 33 @shekharv shekhar@clairvoyant.ai linkedin.com/in/shekharvemuri @rssanders3 robert@clairvoyant.ai linkedin.com/in/robert-sanders-cs

×