SlideShare a Scribd company logo
1 of 33
www.clairvoyantsoft.com
Airflow
A CLAIRVOYANT Story
Quick Poll
| 2
| 3
Robert Sanders
Big Data Manager and Engineer
Shekhar Vemuri
CTO
Shekhar works with clients across various industries and
helps define data strategy, and lead the implementation of
data engineering and data science efforts.
Was Co-founder and CTO of Blue Canary, a Predictive
analytics solution to help with student retention, Blue
Canary was later Acquired by Blackboard in 2015.
One of the early employees of Clairvoyant, Robert
primarily works with clients to enable them along their
big data journey. Robert has deep background in web
and enterprise systems, working on full-stack
implementations and then focusing on Data
management platforms.
| 4
About
Background Awards & Recognition
Boutique consulting firm centered on building data solutions and
products
All things Web and Data Engineering, Analytics, ML and User
Experience to bring it all together
Support core Hadoop platform, data engineering pipelines and
provide administrative and devops expertise focused on Hadoop
| 5
Currently working on building a data security solution to help enterprises
discover, secure and monitor sensitive data in their environment.
| 6
● What is Apache Airflow?
○ Features
○ Architecture
● Use Cases
● Lessons Learned
● Best Practices
● Scaling & High Availability
● Deployment, Management & More
● Questions
Agenda
| 7
Hey Robert, I heard about this new
hotness that will solve all of our
workflow scheduling and
orchestration problems. I played
with it for 2 hours and I am in love!
Can you try it out?
Must be pretty cool. I
wonder how it compares
to what we’re using. I’ll
check it out!
Genesis
| 8
Building Expertise vs Yak
Shaving
| 9
| 10
● Mostly used Cron and Oozie
● Did some crazy things with Java and Quartz in a past life
● Lot of operational support was going into debugging Oozie workloads and issues we ran into
with that
○ 4+ Years of working with Oozie “built expertise??”
● Needed a scalable, open source, user friendly engine for
○ Internal product needs
○ Client engagements
○ Making our Ops and Support teams lives easier
Why?
| 11
Scheduler Landscape
| 12
● “Apache Airflow is an Open Source platform to programmatically Author, Schedule and Monitor workflows”
○ Workflows as Python Code (this is huge!!!!!)
○ Provides monitoring tools like alerts and a web interface
● Written in Python
● Apache Incubator Project
○ Joined Apache Foundation in early 2016
○ https://github.com/apache/incubator-airflow/
What is Apache Airflow?
| 13
● Lightweight Workflow Platform
● Full blown Python scripts as DSL
● More flexible execution and workflow generation
● Feature Rich Web Interface
● Worker Processes can Scale Horizontally and Vertically
● Extensible
Why use Apache Airflow?
| 14
Building Blocks
| 15
Different Executors
● SequentialExecutor
● LocalExecutor
● CeleryExecutor
● MesosExecutor
● KubernetesExecutor (Coming Soon)
Executors
What are Executors?
Executors are the mechanism by which task
instances get run.
| 16
Single Node Deployment
| 17
Multi-Node Deployment
| 18
# Library Imports
from airflow.models import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
# Define global variables and default arguments
START_DATE = datetime.now() - timedelta(minutes=1)
default_args = dict(
'owner'='Airflow',
'retries'=1,
'retry_delay'=timedelta(minutes=5),
)
# Define the DAG
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *’, start_date=START_DATE)
# Define the Tasks
task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag)
task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag)
task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag)
# Define the Task Relationships
task1.set_downstream(task2)
task2.set_downstream(task3)
Defining a Workflow
| 19
dag = DAG('dag_id', …)
last_task = None
for i in range(1, 3):
task = BashOperator(
task_id='task' + str(i),
bash_command="echo 'Task" + str(i) + "'",
dag=dag)
if last_task is None:
last_task = task
else:
last_task.set_downstream(task)
last_task = task
Defining a Dynamic Workflow
| 20
● Action Operators
○ BashOperator(bash_command))
○ SSHOperator(ssh_hook, ssh_conn_id, remote_host, command)
○ PythonOperator(python_callable=python_function)
● Transfer Operators
○ HiveToMySqlTransfer(sql, mysql_table, hiveserver2_conn_id, mysql_conn_id, mysql_preoperator, mysql_postoperator, bulk_load)
○ MySqlToHiveTransfer(sql, hive_table, create, recreate, partition, delimiter, mysql_conn_id, hive_cli_conn_id, tblproperties)
● Sensor Operators
○ HdfsSensor(filepath, hdfs_conn_id, ignored_ext, ignore_copying, file_size, hook)
○ HttpSensor(endpoint, http_conn_id, method, request_params, headers, response_check, extra_options)
Many More
Operators
| 21
● Kogni discovers sensitive data across all data sources enterprise
● Need to configure scans with various schedules, work standalone or with a spark cluster
● Orchestrate, execute and manage dozens of pipelines that scan and ingest data in a secure
fashion
● Needed a tool to manage this outside of the core platform
● Started with exporting Oozie configuration from the core app - but conditional aspects and
visibility became an issue
● Needed something that supported deep DAGs and Broad DAGs
First Use Case (Description)
| 22
● Daily ETL Batch Process to Ingest data into Hadoop
○ Extract
■ 1226 tables from 23 databases
○ Transform
■ Impala scripts to join and transform data
○ Load
■ Impala scripts to load data into common final tables
● Other requirements
○ Make it extensible to allow the client to import more databases and tables in the future
○ Status emails to be sent out after daily job to report on success and failures
● Solution
○ Create a DAG that dynamically generates the workflow based off data in a Metastore
Second Use Case (Description)
| 23
Second Use Case (Architecture)
| 24
Second Use Case (DAG)
1,000 ft view 100,000 ft view
| 25
● Support
● Documentation
● Bugs and Odd Behavior
● Monitor Performance with Charts
● Tune Retries
● Tune Parallelism
Lessons Learned
| 26
● Load Data Incrementally
● Process Historic Data with Backfill operations
● Enforce Idempotency (retry safe)
● Execute Conditionally (BranchPythonOperator, ShortCuircuitOperator)
● Alert if there are failures (task failures and SLA misses) (Email/Slack)
● Use Sensor Operators to determine when to Start a Task (if applicable)
● Build Validation into your Workflows
● Test as much - but needs some thought
Best Practices
| 27
Scaling & High Availability
| 28
High Availability for the Scheduler
Scheduler Failover Controller: https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
| 29
● PIP Install airflow site packages on all Nodes
● Set AIRFLOW_HOME env variable before setup
● Utilize MySQL or PostgreSQL as a Metastore
● Update Web App Port
● Utilize SystemD or Upstart Scripts (https://github.com/apache/incubator-
airflow/tree/master/scripts)
● Set Log Location
○ Local File System, S3 Bucket, Google Cloud Storage
● Daemon Monitoring (Nagios)
● Cloudera Manager CSD (Coming Soon)
Deployment & Management
| 30
● Web App Authentication
○ Password
○ LDAP
○ OAuth: Google, GitHub
● Role Based Access Control (RBAC) (Coming Soon)
● Protect airflow.cfg (expose_config, read access to airflow.cfg)
● Web App SSL
● Kerberos Ticket Renewer
Security
| 31
● PyUnit - Unit Testing
● Test DAG Tasks Individually
airflow test [--subdir SUBDIR] [--dry_run] [--task_params
TASK_PARAMS_JSON] dag_id task_id execution_date
● Airflow Unit Test Mode - Loads configurations from the unittests.cfg file
[tests]
unit_test_mode = true
● Always at the very least ensure that the DAG is valid (can be done as part of CI)
● Take it a step ahead by mock pipeline testing(with inputs and outputs) (especially if your DAGs
are broad)
Testing
Questions?
| 32
We are hiring!
| 33
@shekharv
shekhar@clairvoyant.ai
linkedin.com/in/shekharvemuri
@rssanders3
robert@clairvoyant.ai
linkedin.com/in/robert-sanders-cs

More Related Content

What's hot

Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow IntroductionLiangjun Jiang
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialTim Vaillancourt
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko
 

What's hot (20)

Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Prometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome MonitoringPrometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome Monitoring
 

Similar to Apache Airflow in Production

Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Database automation guide - Oracle Community Tour LATAM 2023
Database automation guide - Oracle Community Tour LATAM 2023Database automation guide - Oracle Community Tour LATAM 2023
Database automation guide - Oracle Community Tour LATAM 2023Nelson Calero
 
Azure functions: from a function to a whole application in 60 minutes
Azure functions: from a function to a whole application in 60 minutesAzure functions: from a function to a whole application in 60 minutes
Azure functions: from a function to a whole application in 60 minutesAlessandro Melchiori
 
DevOps for TYPO3 Teams and Projects
DevOps for TYPO3 Teams and ProjectsDevOps for TYPO3 Teams and Projects
DevOps for TYPO3 Teams and ProjectsFedir RYKHTIK
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managabilityGaurav Bahrani
 
Devops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftDevops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftYaniv cohen
 
Platform as a Service (PaaS) - A cloud service for Developers
Platform as a Service (PaaS) - A cloud service for Developers Platform as a Service (PaaS) - A cloud service for Developers
Platform as a Service (PaaS) - A cloud service for Developers Ravindra Dastikop
 
Serverless - DevOps Lessons Learned From Production
Serverless - DevOps Lessons Learned From ProductionServerless - DevOps Lessons Learned From Production
Serverless - DevOps Lessons Learned From ProductionSteve Hogg
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at UberDataWorks Summit
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackke4qqq
 
Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021NeerajKumar1965
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...Jitendra Bafna
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitZhenxiao Luo
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 

Similar to Apache Airflow in Production (20)

Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Database automation guide - Oracle Community Tour LATAM 2023
Database automation guide - Oracle Community Tour LATAM 2023Database automation guide - Oracle Community Tour LATAM 2023
Database automation guide - Oracle Community Tour LATAM 2023
 
Azure functions: from a function to a whole application in 60 minutes
Azure functions: from a function to a whole application in 60 minutesAzure functions: from a function to a whole application in 60 minutes
Azure functions: from a function to a whole application in 60 minutes
 
DevOps for TYPO3 Teams and Projects
DevOps for TYPO3 Teams and ProjectsDevOps for TYPO3 Teams and Projects
DevOps for TYPO3 Teams and Projects
 
Designing for operability and managability
Designing for operability and managabilityDesigning for operability and managability
Designing for operability and managability
 
Devops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShiftDevops with Python by Yaniv Cohen DevopShift
Devops with Python by Yaniv Cohen DevopShift
 
Platform as a Service (PaaS) - A cloud service for Developers
Platform as a Service (PaaS) - A cloud service for Developers Platform as a Service (PaaS) - A cloud service for Developers
Platform as a Service (PaaS) - A cloud service for Developers
 
Serverless - DevOps Lessons Learned From Production
Serverless - DevOps Lessons Learned From ProductionServerless - DevOps Lessons Learned From Production
Serverless - DevOps Lessons Learned From Production
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks Summit
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Sprint 78
Sprint 78Sprint 78
Sprint 78
 
Sprint 66
Sprint 66Sprint 66
Sprint 66
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 

More from Robert Sanders

Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudRobert Sanders
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Robert Sanders
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applicationsRobert Sanders
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High AvailabilityRobert Sanders
 
Databricks Community Cloud Overview
Databricks Community Cloud OverviewDatabricks Community Cloud Overview
Databricks Community Cloud OverviewRobert Sanders
 

More from Robert Sanders (6)

Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the Cloud
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Databricks Community Cloud Overview
Databricks Community Cloud OverviewDatabricks Community Cloud Overview
Databricks Community Cloud Overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Apache Airflow in Production

  • 3. | 3 Robert Sanders Big Data Manager and Engineer Shekhar Vemuri CTO Shekhar works with clients across various industries and helps define data strategy, and lead the implementation of data engineering and data science efforts. Was Co-founder and CTO of Blue Canary, a Predictive analytics solution to help with student retention, Blue Canary was later Acquired by Blackboard in 2015. One of the early employees of Clairvoyant, Robert primarily works with clients to enable them along their big data journey. Robert has deep background in web and enterprise systems, working on full-stack implementations and then focusing on Data management platforms.
  • 4. | 4 About Background Awards & Recognition Boutique consulting firm centered on building data solutions and products All things Web and Data Engineering, Analytics, ML and User Experience to bring it all together Support core Hadoop platform, data engineering pipelines and provide administrative and devops expertise focused on Hadoop
  • 5. | 5 Currently working on building a data security solution to help enterprises discover, secure and monitor sensitive data in their environment.
  • 6. | 6 ● What is Apache Airflow? ○ Features ○ Architecture ● Use Cases ● Lessons Learned ● Best Practices ● Scaling & High Availability ● Deployment, Management & More ● Questions Agenda
  • 7. | 7 Hey Robert, I heard about this new hotness that will solve all of our workflow scheduling and orchestration problems. I played with it for 2 hours and I am in love! Can you try it out? Must be pretty cool. I wonder how it compares to what we’re using. I’ll check it out! Genesis
  • 8. | 8
  • 9. Building Expertise vs Yak Shaving | 9
  • 10. | 10 ● Mostly used Cron and Oozie ● Did some crazy things with Java and Quartz in a past life ● Lot of operational support was going into debugging Oozie workloads and issues we ran into with that ○ 4+ Years of working with Oozie “built expertise??” ● Needed a scalable, open source, user friendly engine for ○ Internal product needs ○ Client engagements ○ Making our Ops and Support teams lives easier Why?
  • 12. | 12 ● “Apache Airflow is an Open Source platform to programmatically Author, Schedule and Monitor workflows” ○ Workflows as Python Code (this is huge!!!!!) ○ Provides monitoring tools like alerts and a web interface ● Written in Python ● Apache Incubator Project ○ Joined Apache Foundation in early 2016 ○ https://github.com/apache/incubator-airflow/ What is Apache Airflow?
  • 13. | 13 ● Lightweight Workflow Platform ● Full blown Python scripts as DSL ● More flexible execution and workflow generation ● Feature Rich Web Interface ● Worker Processes can Scale Horizontally and Vertically ● Extensible Why use Apache Airflow?
  • 15. | 15 Different Executors ● SequentialExecutor ● LocalExecutor ● CeleryExecutor ● MesosExecutor ● KubernetesExecutor (Coming Soon) Executors What are Executors? Executors are the mechanism by which task instances get run.
  • 16. | 16 Single Node Deployment
  • 18. | 18 # Library Imports from airflow.models import DAG from airflow.operators import BashOperator from datetime import datetime, timedelta # Define global variables and default arguments START_DATE = datetime.now() - timedelta(minutes=1) default_args = dict( 'owner'='Airflow', 'retries'=1, 'retry_delay'=timedelta(minutes=5), ) # Define the DAG dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *’, start_date=START_DATE) # Define the Tasks task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag) task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag) task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag) # Define the Task Relationships task1.set_downstream(task2) task2.set_downstream(task3) Defining a Workflow
  • 19. | 19 dag = DAG('dag_id', …) last_task = None for i in range(1, 3): task = BashOperator( task_id='task' + str(i), bash_command="echo 'Task" + str(i) + "'", dag=dag) if last_task is None: last_task = task else: last_task.set_downstream(task) last_task = task Defining a Dynamic Workflow
  • 20. | 20 ● Action Operators ○ BashOperator(bash_command)) ○ SSHOperator(ssh_hook, ssh_conn_id, remote_host, command) ○ PythonOperator(python_callable=python_function) ● Transfer Operators ○ HiveToMySqlTransfer(sql, mysql_table, hiveserver2_conn_id, mysql_conn_id, mysql_preoperator, mysql_postoperator, bulk_load) ○ MySqlToHiveTransfer(sql, hive_table, create, recreate, partition, delimiter, mysql_conn_id, hive_cli_conn_id, tblproperties) ● Sensor Operators ○ HdfsSensor(filepath, hdfs_conn_id, ignored_ext, ignore_copying, file_size, hook) ○ HttpSensor(endpoint, http_conn_id, method, request_params, headers, response_check, extra_options) Many More Operators
  • 21. | 21 ● Kogni discovers sensitive data across all data sources enterprise ● Need to configure scans with various schedules, work standalone or with a spark cluster ● Orchestrate, execute and manage dozens of pipelines that scan and ingest data in a secure fashion ● Needed a tool to manage this outside of the core platform ● Started with exporting Oozie configuration from the core app - but conditional aspects and visibility became an issue ● Needed something that supported deep DAGs and Broad DAGs First Use Case (Description)
  • 22. | 22 ● Daily ETL Batch Process to Ingest data into Hadoop ○ Extract ■ 1226 tables from 23 databases ○ Transform ■ Impala scripts to join and transform data ○ Load ■ Impala scripts to load data into common final tables ● Other requirements ○ Make it extensible to allow the client to import more databases and tables in the future ○ Status emails to be sent out after daily job to report on success and failures ● Solution ○ Create a DAG that dynamically generates the workflow based off data in a Metastore Second Use Case (Description)
  • 23. | 23 Second Use Case (Architecture)
  • 24. | 24 Second Use Case (DAG) 1,000 ft view 100,000 ft view
  • 25. | 25 ● Support ● Documentation ● Bugs and Odd Behavior ● Monitor Performance with Charts ● Tune Retries ● Tune Parallelism Lessons Learned
  • 26. | 26 ● Load Data Incrementally ● Process Historic Data with Backfill operations ● Enforce Idempotency (retry safe) ● Execute Conditionally (BranchPythonOperator, ShortCuircuitOperator) ● Alert if there are failures (task failures and SLA misses) (Email/Slack) ● Use Sensor Operators to determine when to Start a Task (if applicable) ● Build Validation into your Workflows ● Test as much - but needs some thought Best Practices
  • 27. | 27 Scaling & High Availability
  • 28. | 28 High Availability for the Scheduler Scheduler Failover Controller: https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
  • 29. | 29 ● PIP Install airflow site packages on all Nodes ● Set AIRFLOW_HOME env variable before setup ● Utilize MySQL or PostgreSQL as a Metastore ● Update Web App Port ● Utilize SystemD or Upstart Scripts (https://github.com/apache/incubator- airflow/tree/master/scripts) ● Set Log Location ○ Local File System, S3 Bucket, Google Cloud Storage ● Daemon Monitoring (Nagios) ● Cloudera Manager CSD (Coming Soon) Deployment & Management
  • 30. | 30 ● Web App Authentication ○ Password ○ LDAP ○ OAuth: Google, GitHub ● Role Based Access Control (RBAC) (Coming Soon) ● Protect airflow.cfg (expose_config, read access to airflow.cfg) ● Web App SSL ● Kerberos Ticket Renewer Security
  • 31. | 31 ● PyUnit - Unit Testing ● Test DAG Tasks Individually airflow test [--subdir SUBDIR] [--dry_run] [--task_params TASK_PARAMS_JSON] dag_id task_id execution_date ● Airflow Unit Test Mode - Loads configurations from the unittests.cfg file [tests] unit_test_mode = true ● Always at the very least ensure that the DAG is valid (can be done as part of CI) ● Take it a step ahead by mock pipeline testing(with inputs and outputs) (especially if your DAGs are broad) Testing
  • 33. We are hiring! | 33 @shekharv shekhar@clairvoyant.ai linkedin.com/in/shekharvemuri @rssanders3 robert@clairvoyant.ai linkedin.com/in/robert-sanders-cs

Editor's Notes

  1. Workflow systems used Big data and hadoop Data engineering Cloud
  2. Workflow systems used Big data and hadoop Data engineering Cloud
  3. FINAL
  4. FINAL
  5. FINAL
  6. FINAL
  7. FINAL