SlideShare a Scribd company logo
1 of 24
Running
Apache Airflow
Workflows as ETL
Processes on Hadoop
By: Robert Sanders
2Page:
Agenda
• What is Apache Airflow?
• Features
• Architecture
• Terminology
• Operator Types
• ETL Best Practices
• How they’re supported in Apache Airflow
• Executing Airflow Workflows on Hadoop
• Use Cases
• Q&A
3Page:
Robert Sanders
• Big Data Manager, Engineer, Architect, etc.
• Work for Clairvoyant LLC
• 5+ Years of Big Data Experience
• Email: robert.sanders@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/robert-sanders-
61446732
• Slide Share: http://www.slideshare.net/RobertSanders49
4Page:
What’s the problem?
• As a Big Data Engineer you work to create jobs that will
perform various operations
• Ingest data from external data sources
• Transformation of Data
• Run Predictions
• Export data
• Etc.
• You need to have some mechanism to schedule and run
these jobs
• Cron
• Oozie
• Existing Scheduling Services have a number of limitations
that make them difficult to work with and not usable in all
instances
5Page:
What is Apache Airflow?
• Airflow is an Open Source platform to programmatically
author, schedule and monitor workflows
• Workflows as Code
• Schedules Jobs through Cron Expressions
• Provides monitoring tools like alerts and a web interface
• Written in Python
• As well as user defined Workflows and Plugins
• Was started in the fall of 2014 by Maxime Beauchemin at
Airbnb
• Apache Incubator Project
• Joined Apache Foundation in early 2016
• https://github.com/apache/incubator-airflow/
6Page:
Why use Apache Airflow?
• Define Workflows as Code
• Makes workflows more maintainable, versionable, and
testable
• More flexible execution and workflow generation
• Lots of Features
• Feature Rich Web Interface
• Worker Processes Scale Horizontally and Vertically
• Can be a cluster or single node setup
• Lightweight Workflow Platform
7Page:
Apache Airflow Features (Some of them)
• Automatic Retries
• SLA monitoring/alerting
• Complex dependency rules: branching, joining, sub-
workflows
• Defining ownership and versioning
• Resource Pools: limit concurrency + prioritization
• Plugins
• Operators
• Executors
• New Views
• Built-in integration with other services
• Many more…
8Page:
9Page:
10Page:
What is a DAG?
• Directed Acyclic Graph
• A finite directed graph that doesn’t have any cycles
• A collection of tasks to run, organized in a way that reflects
their relationships and dependencies
• Defines your Workflow
11Page:
What is an Operator?
• An operator describes a single task in a workflow
• Operators allow for generation of certain types of tasks that
become nodes in the DAG when instantiated
• All operators derive from BaseOperator and inherit many
attributes and methods that way
12Page:
Workflow Operators (Sensors)
• A type of operator that keeps running until a certain criteria
is met
• Periodically pokes
• Parameterized poke interval and timeout
• Example
• HdfsSensor
• HivePartitionSensor
• NamedHivePartitionSensor
• S3KeyPartition
• WebHdfsSensor
• Many More…
13Page:
Workflow Operators (Transfer)
• Operator that moves data from one system to another
• Data will be pulled from the source system, staged on the
machine where the executor is running and then transferred
to the target system
• Example:
• HiveToMySqlTransfer
• MySqlToHiveTransfer
• HiveToMsSqlTransfer
• MsSqlToHiveTransfer
• S3ToHiveTransfer
• Many More…
14Page:
Defining a DAG
from airflow.models import DAG
from airflow.operators import …
from datetime import datetime, timedelta
default_args = dict(
'owner'='Airflow’,
'retries': 1,
'retry_delay': timedelta(minutes=5),
)
# Define the DAG
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *')
# Define the Tasks
task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag)
task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag)
task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag)
# Define the task relationships
task1.set_downstream(task2)
task2.set_downstream(task3)
task1 task2 task3
15Page:
Defining a DAG (Dynamically)
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *')
last_task = None
for i in range(1, 3):
task = BashOperator(
task_id='task' + str(i),
bash_command="echo 'Task" + str(i) + "'",
dag=dag)
if last_task is None:
last_task = task
else:
last_task.set_downstream(task)
last_task = task
task1 task2 task3
16Page:
ETL Best Practices (Some of Them)
• Load Data Incrementally
• Operators will receive an execution_date entry which you can
use to pull in data since that date
• Process historic Data
• Backfill operations are supported
• Enforce Idempotency (retry safe)
• Execute Conditionally
• Branching, Joining
• Understand SLA’s and Alerts
• Alert if Failures
• Sense when to start a task
• Sensor Operators
• Build Validation into your Workflows
17Page:
Executing Airflow Workflows on Hadoop
• Airflow Workers should be installed on a edge/gateway
nodes
• Allows Airflow to interact with Hadoop related commands
• Utilize the BashOperator to run command line functions
and interact with Hadoop services
• Put all necessary scripts and Jars in HDFS and pull the files
down from HDFS during the execution of the script
• Avoids requiring you to keep copies of the scripts on
every machine where the executors are running
• Support for Kerborized Clusters
• Airflow can renew Kerberos tickets for itself and store it
in the ticket cache
18Page:
Use Case (BPV)
• Daily ETL Batch Process to Ingest data into Hadoop
• Extract
• 23 databases total
• 1226 tables total
• Transform
• Impala scripts to join and transform data
• Load
• Impala scripts to load data into common final tables
• Other requirements
• Make it extensible to allow the client to import more databases and
tables in the future
• Status emails to be sent out after daily job to report on success and
failures
• Solution
• Create a DAG that dynamically generates the workflow based off data
in a Metastore
19Page:
Use Case (BPV) (Architecture)
20Page:
Use Case (BPV) (DAG)
100 foot view 10,000 foot view
21Page:
Use Case (Kogni)
• New Product being built by Clairvoyant to facilitate:
• kogni-inspector – Sensitive Data Analyzer
• kogni-ingestor – Ingests Data
• kogni-guardian – Sensitive Data Masking (Encrypt and
Tokenize)
• Others components coming soon
• Utilizes Airflow for Data Ingestion and Masking
• Dynamically creates a workflow based off what is in the
Metastore
• Learn More: http://kogni.io/
22Page:
Use Case (Kogni) (Architecture)
23Page:
References
• https://pythonhosted.org/airflow/
• https://gtoonstra.github.io/etl-with-airflow/principles.html
• https://github.com/apache/incubator-airflow
• https://media.readthedocs.org/pdf/airflow/latest/airflow.pdf
Q&A

More Related Content

Viewers also liked

Viewers also liked (10)

Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & AirflowBuilding a Data Ingestion & Processing Pipeline with Spark & Airflow
Building a Data Ingestion & Processing Pipeline with Spark & Airflow
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
 
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator처음 접하는 Oozie Workflow, Coordinator
처음 접하는 Oozie Workflow, Coordinator
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 

More from Robert Sanders

More from Robert Sanders (6)

Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the Cloud
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Databricks Community Cloud Overview
Databricks Community Cloud OverviewDatabricks Community Cloud Overview
Databricks Community Cloud Overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Running Apache Airflow Workflows as ETL Processes on Hadoop

  • 1. Running Apache Airflow Workflows as ETL Processes on Hadoop By: Robert Sanders
  • 2. 2Page: Agenda • What is Apache Airflow? • Features • Architecture • Terminology • Operator Types • ETL Best Practices • How they’re supported in Apache Airflow • Executing Airflow Workflows on Hadoop • Use Cases • Q&A
  • 3. 3Page: Robert Sanders • Big Data Manager, Engineer, Architect, etc. • Work for Clairvoyant LLC • 5+ Years of Big Data Experience • Email: robert.sanders@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/robert-sanders- 61446732 • Slide Share: http://www.slideshare.net/RobertSanders49
  • 4. 4Page: What’s the problem? • As a Big Data Engineer you work to create jobs that will perform various operations • Ingest data from external data sources • Transformation of Data • Run Predictions • Export data • Etc. • You need to have some mechanism to schedule and run these jobs • Cron • Oozie • Existing Scheduling Services have a number of limitations that make them difficult to work with and not usable in all instances
  • 5. 5Page: What is Apache Airflow? • Airflow is an Open Source platform to programmatically author, schedule and monitor workflows • Workflows as Code • Schedules Jobs through Cron Expressions • Provides monitoring tools like alerts and a web interface • Written in Python • As well as user defined Workflows and Plugins • Was started in the fall of 2014 by Maxime Beauchemin at Airbnb • Apache Incubator Project • Joined Apache Foundation in early 2016 • https://github.com/apache/incubator-airflow/
  • 6. 6Page: Why use Apache Airflow? • Define Workflows as Code • Makes workflows more maintainable, versionable, and testable • More flexible execution and workflow generation • Lots of Features • Feature Rich Web Interface • Worker Processes Scale Horizontally and Vertically • Can be a cluster or single node setup • Lightweight Workflow Platform
  • 7. 7Page: Apache Airflow Features (Some of them) • Automatic Retries • SLA monitoring/alerting • Complex dependency rules: branching, joining, sub- workflows • Defining ownership and versioning • Resource Pools: limit concurrency + prioritization • Plugins • Operators • Executors • New Views • Built-in integration with other services • Many more…
  • 10. 10Page: What is a DAG? • Directed Acyclic Graph • A finite directed graph that doesn’t have any cycles • A collection of tasks to run, organized in a way that reflects their relationships and dependencies • Defines your Workflow
  • 11. 11Page: What is an Operator? • An operator describes a single task in a workflow • Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated • All operators derive from BaseOperator and inherit many attributes and methods that way
  • 12. 12Page: Workflow Operators (Sensors) • A type of operator that keeps running until a certain criteria is met • Periodically pokes • Parameterized poke interval and timeout • Example • HdfsSensor • HivePartitionSensor • NamedHivePartitionSensor • S3KeyPartition • WebHdfsSensor • Many More…
  • 13. 13Page: Workflow Operators (Transfer) • Operator that moves data from one system to another • Data will be pulled from the source system, staged on the machine where the executor is running and then transferred to the target system • Example: • HiveToMySqlTransfer • MySqlToHiveTransfer • HiveToMsSqlTransfer • MsSqlToHiveTransfer • S3ToHiveTransfer • Many More…
  • 14. 14Page: Defining a DAG from airflow.models import DAG from airflow.operators import … from datetime import datetime, timedelta default_args = dict( 'owner'='Airflow’, 'retries': 1, 'retry_delay': timedelta(minutes=5), ) # Define the DAG dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *') # Define the Tasks task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag) task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag) task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag) # Define the task relationships task1.set_downstream(task2) task2.set_downstream(task3) task1 task2 task3
  • 15. 15Page: Defining a DAG (Dynamically) dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *') last_task = None for i in range(1, 3): task = BashOperator( task_id='task' + str(i), bash_command="echo 'Task" + str(i) + "'", dag=dag) if last_task is None: last_task = task else: last_task.set_downstream(task) last_task = task task1 task2 task3
  • 16. 16Page: ETL Best Practices (Some of Them) • Load Data Incrementally • Operators will receive an execution_date entry which you can use to pull in data since that date • Process historic Data • Backfill operations are supported • Enforce Idempotency (retry safe) • Execute Conditionally • Branching, Joining • Understand SLA’s and Alerts • Alert if Failures • Sense when to start a task • Sensor Operators • Build Validation into your Workflows
  • 17. 17Page: Executing Airflow Workflows on Hadoop • Airflow Workers should be installed on a edge/gateway nodes • Allows Airflow to interact with Hadoop related commands • Utilize the BashOperator to run command line functions and interact with Hadoop services • Put all necessary scripts and Jars in HDFS and pull the files down from HDFS during the execution of the script • Avoids requiring you to keep copies of the scripts on every machine where the executors are running • Support for Kerborized Clusters • Airflow can renew Kerberos tickets for itself and store it in the ticket cache
  • 18. 18Page: Use Case (BPV) • Daily ETL Batch Process to Ingest data into Hadoop • Extract • 23 databases total • 1226 tables total • Transform • Impala scripts to join and transform data • Load • Impala scripts to load data into common final tables • Other requirements • Make it extensible to allow the client to import more databases and tables in the future • Status emails to be sent out after daily job to report on success and failures • Solution • Create a DAG that dynamically generates the workflow based off data in a Metastore
  • 19. 19Page: Use Case (BPV) (Architecture)
  • 20. 20Page: Use Case (BPV) (DAG) 100 foot view 10,000 foot view
  • 21. 21Page: Use Case (Kogni) • New Product being built by Clairvoyant to facilitate: • kogni-inspector – Sensitive Data Analyzer • kogni-ingestor – Ingests Data • kogni-guardian – Sensitive Data Masking (Encrypt and Tokenize) • Others components coming soon • Utilizes Airflow for Data Ingestion and Masking • Dynamically creates a workflow based off what is in the Metastore • Learn More: http://kogni.io/
  • 22. 22Page: Use Case (Kogni) (Architecture)
  • 23. 23Page: References • https://pythonhosted.org/airflow/ • https://gtoonstra.github.io/etl-with-airflow/principles.html • https://github.com/apache/incubator-airflow • https://media.readthedocs.org/pdf/airflow/latest/airflow.pdf
  • 24. Q&A