SlideShare a Scribd company logo
1 of 24
Running
Apache Airflow
Workflows as ETL
Processes on Hadoop
By: Robert Sanders
2Page:
Agenda
• What is Apache Airflow?
• Features
• Architecture
• Terminology
• Operator Types
• ETL Best Practices
• How they’re supported in Apache Airflow
• Executing Airflow Workflows on Hadoop
• Use Cases
• Q&A
3Page:
Robert Sanders
• Big Data Manager, Engineer, Architect, etc.
• Work for Clairvoyant LLC
• 5+ Years of Big Data Experience
• Email: robert.sanders@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/robert-sanders-
61446732
• Slide Share: http://www.slideshare.net/RobertSanders49
4Page:
What’s the problem?
• As a Big Data Engineer you work to create jobs that will
perform various operations
• Ingest data from external data sources
• Transformation of Data
• Run Predictions
• Export data
• Etc.
• You need to have some mechanism to schedule and run
these jobs
• Cron
• Oozie
• Existing Scheduling Services have a number of limitations
that make them difficult to work with and not usable in all
instances
5Page:
What is Apache Airflow?
• Airflow is an Open Source platform to programmatically
author, schedule and monitor workflows
• Workflows as Code
• Schedules Jobs through Cron Expressions
• Provides monitoring tools like alerts and a web interface
• Written in Python
• As well as user defined Workflows and Plugins
• Was started in the fall of 2014 by Maxime Beauchemin at
Airbnb
• Apache Incubator Project
• Joined Apache Foundation in early 2016
• https://github.com/apache/incubator-airflow/
6Page:
Why use Apache Airflow?
• Define Workflows as Code
• Makes workflows more maintainable, versionable, and
testable
• More flexible execution and workflow generation
• Lots of Features
• Feature Rich Web Interface
• Worker Processes Scale Horizontally and Vertically
• Can be a cluster or single node setup
• Lightweight Workflow Platform
7Page:
Apache Airflow Features (Some of them)
• Automatic Retries
• SLA monitoring/alerting
• Complex dependency rules: branching, joining, sub-
workflows
• Defining ownership and versioning
• Resource Pools: limit concurrency + prioritization
• Plugins
• Operators
• Executors
• New Views
• Built-in integration with other services
• Many more…
8Page:
9Page:
10Page:
What is a DAG?
• Directed Acyclic Graph
• A finite directed graph that doesn’t have any cycles
• A collection of tasks to run, organized in a way that reflects
their relationships and dependencies
• Defines your Workflow
11Page:
What is an Operator?
• An operator describes a single task in a workflow
• Operators allow for generation of certain types of tasks that
become nodes in the DAG when instantiated
• All operators derive from BaseOperator and inherit many
attributes and methods that way
12Page:
Workflow Operators (Sensors)
• A type of operator that keeps running until a certain criteria
is met
• Periodically pokes
• Parameterized poke interval and timeout
• Example
• HdfsSensor
• HivePartitionSensor
• NamedHivePartitionSensor
• S3KeyPartition
• WebHdfsSensor
• Many More…
13Page:
Workflow Operators (Transfer)
• Operator that moves data from one system to another
• Data will be pulled from the source system, staged on the
machine where the executor is running and then transferred
to the target system
• Example:
• HiveToMySqlTransfer
• MySqlToHiveTransfer
• HiveToMsSqlTransfer
• MsSqlToHiveTransfer
• S3ToHiveTransfer
• Many More…
14Page:
Defining a DAG
from airflow.models import DAG
from airflow.operators import …
from datetime import datetime, timedelta
default_args = dict(
'owner'='Airflow’,
'retries': 1,
'retry_delay': timedelta(minutes=5),
)
# Define the DAG
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *')
# Define the Tasks
task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag)
task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag)
task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag)
# Define the task relationships
task1.set_downstream(task2)
task2.set_downstream(task3)
task1 task2 task3
15Page:
Defining a DAG (Dynamically)
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *')
last_task = None
for i in range(1, 3):
task = BashOperator(
task_id='task' + str(i),
bash_command="echo 'Task" + str(i) + "'",
dag=dag)
if last_task is None:
last_task = task
else:
last_task.set_downstream(task)
last_task = task
task1 task2 task3
16Page:
ETL Best Practices (Some of Them)
• Load Data Incrementally
• Operators will receive an execution_date entry which you can
use to pull in data since that date
• Process historic Data
• Backfill operations are supported
• Enforce Idempotency (retry safe)
• Execute Conditionally
• Branching, Joining
• Understand SLA’s and Alerts
• Alert if Failures
• Sense when to start a task
• Sensor Operators
• Build Validation into your Workflows
17Page:
Executing Airflow Workflows on Hadoop
• Airflow Workers should be installed on a edge/gateway
nodes
• Allows Airflow to interact with Hadoop related commands
• Utilize the BashOperator to run command line functions
and interact with Hadoop services
• Put all necessary scripts and Jars in HDFS and pull the files
down from HDFS during the execution of the script
• Avoids requiring you to keep copies of the scripts on
every machine where the executors are running
• Support for Kerborized Clusters
• Airflow can renew Kerberos tickets for itself and store it
in the ticket cache
18Page:
Use Case (BPV)
• Daily ETL Batch Process to Ingest data into Hadoop
• Extract
• 23 databases total
• 1226 tables total
• Transform
• Impala scripts to join and transform data
• Load
• Impala scripts to load data into common final tables
• Other requirements
• Make it extensible to allow the client to import more databases and
tables in the future
• Status emails to be sent out after daily job to report on success and
failures
• Solution
• Create a DAG that dynamically generates the workflow based off data
in a Metastore
19Page:
Use Case (BPV) (Architecture)
20Page:
Use Case (BPV) (DAG)
100 foot view 10,000 foot view
21Page:
Use Case (Kogni)
• New Product being built by Clairvoyant to facilitate:
• kogni-inspector – Sensitive Data Analyzer
• kogni-ingestor – Ingests Data
• kogni-guardian – Sensitive Data Masking (Encrypt and
Tokenize)
• Others components coming soon
• Utilizes Airflow for Data Ingestion and Masking
• Dynamically creates a workflow based off what is in the
Metastore
• Learn More: http://kogni.io/
22Page:
Use Case (Kogni) (Architecture)
23Page:
References
• https://pythonhosted.org/airflow/
• https://gtoonstra.github.io/etl-with-airflow/principles.html
• https://github.com/apache/incubator-airflow
• https://media.readthedocs.org/pdf/airflow/latest/airflow.pdf
Q&A

More Related Content

What's hot

Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow IntroductionLiangjun Jiang
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAnimesh Singh
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Jacob Marble [InfluxData] | Observability with InfluxDB IOx and OpenTelemetry...
Jacob Marble [InfluxData] | Observability with InfluxDB IOx and OpenTelemetry...Jacob Marble [InfluxData] | Observability with InfluxDB IOx and OpenTelemetry...
Jacob Marble [InfluxData] | Observability with InfluxDB IOx and OpenTelemetry...InfluxData
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High AvailabilityRobert Sanders
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyRikin Tanna
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with PrometheusMaximilian Bode
 

What's hot (20)

Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and IstioAdvanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Jacob Marble [InfluxData] | Observability with InfluxDB IOx and OpenTelemetry...
Jacob Marble [InfluxData] | Observability with InfluxDB IOx and OpenTelemetry...Jacob Marble [InfluxData] | Observability with InfluxDB IOx and OpenTelemetry...
Jacob Marble [InfluxData] | Observability with InfluxDB IOx and OpenTelemetry...
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with Prometheus
 

Viewers also liked

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Fernando Gandia - Airflow - PyData Mallorca 18-10-2016
Fernando Gandia - Airflow - PyData Mallorca 18-10-2016Fernando Gandia - Airflow - PyData Mallorca 18-10-2016
Fernando Gandia - Airflow - PyData Mallorca 18-10-2016Fernando Gandia
 
Building Robust Pipelines with Airflow | Wrangle Conference 2017
Building Robust Pipelines with Airflow | Wrangle Conference 2017Building Robust Pipelines with Airflow | Wrangle Conference 2017
Building Robust Pipelines with Airflow | Wrangle Conference 2017Cloudera, Inc.
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with AirflowErin Shellman
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 

Viewers also liked (6)

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Fernando Gandia - Airflow - PyData Mallorca 18-10-2016
Fernando Gandia - Airflow - PyData Mallorca 18-10-2016Fernando Gandia - Airflow - PyData Mallorca 18-10-2016
Fernando Gandia - Airflow - PyData Mallorca 18-10-2016
 
Building Robust Pipelines with Airflow | Wrangle Conference 2017
Building Robust Pipelines with Airflow | Wrangle Conference 2017Building Robust Pipelines with Airflow | Wrangle Conference 2017
Building Robust Pipelines with Airflow | Wrangle Conference 2017
 
Building Robust Pipelines with Airflow
Building Robust Pipelines with AirflowBuilding Robust Pipelines with Airflow
Building Robust Pipelines with Airflow
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 

Similar to Running Airflow Workflows as ETL Processes on Hadoop

Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...Tokyo Azure Meetup
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueMichael Rainey
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Adding Support for Networking and Web Technologies to an Embedded System
Adding Support for Networking and Web Technologies to an Embedded SystemAdding Support for Networking and Web Technologies to an Embedded System
Adding Support for Networking and Web Technologies to an Embedded SystemJohn Efstathiades
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentApache Apex
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresDocker, Inc.
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and ActivatorKevin Webber
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1sqlserver.co.il
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveIBM Cloud Data Services
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?DataWorks Summit
 
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaoadaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaolyvanlinh519
 
OrigoDB - take the red pill
OrigoDB - take the red pillOrigoDB - take the red pill
OrigoDB - take the red pillRobert Friberg
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Utilizing the open ntf domino api
Utilizing the open ntf domino apiUtilizing the open ntf domino api
Utilizing the open ntf domino apiOliver Busse
 

Similar to Running Airflow Workflows as ETL Processes on Hadoop (20)

Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS Glue
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
GoDocker presentation
GoDocker presentationGoDocker presentation
GoDocker presentation
 
Adding Support for Networking and Web Technologies to an Embedded System
Adding Support for Networking and Web Technologies to an Embedded SystemAdding Support for Networking and Web Technologies to an Embedded System
Adding Support for Networking and Web Technologies to an Embedded System
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App Development
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failures
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and Activator
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaoadaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
 
OrigoDB - take the red pill
OrigoDB - take the red pillOrigoDB - take the red pill
OrigoDB - take the red pill
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Utilizing the open ntf domino api
Utilizing the open ntf domino apiUtilizing the open ntf domino api
Utilizing the open ntf domino api
 

More from clairvoyantllc

Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016clairvoyantllc
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014clairvoyantllc
 
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
Architecture   - December 2013 - Avinash Ramineni, Shekhar VeumuriArchitecture   - December 2013 - Avinash Ramineni, Shekhar Veumuri
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuriclairvoyantllc
 
Big data in the cloud - Shekhar Vemuri
Big data in the cloud - Shekhar VemuriBig data in the cloud - Shekhar Vemuri
Big data in the cloud - Shekhar Vemuriclairvoyantllc
 
Webservices Workshop - september 2014
Webservices Workshop -  september 2014Webservices Workshop -  september 2014
Webservices Workshop - september 2014clairvoyantllc
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Databricks Community Cloud
Databricks Community CloudDatabricks Community Cloud
Databricks Community Cloudclairvoyantllc
 
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014clairvoyantllc
 
Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013clairvoyantllc
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineniclairvoyantllc
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015clairvoyantllc
 

More from clairvoyantllc (12)

Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
Architecture   - December 2013 - Avinash Ramineni, Shekhar VeumuriArchitecture   - December 2013 - Avinash Ramineni, Shekhar Veumuri
Architecture - December 2013 - Avinash Ramineni, Shekhar Veumuri
 
Big data in the cloud - Shekhar Vemuri
Big data in the cloud - Shekhar VemuriBig data in the cloud - Shekhar Vemuri
Big data in the cloud - Shekhar Vemuri
 
Webservices Workshop - september 2014
Webservices Workshop -  september 2014Webservices Workshop -  september 2014
Webservices Workshop - september 2014
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Databricks Community Cloud
Databricks Community CloudDatabricks Community Cloud
Databricks Community Cloud
 
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
Log analysis using Logstash,ElasticSearch and Kibana - Desert Code Camp 2014
 
Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013Event Driven Architectures - Phoenix Java Users Group 2013
Event Driven Architectures - Phoenix Java Users Group 2013
 
Strata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash RamineniStrata+Hadoop World NY 2016 - Avinash Ramineni
Strata+Hadoop World NY 2016 - Avinash Ramineni
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Running Airflow Workflows as ETL Processes on Hadoop

  • 1. Running Apache Airflow Workflows as ETL Processes on Hadoop By: Robert Sanders
  • 2. 2Page: Agenda • What is Apache Airflow? • Features • Architecture • Terminology • Operator Types • ETL Best Practices • How they’re supported in Apache Airflow • Executing Airflow Workflows on Hadoop • Use Cases • Q&A
  • 3. 3Page: Robert Sanders • Big Data Manager, Engineer, Architect, etc. • Work for Clairvoyant LLC • 5+ Years of Big Data Experience • Email: robert.sanders@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/robert-sanders- 61446732 • Slide Share: http://www.slideshare.net/RobertSanders49
  • 4. 4Page: What’s the problem? • As a Big Data Engineer you work to create jobs that will perform various operations • Ingest data from external data sources • Transformation of Data • Run Predictions • Export data • Etc. • You need to have some mechanism to schedule and run these jobs • Cron • Oozie • Existing Scheduling Services have a number of limitations that make them difficult to work with and not usable in all instances
  • 5. 5Page: What is Apache Airflow? • Airflow is an Open Source platform to programmatically author, schedule and monitor workflows • Workflows as Code • Schedules Jobs through Cron Expressions • Provides monitoring tools like alerts and a web interface • Written in Python • As well as user defined Workflows and Plugins • Was started in the fall of 2014 by Maxime Beauchemin at Airbnb • Apache Incubator Project • Joined Apache Foundation in early 2016 • https://github.com/apache/incubator-airflow/
  • 6. 6Page: Why use Apache Airflow? • Define Workflows as Code • Makes workflows more maintainable, versionable, and testable • More flexible execution and workflow generation • Lots of Features • Feature Rich Web Interface • Worker Processes Scale Horizontally and Vertically • Can be a cluster or single node setup • Lightweight Workflow Platform
  • 7. 7Page: Apache Airflow Features (Some of them) • Automatic Retries • SLA monitoring/alerting • Complex dependency rules: branching, joining, sub- workflows • Defining ownership and versioning • Resource Pools: limit concurrency + prioritization • Plugins • Operators • Executors • New Views • Built-in integration with other services • Many more…
  • 10. 10Page: What is a DAG? • Directed Acyclic Graph • A finite directed graph that doesn’t have any cycles • A collection of tasks to run, organized in a way that reflects their relationships and dependencies • Defines your Workflow
  • 11. 11Page: What is an Operator? • An operator describes a single task in a workflow • Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated • All operators derive from BaseOperator and inherit many attributes and methods that way
  • 12. 12Page: Workflow Operators (Sensors) • A type of operator that keeps running until a certain criteria is met • Periodically pokes • Parameterized poke interval and timeout • Example • HdfsSensor • HivePartitionSensor • NamedHivePartitionSensor • S3KeyPartition • WebHdfsSensor • Many More…
  • 13. 13Page: Workflow Operators (Transfer) • Operator that moves data from one system to another • Data will be pulled from the source system, staged on the machine where the executor is running and then transferred to the target system • Example: • HiveToMySqlTransfer • MySqlToHiveTransfer • HiveToMsSqlTransfer • MsSqlToHiveTransfer • S3ToHiveTransfer • Many More…
  • 14. 14Page: Defining a DAG from airflow.models import DAG from airflow.operators import … from datetime import datetime, timedelta default_args = dict( 'owner'='Airflow’, 'retries': 1, 'retry_delay': timedelta(minutes=5), ) # Define the DAG dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *') # Define the Tasks task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag) task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag) task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag) # Define the task relationships task1.set_downstream(task2) task2.set_downstream(task3) task1 task2 task3
  • 15. 15Page: Defining a DAG (Dynamically) dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *') last_task = None for i in range(1, 3): task = BashOperator( task_id='task' + str(i), bash_command="echo 'Task" + str(i) + "'", dag=dag) if last_task is None: last_task = task else: last_task.set_downstream(task) last_task = task task1 task2 task3
  • 16. 16Page: ETL Best Practices (Some of Them) • Load Data Incrementally • Operators will receive an execution_date entry which you can use to pull in data since that date • Process historic Data • Backfill operations are supported • Enforce Idempotency (retry safe) • Execute Conditionally • Branching, Joining • Understand SLA’s and Alerts • Alert if Failures • Sense when to start a task • Sensor Operators • Build Validation into your Workflows
  • 17. 17Page: Executing Airflow Workflows on Hadoop • Airflow Workers should be installed on a edge/gateway nodes • Allows Airflow to interact with Hadoop related commands • Utilize the BashOperator to run command line functions and interact with Hadoop services • Put all necessary scripts and Jars in HDFS and pull the files down from HDFS during the execution of the script • Avoids requiring you to keep copies of the scripts on every machine where the executors are running • Support for Kerborized Clusters • Airflow can renew Kerberos tickets for itself and store it in the ticket cache
  • 18. 18Page: Use Case (BPV) • Daily ETL Batch Process to Ingest data into Hadoop • Extract • 23 databases total • 1226 tables total • Transform • Impala scripts to join and transform data • Load • Impala scripts to load data into common final tables • Other requirements • Make it extensible to allow the client to import more databases and tables in the future • Status emails to be sent out after daily job to report on success and failures • Solution • Create a DAG that dynamically generates the workflow based off data in a Metastore
  • 19. 19Page: Use Case (BPV) (Architecture)
  • 20. 20Page: Use Case (BPV) (DAG) 100 foot view 10,000 foot view
  • 21. 21Page: Use Case (Kogni) • New Product being built by Clairvoyant to facilitate: • kogni-inspector – Sensitive Data Analyzer • kogni-ingestor – Ingests Data • kogni-guardian – Sensitive Data Masking (Encrypt and Tokenize) • Others components coming soon • Utilizes Airflow for Data Ingestion and Masking • Dynamically creates a workflow based off what is in the Metastore • Learn More: http://kogni.io/
  • 22. 22Page: Use Case (Kogni) (Architecture)
  • 23. 23Page: References • https://pythonhosted.org/airflow/ • https://gtoonstra.github.io/etl-with-airflow/principles.html • https://github.com/apache/incubator-airflow • https://media.readthedocs.org/pdf/airflow/latest/airflow.pdf
  • 24. Q&A