Presented at the 2017 Pune Data Conference (punedataconference.com)
Workflows and the scheduling and reliable execution of those workflows are very important in the world of data. As engineers we need to make sure that the data is available when it needs to be to ensure our customers and staff can gain the insight that the need into the data. Not having the data around can cause missed opportunities to capitalize on trends. To accomplish this in the Hadoop world there are many such tools that help you to do this. This includes the popular Apache Oozie service and other tools like Azkaban and Talend. Over the course of using such tools we've noticed aspects of these services that make them difficult to work with such as lack of features and flexibility. While exploring alternatives we found Apache Airflow.
Apache Airflow is a platform to programmatically author, schedule and monitor workflows. We at Clairvoyant have used this tool on a number of projects to dynamically and reliably build workflows, which utilize many Hadoop services. This includes running Sqoop, Spark, Hive, Impala and many other jobs.
In this talk, I'll be talking about how we've used the tool in various use cases across a number of different clients. In addition, we'll go over the feature set and talk about why such a tool is superior to some of the more traditional workflow services like Ooize. Some of these reasons include its flexibility and how well it integrates with Hadoop services.
2. 2Page:
Agenda
• What is Apache Airflow?
• Features
• Architecture
• Terminology
• Operators
• ETL Best Practices
• How they’re supported in Apache Airflow
• Executing Airflow Workflows on Hadoop
• Examples
• Kerberized Cluster
• Use Cases
• Q&A
3. 3Page:
Robert Sanders
• Big Data Manager, Engineer, Architect, etc.
• Work for Clairvoyant LLC
• 5+ Years of Big Data Experience
• Email: robert.sanders@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/robert-sanders-
61446732
• Slide Share: http://www.slideshare.net/RobertSanders49
6. 6Page:
What’s the problem?
• As a Big Data Engineer you work to create jobs that will
perform various operations
• Ingest data from external data sources
• Transformation of Data
• Run Predictions
• Export data
• etc.
• You need to have some mechanism to schedule and run
these jobs
• Cron
• Oozie
• Existing Scheduling Services have a number of limitations
that make them difficult to work with and not usable in all
instances
7. 7Page:
What is Apache Airflow?
• Airflow is an Open Source platform to programmatically
author, schedule and monitor workflows
• Workflows as Code
• Schedules Jobs through Cron Expressions
• Provides monitoring tools like alerts and a web interface
• Written in Python
• As well as user defined Workflows and Plugins
• Was started in the fall of 2014 by Maxime Beauchemin at
Airbnb
• Apache Incubator Project
• Joined Apache Foundation in early 2016
• https://github.com/apache/incubator-airflow/
• Latest Version of Airflow: v1.8.0
8. 8Page:
Why use Apache Airflow?
• Lightweight Workflow Platform
• Define Workflows as Code
• Makes workflows more maintainable, versionable, and
testable
• More flexible execution and workflow generation
• Lots of Features
• Automatic Retries
• SLA monitoring/alerting
• Complex dependency rules: branching, joining, sub-
workflows
• Plugins
• Built-in integration with other services
• Many more…
• Feature Rich Web Interface
• Worker Processes can Scale Horizontally and Vertically
• Can be a cluster or single node setup
11. 11Page:
What is a DAG?
• Directed Acyclic Graph
• A finite directed graph that doesn’t have any cycles
• A collection of tasks to run, organized in a way that reflects
their relationships and dependencies
• Defines your Workflow
12. 12Page:
What is an Operator?
• An operator describes a single task in a workflow
• Operators allow for generation of certain types of tasks that
become nodes in the DAG when instantiated
• All operators are derived from airflow.models.BaseOperator
and inherit all its attributes and methods
13. 13Page:
Workflow Operators (Sensors)
• A type of operator that keeps running until a certain
condition is met or it times out
• Parameterized poke interval and timeout
• Example
• HdfsSensor
• HivePartitionSensor
• NamedHivePartitionSensor
• S3KeyPartition
• WebHdfsSensor
• Many More…
14. 14Page:
Workflow Operators (Transfer)
• Operator that moves data from one system to another
• Data will be pulled from the source system, staged on the
machine where the executor is running and then transferred
to the target system
• Example:
• HiveToMySqlTransfer
• MySqlToHiveTransfer
• S3ToHiveTransfer
• Many More…
• WARNING: Avoid using these if you’re dealing with large
volumes of data
15. 15Page:
Defining a DAG
# Library Imports
from airflow.models import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
# Define global variables and default arguments
START_DATE = datetime.now() - timedelta(minutes=1)
default_args = dict(
'owner'='Airflow’,
'retries': 1,
'retry_delay': timedelta(minutes=5),
)
# Define the DAG
dag = DAG('dag_id', default_args=default_args, schedule_interval='0 0 * * *’, start_date=START_DATE)
# Define the Tasks
task1 = BashOperator(task_id='task1', bash_command="echo 'Task 1'", dag=dag)
task2 = BashOperator(task_id='task2', bash_command="echo 'Task 2'", dag=dag)
task3 = BashOperator(task_id='task3', bash_command="echo 'Task 3'", dag=dag)
# Define the Task Relationships
task1.set_downstream(task2)
task2.set_downstream(task3)
task1 task2 task3
16. 16Page:
Defining a DAG (Dynamically)
dag = DAG('dag_id', …)
last_task = None
for i in range(1, 3):
task = BashOperator(
task_id='task' + str(i),
bash_command="echo 'Task" + str(i) + "'",
dag=dag)
if last_task is None:
last_task = task
else:
last_task.set_downstream(task)
last_task = task
task1 task2 task3
17. 17Page:
ETL Best Practices (Some of Them)
• Load Data Incrementally
• Operators will receive an execution_date entry in the context
which you can use to pull in data since that date till now
• Process Historic Data
• Backfill operations are supported
• Enforce Idempotency (retry safe)
• Execute Conditionally
• Branching, Joining
• Understand SLA’s and Alerts
• Alert if there are failures (task failures and SLA misses)
• Sense when to Start a Task
• Sensor Operators
• Build Validation into your Workflows
18. 18Page:
Executing Airflow Workflows on Hadoop
• Airflow Workers should be installed on edge/gateway nodes
• Allows Airflow to interact with Hadoop related commands
• Utilize the airflow.operator.BashOperator to run
command line functions and interact with Hadoop
services
• Put all necessary scripts and Jars in HDFS and pull the files
down from HDFS during the execution of the script
• Avoids requiring you to keep copies of the scripts on
every machine where the executors are running
20. 20Page:
Executing Airflow Workflows on Hadoop – Example 2
# hive transform with file
hive_transform_file = BashOperator(
task_id=’hive_transform_file’,
bash_command=“””
hadoop fs -get hdfs:///path/to/hive.hql .
if [ -f "hive.hql" ]
then
beeline -u jdbc:hive2://<HOST>:10000/default -n <USERNAME> -p <PASSWORD> -f hive.hql
exit ${?}
else
echo “hive.hql not found.”
exit 1
fi
“””,
dag=dag)
# hive transform with file
hive_transform_exec = BashOperator(
task_id=’hive_transform_exec’,
bash_command=“beeline -u jdbc:hive2://<HOST>:10000/default -n <USERNAME> -p <PASSWORD> -e ‘INSERT INTO
TABLE <TARGET_TABLE> AS SELECT * FROM <SOURCE_TABLE>’”,
dag=dag)
21. 21Page:
Running on a Kerberized Cluster
• Airflow provides another processes (apart from the
webserver, worker and scheduler) which can renew Kerberos
tickets for the user it is running as and store it in the ticket
cache.
• The hooks and DAGs can make use of ticket to authenticate
against Kerberized services.
• Update airflow.cfg:
[core]
security = kerberos
[kerberos]
keytab = /etc/airflow/airflow.keytab
reinit_frequency = 3600
principal = airflow
22. 22Page:
Use Case
• Daily ETL Batch Process to Ingest data into Hadoop
• Extract
• 23 databases total
• 1226 tables total
• Transform
• Impala scripts to join and transform data
• Load
• Impala scripts to load data into common final tables
• Other requirements
• Make it extensible to allow the client to import more databases and
tables in the future
• Status emails to be sent out after daily job to report on success and
failures
• Solution
• Create a DAG that dynamically generates the workflow based off data
in a Metastore
25. 25Page:
Use Case (Kogni)
• New Product being built by Clairvoyant to facilitate:
• kogni-inspector – Sensitive Data Analyzer
• kogni-ingestor – Ingests Data
• kogni-guardian – Sensitive Data Masking (Encrypt and
Tokenize)
• Others components coming soon
• Utilizes Airflow for Data Ingestion and Masking
• Dynamically creates a workflow based off what is in the
Metastore
• Learn More: http://kogni.io/