SlideShare a Scribd company logo
1 of 39
Download to read offline
Workflow Engines for
Hadoop
Joe Crobak
@joecrobak
NYC Data Engineering Meetup
September 5, 2013
1
Intro
2
Background
• Devops/Infra for Hadoop
• ~4 years with Hadoop
• Have done two migrations from EMR to the colo.
• Formerly Data/Analytics Infrastructure @
• worked with Apache Oozie and Luigi
• Before that, Hadoop @
• worked with Azkaban 1.0
Disclosure: I’ve contributed to Luigi and Azkaban 1.0
3
What is Apache Hadoop?
4
What is a workflow?
5
What is a workflow engine?
6
Two Example Use-Cases
7
Analytics / Data Warehousing
• logs -> fact table(s).
• database backups -> dimension tables.
• Compute rollups/cubes.
• Load data into a low-latency store (e.g.
Redshift,Vertica, HBase).
• Dashboarding & BI tools hit database.
8
Analytics / Data Warehousing
9
Analytics / Data Warehousing
• What happens if there’s a failure?
• rebuild the failed day
• ... and any downstream datasets
10
Hadoop-Driven Features
• PeopleYou May Know
• Amazon-style “People
that buy this often by
that”
• SPAM detection
• logs, databases ->
machine learning /
collaborative filtering
• derivative datasets ->
production database
(often k/v store)
11
Hadoop-Driven Features
12
Hadoop-Driven Features
• What happens if there’s a failure?
• possibly OK to skip a day.
• Workflow tends to be self-contained, so
you don’t need to rerun downstream.
• Sanity check your data before pushing to
production.
13
Workflow Engine Evolution
• Usually start with cron
• at 01:00 import data
• at 02:00 run really expensive query A
• at 03:00 run query B, C, D
• ...
• This goes on until you have ~10 jobs or so.
• It’s hard to debug and rerun.
• Doesn’t scale to many people.
14
Workflow Engine Evolution
• Two possibilities:
1. “a workflow engine can’t be too hard,
let’s write our own”
2. spend weeks evaluating all the options
out there.Try to shoehorn your
workflow into each one.
15
Workflow Engine
Considerations
How do I...
• Deploy and Upgrade
• workflows and the workflow engine
• Test
• Detect Failure
• Debug/find logs
• Rebuild/backfill datasets
• Load data to/from a RDBMS
• Manage a set of similar tasks
16
Apache
http://oozie.apache.org/
17
Oozie - architecture
18
Oozie - the good
• Great community support
• Integrated with HUE, Cloudera Manager,Apache
Ambari
• HCatalog integration
• SLA alerts (new in Oozie 4)
• Ecosystem support: Pig, Hive, Sqoop, etc.
• Very detailed documentation
• Launcher jobs as map tasks
19
Oozie - the bad
• Launcher jobs as map tasks.
• UI - but HUE, oozie-web (and
good API)
• Confusing object model (bundles,
coordinators, workflows) - high
barrier to entry.
• Setup - extjs, hadoop proxy user,
RDBMS.
• XML!
20
Oozie - the bad
• Hello World in Oozie
21
http://azkaban.github.io/azkaban2/
22
Azkaban - architecture
Source: http://azkaban.github.io/azkaban2/overview.html
23
Azkaban - the good
• Great UI
• DAG visualization
• Task history
• Easy access to log files
• Plugin architecture
• Pig, Hive, etc. Also, voldemort “build and push” integration
• SLA Alerting
• HDFS Browser
• User Authentication/Authorization and auditing.
• Reportal: https://github.com/azkaban/azkaban-plugins/pull/6
24
25
Azkaban - the bad
• Representing data dependencies
• i.e. run job X when datasetY is available.
• Executors run on separate workers, can be
under-utilized (YARN anyone?).
• Community - mostly just LinkedIn, and they
rewrote it in isolation.
• mailing list responsiveness is good.
26
Azkaban - good and
bad
• Job definitions as java properties
• Web uploads/deploy
• Running jobs, scheduling jobs.
• nearly impossible to integrate with
configuration management
27
https://github.com/spotify/luigi
28
Luigi - architecture
29
Luigi - the good
• Task definitions are code.
• Tasks are idempotent.
• Workflow defines data (and task) dependencies.
• Growing community.
• Easy to hack on the codebase (<6k LoC).
• Postgres integration
• Foursquare got this working with Redshift and
Vertica.
30
Luigi - the bad
• Missing some key features, e.g. Pig support
• but this is easy to add
• Deploy situation is confusing (but easy to
automate)
• visualizer scaling
• no persistent backing
• JVM overhead
31
Comparison matrix -
part 1
Lang
Code
Complexity
Frameworks Logs Community Docs
oozie java high - 105k
pig, hive, sqoop,
mapreduce
decentralized,
map tasks
Good - ASF in
many distros
excelle
nt
azkaban java moderate - 26k
pig, hive,
mapreduce
UI-accessible
few users,
responsive on
MLs
good
luigi python simple - 5.9k
hive, postgres,
scalding, python
streaming
decentral-ized
on workers
few users,
responsive on
github and MLs
good
32
Comparison matrix -
part 2
property
configuration
Reruns
Customizat
ion (new
job type)
Testing User Auth
oozie
command-line,
properties file, xml
defaults
oozie job -
rerun
difficult MiniOozie
Kerberos, simple,
custom
azkaban
bundled inside
workflow zip, system
defaults
partial
reruns in UI
plugin
architecture
?
xml-based,
custom
luigi
command-line,
python ini file
remove
output,
idempotency
subclass
luigi.Task
python
unittests
linux-based
33
Other workflow
engines
• Chronos
• EMR
• Mortar
• Qubole
• general purpose:
• kettle, spring batch
34
Qualities I like in a
workflow engine
• scripting language
• you end up writing scripts to run your job anyway
• custom logic, e.g. representing a dep on 7-days of data or run
only every week
• Less property propagation
• Idempotency
• WYSIWYG
• It shouldn't be hard to take my existing job and move it to the
workflow engine (it should just work).
• Easy to hack on
35
Less important
• High availability (cold failover with manual
intervention is OK)
• Multiple cluster support
• Security
36
Best Practices
• Version datasets
• Backfilling datasets
• Monitor the absence of a job running
• Continuous deploy?
37
Resources
• Azkaban talk at Hadoop User Group:
http://www.youtube.com/watch?
v=rIUlh33uKMU
• PyData talk on Luigi: http://vimeo.com/
63435580
• Oozie talk at Hadoop user Group: http://
www.slideshare.net/mislam77/oozie-hug-
may12
38
Thanks!
• Questions?
• shameless plug: Subscribe to my
newsletter: http://hadoopweekly.com
39

More Related Content

What's hot

Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Getting to Know Airflow
Getting to Know AirflowGetting to Know Airflow
Getting to Know AirflowRosanne Hoyem
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 

What's hot (20)

What is Spark
What is SparkWhat is Spark
What is Spark
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Getting to Know Airflow
Getting to Know AirflowGetting to Know Airflow
Getting to Know Airflow
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 

Similar to Workflow Engines for Hadoop: Oozie, Azkaban and Luigi Compared

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 DistilledGrig Gheorghiu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experienceAlex Tumanoff
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experienceIgor Anishchenko
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"IT Event
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
What we talk about when we talk about DevOps
What we talk about when we talk about DevOpsWhat we talk about when we talk about DevOps
What we talk about when we talk about DevOpsRicard Clau
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexApache Apex
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Using LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsUsing LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsAlexander Gladysh
 
Databases in the hosted cloud
Databases in the hosted cloudDatabases in the hosted cloud
Databases in the hosted cloudColin Charles
 

Similar to Workflow Engines for Hadoop: Oozie, Azkaban and Luigi Compared (20)

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalability
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
What we talk about when we talk about DevOps
What we talk about when we talk about DevOpsWhat we talk about when we talk about DevOps
What we talk about when we talk about DevOps
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Using LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsUsing LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projects
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
Databases in the hosted cloud
Databases in the hosted cloudDatabases in the hosted cloud
Databases in the hosted cloud
 

Recently uploaded

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 

Recently uploaded (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 

Workflow Engines for Hadoop: Oozie, Azkaban and Luigi Compared

  • 1. Workflow Engines for Hadoop Joe Crobak @joecrobak NYC Data Engineering Meetup September 5, 2013 1
  • 3. Background • Devops/Infra for Hadoop • ~4 years with Hadoop • Have done two migrations from EMR to the colo. • Formerly Data/Analytics Infrastructure @ • worked with Apache Oozie and Luigi • Before that, Hadoop @ • worked with Azkaban 1.0 Disclosure: I’ve contributed to Luigi and Azkaban 1.0 3
  • 4. What is Apache Hadoop? 4
  • 5. What is a workflow? 5
  • 6. What is a workflow engine? 6
  • 8. Analytics / Data Warehousing • logs -> fact table(s). • database backups -> dimension tables. • Compute rollups/cubes. • Load data into a low-latency store (e.g. Redshift,Vertica, HBase). • Dashboarding & BI tools hit database. 8
  • 9. Analytics / Data Warehousing 9
  • 10. Analytics / Data Warehousing • What happens if there’s a failure? • rebuild the failed day • ... and any downstream datasets 10
  • 11. Hadoop-Driven Features • PeopleYou May Know • Amazon-style “People that buy this often by that” • SPAM detection • logs, databases -> machine learning / collaborative filtering • derivative datasets -> production database (often k/v store) 11
  • 13. Hadoop-Driven Features • What happens if there’s a failure? • possibly OK to skip a day. • Workflow tends to be self-contained, so you don’t need to rerun downstream. • Sanity check your data before pushing to production. 13
  • 14. Workflow Engine Evolution • Usually start with cron • at 01:00 import data • at 02:00 run really expensive query A • at 03:00 run query B, C, D • ... • This goes on until you have ~10 jobs or so. • It’s hard to debug and rerun. • Doesn’t scale to many people. 14
  • 15. Workflow Engine Evolution • Two possibilities: 1. “a workflow engine can’t be too hard, let’s write our own” 2. spend weeks evaluating all the options out there.Try to shoehorn your workflow into each one. 15
  • 16. Workflow Engine Considerations How do I... • Deploy and Upgrade • workflows and the workflow engine • Test • Detect Failure • Debug/find logs • Rebuild/backfill datasets • Load data to/from a RDBMS • Manage a set of similar tasks 16
  • 19. Oozie - the good • Great community support • Integrated with HUE, Cloudera Manager,Apache Ambari • HCatalog integration • SLA alerts (new in Oozie 4) • Ecosystem support: Pig, Hive, Sqoop, etc. • Very detailed documentation • Launcher jobs as map tasks 19
  • 20. Oozie - the bad • Launcher jobs as map tasks. • UI - but HUE, oozie-web (and good API) • Confusing object model (bundles, coordinators, workflows) - high barrier to entry. • Setup - extjs, hadoop proxy user, RDBMS. • XML! 20
  • 21. Oozie - the bad • Hello World in Oozie 21
  • 23. Azkaban - architecture Source: http://azkaban.github.io/azkaban2/overview.html 23
  • 24. Azkaban - the good • Great UI • DAG visualization • Task history • Easy access to log files • Plugin architecture • Pig, Hive, etc. Also, voldemort “build and push” integration • SLA Alerting • HDFS Browser • User Authentication/Authorization and auditing. • Reportal: https://github.com/azkaban/azkaban-plugins/pull/6 24
  • 25. 25
  • 26. Azkaban - the bad • Representing data dependencies • i.e. run job X when datasetY is available. • Executors run on separate workers, can be under-utilized (YARN anyone?). • Community - mostly just LinkedIn, and they rewrote it in isolation. • mailing list responsiveness is good. 26
  • 27. Azkaban - good and bad • Job definitions as java properties • Web uploads/deploy • Running jobs, scheduling jobs. • nearly impossible to integrate with configuration management 27
  • 30. Luigi - the good • Task definitions are code. • Tasks are idempotent. • Workflow defines data (and task) dependencies. • Growing community. • Easy to hack on the codebase (<6k LoC). • Postgres integration • Foursquare got this working with Redshift and Vertica. 30
  • 31. Luigi - the bad • Missing some key features, e.g. Pig support • but this is easy to add • Deploy situation is confusing (but easy to automate) • visualizer scaling • no persistent backing • JVM overhead 31
  • 32. Comparison matrix - part 1 Lang Code Complexity Frameworks Logs Community Docs oozie java high - 105k pig, hive, sqoop, mapreduce decentralized, map tasks Good - ASF in many distros excelle nt azkaban java moderate - 26k pig, hive, mapreduce UI-accessible few users, responsive on MLs good luigi python simple - 5.9k hive, postgres, scalding, python streaming decentral-ized on workers few users, responsive on github and MLs good 32
  • 33. Comparison matrix - part 2 property configuration Reruns Customizat ion (new job type) Testing User Auth oozie command-line, properties file, xml defaults oozie job - rerun difficult MiniOozie Kerberos, simple, custom azkaban bundled inside workflow zip, system defaults partial reruns in UI plugin architecture ? xml-based, custom luigi command-line, python ini file remove output, idempotency subclass luigi.Task python unittests linux-based 33
  • 34. Other workflow engines • Chronos • EMR • Mortar • Qubole • general purpose: • kettle, spring batch 34
  • 35. Qualities I like in a workflow engine • scripting language • you end up writing scripts to run your job anyway • custom logic, e.g. representing a dep on 7-days of data or run only every week • Less property propagation • Idempotency • WYSIWYG • It shouldn't be hard to take my existing job and move it to the workflow engine (it should just work). • Easy to hack on 35
  • 36. Less important • High availability (cold failover with manual intervention is OK) • Multiple cluster support • Security 36
  • 37. Best Practices • Version datasets • Backfilling datasets • Monitor the absence of a job running • Continuous deploy? 37
  • 38. Resources • Azkaban talk at Hadoop User Group: http://www.youtube.com/watch? v=rIUlh33uKMU • PyData talk on Luigi: http://vimeo.com/ 63435580 • Oozie talk at Hadoop user Group: http:// www.slideshare.net/mislam77/oozie-hug- may12 38
  • 39. Thanks! • Questions? • shameless plug: Subscribe to my newsletter: http://hadoopweekly.com 39