SlideShare a Scribd company logo
1 of 35
AIRFLOW - DATA FLOW
ENGINE FROM AIRBNB
By Walter Liu 2016/01/28
Who am I?
 Archer Architect of TrendMicro Coretech
backend
 A new user in Airflow
Why Data Flow Engine?
Cron Job
Direct Acyclic Graph (DAG)
Data relationships
 Data availability
 if the data is not there, trigger the process to
generate the data.
 Data dependency
 Some data relies on other data to generate.
Operability
 Job failed and resume
 Job monitor
 Backfill
Airflow
DAG structure as code
Demo
 python tutorial.py
 airflow list_dags
 airflow list_tasks tutorial
 airflow list_tasks tutorial --tree
 airflow test tutorial print_date 2015-06-01
 airflow test tutorial sleep 2015-06-01
 airflow run tutorial templated 2015-06-01
 Backfill: airflow backfill tutorial -s 2015-06-07 -e 2015-06-10
 Run again
 Run another date range
Site: http://localhost:8080/admin/
UI – DAG view
UI – Tree View
UI – Graph View
UI - Gantt
UI – Task duration
UI – Code
Airflow - Pros
 Dynamic generating path
 Have both Time scheduler and Command line trigger
 Has Master/Worker model (automatically distribute tasks)
 Scale if you have many tasks in a chain.
 But not be so useful to most of our tasks. Maybe useful for full dump.
 Fancy UI
 Dependencies of tasks
 Task success/failure
 Scheduled tasks status
 Has utility lib to wait_data for S3, Hadoop …
Airflow - Cons
 Additional DB/Redis or Rabbitmq for Celery
 HA design: Use RDBMS/redis-cache in AWS
 Require python 2.7 and many other libraries.
 Not dependent on data. Just task dependency.
(Not big cons)
 Write check data file code.
A snippet of the task of Luigi
Recap
 Solving DAG job management problem
 Features to improve daily operation
 Monitor/Visualization
 Job failure and resume
 Backfill
Backup slides
Other Selections?
 Oozie (on Hadoop)
 Luigi (by Spotify)
 Mario (Luigi in Scala)
 Ruigi (Luigi in R)
 Airflow (by Airbnb)
 Mistral (by OpenStack)
 Pinball (by Pinterest)
 …
Github statistics
Start Date Star Commits in
recent 2
months
Airflow 2014/10/05 1516 362
Luigi 2011/11/13 3777 105
Pinball 2015/05/01 476 0
Oozie 2011/08/28 167 14
Oozie
 Pros
 Mature
 Native support of Hadoop ECO system
 Python is not required.
 Cons
 Big and complex XML to define the flow and config.
 Control flow is somehow restrictive
 Hadoop ECO system is required.
 Unix box, Java, Hadoop, Pig, ExtJS library
Oozie XML Example
Luigi - Pros
 Makefile like, dependencies on data (Upward instead of downward)
 Command line trigger
 Not only Hadoop
 Support target: HDFS/S3/RDBMS/…. (not limited)
 Write your own code to support Casandra, etc.
 Dependencies are decentralized, no big XML.
 UI
 Dynamic dependency/task is supported. (Don’t know it is better than
airflow or not)
 [util] Date algebra
Luigi - Cons
 No built-in triggering
 No crontab like things (could borrow Chronos)
 Use cron to run tasks on specific nodes.
(normally)
Luigi - Notes
 Local execution
 Pros: Easy to debug
 Cons:
 need access to other system. (Well, other system didn’t achieve this either.)
 No scalability if you have many tasks in a chain.
 Centralized scheduler servers (optional, recommended for
production)
 Make sure two instances of the same task are not running
simultaneously
 Provide visualization of everything that’s going on.
 HA option
 Run 2+ tasks on 2+ machines at the same time
References
 http://erikbern.com/2015/07/02/more-luigi-
alternatives/

More Related Content

What's hot

Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow IntroductionLiangjun Jiang
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSDerrick Qin
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 

What's hot (20)

Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 

Similar to Schedule and orchestrate data workflows with Airflow

Rounds analytics pipeline
Rounds analytics pipelineRounds analytics pipeline
Rounds analytics pipelineAviv Laufer
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintroDoug Chang
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleMerelda
 
Spring boot 3g
Spring boot 3gSpring boot 3g
Spring boot 3gvasya10
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsBarton Rhodes
 
The 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaThe 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaDavid Chandler
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Zeppelin at Twitter
Zeppelin at TwitterZeppelin at Twitter
Zeppelin at TwitterPrasad Wagle
 
Application Architecture For The Cloud
Application Architecture For The CloudApplication Architecture For The Cloud
Application Architecture For The CloudSteve Loughran
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersVladimir Pavlov
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Zeppelin at twitter (sf data science meetup, july 2016)
Zeppelin at twitter (sf data science meetup, july 2016)Zeppelin at twitter (sf data science meetup, july 2016)
Zeppelin at twitter (sf data science meetup, july 2016)Prasad Wagle
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterKamran Munshi
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEnis Afgan
 
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and KubernetesAll the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and KubernetesDevOps.com
 

Similar to Schedule and orchestrate data workflows with Airflow (20)

Rounds analytics pipeline
Rounds analytics pipelineRounds analytics pipeline
Rounds analytics pipeline
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable Scale
 
Spring boot 3g
Spring boot 3gSpring boot 3g
Spring boot 3g
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
The 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaThe 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for Java
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Zeppelin at Twitter
Zeppelin at TwitterZeppelin at Twitter
Zeppelin at Twitter
 
Application Architecture For The Cloud
Application Architecture For The CloudApplication Architecture For The Cloud
Application Architecture For The Cloud
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted Waters
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Zeppelin at twitter (sf data science meetup, july 2016)
Zeppelin at twitter (sf data science meetup, july 2016)Zeppelin at twitter (sf data science meetup, july 2016)
Zeppelin at twitter (sf data science meetup, july 2016)
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
 
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and KubernetesAll the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
All the Ops: DataOps with GitOps for Streaming data on Kafka and Kubernetes
 

More from Walter Liu

Generative AI 在手機遊戲發行上的應用介紹.pdf
Generative AI 在手機遊戲發行上的應用介紹.pdfGenerative AI 在手機遊戲發行上的應用介紹.pdf
Generative AI 在手機遊戲發行上的應用介紹.pdfWalter Liu
 
Infrastructure as code using Kubernetes
Infrastructure as code using KubernetesInfrastructure as code using Kubernetes
Infrastructure as code using KubernetesWalter Liu
 
手機遊戲數位行銷工具淺談
手機遊戲數位行銷工具淺談手機遊戲數位行銷工具淺談
手機遊戲數位行銷工具淺談Walter Liu
 
關於第三方追蹤廣告再行銷的那些事兒
關於第三方追蹤廣告再行銷的那些事兒關於第三方追蹤廣告再行銷的那些事兒
關於第三方追蹤廣告再行銷的那些事兒Walter Liu
 
Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes WorkshopWalter Liu
 
Using Kubernetes to deploy Django in GCP
Using Kubernetes to deploy Django in GCPUsing Kubernetes to deploy Django in GCP
Using Kubernetes to deploy Django in GCPWalter Liu
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS PreventionWalter Liu
 
Super Fast Gevent Introduction
Super Fast Gevent IntroductionSuper Fast Gevent Introduction
Super Fast Gevent IntroductionWalter Liu
 
HTTP/2 to web dev
HTTP/2 to web devHTTP/2 to web dev
HTTP/2 to web devWalter Liu
 
HTTP/2 Introduction
HTTP/2 IntroductionHTTP/2 Introduction
HTTP/2 IntroductionWalter Liu
 
Consul - service discovery and others
Consul - service discovery and othersConsul - service discovery and others
Consul - service discovery and othersWalter Liu
 
Docker introduction
Docker introductionDocker introduction
Docker introductionWalter Liu
 
WRS GIT Branching Model - draft
WRS GIT Branching Model - draftWRS GIT Branching Model - draft
WRS GIT Branching Model - draftWalter Liu
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionWalter Liu
 
Django deployment and rpm+yum
Django deployment and rpm+yumDjango deployment and rpm+yum
Django deployment and rpm+yumWalter Liu
 
Game Localization in Python
Game Localization in PythonGame Localization in Python
Game Localization in PythonWalter Liu
 
Service production from d3 pitfall viewpoint
Service production from d3 pitfall viewpointService production from d3 pitfall viewpoint
Service production from d3 pitfall viewpointWalter Liu
 
Celery in the Django
Celery in the DjangoCelery in the Django
Celery in the DjangoWalter Liu
 

More from Walter Liu (18)

Generative AI 在手機遊戲發行上的應用介紹.pdf
Generative AI 在手機遊戲發行上的應用介紹.pdfGenerative AI 在手機遊戲發行上的應用介紹.pdf
Generative AI 在手機遊戲發行上的應用介紹.pdf
 
Infrastructure as code using Kubernetes
Infrastructure as code using KubernetesInfrastructure as code using Kubernetes
Infrastructure as code using Kubernetes
 
手機遊戲數位行銷工具淺談
手機遊戲數位行銷工具淺談手機遊戲數位行銷工具淺談
手機遊戲數位行銷工具淺談
 
關於第三方追蹤廣告再行銷的那些事兒
關於第三方追蹤廣告再行銷的那些事兒關於第三方追蹤廣告再行銷的那些事兒
關於第三方追蹤廣告再行銷的那些事兒
 
Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes Workshop
 
Using Kubernetes to deploy Django in GCP
Using Kubernetes to deploy Django in GCPUsing Kubernetes to deploy Django in GCP
Using Kubernetes to deploy Django in GCP
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention
 
Super Fast Gevent Introduction
Super Fast Gevent IntroductionSuper Fast Gevent Introduction
Super Fast Gevent Introduction
 
HTTP/2 to web dev
HTTP/2 to web devHTTP/2 to web dev
HTTP/2 to web dev
 
HTTP/2 Introduction
HTTP/2 IntroductionHTTP/2 Introduction
HTTP/2 Introduction
 
Consul - service discovery and others
Consul - service discovery and othersConsul - service discovery and others
Consul - service discovery and others
 
Docker introduction
Docker introductionDocker introduction
Docker introduction
 
WRS GIT Branching Model - draft
WRS GIT Branching Model - draftWRS GIT Branching Model - draft
WRS GIT Branching Model - draft
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
 
Django deployment and rpm+yum
Django deployment and rpm+yumDjango deployment and rpm+yum
Django deployment and rpm+yum
 
Game Localization in Python
Game Localization in PythonGame Localization in Python
Game Localization in Python
 
Service production from d3 pitfall viewpoint
Service production from d3 pitfall viewpointService production from d3 pitfall viewpoint
Service production from d3 pitfall viewpoint
 
Celery in the Django
Celery in the DjangoCelery in the Django
Celery in the Django
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Schedule and orchestrate data workflows with Airflow

  • 1. AIRFLOW - DATA FLOW ENGINE FROM AIRBNB By Walter Liu 2016/01/28
  • 2. Who am I?  Archer Architect of TrendMicro Coretech backend  A new user in Airflow
  • 3. Why Data Flow Engine?
  • 6.
  • 7.
  • 8. Data relationships  Data availability  if the data is not there, trigger the process to generate the data.  Data dependency  Some data relies on other data to generate.
  • 9. Operability  Job failed and resume  Job monitor  Backfill
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Demo  python tutorial.py  airflow list_dags  airflow list_tasks tutorial  airflow list_tasks tutorial --tree  airflow test tutorial print_date 2015-06-01  airflow test tutorial sleep 2015-06-01  airflow run tutorial templated 2015-06-01  Backfill: airflow backfill tutorial -s 2015-06-07 -e 2015-06-10  Run again  Run another date range Site: http://localhost:8080/admin/
  • 17. UI – DAG view
  • 18. UI – Tree View
  • 19. UI – Graph View
  • 21. UI – Task duration
  • 23. Airflow - Pros  Dynamic generating path  Have both Time scheduler and Command line trigger  Has Master/Worker model (automatically distribute tasks)  Scale if you have many tasks in a chain.  But not be so useful to most of our tasks. Maybe useful for full dump.  Fancy UI  Dependencies of tasks  Task success/failure  Scheduled tasks status  Has utility lib to wait_data for S3, Hadoop …
  • 24. Airflow - Cons  Additional DB/Redis or Rabbitmq for Celery  HA design: Use RDBMS/redis-cache in AWS  Require python 2.7 and many other libraries.  Not dependent on data. Just task dependency. (Not big cons)  Write check data file code.
  • 25. A snippet of the task of Luigi
  • 26. Recap  Solving DAG job management problem  Features to improve daily operation  Monitor/Visualization  Job failure and resume  Backfill
  • 28. Other Selections?  Oozie (on Hadoop)  Luigi (by Spotify)  Mario (Luigi in Scala)  Ruigi (Luigi in R)  Airflow (by Airbnb)  Mistral (by OpenStack)  Pinball (by Pinterest)  …
  • 29. Github statistics Start Date Star Commits in recent 2 months Airflow 2014/10/05 1516 362 Luigi 2011/11/13 3777 105 Pinball 2015/05/01 476 0 Oozie 2011/08/28 167 14
  • 30. Oozie  Pros  Mature  Native support of Hadoop ECO system  Python is not required.  Cons  Big and complex XML to define the flow and config.  Control flow is somehow restrictive  Hadoop ECO system is required.  Unix box, Java, Hadoop, Pig, ExtJS library
  • 32. Luigi - Pros  Makefile like, dependencies on data (Upward instead of downward)  Command line trigger  Not only Hadoop  Support target: HDFS/S3/RDBMS/…. (not limited)  Write your own code to support Casandra, etc.  Dependencies are decentralized, no big XML.  UI  Dynamic dependency/task is supported. (Don’t know it is better than airflow or not)  [util] Date algebra
  • 33. Luigi - Cons  No built-in triggering  No crontab like things (could borrow Chronos)  Use cron to run tasks on specific nodes. (normally)
  • 34. Luigi - Notes  Local execution  Pros: Easy to debug  Cons:  need access to other system. (Well, other system didn’t achieve this either.)  No scalability if you have many tasks in a chain.  Centralized scheduler servers (optional, recommended for production)  Make sure two instances of the same task are not running simultaneously  Provide visualization of everything that’s going on.  HA option  Run 2+ tasks on 2+ machines at the same time

Editor's Notes

  1. Cron is not operational scalable.
  2. Bottleneck and where the time spent.
  3. Reference: http://www.infoq.com/cn/articles/introductionOozie