SlideShare a Scribd company logo
1 of 18
Download to read offline
Data Workflows
at Foursquare
using Luigi
Foursquare
•  35 million users
•  Nearly 4 billion check-ins
•  More than 5 million check-ins per day
•  50 million point-of-interest database
•  100's of GB of log data per day
Tools We Use
•  Hive
o  Ad hoc analytics, data dumping ground
•  Raw MapReduce
o  100's of MapReduce jobs in our codebase
•  Pig
o  Fits between structure Hive and free-form
MapReduce
•  Vertica
o  Low latency analytics
Cron
E.g.
0 0 * * * ./hadoop-script-1.sh
# Wait two hours for that job to finish...
0 2 * * * ./hadoop-script-2.sh
# And on and on and on
Cron - Problems
•  Brittle
•  Hard to reason about / visualize
•  Spend a lot of time waiting
•  Difficult to tell what succeeded or failed
•  No one likes writing Bash scripts
Oozie
XML-based Workflow Engine, with support for
Hadoop, Hive, and Pig
Workflows specify computations in a DAG, e.g
"Run this Hive query, then run these two
MapReduce jobs in parallel"
Coordinators launch recurring workflows at a
given frequency, when dependent data is
available
Oozie - Example
Oozie - Problems
•  Workflows are all-or-nothing
o  Cannot just run step that failed
o  Very little code reuse
•  Little to no extensibility
•  Limited control flow
•  Extremely verbose
•  Difficult to test
•  No one likes writing XML
Luigi
•  Python framework for batch processing jobs
•  Created by Spotify, open-sourced Sept. 2012
•  Tasks are units of work that produce Targets
•  Tasks can depend on one or more other Tasks
•  A Task is only run if all of its dependent Tasks are done
•  Tasks are idempotent
Luigi - Example Task
Luigi - Running the Task
$ python word-count.py WordCount --date 2013-06-01
Luigi - Scheduler
Central scheduler ensures each Task is only
run by a single worker.
A task is uniquely identified by its class name
and its Parameters, e.g.
WordCount(date=2013-06-01)
Will retry failed Tasks after a configured timeout
Emails someone when a Task fails
Luigi - Visualizer
Luigi - Visualizer
Luigi - Visualizer
Luigi - Advantages over Cron
•  Explicit dependencies
•  No wasted time waiting
•  Easy to tell what has failed
•  Avoid duplicate work / partial failures
Luigi - Advantages over Oozie
•  Explicit dependencies between workflows
•  Easier to write
•  Vastly more extensible
•  Code reuse
•  Can easily re-run individual steps
Thank you!
Check out Luigi:
https://github.com/spotify/luigi
Drop me a line:
Joe Ennever
jennever@foursquare.com

More Related Content

What's hot

Solr Metrics - Andrzej Białecki, Lucidworks
Solr Metrics - Andrzej Białecki, LucidworksSolr Metrics - Andrzej Białecki, Lucidworks
Solr Metrics - Andrzej Białecki, LucidworksLucidworks
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowDatabricks
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSergey Karayev
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowYohei Onishi
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101Data Con LA
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesAltinity Ltd
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 
Embracing Observability in CI/CD with OpenTelemetry
Embracing Observability in CI/CD with OpenTelemetryEmbracing Observability in CI/CD with OpenTelemetry
Embracing Observability in CI/CD with OpenTelemetryCyrille Le Clerc
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingLionel Briand
 

What's hot (20)

Solr Metrics - Andrzej Białecki, Lucidworks
Solr Metrics - Andrzej Białecki, LucidworksSolr Metrics - Andrzej Białecki, Lucidworks
Solr Metrics - Andrzej Białecki, Lucidworks
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep Learning
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Embracing Observability in CI/CD with OpenTelemetry
Embracing Observability in CI/CD with OpenTelemetryEmbracing Observability in CI/CD with OpenTelemetry
Embracing Observability in CI/CD with OpenTelemetry
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
InfluxDB & Grafana
InfluxDB & GrafanaInfluxDB & Grafana
InfluxDB & Grafana
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
Revisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software TestingRevisiting the Notion of Diversity in Software Testing
Revisiting the Notion of Diversity in Software Testing
 

Similar to Luigi presentation OA Summit

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and AbstractionsMetosin Oy
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Paolo Negri
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social gamesWooga
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0Anshum Gupta
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudAnshum Gupta
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systemsBill Buchan
 
ProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.xProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.xRussell Searle
 
Profiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsProfiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsAchievers Tech
 
Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)Ford Prior
 
APIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidadAPIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidadSoftware Guru
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning processDenis Dus
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisionsTrent Hornibrook
 
Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopShaurya Shekhar
 
Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Andrei KUCHARAVY
 
CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development LetsConnect
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance ComputingLuciano Mammino
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!David Hoerster
 

Similar to Luigi presentation OA Summit (20)

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
Erlang, the big switch in social games
Erlang, the big switch in social gamesErlang, the big switch in social games
Erlang, the big switch in social games
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
 
ProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.xProjectFork 4.1 in Joomla! 3.x
ProjectFork 4.1 in Joomla! 3.x
 
Profiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsProfiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty Details
 
Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)Testing API's: Tools & Tips & Tricks (Oh My!)
Testing API's: Tools & Tips & Tricks (Oh My!)
 
APIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidadAPIs distribuidos con alta escalabilidad
APIs distribuidos con alta escalabilidad
 
SGCE 2015 REST APIs
SGCE 2015 REST APIsSGCE 2015 REST APIs
SGCE 2015 REST APIs
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 
Greenfields tech decisions
Greenfields tech decisionsGreenfields tech decisions
Greenfields tech decisions
 
Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to Hadoop
 
Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1Introduction to the intermediate Python - v1.1
Introduction to the intermediate Python - v1.1
 
CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development CrossWorlds: Unleash the Power of Domino for Connections Development
CrossWorlds: Unleash the Power of Domino for Connections Development
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
 

More from Open Analytics

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Open Analytics
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Open Analytics
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)Open Analytics
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)Open Analytics
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)Open Analytics
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Open Analytics
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationOpen Analytics
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsOpen Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital EconomyOpen Analytics
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Open Analytics
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Open Analytics
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...Open Analytics
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Open Analytics
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Open Analytics
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)Open Analytics
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYCOpen Analytics
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupOpen Analytics
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupOpen Analytics
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalOpen Analytics
 

More from Open Analytics (20)

Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)Cyber after Snowden (OA Cyber Summit)
Cyber after Snowden (OA Cyber Summit)
 
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
 
CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)CDM….Where do you start? (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
 
An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)An Immigrant’s view of Cyberspace (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
 
Using Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & PersonalizationUsing Real-Time Data to Drive Optimization & Personalization
Using Real-Time Data to Drive Optimization & Personalization
 
M&A Trends in Telco Analytics
M&A Trends in Telco AnalyticsM&A Trends in Telco Analytics
M&A Trends in Telco Analytics
 
Competing in the Digital Economy
Competing in the Digital EconomyCompeting in the Digital Economy
Competing in the Digital Economy
 
Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)Piwik: An Analytics Alternative (Chicago Summit)
Piwik: An Analytics Alternative (Chicago Summit)
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)Crossing the Chasm (Ikanow - Chicago Summit)
Crossing the Chasm (Ikanow - Chicago Summit)
 
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
 
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
 
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
 
From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)From Insight to Impact (Chicago Summit - Keynote)
From Insight to Impact (Chicago Summit - Keynote)
 
Easybib Open Analytics NYC
Easybib Open Analytics NYCEasybib Open Analytics NYC
Easybib Open Analytics NYC
 
MarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics MeetupMarkLogic - Open Analytics Meetup
MarkLogic - Open Analytics Meetup
 
The caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetupThe caprate presentation_july2013_open analytics dc meetup
The caprate presentation_july2013_open analytics dc meetup
 
Verifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_finalVerifeed open analytics_3min deck_071713_final
Verifeed open analytics_3min deck_071713_final
 

Luigi presentation OA Summit

  • 2. Foursquare •  35 million users •  Nearly 4 billion check-ins •  More than 5 million check-ins per day •  50 million point-of-interest database •  100's of GB of log data per day
  • 3. Tools We Use •  Hive o  Ad hoc analytics, data dumping ground •  Raw MapReduce o  100's of MapReduce jobs in our codebase •  Pig o  Fits between structure Hive and free-form MapReduce •  Vertica o  Low latency analytics
  • 4. Cron E.g. 0 0 * * * ./hadoop-script-1.sh # Wait two hours for that job to finish... 0 2 * * * ./hadoop-script-2.sh # And on and on and on
  • 5. Cron - Problems •  Brittle •  Hard to reason about / visualize •  Spend a lot of time waiting •  Difficult to tell what succeeded or failed •  No one likes writing Bash scripts
  • 6. Oozie XML-based Workflow Engine, with support for Hadoop, Hive, and Pig Workflows specify computations in a DAG, e.g "Run this Hive query, then run these two MapReduce jobs in parallel" Coordinators launch recurring workflows at a given frequency, when dependent data is available
  • 8. Oozie - Problems •  Workflows are all-or-nothing o  Cannot just run step that failed o  Very little code reuse •  Little to no extensibility •  Limited control flow •  Extremely verbose •  Difficult to test •  No one likes writing XML
  • 9. Luigi •  Python framework for batch processing jobs •  Created by Spotify, open-sourced Sept. 2012 •  Tasks are units of work that produce Targets •  Tasks can depend on one or more other Tasks •  A Task is only run if all of its dependent Tasks are done •  Tasks are idempotent
  • 11. Luigi - Running the Task $ python word-count.py WordCount --date 2013-06-01
  • 12. Luigi - Scheduler Central scheduler ensures each Task is only run by a single worker. A task is uniquely identified by its class name and its Parameters, e.g. WordCount(date=2013-06-01) Will retry failed Tasks after a configured timeout Emails someone when a Task fails
  • 16. Luigi - Advantages over Cron •  Explicit dependencies •  No wasted time waiting •  Easy to tell what has failed •  Avoid duplicate work / partial failures
  • 17. Luigi - Advantages over Oozie •  Explicit dependencies between workflows •  Easier to write •  Vastly more extensible •  Code reuse •  Can easily re-run individual steps
  • 18. Thank you! Check out Luigi: https://github.com/spotify/luigi Drop me a line: Joe Ennever jennever@foursquare.com