SlideShare a Scribd company logo
1 of 90
Download to read offline
How I learned to time travel
or, data pipelining and scheduling with Airflow
Laura Lorenz | @lalorenz6 | github.com/lauralorenz | llorenz@industrydive.com
We’re
hiring!
Some truths first
Data is weird & breaks stuff
User data is particularly untrustworthy. Don’t trust it.
Computers/the Internet/third party
services/everything will fail
“ In the beginning, there was Cron.
We had one job, it ran at 1AM, and it was
good.
- Pete Owlett, PyData London 2016
from the outline of his talk:
“Lessons from 6 months of using Luigi in production”
“ In the beginning, there was Cron.
We had one job, it ran at 1AM, and it was
good.
- Pete Owlett, PyData London 2016
from the outline of his talk:
“Lessons from 6 months of using Luigi in production”
^
100 ^
depends
^
chaos
Plumbing problems suck
We had thoughts about how this should go
● Prefer something in open source Python so we know what’s going on and can
easily extend or customize
● Resilient
○ Handles failure well; i.e. retry logic, failure callbacks, alerting
● Deals with Complexity Intelligently
○ Can handle complicated dependencies and only runs what it has to
● Flexibility
○ Can run anything we want
● We knew we had batch tasks on daily and hourly schedules
We travelled the land
● File based dependencies
● Dependency framework only
● Lightweight, protocols minimally specified
● Abstract dependencies
● Ships with scheduling & monitoring
● Heavyweight, batteries included
DrakeMake
Pydoit
Pinball
Airflow
Luigi
AWS
Data
Pipeline
Active
docs &
community
File dependencies/target systems
File dependencies
Recipe/action
Target(s)
File dependencies/target systems
File dependencies
Recipe/action
Target(s)
#Makefile
wrangled.csv : source1.csv source2.csv
cat source1.csv source2.csv > wrangled.csv
File dependencies/target systems
File dependencies
Recipe/action
Target(s)
#Drakefile
wrangled.csv <- source1.csv, source2.csv [shell]
cat $INPUT0 $INPUT1 > $OUTPUT
File dependencies/target systems
File dependencies
Recipe/action
Target(s)
#Pydoit
def task_example():
return {“targets”: [‘wrangled.csv’],
“file_deps”: [‘source1.csv’, ‘source2.csv’],
“actions”: [concatenate_files_func]
}
File dependencies/target systems
File dependencies
Recipe/action
Target(s)
# Luigi
class TaskC(luigi.Task):
def requires(self):
return output_from_a()
def output(self):
return input_for_e()
# Luigi cont
def run(self):
do_the_thing(
self.requires, self.output)
File dependencies/target systems
● Work is cached in files
○ Smart rebuilding
● Simple and intuitive
configuration especially for
data transformations
● No native concept of schedule
○ Luigi is the first to introduce this,
but lacks built in polling process
● Alerting systems too basic
● Design paradigm not broadly
applicable to non-target
operations
Pros Cons
Abstract orchestration systems
C
A B
ED
Abstract orchestration systems
C
A B
ED
# Pinball
WORKFLOW = {“ex”: WorkflowConfig(
jobs={
“A”: JobConfig(
JobTemplate(A),
[]),
“C”: JobConfig(
JobTemplate(C) ,
[“A”],
…,
schedule=ScheduleConfig(
recurrence=timedelta(days=1),
reference_timestamp=
datetime(
year=2016, day=8,
month=10))
…,
Abstract orchestration systems
C
A B
ED
# Airflow
dag = DAG(schedule_interval=
timedelta(days=1),
start_date=
datetime(2015,10,6))
a = PythonOperator(
task_id=”A”,
python_callable=ClassA,
dag=dag)
c = MySQLOperator(
task_id=”B”,
sql=”DROP TABLE hello”,
dag=dag)
c.set_upstream(a)
Abstract orchestration systems
● Support many more types of
operations out of the box
● Handles more complicated
dependency logic
● Scheduling, monitoring, and
alerting services built-in and
sophisticated
● Caching is per service; loses
focus on individual data
transformations
● Configuration is more complex
● More infrastructure
dependencies
○ Database for state
○ Queue for distribution
Pros Cons
Armed with knowledge, we had more opinions
● We like the sophistication of the abstract orchestration systems
● But we also like Drake/Luigi-esque file targeting for transparency and data
bug tracking
○ “Intermediate artifacts”
● We (I/devops) don’t want to maintain a separate scheduling service
● We like a strong community and good docs
● We don’t want to be stuck in one ecosystem
Airflow + “smart-airflow” =
Airflow
● Scheduler process handles triggering and executing work specified in DAGs
on a given schedule
● Built in alerting based on service license agreements or task state
● Lots of sexy profiling visualizations
● test, backfill, clear operations convenient from the CLI
● Operators can come from a number of prebuilt classes like PythonOperator,
S3KeySensor, or BaseTransfer, or can obviously extend using inheritance
● Can support local or distributed work execution; distributed needs a celery
backend service (RabbitMQ, Redis)
Airflow
worker
webserver
scheduler
queue metadata
executor
Let’s talk about DAGs and Tasks
schedule_interval
start_date
max_active_runs
DAG
Operators
Sensors Operators
SensorsOperators
poke_interval
timeout
owner
retries
on_failure_callback
data_dependencies
DAG properties Task properties
Let’s talk about DagRuns and TaskInstances
DagRuns:
DAG by “time”
TaskInstances:
Task by DagRun
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
What to
do what to
do
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
Hey do
the
thing!!!
Do the
thing
queued
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
Okok I
told rabbit
Do the
thing
queued
DagRun
TaskInstance
running
queued
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
Do the
thing
queued
What to
do what to
do
DagRun
TaskInstance
running
queued
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
Do the
thing
Oo thing!
I’M ON IT
running
DagRun
TaskInstance
running
running
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
Do the
thing
Ack!!(knowledge):
success!
success
DagRun
TaskInstance
running
success
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
Well I did my
job
DagRun
TaskInstance
running
success
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
DagRun
TaskInstance
running
success
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
DagRun
TaskInstance
running
success
Whaaaats
goin’ on
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
DagRun
TaskInstance
success
success
Ok we’re
done with
that one
Let’s talk about airflow services
webserver
queue (via rabbitmq) metadata (via mysql)
DAGs
Airflow
worker
webserver
scheduler
executor
The people love
UIs, I gotta put
some data on it
DagRun
TaskInstance
success
success
Alerting is fun
SLAs
Callbacks
Email on retry/failure/success
Timeouts
SlackOperator
Configuration abounds
Pools
Queues
Max_active_dag_runs
Max_concurrency
DAGs on/off
Retries
Flexibility
● Operators
○ PythonOperator, BashOperator
○ TriggerDagRunOperator, BranchOperator
○ EmailOperator, MySqlOperator, S3ToHiveTransfer
● Sensors
○ ExternalTaskSensor
○ HttpSensor, S3KeySensor
● Extending your own operators and sensors
○ smart_airflow.DivePythonOperator
smart-airflow
● Airflow doesn’t support much data transfer between tasks out of the box
○ only small pieces of data via XCom
● But we liked the file dependency/target concept of checkpoints to cache data
transformations to both save time and provide transparency
● smart-airflow is a plugin to Airflow that supports local file system or
S3-backed intermediate artifact storage
● It leverages Airflow concepts to make file location predictable
○ dag_id/task_id/execution_date
Our smart-airflow backed ETL paradigm
1. Make each task as small as possible while maintaining readability
2. Preserve output for each task as a file-based intermediate artifact in a format
that is consumable by its dependent task
3. Avoid finalizing artifacts for as long as possible (e.g. update a database table
as the last and simplest step in a DAG)
Setting up Airflow at your organization
● pip install airflow to get started - you can instantly get started up with
○ sqlite metadata database
○ SequentialExecutor
○ Included example DAGs
● Use puckel/docker-airflow to get started quickly with
○ MySQL metadata database
○ CeleryExecutor
○ Celery Flower
○ RabbitMQ messaging backend with Management plugin
● upstart and systemd templates available through the
apache/incubator-airflow repository
Tips, tricks, and gotchas for Airflow
● Minimize your dev environment with SequentialExecutor and use
airflow test {dag_id} {task_id} {execution_date} in early
development to test tasks
● To test your DAG with the scheduler, utilize the @once schedule_interval and
clear the DagRuns and TaskInstances between tests with airflow
clear or the fancy schmancy UI
● Don’t bother with the nascent plugin system, just package your custom
operators with your DAGs for deployment
● No built in log rotation - default logging is pretty verbose and if you add your
own, this might get surprisingly large. As of Airflow 1.7 you can back them up
to S3 with simple configuration
Tips, tricks, and gotchas for Airflow
● We have about ~1300 tasks across 8 active DAGs and 27 worker processes
on an m4.xlarge AWS EC2 instance.
○ We utilize pools (which have come a long way since their buggy inception) to manage
resources
● Consider using queues if you have disparate types of work to do
○ our tasks are currently 100% Python but you could support scripts in any other language by
redirecting those messages to worker servers with the proper executables and dependencies
installed
● Your tasks must be idempotent; we’re using retries here
Let’s talk about TIME TRAVEL
time.
schedule_interval = timedelta(day=1)
execution_date
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
Data
Data
Data
Data
Data
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
now
Data
Data
Data
Data
Data
Data
Data
Data
Data
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
now
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
now
Data
Data
Data
Data
Data
now
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
“allowable” = execution_date + schedule_intervalData
Data
Data
Data
Data
now
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
execution_date != start_dateData
Data
Data
Data
Data
Data
Data
now
2016--10-01
2016--10-02
2016--10-03
2016--10-04
2016--10-05
DagRun:
2016-10-01
start_date: 2016-10-02 00:00:05
end_date: 2016-10-02 00:00:10
It’s the same for TaskInstances
DagRuns:
DAG by “time”
TaskInstances:
Task by DagRun
execution_date != start_date
but…
start_date != start_date
...sorta
Still don’t get it?
● Time travel yourself!!!
○ Rewind
○ Ask me I’m friendly
○ Google Groups, Gitter, Airflow docs, dev mailing list archives
Thank you!
Questions?
PS there’s an appendix
Appendix
More details on the pipeline tools not covered in depth
from an earlier draft of this talk
Make
● Originally/often used to compile source code
● Defines
○ targets and any prerequisites, which are potential file paths
○ recipes that can be executed by your shell environment
● Specify batch workflows in stages with file storage as atomic intermediates
(local FS only).
● Rebuilding logic based on target existence and other file dependencies
(“prerequisites”) existence/metadata.
● Supports basic conditionals and parallelism
From http://blog.kaggle.com/2012/10/15/make-for-data-scientists/
Drake
● “Make for data”
● Specify batch workflows in stages with file storage as atomic intermediates
(backend support for local FS, S3, HDFS, Hive)
● Early support for alternate branches and branch merging
● Smart rebuilding against targets and target metadata, or a quite sophisticated
command line specification system
● Parallel execution
● Workflow graph generation
● Expanding protocols support to facilitate common tasks like python, HTTP
GET, etc
drake --graph +workflow/02.model.complete
Pydoit
● “doit comes from the idea of bringing the power of build-tools to execute any
kind of task”
● flexible build tool used to glue together pipelines
● Similar to Drake but much more support for Python as opposed to bash.
● Specify actions, file_deps, and targets for tasks
● Smart rebuilding based on target/file_deps metadata, or your own custom
logic
● Parallel execution
● Watcher process that triggers based on target/file_deps file changes
● Can define failure/success callbacks against the project
Luigi
● Specify batch workflows with different job type classes including
postgres.CopyToTable, hadoop.JobTask, PySparkTask
● Specify depedencies with class requires() method and record via
output() method targets against supported backends such as S3, HDFS,
local or remote FS, MySQL, Redshift, etc.
● Event system provided to add callbacks to task returns, basic email alerting
● A central task scheduler (luigid) that provides a web frontend for task
reporting, prevents duplicate task execution, and basic task history browsing
● Requires a separate triggering mechanism to submit tasks to the central
scheduler
● RangeDaily and RangeHourly parameters as a dependency for backfill or
recovery from extended downtime
AWS Data Pipeline
● Cloud scheduler and resource instigator on hosted AWS hardware
● Can define data pipeline jobs, some of which come built-in (particularly
AWS-to-AWS data transfer, complete with blueprints), but you can run custom
scripts by hosting them on AWS
● Get all the AWS goodies: CloudWatch, IAM roles/policies, Security Groups
● Spins up target computing instances to run your pipeline activities per your
configuration
● No explicit file based dependencies
Pinball
● Central scheduler server with monitoring UI built in
● Parses a file based configuration system into JobToken or EventToken
instances
● Tokens are checked against Events associated with upstream tokens to
determine whether or not they are in a runnable state.
● Tokens specify their dependencies to each other - not file based.
● More of an task manager abstraction than the other tools as it doesn't have a
lot of job templates built in.
● Overrun policies deal with dependencies on past successes
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow

More Related Content

What's hot

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow ArchitectureGerard Toonstra
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introductionRico Chen
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow IntroductionLiangjun Jiang
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxRomanKhavronenko
 

What's hot (20)

Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 

Similar to How I learned to time travel, or, data pipelining and scheduling with Airflow

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSTulipp. Eu
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101Itiel Shwartz
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...it-people
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow ObstructionsTatiana Al-Chueyr
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUPRonald Hsu
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfprevota
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Ricardo Amaro
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Bgoug 2019.11 test your pl sql - not your patience
Bgoug 2019.11   test your pl sql - not your patienceBgoug 2019.11   test your pl sql - not your patience
Bgoug 2019.11 test your pl sql - not your patienceJacek Gebal
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youJacek Gebal
 

Similar to How I learned to time travel, or, data pipelining and scheduling with Airflow (20)

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP
 
Yaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdfYaetos_Meetup_SparkBCN_v1.pdf
Yaetos_Meetup_SparkBCN_v1.pdf
 
Von neumann workers
Von neumann workersVon neumann workers
Von neumann workers
 
Spring batch overivew
Spring batch overivewSpring batch overivew
Spring batch overivew
 
Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)Capacity Planning Infrastructure for Web Applications (Drupal)
Capacity Planning Infrastructure for Web Applications (Drupal)
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Bgoug 2019.11 test your pl sql - not your patience
Bgoug 2019.11   test your pl sql - not your patienceBgoug 2019.11   test your pl sql - not your patience
Bgoug 2019.11 test your pl sql - not your patience
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love you
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

How I learned to time travel, or, data pipelining and scheduling with Airflow

  • 1. How I learned to time travel or, data pipelining and scheduling with Airflow Laura Lorenz | @lalorenz6 | github.com/lauralorenz | llorenz@industrydive.com We’re hiring!
  • 3. Data is weird & breaks stuff User data is particularly untrustworthy. Don’t trust it.
  • 5. “ In the beginning, there was Cron. We had one job, it ran at 1AM, and it was good. - Pete Owlett, PyData London 2016 from the outline of his talk: “Lessons from 6 months of using Luigi in production”
  • 6. “ In the beginning, there was Cron. We had one job, it ran at 1AM, and it was good. - Pete Owlett, PyData London 2016 from the outline of his talk: “Lessons from 6 months of using Luigi in production” ^ 100 ^ depends ^ chaos
  • 8. We had thoughts about how this should go ● Prefer something in open source Python so we know what’s going on and can easily extend or customize ● Resilient ○ Handles failure well; i.e. retry logic, failure callbacks, alerting ● Deals with Complexity Intelligently ○ Can handle complicated dependencies and only runs what it has to ● Flexibility ○ Can run anything we want ● We knew we had batch tasks on daily and hourly schedules
  • 9. We travelled the land ● File based dependencies ● Dependency framework only ● Lightweight, protocols minimally specified ● Abstract dependencies ● Ships with scheduling & monitoring ● Heavyweight, batteries included DrakeMake Pydoit Pinball Airflow Luigi AWS Data Pipeline Active docs & community
  • 10. File dependencies/target systems File dependencies Recipe/action Target(s)
  • 11. File dependencies/target systems File dependencies Recipe/action Target(s) #Makefile wrangled.csv : source1.csv source2.csv cat source1.csv source2.csv > wrangled.csv
  • 12. File dependencies/target systems File dependencies Recipe/action Target(s) #Drakefile wrangled.csv <- source1.csv, source2.csv [shell] cat $INPUT0 $INPUT1 > $OUTPUT
  • 13. File dependencies/target systems File dependencies Recipe/action Target(s) #Pydoit def task_example(): return {“targets”: [‘wrangled.csv’], “file_deps”: [‘source1.csv’, ‘source2.csv’], “actions”: [concatenate_files_func] }
  • 14. File dependencies/target systems File dependencies Recipe/action Target(s) # Luigi class TaskC(luigi.Task): def requires(self): return output_from_a() def output(self): return input_for_e() # Luigi cont def run(self): do_the_thing( self.requires, self.output)
  • 15. File dependencies/target systems ● Work is cached in files ○ Smart rebuilding ● Simple and intuitive configuration especially for data transformations ● No native concept of schedule ○ Luigi is the first to introduce this, but lacks built in polling process ● Alerting systems too basic ● Design paradigm not broadly applicable to non-target operations Pros Cons
  • 17. Abstract orchestration systems C A B ED # Pinball WORKFLOW = {“ex”: WorkflowConfig( jobs={ “A”: JobConfig( JobTemplate(A), []), “C”: JobConfig( JobTemplate(C) , [“A”], …, schedule=ScheduleConfig( recurrence=timedelta(days=1), reference_timestamp= datetime( year=2016, day=8, month=10)) …,
  • 18. Abstract orchestration systems C A B ED # Airflow dag = DAG(schedule_interval= timedelta(days=1), start_date= datetime(2015,10,6)) a = PythonOperator( task_id=”A”, python_callable=ClassA, dag=dag) c = MySQLOperator( task_id=”B”, sql=”DROP TABLE hello”, dag=dag) c.set_upstream(a)
  • 19. Abstract orchestration systems ● Support many more types of operations out of the box ● Handles more complicated dependency logic ● Scheduling, monitoring, and alerting services built-in and sophisticated ● Caching is per service; loses focus on individual data transformations ● Configuration is more complex ● More infrastructure dependencies ○ Database for state ○ Queue for distribution Pros Cons
  • 20. Armed with knowledge, we had more opinions ● We like the sophistication of the abstract orchestration systems ● But we also like Drake/Luigi-esque file targeting for transparency and data bug tracking ○ “Intermediate artifacts” ● We (I/devops) don’t want to maintain a separate scheduling service ● We like a strong community and good docs ● We don’t want to be stuck in one ecosystem
  • 22. Airflow ● Scheduler process handles triggering and executing work specified in DAGs on a given schedule ● Built in alerting based on service license agreements or task state ● Lots of sexy profiling visualizations ● test, backfill, clear operations convenient from the CLI ● Operators can come from a number of prebuilt classes like PythonOperator, S3KeySensor, or BaseTransfer, or can obviously extend using inheritance ● Can support local or distributed work execution; distributed needs a celery backend service (RabbitMQ, Redis)
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. Let’s talk about DAGs and Tasks schedule_interval start_date max_active_runs DAG Operators Sensors Operators SensorsOperators poke_interval timeout owner retries on_failure_callback data_dependencies DAG properties Task properties
  • 29.
  • 30. Let’s talk about DagRuns and TaskInstances DagRuns: DAG by “time” TaskInstances: Task by DagRun
  • 31. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor
  • 32. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor What to do what to do
  • 33. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor Hey do the thing!!! Do the thing queued
  • 34. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor Okok I told rabbit Do the thing queued DagRun TaskInstance running queued
  • 35. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor Do the thing queued What to do what to do DagRun TaskInstance running queued
  • 36. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor Do the thing Oo thing! I’M ON IT running DagRun TaskInstance running running
  • 37. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor Do the thing Ack!!(knowledge): success! success DagRun TaskInstance running success
  • 38. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor Well I did my job DagRun TaskInstance running success
  • 39. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor DagRun TaskInstance running success
  • 40. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor DagRun TaskInstance running success Whaaaats goin’ on
  • 41. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor DagRun TaskInstance success success Ok we’re done with that one
  • 42. Let’s talk about airflow services webserver queue (via rabbitmq) metadata (via mysql) DAGs Airflow worker webserver scheduler executor The people love UIs, I gotta put some data on it DagRun TaskInstance success success
  • 43. Alerting is fun SLAs Callbacks Email on retry/failure/success Timeouts SlackOperator
  • 45. Flexibility ● Operators ○ PythonOperator, BashOperator ○ TriggerDagRunOperator, BranchOperator ○ EmailOperator, MySqlOperator, S3ToHiveTransfer ● Sensors ○ ExternalTaskSensor ○ HttpSensor, S3KeySensor ● Extending your own operators and sensors ○ smart_airflow.DivePythonOperator
  • 46. smart-airflow ● Airflow doesn’t support much data transfer between tasks out of the box ○ only small pieces of data via XCom ● But we liked the file dependency/target concept of checkpoints to cache data transformations to both save time and provide transparency ● smart-airflow is a plugin to Airflow that supports local file system or S3-backed intermediate artifact storage ● It leverages Airflow concepts to make file location predictable ○ dag_id/task_id/execution_date
  • 47.
  • 48. Our smart-airflow backed ETL paradigm 1. Make each task as small as possible while maintaining readability 2. Preserve output for each task as a file-based intermediate artifact in a format that is consumable by its dependent task 3. Avoid finalizing artifacts for as long as possible (e.g. update a database table as the last and simplest step in a DAG)
  • 49.
  • 50. Setting up Airflow at your organization ● pip install airflow to get started - you can instantly get started up with ○ sqlite metadata database ○ SequentialExecutor ○ Included example DAGs ● Use puckel/docker-airflow to get started quickly with ○ MySQL metadata database ○ CeleryExecutor ○ Celery Flower ○ RabbitMQ messaging backend with Management plugin ● upstart and systemd templates available through the apache/incubator-airflow repository
  • 51. Tips, tricks, and gotchas for Airflow ● Minimize your dev environment with SequentialExecutor and use airflow test {dag_id} {task_id} {execution_date} in early development to test tasks ● To test your DAG with the scheduler, utilize the @once schedule_interval and clear the DagRuns and TaskInstances between tests with airflow clear or the fancy schmancy UI ● Don’t bother with the nascent plugin system, just package your custom operators with your DAGs for deployment ● No built in log rotation - default logging is pretty verbose and if you add your own, this might get surprisingly large. As of Airflow 1.7 you can back them up to S3 with simple configuration
  • 52. Tips, tricks, and gotchas for Airflow ● We have about ~1300 tasks across 8 active DAGs and 27 worker processes on an m4.xlarge AWS EC2 instance. ○ We utilize pools (which have come a long way since their buggy inception) to manage resources ● Consider using queues if you have disparate types of work to do ○ our tasks are currently 100% Python but you could support scripts in any other language by redirecting those messages to worker servers with the proper executables and dependencies installed ● Your tasks must be idempotent; we’re using retries here
  • 53. Let’s talk about TIME TRAVEL time.
  • 60. “allowable” = execution_date + schedule_intervalData Data Data Data Data now 2016--10-01 2016--10-02 2016--10-03 2016--10-04 2016--10-05
  • 62. It’s the same for TaskInstances DagRuns: DAG by “time” TaskInstances: Task by DagRun
  • 63.
  • 65.
  • 66. Still don’t get it? ● Time travel yourself!!! ○ Rewind ○ Ask me I’m friendly ○ Google Groups, Gitter, Airflow docs, dev mailing list archives
  • 68. Appendix More details on the pipeline tools not covered in depth from an earlier draft of this talk
  • 69. Make ● Originally/often used to compile source code ● Defines ○ targets and any prerequisites, which are potential file paths ○ recipes that can be executed by your shell environment ● Specify batch workflows in stages with file storage as atomic intermediates (local FS only). ● Rebuilding logic based on target existence and other file dependencies (“prerequisites”) existence/metadata. ● Supports basic conditionals and parallelism
  • 70.
  • 72. Drake ● “Make for data” ● Specify batch workflows in stages with file storage as atomic intermediates (backend support for local FS, S3, HDFS, Hive) ● Early support for alternate branches and branch merging ● Smart rebuilding against targets and target metadata, or a quite sophisticated command line specification system ● Parallel execution ● Workflow graph generation ● Expanding protocols support to facilitate common tasks like python, HTTP GET, etc
  • 73.
  • 75.
  • 76. Pydoit ● “doit comes from the idea of bringing the power of build-tools to execute any kind of task” ● flexible build tool used to glue together pipelines ● Similar to Drake but much more support for Python as opposed to bash. ● Specify actions, file_deps, and targets for tasks ● Smart rebuilding based on target/file_deps metadata, or your own custom logic ● Parallel execution ● Watcher process that triggers based on target/file_deps file changes ● Can define failure/success callbacks against the project
  • 77.
  • 78.
  • 79. Luigi ● Specify batch workflows with different job type classes including postgres.CopyToTable, hadoop.JobTask, PySparkTask ● Specify depedencies with class requires() method and record via output() method targets against supported backends such as S3, HDFS, local or remote FS, MySQL, Redshift, etc. ● Event system provided to add callbacks to task returns, basic email alerting ● A central task scheduler (luigid) that provides a web frontend for task reporting, prevents duplicate task execution, and basic task history browsing ● Requires a separate triggering mechanism to submit tasks to the central scheduler ● RangeDaily and RangeHourly parameters as a dependency for backfill or recovery from extended downtime
  • 80.
  • 81.
  • 82.
  • 83. AWS Data Pipeline ● Cloud scheduler and resource instigator on hosted AWS hardware ● Can define data pipeline jobs, some of which come built-in (particularly AWS-to-AWS data transfer, complete with blueprints), but you can run custom scripts by hosting them on AWS ● Get all the AWS goodies: CloudWatch, IAM roles/policies, Security Groups ● Spins up target computing instances to run your pipeline activities per your configuration ● No explicit file based dependencies
  • 84.
  • 85.
  • 86. Pinball ● Central scheduler server with monitoring UI built in ● Parses a file based configuration system into JobToken or EventToken instances ● Tokens are checked against Events associated with upstream tokens to determine whether or not they are in a runnable state. ● Tokens specify their dependencies to each other - not file based. ● More of an task manager abstraction than the other tools as it doesn't have a lot of job templates built in. ● Overrun policies deal with dependencies on past successes