SlideShare a Scribd company logo
1 of 28
Download to read offline
Automating Workflows
for Analytics Pipelines
Sadayuki Furuhashi
Open Source Summit 2017
Sadayuki Furuhashi
A founder of Treasure Data, Inc. located in Silicon Valley.
OSS projects I founded:
An open-source hacker.
Github: @frsyuki
What's Workflow Engine?
• Automates your manual operations.
• Load data → Clean up → Analyze → Build reports
• Get customer list → Generate HTML → Send email
• Monitor server status → Restart on abnormal
• Backup database → Alert on failure
• Run test → Package it → Deploy

(Continuous Delivery)
Challenge: Multiple Cloud & Regions
On-Premises
Different API,
Different tools,
Many scripts.
Challenge: Multiple DB technologies
Amazon S3
Amazon 

Redshift
Amazon EMR
Challenge: Multiple DB technologies
Amazon S3
Amazon 

Redshift
Amazon EMR
> Hi!
> I'm a new technology!
Challenge: Modern complex data analytics
Ingest
Application logs
User attribute data
Ad impressions
3rd-party cookie data
Enrich
Removing bot access
Geo location from IP
address
Parsing User-Agent
JOIN user attributes
to event logs
Model
A/B Testing
Funnel analysis
Segmentation
analysis
Machine learning
Load
Creating indexes
Data partitioning
Data compression
Statistics
collection
Utilize
Recommendation
API
Realtime ad bidding
Visualize using BI
applications
Ingest UtilizeEnrich Model Load
Traditional "false" solution
#!/bin/bash
./run_mysql_query.sh
./load_facebook_data.sh
./rsync_apache_logs.sh
./start_emr_cluster.sh
for query in emr/*.sql; do
./run_emr_hive $query
done
./shutdown_emr_cluster.sh
./run_redshift_queries.sh
./call_finish_notification.sh
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization
Solution: Multi-Cloud Workflow Engine
Solves
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization
Example in our case
1. Dump data to
BigQuery
2. load all tables to
Treasure Data
3. Run queries
5. Notify on slack
4. Create reports
on Tableau Server

(on-premises)
Workflow constructs
Unite Engineering & Analytic Teams
+wait_for_arrival:
s3_wait>: |
bucket/www_${session_date}.csv
+load_table:
redshift>: scripts/copy.sql
Powerful for Engineers
> Comfortable for advanced users
Friendly for Analysts
> Still straight forward for analysts to
understand & leverage workflows
Unite Engineering & Analytic Teams
Powerful for Engineers
> Comfortable for advanced users
Friendly for Analysts
> Still straight forward for analysts to
understand & leverage workflows
+wait_for_arrival:
s3_wait>: |
bucket/www_${session_date}.csv
+load_table:
redshift>: scripts/copy.sql
+ is a task
> is an operator
${...} is a variable
Operator library
_export:
td:
database: workflow_temp
+task1:
td>: queries/open.sql
create_table: daily_open
+task2:
td>: queries/close.sql
create_table: daily_close
Standard libraries
redshift>: runs Amazon Redshift queries
emr>: create/shutdowns a cluster & runs
steps
s3_wait>: waits until a file is put on S3
pg>: runs PostgreSQL queries
td>: runs Treasure Data queries
td_for_each>: repeats task for result rows
mail>: sends an email
Open-source libraries
You can release & use open-source
operator libraries.
Parallel execution
+load_data:
_parallel: true


+load_users:
redshift>: copy/users.sql


+load_items:
redshift>: copy/items.sql
Parallel execution
Tasks under a same group run in
parallel if _parallel option is set to
true.
Loops & Parameters
+send_email_to_active_users:
td_for_each>: list_active.sql
_do:
+send:
email>: tempalte.txt
to: ${td.for_each.addr}
Parameter
A task can propagate parameters to
following tasks
Loop
Generate subtasks dynamically so
that Digdag applies the same set of
operators to different data sets.
Grouping workflows...
Ingest UtilizeEnrich Model Load
+task
+task
+task
+task +task
+task +task
+task
+task
+task +task +task
Grouping workflows
Ingest UtilizeEnrich Model Load
+ingest +enrich
+task +task
+model
+basket_analysis
+task +task
+learn
+load
+task +task+tasks
+task
Pushing workflows to a server with Docker image
schedule:
daily>: 01:30:00
timezone: Asia/Tokyo
_export:
docker:
image: my_image:latest
+task:
sh>: ./run_in_docker
Digdag server
> Develop on laptop, push it to a server.
> Workflows run periodically on a server.
> Backfill
> Web editor & monitor
Docker
> Install scripts & dependences in a
Docker image, not on a server.
> Workflows can run anywhere including
developer's laptop.
Demo
Digdag is production-ready
Digdag
server
PostgreSQL
It's just like a web application.
Digdag
client
All task state
API &
scheduler &
executor
Visual UI
Digdag is production-ready
PostgreSQL
Stateless servers + Replicated DB
Digdag
client
API &
scheduler &
executor
PostgreSQL
All task state
Digdag
server
Digdag
server
HTTP Load
Balancer
Visual UI
HA
Digdag is production-ready
Digdag
server
PostgreSQL
Isolating API and execution for reliability
Digdag
client
API
PostgreSQL
HA
Digdag
server
Digdag
server
Digdag
server
scheduler &

executor
HTTP Load
Balancer
All task state
Digdag at Treasure Data
3,600 workflows run every day
28,000 tasks run every day
850 active workflows
400,000 workflow executions in total
Digdag & Open Source
Learning from my OSS projects
• Make it pluggable!
700+ plugins in 6 years
200+ plugins in 3 years
input/output, parser/formatter,

decoder/encoder, filter, and executor
input/output, and filter
70+ implementations
in 8 years
Digdag also has plugin architecture
32 operators
7 schedulers
2 command executors
1 error notification module
Sadayuki Furuhashi
https://digdag.io
Visit my website!

More Related Content

What's hot

Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
Natural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkNatural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache Spark
Databricks
 

What's hot (20)

Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Recent Updates at Embulk Meetup #3
Recent Updates at Embulk Meetup #3Recent Updates at Embulk Meetup #3
Recent Updates at Embulk Meetup #3
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Embulk at Treasure Data
Embulk at Treasure DataEmbulk at Treasure Data
Embulk at Treasure Data
 
Embulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureEmbulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructure
 
re:dash is awesome
re:dash is awesomere:dash is awesome
re:dash is awesome
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Digdag Updates 2020 July
Digdag Updates 2020 JulyDigdag Updates 2020 July
Digdag Updates 2020 July
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
Google App Engine With Java And Groovy
Google App Engine With Java And GroovyGoogle App Engine With Java And Groovy
Google App Engine With Java And Groovy
 
Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Natural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache SparkNatural Language Query and Conversational Interface to Apache Spark
Natural Language Query and Conversational Interface to Apache Spark
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Building Distributed System with Celery on Docker Swarm
Building Distributed System with Celery on Docker SwarmBuilding Distributed System with Celery on Docker Swarm
Building Distributed System with Celery on Docker Swarm
 
Scalding - Big Data Programming with Scala
Scalding - Big Data Programming with ScalaScalding - Big Data Programming with Scala
Scalding - Big Data Programming with Scala
 
Bosh 2.0
Bosh 2.0Bosh 2.0
Bosh 2.0
 

Similar to Automating Workflows for Analytics Pipelines

adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaoadaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
lyvanlinh519
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 

Similar to Automating Workflows for Analytics Pipelines (20)

AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaoadaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Deploying Web Apps with PaaS and Docker Tools
Deploying Web Apps with PaaS and Docker ToolsDeploying Web Apps with PaaS and Docker Tools
Deploying Web Apps with PaaS and Docker Tools
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
DCEU 18: App-in-a-Box with Docker Application Packages
DCEU 18: App-in-a-Box with Docker Application PackagesDCEU 18: App-in-a-Box with Docker Application Packages
DCEU 18: App-in-a-Box with Docker Application Packages
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
DevOps and Decoys How to Build a Successful Microsoft DevOps Including the Data
DevOps and Decoys  How to Build a Successful Microsoft DevOps Including the DataDevOps and Decoys  How to Build a Successful Microsoft DevOps Including the Data
DevOps and Decoys How to Build a Successful Microsoft DevOps Including the Data
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017
 
DevOps Tooling - Pop-up Loft TLV 2017
DevOps Tooling - Pop-up Loft TLV 2017DevOps Tooling - Pop-up Loft TLV 2017
DevOps Tooling - Pop-up Loft TLV 2017
 
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
How to build streaming data pipelines with Akka Streams, Flink, and Spark usi...
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Cloud Native Development
Cloud Native DevelopmentCloud Native Development
Cloud Native Development
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-action
 
Future of Development and Deployment using Docker
Future of Development and Deployment using DockerFuture of Development and Deployment using Docker
Future of Development and Deployment using Docker
 
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
 
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
 
BP101 - 10 Things to Consider when Developing & Deploying Applications in Lar...
BP101 - 10 Things to Consider when Developing & Deploying Applications in Lar...BP101 - 10 Things to Consider when Developing & Deploying Applications in Lar...
BP101 - 10 Things to Consider when Developing & Deploying Applications in Lar...
 

More from Sadayuki Furuhashi

Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
Sadayuki Furuhashi
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack Project
Sadayuki Furuhashi
 
gumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectgumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack Project
Sadayuki Furuhashi
 
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackNoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePack
Sadayuki Furuhashi
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack
Sadayuki Furuhashi
 

More from Sadayuki Furuhashi (19)

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack Project
 
Gumi study7 messagepack
Gumi study7 messagepackGumi study7 messagepack
Gumi study7 messagepack
 
gumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectgumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack Project
 
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackNoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePack
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack
 

Recently uploaded

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

Automating Workflows for Analytics Pipelines

  • 1. Automating Workflows for Analytics Pipelines Sadayuki Furuhashi Open Source Summit 2017
  • 2. Sadayuki Furuhashi A founder of Treasure Data, Inc. located in Silicon Valley. OSS projects I founded: An open-source hacker. Github: @frsyuki
  • 3. What's Workflow Engine? • Automates your manual operations. • Load data → Clean up → Analyze → Build reports • Get customer list → Generate HTML → Send email • Monitor server status → Restart on abnormal • Backup database → Alert on failure • Run test → Package it → Deploy
 (Continuous Delivery)
  • 4. Challenge: Multiple Cloud & Regions On-Premises Different API, Different tools, Many scripts.
  • 5. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR
  • 6. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR > Hi! > I'm a new technology!
  • 7. Challenge: Modern complex data analytics Ingest Application logs User attribute data Ad impressions 3rd-party cookie data Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs Model A/B Testing Funnel analysis Segmentation analysis Machine learning Load Creating indexes Data partitioning Data compression Statistics collection Utilize Recommendation API Realtime ad bidding Visualize using BI applications Ingest UtilizeEnrich Model Load
  • 8. Traditional "false" solution #!/bin/bash ./run_mysql_query.sh ./load_facebook_data.sh ./rsync_apache_logs.sh ./start_emr_cluster.sh for query in emr/*.sql; do ./run_emr_hive $query done ./shutdown_emr_cluster.sh ./run_redshift_queries.sh ./call_finish_notification.sh > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  • 9. Solution: Multi-Cloud Workflow Engine Solves > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  • 10. Example in our case 1. Dump data to BigQuery 2. load all tables to Treasure Data 3. Run queries 5. Notify on slack 4. Create reports on Tableau Server
 (on-premises)
  • 12. Unite Engineering & Analytic Teams +wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql Powerful for Engineers > Comfortable for advanced users Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows
  • 13. Unite Engineering & Analytic Teams Powerful for Engineers > Comfortable for advanced users Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows +wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql + is a task > is an operator ${...} is a variable
  • 14. Operator library _export: td: database: workflow_temp +task1: td>: queries/open.sql create_table: daily_open +task2: td>: queries/close.sql create_table: daily_close Standard libraries redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email Open-source libraries You can release & use open-source operator libraries.
  • 15. Parallel execution +load_data: _parallel: true 
 +load_users: redshift>: copy/users.sql 
 +load_items: redshift>: copy/items.sql Parallel execution Tasks under a same group run in parallel if _parallel option is set to true.
  • 16. Loops & Parameters +send_email_to_active_users: td_for_each>: list_active.sql _do: +send: email>: tempalte.txt to: ${td.for_each.addr} Parameter A task can propagate parameters to following tasks Loop Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.
  • 17. Grouping workflows... Ingest UtilizeEnrich Model Load +task +task +task +task +task +task +task +task +task +task +task +task
  • 18. Grouping workflows Ingest UtilizeEnrich Model Load +ingest +enrich +task +task +model +basket_analysis +task +task +learn +load +task +task+tasks +task
  • 19. Pushing workflows to a server with Docker image schedule: daily>: 01:30:00 timezone: Asia/Tokyo _export: docker: image: my_image:latest +task: sh>: ./run_in_docker Digdag server > Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor Docker > Install scripts & dependences in a Docker image, not on a server. > Workflows can run anywhere including developer's laptop.
  • 20. Demo
  • 21. Digdag is production-ready Digdag server PostgreSQL It's just like a web application. Digdag client All task state API & scheduler & executor Visual UI
  • 22. Digdag is production-ready PostgreSQL Stateless servers + Replicated DB Digdag client API & scheduler & executor PostgreSQL All task state Digdag server Digdag server HTTP Load Balancer Visual UI HA
  • 23. Digdag is production-ready Digdag server PostgreSQL Isolating API and execution for reliability Digdag client API PostgreSQL HA Digdag server Digdag server Digdag server scheduler &
 executor HTTP Load Balancer All task state
  • 24. Digdag at Treasure Data 3,600 workflows run every day 28,000 tasks run every day 850 active workflows 400,000 workflow executions in total
  • 25. Digdag & Open Source
  • 26. Learning from my OSS projects • Make it pluggable! 700+ plugins in 6 years 200+ plugins in 3 years input/output, parser/formatter,
 decoder/encoder, filter, and executor input/output, and filter 70+ implementations in 8 years
  • 27. Digdag also has plugin architecture 32 operators 7 schedulers 2 command executors 1 error notification module