SlideShare a Scribd company logo
1 of 31
Download to read offline
Workflow Hacks! #1
Taro L. Saito

leo@treasure-data.com
Dec. 14, 2015
dots. Tokyo, Japan
Workflow Hacks! #1
2
アンケート
• 終了後 メールにてアンケートを送付します
• 質問内容
• 現在、どのようなシステムを使っているか?
• ワークフローでどのような問題を解決したいか?
• 回答いただいた方に、抽選でTreasure Dataパーカー
をプレゼント!
3
About Me: Taro L. Saito
4
2007 University of Tokyo. Ph.D.
XML DBMS, Transaction Processing
Relational-Style XML Query [SIGMOD 2008]
~ 2014 Assistant Professor at University of Tokyo
Genome Science Research
- Big Data Processing
- Distributed Computing
2014.03~ Treasure Data, Inc. Tokyo
2015.07~ Treasure Data, Inc. 

Mountain View, CA
Cloud Platform for Data Analytics
8
• Importing 1,000,000~ records / sec.
• Presto (Distributed SQL engine)
• 50,000~ queries / day
• Processing 10 trillion records / day
• http://qiita.com/xerial/items/a9093b60062f2c613fda
Import Export
Store
Analyze with
Presto/Hive
(Distributed SQL Engine)
Enterp
Enterprise
Data
BI
Workflow Fundamental Features
• Dependency management
• task1 -> task2 -> task3 …
• Scheduling
• Execution monitoring
• State management
• Error handling
• Easy access to logs
• Notification
9
Workflow Tools
• Workflow Management Tools
• Python: Luigi, Airflow, pinball
• For Hadoop: Oozie (XML)
• Script-based: Makefile, Azkaban
• Biological Science: Galaxy (Web UI), nextflow
• Domestic: JP1, Hinemos
• Dataflow DSL
• Spark, Flink, DriadLINQ, TensorFlow
• Cascading (Java -> MR), Scalding (Scala -> MR)
10
Dataflow DSL
• Translate this data processing program
• into a cluster computing program
11
A B
A0
A1
A2
B1
B2
f
B0
C
C
g
map reduce
f g
Redbook: Dataflow Engines
• Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis
• http://www.redbook.io/ch5-dataflow.html
• DryadLINQ
• Most influential interface

for dataflow DSL
• SQL-like operation
• Functional style
• Spark
• SparkSQL
• 70% of Spark accesses
• Dataset API
• Shift to the dataframe based API
12
Dataflow -> Execution Plan
• Example - Hive: SQL to MapReduce
• Mapping SQL stages into MapReduce program
• SELECT page, count(*) FROM weblog

GROUP BY page
13
HDFS
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Workflows
14
A
f
B C
g
D E
F
G
Hadoop is not enough
• C. Olston et al. [SIGMOD 2011]
• continuous processing
• independent scheduling
• Incremental processing
• Google Parcolator [OSDI 2010]
• Naiad - Differential Workflow

Microsoft [SOSP 2013]
15
Continuous Processing
• The Dataflow Model
• Akidau et al., Google [VLDB2015]
• Unbounded data processing
• late-coming data
• Integration of
• batch processing
• accumulation
16
Cluster Computing with Dryad 

M. Budiu, 2008
Cluster Computing with Dryad 

M. Budiu, 2008
Workflow Hacks!
Airflow
19
Airflow
• Best practices with Airflow - An open source
platform for workflows & schedules (Nov 2015)
• At Silicon Valley Data Engineering Meetup
• https://youtu.be/dgaoqOZlvEA
20
Workflow Development
• Programmatic
• Generate workflows by code
• Configuration as Code
• Workflow reuse/overwrite
• object oriented
• Parameterization
21
Luigi
• Luigiによるワークフロー管理
• http://qiita.com/k24d/items/
fb9bed08423e6249d376
22
Nextflow
• http://www.nextflow.io/
23
Dataflow DSL vs Workflow DSL
• Dataflow
• A -> B -> C -> …
• Data dependencies
• Workflow
• Task A -> Task B -> Task C -> …
• Task dependencies
• Data transfer is optional (through file or DB)
• + Scheduling
• + Task names
• For monitoring, redo, etc.
24
Weavelet (wvlet)
• Object-oriented workflow DSL for Scala
• Workflow reuse, extension, override
• Parameterization
• Function := Task, Workflow := Class
25
Isolating DAG generation and its execution
• Alternatives of MR
• Tez
• Pig on Spark https://issues.apache.org/jira/browse/PIG-4059
• Asakusa on Hadoop, Spark
26
Local
Hadoop
Spark
Result
DSL generates DAG
Stream DSL
• Add “moving stream” support to Dataflow DSL
• ”moving" streams and "resting" datasets
• Example
• Spark Streaming
• Spark DSL + Micro-batch for stream
• Microsoft Azure Stream SQL
• Windowing support for moving data
• Norikra
• Stream processing with SQL
• Reactive programming
• ReactiveX (Netflix), Akka Streaming (beta)  <- Stream DSL (DAG)
• Back-pressure support
• Controlling data transfer speed from receiver side
27
Task Execution Retry
• リトライと冪等性のデザインパターン
• http://frsyuki.hatenablog.com/entry/2014/06/09/164559
• System failures
• Process is not responding
• network, hardware failures
• Middleware failures
• provisioning failures, missing components
• User failures
• Wrong configuration
• Programming error
28
Retry Example
• Example: Task calling a REST API /create/xxx
• Client: First attempt
• Server returns 200 Success
• But failed to get the status code
• Client retries the task
• Get 409 conflict error (entry xxx is already created)
• Solution (Application side)
• Handle 409 error as success in the client (idempotent
execution)
• More strict approach
• Making xxx unique for each request
29
Fault Tolerance
• Presto: Distributed query engine developed by Facebook
• Uses HTTP data transfer
• No fault-tolerance
• 99.5% of queries finishes without any failure
• For queries processing 10 billions or more rows => Drops to 85%
30
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
Summary
• Recent workflow tools
• Driven by Python community
• Because of this book! (=>)
• Airflow, Luigi, etc.
• Workflow manager
• Handle system failures, monitoring
• Workflow development
• DAG based DSL (dataflow, workflow, stream processing) -> Execution
• Does not cover application logic errors
• Idempotent execution
• Requires splitting large tasks into smaller ones
31

More Related Content

What's hot

Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 

What's hot (20)

Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Spark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin KeynoteSpark Summit EU 2015: Reynold Xin Keynote
Spark Summit EU 2015: Reynold Xin Keynote
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at Twitter
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Building Data Pipelines in Python
Building Data Pipelines in PythonBuilding Data Pipelines in Python
Building Data Pipelines in Python
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Presto updates to 0.178
Presto updates to 0.178Presto updates to 0.178
Presto updates to 0.178
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 

Viewers also liked

Rd gatewayによるwindowsインスタンスへの接続
Rd gatewayによるwindowsインスタンスへの接続Rd gatewayによるwindowsインスタンスへの接続
Rd gatewayによるwindowsインスタンスへの接続
Amazon Web Services Japan
 

Viewers also liked (11)

Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
 
xrdpを使ったお手軽BYOD環境の構築
xrdpを使ったお手軽BYOD環境の構築xrdpを使ったお手軽BYOD環境の構築
xrdpを使ったお手軽BYOD環境の構築
 
Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方
 
Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月
Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月
Yahoo!Japan北米DCでOCPのツボをみせてもらってきました - OpenStack最新情報セミナー 2016年5月
 
xrdpで変える!社内のPC環境
xrdpで変える!社内のPC環境xrdpで変える!社内のPC環境
xrdpで変える!社内のPC環境
 
並列データベースシステムの概念と原理
並列データベースシステムの概念と原理並列データベースシステムの概念と原理
並列データベースシステムの概念と原理
 
Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決 - OpenStack最新情報セミナー...
Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決  - OpenStack最新情報セミナー...Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決  - OpenStack最新情報セミナー...
Z Lab社におけるOpenStack × Kubernetesの活用 〜アプリケーション開発者からみた課題解決 - OpenStack最新情報セミナー...
 
OCP, Kubernetes ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)
OCP, Kubernetes  ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)OCP, Kubernetes  ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)
OCP, Kubernetes ハイパースケールアーキテクチャ 導入の道のり - OpenStack最新情報セミナー(2016年7月)
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Rd gatewayによるwindowsインスタンスへの接続
Rd gatewayによるwindowsインスタンスへの接続Rd gatewayによるwindowsインスタンスへの接続
Rd gatewayによるwindowsインスタンスへの接続
 

Similar to Workflow Hacks #1 - dots. Tokyo

Similar to Workflow Hacks #1 - dots. Tokyo (20)

Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
JavaOne_2010
JavaOne_2010JavaOne_2010
JavaOne_2010
 
Background processing with hangfire
Background processing with hangfireBackground processing with hangfire
Background processing with hangfire
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
The Autobahn Has No Speed Limit - Your XPages Shouldn't Either!
 
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
UK Community day 20180427 Microsoft Flow hackathon
UK Community day 20180427 Microsoft Flow hackathonUK Community day 20180427 Microsoft Flow hackathon
UK Community day 20180427 Microsoft Flow hackathon
 
DOTNET8.pptx
DOTNET8.pptxDOTNET8.pptx
DOTNET8.pptx
 
Datastage Online Training
Datastage Online TrainingDatastage Online Training
Datastage Online Training
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Practical automation for beginners
Practical automation for beginnersPractical automation for beginners
Practical automation for beginners
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoPGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
 
SharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi VončinaSharePoint 2013 Performance Analysis - Robi Vončina
SharePoint 2013 Performance Analysis - Robi Vončina
 

More from Taro L. Saito

Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 

More from Taro L. Saito (20)

Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
 
Airframe RPC
Airframe RPCAirframe RPC
Airframe RPC
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Airframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpecAirframe Meetup #3: 2019 Updates & AirSpec
Airframe Meetup #3: 2019 Updates & AirSpec
 
Presto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 UpdatesPresto At Arm Treasure Data - 2019 Updates
Presto At Arm Treasure Data - 2019 Updates
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Learning Silicon Valley Culture
Learning Silicon Valley CultureLearning Silicon Valley Culture
Learning Silicon Valley Culture
 
Scala at Treasure Data
Scala at Treasure DataScala at Treasure Data
Scala at Treasure Data
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
JNuma Library
JNuma LibraryJNuma Library
JNuma Library
 
Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編Treasure Dataを支える技術 - MessagePack編
Treasure Dataを支える技術 - MessagePack編
 
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Silkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミングSilkによる並列分散ワークフロープログラミング
Silkによる並列分散ワークフロープログラミング
 

Recently uploaded

"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 

Recently uploaded (20)

A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 

Workflow Hacks #1 - dots. Tokyo

  • 1. Workflow Hacks! #1 Taro L. Saito
 leo@treasure-data.com Dec. 14, 2015 dots. Tokyo, Japan
  • 3. アンケート • 終了後 メールにてアンケートを送付します • 質問内容 • 現在、どのようなシステムを使っているか? • ワークフローでどのような問題を解決したいか? • 回答いただいた方に、抽選でTreasure Dataパーカー をプレゼント! 3
  • 4. About Me: Taro L. Saito 4 2007 University of Tokyo. Ph.D. XML DBMS, Transaction Processing Relational-Style XML Query [SIGMOD 2008] ~ 2014 Assistant Professor at University of Tokyo Genome Science Research - Big Data Processing - Distributed Computing 2014.03~ Treasure Data, Inc. Tokyo 2015.07~ Treasure Data, Inc. 
 Mountain View, CA
  • 5.
  • 6.
  • 7.
  • 8. Cloud Platform for Data Analytics 8 • Importing 1,000,000~ records / sec. • Presto (Distributed SQL engine) • 50,000~ queries / day • Processing 10 trillion records / day • http://qiita.com/xerial/items/a9093b60062f2c613fda Import Export Store Analyze with Presto/Hive (Distributed SQL Engine) Enterp Enterprise Data BI
  • 9. Workflow Fundamental Features • Dependency management • task1 -> task2 -> task3 … • Scheduling • Execution monitoring • State management • Error handling • Easy access to logs • Notification 9
  • 10. Workflow Tools • Workflow Management Tools • Python: Luigi, Airflow, pinball • For Hadoop: Oozie (XML) • Script-based: Makefile, Azkaban • Biological Science: Galaxy (Web UI), nextflow • Domestic: JP1, Hinemos • Dataflow DSL • Spark, Flink, DriadLINQ, TensorFlow • Cascading (Java -> MR), Scalding (Scala -> MR) 10
  • 11. Dataflow DSL • Translate this data processing program • into a cluster computing program 11 A B A0 A1 A2 B1 B2 f B0 C C g map reduce f g
  • 12. Redbook: Dataflow Engines • Chapter 5: Large-Scale Dataflow Engine, by Peter Bailis • http://www.redbook.io/ch5-dataflow.html • DryadLINQ • Most influential interface
 for dataflow DSL • SQL-like operation • Functional style • Spark • SparkSQL • 70% of Spark accesses • Dataset API • Shift to the dataframe based API 12
  • 13. Dataflow -> Execution Plan • Example - Hive: SQL to MapReduce • Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
 GROUP BY page 13 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 15. Hadoop is not enough • C. Olston et al. [SIGMOD 2011] • continuous processing • independent scheduling • Incremental processing • Google Parcolator [OSDI 2010] • Naiad - Differential Workflow
 Microsoft [SOSP 2013] 15
  • 16. Continuous Processing • The Dataflow Model • Akidau et al., Google [VLDB2015] • Unbounded data processing • late-coming data • Integration of • batch processing • accumulation 16
  • 17. Cluster Computing with Dryad 
 M. Budiu, 2008
  • 18. Cluster Computing with Dryad 
 M. Budiu, 2008 Workflow Hacks!
  • 20. Airflow • Best practices with Airflow - An open source platform for workflows & schedules (Nov 2015) • At Silicon Valley Data Engineering Meetup • https://youtu.be/dgaoqOZlvEA 20
  • 21. Workflow Development • Programmatic • Generate workflows by code • Configuration as Code • Workflow reuse/overwrite • object oriented • Parameterization 21
  • 24. Dataflow DSL vs Workflow DSL • Dataflow • A -> B -> C -> … • Data dependencies • Workflow • Task A -> Task B -> Task C -> … • Task dependencies • Data transfer is optional (through file or DB) • + Scheduling • + Task names • For monitoring, redo, etc. 24
  • 25. Weavelet (wvlet) • Object-oriented workflow DSL for Scala • Workflow reuse, extension, override • Parameterization • Function := Task, Workflow := Class 25
  • 26. Isolating DAG generation and its execution • Alternatives of MR • Tez • Pig on Spark https://issues.apache.org/jira/browse/PIG-4059 • Asakusa on Hadoop, Spark 26 Local Hadoop Spark Result DSL generates DAG
  • 27. Stream DSL • Add “moving stream” support to Dataflow DSL • ”moving" streams and "resting" datasets • Example • Spark Streaming • Spark DSL + Micro-batch for stream • Microsoft Azure Stream SQL • Windowing support for moving data • Norikra • Stream processing with SQL • Reactive programming • ReactiveX (Netflix), Akka Streaming (beta)  <- Stream DSL (DAG) • Back-pressure support • Controlling data transfer speed from receiver side 27
  • 28. Task Execution Retry • リトライと冪等性のデザインパターン • http://frsyuki.hatenablog.com/entry/2014/06/09/164559 • System failures • Process is not responding • network, hardware failures • Middleware failures • provisioning failures, missing components • User failures • Wrong configuration • Programming error 28
  • 29. Retry Example • Example: Task calling a REST API /create/xxx • Client: First attempt • Server returns 200 Success • But failed to get the status code • Client retries the task • Get 409 conflict error (entry xxx is already created) • Solution (Application side) • Handle 409 error as success in the client (idempotent execution) • More strict approach • Making xxx unique for each request 29
  • 30. Fault Tolerance • Presto: Distributed query engine developed by Facebook • Uses HTTP data transfer • No fault-tolerance • 99.5% of queries finishes without any failure • For queries processing 10 billions or more rows => Drops to 85% 30 A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  • 31. Summary • Recent workflow tools • Driven by Python community • Because of this book! (=>) • Airflow, Luigi, etc. • Workflow manager • Handle system failures, monitoring • Workflow development • DAG based DSL (dataflow, workflow, stream processing) -> Execution • Does not cover application logic errors • Idempotent execution • Requires splitting large tasks into smaller ones 31