SlideShare a Scribd company logo
1 of 36
Download to read offline
Digdagによる大規模データ処理の

自動化とエラー処理
Sadayuki Furuhashi
Workflow Engines Night
Sadayuki Furuhashi
A founder of Treasure Data, Inc. located in Silicon Valley.
OSS projects I founded:
An open-source hacker.
Github: @frsyuki
What’s workload automation?
• あらゆる手作業の自動化
> バッチデータ解析の自動化:
• データロード - ETL - JOIN- 集計処理 - レポート生成 - 通知
> メール送信の自動化
• アドレス一覧の取得 - 対象の絞り込み - テンプレートから

本文を生成 - メール送信 - 完了通知
> システム間のデータ連携の自動化
> サーバ・DB・ネットワーク機器の管理やプロビジョニング
の自動化
> テスト・デプロイの自動化(CI)
求められる機能
• 基本機能
> タスクを依存関係順に実行
> 定期的な実行
> ファイルが作成されたら実行
> 過去分の一括実行(backfill)
> 時刻などの変数を含めて実行
• エラー処理
> 失敗したら通知
> 失敗した場所から再開
• 状態監視
> 実行時間が長ければ通知
> タスクの実行時間を可視化
> 実行ログの収集と保存
• 高速化
> タスクを並列して実行
> 同時実行数の制限
• 開発支援
> ワークフローのバージョン管理
> GUIによるワークフロー開発
> 定型処理を簡単に実行できるライ
ブラリ
> 手元とサーバ上で同じように動く
再現性(手元で動けばサーバでも
動く)
> Dockerイメージを使ってタスクを
実行
Products
OSS
• Makefile
• Jenkins
• Luigi
• Airflow
• Rundeck
• Azkaban
• Grid Engine
• OpenLava
• Obsidian Scheduler
• Hinemos
• StackStorm
• Platform LSM
Proprietary
• Tivoli Workload Scheduler (IBM)
• CA Workload Automation

(CA Technologies)
• JP1/AJS3 (Hitachi)
• Systemwalker Job Workload
Server (Fujitsu)
• Workload Automation
(Automatic)
• BatchMan (Honico)
• Control-M (BMC)
• Schedulix
• ServiceNow Workflow
Challenge: Multiple Cloud & Regions
On-Premises
Different API,
Different tools,
Many scripts.
Challenge: Multiple DB technologies
Amazon S3
Amazon 

Redshift
Amazon EMR
Challenge: Multiple DB technologies
Amazon S3
Amazon 

Redshift
Amazon EMR
> Hi!
> I'm a new technology!
Challenge: Modern complex data analytics
Ingest
Application logs
User attribute data
Ad impressions
3rd-party cookie data
Enrich
Removing bot access
Geo location from IP
address
Parsing User-Agent
JOIN user attributes
to event logs
Model
A/B Testing
Funnel analysis
Segmentation
analysis
Machine learning
Load
Creating indexes
Data partitioning
Data compression
Statistics
collection
Utilize
Recommendation
API
Realtime ad bidding
Visualize using BI
applications
Ingest UtilizeEnrich Model Load
Traditional "false" solution
#!/bin/bash
./run_mysql_query.sh
./load_facebook_data.sh
./rsync_apache_logs.sh
./start_emr_cluster.sh
for query in emr/*.sql; do
./run_emr_hive $query
done
./shutdown_emr_cluster.sh
./run_redshift_queries.sh
./call_finish_notification.sh
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization
Solution: Multi-Cloud Workflow Engine
Solves
> Poor error handling
> Write once, Nobody reads
> No alerts on failure
> No alerts on too long run
> No retrying on errors
> No resuming
> No parallel execution
> No distributed execution
> No log collection
> No visualized monitoring
> No modularization
> No parameterization
Example in our case
1. Dump data to
BigQuery
2. load all tables to
Treasure Data
3. Run queries
5. Notify on slack
4. Create reports
on Tableau Server

(on-premises)
Workflow constructs
Key constructs
Operators
> Packaged knowledge to run tasks.
> e.g. pg>, s3>, gcs>, emr>, td>, py>, rb>
Parameters
> Programmable variables for operators.
> e.g. ${session_time}, ${workflow_name},

${JSON.parse(http.last_content)}
Task groups
> Sequence of tasks to organize & modularize
workflows.
Operator library
_export:
td:
database: workflow_temp
+task1:
td>: queries/open.sql
create_table: daily_open
+task2:
td>: queries/close.sql
create_table: daily_close
Standard libraries
redshift>: runs Amazon Redshift queries
emr>: create/shutdowns a cluster & runs
steps
s3_wait>: waits until a file is put on S3
pg>: runs PostgreSQL queries
td>: runs Treasure Data queries
td_for_each>: repeats task for result rows
mail>: sends an email
Open-source libraries
You can release & use open-source
operator libraries.
Task grouping & parallel execution
+load_data:
_parallel: true


+load_users:
redshift>: copy/users.sql


+load_items:
redshift>: copy/items.sql
Parallel execution
Tasks under a same group run in
parallel if _parallel option is set to
true.
Grouping workflows...
Ingest UtilizeEnrich Model Load
+task
+task
+task
+task +task
+task +task
+task
+task
+task +task +task
Grouping workflows
Ingest UtilizeEnrich Model Load
+ingest +enrich
+task +task
+model
+basket_analysis
+task +task
+learn
+load
+task +task+tasks
+task
Parameters & Loops
+send_email_to_active_users:
td_for_each>: list_active.sql
_do:
+send:
email>: tempalte.txt
to: ${td.for_each.addr}
Parameter
A task can propagate parameters to
following tasks
Loop
Generate subtasks dynamically so
that Digdag applies the same set of
operators to different data sets.
Unite Engineering & Analytic Teams
+wait_for_arrival:
s3_wait>: |
bucket/www_${session_date}.csv
+load_table:
redshift>: scripts/copy.sql
Powerful for Engineers
> Comfortable for advanced users
Friendly for Analysts
> Still straight forward for analysts to
understand & leverage workflows
Pushing workflows to a server with Docker image
schedule:
daily>: 01:30:00
timezone: Asia/Tokyo
_export:
docker:
image: my_image:latest
+task:
sh>: ./run_in_docker
Digdag server
> Develop on laptop, push it to a server.
> Workflows run periodically on a server.
> Backfill
> Web editor & monitor
Docker
> Install scripts & dependences in a
Docker image, not on a server.
> Workflows can run anywhere including
developer's laptop.
Amazon ECR Dockerfile & Operator plugin template
• https://github.com/myui/dockernized-digdag-server
• https://github.com/myui/digdag-plugin-example
$ docker pull myui/digdag-server:latest
$ docker run -p 65432:65432 myui/digdag-server
open http://localhost:65432/
Demo
Real-world workflows
Digdag at Treasure Data
3,600 workflows run every day
28,000 tasks run every day
850 active workflows
400,000 workflow executions in total
Example: Customer analysis & alerting
timezone: UTC
schedule:
daily>: 09:00
_export:
mail:
from: 'bizops@example.com'
td:
database: summary
+reports:
td_run>: prepare_users_data
+for_each_users:
td_for_each>: inactive_users.sql
_do:
+alert_email:
mail>: mail.txt
subject: 'Inactive Alert: ${td.each.account_name}'
to: ['${td.each.owner_email}']
timezone: UTC
schedule:
daily>: 09:00
_export:
mail:
from: 'bizops@example.com'
td:
database: summary
+reports:
td_run>: prepare_users_data
+for_each_users:
td_for_each>: inactive_users.sql
_do:
+alert_email:
mail>: mail.txt
subject: 'Inactive Alert: ${td.each.account_name}'
to: ['${td.each.owner_email}']
Example: Customer analysis & alerting
Usage: ${td.each.percentage}%
Account Name: ${td.each.account_name}
Type: Purchase
${td.each.salesforce_link}
Region: ${td.each.region}
Owner: ${td.each.owner_name} (${td.each.owner_email})
Account: ${td.each.account_name}
Status: ${td.each.activity_status}
Actual: ${td.each.total_purchase}
Limit: ${td.each.monthly_purchase_limit}
mail.txt
Example: Backend of a BI app
timezone: <%= ev @timezone %>
<% if @schedule then %>
schedule: <%= ev @schedule %>
<% end %>
_export:
td:
database: <%= ev @database %>
all_mode: ${

(moment(session_time).dayOfYear() - 1)
% 3 == 0
}
+all_load:
if>: ${all_mode == "true"}
_do:
+create_all_records:
td>: segment_web_access.sql
create_table: "cdp_tmp_web_access"
_retry: 5
+rename_tmp_table:
td_ddl>:
rename_tables:
- from: "cdp_tmp_web_access"
to: "cdp_web_access"
_retry: 5
+get_all_count:
td>: incremental_count.sql
table_name: "cdp_web_access"
store_last_results: true
_retry: 5
+syndicate_loop:
loop>: ${Math.ceil(
td.last_results.total_count / 20000
)}
_do:
td>: incremental_select.sql
table_name: "cdp_web_access"
result_connection: cdp_web_access
result_settings:
id: 1
_retry: 5
Example: Moving Spark app to production
_export:
td:
database: digdag_demo_${session_date_compact}
+setup:
td_ddl>:
create_databases: ["${td.database}"]
+ingestion:
_parallel: true
+items_from_access_logs:
+wait_for_arrival:
s3_wait>: digdag-demo-bucket/www_login_$
{session_date_compact}.csv
+load_logs:
td_load>: s3_import_1479918530
+facebook_ads:
td_load>:
facebook_ads_reporting_import_1479843958
+items_from_aurora:
td_load>: mysql_import_1479918544
+enrichment:
_parallel: 5
+ip_location_to_user:
# ip_location, user
td>: queries/ip_location_to_user.sql
create_table: ip_location_to_user
+item_to_click_count:
# item, click_count
td>: queries/item_to_click_count.sql
create_table: item_to_click_count
+item_to_item_count:
# item_1, item_2, count
td>: queries/item_to_item_count.sql
create_table: item_to_item_count
+modeling:
emr>:
cluster: j-OD82XANWFYQ8
staging: s3://digdag-demo-data/emr/staging/
steps:
- type: spark
application: spark/target/scala-2.11/simple-td-
spark-project_2.11-1.0.jar
submit_options: ["--class", "ItemRecommends"]
jars: [td-spark-assembly-0.1.jar]
- type: spark
application: spark/target/scala-2.11/simple-td-
spark-project_2.11-1.0.jar
submit_options: ["--class",
"LocationRecommends"]
jars: [td-spark-assembly-0.1.jar]
+loading:
_parallel: true
+load_location_recommends:
redshift>: copy/copy_location_recommends.sql
+load_item_recommends:
redshift>: copy/copy_item_recommends.sql
Deployment & Fault tolerance
HA deployment of Digdag
Digdag
server
PostgreSQL
It's just like a web application.
Digdag
client
All task state
API &
scheduler &
executor
Visual UI
HA deployment of Digdag
PostgreSQL
Stateless servers + Replicated DB
Digdag
client
API &
scheduler &
executor
PostgreSQL
All task state
Digdag
server
Digdag
server
HTTP Load
Balancer
Visual UI
HA
HA deployment of Digdag
Digdag
server
PostgreSQL
Isolating API and execution for reliability
Digdag
client
API
PostgreSQL
HA
Digdag
server
Digdag
server
Digdag
server
scheduler &

executor
HTTP Load
Balancer
All task state
$ digdag server --disable-local-agent 

--disable-executor-loop
$ digdag server --max-task-threads 100
Single-server task logs
Digdag
server
PostgreSQL
Digdag
client
HTTP Load
Balancer
Local disks
A server writes logs

to a local disk
The same server

serves the logs.
$ digdag --task-log <dir>
$ digdag log <attempt-id> -f
Centralized task log storage
Digdag
server
PostgreSQL
Digdag
client
Digdag
server
HTTP Load
Balancer
AWS S3
A server uploads logs
A server pre-signs

the download URL
log-server.type = s3
log-server.s3.bucket = my-digdag-log-bucket
log-server.s3.path = logs/
$ digdag log <attempt-id> -f
Client downloads logs

directly from S3
Sadayuki Furuhashi
https://digdag.io
Visit my website!

More Related Content

What's hot

PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...NTT DATA Technology & Innovation
 
DBスキーマもバージョン管理したい!
DBスキーマもバージョン管理したい!DBスキーマもバージョン管理したい!
DBスキーマもバージョン管理したい!kwatch
 
BigQuery で 150万円 使ったときの話
BigQuery で 150万円 使ったときの話BigQuery で 150万円 使ったときの話
BigQuery で 150万円 使ったときの話itkr
 
ゴシッププロトコルによる冗長化と負荷分散の検証
ゴシッププロトコルによる冗長化と負荷分散の検証ゴシッププロトコルによる冗長化と負荷分散の検証
ゴシッププロトコルによる冗長化と負荷分散の検証Sugawara Genki
 
Apache Avro vs Protocol Buffers
Apache Avro vs Protocol BuffersApache Avro vs Protocol Buffers
Apache Avro vs Protocol BuffersSeiya Mizuno
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...Holden Karau
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話Yoshinori Matsunobu
 
Redisの特徴と活用方法について
Redisの特徴と活用方法についてRedisの特徴と活用方法について
Redisの特徴と活用方法についてYuji Otani
 
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48Preferred Networks
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11Sadayuki Furuhashi
 
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性Ohyama Masanori
 
MySQLレプリケーションあれやこれや
MySQLレプリケーションあれやこれやMySQLレプリケーションあれやこれや
MySQLレプリケーションあれやこれやyoku0825
 
webエンジニアのためのはじめてのredis
webエンジニアのためのはじめてのrediswebエンジニアのためのはじめてのredis
webエンジニアのためのはじめてのredisnasa9084
 
コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線Motonori Shindo
 
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...Preferred Networks
 
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)NTT DATA Technology & Innovation
 
君はyarn.lockをコミットしているか?
君はyarn.lockをコミットしているか?君はyarn.lockをコミットしているか?
君はyarn.lockをコミットしているか?Teppei Sato
 
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
[GKE & Spanner 勉強会] Cloud Spanner の技術概要[GKE & Spanner 勉強会] Cloud Spanner の技術概要
[GKE & Spanner 勉強会] Cloud Spanner の技術概要Google Cloud Platform - Japan
 

What's hot (20)

PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
 
DBスキーマもバージョン管理したい!
DBスキーマもバージョン管理したい!DBスキーマもバージョン管理したい!
DBスキーマもバージョン管理したい!
 
BigQuery で 150万円 使ったときの話
BigQuery で 150万円 使ったときの話BigQuery で 150万円 使ったときの話
BigQuery で 150万円 使ったときの話
 
ゴシッププロトコルによる冗長化と負荷分散の検証
ゴシッププロトコルによる冗長化と負荷分散の検証ゴシッププロトコルによる冗長化と負荷分散の検証
ゴシッププロトコルによる冗長化と負荷分散の検証
 
Apache Avro vs Protocol Buffers
Apache Avro vs Protocol BuffersApache Avro vs Protocol Buffers
Apache Avro vs Protocol Buffers
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
 
Redisの特徴と活用方法について
Redisの特徴と活用方法についてRedisの特徴と活用方法について
Redisの特徴と活用方法について
 
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
HBase at LINE 2017
HBase at LINE 2017HBase at LINE 2017
HBase at LINE 2017
 
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性
監査要件を有するシステムに対する PostgreSQL 導入の課題と可能性
 
MySQLレプリケーションあれやこれや
MySQLレプリケーションあれやこれやMySQLレプリケーションあれやこれや
MySQLレプリケーションあれやこれや
 
webエンジニアのためのはじめてのredis
webエンジニアのためのはじめてのrediswebエンジニアのためのはじめてのredis
webエンジニアのためのはじめてのredis
 
コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線コンテナネットワーキング(CNI)最前線
コンテナネットワーキング(CNI)最前線
 
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
KubeCon + CloudNativeCon Europe 2022 Recap - Batch/HPCの潮流とScheduler拡張事例 / Kub...
 
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
PGOを用いたPostgreSQL on Kubernetes入門(PostgreSQL Conference Japan 2022 発表資料)
 
君はyarn.lockをコミットしているか?
君はyarn.lockをコミットしているか?君はyarn.lockをコミットしているか?
君はyarn.lockをコミットしているか?
 
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
[GKE & Spanner 勉強会] Cloud Spanner の技術概要[GKE & Spanner 勉強会] Cloud Spanner の技術概要
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
 
Google Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushiGoogle Cloud Dataflow を理解する - #bq_sushi
Google Cloud Dataflow を理解する - #bq_sushi
 

Viewers also liked

Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetrypphaal
 
5 g network &amp; technology
5 g network &amp; technology5 g network &amp; technology
5 g network &amp; technologyFrikha Nour
 
Using Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking AccelerationUsing Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking AccelerationNetronome
 
Nfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricentNfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricentAricent
 
大規模環境のOpenStack アップグレードの考え方と実施のコツ
大規模環境のOpenStackアップグレードの考え方と実施のコツ大規模環境のOpenStackアップグレードの考え方と実施のコツ
大規模環境のOpenStack アップグレードの考え方と実施のコツTomoya Hashimoto
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backIcinga
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platforminside-BigData.com
 
NFV : Virtual Network Function Architecture
NFV : Virtual Network Function ArchitectureNFV : Virtual Network Function Architecture
NFV : Virtual Network Function Architecturesidneel
 
【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信Amazon Web Services Japan
 
Cloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper ContrailCloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper Contrailbuildacloud
 
Contrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at ScaleContrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at ScaleMarketingArrowECS_CZ
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けRecruit Technologies
 

Viewers also liked (18)

Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
5 g network &amp; technology
5 g network &amp; technology5 g network &amp; technology
5 g network &amp; technology
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
NFV Tutorial
NFV TutorialNFV Tutorial
NFV Tutorial
 
NFV and OpenStack
NFV and OpenStackNFV and OpenStack
NFV and OpenStack
 
Using Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking AccelerationUsing Agilio SmartNICs for OpenStack Networking Acceleration
Using Agilio SmartNICs for OpenStack Networking Acceleration
 
Nfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricentNfv orchestration open stack summit may2015 aricent
Nfv orchestration open stack summit may2015 aricent
 
大規模環境のOpenStack アップグレードの考え方と実施のコツ
大規模環境のOpenStackアップグレードの考え方と実施のコツ大規模環境のOpenStackアップグレードの考え方と実施のコツ
大規模環境のOpenStack アップグレードの考え方と実施のコツ
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to back
 
Treasure Data Cloud Data Platform
Treasure Data Cloud Data PlatformTreasure Data Cloud Data Platform
Treasure Data Cloud Data Platform
 
NFV evolution towards 5G
NFV evolution towards 5GNFV evolution towards 5G
NFV evolution towards 5G
 
Design Principles for 5G
Design Principles for 5GDesign Principles for 5G
Design Principles for 5G
 
NFV : Virtual Network Function Architecture
NFV : Virtual Network Function ArchitectureNFV : Virtual Network Function Architecture
NFV : Virtual Network Function Architecture
 
【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信【AWS初心者向けWebinar】AWSから始める動画配信
【AWS初心者向けWebinar】AWSから始める動画配信
 
Cloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper ContrailCloud Network Virtualization with Juniper Contrail
Cloud Network Virtualization with Juniper Contrail
 
Contrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at ScaleContrail Deep-dive - Cloud Network Services at Scale
Contrail Deep-dive - Cloud Network Services at Scale
 
170827 jtf garafana
170827 jtf garafana170827 jtf garafana
170827 jtf garafana
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分け
 

Similar to Digdagによる大規模データ処理の自動化とエラー処理

Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
EG Reports - Delicious Data
EG Reports - Delicious DataEG Reports - Delicious Data
EG Reports - Delicious DataBenjamin Shum
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Startedabramsm
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015N Masahiro
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance IssuesOdoo
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Danny Abukalam
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesDoris Chen
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerCisco Canada
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disquszeeg
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopBrian Christner
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダSadayuki Furuhashi
 
Ajuste (tuning) del rendimiento de SQL Server 2008
Ajuste (tuning) del rendimiento de SQL Server 2008Ajuste (tuning) del rendimiento de SQL Server 2008
Ajuste (tuning) del rendimiento de SQL Server 2008Eduardo Castro
 

Similar to Digdagによる大規模データ処理の自動化とエラー処理 (20)

Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
EG Reports - Delicious Data
EG Reports - Delicious DataEG Reports - Delicious Data
EG Reports - Delicious Data
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015Treasure Data and AWS - Developers.io 2015
Treasure Data and AWS - Developers.io 2015
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Tools for Solving Performance Issues
Tools for Solving Performance IssuesTools for Solving Performance Issues
Tools for Solving Performance Issues
 
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
Matt Jarvis - Unravelling Logs: Log Processing with Logstash and Riemann
 
Ajax Performance Tuning and Best Practices
Ajax Performance Tuning and Best PracticesAjax Performance Tuning and Best Practices
Ajax Performance Tuning and Best Practices
 
Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data center
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Iac d.damyanov 4.pptx
Iac d.damyanov 4.pptxIac d.damyanov 4.pptx
Iac d.damyanov 4.pptx
 
DockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging WorkshopDockerCon Europe 2018 Monitoring & Logging Workshop
DockerCon Europe 2018 Monitoring & Logging Workshop
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 
Ajuste (tuning) del rendimiento de SQL Server 2008
Ajuste (tuning) del rendimiento de SQL Server 2008Ajuste (tuning) del rendimiento de SQL Server 2008
Ajuste (tuning) del rendimiento de SQL Server 2008
 

More from Sadayuki Furuhashi

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreSadayuki Furuhashi
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataSadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into HadoopSadayuki Furuhashi
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectSadayuki Furuhashi
 

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack Project
 

Recently uploaded

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 

Recently uploaded (20)

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 

Digdagによる大規模データ処理の自動化とエラー処理

  • 2. Sadayuki Furuhashi A founder of Treasure Data, Inc. located in Silicon Valley. OSS projects I founded: An open-source hacker. Github: @frsyuki
  • 3. What’s workload automation? • あらゆる手作業の自動化 > バッチデータ解析の自動化: • データロード - ETL - JOIN- 集計処理 - レポート生成 - 通知 > メール送信の自動化 • アドレス一覧の取得 - 対象の絞り込み - テンプレートから
 本文を生成 - メール送信 - 完了通知 > システム間のデータ連携の自動化 > サーバ・DB・ネットワーク機器の管理やプロビジョニング の自動化 > テスト・デプロイの自動化(CI)
  • 4. 求められる機能 • 基本機能 > タスクを依存関係順に実行 > 定期的な実行 > ファイルが作成されたら実行 > 過去分の一括実行(backfill) > 時刻などの変数を含めて実行 • エラー処理 > 失敗したら通知 > 失敗した場所から再開 • 状態監視 > 実行時間が長ければ通知 > タスクの実行時間を可視化 > 実行ログの収集と保存 • 高速化 > タスクを並列して実行 > 同時実行数の制限 • 開発支援 > ワークフローのバージョン管理 > GUIによるワークフロー開発 > 定型処理を簡単に実行できるライ ブラリ > 手元とサーバ上で同じように動く 再現性(手元で動けばサーバでも 動く) > Dockerイメージを使ってタスクを 実行
  • 5. Products OSS • Makefile • Jenkins • Luigi • Airflow • Rundeck • Azkaban • Grid Engine • OpenLava • Obsidian Scheduler • Hinemos • StackStorm • Platform LSM Proprietary • Tivoli Workload Scheduler (IBM) • CA Workload Automation
 (CA Technologies) • JP1/AJS3 (Hitachi) • Systemwalker Job Workload Server (Fujitsu) • Workload Automation (Automatic) • BatchMan (Honico) • Control-M (BMC) • Schedulix • ServiceNow Workflow
  • 6. Challenge: Multiple Cloud & Regions On-Premises Different API, Different tools, Many scripts.
  • 7. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR
  • 8. Challenge: Multiple DB technologies Amazon S3 Amazon 
 Redshift Amazon EMR > Hi! > I'm a new technology!
  • 9. Challenge: Modern complex data analytics Ingest Application logs User attribute data Ad impressions 3rd-party cookie data Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs Model A/B Testing Funnel analysis Segmentation analysis Machine learning Load Creating indexes Data partitioning Data compression Statistics collection Utilize Recommendation API Realtime ad bidding Visualize using BI applications Ingest UtilizeEnrich Model Load
  • 10. Traditional "false" solution #!/bin/bash ./run_mysql_query.sh ./load_facebook_data.sh ./rsync_apache_logs.sh ./start_emr_cluster.sh for query in emr/*.sql; do ./run_emr_hive $query done ./shutdown_emr_cluster.sh ./run_redshift_queries.sh ./call_finish_notification.sh > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  • 11. Solution: Multi-Cloud Workflow Engine Solves > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization
  • 12. Example in our case 1. Dump data to BigQuery 2. load all tables to Treasure Data 3. Run queries 5. Notify on slack 4. Create reports on Tableau Server
 (on-premises)
  • 14. Key constructs Operators > Packaged knowledge to run tasks. > e.g. pg>, s3>, gcs>, emr>, td>, py>, rb> Parameters > Programmable variables for operators. > e.g. ${session_time}, ${workflow_name},
 ${JSON.parse(http.last_content)} Task groups > Sequence of tasks to organize & modularize workflows.
  • 15. Operator library _export: td: database: workflow_temp +task1: td>: queries/open.sql create_table: daily_open +task2: td>: queries/close.sql create_table: daily_close Standard libraries redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email Open-source libraries You can release & use open-source operator libraries.
  • 16. Task grouping & parallel execution +load_data: _parallel: true 
 +load_users: redshift>: copy/users.sql 
 +load_items: redshift>: copy/items.sql Parallel execution Tasks under a same group run in parallel if _parallel option is set to true.
  • 17. Grouping workflows... Ingest UtilizeEnrich Model Load +task +task +task +task +task +task +task +task +task +task +task +task
  • 18. Grouping workflows Ingest UtilizeEnrich Model Load +ingest +enrich +task +task +model +basket_analysis +task +task +learn +load +task +task+tasks +task
  • 19. Parameters & Loops +send_email_to_active_users: td_for_each>: list_active.sql _do: +send: email>: tempalte.txt to: ${td.for_each.addr} Parameter A task can propagate parameters to following tasks Loop Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.
  • 20. Unite Engineering & Analytic Teams +wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql Powerful for Engineers > Comfortable for advanced users Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows
  • 21. Pushing workflows to a server with Docker image schedule: daily>: 01:30:00 timezone: Asia/Tokyo _export: docker: image: my_image:latest +task: sh>: ./run_in_docker Digdag server > Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor Docker > Install scripts & dependences in a Docker image, not on a server. > Workflows can run anywhere including developer's laptop.
  • 22. Amazon ECR Dockerfile & Operator plugin template • https://github.com/myui/dockernized-digdag-server • https://github.com/myui/digdag-plugin-example $ docker pull myui/digdag-server:latest $ docker run -p 65432:65432 myui/digdag-server open http://localhost:65432/
  • 23. Demo
  • 25. Digdag at Treasure Data 3,600 workflows run every day 28,000 tasks run every day 850 active workflows 400,000 workflow executions in total
  • 26. Example: Customer analysis & alerting timezone: UTC schedule: daily>: 09:00 _export: mail: from: 'bizops@example.com' td: database: summary +reports: td_run>: prepare_users_data +for_each_users: td_for_each>: inactive_users.sql _do: +alert_email: mail>: mail.txt subject: 'Inactive Alert: ${td.each.account_name}' to: ['${td.each.owner_email}']
  • 27. timezone: UTC schedule: daily>: 09:00 _export: mail: from: 'bizops@example.com' td: database: summary +reports: td_run>: prepare_users_data +for_each_users: td_for_each>: inactive_users.sql _do: +alert_email: mail>: mail.txt subject: 'Inactive Alert: ${td.each.account_name}' to: ['${td.each.owner_email}'] Example: Customer analysis & alerting Usage: ${td.each.percentage}% Account Name: ${td.each.account_name} Type: Purchase ${td.each.salesforce_link} Region: ${td.each.region} Owner: ${td.each.owner_name} (${td.each.owner_email}) Account: ${td.each.account_name} Status: ${td.each.activity_status} Actual: ${td.each.total_purchase} Limit: ${td.each.monthly_purchase_limit} mail.txt
  • 28. Example: Backend of a BI app timezone: <%= ev @timezone %> <% if @schedule then %> schedule: <%= ev @schedule %> <% end %> _export: td: database: <%= ev @database %> all_mode: ${
 (moment(session_time).dayOfYear() - 1) % 3 == 0 } +all_load: if>: ${all_mode == "true"} _do: +create_all_records: td>: segment_web_access.sql create_table: "cdp_tmp_web_access" _retry: 5 +rename_tmp_table: td_ddl>: rename_tables: - from: "cdp_tmp_web_access" to: "cdp_web_access" _retry: 5 +get_all_count: td>: incremental_count.sql table_name: "cdp_web_access" store_last_results: true _retry: 5 +syndicate_loop: loop>: ${Math.ceil( td.last_results.total_count / 20000 )} _do: td>: incremental_select.sql table_name: "cdp_web_access" result_connection: cdp_web_access result_settings: id: 1 _retry: 5
  • 29. Example: Moving Spark app to production _export: td: database: digdag_demo_${session_date_compact} +setup: td_ddl>: create_databases: ["${td.database}"] +ingestion: _parallel: true +items_from_access_logs: +wait_for_arrival: s3_wait>: digdag-demo-bucket/www_login_$ {session_date_compact}.csv +load_logs: td_load>: s3_import_1479918530 +facebook_ads: td_load>: facebook_ads_reporting_import_1479843958 +items_from_aurora: td_load>: mysql_import_1479918544 +enrichment: _parallel: 5 +ip_location_to_user: # ip_location, user td>: queries/ip_location_to_user.sql create_table: ip_location_to_user +item_to_click_count: # item, click_count td>: queries/item_to_click_count.sql create_table: item_to_click_count +item_to_item_count: # item_1, item_2, count td>: queries/item_to_item_count.sql create_table: item_to_item_count +modeling: emr>: cluster: j-OD82XANWFYQ8 staging: s3://digdag-demo-data/emr/staging/ steps: - type: spark application: spark/target/scala-2.11/simple-td- spark-project_2.11-1.0.jar submit_options: ["--class", "ItemRecommends"] jars: [td-spark-assembly-0.1.jar] - type: spark application: spark/target/scala-2.11/simple-td- spark-project_2.11-1.0.jar submit_options: ["--class", "LocationRecommends"] jars: [td-spark-assembly-0.1.jar] +loading: _parallel: true +load_location_recommends: redshift>: copy/copy_location_recommends.sql +load_item_recommends: redshift>: copy/copy_item_recommends.sql
  • 30. Deployment & Fault tolerance
  • 31. HA deployment of Digdag Digdag server PostgreSQL It's just like a web application. Digdag client All task state API & scheduler & executor Visual UI
  • 32. HA deployment of Digdag PostgreSQL Stateless servers + Replicated DB Digdag client API & scheduler & executor PostgreSQL All task state Digdag server Digdag server HTTP Load Balancer Visual UI HA
  • 33. HA deployment of Digdag Digdag server PostgreSQL Isolating API and execution for reliability Digdag client API PostgreSQL HA Digdag server Digdag server Digdag server scheduler &
 executor HTTP Load Balancer All task state $ digdag server --disable-local-agent 
 --disable-executor-loop $ digdag server --max-task-threads 100
  • 34. Single-server task logs Digdag server PostgreSQL Digdag client HTTP Load Balancer Local disks A server writes logs
 to a local disk The same server
 serves the logs. $ digdag --task-log <dir> $ digdag log <attempt-id> -f
  • 35. Centralized task log storage Digdag server PostgreSQL Digdag client Digdag server HTTP Load Balancer AWS S3 A server uploads logs A server pre-signs
 the download URL log-server.type = s3 log-server.s3.bucket = my-digdag-log-bucket log-server.s3.path = logs/ $ digdag log <attempt-id> -f Client downloads logs
 directly from S3