SlideShare a Scribd company logo
1 of 45
Download to read offline
Fighting Against
Chaotically Separated Values
with Embulk
Sadayuki Furuhashi

Founder & Software Architect
csv,conf,v2
A little about me…
Sadayuki Furuhashi
An open-source hacker.
github: @frsyuki
A founder of Treasure Data, Inc. located in Silicon Valley.
Fluentd - Unifid log collection infrastracture Embulk - Plugin-based ETL tool
OSS projects I founded:
It's like JSON.
but fast and small.
A little about me…
What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data loading easy.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,

transactions (idempotency),

performance, …
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 1. Run a script → fails!
> 2. Improve the script to normalize records
• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many more normalization…
> 3. Second attempt → another error!
• Convert “Inf” → “Infinity”
> 4. Improve the script, fix, retry, fix, retry…
> 5. Oh, some data are loaded twice!?
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 6. Ok, the script worked well today.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loading
Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of efforts for each formats & storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems:
> Difficult to parse files correctly
> How is the CSV file formatted?
> Complex error handling
> How to detect and remove broken records robustly?
> Transactional load, or idempotent retrying
> How to retry without duplicated loading?
> Hard to optimize performance
> How to parallelize the bulk data loading?
> Many formats & storage in the world
> How to save my time?
The problems at Treasure Data
What’s “Treasure Data”?
> “Fast, powerful SQL access to big data from connected
applications and products, with no new infrastructure or
special skills required.”
> Customers want to try Treasure Data, but
> SEs write scripts to bulk load their data. Hard work :(
> Customers want to migrate their big data, but
> It’s hard work :(
> Fluentd solved streaming data collection, but
> bulk data loading is another problem.
Embulk is an open-source, plugin-based
parallel bulk data loader 

that makes data loading easy and fast.
Solution:
IMPORTANT!
Amazon S3
MySQL
FTP
CSV Files
Access Logs
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
Reliable
framework :-)
Parallel execution,
transaction, auto guess,
…and many by plugins.
Demo
$ embulk selfupdate
$ embulk example demo
$ vi demo/csv/sample_01.csv.gz
$ embulk guess demo/seed.yml -o config.yml
$ embulk run config.yml
$ vi config.yml
$ embulk run config.yml
out:
type: postgresql
host: localhost
user: pg
password: ''
database: embulk_demo
table: sample1
mode: replace
:%s/,/t/g
:%s/""/"/g
# Created by Sada
# This is a comment
N
Input Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
Guess
Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
GuessFileInput
Parser
Decoder
Guess
Embulk’s Plugin Architecture
Embulk Core
FileInput
Executor Plugin
Parser
Decoder
FileOutput
Formatter
Encoder
Filter Filter
Examples of Plugins (input)
File Input
Amazon S3
Google Cloud Storage
HDFS
Riak CS
SCP
FTP
…
CSV
JSON
MessagePack
Excel
Apache common logs
pcap format
XML / XPath
regexp
grok
…
File ParserInput
PostgreSQL
MySQL
Oracle
Vertica
Redis
Amazon Redshift
Amazon DynamoDB
Salesforce.com
JIRA
Mixpanel
…
Examples of Plugins (output)
File Output
Amazon S3
Google Cloud Storage
HDFS
SFTP
SCP
FTP
…
CSV
JSON
MessagePack
Excel
…
File FormatterOutput
PostgreSQL
MySQL
Oracle
Vertica
Redis
Amazon Redshift
Elasticsearch
Salesforce.com
Treasure Data
BigQuery
…
Examples of Plugins (filters)
> Filtering columns out by conditions
> Extracting values from a JSON column to columns (JSON flattening)
> Convert User-Agent strings to browser name, OS name, etc.
> Parse query string (“?k1=v1&k2=v2…”) to columns
> Applying SHA1 hash to a column
…
Use case 1: Sync PostgreSQL to Elasticsearch
embulk-input-postgresql
embulk-filter-column embulk-output-elasticsearch
PostgreSQL
column
filter
Elasticsearch
encrypt
filter
embulk-filter-encrypt
remove unnecessary
columns
encrypt password
columns
Use case 2: Load CSV on S3 to Analytics
embulk-parser-csv
embulk-decoder-gzip
embulk-input-s3
csv.gz
on S3
Treasure Data
BigQuery
Redshift
+
+
embulk-output-td
embulk-output-bigquery
embulk-output-redshift
Distributed execution on Hadoop
embulk-executor-mapreduce
Use case 3: Embulk as a Service at Treasure Data
REST API call
MySQL
Internal Architecture
Plugin API
> A plugin is written in Java or Ruby (JRuby).
> A plugin implements “transaction” and “task”.
> transaction controls the entire bulk loading session.
> create a destination table, create a directory,

commit the transaction, etc.
> transaction creates multiple tasks.
> tasks load load data.
> Embulk runs tasks in parallel.
> Embulk retries tasks if necessary.
Transaction stage & Task stage
Task
Transaction Task
Task
taskCount
{
taskIndex: 0,
task: {…}
}
{
taskIndex: 2,
task: {…}
}
runs on a single thread runs on multiple threads

(or machines)
Transaction control
fileInput.transaction {
parser.transaction {
filters.transaction {
formatter.transaction {
fileOutput.transaction {
executor.transaction {
…
}
}
}
}
}
}
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task
Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …
Parallel execution of tasks
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
Distributed execution of tasks
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Useful to partition data by hour or day

before loading data to a storage.
Past & Future
What’s added since the first release?
• v0.3 (Feb, 2015)
• Resuming
• Filter plugin type
• v0.4 (Feb, 2015)
• Plugin template generator
• Incremental load (ConfigDiff)
• Isolated ClassLoaders for Java plugins
• Polyglot command launcher
What’s added since the first release?
• v0.6 (Apr, 2015)
• Executor plugin type
• Liquid template engine
• v0.7 (Aug, 2015)
• EmbulkEmbed & Embulk::Runner
• Plugin bundle (embulk-mkbundle)
• JRuby 9000
• Gradle v2.6
What’s added since the first release?
• v0.8 (Jan, 2016)
• JSON column type
• Page scattaring for more parallel execution
Future plan
• v0.9
• Error plugin type (#27)
• Stats & metrics (#199)
• v0.10
• More Guess (#242, #235)
• Multiple jobs using a single config file (#167)
Hacks
(if time allows)
Plugin Version Conflicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conflicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar
Avoiding Conflicts in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated
environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2
Liquid template engine
• A config file can include variables.
./embulk.jar
$ ./embulk.jar guess example.yml
executable jar!
Header of embulk.jar
: <<BAT
@echo off
setlocal
set this=%~f0
set java_args=
rem ...
java %java_args% -jar %this% %args%
exit /b %ERRORLEVEL%
BAT
# ...
exec java $java_args -jar "$0" "$@"
exit 127
PK...
embulk.jar is a shell script
: <<BAT
@echo off
setlocal
set this=%~f0
set java_args=
rem ...
java %java_args% -jar %this% %args%
exit /b %ERRORLEVEL%
BAT
# ...
exec java $java_args -jar "$0" "$@"
exit 127
PK...
argument of “:” command (heredoc).
“:” is a command that does nothing.
#!/bin/sh is optional.
Empty first line means a shell script.
java -jar $0
shell script exits here
(following data is ignored)
embulk.jar is a bat file
: <<BAT
@echo off
setlocal
set this=%~f0
set java_args=
rem ...
java %java_args% -jar %this% %args%
exit /b %ERRORLEVEL%
BAT
# ...
exec java $java_args -jar "$0" "$@"
exit 127
PK...
.bat exits here
(following lines are ignored)
“:” means a comment-line
embulk.jar is a jar file
: <<BAT
@echo off
setlocal
set this=%~f0
set java_args=
rem ...
java %java_args% -jar %this% %args%
exit /b %ERRORLEVEL%
BAT
# ...
exec java $java_args -jar "$0" "$@"
exit 127
PK...
jar (zip) format ignores headers
(file entries are in footer)
Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)

More Related Content

What's hot

Karpenterで君だけの最強のオートスケーリングを実装しよう
Karpenterで君だけの最強のオートスケーリングを実装しようKarpenterで君だけの最強のオートスケーリングを実装しよう
Karpenterで君だけの最強のオートスケーリングを実装しよう
Kohei Nagase
 

What's hot (20)

SolrCloud on Amazon ECS
SolrCloud on Amazon ECSSolrCloud on Amazon ECS
SolrCloud on Amazon ECS
 
Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015Sphinx autodoc - automated api documentation - PyCon.KR 2015
Sphinx autodoc - automated api documentation - PyCon.KR 2015
 
신입 개발자 생활백서 [개정판]
신입 개발자 생활백서 [개정판]신입 개발자 생활백서 [개정판]
신입 개발자 생활백서 [개정판]
 
BuildKitによる高速でセキュアなイメージビルド
BuildKitによる高速でセキュアなイメージビルドBuildKitによる高速でセキュアなイメージビルド
BuildKitによる高速でセキュアなイメージビルド
 
MySQLerの7つ道具
MySQLerの7つ道具MySQLerの7つ道具
MySQLerの7つ道具
 
クラウド環境下におけるAPIリトライ設計
クラウド環境下におけるAPIリトライ設計クラウド環境下におけるAPIリトライ設計
クラウド環境下におけるAPIリトライ設計
 
初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!初心者向けMongoDBのキホン!
初心者向けMongoDBのキホン!
 
인프런 - 스타트업 인프랩 시작 사례
인프런 - 스타트업 인프랩 시작 사례인프런 - 스타트업 인프랩 시작 사례
인프런 - 스타트업 인프랩 시작 사례
 
MQTTとAMQPと.NET
MQTTとAMQPと.NETMQTTとAMQPと.NET
MQTTとAMQPと.NET
 
実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug
 
料理を楽しくする画像配信システム
料理を楽しくする画像配信システム料理を楽しくする画像配信システム
料理を楽しくする画像配信システム
 
Karpenterで君だけの最強のオートスケーリングを実装しよう
Karpenterで君だけの最強のオートスケーリングを実装しようKarpenterで君だけの最強のオートスケーリングを実装しよう
Karpenterで君だけの最強のオートスケーリングを実装しよう
 
Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編
 
Getting Started GraalVM / GraalVM超入門 #jjug_ccc #ccc_c2
Getting Started GraalVM / GraalVM超入門 #jjug_ccc #ccc_c2Getting Started GraalVM / GraalVM超入門 #jjug_ccc #ccc_c2
Getting Started GraalVM / GraalVM超入門 #jjug_ccc #ccc_c2
 
PlaySQLAlchemy: SQLAlchemy入門
PlaySQLAlchemy: SQLAlchemy入門PlaySQLAlchemy: SQLAlchemy入門
PlaySQLAlchemy: SQLAlchemy入門
 
Ruby on Rails のキャッシュ機構について
Ruby on Rails のキャッシュ機構についてRuby on Rails のキャッシュ機構について
Ruby on Rails のキャッシュ機構について
 
なぜあなたのプロジェクトのDevSecOpsは形骸化するのか(CloudNative Security Conference 2022)
なぜあなたのプロジェクトのDevSecOpsは形骸化するのか(CloudNative Security Conference 2022)なぜあなたのプロジェクトのDevSecOpsは形骸化するのか(CloudNative Security Conference 2022)
なぜあなたのプロジェクトのDevSecOpsは形骸化するのか(CloudNative Security Conference 2022)
 
Javaのログ出力: 道具と考え方
Javaのログ出力: 道具と考え方Javaのログ出力: 道具と考え方
Javaのログ出力: 道具と考え方
 
AWSで作る分析基盤
AWSで作る分析基盤AWSで作る分析基盤
AWSで作る分析基盤
 
Prometheus入門から運用まで徹底解説
Prometheus入門から運用まで徹底解説Prometheus入門から運用まで徹底解説
Prometheus入門から運用まで徹底解説
 

Viewers also liked

Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Tools
m_richardson
 

Viewers also liked (20)

Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Jenkins 2.0 Pipeline & Blue Ocean
Jenkins 2.0 Pipeline & Blue OceanJenkins 2.0 Pipeline & Blue Ocean
Jenkins 2.0 Pipeline & Blue Ocean
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 
M&L Webinar: “Open Source ILIAS Plugin: Interactive Videos"
M&L Webinar: “Open Source ILIAS Plugin: Interactive Videos"M&L Webinar: “Open Source ILIAS Plugin: Interactive Videos"
M&L Webinar: “Open Source ILIAS Plugin: Interactive Videos"
 
Writing Your First Plugin
Writing Your First PluginWriting Your First Plugin
Writing Your First Plugin
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Connections Plugins - Engage 2016
Connections Plugins - Engage 2016Connections Plugins - Engage 2016
Connections Plugins - Engage 2016
 
Why should you publish your plugin as open source and contribute to WordPress?
Why should you publish your plugin as open source and contribute to WordPress?Why should you publish your plugin as open source and contribute to WordPress?
Why should you publish your plugin as open source and contribute to WordPress?
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
 
EmbulkのGCS/BigQuery周りのプラグインについて
EmbulkのGCS/BigQuery周りのプラグインについてEmbulkのGCS/BigQuery周りのプラグインについて
EmbulkのGCS/BigQuery周りのプラグインについて
 
Creating modern java web applications based on struts2 and angularjs
Creating modern java web applications based on struts2 and angularjsCreating modern java web applications based on struts2 and angularjs
Creating modern java web applications based on struts2 and angularjs
 
오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개
 
Find WordPress performance bottlenecks with XDebug PHP profiling
Find WordPress performance bottlenecks with XDebug PHP profilingFind WordPress performance bottlenecks with XDebug PHP profiling
Find WordPress performance bottlenecks with XDebug PHP profiling
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Open Source Monitoring Tools
Open Source Monitoring ToolsOpen Source Monitoring Tools
Open Source Monitoring Tools
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 

Similar to Fighting Against Chaotically Separated Values with Embulk

NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
BigBlueHat
 
Intro to node.js - Ran Mizrahi (28/8/14)
Intro to node.js - Ran Mizrahi (28/8/14)Intro to node.js - Ran Mizrahi (28/8/14)
Intro to node.js - Ran Mizrahi (28/8/14)
Ran Mizrahi
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN Stack
Rob Davarnia
 

Similar to Fighting Against Chaotically Separated Values with Embulk (20)

Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customers
 
The Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with RubyThe Future is Now: Leveraging the Cloud with Ruby
The Future is Now: Leveraging the Cloud with Ruby
 
ECS & ECR Deep Dive - 김기완 솔루션즈 아키텍트 :: AWS Container Day
ECS & ECR Deep Dive - 김기완 솔루션즈 아키텍트 :: AWS Container DayECS & ECR Deep Dive - 김기완 솔루션즈 아키텍트 :: AWS Container Day
ECS & ECR Deep Dive - 김기완 솔루션즈 아키텍트 :: AWS Container Day
 
Cloud State of the Union for Java Developers
Cloud State of the Union for Java DevelopersCloud State of the Union for Java Developers
Cloud State of the Union for Java Developers
 
NoSQL: Why, When, and How
NoSQL: Why, When, and HowNoSQL: Why, When, and How
NoSQL: Why, When, and How
 
Auto scaling with Ruby, AWS, Jenkins and Redis
Auto scaling with Ruby, AWS, Jenkins and RedisAuto scaling with Ruby, AWS, Jenkins and Redis
Auto scaling with Ruby, AWS, Jenkins and Redis
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Distribua, gerencie e escale suas aplicações com o aws elastic beanstalk
Distribua, gerencie e escale suas aplicações com o aws elastic beanstalkDistribua, gerencie e escale suas aplicações com o aws elastic beanstalk
Distribua, gerencie e escale suas aplicações com o aws elastic beanstalk
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
AWS Elastic Beanstalk - Running Microservices and Docker
AWS Elastic Beanstalk - Running Microservices and DockerAWS Elastic Beanstalk - Running Microservices and Docker
AWS Elastic Beanstalk - Running Microservices and Docker
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Intro to node.js - Ran Mizrahi (27/8/2014)
Intro to node.js - Ran Mizrahi (27/8/2014)Intro to node.js - Ran Mizrahi (27/8/2014)
Intro to node.js - Ran Mizrahi (27/8/2014)
 
Intro to node.js - Ran Mizrahi (28/8/14)
Intro to node.js - Ran Mizrahi (28/8/14)Intro to node.js - Ran Mizrahi (28/8/14)
Intro to node.js - Ran Mizrahi (28/8/14)
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
 
(DVO308) Docker & ECS in Production: How We Migrated Our Infrastructure from ...
(DVO308) Docker & ECS in Production: How We Migrated Our Infrastructure from ...(DVO308) Docker & ECS in Production: How We Migrated Our Infrastructure from ...
(DVO308) Docker & ECS in Production: How We Migrated Our Infrastructure from ...
 
Beginning MEAN Stack
Beginning MEAN StackBeginning MEAN Stack
Beginning MEAN Stack
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Why Scale Matters and How the Cloud is Really Different (at scale)
Why Scale Matters and How the Cloud is Really Different (at scale)Why Scale Matters and How the Cloud is Really Different (at scale)
Why Scale Matters and How the Cloud is Really Different (at scale)
 

More from Sadayuki Furuhashi

Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
Sadayuki Furuhashi
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack Project
Sadayuki Furuhashi
 
gumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectgumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack Project
Sadayuki Furuhashi
 

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack Project
 
Gumi study7 messagepack
Gumi study7 messagepackGumi study7 messagepack
Gumi study7 messagepack
 
gumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectgumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack Project
 

Recently uploaded

Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 

Recently uploaded (18)

Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 

Fighting Against Chaotically Separated Values with Embulk

  • 1. Fighting Against Chaotically Separated Values with Embulk Sadayuki Furuhashi
 Founder & Software Architect csv,conf,v2
  • 2. A little about me… Sadayuki Furuhashi An open-source hacker. github: @frsyuki A founder of Treasure Data, Inc. located in Silicon Valley. Fluentd - Unifid log collection infrastracture Embulk - Plugin-based ETL tool OSS projects I founded:
  • 3. It's like JSON. but fast and small. A little about me…
  • 4. What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data loading easy. > which was very painful… Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  • 5. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. Run a script → fails! > 2. Improve the script to normalize records • Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC” • Convert “N" → “” • many more normalization… > 3. Second attempt → another error! • Convert “Inf” → “Infinity” > 4. Improve the script, fix, retry, fix, retry… > 5. Oh, some data are loaded twice!?
  • 6. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked well today. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
  • 7. The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of efforts for each formats & storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
  • 8. The problems: > Difficult to parse files correctly > How is the CSV file formatted? > Complex error handling > How to detect and remove broken records robustly? > Transactional load, or idempotent retrying > How to retry without duplicated loading? > Hard to optimize performance > How to parallelize the bulk data loading? > Many formats & storage in the world > How to save my time?
  • 9. The problems at Treasure Data What’s “Treasure Data”? > “Fast, powerful SQL access to big data from connected applications and products, with no new infrastructure or special skills required.” > Customers want to try Treasure Data, but > SEs write scripts to bulk load their data. Hard work :( > Customers want to migrate their big data, but > It’s hard work :( > Fluentd solved streaming data collection, but > bulk data loading is another problem.
  • 10. Embulk is an open-source, plugin-based parallel bulk data loader 
 that makes data loading easy and fast. Solution: IMPORTANT!
  • 11. Amazon S3 MySQL FTP CSV Files Access Logs Salesforce.com Elasticsearch Cassandra Hive Redis Reliable framework :-) Parallel execution, transaction, auto guess, …and many by plugins.
  • 12. Demo
  • 13. $ embulk selfupdate $ embulk example demo $ vi demo/csv/sample_01.csv.gz $ embulk guess demo/seed.yml -o config.yml $ embulk run config.yml $ vi config.yml $ embulk run config.yml out: type: postgresql host: localhost user: pg password: '' database: embulk_demo table: sample1 mode: replace :%s/,/t/g :%s/""/"/g # Created by Sada # This is a comment N
  • 14. Input Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter Guess
  • 15. Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter GuessFileInput Parser Decoder
  • 16. Guess Embulk’s Plugin Architecture Embulk Core FileInput Executor Plugin Parser Decoder FileOutput Formatter Encoder Filter Filter
  • 17. Examples of Plugins (input) File Input Amazon S3 Google Cloud Storage HDFS Riak CS SCP FTP … CSV JSON MessagePack Excel Apache common logs pcap format XML / XPath regexp grok … File ParserInput PostgreSQL MySQL Oracle Vertica Redis Amazon Redshift Amazon DynamoDB Salesforce.com JIRA Mixpanel …
  • 18. Examples of Plugins (output) File Output Amazon S3 Google Cloud Storage HDFS SFTP SCP FTP … CSV JSON MessagePack Excel … File FormatterOutput PostgreSQL MySQL Oracle Vertica Redis Amazon Redshift Elasticsearch Salesforce.com Treasure Data BigQuery …
  • 19. Examples of Plugins (filters) > Filtering columns out by conditions > Extracting values from a JSON column to columns (JSON flattening) > Convert User-Agent strings to browser name, OS name, etc. > Parse query string (“?k1=v1&k2=v2…”) to columns > Applying SHA1 hash to a column …
  • 20. Use case 1: Sync PostgreSQL to Elasticsearch embulk-input-postgresql embulk-filter-column embulk-output-elasticsearch PostgreSQL column filter Elasticsearch encrypt filter embulk-filter-encrypt remove unnecessary columns encrypt password columns
  • 21. Use case 2: Load CSV on S3 to Analytics embulk-parser-csv embulk-decoder-gzip embulk-input-s3 csv.gz on S3 Treasure Data BigQuery Redshift + + embulk-output-td embulk-output-bigquery embulk-output-redshift Distributed execution on Hadoop embulk-executor-mapreduce
  • 22. Use case 3: Embulk as a Service at Treasure Data REST API call MySQL
  • 24. Plugin API > A plugin is written in Java or Ruby (JRuby). > A plugin implements “transaction” and “task”. > transaction controls the entire bulk loading session. > create a destination table, create a directory,
 commit the transaction, etc. > transaction creates multiple tasks. > tasks load load data. > Embulk runs tasks in parallel. > Embulk retries tasks if necessary.
  • 25. Transaction stage & Task stage Task Transaction Task Task taskCount { taskIndex: 0, task: {…} } { taskIndex: 2, task: {…} } runs on a single thread runs on multiple threads
 (or machines)
  • 26. Transaction control fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } } file input plugin parser plugin filter plugins formatter plugin file output plugin executor plugin Task Task
  • 27. Task execution parser.run(fileInput, pageOutput) fileInput.open() formatter.open(fileOutput) fileOutput.open() parser plugin file input plugin filter plugins file output plugin formatter plugin …Task Task …
  • 28. Parallel execution of tasks Task Task Task Task Threads Task queue run tasks in parallel (embulk-executor-local-thread)
  • 29. Distributed execution of tasks Task Task Task Task Map tasks Task queue run tasks on Hadoop (embulk-executor-mapreduce)
  • 30. Distributed execution (w/ partitioning) Task Task Task Task Map - Shuffle - Reduce Task queue run tasks on Hadoop (embulk-executor-mapreduce) Useful to partition data by hour or day
 before loading data to a storage.
  • 32. What’s added since the first release? • v0.3 (Feb, 2015) • Resuming • Filter plugin type • v0.4 (Feb, 2015) • Plugin template generator • Incremental load (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher
  • 33. What’s added since the first release? • v0.6 (Apr, 2015) • Executor plugin type • Liquid template engine • v0.7 (Aug, 2015) • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6
  • 34. What’s added since the first release? • v0.8 (Jan, 2016) • JSON column type • Page scattaring for more parallel execution
  • 35. Future plan • v0.9 • Error plugin type (#27) • Stats & metrics (#199) • v0.10 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)
  • 37. Plugin Version Conflicts Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Version conflicts! aws-sdk.jar v1.10 embulk-output-redshift.jar
  • 38. Avoiding Conflicts in JVM Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Isolated environments aws-sdk.jar v1.10 embulk-output-redshift.jar Class Loader 1 Class Loader 2
  • 39. Liquid template engine • A config file can include variables.
  • 40. ./embulk.jar $ ./embulk.jar guess example.yml executable jar!
  • 41. Header of embulk.jar : <<BAT @echo off setlocal set this=%~f0 set java_args= rem ... java %java_args% -jar %this% %args% exit /b %ERRORLEVEL% BAT # ... exec java $java_args -jar "$0" "$@" exit 127 PK...
  • 42. embulk.jar is a shell script : <<BAT @echo off setlocal set this=%~f0 set java_args= rem ... java %java_args% -jar %this% %args% exit /b %ERRORLEVEL% BAT # ... exec java $java_args -jar "$0" "$@" exit 127 PK... argument of “:” command (heredoc). “:” is a command that does nothing. #!/bin/sh is optional. Empty first line means a shell script. java -jar $0 shell script exits here (following data is ignored)
  • 43. embulk.jar is a bat file : <<BAT @echo off setlocal set this=%~f0 set java_args= rem ... java %java_args% -jar %this% %args% exit /b %ERRORLEVEL% BAT # ... exec java $java_args -jar "$0" "$@" exit 127 PK... .bat exits here (following lines are ignored) “:” means a comment-line
  • 44. embulk.jar is a jar file : <<BAT @echo off setlocal set this=%~f0 set java_args= rem ... java %java_args% -jar %this% %args% exit /b %ERRORLEVEL% BAT # ... exec java $java_args -jar "$0" "$@" exit 127 PK... jar (zip) format ignores headers (file entries are in footer)
  • 45. Type conversion Embulk type systemInput type system Output type system boolean long double string timestamp boolean integer bigint double precision text varchar date timestamp timestamp with zone … (e.g. PostgreSQL) boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)