Open source data ingestion

•

6 likes•4,930 views

Treasure Data, Inc.

Table of Contents
1. Why you should care
2. Data Collection v. Data Ingestion
3. Examples: Data Collection Tools
4. Examples: Data Ingestion Tools
5. Case Study: Async App Logging
Links to be added after the talk.

Data Sources
Raw Data
Storage
Processed
Data
Analysis
Environment
(Big) Data Pipeline
Data Collection
and Ingestion
Data Pre-
processing
Data Fetching
Data Engineers

Data Sources
Raw Data
Storage
Processed
Data
Analysis
Environment
If Data Collection Goes Awry...
Data Collection
and Ingestion
Data Pre-
processing
Data Fetching
Data Engineers

Data Collection
- Happens where data originates
- “logging code”
- Batch v. Streaming
- Pull v. Push
log.error(“FUUUUU....WHY!?”)
cln.send({“uid”:1,”action”:”died”})
200 GET a.com/?utm=big%20data

Data Ingestion
- Receives data
- Sometimes coupled with storage
- Routing data Data Ingestion Layer

rsyslog
- The grandfather of data collectors
- Streaming
- Installed by default, widely understood
- Not as easy to extend/configure

rsyslog
https://github.com/rsyslog/rsyslog/blob/master/ChangeLog

Scribe
- Written originally at Facebook
- Streaming
- Fast (C++)
- Nightmare to build, largely
abandoned

Flume-ng
- Written and maintained by
Cloudera (successor to Flume)
- Commercial support by
Cloudera. Track record for
Hadoop
- Java can be heavy-handed for
some orgs/cases

Logstash
- Pluggable architecture, rich
ecosystem
- The “L” of the ELK stack by
Elastic
- JRuby
- HA uses Redis as a queue
http://apuntesdetrabajo.es/?p=263

Heka
- Developed at Mozilla
- Written in Go, extensible w/ Lua
- Plugin system, but compilation
needed (Go’s limitation, may
change)

Fluentd
- Plugin architecture
- Built-in HA
- CRuby (JRuby on the roadmap)
- google-fluentd, td-agent
- Lightweight multi-source, multi-
destination log routing

Embulk
- Plugin architecture
- Focuses on Batch workloads
- Java/JRuby
- Very new! (looking for
contributors!)

RabbitMQ
- Written in Erlang, supported by
Pivotal
- Implements AMQP

Kafka
- Begun at LinkedIn, now Confluent
- Topic-based Message Broker:
Producer/Broker/Consumer
- Distributed design
- Provides at least once, at most
once by consumers

Fluentd!?
- Used (abused?) as a bus/MQ
- tag-based event routing
- Can be combined with
RabbitMQ/Kafka, etc.

Application Logging
- Common ask: “How’s our new feature doing?”
GET
/foobar
API Server
200 {...}

Application Logging
- What NOT to do: synchronous logging
GET
/foobar
API Server200 {...} Data Backend
write
ack

Application Logging
- What NOT to do: synchronous logging
GET
/foobar
API Server200 {...} Local Data
Collector
write Flush
Data
Backendack
Buffer

- Is writing to a local log collector safe?
- What if the log collector retries by error?
But wait...
- A lot of problems to think about!

“Much of the blame, little of the glory”
(Just kidding. The entire data team relies on YOU!)

Thank you!
(...and we are hiring!)
www.treasuredata.com/careers

- Software
- www.fluentd.org
- hekad.readthedocs.org
- logstash.org
- kafka.apache.org
- Embulk.org
- www.rabbitmq.com
- Ideas
- https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
- http://radar.oreilly.com/2015/04/the-log-the-lifeblood-of-your-data-
pipeline.htmlL
Bibliography

What's hot

Fluentd - Unified logging layerTreasure Data, Inc.

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...Spark Summit

PrestoChen Chun

Introduction to Presto at Treasure DataTaro L. Saito

Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda

Presto in the cloudQubole

Netflix running Presto in the AWS CloudZhenxiao Luo

Top 5 mistakes when writing Streaming applicationshadooparchbook

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit

Pinot: Near Realtime Analytics @ UberXiang Fu

Presto for the Enterprise @ Hadoop MeetupWojciech Biela

Presto at Hadoop Summit 2016kbajda

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit

To Have Own Data Analytics Platform, Or NOT ToSATOSHI TAGOMORI

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG

Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito

Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.

Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI

PrestoKnoldus Inc.

What's hot (20)

Fluentd - Unified logging layer

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...

Presto

Introduction to Presto at Treasure Data

Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA

Presto in the cloud

Netflix running Presto in the AWS Cloud

Top 5 mistakes when writing Streaming applications

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...

Pinot: Near Realtime Analytics @ Uber

Presto for the Enterprise @ Hadoop Meetup

Presto at Hadoop Summit 2016

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

To Have Own Data Analytics Platform, Or NOT To

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...

Hadoop ecosystem framework n hadoop in live environment

Presto @ Treasure Data - Presto Meetup Boston 2015

Alluxio+Presto: An Architecture for Fast SQL in the Cloud

Data Analytics Service Company and Its Ruby Usage

Presto

Viewers also liked

Unifying Events and Logs into the CloudTreasure Data, Inc.

Insight Data Engineering: Open source data ingestionTreasure Data, Inc.

Fluentd and Docker - running fluentd within a docker containerTreasure Data, Inc.

Introduction to New features and Use cases of HivemallTreasure Data, Inc.

What is support_engineer_in_treasuredataTreasure Data, Inc.

Fluentd and Docker - running fluentd within a docker containerTreasure Data, Inc.

Augmenting Mongo DB with Treasure DataTreasure Data, Inc.

Packaging Ecosystems -Monki Gras 2017Treasure Data, Inc.

Augmenting Mongo DB with treasure dataTreasure Data, Inc.

글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)Treasure Data, Inc.

Keynote - Fluentd meetup v14Treasure Data, Inc.

Viewers also liked (11)

Unifying Events and Logs into the Cloud

Insight Data Engineering: Open source data ingestion

Fluentd and Docker - running fluentd within a docker container

Introduction to New features and Use cases of Hivemall

What is support_engineer_in_treasuredata

Fluentd and Docker - running fluentd within a docker container

Augmenting Mongo DB with Treasure Data

Packaging Ecosystems -Monki Gras 2017

Augmenting Mongo DB with treasure data

글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)

Keynote - Fluentd meetup v14

Similar to Open source data ingestion

Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Apache Hudi: The Path ForwardAlluxio, Inc.

Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA

Application design for the cloud using AWSJonathan Holloway

Building data pipelinesJonathan Holloway

Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/HudiVinoth Chandar

The Big Data Analytics Ecosystem at LinkedInrajappaiyer

Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.

Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan

AWS (Hadoop) Meetup 30.04.09Chris Purrington

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen

The basics of fluentdTreasure Data, Inc.

SF Big Analytics meetup : Hoodie From UberChester Chen

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Yahoo Developer Network

High quality ap is with api platformNelson Kopliku

Monitoring a Kubernetes-backed microservice architecture with PrometheusFabian Reinartz

Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies

Similar to Open source data ingestion (20)

Why apache Flink is the 4G of Big Data Analytics Frameworks

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...

Apache Hudi: The Path Forward

Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...

Application design for the cloud using AWS

Building data pipelines

Hadoop in Practice (SDN Conference, Dec 2014)

[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi

The Big Data Analytics Ecosystem at LinkedIn

Big data, just an introduction to Hadoop and Scripting Languages

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

AWS (Hadoop) Meetup 30.04.09

SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...

The basics of fluentd

SF Big Analytics meetup : Hoodie From Uber

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...

High quality ap is with api platform

Monitoring a Kubernetes-backed microservice architecture with Prometheus

Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas

More from Treasure Data, Inc.

GDPR: A Practical Guide for MarketersTreasure Data, Inc.

AR and VR by the Numbers: A Data First Approach to the Technology and MarketTreasure Data, Inc.

Introduction to Customer Data PlatformsTreasure Data, Inc.

Hands On: Javascript SDKTreasure Data, Inc.

Hands-On: Managing Slowly Changing Dimensions Using TD WorkflowTreasure Data, Inc.

Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsTreasure Data, Inc.

How to Power Your Customer Experience with DataTreasure Data, Inc.

Why Your VR Game is Virtually Useless Without DataTreasure Data, Inc.

Connecting the Customer Data DotsTreasure Data, Inc.

Harnessing Data for Better Customer Experience and Company SuccessTreasure Data, Inc.

Scalable Hadoop in the cloudTreasure Data, Inc.

Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.

Treasure Data: Move your data from MySQL to Redshift with (not much more tha...Treasure Data, Inc.

Treasure Data From MySQL to RedshiftTreasure Data, Inc.

Partner webinar presentation aws pebble_treasure_dataTreasure Data, Inc.

Introduction to HivemallTreasure Data, Inc.

More from Treasure Data, Inc. (16)

GDPR: A Practical Guide for Marketers

AR and VR by the Numbers: A Data First Approach to the Technology and Market

Introduction to Customer Data Platforms

Hands On: Javascript SDK

Hands-On: Managing Slowly Changing Dimensions Using TD Workflow

Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps

How to Power Your Customer Experience with Data

Why Your VR Game is Virtually Useless Without Data

Connecting the Customer Data Dots

Harnessing Data for Better Customer Experience and Company Success

Scalable Hadoop in the cloud

Scaling to Infinity - Open Source meets Big Data

Treasure Data: Move your data from MySQL to Redshift with (not much more tha...

Treasure Data From MySQL to Redshift

Partner webinar presentation aws pebble_treasure_data

Introduction to Hivemall

Open source data ingestion

1. Open Source Data Collection/Ingestion Treasure Data, Inc. www.treasuredata.com

2. Hello! - “Committer” of Fluentd - Treasure Data, Inc. - Former Algorithmic Trader - Stanford Math and CS

3. Table of Contents 1. Why you should care 2. Data Collection v. Data Ingestion 3. Examples: Data Collection Tools 4. Examples: Data Ingestion Tools 5. Case Study: Async App Logging Links to be added after the talk.

4. Data Collection/Ingestion is HARD

5. Data Sources Raw Data Storage Processed Data Analysis Environment (Big) Data Pipeline Data Collection and Ingestion Data Pre- processing Data Fetching Data Engineers

6. Data Sources Raw Data Storage Processed Data Analysis Environment If Data Collection Goes Awry... Data Collection and Ingestion Data Pre- processing Data Fetching Data Engineers

7. Collection v. Ingestion

8. Data Collection - Happens where data originates - “logging code” - Batch v. Streaming - Pull v. Push log.error(“FUUUUU....WHY!?”) cln.send({“uid”:1,”action”:”died”}) 200 GET a.com/?utm=big%20data

9. Data Ingestion - Receives data - Sometimes coupled with storage - Routing data Data Ingestion Layer

10. ex. Data Collection Tools

11. rsyslog - The grandfather of data collectors - Streaming - Installed by default, widely understood - Not as easy to extend/configure

12. rsyslog https://github.com/rsyslog/rsyslog/blob/master/ChangeLog

13. Scribe - Written originally at Facebook - Streaming - Fast (C++) - Nightmare to build, largely abandoned

14. Flume-ng - Written and maintained by Cloudera (successor to Flume) - Commercial support by Cloudera. Track record for Hadoop - Java can be heavy-handed for some orgs/cases

15. Logstash - Pluggable architecture, rich ecosystem - The “L” of the ELK stack by Elastic - JRuby - HA uses Redis as a queue http://apuntesdetrabajo.es/?p=263

16. Heka - Developed at Mozilla - Written in Go, extensible w/ Lua - Plugin system, but compilation needed (Go’s limitation, may change)

17. Fluentd - Plugin architecture - Built-in HA - CRuby (JRuby on the roadmap) - google-fluentd, td-agent - Lightweight multi-source, multi- destination log routing

18. Embulk - Plugin architecture - Focuses on Batch workloads - Java/JRuby - Very new! (looking for contributors!)

19. ex. Data Ingestion Tools

20. RabbitMQ - Written in Erlang, supported by Pivotal - Implements AMQP

21. Kafka - Begun at LinkedIn, now Confluent - Topic-based Message Broker: Producer/Broker/Consumer - Distributed design - Provides at least once, at most once by consumers

22. Fluentd!? - Used (abused?) as a bus/MQ - tag-based event routing - Can be combined with RabbitMQ/Kafka, etc.

23. case study: Async App Logging

24. Application Logging - Common ask: “How’s our new feature doing?” GET /foobar API Server 200 {...}

25. Application Logging - What NOT to do: synchronous logging GET /foobar API Server200 {...} Data Backend write ack

26. Application Logging - What NOT to do: synchronous logging GET /foobar API Server200 {...} Local Data Collector write Flush Data Backendack Buffer

27. - Is writing to a local log collector safe? - What if the log collector retries by error? But wait... - A lot of problems to think about!

28. “Much of the blame, little of the glory” (Just kidding. The entire data team relies on YOU!)

29. Thank you! (...and we are hiring!) www.treasuredata.com/careers

30. - Software - www.fluentd.org - hekad.readthedocs.org - logstash.org - kafka.apache.org - Embulk.org - www.rabbitmq.com - Ideas - https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying - http://radar.oreilly.com/2015/04/the-log-the-lifeblood-of-your-data- pipeline.htmlL Bibliography

Open source data ingestion

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Open source data ingestion

Similar to Open source data ingestion (20)

More from Treasure Data, Inc.

More from Treasure Data, Inc. (16)

Open source data ingestion