Apache Flume

•

23 likes•17,049 views

Arinto Murdopo

Brief description of Apache Flume 0.9.x for EEDC assignment

Technology

Arinto Murdopo
Josep Subirats
Group 4
EEDC 2012

Outline
● Current problem
● What is Apache Flume?
● The Flume Model
○ Flows and Nodes
○ Agent, Processor and Collector Nodes
○ Data and Control Path
● Flume goals
○ Reliability
○ Scalability
○ Extensibility
○ Manageability
● Use case: Near Realtime Aggregator

Current Problem
● Situation:
You have hundreds of services running in different servers
that produce lots of large logs which should be analyzed
altogether. You have Hadoop to process them.

● Problem:
How do I send all my logs to a place that has Hadoop? I
need a reliable, scalable, extensible and manageable way
to do it!

What is Apache Flume?
● It is a distributed data collection service that gets
flows of data (like logs) from their source and
aggregates them to where they have to be processed.
● Goals: reliability, scalability, extensibility,
manageability.

Exactly what I needed!

The Flume Model: Flows and Nodes

● A flow corresponds to a type of data source (server
logs, machine monitoring metrics...).
● Flows are comprised of nodes chained together (see
slide 7).

The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
...are optionally processed by one or more decorators...
...and then are transmitted out via a sink.

Examples: Console, Exec, Syslog, IRC,
Twitter, other nodes...

Examples: Console, local files, HDFS, S3,
other nodes...

Examples: wire batching, compression,
sampling, projection, extraction...

The Flume Model: Agent, Processor and
Collector Nodes

● Agent:
receives data from an
application.

● Processor (optional):
intermediate processing.

● Collector:
write data to permanent
storage.

The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.

The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.

Flume Goals: Reliability
Tunable Failure Recovery Modes

● Best Effort

● Store on Failure and Retry

● End to End Reliability

Flume Goals: Scalability
Horizontally Scalable Data Path

Load Balancing

Flume Goals: Scalability
Horizontally Scalable Control Path

Flume Goals: Extensibility
● Simple Source and Sink API
○ Event streaming and composition of simple
operation

● Plug in Architecture
○ Add your own sources, sinks, decorators

Flume Goals: Manageability
Centralized Data Flow Management Interface

Flume Goals: Manageability
Configuring Flume

Node: tail(“file”) | filter [ console, roll
(1000) { dfs(“hdfs://namenode/user/flume”) } ]
;
Output Bucketing
/logs/web/2010/0715/1200/data-xxx.txt
/logs/web/2010/0715/1200/data-xxy.txt
/logs/web/2010/0715/1300/data-xxx.txt
/logs/web/2010/0715/1300/data-xxy.txt
/logs/web/2010/0715/1400/data-xxx.txt

Conclusion
Flume is
● Distributed data collection service

● Suitable for enterprise setting

● Large amount of log data to process

References
● http://www.cloudera.
com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie
h_hadoop_log_processing/
● http://www.slideshare.net/cloudera/inside-flume
● http://www.slideshare.net/cloudera/flume-intro100715
● http://www.slideshare.net/cloudera/flume-austin-hug-21711

What's hot

SQOOP PPTDushhyant Kumar

SparkKoushik Mondal

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Apache Spark overviewDataArt

Hive + Tez: A Performance Deep DiveDataWorks Summit

Hadoop Security ArchitectureOwen O'Malley

Apache sparkshima jafari

Introduction to Spark Streamingdatamantra

Introduction to PigPrashanth Babu

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

Introduction to Apache ZooKeeperSaurav Haloi

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Hive(ppt)Abhinav Tyagi

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Introduction to HDFSBhavesh Padharia

Deep Dive: Memory Management in Apache SparkDatabricks

Spark SQLJoud Khattab

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd

Hadoop File system (HDFS)Prashant Gupta

What's hot (20)

SQOOP PPT

Spark

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Apache Spark overview

Hive + Tez: A Performance Deep Dive

Hadoop Security Architecture

Apache spark

Introduction to Spark Streaming

Introduction to Pig

Apache Tez - A New Chapter in Hadoop Data Processing

Introduction to Apache ZooKeeper

A Deep Dive into Query Execution Engine of Spark SQL

Hive(ppt)

Hudi architecture, fundamentals and capabilities

Introduction to HDFS

Deep Dive: Memory Management in Apache Spark

Spark SQL

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

Hadoop Tutorial For Beginners

Hadoop File system (HDFS)

Similar to Apache Flume

Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru

Hadoop project design and a usecasesudhakara st

Centralized logging with FlumeRatnakar Pawar

Near real-time anomaly detection at Lyftmarkgrover

Monitoring.pptxShadi Akil

Flume-based Independent News AggregatorMário Almeida

FluentD for end to end monitoringPhil Wilkins

Prometheus and Docker (Docker Galway, November 2015)Brian Brazil

Do you know what your Drupal is doing_ Observe it!sparkfabrik

OpenSource Big Data Platform - Flamingo ProjectBYOUNG GON KIM

Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia

Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018VMware Tanzu

Data Aggregation At Scale Using Apache FlumeArvind Prabhakar

NodeJSLinkMe Srl

Airflow Intro-1.pdfBagustTriCahyo1

Graphs, parallelism and business casesDaniel Toader

Graphs, parallelism and business casesDanBelibov1

The Big Data Analytics Ecosystem at LinkedInrajappaiyer

Empowering Real-Time Decision Making with Data StreamingSafe Software

Setting up a big data platform at kelkooFabrice dos Santos

Similar to Apache Flume (20)

Big data components - Introduction to Flume, Pig and Sqoop

Hadoop project design and a usecase

Centralized logging with Flume

Near real-time anomaly detection at Lyft

Monitoring.pptx

Flume-based Independent News Aggregator

FluentD for end to end monitoring

Prometheus and Docker (Docker Galway, November 2015)

Do you know what your Drupal is doing_ Observe it!

OpenSource Big Data Platform - Flamingo Project

Building a data pipeline to ingest data into Hadoop in minutes using Streamse...

Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018

Data Aggregation At Scale Using Apache Flume

NodeJS

Airflow Intro-1.pdf

Graphs, parallelism and business cases

The Big Data Analytics Ecosystem at LinkedIn

Empowering Real-Time Decision Making with Data Streaming

Setting up a big data platform at kelkoo

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Partners Life - Insurer Innovation Award 2024The Digital Insurer

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

A Year of the Servo Reboot: Where Are We Now?Igalia

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

How to Troubleshoot Apps for the Modern Connected Worker

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Scaling API-first – The story of a global engineering organization

Apidays New York 2024 - The value of a flexible API Management solution for O...

Finology Group – Insurtech Innovation Award 2024

How to Troubleshoot Apps for the Modern Connected Worker

Partners Life - Insurer Innovation Award 2024

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

presentation ICT roal in 21st century education

Tata AIG General Insurance Company - Insurer Innovation Award 2024

AWS Community Day CPH - Three problems of Terraform

Strategies for Landing an Oracle DBA Job as a Fresher

Exploring the Future Potential of AI-Enabled Smartphone Processors

Handwritten Text Recognition for manuscripts and early printed texts

Advantages of Hiring UIUX Design Service Providers for Your Business

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

A Year of the Servo Reboot: Where Are We Now?

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apache Flume

1. Arinto Murdopo Josep Subirats Group 4 EEDC 2012

2. Outline ● Current problem ● What is Apache Flume? ● The Flume Model ○ Flows and Nodes ○ Agent, Processor and Collector Nodes ○ Data and Control Path ● Flume goals ○ Reliability ○ Scalability ○ Extensibility ○ Manageability ● Use case: Near Realtime Aggregator

3. Current Problem ● Situation: You have hundreds of services running in different servers that produce lots of large logs which should be analyzed altogether. You have Hadoop to process them. ● Problem: How do I send all my logs to a place that has Hadoop? I need a reliable, scalable, extensible and manageable way to do it!

4. What is Apache Flume? ● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed. ● Goals: reliability, scalability, extensibility, manageability. Exactly what I needed!

5. The Flume Model: Flows and Nodes ● A flow corresponds to a type of data source (server logs, machine monitoring metrics...). ● Flows are comprised of nodes chained together (see slide 7).

6. The Flume Model: Flows and Nodes ● In a Node, data come in through a source... ...are optionally processed by one or more decorators... ...and then are transmitted out via a sink. Examples: Console, Exec, Syslog, IRC, Twitter, other nodes... Examples: Console, local files, HDFS, S3, other nodes... Examples: wire batching, compression, sampling, projection, extraction...

7. The Flume Model: Agent, Processor and Collector Nodes ● Agent: receives data from an application. ● Processor (optional): intermediate processing. ● Collector: write data to permanent storage.

8. The Flume Model: Data and Control Path (1/2) Nodes are in the data path.

9. The Flume Model: Data and Control Path (2/2) Masters are in the control path. ● Centralized point of configuration. Multiple: ZK. ● Specify sources, sinks and control data flows.

10. Flume Goals: Reliability Tunable Failure Recovery Modes ● Best Effort ● Store on Failure and Retry ● End to End Reliability

11. Flume Goals: Scalability Horizontally Scalable Data Path Load Balancing

12. Flume Goals: Scalability Horizontally Scalable Control Path

13. Flume Goals: Extensibility ● Simple Source and Sink API ○ Event streaming and composition of simple operation ● Plug in Architecture ○ Add your own sources, sinks, decorators

14. Flume Goals: Manageability Centralized Data Flow Management Interface

15. Flume Goals: Manageability Configuring Flume Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”) } ] ; Output Bucketing /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt

16. Use Case: Near Realtime Aggregator

17. Conclusion Flume is ● Distributed data collection service ● Suitable for enterprise setting ● Large amount of log data to process

18. Q&A Questions to be unveiled?

19. References ● http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/ ● http://www.slideshare.net/cloudera/inside-flume ● http://www.slideshare.net/cloudera/flume-intro100715 ● http://www.slideshare.net/cloudera/flume-austin-hug-21711

Apache Flume

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Flume

Similar to Apache Flume (20)

More from Arinto Murdopo

More from Arinto Murdopo (20)

Recently uploaded

Recently uploaded (20)

Apache Flume