SlideShare a Scribd company logo
1 of 19
Download to read offline
Arinto Murdopo
 Josep Subirats
       Group 4
     EEDC 2012
Outline
● Current problem
● What is Apache Flume?
● The Flume Model
  ○ Flows and Nodes
  ○ Agent, Processor and Collector Nodes
  ○ Data and Control Path
● Flume goals
  ○ Reliability
  ○ Scalability
  ○ Extensibility
  ○ Manageability
● Use case: Near Realtime Aggregator
 
Current Problem
● Situation:
You have hundreds of services running in different servers
that produce lots of large logs which should be analyzed
altogether. You have Hadoop to process them.
 
● Problem:
How do I send all my logs to a place that has Hadoop? I
need a reliable, scalable, extensible and manageable way
to do it!
What is Apache Flume?
● It is a distributed data collection service that gets
    flows of data (like logs) from their source and
    aggregates them to where they have to be processed.
●   Goals: reliability, scalability, extensibility,
    manageability.




                   Exactly what I needed!
The Flume Model: Flows and Nodes

● A flow corresponds to a type of data source (server
    logs, machine monitoring metrics...).
●   Flows are comprised of nodes chained together (see
    slide 7).
The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
   ...are optionally processed by one or more decorators...
   ...and then are transmitted out via a sink.
    
                 Examples: Console, Exec, Syslog, IRC,
                 Twitter, other nodes...
                  
                 Examples: Console, local files, HDFS, S3,
                 other nodes...
                  
                 Examples: wire batching, compression,
                 sampling, projection, extraction...
The Flume Model: Agent, Processor and
Collector Nodes

● Agent:
    receives data from an
    application.
 
● Processor (optional):
    intermediate processing.
 
● Collector:
    write data to permanent
    storage.
The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.
The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.
Flume Goals: Reliability
Tunable Failure Recovery Modes
 
● Best Effort
 
● Store on Failure and Retry
 
● End to End Reliability
Flume Goals: Scalability
Horizontally Scalable Data Path




Load Balancing
Flume Goals: Scalability
Horizontally Scalable Control Path
Flume Goals: Extensibility
● Simple Source and Sink API
  ○ Event streaming and composition of simple
       operation
   
● Plug in Architecture
   ○ Add your own sources, sinks, decorators
    
    
Flume Goals: Manageability
Centralized Data Flow Management Interface
 
Flume Goals: Manageability
Configuring Flume
 
 
   Node: tail(“file”) | filter [ console, roll
   (1000) { dfs(“hdfs://namenode/user/flume”) } ]
   ;
Output Bucketing
                              /logs/web/2010/0715/1200/data-xxx.txt
                              /logs/web/2010/0715/1200/data-xxy.txt
                              /logs/web/2010/0715/1300/data-xxx.txt
                              /logs/web/2010/0715/1300/data-xxy.txt
                              /logs/web/2010/0715/1400/data-xxx.txt
Use Case: Near Realtime Aggregator
Conclusion
Flume is
● Distributed data collection service
 
● Suitable for enterprise setting
 
● Large amount of log data to process
Q&A
Questions to be unveiled?
 
 
References
●   http://www.cloudera.
    com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie
    h_hadoop_log_processing/
●   http://www.slideshare.net/cloudera/inside-flume
●   http://www.slideshare.net/cloudera/flume-intro100715
●   http://www.slideshare.net/cloudera/flume-austin-hug-21711

More Related Content

What's hot

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 

What's hot (20)

SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Spark
SparkSpark
Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 

Similar to Apache Flume

Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Monitoring.pptx
Monitoring.pptxMonitoring.pptx
Monitoring.pptxShadi Akil
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News AggregatorMário Almeida
 
FluentD for end to end monitoring
FluentD for end to end monitoringFluentD for end to end monitoring
FluentD for end to end monitoringPhil Wilkins
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Brian Brazil
 
Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!sparkfabrik
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectBYOUNG GON KIM
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018VMware Tanzu
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 
Graphs, parallelism and business cases
 Graphs, parallelism and business cases Graphs, parallelism and business cases
Graphs, parallelism and business casesDaniel Toader
 
Graphs, parallelism and business cases
Graphs, parallelism and business casesGraphs, parallelism and business cases
Graphs, parallelism and business casesDanBelibov1
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInrajappaiyer
 
Empowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data StreamingEmpowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data StreamingSafe Software
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkooFabrice dos Santos
 

Similar to Apache Flume (20)

Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Monitoring.pptx
Monitoring.pptxMonitoring.pptx
Monitoring.pptx
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
FluentD for end to end monitoring
FluentD for end to end monitoringFluentD for end to end monitoring
FluentD for end to end monitoring
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo Project
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
NodeJS
NodeJSNodeJS
NodeJS
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Graphs, parallelism and business cases
 Graphs, parallelism and business cases Graphs, parallelism and business cases
Graphs, parallelism and business cases
 
Graphs, parallelism and business cases
Graphs, parallelism and business casesGraphs, parallelism and business cases
Graphs, parallelism and business cases
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
 
Empowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data StreamingEmpowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data Streaming
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 

More from Arinto Murdopo

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Arinto Murdopo
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARNArinto Murdopo
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Arinto Murdopo
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideArinto Murdopo
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksArinto Murdopo
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIArinto Murdopo
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 PresentationArinto Murdopo
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event ScalabilityArinto Murdopo
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideArinto Murdopo
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsArinto Murdopo
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network VirtualizationArinto Murdopo
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Arinto Murdopo
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArinto Murdopo
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingArinto Murdopo
 

More from Arinto Murdopo (20)

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARN
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slide
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPI
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 Presentation
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event Scalability
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network Virtualization
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity Fabric
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer Computing
 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Apache Flume

  • 1. Arinto Murdopo Josep Subirats Group 4 EEDC 2012
  • 2. Outline ● Current problem ● What is Apache Flume? ● The Flume Model ○ Flows and Nodes ○ Agent, Processor and Collector Nodes ○ Data and Control Path ● Flume goals ○ Reliability ○ Scalability ○ Extensibility ○ Manageability ● Use case: Near Realtime Aggregator  
  • 3. Current Problem ● Situation: You have hundreds of services running in different servers that produce lots of large logs which should be analyzed altogether. You have Hadoop to process them.   ● Problem: How do I send all my logs to a place that has Hadoop? I need a reliable, scalable, extensible and manageable way to do it!
  • 4. What is Apache Flume? ● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed. ● Goals: reliability, scalability, extensibility, manageability. Exactly what I needed!
  • 5. The Flume Model: Flows and Nodes ● A flow corresponds to a type of data source (server logs, machine monitoring metrics...). ● Flows are comprised of nodes chained together (see slide 7).
  • 6. The Flume Model: Flows and Nodes ● In a Node, data come in through a source... ...are optionally processed by one or more decorators... ...and then are transmitted out via a sink.   Examples: Console, Exec, Syslog, IRC, Twitter, other nodes...   Examples: Console, local files, HDFS, S3, other nodes...   Examples: wire batching, compression, sampling, projection, extraction...
  • 7. The Flume Model: Agent, Processor and Collector Nodes ● Agent: receives data from an application.   ● Processor (optional): intermediate processing.   ● Collector: write data to permanent storage.
  • 8. The Flume Model: Data and Control Path (1/2) Nodes are in the data path.
  • 9. The Flume Model: Data and Control Path (2/2) Masters are in the control path. ● Centralized point of configuration. Multiple: ZK. ● Specify sources, sinks and control data flows.
  • 10. Flume Goals: Reliability Tunable Failure Recovery Modes   ● Best Effort   ● Store on Failure and Retry   ● End to End Reliability
  • 11. Flume Goals: Scalability Horizontally Scalable Data Path Load Balancing
  • 12. Flume Goals: Scalability Horizontally Scalable Control Path
  • 13. Flume Goals: Extensibility ● Simple Source and Sink API ○ Event streaming and composition of simple operation   ● Plug in Architecture ○ Add your own sources, sinks, decorators    
  • 14. Flume Goals: Manageability Centralized Data Flow Management Interface  
  • 15. Flume Goals: Manageability Configuring Flume     Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”) } ] ; Output Bucketing   /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt   /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt
  • 16. Use Case: Near Realtime Aggregator
  • 17. Conclusion Flume is ● Distributed data collection service   ● Suitable for enterprise setting   ● Large amount of log data to process
  • 18. Q&A Questions to be unveiled?    
  • 19. References ● http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/ ● http://www.slideshare.net/cloudera/inside-flume ● http://www.slideshare.net/cloudera/flume-intro100715 ● http://www.slideshare.net/cloudera/flume-austin-hug-21711