SlideShare a Scribd company logo
1 of 68
Download to read offline
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Flume
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
You have a lot of servers and systems
■ network devices
■ operating systems
■ web servers
■ applications
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
They generate a lot of
logs and other data
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
I have business idea
how to use this data!
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
You have Hadoop
cluster running
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Motivation
You want to move
the logs to Hadoop
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Traditional Solutions
■ Own scripts
● Probably a combination of
■
■
■ and/or /
● Cron or start/stop manually
● Hardcoded or missing configuration
● Tightly-coupled with data that is transferred
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Complications
■ High delays
■ Limited manageability
● Compression, encryptions, various file formats
● Throughput
● Configuration and monitoring
■ Limited scalability
● Data explosion, Failover, Load balancing
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Apache Flume
■ Aims to solve this problem!
■ It can move large amounts of streaming event data from
one place to another
● e.g. from web servers to Hadoop cluster
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Overview
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Various systems that
constantly generate data
in form of events
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
■ Installed on each node
■ Collects events
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
■ Filters useless events
■ Decorates events by adding metadata
● e.g. timestamp, hostname, UUID, static
markers
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Encrypts events in
a file on disk
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
Sends events to next-hop
Flume agent
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
Compression is
supported
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
Events can be multiplexed to
multiple agents (to spread the
load) ….
A
B
A,B
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
… or replicated for
redundancy
A,B
A,B
A,B
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
to survive a permanent
failure of an agent, disk
or node.
A,B
A,B
A,B
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
Events can be also
delivered in “failover”
mode where …
C,D
C,D
Compress
1
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
C,D
C,D
Compress
1
… in case of a
failure of the next-
hop agent …
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
… we try next Agent(s)
on a prioritized list.
C,D
C,D
Compress
2
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
Events can be also load-
balanced (round robin, random
and custom) …
E,F
E,F
Compress
1
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
… and go to different next-
hop Agents to spread the
load
E,F
G,H
G,H
Compress
2
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
Events can be stored
■ in memory (for performance)
■ on disk (for durability)
E,F
G,H
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
Events can be finally transferred to HDFS
■ Multiple file formats e.g. Text, JSON, Avro
■ Compression supported
■ Flexible names of HDFS path
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Flume Agent
Interceptor
Encrypt
Flume Agent
Flume Agent
Flume Agent
However,
■ Many destinations are supported
■ One can implement a custom one
ones
Compress
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume
■ Distributed
● Agents installed on many machines
■ Scalable
● Add more machines to transfer more events
■ Reliable
● Durable storage, failover and/or replication
■ Manageable
● Easy to install, configure, reconfigure and run
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume
■ Nicely integrated with the Hadoop Ecosystem
● Various destinations e.g. HDFS, HBase
● Various file formats e.g. Avro, SequenceFile
■ Extensible
● Possibility to add new functionality e.g. source and
destination for events
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Architecture
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Event
■ Unit of data transported by Flume
Headers Payload
Generally small
You can add own headers
e.g. hostname, timestamp
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Flume Agent
■ Responsible for transferring events
■ Runs in JVM
■ Consists of Source(s), Channel(s) and Sink(s)
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Source
■ Collects and forwards events in channels
● HTTP, JMS, RPC, NetCat
● Exec
● Spooling directory
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Exec Source
■ Runs a given Unix command on startup
● Should continuously run and produce data on
■ If the process exits, the source also exits and will NOT
produce any further data
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Spooling Directory Source
■ Watches a specified directory for new files
■ Parses events out of new files as they appear
■ After a file has been fully processed, it is renamed to
indicate completion (or optionally deleted)
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Channel
■ Buffers incoming events until they are extracted by Sinks
■ Tradeoff between durability and throughput
● Memory
● File
● JDBC
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Memory Channel
■ Events stored in an in-memory queue
■ Configurable capacity
● The maximum number of events and/or bytes in memory
■ Nondurable, but faster
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
File Channel
■ Events stored in file on disk
● Durable
■ Flushes to disk at the end of each transaction
● Supports encryption
■ Configurable capacity
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
File Channel
■ The more disks
● The better performance
● The higher capacity
■ Can be limited by the amount of memory for in-memory queue that
keeps pointers to all events stored in log files
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Sink
■ Removes events from a Channel and forwards them to their
next destination
● HDFS, HBase, Solr, ElasticSearch
● File, Logger
● Flume Agent
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
HDFS Sink
■ Writes events to HDFS
● Flexible naming of HDFS paths
■ Multiple file formats are supported e.g. Text, Avro
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
HDFS Sink
■ Rollover properties for files generated by HDFS Sink
Number of seconds to wait before rolling a file (0 deactivates this
feature)
File size, in bytes, to trigger roll of a file (0 deactivates this feature)
Number of events written to file before rolling a file (0 deactivates this
feature)
Timeout after which inactive files get closed (0 deactivates this
feature)
Number of events written to file before they are flushed to HDFS
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
HDFS Sink
■ Rolling a file will generate many small files
● Need to compact them to avoid an explosion of HDFS
metadata
■ Often, you also want to deduplicate, filter and split events
Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Avro Source And Sink
Source Avro SinkChannel
Avro Source SinkChannel
Sends a batch of Avro
events to a configured
hostname:port
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Avro Source And Sink
Source Avro SinkChannel
Avro Source SinkChannel
Listens to events
on a given port
Sends a batch of Avro
events to a configured
hostname:port
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Avro Source And Sink
Source Avro SinkChannel
Avro Source SinkChannel
Listens to events
on a given port
Sends a batch of Avro
events to a configured
hostname:port
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Avro Source And Sink
Source Avro SinkChannel
Avro Source SinkChannel
Compress
Encrypt
Listens to events
on a given port
Sends a batch of Avro
events to a configured
hostname:port
Optionally
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
■ Durable channels
● Survive the agent failure, machine restarts or non disk-
related failures
■ Redundant path in a workflow topology
● Survive the failure of a node
● Achieved via replication or failover
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
■ Sufficient capacity of the channels
● Minimize the back pressure on earlier points in the flow
● Some sources might not be able to resend the data e.g.
■ Exec Source does not handle failures and might lose the data
■ Spooling Directory Source offers reliability guarantees
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
Avro Source SinkChannel
D, C, B, A
Start the transaction1
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
D, C
Take a batch of events2
B, A
Avro Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
D, C
Send a batch of events3
B, A
Avro Source SinkChannel
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
D, C
B, A
Avro Source SinkChannelStart the transaction 4
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
D, C
B, A
Avro Source SinkChannel
Put events into a
channel
5
B, A
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
D, C
B, A
Avro Source SinkChannelStop the transaction 6
B, A
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Reliability
Source Avro SinkChannel
Avro Source SinkChannel
Stop the transaction7
B, A
D, C
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
When Flume Is Not A Good Fit
■ Very large events
● An event cannot be larger than memory or a disk on an
agent’s machine
■ Infrequent bulks loads
● Other tools might be better e.g. HDFS File Slurper
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Configuration
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Configuration
■ Simple format
■ A configuration file can contain configuration settings for
many Agents
● Only settings needed by the Agent will be loaded
■ Agent automatically reloads configuration if it changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Configuration Example
■ We configure Flume to run a single agent that
1. listens for data on a given port
2. turns each line of incoming text into an event
3. and sends to HDFS via the in-memory channel.
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Agent
● Simple format
● A configuration file can contain configuration
settings for many Agents
○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it
changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Source
● Simple format
● A configuration file can contain configuration
settings for many Agents
○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it
changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Channel
● Simple format
● A configuration file can contain configuration
settings for many Agents
○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it
changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Sink
● Simple format
● A configuration file can contain configuration
settings for many Agents
○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it
changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
Starting Agent
● Simple format
● A configuration file can contain configuration
settings for many Agents
○ Only settings needed by the Agent will be loaded
● Agent automatically reloads configuration if it
changes
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
There Is More!
© Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent.
GetInData
■ Data consulting company
■ We help you benefit from data
● Look at our portfolio: http://getindata.com/portfolio
● Find our trainings: http://getindata.com/trainings
● Learn more about our team: http://getindata.com/team

More Related Content

What's hot

Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...StampedeCon
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 

What's hot (20)

Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Apache flink
Apache flinkApache flink
Apache flink
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Apache flink
Apache flinkApache flink
Apache flink
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Spark
SparkSpark
Spark
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Viewers also liked

Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem GetInData
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016Jayesh Thakrar
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeYahoo Developer Network
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopGetInData
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWSPhilipp Garbe
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit
 

Viewers also liked (20)

Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Apache Flume (NG)
Apache Flume (NG)Apache Flume (NG)
Apache Flume (NG)
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
HCatalog
HCatalogHCatalog
HCatalog
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 
ElasticSearch on AWS
ElasticSearch on AWSElasticSearch on AWS
ElasticSearch on AWS
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 

Similar to Apache Flume

Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data applicationtomwhite
 
Wordpress security
Wordpress securityWordpress security
Wordpress securityMehmet Ince
 
Your Inner Sysadmin - MidwestPHP 2015
Your Inner Sysadmin - MidwestPHP 2015Your Inner Sysadmin - MidwestPHP 2015
Your Inner Sysadmin - MidwestPHP 2015Chris Tankersley
 
Perl hosting for beginners - Cluj.pm March 2013
Perl hosting for beginners - Cluj.pm March 2013Perl hosting for beginners - Cluj.pm March 2013
Perl hosting for beginners - Cluj.pm March 2013Arpad Szasz
 
Moving AWS workloads to OpenStack
Moving AWS workloads to OpenStackMoving AWS workloads to OpenStack
Moving AWS workloads to OpenStackMirantis
 
Your Inner Sysadmin - Tutorial (SunshinePHP 2015)
Your Inner Sysadmin - Tutorial (SunshinePHP 2015)Your Inner Sysadmin - Tutorial (SunshinePHP 2015)
Your Inner Sysadmin - Tutorial (SunshinePHP 2015)Chris Tankersley
 
Streaming analytics better than batch when and why - (Big Data Tech 2017)
Streaming analytics better than batch   when and why - (Big Data Tech 2017)Streaming analytics better than batch   when and why - (Big Data Tech 2017)
Streaming analytics better than batch when and why - (Big Data Tech 2017)GetInData
 
Read-only rootfs: theory and practice
Read-only rootfs: theory and practiceRead-only rootfs: theory and practice
Read-only rootfs: theory and practiceChris Simmonds
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
CloudOpen North America 2013: Vagrant & CFEngine
CloudOpen North America 2013: Vagrant & CFEngineCloudOpen North America 2013: Vagrant & CFEngine
CloudOpen North America 2013: Vagrant & CFEngineNick Anderson
 
KScope14 Oracle EPM Troubleshooting
KScope14 Oracle EPM TroubleshootingKScope14 Oracle EPM Troubleshooting
KScope14 Oracle EPM TroubleshootingAlithya
 
Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018Andrey Zagrebin
 
Chef on SmartOS
Chef on SmartOSChef on SmartOS
Chef on SmartOSEric Saxby
 
TripleO Lightning Talk
TripleO Lightning TalkTripleO Lightning Talk
TripleO Lightning Talkcmsj1
 
The Kitchen Cloud How To: Automating Joyent SmartMachines with Chef
The Kitchen Cloud How To: Automating Joyent SmartMachines with ChefThe Kitchen Cloud How To: Automating Joyent SmartMachines with Chef
The Kitchen Cloud How To: Automating Joyent SmartMachines with ChefChef Software, Inc.
 
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...Big Data Spain
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...GetInData
 
PHP Installed on IBM i - the Nickel Tour
PHP Installed on IBM i - the Nickel TourPHP Installed on IBM i - the Nickel Tour
PHP Installed on IBM i - the Nickel TourRod Flohr
 

Similar to Apache Flume (20)

Building an Apache Hadoop data application
Building an Apache Hadoop data applicationBuilding an Apache Hadoop data application
Building an Apache Hadoop data application
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Wordpress security
Wordpress securityWordpress security
Wordpress security
 
Your Inner Sysadmin - MidwestPHP 2015
Your Inner Sysadmin - MidwestPHP 2015Your Inner Sysadmin - MidwestPHP 2015
Your Inner Sysadmin - MidwestPHP 2015
 
Perl hosting for beginners - Cluj.pm March 2013
Perl hosting for beginners - Cluj.pm March 2013Perl hosting for beginners - Cluj.pm March 2013
Perl hosting for beginners - Cluj.pm March 2013
 
Moving AWS workloads to OpenStack
Moving AWS workloads to OpenStackMoving AWS workloads to OpenStack
Moving AWS workloads to OpenStack
 
Your Inner Sysadmin - Tutorial (SunshinePHP 2015)
Your Inner Sysadmin - Tutorial (SunshinePHP 2015)Your Inner Sysadmin - Tutorial (SunshinePHP 2015)
Your Inner Sysadmin - Tutorial (SunshinePHP 2015)
 
Streaming analytics better than batch when and why - (Big Data Tech 2017)
Streaming analytics better than batch   when and why - (Big Data Tech 2017)Streaming analytics better than batch   when and why - (Big Data Tech 2017)
Streaming analytics better than batch when and why - (Big Data Tech 2017)
 
Read-only rootfs: theory and practice
Read-only rootfs: theory and practiceRead-only rootfs: theory and practice
Read-only rootfs: theory and practice
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Apache Maven
Apache MavenApache Maven
Apache Maven
 
CloudOpen North America 2013: Vagrant & CFEngine
CloudOpen North America 2013: Vagrant & CFEngineCloudOpen North America 2013: Vagrant & CFEngine
CloudOpen North America 2013: Vagrant & CFEngine
 
KScope14 Oracle EPM Troubleshooting
KScope14 Oracle EPM TroubleshootingKScope14 Oracle EPM Troubleshooting
KScope14 Oracle EPM Troubleshooting
 
Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018
 
Chef on SmartOS
Chef on SmartOSChef on SmartOS
Chef on SmartOS
 
TripleO Lightning Talk
TripleO Lightning TalkTripleO Lightning Talk
TripleO Lightning Talk
 
The Kitchen Cloud How To: Automating Joyent SmartMachines with Chef
The Kitchen Cloud How To: Automating Joyent SmartMachines with ChefThe Kitchen Cloud How To: Automating Joyent SmartMachines with Chef
The Kitchen Cloud How To: Automating Joyent SmartMachines with Chef
 
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
 
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
Creating Real-Time Data Streaming powered by SQL on Kubernetes - Albert Lewan...
 
PHP Installed on IBM i - the Nickel Tour
PHP Installed on IBM i - the Nickel TourPHP Installed on IBM i - the Nickel Tour
PHP Installed on IBM i - the Nickel Tour
 

More from GetInData

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...GetInData
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczGetInData
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competitionGetInData
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? GetInData
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierGetInData
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformGetInData
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataGetInData
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...GetInData
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataGetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...GetInData
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataGetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...GetInData
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...GetInData
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...GetInData
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...GetInData
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...GetInData
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataGetInData
 

More from GetInData (20)

How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...How do we work with customers on Big Data / ML / Analytics Projects using Scr...
How do we work with customers on Big Data / ML / Analytics Projects using Scr...
 
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr MenclewiczData-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
Data-Driven Fast Track: Introduction to data-drivenness with Piotr Menclewicz
 
How NOT to win a Kaggle competition
How NOT to win a Kaggle competitionHow NOT to win a Kaggle competition
How NOT to win a Kaggle competition
 
How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team? How to become good Developer in Scrum Team?
How to become good Developer in Scrum Team?
 
OpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easierOpenLineage & Airflow - data lineage has never been easier
OpenLineage & Airflow - data lineage has never been easier
 
Benefits of a Homemade ML Platform
Benefits of a Homemade ML PlatformBenefits of a Homemade ML Platform
Benefits of a Homemade ML Platform
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
 
MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...MLOps implemented - how we combine the cloud & open-source to boost data scie...
MLOps implemented - how we combine the cloud & open-source to boost data scie...
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInDataFeast + Amundsen Integration - Mariusz Strzelecki, GetInData
Feast + Amundsen Integration - Mariusz Strzelecki, GetInData
 
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...Kubernetes and real-time analytics - how to connect these two worlds with Apa...
Kubernetes and real-time analytics - how to connect these two worlds with Apa...
 
Big data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInDataBig data trends - Krzysztof Zarzycki, GetInData
Big data trends - Krzysztof Zarzycki, GetInData
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...Complex event processing platform handling millions of users - Krzysztof Zarz...
Complex event processing platform handling millions of users - Krzysztof Zarz...
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...Managing Big Data projects in a constantly changing environment - Rafał Zalew...
Managing Big Data projects in a constantly changing environment - Rafał Zalew...
 
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
NLP for videos: Understanding customers' feelings in videos - Albert Lewandow...
 
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInDataStrategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
Strategies for on premise to Google Cloud migration - Mateusz Pytel, GetInData
 

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Apache Flume

  • 1. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Apache Flume
  • 2. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation
  • 3. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation You have a lot of servers and systems ■ network devices ■ operating systems ■ web servers ■ applications
  • 4. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation They generate a lot of logs and other data
  • 5. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation I have business idea how to use this data!
  • 6. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation You have Hadoop cluster running
  • 7. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation You want to move the logs to Hadoop
  • 8. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Traditional Solutions ■ Own scripts ● Probably a combination of ■ ■ ■ and/or / ● Cron or start/stop manually ● Hardcoded or missing configuration ● Tightly-coupled with data that is transferred
  • 9. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Complications ■ High delays ■ Limited manageability ● Compression, encryptions, various file formats ● Throughput ● Configuration and monitoring ■ Limited scalability ● Data explosion, Failover, Load balancing
  • 10. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Apache Flume ■ Aims to solve this problem! ■ It can move large amounts of streaming event data from one place to another ● e.g. from web servers to Hadoop cluster
  • 11. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Overview
  • 12. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Various systems that constantly generate data in form of events
  • 13. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent ■ Installed on each node ■ Collects events
  • 14. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor ■ Filters useless events ■ Decorates events by adding metadata ● e.g. timestamp, hostname, UUID, static markers
  • 15. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Encrypts events in a file on disk
  • 16. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Sends events to next-hop Flume agent
  • 17. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Compression is supported Compress
  • 18. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be multiplexed to multiple agents (to spread the load) …. A B A,B Compress
  • 19. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent … or replicated for redundancy A,B A,B A,B Compress
  • 20. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent to survive a permanent failure of an agent, disk or node. A,B A,B A,B Compress
  • 21. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be also delivered in “failover” mode where … C,D C,D Compress 1
  • 22. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent C,D C,D Compress 1 … in case of a failure of the next- hop agent …
  • 23. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent … we try next Agent(s) on a prioritized list. C,D C,D Compress 2
  • 24. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be also load- balanced (round robin, random and custom) … E,F E,F Compress 1
  • 25. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent … and go to different next- hop Agents to spread the load E,F G,H G,H Compress 2
  • 26. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be stored ■ in memory (for performance) ■ on disk (for durability) E,F G,H Compress
  • 27. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be finally transferred to HDFS ■ Multiple file formats e.g. Text, JSON, Avro ■ Compression supported ■ Flexible names of HDFS path Compress
  • 28. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent However, ■ Many destinations are supported ■ One can implement a custom one ones Compress
  • 29. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume ■ Distributed ● Agents installed on many machines ■ Scalable ● Add more machines to transfer more events ■ Reliable ● Durable storage, failover and/or replication ■ Manageable ● Easy to install, configure, reconfigure and run
  • 30. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume ■ Nicely integrated with the Hadoop Ecosystem ● Various destinations e.g. HDFS, HBase ● Various file formats e.g. Avro, SequenceFile ■ Extensible ● Possibility to add new functionality e.g. source and destination for events
  • 31. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Architecture
  • 32. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Event ■ Unit of data transported by Flume Headers Payload Generally small You can add own headers e.g. hostname, timestamp
  • 33. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent ■ Responsible for transferring events ■ Runs in JVM ■ Consists of Source(s), Channel(s) and Sink(s) Source SinkChannel
  • 34. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Source ■ Collects and forwards events in channels ● HTTP, JMS, RPC, NetCat ● Exec ● Spooling directory Source SinkChannel
  • 35. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Exec Source ■ Runs a given Unix command on startup ● Should continuously run and produce data on ■ If the process exits, the source also exits and will NOT produce any further data Source SinkChannel
  • 36. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Spooling Directory Source ■ Watches a specified directory for new files ■ Parses events out of new files as they appear ■ After a file has been fully processed, it is renamed to indicate completion (or optionally deleted) Source SinkChannel
  • 37. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Channel ■ Buffers incoming events until they are extracted by Sinks ■ Tradeoff between durability and throughput ● Memory ● File ● JDBC Source SinkChannel
  • 38. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Memory Channel ■ Events stored in an in-memory queue ■ Configurable capacity ● The maximum number of events and/or bytes in memory ■ Nondurable, but faster Source SinkChannel
  • 39. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. File Channel ■ Events stored in file on disk ● Durable ■ Flushes to disk at the end of each transaction ● Supports encryption ■ Configurable capacity Source SinkChannel
  • 40. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. File Channel ■ The more disks ● The better performance ● The higher capacity ■ Can be limited by the amount of memory for in-memory queue that keeps pointers to all events stored in log files Source SinkChannel
  • 41. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Sink ■ Removes events from a Channel and forwards them to their next destination ● HDFS, HBase, Solr, ElasticSearch ● File, Logger ● Flume Agent Source SinkChannel
  • 42. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. HDFS Sink ■ Writes events to HDFS ● Flexible naming of HDFS paths ■ Multiple file formats are supported e.g. Text, Avro Source SinkChannel
  • 43. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. HDFS Sink ■ Rollover properties for files generated by HDFS Sink Number of seconds to wait before rolling a file (0 deactivates this feature) File size, in bytes, to trigger roll of a file (0 deactivates this feature) Number of events written to file before rolling a file (0 deactivates this feature) Timeout after which inactive files get closed (0 deactivates this feature) Number of events written to file before they are flushed to HDFS
  • 44. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. HDFS Sink ■ Rolling a file will generate many small files ● Need to compact them to avoid an explosion of HDFS metadata ■ Often, you also want to deduplicate, filter and split events Source SinkChannel
  • 45. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Avro Source And Sink Source Avro SinkChannel Avro Source SinkChannel Sends a batch of Avro events to a configured hostname:port
  • 46. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Avro Source And Sink Source Avro SinkChannel Avro Source SinkChannel Listens to events on a given port Sends a batch of Avro events to a configured hostname:port
  • 47. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Avro Source And Sink Source Avro SinkChannel Avro Source SinkChannel Listens to events on a given port Sends a batch of Avro events to a configured hostname:port
  • 48. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Avro Source And Sink Source Avro SinkChannel Avro Source SinkChannel Compress Encrypt Listens to events on a given port Sends a batch of Avro events to a configured hostname:port Optionally
  • 49. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability ■ Durable channels ● Survive the agent failure, machine restarts or non disk- related failures ■ Redundant path in a workflow topology ● Survive the failure of a node ● Achieved via replication or failover
  • 50. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability ■ Sufficient capacity of the channels ● Minimize the back pressure on earlier points in the flow ● Some sources might not be able to resend the data e.g. ■ Exec Source does not handle failures and might lose the data ■ Spooling Directory Source offers reliability guarantees
  • 51. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel Avro Source SinkChannel D, C, B, A Start the transaction1
  • 52. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C Take a batch of events2 B, A Avro Source SinkChannel
  • 53. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C Send a batch of events3 B, A Avro Source SinkChannel
  • 54. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C B, A Avro Source SinkChannelStart the transaction 4
  • 55. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C B, A Avro Source SinkChannel Put events into a channel 5 B, A
  • 56. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C B, A Avro Source SinkChannelStop the transaction 6 B, A
  • 57. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel Avro Source SinkChannel Stop the transaction7 B, A D, C
  • 58. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. When Flume Is Not A Good Fit ■ Very large events ● An event cannot be larger than memory or a disk on an agent’s machine ■ Infrequent bulks loads ● Other tools might be better e.g. HDFS File Slurper
  • 59. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Configuration
  • 60. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Configuration ■ Simple format ■ A configuration file can contain configuration settings for many Agents ● Only settings needed by the Agent will be loaded ■ Agent automatically reloads configuration if it changes
  • 61. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Configuration Example ■ We configure Flume to run a single agent that 1. listens for data on a given port 2. turns each line of incoming text into an event 3. and sends to HDFS via the in-memory channel.
  • 62. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Agent ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  • 63. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Source ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  • 64. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Channel ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  • 65. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Sink ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  • 66. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Starting Agent ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  • 67. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. There Is More!
  • 68. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. GetInData ■ Data consulting company ■ We help you benefit from data ● Look at our portfolio: http://getindata.com/portfolio ● Find our trainings: http://getindata.com/trainings ● Learn more about our team: http://getindata.com/team