SlideShare a Scribd company logo
1 of 34
Download to read offline
© 2015 StreamSets Inc., All rights reserved
Apache Flume
Data Aggregation At Scale
Arvind Prabhakar
© 2015 StreamSets, Inc.
Who am I?
❏  Founder/CTO
Apache Software Foundation
❏  Flume - PMC Chair
❏  Sqoop - PMC Chair
❏  Storm - PMC, Committer
❏  MetaModel - Mentor
❏  Sentry - Mentor
❏  NiFi - Mentor
❏  ASF Member
Previously...
❏  Cloudera
❏  Informatica
Arvind Prabhakar
© 2015 StreamSets, Inc.
What is Flume?
Logs
Files
Click
Streams
Sensors
Devices
Database
Logs
Social Data
Streams
Feeds
Other
Raw Storage
(HDFS, S3)
EDW, NoSQL
(Hive, Impala,
HBase,
Cassandra)
Search
(Solr,
ElasticSearch)
Enterprise Data
Infrastructure
Apache Flume is a
continuous data
ingestion system that
is...
●  open-source,
●  reliable,
●  scalable,
●  manageable,
●  customizable,
...and designed for Big
Data ecosystem.
© 2015 StreamSets, Inc.
...for Big Data ecosystem?
“Big data is an all-encompassing term for any collection
of data sets so large and complex that it becomes difficult
to process using traditional data processing applications.”
Big Data from a Data Ingestion Perspective
●  Logical Data Sources are physically distributed
●  Data production is continuous / never ending
●  Data structure and semantics change without notice
© 2015 StreamSets, Inc.
Physically Distributed Data Sources
●  Many physical sources
that produce data
●  Number of physical
sources changes
constantly
●  Sources may exist in
different governance
zones, data centers,
continents...
© 2015 StreamSets, Inc.
“Every two days now we create as much information
as we did from the dawn of civilization up until 2003”
-  Eric Schmidt, 2010
Continuous Data Production
●  Weather
●  Traffic
●  Automobiles
●  Trains
●  Airplanes
●  Geological/Seismic
●  Oceanographic
●  Smart Phones
●  Health Accessories
●  Medical Devices
●  Home Automation
●  Digital Cameras
●  Social Media
●  Geolocation
●  Shop Floor Sensors
●  Network Activity
●  Industry Appliances
●  Security/Surveillance
●  Server Workloads
●  Digital Telephony
●  Bio-simulations...
© 2015 StreamSets, Inc.
A Notable Prediction From 2014...
“Breakthrough in
compression technology
will succeed in convert
#bigdata to regular data
#2014BigDataPredictions”
http://bit.ly/1x6cz9M
1:41 PM - 20 Dec 2013
© 2015 StreamSets, Inc.
Ever Changing Structure of Data
●  One of your data centers
upgrade to IPv6
192.168.0.4 fe80::21b:21ff:fe83:90fa
M0137: User {jonsmith} granted access to {accounts}
M0137: [jonsmith] granted access to [sys.accounts]
{
“first”:”jon”,
“last”:”smith”,
“add1”:”123 Main St.”,
“add2”:”Ste - 4”,
“city”:”Little Town”,
“state”:”AZ”,
“zip”: “12121”
}
{
“first”:”jon”,
“last”:”smith”,
“add1”:”123 Main St.”,
“add2”:”Ste - 4”,
“city”:”Little Town”,
“state”:”AZ”,
“zip”: “12121”,
“phone”: “(408) 555-1212”
}
●  Application developer
changes logs (again)
●  JSON data may contain
more attributes than
expected
© 2015 StreamSets, Inc.
So, from Data Ingestion Perspective:
Massive collection of ever changing physical sources...
Never ending data production...
Continuously evolving data structures and semantics...
© 2015 StreamSets, Inc.
© 2015 StreamSets, Inc.
Flume to the Rescue!
© 2015 StreamSets, Inc.
Apache Flume
●  Originally designed to be a log
aggregation system by
Cloudera Engineers
●  Evolved to handle any type of
streaming event data
●  Low-cost of installation,
operation and maintenance
●  Highly customizable and
extendable
© 2015 StreamSets, Inc.
A Closer Look at Flume
Agent Agent AgentAgentInput Destination
●  Distributed Pipeline Architecture
●  Optimized for commonly used data sources and destinations
●  Built in support for contextual routing
●  Fully customizable and extendable
eg Web Logs eg HDFS
© 2015 StreamSets, Inc.
Anatomy of a Flume Agent
Flume Agent
Source
Sink
Channel
Incoming
Data
Outgoing
Data
Source
●  Accepts incoming
Data
●  Scales as required
●  Writes data to
Channel
Sink
●  Removes data from
Channel
●  Sends data to
downstream Agent or
Destination
Channel
●  Stores data in the
order received
© 2015 StreamSets, Inc.
Transactional Data Exchange
Flume Agent
Source
Sink
Channel
Incoming
Data
Outgoing
Data
Source TX
Sink TX
●  Source uses transactions to write to the channel
Upstream Sink TX
●  Sink uses transactions to remove data from the channel
●  Sink transaction commits only after successful transfer of data
●  This ensures no data loss in Flume pipeline
© 2015 StreamSets, Inc.
Routing and Replicating
Flume Agent
Source
Sink 1
Channel 1
Incoming
Data
Outgoing Data
Channel 2
Sink 2 Outgoing Data
●  Source can replicate or multiplex data across many channels
●  Metadata headers can be used to do contextual selection of
channels
●  Channels can be drained by different sinks to different
destinations or pipelines
© 2015 StreamSets, Inc.
Why Channels?
●  Buffers data and insulates downstream from load spikes
●  Provides persistent store for data in case the process restarts
●  Provides flow ordering* and transactional guarantees
© 2015 StreamSets, Inc.
Use-Case: Log Aggregation
© 2015 StreamSets, Inc.
Starting Out Simple
●  You would like to move your web-
server logs to HDFS
●  Let’s assume there are only 3 web
servers at the time of launch
●  Ad-hoc solution will likely suffice!
Challenges
●  How do you manage your output paths on HDFS?
●  How do you maintain your client code in face of changing
environment as well as requirements?
© 2015 StreamSets, Inc.
Adding a Single Flume Agent
Advantages
●  Insulation from HDFS downtime
●  Quick offload of logs from Web
Server machines
●  Better Network utilization
Challenges
●  What if the Flume node goes down?
●  Can one Flume node accommodate all load from Web Servers?
© 2015 StreamSets, Inc.
Adding Two Flume Agents
Advantages
●  Redundancy and Availability
●  Better handling of downstream
failures
●  Automatic load balancing and
failover
Challenges
●  What happens when new Web Servers are added?
●  Can two Flume Agents keep up with all the load from more Web
Servers?
© 2015 StreamSets, Inc.
Handling a Server Farm
A Converging Flow
●  Traffic is aggregated by Tier-2 and
Tier-3 before being put into
destination system
●  Closer a tier is to the destination,
larger the batch size it delivers
downstream
●  Optimized handling of destination
systems
© 2015 StreamSets, Inc.
Data Volume Per Agent
Batch Size Variation per Agent
●  Event volume is least in the
outermost tier
●  Event volume increases as the
flow converges
●  Event volume is highest in the
innermost tier
© 2015 StreamSets, Inc.
Data Volume Per Tier
Batch Size Variation per Tier
●  In steady state, all tiers carry
same event volume
●  Transient variations in flow are
absorbed and ironed out by
channels
●  Load spikes are handled smoothly
without overwhelming the
infrastructure
© 2015 StreamSets, Inc.
Planning and Sizing Flume Topology
for Log-Aggregation Use-Case
© 2015 StreamSets, Inc.
Planning and Sizing Your Topology
What we know:
●  Number of Web Servers
●  Log volume per Web
Server per unit time
●  Destination System and
layout (Routing
Requirements)
●  Worst case downtime for
destination system
What we will calculate:
●  Number of tiers
●  Exit Batch Sizes
●  Channel capacity
© 2015 StreamSets, Inc.
Calculating Number of Tiers
Rule of Thumb
One Aggregating Agent (A) can be used with
anywhere from 4 to 16 client Agents
Considerations
●  Must handle projected ingest volume
●  Resulting number of tiers should provide for
routing, load-balancing and failover
requirements
Gotchas
Load test to ensure that steady state and peak
load are addressed with adequate failover capacity
© 2015 StreamSets, Inc.
Calculating Exit Batch Size
Rule of Thumb
Exit batch size is same as total exit data volume
divided by number of Agents in a tier
Considerations
Gotchas
Load test fail-over scenario to ensure near
steady-state drain
●  Having some extra room is good
●  Keep contextual routing in mind
●  Consider duplication impact when batch
sizes are large
© 2015 StreamSets, Inc.
Calculating Channel Capacity
Source
Sink
X
Source
Sink
X
X
Rule of Thumb
Equal to worst case data ingest rate sustained
over the worst case downstream outage
interval
Considerations
●  Multiple disks will yield better performance
●  Channel size impacts the back-pressure
buildup in the pipeline
You may need more disk space than the
physical footprint of the data size
Gotchas
© 2015 StreamSets, Inc.
To Recap
Calculated with upstream to downstream Agent ratio ranging from 4:1 to
16:1. Factor in routing, failover, load-balancing requirements...
Number of Tiers
Calculated for steady state data volume exiting the tier, divided by
number of Agents in that tier. Factor in contextual routing and duplication
due to transient failure impact...
Exit Batch Size
Calculated as worst case ingest rate sustained over the worst case
downstream downtime. Factor in number of disks used etc...
Channel Capacity
...and that’s all there is to it!
© 2015 StreamSets, Inc.
© 2015 StreamSets, Inc.
Some Highlights of Flume
●  Flume is suitable for large volume data collection, especially when
data is being produced in multiple locations
●  Once planned and sized appropriately, Flume will practically run
itself without any operational intervention
●  Flume provides weak ordering guarantee, i.e., in the absence of
failures the data will arrive in the order it was received in the Flume
pipeline
●  Transactional exchange ensures that Flume never loses any data in
transit between Agents. Sinks use transactions to ensure data is not
lost at point of ingest or terminal destinations.
●  Flume has rich out-of-the box features such as contextual routing,
and support for popular data sources and destination systems
© 2015 StreamSets, Inc.
Things that could be better...
●  Handling of poison events
●  Ability to tail files
●  Ability to handle preset data formats such as JSON, CSV, XML
●  Centralized configuration
●  Once-only delivery semantics
●  ...and more
Remember: patches are welcome!
© 2015 StreamSets, Inc.
Thank You!
Contact:
●  Email: arvind at streamsets dot com
●  Twitter: @aprabhakar
More on Flume:
●  http://flume.apache.org/
●  User Mailing List: user-subscribe@flume.apache.org
●  Developer Mailing List: dev-subscribe@flume.apache.org
●  JIRA: https://issues.apache.org/jira/browse/FLUME

More Related Content

What's hot

Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeYahoo Developer Network
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupKarthik Ramasamy
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiMichael Stack
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaJay Kreps
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
SAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage AccountSAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage AccountGary Jackson MBCS
 

What's hot (20)

Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at HuaweiHBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
SAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage AccountSAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage Account
 

Similar to Data Aggregation at Scale with Apache Flume

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
 
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHINGBig Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHINGMatt Stubbs
 
Apache Apex Meetup at Cask
Apache Apex Meetup at CaskApache Apex Meetup at Cask
Apache Apex Meetup at CaskApache Apex
 
Apache Apex - Hadoop Users Group
Apache Apex - Hadoop Users GroupApache Apex - Hadoop Users Group
Apache Apex - Hadoop Users GroupPramod Immaneni
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...Yahoo Developer Network
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterInfluxData
 
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingFull Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingSafe Software
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services
 
Network characteristics of the cloud
Network characteristics of the cloudNetwork characteristics of the cloud
Network characteristics of the cloudCloud Genius
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Felix GV
 
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...CA Technologies
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...CA Technologies
 
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick ParkerDevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick ParkerR3
 
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedInБыстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedInCEE-SEC(R)
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes StrategicMapR Technologies
 

Similar to Data Aggregation at Scale with Apache Flume (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHINGBig Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Apache Apex Meetup at Cask
Apache Apex Meetup at CaskApache Apex Meetup at Cask
Apache Apex Meetup at Cask
 
Apache Apex - Hadoop Users Group
Apache Apex - Hadoop Users GroupApache Apex - Hadoop Users Group
Apache Apex - Hadoop Users Group
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise ClusterBest Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
 
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingFull Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
Network characteristics of the cloud
Network characteristics of the cloudNetwork characteristics of the cloud
Network characteristics of the cloud
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
 
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
 
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick ParkerDevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
 
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedInБыстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes Strategic
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 

Data Aggregation at Scale with Apache Flume

  • 1. © 2015 StreamSets Inc., All rights reserved Apache Flume Data Aggregation At Scale Arvind Prabhakar
  • 2. © 2015 StreamSets, Inc. Who am I? ❏  Founder/CTO Apache Software Foundation ❏  Flume - PMC Chair ❏  Sqoop - PMC Chair ❏  Storm - PMC, Committer ❏  MetaModel - Mentor ❏  Sentry - Mentor ❏  NiFi - Mentor ❏  ASF Member Previously... ❏  Cloudera ❏  Informatica Arvind Prabhakar
  • 3. © 2015 StreamSets, Inc. What is Flume? Logs Files Click Streams Sensors Devices Database Logs Social Data Streams Feeds Other Raw Storage (HDFS, S3) EDW, NoSQL (Hive, Impala, HBase, Cassandra) Search (Solr, ElasticSearch) Enterprise Data Infrastructure Apache Flume is a continuous data ingestion system that is... ●  open-source, ●  reliable, ●  scalable, ●  manageable, ●  customizable, ...and designed for Big Data ecosystem.
  • 4. © 2015 StreamSets, Inc. ...for Big Data ecosystem? “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” Big Data from a Data Ingestion Perspective ●  Logical Data Sources are physically distributed ●  Data production is continuous / never ending ●  Data structure and semantics change without notice
  • 5. © 2015 StreamSets, Inc. Physically Distributed Data Sources ●  Many physical sources that produce data ●  Number of physical sources changes constantly ●  Sources may exist in different governance zones, data centers, continents...
  • 6. © 2015 StreamSets, Inc. “Every two days now we create as much information as we did from the dawn of civilization up until 2003” -  Eric Schmidt, 2010 Continuous Data Production ●  Weather ●  Traffic ●  Automobiles ●  Trains ●  Airplanes ●  Geological/Seismic ●  Oceanographic ●  Smart Phones ●  Health Accessories ●  Medical Devices ●  Home Automation ●  Digital Cameras ●  Social Media ●  Geolocation ●  Shop Floor Sensors ●  Network Activity ●  Industry Appliances ●  Security/Surveillance ●  Server Workloads ●  Digital Telephony ●  Bio-simulations...
  • 7. © 2015 StreamSets, Inc. A Notable Prediction From 2014... “Breakthrough in compression technology will succeed in convert #bigdata to regular data #2014BigDataPredictions” http://bit.ly/1x6cz9M 1:41 PM - 20 Dec 2013
  • 8. © 2015 StreamSets, Inc. Ever Changing Structure of Data ●  One of your data centers upgrade to IPv6 192.168.0.4 fe80::21b:21ff:fe83:90fa M0137: User {jonsmith} granted access to {accounts} M0137: [jonsmith] granted access to [sys.accounts] { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121” } { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121”, “phone”: “(408) 555-1212” } ●  Application developer changes logs (again) ●  JSON data may contain more attributes than expected
  • 9. © 2015 StreamSets, Inc. So, from Data Ingestion Perspective: Massive collection of ever changing physical sources... Never ending data production... Continuously evolving data structures and semantics...
  • 11. © 2015 StreamSets, Inc. Flume to the Rescue!
  • 12. © 2015 StreamSets, Inc. Apache Flume ●  Originally designed to be a log aggregation system by Cloudera Engineers ●  Evolved to handle any type of streaming event data ●  Low-cost of installation, operation and maintenance ●  Highly customizable and extendable
  • 13. © 2015 StreamSets, Inc. A Closer Look at Flume Agent Agent AgentAgentInput Destination ●  Distributed Pipeline Architecture ●  Optimized for commonly used data sources and destinations ●  Built in support for contextual routing ●  Fully customizable and extendable eg Web Logs eg HDFS
  • 14. © 2015 StreamSets, Inc. Anatomy of a Flume Agent Flume Agent Source Sink Channel Incoming Data Outgoing Data Source ●  Accepts incoming Data ●  Scales as required ●  Writes data to Channel Sink ●  Removes data from Channel ●  Sends data to downstream Agent or Destination Channel ●  Stores data in the order received
  • 15. © 2015 StreamSets, Inc. Transactional Data Exchange Flume Agent Source Sink Channel Incoming Data Outgoing Data Source TX Sink TX ●  Source uses transactions to write to the channel Upstream Sink TX ●  Sink uses transactions to remove data from the channel ●  Sink transaction commits only after successful transfer of data ●  This ensures no data loss in Flume pipeline
  • 16. © 2015 StreamSets, Inc. Routing and Replicating Flume Agent Source Sink 1 Channel 1 Incoming Data Outgoing Data Channel 2 Sink 2 Outgoing Data ●  Source can replicate or multiplex data across many channels ●  Metadata headers can be used to do contextual selection of channels ●  Channels can be drained by different sinks to different destinations or pipelines
  • 17. © 2015 StreamSets, Inc. Why Channels? ●  Buffers data and insulates downstream from load spikes ●  Provides persistent store for data in case the process restarts ●  Provides flow ordering* and transactional guarantees
  • 18. © 2015 StreamSets, Inc. Use-Case: Log Aggregation
  • 19. © 2015 StreamSets, Inc. Starting Out Simple ●  You would like to move your web- server logs to HDFS ●  Let’s assume there are only 3 web servers at the time of launch ●  Ad-hoc solution will likely suffice! Challenges ●  How do you manage your output paths on HDFS? ●  How do you maintain your client code in face of changing environment as well as requirements?
  • 20. © 2015 StreamSets, Inc. Adding a Single Flume Agent Advantages ●  Insulation from HDFS downtime ●  Quick offload of logs from Web Server machines ●  Better Network utilization Challenges ●  What if the Flume node goes down? ●  Can one Flume node accommodate all load from Web Servers?
  • 21. © 2015 StreamSets, Inc. Adding Two Flume Agents Advantages ●  Redundancy and Availability ●  Better handling of downstream failures ●  Automatic load balancing and failover Challenges ●  What happens when new Web Servers are added? ●  Can two Flume Agents keep up with all the load from more Web Servers?
  • 22. © 2015 StreamSets, Inc. Handling a Server Farm A Converging Flow ●  Traffic is aggregated by Tier-2 and Tier-3 before being put into destination system ●  Closer a tier is to the destination, larger the batch size it delivers downstream ●  Optimized handling of destination systems
  • 23. © 2015 StreamSets, Inc. Data Volume Per Agent Batch Size Variation per Agent ●  Event volume is least in the outermost tier ●  Event volume increases as the flow converges ●  Event volume is highest in the innermost tier
  • 24. © 2015 StreamSets, Inc. Data Volume Per Tier Batch Size Variation per Tier ●  In steady state, all tiers carry same event volume ●  Transient variations in flow are absorbed and ironed out by channels ●  Load spikes are handled smoothly without overwhelming the infrastructure
  • 25. © 2015 StreamSets, Inc. Planning and Sizing Flume Topology for Log-Aggregation Use-Case
  • 26. © 2015 StreamSets, Inc. Planning and Sizing Your Topology What we know: ●  Number of Web Servers ●  Log volume per Web Server per unit time ●  Destination System and layout (Routing Requirements) ●  Worst case downtime for destination system What we will calculate: ●  Number of tiers ●  Exit Batch Sizes ●  Channel capacity
  • 27. © 2015 StreamSets, Inc. Calculating Number of Tiers Rule of Thumb One Aggregating Agent (A) can be used with anywhere from 4 to 16 client Agents Considerations ●  Must handle projected ingest volume ●  Resulting number of tiers should provide for routing, load-balancing and failover requirements Gotchas Load test to ensure that steady state and peak load are addressed with adequate failover capacity
  • 28. © 2015 StreamSets, Inc. Calculating Exit Batch Size Rule of Thumb Exit batch size is same as total exit data volume divided by number of Agents in a tier Considerations Gotchas Load test fail-over scenario to ensure near steady-state drain ●  Having some extra room is good ●  Keep contextual routing in mind ●  Consider duplication impact when batch sizes are large
  • 29. © 2015 StreamSets, Inc. Calculating Channel Capacity Source Sink X Source Sink X X Rule of Thumb Equal to worst case data ingest rate sustained over the worst case downstream outage interval Considerations ●  Multiple disks will yield better performance ●  Channel size impacts the back-pressure buildup in the pipeline You may need more disk space than the physical footprint of the data size Gotchas
  • 30. © 2015 StreamSets, Inc. To Recap Calculated with upstream to downstream Agent ratio ranging from 4:1 to 16:1. Factor in routing, failover, load-balancing requirements... Number of Tiers Calculated for steady state data volume exiting the tier, divided by number of Agents in that tier. Factor in contextual routing and duplication due to transient failure impact... Exit Batch Size Calculated as worst case ingest rate sustained over the worst case downstream downtime. Factor in number of disks used etc... Channel Capacity ...and that’s all there is to it!
  • 32. © 2015 StreamSets, Inc. Some Highlights of Flume ●  Flume is suitable for large volume data collection, especially when data is being produced in multiple locations ●  Once planned and sized appropriately, Flume will practically run itself without any operational intervention ●  Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline ●  Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations. ●  Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems
  • 33. © 2015 StreamSets, Inc. Things that could be better... ●  Handling of poison events ●  Ability to tail files ●  Ability to handle preset data formats such as JSON, CSV, XML ●  Centralized configuration ●  Once-only delivery semantics ●  ...and more Remember: patches are welcome!
  • 34. © 2015 StreamSets, Inc. Thank You! Contact: ●  Email: arvind at streamsets dot com ●  Twitter: @aprabhakar More on Flume: ●  http://flume.apache.org/ ●  User Mailing List: user-subscribe@flume.apache.org ●  Developer Mailing List: dev-subscribe@flume.apache.org ●  JIRA: https://issues.apache.org/jira/browse/FLUME