Submit Search
Upload
Data Aggregation at Scale with Apache Flume
•
20 likes
•
6,209 views
AI-enhanced title
Arvind Prabhakar
Follow
Learn how you can easily use Apache Flume to solve your complex ingestion problems.
Read less
Read more
Software
Report
Share
Report
Share
1 of 34
Download now
Download to read offline
Recommended
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar
Inside Flume
Inside Flume
Cloudera, Inc.
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
Flume
Flume
Chirag Ahuja
Flume and Hadoop performance insights
Flume and Hadoop performance insights
Omid Vahdaty
Apache kafka
Apache kafka
Shravan (Sean) Pabba
Apache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
Swapnil Dubey
Recommended
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
Arvind Prabhakar
Inside Flume
Inside Flume
Cloudera, Inc.
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Omid Vahdaty
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
Flume
Flume
Chirag Ahuja
Flume and Hadoop performance insights
Flume and Hadoop performance insights
Omid Vahdaty
Apache kafka
Apache kafka
Shravan (Sean) Pabba
Apache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
Swapnil Dubey
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Yahoo Developer Network
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
Kowndinya Mannepalli
Flume in 10minutes
Flume in 10minutes
dwmclary
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
JinfengHuang3
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
Big data: Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
Christophe Marchal
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
Karthik Ramasamy
Kafka internals
Kafka internals
David Groozman
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
Michael Stack
Current and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
Introduction to Spark Streaming
Introduction to Spark Streaming
datamantra
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Discover Pinterest
Kafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
SAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage Account
Gary Jackson MBCS
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
Safe Software
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
More Related Content
What's hot
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Yahoo Developer Network
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
Kowndinya Mannepalli
Flume in 10minutes
Flume in 10minutes
dwmclary
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
JinfengHuang3
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
Big data: Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
Christophe Marchal
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
Karthik Ramasamy
Kafka internals
Kafka internals
David Groozman
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
Michael Stack
Current and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
Introduction to Spark Streaming
Introduction to Spark Streaming
datamantra
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Discover Pinterest
Kafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
SAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage Account
Gary Jackson MBCS
What's hot
(20)
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
Flume in 10minutes
Flume in 10minutes
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
Big data: Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
Kafka internals
Kafka internals
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
HBaseConAsia2018 Track3-4: HBase and OpenTSDB practice at Huawei
Current and Future of Apache Kafka
Current and Future of Apache Kafka
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
Introduction to Spark Streaming
Introduction to Spark Streaming
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Apache Kafka at LinkedIn
Apache Kafka at LinkedIn
Kafka for DBAs
Kafka for DBAs
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
SAP OS/DB Migration using Azure Storage Account
SAP OS/DB Migration using Azure Storage Account
Similar to Data Aggregation at Scale with Apache Flume
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
Safe Software
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Matt Stubbs
Event Driven Architecture
Event Driven Architecture
Benjamin Joyen-Conseil
Apache Apex Meetup at Cask
Apache Apex Meetup at Cask
Apache Apex
Apache Apex - Hadoop Users Group
Apache Apex - Hadoop Users Group
Pramod Immaneni
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
Yahoo Developer Network
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
InfluxData
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
Safe Software
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
Amazon Web Services
Network characteristics of the cloud
Network characteristics of the cloud
Cloud Genius
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
Felix GV
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
CA Technologies
Zero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
CA Technologies
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
R3
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
CEE-SEC(R)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
When Streaming Becomes Strategic
When Streaming Becomes Strategic
MapR Technologies
Similar to Data Aggregation at Scale with Apache Flume
(20)
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Big Data LDN 2018: STREAM PROCESSING TAKES ON EVERYTHING
Event Driven Architecture
Event Driven Architecture
Apache Apex Meetup at Cask
Apache Apex Meetup at Cask
Apache Apex - Hadoop Users Group
Apache Apex - Hadoop Users Group
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
Best Practices for Scaling an InfluxEnterprise Cluster
Best Practices for Scaling an InfluxEnterprise Cluster
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
Network characteristics of the cloud
Network characteristics of the cloud
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
Infrastructure Performance Management: Flexibility Combining Breadth, Depth ...
Zero Downtime JEE Architectures
Zero Downtime JEE Architectures
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
Stop the Blame Game with Increased Visibility of your Mobile-to-Mainframe IT ...
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
DevDay: Corda Enterprise: Journey to 1000 TPS per node, Rick Parker
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
Быстрый онлайн-доступ к огромному количеству оффлайн-данных в LinkedIn
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
When Streaming Becomes Strategic
When Streaming Becomes Strategic
Recently uploaded
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
Jhone kinadey
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
Wave PLM
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
Fatema Valibhai
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Alberto González Trastoy
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
harshavardhanraghave
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽❤️🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽❤️🧑🏻 89...
gurkirankumar98700
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
Cionsystems
Software Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
Arshad QA
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
kalichargn70th171
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
Evangelist Apps https://twitter.com/EvangelistSW/
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
AxelRicardoTrocheRiq
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
bodapatigopi8531
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
SolGuruz
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
joe51371421
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
OnePlan Solutions
Professional Resume Template for Software Developers
Professional Resume Template for Software Developers
Vinodh Ram
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
Arshad QA
Recently uploaded
(20)
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽❤️🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽❤️🧑🏻 89...
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
Software Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Professional Resume Template for Software Developers
Professional Resume Template for Software Developers
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
Data Aggregation at Scale with Apache Flume
1.
© 2015 StreamSets
Inc., All rights reserved Apache Flume Data Aggregation At Scale Arvind Prabhakar
2.
© 2015 StreamSets,
Inc. Who am I? ❏ Founder/CTO Apache Software Foundation ❏ Flume - PMC Chair ❏ Sqoop - PMC Chair ❏ Storm - PMC, Committer ❏ MetaModel - Mentor ❏ Sentry - Mentor ❏ NiFi - Mentor ❏ ASF Member Previously... ❏ Cloudera ❏ Informatica Arvind Prabhakar
3.
© 2015 StreamSets,
Inc. What is Flume? Logs Files Click Streams Sensors Devices Database Logs Social Data Streams Feeds Other Raw Storage (HDFS, S3) EDW, NoSQL (Hive, Impala, HBase, Cassandra) Search (Solr, ElasticSearch) Enterprise Data Infrastructure Apache Flume is a continuous data ingestion system that is... ● open-source, ● reliable, ● scalable, ● manageable, ● customizable, ...and designed for Big Data ecosystem.
4.
© 2015 StreamSets,
Inc. ...for Big Data ecosystem? “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” Big Data from a Data Ingestion Perspective ● Logical Data Sources are physically distributed ● Data production is continuous / never ending ● Data structure and semantics change without notice
5.
© 2015 StreamSets,
Inc. Physically Distributed Data Sources ● Many physical sources that produce data ● Number of physical sources changes constantly ● Sources may exist in different governance zones, data centers, continents...
6.
© 2015 StreamSets,
Inc. “Every two days now we create as much information as we did from the dawn of civilization up until 2003” - Eric Schmidt, 2010 Continuous Data Production ● Weather ● Traffic ● Automobiles ● Trains ● Airplanes ● Geological/Seismic ● Oceanographic ● Smart Phones ● Health Accessories ● Medical Devices ● Home Automation ● Digital Cameras ● Social Media ● Geolocation ● Shop Floor Sensors ● Network Activity ● Industry Appliances ● Security/Surveillance ● Server Workloads ● Digital Telephony ● Bio-simulations...
7.
© 2015 StreamSets,
Inc. A Notable Prediction From 2014... “Breakthrough in compression technology will succeed in convert #bigdata to regular data #2014BigDataPredictions” http://bit.ly/1x6cz9M 1:41 PM - 20 Dec 2013
8.
© 2015 StreamSets,
Inc. Ever Changing Structure of Data ● One of your data centers upgrade to IPv6 192.168.0.4 fe80::21b:21ff:fe83:90fa M0137: User {jonsmith} granted access to {accounts} M0137: [jonsmith] granted access to [sys.accounts] { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121” } { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121”, “phone”: “(408) 555-1212” } ● Application developer changes logs (again) ● JSON data may contain more attributes than expected
9.
© 2015 StreamSets,
Inc. So, from Data Ingestion Perspective: Massive collection of ever changing physical sources... Never ending data production... Continuously evolving data structures and semantics...
10.
© 2015 StreamSets,
Inc.
11.
© 2015 StreamSets,
Inc. Flume to the Rescue!
12.
© 2015 StreamSets,
Inc. Apache Flume ● Originally designed to be a log aggregation system by Cloudera Engineers ● Evolved to handle any type of streaming event data ● Low-cost of installation, operation and maintenance ● Highly customizable and extendable
13.
© 2015 StreamSets,
Inc. A Closer Look at Flume Agent Agent AgentAgentInput Destination ● Distributed Pipeline Architecture ● Optimized for commonly used data sources and destinations ● Built in support for contextual routing ● Fully customizable and extendable eg Web Logs eg HDFS
14.
© 2015 StreamSets,
Inc. Anatomy of a Flume Agent Flume Agent Source Sink Channel Incoming Data Outgoing Data Source ● Accepts incoming Data ● Scales as required ● Writes data to Channel Sink ● Removes data from Channel ● Sends data to downstream Agent or Destination Channel ● Stores data in the order received
15.
© 2015 StreamSets,
Inc. Transactional Data Exchange Flume Agent Source Sink Channel Incoming Data Outgoing Data Source TX Sink TX ● Source uses transactions to write to the channel Upstream Sink TX ● Sink uses transactions to remove data from the channel ● Sink transaction commits only after successful transfer of data ● This ensures no data loss in Flume pipeline
16.
© 2015 StreamSets,
Inc. Routing and Replicating Flume Agent Source Sink 1 Channel 1 Incoming Data Outgoing Data Channel 2 Sink 2 Outgoing Data ● Source can replicate or multiplex data across many channels ● Metadata headers can be used to do contextual selection of channels ● Channels can be drained by different sinks to different destinations or pipelines
17.
© 2015 StreamSets,
Inc. Why Channels? ● Buffers data and insulates downstream from load spikes ● Provides persistent store for data in case the process restarts ● Provides flow ordering* and transactional guarantees
18.
© 2015 StreamSets,
Inc. Use-Case: Log Aggregation
19.
© 2015 StreamSets,
Inc. Starting Out Simple ● You would like to move your web- server logs to HDFS ● Let’s assume there are only 3 web servers at the time of launch ● Ad-hoc solution will likely suffice! Challenges ● How do you manage your output paths on HDFS? ● How do you maintain your client code in face of changing environment as well as requirements?
20.
© 2015 StreamSets,
Inc. Adding a Single Flume Agent Advantages ● Insulation from HDFS downtime ● Quick offload of logs from Web Server machines ● Better Network utilization Challenges ● What if the Flume node goes down? ● Can one Flume node accommodate all load from Web Servers?
21.
© 2015 StreamSets,
Inc. Adding Two Flume Agents Advantages ● Redundancy and Availability ● Better handling of downstream failures ● Automatic load balancing and failover Challenges ● What happens when new Web Servers are added? ● Can two Flume Agents keep up with all the load from more Web Servers?
22.
© 2015 StreamSets,
Inc. Handling a Server Farm A Converging Flow ● Traffic is aggregated by Tier-2 and Tier-3 before being put into destination system ● Closer a tier is to the destination, larger the batch size it delivers downstream ● Optimized handling of destination systems
23.
© 2015 StreamSets,
Inc. Data Volume Per Agent Batch Size Variation per Agent ● Event volume is least in the outermost tier ● Event volume increases as the flow converges ● Event volume is highest in the innermost tier
24.
© 2015 StreamSets,
Inc. Data Volume Per Tier Batch Size Variation per Tier ● In steady state, all tiers carry same event volume ● Transient variations in flow are absorbed and ironed out by channels ● Load spikes are handled smoothly without overwhelming the infrastructure
25.
© 2015 StreamSets,
Inc. Planning and Sizing Flume Topology for Log-Aggregation Use-Case
26.
© 2015 StreamSets,
Inc. Planning and Sizing Your Topology What we know: ● Number of Web Servers ● Log volume per Web Server per unit time ● Destination System and layout (Routing Requirements) ● Worst case downtime for destination system What we will calculate: ● Number of tiers ● Exit Batch Sizes ● Channel capacity
27.
© 2015 StreamSets,
Inc. Calculating Number of Tiers Rule of Thumb One Aggregating Agent (A) can be used with anywhere from 4 to 16 client Agents Considerations ● Must handle projected ingest volume ● Resulting number of tiers should provide for routing, load-balancing and failover requirements Gotchas Load test to ensure that steady state and peak load are addressed with adequate failover capacity
28.
© 2015 StreamSets,
Inc. Calculating Exit Batch Size Rule of Thumb Exit batch size is same as total exit data volume divided by number of Agents in a tier Considerations Gotchas Load test fail-over scenario to ensure near steady-state drain ● Having some extra room is good ● Keep contextual routing in mind ● Consider duplication impact when batch sizes are large
29.
© 2015 StreamSets,
Inc. Calculating Channel Capacity Source Sink X Source Sink X X Rule of Thumb Equal to worst case data ingest rate sustained over the worst case downstream outage interval Considerations ● Multiple disks will yield better performance ● Channel size impacts the back-pressure buildup in the pipeline You may need more disk space than the physical footprint of the data size Gotchas
30.
© 2015 StreamSets,
Inc. To Recap Calculated with upstream to downstream Agent ratio ranging from 4:1 to 16:1. Factor in routing, failover, load-balancing requirements... Number of Tiers Calculated for steady state data volume exiting the tier, divided by number of Agents in that tier. Factor in contextual routing and duplication due to transient failure impact... Exit Batch Size Calculated as worst case ingest rate sustained over the worst case downstream downtime. Factor in number of disks used etc... Channel Capacity ...and that’s all there is to it!
31.
© 2015 StreamSets,
Inc.
32.
© 2015 StreamSets,
Inc. Some Highlights of Flume ● Flume is suitable for large volume data collection, especially when data is being produced in multiple locations ● Once planned and sized appropriately, Flume will practically run itself without any operational intervention ● Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline ● Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations. ● Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems
33.
© 2015 StreamSets,
Inc. Things that could be better... ● Handling of poison events ● Ability to tail files ● Ability to handle preset data formats such as JSON, CSV, XML ● Centralized configuration ● Once-only delivery semantics ● ...and more Remember: patches are welcome!
34.
© 2015 StreamSets,
Inc. Thank You! Contact: ● Email: arvind at streamsets dot com ● Twitter: @aprabhakar More on Flume: ● http://flume.apache.org/ ● User Mailing List: user-subscribe@flume.apache.org ● Developer Mailing List: dev-subscribe@flume.apache.org ● JIRA: https://issues.apache.org/jira/browse/FLUME
Download now