SlideShare a Scribd company logo
1 of 24
Introduction to Flume and Flive

              July 11, 2012
               Willis Gong
       Big Data Engineering Team
              Hanborq Inc.
Topic
• Flume
  – Definition of the solution
  – Characteristics
  – Core concepts
• Flive
  – Concepts
  – Improvements



                                     2
The real world problem
• Changing requirements Extensibility & Manageability
   – In the source
   – In the path
   – In the sink
• Growing scales  Scalability
   – Volume/nodes keep increasing
• Error prone  Reliability
   – Network failure
   – Service breakdown
Flume: the solution to these problems

• Flume is:
   – A distributed data collection system
   – A streamlined event processing pipeline
   – A extensible distributed computation
     framework
• Flume answers previous challenges
   –   Easily extends to new data formats
   –   Easily adapts new collecting strategies
   –   Scales linearly as new node added
   –   Multi level of reliability
   –   Configurable from shell / web
   –   Etc.
Core Concepts: Flow and Event
•   Everything is event – body + meta table
•   A flow is a event pipeline from a particular data source
•   Flows are comprised of nodes chained together
•   Many flows may overlap a physical cluster
Core Concepts: Nodes and Plane
• Data plane:
   – Path of data flow
   – Composited by one or more node in a tiered
     architecture
      • Two-tier: Agent  Collector
      • Multi-tier: Agent  Processor  Collector
• Nodes:
   – Nodes have a source and a sink
   – Their roles depend on their position in data path
• Masters are in the control plane
   – Central control point
   – Light weighted since no data plane processing involved
Core Concepts: Agent and Collector

• Data plane nodes
  – Agent
     • receives data from an application
  – Processor(optional)
     • Intermediate processing
  – Collector
     • Write data to permanent storage
Deploy Topology
• Deploy considerations
  – Agents: depend on application data source
  – Collectors: depend on targeting storage, network topology,
    load balance, etc
Considerations on Data Source
• Three integration modes:
  – Push: agent as a data collecting service
    for data source application
  – Pull: agent poll data source periodically
  – Embedded: data source application is the
    agent itself
Data Plane Reliability
• Best effort
   – Fire and forget
• Store on failure + retry
   – Local acks, local errors detectable
   – Failover when faults detected
• End-to-end reliability
   – End to end acks
   – Data survives compound failures
   – At least once
Control Plane Reliability
• Master design
  – Light-weighted process
     • Isolated from data plane processing
  – Lazy design
     • simply answer a few node requests
• Service availability
  – Watch dog
  – Multi masters backup
  – Service availability between reboot
     • Persist configuration data to ZooKeeper
Data Plane Scalability
• Data plane is horizontally scalable
   – Add collectors to increase availability and to handle more data
      • Assumes a single agent will not dominate a collector
      • Fewer connections to HDFS.
      • Larger more efficient writes to HDFS.
• Agents have mechanisms for machine resource tradeoffs
   – Write log locally to avoid collector disk IO bottleneck and catastrophic
     failures
   – Compression and batching (trade cpu for network)
   – Push computation into the event collection pipeline (balance IO, Mem,
     and CPU resource bottlenecks)
Data Plane Scalability
• Agents are logically partitioned and send to different
  collectors
• Use randomization to pre-specify failovers when many
  collectors exist
  – Spread load if a collector goes down.
  – Spread load if new collectors added to the system.
Control Plane Scalability
• A master controls dynamic configurations of nodes
  – Uses gossip protocol to keep state consistent
  – Scales well for configuration reads
  – Allows for adaptive repartitioning in the future
  – Nodes can talk to any master.
Extensibility
• Extensibility answers to changing use cases
   – Invent new connector
      • Simple source/sink/decorator APIs
      • Plug-in architecture
   – Dynamic wired pipeline processing logic
      • Many simple operations composes for complex behavior
• Connector
   – Sources produce data: plain text files, directory, Log4j, FTP, SQL, …
   – Sinks consume data: console, HDFS, local file system
   – Decorators modify data sent to sinks
Extensibility
• Example
Manageability
• Near natural language for node configure
   – web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
   – web-log-collector : autoCollectorSource
     | { regex(“(Firefox|Internet Explorer)”, “browser”) =>
     collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
• One place to specify node sources, sinks and data flows
   – Basic Web interface
   – Flume Shell – command line interface
   – Extended custom management thru master RPC API
Flive – HANBORQ Enhanced Flume
• Based on Flume but with HANBORQ product ecosystem
  orientation
• The new HTLoad
• Enhancements:
  – Performance
  – Functionality
  – Manageability
  – Hugetable integration
• Compatible with original Flume usage

                                                      18
Flive – More Than Flume
• Efficiency improvement
  – Driving the pipeline
     • Native driver is a single thread doing source-pulling and sink-pushing
        – Temporal rate mismatch in source and sink may affect each other
     • Flive use two threads, one source-pulling and one sink-pushing,
       coupled by internal event queue
        – Temporal rate variances in source and sink are filtered by the queue
        – Contributes 10%~30% throughput improvement
  – Introduced node concurrency to maximize target storage
    bandwidth
Flive – More Than Flume
• Functionality enhancement
  – Native Flume connector conf spec syntax is flat
     • But connectors are hierarchical essentially
     • Limited flat syntax also limits connectors to be flatly assembled
     • Assemble connector hierarchy thru hard code, or ad-hoc syntax
  – Flive introduced hierarchical syntax
     • Hierarchical connector architecture can be dynamically wired
     • For backward compatibility, only Flive connector support enhanced
       syntax
Flive – More Than Flume
• Ease of use
  – Zero-configure plug-in architecture
     • Native flume mandates handy configure about plugins
     • Flive no longer requires any configure but minimal conventions
  – Simpler, but yet powerful Flive shell
  – Introduced the translator framework
     • Node configuration specs may be too complicate to be manually edited
     • Translator helps translate user domain spec to Flive/Flume configuration
       spec
     • Extendable
         – Hugetable translator for Hugetable
         – Basic translator for native Flume – full Flume compatibility
  – Ease of deploy and management
Flive – More Than Flume
• As a Hugetable ETL
   – Sourcing structured data from various sources
      • FS, FTP, SQL, LOG4J, …
   – Targeting all Hugetable storage engine
      • Text File, Sequence File, RCFile, HFile, HBase,…
   – Filtering unwanted/malformed records
   – Column transfer over the air
      • IUD like single stream column op: based on function expression
      • Multi stream op: pre-join in the fly
   – Multi table loading
      • Like fan-out but less overhead
   – Real time aggregation
      • Accurate computation: sum(x), count(*)
      • Probabilistic computation: count(distinct x), top(k), etc.
Runtime Flive
   •
                                                                        Flume Driver
    DataSource                                                                                 C-puller
                                                                                         Q3                         Q4

                 Tailer
                                                                                                                    C-pusher
Flume Driver                                                          T-server
                      A-puller        A-pusher                                                  Q5
                                                                            多线程解码
        Q1                       Q2                 network
                                                                                                          Decoder


                                                                                                Q6


                                                                                                          Driver
                                                          Collector
                                            Agent

                                                                                                Q7
                                                                            多线程Append



                                                                                              Appender




                                                                                 Hbase        HDFS                  Others
Thank you!

More Related Content

What's hot

Serval: Software Defined Service-­Centric Networking
Serval: Software Defined Service-­Centric NetworkingServal: Software Defined Service-­Centric Networking
Serval: Software Defined Service-­Centric NetworkingOpen Networking Summits
 
Brkrst 3123 previdi-final
Brkrst 3123 previdi-finalBrkrst 3123 previdi-final
Brkrst 3123 previdi-finalStefano Previdi
 
1 bonica tutorial_segment_routing
1 bonica tutorial_segment_routing1 bonica tutorial_segment_routing
1 bonica tutorial_segment_routinghptoga
 
IPv6 Segment Routing : an end-to-end solution ?
IPv6 Segment Routing : an end-to-end solution ?IPv6 Segment Routing : an end-to-end solution ?
IPv6 Segment Routing : an end-to-end solution ?Olivier Bonaventure
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an IntroductionErik Schmiegelow
 
CMAF Live Ingest Uplink Protocol
CMAF Live Ingest Uplink ProtocolCMAF Live Ingest Uplink Protocol
CMAF Live Ingest Uplink ProtocolRufael Mekuria
 
RPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksRPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksPradeep Kumar TS
 
Fpga implemented ahb protocol
Fpga implemented ahb protocolFpga implemented ahb protocol
Fpga implemented ahb protocoliaemedu
 
ETE405-lec9.pdf
ETE405-lec9.pdfETE405-lec9.pdf
ETE405-lec9.pdfmashiur
 

What's hot (20)

Serval: Software Defined Service-­Centric Networking
Serval: Software Defined Service-­Centric NetworkingServal: Software Defined Service-­Centric Networking
Serval: Software Defined Service-­Centric Networking
 
Quality of service
Quality of serviceQuality of service
Quality of service
 
Brkrst 3123 previdi-final
Brkrst 3123 previdi-finalBrkrst 3123 previdi-final
Brkrst 3123 previdi-final
 
BGP Advanced topics
BGP Advanced topicsBGP Advanced topics
BGP Advanced topics
 
10 fn s40
10 fn s4010 fn s40
10 fn s40
 
1 bonica tutorial_segment_routing
1 bonica tutorial_segment_routing1 bonica tutorial_segment_routing
1 bonica tutorial_segment_routing
 
Flume
FlumeFlume
Flume
 
IPv6 Segment Routing : an end-to-end solution ?
IPv6 Segment Routing : an end-to-end solution ?IPv6 Segment Routing : an end-to-end solution ?
IPv6 Segment Routing : an end-to-end solution ?
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
 
IPv6 Entreprise Multihoming
IPv6 Entreprise MultihomingIPv6 Entreprise Multihoming
IPv6 Entreprise Multihoming
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Apache flume
Apache flumeApache flume
Apache flume
 
Enabling IPv6 Services Transparently
Enabling IPv6 Services TransparentlyEnabling IPv6 Services Transparently
Enabling IPv6 Services Transparently
 
Ingest oct-9-update
Ingest oct-9-updateIngest oct-9-update
Ingest oct-9-update
 
Mpls Qos Jayk
Mpls Qos JaykMpls Qos Jayk
Mpls Qos Jayk
 
CMAF Live Ingest Uplink Protocol
CMAF Live Ingest Uplink ProtocolCMAF Live Ingest Uplink Protocol
CMAF Live Ingest Uplink Protocol
 
RPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksRPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy Networks
 
Fpga implemented ahb protocol
Fpga implemented ahb protocolFpga implemented ahb protocol
Fpga implemented ahb protocol
 
ETE405-lec9.pdf
ETE405-lec9.pdfETE405-lec9.pdf
ETE405-lec9.pdf
 

Similar to Introduction to Flume and Flive: Distributed Data Streaming Solutions

Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introductionRandy Abernethy
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Go with the Flow-v2
Go with the Flow-v2Go with the Flow-v2
Go with the Flow-v2Zobair Khan
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeTimothy Spann
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookbaggioss
 
FlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerFlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerHolger Winkelmann
 
LAB - Perforce Large Scale & Multi-Site Implementations
LAB - Perforce Large Scale & Multi-Site ImplementationsLAB - Perforce Large Scale & Multi-Site Implementations
LAB - Perforce Large Scale & Multi-Site ImplementationsPerforce
 

Similar to Introduction to Flume and Flive: Distributed Data Streaming Solutions (20)

Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introduction
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Go with the Flow-v2
Go with the Flow-v2Go with the Flow-v2
Go with the Flow-v2
 
Go with the Flow
Go with the Flow Go with the Flow
Go with the Flow
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
 
FlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerFlowER Erlang Openflow Controller
FlowER Erlang Openflow Controller
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
LAB - Perforce Large Scale & Multi-Site Implementations
LAB - Perforce Large Scale & Multi-Site ImplementationsLAB - Perforce Large Scale & Multi-Site Implementations
LAB - Perforce Large Scale & Multi-Site Implementations
 
PLB
PLBPLB
PLB
 

More from Hanborq Inc.

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraHanborq Inc.
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHanborq Inc.
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验Hanborq Inc.
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase IntroductionHanborq Inc.
 
Hadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHanborq Inc.
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHanborq Inc.
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
How to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHow to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHanborq Inc.
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 

More from Hanborq Inc. (12)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
FlumeBase Study
FlumeBase StudyFlumeBase Study
FlumeBase Study
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase Introduction
 
Hadoop Versioning
Hadoop VersioningHadoop Versioning
Hadoop Versioning
 
Hadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler Introduction
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
How to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHow to Build Cloud Storage Service Systems
How to Build Cloud Storage Service Systems
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Introduction to Flume and Flive: Distributed Data Streaming Solutions

  • 1. Introduction to Flume and Flive July 11, 2012 Willis Gong Big Data Engineering Team Hanborq Inc.
  • 2. Topic • Flume – Definition of the solution – Characteristics – Core concepts • Flive – Concepts – Improvements 2
  • 3. The real world problem • Changing requirements Extensibility & Manageability – In the source – In the path – In the sink • Growing scales  Scalability – Volume/nodes keep increasing • Error prone  Reliability – Network failure – Service breakdown
  • 4. Flume: the solution to these problems • Flume is: – A distributed data collection system – A streamlined event processing pipeline – A extensible distributed computation framework • Flume answers previous challenges – Easily extends to new data formats – Easily adapts new collecting strategies – Scales linearly as new node added – Multi level of reliability – Configurable from shell / web – Etc.
  • 5. Core Concepts: Flow and Event • Everything is event – body + meta table • A flow is a event pipeline from a particular data source • Flows are comprised of nodes chained together • Many flows may overlap a physical cluster
  • 6. Core Concepts: Nodes and Plane • Data plane: – Path of data flow – Composited by one or more node in a tiered architecture • Two-tier: Agent  Collector • Multi-tier: Agent  Processor  Collector • Nodes: – Nodes have a source and a sink – Their roles depend on their position in data path • Masters are in the control plane – Central control point – Light weighted since no data plane processing involved
  • 7. Core Concepts: Agent and Collector • Data plane nodes – Agent • receives data from an application – Processor(optional) • Intermediate processing – Collector • Write data to permanent storage
  • 8. Deploy Topology • Deploy considerations – Agents: depend on application data source – Collectors: depend on targeting storage, network topology, load balance, etc
  • 9. Considerations on Data Source • Three integration modes: – Push: agent as a data collecting service for data source application – Pull: agent poll data source periodically – Embedded: data source application is the agent itself
  • 10. Data Plane Reliability • Best effort – Fire and forget • Store on failure + retry – Local acks, local errors detectable – Failover when faults detected • End-to-end reliability – End to end acks – Data survives compound failures – At least once
  • 11. Control Plane Reliability • Master design – Light-weighted process • Isolated from data plane processing – Lazy design • simply answer a few node requests • Service availability – Watch dog – Multi masters backup – Service availability between reboot • Persist configuration data to ZooKeeper
  • 12. Data Plane Scalability • Data plane is horizontally scalable – Add collectors to increase availability and to handle more data • Assumes a single agent will not dominate a collector • Fewer connections to HDFS. • Larger more efficient writes to HDFS. • Agents have mechanisms for machine resource tradeoffs – Write log locally to avoid collector disk IO bottleneck and catastrophic failures – Compression and batching (trade cpu for network) – Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks)
  • 13. Data Plane Scalability • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist – Spread load if a collector goes down. – Spread load if new collectors added to the system.
  • 14. Control Plane Scalability • A master controls dynamic configurations of nodes – Uses gossip protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future – Nodes can talk to any master.
  • 15. Extensibility • Extensibility answers to changing use cases – Invent new connector • Simple source/sink/decorator APIs • Plug-in architecture – Dynamic wired pipeline processing logic • Many simple operations composes for complex behavior • Connector – Sources produce data: plain text files, directory, Log4j, FTP, SQL, … – Sinks consume data: console, HDFS, local file system – Decorators modify data sent to sinks
  • 17. Manageability • Near natural language for node configure – web-log-agent : tail(“/var/log/httpd.log”) | agentBESink – web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) } • One place to specify node sources, sinks and data flows – Basic Web interface – Flume Shell – command line interface – Extended custom management thru master RPC API
  • 18. Flive – HANBORQ Enhanced Flume • Based on Flume but with HANBORQ product ecosystem orientation • The new HTLoad • Enhancements: – Performance – Functionality – Manageability – Hugetable integration • Compatible with original Flume usage 18
  • 19. Flive – More Than Flume • Efficiency improvement – Driving the pipeline • Native driver is a single thread doing source-pulling and sink-pushing – Temporal rate mismatch in source and sink may affect each other • Flive use two threads, one source-pulling and one sink-pushing, coupled by internal event queue – Temporal rate variances in source and sink are filtered by the queue – Contributes 10%~30% throughput improvement – Introduced node concurrency to maximize target storage bandwidth
  • 20. Flive – More Than Flume • Functionality enhancement – Native Flume connector conf spec syntax is flat • But connectors are hierarchical essentially • Limited flat syntax also limits connectors to be flatly assembled • Assemble connector hierarchy thru hard code, or ad-hoc syntax – Flive introduced hierarchical syntax • Hierarchical connector architecture can be dynamically wired • For backward compatibility, only Flive connector support enhanced syntax
  • 21. Flive – More Than Flume • Ease of use – Zero-configure plug-in architecture • Native flume mandates handy configure about plugins • Flive no longer requires any configure but minimal conventions – Simpler, but yet powerful Flive shell – Introduced the translator framework • Node configuration specs may be too complicate to be manually edited • Translator helps translate user domain spec to Flive/Flume configuration spec • Extendable – Hugetable translator for Hugetable – Basic translator for native Flume – full Flume compatibility – Ease of deploy and management
  • 22. Flive – More Than Flume • As a Hugetable ETL – Sourcing structured data from various sources • FS, FTP, SQL, LOG4J, … – Targeting all Hugetable storage engine • Text File, Sequence File, RCFile, HFile, HBase,… – Filtering unwanted/malformed records – Column transfer over the air • IUD like single stream column op: based on function expression • Multi stream op: pre-join in the fly – Multi table loading • Like fan-out but less overhead – Real time aggregation • Accurate computation: sum(x), count(*) • Probabilistic computation: count(distinct x), top(k), etc.
  • 23. Runtime Flive • Flume Driver DataSource C-puller Q3 Q4 Tailer C-pusher Flume Driver T-server A-puller A-pusher Q5 多线程解码 Q1 Q2 network Decoder Q6 Driver Collector Agent Q7 多线程Append Appender Hbase HDFS Others