Introduction to Flume and Flive: Distributed Data Streaming Solutions

Introduction to Flume and Flive

July 11, 2012
Willis Gong
Big Data Engineering Team
Hanborq Inc.

Topic
• Flume
– Definition of the solution
– Characteristics
– Core concepts
• Flive
– Concepts
– Improvements

2

The real world problem
• Changing requirements Extensibility & Manageability
– In the source
– In the path
– In the sink
• Growing scales  Scalability
– Volume/nodes keep increasing
• Error prone  Reliability
– Network failure
– Service breakdown

Flume: the solution to these problems

• Flume is:
– A distributed data collection system
– A streamlined event processing pipeline
– A extensible distributed computation
framework
• Flume answers previous challenges
– Easily extends to new data formats
– Easily adapts new collecting strategies
– Scales linearly as new node added
– Multi level of reliability
– Configurable from shell / web
– Etc.

Core Concepts: Flow and Event
• Everything is event – body + meta table
• A flow is a event pipeline from a particular data source
• Flows are comprised of nodes chained together
• Many flows may overlap a physical cluster

Core Concepts: Nodes and Plane
• Data plane:
– Path of data flow
– Composited by one or more node in a tiered
architecture
• Two-tier: Agent  Collector
• Multi-tier: Agent  Processor  Collector
• Nodes:
– Nodes have a source and a sink
– Their roles depend on their position in data path
• Masters are in the control plane
– Central control point
– Light weighted since no data plane processing involved

Core Concepts: Agent and Collector

• Data plane nodes
– Agent
• receives data from an application
– Processor(optional)
• Intermediate processing
– Collector
• Write data to permanent storage

Deploy Topology
• Deploy considerations
– Agents: depend on application data source
– Collectors: depend on targeting storage, network topology,
load balance, etc

Considerations on Data Source
• Three integration modes:
– Push: agent as a data collecting service
for data source application
– Pull: agent poll data source periodically
– Embedded: data source application is the
agent itself

Data Plane Reliability
• Best effort
– Fire and forget
• Store on failure + retry
– Local acks, local errors detectable
– Failover when faults detected
• End-to-end reliability
– End to end acks
– Data survives compound failures
– At least once

Control Plane Reliability
• Master design
– Light-weighted process
• Isolated from data plane processing
– Lazy design
• simply answer a few node requests
• Service availability
– Watch dog
– Multi masters backup
– Service availability between reboot
• Persist configuration data to ZooKeeper

Data Plane Scalability
• Data plane is horizontally scalable
– Add collectors to increase availability and to handle more data
• Assumes a single agent will not dominate a collector
• Fewer connections to HDFS.
• Larger more efficient writes to HDFS.
• Agents have mechanisms for machine resource tradeoffs
– Write log locally to avoid collector disk IO bottleneck and catastrophic
failures
– Compression and batching (trade cpu for network)
– Push computation into the event collection pipeline (balance IO, Mem,
and CPU resource bottlenecks)

Data Plane Scalability
• Agents are logically partitioned and send to different
collectors
• Use randomization to pre-specify failovers when many
collectors exist
– Spread load if a collector goes down.
– Spread load if new collectors added to the system.

Control Plane Scalability
• A master controls dynamic configurations of nodes
– Uses gossip protocol to keep state consistent
– Scales well for configuration reads
– Allows for adaptive repartitioning in the future
– Nodes can talk to any master.

Extensibility
• Extensibility answers to changing use cases
– Invent new connector
• Simple source/sink/decorator APIs
• Plug-in architecture
– Dynamic wired pipeline processing logic
• Many simple operations composes for complex behavior
• Connector
– Sources produce data: plain text files, directory, Log4j, FTP, SQL, …
– Sinks consume data: console, HDFS, local file system
– Decorators modify data sent to sinks

Manageability
• Near natural language for node configure
– web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
– web-log-collector : autoCollectorSource
| { regex(“(Firefox|Internet Explorer)”, “browser”) =>
collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
• One place to specify node sources, sinks and data flows
– Basic Web interface
– Flume Shell – command line interface
– Extended custom management thru master RPC API

Flive – HANBORQ Enhanced Flume
• Based on Flume but with HANBORQ product ecosystem
orientation
• The new HTLoad
• Enhancements:
– Performance
– Functionality
– Manageability
– Hugetable integration
• Compatible with original Flume usage

18

Flive – More Than Flume
• Efficiency improvement
– Driving the pipeline
• Native driver is a single thread doing source-pulling and sink-pushing
– Temporal rate mismatch in source and sink may affect each other
• Flive use two threads, one source-pulling and one sink-pushing,
coupled by internal event queue
– Temporal rate variances in source and sink are filtered by the queue
– Contributes 10%~30% throughput improvement
– Introduced node concurrency to maximize target storage
bandwidth

• Functionality enhancement
– Native Flume connector conf spec syntax is flat
• But connectors are hierarchical essentially
• Limited flat syntax also limits connectors to be flatly assembled
• Assemble connector hierarchy thru hard code, or ad-hoc syntax
– Flive introduced hierarchical syntax
• Hierarchical connector architecture can be dynamically wired
• For backward compatibility, only Flive connector support enhanced
syntax

• Ease of use
– Zero-configure plug-in architecture
• Native flume mandates handy configure about plugins
• Flive no longer requires any configure but minimal conventions
– Simpler, but yet powerful Flive shell
– Introduced the translator framework
• Node configuration specs may be too complicate to be manually edited
• Translator helps translate user domain spec to Flive/Flume configuration
spec
• Extendable
– Hugetable translator for Hugetable
– Basic translator for native Flume – full Flume compatibility
– Ease of deploy and management

• As a Hugetable ETL
– Sourcing structured data from various sources
• FS, FTP, SQL, LOG4J, …
– Targeting all Hugetable storage engine
• Text File, Sequence File, RCFile, HFile, HBase,…
– Filtering unwanted/malformed records
– Column transfer over the air
• IUD like single stream column op: based on function expression
• Multi stream op: pre-join in the fly
– Multi table loading
• Like fan-out but less overhead
– Real time aggregation
• Accurate computation: sum(x), count(*)
• Probabilistic computation: count(distinct x), top(k), etc.

Runtime Flive
•
Flume Driver
DataSource C-puller
Q3 Q4

Tailer
C-pusher
Flume Driver T-server
A-puller A-pusher Q5
多线程解码
Q1 Q2 network
Decoder

Q6

Driver
Collector
Agent

Q7
多线程Append

Appender

Hbase HDFS Others

Introduction to Flume and Flive: Distributed Data Streaming Solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Flume and Flive: Distributed Data Streaming Solutions

Similar to Introduction to Flume and Flive: Distributed Data Streaming Solutions (20)

More from Hanborq Inc.

More from Hanborq Inc. (12)

Recently uploaded

Recently uploaded (20)

Introduction to Flume and Flive: Distributed Data Streaming Solutions