"Low-Latency Streaming Data Processing in Hadoop" was presented to Lansing Java Users Group meeting on 3/17/2015 by Vijay Mandava and Lan Jiang. It covers several frameworks that are commonly used for real-time streaming data processing such as Flume, Kafka and Storm. The demo uses all 3 frameworks to stream live twitter data to calculate TopN trending hashtags. The demo was done on Cloudera CDH 5.3
7. Flume Architecture
• Distributed system for
collecting and
aggregating from
multiple data stores to a
centralized data store
• Agent is a JVM that
hosts the Flume
components
• Channel will store
message until
picked by a sink
• Different types of
Flume sources
• Source and Sink are
decoupled
10. Kafka Introduction
• Messaging System which is distributed, partitioned and replicated
• Kafka brokers run as a cluster
• Producers and Consumers can be written in any language
11. Topic
• Ordered, immutable sequence numbers
• Retains messages until a period of time
• “Offset” of where they are is controlled by the consumer
• Each partition is replicated and has “leader” and 0 or more “follower”. R/W
only done on leader
12. Producers and Consumers
• Producer controls which partition messages goes to
• Supports both Queuing and Pub/Sub
– Abstraction called Consumer group
• Ordering within Partition
– Ordering for subscriber has to be done with only one subscriber to that
partition
13. Storm Introduction
• Distributed real time computational system
– Process unbounded streams of data
– Can use multiple programming languages
– Scalable, fault-tolerant and guarantees that data will be processed
• Use Cases
– Real time analytics, online machine learning
– Continuous Computation
– Distributed RPC
– ETL
• Concepts
– Topology
– Spouts
– Bolts
14. Concepts
• Storm Cluster
– Master node(Nimbus)
• Distributing code
• Assigns tasks to machines
• Monitors for failures
– Worker nodes(Supervisor)
• Starts/stops worker processes
• Each worker process executes subset of a topology
– Zookeeper
• Coordinates between Nimbus and Supervisors
• Nimbus and Supervisors completely stateless
• State maintained by Zookeeper or local disks
15. Details
• Stream
– Unbounded sequence of tuples
• Spout(write logic)
– Source of stream. Emits tuples
• Bolt(write logic)
– Processes streams and emits tuples
• Topology
– DAG of spouts and bolts
– Submit a topology to a Storm cluster
– Each node runs in parallel and parallelism is controlled
16. Stream groupings
• Tells a topology how to send tuples between two components
• Since tasks are executed in parallel, how do we control which tasks the
tuples are being sent to
17. Demo - Twitter TopN Trending Topic
• Use Flume Twitter Source to ingest data
and publish event to Kafka topic
• Use Storm as an Real-Time event
processing system to calculate TopN
trending topic
• Use Redis to store the TopN Result
• Use Node.js/JQuery for visualization