Low Latency Streaming Data Processing in Hadoop

Low-Latency Streaming
Data Processing in Hadoop
InSemble Inc.
http://www.insemble.com

Agenda
Reference Architecture for Low Latency Streaming1
Storm4
Kafka3
Flume2
Demo5

Hadoop Ecosystem
Source: Apache Hadoop Documentation

Hortonworks Data Platform(HDP)

Real time Stream Processing
Architecture with Hadoop

Flume Architecture
• Distributed system for
collecting and
aggregating from
multiple data stores to a
centralized data store
• Agent is a JVM that
hosts the Flume
components
• Channel will store
message until
picked by a sink
• Different types of
Flume sources
• Source and Sink are
decoupled

Kafka Introduction
• Messaging System which is distributed, partitioned and replicated
• Kafka brokers run as a cluster
• Producers and Consumers can be written in any language

Topic
• Ordered, immutable sequence numbers
• Retains messages until a period of time
• “Offset” of where they are is controlled by the consumer
• Each partition is replicated and has “leader” and 0 or more “follower”. R/W
only done on leader

Producers and Consumers
• Producer controls which partition messages goes to
• Supports both Queuing and Pub/Sub
– Abstraction called Consumer group
• Ordering within Partition
– Ordering for subscriber has to be done with only one subscriber to that
partition

Storm Introduction
• Distributed real time computational system
– Process unbounded streams of data
– Can use multiple programming languages
– Scalable, fault-tolerant and guarantees that data will be processed
• Use Cases
– Real time analytics, online machine learning
– Continuous Computation
– Distributed RPC
– ETL
• Concepts
– Topology
– Spouts
– Bolts

Concepts
• Storm Cluster
– Master node(Nimbus)
• Distributing code
• Assigns tasks to machines
• Monitors for failures
– Worker nodes(Supervisor)
• Starts/stops worker processes
• Each worker process executes subset of a topology
– Zookeeper
• Coordinates between Nimbus and Supervisors
• Nimbus and Supervisors completely stateless
• State maintained by Zookeeper or local disks

Details
• Stream
– Unbounded sequence of tuples
• Spout(write logic)
– Source of stream. Emits tuples
• Bolt(write logic)
– Processes streams and emits tuples
• Topology
– DAG of spouts and bolts
– Submit a topology to a Storm cluster
– Each node runs in parallel and parallelism is controlled

Stream groupings
• Tells a topology how to send tuples between two components
• Since tasks are executed in parallel, how do we control which tasks the
tuples are being sent to

Demo - Twitter TopN Trending Topic
• Use Flume Twitter Source to ingest data
and publish event to Kafka topic
• Use Storm as an Real-Time event
processing system to calculate TopN
trending topic
• Use Redis to store the TopN Result
• Use Node.js/JQuery for visualization

Flow Chart
Twitter
Twitter Source
Flume Agent
Mem Channel Kafka Sink
KafkaKafka SpoutParse Twitter BoltCount Bolt
TopN Ranker Bolt Report Bolt
Storm
RedisNode.js + JQuery
Twitter Source Mem Channel Kafka Sink

Low Latency Streaming Data Processing in Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Low Latency Streaming Data Processing in Hadoop

Similar to Low Latency Streaming Data Processing in Hadoop (20)

Recently uploaded

Recently uploaded (20)

Low Latency Streaming Data Processing in Hadoop