SlideShare a Scribd company logo
1 of 34
Download to read offline
Real time and reliable
processing with Apache
Storm
The code is available on:
https://github.com/andreaiacono/StormTalk
What is Apache Storm?
Real time and reliable processing with Apache Storm
Storm is a real-time distributed computing framework for
reliably processing unbounded data streams.
It was created by Nathan Marz and his team at BackType,
and released as open source in 2011 (after BackType was
acquired by Twitter).
Topology
A spout is the source of a data stream that is emitted to one or more bolts.
Emitted data is called tuple and is an ordered list of values.
A bolt performs computation on the data it receives and emits them to one
or more bolts. If a bolt is at the end of the topology, it doesn't emit anything.
Every task (either a spout or a bolt) can have multiple instances.
A topology is a directed acyclic graph of computation formed by spouts and bolts.
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
A simple topology
We'd like to build a system that generates random numbers and writes them
to a file.
Here is a topology that represent it:
public class RandomSpout extends BaseRichSpout {
private SpoutOutputCollector spoutOutputCollector;
private Random random;
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("val"));
}
@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.spoutOutputCollector = spoutOutputCollector;
random = new Random();
}
@Override
public void nextTuple() {
spoutOutputCollector.emit(new Values(random.nextInt() % 100));
}
}
Real time and reliable processing with Apache Storm
A simple topology: the spout
// no exception checking: it's a sample!
public class FileWriteBolt extends BaseBasicBolt {
private final String filename = "output.txt";
private BufferedWriter writer;
@Override
public void prepare(Map stormConf, TopologyContext context) {
super.prepare(stormConf, context);
writer = new BufferedWriter(new FileWriter(filename, true));
}
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
writer.write(tuple.getInteger(0) + "n");
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {}
@Override
public void cleanup() {
writer.close();
}
Real time and reliable processing with Apache Storm
A simple topology: the bolt
Real time and reliable processing with Apache Storm
public class RandomValuesTopology {
private static final String name = RandomValuesTopology.class.getName();
public static void main(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("random-spout", new RandomSpout());
builder.setBolt("writer-bolt",new FileWriteBolt())
.shuffleGrouping("random-spout");
Config conf = new Config();
conf.setDebug(false);
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(name, conf, builder.createTopology());
Utils.sleep(300_000);
cluster.killTopology(name);
cluster.shutdown();
// to run it on a live cluster
// StormSubmitter.submitTopology("topology", conf, builder.createTopology());
}
}
A simple topology: the topology
Grouping
Tuples path from one bolt to another is driven by grouping. Since we can
have multiple instances of bolts, we have to decide where to send the
tuples emitted.
Real time and reliable processing with Apache Storm
We want to create a webpage that shows the top-N hashtags and
every time arrives a new tweet containing one of them, displays it
on a world map.
Twitter top-n hashtags: overview
Real time and reliable processing with Apache Storm
public class GeoTweetSpout extends BaseRichSpout {
SpoutOutputCollector spoutOutputCollector;
TwitterStream twitterStream;
LinkedBlockingQueue<String> queue = null;
@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.spoutOutputCollector = spoutOutputCollector;
queue = new LinkedBlockingQueue<>(1000);
ConfigurationBuilder config = new ConfigurationBuilder()
.setOAuthConsumerKey(custkey)
.setOAuthConsumerSecret(custsecret)
.setOAuthAccessToken(accesstoken)
.setOAuthAccessTokenSecret(accesssecret);
TwitterStreamFactory streamFactory = new TwitterStreamFactory(config.build());
twitterStream = streamFactory.getInstance();
twitterStream.addListener(new GeoTwitterListener(queue));
double[][] boundingBox = {{-179d, -89d}, {179d, 89d}};
FilterQuery filterQuery = new FilterQuery().locations(boundingBox);
twitterStream.filter(filterQuery);
}
@Override
public void nextTuple() {
String msg = queue.poll();
if (msg == null) {
return;
}
String lat = MiscUtils.getLatFromMsg(msg);;
String lon = MiscUtils.getLonFromMsg(msg);;
String tweet = MiscUtils.getTweetFromMsg(msg);;
spoutOutputCollector.emit(new Values(tweet, lat, lon));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("tweet", "lat", "lon"));
}
}
Real time and reliable processing with Apache Storm
Twitter top-n hashtags: GeoTweetSpout
public class NoHashtagDropperBolt extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("tweet", "lat", "lon"));
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
Set<String> hashtags = MiscUtils.getHashtags(tuple.getString(0));
if (hashtags.size() == 0) {
return;
}
String tweet = tuple.getString(0);
String lat = tuple.getString(1);
String lon = tuple.getString(2);
collector.emit(new Values(tweet, lat, lon));
}
}
Twitter top-n hashtags: NoHashtagDropperBolt
Real time and reliable processing with Apache Storm
Twitter top-n hashtags: GeoHashtagFilterBolt
Real time and reliable processing with Apache Storm
public class GeoHashtagsFilterBolt extends BaseBasicBolt {
private Rankings rankings;
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("tweet", "lat", "lon","hashtag"));
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String componentId = tuple.getSourceComponent();
if ("total-rankings".equals(componentId)) {
rankings = (Rankings) tuple.getValue(0);
return;
}
if (rankings == null) return;
String tweet = tuple.getString(0);
for (String hashtag : MiscUtils.getHashtags(tweet)) {
for (Rankable r : rankings.getRankings()) {
String rankedHashtag = r.getObject().toString();
if (hashtag.equals(rankedHashtag)) {
String lat = tuple.getString(1);
String lon = tuple.getString(2);
collector.emit(new Values(lat, lon, hashtag, tweet));
return;
}
}
}
public class ToRedisTweetBolt extends BaseBasicBolt {
private RedisConnection<String, String> redis;
@Override
public void prepare(Map stormConf, TopologyContext context) {
super.prepare(stormConf, context);
RedisClient client = new RedisClient("localhost", 6379);
redis = client.connect();
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
// gets the tweet and its rank
String lat = tuple.getString(0);
String lon = tuple.getString(1);
String hashtag = tuple.getString(2);
String tweet = tuple.getString(3);
String message = "1|" + lat + "|" + lon + "|" + hashtag + "|" + tweet;
redis.publish("tophashtagsmap", message);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
Twitter top-n hashtags: ToRedisTweetBolt
Real time and reliable processing with Apache Storm
public class TopHashtagMapTopology {
private static int n = 20;
public static void main(String[] args) {
GeoTweetSpout geoTweetSpout = new GeoTweetSpout();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("geo-tweet-spout", geoTweetSpout, 4);
builder.setBolt("no-ht-dropper",new NoHashtagDropperBolt(), 4)
.shuffleGrouping("geo-tweet-spout");
builder.setBolt("parse-twt",new ParseTweetBolt(), 4)
.shuffleGrouping("no-ht-dropper");
builder.setBolt("count-ht",new CountHashtagsBolt(), 4)
.fieldsGrouping("parse-twt",new Fields("hashtag"));
builder.setBolt("inter-rankings", new IntermediateRankingsBolt(n), 4)
.fieldsGrouping("count-ht", new Fields("hashtag"));
builder.setBolt("total-rankings", new TotalRankingsBolt(n), 1)
.globalGrouping("inter-rankings");
builder.setBolt("to-redis-ht", new ToRedisTopHashtagsBolt(), 1)
.shuffleGrouping("total-rankings");
builder.setBolt("geo-hashtag-filter", new GeoHashtagsFilterBolt(), 4)
.shuffleGrouping("no-ht-dropper")
.allGrouping("total-rankings");
builder.setBolt("to-redis-tweets", new ToRedisTweetBolt(), 4)
.globalGrouping("geo-hashtag-filter");
// code to start the topology...
}
}
Twitter top-n hashtags: topology
Real time and reliable processing with Apache Storm
Twitter top-n hashtags
Real time and reliable processing with Apache Storm
Storm Cluster
Nimbus: a daemon responsible for
distributing code around the cluster,
assigning jobs to nodes, and
monitoring for failures.
Worker node: executes a subset of
a topology (spouts and/or bolts). It
runs a supervisor daemon that
listens for jobs assigned to the
machine and starts and stops
worker processes as necessary.
Zookeeper: manages all the
coordination between Nimbus and
the supervisors.
Real time and reliable processing with Apache Storm
Worker Node
Worker process: JVM (processes a specific topology)
Executor: Thread
Task: instance of bolt/spout
Supervisor: syncing with Master Node
The number of executors can be modified at runtime;
the topology structure cannot.
Real time and reliable processing with Apache Storm
Tuples transfer
●
on the same JVM
●
on different JVMs
For serialization, Storm tries to lookup a Kryo serializer, which is
more efficient than Java standard serialization.
The network layer for transport is provided by Netty.
Also for performance reasons, the queues are implemented using
the LMAX Disruptor library, which enables efficient queuing.
Real time and reliable processing with Apache Storm
Storm supports two different types of transfer:
Tuples transfer: on the same JVM
A generic task is composed by two threads and two queues.
Tasks at the start (spout) or at the end of the topology (ending
bolts) have only one queue.
Real time and reliable processing with Apache Storm
Tuples transfer: on different JVMs
Real time and reliable processing with Apache Storm
Queues failure
Since the model behind the queue is the producer/consumer, if the
producer supplies data at a higher rate than the consumer, the
queue will overflow.
The transfer queue is more critical because it has to serve all the
tasks of the worker, so it's stressed more than the internal one.
If an overflow happens, Storm tries - but not guarantees - to put
the overflowing tuples into a temporary queue, with the side-
effect of dropping the throughput of the topology.
Real time and reliable processing with Apache Storm
Reliability
Levels of delivery guarantee
●
at-most-once: tuples are processed in the order coming from spouts and in
case of failure (network, exceptions) are just dropped
●
at-least-once: in case of failure tuples are re-emitted from the spout; a tuple
can be processed more than once and they can arrive out of
order
●
exactly-once: only available with Trident, a layer sitting on top of Storm that
allows to write topologies with different semantic
Real time and reliable processing with Apache Storm
Reliability for bolts
Real time and reliable processing with Apache Storm
The three main concepts to achieve at-least-once guarantee level are:
●
anchoring: every tuple emitted by a bolt has to be linked to the
input tuple using the emit(tuple, values) method
●
acking: when a bolt successfully finishes to execute() a tuple, it
has to call the ack() method to notify Storm
●
failing: when a bolt encounters a problem with the incoming tuple, it
has to call the fail() method
The BaseBasicBolt we saw before takes care of them automatically
(when a tuple has to fail, it must be thrown a FailedException).
When the topology is complex (expanding tuples, collapsing tuples,
joining streams) they must be explicitly managed extending a
BaseRichBolt.
Reliability for spouts
Real time and reliable processing with Apache Storm
The ISpout interface defines - beside others - these methods:
void open(Map conf,TopologyContext context,SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
To implement a reliable spout we have to call inside the nextTuple() method:
Collector.emit(values, msgId);
and we have to manage the ack() and fail() methods accordingly.
Reliability
A tuple tree is the set of all the additional tuples emitted by the
subsequent bolts starting from the tuple emitted by a spout.
When all the tuples of a tree are marked as processed, Storm will
consider the initial tuple from a spout correctly processed.
If any tuple of the tree is not marked as processed within a timeout
(30 secs default) or is explicitly set as failed, Storm will replay the
tuple starting from the spout (this means that the operations made
by a task have to be idempotent).
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
Reliability – Step 1
This is the starting state, nothing has yet happened.
For better understading the following slides, it's important to review
this binary XOR property:
1 ^ 1 == 0
1 ^ 2 ^ 1 == 2
1 ^ 2 ^ 2 ^ 1 == 0
1 ^ 2 ^ 3 ^ 2 ^ 1 == 3
1 ^ 2 ^ 3 ^ 3 ^ 2 ^ 1 == 0
1 ^ 2 ^ 3 ^ ... ^ N ^ ... ^ 3 ^ 2 ^ 1 == N
1 ^ 2 ^ 3 ^ ... ^ N ^ N ^ ... ^ 3 ^ 2 ^ 1 == 0
(whatever the order of the operands is)
Real time and reliable processing with Apache Storm
Reliability – Step 2
The spout has received something, so it sends to its acker
task a couple of values:
●
the tuple ID of the tuple to emit to the bolt
●
its task ID
The acker puts those data in a map <TupleID,[TaskID, AckVal]>
where the AckVal is initially set to the TupleID value.
The ID values are of type long, so they're 64 bit.
Real time and reliable processing with Apache Storm
Reliability – Step 3
After having notified the acker that a new tuple was created,
It sends the tuple to its attached bolt (Bolt1).
Real time and reliable processing with Apache Storm
Reliability – Step 4
Bolt1 computes the outgoing tuple according to its business
logic and notifies the acker that it's going to emit a new tuple.
The acker gets the ID of the new tuple and XORs it with the
AckVal (that contained the initial tuple ID).
Real time and reliable processing with Apache Storm
Reliability – Step 5
Bolt1 sends the new tuple to its attached bolt (Bolt2).
Real time and reliable processing with Apache Storm
Reliability – Step 6
After emitting the new tuple, Bolt1 sets as finished its work with
the incoming tuple (Tuple0), and so it sends an ack to the acker to
notify that.
The acker gets the tuple ID and XORs it with the AckVal.
Real time and reliable processing with Apache Storm
Reliability – Step 7
Bolt2 process the incoming tuple according to its business logic
and - probably - will write some data on a DB, on a queue or
somewhere else. Since it's a terminal bolt, it will not emit a new
Tuple. Since its job is done, it can send an ack to the acker for the
incoming tuple.
The acker gets the tuple ID and XORs it with the AckVal. The value
of AckVal will be 0, so the acker knows that the starting tuple has
been successfully processed by the topology.
Questions & Answers
Real time and reliable processing with Apache Storm
Thanks!
The code is available on:
https://github.com/andreaiacono/TalkStorm
Real time and reliable processing with Apache Storm

More Related Content

What's hot

Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentationGabriel Eisbruch
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleDung Ngua
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 

What's hot (20)

Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
STORM
STORMSTORM
STORM
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 

Viewers also liked

Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm TutorialDavide Mazza
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
 
Real-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormReal-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormTaewoo Kim
 
StormWars - when the data stream shrinks
StormWars - when the data stream shrinksStormWars - when the data stream shrinks
StormWars - when the data stream shrinksvishnu rao
 
Neo4j and the Panama Papers - FooCafe June 2016
Neo4j and the Panama Papers - FooCafe June 2016Neo4j and the Panama Papers - FooCafe June 2016
Neo4j and the Panama Papers - FooCafe June 2016Craig Taverner
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014鉄平 土佐
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Apache Storm
Apache StormApache Storm
Apache StormEdureka!
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormJohn Georgiadis
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis
 
Processing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) PregelProcessing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) PregelArangoDB Database
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel
 
Handling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseHandling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseArangoDB Database
 

Viewers also liked (20)

Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Real-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormReal-Time Analytics with Apache Storm
Real-Time Analytics with Apache Storm
 
StormWars - when the data stream shrinks
StormWars - when the data stream shrinksStormWars - when the data stream shrinks
StormWars - when the data stream shrinks
 
Twitter Stream Processing
Twitter Stream ProcessingTwitter Stream Processing
Twitter Stream Processing
 
Neo4j and the Panama Papers - FooCafe June 2016
Neo4j and the Panama Papers - FooCafe June 2016Neo4j and the Panama Papers - FooCafe June 2016
Neo4j and the Panama Papers - FooCafe June 2016
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
 
Storm real-time processing
Storm real-time processingStorm real-time processing
Storm real-time processing
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
Big Data Streaming processing using Apache Storm - FOSSCOMM 2016
 
Structured streaming in Spark
Structured streaming in SparkStructured streaming in Spark
Structured streaming in Spark
 
Processing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) PregelProcessing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) Pregel
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
Handling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseHandling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph Database
 

Similar to Real time Twitter analytics with Apache Storm

실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리Byeongweon Moon
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
外部環境への依存をテストする
外部環境への依存をテストする外部環境への依存をテストする
外部環境への依存をテストするShunsuke Maeda
 
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataAlexander Schätzle
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
Java programming lab_manual_by_rohit_jaiswar
Java programming lab_manual_by_rohit_jaiswarJava programming lab_manual_by_rohit_jaiswar
Java programming lab_manual_by_rohit_jaiswarROHIT JAISWAR
 
JJUG CCC 2011 Spring
JJUG CCC 2011 SpringJJUG CCC 2011 Spring
JJUG CCC 2011 SpringKiyotaka Oku
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
A topology of memory leaks on the JVM
A topology of memory leaks on the JVMA topology of memory leaks on the JVM
A topology of memory leaks on the JVMRafael Winterhalter
 
Java 7 JUG Summer Camp
Java 7 JUG Summer CampJava 7 JUG Summer Camp
Java 7 JUG Summer Campjulien.ponge
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Andrew Yongjoon Kong
 
Advanced Java - Practical File
Advanced Java - Practical FileAdvanced Java - Practical File
Advanced Java - Practical FileFahad Shaikh
 
Implementing STM in Java
Implementing STM in JavaImplementing STM in Java
Implementing STM in JavaMisha Kozik
 

Similar to Real time Twitter analytics with Apache Storm (20)

실시간 인벤트 처리
실시간 인벤트 처리실시간 인벤트 처리
실시간 인벤트 처리
 
Storm is coming
Storm is comingStorm is coming
Storm is coming
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Storm introduction
Storm introductionStorm introduction
Storm introduction
 
外部環境への依存をテストする
外部環境への依存をテストする外部環境への依存をテストする
外部環境への依存をテストする
 
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big DataPigSPARQL: A SPARQL Query Processing Baseline for Big Data
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Java programming lab_manual_by_rohit_jaiswar
Java programming lab_manual_by_rohit_jaiswarJava programming lab_manual_by_rohit_jaiswar
Java programming lab_manual_by_rohit_jaiswar
 
JJUG CCC 2011 Spring
JJUG CCC 2011 SpringJJUG CCC 2011 Spring
JJUG CCC 2011 Spring
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Java 7 LavaJUG
Java 7 LavaJUGJava 7 LavaJUG
Java 7 LavaJUG
 
A topology of memory leaks on the JVM
A topology of memory leaks on the JVMA topology of memory leaks on the JVM
A topology of memory leaks on the JVM
 
Java 7 JUG Summer Camp
Java 7 JUG Summer CampJava 7 JUG Summer Camp
Java 7 JUG Summer Camp
 
JAVA SE 7
JAVA SE 7JAVA SE 7
JAVA SE 7
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...
 
Advanced Java - Practical File
Advanced Java - Practical FileAdvanced Java - Practical File
Advanced Java - Practical File
 
Implementing STM in Java
Implementing STM in JavaImplementing STM in Java
Implementing STM in Java
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Real time Twitter analytics with Apache Storm

  • 1. Real time and reliable processing with Apache Storm The code is available on: https://github.com/andreaiacono/StormTalk
  • 2. What is Apache Storm? Real time and reliable processing with Apache Storm Storm is a real-time distributed computing framework for reliably processing unbounded data streams. It was created by Nathan Marz and his team at BackType, and released as open source in 2011 (after BackType was acquired by Twitter).
  • 3. Topology A spout is the source of a data stream that is emitted to one or more bolts. Emitted data is called tuple and is an ordered list of values. A bolt performs computation on the data it receives and emits them to one or more bolts. If a bolt is at the end of the topology, it doesn't emit anything. Every task (either a spout or a bolt) can have multiple instances. A topology is a directed acyclic graph of computation formed by spouts and bolts. Real time and reliable processing with Apache Storm
  • 4. Real time and reliable processing with Apache Storm A simple topology We'd like to build a system that generates random numbers and writes them to a file. Here is a topology that represent it:
  • 5. public class RandomSpout extends BaseRichSpout { private SpoutOutputCollector spoutOutputCollector; private Random random; @Override public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) { outputFieldsDeclarer.declare(new Fields("val")); } @Override public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) { this.spoutOutputCollector = spoutOutputCollector; random = new Random(); } @Override public void nextTuple() { spoutOutputCollector.emit(new Values(random.nextInt() % 100)); } } Real time and reliable processing with Apache Storm A simple topology: the spout
  • 6. // no exception checking: it's a sample! public class FileWriteBolt extends BaseBasicBolt { private final String filename = "output.txt"; private BufferedWriter writer; @Override public void prepare(Map stormConf, TopologyContext context) { super.prepare(stormConf, context); writer = new BufferedWriter(new FileWriter(filename, true)); } @Override public void execute(Tuple input, BasicOutputCollector collector) { writer.write(tuple.getInteger(0) + "n"); } @Override public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {} @Override public void cleanup() { writer.close(); } Real time and reliable processing with Apache Storm A simple topology: the bolt
  • 7. Real time and reliable processing with Apache Storm public class RandomValuesTopology { private static final String name = RandomValuesTopology.class.getName(); public static void main(String[] args) { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("random-spout", new RandomSpout()); builder.setBolt("writer-bolt",new FileWriteBolt()) .shuffleGrouping("random-spout"); Config conf = new Config(); conf.setDebug(false); conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(name, conf, builder.createTopology()); Utils.sleep(300_000); cluster.killTopology(name); cluster.shutdown(); // to run it on a live cluster // StormSubmitter.submitTopology("topology", conf, builder.createTopology()); } } A simple topology: the topology
  • 8. Grouping Tuples path from one bolt to another is driven by grouping. Since we can have multiple instances of bolts, we have to decide where to send the tuples emitted. Real time and reliable processing with Apache Storm
  • 9. We want to create a webpage that shows the top-N hashtags and every time arrives a new tweet containing one of them, displays it on a world map. Twitter top-n hashtags: overview Real time and reliable processing with Apache Storm
  • 10. public class GeoTweetSpout extends BaseRichSpout { SpoutOutputCollector spoutOutputCollector; TwitterStream twitterStream; LinkedBlockingQueue<String> queue = null; @Override public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) { this.spoutOutputCollector = spoutOutputCollector; queue = new LinkedBlockingQueue<>(1000); ConfigurationBuilder config = new ConfigurationBuilder() .setOAuthConsumerKey(custkey) .setOAuthConsumerSecret(custsecret) .setOAuthAccessToken(accesstoken) .setOAuthAccessTokenSecret(accesssecret); TwitterStreamFactory streamFactory = new TwitterStreamFactory(config.build()); twitterStream = streamFactory.getInstance(); twitterStream.addListener(new GeoTwitterListener(queue)); double[][] boundingBox = {{-179d, -89d}, {179d, 89d}}; FilterQuery filterQuery = new FilterQuery().locations(boundingBox); twitterStream.filter(filterQuery); } @Override public void nextTuple() { String msg = queue.poll(); if (msg == null) { return; } String lat = MiscUtils.getLatFromMsg(msg);; String lon = MiscUtils.getLonFromMsg(msg);; String tweet = MiscUtils.getTweetFromMsg(msg);; spoutOutputCollector.emit(new Values(tweet, lat, lon)); } @Override public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) { outputFieldsDeclarer.declare(new Fields("tweet", "lat", "lon")); } } Real time and reliable processing with Apache Storm Twitter top-n hashtags: GeoTweetSpout
  • 11. public class NoHashtagDropperBolt extends BaseBasicBolt { @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("tweet", "lat", "lon")); } @Override public void execute(Tuple tuple, BasicOutputCollector collector) { Set<String> hashtags = MiscUtils.getHashtags(tuple.getString(0)); if (hashtags.size() == 0) { return; } String tweet = tuple.getString(0); String lat = tuple.getString(1); String lon = tuple.getString(2); collector.emit(new Values(tweet, lat, lon)); } } Twitter top-n hashtags: NoHashtagDropperBolt Real time and reliable processing with Apache Storm
  • 12. Twitter top-n hashtags: GeoHashtagFilterBolt Real time and reliable processing with Apache Storm public class GeoHashtagsFilterBolt extends BaseBasicBolt { private Rankings rankings; @Override public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) { outputFieldsDeclarer.declare(new Fields("tweet", "lat", "lon","hashtag")); } @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String componentId = tuple.getSourceComponent(); if ("total-rankings".equals(componentId)) { rankings = (Rankings) tuple.getValue(0); return; } if (rankings == null) return; String tweet = tuple.getString(0); for (String hashtag : MiscUtils.getHashtags(tweet)) { for (Rankable r : rankings.getRankings()) { String rankedHashtag = r.getObject().toString(); if (hashtag.equals(rankedHashtag)) { String lat = tuple.getString(1); String lon = tuple.getString(2); collector.emit(new Values(lat, lon, hashtag, tweet)); return; } } }
  • 13. public class ToRedisTweetBolt extends BaseBasicBolt { private RedisConnection<String, String> redis; @Override public void prepare(Map stormConf, TopologyContext context) { super.prepare(stormConf, context); RedisClient client = new RedisClient("localhost", 6379); redis = client.connect(); } @Override public void execute(Tuple tuple, BasicOutputCollector collector) { // gets the tweet and its rank String lat = tuple.getString(0); String lon = tuple.getString(1); String hashtag = tuple.getString(2); String tweet = tuple.getString(3); String message = "1|" + lat + "|" + lon + "|" + hashtag + "|" + tweet; redis.publish("tophashtagsmap", message); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { } } Twitter top-n hashtags: ToRedisTweetBolt Real time and reliable processing with Apache Storm
  • 14. public class TopHashtagMapTopology { private static int n = 20; public static void main(String[] args) { GeoTweetSpout geoTweetSpout = new GeoTweetSpout(); TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("geo-tweet-spout", geoTweetSpout, 4); builder.setBolt("no-ht-dropper",new NoHashtagDropperBolt(), 4) .shuffleGrouping("geo-tweet-spout"); builder.setBolt("parse-twt",new ParseTweetBolt(), 4) .shuffleGrouping("no-ht-dropper"); builder.setBolt("count-ht",new CountHashtagsBolt(), 4) .fieldsGrouping("parse-twt",new Fields("hashtag")); builder.setBolt("inter-rankings", new IntermediateRankingsBolt(n), 4) .fieldsGrouping("count-ht", new Fields("hashtag")); builder.setBolt("total-rankings", new TotalRankingsBolt(n), 1) .globalGrouping("inter-rankings"); builder.setBolt("to-redis-ht", new ToRedisTopHashtagsBolt(), 1) .shuffleGrouping("total-rankings"); builder.setBolt("geo-hashtag-filter", new GeoHashtagsFilterBolt(), 4) .shuffleGrouping("no-ht-dropper") .allGrouping("total-rankings"); builder.setBolt("to-redis-tweets", new ToRedisTweetBolt(), 4) .globalGrouping("geo-hashtag-filter"); // code to start the topology... } } Twitter top-n hashtags: topology Real time and reliable processing with Apache Storm
  • 15. Twitter top-n hashtags Real time and reliable processing with Apache Storm
  • 16. Storm Cluster Nimbus: a daemon responsible for distributing code around the cluster, assigning jobs to nodes, and monitoring for failures. Worker node: executes a subset of a topology (spouts and/or bolts). It runs a supervisor daemon that listens for jobs assigned to the machine and starts and stops worker processes as necessary. Zookeeper: manages all the coordination between Nimbus and the supervisors. Real time and reliable processing with Apache Storm
  • 17. Worker Node Worker process: JVM (processes a specific topology) Executor: Thread Task: instance of bolt/spout Supervisor: syncing with Master Node The number of executors can be modified at runtime; the topology structure cannot. Real time and reliable processing with Apache Storm
  • 18. Tuples transfer ● on the same JVM ● on different JVMs For serialization, Storm tries to lookup a Kryo serializer, which is more efficient than Java standard serialization. The network layer for transport is provided by Netty. Also for performance reasons, the queues are implemented using the LMAX Disruptor library, which enables efficient queuing. Real time and reliable processing with Apache Storm Storm supports two different types of transfer:
  • 19. Tuples transfer: on the same JVM A generic task is composed by two threads and two queues. Tasks at the start (spout) or at the end of the topology (ending bolts) have only one queue. Real time and reliable processing with Apache Storm
  • 20. Tuples transfer: on different JVMs Real time and reliable processing with Apache Storm
  • 21. Queues failure Since the model behind the queue is the producer/consumer, if the producer supplies data at a higher rate than the consumer, the queue will overflow. The transfer queue is more critical because it has to serve all the tasks of the worker, so it's stressed more than the internal one. If an overflow happens, Storm tries - but not guarantees - to put the overflowing tuples into a temporary queue, with the side- effect of dropping the throughput of the topology. Real time and reliable processing with Apache Storm
  • 22. Reliability Levels of delivery guarantee ● at-most-once: tuples are processed in the order coming from spouts and in case of failure (network, exceptions) are just dropped ● at-least-once: in case of failure tuples are re-emitted from the spout; a tuple can be processed more than once and they can arrive out of order ● exactly-once: only available with Trident, a layer sitting on top of Storm that allows to write topologies with different semantic Real time and reliable processing with Apache Storm
  • 23. Reliability for bolts Real time and reliable processing with Apache Storm The three main concepts to achieve at-least-once guarantee level are: ● anchoring: every tuple emitted by a bolt has to be linked to the input tuple using the emit(tuple, values) method ● acking: when a bolt successfully finishes to execute() a tuple, it has to call the ack() method to notify Storm ● failing: when a bolt encounters a problem with the incoming tuple, it has to call the fail() method The BaseBasicBolt we saw before takes care of them automatically (when a tuple has to fail, it must be thrown a FailedException). When the topology is complex (expanding tuples, collapsing tuples, joining streams) they must be explicitly managed extending a BaseRichBolt.
  • 24. Reliability for spouts Real time and reliable processing with Apache Storm The ISpout interface defines - beside others - these methods: void open(Map conf,TopologyContext context,SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId); To implement a reliable spout we have to call inside the nextTuple() method: Collector.emit(values, msgId); and we have to manage the ack() and fail() methods accordingly.
  • 25. Reliability A tuple tree is the set of all the additional tuples emitted by the subsequent bolts starting from the tuple emitted by a spout. When all the tuples of a tree are marked as processed, Storm will consider the initial tuple from a spout correctly processed. If any tuple of the tree is not marked as processed within a timeout (30 secs default) or is explicitly set as failed, Storm will replay the tuple starting from the spout (this means that the operations made by a task have to be idempotent). Real time and reliable processing with Apache Storm
  • 26. Real time and reliable processing with Apache Storm Reliability – Step 1 This is the starting state, nothing has yet happened. For better understading the following slides, it's important to review this binary XOR property: 1 ^ 1 == 0 1 ^ 2 ^ 1 == 2 1 ^ 2 ^ 2 ^ 1 == 0 1 ^ 2 ^ 3 ^ 2 ^ 1 == 3 1 ^ 2 ^ 3 ^ 3 ^ 2 ^ 1 == 0 1 ^ 2 ^ 3 ^ ... ^ N ^ ... ^ 3 ^ 2 ^ 1 == N 1 ^ 2 ^ 3 ^ ... ^ N ^ N ^ ... ^ 3 ^ 2 ^ 1 == 0 (whatever the order of the operands is)
  • 27. Real time and reliable processing with Apache Storm Reliability – Step 2 The spout has received something, so it sends to its acker task a couple of values: ● the tuple ID of the tuple to emit to the bolt ● its task ID The acker puts those data in a map <TupleID,[TaskID, AckVal]> where the AckVal is initially set to the TupleID value. The ID values are of type long, so they're 64 bit.
  • 28. Real time and reliable processing with Apache Storm Reliability – Step 3 After having notified the acker that a new tuple was created, It sends the tuple to its attached bolt (Bolt1).
  • 29. Real time and reliable processing with Apache Storm Reliability – Step 4 Bolt1 computes the outgoing tuple according to its business logic and notifies the acker that it's going to emit a new tuple. The acker gets the ID of the new tuple and XORs it with the AckVal (that contained the initial tuple ID).
  • 30. Real time and reliable processing with Apache Storm Reliability – Step 5 Bolt1 sends the new tuple to its attached bolt (Bolt2).
  • 31. Real time and reliable processing with Apache Storm Reliability – Step 6 After emitting the new tuple, Bolt1 sets as finished its work with the incoming tuple (Tuple0), and so it sends an ack to the acker to notify that. The acker gets the tuple ID and XORs it with the AckVal.
  • 32. Real time and reliable processing with Apache Storm Reliability – Step 7 Bolt2 process the incoming tuple according to its business logic and - probably - will write some data on a DB, on a queue or somewhere else. Since it's a terminal bolt, it will not emit a new Tuple. Since its job is done, it can send an ack to the acker for the incoming tuple. The acker gets the tuple ID and XORs it with the AckVal. The value of AckVal will be 0, so the acker knows that the starting tuple has been successfully processed by the topology.
  • 33. Questions & Answers Real time and reliable processing with Apache Storm
  • 34. Thanks! The code is available on: https://github.com/andreaiacono/TalkStorm Real time and reliable processing with Apache Storm