Real time Twitter analytics with Apache Storm

Real time and reliable
processing with Apache
Storm
The code is available on:
https://github.com/andreaiacono/StormTalk

What is Apache Storm?
Real time and reliable processing with Apache Storm
Storm is a real-time distributed computing framework for
reliably processing unbounded data streams.
It was created by Nathan Marz and his team at BackType,
and released as open source in 2011 (after BackType was
acquired by Twitter).

Topology
A spout is the source of a data stream that is emitted to one or more bolts.
Emitted data is called tuple and is an ordered list of values.
A bolt performs computation on the data it receives and emits them to one
or more bolts. If a bolt is at the end of the topology, it doesn't emit anything.
Every task (either a spout or a bolt) can have multiple instances.
A topology is a directed acyclic graph of computation formed by spouts and bolts.

A simple topology
We'd like to build a system that generates random numbers and writes them
to a file.
Here is a topology that represent it:

public class RandomSpout extends BaseRichSpout {
private SpoutOutputCollector spoutOutputCollector;
private Random random;
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("val"));
}
@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.spoutOutputCollector = spoutOutputCollector;
random = new Random();
}
@Override
public void nextTuple() {
spoutOutputCollector.emit(new Values(random.nextInt() % 100));
}
}
A simple topology: the spout

// no exception checking: it's a sample!
public class FileWriteBolt extends BaseBasicBolt {
private final String filename = "output.txt";
private BufferedWriter writer;
@Override
public void prepare(Map stormConf, TopologyContext context) {
super.prepare(stormConf, context);
writer = new BufferedWriter(new FileWriter(filename, true));
}
@Override
public void execute(Tuple input, BasicOutputCollector collector) {
writer.write(tuple.getInteger(0) + "n");
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {}
@Override
public void cleanup() {
writer.close();
}
A simple topology: the bolt

public class RandomValuesTopology {
private static final String name = RandomValuesTopology.class.getName();
public static void main(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("random-spout", new RandomSpout());
builder.setBolt("writer-bolt",new FileWriteBolt())
.shuffleGrouping("random-spout");
Config conf = new Config();
conf.setDebug(false);
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(name, conf, builder.createTopology());
Utils.sleep(300_000);
cluster.killTopology(name);
cluster.shutdown();
// to run it on a live cluster
// StormSubmitter.submitTopology("topology", conf, builder.createTopology());
}
}
A simple topology: the topology

Grouping
Tuples path from one bolt to another is driven by grouping. Since we can
have multiple instances of bolts, we have to decide where to send the
tuples emitted.

We want to create a webpage that shows the top-N hashtags and
every time arrives a new tweet containing one of them, displays it
on a world map.
Twitter top-n hashtags: overview

public class GeoTweetSpout extends BaseRichSpout {
SpoutOutputCollector spoutOutputCollector;
TwitterStream twitterStream;
LinkedBlockingQueue<String> queue = null;
@Override
public void open(Map map, TopologyContext topologyContext,
SpoutOutputCollector spoutOutputCollector) {
this.spoutOutputCollector = spoutOutputCollector;
queue = new LinkedBlockingQueue<>(1000);
ConfigurationBuilder config = new ConfigurationBuilder()
.setOAuthConsumerKey(custkey)
.setOAuthConsumerSecret(custsecret)
.setOAuthAccessToken(accesstoken)
.setOAuthAccessTokenSecret(accesssecret);
TwitterStreamFactory streamFactory = new TwitterStreamFactory(config.build());
twitterStream = streamFactory.getInstance();
twitterStream.addListener(new GeoTwitterListener(queue));
double[][] boundingBox = {{-179d, -89d}, {179d, 89d}};
FilterQuery filterQuery = new FilterQuery().locations(boundingBox);
twitterStream.filter(filterQuery);
}
@Override
public void nextTuple() {
String msg = queue.poll();
if (msg == null) {
return;
}
String lat = MiscUtils.getLatFromMsg(msg);;
String lon = MiscUtils.getLonFromMsg(msg);;
String tweet = MiscUtils.getTweetFromMsg(msg);;
spoutOutputCollector.emit(new Values(tweet, lat, lon));
}
@Override
outputFieldsDeclarer.declare(new Fields("tweet", "lat", "lon"));
}
}
Twitter top-n hashtags: GeoTweetSpout

public class NoHashtagDropperBolt extends BaseBasicBolt {
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("tweet", "lat", "lon"));
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
Set<String> hashtags = MiscUtils.getHashtags(tuple.getString(0));
if (hashtags.size() == 0) {
return;
}
String tweet = tuple.getString(0);
String lat = tuple.getString(1);
String lon = tuple.getString(2);
collector.emit(new Values(tweet, lat, lon));
}
}
Twitter top-n hashtags: NoHashtagDropperBolt

Twitter top-n hashtags: GeoHashtagFilterBolt
public class GeoHashtagsFilterBolt extends BaseBasicBolt {
private Rankings rankings;
@Override
outputFieldsDeclarer.declare(new Fields("tweet", "lat", "lon","hashtag"));
}
@Override
String componentId = tuple.getSourceComponent();
if ("total-rankings".equals(componentId)) {
rankings = (Rankings) tuple.getValue(0);
return;
}
if (rankings == null) return;
for (String hashtag : MiscUtils.getHashtags(tweet)) {
for (Rankable r : rankings.getRankings()) {
String rankedHashtag = r.getObject().toString();
if (hashtag.equals(rankedHashtag)) {
collector.emit(new Values(lat, lon, hashtag, tweet));
return;
}
}
}

public class ToRedisTweetBolt extends BaseBasicBolt {
private RedisConnection<String, String> redis;
@Override
public void prepare(Map stormConf, TopologyContext context) {
super.prepare(stormConf, context);
RedisClient client = new RedisClient("localhost", 6379);
redis = client.connect();
}
@Override
// gets the tweet and its rank
String hashtag = tuple.getString(2);
String message = "1|" + lat + "|" + lon + "|" + hashtag + "|" + tweet;
redis.publish("tophashtagsmap", message);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
}
Twitter top-n hashtags: ToRedisTweetBolt

public class TopHashtagMapTopology {
private static int n = 20;
public static void main(String[] args) {
GeoTweetSpout geoTweetSpout = new GeoTweetSpout();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("geo-tweet-spout", geoTweetSpout, 4);
builder.setBolt("no-ht-dropper",new NoHashtagDropperBolt(), 4)
.shuffleGrouping("geo-tweet-spout");
builder.setBolt("parse-twt",new ParseTweetBolt(), 4)
.shuffleGrouping("no-ht-dropper");
builder.setBolt("count-ht",new CountHashtagsBolt(), 4)
.fieldsGrouping("parse-twt",new Fields("hashtag"));
builder.setBolt("inter-rankings", new IntermediateRankingsBolt(n), 4)
.fieldsGrouping("count-ht", new Fields("hashtag"));
builder.setBolt("total-rankings", new TotalRankingsBolt(n), 1)
.globalGrouping("inter-rankings");
builder.setBolt("to-redis-ht", new ToRedisTopHashtagsBolt(), 1)
.shuffleGrouping("total-rankings");
builder.setBolt("geo-hashtag-filter", new GeoHashtagsFilterBolt(), 4)
.shuffleGrouping("no-ht-dropper")
.allGrouping("total-rankings");
builder.setBolt("to-redis-tweets", new ToRedisTweetBolt(), 4)
.globalGrouping("geo-hashtag-filter");
// code to start the topology...
}
}
Twitter top-n hashtags: topology

Twitter top-n hashtags

Storm Cluster
Nimbus: a daemon responsible for
distributing code around the cluster,
assigning jobs to nodes, and
monitoring for failures.
Worker node: executes a subset of
a topology (spouts and/or bolts). It
runs a supervisor daemon that
listens for jobs assigned to the
machine and starts and stops
worker processes as necessary.
Zookeeper: manages all the
coordination between Nimbus and
the supervisors.

Worker Node
Worker process: JVM (processes a specific topology)
Executor: Thread
Task: instance of bolt/spout
Supervisor: syncing with Master Node
The number of executors can be modified at runtime;
the topology structure cannot.

Tuples transfer
●
on the same JVM
●
on different JVMs
For serialization, Storm tries to lookup a Kryo serializer, which is
more efficient than Java standard serialization.
The network layer for transport is provided by Netty.
Also for performance reasons, the queues are implemented using
the LMAX Disruptor library, which enables efficient queuing.
Storm supports two different types of transfer:

Tuples transfer: on the same JVM
A generic task is composed by two threads and two queues.
Tasks at the start (spout) or at the end of the topology (ending
bolts) have only one queue.

Tuples transfer: on different JVMs

Queues failure
Since the model behind the queue is the producer/consumer, if the
producer supplies data at a higher rate than the consumer, the
queue will overflow.
The transfer queue is more critical because it has to serve all the
tasks of the worker, so it's stressed more than the internal one.
If an overflow happens, Storm tries - but not guarantees - to put
the overflowing tuples into a temporary queue, with the side-
effect of dropping the throughput of the topology.

Reliability
Levels of delivery guarantee
●
at-most-once: tuples are processed in the order coming from spouts and in
case of failure (network, exceptions) are just dropped
●
at-least-once: in case of failure tuples are re-emitted from the spout; a tuple
can be processed more than once and they can arrive out of
order
●
exactly-once: only available with Trident, a layer sitting on top of Storm that
allows to write topologies with different semantic

Reliability for bolts
The three main concepts to achieve at-least-once guarantee level are:
●
anchoring: every tuple emitted by a bolt has to be linked to the
input tuple using the emit(tuple, values) method
●
acking: when a bolt successfully finishes to execute() a tuple, it
has to call the ack() method to notify Storm
●
failing: when a bolt encounters a problem with the incoming tuple, it
has to call the fail() method
The BaseBasicBolt we saw before takes care of them automatically
(when a tuple has to fail, it must be thrown a FailedException).
When the topology is complex (expanding tuples, collapsing tuples,
joining streams) they must be explicitly managed extending a
BaseRichBolt.

Reliability for spouts
The ISpout interface defines - beside others - these methods:
void open(Map conf,TopologyContext context,SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
To implement a reliable spout we have to call inside the nextTuple() method:
Collector.emit(values, msgId);
and we have to manage the ack() and fail() methods accordingly.

Reliability
A tuple tree is the set of all the additional tuples emitted by the
subsequent bolts starting from the tuple emitted by a spout.
When all the tuples of a tree are marked as processed, Storm will
consider the initial tuple from a spout correctly processed.
If any tuple of the tree is not marked as processed within a timeout
(30 secs default) or is explicitly set as failed, Storm will replay the
tuple starting from the spout (this means that the operations made
by a task have to be idempotent).

Reliability – Step 1
This is the starting state, nothing has yet happened.
For better understading the following slides, it's important to review
this binary XOR property:
1 ^ 1 == 0
1 ^ 2 ^ 1 == 2
1 ^ 2 ^ 2 ^ 1 == 0
1 ^ 2 ^ 3 ^ 2 ^ 1 == 3
1 ^ 2 ^ 3 ^ 3 ^ 2 ^ 1 == 0
1 ^ 2 ^ 3 ^ ... ^ N ^ ... ^ 3 ^ 2 ^ 1 == N
1 ^ 2 ^ 3 ^ ... ^ N ^ N ^ ... ^ 3 ^ 2 ^ 1 == 0
(whatever the order of the operands is)

The spout has received something, so it sends to its acker
task a couple of values:
●
the tuple ID of the tuple to emit to the bolt
●
its task ID
The acker puts those data in a map <TupleID,[TaskID, AckVal]>
where the AckVal is initially set to the TupleID value.
The ID values are of type long, so they're 64 bit.

After having notified the acker that a new tuple was created,
It sends the tuple to its attached bolt (Bolt1).

Bolt1 computes the outgoing tuple according to its business
logic and notifies the acker that it's going to emit a new tuple.
The acker gets the ID of the new tuple and XORs it with the
AckVal (that contained the initial tuple ID).

Bolt1 sends the new tuple to its attached bolt (Bolt2).

After emitting the new tuple, Bolt1 sets as finished its work with
the incoming tuple (Tuple0), and so it sends an ack to the acker to
notify that.
The acker gets the tuple ID and XORs it with the AckVal.

Bolt2 process the incoming tuple according to its business logic
and - probably - will write some data on a DB, on a queue or
somewhere else. Since it's a terminal bolt, it will not emit a new
Tuple. Since its job is done, it can send an ack to the acker for the
incoming tuple.
The acker gets the tuple ID and XORs it with the AckVal. The value
of AckVal will be 0, so the acker knows that the starting tuple has
been successfully processed by the topology.

Questions & Answers

Thanks!
The code is available on:
https://github.com/andreaiacono/TalkStorm

Real time Twitter analytics with Apache Storm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Real time Twitter analytics with Apache Storm

Similar to Real time Twitter analytics with Apache Storm (20)

Recently uploaded

Recently uploaded (20)

Real time Twitter analytics with Apache Storm