SlideShare a Scribd company logo
1 of 23
Farzad Nozarian
5/9/15 @AUT
Purpose
• In this tutorial, you'll learn how to create Storm topologies and deploy
them to a Storm cluster.
• Java will be the main language used.
• This tutorial uses examples from the storm-starter project. It's
recommended that you clone the project and follow along with the
examples.
2
Components of a Storm cluster (1/2)
• Hadoop "MapReduce jobs“ vs. Storm "topologies“
• One key difference: MapReduce job eventually finishes, whereas a topology
processes messages forever (or until you kill it).
• There are two kinds of nodes on a Storm cluster:
• Master node: runs a daemon called "Nimbus“. (similar to Hadoop's "JobTracker")
• Distributing code around the cluster.
• Assigning tasks to machines.
• Monitoring for failures
• Worker nodes: runs a daemon called "Supervisor“. (similar to Hadoop's “TaskTracker")
• Listens for work assigned to its machine.
• Starts and stops worker processes.
3
Components of a Storm cluster (2/2)
• All coordination between Nimbus and the Supervisors is done through
a Zookeeper cluster
4
Topologies
• A topology is a graph of computation.
• Node in a topology contains processing logic.
• Links between nodes indicate how data should be passed around between
nodes.
• Running a topology is straightforward:
• The storm jar part takes care of connecting to Nimbus and uploading the jar.
• A topology runs forever, or until you kill it!
5
storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2
Streams (1/2)
• A stream is an unbounded sequence of tuples.
• Storm provides the primitives for transforming a stream into a new stream
in a distributed and reliable way.
• The basic primitives Storm provides for doing stream transformations are
spouts and bolts.
• Spouts and bolts have interfaces that you implement to run your
application-specific logic.
• A spout is a source of streams.
• For example, spout may connect to the Twitter API and emit a stream of
tweets.
• A bolt consumes any number of input streams, does some processing,
and possibly emits new streams.
6
Streams (2/2)
• Bolts can do anything:
• run functions,
• filter tuples,
• do streaming aggregations,
• do streaming joins,
• talk to databases.
• Networks of spouts and bolts are packaged into a "topology“.
• Edges in the graph indicate which bolts are subscribing to which streams.
• Each node in a Storm topology executes in parallel.
• You can specify how much parallelism you want for each node.
7
Data model
• Storm uses tuples as its data model.
• A tuple is a named list of values, and a field in a tuple can be an object of any
type.
• Storm supports all the primitive types, strings, and byte arrays as tuple field
values.
• To use an object of another type, you just need to implement a serializer for
the type.
• Every node in a topology must declare the output fields for the tuples it
emits:
8
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("double", "triple"));
}
A simple topology (1/8)
• Let's look at the ExclamationTopology definition from storm-starter:
• This topology contains a spout and two bolts.
• The spout emits words, and each bolt appends the string "!!!" to its input.
9
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("words", new TestWordSpout(), 10);
builder.setBolt("exclaim1", new ExclamationBolt(), 3)
.shuffleGrouping("words");
builder.setBolt("exclaim2", new ExclamationBolt(), 2)
.shuffleGrouping("exclaim1");
builder.setSpout("words", new TestWordSpout(), 10);
A simple topology (2/8)
• This code defines the nodes using the setSpout and setBolt methods.
• These methods take as input:
• A user-specified id,
• An object containing the processing logic,
• The amount of parallelism you want for the node.
• The spout is given id "words" and the bolts are given ids "exclaim1" and
"exclaim2"
• The object containing the processing logic implements
the IRichSpout interface for spouts and the IRichBolt interface for bolts.
• The last parameter, how much parallelism you want for the node, is optional.
• How many threads should execute that component across the cluster. Default,
Storm will only allocate one thread for that node.
10
A simple topology (3/8)
• setBolt returns an InputDeclarer object that is used to define the
inputs to the Bolt.
• "shuffle grouping" means that tuples should be randomly distributed from the
input tasks to the bolt's tasks.
• There are many ways to group data between components.
• Input declarations can be chained to specify multiple sources for the Bolt:
11
builder.setBolt("exclaim2", new ExclamationBolt(), 5)
.shuffleGrouping("words")
.shuffleGrouping("exclaim1");
A simple topology (4/8)
Spouts implementation
Let's dig into the implementations of the spouts and bolts in this topology.
• Spouts are responsible for emitting new messages into the topology.
• The implementation of nextTuple() in TestWordSpout looks like this:
12
public void nextTuple() {
Utils.sleep(100);
final String[] words = new String[] {"nathan", "mike", "jackson", "golda","bertels"};
final Random rand = new Random();
final String word = words[rand.nextInt(words.length)];
_collector.emit(new Values(word));
}
A simple topology (5/8)
Bolts implementation
13
public static class ExclamationBolt extends BaseRichBolt {
OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
public void execute(Tuple tuple) {
_collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
_collector.ack(tuple);
}
public void cleanup() {
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
public Map getComponentConfiguration() {
return null;
}
}
A simple topology (6/8)
Bolts implementation
• The prepare method provides the bolt with an OutputCollector that is
used for emitting tuples from this bolt.
• Tuples can be emitted at anytime from the bolt:
• in the prepare,
• in the execute,
• or cleanup methods,
• or even asynchronously in another thread.
14
public void prepare(Map conf, TopologyContext context, OutputCollector collector){
_collector = collector;
}
A simple topology (7/8)
Bolts implementation
• The execute method receives a tuple from one of the bolt's inputs.
• The ExclamationBolt grabs the first field from the tuple and emits a
new tuple with the string "!!!" appended to it.
• In case of multiple input sources, you can find out which component
the Tuple came from by using the Tuple#getSourceComponent method.
15
public void execute(Tuple tuple) {
_collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
_collector.ack(tuple);
}
A simple topology (8/8)
Bolts implementation
• The cleanup method is called when a Bolt is being shutdown and should
cleanup any resources that were opened.
• The declareOutputFields method declares that the ExclamationBolt
emits 1-tuples with one field called "word".
• The getComponentConfiguration method allows you to configure
various aspects of how this component runs.
16
public void cleanup() {}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
public Map getComponentConfiguration() { return null; }
Running ExclamationTopology in local mode (1/3)
• Storm has two modes of operation:
• Local mode
• Storm executes completely in process by simulating worker nodes with threads.
• Local mode is useful for testing and development of topologies.
• To create an in-process cluster, simply use the LocalCluster class.
• Distributed mode:
• Storm operates as a cluster of machines.
• Running topologies on a production cluster is similar to running in Local mode:
1. Define the topology (Use TopologyBuilder if defining using Java).
2. Use StormSubmitter to submit the topology to the cluster.
3. Create a jar containing your code and all the dependencies of your code.
4. Submit the topology to the cluster using the storm client.
17
Running ExclamationTopology in local mode (2/3)
Here's the code that runs ExclamationTopology in local mode:
• It submits a topology to the LocalCluster by calling submitTopology.
• This method takes as arguments a name for the running topology, a
configuration for the topology, and then the topology itself.
18
Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(2);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf, builder.createTopology());
Utils.sleep(10000);
cluster.killTopology("test");
cluster.shutdown();
Running ExclamationTopology in local mode (3/3)
• The name is used to identify the topology so that you can kill it later on.
• The configuration is used to tune various aspects of the running topology.
• The two configurations specified here are very common:
• TOPOLOGY_WORKERS (set with setNumWorkers):
• Specifies how many processes you want allocated around the cluster to execute the
topology.
• Each component in the topology will execute as many threads.
• Those threads exist within worker processes.
• TOPOLOGY_DEBUG (set with setDebug):
• When set to true, tells Storm to log every message every emitted by a component.
• This is useful in local mode when testing topologies
19
Stream groupings (1/3)
• A stream grouping tells a topology how to send tuples between two
components.
• A task level topology execution is looks something like this:
20
When a task for Bolt A
emits a tuple to Bolt B,
which task should it send
the tuple to?
Stream groupings (2/3)
Let's take a look at another topology from storm-starter: WordCountTopology
• It reads sentences off of a spout and streams out of WordCountBolt the total
number of times it has seen that word before:
• SplitSentence emits a tuple for each word in each sentence it receives.
• WordCount keeps a map in memory from word to count.
• Each time WordCount receives a word, it updates its state and emits the new word count.
21
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("sentences", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8)
.shuffleGrouping("sentences");
builder.setBolt("count", new WordCount(), 12)
.fieldsGrouping("split", new Fields("word"));
Stream groupings (3/3)
• There's a few different kinds of stream groupings:
• Shuffle grouping (simplest kind of grouping):
• Sends the tuple to a random task.
• A shuffle grouping is used in the WordCountTopology to send tuples from
RandomSentenceSpout to the SplitSentence bolt.
• Distribute the work of processing the tuples across all of SplitSentence bolt's tasks.
• Fields grouping:
• Lets you group a stream by a subset of its fields.
• This causes equal values for that subset of fields to go to the same task.
• A fields grouping is used between the SplitSentence bolt and the WordCount bolt.
• It is critical for the functioning of the WordCount bolt that the same word always go to
the same task.
22
References:
• http://storm.apache.org/documentation/Tutorial.html
23

More Related Content

What's hot

Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
P. Taylor Goetz
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
Mariusz Gil
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
DataWorks Summit
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
Dan Lynn
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
Flink Forward
 

What's hot (19)

Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 

Viewers also liked

Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
 

Viewers also liked (11)

Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing Environments
 
Object Based Databases
Object Based DatabasesObject Based Databases
Object Based Databases
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 

Similar to Apache Storm Tutorial

Similar to Apache Storm Tutorial (20)

Storm
StormStorm
Storm
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 
Storm 0.8.2
Storm 0.8.2Storm 0.8.2
Storm 0.8.2
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
1 storm-intro
1 storm-intro1 storm-intro
1 storm-intro
 
STORM
STORMSTORM
STORM
 
Storm
StormStorm
Storm
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
java Statements
java Statementsjava Statements
java Statements
 
Apache Storm Basics
Apache Storm BasicsApache Storm Basics
Apache Storm Basics
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Slide #2: How to Setup Apache STROM
Slide #2: How to Setup Apache STROMSlide #2: How to Setup Apache STROM
Slide #2: How to Setup Apache STROM
 
Slide #2: Setup Apache Storm
Slide #2: Setup Apache StormSlide #2: Setup Apache Storm
Slide #2: Setup Apache Storm
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
Blocks & GCD
Blocks & GCDBlocks & GCD
Blocks & GCD
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 

Apache Storm Tutorial

  • 2. Purpose • In this tutorial, you'll learn how to create Storm topologies and deploy them to a Storm cluster. • Java will be the main language used. • This tutorial uses examples from the storm-starter project. It's recommended that you clone the project and follow along with the examples. 2
  • 3. Components of a Storm cluster (1/2) • Hadoop "MapReduce jobs“ vs. Storm "topologies“ • One key difference: MapReduce job eventually finishes, whereas a topology processes messages forever (or until you kill it). • There are two kinds of nodes on a Storm cluster: • Master node: runs a daemon called "Nimbus“. (similar to Hadoop's "JobTracker") • Distributing code around the cluster. • Assigning tasks to machines. • Monitoring for failures • Worker nodes: runs a daemon called "Supervisor“. (similar to Hadoop's “TaskTracker") • Listens for work assigned to its machine. • Starts and stops worker processes. 3
  • 4. Components of a Storm cluster (2/2) • All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster 4
  • 5. Topologies • A topology is a graph of computation. • Node in a topology contains processing logic. • Links between nodes indicate how data should be passed around between nodes. • Running a topology is straightforward: • The storm jar part takes care of connecting to Nimbus and uploading the jar. • A topology runs forever, or until you kill it! 5 storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2
  • 6. Streams (1/2) • A stream is an unbounded sequence of tuples. • Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way. • The basic primitives Storm provides for doing stream transformations are spouts and bolts. • Spouts and bolts have interfaces that you implement to run your application-specific logic. • A spout is a source of streams. • For example, spout may connect to the Twitter API and emit a stream of tweets. • A bolt consumes any number of input streams, does some processing, and possibly emits new streams. 6
  • 7. Streams (2/2) • Bolts can do anything: • run functions, • filter tuples, • do streaming aggregations, • do streaming joins, • talk to databases. • Networks of spouts and bolts are packaged into a "topology“. • Edges in the graph indicate which bolts are subscribing to which streams. • Each node in a Storm topology executes in parallel. • You can specify how much parallelism you want for each node. 7
  • 8. Data model • Storm uses tuples as its data model. • A tuple is a named list of values, and a field in a tuple can be an object of any type. • Storm supports all the primitive types, strings, and byte arrays as tuple field values. • To use an object of another type, you just need to implement a serializer for the type. • Every node in a topology must declare the output fields for the tuples it emits: 8 @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); }
  • 9. A simple topology (1/8) • Let's look at the ExclamationTopology definition from storm-starter: • This topology contains a spout and two bolts. • The spout emits words, and each bolt appends the string "!!!" to its input. 9 TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("words", new TestWordSpout(), 10); builder.setBolt("exclaim1", new ExclamationBolt(), 3) .shuffleGrouping("words"); builder.setBolt("exclaim2", new ExclamationBolt(), 2) .shuffleGrouping("exclaim1");
  • 10. builder.setSpout("words", new TestWordSpout(), 10); A simple topology (2/8) • This code defines the nodes using the setSpout and setBolt methods. • These methods take as input: • A user-specified id, • An object containing the processing logic, • The amount of parallelism you want for the node. • The spout is given id "words" and the bolts are given ids "exclaim1" and "exclaim2" • The object containing the processing logic implements the IRichSpout interface for spouts and the IRichBolt interface for bolts. • The last parameter, how much parallelism you want for the node, is optional. • How many threads should execute that component across the cluster. Default, Storm will only allocate one thread for that node. 10
  • 11. A simple topology (3/8) • setBolt returns an InputDeclarer object that is used to define the inputs to the Bolt. • "shuffle grouping" means that tuples should be randomly distributed from the input tasks to the bolt's tasks. • There are many ways to group data between components. • Input declarations can be chained to specify multiple sources for the Bolt: 11 builder.setBolt("exclaim2", new ExclamationBolt(), 5) .shuffleGrouping("words") .shuffleGrouping("exclaim1");
  • 12. A simple topology (4/8) Spouts implementation Let's dig into the implementations of the spouts and bolts in this topology. • Spouts are responsible for emitting new messages into the topology. • The implementation of nextTuple() in TestWordSpout looks like this: 12 public void nextTuple() { Utils.sleep(100); final String[] words = new String[] {"nathan", "mike", "jackson", "golda","bertels"}; final Random rand = new Random(); final String word = words[rand.nextInt(words.length)]; _collector.emit(new Values(word)); }
  • 13. A simple topology (5/8) Bolts implementation 13 public static class ExclamationBolt extends BaseRichBolt { OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } public void cleanup() { } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; } }
  • 14. A simple topology (6/8) Bolts implementation • The prepare method provides the bolt with an OutputCollector that is used for emitting tuples from this bolt. • Tuples can be emitted at anytime from the bolt: • in the prepare, • in the execute, • or cleanup methods, • or even asynchronously in another thread. 14 public void prepare(Map conf, TopologyContext context, OutputCollector collector){ _collector = collector; }
  • 15. A simple topology (7/8) Bolts implementation • The execute method receives a tuple from one of the bolt's inputs. • The ExclamationBolt grabs the first field from the tuple and emits a new tuple with the string "!!!" appended to it. • In case of multiple input sources, you can find out which component the Tuple came from by using the Tuple#getSourceComponent method. 15 public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); }
  • 16. A simple topology (8/8) Bolts implementation • The cleanup method is called when a Bolt is being shutdown and should cleanup any resources that were opened. • The declareOutputFields method declares that the ExclamationBolt emits 1-tuples with one field called "word". • The getComponentConfiguration method allows you to configure various aspects of how this component runs. 16 public void cleanup() {} public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; }
  • 17. Running ExclamationTopology in local mode (1/3) • Storm has two modes of operation: • Local mode • Storm executes completely in process by simulating worker nodes with threads. • Local mode is useful for testing and development of topologies. • To create an in-process cluster, simply use the LocalCluster class. • Distributed mode: • Storm operates as a cluster of machines. • Running topologies on a production cluster is similar to running in Local mode: 1. Define the topology (Use TopologyBuilder if defining using Java). 2. Use StormSubmitter to submit the topology to the cluster. 3. Create a jar containing your code and all the dependencies of your code. 4. Submit the topology to the cluster using the storm client. 17
  • 18. Running ExclamationTopology in local mode (2/3) Here's the code that runs ExclamationTopology in local mode: • It submits a topology to the LocalCluster by calling submitTopology. • This method takes as arguments a name for the running topology, a configuration for the topology, and then the topology itself. 18 Config conf = new Config(); conf.setDebug(true); conf.setNumWorkers(2); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology()); Utils.sleep(10000); cluster.killTopology("test"); cluster.shutdown();
  • 19. Running ExclamationTopology in local mode (3/3) • The name is used to identify the topology so that you can kill it later on. • The configuration is used to tune various aspects of the running topology. • The two configurations specified here are very common: • TOPOLOGY_WORKERS (set with setNumWorkers): • Specifies how many processes you want allocated around the cluster to execute the topology. • Each component in the topology will execute as many threads. • Those threads exist within worker processes. • TOPOLOGY_DEBUG (set with setDebug): • When set to true, tells Storm to log every message every emitted by a component. • This is useful in local mode when testing topologies 19
  • 20. Stream groupings (1/3) • A stream grouping tells a topology how to send tuples between two components. • A task level topology execution is looks something like this: 20 When a task for Bolt A emits a tuple to Bolt B, which task should it send the tuple to?
  • 21. Stream groupings (2/3) Let's take a look at another topology from storm-starter: WordCountTopology • It reads sentences off of a spout and streams out of WordCountBolt the total number of times it has seen that word before: • SplitSentence emits a tuple for each word in each sentence it receives. • WordCount keeps a map in memory from word to count. • Each time WordCount receives a word, it updates its state and emits the new word count. 21 TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("sentences", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("sentences"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word"));
  • 22. Stream groupings (3/3) • There's a few different kinds of stream groupings: • Shuffle grouping (simplest kind of grouping): • Sends the tuple to a random task. • A shuffle grouping is used in the WordCountTopology to send tuples from RandomSentenceSpout to the SplitSentence bolt. • Distribute the work of processing the tuples across all of SplitSentence bolt's tasks. • Fields grouping: • Lets you group a stream by a subset of its fields. • This causes equal values for that subset of fields to go to the same task. • A fields grouping is used between the SplitSentence bolt and the WordCount bolt. • It is critical for the functioning of the WordCount bolt that the same word always go to the same task. 22