2. LIVE On-line Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz and Assignment
Project Work on Large Data Set
Verifiable Certificate
How it Works?
Slide 2 www.edureka.in/apache-storm
3. Course Topics
Slide 3 www.edureka.in/apache-storm
Module 1
» Introduction to Big Data and Storm
Module 2
» Storm Technology Stack and Groupings
Module 3
» Spouts and Bolts
Module 4
» Trident Topologies
Module 5
» Real Life Storm Project -1
Module 6
» Real Life Storm Project -2
4. Objectives
Slide 4 www.edureka.in/apache-storm
At the end of this module, you will be able to:
Recall Big Data and Hadoop
Understand Batch and Real-time Analytics of Big Data
Investigate Shortcoming of Hadoop
Understand Lambda Architecture
Develop a basic knowledge of Apache Storm and its components
Explain the Use Cases and Key Differentiators of Storm
5. Big Data
Slide 5 www.edureka.in/apache-storm
Storm is a open source computing system used for processing Real-time Big Data Analytics.
Lets understand Big Data first to learn STORM.
6. Lots of Data - Terabytes or Petabytes
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications.
The challenges include capture, curation, storage,
search, sharing, transfer, analysis, and visualization.
What is Big Data?
Slide 6 www.edureka.in/apache-storm
7. Systems / Enterprises generate huge amount of data from Terabytes and even Petabytes of information.
Stock market generates about one terabyte of new trade data per day to
perform stock trading analytics to determine trends for optimal trades.
What is Big Data?
Slide 7 www.edureka.in/apache-storm
8. 2,500 exabytes of new information in 2012 with Internet as primary driver.
Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year.
Slide 8 www.edureka.in/apache-storm
Un-structured Data is Exploding
9. IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
IBM’s Definition
Web
logs
Images
Videos
Sensor
Data
Audios
VOLUME VELOCITY VARIETY
Slide 9 www.edureka.in/apache-storm
10. Annie’s Introduction
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Slide 10 www.edureka.in/apache-storm
11. Annie’s Question
Map the following to correspolnodinTghdeatraet!y!pe:
Slide 11 www.edureka.in/apache-storm
My name is Annie.
I lo quizzes and
Data from EpnutezrpzrliseessyastnemdsI(EaRmP, CRhMereetc.)to
make you guys think and
answer my questions.
- XML files
- Word docs, PDF files, Text files
-
-
E-Mail body
12. Annie’s Answer
XML files -> Semi-structureldodTathaere!!
Slide 12 www.edureka.in/apache-storm
Word docs, PDF filesM, Tyextnfailems -e> UisnsAtrnunctiuer.ed DataE-Mail body -> Unstructured Data
Data from EnterpriseIsylostems q(EuRiPz, zCReMs eatcn.)d-> Structured Data
puzzles and I am here to
make you guys think and
answer my questions.
13. Hadoop and its primary programming model, Map-Reduce, are great for batch-oriented processing of huge amount
of data.
With growing data, Hadoop enables you to horizontally scale your cluster by adding commodity nodes and thus keep
up with query workloads.
is primary programming model
great for batch-oriented processing of huge amount of data
Big Data Batch Analytics
Slide 13 www.edureka.in/apache-storm
14. What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of
commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage and distributed processing.
Slide 14 www.edureka.in/apache-storm
15. Hadoop Eco-System
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
HIVE
DW System
Pig Latin
Data Analysis Other
YARN
Frameworks
(MPI,GIRAPH)MapReduce Framework
HBase
YARN
Cluster Resource Management
Slide 15 www.edureka.in/apache-storm
16. This evolution has forced the addition of support for
Higher Level Languages (Pig & Hive) New Real-time Storage Engines (HBase)
Big Data Batch Analytics
Extensions for Streaming Data (Hadoop Streaming)
Slide 16 www.edureka.in/apache-storm
17. Due to batch processing, Hadoop should be deployed in situations such as
Index Building
Pattern Recognitions
Creating Recommendation
Engine
Sentiment Analysis
Situations
generate
huge amount of data
stored
queried
Hadoop for Batch Analytics
Slide 17 www.edureka.in/apache-storm
18. Real-time Big Data Analytics
Social Networking:
» Pick your own Big Data database (RDBMS or NoSQL)
» Measure the immediate impact to your site traffic from
social media, whether a new blog post, a tweet, a “Like”,
or even a comment.
» Knowing this information translates to better conversion
and more effective online campaigns.
Slide 18 www.edureka.in/apache-storm
19. Real-time Big Data Analytics
SaaS:
» Measuring user behaviour and acting upon it is crucial
for improving customer satisfaction and conversion rates
– which represent immediate increases in revenue.
Slide 19 www.edureka.in/apache-storm
20. Real-time Big Data Analytics
Financial Services:
» Determining in real time whether your portfolio is losing
money, or if there is fraud in your system means that you
can prevent disasters as they occur, not after the damage
is done.
» Correlating multiple sources from the market in real-time
results in a more accurate view of the market and enables
more accurate actions to maximize your profit.
Slide 20 www.edureka.in/apache-storm
21. Real Time Big Data Analytics - Options
Apache StormAmazon Kinesis
Slide 21 www.edureka.in/apache-storm
22. Problem Statement:
To find the total number of page views of Edureka’s blog over a
range of time.
Google Analytics can provide you this information.
Example: For a particular day, the data can be:
Need for Real-time Analytics
Slide 22 www.edureka.in/apache-storm
23. petabyte – scale
All Data
Slide 23 www.edureka.in/apache-storm
Need for Real-time Analytics
Challenge:
Querying huge amount of Historical Data is slow
25. Need for Real-time Analytics
Google Analytics might have to keep the historical data for each hour as precompiled view
Page view
Page view
Page view
Page view
Page view
All Data
Query
Slide 25 www.edureka.in/apache-storm
URL Hr of the
day
No. of
pageviews
edureka.in/blog/aboutapachestorm 1 250
edureka.in/blog/aboutapachestorm 2 300
edureka.in/blog/aboutapachestorm 3 455
edureka.in/blog/aboutapachestorm 4 460
edureka.in/blog/aboutapachestorm 5 320
edureka.in/blog/aboutapachestorm 6 111
edureka.in/blog/aboutapachestorm 7 129
Precomputed View
26. Need for Real-time Analytics
Precomputed
View
All Data Query
Slide 26 www.edureka.in/apache-storm
using Hadoop
27. But, what about the
data generated after
last precompiled view?
Slide 27 www.edureka.in/apache-storm
Need for Real-time Analytics
28. Compensating for last few hours of data
Need for Real-time Analytics
spout
bolt
bolt
bolt Real-time
View
Storm
Real-time
Data
Stored
Or
Slide 28 www.edureka.in/apache-storm
Or
30. Lambda Architecture
All data entering the system is dispatched to both the batch layer and the speed layer for processing.
New Data
Speed Layer
Slide 30 www.edureka.in/apache-storm
Batch Layer
1
Serving Layer
31. Lambda Architecture
Batch View
Batch View
Master
Dataset
The batch layer has two functions:
» managing the master dataset (an immutable, append-only set of raw data), and
» to pre-compute the batch views. The serving layer indexes the batch views so that they can be queried in
low-latency, ad-hoc way.
Batch Layer Serving Layer
New Data
Speed Layer
1
2
3
Slide 31 www.edureka.in/apache-storm
32. Lambda Architecture
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Batch View
Batch View
Real-time
View
Master
Dataset
New Data
Speed Layer
Real-time
View
Batch Layer Serving Layer
1
2
3
Slide 32 www.edureka.in/apache-storm
4
33. Lambda Architecture
Any incoming query can be answered by merging results from batch views and real-time views.
Batch View
Batch View
Real-time
View
Master
Dataset
New Data
Query
Speed Layer
Query
Real-time
View
Batch Layer Serving Layer
1
2
3
Slide 33 www.edureka.in/apache-storm
4
5
34. Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
Fault-tolerant
STORM
processing
Streams of Data
What is Storm?
Slide 34 www.edureka.in/apache-storm
35. The work is delegated to different types of components that are each responsible for a simple specific processing task.
The input stream of a Storm cluster is handled by a component called a spout.
The spout passes the data to a component called a bolt, which transforms it in some way.
A bolt either persists the data in some sort of storage, or passes it to some other bolt.
transforms data
bolt
bolt
spout
spout
bolt
bolt
passes data
passes data
data storage
Input Data
Source
What is Storm?
Slide 35 www.edureka.in/apache-storm
36. Annie’s Question
Storm can be used in:
- Real-time Processing
- Batch Processing
- Both
Slide 36 www.edureka.in/apache-storm
40. Annie’s Question
It is not possible to run Storm process along with MapReduce jobs inside a
Hadoop Cluster.
- True
- False
Slide 40 www.edureka.in/apache-storm
42. ZooKeeper
Nimbus ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus node (master node, similar to the Hadoop
JobTracker):
» Uploads computations for execution
» Distributes code across the cluster
» Launches workers across the cluster
» Monitors computation and reallocates
workers as needed
ZooKeeper nodes:
» Coordinates the Storm cluster
Supervisor nodes :
» Communicates with Nimbus through
Zookeeper, starts and stops workers
according to signals from Nimbus
Storm Components
A Storm cluster has 3 sets of nodes
1. Nimbus node
2. Zookeeper nodes
3. Supervisor nodes
Slide 42 www.edureka.in/apache-storm
43. Annie’s Question
A Nimbus Node is similar to TaskTracker Node in Hadoop Cluster.
- True
- False
Slide 43 www.edureka.in/apache-storm
44. Annie’s Answer
No. A Nimbus Node is more like a JobTracker Node in Hadoop
Slide 44 www.edureka.in/apache-storm
45. Five key abstractions help to understand how Storm
processes data:
Tuples – an ordered list of elements. For example, a
“4-tuple” might be (7, 1, 3, 7)
Streams – an unbounded sequence of tuples
Spouts – sources of streams in a computation (e.g. a
Twitter API)
Bolts – process input streams and produce output
streams. They can: run functions; filter, aggregate, or
join data; or talk to databases
Topologies – the overall calculation, represented
visually as a network of spouts and bolts
spout
spout
bolt
bolt
bolt
bolt
Storm users define topologies for how to process the data when it comes streaming in from the spout.
Slide 45 www.edureka.in/apache-storm
Storm Components
46. Annie’s Question
A Storm topology is defined in terms of
- Nimbus, Zookeeper, Supervisor nodes
- Spout, Bolt
- Spout, Bolt, Nimbus, Zookeeper, Supervisor nodes
- Spout, Bolt, Zookeeper node
Slide 46 www.edureka.in/apache-storm
48. Use Cases of Storm
Processing Streams
Distributed Remote
Procedure Call
Unlike other stream
processing systems,
with Storm there’s no
need for intermediate
queues.
Send data to clients
continuously so they
can update and show
results in real time,
such as site metrics.
Easily parallelize CPU-
intensive operations.
Continuous
Computation
Use Cases of Storm
Slide 48 www.edureka.in/apache-storm
49. Use Cases of Storm
Slide 49 www.edureka.in/apache-storm
Financial Services
» Securities Fraud
» Compliance Violations
» Order Routing
» Pricing
Telecom
» Security Breaches
» Network Outages
» Bandwidth Allocation
» Customer Service
Retail
» Shrinkage
» Stock outs
» Offers
» Pricing
Web
» Application Failure
» Operational Issues
» Personalized Content
Use Storm to prevent certain outcomes or to optimize their objectives.
50. Key Differentiators
Simple to Program Fault-tolerant
It’s painful to do real-
time processing from
scratch.
With storm,
complexity is reduced
drastically.
It’s easier to develop
in a JVM-based
language, but Storm
supports any
language
as long as you use or
implement a small
intermediary library.
The Storm cluster
takes care of workers
going down,
reassigning tasks
when
necessary.
Support for Multiple
Programming
Languages
Key Differentiators
Slide 50 www.edureka.in/apache-storm