SlideShare a Scribd company logo
1 of 65
Download to read offline
The Future of Apache Storm
Hadoop Summit 2016 - Dublin
P. Taylor Goetz, Hortonworks
@ptgoetz
About Me
• Tech Staff @ Hortonworks
• PMC Chair, Apache Storm
• ASF Member
• PMC, Apache Incubator
• PMC, Apache Arrow
• PMC, Apache Kylin
• Mentor/PPMC, Apache Eagle (Incubating)
• Mentor/PPMC, Apache Metron (Incubating)
• Mentor/PPMC, Apache Apex (Incubating)
Apache Storm 0.9.x
Storm moves to Apache
Apache Storm 0.9.x
• First official Apache Release
• Storm becomes an Apache TLP
• 0mq to Netty for inter-worker communication
• Expanded Integration (Kafka, HDFS, HBase)
• Dependency conflict reduction (It was a start ;) )
Apache Storm 0.10.x
Enterprise Readiness
Apache Storm 0.10.x
• Security, Multi-Tenancy
• Enable Rolling Upgrades
• Flux (declarative topology wiring/configuration)
• Partial Key Groupings
Apache Storm 0.10.x
• Improved logging (Log4j 2)
• Streaming Ingest to Apache Hive
• Azure Event Hubs Integration
• Redis Integration
• JDBC Integration
Apache Storm 1.0
Maturity and Improved Performance
Release Date: April 12, 2016
Pacemaker
Heartbeat Server
Pacemaker
• Replaces Zookeeper for Heartbeats
• In-Memory key-value store
• Allows Scaling to 2k-3k+ Nodes
• Secure: Kerberos/Digest Authentication
Pacemaker
• Compared to Zookeeper:
• Less Memory/CPU
• No Disk
• Spared the overhead of maintaining consistency
Distributed Cache API
Distributed Cache API
• Topology resources:
• Dictionaries, ML Models, Geolocation Data, etc.
• Typically packaged in topology jar
• Fine for small files
• Large files negatively impact topology startup time
• Immutable: Changes require repackaging and deployment
Distributed Cache API
• Allows sharing of files (BLOBs) among topologies
• Files can be updated from the command line
• Allows for files from several KB to several GB in size
• Files can change over the lifetime of the topology
• Allows for compression (e.g. zip, tar, gzip)
Distributed Cache API
• Two implementations: LocalFsBlobStore and HdfsBlobStore
• Local implementation supports Replication Factor (not needed for
HDFS-backed implementation)
• Both support ACLs
Distributed Cache API
Creating a blob:
storm blobstore create --file dict.txt --acl o::rwa
--repl-fctr 2 key1
Making it available to a topology:
storm jar topo.jar my.topo.Class test_topo -c
topology.blobstore.map=‘{"key1":
{"localname":"dict.txt", "uncompress":"false"}}'
High Availability Nimbus
Before HA Nimbus
ZooKeeperNimbus
Supervisor Supervisor Supervisor Supervisor
Worker* Worker* Worker* Worker*
HA Nimbus
Pacemaker
(ZooKeeper)Nimbus
Supervisor Supervisor Supervisor Supervisor
Worker* Worker* Worker* Worker*
Nimbus
Nimbus
Leader
HA Nimbus - Failover
Pacemaker
(ZooKeeper)Nimbus
Supervisor Supervisor Supervisor Supervisor
Worker* Worker* Worker* Worker*
Nimbus
Nimbus
Leader
X
Leader Election
HA Nimbus - Failover
Pacemaker
(ZooKeeper)Nimbus
Supervisor Supervisor Supervisor Supervisor
Worker* Worker* Worker* Worker*
Nimbus
Nimbus
X
Leader
HA Nimbus
• Increase overall availability of Nimbus
• Nimbus hosts can join/leave at any time
• Leverages Distributed Cache API
• Topology JAR, Config, and Serialized Topology uploaded to
Distributed Cache
• Replication guarantees availability of all files
Native Streaming Windows
Streaming Windows
• Specify Length - Duration or Tuple Count
• Slide Interval - How often to advance the window
Sliding Windows
Windows can overlap
{…} {…} {…} {…} {…} {…} {…} {…} {…}
Time
Window 1 Window 2
Tumbling Windows
Windows do not overlap
{…} {…} {…} {…} {…} {…} {…} {…} {…}
Time
Window 1 Window 2 Window 3
Streaming Windows
• Timestamps (Event Time, Ingestion Time and Processing Time)
• Out of Order Tuples
• Watermarks
• Window State Checkpointing
Sate Management
Stateful Bolts with Automatic Checkpointing
What you see.
Spout Stateful Bolt 1 Stateful Bolt 2Bolt
State Management
State Management
public class WordCountBolt extends BaseStatefulBolt<KeyValueState> {
private KeyValueState wordCounts;
private OutputCollector collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
public void initState(KeyValueState state) {
this.wordCounts = state;
}
public void execute(Tuple tuple) {
String word = tuple.getString(0);
Integer count = (Integer) wordCounts.get(word, 0);
count++;
wordCounts.put(word, count);
collector.emit(new Values(word, count));
}
}
public class WordCountBolt extends BaseStatefulBolt<KeyValueState> {
private KeyValueState wordCounts;
private OutputCollector collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
public void initState(KeyValueState state) {
this.wordCounts = state;
}
public void execute(Tuple tuple) {
String word = tuple.getString(0);
Integer count = (Integer) wordCounts.get(word, 0);
count++;
wordCounts.put(word, count);
collector.emit(new Values(word, count));
}
}
Initialize State
State Management
public class WordCountBolt extends BaseStatefulBolt<KeyValueState> {
private KeyValueState wordCounts;
private OutputCollector collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
public void initState(KeyValueState state) {
this.wordCounts = state;
}
public void execute(Tuple tuple) {
String word = tuple.getString(0);
Integer count = (Integer) wordCounts.get(word, 0);
count++;
wordCounts.put(word, count);
collector.emit(new Values(word, count));
}
}
Read/Update State
State Management
Sate Management
Automatic Checkpointing
Checkpointing/Snapshotting
• Asynchronous Barrier Snapshotting (ABS) algorithm [1]
• Chandy-Lamport Algorithm [2]
[1] http://arxiv.org/pdf/1506.08603v1.pdf
[2] http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf
State Management
Checkpointing/Snapshotting: What you see.
Spout Stateful Bolt 1 Stateful Bolt 2Bolt
Storm State Management
execute/update state execute execute/update state
Checkpointing/Snapshotting: What you get.
Spout Stateful Bolt 1 Stateful Bolt 2
Checkpoint Spout ACKER
State Store
Bolt
$chkpt
$chkpt
$chkpt
ACK
ACK
ACK
Storm State Management
Automatic Back Pressure
Automatic Back Pressure
• In previous Storm versions, the only way to throttle topologies was to
enable ACKing and set topology.spout.max.pending.
• If you don’t require at-least-once guarantees, this imposed a
significant performance penalty.**
** In Storm 1.0 this penalty is drastically reduced (more on this later)
Automatic Backpressure
• High/Low Watermarks (expressed as % of buffer size)
• Back Pressure thread monitors buffers
• If High Watermark reached, slow down Spouts
• If Low Watermark reached, stop throttling
• All Spouts Supported
Resource Aware Scheduler
(RAS)
Resource Aware Scheduler
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)
Resource Aware Scheduler
• CPU and Memory availability described in storm.yaml on each
supervisor node. E.g.:



supervisor.memory.capacity.mb: 3072.0

supervisor.cpu.capacity: 400.0
• Convention for CPU capacity is to use 100 for each CPU core
Resource Aware Scheduler
Setting component resource requirements:
SpoutDeclarer spout = builder.setSpout("sp1", new TestSpout(), 10);
//set cpu requirement
spout.setCPULoad(20);
//set onheap and offheap memory requirement
spout.setMemoryLoad(64, 16);
BoltDeclarer bolt1 = builder.setBolt("b1", new MyBolt(), 3).shuffleGrouping("sp1");
//sets cpu requirement. Not neccessary to set both CPU and memory.
//For requirements not set, a default value will be used
bolt1.setCPULoad(15);
BoltDeclarer bolt2 = builder.setBolt("b2", new MyBolt(), 2).shuffleGrouping("b1");
bolt2.setMemoryLoad(100);
Storm Usability Improvements
Enhanced Debugging and Monitoring of Topologies
Dynamic Log Level Settings
Dynamic Log Levels
• Set log level setting for a running topology
• Via Storm UI and CLI
• Optional timeout after which changes will be reverted
• Logs searchable from Storm UI/Logviewer
Dynamic Log Levels
Via Storm UI:
Dynamic Log Levels
Via Storm CLI:
./bin/storm set_log_level [topology name] -l
[logger_name]=[LEVEL]:[TIMEOUT]
Tuple Sampling
• No more debug bolts or Trident functions!
• In Storm UI: Select a Topology or component and click “Debug”
• Specify a sampling percentage (% of tuples to be sampled)
• Click on the “Events” link to view the sample log.
Distributed Log Search
• Search across all log files for a specific topology
• Search in archived (ZIP) logs
• Results include matches from all Supervisor nodes
Dynamic Worker Profiling
• Request worker profile data from Storm UI:
• Heap Dumps
• JStack Output
• JProfile Recordings
• Download generated files for off-line analysis
• Restart workers from UI
Supervisor Health Checks
• Identify Supervisor nodes that are in a bad state
• Automatically decommission bad nodes
• Simple shell script
• You define what constitutes “Unhealthy”
New Integrations
• Cassandra
• Solr
• Elastic Search
• MQTT
Integration Improvements
• Kafka
• HDFS Spout
• Avro Integration for HDFS
• HBase
• Hive
Before I forget...
Performance
Up to 16x faster throughput.
Realistically 3x -- Highly dependent on use case
> 60% Latency Reduction
Bear in mind performance varies
widely depending on the use case.
The most important benchmarks
are the ones you do.
Storm 2.0
Underway @ Apache
Clojure to Java
Broadening the contributor base
Clojure to Java
Alibaba JStorm Contribution
Questions?
Thank you!
P. Taylor Goetz, Hortonworks
@ptgoetz

More Related Content

What's hot

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 

What's hot (20)

Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsTuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 

Viewers also liked

Viewers also liked (20)

Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 
A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?A Continuously Deployed Hadoop Analytics Platform?
A Continuously Deployed Hadoop Analytics Platform?
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Hadoop Everywhere
 
Practical advice to build a data driven company
Practical advice to build a data driven companyPractical advice to build a data driven company
Practical advice to build a data driven company
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
Using a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance businessUsing a Data Lake at the core of a Life Assurance business
Using a Data Lake at the core of a Life Assurance business
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
 
Architecting a multi-tenanted platform
Architecting a multi-tenanted platform Architecting a multi-tenanted platform
Architecting a multi-tenanted platform
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 

Similar to The Future of Apache Storm

Similar to The Future of Apache Storm (20)

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache Storm
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
Large-scaled Deploy Over 100 Servers in 3 Minutes
Large-scaled Deploy Over 100 Servers in 3 MinutesLarge-scaled Deploy Over 100 Servers in 3 Minutes
Large-scaled Deploy Over 100 Servers in 3 Minutes
 
Kubernetes for the PHP developer
Kubernetes for the PHP developerKubernetes for the PHP developer
Kubernetes for the PHP developer
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
eBay Pulsar: Real-time analytics platform
eBay Pulsar: Real-time analytics platformeBay Pulsar: Real-time analytics platform
eBay Pulsar: Real-time analytics platform
 
From Ceilometer to Telemetry: not so alarming!
From Ceilometer to Telemetry: not so alarming!From Ceilometer to Telemetry: not so alarming!
From Ceilometer to Telemetry: not so alarming!
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017
 
Storm
StormStorm
Storm
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
Web Applications with Eclipse RT and Docker in the Cloud
Web Applications with Eclipse RT and Docker in the CloudWeb Applications with Eclipse RT and Docker in the Cloud
Web Applications with Eclipse RT and Docker in the Cloud
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

The Future of Apache Storm

  • 1. The Future of Apache Storm Hadoop Summit 2016 - Dublin P. Taylor Goetz, Hortonworks @ptgoetz
  • 2. About Me • Tech Staff @ Hortonworks • PMC Chair, Apache Storm • ASF Member • PMC, Apache Incubator • PMC, Apache Arrow • PMC, Apache Kylin • Mentor/PPMC, Apache Eagle (Incubating) • Mentor/PPMC, Apache Metron (Incubating) • Mentor/PPMC, Apache Apex (Incubating)
  • 3. Apache Storm 0.9.x Storm moves to Apache
  • 4. Apache Storm 0.9.x • First official Apache Release • Storm becomes an Apache TLP • 0mq to Netty for inter-worker communication • Expanded Integration (Kafka, HDFS, HBase) • Dependency conflict reduction (It was a start ;) )
  • 6. Apache Storm 0.10.x • Security, Multi-Tenancy • Enable Rolling Upgrades • Flux (declarative topology wiring/configuration) • Partial Key Groupings
  • 7. Apache Storm 0.10.x • Improved logging (Log4j 2) • Streaming Ingest to Apache Hive • Azure Event Hubs Integration • Redis Integration • JDBC Integration
  • 8. Apache Storm 1.0 Maturity and Improved Performance Release Date: April 12, 2016
  • 10. Pacemaker • Replaces Zookeeper for Heartbeats • In-Memory key-value store • Allows Scaling to 2k-3k+ Nodes • Secure: Kerberos/Digest Authentication
  • 11. Pacemaker • Compared to Zookeeper: • Less Memory/CPU • No Disk • Spared the overhead of maintaining consistency
  • 13. Distributed Cache API • Topology resources: • Dictionaries, ML Models, Geolocation Data, etc. • Typically packaged in topology jar • Fine for small files • Large files negatively impact topology startup time • Immutable: Changes require repackaging and deployment
  • 14. Distributed Cache API • Allows sharing of files (BLOBs) among topologies • Files can be updated from the command line • Allows for files from several KB to several GB in size • Files can change over the lifetime of the topology • Allows for compression (e.g. zip, tar, gzip)
  • 15. Distributed Cache API • Two implementations: LocalFsBlobStore and HdfsBlobStore • Local implementation supports Replication Factor (not needed for HDFS-backed implementation) • Both support ACLs
  • 16. Distributed Cache API Creating a blob: storm blobstore create --file dict.txt --acl o::rwa --repl-fctr 2 key1 Making it available to a topology: storm jar topo.jar my.topo.Class test_topo -c topology.blobstore.map=‘{"key1": {"localname":"dict.txt", "uncompress":"false"}}'
  • 18. Before HA Nimbus ZooKeeperNimbus Supervisor Supervisor Supervisor Supervisor Worker* Worker* Worker* Worker*
  • 19. HA Nimbus Pacemaker (ZooKeeper)Nimbus Supervisor Supervisor Supervisor Supervisor Worker* Worker* Worker* Worker* Nimbus Nimbus Leader
  • 20. HA Nimbus - Failover Pacemaker (ZooKeeper)Nimbus Supervisor Supervisor Supervisor Supervisor Worker* Worker* Worker* Worker* Nimbus Nimbus Leader X Leader Election
  • 21. HA Nimbus - Failover Pacemaker (ZooKeeper)Nimbus Supervisor Supervisor Supervisor Supervisor Worker* Worker* Worker* Worker* Nimbus Nimbus X Leader
  • 22. HA Nimbus • Increase overall availability of Nimbus • Nimbus hosts can join/leave at any time • Leverages Distributed Cache API • Topology JAR, Config, and Serialized Topology uploaded to Distributed Cache • Replication guarantees availability of all files
  • 24. Streaming Windows • Specify Length - Duration or Tuple Count • Slide Interval - How often to advance the window
  • 25. Sliding Windows Windows can overlap {…} {…} {…} {…} {…} {…} {…} {…} {…} Time Window 1 Window 2
  • 26. Tumbling Windows Windows do not overlap {…} {…} {…} {…} {…} {…} {…} {…} {…} Time Window 1 Window 2 Window 3
  • 27. Streaming Windows • Timestamps (Event Time, Ingestion Time and Processing Time) • Out of Order Tuples • Watermarks • Window State Checkpointing
  • 28. Sate Management Stateful Bolts with Automatic Checkpointing
  • 29. What you see. Spout Stateful Bolt 1 Stateful Bolt 2Bolt State Management
  • 30. State Management public class WordCountBolt extends BaseStatefulBolt<KeyValueState> { private KeyValueState wordCounts; private OutputCollector collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { this.collector = collector; } public void initState(KeyValueState state) { this.wordCounts = state; } public void execute(Tuple tuple) { String word = tuple.getString(0); Integer count = (Integer) wordCounts.get(word, 0); count++; wordCounts.put(word, count); collector.emit(new Values(word, count)); } }
  • 31. public class WordCountBolt extends BaseStatefulBolt<KeyValueState> { private KeyValueState wordCounts; private OutputCollector collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { this.collector = collector; } public void initState(KeyValueState state) { this.wordCounts = state; } public void execute(Tuple tuple) { String word = tuple.getString(0); Integer count = (Integer) wordCounts.get(word, 0); count++; wordCounts.put(word, count); collector.emit(new Values(word, count)); } } Initialize State State Management
  • 32. public class WordCountBolt extends BaseStatefulBolt<KeyValueState> { private KeyValueState wordCounts; private OutputCollector collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { this.collector = collector; } public void initState(KeyValueState state) { this.wordCounts = state; } public void execute(Tuple tuple) { String word = tuple.getString(0); Integer count = (Integer) wordCounts.get(word, 0); count++; wordCounts.put(word, count); collector.emit(new Values(word, count)); } } Read/Update State State Management
  • 34. Checkpointing/Snapshotting • Asynchronous Barrier Snapshotting (ABS) algorithm [1] • Chandy-Lamport Algorithm [2] [1] http://arxiv.org/pdf/1506.08603v1.pdf [2] http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf State Management
  • 35. Checkpointing/Snapshotting: What you see. Spout Stateful Bolt 1 Stateful Bolt 2Bolt Storm State Management execute/update state execute execute/update state
  • 36. Checkpointing/Snapshotting: What you get. Spout Stateful Bolt 1 Stateful Bolt 2 Checkpoint Spout ACKER State Store Bolt $chkpt $chkpt $chkpt ACK ACK ACK Storm State Management
  • 38. Automatic Back Pressure • In previous Storm versions, the only way to throttle topologies was to enable ACKing and set topology.spout.max.pending. • If you don’t require at-least-once guarantees, this imposed a significant performance penalty.** ** In Storm 1.0 this penalty is drastically reduced (more on this later)
  • 39. Automatic Backpressure • High/Low Watermarks (expressed as % of buffer size) • Back Pressure thread monitors buffers • If High Watermark reached, slow down Spouts • If Low Watermark reached, stop throttling • All Spouts Supported
  • 41. Resource Aware Scheduler • Specify the resource requirements (Memory/CPU) for individual topology components (Spouts/Bolts) • Memory: On-Heap / Off-Heap (if off-heap is used) • CPU: Point system based on number of cores • Resource requirements are per component instance (parallelism matters)
  • 42. Resource Aware Scheduler • CPU and Memory availability described in storm.yaml on each supervisor node. E.g.:
 
 supervisor.memory.capacity.mb: 3072.0
 supervisor.cpu.capacity: 400.0 • Convention for CPU capacity is to use 100 for each CPU core
  • 43. Resource Aware Scheduler Setting component resource requirements: SpoutDeclarer spout = builder.setSpout("sp1", new TestSpout(), 10); //set cpu requirement spout.setCPULoad(20); //set onheap and offheap memory requirement spout.setMemoryLoad(64, 16); BoltDeclarer bolt1 = builder.setBolt("b1", new MyBolt(), 3).shuffleGrouping("sp1"); //sets cpu requirement. Not neccessary to set both CPU and memory. //For requirements not set, a default value will be used bolt1.setCPULoad(15); BoltDeclarer bolt2 = builder.setBolt("b2", new MyBolt(), 2).shuffleGrouping("b1"); bolt2.setMemoryLoad(100);
  • 44. Storm Usability Improvements Enhanced Debugging and Monitoring of Topologies
  • 45. Dynamic Log Level Settings
  • 46. Dynamic Log Levels • Set log level setting for a running topology • Via Storm UI and CLI • Optional timeout after which changes will be reverted • Logs searchable from Storm UI/Logviewer
  • 48. Dynamic Log Levels Via Storm CLI: ./bin/storm set_log_level [topology name] -l [logger_name]=[LEVEL]:[TIMEOUT]
  • 49. Tuple Sampling • No more debug bolts or Trident functions! • In Storm UI: Select a Topology or component and click “Debug” • Specify a sampling percentage (% of tuples to be sampled) • Click on the “Events” link to view the sample log.
  • 50. Distributed Log Search • Search across all log files for a specific topology • Search in archived (ZIP) logs • Results include matches from all Supervisor nodes
  • 51. Dynamic Worker Profiling • Request worker profile data from Storm UI: • Heap Dumps • JStack Output • JProfile Recordings • Download generated files for off-line analysis • Restart workers from UI
  • 52. Supervisor Health Checks • Identify Supervisor nodes that are in a bad state • Automatically decommission bad nodes • Simple shell script • You define what constitutes “Unhealthy”
  • 53. New Integrations • Cassandra • Solr • Elastic Search • MQTT
  • 54. Integration Improvements • Kafka • HDFS Spout • Avro Integration for HDFS • HBase • Hive
  • 57. Up to 16x faster throughput. Realistically 3x -- Highly dependent on use case
  • 58. > 60% Latency Reduction
  • 59. Bear in mind performance varies widely depending on the use case.
  • 60. The most important benchmarks are the ones you do.
  • 62. Clojure to Java Broadening the contributor base
  • 63. Clojure to Java Alibaba JStorm Contribution
  • 65. Thank you! P. Taylor Goetz, Hortonworks @ptgoetz