SlideShare a Scribd company logo
1 of 59
The Past, Present, and Future of
Apache Storm
DataWorks Summit 2017, Munich
P. Taylor Goetz, Hortonworks
@ptgoetz
About Me
• Tech Staff @ Hortonworks
• PMC Chair, Apache Storm
• ASF Member
• PMC: Apache Incubator, Apache Arrow, Apache
Kylin, Apache Apex, Apache Eagle
• Mentor/PPMC: Apache Mynewt (Incubating),
Apache Metron (Incubating), Apache Gossip
(Incubating)
Apache Storm 0.9.x
Storm moves to Apache
Apache Storm 0.9.x
• First official Apache Release
• Storm becomes an Apache TLP
• 0mq to Netty for inter-worker communication
• Expanded Integration (Kafka, HDFS, HBase)
• Dependency conflict reduction (It was a start ;) )
Apache Storm 0.10.x
Enterprise Readiness
Apache Storm 0.10.x
• Security, Multi-Tenancy
• Enable Rolling Upgrades
• Flux (declarative topology wiring/configuration)
• Partial Key Groupings
Apache Storm 0.10.x
• Improved logging (Log4j 2)
• Streaming Ingest to Apache Hive
• Azure Event Hubs Integration
• Redis Integration
• JDBC Integration
Apache Storm 1.0
Maturity and Improved Performance
Release Date: April 12, 2016
Apache Storm 1.0
Pacemaker (Heartbeat Server)
• Replaces Zookeeper for Heartbeats
• In-Memory key-value store
• Allows Scaling to 2k-3k+ Nodes
• Secure: Kerberos/Digest Authentication
Apache Storm 1.0
Distributed Cache API
• External storage for topology resources
• Dictionaries, ML Models, Geolocation Data, etc.
• No need to redeploy topologies for resource changes
• Allows sharing of files (BLOBs) among topologies
• Files can be updated from the command line
• Files can change over the lifetime of the topology
• Compression (e.g. zip, tar, gzip)
Apache Storm 1.0
HA Nimbus
• Eliminates “soft” point of failure — Multiple Nimbus nodes w/Leader
Election on failure
• Increase overall availability of Nimbus
• Nimbus hosts can join/leave at any time
• Leverages Distributed Cache API
• Replication guarantees availability of all files
Apache Storm 1.0
State Management - Stateful Bolts
• Automatic checkpoints/snapshots
• Query/Update state
• Asynchronous Barrier Snapshotting (ABS) algorithm [1]
• Chandy-Lamport Algorithm [2]
[1] http://arxiv.org/pdf/1506.08603v1.pdf
[2] http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf
Apache Storm 1.0
Streaming Windows
• Sliding, Tumbling
• Specify Length - Duration or Tuple Count
• Slide Interval - How often to advance the window
• Timestamps (Event Time, Ingestion Time and Processing Time)
• Out of Order Tuples, Late Tuples
• Watermarks
• Window State Checkpointing
Apache Storm 1.0
Automatic Backpressure
• High/Low Watermarks (expressed as % of buffer size)
• Back Pressure thread monitors buffers
• If High Watermark reached, throttle Spouts
• If Low Watermark reached, stop throttling
• All Spouts Supported
Apache Storm 1.0
Resource Aware Scheduler (RAS)
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)
Apache Storm 1.0
Usability Improvements
Enhanced Debugging and Monitoring of Topologies
Apache Storm 1.0
Dynamic Log Levels
• Set log level setting for a running topology
• Via Storm UI and CLI
• Optional timeout after which changes will be reverted
• Logs searchable from Storm UI/Logviewer
Apache Storm 1.0
On-Demand Tuple Sampling
• No more debug bolts or Trident functions!
• In Storm UI: Select a Topology or component and click “Debug”
• Specify a sampling percentage (% of tuples to be sampled)
• Click on the “Events” link to view the sample log
Apache Storm 1.0
Distributed Log Search
• Search across all log files for a specific topology
• Search in archived (ZIP) logs
• Results include matches from all Supervisor nodes
Apache Storm 1.0
Dynamic Worker Profiling
• Request worker profile data from Storm UI:
• Heap Dumps
• JStack Output
• JProfile Recordings
• Download generated files for off-line analysis
• Restart workers from UI
Apache Storm 1.0
Supervisor Health Checks
• Identify Supervisor nodes that are in a bad state
• Automatically decommission bad nodes
• Simple shell script
• You define what constitutes “Unhealthy”
Apache Storm 1.0
Performance Improvements
• Up to 16x faster throughput
• > 60% latency reduction
Apache Storm 1.0
New Integration Components
• Cassandra
• Solr
• Elastic Search
• MQTT
Storm 1.1.0
Release Date: March 29, 2017
Streaming SQL
Streaming SQL
• Powered by Apache Calcite
• Define a topology in a SQL text file
• Use storm sql command to deploy the topology
• Storm will compile the SQL into a Trident topology
Streaming SQL
• Stream from/to external data sources
• Apache Kafka
• HDFS
• MongoDB
• Redis
• Extend to additional systems via the ISqlTridentDataSource
interface
Streaming SQL
• Tuple filtering
• Projections
• User-defined parallelism of generated components
• CSV, TSV, and Avro input/output formats
• User Defined Functions (UDFs)
Streaming SQL - External Data
Sources
CREATE EXTERNAL TABLE table_name field_list
[ STORED AS
INPUTFORMAT input_format_classname
OUTPUTFORMAT output_format_classname
]
LOCATION location
[ PARALLELISM parallelism ]
[ TBLPROPERTIES tbl_properties ]
[ AS select_stmt ]
Streaming SQL - Kafka Consumer (Spout)
CREATE EXTERNAL TABLE ORDERS
(ID INT PRIMARY KEY, UNIT_PRICE INT, QUANTITY INT)
LOCATION 'kafka://localhost:2181/brokers?topic=orders'
Streaming SQL - Kafka Producer
CREATE EXTERNAL TABLE LARGE_ORDERS
(ID INT PRIMARY KEY, TOTAL INT)
LOCATION 'kafka://localhost:2181/brokers?topic=large_orders'
TBLPROPERTIES '{
"producer":{
"bootstrap.servers":"localhost:9092",
"acks":"1",
"key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer",
"value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer"
}
}'
Streaming SQL - Example
• Consume Apache HTTPD server logs from Kafka
• Filter out everything but errors
• Write the errors back to Kafka
SQL Example: Kafka Input
CREATE EXTERNAL TABLE APACHE_LOGS
(ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR,
REQUEST_METHOD VARCHAR, STATUS VARCHAR, REQUEST_HEADER_USER_AGENT VARCHAR,
TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_US DOUBLE)
LOCATION 'kafka://localhost:2181/brokers?topic=apache-logs'
SQL Example: Kafka Output
CREATE EXTERNAL TABLE APACHE_ERROR_LOGS
(ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR,
REQUEST_METHOD VARCHAR, STATUS INT, REQUEST_HEADER_USER_AGENT VARCHAR,
TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_ELAPSED_MS INT)
LOCATION 'kafka://localhost:2181/brokers?topic=apache-error-logs'
TBLPROPERTIES
'{"producer":
{"bootstrap.servers":"localhost:9092",
"acks":"1",
"key.serializer":"org.apache.storm.kafka.IntSerializer",
"value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
SQL Example: Insert Query
INSERT INTO APACHE_ERROR_LOGS
SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD,
CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT,
TIME_RECEIVED_UTC_ISOFORMAT,
(TIME_US / 1000) AS TIME_ELAPSED_MS
FROM APACHE_LOGS
WHERE (CAST(STATUS AS INT) / 100) >= 4
Streaming SQL
User Defined Functions (UDFs)
Streaming SQL - Scalar UDF
public static class Add {
public static Integer evaluate(Integer x, Integer y) {
return x + y;
}
}
CREATE FUNCTION ADD AS
‘org.apache.storm.sql.example.udf.Add’
Streaming SQL - Aggregate UDF
public static class MyConcat {
public static String init() {
return "";
}
public static String add(String accumulator, String val) {
return accumulator + val;
}
public static String result(String accumulator) {
return accumulator;
}
}
CREATE FUNCTION MYCONCAT AS
‘org.apache.storm.sql.example.udf.MyConcat’
Apache Kafka Integration Improvements
• Two Kafka integration components:
• storm-kafka (for Kafka 0.8/0.9)
• storm-kafka-client (for Kafka 0.10 and later)
• storm-kafka-client provides a number of improvements over previous
Kafka integration…
Apache Kafka Integration Improvements
• Enhanced configuration API
• Fine-grained offset control
• Consumer group support
• Pluggable record translators
• Wildcard topics
• Support for multiple streams
• Integrates with Kafka security
PMML Support
• Predictive Model Markup Language
• Model produced by data mining/ML algorithms
• Bolt executes model and emits scores, predicted fields
• Store model in Distributed Cache to decouple from topology
Druid Integration
• High performance, column oriented, distributed data store
• Popular for realtime analytics and OLAP
• Storm bolt and Trident state implementation
OpenTSDB Integration
• Time series DB based on Apache Hbase
• Storm bolt and Trident state implementation
• User-defined class to map Tuples to OpenTSDB data point:
• Metric name
• Tags (name/value collection)
• Timestamp
• Numeric value
AWS Kinesis Support
• Spout for consuming Kinesis records
• User-defined handlers
• User-defined failure handling
HDFS Spout
• Streams data from configured directory
• Parallelism through locking/ownership mechanism
• Post-processing “Archive” directory
• Corrupt file handling
• Kerberos support
Flux Improvements
• Visualization in Storm UI
• Stateful bolts and windowing
• Named streams
• Support for lists of references
Topology Deployment Enhancements
• Alternative to “uber jar”
• Options for storm jar command
—-jars Upload local files
-—artifacts Resolve/upload Maven dependencies
—-artifactRepository Additional Maven repositories
Resource Aware Scheduler
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)
RAS Improvements
• Topology Priorities
• Per-user Resources
• Topology Eviction Strategies
• Rack Awareness
Apache Storm 2.0 and Beyond
Clojure to Java
• Community expansion
• Alibaba JStorm contribution
Performance Improvements
• Disruptor Queue to JCTools (improved concurrency model)
• Thread model overhaul
Apache Beam Integration
• Unified API for dealing with
bounded/unbounded data sources
(i.e. batch/streaming)
• One API. Multiple implementations
(execution engines). Called
“Runners” in Beamspeak.
Bounded Spouts
• Necessary for Beam model
• Bridge the gap between bounded/unbounded (batch/streaming)
models
Metrics V2
• Based on Coda Hale’s metrics library
• Simplified user-defined metrics
• Multiple configurable reporters
Worker Classloader Isolation
Alternative to shading dependencies.
Improved Backpressure
Based on JStorm implementation.
Dynamic Topology Updates
Update topologies and configs without restart.
Thank you!
Questions?
P. Taylor Goetz, Hortonworks
@ptgoetz

More Related Content

What's hot

Pulsar Architectural Patterns for CI/CD Automation and Self-Service_Devin Bost
Pulsar Architectural Patterns for CI/CD Automation and Self-Service_Devin BostPulsar Architectural Patterns for CI/CD Automation and Self-Service_Devin Bost
Pulsar Architectural Patterns for CI/CD Automation and Self-Service_Devin Bost
StreamNative
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 

What's hot (20)

Streaming and Messaging
Streaming and MessagingStreaming and Messaging
Streaming and Messaging
 
Pulsar Architectural Patterns for CI/CD Automation and Self-Service_Devin Bost
Pulsar Architectural Patterns for CI/CD Automation and Self-Service_Devin BostPulsar Architectural Patterns for CI/CD Automation and Self-Service_Devin Bost
Pulsar Architectural Patterns for CI/CD Automation and Self-Service_Devin Bost
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Elk ruminating on logs
Elk ruminating on logsElk ruminating on logs
Elk ruminating on logs
 
Storm worker redesign
Storm worker redesignStorm worker redesign
Storm worker redesign
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Pulsar Summit Asia - Running a secure pulsar cluster
Pulsar Summit Asia -  Running a secure pulsar clusterPulsar Summit Asia -  Running a secure pulsar cluster
Pulsar Summit Asia - Running a secure pulsar cluster
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...
44CON 2014 - Pentesting NoSQL DB's Using NoSQL Exploitation Framework, Franci...
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Building a FaaS with pulsar
Building a FaaS with pulsarBuilding a FaaS with pulsar
Building a FaaS with pulsar
 
Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS Lambda
 
Managing a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsManaging a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIs
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)
 
DjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling DisqusDjangoCon 2010 Scaling Disqus
DjangoCon 2010 Scaling Disqus
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
 

Similar to Past, Present, and Future of Apache Storm

Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

Similar to Past, Present, and Future of Apache Storm (20)

The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any Scale
 
Sink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any ScaleSink Your Teeth into Streaming at Any Scale
Sink Your Teeth into Streaming at Any Scale
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Chicago Kafka Meetup
Chicago Kafka MeetupChicago Kafka Meetup
Chicago Kafka Meetup
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Cloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: OpenstackCloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: Openstack
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 

More from P. Taylor Goetz (6)

Flux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & DeploymentFlux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & Deployment
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 

Recently uploaded

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 

Past, Present, and Future of Apache Storm

  • 1. The Past, Present, and Future of Apache Storm DataWorks Summit 2017, Munich P. Taylor Goetz, Hortonworks @ptgoetz
  • 2. About Me • Tech Staff @ Hortonworks • PMC Chair, Apache Storm • ASF Member • PMC: Apache Incubator, Apache Arrow, Apache Kylin, Apache Apex, Apache Eagle • Mentor/PPMC: Apache Mynewt (Incubating), Apache Metron (Incubating), Apache Gossip (Incubating)
  • 3. Apache Storm 0.9.x Storm moves to Apache
  • 4. Apache Storm 0.9.x • First official Apache Release • Storm becomes an Apache TLP • 0mq to Netty for inter-worker communication • Expanded Integration (Kafka, HDFS, HBase) • Dependency conflict reduction (It was a start ;) )
  • 6. Apache Storm 0.10.x • Security, Multi-Tenancy • Enable Rolling Upgrades • Flux (declarative topology wiring/configuration) • Partial Key Groupings
  • 7. Apache Storm 0.10.x • Improved logging (Log4j 2) • Streaming Ingest to Apache Hive • Azure Event Hubs Integration • Redis Integration • JDBC Integration
  • 8. Apache Storm 1.0 Maturity and Improved Performance Release Date: April 12, 2016
  • 9. Apache Storm 1.0 Pacemaker (Heartbeat Server) • Replaces Zookeeper for Heartbeats • In-Memory key-value store • Allows Scaling to 2k-3k+ Nodes • Secure: Kerberos/Digest Authentication
  • 10. Apache Storm 1.0 Distributed Cache API • External storage for topology resources • Dictionaries, ML Models, Geolocation Data, etc. • No need to redeploy topologies for resource changes • Allows sharing of files (BLOBs) among topologies • Files can be updated from the command line • Files can change over the lifetime of the topology • Compression (e.g. zip, tar, gzip)
  • 11. Apache Storm 1.0 HA Nimbus • Eliminates “soft” point of failure — Multiple Nimbus nodes w/Leader Election on failure • Increase overall availability of Nimbus • Nimbus hosts can join/leave at any time • Leverages Distributed Cache API • Replication guarantees availability of all files
  • 12. Apache Storm 1.0 State Management - Stateful Bolts • Automatic checkpoints/snapshots • Query/Update state • Asynchronous Barrier Snapshotting (ABS) algorithm [1] • Chandy-Lamport Algorithm [2] [1] http://arxiv.org/pdf/1506.08603v1.pdf [2] http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf
  • 13. Apache Storm 1.0 Streaming Windows • Sliding, Tumbling • Specify Length - Duration or Tuple Count • Slide Interval - How often to advance the window • Timestamps (Event Time, Ingestion Time and Processing Time) • Out of Order Tuples, Late Tuples • Watermarks • Window State Checkpointing
  • 14. Apache Storm 1.0 Automatic Backpressure • High/Low Watermarks (expressed as % of buffer size) • Back Pressure thread monitors buffers • If High Watermark reached, throttle Spouts • If Low Watermark reached, stop throttling • All Spouts Supported
  • 15. Apache Storm 1.0 Resource Aware Scheduler (RAS) • Specify the resource requirements (Memory/CPU) for individual topology components (Spouts/Bolts) • Memory: On-Heap / Off-Heap (if off-heap is used) • CPU: Point system based on number of cores • Resource requirements are per component instance (parallelism matters)
  • 16. Apache Storm 1.0 Usability Improvements Enhanced Debugging and Monitoring of Topologies
  • 17. Apache Storm 1.0 Dynamic Log Levels • Set log level setting for a running topology • Via Storm UI and CLI • Optional timeout after which changes will be reverted • Logs searchable from Storm UI/Logviewer
  • 18. Apache Storm 1.0 On-Demand Tuple Sampling • No more debug bolts or Trident functions! • In Storm UI: Select a Topology or component and click “Debug” • Specify a sampling percentage (% of tuples to be sampled) • Click on the “Events” link to view the sample log
  • 19. Apache Storm 1.0 Distributed Log Search • Search across all log files for a specific topology • Search in archived (ZIP) logs • Results include matches from all Supervisor nodes
  • 20. Apache Storm 1.0 Dynamic Worker Profiling • Request worker profile data from Storm UI: • Heap Dumps • JStack Output • JProfile Recordings • Download generated files for off-line analysis • Restart workers from UI
  • 21. Apache Storm 1.0 Supervisor Health Checks • Identify Supervisor nodes that are in a bad state • Automatically decommission bad nodes • Simple shell script • You define what constitutes “Unhealthy”
  • 22. Apache Storm 1.0 Performance Improvements • Up to 16x faster throughput • > 60% latency reduction
  • 23. Apache Storm 1.0 New Integration Components • Cassandra • Solr • Elastic Search • MQTT
  • 24. Storm 1.1.0 Release Date: March 29, 2017
  • 26. Streaming SQL • Powered by Apache Calcite • Define a topology in a SQL text file • Use storm sql command to deploy the topology • Storm will compile the SQL into a Trident topology
  • 27. Streaming SQL • Stream from/to external data sources • Apache Kafka • HDFS • MongoDB • Redis • Extend to additional systems via the ISqlTridentDataSource interface
  • 28. Streaming SQL • Tuple filtering • Projections • User-defined parallelism of generated components • CSV, TSV, and Avro input/output formats • User Defined Functions (UDFs)
  • 29. Streaming SQL - External Data Sources CREATE EXTERNAL TABLE table_name field_list [ STORED AS INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname ] LOCATION location [ PARALLELISM parallelism ] [ TBLPROPERTIES tbl_properties ] [ AS select_stmt ]
  • 30. Streaming SQL - Kafka Consumer (Spout) CREATE EXTERNAL TABLE ORDERS (ID INT PRIMARY KEY, UNIT_PRICE INT, QUANTITY INT) LOCATION 'kafka://localhost:2181/brokers?topic=orders'
  • 31. Streaming SQL - Kafka Producer CREATE EXTERNAL TABLE LARGE_ORDERS (ID INT PRIMARY KEY, TOTAL INT) LOCATION 'kafka://localhost:2181/brokers?topic=large_orders' TBLPROPERTIES '{ "producer":{ "bootstrap.servers":"localhost:9092", "acks":"1", "key.serializer":"org.apache.org.apache.storm.kafka.IntSerializer", "value.serializer":"org.apache.org.apache.storm.kafka.ByteBufferSerializer" } }'
  • 32. Streaming SQL - Example • Consume Apache HTTPD server logs from Kafka • Filter out everything but errors • Write the errors back to Kafka
  • 33. SQL Example: Kafka Input CREATE EXTERNAL TABLE APACHE_LOGS (ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR, REQUEST_METHOD VARCHAR, STATUS VARCHAR, REQUEST_HEADER_USER_AGENT VARCHAR, TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_US DOUBLE) LOCATION 'kafka://localhost:2181/brokers?topic=apache-logs'
  • 34. SQL Example: Kafka Output CREATE EXTERNAL TABLE APACHE_ERROR_LOGS (ID INT PRIMARY KEY, REMOTE_IP VARCHAR, REQUEST_URL VARCHAR, REQUEST_METHOD VARCHAR, STATUS INT, REQUEST_HEADER_USER_AGENT VARCHAR, TIME_RECEIVED_UTC_ISOFORMAT VARCHAR, TIME_ELAPSED_MS INT) LOCATION 'kafka://localhost:2181/brokers?topic=apache-error-logs' TBLPROPERTIES '{"producer": {"bootstrap.servers":"localhost:9092", "acks":"1", "key.serializer":"org.apache.storm.kafka.IntSerializer", "value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
  • 35. SQL Example: Insert Query INSERT INTO APACHE_ERROR_LOGS SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD, CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT, TIME_RECEIVED_UTC_ISOFORMAT, (TIME_US / 1000) AS TIME_ELAPSED_MS FROM APACHE_LOGS WHERE (CAST(STATUS AS INT) / 100) >= 4
  • 36. Streaming SQL User Defined Functions (UDFs)
  • 37. Streaming SQL - Scalar UDF public static class Add { public static Integer evaluate(Integer x, Integer y) { return x + y; } } CREATE FUNCTION ADD AS ‘org.apache.storm.sql.example.udf.Add’
  • 38. Streaming SQL - Aggregate UDF public static class MyConcat { public static String init() { return ""; } public static String add(String accumulator, String val) { return accumulator + val; } public static String result(String accumulator) { return accumulator; } } CREATE FUNCTION MYCONCAT AS ‘org.apache.storm.sql.example.udf.MyConcat’
  • 39. Apache Kafka Integration Improvements • Two Kafka integration components: • storm-kafka (for Kafka 0.8/0.9) • storm-kafka-client (for Kafka 0.10 and later) • storm-kafka-client provides a number of improvements over previous Kafka integration…
  • 40. Apache Kafka Integration Improvements • Enhanced configuration API • Fine-grained offset control • Consumer group support • Pluggable record translators • Wildcard topics • Support for multiple streams • Integrates with Kafka security
  • 41. PMML Support • Predictive Model Markup Language • Model produced by data mining/ML algorithms • Bolt executes model and emits scores, predicted fields • Store model in Distributed Cache to decouple from topology
  • 42. Druid Integration • High performance, column oriented, distributed data store • Popular for realtime analytics and OLAP • Storm bolt and Trident state implementation
  • 43. OpenTSDB Integration • Time series DB based on Apache Hbase • Storm bolt and Trident state implementation • User-defined class to map Tuples to OpenTSDB data point: • Metric name • Tags (name/value collection) • Timestamp • Numeric value
  • 44. AWS Kinesis Support • Spout for consuming Kinesis records • User-defined handlers • User-defined failure handling
  • 45. HDFS Spout • Streams data from configured directory • Parallelism through locking/ownership mechanism • Post-processing “Archive” directory • Corrupt file handling • Kerberos support
  • 46. Flux Improvements • Visualization in Storm UI • Stateful bolts and windowing • Named streams • Support for lists of references
  • 47. Topology Deployment Enhancements • Alternative to “uber jar” • Options for storm jar command —-jars Upload local files -—artifacts Resolve/upload Maven dependencies —-artifactRepository Additional Maven repositories
  • 48. Resource Aware Scheduler • Specify the resource requirements (Memory/CPU) for individual topology components (Spouts/Bolts) • Memory: On-Heap / Off-Heap (if off-heap is used) • CPU: Point system based on number of cores • Resource requirements are per component instance (parallelism matters)
  • 49. RAS Improvements • Topology Priorities • Per-user Resources • Topology Eviction Strategies • Rack Awareness
  • 50. Apache Storm 2.0 and Beyond
  • 51. Clojure to Java • Community expansion • Alibaba JStorm contribution
  • 52. Performance Improvements • Disruptor Queue to JCTools (improved concurrency model) • Thread model overhaul
  • 53. Apache Beam Integration • Unified API for dealing with bounded/unbounded data sources (i.e. batch/streaming) • One API. Multiple implementations (execution engines). Called “Runners” in Beamspeak.
  • 54. Bounded Spouts • Necessary for Beam model • Bridge the gap between bounded/unbounded (batch/streaming) models
  • 55. Metrics V2 • Based on Coda Hale’s metrics library • Simplified user-defined metrics • Multiple configurable reporters
  • 56. Worker Classloader Isolation Alternative to shading dependencies.
  • 57. Improved Backpressure Based on JStorm implementation.
  • 58. Dynamic Topology Updates Update topologies and configs without restart.
  • 59. Thank you! Questions? P. Taylor Goetz, Hortonworks @ptgoetz

Editor's Notes

  1. Google DataFlow API recently open-sourced to Apache.