4. Apache Storm 0.9.x
• First official Apache Release
• Storm becomes an Apache TLP
• 0mq to Netty for inter-worker communication
• Expanded Integration (Kafka, HDFS, HBase)
• Dependency conflict reduction (It was a start ;) )
9. Apache Storm 1.0
Pacemaker (Heartbeat Server)
• Replaces Zookeeper for Heartbeats
• In-Memory key-value store
• Allows Scaling to 2k-3k+ Nodes
• Secure: Kerberos/Digest Authentication
10. Apache Storm 1.0
Distributed Cache API
• External storage for topology resources
• Dictionaries, ML Models, Geolocation Data, etc.
• No need to redeploy topologies for resource changes
• Allows sharing of files (BLOBs) among topologies
• Files can be updated from the command line
• Files can change over the lifetime of the topology
• Compression (e.g. zip, tar, gzip)
11. Apache Storm 1.0
HA Nimbus
• Eliminates “soft” point of failure — Multiple Nimbus nodes w/Leader
Election on failure
• Increase overall availability of Nimbus
• Nimbus hosts can join/leave at any time
• Leverages Distributed Cache API
• Replication guarantees availability of all files
13. Apache Storm 1.0
Streaming Windows
• Sliding, Tumbling
• Specify Length - Duration or Tuple Count
• Slide Interval - How often to advance the window
• Timestamps (Event Time, Ingestion Time and Processing Time)
• Out of Order Tuples, Late Tuples
• Watermarks
• Window State Checkpointing
14. Apache Storm 1.0
Automatic Backpressure
• High/Low Watermarks (expressed as % of buffer size)
• Back Pressure thread monitors buffers
• If High Watermark reached, throttle Spouts
• If Low Watermark reached, stop throttling
• All Spouts Supported
15. Apache Storm 1.0
Resource Aware Scheduler (RAS)
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)
17. Apache Storm 1.0
Dynamic Log Levels
• Set log level setting for a running topology
• Via Storm UI and CLI
• Optional timeout after which changes will be reverted
• Logs searchable from Storm UI/Logviewer
18. Apache Storm 1.0
On-Demand Tuple Sampling
• No more debug bolts or Trident functions!
• In Storm UI: Select a Topology or component and click “Debug”
• Specify a sampling percentage (% of tuples to be sampled)
• Click on the “Events” link to view the sample log
19. Apache Storm 1.0
Distributed Log Search
• Search across all log files for a specific topology
• Search in archived (ZIP) logs
• Results include matches from all Supervisor nodes
20. Apache Storm 1.0
Dynamic Worker Profiling
• Request worker profile data from Storm UI:
• Heap Dumps
• JStack Output
• JProfile Recordings
• Download generated files for off-line analysis
• Restart workers from UI
21. Apache Storm 1.0
Supervisor Health Checks
• Identify Supervisor nodes that are in a bad state
• Automatically decommission bad nodes
• Simple shell script
• You define what constitutes “Unhealthy”
26. Streaming SQL
• Powered by Apache Calcite
• Define a topology in a SQL text file
• Use storm sql command to deploy the topology
• Storm will compile the SQL into a Trident topology
27. Streaming SQL
• Stream from/to external data sources
• Apache Kafka
• HDFS
• MongoDB
• Redis
• Extend to additional systems via the ISqlTridentDataSource
interface
28. Streaming SQL
• Tuple filtering
• Projections
• User-defined parallelism of generated components
• CSV, TSV, and Avro input/output formats
• User Defined Functions (UDFs)
35. SQL Example: Insert Query
INSERT INTO APACHE_ERROR_LOGS
SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD,
CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT,
TIME_RECEIVED_UTC_ISOFORMAT,
(TIME_US / 1000) AS TIME_ELAPSED_MS
FROM APACHE_LOGS
WHERE (CAST(STATUS AS INT) / 100) >= 4
37. Streaming SQL - Scalar UDF
public static class Add {
public static Integer evaluate(Integer x, Integer y) {
return x + y;
}
}
CREATE FUNCTION ADD AS
‘org.apache.storm.sql.example.udf.Add’
38. Streaming SQL - Aggregate UDF
public static class MyConcat {
public static String init() {
return "";
}
public static String add(String accumulator, String val) {
return accumulator + val;
}
public static String result(String accumulator) {
return accumulator;
}
}
CREATE FUNCTION MYCONCAT AS
‘org.apache.storm.sql.example.udf.MyConcat’
39. Apache Kafka Integration Improvements
• Two Kafka integration components:
• storm-kafka (for Kafka 0.8/0.9)
• storm-kafka-client (for Kafka 0.10 and later)
• storm-kafka-client provides a number of improvements over previous
Kafka integration…
40. Apache Kafka Integration Improvements
• Enhanced configuration API
• Fine-grained offset control
• Consumer group support
• Pluggable record translators
• Wildcard topics
• Support for multiple streams
• Integrates with Kafka security
41. PMML Support
• Predictive Model Markup Language
• Model produced by data mining/ML algorithms
• Bolt executes model and emits scores, predicted fields
• Store model in Distributed Cache to decouple from topology
42. Druid Integration
• High performance, column oriented, distributed data store
• Popular for realtime analytics and OLAP
• Storm bolt and Trident state implementation
43. OpenTSDB Integration
• Time series DB based on Apache Hbase
• Storm bolt and Trident state implementation
• User-defined class to map Tuples to OpenTSDB data point:
• Metric name
• Tags (name/value collection)
• Timestamp
• Numeric value
44. AWS Kinesis Support
• Spout for consuming Kinesis records
• User-defined handlers
• User-defined failure handling
45. HDFS Spout
• Streams data from configured directory
• Parallelism through locking/ownership mechanism
• Post-processing “Archive” directory
• Corrupt file handling
• Kerberos support
47. Topology Deployment Enhancements
• Alternative to “uber jar”
• Options for storm jar command
—-jars Upload local files
-—artifacts Resolve/upload Maven dependencies
—-artifactRepository Additional Maven repositories
48. Resource Aware Scheduler
• Specify the resource requirements (Memory/CPU) for individual
topology components (Spouts/Bolts)
• Memory: On-Heap / Off-Heap (if off-heap is used)
• CPU: Point system based on number of cores
• Resource requirements are per component instance (parallelism
matters)
53. Apache Beam Integration
• Unified API for dealing with
bounded/unbounded data sources
(i.e. batch/streaming)
• One API. Multiple implementations
(execution engines). Called
“Runners” in Beamspeak.
54. Bounded Spouts
• Necessary for Beam model
• Bridge the gap between bounded/unbounded (batch/streaming)
models
55. Metrics V2
• Based on Coda Hale’s metrics library
• Simplified user-defined metrics
• Multiple configurable reporters