Presented by Michael Noll, Product Manager, Confluent.
Why are there so many stream processing frameworks that each define their own terminology? Are the components of each comparable? Why do you need to know about spouts or DStreams just to process a simple sequence of records? Depending on your application’s requirements, you may not need a full framework at all.
Processing and understanding your data to create business value is the ultimate goal of a stream data platform. In this talk we will survey the stream processing landscape, the dimensions along which to evaluate stream processing technologies, and how they integrate with Apache Kafka. Particularly, we will learn how Kafka Streams, the built-in stream processing engine of Apache Kafka, compares to other stream processing systems that require a separate processing infrastructure.
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
A Practical Guide to Selecting a Stream Processing Technology
1. A Practical Guide to Selecting a
Stream Processing Technology
Michael G. Noll
Product Manager, Confluent
2. Kafka Talk Series
Date Title
Sep 27 Introduction
To
Streaming
Data
and
Stream
Processing
with
Apache
Kafka
Oct
06 Deep
Dive
into
Apache
Kafka
Oct
27 Data
Integration
with
Apache
Kafka
Nov
17 Demystifying
Stream
Processing
with
Apache
Kafka
Dec
01 A
Practical
Guide
to
Selecting
a
Stream
Processing
Technology
Dec
15 Streaming
in
Practice:
Putting
Apache
Kafka
in
Production
https://www.confluent.io/apache-‐kafka-‐talk-‐series
3. Agenda
• Recap: What is Stream Processing?
• The Three Pillars of Stream Processing in Practice
• Key Selection Criteria
• Organizational/Non-Technical Dimensions
• Technical Dimensions
• Summary
4. Agenda
• Recap: What is Stream Processing?
• The Three Pillars of Stream Processing in Practice
• Key Selection Criteria
• Organizational/Non-Technical Dimensions
• Technical Dimensions
• Summary
5.
6.
7.
8.
9. Agenda
• Recap: What is Stream Processing?
• The Three Pillars of Stream Processing in Practice
• Key Selection Criteria
• Organizational/Non-Technical Dimensions
• Technical Dimensions
• Summary
42. Streams & Databases
• A stream processing technology must have first-class
support for Streams and Tables
• With scalability, fault tolerance, …
• Why? Because most use cases require not just one, but both!
• Support – or lack thereof – strongly impacts the resulting
technical architecture and development efforts
• No support means:
• Painful Do-It-Yourself
• Increased complexity, more moving pieces to juggle
43. Agenda
• Recap: What is Stream Processing?
• The Three Pillars of Stream Processing in Practice
• Key Selection Criteria
• Organizational/Non-Technical Dimensions
• Technical Dimensions
• Summary
44. Agenda
• Recap: What is Stream Processing?
• The Three Pillars of Stream Processing in Practice
• Key Selection Criteria
• Organizational/Non-Technical Dimensions
• Technical Dimensions
• Summary
45. Organizational/Non-‐Tech Dimensions
• Can your org understand and leverage the technology?
• Familiarity with languages; intuitive concepts and APIs; trainings
• Are you permitted to use it in your organization?
• Security features, licensing, open source vs. proprietary
• Can you continue to use it in the future?
• Longevity of technology, licensing, vendor strength
46. Organizational/Non-‐Tech Dimensions
• Do you believe in the long-term vision?
• Switching technologies in an organization is often expensive/slow:
legacy migration, re-training, resistance to change, etc.
• What is the path and time to success?
• Can you move smoothly and quickly from proof-of-concept to
production?
• Areas and range of applicability in your organization
• General-purpose vs. niche technology
• Viable for S/M/L/XL use cases vs. for XL use cases only
• Building core business apps vs. doing backend analytics
47. Organizational/Non-‐Tech Dimensions
Licensing Vision/Roadmap ROI
Impact
on
Organization
Broad
vs.
Niche
Applicability
Time
to
Market
Professional
Services
Documentation Examples User
CommunityLearning
Curve
Impact
on
Tools,
Infrastructure,
…
48. Agenda
• Recap: What is Stream Processing?
• The Three Pillars of Stream Processing in Practice
• Key Selection Criteria
• Organizational/Non-Technical Dimensions
• Technical Dimensions
• Summary
50. State
• Stateful processing of any kind requires…state
• Many (most?) use cases for stream processing are stateful
• Joins, aggregations, windowing, counting, ...
• Is state performant? Local vs. remote state?
50
51.
52.
53. State
• Stateful processing of any kind requires…state
• Many (most?) use cases for stream processing are stateful
• Joins, aggregations, windowing, counting, ...
• Is state performant? Local vs. remote state?
• Is state fault-tolerant? How fast is recovery/failover?
53
54.
55. State
• Stateful processing of any kind requires…state
• Many (most?) use cases for stream processing are stateful
• Joins, aggregations, windowing, counting, ...
• Is state performant? Local vs. remote state?
• Is state fault-tolerant? How fast is recovery/failover?
• Is state interactively queryable?
• Kafka: ready for use (GA)
• Spark, Flink: under development (alpha)
• Storm, Samza, and others: not available
55
58. Abstractions
• What are the data model and the available abstractions?
• Most common abstraction: stream of records, events
• Kafka, Spark, Storm, Samza, Flink, Apex, ...
• New, very powerful: table of records
• Currently unique to Kafka
• Represents latest state and materialized views
• State must have a first-class abstraction because, as we just saw in
the previous section, state is crucial for stream processing!
58
60. Time model
• Different use cases require different time semantics
• Great majority of use cases require event-time semantics
• Other use cases may require processing-time (e.g. real-
time monitoring) or special variants like ingestion-time
• A stream processing technology should, at a minimum,
support event-time to cover most use cases in practice
• Examples: Kafka, Beam, Flink
64. Windowing
Input
data,
where
colors
represent
different
users
events
Rectangles
denote
different
event-‐time
windows
processing-‐time
event-‐time
windowing
alice
bob
dave
65. Windowing
• Windowing is an operation that groups events
• Most commonly needed: time windows, session windows
• Examples:
• Real-time monitoring: 5-minute averages
• Reader behavior on a website: user browsing sessions
68. Out-‐of-‐order and late-‐arriving data
• Is very common in practice, not a rare corner case
• Related to time model discussion
69. Out-‐of-‐order and late-‐arriving data
Users
with
mobile
phones
enter
airplane,
lose
Internet
connectivity
Emails
are
being
written
during
the
10h
flight
Internet
connectivity
is
restored,
phones
will
send
queued
emails
now
70. Out-‐of-‐order and late-‐arriving data
• Is very common in practice, not a rare corner case
• Related to time model discussion
• We want control over how out-of-order data is handled
• Example:
• We process data in 5-minute windows, e.g. compute statistics
• When event arrives 1 minute late: update the original result!
• When event arrives 2 hours late: discard it!
• Handling must be efficient because it happens so often
72. Reprocessing
• Re-process data by rewinding a stream back in time
• Use cases in practice include
• Correcting output data after fixing a bug
• Facilitate iterative and explorative development
• A/B testing
• Processing historical data
• Walking through "What If?" scenarios
• Also: often used behind-the-scenes for fault tolerance
75. Scalability, Elasticity, Fault Tolerance
• Can the technology scale according to your needs?
• Desired latency, throughput?
• Able to process millions of messages per second?
• What is the minimum footprint?
• Expand/shrink capacity dynamically during operations?
• Helps with resource utilization because most stream apps run continuously
• Resilience and fault tolerance
• Which guarantees for data delivery and for state? "At-least-once", "exactly-
once", "effectively-once", etc.
• Failover behavior and recovery time? Automated or manual?
• Any negative impact of fault tolerance features on performance?
80. Security
• To meet internal security policies, legal compliance, etc.
• Typical base requirements for stream processing applications:
• Encrypt data-in-transit (e.g. from/to Kafka)
• Authentication: "only some applications may talk to production"
• Authorization: "access to sensitive data such as PII is restricted”
• The easier it is to use security features, the more likely they are
actually being used in practice
84. API
• Choice of API is a subjective matter – skills, preference, …
• Typical options
• Declarative, expressive API: operations like map(), filter()
• Imperative, lower-level API: callbacks like process(event)
• Streaming SQL: STREAM SELECT … FROM … WHERE …
• In the best case you get not just one, but all three
• "Abstractions are great!"
• "Abstractions considered harmful!"
90. Coming Up Next
Date Title Speaker
Dec
15 Streaming in Practice: Putting Apache
Kafka in Production
Roger Hoover
https://www.confluent.io/apache-‐kafka-‐talk-‐series