A Practical Guide to Selecting a Stream Processing Technology

A Practical Guide to Selecting a
Stream Processing Technology
Michael G. Noll
Product Manager, Confluent

Kafka Talk Series
Date Title
Sep 27 Introduction
To
Streaming
Data
and
Stream
Processing
with
Apache
Kafka
Oct
06 Deep
Dive
into
Apache
Kafka
Oct
27 Data
Integration
with
Apache
Kafka
Nov
17 Demystifying
Stream
Processing
with
Apache
Kafka
Dec
01 A
Practical
Guide
to
Selecting
a
Stream
Processing
Technology
Dec
15 Streaming
in
Practice:
Putting
Apache
Kafka
in
Production
https://www.confluent.io/apache-‐kafka-‐talk-‐series

Agenda
• Recap: What is Stream Processing?
• The Three Pillars of Stream Processing in Practice
• Key Selection Criteria
• Organizational/Non-Technical Dimensions
• Technical Dimensions
• Summary

Powered by Kafka (﴾thousands more)﴿

Spark Streaming API (﴾2.0)﴿

Kafka’s Streams API (﴾0.10)﴿

Example: Streams and Tables in Kafka
Word Count
hello 2
kafka 1
world 1
… …

Streams & Databases
• A stream processing technology must have first-class
support for Streams and Tables
• With scalability, fault tolerance, …
• Why? Because most use cases require not just one, but both!
• Support – or lack thereof – strongly impacts the resulting
technical architecture and development efforts
• No support means:
• Painful Do-It-Yourself
• Increased complexity, more moving pieces to juggle

Organizational/Non-‐Tech Dimensions
• Can your org understand and leverage the technology?
• Familiarity with languages; intuitive concepts and APIs; trainings
• Are you permitted to use it in your organization?
• Security features, licensing, open source vs. proprietary
• Can you continue to use it in the future?
• Longevity of technology, licensing, vendor strength

• Do you believe in the long-term vision?
• Switching technologies in an organization is often expensive/slow:
legacy migration, re-training, resistance to change, etc.
• What is the path and time to success?
• Can you move smoothly and quickly from proof-of-concept to
production?
• Areas and range of applicability in your organization
• General-purpose vs. niche technology
• Viable for S/M/L/XL use cases vs. for XL use cases only
• Building core business apps vs. doing backend analytics

Licensing Vision/Roadmap ROI
Impact
on
Organization
Broad
vs.
Niche
Applicability
Time
to
Market
Professional
Services
Documentation Examples User
CommunityLearning
Curve
Impact
on
Tools,
Infrastructure,
…

Technical Dimensions
Reprocessing Scalability
&
Elasticity
Fault
Tolerance
API Dev/Ops
Lifecycle
Security Processing
Model
Out
of
Order
Data
Abstractions Time
Model WindowingState

State
• Stateful processing of any kind requires…state
• Many (most?) use cases for stream processing are stateful
• Joins, aggregations, windowing, counting, ...
• Is state performant? Local vs. remote state?
50

State
• Is state fault-tolerant? How fast is recovery/failover?
53

State
• Is state fault-tolerant? How fast is recovery/failover?
• Is state interactively queryable?
• Kafka: ready for use (GA)
• Spark, Flink: under development (alpha)
• Storm, Samza, and others: not available
55

Abstractions
• What are the data model and the available abstractions?
• Most common abstraction: stream of records, events
• Kafka, Spark, Storm, Samza, Flink, Apex, ...
• New, very powerful: table of records
• Currently unique to Kafka
• Represents latest state and materialized views
• State must have a first-class abstraction because, as we just saw in
the previous section, state is crucial for stream processing!
58

Time model
• Different use cases require different time semantics
• Great majority of use cases require event-time semantics
• Other use cases may require processing-time (e.g. real-
time monitoring) or special variants like ingestion-time
• A stream processing technology should, at a minimum,
support event-time to cover most use cases in practice
• Examples: Kafka, Beam, Flink

Windowing
• Windowing is an operation that groups events

Windowing
Input
data,
where
colors
represent
different
users
events
Rectangles
denote
different
event-‐time
windows
processing-‐time
event-‐time
windowing
alice
bob
dave

Windowing
• Windowing is an operation that groups events
• Most commonly needed: time windows, session windows
• Examples:
• Real-time monitoring: 5-minute averages
• Reader behavior on a website: user browsing sessions

Out-‐of-‐order and late-‐arriving data
• Is very common in practice, not a rare corner case
• Related to time model discussion

Users
with
mobile
phones
enter
airplane,
lose
Internet
connectivity
Emails
are
being
written
during
the
10h
flight
Internet
connectivity
is
restored,
phones
will
send
queued
emails
now

• Is very common in practice, not a rare corner case
• Related to time model discussion
• We want control over how out-of-order data is handled
• Example:
• We process data in 5-minute windows, e.g. compute statistics
• When event arrives 1 minute late: update the original result!
• When event arrives 2 hours late: discard it!
• Handling must be efficient because it happens so often

Reprocessing
• Re-process data by rewinding a stream back in time
• Use cases in practice include
• Correcting output data after fixing a bug
• Facilitate iterative and explorative development
• A/B testing
• Processing historical data
• Walking through "What If?" scenarios
• Also: often used behind-the-scenes for fault tolerance

Scalability, Elasticity, Fault Tolerance
• Can the technology scale according to your needs?
• Desired latency, throughput?
• Able to process millions of messages per second?
• What is the minimum footprint?
• Expand/shrink capacity dynamically during operations?
• Helps with resource utilization because most stream apps run continuously
• Resilience and fault tolerance
• Which guarantees for data delivery and for state? "At-least-once", "exactly-
once", "effectively-once", etc.
• Failover behavior and recovery time? Automated or manual?
• Any negative impact of fault tolerance features on performance?

Security
• To meet internal security policies, legal compliance, etc.
• Typical base requirements for stream processing applications:
• Encrypt data-in-transit (e.g. from/to Kafka)
• Authentication: "only some applications may talk to production"
• Authorization: "access to sensitive data such as PII is restricted”
• The easier it is to use security features, the more likely they are
actually being used in practice

Processing Model
• True stream processing is record-at-a-time processing
• Benefits include low latency (millisecs), dealing efficiently with out-of-order data
• Can provide both latency and high throughput via internal optimizations
• Examples: Kafka, Storm, Samza, Flink, Beam
• Some processing technologies opt for (micro)batching
• Micro-batching has no true benefits: consider it a technical workaround to
shoehorn stream-like functionality into a tool
• Suffers from significant overhead when dealing with e.g. out-of-order/late-arriving
data, when performing windowed analyses (e.g. session windows)
• Typically a strong blocker for use cases such as fraud detection or anything where
"a few seconds" of latency is prohibitive
• Examples: Spark, Storm (Trident), Hadoop*

API
• Choice of API is a subjective matter – skills, preference, …
• Typical options
• Declarative, expressive API: operations like map(), filter()
• Imperative, lower-level API: callbacks like process(event)
• Streaming SQL: STREAM SELECT … FROM … WHERE …
• In the best case you get not just one, but all three
• "Abstractions are great!"
• "Abstractions considered harmful!"

Developer/Operations Lifecycle
• How should your daily work look and feel like?
• "I like to do quick, iterative development" (modify/test/repeat)
• "I want to decouple team roadmaps, project schedules"
• Big difference between App Model <-> Cluster Model
• Testing, packaging, deployment, monitoring, operations
• "Do I need to know Java (app) or YARN (cluster) for this?”
• "I want reactive processing in containers that run on Mesos!"
• Rolling, no-downtime upgrades?
• Integration with existing Ops infra, tools, processes?

Summary
• What we covered is a good starting point
• But, no free lunch!
• Understand what you need, and weigh criteria appropriately
• Think end-to-end: idea, development, operations, troubleshooting
• Think big-picture: future use cases, architecture, security, training, …
• Do your own internal hackathons, proof-of-concepts
• Do your own benchmarks
• If in doubt: simplicity beats complexity
• Faster to learn, easier to understand, less likely to fail, …

Coming Up Next
Date Title Speaker
Dec
15 Streaming in Practice: Putting Apache
Kafka in Production
Roger Hoover
https://www.confluent.io/apache-‐kafka-‐talk-‐series

A Practical Guide to Selecting a Stream Processing Technology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A Practical Guide to Selecting a Stream Processing Technology

Similar to A Practical Guide to Selecting a Stream Processing Technology (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

A Practical Guide to Selecting a Stream Processing Technology