Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. In this session, I share the benefits and our experience building the platform.
2. @monaldax
● Data Engineer Why stream processing, and what to expect
from a platform?
● Data Leader Product / Vision of Stream Processing As a
Service platform
● Platform engineer How to build and operate a a Stream
What Do I Get Out Of This Talk?
Different perspectives:
@monaldax
3. @monaldax
● I will focus on stream processing platform for business
insights, which my team builds mostly based on Flink
● I won’t be addressing operational insights for which we
have different systems
6. @monaldax
● Low latency insights and analytics
● Processing data as it arrives helps spread workload
over time, & reduce processing redundancy
● Need to process unbounded data sets becoming
increasingly common
Why Real Time Data?
7. @monaldax
● Enable users to focus on data and business insights,
and not worry about building stream processing
infrastructure and tooling
Why Build A Stream Processing
Platform?
9. @monaldax
Platform Needs To Offer Robust Way To Process
Streams Allowing To Tradeoff Between Ease,
Capability, &Flexibility
SPaa
S
10. @monaldax
Point & Click
Routing, Filtering,
Projection
Streaming
Jobs
● Support Streaming SQL Future
● Interactive exploration of streams for quick
prototyping Future
Stream Processing as a Service platform
offers
20. Event
Producer
Create Kafka Topic, And Three
Separate Jobs
SPaaS
Router
Fronting
Kafka
KSGateway
Consumer
Kafka
KC
W
Elasticsearc
h
3 Jobs1
Topic
Keystone
Management
1
Topic
@monaldax
21. Event Flow: Producer Uses Kafka Client
Wrapper Or Proxy
SPaaS
Router
Fronting
Kafka
Event
Producer
KSGateway
Consumer
Kafka
Keystone
Management
KC
W
Elasticsearc
h
@monaldax
22. Event Flow: Events Queued In Kafka
SPaaS
Router
Fronting
Kafka
Event
Producer
KSGateway
Consumer
Kafka
KC
W
Elasticsearc
h
3
instance
s
Keystone
Management
@monaldax
23. Event Flow: Each Router Reads From Source,
Optionally Applies Filter & Projection
SPaaS
Router
Fronting
Kafka
Event
Producer
KSGateway
Consumer
Kafka
KC
W
Elasticsearc
h
3
instance
s
Keystone
Management
@monaldax
24. Event Flow: Each Router Writes To Their
Respective Sinks
SPaaS
Router
Fronting
Kafka
Event
Producer
KSGateway
Consumer
Kafka
KC
W
Elasticsearc
h
3
instance
s
Non-
Keyed
Keyed
Supported
Keystone
Management
@monaldax
50. @monaldax
Stateless Streaming Job Use Case: High Level
Architecture
Enriching And Identifying Certain Plays
Playback
History
Service
Video
Metada
ta
Streaming
Job
Play
Logs
Live
Service
Lookup
Data
55. Streaming Job (Flink) Savepoint Tooling
Support
• S3 based multi-tenant storage management
• Auto savepoint and resume from savepoint on redeploy
• Resume from an existing savepoint
@monaldax
56. Streaming Job (Flink) High Level Features
• Stateless jobs
• Event enrichment support by accessing services using
platform thick clients
• Stateful jobs 100s of GB, with larger state support in the
works
• Reusable blocks (in progress)
• Job development, deployment, and monitoring tooling
(alpha)@monaldax
57. Streaming Jobs - The Road Ahead
• Easy resource provisioning estimates
• Flink support for reading and writing from data warehouse,
backfill
• Continue to evolve tooling and support for large state
• Reusable Components - sources, sinks, operators, schema support,
data hygiene
• Tooling support for Spark Streaming
@monaldax
59. Prod – Trending Events & Scale
With Events Flowing To Hive, Elasticsearch,
Kafka
≅ 80B to
1.3T
• 1.3T+ events processed per day
• 600B to 1T unique per day
• 2+ PB in 4.5+ PB out per day
• Peak: 12M events in / sec & 36
GB / sec
@monaldax
62. @monaldax
RTDI Consists Of 4 Systems. Keystone Pipeline
Runs 24 X 7, & Does Not Impact Members
Ability To Play Videos
Keystone
Stream
Processing
(SPaaS)
Keystone
Management
Keystone
Messaging
24 x 7
- Dev
- Test
- Prod
Granular
shadowin
g
63. Event
Producer
Components & Streaming Jobs
SPaaS
Router
Fronting
Kafka
KSGateway
Consumer
Kafka
Keystone
Management
KC
W
Hive
Elasticsearc
h
Streaming
Job
@monaldax
67. • Automated Kafka producer buffer (60s) tuning
based on traffic
• Best effort delivery, Prioritizes host application
availability
• acks=1, Do not block to send events, Unclean
leader election
• Non-keyed messages, retry send to available
Producer Library - Kafka Client Wrapper
@monaldax
68. Event
Producer
Ksgateway - Event Proxy For Non-java Clients,
REST & GRPC
SPaaS
Router
Fronting
Kafka
KSGateway
Consumer
Kafka
Keystone
Management
KC
W
Hive
Elasticsearc
h
Streaming
Job
@monaldax
69. Event
Producer
Kafka Clusters (0.10) on Amazon EC2
SPaaS
Router
Fronting
Kafka
KSGateway
Consumer
Kafka
Keystone
Management
KC
W
Elasticsearc
h
Streaming
Job
@monaldax
70. • Have message sizes > 1MB and up to 10MB
• Large Scale Keystone Ingest pipelines results in
large fan out
• Lower Latency – used for ad-hoc messaging as
well
• Open source – enhance, patch, or extend
Why Kafka?
@monaldax
71. Scale for Large Fan-out and Isolation -
Cascading Topology
Fronting
Kafka
Consumer
Kafka
Consumer
@monaldax
72. Alternative: Logical Stream (Topic) Spread
Across Multiple Topics Across Multiple
Clusters (WIP)
Multi-Cluster
Producer
Multi-
Cluster
Consumer
@monaldax
73. • Dedicated Zookeeper cluster per Kafka cluster
• Small Clusters < 200 brokers, partitions <= 10K
• Partitions distributed evenly across brokers
• Rack-aware replica assignment, brokers spread in
3 Zones
• 2 copies & Unclean leader election on
• Non-transactional
Kafka Deployment Strategies – Version 0.10
(YMMV)
@monaldax
77. • Keystone pipeline is built on Flink Routers
• Each Flink Router is a stream processing job
• Router provisioning based on incoming traffic or
estimates
• Runs on containers atop EC2
• Island mode - single AWS Region
Streaming Jobs 1.3.2
@monaldax
78. High-level Stream Processing Platform
Architecture
Streaming Jobs
Keystone
Management
Point & Click or
Streaming Job
Container
Runtime
1.
Create
Streaming
Job
2. Launch Job
with
Config
overrides 3. Launch
Containers
• Immutable Image
• User driven config
overrides
@monaldax
80. @monaldax
Flink Job Cluster In HA Mode
Zookeeper
Job
Manager
Leader
(WebUI)
Task
Manager
Task
Manager
Task
Manager
Job
Manager
(WebUI)
One dedicated
Zookeeper cluster for
all streamig Jobs
82. @monaldax
Flink Job Cluster In HA Mode With
Checkpoints
Zookeeper
Job
Manager
(Leader)
Task
Manager
Task
Manager
Task
Manager
Job
Manager
State
Checkpoints
State
Metadata
Checkpoint
s
84. @monaldax
Titus
Job
Task
Manager
I
P
Titus Host 4 Titus Host 5
Checkpoints Are Taken Often
Zookeep
er
Job
Manager
(standby)
Job
Manager
(master)
Task
Manager
Titus Host
1 I
P
Titus
Host 1
…. Task
Manager
Titus Host
2 I
P
Titus
Job I
P
I
P
AWS
VPC
State
-
Checkpoints
- Kafka
Offset
Save
85. @monaldax
Titus
Job
Task
Manager
I
P
Titus Host 4 Titus Host 5
Checkpoints Are Taken Often. A Container
Could Fail…
Zookeep
er
Job
Manager
(standby)
Job
Manager
(master)
Task
Manager
Titus Host
1 I
P
Titus
Host 1
…. Task
Manager
Titus Host
2 I
P
Titus
Job I
P
I
P
AWS
VPC
State
-
Checkpoints
- Kafka
Offset
Save
X
86. @monaldax
Titus
Job
Task
Manager
I
P
Titus Host 4 Titus Host 5
Zookeep
er
Job
Manager
(standby)
Job
Manager
(master)
Task
Manager
Titus Host
1 I
P
Titus
Host 2
…. Task
Manager
Titus
Host 3 I
P
Titus
Job I
P
I
P
AWS
VPC
State
-
Checkpoints
- Kafka
Offset
Restor
e
Restored To Last Checkpoint, Partially
Recovery Supported
Replacement
container
87. Event
Producer
and Streaming Jobs Management
SPaaS
Router
Fronting
Kafka
KSGateway
Consumer
Kafka
Keystone
Management
KC
W
Hive
Elasticsearc
h
Streaming
Job
@monaldax
91. • The ability to pass data along the chain of Joblets within a
Job
• Locks and semaphores on resources spanning across jobs
• Customization and integration into Netflix ecosystem –
Eureka, etc.
Keystone Management Unique Features
@monaldax
93. • No separate Ops team
• No separate QA team
• No separate Dev team
• It’s all done by developers of the Real Time Data
Infrastructure
We Run What We Build!
@monaldax
94. • We rely on metrics, monitoring, alerting & paging, &
automation
• Separate metrics system – Atlas
• Separate alert configuration and alert actions system
• Options for separate system to run cross-system
automation tasks
We Leverage Other Netflix Systems
@monaldax
105. @monaldax
Launch Backup Kafka Cluster With Same
Number Of Instances, But Smaller Instance
Type
Flink
RouterFronting
Kafka
Event
Producer
Bring up
failover
Kafka cluster
Copy
metadata
from
Zookeeper
X
106. @monaldax
Change Producer Config To Produce To
Failover Cluster, And Launch Routers For
Failover Traffic
Flink
RouterFronting
Kafka
Event
Producer
Failover
Flink
Router
X
107. @monaldax
Change Producer Config To Original
Cluster, And Finish Draining Events From
Backup Flink Router
Flink
RouterFronting
Kafka
Event
Producer
Failover
Flink
Router
108. @monaldax
Decommission Backup Cluster And Router
Once Original Cluster Is Fixed, Or A
Replacement Cluster Is Live
Flink
RouterFronting
Kafka
Event
Producer
Failover
Flink
Router
X X
110. • Failover currently supported for Fronting Kafka
clusters only
• We are working on multi-consumer client with
support for keyed message to support failover of
consumer Kafka clusters.
Consumer Kafka Clusters
@monaldax
111. Planned & Regular
Kafka Kong
This Automation Also Serves As Kafka Kong,
A Tool That Follows Principles Of Chaos
Engineering
@monaldax
112. • Over provision for variations and traffic for
failover
• Broker health & outlier detection and auto
termination
• 99 percentile response time
• Broker TCP timeouts, errors, retransmissions
• Producer’s send latency
Kafka Operation Strategies (YMMV)
@monaldax
113. • Scale up by
• Adding partitions – to new brokers, requires no
keyed messages
• Partition reassignment – in small batches with
custom tool
• Scale down by
• Create New topics / New clusters
• Create new clusters - use Kafka failover
Kafka Operation Strategies (YMMV)
@monaldax
115. • Container replacement
• Checkpoints and Savepoints
• Keep retrying if event data format is valid
• Isolation – issue with one sink does not impact
another
Routers & Streaming Job Fault Tolerance
By Design
@monaldax
116. • Provision new or updated streams
• Bulk updates and terminate routers and re-
deployment
• Automatic partial recovery allows zero-touch
migration of underlying container infrastructure
• Manual – KSRunbook
Router Deployment Automation
@monaldax
118. • Per stream provisioning based on past weeks traffic or
bit rate estimate
• Provision buffer capacity
• Run 1 additional container for latency sensitive
consumers
• Manual, % increase, easy to compute and deploy
• Plan capacity to handle service failover, and
Router Capacity Planning And Provisioning
@monaldax
119. Admin Tooling To Scale Up Manually, Or To
Deploy A New Build
@monaldax
125. @monaldax
Flink Streaming Job
● Split between application and infrastructure
● Metrics and monitoring and
● Alerts
● Paging and on-call rotations
● Platform customers follow the same “We build it we run it
model”
128. @monaldax
Operations – The road ahead
● True auto scaling
● Bootstrap capacity planning for stateful streaming jobs
● Automated Canary tooling & Data parity
● Point and Click components quick testing, and performance
profiling
● E.g., - iterating over a Filter definition
129. @monaldax
I Want To Learn More
● http://bit.ly/mLOOP - Deep dive into Unbounded Data
Processing Systems
● http://bit.ly/m17FF - Keynote – Stream Processing with Flink
at Netflix
● http://bit.ly/2BoYAq0 - Multi-tenant Multi-cluster Kafka