The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
2. Agenda
● Data Processing at LinkedIn
● Data Pipelines in Batch & Stream
● Overview of Apache Samza
● Convergence of Pipelines with Apache Samza
○ Support for Batch Data
○ Unified Data Processing API
○ Flexible Deployment Model
3. Data Processing at LinkedIn
Azure
EventHub
Oracle
DB
Espresso DB
(NoSQL Store
for all user data)
Brooklin
(DB Change Capture)
HDFS
Hadoop
(Batch Processing)
Import / Export
Services Tier
Ingestion
Processing
Voldemort / Venice
(K-V Store for
Derived Data)
Samza
(Stream Processing)
Amazon
Kinesis
4. Scale of Processing at LinkedIn
KAFKA
2.3 Trillion
Msgs per Day
0.6 PB in, 2.3 PB out per
Day (compressed)
16 million Msgs per
Second at peaks!
HADOOP
125 TB Ingested per Day
120 PB Hdfs Size
200K Jobs per Day
SAMZA
200+ Applications
Most Applications
require Stateful
Processing ~ several
TBs (overall)
5. Data Processing Scenarios at LinkedIn
Site Speed
Real-time site-
speed profiling by
facets
Call-graph
Computation
Analysis of
Service calls
Dashboards
Real-time Analytics
Ad CTR
Computation
Tracking Ads Views
and Ads Clicks
Operate primarily using real-time input data
6. Data Processing Scenarios at LinkedIn
News
Classification
Real-time topic
tagging of articles
Profile
Standardization
Standardizing
titles, gender,
education
Security
Real-time DDoS
protection for
members
● Operate on real-time data & rely on models computed
offline
● Offline computed model must be accessible during
real-time processing
7. Agenda
● Data Processing at LinkedIn
● Data Pipelines in Batch & Stream
● Overview of Apache Samza
● Convergence of Pipelines with Apache Samza
○ Support for Batch Data
○ Unified Data Processing API
○ Flexible Deployment Model
9. Batch
● Processing on bounded data
● Processing at regular intervals
● Latency ~ order of hours
● Processing on unbounded data
● Processing is continuous
● Latency ~ order of sub-seconds
● Time matters!
Stream
10. ● Overhead of developing and managing multiple source codes
○ Same application logic written using 2 different APIs - one using offline processing APIs and
another using near-realtime processing API
● Same application deployed in potentially 2 different managed platforms
○ Restrictions due to firewalls, acl to environments etc.
● Expensive $$
○ When near-realtime application needs processed data from offline, the data snapshot has to
be made available as a stream. This is expensive!
Data Pipelines in Batch & Stream - Drawbacks
13. Agenda
● Data Processing at LinkedIn
● Data Pipelines in Batch & Stream
● Overview of Apache Samza
● Convergence of Pipelines with Apache Samza
○ Support for Batch Data
○ Unified Data Processing API
○ Flexible Deployment Model
14. Apache Samza
• Production at LinkedIn since 2013
• Apache TLP since 2014
• Streams as first-class citizen
– Batch as a special case of streaming
15. Apache Samza
● Provides distributed and scalable data processing platform
with
○ Configurable and heterogeneous data sources and
sinks (Eg. Kafka, HDFS, Kinesis, EventHub etc)
○ Efficient state management - local state and
incremental checkpoints
○ Unified Processing API for Batch & Streaming
○ Flexible deployment models
17. Data Processing Model
• Natively supports partitioned data
• Re-partitioning may be required for an un-partitioned source
• Pluggable System and CheckpointManager implementations
20. Ad View Stream
Samza Application
1
2
3
Ad Click Stream
Ad Click Through
Rate Stream
Tasks
Processing
Joining Co-partitioned Data
1
2
3
1
2
3
Co-partitioned by Ad-ID
21. Ad View Stream
Samza Application
1
2
3
Ad Click Stream
Ad Click Through
Rate Stream
Tasks
Processing
Joining Co-partitioned Data
Local State Store
(RocksDB)
1
2
3
1
2
3
Co-partitioned by Ad-ID
22. Ad View Stream
Samza Application
1
2
3
Ad Click Stream
Ad Click Through
Rate Stream
Tasks
Processing
Joining Co-partitioned Data
1
2
3
1
2
3
Co-partitioned by Ad-ID
Changelog Stream
for Replication
(partitioned)
Used for Recovery
upon Task Failure
23. Agenda
● Data Processing at LinkedIn
● Data Pipelines in Batch & Stream
● Overview of Apache Samza
● Convergence of Pipelines with Apache Samza
○ Support for Batch Data
○ Unified Data Processing API
○ Flexible Deployment Model
24. ❏Support for Bounded Data
❏ Define a boundary over the stream
❏ Batched Processing
❏Unified Data Processing API
❏Flexible Deployment Models – Write once, Run anywhere!
How to converge?
25. Agenda
● Data Processing at LinkedIn
● Data Pipelines in Batch & Stream
● Overview of Apache Samza
● Convergence of Pipelines with Apache Samza
○ Support for Batch Data
○ Unified Data Processing API
○ Flexible Deployment Model
26. Support for Batch Data
• Batch as a special Case of Stream:
Define boundary on stream
Batched processing – end of batch basically ends the job
27. Defining a Boundary on the Stream
• Introduced a notion of End-of-Stream (EoS) in the input
• Consumer in the System detects the EoS for a source
– Upon EoS, Samza may invoke EndOfStreamListenerTask handler
implemented by the application (optional)
37. Batch as a Special Case of Stream
Support Bounded nature of data
Define a boundary over the stream
Processing at regular intervals
Tasks exit upon complete consumption of the batch
39. Agenda
● Data Processing at LinkedIn
● Data Pipelines in Batch & Stream
● Overview of Apache Samza
● Convergence of Pipelines with Apache Samza
○ Support for Batch Data
○ Unified Data Processing API
○ Flexible Deployment Model
40. Example Application
Count PageViewEvent for each mobile Device OS in a 5 minute
window and send the counts to PageViewEventPerDeviceOS
PageViewEvent PageViewCountPerDeviceOS
Filter & Re-
partition
Window Map SendTo
46. Samza High-level API
public interface StreamApplication {
void init(StreamGraph streamGraph,
Config config) {
// Process message using DSL-
// like declarations
}
}
- Ability to express a multi-stage
processing pipeline in a single user
program
- Built-in library to provide high-level
stream transformation functions -> Map,
Filter, Window, Partition, Join etc.
- Automatically generates the DAG for
the application
47. public class CountByDeviceOSApplication implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
Supplier<Integer> initialValue = () -> 0;
MessageStream<PageViewEvent> pageViewEvents =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyStreamOutput, MyStreamOutput> pageViewEventPerMemberStream = graph
.getOutputStream("pageViewCountPerDevice", m -> m.memberId, m -> m);
pageViewEvents
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(
m -> m.memberId, Duration.ofMinutes(5),initialValue,(m, c) -> c + 1))
.map(MyStreamOutput::new)
.sendTo(pageViewEventPerMemberStream);
}
}
Built-in
Transforms
Application using High-level API
PageViewEvent
PageViewCountPerDeviceOS
Filter & Re-
partition
Window Map SendTo
PageViewEventByDeviceOS
48. public class CountByDeviceOSApplication implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
Supplier<Integer> initialValue = () -> 0;
MessageStream<PageViewEvent> pageViewEvents =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyStreamOutput, MyStreamOutput> pageViewEventPerMemberStream = graph
.getOutputStream("pageViewCountPerDevice", m -> m.memberId, m -> m);
pageViewEvents
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(
m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1))
.map(MyStreamOutput::new)
.sendTo(pageViewEventPerMemberStream);
}
}
Unified for Batch & Stream
Configuration for Stream Input (Kafka):
systems.kafka.samza.factory =
org.apache.samza.system.KafkaSystemFactory
streams.PageViewEvent.samza.system = kafka
streams.PageViewEvent.samza.physical.name = PageViewEvent
49. public class CountByDeviceOSApplication implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
Supplier<Integer> initialValue = () -> 0;
MessageStream<PageViewEvent> pageViewEvents =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyStreamOutput, MyStreamOutput> pageViewEventPerMemberStream = graph
.getOutputStream("pageViewCountPerDevice", m -> m.memberId, m -> m);
pageViewEvents
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(
m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1))
.map(MyStreamOutput::new)
.sendTo(pageViewEventPerMemberStream);
}
}
Unified for Batch & Stream
Configuration for Stream Input (Kafka):
systems.kafka.samza.factory =
org.apache.samza.system.KafkaSystemFactory
streams.PageViewEvent.samza.system = kafka
streams.PageViewEvent.samza.physical.name = PageViewEvent
Configuration for Batch Input (HDFS):
systems.hdfs.samza.factory =
org.apache.samza.system.HdfsSystemFactory
streams.PageViewEvent.samza.system = hdfs
streams.PageViewEvent.samza.physical.name =
hdfs:/user/nramesh/PageViewEvent
50. public class CountByDeviceOSApplication implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
Supplier<Integer> initialValue = () -> 0;
MessageStream<PageViewEvent> pageViewEvents =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyStreamOutput, MyStreamOutput> pageViewEventPerMemberStream = graph
.getOutputStream("pageViewCountPerDevice", m -> m.memberId, m -> m);
pageViewEvents
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(
m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1))
.map(MyStreamOutput::new)
.sendTo(pageViewEventPerMemberStream);
}
}
Unified for Batch & Stream
Configuration for Stream Input (Kafka):
systems.kafka.samza.factory =
org.apache.samza.system.KafkaSystemFactory
streams.PageViewEvent.samza.system = kafka
streams.PageViewEvent.samza.physical.name = PageViewEvent
Configuration for Batch Input (HDFS):
systems.hdfs.samza.factory =
org.apache.samza.system.HdfsSystemFactory
streams.PageViewEvent.samza.system = hdfs
streams.PageViewEvent.samza.physical.name =
hdfs:/user/nramesh/PageViewEvent
Only Config Change!
51. High-level API - Visualization for DAG
SAMZA Visualizer
A visualization of application samza-count-by-device-i001, which consists of 1 job(s), 1 input
stream(s), and 1 output stream(s).
53. Agenda
● Data Processing at LinkedIn
● Data Pipelines in Batch & Stream
● Overview of Apache Samza
● Convergence of Pipelines with Apache Samza
○ Support for Batch Data
○ Unified Data Processing API
○ Flexible Deployment Model
54. Coordination Model
• Coordination layer is pluggable in Samza
• Samza master / leader
– Distributes tasks to processor JVMs
– On processor failure, it re-distributes
• Available Coordination Mechanisms
– Apache Yarn
• ApplicationMaster is the leader
– Apache Zookeeper
• One of the processors is the leader and co-ordinates via Zookeeper
– Microsoft Azure
• One of the processors is the leader and co-ordinates via Azure’s
Blob/Tables Storage
55. Embedding Processor within Application
- An instance of the processor is
embedded within user’s application
- LocalApplicationRunner helps launch
the processor within the application
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new
LocalApplicationRunner(config);
CountByDeviceOSApplication app = new
CountByDeviceOSApplication();
runner.run(app);
runner.waitForFinish();
}
56. Pluggable Coordination Config
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new
LocalApplicationRunner(config);
CountByDeviceOSApplication app = new
CountByDeviceOSApplication();
runner.run(app);
runner.waitForFinish();
}
Configs with Zk-based coordination
job.coordinator.factory =
org.apache.samza.zk.ZkJobCoordinatorFactory
job.coordinator.zk.connect = foobar:2181/samza
57. Pluggable Coordination Config
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new
LocalApplicationRunner(config);
CountByDeviceOSApplication app = new
CountByDeviceOSApplication();
runner.run(app);
runner.waitForFinish();
}
Configs with Azure-based coordination:
job.coordinator.factory =
org.apache.samza.azure.AzureJobCoordinatorFactory
job.coordinator.azure.connect = http://foobar:29892/storage/
Configs with Zk-based coordination
job.coordinator.factory =
org.apache.samza.zk.ZkJobCoordinatorFactory
job.coordinator.zk.connect = foobar:2181/samza
58. Pluggable Coordination Config
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new
LocalApplicationRunner(config);
CountByDeviceOSApplication app = new
CountByDeviceOSApplication();
runner.run(app);
runner.waitForFinish();
}
Only Config Change!
Configs with Azure-based coordination:
job.coordinator.factory =
org.apache.samza.azure.AzureJobCoordinatorFactory
job.coordinator.azure.connect = http://foobar:29892/storage/
Configs with Zk-based coordination
job.coordinator.factory =
org.apache.samza.zk.ZkJobCoordinatorFactory
job.coordinator.zk.connect = foobar:2181/samza
59. Deploying Samza in a Managed Cluster (Yarn)
app.class = MyStreamApplication
RemoteAppplicationRunner: main()
RM
NM
LocalApplicationRunner
StreamProcessor
JobCoordinator
NM
NM
LocalApplicationRunner
StreamProcessor
Client
Submits JAR
run-jc.sh
run-app.sh
run-local-app.sh run-local-app.sh
60. Flexible Deployment Models
Samza as a Library
- Run embedded stream processing in
user program
- Use Zookeeper for partition distribution
among tasks and liveness of processors
- Seamlessly scale by spinning a new
processor instance
Samza as a Service
- Run stream processing as a
managed program in a cluster (eg.
Yarn)
- Works with the cluster manager (Eg.
AM/RM) for partition distribution
among tasks and liveness of
processors
- Better for resource sharing in a multi-
tenant environment
61. Conclusion
● Easily Composable Architecture allows varied data source consumption
● Write Once, Run Anywhere paradigm
○ Unified API - application logic to be written only once
○ Pluggable Coordination Model - allows application deployment across different execution
environment
62. Future Work
● Support SQL on Streams with Samza
● Table Abstraction in Samza
● Event-time processing
● Samza runner for Apache Beam
Contributions are welcome!
● Contributor’s Corner - http://samza.apache.org/contribute/contributors-corner.html
● Ask any question - dev@samza.apache.org
● Follow or tweet us @apachesamza