Effective Multi-stream Joining in Apache Samza Framework

Effective Multi-stream Joining for Enhancing Data
Quality in Apache Samza Framework
Tao Feng, Zhenyun Zhuang, Haricharan Ramachandra
{tofeng, zzhuang, hramachandra}@linkedin.com
LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States
Abstract—Increasing adoption of Big Data in business environ-
ments have driven the needs of stream joining in realtime fashion.
Multi-stream joining is an important stream processing type in
today’s Internet companies, and it has been used to generate
higher-quality data in business pipelines.
Multi-stream joining can be performed in two models: (1)
All-In-One (AIO) Joining and (2) Step-By-Step (SBS) Joining.
Both models have advantages and disadvantages with regard to
memory footprint, joining latency, deployment complexity, etc. In
this work, we analyze the performance tradeoffs associated with
these two models using Apache Samza and share our findings.
Index Terms—Multi-stream joining; Samza; Stream process-
ing; Data quality
I. INTRODUCTION
Data quality, a long-existing critical issue, faces more chal-
lenges in Big Data era. In particular, the large data volume and
the realtime requirements in many Internet companies have
necessitated effective stream processing to drive their critical
business data flow.
To deal with the streaming requirements of todays Big Data
processing, stream processing frameworks such as Apache
Samza [1], Apache Storm [2] and Spark [3] are being increas-
ingly adopted by Internet companies such as LinkedIn and
Twitter [4]. Apache Samza, with its advantage of efficiently
handling large amount of data, has been heavily used by
companies like LinkedIn for many years [5]. LinkedIn’s largest
Samza job is processing multi-million messages per-second
during peak traffic hours.
The wide deployment of Big Data solutions demands
business-value augments in various areas. Stream joining,
which joins multiple data streams for various purposes includ-
ing increased data quality, is an important stream processing
type in todays Internet companies, particularly those involving
multiple types of online user activities. This type of stream
processing essentially aggregates “nearly aligned” streams. It
expects to receive related events from multiple input streams
and eventually combines them into a single output event
stream. The most simple stream joining is to join two streams,
for instance, joining a stream of ads clicks to a stream of ads
impressions, so that the company can link the information on
when the ad was shown to the information on when it was
clicked.
As the business logic becomes more complicated and the
online data types multiply, there is a need to join multiple
streams (i.e., >= 3) in a realtime or near-realtime fashion. As
an example, at LinkedIn, our data flow joins multiple streams
to drive the site-speed (e.g., page display latency in different
countries) dashboard. Each of the data stream corresponds
to the users from a single area (e.g., country), and multiple
streams are joined to form an aggregated view which allows
filtering/aggregation on the dashboard. As another example,
different types of user activities (e.g., user ads clicking, user
profile viewing, user group discussions) can also be joined in
realtime to produce more detailed activities of users on the fly.
To join multiple streams, in general two models of solutions
can be used: (1) All-In-One (AIO) Joining, which simply
joins all incoming streams in the same processing unit (e.g.,
software instance); and (2) Step-By-Step (SBS) Joining, which
joins incoming streams in different processing units.
Both AIO and SBS have advantages and disadvantages. AIO
only needs a single processing job, which enables simpler
processing logic and easier deployment. It also has lower end-
to-end latency when emitting the final output stream. On the
other hand, AIO also has several disadvantages compared to
SBS. First, if the incoming streams are far apart in arrival
time, AIO’s memory footprint can be larger due to the longer
memory buffering time when joining all incoming streams.
With larger memory footprint (e.g. Java heap), AIO may
also incur other related problems such as larger GC-caused
(Garbage Collection) application pauses and longer startup
delay. In addition, if any incoming stream is blocked, AIO
will be stuck and unable to produce any useful data. SBS,
on the contrary, can still produce useful data (in intermediate
streams) in steps before the blocked stream.
Motivated by the importance of multi-stream joining, we
feel necessary to study the performance tradeoff of different
processing models. In this work, we consider both models
of multi-stream joining, using Samza as the streaming frame-
work. The findings however are not limited to Samza, since
many of them also apply to other streaming frameworks.
For the remainder of the paper, after providing necessary
technical background in section II, we give two motivating
examples demonstrating how Samza-based stream processing
can help improve data quality in Section III. We then elaborate
on the two multi-stream processing models in Section IV. The
major part of this work is to analyze the performance tradeoffs
of the two models, which are presented in Section V. We set
up a real Samza test bed to verify our analysis in Section
VI. We also discuss the joining sequences of SBS in Section
VII. Section VIII gives related works, and finally Section IX
concludes the work.

The contributions of this work are:
• We characterize the approaches of performing multi-
stream joining into two models: AIO and SBS;
• We compare the performance tradeoffs of the two models
with regard to various performance metrics including
joining latency and memory footprint;
• We also use Samza framework to evaluate the two mod-
els. The results can be used for designing appropriate
multi-stream joining systems.
II. BACKGROUND
A. Data stream and stream processing framework
Given the growing size of big data, data streaming frame-
works such as Apache Kafka [6] have been designed and
deployed to handle the data streams in realtime or near-
realtime fashions. To handle the increasingly complicated
business logic, various stream processing frameworks have
been proposed. Many of todays applications use stream pro-
cessing frameworks to filter and aggregate data streams. One
example usage of such stream processing framework is to
increase the quality of the data using filtering. Such data-
filtering applications are heavily deployed to filter streaming
data triggered by online activities coming from hundreds of
millions of users. For instance, user activities coming from
user-facing pages may contain inconsistent data; hence the
data need to be filtered to remove the inconsistency before
being persisted to backend storages such as Espresso [7].
B. Multi-stream joining
When there are multiple streams that need to be correlated
with each other, multi-stream joining becomes necessary. In
such scenarios, the multiple input streams share some common
features such as user IDs. By joining them and producing
another output stream, various benefits can be obtained.
Multi-stream joining can be used to improve data quality.
Data consistency and accuracy of massive data has been
increasingly recognized as one of key attribute of Big Data.
Poor data quality [8] negatively impacts enterprise companies
which leads to customer dissatisfaction, increased cost, and
bad decision-making, etc.
In some scenarios, low-quality data streams can be improved
by aggregating with other data streams. For instance, a data
stream may only have user click information in the tuple of
(cookie id, actions) which presents itself as low-quality data.
Cookie ids need to be replaced by corresponding userID before
these activities can be persisted to storage as high-quality data
for easier retrieval later. For this use case, another data stream
needs to be utilized to enhance the data quality.
In order to perform stream joining, the application typically
needs to buffer events for some time window over which it
wants to join, which is exactly used by many stream processing
frameworks such as Samza.
C. Apache Samza
Stream processing is high throughput data processing which
applies to massive data set with near real-time SLA guarantee.
Samza
Input Data
(Mixed Quality)
Rules
Output Data
(High Quality)
Output Data
(Low Quality)
Figure 1. Samza-based stream filtering for data quality enhancement
Input Data IInput Data I
Output Data
Input Data I
(Low Quality) Output Data(Low Quality) Output Data(Low Quality) Output Data
(High Quality)
Samza
(High Quality)
Samza
(High Quality)
Samza
Input Data IIInput Data IIInput Data II
(Low Quality)(Low Quality)(Low Quality)(Low Quality)
Figure 2. Samza-based stream joining for data quality enhancement
Apache Samza is a lightweight stream processing framework
originally created for solving continuous data processing at
LinkedIn and later open-sourced. Samza has a set of desired
features such as simple API, fault tolerance, high durability,
managed state, etc. Samza integrates with YARN [9] and
Kafka [6]. The most basic processing element of Samza is a
stream, which is composed of immutable messages of a similar
type or category. A stream can be the data from a messaging
system such as Kafka, or a database table, or even Hadoop
HDFS [10] files.
A stream can be further split into a set of partitions using
keys. For each partition, all messages are strictly ordered.
Samza currently ensures that separate applications live in sep-
arate containers (processes). Samza jobs process the incoming
streams. A Samza job is a set of code that performs logical
transformation on a set of input streams to append messages to
a set of output streams. To join incoming streams, the Samza
job opens separate buffers for each incoming streams. The job
scans all buffered messages and merge aligned messages into
the output stream.
Samza uses a single thread internally to handle reading and
writing messages, flushing metrics, checkpointing, windowing,
and so on. Samza could scale with creating multiple instances
if single instance is not enough to sustain the traffic. Samza
container creates an associated task to process the messages
of each input stream topic partition. The Samza container
chooses a message from input stream partitions in a round-
robin way and passes the message to the related task. If the
Samza container uses stateful management, each task creates
one partition in the store to buffer messages. If the changelog
option is enabled, each task sends the changelog stream to
Kafka for durability purposes. Each task can emit messages
to output stream topics, depending on the use case [11].

III. ENHANCING DATA QUALITY USING SAMZA
Samza can be used to enhance data quality in streaming
scenarios. Depending on specific data set, Samza can augment
the quality of a data stream, or it can join multiple low-
quality data streams and produce a high-quality data stream.
Specifically, Samza framework is heavily used for LinkedIn
application and system monitoring, data filtering, track user
behavior etc. We have two different usage scenarios of doing
data filtering with Samza.
A. Samza-based Stream Filtering
The first usage scenario, bad-quality data are filtered (i.e.,
removed) based on some rules and other data streams. For
instance, if some streamed data items do not contain valid
values (e.g., wireless network bandwidth measurement results
for a particular user), they are simply discarded with the
guidance of certain rules. The process is shown in Figure 1,
where an input stream with mixed data quality (both low and
high quality) is filtered to emit a high-quality stream. A set
of rules is used to guide the data filtering, and based on the
rules low-quality data re discarded.
B. Samza-based Stream Joining
The other data filtering scenario is to enhance the low-
quality data streams by aggregating with other data streams.
For instance, a data stream may only have user click infor-
mation in the tuple of (cookie id, actions) which presents
itself as low-quality data. Cookie ids need to be replaced by
corresponding userID before these activities can be persisted
to storage as high-quality data for easier retrieval later. For this
use case, another data stream needs to be utilized to enhance
the data quality. The process is shown in Figure 2, where
two low-quality streams are aggregated (i.e., joined) to emit a
high-quality data stream.
IV. MULTI-STREAM JOINING MODELS
When joining multiple (i.e., >= 3) streams, there could be
different ways of how to perform the joining. Overall, we
characterize different joining fashions into two models: AIO
(ALL-In-One) and SBS (Step-By-Step). In this Section, we
illustrate how these two models of multi-stream joining work.
To simply the presentation, we use only 3 input streams. It
should be straightforward to apply the two models to scenarios
where more than 3 input streams exist.
A. All-In-One (AIO)
AIO multi-stream joining executes the task in a single
container. Samza allocates individual buffers for each of the
incoming streams so that a fixed number of messages from
each stream can be fetched before performing joining. As
shown in Figure 3(a), three input streams of A, B and C all
feed into the same container indicated by Samza Job. After
joining, the resulting output stream E is produced.
As the message arrival time for different input streams may
vary, a typical implementation of AIO uses multiple memory
buffers to hold input streams, one stream each. After buffering
Stream A
Stream B
Stream C
Samza Job Stream E
(a) AIO multi-stream joining
Stream A
Stream B
Stream C
Samza
Job I
Stream D
Samza
Job II
Stream E
(b) SBS multi-stream joining
Figure 3. Two models of multi-stream joining (AIO and SBS)
some messages from all input streams, AIO may scan all input
streams to identify messages that can be joined. A message can
be joined only if all corresponding messages (e.g., identified
by an unique ID) are present in message buffers.
B. Step-By-Step (SBS)
Contrary to AIO, SBS multi-stream joining uses more than
one containers to perform the task. Depending on the number
of input streams and how fat each container is designed, SBS
involve multiple cascading steps. In Figure 3(b) we show a
simple case where 3 input streams are processed in two steps,
and each step only processes two input streams. The first step
of Samza Job I joins streams of A and B and outputs an
intermediate stream D. The intermediate stream D is fed into
the second step of Samza Job II and serves as one of the input
streams. Samza Job II then joins D and another input stream
C to produce final output stream of E.
Since SBS only joins a subset of input streams at each step,
it only needs to buffer the messages from the subset of input
streams. Similar to AIO implementation, SBS implementation
also allocates message buffers for each input stream. The only
difference is that SBS allocates less memory buffers (i.e., due
to less number of input streams), scans less messages, and
incurs less processing latency.
Note when joining 4 or more input streams, there could be
multiple ways of doing SBS. For instance, if each joining step
only takes 2 input streams, then totally 3 steps are needed.
If each step can join 3 input streams, then only 2 steps
are needed. For simplicity, in the following presentation, we
assume each step only joins two input streams unless otherwise
specified 1.
V. PERFORMANCE COMPARISON OF AIO AND SBS
Both AIO and SBS can achieve the same goal of multi-
stream joining, however they exhibit very different character-
1The analysis done in this work can also easily be applied to other SBS
types where more than 2 input streams can be joined in a step.

Theorem-I
(a) For SBS, let S be the number of joining steps, then
S − 1 intermediate output streams will be produced.
(b) If the first blocked input stream is joined in stepn, then
n − 1 intermediate streams can be produced.
Figure 4. Theorem-I: Partial joining results
istics and associated performance tradeoffs. Now we compare
these two models and analyze the differences.
A. Comparing AIO and SBS
AIO and SBS differ in many aspects due to their design
differences. Based on our production experiences with regard
to application development, deployment, administration, de-
bugging and optimizations, we choose the following 6 per-
formance metrics to compare: (1) Development, deployment
and administration complexity; (2) Partial joining results; (3)
Joining latency; (4) Stream joining throughput; (5) Memory
footprint; and (6) CPU usage.
B. Development, deployment and administration complexity
Since AIO performs the stream joining tasks in one step, it
only needs a single container (or instance) to be deployed,
which is easier to deploy and administrate. By contrast,
SBS needs to deploy multiple containers, hence incurs more
deployment and administration complexity.
Development-wise, however, AIO and SBS may see similar
complexity. The logic of AIO handling is relatively straight-
forward, as all input streams can be treated similarly, except
the particular joining fields. Depending on specific designs,
SBS development can also be straightforward. For example, if
each step is limited to take 2 input streams, then all steps can
share the same code base, while leaving the stream selections
to configurations and deployments.
C. Partial joining results
Irrespective of how many input streams, AIO only produces
a single output stream, a service guarantee of all or nothing.
In other words, if for any reasons (e.g. one input stream is
blocked), the entire joining is stuck and no new joined results
will be appended to the output stream.
On the other hands, SBS is able to produce partial results,
even if some input streams are blocked. Take the example
scenario shown in Figure 3(b). If Stream C is blocked, the
intermediate output stream of D is still a valid stream.
Depending on applications, the partially joined intermediate
streams could be useful. For example, if the multiple input
streams correspond to different countries of site-speed infor-
mation, then any partial results could be used to drive some
business logics. For SBS, the relationship between the number
of joining steps and the number of intermediate streams can
be expressed in Theorem-I displayed in Figure 4.
Theorem-II
Let T0 be the time difference in all input streams;
Let TA and TB be the processing latency incurred by each
individual container for AIO and SBS, respectively.
(a) For AIO, E2E joining latency is T0 + TA;
(b) For SBS, E2E joining latency is T0 + nTB,
where n is the number of joining steps.
Figure 5. Theorem-II: E2E joining latency
D. Joining latency
When performing stream joining, the end-to-end (E2E)
latency is defined as the following. For a message with a
attribute K (e.g., user ID) based on which the stream joining is
done, we refer to all messages that share K as the K-identified
message set. Apparently, when joining n streams, there are at
most n messages for any unique K. For a K-identified message
set, the E2E latency is the difference between the time when
the first message is injected into any input stream and the time
when the joining is done for K.
E2E latency consists of two major parts: inherent inter-
stream lag and E2E joining latency. The first part of inter-
stream lag is the arrival time difference among all input
streams, denoted by T0. For example, a K-unidentified message
set may inject spill messages into input streams at different
time. This part of latency, however, cannot be removed by
any joining, hence is the same irrespective what processing
model (i.e., SBS or AIO) is chosen.
The E2E joining latency is introduced by the stream pro-
cessing work, and it is an important performance metric to
consider. Its value should be kept as low as possible for
any latency-sensitive stream joining tasks. However, stream
joining does introduce processing delays into the joining
latency, mainly thanks to the message buffering needed when
aligning the arrival differences of input streams. The more
steps involved, the higher processing delays are added.
As such, AIO is expected to have lower E2E joining latency
than SBS, as the latter incurs multiple processing delays,
while the former only once. Assuming the processing delay
introduced by the Samza container in AIO is denoted by TA,
and the processing delay introduced by each Samza container
in SBS is the same and denoted by TB. We typically have
TA ≥ TB. However, since SBS adds multiple TB, its E2E joining
latency is typically larger than that of AIO. The E2E joining
latency of AIO is T0 +TA, while SBS has T0 +nTB, as presented
in Figure 5.
E. Stream joining Throughput
The throughput of the two models is defined as the number
of K-defined messages that are successfully joined per time
unit. For both models, the throughput value is determined
by capacity of the deployment, which is further dictated
by the most constraining resource capacity and the resource

Stream 1
Stream 2
Stream n
Base buffer: M
Buffer12: M + T12 * B
Buffer1n: M + T1n * B
T12
T1n
……
Figure 6. Memory footprint of stream joining
efficiency of each model. If, for example, CPU is the most
constraining (i.e., as seen from Linux “top” command) per-
formance bottleneck of a stream joining deployment, then
the resource efficiency (i.e., as measured by CPU usage per
joined message) and the available resource capacity together
determine the throughput. As another example, if the memory
footprint is the top performance bottleneck, then for a machine
with fixed memory size, the per-message memory usage of
the deployment will determine the throughput of the stream
joining system.
The value of this performance metric, however, depends
on other performance metrics such as CPU and memory
footprint, which will be discussed later. Hence we defer further
discussions to later sections.
F. Memory usage
Both AIO and SBS need to allocate memory space to buffer
the received messages from different input streams. For the
output stream, we assume each produced message will be
immediately sent out, hence not occupying memory space.
Assuming the earliest stream into which the first message
from K-identified message set injects is Stream 1, then all
other streams have an inter-stream lag with regard to stream-
1. Denoting the lag of stream n as T1n. Typically stream joining
tasks buffer the first stream (i.e., stream 1) using some fixed
memory space denoted by M 2
. For other streams, assuming
the message size and message arrival rate are known, then the
corresponding memory space needed for each stream can be
determined by the latency lag. For instance, for stream-n, the
memory space needed is M + T1n ∗ B, where B is the scaling
factor based on message arrival rate and message size.
For AIO, given n input streams, the total memory usage is
nM + Σn
i=2BT1i. For SBS, the number of steps may be less or
equal to n − 1. We firstly take an extreme case where n − 1
steps are needed. Each step will only join two streams. The
memory space needed for each step is M + Tδ, where Tδ is
the latency lag between the two input streams. If the joining
sequence of input streams in SBS is optimized (i.e., the joining
sequence aligns well with the inter-stream lag), the aggregated
memory footprint can be minimized. We will discuss more on
2In other scenarios, a fixed number of messages or fixed length of time
of messages are buffered; the used memory space can be correspondingly
determined.
Theorem-III
Let M be base memory space of the earliest stream;
n be the total number of input streams;
Tij is the latency lag between two streams;
B is the scaling factor;
(a) For AIO, the memory footprint is nM + Σn
i=2T1iB;
(b) For SBS, the minimum total memory footprint is
is (n − 1)M + T1nB;
Figure 7. Theorem-III: Memory footprint
this in Section VII. Adding all steps together, the minimum
total memory usage for SBS is (n − 1)M + T1n, where T1n is
the longest latency lag among all input streams.
If we ignore other memory usages of the stream joining
container and only consider the memory space used for
buffering messages from input streams, the memory footprints
of both AIO and SBS are presented in Figure 7. Comparing
these two memory footprints, we can say that SBS has smaller
memory footprint.
G. CPU usage
AIO joins all input streams in one step, while SBS uses
multiple steps to join the streams and each step only joins a
subset of the input streams. The joining process needs to check
all input streams for a particular K-identified message set,
hence incurring certain type of computation-heavy processing.
Because of this, AIO container is expected to have higher CPU
usage than any single SBS container.
When comparing the aggregated CPU usage of all steps
in SBS to that of AIO, things become more complicated.
Since each step needs to invoke a Samza container, and each
container will incur certain house-keeping processing that is
not directly contributing the stream joining, the aggregated
CPU usage of SBS may or may not exceed that of AIO.
If the house-keeping processing is heavy, then the multiple
containers in SBS may incur CPU overhead larger than that
of AIO. On the other hand, if the house-keeping processing is
light enough, then SBS could use less CPU resource than its
AIO counterpart.
H. Summary
We have compared the two multi-stream joining models
of AIO and SBS. In summary, AIO has the advantages of
being easier to deploy and administrate and having lower
E2E joining latency; while SBS has the advantages of having
smaller memory footprint and emitting intermediate streams.
One other particular implication of the memory footprint
size is how to scale to handle large volume of data. If the
memory footprint of AIO is too big to fit into one node, the
only way to scale is by partitioning. On the other hand, SBS
has the luxury of scaling at ease by deploying different steps
on different nodes. Another implication of memory footprint
size is with regard to JVM. For Java-based Samza joining,

Table I
JOINING LATENCY RESULTS
Scenario / AIO SBS Phase-I Phase-II
Latency (ms) (ms) (ms) (ms)
Sync, 1 partition 1734 9845 4791 5054
Sync, 10 partition 1573 8849 4478 4371
Async, 1 partition 2009 9811 4713 5096
Async, 10 partition 1346 8081 3984 4097
Table II
MEMORY FOOTPRINTS (VALUES ARE IN MB UNLESS SPECIFIED IN “GB”)
Memory footprint (MB) (MB) (MB) (MB)
Sync, 1 partition 512 220 128 92
Sync, 10 partition 9 GB 9 GB 4 GB 5GB
Async, 1 partition 640 480 96 384
Async, 10 partition 10 GB 9 GB 4 GB 5 GB
the larger heap size, the higher GC-caused pauses and longer
startup delay.
VI. EVALUATION
A. Experiment setup
We focus on a case with 3 input streams and a single output
stream. The entire setup consists of the following components:
(1) a set of 3 Kafka producers generating messages for
each stream (or topic); (2) Samza job(s); and (3) a single
Kafka consumer that consumes the joined stream. The Kafka
producers reside in a data center different from the Kafka
clusters. Note that with this setup, the the end-2-end joining
latency is not negligible since: (1) it takes longer for Kafka
producers to produce messages to Kafka cluster; and (2) For
a Samza job, it takes longer to fetch messages from the Kafka
cluster which causes the single container latency longer.
1) Number of partitions : The memory footprint of a Samza
job partly is determined by the number of partitions in a Kafka
topic. A Kafka topic can be composed of a single partition
or multiple partitions. In multi-partition scenarios, Kafka pro-
ducers could produce message to any of the partitions. Samza
jobs create a separate task to handle each topic partition. For
the stream joining purpose, a Samza job needs to buffer the
message until it gets the messages with the same key from all
the topics before performing a stream-join. Hence the memory
requirement will be much higher. We will study both the 1-
partition and multi-partition scenarios.
2) Inter-stream lag: The multiple input streams may not be
time-aligned well; instead, there could be inter-stream lags.
The significance of the inter-stream lag is that it impacts the
memory footprint of Samza jobs. As Samza jobs need to buffer
messages in order to join them, the higher inter-stream lag,
the more buffer is used. Because of this, we will study both
scenarios of with and without considerable inter-stream lags.
We refer to the scenario with near-zero inter-stream lags as
“synchronized (or sync)” scenario, while non-zero inter-stream
lags as “asynchronized (or async)” scenario.
In sync scenarios, Kafka producers publish the message
with the same key to all the topics around the same time.
Table III
CPU USAGE
CPU usage (%) (%) (%) (%)
But in reality, due to other environment dynamics (e.g.,
networking latency, etc.), we still encounter certain lags for
the messages with same key in the respective Kafka topics. In
async scenarios, we inject 1-minute fixed time delay between
any two topics for the messages with the same key 3. Such test
will put more memory pressure on Samza job as it requires
to buffer the messages with different arrival time.
3) Kafka producers and consumer: The Kafka producer
used in the work is a Kafka console producer, which is a
command line tool. For each generated message, the Kafka
producer firstly determines which partition the message will be
sent to. Then it will query the Kafka broker for topic partition
metadata based upon which it will figure out which leader
broker is for the partition. Finally it will send the data to
leader broker for the partition.
The Samza job instantiates a Kafka consumer which issues
fetch requests to the brokers corresponding to the partitions it
wants to consume. Typically the Samza job prefetches certain
amount of messages to avoid delay.
4) Samza jobs: Samza jobs use a single thread to handle
actions such as reading and writing messages, flushing metrics,
checkpointing, windowing, and so on. The Samza container
creates an associated task to process the messages of each
input stream, and it chooses messages from input stream
partitions in a round-robin way and passes the message to the
related task. If the Samza container uses stateful management,
each task creates one store to buffer messages. If the changelog
option is enabled, each task also sends the changelog stream
to Kafka for durability purposes. Each task can emit messages
to output stream topics, depending on the use case.
B. Results
We compare the performance of AIO and SBS. From
the 6 metrics listed in Section V We choose 3 important
performance metrics (i.e., joining latency, memory footprint
and CPU usage) since other three metrics (i.e., deployment
complexity, partial results, stream joining throughput) are ei-
ther straightforward or determined by the former three metrics.
In addition, we also examined the Samza jobs’ JVM GC
pauses, as these values can shed light on the CPU usage
efficiencies and application pauses.
1) Joining latency: AIO only has a single joining latency
value, while SBS has two as SBS performs its joining task
in two phases (i.e, Phase-I and Phase-II) using two separate
Samza jobs. The results are shown in Table I. We can see that
3For 3-stream case, there are 2-minute delay between the first and the third
streams.

0
10
20
30
40
50
AIO SBS Phase I SBS Phase II
Average Max
(a) 1 partition
0
10
20
30
40
50
60
Average Max
(b) 10 partitions
Figure 8. GC pauses in sync scenario
0
20
40
60
80
100
120
140
Average Max
(a) 1 partition
0
50
100
150
200
Average Max
(b) 10 partitions
Figure 9. GC pauses in async scenario
AIO results much smaller joining latency when compared to
its SBS counterpart. Depending on scenarios, the difference is
around 5X.
2) Memory footprint: We measure the memory footprint
of a Samza container based on the actual heap size usage,
as that shows the aggregated live object size. The maximum
heap size is all set to be 16 GB, but during stable state, the
heap only commits a portion of the specified 16 GB. AIO has
only one container to join 3 input streams, so it only has a
single memory footprint value, while SBS has two. For the 4
scenarios, the results are listed in Table II. We see in general
lower memory footprint in AIO for almost all scenarios, except
in the second scenario where the values are on par.
3) CPU usage: For CPU usage measurement, we simply
take the machine level CPU usage values reported by “top”
utility [12]. As the machines we use are dedicated identical
ones and only running our workloads, we feel such measure-
ment is fair comparison. For all 4 scenarios, AIO sees much
higher CPU usage when compared to SBS, and the results are
listed in Table III. Note that some values are higher than 100%
since the machines we use are multi-core machines and each
core has a CPU capacity of 100%.
4) JVM GC pauses: The JVM version we use is Oracle
HotSpot 8. We report Young GC pauses as they happen more
frequently. For each scenario, we capture about 10 minutes of
GC stats during stable state. For each scenario, we report both
the average value and the maximum value of the GC pauses in
Figures 8 and 9. We notice that mostly AIO has higher Young
GC pauses, thanks to its larger heap size. Depending on use
cases, these GC pauses may or may not impact application
performance. For applications that care more on message
throughput, individual GC pauses are less important. On the
other hand, for applications that care the joining latencies of
each message, having low GC pauses is critical.
C. Summary of results
Our data evaluates a set of performance metrics (i.e., joining
latency, memory footprint, CPU usage and JVM GC pauses)
of both AIO and SBS. Given the vast number of variations
in stream-joining systems (e.g., number of input streams,
inter-stream lags, message size), the data is far from being
comprehensive with regard to thorough analysis and detailed
characterizations of these two models. However, the results
clearly indicate the performance tradeoff inherent in the two
models. Users should carefully choose the most appropriate
model for their particular use cases.
VII. DISCUSSIONS: JOINING SEQUENCE OF SBS
For the SBS, the joining sequence of multiple input streams
can significantly impact the performance of the entire work
flow. Specifically, with regard to the intermediate streams,
if one of the input streams for a particular step is blocked,
then there will be no intermediate streams being emitted from
this step. For example, for the 3-stream joining illustration

Stream 1
Stream 2
Stream 3
T12
T13
(a) SBS with 3 input streams
Stream 1
Stream 2 Stream 3
T12
T23
Stream 4
(b) An optimal joining sequence
(Aggregated memory footprint is 2M +T13B)
Stream 1
Stream 2Stream 3
T32
T13
Stream 4
(c) An suboptimal joining sequence
(Aggregated memory footprint is M +T13B and M +T32B),
or T32B larger than optimal scenario
Figure 10. Using SBS to join 3 input streams
in Figure 3(b), if Stream-B is blocked due to some reason,
then intermediate stream D will not be produced and all later
joining steps are also blocked.
As another example, if the input streams have different inter-
stream lags, then the joining sequence will greatly impact the
resulted memory footprint. As shown in Figure 10(a), assum-
ing an SBS scenario where 3 input streams need to be joined.
Denote stream-1 as the earliest stream, and stream-2 and
stream-3 have latency lags of T12 and T13, respectively. Figure
10(b) illustrates an optimal joining sequence where stream-1
and stream-2 are firstly joined, and the resulted intermediate
stream-4 will be joined with stream-3 in the second step. The
memory footprint for the first step is M + T12B as directed
by Theorem-III. Correspondingly the memory footprint of the
second step is M + T23B, where T23 (or T32) is the latency
lag between stream-2 and stream-3. Since T12 +T23 = T13, the
aggregated memory footprint of Figure 10(b) is 2M + T13B.
If the joining sequence is different, as shown in Figure
10(c), then the memory footprints of the two SBS steps will be
M+T13B and M+T32B, respectively. Compared to the optimal
joining sequence, the increased memory usage is T32B.
Considering the above two types of examples, we propose
two approaches for determining the optimal joining sequences
of SBS: (1) memory-oriented, which aims at minimizing
the aggregated memory footprint; and (2) reliability-oriented,
which aims at providing as many intermediate streams as
possible.
Memory-oriented SBS For this approach, all input streams
are sorted based on the latency lags with respect to the earliest
input stream. Then different steps of SBS simply join them
according to the ranking of input streams. By doing so, the
aggregated memory footprint will be directed by Theorem-III,
which is minimized.
Reliability-oriented SBS This approach aims at providing
partial results to users. All input streams are ranked based on
how likely the messages are blocked, with the most reliable
input streams comes first. By joining input streams based on
reliability, the chance of intermediate streams being blocked
is minimized.
VIII. RELATED WORK
A. Streaming systems
There are many systems designed for streaming processing
purpose available in Industry, each of which has different
attributes to serve streaming processing need. For instance,
S4 [13] is a distributed stream processing engine originally
developed at Yahoo. Apache Storm [2], [4] is another popu-
lar system for streaming processing purpose originated from
Twitter. Storm utilizes Apache Mesos [14] to schedule the
processing job and provides two message deliver semantics(at
least once, at most once). MillWheel [15] is an in-house
solution developed by Google to serve its stream processing
purposes. Apache Spark [3], a streaming framework used by
Yahoo and Baidu, treats streaming as a series of deterministic
batch operations, such that developers can use the same algo-
rithms in both streaming and batch modes. Compared to Spark,
which is a batch processing framework that can approximate
stream processing, Flink [16] is primarily a stream processing
framework that can look like a batch processor.
B. Capacity model and capacity planning
There are many works in the general area of capacity
planning. The work [17] presents a four-phased model for
computer capacity planning (CCP), The model can be used
to develop a formal capacity planning function. Our work
[18] considers a practical problem of capacity planning for
database replication for a large scale website, and presents
a model to forecast future traffic and determine required
software capacity. Our recent work [19] presents a memory
capacity model for data-filtering applications with Samza.
C. Java heap sizing
To size the Java heap appropriately, various works have
been done to strike the balance among various tradeoffs. Work
[20] analyzed the memory usage on Java heap through object
tracing based on the observations that inappropriate heap size
can lead to performance problems caused by excessive GC
overhead or memory paging. It then presents an automatic
heap-sizing algorithm applicable to different garbage collec-
tors with only minor changes. The paper [21] presents a heap
sizing Rule for dealing with different memory sizes with
regard to heap size. Specifically, it models how severe the
page faults can be with respect to a Heap-Aware Page Fault
Equation.

D. Data quality
Veracity [22], which refers to the consistency and accuracy
of massive data, has been increasingly recognized as one
of key attribute of Big Data. Poor data quality [8] impacts
enterprise companies which leads to customer dissatisfaction,
increased cost, and bad decision-making etc. There are many
works of developing systems to improve data quality. The
work [23] presents a system based on conditional functional
dependencies(CFDs) model for improving the quality of rela-
tional data.
IX. CONCLUSION
This work studies the multi-stream joining models using
Samza framework. We characterize the two models of SBS
and AIO, and analyze their performance tradeoffs. We also set
up Samza testbed using Kafka streams to verify our theoretical
analysis. The results can be used to design/implement systems
to more efficiently join multiple streams.
REFERENCES
[1] “Apache samza,” http://samza.apache.org/.
[2] “Apache storm,” http://storm.apache.org/.
[3] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets:
A fault-tolerant abstraction for in-memory cluster computing,” in Pro-
ceedings of the 9th USENIX Conference on Networked Systems Design
and Implementation, ser. NSDI’12, San Jose, CA, 2012.
[4] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulka-
rni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and
D. Ryaboy, “Storm@twitter,” in Proceedings of the 2014 ACM SIGMOD
International Conference on Management of Data, ser. SIGMOD ’14,
Snowbird, Utah, USA, 2014.
[5] “Apache samza: Linkedin’s real-time stream processing framework,”
https://engineering.linkedin.com/data-streams/apache-samza-linkedins-
real-time-stream-processing-framework.
[6] “Apache kafka,” http://kafka.apache.org/.
[7] L. Qiao, K. Surlaker, S. Das, and et. al, “On brewing fresh espresso:
Linkedin’s distributed data serving platform,” in Proceedings of the 2013
ACM SIGMOD International Conference on Management of Data, ser.
SIGMOD ’13, New York, New York, USA, 2013.
[8] T. C. Redman, “The impact of poor data quality on the typical enter-
prise,” Commun. ACM, vol. 41, no. 2, pp. 79–82, Feb. 1998.
[9] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,
R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,
O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache
hadoop yarn: Yet another resource negotiator,” in Proceedings of the 4th
Annual Symposium on Cloud Computing, ser. SOCC ’13, Santa Clara,
California, 2013.
[10] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop dis-
tributed file system,” in Proceedings of the 2010 IEEE 26th Symposium
on Mass Storage Systems and Technologies (MSST), ser. MSST ’10,
Washington, DC, USA, 2010.
[11] “Benchmarking apache samza: 1.2 million messages per second on a
single host,” http://engineering.linkedin.com/performance
/benchmarkingapache-samza-12-million-messages-second-single-node.
[12] M. G. Sobell, A Practical Guide to Linux Commands, Editorsnd Shell
Programming, A. Upper Saddle River, NJ, USA: Prentice Hall PTR,
2005.
[13] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed
stream computing platform,” in Proceedings of the 2010 IEEE Inter-
national Conference on Data Mining Workshops, ser. ICDMW ’10,
Washington, DC, USA, 2010.
[14] “Apache mesos,” http://mesos.apache.org/.
[15] T. Akidau, A. Balikov, K. Bekiro˘glu, S. Chernyak, J. Haberman, R. Lax,
D. McVeety, Sam andls, P. Nordstrom, and S. Whittle, “Millwheel: Fault-
tolerant stream processing at internet scale,” Proc. VLDB Endow., vol. 6,
no. 11, pp. 1033–1044, Aug. 2013.
[16] “Apache flink,” https://flink.apache.org//.
[17] I. L. Carper, S. Harvey, and J. C. Wetherbe, “Computer capacity
planning: Strategy and methodologies,” SIGMIS Database, vol. 14, no. 4,
pp. 3–13, Jul. 1983.
[18] Z. Zhuang, H. Ramachandra, C. Tran, S. Subramaniam, C. Botev,
C. Xiong, and B. Sridharan, “Capacity planning and headroom analysis
for taming database replication latency: Experiences with linkedin
internet traffic,” in Proceedings of the 6th ACM/SPEC International
Conference on Performance Engineering, ser. ICPE ’15, Austin, Texas,
USA, 2015, pp. 39–50.
[19] T. Feng, Z. Zhuang, Y. Pan, and H. Ramachandra, “A memory capacity
model for high performing data-filtering applications in samza frame-
work,” in Proceedings of the 2015 IEEE International Conference on
Big Data - Workshop on Data Quality Issues, Santa Clara, CA, USA,
2015, pp. 2600–2605.
[20] P. Lengauer, V. Bitto, and H. Mössenböck, “Accurate and efficient object
tracing for java applications,” in Proceedings of the 6th ACM/SPEC
International Conference on Performance Engineering, ser. ICPE ’15,
Austin, Texas, USA, 2015.
[21] Y. C. Tay, X. Zong, and X. He, “An equation-based heap sizing rule,”
Perform. Eval., vol. 70, no. 11, pp. 948–964, Nov. 2013.
[22] B. Saha and D. Srivastava, “Data quality: The other face of big
data,” in Proceedings of IEEE 30th International Conference on Data
Engineering, ser. ICDE ’14, Chicago, IL, USA, 2014.
[23] W. Fan, F. Geerts, and X. Jia, “Semandaq: A data quality system based
on conditional functional dependencies,” Proc. VLDB Endow., vol. 1,
no. 2, pp. 1460–1463, Aug. 2008.

Effective Multi-stream Joining in Apache Samza Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Effective Multi-stream Joining in Apache Samza Framework

Similar to Effective Multi-stream Joining in Apache Samza Framework (20)

More from Tao Feng

More from Tao Feng (7)

Recently uploaded

Recently uploaded (20)

Effective Multi-stream Joining in Apache Samza Framework