SlideShare a Scribd company logo
1 of 9
Download to read offline
Effective Multi-stream Joining for Enhancing Data
Quality in Apache Samza Framework
Tao Feng, Zhenyun Zhuang, Haricharan Ramachandra
{tofeng, zzhuang, hramachandra}@linkedin.com
LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States
Abstract—Increasing adoption of Big Data in business environ-
ments have driven the needs of stream joining in realtime fashion.
Multi-stream joining is an important stream processing type in
today’s Internet companies, and it has been used to generate
higher-quality data in business pipelines.
Multi-stream joining can be performed in two models: (1)
All-In-One (AIO) Joining and (2) Step-By-Step (SBS) Joining.
Both models have advantages and disadvantages with regard to
memory footprint, joining latency, deployment complexity, etc. In
this work, we analyze the performance tradeoffs associated with
these two models using Apache Samza and share our findings.
Index Terms—Multi-stream joining; Samza; Stream process-
ing; Data quality
I. INTRODUCTION
Data quality, a long-existing critical issue, faces more chal-
lenges in Big Data era. In particular, the large data volume and
the realtime requirements in many Internet companies have
necessitated effective stream processing to drive their critical
business data flow.
To deal with the streaming requirements of todays Big Data
processing, stream processing frameworks such as Apache
Samza [1], Apache Storm [2] and Spark [3] are being increas-
ingly adopted by Internet companies such as LinkedIn and
Twitter [4]. Apache Samza, with its advantage of efficiently
handling large amount of data, has been heavily used by
companies like LinkedIn for many years [5]. LinkedIn’s largest
Samza job is processing multi-million messages per-second
during peak traffic hours.
The wide deployment of Big Data solutions demands
business-value augments in various areas. Stream joining,
which joins multiple data streams for various purposes includ-
ing increased data quality, is an important stream processing
type in todays Internet companies, particularly those involving
multiple types of online user activities. This type of stream
processing essentially aggregates “nearly aligned” streams. It
expects to receive related events from multiple input streams
and eventually combines them into a single output event
stream. The most simple stream joining is to join two streams,
for instance, joining a stream of ads clicks to a stream of ads
impressions, so that the company can link the information on
when the ad was shown to the information on when it was
clicked.
As the business logic becomes more complicated and the
online data types multiply, there is a need to join multiple
streams (i.e., >= 3) in a realtime or near-realtime fashion. As
an example, at LinkedIn, our data flow joins multiple streams
to drive the site-speed (e.g., page display latency in different
countries) dashboard. Each of the data stream corresponds
to the users from a single area (e.g., country), and multiple
streams are joined to form an aggregated view which allows
filtering/aggregation on the dashboard. As another example,
different types of user activities (e.g., user ads clicking, user
profile viewing, user group discussions) can also be joined in
realtime to produce more detailed activities of users on the fly.
To join multiple streams, in general two models of solutions
can be used: (1) All-In-One (AIO) Joining, which simply
joins all incoming streams in the same processing unit (e.g.,
software instance); and (2) Step-By-Step (SBS) Joining, which
joins incoming streams in different processing units.
Both AIO and SBS have advantages and disadvantages. AIO
only needs a single processing job, which enables simpler
processing logic and easier deployment. It also has lower end-
to-end latency when emitting the final output stream. On the
other hand, AIO also has several disadvantages compared to
SBS. First, if the incoming streams are far apart in arrival
time, AIO’s memory footprint can be larger due to the longer
memory buffering time when joining all incoming streams.
With larger memory footprint (e.g. Java heap), AIO may
also incur other related problems such as larger GC-caused
(Garbage Collection) application pauses and longer startup
delay. In addition, if any incoming stream is blocked, AIO
will be stuck and unable to produce any useful data. SBS,
on the contrary, can still produce useful data (in intermediate
streams) in steps before the blocked stream.
Motivated by the importance of multi-stream joining, we
feel necessary to study the performance tradeoff of different
processing models. In this work, we consider both models
of multi-stream joining, using Samza as the streaming frame-
work. The findings however are not limited to Samza, since
many of them also apply to other streaming frameworks.
For the remainder of the paper, after providing necessary
technical background in section II, we give two motivating
examples demonstrating how Samza-based stream processing
can help improve data quality in Section III. We then elaborate
on the two multi-stream processing models in Section IV. The
major part of this work is to analyze the performance tradeoffs
of the two models, which are presented in Section V. We set
up a real Samza test bed to verify our analysis in Section
VI. We also discuss the joining sequences of SBS in Section
VII. Section VIII gives related works, and finally Section IX
concludes the work.
The contributions of this work are:
• We characterize the approaches of performing multi-
stream joining into two models: AIO and SBS;
• We compare the performance tradeoffs of the two models
with regard to various performance metrics including
joining latency and memory footprint;
• We also use Samza framework to evaluate the two mod-
els. The results can be used for designing appropriate
multi-stream joining systems.
II. BACKGROUND
A. Data stream and stream processing framework
Given the growing size of big data, data streaming frame-
works such as Apache Kafka [6] have been designed and
deployed to handle the data streams in realtime or near-
realtime fashions. To handle the increasingly complicated
business logic, various stream processing frameworks have
been proposed. Many of todays applications use stream pro-
cessing frameworks to filter and aggregate data streams. One
example usage of such stream processing framework is to
increase the quality of the data using filtering. Such data-
filtering applications are heavily deployed to filter streaming
data triggered by online activities coming from hundreds of
millions of users. For instance, user activities coming from
user-facing pages may contain inconsistent data; hence the
data need to be filtered to remove the inconsistency before
being persisted to backend storages such as Espresso [7].
B. Multi-stream joining
When there are multiple streams that need to be correlated
with each other, multi-stream joining becomes necessary. In
such scenarios, the multiple input streams share some common
features such as user IDs. By joining them and producing
another output stream, various benefits can be obtained.
Multi-stream joining can be used to improve data quality.
Data consistency and accuracy of massive data has been
increasingly recognized as one of key attribute of Big Data.
Poor data quality [8] negatively impacts enterprise companies
which leads to customer dissatisfaction, increased cost, and
bad decision-making, etc.
In some scenarios, low-quality data streams can be improved
by aggregating with other data streams. For instance, a data
stream may only have user click information in the tuple of
(cookie id, actions) which presents itself as low-quality data.
Cookie ids need to be replaced by corresponding userID before
these activities can be persisted to storage as high-quality data
for easier retrieval later. For this use case, another data stream
needs to be utilized to enhance the data quality.
In order to perform stream joining, the application typically
needs to buffer events for some time window over which it
wants to join, which is exactly used by many stream processing
frameworks such as Samza.
C. Apache Samza
Stream processing is high throughput data processing which
applies to massive data set with near real-time SLA guarantee.
Samza
Input Data
(Mixed Quality)
Rules
Output Data
(High Quality)
Output Data
(Low Quality)
Figure 1. Samza-based stream filtering for data quality enhancement
Input Data IInput Data I
Output Data
Input Data I
(Low Quality) Output Data(Low Quality) Output Data(Low Quality) Output Data
(High Quality)
Samza
(High Quality)
Samza
(High Quality)
Samza
Input Data IIInput Data IIInput Data II
(Low Quality)(Low Quality)(Low Quality)(Low Quality)
Figure 2. Samza-based stream joining for data quality enhancement
Apache Samza is a lightweight stream processing framework
originally created for solving continuous data processing at
LinkedIn and later open-sourced. Samza has a set of desired
features such as simple API, fault tolerance, high durability,
managed state, etc. Samza integrates with YARN [9] and
Kafka [6]. The most basic processing element of Samza is a
stream, which is composed of immutable messages of a similar
type or category. A stream can be the data from a messaging
system such as Kafka, or a database table, or even Hadoop
HDFS [10] files.
A stream can be further split into a set of partitions using
keys. For each partition, all messages are strictly ordered.
Samza currently ensures that separate applications live in sep-
arate containers (processes). Samza jobs process the incoming
streams. A Samza job is a set of code that performs logical
transformation on a set of input streams to append messages to
a set of output streams. To join incoming streams, the Samza
job opens separate buffers for each incoming streams. The job
scans all buffered messages and merge aligned messages into
the output stream.
Samza uses a single thread internally to handle reading and
writing messages, flushing metrics, checkpointing, windowing,
and so on. Samza could scale with creating multiple instances
if single instance is not enough to sustain the traffic. Samza
container creates an associated task to process the messages
of each input stream topic partition. The Samza container
chooses a message from input stream partitions in a round-
robin way and passes the message to the related task. If the
Samza container uses stateful management, each task creates
one partition in the store to buffer messages. If the changelog
option is enabled, each task sends the changelog stream to
Kafka for durability purposes. Each task can emit messages
to output stream topics, depending on the use case [11].
III. ENHANCING DATA QUALITY USING SAMZA
Samza can be used to enhance data quality in streaming
scenarios. Depending on specific data set, Samza can augment
the quality of a data stream, or it can join multiple low-
quality data streams and produce a high-quality data stream.
Specifically, Samza framework is heavily used for LinkedIn
application and system monitoring, data filtering, track user
behavior etc. We have two different usage scenarios of doing
data filtering with Samza.
A. Samza-based Stream Filtering
The first usage scenario, bad-quality data are filtered (i.e.,
removed) based on some rules and other data streams. For
instance, if some streamed data items do not contain valid
values (e.g., wireless network bandwidth measurement results
for a particular user), they are simply discarded with the
guidance of certain rules. The process is shown in Figure 1,
where an input stream with mixed data quality (both low and
high quality) is filtered to emit a high-quality stream. A set
of rules is used to guide the data filtering, and based on the
rules low-quality data re discarded.
B. Samza-based Stream Joining
The other data filtering scenario is to enhance the low-
quality data streams by aggregating with other data streams.
For instance, a data stream may only have user click infor-
mation in the tuple of (cookie id, actions) which presents
itself as low-quality data. Cookie ids need to be replaced by
corresponding userID before these activities can be persisted
to storage as high-quality data for easier retrieval later. For this
use case, another data stream needs to be utilized to enhance
the data quality. The process is shown in Figure 2, where
two low-quality streams are aggregated (i.e., joined) to emit a
high-quality data stream.
IV. MULTI-STREAM JOINING MODELS
When joining multiple (i.e., >= 3) streams, there could be
different ways of how to perform the joining. Overall, we
characterize different joining fashions into two models: AIO
(ALL-In-One) and SBS (Step-By-Step). In this Section, we
illustrate how these two models of multi-stream joining work.
To simply the presentation, we use only 3 input streams. It
should be straightforward to apply the two models to scenarios
where more than 3 input streams exist.
A. All-In-One (AIO)
AIO multi-stream joining executes the task in a single
container. Samza allocates individual buffers for each of the
incoming streams so that a fixed number of messages from
each stream can be fetched before performing joining. As
shown in Figure 3(a), three input streams of A, B and C all
feed into the same container indicated by Samza Job. After
joining, the resulting output stream E is produced.
As the message arrival time for different input streams may
vary, a typical implementation of AIO uses multiple memory
buffers to hold input streams, one stream each. After buffering
Stream A
Stream B
Stream C
Samza Job Stream E
(a) AIO multi-stream joining
Stream A
Stream B
Stream C
Samza
Job I
Stream D
Samza
Job II
Stream E
(b) SBS multi-stream joining
Figure 3. Two models of multi-stream joining (AIO and SBS)
some messages from all input streams, AIO may scan all input
streams to identify messages that can be joined. A message can
be joined only if all corresponding messages (e.g., identified
by an unique ID) are present in message buffers.
B. Step-By-Step (SBS)
Contrary to AIO, SBS multi-stream joining uses more than
one containers to perform the task. Depending on the number
of input streams and how fat each container is designed, SBS
involve multiple cascading steps. In Figure 3(b) we show a
simple case where 3 input streams are processed in two steps,
and each step only processes two input streams. The first step
of Samza Job I joins streams of A and B and outputs an
intermediate stream D. The intermediate stream D is fed into
the second step of Samza Job II and serves as one of the input
streams. Samza Job II then joins D and another input stream
C to produce final output stream of E.
Since SBS only joins a subset of input streams at each step,
it only needs to buffer the messages from the subset of input
streams. Similar to AIO implementation, SBS implementation
also allocates message buffers for each input stream. The only
difference is that SBS allocates less memory buffers (i.e., due
to less number of input streams), scans less messages, and
incurs less processing latency.
Note when joining 4 or more input streams, there could be
multiple ways of doing SBS. For instance, if each joining step
only takes 2 input streams, then totally 3 steps are needed.
If each step can join 3 input streams, then only 2 steps
are needed. For simplicity, in the following presentation, we
assume each step only joins two input streams unless otherwise
specified 1.
V. PERFORMANCE COMPARISON OF AIO AND SBS
Both AIO and SBS can achieve the same goal of multi-
stream joining, however they exhibit very different character-
1The analysis done in this work can also easily be applied to other SBS
types where more than 2 input streams can be joined in a step.
Theorem-I
(a) For SBS, let S be the number of joining steps, then
S − 1 intermediate output streams will be produced.
(b) If the first blocked input stream is joined in stepn, then
n − 1 intermediate streams can be produced.
Figure 4. Theorem-I: Partial joining results
istics and associated performance tradeoffs. Now we compare
these two models and analyze the differences.
A. Comparing AIO and SBS
AIO and SBS differ in many aspects due to their design
differences. Based on our production experiences with regard
to application development, deployment, administration, de-
bugging and optimizations, we choose the following 6 per-
formance metrics to compare: (1) Development, deployment
and administration complexity; (2) Partial joining results; (3)
Joining latency; (4) Stream joining throughput; (5) Memory
footprint; and (6) CPU usage.
B. Development, deployment and administration complexity
Since AIO performs the stream joining tasks in one step, it
only needs a single container (or instance) to be deployed,
which is easier to deploy and administrate. By contrast,
SBS needs to deploy multiple containers, hence incurs more
deployment and administration complexity.
Development-wise, however, AIO and SBS may see similar
complexity. The logic of AIO handling is relatively straight-
forward, as all input streams can be treated similarly, except
the particular joining fields. Depending on specific designs,
SBS development can also be straightforward. For example, if
each step is limited to take 2 input streams, then all steps can
share the same code base, while leaving the stream selections
to configurations and deployments.
C. Partial joining results
Irrespective of how many input streams, AIO only produces
a single output stream, a service guarantee of all or nothing.
In other words, if for any reasons (e.g. one input stream is
blocked), the entire joining is stuck and no new joined results
will be appended to the output stream.
On the other hands, SBS is able to produce partial results,
even if some input streams are blocked. Take the example
scenario shown in Figure 3(b). If Stream C is blocked, the
intermediate output stream of D is still a valid stream.
Depending on applications, the partially joined intermediate
streams could be useful. For example, if the multiple input
streams correspond to different countries of site-speed infor-
mation, then any partial results could be used to drive some
business logics. For SBS, the relationship between the number
of joining steps and the number of intermediate streams can
be expressed in Theorem-I displayed in Figure 4.
Theorem-II
Let T0 be the time difference in all input streams;
Let TA and TB be the processing latency incurred by each
individual container for AIO and SBS, respectively.
(a) For AIO, E2E joining latency is T0 + TA;
(b) For SBS, E2E joining latency is T0 + nTB,
where n is the number of joining steps.
Figure 5. Theorem-II: E2E joining latency
D. Joining latency
When performing stream joining, the end-to-end (E2E)
latency is defined as the following. For a message with a
attribute K (e.g., user ID) based on which the stream joining is
done, we refer to all messages that share K as the K-identified
message set. Apparently, when joining n streams, there are at
most n messages for any unique K. For a K-identified message
set, the E2E latency is the difference between the time when
the first message is injected into any input stream and the time
when the joining is done for K.
E2E latency consists of two major parts: inherent inter-
stream lag and E2E joining latency. The first part of inter-
stream lag is the arrival time difference among all input
streams, denoted by T0. For example, a K-unidentified message
set may inject spill messages into input streams at different
time. This part of latency, however, cannot be removed by
any joining, hence is the same irrespective what processing
model (i.e., SBS or AIO) is chosen.
The E2E joining latency is introduced by the stream pro-
cessing work, and it is an important performance metric to
consider. Its value should be kept as low as possible for
any latency-sensitive stream joining tasks. However, stream
joining does introduce processing delays into the joining
latency, mainly thanks to the message buffering needed when
aligning the arrival differences of input streams. The more
steps involved, the higher processing delays are added.
As such, AIO is expected to have lower E2E joining latency
than SBS, as the latter incurs multiple processing delays,
while the former only once. Assuming the processing delay
introduced by the Samza container in AIO is denoted by TA,
and the processing delay introduced by each Samza container
in SBS is the same and denoted by TB. We typically have
TA ≥ TB. However, since SBS adds multiple TB, its E2E joining
latency is typically larger than that of AIO. The E2E joining
latency of AIO is T0 +TA, while SBS has T0 +nTB, as presented
in Figure 5.
E. Stream joining Throughput
The throughput of the two models is defined as the number
of K-defined messages that are successfully joined per time
unit. For both models, the throughput value is determined
by capacity of the deployment, which is further dictated
by the most constraining resource capacity and the resource
Stream 1
Stream 2
Stream n
Base buffer: M
Buffer12: M + T12 * B
Buffer1n: M + T1n * B
T12
T1n
……
Figure 6. Memory footprint of stream joining
efficiency of each model. If, for example, CPU is the most
constraining (i.e., as seen from Linux “top” command) per-
formance bottleneck of a stream joining deployment, then
the resource efficiency (i.e., as measured by CPU usage per
joined message) and the available resource capacity together
determine the throughput. As another example, if the memory
footprint is the top performance bottleneck, then for a machine
with fixed memory size, the per-message memory usage of
the deployment will determine the throughput of the stream
joining system.
The value of this performance metric, however, depends
on other performance metrics such as CPU and memory
footprint, which will be discussed later. Hence we defer further
discussions to later sections.
F. Memory usage
Both AIO and SBS need to allocate memory space to buffer
the received messages from different input streams. For the
output stream, we assume each produced message will be
immediately sent out, hence not occupying memory space.
Assuming the earliest stream into which the first message
from K-identified message set injects is Stream 1, then all
other streams have an inter-stream lag with regard to stream-
1. Denoting the lag of stream n as T1n. Typically stream joining
tasks buffer the first stream (i.e., stream 1) using some fixed
memory space denoted by M 2
. For other streams, assuming
the message size and message arrival rate are known, then the
corresponding memory space needed for each stream can be
determined by the latency lag. For instance, for stream-n, the
memory space needed is M + T1n ∗ B, where B is the scaling
factor based on message arrival rate and message size.
For AIO, given n input streams, the total memory usage is
nM + Σn
i=2BT1i. For SBS, the number of steps may be less or
equal to n − 1. We firstly take an extreme case where n − 1
steps are needed. Each step will only join two streams. The
memory space needed for each step is M + Tδ, where Tδ is
the latency lag between the two input streams. If the joining
sequence of input streams in SBS is optimized (i.e., the joining
sequence aligns well with the inter-stream lag), the aggregated
memory footprint can be minimized. We will discuss more on
2In other scenarios, a fixed number of messages or fixed length of time
of messages are buffered; the used memory space can be correspondingly
determined.
Theorem-III
Let M be base memory space of the earliest stream;
n be the total number of input streams;
Tij is the latency lag between two streams;
B is the scaling factor;
(a) For AIO, the memory footprint is nM + Σn
i=2T1iB;
(b) For SBS, the minimum total memory footprint is
is (n − 1)M + T1nB;
Figure 7. Theorem-III: Memory footprint
this in Section VII. Adding all steps together, the minimum
total memory usage for SBS is (n − 1)M + T1n, where T1n is
the longest latency lag among all input streams.
If we ignore other memory usages of the stream joining
container and only consider the memory space used for
buffering messages from input streams, the memory footprints
of both AIO and SBS are presented in Figure 7. Comparing
these two memory footprints, we can say that SBS has smaller
memory footprint.
G. CPU usage
AIO joins all input streams in one step, while SBS uses
multiple steps to join the streams and each step only joins a
subset of the input streams. The joining process needs to check
all input streams for a particular K-identified message set,
hence incurring certain type of computation-heavy processing.
Because of this, AIO container is expected to have higher CPU
usage than any single SBS container.
When comparing the aggregated CPU usage of all steps
in SBS to that of AIO, things become more complicated.
Since each step needs to invoke a Samza container, and each
container will incur certain house-keeping processing that is
not directly contributing the stream joining, the aggregated
CPU usage of SBS may or may not exceed that of AIO.
If the house-keeping processing is heavy, then the multiple
containers in SBS may incur CPU overhead larger than that
of AIO. On the other hand, if the house-keeping processing is
light enough, then SBS could use less CPU resource than its
AIO counterpart.
H. Summary
We have compared the two multi-stream joining models
of AIO and SBS. In summary, AIO has the advantages of
being easier to deploy and administrate and having lower
E2E joining latency; while SBS has the advantages of having
smaller memory footprint and emitting intermediate streams.
One other particular implication of the memory footprint
size is how to scale to handle large volume of data. If the
memory footprint of AIO is too big to fit into one node, the
only way to scale is by partitioning. On the other hand, SBS
has the luxury of scaling at ease by deploying different steps
on different nodes. Another implication of memory footprint
size is with regard to JVM. For Java-based Samza joining,
Table I
JOINING LATENCY RESULTS
Scenario / AIO SBS Phase-I Phase-II
Latency (ms) (ms) (ms) (ms)
Sync, 1 partition 1734 9845 4791 5054
Sync, 10 partition 1573 8849 4478 4371
Async, 1 partition 2009 9811 4713 5096
Async, 10 partition 1346 8081 3984 4097
Table II
MEMORY FOOTPRINTS (VALUES ARE IN MB UNLESS SPECIFIED IN “GB”)
Scenario / AIO SBS Phase-I Phase-II
Memory footprint (MB) (MB) (MB) (MB)
Sync, 1 partition 512 220 128 92
Sync, 10 partition 9 GB 9 GB 4 GB 5GB
Async, 1 partition 640 480 96 384
Async, 10 partition 10 GB 9 GB 4 GB 5 GB
the larger heap size, the higher GC-caused pauses and longer
startup delay.
VI. EVALUATION
A. Experiment setup
We focus on a case with 3 input streams and a single output
stream. The entire setup consists of the following components:
(1) a set of 3 Kafka producers generating messages for
each stream (or topic); (2) Samza job(s); and (3) a single
Kafka consumer that consumes the joined stream. The Kafka
producers reside in a data center different from the Kafka
clusters. Note that with this setup, the the end-2-end joining
latency is not negligible since: (1) it takes longer for Kafka
producers to produce messages to Kafka cluster; and (2) For
a Samza job, it takes longer to fetch messages from the Kafka
cluster which causes the single container latency longer.
1) Number of partitions : The memory footprint of a Samza
job partly is determined by the number of partitions in a Kafka
topic. A Kafka topic can be composed of a single partition
or multiple partitions. In multi-partition scenarios, Kafka pro-
ducers could produce message to any of the partitions. Samza
jobs create a separate task to handle each topic partition. For
the stream joining purpose, a Samza job needs to buffer the
message until it gets the messages with the same key from all
the topics before performing a stream-join. Hence the memory
requirement will be much higher. We will study both the 1-
partition and multi-partition scenarios.
2) Inter-stream lag: The multiple input streams may not be
time-aligned well; instead, there could be inter-stream lags.
The significance of the inter-stream lag is that it impacts the
memory footprint of Samza jobs. As Samza jobs need to buffer
messages in order to join them, the higher inter-stream lag,
the more buffer is used. Because of this, we will study both
scenarios of with and without considerable inter-stream lags.
We refer to the scenario with near-zero inter-stream lags as
“synchronized (or sync)” scenario, while non-zero inter-stream
lags as “asynchronized (or async)” scenario.
In sync scenarios, Kafka producers publish the message
with the same key to all the topics around the same time.
Table III
CPU USAGE
Scenario / AIO SBS Phase-I Phase-II
CPU usage (%) (%) (%) (%)
Sync, 1 partition 84 28 14 14
Sync, 10 partition 101 95 59 36
Async, 1 partition 86 66 53 13
Async, 10 partition 133 29 15 14
But in reality, due to other environment dynamics (e.g.,
networking latency, etc.), we still encounter certain lags for
the messages with same key in the respective Kafka topics. In
async scenarios, we inject 1-minute fixed time delay between
any two topics for the messages with the same key 3. Such test
will put more memory pressure on Samza job as it requires
to buffer the messages with different arrival time.
3) Kafka producers and consumer: The Kafka producer
used in the work is a Kafka console producer, which is a
command line tool. For each generated message, the Kafka
producer firstly determines which partition the message will be
sent to. Then it will query the Kafka broker for topic partition
metadata based upon which it will figure out which leader
broker is for the partition. Finally it will send the data to
leader broker for the partition.
The Samza job instantiates a Kafka consumer which issues
fetch requests to the brokers corresponding to the partitions it
wants to consume. Typically the Samza job prefetches certain
amount of messages to avoid delay.
4) Samza jobs: Samza jobs use a single thread to handle
actions such as reading and writing messages, flushing metrics,
checkpointing, windowing, and so on. The Samza container
creates an associated task to process the messages of each
input stream, and it chooses messages from input stream
partitions in a round-robin way and passes the message to the
related task. If the Samza container uses stateful management,
each task creates one store to buffer messages. If the changelog
option is enabled, each task also sends the changelog stream
to Kafka for durability purposes. Each task can emit messages
to output stream topics, depending on the use case.
B. Results
We compare the performance of AIO and SBS. From
the 6 metrics listed in Section V We choose 3 important
performance metrics (i.e., joining latency, memory footprint
and CPU usage) since other three metrics (i.e., deployment
complexity, partial results, stream joining throughput) are ei-
ther straightforward or determined by the former three metrics.
In addition, we also examined the Samza jobs’ JVM GC
pauses, as these values can shed light on the CPU usage
efficiencies and application pauses.
1) Joining latency: AIO only has a single joining latency
value, while SBS has two as SBS performs its joining task
in two phases (i.e, Phase-I and Phase-II) using two separate
Samza jobs. The results are shown in Table I. We can see that
3For 3-stream case, there are 2-minute delay between the first and the third
streams.
0
10
20
30
40
50
AIO SBS Phase I SBS Phase II
Average Max
(a) 1 partition
0
10
20
30
40
50
60
AIO SBS Phase I SBS Phase II
Average Max
(b) 10 partitions
Figure 8. GC pauses in sync scenario
0
20
40
60
80
100
120
140
AIO SBS Phase I SBS Phase II
Average Max
(a) 1 partition
0
50
100
150
200
AIO SBS Phase I SBS Phase II
Average Max
(b) 10 partitions
Figure 9. GC pauses in async scenario
AIO results much smaller joining latency when compared to
its SBS counterpart. Depending on scenarios, the difference is
around 5X.
2) Memory footprint: We measure the memory footprint
of a Samza container based on the actual heap size usage,
as that shows the aggregated live object size. The maximum
heap size is all set to be 16 GB, but during stable state, the
heap only commits a portion of the specified 16 GB. AIO has
only one container to join 3 input streams, so it only has a
single memory footprint value, while SBS has two. For the 4
scenarios, the results are listed in Table II. We see in general
lower memory footprint in AIO for almost all scenarios, except
in the second scenario where the values are on par.
3) CPU usage: For CPU usage measurement, we simply
take the machine level CPU usage values reported by “top”
utility [12]. As the machines we use are dedicated identical
ones and only running our workloads, we feel such measure-
ment is fair comparison. For all 4 scenarios, AIO sees much
higher CPU usage when compared to SBS, and the results are
listed in Table III. Note that some values are higher than 100%
since the machines we use are multi-core machines and each
core has a CPU capacity of 100%.
4) JVM GC pauses: The JVM version we use is Oracle
HotSpot 8. We report Young GC pauses as they happen more
frequently. For each scenario, we capture about 10 minutes of
GC stats during stable state. For each scenario, we report both
the average value and the maximum value of the GC pauses in
Figures 8 and 9. We notice that mostly AIO has higher Young
GC pauses, thanks to its larger heap size. Depending on use
cases, these GC pauses may or may not impact application
performance. For applications that care more on message
throughput, individual GC pauses are less important. On the
other hand, for applications that care the joining latencies of
each message, having low GC pauses is critical.
C. Summary of results
Our data evaluates a set of performance metrics (i.e., joining
latency, memory footprint, CPU usage and JVM GC pauses)
of both AIO and SBS. Given the vast number of variations
in stream-joining systems (e.g., number of input streams,
inter-stream lags, message size), the data is far from being
comprehensive with regard to thorough analysis and detailed
characterizations of these two models. However, the results
clearly indicate the performance tradeoff inherent in the two
models. Users should carefully choose the most appropriate
model for their particular use cases.
VII. DISCUSSIONS: JOINING SEQUENCE OF SBS
For the SBS, the joining sequence of multiple input streams
can significantly impact the performance of the entire work
flow. Specifically, with regard to the intermediate streams,
if one of the input streams for a particular step is blocked,
then there will be no intermediate streams being emitted from
this step. For example, for the 3-stream joining illustration
Stream 1
Stream 2
Stream 3
T12
T13
(a) SBS with 3 input streams
Stream 1
Stream 2 Stream 3
T12
T23
Stream 4
(b) An optimal joining sequence
(Aggregated memory footprint is 2M +T13B)
Stream 1
Stream 2Stream 3
T32
T13
Stream 4
(c) An suboptimal joining sequence
(Aggregated memory footprint is M +T13B and M +T32B),
or T32B larger than optimal scenario
Figure 10. Using SBS to join 3 input streams
in Figure 3(b), if Stream-B is blocked due to some reason,
then intermediate stream D will not be produced and all later
joining steps are also blocked.
As another example, if the input streams have different inter-
stream lags, then the joining sequence will greatly impact the
resulted memory footprint. As shown in Figure 10(a), assum-
ing an SBS scenario where 3 input streams need to be joined.
Denote stream-1 as the earliest stream, and stream-2 and
stream-3 have latency lags of T12 and T13, respectively. Figure
10(b) illustrates an optimal joining sequence where stream-1
and stream-2 are firstly joined, and the resulted intermediate
stream-4 will be joined with stream-3 in the second step. The
memory footprint for the first step is M + T12B as directed
by Theorem-III. Correspondingly the memory footprint of the
second step is M + T23B, where T23 (or T32) is the latency
lag between stream-2 and stream-3. Since T12 +T23 = T13, the
aggregated memory footprint of Figure 10(b) is 2M + T13B.
If the joining sequence is different, as shown in Figure
10(c), then the memory footprints of the two SBS steps will be
M+T13B and M+T32B, respectively. Compared to the optimal
joining sequence, the increased memory usage is T32B.
Considering the above two types of examples, we propose
two approaches for determining the optimal joining sequences
of SBS: (1) memory-oriented, which aims at minimizing
the aggregated memory footprint; and (2) reliability-oriented,
which aims at providing as many intermediate streams as
possible.
Memory-oriented SBS For this approach, all input streams
are sorted based on the latency lags with respect to the earliest
input stream. Then different steps of SBS simply join them
according to the ranking of input streams. By doing so, the
aggregated memory footprint will be directed by Theorem-III,
which is minimized.
Reliability-oriented SBS This approach aims at providing
partial results to users. All input streams are ranked based on
how likely the messages are blocked, with the most reliable
input streams comes first. By joining input streams based on
reliability, the chance of intermediate streams being blocked
is minimized.
VIII. RELATED WORK
A. Streaming systems
There are many systems designed for streaming processing
purpose available in Industry, each of which has different
attributes to serve streaming processing need. For instance,
S4 [13] is a distributed stream processing engine originally
developed at Yahoo. Apache Storm [2], [4] is another popu-
lar system for streaming processing purpose originated from
Twitter. Storm utilizes Apache Mesos [14] to schedule the
processing job and provides two message deliver semantics(at
least once, at most once). MillWheel [15] is an in-house
solution developed by Google to serve its stream processing
purposes. Apache Spark [3], a streaming framework used by
Yahoo and Baidu, treats streaming as a series of deterministic
batch operations, such that developers can use the same algo-
rithms in both streaming and batch modes. Compared to Spark,
which is a batch processing framework that can approximate
stream processing, Flink [16] is primarily a stream processing
framework that can look like a batch processor.
B. Capacity model and capacity planning
There are many works in the general area of capacity
planning. The work [17] presents a four-phased model for
computer capacity planning (CCP), The model can be used
to develop a formal capacity planning function. Our work
[18] considers a practical problem of capacity planning for
database replication for a large scale website, and presents
a model to forecast future traffic and determine required
software capacity. Our recent work [19] presents a memory
capacity model for data-filtering applications with Samza.
C. Java heap sizing
To size the Java heap appropriately, various works have
been done to strike the balance among various tradeoffs. Work
[20] analyzed the memory usage on Java heap through object
tracing based on the observations that inappropriate heap size
can lead to performance problems caused by excessive GC
overhead or memory paging. It then presents an automatic
heap-sizing algorithm applicable to different garbage collec-
tors with only minor changes. The paper [21] presents a heap
sizing Rule for dealing with different memory sizes with
regard to heap size. Specifically, it models how severe the
page faults can be with respect to a Heap-Aware Page Fault
Equation.
D. Data quality
Veracity [22], which refers to the consistency and accuracy
of massive data, has been increasingly recognized as one
of key attribute of Big Data. Poor data quality [8] impacts
enterprise companies which leads to customer dissatisfaction,
increased cost, and bad decision-making etc. There are many
works of developing systems to improve data quality. The
work [23] presents a system based on conditional functional
dependencies(CFDs) model for improving the quality of rela-
tional data.
IX. CONCLUSION
This work studies the multi-stream joining models using
Samza framework. We characterize the two models of SBS
and AIO, and analyze their performance tradeoffs. We also set
up Samza testbed using Kafka streams to verify our theoretical
analysis. The results can be used to design/implement systems
to more efficiently join multiple streams.
REFERENCES
[1] “Apache samza,” http://samza.apache.org/.
[2] “Apache storm,” http://storm.apache.org/.
[3] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets:
A fault-tolerant abstraction for in-memory cluster computing,” in Pro-
ceedings of the 9th USENIX Conference on Networked Systems Design
and Implementation, ser. NSDI’12, San Jose, CA, 2012.
[4] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulka-
rni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and
D. Ryaboy, “Storm@twitter,” in Proceedings of the 2014 ACM SIGMOD
International Conference on Management of Data, ser. SIGMOD ’14,
Snowbird, Utah, USA, 2014.
[5] “Apache samza: Linkedin’s real-time stream processing framework,”
https://engineering.linkedin.com/data-streams/apache-samza-linkedins-
real-time-stream-processing-framework.
[6] “Apache kafka,” http://kafka.apache.org/.
[7] L. Qiao, K. Surlaker, S. Das, and et. al, “On brewing fresh espresso:
Linkedin’s distributed data serving platform,” in Proceedings of the 2013
ACM SIGMOD International Conference on Management of Data, ser.
SIGMOD ’13, New York, New York, USA, 2013.
[8] T. C. Redman, “The impact of poor data quality on the typical enter-
prise,” Commun. ACM, vol. 41, no. 2, pp. 79–82, Feb. 1998.
[9] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,
R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,
O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache
hadoop yarn: Yet another resource negotiator,” in Proceedings of the 4th
Annual Symposium on Cloud Computing, ser. SOCC ’13, Santa Clara,
California, 2013.
[10] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop dis-
tributed file system,” in Proceedings of the 2010 IEEE 26th Symposium
on Mass Storage Systems and Technologies (MSST), ser. MSST ’10,
Washington, DC, USA, 2010.
[11] “Benchmarking apache samza: 1.2 million messages per second on a
single host,” http://engineering.linkedin.com/performance
/benchmarkingapache-samza-12-million-messages-second-single-node.
[12] M. G. Sobell, A Practical Guide to Linux Commands, Editorsnd Shell
Programming, A. Upper Saddle River, NJ, USA: Prentice Hall PTR,
2005.
[13] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed
stream computing platform,” in Proceedings of the 2010 IEEE Inter-
national Conference on Data Mining Workshops, ser. ICDMW ’10,
Washington, DC, USA, 2010.
[14] “Apache mesos,” http://mesos.apache.org/.
[15] T. Akidau, A. Balikov, K. Bekiro˘glu, S. Chernyak, J. Haberman, R. Lax,
D. McVeety, Sam andls, P. Nordstrom, and S. Whittle, “Millwheel: Fault-
tolerant stream processing at internet scale,” Proc. VLDB Endow., vol. 6,
no. 11, pp. 1033–1044, Aug. 2013.
[16] “Apache flink,” https://flink.apache.org//.
[17] I. L. Carper, S. Harvey, and J. C. Wetherbe, “Computer capacity
planning: Strategy and methodologies,” SIGMIS Database, vol. 14, no. 4,
pp. 3–13, Jul. 1983.
[18] Z. Zhuang, H. Ramachandra, C. Tran, S. Subramaniam, C. Botev,
C. Xiong, and B. Sridharan, “Capacity planning and headroom analysis
for taming database replication latency: Experiences with linkedin
internet traffic,” in Proceedings of the 6th ACM/SPEC International
Conference on Performance Engineering, ser. ICPE ’15, Austin, Texas,
USA, 2015, pp. 39–50.
[19] T. Feng, Z. Zhuang, Y. Pan, and H. Ramachandra, “A memory capacity
model for high performing data-filtering applications in samza frame-
work,” in Proceedings of the 2015 IEEE International Conference on
Big Data - Workshop on Data Quality Issues, Santa Clara, CA, USA,
2015, pp. 2600–2605.
[20] P. Lengauer, V. Bitto, and H. M¨ossenb¨ock, “Accurate and efficient object
tracing for java applications,” in Proceedings of the 6th ACM/SPEC
International Conference on Performance Engineering, ser. ICPE ’15,
Austin, Texas, USA, 2015.
[21] Y. C. Tay, X. Zong, and X. He, “An equation-based heap sizing rule,”
Perform. Eval., vol. 70, no. 11, pp. 948–964, Nov. 2013.
[22] B. Saha and D. Srivastava, “Data quality: The other face of big
data,” in Proceedings of IEEE 30th International Conference on Data
Engineering, ser. ICDE ’14, Chicago, IL, USA, 2014.
[23] W. Fan, F. Geerts, and X. Jia, “Semandaq: A data quality system based
on conditional functional dependencies,” Proc. VLDB Endow., vol. 1,
no. 2, pp. 1460–1463, Aug. 2008.

More Related Content

What's hot

Microsoft access 2010
Microsoft access 2010Microsoft access 2010
Microsoft access 2010mahalihubeb
 
SAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperSAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperVipul Neema
 
Preparing a Dataset for Processing
Preparing a Dataset for ProcessingPreparing a Dataset for Processing
Preparing a Dataset for ProcessingManish Chopra
 
The Story of How an Oracle Classic Stronghold successfully embraced SOA (ODTU...
The Story of How an Oracle Classic Stronghold successfully embraced SOA (ODTU...The Story of How an Oracle Classic Stronghold successfully embraced SOA (ODTU...
The Story of How an Oracle Classic Stronghold successfully embraced SOA (ODTU...Lucas Jellema
 
Knowledge management and information system
Knowledge management and information systemKnowledge management and information system
Knowledge management and information systemnihad341
 
Enabling Governed Data Access with Tableau Data Server
Enabling Governed Data Access with Tableau Data Server Enabling Governed Data Access with Tableau Data Server
Enabling Governed Data Access with Tableau Data Server Tableau Software
 
Data upload from flat file to business intelligence 7.0 system
Data upload from flat file to business intelligence 7.0 systemData upload from flat file to business intelligence 7.0 system
Data upload from flat file to business intelligence 7.0 systemBhaskar Reddy
 
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...1crore projects
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraYashIyengar
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformHien Luu
 
Rdbms Practical file diploma
Rdbms Practical file diploma Rdbms Practical file diploma
Rdbms Practical file diploma mustkeem khan
 
Asp.net interview questions
Asp.net interview questionsAsp.net interview questions
Asp.net interview questionsAkhil Mittal
 
Should ETL Become Obsolete
Should ETL Become ObsoleteShould ETL Become Obsolete
Should ETL Become ObsoleteJerald Burget
 
Rx o technology. general prezentation.
Rx o technology. general prezentation.Rx o technology. general prezentation.
Rx o technology. general prezentation.Mumr
 
Database Concepts 101
Database Concepts 101Database Concepts 101
Database Concepts 101Amit Garg
 

What's hot (20)

Microsoft access 2010
Microsoft access 2010Microsoft access 2010
Microsoft access 2010
 
dvprimer-concepts
dvprimer-conceptsdvprimer-concepts
dvprimer-concepts
 
SAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White PaperSAP BW vs Teradat; A White Paper
SAP BW vs Teradat; A White Paper
 
Preparing a Dataset for Processing
Preparing a Dataset for ProcessingPreparing a Dataset for Processing
Preparing a Dataset for Processing
 
The Story of How an Oracle Classic Stronghold successfully embraced SOA (ODTU...
The Story of How an Oracle Classic Stronghold successfully embraced SOA (ODTU...The Story of How an Oracle Classic Stronghold successfully embraced SOA (ODTU...
The Story of How an Oracle Classic Stronghold successfully embraced SOA (ODTU...
 
Mule ESB
Mule ESBMule ESB
Mule ESB
 
Knowledge management and information system
Knowledge management and information systemKnowledge management and information system
Knowledge management and information system
 
SAP BW connect db
SAP BW connect dbSAP BW connect db
SAP BW connect db
 
Enabling Governed Data Access with Tableau Data Server
Enabling Governed Data Access with Tableau Data Server Enabling Governed Data Access with Tableau Data Server
Enabling Governed Data Access with Tableau Data Server
 
Nadee2018
Nadee2018Nadee2018
Nadee2018
 
Data upload from flat file to business intelligence 7.0 system
Data upload from flat file to business intelligence 7.0 systemData upload from flat file to business intelligence 7.0 system
Data upload from flat file to business intelligence 7.0 system
 
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and Cassandra
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting Platform
 
Rdbms Practical file diploma
Rdbms Practical file diploma Rdbms Practical file diploma
Rdbms Practical file diploma
 
Asp.net interview questions
Asp.net interview questionsAsp.net interview questions
Asp.net interview questions
 
mule esb
mule esbmule esb
mule esb
 
Should ETL Become Obsolete
Should ETL Become ObsoleteShould ETL Become Obsolete
Should ETL Become Obsolete
 
Rx o technology. general prezentation.
Rx o technology. general prezentation.Rx o technology. general prezentation.
Rx o technology. general prezentation.
 
Database Concepts 101
Database Concepts 101Database Concepts 101
Database Concepts 101
 

Similar to Effective Multi-stream Joining in Apache Samza Framework

A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...Tao Feng
 
Service generated big data and big data-as-a-service
Service generated big data and big data-as-a-serviceService generated big data and big data-as-a-service
Service generated big data and big data-as-a-serviceJYOTIR MOY
 
Technology Overview
Technology OverviewTechnology Overview
Technology OverviewLiran Zelkha
 
IRJET- A Review on Performance Enhancement of Web Services using Tagged-Sub O...
IRJET- A Review on Performance Enhancement of Web Services using Tagged-Sub O...IRJET- A Review on Performance Enhancement of Web Services using Tagged-Sub O...
IRJET- A Review on Performance Enhancement of Web Services using Tagged-Sub O...IRJET Journal
 
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEMLEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEMmyteratak
 
Solving big data challenges for enterprise application
Solving big data challenges for enterprise applicationSolving big data challenges for enterprise application
Solving big data challenges for enterprise applicationTrieu Dao Minh
 
Real time service oriented cloud computing
Real time service oriented cloud computingReal time service oriented cloud computing
Real time service oriented cloud computingwww.pixelsolutionbd.com
 
Oracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridOracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridEmiliano Pecis
 
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...NAbderrahim
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoJusto Hidalgo
 
IRJET - Application Development Approach to Transform Traditional Web Applica...
IRJET - Application Development Approach to Transform Traditional Web Applica...IRJET - Application Development Approach to Transform Traditional Web Applica...
IRJET - Application Development Approach to Transform Traditional Web Applica...IRJET Journal
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET Journal
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationKnoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationKnoldus Inc.
 
Whitepaper : Building an Efficient Microservices Architecture
Whitepaper : Building an Efficient Microservices ArchitectureWhitepaper : Building an Efficient Microservices Architecture
Whitepaper : Building an Efficient Microservices ArchitectureNewt Global Consulting LLC
 

Similar to Effective Multi-stream Joining in Apache Samza Framework (20)

A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...
 
Service generated big data and big data-as-a-service
Service generated big data and big data-as-a-serviceService generated big data and big data-as-a-service
Service generated big data and big data-as-a-service
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Technology Overview
Technology OverviewTechnology Overview
Technology Overview
 
IRJET- A Review on Performance Enhancement of Web Services using Tagged-Sub O...
IRJET- A Review on Performance Enhancement of Web Services using Tagged-Sub O...IRJET- A Review on Performance Enhancement of Web Services using Tagged-Sub O...
IRJET- A Review on Performance Enhancement of Web Services using Tagged-Sub O...
 
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEMLEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
LEGO EMBRACING CHANGE BY COMBINING BI WITH FLEXIBLE INFORMATION SYSTEM
 
Solving big data challenges for enterprise application
Solving big data challenges for enterprise applicationSolving big data challenges for enterprise application
Solving big data challenges for enterprise application
 
Real time service oriented cloud computing
Real time service oriented cloud computingReal time service oriented cloud computing
Real time service oriented cloud computing
 
Api enablement-mainframe
Api enablement-mainframeApi enablement-mainframe
Api enablement-mainframe
 
Oracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridOracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagrid
 
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
DLTSR_A_Deep_Learning_Framework_for_Recommendations_of_Long-Tail_Web_Services...
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
 
IRJET - Application Development Approach to Transform Traditional Web Applica...
IRJET - Application Development Approach to Transform Traditional Web Applica...IRJET - Application Development Approach to Transform Traditional Web Applica...
IRJET - Application Development Approach to Transform Traditional Web Applica...
 
IRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database TechniquesIRJET- Recommendation System based on Graph Database Techniques
IRJET- Recommendation System based on Graph Database Techniques
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Whitepaper : Building an Efficient Microservices Architecture
Whitepaper : Building an Efficient Microservices ArchitectureWhitepaper : Building an Efficient Microservices Architecture
Whitepaper : Building an Efficient Microservices Architecture
 
The Cloud Of Cloud Computing Essay
The Cloud Of Cloud Computing EssayThe Cloud Of Cloud Computing Essay
The Cloud Of Cloud Computing Essay
 

More from Tao Feng

Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Tao Feng
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
 

More from Tao Feng (7)

Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 

Recently uploaded

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Effective Multi-stream Joining in Apache Samza Framework

  • 1. Effective Multi-stream Joining for Enhancing Data Quality in Apache Samza Framework Tao Feng, Zhenyun Zhuang, Haricharan Ramachandra {tofeng, zzhuang, hramachandra}@linkedin.com LinkedIn Corporation, 2029 Stierlin Court Mountain View, CA 94043 United States Abstract—Increasing adoption of Big Data in business environ- ments have driven the needs of stream joining in realtime fashion. Multi-stream joining is an important stream processing type in today’s Internet companies, and it has been used to generate higher-quality data in business pipelines. Multi-stream joining can be performed in two models: (1) All-In-One (AIO) Joining and (2) Step-By-Step (SBS) Joining. Both models have advantages and disadvantages with regard to memory footprint, joining latency, deployment complexity, etc. In this work, we analyze the performance tradeoffs associated with these two models using Apache Samza and share our findings. Index Terms—Multi-stream joining; Samza; Stream process- ing; Data quality I. INTRODUCTION Data quality, a long-existing critical issue, faces more chal- lenges in Big Data era. In particular, the large data volume and the realtime requirements in many Internet companies have necessitated effective stream processing to drive their critical business data flow. To deal with the streaming requirements of todays Big Data processing, stream processing frameworks such as Apache Samza [1], Apache Storm [2] and Spark [3] are being increas- ingly adopted by Internet companies such as LinkedIn and Twitter [4]. Apache Samza, with its advantage of efficiently handling large amount of data, has been heavily used by companies like LinkedIn for many years [5]. LinkedIn’s largest Samza job is processing multi-million messages per-second during peak traffic hours. The wide deployment of Big Data solutions demands business-value augments in various areas. Stream joining, which joins multiple data streams for various purposes includ- ing increased data quality, is an important stream processing type in todays Internet companies, particularly those involving multiple types of online user activities. This type of stream processing essentially aggregates “nearly aligned” streams. It expects to receive related events from multiple input streams and eventually combines them into a single output event stream. The most simple stream joining is to join two streams, for instance, joining a stream of ads clicks to a stream of ads impressions, so that the company can link the information on when the ad was shown to the information on when it was clicked. As the business logic becomes more complicated and the online data types multiply, there is a need to join multiple streams (i.e., >= 3) in a realtime or near-realtime fashion. As an example, at LinkedIn, our data flow joins multiple streams to drive the site-speed (e.g., page display latency in different countries) dashboard. Each of the data stream corresponds to the users from a single area (e.g., country), and multiple streams are joined to form an aggregated view which allows filtering/aggregation on the dashboard. As another example, different types of user activities (e.g., user ads clicking, user profile viewing, user group discussions) can also be joined in realtime to produce more detailed activities of users on the fly. To join multiple streams, in general two models of solutions can be used: (1) All-In-One (AIO) Joining, which simply joins all incoming streams in the same processing unit (e.g., software instance); and (2) Step-By-Step (SBS) Joining, which joins incoming streams in different processing units. Both AIO and SBS have advantages and disadvantages. AIO only needs a single processing job, which enables simpler processing logic and easier deployment. It also has lower end- to-end latency when emitting the final output stream. On the other hand, AIO also has several disadvantages compared to SBS. First, if the incoming streams are far apart in arrival time, AIO’s memory footprint can be larger due to the longer memory buffering time when joining all incoming streams. With larger memory footprint (e.g. Java heap), AIO may also incur other related problems such as larger GC-caused (Garbage Collection) application pauses and longer startup delay. In addition, if any incoming stream is blocked, AIO will be stuck and unable to produce any useful data. SBS, on the contrary, can still produce useful data (in intermediate streams) in steps before the blocked stream. Motivated by the importance of multi-stream joining, we feel necessary to study the performance tradeoff of different processing models. In this work, we consider both models of multi-stream joining, using Samza as the streaming frame- work. The findings however are not limited to Samza, since many of them also apply to other streaming frameworks. For the remainder of the paper, after providing necessary technical background in section II, we give two motivating examples demonstrating how Samza-based stream processing can help improve data quality in Section III. We then elaborate on the two multi-stream processing models in Section IV. The major part of this work is to analyze the performance tradeoffs of the two models, which are presented in Section V. We set up a real Samza test bed to verify our analysis in Section VI. We also discuss the joining sequences of SBS in Section VII. Section VIII gives related works, and finally Section IX concludes the work.
  • 2. The contributions of this work are: • We characterize the approaches of performing multi- stream joining into two models: AIO and SBS; • We compare the performance tradeoffs of the two models with regard to various performance metrics including joining latency and memory footprint; • We also use Samza framework to evaluate the two mod- els. The results can be used for designing appropriate multi-stream joining systems. II. BACKGROUND A. Data stream and stream processing framework Given the growing size of big data, data streaming frame- works such as Apache Kafka [6] have been designed and deployed to handle the data streams in realtime or near- realtime fashions. To handle the increasingly complicated business logic, various stream processing frameworks have been proposed. Many of todays applications use stream pro- cessing frameworks to filter and aggregate data streams. One example usage of such stream processing framework is to increase the quality of the data using filtering. Such data- filtering applications are heavily deployed to filter streaming data triggered by online activities coming from hundreds of millions of users. For instance, user activities coming from user-facing pages may contain inconsistent data; hence the data need to be filtered to remove the inconsistency before being persisted to backend storages such as Espresso [7]. B. Multi-stream joining When there are multiple streams that need to be correlated with each other, multi-stream joining becomes necessary. In such scenarios, the multiple input streams share some common features such as user IDs. By joining them and producing another output stream, various benefits can be obtained. Multi-stream joining can be used to improve data quality. Data consistency and accuracy of massive data has been increasingly recognized as one of key attribute of Big Data. Poor data quality [8] negatively impacts enterprise companies which leads to customer dissatisfaction, increased cost, and bad decision-making, etc. In some scenarios, low-quality data streams can be improved by aggregating with other data streams. For instance, a data stream may only have user click information in the tuple of (cookie id, actions) which presents itself as low-quality data. Cookie ids need to be replaced by corresponding userID before these activities can be persisted to storage as high-quality data for easier retrieval later. For this use case, another data stream needs to be utilized to enhance the data quality. In order to perform stream joining, the application typically needs to buffer events for some time window over which it wants to join, which is exactly used by many stream processing frameworks such as Samza. C. Apache Samza Stream processing is high throughput data processing which applies to massive data set with near real-time SLA guarantee. Samza Input Data (Mixed Quality) Rules Output Data (High Quality) Output Data (Low Quality) Figure 1. Samza-based stream filtering for data quality enhancement Input Data IInput Data I Output Data Input Data I (Low Quality) Output Data(Low Quality) Output Data(Low Quality) Output Data (High Quality) Samza (High Quality) Samza (High Quality) Samza Input Data IIInput Data IIInput Data II (Low Quality)(Low Quality)(Low Quality)(Low Quality) Figure 2. Samza-based stream joining for data quality enhancement Apache Samza is a lightweight stream processing framework originally created for solving continuous data processing at LinkedIn and later open-sourced. Samza has a set of desired features such as simple API, fault tolerance, high durability, managed state, etc. Samza integrates with YARN [9] and Kafka [6]. The most basic processing element of Samza is a stream, which is composed of immutable messages of a similar type or category. A stream can be the data from a messaging system such as Kafka, or a database table, or even Hadoop HDFS [10] files. A stream can be further split into a set of partitions using keys. For each partition, all messages are strictly ordered. Samza currently ensures that separate applications live in sep- arate containers (processes). Samza jobs process the incoming streams. A Samza job is a set of code that performs logical transformation on a set of input streams to append messages to a set of output streams. To join incoming streams, the Samza job opens separate buffers for each incoming streams. The job scans all buffered messages and merge aligned messages into the output stream. Samza uses a single thread internally to handle reading and writing messages, flushing metrics, checkpointing, windowing, and so on. Samza could scale with creating multiple instances if single instance is not enough to sustain the traffic. Samza container creates an associated task to process the messages of each input stream topic partition. The Samza container chooses a message from input stream partitions in a round- robin way and passes the message to the related task. If the Samza container uses stateful management, each task creates one partition in the store to buffer messages. If the changelog option is enabled, each task sends the changelog stream to Kafka for durability purposes. Each task can emit messages to output stream topics, depending on the use case [11].
  • 3. III. ENHANCING DATA QUALITY USING SAMZA Samza can be used to enhance data quality in streaming scenarios. Depending on specific data set, Samza can augment the quality of a data stream, or it can join multiple low- quality data streams and produce a high-quality data stream. Specifically, Samza framework is heavily used for LinkedIn application and system monitoring, data filtering, track user behavior etc. We have two different usage scenarios of doing data filtering with Samza. A. Samza-based Stream Filtering The first usage scenario, bad-quality data are filtered (i.e., removed) based on some rules and other data streams. For instance, if some streamed data items do not contain valid values (e.g., wireless network bandwidth measurement results for a particular user), they are simply discarded with the guidance of certain rules. The process is shown in Figure 1, where an input stream with mixed data quality (both low and high quality) is filtered to emit a high-quality stream. A set of rules is used to guide the data filtering, and based on the rules low-quality data re discarded. B. Samza-based Stream Joining The other data filtering scenario is to enhance the low- quality data streams by aggregating with other data streams. For instance, a data stream may only have user click infor- mation in the tuple of (cookie id, actions) which presents itself as low-quality data. Cookie ids need to be replaced by corresponding userID before these activities can be persisted to storage as high-quality data for easier retrieval later. For this use case, another data stream needs to be utilized to enhance the data quality. The process is shown in Figure 2, where two low-quality streams are aggregated (i.e., joined) to emit a high-quality data stream. IV. MULTI-STREAM JOINING MODELS When joining multiple (i.e., >= 3) streams, there could be different ways of how to perform the joining. Overall, we characterize different joining fashions into two models: AIO (ALL-In-One) and SBS (Step-By-Step). In this Section, we illustrate how these two models of multi-stream joining work. To simply the presentation, we use only 3 input streams. It should be straightforward to apply the two models to scenarios where more than 3 input streams exist. A. All-In-One (AIO) AIO multi-stream joining executes the task in a single container. Samza allocates individual buffers for each of the incoming streams so that a fixed number of messages from each stream can be fetched before performing joining. As shown in Figure 3(a), three input streams of A, B and C all feed into the same container indicated by Samza Job. After joining, the resulting output stream E is produced. As the message arrival time for different input streams may vary, a typical implementation of AIO uses multiple memory buffers to hold input streams, one stream each. After buffering Stream A Stream B Stream C Samza Job Stream E (a) AIO multi-stream joining Stream A Stream B Stream C Samza Job I Stream D Samza Job II Stream E (b) SBS multi-stream joining Figure 3. Two models of multi-stream joining (AIO and SBS) some messages from all input streams, AIO may scan all input streams to identify messages that can be joined. A message can be joined only if all corresponding messages (e.g., identified by an unique ID) are present in message buffers. B. Step-By-Step (SBS) Contrary to AIO, SBS multi-stream joining uses more than one containers to perform the task. Depending on the number of input streams and how fat each container is designed, SBS involve multiple cascading steps. In Figure 3(b) we show a simple case where 3 input streams are processed in two steps, and each step only processes two input streams. The first step of Samza Job I joins streams of A and B and outputs an intermediate stream D. The intermediate stream D is fed into the second step of Samza Job II and serves as one of the input streams. Samza Job II then joins D and another input stream C to produce final output stream of E. Since SBS only joins a subset of input streams at each step, it only needs to buffer the messages from the subset of input streams. Similar to AIO implementation, SBS implementation also allocates message buffers for each input stream. The only difference is that SBS allocates less memory buffers (i.e., due to less number of input streams), scans less messages, and incurs less processing latency. Note when joining 4 or more input streams, there could be multiple ways of doing SBS. For instance, if each joining step only takes 2 input streams, then totally 3 steps are needed. If each step can join 3 input streams, then only 2 steps are needed. For simplicity, in the following presentation, we assume each step only joins two input streams unless otherwise specified 1. V. PERFORMANCE COMPARISON OF AIO AND SBS Both AIO and SBS can achieve the same goal of multi- stream joining, however they exhibit very different character- 1The analysis done in this work can also easily be applied to other SBS types where more than 2 input streams can be joined in a step.
  • 4. Theorem-I (a) For SBS, let S be the number of joining steps, then S − 1 intermediate output streams will be produced. (b) If the first blocked input stream is joined in stepn, then n − 1 intermediate streams can be produced. Figure 4. Theorem-I: Partial joining results istics and associated performance tradeoffs. Now we compare these two models and analyze the differences. A. Comparing AIO and SBS AIO and SBS differ in many aspects due to their design differences. Based on our production experiences with regard to application development, deployment, administration, de- bugging and optimizations, we choose the following 6 per- formance metrics to compare: (1) Development, deployment and administration complexity; (2) Partial joining results; (3) Joining latency; (4) Stream joining throughput; (5) Memory footprint; and (6) CPU usage. B. Development, deployment and administration complexity Since AIO performs the stream joining tasks in one step, it only needs a single container (or instance) to be deployed, which is easier to deploy and administrate. By contrast, SBS needs to deploy multiple containers, hence incurs more deployment and administration complexity. Development-wise, however, AIO and SBS may see similar complexity. The logic of AIO handling is relatively straight- forward, as all input streams can be treated similarly, except the particular joining fields. Depending on specific designs, SBS development can also be straightforward. For example, if each step is limited to take 2 input streams, then all steps can share the same code base, while leaving the stream selections to configurations and deployments. C. Partial joining results Irrespective of how many input streams, AIO only produces a single output stream, a service guarantee of all or nothing. In other words, if for any reasons (e.g. one input stream is blocked), the entire joining is stuck and no new joined results will be appended to the output stream. On the other hands, SBS is able to produce partial results, even if some input streams are blocked. Take the example scenario shown in Figure 3(b). If Stream C is blocked, the intermediate output stream of D is still a valid stream. Depending on applications, the partially joined intermediate streams could be useful. For example, if the multiple input streams correspond to different countries of site-speed infor- mation, then any partial results could be used to drive some business logics. For SBS, the relationship between the number of joining steps and the number of intermediate streams can be expressed in Theorem-I displayed in Figure 4. Theorem-II Let T0 be the time difference in all input streams; Let TA and TB be the processing latency incurred by each individual container for AIO and SBS, respectively. (a) For AIO, E2E joining latency is T0 + TA; (b) For SBS, E2E joining latency is T0 + nTB, where n is the number of joining steps. Figure 5. Theorem-II: E2E joining latency D. Joining latency When performing stream joining, the end-to-end (E2E) latency is defined as the following. For a message with a attribute K (e.g., user ID) based on which the stream joining is done, we refer to all messages that share K as the K-identified message set. Apparently, when joining n streams, there are at most n messages for any unique K. For a K-identified message set, the E2E latency is the difference between the time when the first message is injected into any input stream and the time when the joining is done for K. E2E latency consists of two major parts: inherent inter- stream lag and E2E joining latency. The first part of inter- stream lag is the arrival time difference among all input streams, denoted by T0. For example, a K-unidentified message set may inject spill messages into input streams at different time. This part of latency, however, cannot be removed by any joining, hence is the same irrespective what processing model (i.e., SBS or AIO) is chosen. The E2E joining latency is introduced by the stream pro- cessing work, and it is an important performance metric to consider. Its value should be kept as low as possible for any latency-sensitive stream joining tasks. However, stream joining does introduce processing delays into the joining latency, mainly thanks to the message buffering needed when aligning the arrival differences of input streams. The more steps involved, the higher processing delays are added. As such, AIO is expected to have lower E2E joining latency than SBS, as the latter incurs multiple processing delays, while the former only once. Assuming the processing delay introduced by the Samza container in AIO is denoted by TA, and the processing delay introduced by each Samza container in SBS is the same and denoted by TB. We typically have TA ≥ TB. However, since SBS adds multiple TB, its E2E joining latency is typically larger than that of AIO. The E2E joining latency of AIO is T0 +TA, while SBS has T0 +nTB, as presented in Figure 5. E. Stream joining Throughput The throughput of the two models is defined as the number of K-defined messages that are successfully joined per time unit. For both models, the throughput value is determined by capacity of the deployment, which is further dictated by the most constraining resource capacity and the resource
  • 5. Stream 1 Stream 2 Stream n Base buffer: M Buffer12: M + T12 * B Buffer1n: M + T1n * B T12 T1n …… Figure 6. Memory footprint of stream joining efficiency of each model. If, for example, CPU is the most constraining (i.e., as seen from Linux “top” command) per- formance bottleneck of a stream joining deployment, then the resource efficiency (i.e., as measured by CPU usage per joined message) and the available resource capacity together determine the throughput. As another example, if the memory footprint is the top performance bottleneck, then for a machine with fixed memory size, the per-message memory usage of the deployment will determine the throughput of the stream joining system. The value of this performance metric, however, depends on other performance metrics such as CPU and memory footprint, which will be discussed later. Hence we defer further discussions to later sections. F. Memory usage Both AIO and SBS need to allocate memory space to buffer the received messages from different input streams. For the output stream, we assume each produced message will be immediately sent out, hence not occupying memory space. Assuming the earliest stream into which the first message from K-identified message set injects is Stream 1, then all other streams have an inter-stream lag with regard to stream- 1. Denoting the lag of stream n as T1n. Typically stream joining tasks buffer the first stream (i.e., stream 1) using some fixed memory space denoted by M 2 . For other streams, assuming the message size and message arrival rate are known, then the corresponding memory space needed for each stream can be determined by the latency lag. For instance, for stream-n, the memory space needed is M + T1n ∗ B, where B is the scaling factor based on message arrival rate and message size. For AIO, given n input streams, the total memory usage is nM + Σn i=2BT1i. For SBS, the number of steps may be less or equal to n − 1. We firstly take an extreme case where n − 1 steps are needed. Each step will only join two streams. The memory space needed for each step is M + Tδ, where Tδ is the latency lag between the two input streams. If the joining sequence of input streams in SBS is optimized (i.e., the joining sequence aligns well with the inter-stream lag), the aggregated memory footprint can be minimized. We will discuss more on 2In other scenarios, a fixed number of messages or fixed length of time of messages are buffered; the used memory space can be correspondingly determined. Theorem-III Let M be base memory space of the earliest stream; n be the total number of input streams; Tij is the latency lag between two streams; B is the scaling factor; (a) For AIO, the memory footprint is nM + Σn i=2T1iB; (b) For SBS, the minimum total memory footprint is is (n − 1)M + T1nB; Figure 7. Theorem-III: Memory footprint this in Section VII. Adding all steps together, the minimum total memory usage for SBS is (n − 1)M + T1n, where T1n is the longest latency lag among all input streams. If we ignore other memory usages of the stream joining container and only consider the memory space used for buffering messages from input streams, the memory footprints of both AIO and SBS are presented in Figure 7. Comparing these two memory footprints, we can say that SBS has smaller memory footprint. G. CPU usage AIO joins all input streams in one step, while SBS uses multiple steps to join the streams and each step only joins a subset of the input streams. The joining process needs to check all input streams for a particular K-identified message set, hence incurring certain type of computation-heavy processing. Because of this, AIO container is expected to have higher CPU usage than any single SBS container. When comparing the aggregated CPU usage of all steps in SBS to that of AIO, things become more complicated. Since each step needs to invoke a Samza container, and each container will incur certain house-keeping processing that is not directly contributing the stream joining, the aggregated CPU usage of SBS may or may not exceed that of AIO. If the house-keeping processing is heavy, then the multiple containers in SBS may incur CPU overhead larger than that of AIO. On the other hand, if the house-keeping processing is light enough, then SBS could use less CPU resource than its AIO counterpart. H. Summary We have compared the two multi-stream joining models of AIO and SBS. In summary, AIO has the advantages of being easier to deploy and administrate and having lower E2E joining latency; while SBS has the advantages of having smaller memory footprint and emitting intermediate streams. One other particular implication of the memory footprint size is how to scale to handle large volume of data. If the memory footprint of AIO is too big to fit into one node, the only way to scale is by partitioning. On the other hand, SBS has the luxury of scaling at ease by deploying different steps on different nodes. Another implication of memory footprint size is with regard to JVM. For Java-based Samza joining,
  • 6. Table I JOINING LATENCY RESULTS Scenario / AIO SBS Phase-I Phase-II Latency (ms) (ms) (ms) (ms) Sync, 1 partition 1734 9845 4791 5054 Sync, 10 partition 1573 8849 4478 4371 Async, 1 partition 2009 9811 4713 5096 Async, 10 partition 1346 8081 3984 4097 Table II MEMORY FOOTPRINTS (VALUES ARE IN MB UNLESS SPECIFIED IN “GB”) Scenario / AIO SBS Phase-I Phase-II Memory footprint (MB) (MB) (MB) (MB) Sync, 1 partition 512 220 128 92 Sync, 10 partition 9 GB 9 GB 4 GB 5GB Async, 1 partition 640 480 96 384 Async, 10 partition 10 GB 9 GB 4 GB 5 GB the larger heap size, the higher GC-caused pauses and longer startup delay. VI. EVALUATION A. Experiment setup We focus on a case with 3 input streams and a single output stream. The entire setup consists of the following components: (1) a set of 3 Kafka producers generating messages for each stream (or topic); (2) Samza job(s); and (3) a single Kafka consumer that consumes the joined stream. The Kafka producers reside in a data center different from the Kafka clusters. Note that with this setup, the the end-2-end joining latency is not negligible since: (1) it takes longer for Kafka producers to produce messages to Kafka cluster; and (2) For a Samza job, it takes longer to fetch messages from the Kafka cluster which causes the single container latency longer. 1) Number of partitions : The memory footprint of a Samza job partly is determined by the number of partitions in a Kafka topic. A Kafka topic can be composed of a single partition or multiple partitions. In multi-partition scenarios, Kafka pro- ducers could produce message to any of the partitions. Samza jobs create a separate task to handle each topic partition. For the stream joining purpose, a Samza job needs to buffer the message until it gets the messages with the same key from all the topics before performing a stream-join. Hence the memory requirement will be much higher. We will study both the 1- partition and multi-partition scenarios. 2) Inter-stream lag: The multiple input streams may not be time-aligned well; instead, there could be inter-stream lags. The significance of the inter-stream lag is that it impacts the memory footprint of Samza jobs. As Samza jobs need to buffer messages in order to join them, the higher inter-stream lag, the more buffer is used. Because of this, we will study both scenarios of with and without considerable inter-stream lags. We refer to the scenario with near-zero inter-stream lags as “synchronized (or sync)” scenario, while non-zero inter-stream lags as “asynchronized (or async)” scenario. In sync scenarios, Kafka producers publish the message with the same key to all the topics around the same time. Table III CPU USAGE Scenario / AIO SBS Phase-I Phase-II CPU usage (%) (%) (%) (%) Sync, 1 partition 84 28 14 14 Sync, 10 partition 101 95 59 36 Async, 1 partition 86 66 53 13 Async, 10 partition 133 29 15 14 But in reality, due to other environment dynamics (e.g., networking latency, etc.), we still encounter certain lags for the messages with same key in the respective Kafka topics. In async scenarios, we inject 1-minute fixed time delay between any two topics for the messages with the same key 3. Such test will put more memory pressure on Samza job as it requires to buffer the messages with different arrival time. 3) Kafka producers and consumer: The Kafka producer used in the work is a Kafka console producer, which is a command line tool. For each generated message, the Kafka producer firstly determines which partition the message will be sent to. Then it will query the Kafka broker for topic partition metadata based upon which it will figure out which leader broker is for the partition. Finally it will send the data to leader broker for the partition. The Samza job instantiates a Kafka consumer which issues fetch requests to the brokers corresponding to the partitions it wants to consume. Typically the Samza job prefetches certain amount of messages to avoid delay. 4) Samza jobs: Samza jobs use a single thread to handle actions such as reading and writing messages, flushing metrics, checkpointing, windowing, and so on. The Samza container creates an associated task to process the messages of each input stream, and it chooses messages from input stream partitions in a round-robin way and passes the message to the related task. If the Samza container uses stateful management, each task creates one store to buffer messages. If the changelog option is enabled, each task also sends the changelog stream to Kafka for durability purposes. Each task can emit messages to output stream topics, depending on the use case. B. Results We compare the performance of AIO and SBS. From the 6 metrics listed in Section V We choose 3 important performance metrics (i.e., joining latency, memory footprint and CPU usage) since other three metrics (i.e., deployment complexity, partial results, stream joining throughput) are ei- ther straightforward or determined by the former three metrics. In addition, we also examined the Samza jobs’ JVM GC pauses, as these values can shed light on the CPU usage efficiencies and application pauses. 1) Joining latency: AIO only has a single joining latency value, while SBS has two as SBS performs its joining task in two phases (i.e, Phase-I and Phase-II) using two separate Samza jobs. The results are shown in Table I. We can see that 3For 3-stream case, there are 2-minute delay between the first and the third streams.
  • 7. 0 10 20 30 40 50 AIO SBS Phase I SBS Phase II Average Max (a) 1 partition 0 10 20 30 40 50 60 AIO SBS Phase I SBS Phase II Average Max (b) 10 partitions Figure 8. GC pauses in sync scenario 0 20 40 60 80 100 120 140 AIO SBS Phase I SBS Phase II Average Max (a) 1 partition 0 50 100 150 200 AIO SBS Phase I SBS Phase II Average Max (b) 10 partitions Figure 9. GC pauses in async scenario AIO results much smaller joining latency when compared to its SBS counterpart. Depending on scenarios, the difference is around 5X. 2) Memory footprint: We measure the memory footprint of a Samza container based on the actual heap size usage, as that shows the aggregated live object size. The maximum heap size is all set to be 16 GB, but during stable state, the heap only commits a portion of the specified 16 GB. AIO has only one container to join 3 input streams, so it only has a single memory footprint value, while SBS has two. For the 4 scenarios, the results are listed in Table II. We see in general lower memory footprint in AIO for almost all scenarios, except in the second scenario where the values are on par. 3) CPU usage: For CPU usage measurement, we simply take the machine level CPU usage values reported by “top” utility [12]. As the machines we use are dedicated identical ones and only running our workloads, we feel such measure- ment is fair comparison. For all 4 scenarios, AIO sees much higher CPU usage when compared to SBS, and the results are listed in Table III. Note that some values are higher than 100% since the machines we use are multi-core machines and each core has a CPU capacity of 100%. 4) JVM GC pauses: The JVM version we use is Oracle HotSpot 8. We report Young GC pauses as they happen more frequently. For each scenario, we capture about 10 minutes of GC stats during stable state. For each scenario, we report both the average value and the maximum value of the GC pauses in Figures 8 and 9. We notice that mostly AIO has higher Young GC pauses, thanks to its larger heap size. Depending on use cases, these GC pauses may or may not impact application performance. For applications that care more on message throughput, individual GC pauses are less important. On the other hand, for applications that care the joining latencies of each message, having low GC pauses is critical. C. Summary of results Our data evaluates a set of performance metrics (i.e., joining latency, memory footprint, CPU usage and JVM GC pauses) of both AIO and SBS. Given the vast number of variations in stream-joining systems (e.g., number of input streams, inter-stream lags, message size), the data is far from being comprehensive with regard to thorough analysis and detailed characterizations of these two models. However, the results clearly indicate the performance tradeoff inherent in the two models. Users should carefully choose the most appropriate model for their particular use cases. VII. DISCUSSIONS: JOINING SEQUENCE OF SBS For the SBS, the joining sequence of multiple input streams can significantly impact the performance of the entire work flow. Specifically, with regard to the intermediate streams, if one of the input streams for a particular step is blocked, then there will be no intermediate streams being emitted from this step. For example, for the 3-stream joining illustration
  • 8. Stream 1 Stream 2 Stream 3 T12 T13 (a) SBS with 3 input streams Stream 1 Stream 2 Stream 3 T12 T23 Stream 4 (b) An optimal joining sequence (Aggregated memory footprint is 2M +T13B) Stream 1 Stream 2Stream 3 T32 T13 Stream 4 (c) An suboptimal joining sequence (Aggregated memory footprint is M +T13B and M +T32B), or T32B larger than optimal scenario Figure 10. Using SBS to join 3 input streams in Figure 3(b), if Stream-B is blocked due to some reason, then intermediate stream D will not be produced and all later joining steps are also blocked. As another example, if the input streams have different inter- stream lags, then the joining sequence will greatly impact the resulted memory footprint. As shown in Figure 10(a), assum- ing an SBS scenario where 3 input streams need to be joined. Denote stream-1 as the earliest stream, and stream-2 and stream-3 have latency lags of T12 and T13, respectively. Figure 10(b) illustrates an optimal joining sequence where stream-1 and stream-2 are firstly joined, and the resulted intermediate stream-4 will be joined with stream-3 in the second step. The memory footprint for the first step is M + T12B as directed by Theorem-III. Correspondingly the memory footprint of the second step is M + T23B, where T23 (or T32) is the latency lag between stream-2 and stream-3. Since T12 +T23 = T13, the aggregated memory footprint of Figure 10(b) is 2M + T13B. If the joining sequence is different, as shown in Figure 10(c), then the memory footprints of the two SBS steps will be M+T13B and M+T32B, respectively. Compared to the optimal joining sequence, the increased memory usage is T32B. Considering the above two types of examples, we propose two approaches for determining the optimal joining sequences of SBS: (1) memory-oriented, which aims at minimizing the aggregated memory footprint; and (2) reliability-oriented, which aims at providing as many intermediate streams as possible. Memory-oriented SBS For this approach, all input streams are sorted based on the latency lags with respect to the earliest input stream. Then different steps of SBS simply join them according to the ranking of input streams. By doing so, the aggregated memory footprint will be directed by Theorem-III, which is minimized. Reliability-oriented SBS This approach aims at providing partial results to users. All input streams are ranked based on how likely the messages are blocked, with the most reliable input streams comes first. By joining input streams based on reliability, the chance of intermediate streams being blocked is minimized. VIII. RELATED WORK A. Streaming systems There are many systems designed for streaming processing purpose available in Industry, each of which has different attributes to serve streaming processing need. For instance, S4 [13] is a distributed stream processing engine originally developed at Yahoo. Apache Storm [2], [4] is another popu- lar system for streaming processing purpose originated from Twitter. Storm utilizes Apache Mesos [14] to schedule the processing job and provides two message deliver semantics(at least once, at most once). MillWheel [15] is an in-house solution developed by Google to serve its stream processing purposes. Apache Spark [3], a streaming framework used by Yahoo and Baidu, treats streaming as a series of deterministic batch operations, such that developers can use the same algo- rithms in both streaming and batch modes. Compared to Spark, which is a batch processing framework that can approximate stream processing, Flink [16] is primarily a stream processing framework that can look like a batch processor. B. Capacity model and capacity planning There are many works in the general area of capacity planning. The work [17] presents a four-phased model for computer capacity planning (CCP), The model can be used to develop a formal capacity planning function. Our work [18] considers a practical problem of capacity planning for database replication for a large scale website, and presents a model to forecast future traffic and determine required software capacity. Our recent work [19] presents a memory capacity model for data-filtering applications with Samza. C. Java heap sizing To size the Java heap appropriately, various works have been done to strike the balance among various tradeoffs. Work [20] analyzed the memory usage on Java heap through object tracing based on the observations that inappropriate heap size can lead to performance problems caused by excessive GC overhead or memory paging. It then presents an automatic heap-sizing algorithm applicable to different garbage collec- tors with only minor changes. The paper [21] presents a heap sizing Rule for dealing with different memory sizes with regard to heap size. Specifically, it models how severe the page faults can be with respect to a Heap-Aware Page Fault Equation.
  • 9. D. Data quality Veracity [22], which refers to the consistency and accuracy of massive data, has been increasingly recognized as one of key attribute of Big Data. Poor data quality [8] impacts enterprise companies which leads to customer dissatisfaction, increased cost, and bad decision-making etc. There are many works of developing systems to improve data quality. The work [23] presents a system based on conditional functional dependencies(CFDs) model for improving the quality of rela- tional data. IX. CONCLUSION This work studies the multi-stream joining models using Samza framework. We characterize the two models of SBS and AIO, and analyze their performance tradeoffs. We also set up Samza testbed using Kafka streams to verify our theoretical analysis. The results can be used to design/implement systems to more efficiently join multiple streams. REFERENCES [1] “Apache samza,” http://samza.apache.org/. [2] “Apache storm,” http://storm.apache.org/. [3] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Pro- ceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, ser. NSDI’12, San Jose, CA, 2012. [4] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulka- rni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy, “Storm@twitter,” in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’14, Snowbird, Utah, USA, 2014. [5] “Apache samza: Linkedin’s real-time stream processing framework,” https://engineering.linkedin.com/data-streams/apache-samza-linkedins- real-time-stream-processing-framework. [6] “Apache kafka,” http://kafka.apache.org/. [7] L. Qiao, K. Surlaker, S. Das, and et. al, “On brewing fresh espresso: Linkedin’s distributed data serving platform,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’13, New York, New York, USA, 2013. [8] T. C. Redman, “The impact of poor data quality on the typical enter- prise,” Commun. ACM, vol. 41, no. 2, pp. 79–82, Feb. 1998. [9] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache hadoop yarn: Yet another resource negotiator,” in Proceedings of the 4th Annual Symposium on Cloud Computing, ser. SOCC ’13, Santa Clara, California, 2013. [10] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop dis- tributed file system,” in Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), ser. MSST ’10, Washington, DC, USA, 2010. [11] “Benchmarking apache samza: 1.2 million messages per second on a single host,” http://engineering.linkedin.com/performance /benchmarkingapache-samza-12-million-messages-second-single-node. [12] M. G. Sobell, A Practical Guide to Linux Commands, Editorsnd Shell Programming, A. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2005. [13] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed stream computing platform,” in Proceedings of the 2010 IEEE Inter- national Conference on Data Mining Workshops, ser. ICDMW ’10, Washington, DC, USA, 2010. [14] “Apache mesos,” http://mesos.apache.org/. [15] T. Akidau, A. Balikov, K. Bekiro˘glu, S. Chernyak, J. Haberman, R. Lax, D. McVeety, Sam andls, P. Nordstrom, and S. Whittle, “Millwheel: Fault- tolerant stream processing at internet scale,” Proc. VLDB Endow., vol. 6, no. 11, pp. 1033–1044, Aug. 2013. [16] “Apache flink,” https://flink.apache.org//. [17] I. L. Carper, S. Harvey, and J. C. Wetherbe, “Computer capacity planning: Strategy and methodologies,” SIGMIS Database, vol. 14, no. 4, pp. 3–13, Jul. 1983. [18] Z. Zhuang, H. Ramachandra, C. Tran, S. Subramaniam, C. Botev, C. Xiong, and B. Sridharan, “Capacity planning and headroom analysis for taming database replication latency: Experiences with linkedin internet traffic,” in Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’15, Austin, Texas, USA, 2015, pp. 39–50. [19] T. Feng, Z. Zhuang, Y. Pan, and H. Ramachandra, “A memory capacity model for high performing data-filtering applications in samza frame- work,” in Proceedings of the 2015 IEEE International Conference on Big Data - Workshop on Data Quality Issues, Santa Clara, CA, USA, 2015, pp. 2600–2605. [20] P. Lengauer, V. Bitto, and H. M¨ossenb¨ock, “Accurate and efficient object tracing for java applications,” in Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’15, Austin, Texas, USA, 2015. [21] Y. C. Tay, X. Zong, and X. He, “An equation-based heap sizing rule,” Perform. Eval., vol. 70, no. 11, pp. 948–964, Nov. 2013. [22] B. Saha and D. Srivastava, “Data quality: The other face of big data,” in Proceedings of IEEE 30th International Conference on Data Engineering, ser. ICDE ’14, Chicago, IL, USA, 2014. [23] W. Fan, F. Geerts, and X. Jia, “Semandaq: A data quality system based on conditional functional dependencies,” Proc. VLDB Endow., vol. 1, no. 2, pp. 1460–1463, Aug. 2008.