A memory capacity model for high performing data-filtering applications in Samza framework

A Memory Capacity Model for High Performing Data-filtering Applications in
Samza Framework
Tao Feng, Zhenyun Zhuang, Yi Pan, Haricharan Ramachandra
LinkedIn Corp
2029 Stierlin Court Mountain View, CA 94043, USA
{tofeng, zzhuang, yipan, hramachandra}@linkedin.com
Abstract—Data quality is essential in big data paradigm as
poor data can have serious consequences when dealing with
large volumes of data. While it is trivial to spot poor data for
small-scale and offline use cases, it is challenging to detect and
fix data inconsistency in large-scale and online (real-time or
near-real time) big data context. An example of such scenario
is spotting and fixing poor data using Apache Samza, a stream
processing framework that has been increasingly adopted to
process near-real-time data at LinkedIn.
To optimize the deployment of Samza processing and reduce
business cost, in this work we propose a memory capacity
model for Apache Samza to allow denser deployments of
high performing data-filtering applications built on Samza.
The model can be used to provision just-enough memory
resource to applications by tightening the bounds on the
memory allocations. We apply our memory capacity model
on LinkedIn’s real use cases in production, which significantly
increases the deployment density and saves business costs. We
will share key learning in this paper.
Keywords-Apache Samza; capacity model; data filtering;
performance;
I. INTRODUCTION
With the exploding growth of Big Data, the quality of
data continues to be a critical factor in most computing
environments, be they business related or not. Low quality
data give rise to various types of challenges and damages to
the concerned applications. Depending on usage scenarios,
the problems caused by low-quality data can be as severe
as completely distorted data that leads to incorrect results or
conclusions.
Spotting bad-quality data and fixing them are much
needed for many data processing applications. Though
such processing is trivial with small-scale data, spot-
ting/filtering/fixing bad-quality data in large-scale environ-
ments such as big data is more challenging due to the ever-
mounting size of data involved.
Such data filtering to ensuring data quality can happen
in either offline or online (i.e., real-time or near-real time)
fashion. Offline processing is comparably easier since the
applications can do post-processing, while real-time or near-
real-time data filtering is more difficult due to the time
sensitiveness requirement.
Given the growing size of big data, the increasing de-
ployments of data streaming frameworks such as Apache
Kafka [1], and the intrinsic timeliness requirements, many of
today’s applications need effective data filtering techniques
and high-performing deployments to ensure data quality.
To deal with the above requirements, data filtering ap-
plications that use stream processing frameworks such as
Apache Storm [2] and Apache Samza [3] are being increas-
ingly adopted by Internet companies such as LinkedIn. At
LinkedIn, these applications are heavily deployed to filter
streaming data triggered by online activities coming from
hundreds of millions of users. For instance, user activities
coming from user-facing pages may contain inconsistent
data; hence the data need to be filtered to remove the
inconsistency before being persisted to backend storages
such as Espresso [4].
Though designed to scale well to deal with massive
size of data, such data-filtering applications often have to
be co-deployed on machines to achieve business-required
aggregated throughputs. Due to the limiting computing re-
sources (e.g., memory), there is a limit on how many data-
filtering instances can co-exist on a single node. For most
of LinkedIn’s data filtering applications, which are typically
written in Java, the top resource bottleneck is memory, partly
due to the practices of pre-configuring heap size as required
by Java.
To maximize the number of co-located instances, and also
for capacity planning purposes, a memory capacity model
is needed to properly configure Java heap size. Without
such a model, engineers that are responsible for deploying
applications would have to rely on experiences or ad-hoc
values, which are not only unreliable, but also error-prone.
On one hand, if the configured Java heap size is too small,
the application may end up with crushing with OOM (out
of memory) error. On the other hand, if the configured
Java heap size is too big, it would waste the memory
resource, hence resulting in under-allocating the co-located
applications.
In this work, we propose a memory capacity model to
deterministically calculate the needed heap size for Samza-
based applications without sacrificing the necessity data
quality accuracy. Based on the model, appropriate heap size

can be specified for the applications. Our experiment results
suggest that the model accurately predicts the needed heap
size, and allows for maximum number of applications which
can be co-located on machines.
The remainder of this paper is organized as follows. In
Section II, we present the Samza framework, upon which the
memory capacity model is based. In Section III, we describe
the memory capacity model we propose. In Section IV, we
evaluate the performance of the model. In Section V, we
present the related works. Finally we conclude the paper in
Section VI.
II. USING SAMZA FRAMEWORK TO DO DATA
FILTERING
A. Samza Framework
Stream processing is ongoing high throughput data pro-
cessing which applies to massive data set with near real-
time SLA guarantee. Apache Samza is a lightweight stream
processing framework originally created at LinkedIn for
solving continuous data processing. Samza has a set of
desired features such as simple API, fault tolerance, high
durability, managed state, etc.
Samza uses a single thread internally to handle reading
and writing messages, flushing metrics, checkpointing, win-
dowing, and so on. Samza could scale with creating multiple
instances if single instance is not enough to sustain the
traffic. As shown in Figure 1, the Samza container creates an
associated task to process the messages of each input stream
topic partition. The Samza container chooses a message from
input stream partitions in a round-robin way and passes the
message to the related task. If the Samza container uses
stateful management, each task creates one partition in the
store to buffer messages. If the changelog option is enabled,
each task sends the changelog stream to Kafka for durability
purposes. Each task can emit messages to output stream
topics, depending on the use case [5].
Figure 1: Samza Framework
B. Samza-based Data Filtering
Samza framework is heavily used at LinkedIn application
and system monitoring, data filtering, track user behavior
etc. We have two different usage scenarios of doing data
filtering with Samza. The first usage scenario, bad-quality
data are filtered (i.e., removed) based on some rules and
other data streams. For instance, if some streamed data items
do not contain valid values (e.g., wireless network bandwidth
measurement results for a particular user), they are simply
discarded with the guidance of certain rules. The process is
shown in Figure 2.
Figure 2: Data Filtering By Rules
The other data filtering scenario is to enhance the low-
quality data streams by aggregating with other data streams.
For instance, a data stream may only have user click infor-
mation in the tuple of (cookie id, actions) which presents
itself as low-quality data. Cookie ids need to be replaced by
corresponding userID before these activities can be persisted
to storage as high-quality data for easier retrieval later. For
this use case, another data stream needs to be utilized to
enhance the data quality. The process is shown in Figure 3.
Figure 3: Data Filtering By Joining Streams

III. MEMORY CAPACITY MODEL
A. Overview of the Model
Samza has been used for stateful streaming processing to
do data filtering. The data from upstream system could be
stored through Samza task in the same machine where the
Samza job is running. Samzas local state allows Samza job
to access data with low latency and high performance. Samza
provides two types of key-value stores: in-memory option
and RocksDB [6]. The in-memory store is an internally
created Java TreeMap. RocksDB is a persistent key-value
store designed and maintained by Facebook for flash storage.
In this paper we propose Samza memory capacity model for
the in-memory store option.
The memory capacity model for Samza in-memory key-
value store is through calculating live data object size
based on the workload characteristics and the data accuracy
requirement. It considers the internal working mechanism of
Samza, as well as the typical memory footprint of each type
of variables as used by Samzas internal data structures.
B. In-memory Capacity Model
For easy presentation, we use the following notations:
• T: Number of input topics in Samza. It depends on how
many input topic Samza container tries to consume.
The value can be configured through task.inputs in
configuration file.
• P: Number of partition per topic, which depends on the
input topic configuration in Kafka cluster. User could
query this info through zookeeper client.
• E: Number of unique entry per partition, which de-
pends on:1) unique key id for the input topic and 2)
input topic retention period. For example assuming an
input topic has 10 million unique key id that will take
the container 5 hours to full consumer, the Samza
container only maintains 3 hours of the input topic
message then does a cleanup. In this case, the numbers
of unique entry per partition will be 10 million * 3 /5
= 6 millions.
• B: Bytes per treemap entry which is about 40 bytes
based on our test.
• Bk: bytes of key serialization which depends on the
key object and serialization implementation. For string
object, it is roughly 20 bytes per string from our test.
• Bm: bytes of value message serialization, which de-
pends on the value object of the store and serialization
implementation. For Integer object, it is roughly 4 bytes
per Integer according to source code and our test.
With the above notations, using L to denote the total live
data size required by a particular application, then we have:
L = TPE(B + Bk + Bm)
To make sure the job runs smoothly without frequent full
GC impact, we suggest that the heap space of our job is at
least twice as live data set size. And it is suggested to deploy
with low latency GC algorithm like G1 [7]. Essentially,
G1 is a compaction-based garbage collection, hence to
accommodate the effect of moving-data-around, denote the
heap size as H, then H=2L would give a reasonable heap
size for the necessity requirement of data accuracy and
performance.
C. Determing Variable Values of the Model
We deduct the values of the variables using experiments.
Our upstream input topic has 8 partitions ,each of which has
5 million unique key ids. Samza container by default creates
8 tasks each of which creates one separate part in memory
store(treemap) to handle one separate partition.
Once the Samza job reaches a stable state which each
memory store has 5 million entries, we perform a live object
heap dump(only list all the live object) with jmap command
which is provided by standard JDK. The command for live
object dump is ”jmap -histo:live $pid”. Live object dump
shows how many object references are live by triggering a
full GC.
Figure 4: Heap Dump
Figure 4 shows all the live objects listed with numbers
of instances, total bytes and corresponding class name.
The main two types of live objects are: byte buffer([B)
and TreeMap entries(java.util.TreeMap$Entry). Samza im-
plements the in-memory store option with Treemap which
keeps all the entries sorted. The interesting questions would
be: 1) why we have about 40 million Treemap entries with
about 1.6GB total size; and 2) why we have about 80 million
byte buffers with 1.8 GB size.
To answer the first question, we have 8 tasks each of
which has one in memory tree map store handling 5 million
entries. So the total numbers of entries is 8*5million =
40million. From the above micro-benchmark test, we know
one Treemap entry takes about 40 bytes. Thus the total object
size is 40 * 40million which equals to around 1.6 GB.
To answer the second question, this will be a little bit
tricky to explain. Samza do many serialization / deserial-
ization in the framework. Couples of typical scenarios: 1)
when the container receives kafka messages, the message
will be deserialized; 2) when the container wants to store
key/message into store(inmemory, rocksdb), both the key
and message will be serialized and inserted into the store; 3)
When the container wants to send message to downstream

Kafka topic, it will serialize the key/message for the output
message.
Different object types use different serde/deserde which
user defines those configuration options in Samza configura-
tion table. So the key of the in-memory store is String type,
and the value is Integer type. Both the key and value are
serialized into byte buffer with different serializer (String
serde and Integer serde). We have around 40 million entries
each of which will have 40 million keys and 40 million
values. Thus the number of instances for byte buffer is 80
millions. A string object is serialized into 24 bytes , while
an integer object is serialized into 24 bytes due to header
and alignment according to the serializer implementation.
Thus the total object size for byte buffer is 24 * 80 million
which equals to around 1.9 GB . Thus the total live data set
size is around 3.5 GB in this case. The Samza job would
encounter data loss and not be able to maintain necessity
data quality if heap size is too small. Hence based on the
model, a reasonable heap size is around 7.0 GB to maintain
both data quality and performance.
Samza supports a feature called auto-scaling which pro-
vides a profiler actively checking memory usage, input
stream rate etc. The Samza framework reads profiler output
and redeploys Samza job if certain SLA is violated. For
future work, we will incorporate our memory capacity model
into the profiler which will first profile input stream event
rate and redeploy Samza job according to the memory
capacity model.
IV. EVALUATION
To evaluate the performance of our model, we use the
input data stream that represents LinkedIn production traffic
patterns to drive our Samza workload.
A. Evaluation Methodology
We firstly deduct the heap size needed based on our
capacity model, say it H. For comparison, we consider two
comparable cases of doubling and tripling the deducted heap
sizes, i.e., 2H and 3H.
We measure the major performance metrics being the
job throughput. We use process-envelopes of Samza as
the key metric to measure Samza job throughput perfor-
mance. Process-envelopes metric is used to measure Samza
message-processing rate.
In addition to the main performance metric of job through-
put, to understand the impacts on other resources such as
CPU, we also measure CPU usage and JAVA GC activities
such as counts, total GC time. Both young GC and old GC
activities are measured.
B. Workload and Testbed
For the workload development, we follow the common
practices carried out by Samza community. When writing
a stream processor for Samza, we need to implement the
StreamTask interface specified by Samza framework.
The workload we consider is a straightforward one for the
purpose of being stable and repeatable. In the workload, the
Samza stream processor listens to messages from a Kafka
upstream topic, extracts the key which indicates the identity
of the message from the valuable message, checks whether
we have encounter this key before in the in-memory store,
performs a key counting and stores the result back to the
in-memory store.
At LinkedIn, we have clusters served as Kafka brokers in
production to listen to various messages for different topics.
We used a dedicated box to run Samza stream processor. The
hardware characteristics are as follows: Intel Xeon 2.67 GHz
processor with 24 cores, 48GB RAM, Gbps Ethernet, and
1.65TB SSD. The Kafka topic that our Samza job listens
to has about 5 million records with unique keys and 8
partitions.
Our Samza stream processor runs with G1 garbage col-
lection algorithm. The major JVM options used are ”-
XX:+UseG1GC -XX:G1HeapRegionSize =4M”. And we
run the Samza job for half an hour to make sure the result
is stable and consistent. We also fix the heap size deducted
by our capacity model and the comparable cases.
Figure 5: Throughput
1H 2H 3H
Young GC: Count 88 29 32
Total time(ms) 9850 5063 6144
Old GC: Count 24 0 0
Total time(ms) 70166 0 0
Total: Count 112 29 31
Total time(ms) 80117 5063 6144
Table I: Java GC Pauses

Figure 6: CPU Utilization
C. Heap sizes deducted for 3 cases
Using our model, we conclude that our live data set size
is about 3.5 GB which L(3.5GB) equals to T(1) * P(8) *
E(5 million) * (B(40) + Bk(24) + Bm(24)).
Based on above, we derive that our Samza jobs heap size
is around 7GB. For the two comparable cases, the heap sizes
are 14GB and 21GB, respectively. .
D. Performance results
Job-throughput. The job throughput results are shown
in Figure 5. From the throughput figure, we found out that
there is not much difference of average throughput in terms
of kilo processed-message per second among all the three
cases. The throughput difference is only 5%. This means that
the JVM heap size derived from our memory capacity model
could sustain the same message rate compared with 2H, 3H
JVM heap size while it brings memory-saving benefit to the
system.
CPU Usage. The CPU usage results are shown in Figure
6. The first case (i.e., based on our capacity model) sees up
to 50% higher CPU usage than the other two cases.
GC activities. Table 1 lists the statistics related to Java
GC pauses with young GC, old GC, and total GC three cat-
egories considering three different heap size cases captured
with Java Mission Control. We found that 1H gives the worst
GC performance considering total GC pause times as well
as GC counts.
We observe significantly higher GC activities in the first
case. Specifically, the total GC counts is 3X more, while the
total GC pause time is 16X more.
E. Comparing the three cases
The job throughput results suggest that the heap size based
on our memory capacity model does not much degrade the
job-level performance, even though GC activities and CPU
usage are much higher. In fact, the latter two are expected
and directly caused of our memory capacity model. There
is clearly a tradeoff between the heap size and JVM-level
as well as OS level activities.
In Java, if the heap size is smaller, more GC events will
be observed due to the fact that the heap space will fill
and trigger GC sooner. Because Java objects need to be
scanned and compacted during GC process, CPU usage not
surprisingly will also increase. But the increase of cpu usage
might affect Samza job performance in a multi-instances
environment. If we exhaust cpu resources before memory
resources in a multi-instances environment, the performance
of 1H will be impacted.
Since the major benefit of our memory capacity model is
to more densely co-locate applications on the same node by
accurate memory estimation without job-level performance
degradation, we are happy with the fact that job-level
throughputs are on par, while with 2X or 3X less memory
usage. In other words, our memory capacity model can allow
2X or 3X denser deployments, which saves business cost at
same scale.
F. Deep examination of GC
Though we observe much higher counts of GC activities
in 1H case, one major attribute we determine whether GC
is effective or not for the workload is the numbers of times
that full garbage collections happened. Full GC is a stop
the world event which all the application threads could not
proceed until the whole heap is scanned and garbage objects
are released. This impacts application the most especially
and those low-latency application like Samza processor
could not tolerate. In our cases all the three scenarios (H,
2H, 3H) dont encounter full GC. The old GC shown in the
table is a mixed GC in G1 which the collection sets include
both young region and old region. This is different from full
GC as G1 is a region based garbage collection algorithm that
only collects subset regions in mixed GC.
We expect to encounter mixed GC in 1H case as live data
size exceeds over half the total heap size. But some phases
of G1 belong to concurrent mark cycles which are not stop-
the-world event. In this case application continue running
besides G1 concurrent phase happens but cpu utilization
increases. Thats the main reason we observe the per JVM
cpu utilization in figure 6 captured with top command
increases compared to 2H, 3H case.
G. Summary
Overall the throughput of our production Samza job in 1H
case is on par with the performance of 2H, 3H cases despite
both GC counts and application cpu utilization increase. And
we have great memory saving in 1H case which brings
the benefit of co-locating multiple instances within same
machine.

V. RELATED WORKS
A. Capacity model and capacity planning
There are many works in the general area of capac-
ity planning. The work [8] presents a four-phased model
for computer capacity planning (CCP), The model can be
used to develop a formal capacity planning function. Work
[9] considers a practical problem of capacity planning for
database replication for a large scale website, and presents
a model to forecast future traffic and determine required
software capacity.
B. Streaming systems
There are many systems designed for streaming process-
ing purpose available in Industry, each of which has different
attributes to serve streaming processing need. For instance,
S4 [10] is a distributed stream processing engine originally
developed at Yahoo. Apache Storm [11] is another popular
system for streaming processing purpose originated from
Twitter. Storm utilizes Apache Mesos [12] to schedule the
processing job and provides two message deliver seman-
tics(at least once, at most once). MillWheel [13] is an in-
house solution developed by Google to serve its stream
processing purposes.
C. Java heap sizing
To size the Java heap appropriately, various works have
been done to strike the balance among various tradeoffs.
Work [14] analyzed the memory usage on Java heap through
object tracing based on the observations that inappropriate
heap size can lead to performance problems caused by
excessive GC overhead or memory paging. It then presents
an automatic heap-sizing algorithm applicable to different
garbage collectors with only minor changes. The paper
[15] presents a Heap Sizing Rule for dealing with different
memory sizes with regard to heap size. Specifically, it
models how severe the page faults can be with respect to a
Heap-Aware Page Fault Equation.
D. Data quality
Veracity [16], which refers to the consistency and accuracy
of massive data, has been increasingly recognized as one of
key attribute of Big Data. Poor data quality [17] impacts en-
terprise companies which leads to customer dissatisfaction,
increased cost, and bad decision-making etc. There are many
works of developing systems to improve data quality. The
work[18] presents a system based on conditional functional
dependencies(CFDs) model for improving the quality of
relational data.
VI. CONCLUSION
In this work, we consider the problem of large-scale
and real-time data filtering and propose a memory ca-
pacity model for Samza-based data-filtering applications.
The model can be used to predict the heap usage of an
application, and hence allows for denser deployments of
multiple applications on the same machine.
REFERENCES
[1] Apache Kafka: http://kafka.apache.org/
[2] Apache Storm: https://storm.apache.org/
[3] Apache Samza: http://samza.apache.org/
[4] LinkedIn Espresso: http://data.linkedin.com/projects/espresso
[5] Benchmarking Apache Samza: 1.2 million
messages per second on a single host:
http://engineering.linkedin.com/performance/benchmarking-
apache-samza-12-million-messages-second-single-node
[6] RocksDB: http://rocksdb.org/
[7] G1 GC: http://www.oracle.com/technetwork/java/javase/tech/g1-
intro-jsp-135488.html
[8] Carper, L., et al., Computer capacity planning: strategy and
methodologies, SIGMIS Database, volume 14, 1983
[9] Zhuang, Z., et al., Capacity Planning and Headroom Analysis
for Taming Database Replication Latency (Experiences with
LinkedIn Internet Traffic), Proceedings of the 6th ACM/SPEC
International Conference on Performance Engineering, 2015,
Austin, TX, USA
[10] Neumeyer, L., et al., S4: Distributed Stream Computing
Platform,” in Proceedings of the 2010 IEEE International
Conference on Data Mining Workshops (ICDMW), Sydney,
Australia, Dec. 2010.
[11] Toshniwal, A., et al., Storm@twitter, In Proceedings of the
2014 ACM SIGMOD international conference on Manage-
ment of data (SIGMOD ’14), Snobird, UT, USA
[12] Apache Mesos: http://mesos.apache.org/
[13] Akidau, T., et al., MillWheel: fault-tolerant stream processing
at internet scale, in Proc. of VLDB Endow., August 2013
[14] Lengauer, P., et al., Accurate and Efficient Object Tracing
for Java Applications, Proceedings of the 6th ACM/SPEC
International Conference on Performance Engineering (ICPE
2015), Austin, TX, USA
[15] Tay, Y., et al., An equation-based Heap Sizing Rule, Journal
of Perform. Eval., vol 70, Nov. 2013
[16] Saha, B., et al., Data quality: The other face of Big Data,
in Data Engineering (ICDE), 2014 IEEE 30th International
Conference on , vol., no., pp.1294-1297, March 31 2014-
April 4 2014
[17] Redman T., The impact of poor data quality on the typical
enterprise, Commun. ACM 1998
[18] Fan W., et al., Semandaq: a data quality system based
on conditional functional dependencies, PLVDB 2008:1460-
1463

A memory capacity model for high performing data-filtering applications in Samza framework

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to A memory capacity model for high performing data-filtering applications in Samza framework

Similar to A memory capacity model for high performing data-filtering applications in Samza framework (20)

More from Tao Feng

More from Tao Feng (7)

Recently uploaded

Recently uploaded (20)

A memory capacity model for high performing data-filtering applications in Samza framework