SlideShare a Scribd company logo
1 of 6
Download to read offline
A Memory Capacity Model for High Performing Data-ļ¬ltering Applications in
Samza Framework
Tao Feng, Zhenyun Zhuang, Yi Pan, Haricharan Ramachandra
LinkedIn Corp
2029 Stierlin Court Mountain View, CA 94043, USA
{tofeng, zzhuang, yipan, hramachandra}@linkedin.com
Abstractā€”Data quality is essential in big data paradigm as
poor data can have serious consequences when dealing with
large volumes of data. While it is trivial to spot poor data for
small-scale and ofļ¬‚ine use cases, it is challenging to detect and
ļ¬x data inconsistency in large-scale and online (real-time or
near-real time) big data context. An example of such scenario
is spotting and ļ¬xing poor data using Apache Samza, a stream
processing framework that has been increasingly adopted to
process near-real-time data at LinkedIn.
To optimize the deployment of Samza processing and reduce
business cost, in this work we propose a memory capacity
model for Apache Samza to allow denser deployments of
high performing data-ļ¬ltering applications built on Samza.
The model can be used to provision just-enough memory
resource to applications by tightening the bounds on the
memory allocations. We apply our memory capacity model
on LinkedInā€™s real use cases in production, which signiļ¬cantly
increases the deployment density and saves business costs. We
will share key learning in this paper.
Keywords-Apache Samza; capacity model; data ļ¬ltering;
performance;
I. INTRODUCTION
With the exploding growth of Big Data, the quality of
data continues to be a critical factor in most computing
environments, be they business related or not. Low quality
data give rise to various types of challenges and damages to
the concerned applications. Depending on usage scenarios,
the problems caused by low-quality data can be as severe
as completely distorted data that leads to incorrect results or
conclusions.
Spotting bad-quality data and ļ¬xing them are much
needed for many data processing applications. Though
such processing is trivial with small-scale data, spot-
ting/ļ¬ltering/ļ¬xing bad-quality data in large-scale environ-
ments such as big data is more challenging due to the ever-
mounting size of data involved.
Such data ļ¬ltering to ensuring data quality can happen
in either ofļ¬‚ine or online (i.e., real-time or near-real time)
fashion. Ofļ¬‚ine processing is comparably easier since the
applications can do post-processing, while real-time or near-
real-time data ļ¬ltering is more difļ¬cult due to the time
sensitiveness requirement.
Given the growing size of big data, the increasing de-
ployments of data streaming frameworks such as Apache
Kafka [1], and the intrinsic timeliness requirements, many of
todayā€™s applications need effective data ļ¬ltering techniques
and high-performing deployments to ensure data quality.
To deal with the above requirements, data ļ¬ltering ap-
plications that use stream processing frameworks such as
Apache Storm [2] and Apache Samza [3] are being increas-
ingly adopted by Internet companies such as LinkedIn. At
LinkedIn, these applications are heavily deployed to ļ¬lter
streaming data triggered by online activities coming from
hundreds of millions of users. For instance, user activities
coming from user-facing pages may contain inconsistent
data; hence the data need to be ļ¬ltered to remove the
inconsistency before being persisted to backend storages
such as Espresso [4].
Though designed to scale well to deal with massive
size of data, such data-ļ¬ltering applications often have to
be co-deployed on machines to achieve business-required
aggregated throughputs. Due to the limiting computing re-
sources (e.g., memory), there is a limit on how many data-
ļ¬ltering instances can co-exist on a single node. For most
of LinkedInā€™s data ļ¬ltering applications, which are typically
written in Java, the top resource bottleneck is memory, partly
due to the practices of pre-conļ¬guring heap size as required
by Java.
To maximize the number of co-located instances, and also
for capacity planning purposes, a memory capacity model
is needed to properly conļ¬gure Java heap size. Without
such a model, engineers that are responsible for deploying
applications would have to rely on experiences or ad-hoc
values, which are not only unreliable, but also error-prone.
On one hand, if the conļ¬gured Java heap size is too small,
the application may end up with crushing with OOM (out
of memory) error. On the other hand, if the conļ¬gured
Java heap size is too big, it would waste the memory
resource, hence resulting in under-allocating the co-located
applications.
In this work, we propose a memory capacity model to
deterministically calculate the needed heap size for Samza-
based applications without sacriļ¬cing the necessity data
quality accuracy. Based on the model, appropriate heap size
can be speciļ¬ed for the applications. Our experiment results
suggest that the model accurately predicts the needed heap
size, and allows for maximum number of applications which
can be co-located on machines.
The remainder of this paper is organized as follows. In
Section II, we present the Samza framework, upon which the
memory capacity model is based. In Section III, we describe
the memory capacity model we propose. In Section IV, we
evaluate the performance of the model. In Section V, we
present the related works. Finally we conclude the paper in
Section VI.
II. USING SAMZA FRAMEWORK TO DO DATA
FILTERING
A. Samza Framework
Stream processing is ongoing high throughput data pro-
cessing which applies to massive data set with near real-
time SLA guarantee. Apache Samza is a lightweight stream
processing framework originally created at LinkedIn for
solving continuous data processing. Samza has a set of
desired features such as simple API, fault tolerance, high
durability, managed state, etc.
Samza uses a single thread internally to handle reading
and writing messages, ļ¬‚ushing metrics, checkpointing, win-
dowing, and so on. Samza could scale with creating multiple
instances if single instance is not enough to sustain the
trafļ¬c. As shown in Figure 1, the Samza container creates an
associated task to process the messages of each input stream
topic partition. The Samza container chooses a message from
input stream partitions in a round-robin way and passes the
message to the related task. If the Samza container uses
stateful management, each task creates one partition in the
store to buffer messages. If the changelog option is enabled,
each task sends the changelog stream to Kafka for durability
purposes. Each task can emit messages to output stream
topics, depending on the use case [5].
Figure 1: Samza Framework
B. Samza-based Data Filtering
Samza framework is heavily used at LinkedIn application
and system monitoring, data ļ¬ltering, track user behavior
etc. We have two different usage scenarios of doing data
ļ¬ltering with Samza. The ļ¬rst usage scenario, bad-quality
data are ļ¬ltered (i.e., removed) based on some rules and
other data streams. For instance, if some streamed data items
do not contain valid values (e.g., wireless network bandwidth
measurement results for a particular user), they are simply
discarded with the guidance of certain rules. The process is
shown in Figure 2.
Figure 2: Data Filtering By Rules
The other data ļ¬ltering scenario is to enhance the low-
quality data streams by aggregating with other data streams.
For instance, a data stream may only have user click infor-
mation in the tuple of (cookie id, actions) which presents
itself as low-quality data. Cookie ids need to be replaced by
corresponding userID before these activities can be persisted
to storage as high-quality data for easier retrieval later. For
this use case, another data stream needs to be utilized to
enhance the data quality. The process is shown in Figure 3.
Figure 3: Data Filtering By Joining Streams
III. MEMORY CAPACITY MODEL
A. Overview of the Model
Samza has been used for stateful streaming processing to
do data ļ¬ltering. The data from upstream system could be
stored through Samza task in the same machine where the
Samza job is running. Samzas local state allows Samza job
to access data with low latency and high performance. Samza
provides two types of key-value stores: in-memory option
and RocksDB [6]. The in-memory store is an internally
created Java TreeMap. RocksDB is a persistent key-value
store designed and maintained by Facebook for ļ¬‚ash storage.
In this paper we propose Samza memory capacity model for
the in-memory store option.
The memory capacity model for Samza in-memory key-
value store is through calculating live data object size
based on the workload characteristics and the data accuracy
requirement. It considers the internal working mechanism of
Samza, as well as the typical memory footprint of each type
of variables as used by Samzas internal data structures.
B. In-memory Capacity Model
For easy presentation, we use the following notations:
ā€¢ T: Number of input topics in Samza. It depends on how
many input topic Samza container tries to consume.
The value can be conļ¬gured through task.inputs in
conļ¬guration ļ¬le.
ā€¢ P: Number of partition per topic, which depends on the
input topic conļ¬guration in Kafka cluster. User could
query this info through zookeeper client.
ā€¢ E: Number of unique entry per partition, which de-
pends on:1) unique key id for the input topic and 2)
input topic retention period. For example assuming an
input topic has 10 million unique key id that will take
the container 5 hours to full consumer, the Samza
container only maintains 3 hours of the input topic
message then does a cleanup. In this case, the numbers
of unique entry per partition will be 10 million * 3 /5
= 6 millions.
ā€¢ B: Bytes per treemap entry which is about 40 bytes
based on our test.
ā€¢ Bk: bytes of key serialization which depends on the
key object and serialization implementation. For string
object, it is roughly 20 bytes per string from our test.
ā€¢ Bm: bytes of value message serialization, which de-
pends on the value object of the store and serialization
implementation. For Integer object, it is roughly 4 bytes
per Integer according to source code and our test.
With the above notations, using L to denote the total live
data size required by a particular application, then we have:
L = TPE(B + Bk + Bm)
To make sure the job runs smoothly without frequent full
GC impact, we suggest that the heap space of our job is at
least twice as live data set size. And it is suggested to deploy
with low latency GC algorithm like G1 [7]. Essentially,
G1 is a compaction-based garbage collection, hence to
accommodate the effect of moving-data-around, denote the
heap size as H, then H=2L would give a reasonable heap
size for the necessity requirement of data accuracy and
performance.
C. Determing Variable Values of the Model
We deduct the values of the variables using experiments.
Our upstream input topic has 8 partitions ,each of which has
5 million unique key ids. Samza container by default creates
8 tasks each of which creates one separate part in memory
store(treemap) to handle one separate partition.
Once the Samza job reaches a stable state which each
memory store has 5 million entries, we perform a live object
heap dump(only list all the live object) with jmap command
which is provided by standard JDK. The command for live
object dump is ā€jmap -histo:live $pidā€. Live object dump
shows how many object references are live by triggering a
full GC.
Figure 4: Heap Dump
Figure 4 shows all the live objects listed with numbers
of instances, total bytes and corresponding class name.
The main two types of live objects are: byte buffer([B)
and TreeMap entries(java.util.TreeMap$Entry). Samza im-
plements the in-memory store option with Treemap which
keeps all the entries sorted. The interesting questions would
be: 1) why we have about 40 million Treemap entries with
about 1.6GB total size; and 2) why we have about 80 million
byte buffers with 1.8 GB size.
To answer the ļ¬rst question, we have 8 tasks each of
which has one in memory tree map store handling 5 million
entries. So the total numbers of entries is 8*5million =
40million. From the above micro-benchmark test, we know
one Treemap entry takes about 40 bytes. Thus the total object
size is 40 * 40million which equals to around 1.6 GB.
To answer the second question, this will be a little bit
tricky to explain. Samza do many serialization / deserial-
ization in the framework. Couples of typical scenarios: 1)
when the container receives kafka messages, the message
will be deserialized; 2) when the container wants to store
key/message into store(inmemory, rocksdb), both the key
and message will be serialized and inserted into the store; 3)
When the container wants to send message to downstream
Kafka topic, it will serialize the key/message for the output
message.
Different object types use different serde/deserde which
user deļ¬nes those conļ¬guration options in Samza conļ¬gura-
tion table. So the key of the in-memory store is String type,
and the value is Integer type. Both the key and value are
serialized into byte buffer with different serializer (String
serde and Integer serde). We have around 40 million entries
each of which will have 40 million keys and 40 million
values. Thus the number of instances for byte buffer is 80
millions. A string object is serialized into 24 bytes , while
an integer object is serialized into 24 bytes due to header
and alignment according to the serializer implementation.
Thus the total object size for byte buffer is 24 * 80 million
which equals to around 1.9 GB . Thus the total live data set
size is around 3.5 GB in this case. The Samza job would
encounter data loss and not be able to maintain necessity
data quality if heap size is too small. Hence based on the
model, a reasonable heap size is around 7.0 GB to maintain
both data quality and performance.
Samza supports a feature called auto-scaling which pro-
vides a proļ¬ler actively checking memory usage, input
stream rate etc. The Samza framework reads proļ¬ler output
and redeploys Samza job if certain SLA is violated. For
future work, we will incorporate our memory capacity model
into the proļ¬ler which will ļ¬rst proļ¬le input stream event
rate and redeploy Samza job according to the memory
capacity model.
IV. EVALUATION
To evaluate the performance of our model, we use the
input data stream that represents LinkedIn production trafļ¬c
patterns to drive our Samza workload.
A. Evaluation Methodology
We ļ¬rstly deduct the heap size needed based on our
capacity model, say it H. For comparison, we consider two
comparable cases of doubling and tripling the deducted heap
sizes, i.e., 2H and 3H.
We measure the major performance metrics being the
job throughput. We use process-envelopes of Samza as
the key metric to measure Samza job throughput perfor-
mance. Process-envelopes metric is used to measure Samza
message-processing rate.
In addition to the main performance metric of job through-
put, to understand the impacts on other resources such as
CPU, we also measure CPU usage and JAVA GC activities
such as counts, total GC time. Both young GC and old GC
activities are measured.
B. Workload and Testbed
For the workload development, we follow the common
practices carried out by Samza community. When writing
a stream processor for Samza, we need to implement the
StreamTask interface speciļ¬ed by Samza framework.
The workload we consider is a straightforward one for the
purpose of being stable and repeatable. In the workload, the
Samza stream processor listens to messages from a Kafka
upstream topic, extracts the key which indicates the identity
of the message from the valuable message, checks whether
we have encounter this key before in the in-memory store,
performs a key counting and stores the result back to the
in-memory store.
At LinkedIn, we have clusters served as Kafka brokers in
production to listen to various messages for different topics.
We used a dedicated box to run Samza stream processor. The
hardware characteristics are as follows: Intel Xeon 2.67 GHz
processor with 24 cores, 48GB RAM, Gbps Ethernet, and
1.65TB SSD. The Kafka topic that our Samza job listens
to has about 5 million records with unique keys and 8
partitions.
Our Samza stream processor runs with G1 garbage col-
lection algorithm. The major JVM options used are ā€-
XX:+UseG1GC -XX:G1HeapRegionSize =4Mā€. And we
run the Samza job for half an hour to make sure the result
is stable and consistent. We also ļ¬x the heap size deducted
by our capacity model and the comparable cases.
Figure 5: Throughput
1H 2H 3H
Young GC: Count 88 29 32
Total time(ms) 9850 5063 6144
Old GC: Count 24 0 0
Total time(ms) 70166 0 0
Total: Count 112 29 31
Total time(ms) 80117 5063 6144
Table I: Java GC Pauses
Figure 6: CPU Utilization
C. Heap sizes deducted for 3 cases
Using our model, we conclude that our live data set size
is about 3.5 GB which L(3.5GB) equals to T(1) * P(8) *
E(5 million) * (B(40) + Bk(24) + Bm(24)).
Based on above, we derive that our Samza jobs heap size
is around 7GB. For the two comparable cases, the heap sizes
are 14GB and 21GB, respectively. .
D. Performance results
Job-throughput. The job throughput results are shown
in Figure 5. From the throughput ļ¬gure, we found out that
there is not much difference of average throughput in terms
of kilo processed-message per second among all the three
cases. The throughput difference is only 5%. This means that
the JVM heap size derived from our memory capacity model
could sustain the same message rate compared with 2H, 3H
JVM heap size while it brings memory-saving beneļ¬t to the
system.
CPU Usage. The CPU usage results are shown in Figure
6. The ļ¬rst case (i.e., based on our capacity model) sees up
to 50% higher CPU usage than the other two cases.
GC activities. Table 1 lists the statistics related to Java
GC pauses with young GC, old GC, and total GC three cat-
egories considering three different heap size cases captured
with Java Mission Control. We found that 1H gives the worst
GC performance considering total GC pause times as well
as GC counts.
We observe signiļ¬cantly higher GC activities in the ļ¬rst
case. Speciļ¬cally, the total GC counts is 3X more, while the
total GC pause time is 16X more.
E. Comparing the three cases
The job throughput results suggest that the heap size based
on our memory capacity model does not much degrade the
job-level performance, even though GC activities and CPU
usage are much higher. In fact, the latter two are expected
and directly caused of our memory capacity model. There
is clearly a tradeoff between the heap size and JVM-level
as well as OS level activities.
In Java, if the heap size is smaller, more GC events will
be observed due to the fact that the heap space will ļ¬ll
and trigger GC sooner. Because Java objects need to be
scanned and compacted during GC process, CPU usage not
surprisingly will also increase. But the increase of cpu usage
might affect Samza job performance in a multi-instances
environment. If we exhaust cpu resources before memory
resources in a multi-instances environment, the performance
of 1H will be impacted.
Since the major beneļ¬t of our memory capacity model is
to more densely co-locate applications on the same node by
accurate memory estimation without job-level performance
degradation, we are happy with the fact that job-level
throughputs are on par, while with 2X or 3X less memory
usage. In other words, our memory capacity model can allow
2X or 3X denser deployments, which saves business cost at
same scale.
F. Deep examination of GC
Though we observe much higher counts of GC activities
in 1H case, one major attribute we determine whether GC
is effective or not for the workload is the numbers of times
that full garbage collections happened. Full GC is a stop
the world event which all the application threads could not
proceed until the whole heap is scanned and garbage objects
are released. This impacts application the most especially
and those low-latency application like Samza processor
could not tolerate. In our cases all the three scenarios (H,
2H, 3H) dont encounter full GC. The old GC shown in the
table is a mixed GC in G1 which the collection sets include
both young region and old region. This is different from full
GC as G1 is a region based garbage collection algorithm that
only collects subset regions in mixed GC.
We expect to encounter mixed GC in 1H case as live data
size exceeds over half the total heap size. But some phases
of G1 belong to concurrent mark cycles which are not stop-
the-world event. In this case application continue running
besides G1 concurrent phase happens but cpu utilization
increases. Thats the main reason we observe the per JVM
cpu utilization in ļ¬gure 6 captured with top command
increases compared to 2H, 3H case.
G. Summary
Overall the throughput of our production Samza job in 1H
case is on par with the performance of 2H, 3H cases despite
both GC counts and application cpu utilization increase. And
we have great memory saving in 1H case which brings
the beneļ¬t of co-locating multiple instances within same
machine.
V. RELATED WORKS
A. Capacity model and capacity planning
There are many works in the general area of capac-
ity planning. The work [8] presents a four-phased model
for computer capacity planning (CCP), The model can be
used to develop a formal capacity planning function. Work
[9] considers a practical problem of capacity planning for
database replication for a large scale website, and presents
a model to forecast future trafļ¬c and determine required
software capacity.
B. Streaming systems
There are many systems designed for streaming process-
ing purpose available in Industry, each of which has different
attributes to serve streaming processing need. For instance,
S4 [10] is a distributed stream processing engine originally
developed at Yahoo. Apache Storm [11] is another popular
system for streaming processing purpose originated from
Twitter. Storm utilizes Apache Mesos [12] to schedule the
processing job and provides two message deliver seman-
tics(at least once, at most once). MillWheel [13] is an in-
house solution developed by Google to serve its stream
processing purposes.
C. Java heap sizing
To size the Java heap appropriately, various works have
been done to strike the balance among various tradeoffs.
Work [14] analyzed the memory usage on Java heap through
object tracing based on the observations that inappropriate
heap size can lead to performance problems caused by
excessive GC overhead or memory paging. It then presents
an automatic heap-sizing algorithm applicable to different
garbage collectors with only minor changes. The paper
[15] presents a Heap Sizing Rule for dealing with different
memory sizes with regard to heap size. Speciļ¬cally, it
models how severe the page faults can be with respect to a
Heap-Aware Page Fault Equation.
D. Data quality
Veracity [16], which refers to the consistency and accuracy
of massive data, has been increasingly recognized as one of
key attribute of Big Data. Poor data quality [17] impacts en-
terprise companies which leads to customer dissatisfaction,
increased cost, and bad decision-making etc. There are many
works of developing systems to improve data quality. The
work[18] presents a system based on conditional functional
dependencies(CFDs) model for improving the quality of
relational data.
VI. CONCLUSION
In this work, we consider the problem of large-scale
and real-time data ļ¬ltering and propose a memory ca-
pacity model for Samza-based data-ļ¬ltering applications.
The model can be used to predict the heap usage of an
application, and hence allows for denser deployments of
multiple applications on the same machine.
REFERENCES
[1] Apache Kafka: http://kafka.apache.org/
[2] Apache Storm: https://storm.apache.org/
[3] Apache Samza: http://samza.apache.org/
[4] LinkedIn Espresso: http://data.linkedin.com/projects/espresso
[5] Benchmarking Apache Samza: 1.2 million
messages per second on a single host:
http://engineering.linkedin.com/performance/benchmarking-
apache-samza-12-million-messages-second-single-node
[6] RocksDB: http://rocksdb.org/
[7] G1 GC: http://www.oracle.com/technetwork/java/javase/tech/g1-
intro-jsp-135488.html
[8] Carper, L., et al., Computer capacity planning: strategy and
methodologies, SIGMIS Database, volume 14, 1983
[9] Zhuang, Z., et al., Capacity Planning and Headroom Analysis
for Taming Database Replication Latency (Experiences with
LinkedIn Internet Trafļ¬c), Proceedings of the 6th ACM/SPEC
International Conference on Performance Engineering, 2015,
Austin, TX, USA
[10] Neumeyer, L., et al., S4: Distributed Stream Computing
Platform,ā€ in Proceedings of the 2010 IEEE International
Conference on Data Mining Workshops (ICDMW), Sydney,
Australia, Dec. 2010.
[11] Toshniwal, A., et al., Storm@twitter, In Proceedings of the
2014 ACM SIGMOD international conference on Manage-
ment of data (SIGMOD ā€™14), Snobird, UT, USA
[12] Apache Mesos: http://mesos.apache.org/
[13] Akidau, T., et al., MillWheel: fault-tolerant stream processing
at internet scale, in Proc. of VLDB Endow., August 2013
[14] Lengauer, P., et al., Accurate and Efļ¬cient Object Tracing
for Java Applications, Proceedings of the 6th ACM/SPEC
International Conference on Performance Engineering (ICPE
2015), Austin, TX, USA
[15] Tay, Y., et al., An equation-based Heap Sizing Rule, Journal
of Perform. Eval., vol 70, Nov. 2013
[16] Saha, B., et al., Data quality: The other face of Big Data,
in Data Engineering (ICDE), 2014 IEEE 30th International
Conference on , vol., no., pp.1294-1297, March 31 2014-
April 4 2014
[17] Redman T., The impact of poor data quality on the typical
enterprise, Commun. ACM 1998
[18] Fan W., et al., Semandaq: a data quality system based
on conditional functional dependencies, PLVDB 2008:1460-
1463

More Related Content

What's hot

HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce ijujournal
Ā 
Performance analysis of MongoDB and HBase
Performance analysis of MongoDB and HBasePerformance analysis of MongoDB and HBase
Performance analysis of MongoDB and HBaseSindhujanDhayalan
Ā 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
Ā 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar DatabaseBiju Nair
Ā 
Mining Of Big Data Using Map-Reduce Theorem
Mining Of Big Data Using Map-Reduce TheoremMining Of Big Data Using Map-Reduce Theorem
Mining Of Big Data Using Map-Reduce TheoremIOSR Journals
Ā 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperScott Gray
Ā 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
Ā 
Data ware house design
Data ware house designData ware house design
Data ware house designSayed Ahmed
Ā 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraYashIyengar
Ā 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
Ā 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAijfcstjournal
Ā 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design phanleson
Ā 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
Ā 
Massive sacalabilitty with InterSystems IRIS Data Platform
Massive sacalabilitty with InterSystems IRIS Data PlatformMassive sacalabilitty with InterSystems IRIS Data Platform
Massive sacalabilitty with InterSystems IRIS Data PlatformRobert Bira
Ā 
Slideshare mmp
Slideshare mmpSlideshare mmp
Slideshare mmpKiluha99
Ā 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
Ā 

What's hot (19)

HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
Ā 
Performance analysis of MongoDB and HBase
Performance analysis of MongoDB and HBasePerformance analysis of MongoDB and HBase
Performance analysis of MongoDB and HBase
Ā 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
Ā 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar Database
Ā 
Mining Of Big Data Using Map-Reduce Theorem
Mining Of Big Data Using Map-Reduce TheoremMining Of Big Data Using Map-Reduce Theorem
Mining Of Big Data Using Map-Reduce Theorem
Ā 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
Ā 
Updating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data WarehousesUpdating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data Warehouses
Ā 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
Ā 
Data ware house design
Data ware house designData ware house design
Data ware house design
Ā 
Performance Comparison of HBase and Cassandra
Performance Comparison of HBase and CassandraPerformance Comparison of HBase and Cassandra
Performance Comparison of HBase and Cassandra
Ā 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
Ā 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
Ā 
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRAA NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
Ā 
hive lab
hive labhive lab
hive lab
Ā 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design
Ā 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
Ā 
Massive sacalabilitty with InterSystems IRIS Data Platform
Massive sacalabilitty with InterSystems IRIS Data PlatformMassive sacalabilitty with InterSystems IRIS Data Platform
Massive sacalabilitty with InterSystems IRIS Data Platform
Ā 
Slideshare mmp
Slideshare mmpSlideshare mmp
Slideshare mmp
Ā 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
Ā 

Similar to A memory capacity model for high performing data-filtering applications in Samza framework

Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
Ā 
Dynamo Amazonā€™s Highly Available Key-value Store Giuseppe D.docx
Dynamo Amazonā€™s Highly Available Key-value Store Giuseppe D.docxDynamo Amazonā€™s Highly Available Key-value Store Giuseppe D.docx
Dynamo Amazonā€™s Highly Available Key-value Store Giuseppe D.docxjacksnathalie
Ā 
Charm a cost efficient multi cloud data hosting scheme with high availability
Charm a cost efficient multi cloud data hosting scheme with high availabilityCharm a cost efficient multi cloud data hosting scheme with high availability
Charm a cost efficient multi cloud data hosting scheme with high availabilityPvrtechnologies Nellore
Ā 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree AnikeyRoy
Ā 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtreesamirandev1
Ā 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtreedevraajsingh
Ā 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog sameerroshan
Ā 
Ethopian Database Management system as a Cloud Service: Limitations and advan...
Ethopian Database Management system as a Cloud Service: Limitations and advan...Ethopian Database Management system as a Cloud Service: Limitations and advan...
Ethopian Database Management system as a Cloud Service: Limitations and advan...IOSR Journals
Ā 
amazon-dynamo-sosp2007
amazon-dynamo-sosp2007amazon-dynamo-sosp2007
amazon-dynamo-sosp2007Thomas Hughes
Ā 
Amazon dynamo-sosp2007
Amazon dynamo-sosp2007Amazon dynamo-sosp2007
Amazon dynamo-sosp2007huangjunsk
Ā 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityPapitha Velumani
Ā 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityPapitha Velumani
Ā 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesSigmoid
Ā 
Oracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridOracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridEmiliano Pecis
Ā 
Charm a cost efficient multi cloud data hosting scheme with high availability
Charm a cost efficient multi cloud data hosting scheme with high availabilityCharm a cost efficient multi cloud data hosting scheme with high availability
Charm a cost efficient multi cloud data hosting scheme with high availabilityKamal Spring
Ā 
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...IBM India Smarter Computing
Ā 
Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...CSITiaesprime
Ā 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computingwww.pixelsolutionbd.com
Ā 

Similar to A memory capacity model for high performing data-filtering applications in Samza framework (20)

Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza Framework
Ā 
Dynamo Amazonā€™s Highly Available Key-value Store Giuseppe D.docx
Dynamo Amazonā€™s Highly Available Key-value Store Giuseppe D.docxDynamo Amazonā€™s Highly Available Key-value Store Giuseppe D.docx
Dynamo Amazonā€™s Highly Available Key-value Store Giuseppe D.docx
Ā 
Charm a cost efficient multi cloud data hosting scheme with high availability
Charm a cost efficient multi cloud data hosting scheme with high availabilityCharm a cost efficient multi cloud data hosting scheme with high availability
Charm a cost efficient multi cloud data hosting scheme with high availability
Ā 
Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree									Steps to Modernize Your Data Ecosystem | Mindtree
Steps to Modernize Your Data Ecosystem | Mindtree
Ā 
Six Steps to Modernize Your Data Ecosystem - Mindtree
Six Steps to Modernize Your Data Ecosystem  - MindtreeSix Steps to Modernize Your Data Ecosystem  - Mindtree
Six Steps to Modernize Your Data Ecosystem - Mindtree
Ā 
6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree6 Steps to Modernize Data Ecosystem with Mindtree
6 Steps to Modernize Data Ecosystem with Mindtree
Ā 
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog Steps to Modernize Your Data Ecosystem with Mindtree Blog
Steps to Modernize Your Data Ecosystem with Mindtree Blog
Ā 
Ethopian Database Management system as a Cloud Service: Limitations and advan...
Ethopian Database Management system as a Cloud Service: Limitations and advan...Ethopian Database Management system as a Cloud Service: Limitations and advan...
Ethopian Database Management system as a Cloud Service: Limitations and advan...
Ā 
amazon-dynamo-sosp2007
amazon-dynamo-sosp2007amazon-dynamo-sosp2007
amazon-dynamo-sosp2007
Ā 
Amazon dynamo-sosp2007
Amazon dynamo-sosp2007Amazon dynamo-sosp2007
Amazon dynamo-sosp2007
Ā 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
Ā 
Scalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availabilityScalable analytics for iaas cloud availability
Scalable analytics for iaas cloud availability
Ā 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
Ā 
Oracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagridOracle Coherence: in-memory datagrid
Oracle Coherence: in-memory datagrid
Ā 
Charm a cost efficient multi cloud data hosting scheme with high availability
Charm a cost efficient multi cloud data hosting scheme with high availabilityCharm a cost efficient multi cloud data hosting scheme with high availability
Charm a cost efficient multi cloud data hosting scheme with high availability
Ā 
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Positioning IBM Flex System 16 Gb Fibre Channel Fabric for Storage-Intensive ...
Ā 
Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...
Ā 
Tombolo
TomboloTombolo
Tombolo
Ā 
2011 keesvan gelder
2011 keesvan gelder2011 keesvan gelder
2011 keesvan gelder
Ā 
Fault tolerance on cloud computing
Fault tolerance on cloud computingFault tolerance on cloud computing
Fault tolerance on cloud computing
Ā 

More from Tao Feng

Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
Ā 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
Ā 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
Ā 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
Ā 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Tao Feng
Ā 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
Ā 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
Ā 

More from Tao Feng (7)

Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
Ā 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Ā 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
Ā 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Ā 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)
Ā 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Ā 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
Ā 

Recently uploaded

Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
Ā 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
Ā 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
Ā 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
Ā 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
Ā 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
Ā 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
Ā 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
Ā 
Gurgaon āœ”ļø9711147426āœØCall In girls Gurgaon Sector 51 escort service
Gurgaon āœ”ļø9711147426āœØCall In girls Gurgaon Sector 51 escort serviceGurgaon āœ”ļø9711147426āœØCall In girls Gurgaon Sector 51 escort service
Gurgaon āœ”ļø9711147426āœØCall In girls Gurgaon Sector 51 escort servicejennyeacort
Ā 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
Ā 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxErbil Polytechnic University
Ā 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
Ā 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
Ā 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
Ā 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
Ā 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
Ā 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
Ā 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
Ā 

Recently uploaded (20)

young call girls in Green ParkšŸ” 9953056974 šŸ” escort Service
young call girls in Green ParkšŸ” 9953056974 šŸ” escort Serviceyoung call girls in Green ParkšŸ” 9953056974 šŸ” escort Service
young call girls in Green ParkšŸ” 9953056974 šŸ” escort Service
Ā 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
Ā 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
Ā 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
Ā 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
Ā 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
Ā 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Ā 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
Ā 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
Ā 
Gurgaon āœ”ļø9711147426āœØCall In girls Gurgaon Sector 51 escort service
Gurgaon āœ”ļø9711147426āœØCall In girls Gurgaon Sector 51 escort serviceGurgaon āœ”ļø9711147426āœØCall In girls Gurgaon Sector 51 escort service
Gurgaon āœ”ļø9711147426āœØCall In girls Gurgaon Sector 51 escort service
Ā 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
Ā 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
Ā 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
Ā 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ā 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
Ā 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
Ā 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
Ā 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
Ā 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
Ā 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
Ā 

A memory capacity model for high performing data-filtering applications in Samza framework

  • 1. A Memory Capacity Model for High Performing Data-ļ¬ltering Applications in Samza Framework Tao Feng, Zhenyun Zhuang, Yi Pan, Haricharan Ramachandra LinkedIn Corp 2029 Stierlin Court Mountain View, CA 94043, USA {tofeng, zzhuang, yipan, hramachandra}@linkedin.com Abstractā€”Data quality is essential in big data paradigm as poor data can have serious consequences when dealing with large volumes of data. While it is trivial to spot poor data for small-scale and ofļ¬‚ine use cases, it is challenging to detect and ļ¬x data inconsistency in large-scale and online (real-time or near-real time) big data context. An example of such scenario is spotting and ļ¬xing poor data using Apache Samza, a stream processing framework that has been increasingly adopted to process near-real-time data at LinkedIn. To optimize the deployment of Samza processing and reduce business cost, in this work we propose a memory capacity model for Apache Samza to allow denser deployments of high performing data-ļ¬ltering applications built on Samza. The model can be used to provision just-enough memory resource to applications by tightening the bounds on the memory allocations. We apply our memory capacity model on LinkedInā€™s real use cases in production, which signiļ¬cantly increases the deployment density and saves business costs. We will share key learning in this paper. Keywords-Apache Samza; capacity model; data ļ¬ltering; performance; I. INTRODUCTION With the exploding growth of Big Data, the quality of data continues to be a critical factor in most computing environments, be they business related or not. Low quality data give rise to various types of challenges and damages to the concerned applications. Depending on usage scenarios, the problems caused by low-quality data can be as severe as completely distorted data that leads to incorrect results or conclusions. Spotting bad-quality data and ļ¬xing them are much needed for many data processing applications. Though such processing is trivial with small-scale data, spot- ting/ļ¬ltering/ļ¬xing bad-quality data in large-scale environ- ments such as big data is more challenging due to the ever- mounting size of data involved. Such data ļ¬ltering to ensuring data quality can happen in either ofļ¬‚ine or online (i.e., real-time or near-real time) fashion. Ofļ¬‚ine processing is comparably easier since the applications can do post-processing, while real-time or near- real-time data ļ¬ltering is more difļ¬cult due to the time sensitiveness requirement. Given the growing size of big data, the increasing de- ployments of data streaming frameworks such as Apache Kafka [1], and the intrinsic timeliness requirements, many of todayā€™s applications need effective data ļ¬ltering techniques and high-performing deployments to ensure data quality. To deal with the above requirements, data ļ¬ltering ap- plications that use stream processing frameworks such as Apache Storm [2] and Apache Samza [3] are being increas- ingly adopted by Internet companies such as LinkedIn. At LinkedIn, these applications are heavily deployed to ļ¬lter streaming data triggered by online activities coming from hundreds of millions of users. For instance, user activities coming from user-facing pages may contain inconsistent data; hence the data need to be ļ¬ltered to remove the inconsistency before being persisted to backend storages such as Espresso [4]. Though designed to scale well to deal with massive size of data, such data-ļ¬ltering applications often have to be co-deployed on machines to achieve business-required aggregated throughputs. Due to the limiting computing re- sources (e.g., memory), there is a limit on how many data- ļ¬ltering instances can co-exist on a single node. For most of LinkedInā€™s data ļ¬ltering applications, which are typically written in Java, the top resource bottleneck is memory, partly due to the practices of pre-conļ¬guring heap size as required by Java. To maximize the number of co-located instances, and also for capacity planning purposes, a memory capacity model is needed to properly conļ¬gure Java heap size. Without such a model, engineers that are responsible for deploying applications would have to rely on experiences or ad-hoc values, which are not only unreliable, but also error-prone. On one hand, if the conļ¬gured Java heap size is too small, the application may end up with crushing with OOM (out of memory) error. On the other hand, if the conļ¬gured Java heap size is too big, it would waste the memory resource, hence resulting in under-allocating the co-located applications. In this work, we propose a memory capacity model to deterministically calculate the needed heap size for Samza- based applications without sacriļ¬cing the necessity data quality accuracy. Based on the model, appropriate heap size
  • 2. can be speciļ¬ed for the applications. Our experiment results suggest that the model accurately predicts the needed heap size, and allows for maximum number of applications which can be co-located on machines. The remainder of this paper is organized as follows. In Section II, we present the Samza framework, upon which the memory capacity model is based. In Section III, we describe the memory capacity model we propose. In Section IV, we evaluate the performance of the model. In Section V, we present the related works. Finally we conclude the paper in Section VI. II. USING SAMZA FRAMEWORK TO DO DATA FILTERING A. Samza Framework Stream processing is ongoing high throughput data pro- cessing which applies to massive data set with near real- time SLA guarantee. Apache Samza is a lightweight stream processing framework originally created at LinkedIn for solving continuous data processing. Samza has a set of desired features such as simple API, fault tolerance, high durability, managed state, etc. Samza uses a single thread internally to handle reading and writing messages, ļ¬‚ushing metrics, checkpointing, win- dowing, and so on. Samza could scale with creating multiple instances if single instance is not enough to sustain the trafļ¬c. As shown in Figure 1, the Samza container creates an associated task to process the messages of each input stream topic partition. The Samza container chooses a message from input stream partitions in a round-robin way and passes the message to the related task. If the Samza container uses stateful management, each task creates one partition in the store to buffer messages. If the changelog option is enabled, each task sends the changelog stream to Kafka for durability purposes. Each task can emit messages to output stream topics, depending on the use case [5]. Figure 1: Samza Framework B. Samza-based Data Filtering Samza framework is heavily used at LinkedIn application and system monitoring, data ļ¬ltering, track user behavior etc. We have two different usage scenarios of doing data ļ¬ltering with Samza. The ļ¬rst usage scenario, bad-quality data are ļ¬ltered (i.e., removed) based on some rules and other data streams. For instance, if some streamed data items do not contain valid values (e.g., wireless network bandwidth measurement results for a particular user), they are simply discarded with the guidance of certain rules. The process is shown in Figure 2. Figure 2: Data Filtering By Rules The other data ļ¬ltering scenario is to enhance the low- quality data streams by aggregating with other data streams. For instance, a data stream may only have user click infor- mation in the tuple of (cookie id, actions) which presents itself as low-quality data. Cookie ids need to be replaced by corresponding userID before these activities can be persisted to storage as high-quality data for easier retrieval later. For this use case, another data stream needs to be utilized to enhance the data quality. The process is shown in Figure 3. Figure 3: Data Filtering By Joining Streams
  • 3. III. MEMORY CAPACITY MODEL A. Overview of the Model Samza has been used for stateful streaming processing to do data ļ¬ltering. The data from upstream system could be stored through Samza task in the same machine where the Samza job is running. Samzas local state allows Samza job to access data with low latency and high performance. Samza provides two types of key-value stores: in-memory option and RocksDB [6]. The in-memory store is an internally created Java TreeMap. RocksDB is a persistent key-value store designed and maintained by Facebook for ļ¬‚ash storage. In this paper we propose Samza memory capacity model for the in-memory store option. The memory capacity model for Samza in-memory key- value store is through calculating live data object size based on the workload characteristics and the data accuracy requirement. It considers the internal working mechanism of Samza, as well as the typical memory footprint of each type of variables as used by Samzas internal data structures. B. In-memory Capacity Model For easy presentation, we use the following notations: ā€¢ T: Number of input topics in Samza. It depends on how many input topic Samza container tries to consume. The value can be conļ¬gured through task.inputs in conļ¬guration ļ¬le. ā€¢ P: Number of partition per topic, which depends on the input topic conļ¬guration in Kafka cluster. User could query this info through zookeeper client. ā€¢ E: Number of unique entry per partition, which de- pends on:1) unique key id for the input topic and 2) input topic retention period. For example assuming an input topic has 10 million unique key id that will take the container 5 hours to full consumer, the Samza container only maintains 3 hours of the input topic message then does a cleanup. In this case, the numbers of unique entry per partition will be 10 million * 3 /5 = 6 millions. ā€¢ B: Bytes per treemap entry which is about 40 bytes based on our test. ā€¢ Bk: bytes of key serialization which depends on the key object and serialization implementation. For string object, it is roughly 20 bytes per string from our test. ā€¢ Bm: bytes of value message serialization, which de- pends on the value object of the store and serialization implementation. For Integer object, it is roughly 4 bytes per Integer according to source code and our test. With the above notations, using L to denote the total live data size required by a particular application, then we have: L = TPE(B + Bk + Bm) To make sure the job runs smoothly without frequent full GC impact, we suggest that the heap space of our job is at least twice as live data set size. And it is suggested to deploy with low latency GC algorithm like G1 [7]. Essentially, G1 is a compaction-based garbage collection, hence to accommodate the effect of moving-data-around, denote the heap size as H, then H=2L would give a reasonable heap size for the necessity requirement of data accuracy and performance. C. Determing Variable Values of the Model We deduct the values of the variables using experiments. Our upstream input topic has 8 partitions ,each of which has 5 million unique key ids. Samza container by default creates 8 tasks each of which creates one separate part in memory store(treemap) to handle one separate partition. Once the Samza job reaches a stable state which each memory store has 5 million entries, we perform a live object heap dump(only list all the live object) with jmap command which is provided by standard JDK. The command for live object dump is ā€jmap -histo:live $pidā€. Live object dump shows how many object references are live by triggering a full GC. Figure 4: Heap Dump Figure 4 shows all the live objects listed with numbers of instances, total bytes and corresponding class name. The main two types of live objects are: byte buffer([B) and TreeMap entries(java.util.TreeMap$Entry). Samza im- plements the in-memory store option with Treemap which keeps all the entries sorted. The interesting questions would be: 1) why we have about 40 million Treemap entries with about 1.6GB total size; and 2) why we have about 80 million byte buffers with 1.8 GB size. To answer the ļ¬rst question, we have 8 tasks each of which has one in memory tree map store handling 5 million entries. So the total numbers of entries is 8*5million = 40million. From the above micro-benchmark test, we know one Treemap entry takes about 40 bytes. Thus the total object size is 40 * 40million which equals to around 1.6 GB. To answer the second question, this will be a little bit tricky to explain. Samza do many serialization / deserial- ization in the framework. Couples of typical scenarios: 1) when the container receives kafka messages, the message will be deserialized; 2) when the container wants to store key/message into store(inmemory, rocksdb), both the key and message will be serialized and inserted into the store; 3) When the container wants to send message to downstream
  • 4. Kafka topic, it will serialize the key/message for the output message. Different object types use different serde/deserde which user deļ¬nes those conļ¬guration options in Samza conļ¬gura- tion table. So the key of the in-memory store is String type, and the value is Integer type. Both the key and value are serialized into byte buffer with different serializer (String serde and Integer serde). We have around 40 million entries each of which will have 40 million keys and 40 million values. Thus the number of instances for byte buffer is 80 millions. A string object is serialized into 24 bytes , while an integer object is serialized into 24 bytes due to header and alignment according to the serializer implementation. Thus the total object size for byte buffer is 24 * 80 million which equals to around 1.9 GB . Thus the total live data set size is around 3.5 GB in this case. The Samza job would encounter data loss and not be able to maintain necessity data quality if heap size is too small. Hence based on the model, a reasonable heap size is around 7.0 GB to maintain both data quality and performance. Samza supports a feature called auto-scaling which pro- vides a proļ¬ler actively checking memory usage, input stream rate etc. The Samza framework reads proļ¬ler output and redeploys Samza job if certain SLA is violated. For future work, we will incorporate our memory capacity model into the proļ¬ler which will ļ¬rst proļ¬le input stream event rate and redeploy Samza job according to the memory capacity model. IV. EVALUATION To evaluate the performance of our model, we use the input data stream that represents LinkedIn production trafļ¬c patterns to drive our Samza workload. A. Evaluation Methodology We ļ¬rstly deduct the heap size needed based on our capacity model, say it H. For comparison, we consider two comparable cases of doubling and tripling the deducted heap sizes, i.e., 2H and 3H. We measure the major performance metrics being the job throughput. We use process-envelopes of Samza as the key metric to measure Samza job throughput perfor- mance. Process-envelopes metric is used to measure Samza message-processing rate. In addition to the main performance metric of job through- put, to understand the impacts on other resources such as CPU, we also measure CPU usage and JAVA GC activities such as counts, total GC time. Both young GC and old GC activities are measured. B. Workload and Testbed For the workload development, we follow the common practices carried out by Samza community. When writing a stream processor for Samza, we need to implement the StreamTask interface speciļ¬ed by Samza framework. The workload we consider is a straightforward one for the purpose of being stable and repeatable. In the workload, the Samza stream processor listens to messages from a Kafka upstream topic, extracts the key which indicates the identity of the message from the valuable message, checks whether we have encounter this key before in the in-memory store, performs a key counting and stores the result back to the in-memory store. At LinkedIn, we have clusters served as Kafka brokers in production to listen to various messages for different topics. We used a dedicated box to run Samza stream processor. The hardware characteristics are as follows: Intel Xeon 2.67 GHz processor with 24 cores, 48GB RAM, Gbps Ethernet, and 1.65TB SSD. The Kafka topic that our Samza job listens to has about 5 million records with unique keys and 8 partitions. Our Samza stream processor runs with G1 garbage col- lection algorithm. The major JVM options used are ā€- XX:+UseG1GC -XX:G1HeapRegionSize =4Mā€. And we run the Samza job for half an hour to make sure the result is stable and consistent. We also ļ¬x the heap size deducted by our capacity model and the comparable cases. Figure 5: Throughput 1H 2H 3H Young GC: Count 88 29 32 Total time(ms) 9850 5063 6144 Old GC: Count 24 0 0 Total time(ms) 70166 0 0 Total: Count 112 29 31 Total time(ms) 80117 5063 6144 Table I: Java GC Pauses
  • 5. Figure 6: CPU Utilization C. Heap sizes deducted for 3 cases Using our model, we conclude that our live data set size is about 3.5 GB which L(3.5GB) equals to T(1) * P(8) * E(5 million) * (B(40) + Bk(24) + Bm(24)). Based on above, we derive that our Samza jobs heap size is around 7GB. For the two comparable cases, the heap sizes are 14GB and 21GB, respectively. . D. Performance results Job-throughput. The job throughput results are shown in Figure 5. From the throughput ļ¬gure, we found out that there is not much difference of average throughput in terms of kilo processed-message per second among all the three cases. The throughput difference is only 5%. This means that the JVM heap size derived from our memory capacity model could sustain the same message rate compared with 2H, 3H JVM heap size while it brings memory-saving beneļ¬t to the system. CPU Usage. The CPU usage results are shown in Figure 6. The ļ¬rst case (i.e., based on our capacity model) sees up to 50% higher CPU usage than the other two cases. GC activities. Table 1 lists the statistics related to Java GC pauses with young GC, old GC, and total GC three cat- egories considering three different heap size cases captured with Java Mission Control. We found that 1H gives the worst GC performance considering total GC pause times as well as GC counts. We observe signiļ¬cantly higher GC activities in the ļ¬rst case. Speciļ¬cally, the total GC counts is 3X more, while the total GC pause time is 16X more. E. Comparing the three cases The job throughput results suggest that the heap size based on our memory capacity model does not much degrade the job-level performance, even though GC activities and CPU usage are much higher. In fact, the latter two are expected and directly caused of our memory capacity model. There is clearly a tradeoff between the heap size and JVM-level as well as OS level activities. In Java, if the heap size is smaller, more GC events will be observed due to the fact that the heap space will ļ¬ll and trigger GC sooner. Because Java objects need to be scanned and compacted during GC process, CPU usage not surprisingly will also increase. But the increase of cpu usage might affect Samza job performance in a multi-instances environment. If we exhaust cpu resources before memory resources in a multi-instances environment, the performance of 1H will be impacted. Since the major beneļ¬t of our memory capacity model is to more densely co-locate applications on the same node by accurate memory estimation without job-level performance degradation, we are happy with the fact that job-level throughputs are on par, while with 2X or 3X less memory usage. In other words, our memory capacity model can allow 2X or 3X denser deployments, which saves business cost at same scale. F. Deep examination of GC Though we observe much higher counts of GC activities in 1H case, one major attribute we determine whether GC is effective or not for the workload is the numbers of times that full garbage collections happened. Full GC is a stop the world event which all the application threads could not proceed until the whole heap is scanned and garbage objects are released. This impacts application the most especially and those low-latency application like Samza processor could not tolerate. In our cases all the three scenarios (H, 2H, 3H) dont encounter full GC. The old GC shown in the table is a mixed GC in G1 which the collection sets include both young region and old region. This is different from full GC as G1 is a region based garbage collection algorithm that only collects subset regions in mixed GC. We expect to encounter mixed GC in 1H case as live data size exceeds over half the total heap size. But some phases of G1 belong to concurrent mark cycles which are not stop- the-world event. In this case application continue running besides G1 concurrent phase happens but cpu utilization increases. Thats the main reason we observe the per JVM cpu utilization in ļ¬gure 6 captured with top command increases compared to 2H, 3H case. G. Summary Overall the throughput of our production Samza job in 1H case is on par with the performance of 2H, 3H cases despite both GC counts and application cpu utilization increase. And we have great memory saving in 1H case which brings the beneļ¬t of co-locating multiple instances within same machine.
  • 6. V. RELATED WORKS A. Capacity model and capacity planning There are many works in the general area of capac- ity planning. The work [8] presents a four-phased model for computer capacity planning (CCP), The model can be used to develop a formal capacity planning function. Work [9] considers a practical problem of capacity planning for database replication for a large scale website, and presents a model to forecast future trafļ¬c and determine required software capacity. B. Streaming systems There are many systems designed for streaming process- ing purpose available in Industry, each of which has different attributes to serve streaming processing need. For instance, S4 [10] is a distributed stream processing engine originally developed at Yahoo. Apache Storm [11] is another popular system for streaming processing purpose originated from Twitter. Storm utilizes Apache Mesos [12] to schedule the processing job and provides two message deliver seman- tics(at least once, at most once). MillWheel [13] is an in- house solution developed by Google to serve its stream processing purposes. C. Java heap sizing To size the Java heap appropriately, various works have been done to strike the balance among various tradeoffs. Work [14] analyzed the memory usage on Java heap through object tracing based on the observations that inappropriate heap size can lead to performance problems caused by excessive GC overhead or memory paging. It then presents an automatic heap-sizing algorithm applicable to different garbage collectors with only minor changes. The paper [15] presents a Heap Sizing Rule for dealing with different memory sizes with regard to heap size. Speciļ¬cally, it models how severe the page faults can be with respect to a Heap-Aware Page Fault Equation. D. Data quality Veracity [16], which refers to the consistency and accuracy of massive data, has been increasingly recognized as one of key attribute of Big Data. Poor data quality [17] impacts en- terprise companies which leads to customer dissatisfaction, increased cost, and bad decision-making etc. There are many works of developing systems to improve data quality. The work[18] presents a system based on conditional functional dependencies(CFDs) model for improving the quality of relational data. VI. CONCLUSION In this work, we consider the problem of large-scale and real-time data ļ¬ltering and propose a memory ca- pacity model for Samza-based data-ļ¬ltering applications. The model can be used to predict the heap usage of an application, and hence allows for denser deployments of multiple applications on the same machine. REFERENCES [1] Apache Kafka: http://kafka.apache.org/ [2] Apache Storm: https://storm.apache.org/ [3] Apache Samza: http://samza.apache.org/ [4] LinkedIn Espresso: http://data.linkedin.com/projects/espresso [5] Benchmarking Apache Samza: 1.2 million messages per second on a single host: http://engineering.linkedin.com/performance/benchmarking- apache-samza-12-million-messages-second-single-node [6] RocksDB: http://rocksdb.org/ [7] G1 GC: http://www.oracle.com/technetwork/java/javase/tech/g1- intro-jsp-135488.html [8] Carper, L., et al., Computer capacity planning: strategy and methodologies, SIGMIS Database, volume 14, 1983 [9] Zhuang, Z., et al., Capacity Planning and Headroom Analysis for Taming Database Replication Latency (Experiences with LinkedIn Internet Trafļ¬c), Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, 2015, Austin, TX, USA [10] Neumeyer, L., et al., S4: Distributed Stream Computing Platform,ā€ in Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (ICDMW), Sydney, Australia, Dec. 2010. [11] Toshniwal, A., et al., Storm@twitter, In Proceedings of the 2014 ACM SIGMOD international conference on Manage- ment of data (SIGMOD ā€™14), Snobird, UT, USA [12] Apache Mesos: http://mesos.apache.org/ [13] Akidau, T., et al., MillWheel: fault-tolerant stream processing at internet scale, in Proc. of VLDB Endow., August 2013 [14] Lengauer, P., et al., Accurate and Efļ¬cient Object Tracing for Java Applications, Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (ICPE 2015), Austin, TX, USA [15] Tay, Y., et al., An equation-based Heap Sizing Rule, Journal of Perform. Eval., vol 70, Nov. 2013 [16] Saha, B., et al., Data quality: The other face of Big Data, in Data Engineering (ICDE), 2014 IEEE 30th International Conference on , vol., no., pp.1294-1297, March 31 2014- April 4 2014 [17] Redman T., The impact of poor data quality on the typical enterprise, Commun. ACM 1998 [18] Fan W., et al., Semandaq: a data quality system based on conditional functional dependencies, PLVDB 2008:1460- 1463