SlideShare a Scribd company logo
1 of 59
Download to read offline
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
How Not to Measure
Latency
An attempt to confer wisdom...
Gil Tene, CTO & co-Founder, Azul Systems
©2011 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Session # 6883
(a prime number BTW)
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
This Talk’s Purpose / Goals
This is not a “there is only one right way” talk
This is a talk about some common pitfalls people run
into when measuring latency
It will hopefully get you to critically examine both
WHY and HOW you measure latency
It will discuss some generic tools that could help
The “Azul makes the world’s best JVM for latency-
sensitive applications” stuff will only come towards the
end, I promise...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About me: Gil Tene
co-founder, CTO
@Azul Systems
Have been working on
a “think different” GC
approaches since 2002
Created Pauseless & C4
core GC algorithms
(Tene, Wolf)
A Long history building
Virtual & Physical
Machines, Operating
Systems, Enterprise
apps, etc... * working on real-world trash compaction issues, circa 2004
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
Have built “whatever it takes
to get job done” since 2002
3 generations of custom SMP
Multi-core HW (Vega)
Now Pure software for
commodity x86 (Zing)
“Industry firsts” in Garbage
collection, elastic memory,
Java virtualization, memory
scale
Vega
C4
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
High level agenda
Some latency behavior background
Latency “philosophy” questions
The pitfalls of using “statistics”
The Coordinated Omission Problem
Some useful tools
Demonstrate what tools can tell us about a latency-
friendly JVM...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
A classic look at response
time behavior
Response time as a function of load
source: IBM CICS server documentation, “understanding response times”
Average?
Max?
Median?
90%?
99.9%
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
“Hiccups”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
8000"
9000"
0" 20" 40" 60" 80" 100" 120" 140" 160" 180"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
What happened here?
Source: Gil running an idle program and suspending it five times in the middle
“Hiccups”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common fallacies
Computers run application code continuously
Response time can be measured as work units/time
Response time exhibits a normal distribution
“Glitches” or “Semi-random omissions” in measurement
don’t have a big effect.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
They don’t look anything like a normal distribution
They usually look like periodic freezes
A complete shift from one mode/behavior to another
Mode A: “good”.
Mode B: “Somewhat bad”
Mode C: “terrible”, ...
....
Common ways people deal with hiccups
Common ways people deal with hiccups
Averages and Standard Deviation
Better ways people deal with hiccups
Actually measuring percentiles
SLA
Response Time
Percentile
plot line
0"
20"
40"
60"
80"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&
&Elapsed&Time&(sec)&
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Requirements
Why we measure latency and response
times to begin with...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency tells us how long
something took
But what do we WANT the latency to be?
What do we want the latency to BEHAVE like?
Latency requirements are usually a PASS/FAIL test
of some predefined criteria
Different applications have different needs
Requirements should reflect application needs
Measurements should provide data to evaluate
requirements
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The Olympics
aka “ring the bell first”
Goal: Get gold medals
Need to be faster than everyone else at SOME races
Ok to be slower in some, as long as fastest at some
(the average speed doesn’t matter)
Ok to not even finish or compete (the worst case and
99%‘ile don’t matter)
Different strategies can apply. E.g. compete in only 3
races to not risk burning out, or compete in 8 races in
hope of winning two
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Pacemakers
aka “hard” real time
Goal: Keep heart beating
Need to never be slower than X
“Your heart will keep beating 99.9% of the time” is
not very reassuring
Having a good average and a nice standard deviation
don’t matter or help
The worst case is all that matters
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
“Low Latency” Trading
aka “soft” real time
Goal: Be fast enough to make some good plays
Goal: Contain risk and exposure while making plays
E.g. want to “typically” react within 200 usec.
But can’t afford to hold open position for 20 msec, or
react to 30msec stale information
So we want a very good “typical” (median, 50%‘ile)
But we also need a reasonable Max, or 99.99%‘ile
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Interactive applications
aka “squishy” real time
Goal: Keep users happy enough to not complain/leave
Need to have “typically snappy” behavior
Ok to have occasional longer times, but not too high,
and not too often
Example: 90% of responses should be below 0.5sec,
99% should be below 2 seconds, 99.9 should be better
than 5 seconds. And a >10 sec. response should never
happen.
Remember: A single user may have 100 interactions
per session...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20msec
Q: Ok. Typical/average of 20msec... So what is the worst case requirement?
A: We don’t have one
Q: So it’s ok for some things to take more than 5 hours?
A: No way!
Q: So I’ll write down “5 hours worst case...”
A: No... Make that “nothing worse than 100 msec”
Q: Are you sure? Even if it’s only three times a day?
A: Ok... Make it “nothing worse than 2 seconds...”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is
it ok to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
Q: Right. For the worst case. But if half the results are better than 20msec, is it
ok for the other half to be just short of 2 seconds? What % of the time are you
willing to take a 1 second, or a half second hiccup? Or some other level?
A: Oh. Let’s see. We have to better than 50msec 90% of the time, or we’ll be
losing money even when we are fast the rest of the time. We need to be better
than 500msec 99.9% of the time, or our customers will complain and go
elsewhere
Now we have a service level expectation:
50% better than 20msec
90% better than 50msec
99.9% better than 500msec
100% better than 2 seconds
Latency does not live in a vacuum
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Remember this?
How much load can this system handle?
Where the
sysadmin
is willing
to go
What the
marketing
benchmarks
will say
Where users
complain
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Sustainable Throughput:
The throughput achieved while
safely maintaining service levels
Unsustainable
Throughout
FAIL
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Instance capacity test: “Fat Portal”
HotSpot CMS: Peaks at ~ 3GB / 45 concurrent users
* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Instance capacity test: “Fat Portal”
C4: still smooth @ 800 concurrent users
The coordinated omission problem
An accidental conspiracy...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
Common Example:
build/buy simple load tester to measure throughput
issue requests one by one at a certain rate
measure and log response time for each request
results log used to produce histograms, percentiles, etc.
So what’s wrong with that?
works well only when all responses fit within rate interval
technique includes implicit “automatic backoff” and coordination
But requirements interested in random, uncoordinated requests
How bad can this get, really?
Example of naïve %’ile
Avg. is 1 msec
over 1st 100 sec
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How would you characterize this system?
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
Avg. is 50 sec.
over next 100 sec
Overall Average response time is ~25 sec.
Example of naïve %’ile
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
Naïve Characterization
10,000 @ 1msec 1 @ 100 second
99.99%‘ile is 1 msec! Average. is 10.9msec! Std. Dev. is 0.99sec!
(should be ~100sec) (should be ~25 sec)
Proper measurement
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
The real world
99%‘ile MUST be at least 0.29%
of total time (1.29% - 1%)
which would be 5.9 seconds
26.182 seconds
represents 1.29%
of the total time
wrong by a
factor of 1,000x
Results were
collected by a
single client
thread
The real world
The max is 762 (!!!)
standard deviations
away from the mean
305.197 seconds
represents 8.4% of
the timing run
A world record SPECjEnterprise2010 result
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
Max=7995.39*
0#
1000#
2000#
3000#
4000#
5000#
6000#
7000#
8000#
9000#
Latency*(msec)*
*
*
Percen7le*
Uncorrected*Latency*by*Percen7le*Distribu7on*
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
Max=7995.39*
0#
1000#
2000#
3000#
4000#
5000#
6000#
7000#
8000#
9000#
Latency*(msec)*
*
*
Percen7le*
Corrected*Latency*by*Percen7le*Distribu7on*
Real World Coordinated Omission effects
Before
Correction
After
Correction
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Suggestions
Lessons learned
Whatever your measurement technique is, test it.
Run your measurement method against artificial
system that creates the hypothetical pauses
scenarios. See if your reported results agree with how
you would describe that system behavior
Don’t waste time analyzing until you establish sanity
Don’t use or derive from std. deviation.
Always measure Max time. Consider what it means.
Measure %‘iles. Lots of them.
Some Tools
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
Max=137.472+
0#
20#
40#
60#
80#
100#
120#
140#
160#
Latency+(msec)+
+
+
Percen8le+
Latency+by+Percen8le+Distribu8on+
HdrHistogram
If you want to be able to produce graphs like this...
You need a good dynamic range, and good
resolution, at the same time
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
A High Dynamic Range Histogram
Covers a configurable dynamic value range
At configurable precision (expressed as number of significant digits)
For Example:
Track values between 1 microsecond and 1 hour
With 3 decimal points of resolution
Built-in compensation for Coordinated Omission
Open Source
On github, released to the public domain, creative commons CC0
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
Fixed cost in both space and time
Built with “latency sensitive” applications in mind
Recording values does not allocate or grow any data structures
Recording values use a fixed computation to determine location (no
searches, no variability in recording cost, FAST)
Even iterating through histogram can be done with no allocation
Internals work like a “floating point” data structure
“Exponent” and “Mantissa”
Exponent determines “Mantissa bucket” to use
“Mantissa buckets” provide linear value range for a given exponent.
Each have enough linear entries to support required precision
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
Provides tools for iteration
Linear, Logarithmic, Percentile
Supports percentile iterators
Practical due to high dynamic
range
Convenient percentile output
E.g. 10% intervals between 0 and
50%, 5% intervals between 50%
and 75%, 2.5% intervals between
75% and 87.5%, ...
Very useful for feeding percentile
distribution graphs...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
Max=137.472+
0#
20#
40#
60#
80#
100#
120#
140#
160#
Latency+(msec)+
+
+
Percen8le+
Latency+by+Percen8le+Distribu8on+
HdrHistogram
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
jHiccup
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Incontinuities in Java platform execution
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1800"
0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups"by"Time"Interval"
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=1665.024&
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1800"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups"by"Percen@le"Distribu@on"
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
jHiccup
A tool for capturing and displaying platform hiccups
Records any observed non-continuity of the underlying platform
Plots results in simple, consistent format
Simple, non-intrusive
As simple as adding the word “jHiccup” to your java launch line
% jHiccup java myflags myApp
(Or use as a java agent)
Adds a background thread that samples time @ 1000/sec
Open Source
Released to the public domain, creative commons CC0
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Telco App Example
0"
20"
40"
60"
80"
100"
120"
140"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Optional SLA
plotting
Max Time per
interval
Hiccup
duration at
percentile
levels
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Examples
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Idle App on Busy System
0"
10"
20"
30"
40"
50"
60"
0" 100" 200" 300" 400" 500" 600" 700" 800" 900"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=49.728&
0"
10"
20"
30"
40"
50"
60"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Idle App on Quiet System
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600" 700" 800" 900"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.336&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Idle App on Quiet System
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600" 700" 800" 900"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.336&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Idle App on Dedicated System
0"
0.05"
0.1"
0.15"
0.2"
0.25"
0.3"
0.35"
0.4"
0.45"
0" 100" 200" 300" 400" 500" 600" 700" 800" 900"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=0.411&
0"
0.05"
0.1"
0.15"
0.2"
0.25"
0.3"
0.35"
0.4"
0.45"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
EHCache: 1GB data set under load
0"
500"
1000"
1500"
2000"
2500"
3000"
3500"
4000"
0" 500" 1000" 1500" 2000"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups"by"Time"Interval"
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
Max=3448.832&
0"
500"
1000"
1500"
2000"
2500"
3000"
3500"
4000"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups"by"Percen@le"Distribu@on"
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Fun with jHiccup
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Oracle HotSpot CMS, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=13156.352&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Zing 5, 1GB in an 8GB heap
0"
5"
10"
15"
20"
25"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
Max=20.384&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Oracle HotSpot CMS, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=13156.352&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Zing 5, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"Max=20.384&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Drawn to scale
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
What you can expect (from Zing)
in the low latency world
Assuming individual transaction work is “short” (on
the order of 1 msec), and assuming you don’t have
100s of runnable threads competing for 10 cores...
“Easily” get your application to < 10 msec worst case
With some tuning, 2-3 msec worst case
Can go to below 1 msec worst case...
May require heavy tuning/tweaking
Mileage WILL vary
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1.8"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=1.568&
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1.8"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.656&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Oracle HotSpot (pure newgen) Zing
Low latency trading application
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Low latency - Drawn to scale
Oracle HotSpot (pure newgen) Zing
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=1.568&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.656&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Takeaways
Standard Deviation and application latency should
never show up on the same page...
If you haven’t stated percentiles and a Max, you
haven’t specified your requirements
Measuring throughput without latency behavior is
[usually] meaningless
Mistakes in measurement/analysis can lead to orders-
of-magnitude errors and lead to bad business decisions
jHiccup and HdrHistogram are generically useful
The Zing JVM is cool...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Q & A
http://www.azulsystems.com
http://www.jhiccup.com
http://giltene.github.com/HdrHistogram
Session # 6883

More Related Content

Viewers also liked

KCPOS Brochure 2015 - Email
KCPOS Brochure 2015 - EmailKCPOS Brochure 2015 - Email
KCPOS Brochure 2015 - EmailRichard Wellham
 
Estudios ambientales
Estudios ambientalesEstudios ambientales
Estudios ambientalesValentinaM16
 
Observación y Análisis a la Práctica Escolar
Observación y Análisis a la Práctica EscolarObservación y Análisis a la Práctica Escolar
Observación y Análisis a la Práctica EscolarImelda Ayala
 
Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015
Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015
Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015Azul Systems, Inc.
 

Viewers also liked (6)

KCPOS Brochure 2015 - Email
KCPOS Brochure 2015 - EmailKCPOS Brochure 2015 - Email
KCPOS Brochure 2015 - Email
 
Estudios ambientales
Estudios ambientalesEstudios ambientales
Estudios ambientales
 
Observación y Análisis a la Práctica Escolar
Observación y Análisis a la Práctica EscolarObservación y Análisis a la Práctica Escolar
Observación y Análisis a la Práctica Escolar
 
Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015
Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015
Enabling Java in Latency Sensitive Environments - Dallas JUG April 2015
 
Sms-Voting
Sms-VotingSms-Voting
Sms-Voting
 
CURRICULUM_VITAE.PDF
CURRICULUM_VITAE.PDFCURRICULUM_VITAE.PDF
CURRICULUM_VITAE.PDF
 

More from Azul Systems, Inc.

Unleash the Power of Apache Cassandra
Unleash the Power of Apache CassandraUnleash the Power of Apache Cassandra
Unleash the Power of Apache CassandraAzul Systems, Inc.
 
Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015
Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015
Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015Azul Systems, Inc.
 
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVMQCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVMAzul Systems, Inc.
 
JVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentationJVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentationAzul Systems, Inc.
 
JVM Language Summit: Object layout workshop
JVM Language Summit: Object layout workshopJVM Language Summit: Object layout workshop
JVM Language Summit: Object layout workshopAzul Systems, Inc.
 
What's New in the JVM in Java 8?
What's New in the JVM in Java 8?What's New in the JVM in Java 8?
What's New in the JVM in Java 8?Azul Systems, Inc.
 
DC JUG: Understanding Java Garbage Collection
DC JUG: Understanding Java Garbage CollectionDC JUG: Understanding Java Garbage Collection
DC JUG: Understanding Java Garbage CollectionAzul Systems, Inc.
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsAzul Systems, Inc.
 

More from Azul Systems, Inc. (9)

Unleash the Power of Apache Cassandra
Unleash the Power of Apache CassandraUnleash the Power of Apache Cassandra
Unleash the Power of Apache Cassandra
 
How NOT to Measure Latency
How NOT to Measure LatencyHow NOT to Measure Latency
How NOT to Measure Latency
 
Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015
Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015
Enabling Java in Latency-Sensitive Environments - Austin JUG April 2015
 
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVMQCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
 
JVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentationJVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentation
 
JVM Language Summit: Object layout workshop
JVM Language Summit: Object layout workshopJVM Language Summit: Object layout workshop
JVM Language Summit: Object layout workshop
 
What's New in the JVM in Java 8?
What's New in the JVM in Java 8?What's New in the JVM in Java 8?
What's New in the JVM in Java 8?
 
DC JUG: Understanding Java Garbage Collection
DC JUG: Understanding Java Garbage CollectionDC JUG: Understanding Java Garbage Collection
DC JUG: Understanding Java Garbage Collection
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM Mechanics
 

Recently uploaded

Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 

Recently uploaded (20)

Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 

How NOT to Measure Java Latency

  • 1. ©2012 Azul Systems, Inc. How Not to Measure Latency An attempt to confer wisdom... Gil Tene, CTO & co-Founder, Azul Systems
  • 2. ©2011 Azul Systems, Inc. Session # 6883 (a prime number BTW)
  • 3. ©2012 Azul Systems, Inc. This Talk’s Purpose / Goals This is not a “there is only one right way” talk This is a talk about some common pitfalls people run into when measuring latency It will hopefully get you to critically examine both WHY and HOW you measure latency It will discuss some generic tools that could help The “Azul makes the world’s best JVM for latency- sensitive applications” stuff will only come towards the end, I promise...
  • 4. ©2012 Azul Systems, Inc. About me: Gil Tene co-founder, CTO @Azul Systems Have been working on a “think different” GC approaches since 2002 Created Pauseless & C4 core GC algorithms (Tene, Wolf) A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... * working on real-world trash compaction issues, circa 2004
  • 5. ©2012 Azul Systems, Inc. About Azul We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega) Now Pure software for commodity x86 (Zing) “Industry firsts” in Garbage collection, elastic memory, Java virtualization, memory scale Vega C4
  • 6. ©2012 Azul Systems, Inc. High level agenda Some latency behavior background Latency “philosophy” questions The pitfalls of using “statistics” The Coordinated Omission Problem Some useful tools Demonstrate what tools can tell us about a latency- friendly JVM...
  • 7. ©2012 Azul Systems, Inc. A classic look at response time behavior Response time as a function of load source: IBM CICS server documentation, “understanding response times” Average? Max? Median? 90%? 99.9%
  • 8. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown. “Hiccups”
  • 9. ©2012 Azul Systems, Inc. 0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000" 9000" 0" 20" 40" 60" 80" 100" 120" 140" 160" 180" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" What happened here? Source: Gil running an idle program and suspending it five times in the middle “Hiccups”
  • 10. ©2012 Azul Systems, Inc. Common fallacies Computers run application code continuously Response time can be measured as work units/time Response time exhibits a normal distribution “Glitches” or “Semi-random omissions” in measurement don’t have a big effect.
  • 11. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal They don’t look anything like a normal distribution They usually look like periodic freezes A complete shift from one mode/behavior to another Mode A: “good”. Mode B: “Somewhat bad” Mode C: “terrible”, ... ....
  • 12. Common ways people deal with hiccups
  • 13. Common ways people deal with hiccups Averages and Standard Deviation
  • 14. Better ways people deal with hiccups Actually measuring percentiles SLA Response Time Percentile plot line 0" 20" 40" 60" 80" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on& &Elapsed&Time&(sec)& 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA"
  • 15. Requirements Why we measure latency and response times to begin with...
  • 16. ©2012 Azul Systems, Inc. Latency tells us how long something took But what do we WANT the latency to be? What do we want the latency to BEHAVE like? Latency requirements are usually a PASS/FAIL test of some predefined criteria Different applications have different needs Requirements should reflect application needs Measurements should provide data to evaluate requirements
  • 17. ©2012 Azul Systems, Inc. The Olympics aka “ring the bell first” Goal: Get gold medals Need to be faster than everyone else at SOME races Ok to be slower in some, as long as fastest at some (the average speed doesn’t matter) Ok to not even finish or compete (the worst case and 99%‘ile don’t matter) Different strategies can apply. E.g. compete in only 3 races to not risk burning out, or compete in 8 races in hope of winning two
  • 18. ©2012 Azul Systems, Inc. Pacemakers aka “hard” real time Goal: Keep heart beating Need to never be slower than X “Your heart will keep beating 99.9% of the time” is not very reassuring Having a good average and a nice standard deviation don’t matter or help The worst case is all that matters
  • 19. ©2012 Azul Systems, Inc. “Low Latency” Trading aka “soft” real time Goal: Be fast enough to make some good plays Goal: Contain risk and exposure while making plays E.g. want to “typically” react within 200 usec. But can’t afford to hold open position for 20 msec, or react to 30msec stale information So we want a very good “typical” (median, 50%‘ile) But we also need a reasonable Max, or 99.99%‘ile
  • 20. ©2012 Azul Systems, Inc. Interactive applications aka “squishy” real time Goal: Keep users happy enough to not complain/leave Need to have “typically snappy” behavior Ok to have occasional longer times, but not too high, and not too often Example: 90% of responses should be below 0.5sec, 99% should be below 2 seconds, 99.9 should be better than 5 seconds. And a >10 sec. response should never happen. Remember: A single user may have 100 interactions per session...
  • 21. ©2012 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20msec Q: Ok. Typical/average of 20msec... So what is the worst case requirement? A: We don’t have one Q: So it’s ok for some things to take more than 5 hours? A: No way! Q: So I’ll write down “5 hours worst case...” A: No... Make that “nothing worse than 100 msec” Q: Are you sure? Even if it’s only three times a day? A: Ok... Make it “nothing worse than 2 seconds...”
  • 22. ©2012 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day Q: Right. For the worst case. But if half the results are better than 20msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level? A: Oh. Let’s see. We have to better than 50msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500msec 99.9% of the time, or our customers will complain and go elsewhere Now we have a service level expectation: 50% better than 20msec 90% better than 50msec 99.9% better than 500msec 100% better than 2 seconds
  • 23. Latency does not live in a vacuum
  • 24. ©2012 Azul Systems, Inc. Remember this? How much load can this system handle? Where the sysadmin is willing to go What the marketing benchmarks will say Where users complain
  • 25. ©2012 Azul Systems, Inc. Sustainable Throughput: The throughput achieved while safely maintaining service levels Unsustainable Throughout FAIL
  • 26. ©2012 Azul Systems, Inc. Instance capacity test: “Fat Portal” HotSpot CMS: Peaks at ~ 3GB / 45 concurrent users * LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
  • 27. ©2012 Azul Systems, Inc. Instance capacity test: “Fat Portal” C4: still smooth @ 800 concurrent users
  • 28. The coordinated omission problem An accidental conspiracy...
  • 29. ©2012 Azul Systems, Inc. The coordinated omission problem Common Example: build/buy simple load tester to measure throughput issue requests one by one at a certain rate measure and log response time for each request results log used to produce histograms, percentiles, etc. So what’s wrong with that? works well only when all responses fit within rate interval technique includes implicit “automatic backoff” and coordination But requirements interested in random, uncoordinated requests How bad can this get, really?
  • 30. Example of naïve %’ile Avg. is 1 msec over 1st 100 sec System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system? ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec Avg. is 50 sec. over next 100 sec Overall Average response time is ~25 sec.
  • 31. Example of naïve %’ile System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec Naïve Characterization 10,000 @ 1msec 1 @ 100 second 99.99%‘ile is 1 msec! Average. is 10.9msec! Std. Dev. is 0.99sec! (should be ~100sec) (should be ~25 sec)
  • 32. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
  • 33. The real world 99%‘ile MUST be at least 0.29% of total time (1.29% - 1%) which would be 5.9 seconds 26.182 seconds represents 1.29% of the total time wrong by a factor of 1,000x Results were collected by a single client thread
  • 34. The real world The max is 762 (!!!) standard deviations away from the mean 305.197 seconds represents 8.4% of the timing run A world record SPECjEnterprise2010 result
  • 35. ©2012 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# Max=7995.39* 0# 1000# 2000# 3000# 4000# 5000# 6000# 7000# 8000# 9000# Latency*(msec)* * * Percen7le* Uncorrected*Latency*by*Percen7le*Distribu7on* 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# Max=7995.39* 0# 1000# 2000# 3000# 4000# 5000# 6000# 7000# 8000# 9000# Latency*(msec)* * * Percen7le* Corrected*Latency*by*Percen7le*Distribu7on* Real World Coordinated Omission effects Before Correction After Correction
  • 36. ©2012 Azul Systems, Inc. Suggestions Lessons learned Whatever your measurement technique is, test it. Run your measurement method against artificial system that creates the hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior Don’t waste time analyzing until you establish sanity Don’t use or derive from std. deviation. Always measure Max time. Consider what it means. Measure %‘iles. Lots of them.
  • 38. ©2012 Azul Systems, Inc. HdrHistogram
  • 39. ©2012 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# Max=137.472+ 0# 20# 40# 60# 80# 100# 120# 140# 160# Latency+(msec)+ + + Percen8le+ Latency+by+Percen8le+Distribu8on+ HdrHistogram If you want to be able to produce graphs like this... You need a good dynamic range, and good resolution, at the same time
  • 40. ©2012 Azul Systems, Inc. HdrHistogram A High Dynamic Range Histogram Covers a configurable dynamic value range At configurable precision (expressed as number of significant digits) For Example: Track values between 1 microsecond and 1 hour With 3 decimal points of resolution Built-in compensation for Coordinated Omission Open Source On github, released to the public domain, creative commons CC0
  • 41. ©2012 Azul Systems, Inc. HdrHistogram Fixed cost in both space and time Built with “latency sensitive” applications in mind Recording values does not allocate or grow any data structures Recording values use a fixed computation to determine location (no searches, no variability in recording cost, FAST) Even iterating through histogram can be done with no allocation Internals work like a “floating point” data structure “Exponent” and “Mantissa” Exponent determines “Mantissa bucket” to use “Mantissa buckets” provide linear value range for a given exponent. Each have enough linear entries to support required precision
  • 42. ©2012 Azul Systems, Inc. HdrHistogram Provides tools for iteration Linear, Logarithmic, Percentile Supports percentile iterators Practical due to high dynamic range Convenient percentile output E.g. 10% intervals between 0 and 50%, 5% intervals between 50% and 75%, 2.5% intervals between 75% and 87.5%, ... Very useful for feeding percentile distribution graphs...
  • 43. ©2012 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# Max=137.472+ 0# 20# 40# 60# 80# 100# 120# 140# 160# Latency+(msec)+ + + Percen8le+ Latency+by+Percen8le+Distribu8on+ HdrHistogram
  • 44. ©2012 Azul Systems, Inc. jHiccup
  • 45. ©2012 Azul Systems, Inc. Incontinuities in Java platform execution 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800" 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups"by"Time"Interval" Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=1665.024& 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups"by"Percen@le"Distribu@on"
  • 46. ©2012 Azul Systems, Inc. jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform Plots results in simple, consistent format Simple, non-intrusive As simple as adding the word “jHiccup” to your java launch line % jHiccup java myflags myApp (Or use as a java agent) Adds a background thread that samples time @ 1000/sec Open Source Released to the public domain, creative commons CC0
  • 47. ©2012 Azul Systems, Inc. Telco App Example 0" 20" 40" 60" 80" 100" 120" 140" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA" Optional SLA plotting Max Time per interval Hiccup duration at percentile levels
  • 48. ©2012 Azul Systems, Inc. Examples
  • 49. ©2012 Azul Systems, Inc. Idle App on Busy System 0" 10" 20" 30" 40" 50" 60" 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=49.728& 0" 10" 20" 30" 40" 50" 60" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Idle App on Quiet System 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=22.336& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on&
  • 50. ©2012 Azul Systems, Inc. Idle App on Quiet System 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=22.336& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Idle App on Dedicated System 0" 0.05" 0.1" 0.15" 0.2" 0.25" 0.3" 0.35" 0.4" 0.45" 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=0.411& 0" 0.05" 0.1" 0.15" 0.2" 0.25" 0.3" 0.35" 0.4" 0.45" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on&
  • 51. ©2012 Azul Systems, Inc. EHCache: 1GB data set under load 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 4000" 0" 500" 1000" 1500" 2000" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups"by"Time"Interval" Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" Max=3448.832& 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" 4000" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups"by"Percen@le"Distribu@on"
  • 52. ©2012 Azul Systems, Inc. Fun with jHiccup
  • 53. ©2012 Azul Systems, Inc. Oracle HotSpot CMS, 1GB in an 8GB heap 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=13156.352& 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Zing 5, 1GB in an 8GB heap 0" 5" 10" 15" 20" 25" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" Max=20.384& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on&
  • 54. ©2012 Azul Systems, Inc. Oracle HotSpot CMS, 1GB in an 8GB heap 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=13156.352& 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Zing 5, 1GB in an 8GB heap 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"Max=20.384& 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Drawn to scale
  • 55. ©2012 Azul Systems, Inc. What you can expect (from Zing) in the low latency world Assuming individual transaction work is “short” (on the order of 1 msec), and assuming you don’t have 100s of runnable threads competing for 10 cores... “Easily” get your application to < 10 msec worst case With some tuning, 2-3 msec worst case Can go to below 1 msec worst case... May require heavy tuning/tweaking Mileage WILL vary
  • 56. ©2012 Azul Systems, Inc. 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 1.6" 1.8" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=1.568& 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 1.6" 1.8" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=22.656& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Oracle HotSpot (pure newgen) Zing Low latency trading application
  • 57. ©2012 Azul Systems, Inc. Low latency - Drawn to scale Oracle HotSpot (pure newgen) Zing 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=1.568& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=22.656& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on&
  • 58. ©2012 Azul Systems, Inc. Takeaways Standard Deviation and application latency should never show up on the same page... If you haven’t stated percentiles and a Max, you haven’t specified your requirements Measuring throughput without latency behavior is [usually] meaningless Mistakes in measurement/analysis can lead to orders- of-magnitude errors and lead to bad business decisions jHiccup and HdrHistogram are generically useful The Zing JVM is cool...
  • 59. ©2012 Azul Systems, Inc. Q & A http://www.azulsystems.com http://www.jhiccup.com http://giltene.github.com/HdrHistogram Session # 6883