Hardware fails, applications fail, our code... well, it fails too (at least mine). To prevent software failure we test. Hardware failures are inevitable, so we write code that tolerates them, then we test. From tests we gather metrics and act upon them by improving parts that perform inadequately. Measuring right things at right places in an application is as much about good engineering practices and maintaining SLAs as it is about end user experience and may differentiate successful product from a failure.
In order to act on performance metrics such as max latency and consistent response times we need to know their accurate value. The problem with such metrics is that when using popular tools we get results that are not only inaccurate but also too optimistic.
During my presentation I will simulate services that require monitoring and show how gathered metrics differ from real numbers. All this while using what currently seems to be most popular metric pipeline - Graphite together with metrics.dropwizard.io library - and get completely false results. We will learn to tune it and get much better accuracy. We will use JMeter to measure latency and observe how falsely reassuring the results are. Finally I will show how HdrHistogram helps in gathering reliable metrics. We will also run tests measuring performance of different metric classes.
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Everybody Lies
1. E V E R Y B O D Y L I E S
T O M A S Z K O WA L C Z E W S K I
2. C A R G O C U LT
During the Middle Ages there were all kinds of
crazy ideas, such as that a piece of of
rhinoceros horn would increase potency. Then
a method was discovered for separating the
ideas- which was to try one to see if it worked,
and if it didn't work, to eliminate it. This
method became organized, of course, into
science. And it developed very well, so that we
are now in the scientific age. It is such a
scientific age, in fact, that we have difficulty in
understanding how witch doctors could ever
have existed, when nothing that they proposed
ever really worked-or very little of it did.
Richard Feynman
From a Caltech commencement address
given in 1974
3. W H Y B O T H E R ?
• You get what you measure
- Ineffective optimisations that complicate code
+ Numbers to convince management to do
refactoring or migration to Java 8!
4. W H Y B O T H E R ?
• Predictable is better than fast
• One page display requires multiple calls (static and
dynamic resources)
• Multiple microservices are called to generate response
• During a session user may do hundreds of displays of
your webpages
5. W H Y D O T H I S ?
• Every 100 ms increase in load time of Amazon.com
decreased sales by 1%1
• Increasing web search latency 100 to 400 ms reduces
the daily searches per user by 0.2% to 0.6%.
Furthermore, users do fewer searches the longer they
are exposed. For longer delays, the loss of searches
persists for a time even after latency returns to
previous levels.2
1Kohavi and Longbotham 2007
2Brutlag 2009
8. S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
9. S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
10. S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
• Graph average? Median?
11. S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
• Graph average? Median?
• Percentiles?
13. W H AT M E T R I C S C A N W E U S E ?
graphite.send(prefix(name, "max"), ...);
graphite.send(prefix(name, "mean"), ...);
graphite.send(prefix(name, "min"), ...);
graphite.send(prefix(name, "stddev"), ...);
graphite.send(prefix(name, "p50"), ...);
graphite.send(prefix(name, "p75"), ...);
graphite.send(prefix(name, "p95"), ...);
graphite.send(prefix(name, "p98"), ...);
graphite.send(prefix(name, "p99"), ...);
graphite.send(prefix(name, “p999"), ...);
14. D O N ’ T L O O K AT M E A N
• 1000 queries - 0ms latency, 100 queries 5s latency
• Average is 4,5ms
• 1000 queries - 1ms latency, 100 queries - 5s latency
• Average is 455ms
• Does not help to quantify lags users will experience
15. – A N S C O M B E ' S Q U A R T E T B Y F R A N C I S A N S C O M B E
These four data sets all have the same mean,
median, and variance
16. P L O T T I N G M E A N I S F O R
S H O W I N G O F F T O M A N A G E M E N T
17. M AY B E M E D I A N T H E N ?
• What is the probability of end user encountering
latency worse than median?
• Remember: usually multiple requests are needed to
respond to API call (e.g. N micro services, N
resource requests per page)
✓
1
2
◆N
· 100
18. P R O B A B I L I T Y O F E X P E R I E N C I N G
L AT E N C Y B E T T E R T H A N M E D I A N
I N F U N C T I O N O F M I C R O S E R V I C E S I N V O LV E D
0
1
2
3
4
5
6
7
8
9
10
10
20
30
40
50
60
70
80
90
100
19. W H I C H P E R C E N T I L E I S R E L E VA N T T O
Y O U ?
• Is 99th percentile demanding constraint?
• In application serving 1000 qps latency worse than that happens
ten times per second.
• User that needs to navigate through several web pages will most
probably experience it
• What is the probability of encountering latency better than 99th?
✓
99
100
◆N
· 100
20. P R O B A B I L I T Y O F E X P E R I E N C I N G
L AT E N C Y B E T T E R T H A N 9 9
T H
P E R C E N T I L E
I N F U N C T I O N O F M I C R O S E R V I C E S I N V O LV E D
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
21. D O N O T AV E R A G E P E R C E N T I L E S
Example scenario:
1. Load balancer splits traffic unevenly (ELB anyone?)
2. Server S1 has 1 qps over measured time with 95%’ile == 1ms
3. Server S2 has 100 qps over measured time with 95%’ile == 10s
4. Average is ~5s.
5. What does that tell us?
6. Did we satisfy SLA if it says “95%’ile must be below 8s”?
7. Actual 95%’ile percentile is ~10s
22. – A L I C E ' S A D V E N T U R E S I N W O N D E R L A N D
“If there's no meaning in it,' said the King, 'that
saves a world of trouble, you know, as we
needn't try to find any”
23. Every time you average max values someone in the
world starts new JavaScript framework
25. m e t r i c R e g i s t r y . t i m e r ( " 2 0 1 5 . s t a n d a r d T i m e r " ) ;
Standard timer will over or under report actual
percentiles at will.
Green line represents actual MAX values.
26. m e t r i c R e g i s t r y . t i m e r ( " 2 0 1 5 . s t a n d a r d T i m e r " ) ;
Standard timer will over or under report actual
percentiles at will.
Green line represents actual MAX values.
27. T I M E R ’ S H I S T O G R A M R E S E R V O I R
• Backing storage for Timer’s data
• Contain “statistically representative reservoir of a data stream”
• Default is ExponentiallyDecayingReservoir which has many
drawbacks and is source of most inaccuracies observed
throughout this presentation
• Others include
• UniformReservoir, SlidingTimeWindowReservoir,
SlidingTimeWindowReservoir, SlidingWindowReservoir
28. E X P O N E N T I A L LY D E C AY I N G R E S E R V O I R
• Stores 1028 random samples by default
• Assumes normal distribution of recorded values
• Many statistical tools applied in computer systems
monitoring will assume normal distribution
• Be suspicious of such tools
• Why is that a bad idea?
29. -2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8
0,5
1
1,5
2
2,5
3
N O R M A L
D I S T R I B U T I O N -
W H Y S O U S E F U L ?
• Central limit theorem
• Chebyshev's inequality
f (x, µ, ) =
1
p
2⇡
e
(x µ)2
2 2
30. 10 10,5 11 11,5 12
-0,25
0,25
0,5
0,75
1
C A L C U L AT E 9 5 % ’ I L E
B A S E D O N M E A N
A N D S T D . D E V.
• IFF latency values were
distributed normally then
we could calculate any
percentile based on mean
and standard deviation
µ = 10ms = 1ms
• Lookup into standard
normal (Z) table
• 95%’ile is located 1.65 std.
dev. from mean
• Result is 11,65ms
35. Add spikes due to: lost tcp packet retransmission,
disk swapping, kernel bookkeeping etc.
36. -2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8
0,5
1
1,5
2
2,5
3
N O R M A L
D I S T R I B U T I O N - W H Y
N O T A P P L I C A B L E ?
• The value of the normal
distribution is practically zero when
the value x lies more than a few
standard deviations away from the
mean.
• It may not be an appropriate
model when one expects a
significant fraction of outliers
• […] other statistical inference
methods that are optimal for
normally distributed variables often
become highly unreliable when
applied to such data.
1
f (x, µ, ) =
1
p
2⇡
e
(x µ)2
2 2
1
All quotes on this slide from Wikipedia
37. Blue line represents metric reported from Timer class
Green line represents request rate
38. T I M E R , T I M E R
N E V E R C H A N G E S …
• Timer values decay exponentially
• giving artificial smoothing of values
for server behaviour that may be
long gone
• Timer that is not updated does not
decay
• If Timer is not updated (e.g.
subprocess failed and we stopped
sending requests to it) its values
will remain constant
• Check this post for potential solutions:
taint.org/2014/01/16/145944a.html
39. H D R H I S T O G R A M
• Supports recording and analysis of sampled data across
configurable range with configurable accuracy
• Provides compact representation of data while retaining
high resolution
• Allows configurable tradeoffs between space and accuracy
• Very fast, allocation free, not thread safe for maximum
speed (thread safe versions available)
• Created by Gil Tene of Azul Sytems
40. R E C O R D E R
• Uses HdrHistogram to store values
• Supports concurrent recording of values
• Recording is lock free but also wait free on most
architectures (that support lock xadd)
• Reading is not lock free but does not stall writers (writer-
reader phaser)
• Checkout Marshall Pierce’s library for using it as a
Reservoir implementation
41. S O L U T I O N S
• Always instantiate Timer with custom reservoir
• new ExponentiallyDecayingReservoir(LARGE_NUMBER)
• new SlidingTimeWindowReservoir(1, MINUTES)
• new HdrHistogramResetOnSnapshotReservoir()
• Only last one is safe and accurate and will not report stale values
if no updates were made
43. S M O K I N G B E N C H M A R K I N G I S T H E
L E A D I N G C A U S E O F S TAT I S T I C S I N
T H E W O R L D
44. C O O R D I N AT E D O M I S S I O N
• As formulated by Gil Tene of Azul Systems
• When load driver is plotting with system under test to
deceive you
• Most tools do this
• Most benchmarks do this
• Yahoo Cloud Serving Benchmark had that problem1
1Recently fixed by Nitsan Wakart, see
psy-lob-saw.blogspot.com/2015/03/fixing-ycsb-coordinated-omission.html
45. -0,8 0 0,8 1,6 2,4 3,2 4 4,8 5,6 6,4
-0,8
0,8
1,6
2,4
3,2
4
request arrival time
Application pause time
Requests according to test
plan. Only red one will be
send. Others will be missing
from test.
latency
46. – C R E AT E D W I T H G I L T E N E ' S H D R H I S T O G R A M
P L O T T I N G S C R I P T
Effects on benchmarks at high percentiles are
spectacular
47. C O O R D I N AT E D O M I S S I O N S O L U T I O N S
1. Ignore the problem!
perfectly fine for non interactive system where
only throughput matters
48. C O O R D I N AT E D O M I S S I O N S O L U T I O N S
2. Correct it mathematically in sampling mechanism
HdrHistogram can correct CO with these methods
(choose one!):
histogram.recordValueWithExpectedInterval(
value,
expectedIntervalBetweenSamples
);
histogram.copyCorrectedForCoordinatedOmission(
expectedIntervalBetweenSamples
);
49. C O O R D I N AT E D O M I S S I O N S O L U T I O N S
3. Correct it on load driver side
by noticing pauses between sent requests.
newly issued request will have timer that starts
counting from time it should have been sent but
wasn't
50. C O O R D I N AT E D
O M I S S I O N
S O L U T I O N S
4. Fail the test
for hard real time
systems where pause
causes human
casualties (breaks,
pacemakers, Phalanx
system)
51. C O O R D I N AT E D O M I S S I O N
• Mathematical solutions can overcorrect when load driver
has pauses (e.g. GC).
• Do not account for the fact that server after pause has
no work to do instead of N more requests waiting to be
executed
• In real world it might have never recovered
• Most tools ignore the problem
• Notable exception: Twitter Iago
52. – L O A D D R I V E R M O T T O
“Do not bend to the tyranny of reality”
53. S U M M A RY
• Measure what is meaningful not just what is measurable
• Set SLA before testing and creating dashboards
• Do not trust Timer class, use custom reservoirs, HdrHistogram,
Recorder, never trust EMWA for request rate
• Do not average percentiles unless you need a random number
generator
• Do not plot averages unless you just want to look good on
dashboards
• When load testing be aware of coordinated omission
54. S O U R C E S , T H A N K Y O U S A N D
R E C O M M E N D E D F O L L O W U P S
• Coda Hale for great metrics library
• Gil Tene
• latencytipoftheday.blogspot.de
• www.infoq.com/presentations/latency-pitfalls
• github.com/HdrHistogram/HdrHistogram
• Nitsan Wakart
• psy-lob-saw.blogspot.de/2015/03/fixing-ycsb-coordinated-omission.html
• and whole blog
• Matin Thompson et. al.
• groups.google.com/forum/#!forum/mechanical-sympathy
55. R E C O M M E N D E D
Great introduction to
statistics and queueing
theory.
Performance Modeling and
Design of Computer
Systems: Queueing Theory in
Action
Prof. Mor Harchol-Balter
56. F E E D B A C K K I N D LY R E Q U E S T E D
https://www.surveymonkey.com/s/B5KGWWN