SlideShare a Scribd company logo
1 of 56
Download to read offline
E V E R Y B O D Y L I E S
T O M A S Z K O WA L C Z E W S K I
C A R G O C U LT
During the Middle Ages there were all kinds of
crazy ideas, such as that a piece of of
rhinoceros horn would increase potency. Then
a method was discovered for separating the
ideas- which was to try one to see if it worked,
and if it didn't work, to eliminate it. This
method became organized, of course, into
science. And it developed very well, so that we
are now in the scientific age. It is such a
scientific age, in fact, that we have difficulty in
understanding how witch doctors could ever
have existed, when nothing that they proposed
ever really worked-or very little of it did.
Richard Feynman
From a Caltech commencement address
given in 1974
W H Y B O T H E R ?
• You get what you measure
- Ineffective optimisations that complicate code
+ Numbers to convince management to do
refactoring or migration to Java 8!
W H Y B O T H E R ?
• Predictable is better than fast
• One page display requires multiple calls (static and
dynamic resources)
• Multiple microservices are called to generate response
• During a session user may do hundreds of displays of
your webpages
W H Y D O T H I S ?
• Every 100 ms increase in load time of Amazon.com
decreased sales by 1%1
• Increasing web search latency 100 to 400 ms reduces
the daily searches per user by 0.2% to 0.6%.
Furthermore, users do fewer searches the longer they
are exposed. For longer delays, the loss of searches
persists for a time even after latency returns to
previous levels.2
1Kohavi and Longbotham 2007
2Brutlag 2009
S U R V E Y
• Do you…
S U R V E Y
• Use graphite?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
• Graph average? Median?
S U R V E Y
• Use graphite?
• Feed it with Coda Hale/Dropwizard metrics?
• Modify their source? Use nonstandard options?
• Graph average? Median?
• Percentiles?
(c) xkcd.com
W H AT M E T R I C S C A N W E U S E ?
graphite.send(prefix(name, "max"), ...);

graphite.send(prefix(name, "mean"), ...);

graphite.send(prefix(name, "min"), ...);

graphite.send(prefix(name, "stddev"), ...);

graphite.send(prefix(name, "p50"), ...);

graphite.send(prefix(name, "p75"), ...);

graphite.send(prefix(name, "p95"), ...);

graphite.send(prefix(name, "p98"), ...);

graphite.send(prefix(name, "p99"), ...);

graphite.send(prefix(name, “p999"), ...);
D O N ’ T L O O K AT M E A N
• 1000 queries - 0ms latency, 100 queries 5s latency
• Average is 4,5ms
• 1000 queries - 1ms latency, 100 queries - 5s latency
• Average is 455ms
• Does not help to quantify lags users will experience
– A N S C O M B E ' S Q U A R T E T B Y F R A N C I S A N S C O M B E
These four data sets all have the same mean,
median, and variance
P L O T T I N G M E A N I S F O R
S H O W I N G O F F T O M A N A G E M E N T
M AY B E M E D I A N T H E N ?
• What is the probability of end user encountering
latency worse than median?
• Remember: usually multiple requests are needed to
respond to API call (e.g. N micro services, N
resource requests per page)
✓
1
2
◆N
· 100
P R O B A B I L I T Y O F E X P E R I E N C I N G
L AT E N C Y B E T T E R T H A N M E D I A N
I N F U N C T I O N O F M I C R O S E R V I C E S I N V O LV E D
0
1
2
3
4
5
6
7
8
9
10
10
20
30
40
50
60
70
80
90
100
W H I C H P E R C E N T I L E I S R E L E VA N T T O
Y O U ?
• Is 99th percentile demanding constraint?
• In application serving 1000 qps latency worse than that happens
ten times per second.
• User that needs to navigate through several web pages will most
probably experience it
• What is the probability of encountering latency better than 99th?
✓
99
100
◆N
· 100
P R O B A B I L I T Y O F E X P E R I E N C I N G
L AT E N C Y B E T T E R T H A N 9 9
T H
P E R C E N T I L E
I N F U N C T I O N O F M I C R O S E R V I C E S I N V O LV E D
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
D O N O T AV E R A G E P E R C E N T I L E S
Example scenario:
1. Load balancer splits traffic unevenly (ELB anyone?)
2. Server S1 has 1 qps over measured time with 95%’ile == 1ms
3. Server S2 has 100 qps over measured time with 95%’ile == 10s
4. Average is ~5s.
5. What does that tell us?
6. Did we satisfy SLA if it says “95%’ile must be below 8s”?
7. Actual 95%’ile percentile is ~10s
– A L I C E ' S A D V E N T U R E S I N W O N D E R L A N D
“If there's no meaning in it,' said the King, 'that
saves a world of trouble, you know, as we
needn't try to find any”
Every time you average max values someone in the
world starts new JavaScript framework
Demo time
m e t r i c R e g i s t r y . t i m e r ( " 2 0 1 5 . s t a n d a r d T i m e r " ) ;
Standard timer will over or under report actual
percentiles at will.
Green line represents actual MAX values.
m e t r i c R e g i s t r y . t i m e r ( " 2 0 1 5 . s t a n d a r d T i m e r " ) ;
Standard timer will over or under report actual
percentiles at will.
Green line represents actual MAX values.
T I M E R ’ S H I S T O G R A M R E S E R V O I R
• Backing storage for Timer’s data
• Contain “statistically representative reservoir of a data stream”
• Default is ExponentiallyDecayingReservoir which has many
drawbacks and is source of most inaccuracies observed
throughout this presentation
• Others include
• UniformReservoir, SlidingTimeWindowReservoir,
SlidingTimeWindowReservoir, SlidingWindowReservoir
E X P O N E N T I A L LY D E C AY I N G R E S E R V O I R
• Stores 1028 random samples by default
• Assumes normal distribution of recorded values
• Many statistical tools applied in computer systems
monitoring will assume normal distribution
• Be suspicious of such tools
• Why is that a bad idea?
-2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8
0,5
1
1,5
2
2,5
3
N O R M A L
D I S T R I B U T I O N -
W H Y S O U S E F U L ?
• Central limit theorem
• Chebyshev's inequality
f (x, µ, ) =
1
p
2⇡
e
(x µ)2
2 2
10 10,5 11 11,5 12
-0,25
0,25
0,5
0,75
1
C A L C U L AT E 9 5 % ’ I L E
B A S E D O N M E A N
A N D S T D . D E V.
• IFF latency values were
distributed normally then
we could calculate any
percentile based on mean
and standard deviation
µ = 10ms = 1ms
• Lookup into standard
normal (Z) table
• 95%’ile is located 1.65 std.
dev. from mean
• Result is 11,65ms
Latency profile resembling normal distribution…
Add spikes due to young gen GC pauses
Add spikes due to old gen GC pauses
Add spikes due to calling other services (like DB)
Add spikes due to: lost tcp packet retransmission,
disk swapping, kernel bookkeeping etc.
-2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8
0,5
1
1,5
2
2,5
3
N O R M A L
D I S T R I B U T I O N - W H Y
N O T A P P L I C A B L E ?
• The value of the normal
distribution is practically zero when
the value x lies more than a few
standard deviations away from the
mean.
• It may not be an appropriate
model when one expects a
significant fraction of outliers
• […] other statistical inference
methods that are optimal for
normally distributed variables often
become highly unreliable when
applied to such data.
1
f (x, µ, ) =
1
p
2⇡
e
(x µ)2
2 2
1
All quotes on this slide from Wikipedia
Blue line represents metric reported from Timer class
Green line represents request rate
T I M E R , T I M E R
N E V E R C H A N G E S …
• Timer values decay exponentially
• giving artificial smoothing of values
for server behaviour that may be
long gone
• Timer that is not updated does not
decay
• If Timer is not updated (e.g.
subprocess failed and we stopped
sending requests to it) its values
will remain constant
• Check this post for potential solutions:
taint.org/2014/01/16/145944a.html
H D R H I S T O G R A M
• Supports recording and analysis of sampled data across
configurable range with configurable accuracy
• Provides compact representation of data while retaining
high resolution
• Allows configurable tradeoffs between space and accuracy
• Very fast, allocation free, not thread safe for maximum
speed (thread safe versions available)
• Created by Gil Tene of Azul Sytems
R E C O R D E R
• Uses HdrHistogram to store values
• Supports concurrent recording of values
• Recording is lock free but also wait free on most
architectures (that support lock xadd)
• Reading is not lock free but does not stall writers (writer-
reader phaser)
• Checkout Marshall Pierce’s library for using it as a
Reservoir implementation
S O L U T I O N S
• Always instantiate Timer with custom reservoir
• new ExponentiallyDecayingReservoir(LARGE_NUMBER)
• new SlidingTimeWindowReservoir(1, MINUTES)
• new HdrHistogramResetOnSnapshotReservoir()
• Only last one is safe and accurate and will not report stale values
if no updates were made
JMH benchmarks (from my laptop, caveat emptor!)
S M O K I N G B E N C H M A R K I N G I S T H E
L E A D I N G C A U S E O F S TAT I S T I C S I N
T H E W O R L D
C O O R D I N AT E D O M I S S I O N
• As formulated by Gil Tene of Azul Systems
• When load driver is plotting with system under test to
deceive you
• Most tools do this
• Most benchmarks do this
• Yahoo Cloud Serving Benchmark had that problem1
1Recently fixed by Nitsan Wakart, see
psy-lob-saw.blogspot.com/2015/03/fixing-ycsb-coordinated-omission.html
-0,8 0 0,8 1,6 2,4 3,2 4 4,8 5,6 6,4
-0,8
0,8
1,6
2,4
3,2
4
request arrival time
Application pause time
Requests according to test
plan. Only red one will be
send. Others will be missing
from test.
latency
– C R E AT E D W I T H G I L T E N E ' S H D R H I S T O G R A M
P L O T T I N G S C R I P T
Effects on benchmarks at high percentiles are
spectacular
C O O R D I N AT E D O M I S S I O N S O L U T I O N S
1. Ignore the problem!
perfectly fine for non interactive system where
only throughput matters
C O O R D I N AT E D O M I S S I O N S O L U T I O N S
2. Correct it mathematically in sampling mechanism
HdrHistogram can correct CO with these methods
(choose one!):
histogram.recordValueWithExpectedInterval(

value,

expectedIntervalBetweenSamples

);
histogram.copyCorrectedForCoordinatedOmission(

expectedIntervalBetweenSamples

);
C O O R D I N AT E D O M I S S I O N S O L U T I O N S
3. Correct it on load driver side
by noticing pauses between sent requests.
newly issued request will have timer that starts
counting from time it should have been sent but
wasn't
C O O R D I N AT E D
O M I S S I O N
S O L U T I O N S
4. Fail the test
for hard real time
systems where pause
causes human
casualties (breaks,
pacemakers, Phalanx
system)
C O O R D I N AT E D O M I S S I O N
• Mathematical solutions can overcorrect when load driver
has pauses (e.g. GC).
• Do not account for the fact that server after pause has
no work to do instead of N more requests waiting to be
executed
• In real world it might have never recovered
• Most tools ignore the problem
• Notable exception: Twitter Iago
– L O A D D R I V E R M O T T O
“Do not bend to the tyranny of reality”
S U M M A RY
• Measure what is meaningful not just what is measurable
• Set SLA before testing and creating dashboards
• Do not trust Timer class, use custom reservoirs, HdrHistogram,
Recorder, never trust EMWA for request rate
• Do not average percentiles unless you need a random number
generator
• Do not plot averages unless you just want to look good on
dashboards
• When load testing be aware of coordinated omission
S O U R C E S , T H A N K Y O U S A N D
R E C O M M E N D E D F O L L O W U P S
• Coda Hale for great metrics library
• Gil Tene
• latencytipoftheday.blogspot.de
• www.infoq.com/presentations/latency-pitfalls
• github.com/HdrHistogram/HdrHistogram
• Nitsan Wakart
• psy-lob-saw.blogspot.de/2015/03/fixing-ycsb-coordinated-omission.html
• and whole blog
• Matin Thompson et. al.
• groups.google.com/forum/#!forum/mechanical-sympathy
R E C O M M E N D E D
Great introduction to
statistics and queueing
theory.
Performance Modeling and
Design of Computer
Systems: Queueing Theory in
Action
Prof. Mor Harchol-Balter
F E E D B A C K K I N D LY R E Q U E S T E D
https://www.surveymonkey.com/s/B5KGWWN

More Related Content

Viewers also liked

Edu 653 due may 12
Edu 653 due may 12Edu 653 due may 12
Edu 653 due may 12nbk76dr
 
Analysis of professional contents page.
Analysis of professional contents page.Analysis of professional contents page.
Analysis of professional contents page.bethany_perry95
 
Presentazione torre dell'orso
Presentazione torre dell'orso Presentazione torre dell'orso
Presentazione torre dell'orso Teresa Manicone
 
Share transfer agreement
Share transfer agreementShare transfer agreement
Share transfer agreementConsuldimo
 
Onorevoli
OnorevoliOnorevoli
OnorevoliMCO75
 
Home de giovanni luigi pittore contemporaneo - creazioni d'arte - cagliari
Home   de giovanni luigi pittore contemporaneo - creazioni d'arte - cagliariHome   de giovanni luigi pittore contemporaneo - creazioni d'arte - cagliari
Home de giovanni luigi pittore contemporaneo - creazioni d'arte - cagliariLuigi De Giovanni
 
Power point group project luis alvarez
Power point group project luis alvarezPower point group project luis alvarez
Power point group project luis alvarezbearister2746
 
Treści wizualne w Content Marketingu w 20 liczbach
Treści wizualne w Content Marketingu w 20 liczbachTreści wizualne w Content Marketingu w 20 liczbach
Treści wizualne w Content Marketingu w 20 liczbachiPresso
 
Edinburgh fringe in a nutsehll by paul eccentric sample
Edinburgh fringe in a nutsehll by paul eccentric sampleEdinburgh fringe in a nutsehll by paul eccentric sample
Edinburgh fringe in a nutsehll by paul eccentric sampleBurning Eye
 
Schoolfeest 2015 fotowedstrijd
Schoolfeest 2015 fotowedstrijdSchoolfeest 2015 fotowedstrijd
Schoolfeest 2015 fotowedstrijdSteven Verleysen
 

Viewers also liked (18)

IBM TOKYO
IBM TOKYOIBM TOKYO
IBM TOKYO
 
Edu 653 due may 12
Edu 653 due may 12Edu 653 due may 12
Edu 653 due may 12
 
Analysis of professional contents page.
Analysis of professional contents page.Analysis of professional contents page.
Analysis of professional contents page.
 
Presentazione torre dell'orso
Presentazione torre dell'orso Presentazione torre dell'orso
Presentazione torre dell'orso
 
Question 3
Question 3Question 3
Question 3
 
Share transfer agreement
Share transfer agreementShare transfer agreement
Share transfer agreement
 
Onorevoli
OnorevoliOnorevoli
Onorevoli
 
Section9
Section9Section9
Section9
 
Home de giovanni luigi pittore contemporaneo - creazioni d'arte - cagliari
Home   de giovanni luigi pittore contemporaneo - creazioni d'arte - cagliariHome   de giovanni luigi pittore contemporaneo - creazioni d'arte - cagliari
Home de giovanni luigi pittore contemporaneo - creazioni d'arte - cagliari
 
Power point group project luis alvarez
Power point group project luis alvarezPower point group project luis alvarez
Power point group project luis alvarez
 
Treści wizualne w Content Marketingu w 20 liczbach
Treści wizualne w Content Marketingu w 20 liczbachTreści wizualne w Content Marketingu w 20 liczbach
Treści wizualne w Content Marketingu w 20 liczbach
 
Iot and-gnu-linux
Iot and-gnu-linuxIot and-gnu-linux
Iot and-gnu-linux
 
English prepositions
English prepositionsEnglish prepositions
English prepositions
 
IAM isnt Magic
IAM isnt MagicIAM isnt Magic
IAM isnt Magic
 
Edinburgh fringe in a nutsehll by paul eccentric sample
Edinburgh fringe in a nutsehll by paul eccentric sampleEdinburgh fringe in a nutsehll by paul eccentric sample
Edinburgh fringe in a nutsehll by paul eccentric sample
 
Schoolfeest 2015 fotowedstrijd
Schoolfeest 2015 fotowedstrijdSchoolfeest 2015 fotowedstrijd
Schoolfeest 2015 fotowedstrijd
 
Sinalizacao nova
Sinalizacao novaSinalizacao nova
Sinalizacao nova
 
Diplomas 1
Diplomas 1Diplomas 1
Diplomas 1
 

Similar to Everybody Lies

4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz KowalczewskiPROIDEA
 
Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in WonderlandPaul Seiffert
 
Introducing HOSTING Labs - Ed Schaefer
Introducing HOSTING Labs - Ed Schaefer Introducing HOSTING Labs - Ed Schaefer
Introducing HOSTING Labs - Ed Schaefer Hostway|HOSTING
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systemsinside-BigData.com
 
Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014StampedeCon
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Aleksandr Tavgen
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 
Quality Assurance in Agile
Quality Assurance in AgileQuality Assurance in Agile
Quality Assurance in AgileSQALab
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java ProfilingJerry Yoakum
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformDevOps.com
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at ScaleDavid Simons
 
SignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx
 
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyObservability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyAleksandr Tavgen
 
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...InfluxData
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureMarco Parenzan
 
Five Ways to Leverage AI and Tableau
Five Ways to Leverage AI and TableauFive Ways to Leverage AI and Tableau
Five Ways to Leverage AI and TableauStarschema
 
Seeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of MicroservicesSeeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of MicroservicesDave McAllister
 
How Machines Help Humans Root Case Issues @ Netflix
How Machines Help Humans Root Case Issues @ NetflixHow Machines Help Humans Root Case Issues @ Netflix
How Machines Help Humans Root Case Issues @ NetflixC4Media
 

Similar to Everybody Lies (20)

4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski4Developers 2015: Measure to fail - Tomasz Kowalczewski
4Developers 2015: Measure to fail - Tomasz Kowalczewski
 
Monitoring and Logging in Wonderland
Monitoring and Logging in WonderlandMonitoring and Logging in Wonderland
Monitoring and Logging in Wonderland
 
Introducing HOSTING Labs - Ed Schaefer
Introducing HOSTING Labs - Ed Schaefer Introducing HOSTING Labs - Ed Schaefer
Introducing HOSTING Labs - Ed Schaefer
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systems
 
Tom Kyte at Hotsos 2015
Tom Kyte at Hotsos 2015Tom Kyte at Hotsos 2015
Tom Kyte at Hotsos 2015
 
Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014Apache Spark: the next big thing? - StampedeCon 2014
Apache Spark: the next big thing? - StampedeCon 2014
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Quality Assurance in Agile
Quality Assurance in AgileQuality Assurance in Agile
Quality Assurance in Agile
 
Introduction to Java Profiling
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS Platform
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
 
Data Modelling at Scale
Data Modelling at ScaleData Modelling at Scale
Data Modelling at Scale
 
SignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and Alerting
 
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyObservability - the good, the bad, and the ugly
Observability - the good, the bad, and the ugly
 
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
 
Five Ways to Leverage AI and Tableau
Five Ways to Leverage AI and TableauFive Ways to Leverage AI and Tableau
Five Ways to Leverage AI and Tableau
 
Seeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of MicroservicesSeeing RED: Monitoring and Observability in the Age of Microservices
Seeing RED: Monitoring and Observability in the Age of Microservices
 
How Machines Help Humans Root Case Issues @ Netflix
How Machines Help Humans Root Case Issues @ NetflixHow Machines Help Humans Root Case Issues @ Netflix
How Machines Help Humans Root Case Issues @ Netflix
 

More from Tomasz Kowalczewski

How I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdfHow I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdfTomasz Kowalczewski
 
Is writing performant code too expensive?
Is writing performant code too expensive? Is writing performant code too expensive?
Is writing performant code too expensive? Tomasz Kowalczewski
 
Is writing performant code too expensive?
Is writing performant code too expensive? Is writing performant code too expensive?
Is writing performant code too expensive? Tomasz Kowalczewski
 
Is writing performant code too expensive?
Is writing performant code too expensive?Is writing performant code too expensive?
Is writing performant code too expensive?Tomasz Kowalczewski
 
Deep dive reactive java (DevoxxPl)
Deep dive reactive java (DevoxxPl)Deep dive reactive java (DevoxxPl)
Deep dive reactive java (DevoxxPl)Tomasz Kowalczewski
 

More from Tomasz Kowalczewski (13)

How I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdfHow I learned to stop worrying and love the dark silicon apocalypse.pdf
How I learned to stop worrying and love the dark silicon apocalypse.pdf
 
Is writing performant code too expensive?
Is writing performant code too expensive? Is writing performant code too expensive?
Is writing performant code too expensive?
 
Is writing performant code too expensive?
Is writing performant code too expensive? Is writing performant code too expensive?
Is writing performant code too expensive?
 
Is writing performant code too expensive?
Is writing performant code too expensive?Is writing performant code too expensive?
Is writing performant code too expensive?
 
Deep dive reactive java (DevoxxPl)
Deep dive reactive java (DevoxxPl)Deep dive reactive java (DevoxxPl)
Deep dive reactive java (DevoxxPl)
 
Forgive me for i have allocated
Forgive me for i have allocatedForgive me for i have allocated
Forgive me for i have allocated
 
AWS Java SDK @ scale
AWS Java SDK @ scaleAWS Java SDK @ scale
AWS Java SDK @ scale
 
Reactive Java at JDD 2014
Reactive Java at JDD 2014Reactive Java at JDD 2014
Reactive Java at JDD 2014
 
Reactive Java (33rd Degree)
Reactive Java (33rd Degree)Reactive Java (33rd Degree)
Reactive Java (33rd Degree)
 
Reactive Java (GeeCON 2014)
Reactive Java (GeeCON 2014)Reactive Java (GeeCON 2014)
Reactive Java (GeeCON 2014)
 
Introduction to Reactive Java
Introduction to Reactive JavaIntroduction to Reactive Java
Introduction to Reactive Java
 
Java 8 jest tuż za rogiem
Java 8 jest tuż za rogiemJava 8 jest tuż za rogiem
Java 8 jest tuż za rogiem
 
Java gets a closure
Java gets a closureJava gets a closure
Java gets a closure
 

Recently uploaded

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 

Recently uploaded (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

Everybody Lies

  • 1. E V E R Y B O D Y L I E S T O M A S Z K O WA L C Z E W S K I
  • 2. C A R G O C U LT During the Middle Ages there were all kinds of crazy ideas, such as that a piece of of rhinoceros horn would increase potency. Then a method was discovered for separating the ideas- which was to try one to see if it worked, and if it didn't work, to eliminate it. This method became organized, of course, into science. And it developed very well, so that we are now in the scientific age. It is such a scientific age, in fact, that we have difficulty in understanding how witch doctors could ever have existed, when nothing that they proposed ever really worked-or very little of it did. Richard Feynman From a Caltech commencement address given in 1974
  • 3. W H Y B O T H E R ? • You get what you measure - Ineffective optimisations that complicate code + Numbers to convince management to do refactoring or migration to Java 8!
  • 4. W H Y B O T H E R ? • Predictable is better than fast • One page display requires multiple calls (static and dynamic resources) • Multiple microservices are called to generate response • During a session user may do hundreds of displays of your webpages
  • 5. W H Y D O T H I S ? • Every 100 ms increase in load time of Amazon.com decreased sales by 1%1 • Increasing web search latency 100 to 400 ms reduces the daily searches per user by 0.2% to 0.6%. Furthermore, users do fewer searches the longer they are exposed. For longer delays, the loss of searches persists for a time even after latency returns to previous levels.2 1Kohavi and Longbotham 2007 2Brutlag 2009
  • 6. S U R V E Y • Do you…
  • 7. S U R V E Y • Use graphite?
  • 8. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics?
  • 9. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics? • Modify their source? Use nonstandard options?
  • 10. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics? • Modify their source? Use nonstandard options? • Graph average? Median?
  • 11. S U R V E Y • Use graphite? • Feed it with Coda Hale/Dropwizard metrics? • Modify their source? Use nonstandard options? • Graph average? Median? • Percentiles?
  • 13. W H AT M E T R I C S C A N W E U S E ? graphite.send(prefix(name, "max"), ...);
 graphite.send(prefix(name, "mean"), ...);
 graphite.send(prefix(name, "min"), ...);
 graphite.send(prefix(name, "stddev"), ...);
 graphite.send(prefix(name, "p50"), ...);
 graphite.send(prefix(name, "p75"), ...);
 graphite.send(prefix(name, "p95"), ...);
 graphite.send(prefix(name, "p98"), ...);
 graphite.send(prefix(name, "p99"), ...);
 graphite.send(prefix(name, “p999"), ...);
  • 14. D O N ’ T L O O K AT M E A N • 1000 queries - 0ms latency, 100 queries 5s latency • Average is 4,5ms • 1000 queries - 1ms latency, 100 queries - 5s latency • Average is 455ms • Does not help to quantify lags users will experience
  • 15. – A N S C O M B E ' S Q U A R T E T B Y F R A N C I S A N S C O M B E These four data sets all have the same mean, median, and variance
  • 16. P L O T T I N G M E A N I S F O R S H O W I N G O F F T O M A N A G E M E N T
  • 17. M AY B E M E D I A N T H E N ? • What is the probability of end user encountering latency worse than median? • Remember: usually multiple requests are needed to respond to API call (e.g. N micro services, N resource requests per page) ✓ 1 2 ◆N · 100
  • 18. P R O B A B I L I T Y O F E X P E R I E N C I N G L AT E N C Y B E T T E R T H A N M E D I A N I N F U N C T I O N O F M I C R O S E R V I C E S I N V O LV E D 0 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 90 100
  • 19. W H I C H P E R C E N T I L E I S R E L E VA N T T O Y O U ? • Is 99th percentile demanding constraint? • In application serving 1000 qps latency worse than that happens ten times per second. • User that needs to navigate through several web pages will most probably experience it • What is the probability of encountering latency better than 99th? ✓ 99 100 ◆N · 100
  • 20. P R O B A B I L I T Y O F E X P E R I E N C I N G L AT E N C Y B E T T E R T H A N 9 9 T H P E R C E N T I L E I N F U N C T I O N O F M I C R O S E R V I C E S I N V O LV E D 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
  • 21. D O N O T AV E R A G E P E R C E N T I L E S Example scenario: 1. Load balancer splits traffic unevenly (ELB anyone?) 2. Server S1 has 1 qps over measured time with 95%’ile == 1ms 3. Server S2 has 100 qps over measured time with 95%’ile == 10s 4. Average is ~5s. 5. What does that tell us? 6. Did we satisfy SLA if it says “95%’ile must be below 8s”? 7. Actual 95%’ile percentile is ~10s
  • 22. – A L I C E ' S A D V E N T U R E S I N W O N D E R L A N D “If there's no meaning in it,' said the King, 'that saves a world of trouble, you know, as we needn't try to find any”
  • 23. Every time you average max values someone in the world starts new JavaScript framework
  • 25. m e t r i c R e g i s t r y . t i m e r ( " 2 0 1 5 . s t a n d a r d T i m e r " ) ; Standard timer will over or under report actual percentiles at will. Green line represents actual MAX values.
  • 26. m e t r i c R e g i s t r y . t i m e r ( " 2 0 1 5 . s t a n d a r d T i m e r " ) ; Standard timer will over or under report actual percentiles at will. Green line represents actual MAX values.
  • 27. T I M E R ’ S H I S T O G R A M R E S E R V O I R • Backing storage for Timer’s data • Contain “statistically representative reservoir of a data stream” • Default is ExponentiallyDecayingReservoir which has many drawbacks and is source of most inaccuracies observed throughout this presentation • Others include • UniformReservoir, SlidingTimeWindowReservoir, SlidingTimeWindowReservoir, SlidingWindowReservoir
  • 28. E X P O N E N T I A L LY D E C AY I N G R E S E R V O I R • Stores 1028 random samples by default • Assumes normal distribution of recorded values • Many statistical tools applied in computer systems monitoring will assume normal distribution • Be suspicious of such tools • Why is that a bad idea?
  • 29. -2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8 0,5 1 1,5 2 2,5 3 N O R M A L D I S T R I B U T I O N - W H Y S O U S E F U L ? • Central limit theorem • Chebyshev's inequality f (x, µ, ) = 1 p 2⇡ e (x µ)2 2 2
  • 30. 10 10,5 11 11,5 12 -0,25 0,25 0,5 0,75 1 C A L C U L AT E 9 5 % ’ I L E B A S E D O N M E A N A N D S T D . D E V. • IFF latency values were distributed normally then we could calculate any percentile based on mean and standard deviation µ = 10ms = 1ms • Lookup into standard normal (Z) table • 95%’ile is located 1.65 std. dev. from mean • Result is 11,65ms
  • 31. Latency profile resembling normal distribution…
  • 32. Add spikes due to young gen GC pauses
  • 33. Add spikes due to old gen GC pauses
  • 34. Add spikes due to calling other services (like DB)
  • 35. Add spikes due to: lost tcp packet retransmission, disk swapping, kernel bookkeeping etc.
  • 36. -2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8 0,5 1 1,5 2 2,5 3 N O R M A L D I S T R I B U T I O N - W H Y N O T A P P L I C A B L E ? • The value of the normal distribution is practically zero when the value x lies more than a few standard deviations away from the mean. • It may not be an appropriate model when one expects a significant fraction of outliers • […] other statistical inference methods that are optimal for normally distributed variables often become highly unreliable when applied to such data. 1 f (x, µ, ) = 1 p 2⇡ e (x µ)2 2 2 1 All quotes on this slide from Wikipedia
  • 37. Blue line represents metric reported from Timer class Green line represents request rate
  • 38. T I M E R , T I M E R N E V E R C H A N G E S … • Timer values decay exponentially • giving artificial smoothing of values for server behaviour that may be long gone • Timer that is not updated does not decay • If Timer is not updated (e.g. subprocess failed and we stopped sending requests to it) its values will remain constant • Check this post for potential solutions: taint.org/2014/01/16/145944a.html
  • 39. H D R H I S T O G R A M • Supports recording and analysis of sampled data across configurable range with configurable accuracy • Provides compact representation of data while retaining high resolution • Allows configurable tradeoffs between space and accuracy • Very fast, allocation free, not thread safe for maximum speed (thread safe versions available) • Created by Gil Tene of Azul Sytems
  • 40. R E C O R D E R • Uses HdrHistogram to store values • Supports concurrent recording of values • Recording is lock free but also wait free on most architectures (that support lock xadd) • Reading is not lock free but does not stall writers (writer- reader phaser) • Checkout Marshall Pierce’s library for using it as a Reservoir implementation
  • 41. S O L U T I O N S • Always instantiate Timer with custom reservoir • new ExponentiallyDecayingReservoir(LARGE_NUMBER) • new SlidingTimeWindowReservoir(1, MINUTES) • new HdrHistogramResetOnSnapshotReservoir() • Only last one is safe and accurate and will not report stale values if no updates were made
  • 42. JMH benchmarks (from my laptop, caveat emptor!)
  • 43. S M O K I N G B E N C H M A R K I N G I S T H E L E A D I N G C A U S E O F S TAT I S T I C S I N T H E W O R L D
  • 44. C O O R D I N AT E D O M I S S I O N • As formulated by Gil Tene of Azul Systems • When load driver is plotting with system under test to deceive you • Most tools do this • Most benchmarks do this • Yahoo Cloud Serving Benchmark had that problem1 1Recently fixed by Nitsan Wakart, see psy-lob-saw.blogspot.com/2015/03/fixing-ycsb-coordinated-omission.html
  • 45. -0,8 0 0,8 1,6 2,4 3,2 4 4,8 5,6 6,4 -0,8 0,8 1,6 2,4 3,2 4 request arrival time Application pause time Requests according to test plan. Only red one will be send. Others will be missing from test. latency
  • 46. – C R E AT E D W I T H G I L T E N E ' S H D R H I S T O G R A M P L O T T I N G S C R I P T Effects on benchmarks at high percentiles are spectacular
  • 47. C O O R D I N AT E D O M I S S I O N S O L U T I O N S 1. Ignore the problem! perfectly fine for non interactive system where only throughput matters
  • 48. C O O R D I N AT E D O M I S S I O N S O L U T I O N S 2. Correct it mathematically in sampling mechanism HdrHistogram can correct CO with these methods (choose one!): histogram.recordValueWithExpectedInterval(
 value,
 expectedIntervalBetweenSamples
 ); histogram.copyCorrectedForCoordinatedOmission(
 expectedIntervalBetweenSamples
 );
  • 49. C O O R D I N AT E D O M I S S I O N S O L U T I O N S 3. Correct it on load driver side by noticing pauses between sent requests. newly issued request will have timer that starts counting from time it should have been sent but wasn't
  • 50. C O O R D I N AT E D O M I S S I O N S O L U T I O N S 4. Fail the test for hard real time systems where pause causes human casualties (breaks, pacemakers, Phalanx system)
  • 51. C O O R D I N AT E D O M I S S I O N • Mathematical solutions can overcorrect when load driver has pauses (e.g. GC). • Do not account for the fact that server after pause has no work to do instead of N more requests waiting to be executed • In real world it might have never recovered • Most tools ignore the problem • Notable exception: Twitter Iago
  • 52. – L O A D D R I V E R M O T T O “Do not bend to the tyranny of reality”
  • 53. S U M M A RY • Measure what is meaningful not just what is measurable • Set SLA before testing and creating dashboards • Do not trust Timer class, use custom reservoirs, HdrHistogram, Recorder, never trust EMWA for request rate • Do not average percentiles unless you need a random number generator • Do not plot averages unless you just want to look good on dashboards • When load testing be aware of coordinated omission
  • 54. S O U R C E S , T H A N K Y O U S A N D R E C O M M E N D E D F O L L O W U P S • Coda Hale for great metrics library • Gil Tene • latencytipoftheday.blogspot.de • www.infoq.com/presentations/latency-pitfalls • github.com/HdrHistogram/HdrHistogram • Nitsan Wakart • psy-lob-saw.blogspot.de/2015/03/fixing-ycsb-coordinated-omission.html • and whole blog • Matin Thompson et. al. • groups.google.com/forum/#!forum/mechanical-sympathy
  • 55. R E C O M M E N D E D Great introduction to statistics and queueing theory. Performance Modeling and Design of Computer Systems: Queueing Theory in Action Prof. Mor Harchol-Balter
  • 56. F E E D B A C K K I N D LY R E Q U E S T E D https://www.surveymonkey.com/s/B5KGWWN