SlideShare a Scribd company logo
1 of 254
Download to read offline
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
How NOT to Measure
Latency
An attempt to confer wisdom...
Gil Tene, CTO & co-Founder, Azul Systems
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
High level agenda
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
High level agenda
Some latency behavior background
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
High level agenda
Some latency behavior background
The pitfalls of using “statistics”
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
High level agenda
Some latency behavior background
The pitfalls of using “statistics”
Latency “philosophy” questions
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
High level agenda
Some latency behavior background
The pitfalls of using “statistics”
Latency “philosophy” questions
The Coordinated Omission Problem
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
High level agenda
Some latency behavior background
The pitfalls of using “statistics”
Latency “philosophy” questions
The Coordinated Omission Problem
Some useful tools
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
High level agenda
Some latency behavior background
The pitfalls of using “statistics”
Latency “philosophy” questions
The Coordinated Omission Problem
Some useful tools
Use tools for bragging
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About me: Gil Tene
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About me: Gil Tene
co-founder, CTO @Azul
Systems
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About me: Gil Tene
co-founder, CTO @Azul
Systems
Have been working on “think
different” GC approaches
since 2002
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About me: Gil Tene
co-founder, CTO @Azul
Systems
Have been working on “think
different” GC approaches
since 2002
* working on real-world trash compaction issues, circa 2004
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About me: Gil Tene
co-founder, CTO @Azul
Systems
Have been working on “think
different” GC approaches
since 2002
Created Pauseless & C4 core
GC algorithms (Tene, Wolf)
* working on real-world trash compaction issues, circa 2004
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About me: Gil Tene
co-founder, CTO @Azul
Systems
Have been working on “think
different” GC approaches
since 2002
Created Pauseless & C4 core
GC algorithms (Tene, Wolf)
A Long history building
Virtual & Physical Machines,
Operating Systems,
Enterprise apps, etc...
* working on real-world trash compaction issues, circa 2004
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About me: Gil Tene
co-founder, CTO @Azul
Systems
Have been working on “think
different” GC approaches
since 2002
Created Pauseless & C4 core
GC algorithms (Tene, Wolf)
A Long history building
Virtual & Physical Machines,
Operating Systems,
Enterprise apps, etc...
JCP EC Member...
* working on real-world trash compaction issues, circa 2004
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
Have built “whatever it takes to
get job done” since 2002
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
Have built “whatever it takes to
get job done” since 2002
3 generations of custom SMP
Multi-core HW (Vega)
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
Have built “whatever it takes to
get job done” since 2002
3 generations of custom SMP
Multi-core HW (Vega)
Vega
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
Have built “whatever it takes to
get job done” since 2002
3 generations of custom SMP
Multi-core HW (Vega)
Vega
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
Have built “whatever it takes to
get job done” since 2002
3 generations of custom SMP
Multi-core HW (Vega)
Zing: Pure software for
commodity x86
Vega
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
Have built “whatever it takes to
get job done” since 2002
3 generations of custom SMP
Multi-core HW (Vega)
Zing: Pure software for
commodity x86
Vega
C4
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
About Azul
We make scalable Virtual
Machines
Have built “whatever it takes to
get job done” since 2002
3 generations of custom SMP
Multi-core HW (Vega)
Zing: Pure software for
commodity x86
Known for Low Latency,
Consistent execution, and Large
data set excellence
Vega
C4
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common fallacies
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common fallacies
Computers run application code continuously
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common fallacies
Computers run application code continuously
Response time can be measured as work units/time
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common fallacies
Computers run application code continuously
Response time can be measured as work units/time
Response time exhibits a normal distribution
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common fallacies
Computers run application code continuously
Response time can be measured as work units/time
Response time exhibits a normal distribution
“Glitches” or “Semi-random omissions” in
measurement don’t have a big effect.
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common fallacies
Computers run application code continuously
Response time can be measured as work units/time
Response time exhibits a normal distribution
“Glitches” or “Semi-random omissions” in
measurement don’t have a big effect.
Wrong!
Wrong!
Wrong!
Wrong!
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
A classic look at response
time behavior
Response time as a function of load
source: IBM CICS server documentation, “understanding response times”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
A classic look at response
time behavior
Response time as a function of load
source: IBM CICS server documentation, “understanding response times”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
A classic look at response
time behavior
Response time as a function of load
source: IBM CICS server documentation, “understanding response times”
Average?
Max?
Median?
90%?
99.9%
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
A classic look at response
time behavior
Response time as a function of load
source: IBM CICS server documentation, “understanding response times”
Average?
Max?
Median?
90%?
99.9%
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
Interpreting Results
From this graph it is possible to identify peaks in the response time over a period of time. Helps to
identify issues when server is continuously loaded for long duration of time.
Response Time Graph (per Transaction)
Response Time versus Elapsed Time report indicates the average response time of the transactions
over the elapsed time of the load test as shown.
“Hiccups”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
8000"
9000"
0" 20" 40" 60" 80" 100" 120" 140" 160" 180"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
What happened here?
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
8000"
9000"
0" 20" 40" 60" 80" 100" 120" 140" 160" 180"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
What happened here?
“Hiccups”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
8000"
9000"
0" 20" 40" 60" 80" 100" 120" 140" 160" 180"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
What happened here?
Source: Gil running an idle program and suspending it five times in the middle
“Hiccups”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
25"
Hiccups&by&Percen*le&Distribu*on&
The real world (a low latency example)
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
25"
Hiccups&by&Percen*le&Distribu*on&
The real world (a low latency example)
99%‘ile is ~60 usec
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
25"
Hiccups&by&Percen*le&Distribu*on&
The real world (a low latency example)
99%‘ile is ~60 usec
Max is ~30,000%
higher than “typical”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
They don’t look anything like a normal distribution
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
They don’t look anything like a normal distribution
They usually look like periodic freezes
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
They don’t look anything like a normal distribution
They usually look like periodic freezes
A complete shift from one mode/behavior to another
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
They don’t look anything like a normal distribution
They usually look like periodic freezes
A complete shift from one mode/behavior to another
Mode A: “good”.
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
They don’t look anything like a normal distribution
They usually look like periodic freezes
A complete shift from one mode/behavior to another
Mode A: “good”.
Mode B: “Somewhat bad”
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
They don’t look anything like a normal distribution
They usually look like periodic freezes
A complete shift from one mode/behavior to another
Mode A: “good”.
Mode B: “Somewhat bad”
Mode C: “terrible”, ...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Hiccups are [typically]
strongly multi-modal
They don’t look anything like a normal distribution
They usually look like periodic freezes
A complete shift from one mode/behavior to another
Mode A: “good”.
Mode B: “Somewhat bad”
Mode C: “terrible”, ...
....
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common ways people deal with hiccups
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common ways people deal with hiccups
Averages and Standard Deviation
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common ways people deal with hiccups
Averages and Standard Deviation
Always
Wrong!
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Better ways people can deal with hiccups
Actually measuring percentiles0"
20"
40"
60"
80"
100"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(ms
&Elapsed&Time&(sec)&
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Better ways people can deal with hiccups
Actually measuring percentiles0"
20"
40"
60"
80"
100"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(ms
&Elapsed&Time&(sec)&
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Response Time
Percentile
plot line
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Better ways people can deal with hiccups
Actually measuring percentiles0"
20"
40"
60"
80"
100"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(ms
&Elapsed&Time&(sec)&
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Requirements
Response Time
Percentile
plot line
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Requirements
Why we measure latency and response times
to begin with...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency tells us how long
something took
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency tells us how long
something took
But what do we WANT the latency to be?
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency tells us how long
something took
But what do we WANT the latency to be?
What do we want the latency to BEHAVE like?
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency tells us how long
something took
But what do we WANT the latency to be?
What do we want the latency to BEHAVE like?
Latency requirements are usually a PASS/FAIL test of
some predefined criteria
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency tells us how long
something took
But what do we WANT the latency to be?
What do we want the latency to BEHAVE like?
Latency requirements are usually a PASS/FAIL test of
some predefined criteria
Different applications have different needs
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency tells us how long
something took
But what do we WANT the latency to be?
What do we want the latency to BEHAVE like?
Latency requirements are usually a PASS/FAIL test of
some predefined criteria
Different applications have different needs
Requirements should reflect application needs
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency tells us how long
something took
But what do we WANT the latency to be?
What do we want the latency to BEHAVE like?
Latency requirements are usually a PASS/FAIL test of
some predefined criteria
Different applications have different needs
Requirements should reflect application needs
Measurements should provide data to evaluate
requirements
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The Olympics
aka “ring the bell first”
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The Olympics
aka “ring the bell first”
Goal: Get gold medals
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The Olympics
aka “ring the bell first”
Goal: Get gold medals
Need to be faster than everyone else at SOME races
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The Olympics
aka “ring the bell first”
Goal: Get gold medals
Need to be faster than everyone else at SOME races
Ok to be slower in some, as long as fastest at some
(the average speed doesn’t matter)
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The Olympics
aka “ring the bell first”
Goal: Get gold medals
Need to be faster than everyone else at SOME races
Ok to be slower in some, as long as fastest at some
(the average speed doesn’t matter)
Ok to not even finish or compete (the worst case and
99%‘ile don’t matter)
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The Olympics
aka “ring the bell first”
Goal: Get gold medals
Need to be faster than everyone else at SOME races
Ok to be slower in some, as long as fastest at some
(the average speed doesn’t matter)
Ok to not even finish or compete (the worst case and
99%‘ile don’t matter)
Different strategies can apply. E.g. compete in only 3
races to not risk burning out, or compete in 8 races
in hope of winning two
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Pacemakers
aka “hard” real time
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Pacemakers
aka “hard” real time
Goal: Keep heart beating
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Pacemakers
aka “hard” real time
Goal: Keep heart beating
Need to never be slower than X
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Pacemakers
aka “hard” real time
Goal: Keep heart beating
Need to never be slower than X
“Your heart will keep beating 99.9% of the time” is
not reassuring
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Pacemakers
aka “hard” real time
Goal: Keep heart beating
Need to never be slower than X
“Your heart will keep beating 99.9% of the time” is
not reassuring
Having a good average and a nice standard deviation
don’t matter or help
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Pacemakers
aka “hard” real time
Goal: Keep heart beating
Need to never be slower than X
“Your heart will keep beating 99.9% of the time” is
not reassuring
Having a good average and a nice standard deviation
don’t matter or help
The worst case is all that matters
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
“Low Latency” Trading
aka “soft” real time
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
“Low Latency” Trading
aka “soft” real time
Goal A: Be fast enough to make some good plays
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
“Low Latency” Trading
aka “soft” real time
Goal A: Be fast enough to make some good plays
Goal B: Contain risk and exposure while making plays
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
“Low Latency” Trading
aka “soft” real time
Goal A: Be fast enough to make some good plays
Goal B: Contain risk and exposure while making plays
E.g. want to “typically” react within 200 usec.
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
“Low Latency” Trading
aka “soft” real time
Goal A: Be fast enough to make some good plays
Goal B: Contain risk and exposure while making plays
E.g. want to “typically” react within 200 usec.
But can’t afford to hold open position for 20 msec, or
react to 30 msec stale information
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
“Low Latency” Trading
aka “soft” real time
Goal A: Be fast enough to make some good plays
Goal B: Contain risk and exposure while making plays
E.g. want to “typically” react within 200 usec.
But can’t afford to hold open position for 20 msec, or
react to 30 msec stale information
So we want a very good “typical” (median, 50%‘ile)
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
“Low Latency” Trading
aka “soft” real time
Goal A: Be fast enough to make some good plays
Goal B: Contain risk and exposure while making plays
E.g. want to “typically” react within 200 usec.
But can’t afford to hold open position for 20 msec, or
react to 30 msec stale information
So we want a very good “typical” (median, 50%‘ile)
But we also need a reasonable Max, or 99.99%‘ile
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Interactive applications
aka “squishy” real time
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Interactive applications
aka “squishy” real time
Goal: Keep users happy enough to not complain/leave
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Interactive applications
aka “squishy” real time
Goal: Keep users happy enough to not complain/leave
Need to have “typically snappy” behavior
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Interactive applications
aka “squishy” real time
Goal: Keep users happy enough to not complain/leave
Need to have “typically snappy” behavior
Ok to have occasional longer times, but not too high, and
not too often
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Interactive applications
aka “squishy” real time
Goal: Keep users happy enough to not complain/leave
Need to have “typically snappy” behavior
Ok to have occasional longer times, but not too high, and
not too often
Example: 90% of responses should be below 0.2 sec, 99%
should be below 0.5 sec, 99.9 should be better than 2
seconds. And a >10 second response should never happen.
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Interactive applications
aka “squishy” real time
Goal: Keep users happy enough to not complain/leave
Need to have “typically snappy” behavior
Ok to have occasional longer times, but not too high, and
not too often
Example: 90% of responses should be below 0.2 sec, 99%
should be below 0.5 sec, 99.9 should be better than 2
seconds. And a >10 second response should never happen.
Remember: A single user may have 100s of interactions
per session...
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
A: We don’t have one
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
A: We don’t have one
Q: So it’s ok for some things to take more than 5 hours?
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
A: We don’t have one
Q: So it’s ok for some things to take more than 5 hours?
A: No way in H%%&!
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
A: We don’t have one
Q: So it’s ok for some things to take more than 5 hours?
A: No way in H%%&!
Q: So I’ll write down “5 hours worst case...”
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
A: We don’t have one
Q: So it’s ok for some things to take more than 5 hours?
A: No way in H%%&!
Q: So I’ll write down “5 hours worst case...”
A: No. That’s not what I said. Make that “nothing worse than 100 msec”
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
A: We don’t have one
Q: So it’s ok for some things to take more than 5 hours?
A: No way in H%%&!
Q: So I’ll write down “5 hours worst case...”
A: No. That’s not what I said. Make that “nothing worse than 100 msec”
Q: Are you sure? Even if it’s only two times a day?
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Q: What are your latency requirements?
A: We need an avg. response of 20 msec
Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
A: We don’t have one
Q: So it’s ok for some things to take more than 5 hours?
A: No way in H%%&!
Q: So I’ll write down “5 hours worst case...”
A: No. That’s not what I said. Make that “nothing worse than 100 msec”
Q: Are you sure? Even if it’s only two times a day?
A: Ok... Make it “nothing worse than 2 seconds...”
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
Q: That was for the worst case. But if half the results are better than 20 msec, is it ok
for the other half to be just short of 2 seconds? What % of the time are you willing to
take a 1 second, or a half second hiccup? Or some other level?
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
Q: That was for the worst case. But if half the results are better than 20 msec, is it ok
for the other half to be just short of 2 seconds? What % of the time are you willing to
take a 1 second, or a half second hiccup? Or some other level?
A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing
money even when we are fast the rest of the time. We need to be better than 500
msec 99.9% of the time, or our customers will complain and go elsewhere
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
Q: That was for the worst case. But if half the results are better than 20 msec, is it ok
for the other half to be just short of 2 seconds? What % of the time are you willing to
take a 1 second, or a half second hiccup? Or some other level?
A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing
money even when we are fast the rest of the time. We need to be better than 500
msec 99.9% of the time, or our customers will complain and go elsewhere
Now we have a service level expectation:
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
Q: That was for the worst case. But if half the results are better than 20 msec, is it ok
for the other half to be just short of 2 seconds? What % of the time are you willing to
take a 1 second, or a half second hiccup? Or some other level?
A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing
money even when we are fast the rest of the time. We need to be better than 500
msec 99.9% of the time, or our customers will complain and go elsewhere
Now we have a service level expectation:
50% better than 20 msec
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
Q: That was for the worst case. But if half the results are better than 20 msec, is it ok
for the other half to be just short of 2 seconds? What % of the time are you willing to
take a 1 second, or a half second hiccup? Or some other level?
A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing
money even when we are fast the rest of the time. We need to be better than 500
msec 99.9% of the time, or our customers will complain and go elsewhere
Now we have a service level expectation:
50% better than 20 msec
90% better than 50 msec
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
Q: That was for the worst case. But if half the results are better than 20 msec, is it ok
for the other half to be just short of 2 seconds? What % of the time are you willing to
take a 1 second, or a half second hiccup? Or some other level?
A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing
money even when we are fast the rest of the time. We need to be better than 500
msec 99.9% of the time, or our customers will complain and go elsewhere
Now we have a service level expectation:
50% better than 20 msec
90% better than 50 msec
99.9% better than 500 msec
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Establishing Requirements
an interactive interview (or thought) process
Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok
to have a 1 second response?
A: (Annoyed) I thought you said only a few times a day
Q: That was for the worst case. But if half the results are better than 20 msec, is it ok
for the other half to be just short of 2 seconds? What % of the time are you willing to
take a 1 second, or a half second hiccup? Or some other level?
A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing
money even when we are fast the rest of the time. We need to be better than 500
msec 99.9% of the time, or our customers will complain and go elsewhere
Now we have a service level expectation:
50% better than 20 msec
90% better than 50 msec
99.9% better than 500 msec
100% better than 2 seconds
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Latency does not live in a vacuum
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Remember this?
How much load can this system handle?
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Remember this?
How much load can this system handle?
What the
marketing
benchmarks
will say
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Remember this?
How much load can this system handle?
What the
marketing
benchmarks
will say
Where users
complain
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Remember this?
How much load can this system handle?
Where the
sysadmin
is willing
to go
What the
marketing
benchmarks
will say
Where users
complain
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Remember this?
How much load can this system handle?
Where the
sysadmin
is willing
to go
What the
marketing
benchmarks
will say
Where users
complain
Sustainable
Throughput
Level
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Sustainable Throughput: The
throughput achieved while safely
maintaining service levels
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Sustainable Throughput: The
throughput achieved while safely
maintaining service levels
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Sustainable Throughput: The
throughput achieved while safely
maintaining service levels
Unsustainable
Throughout
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Comparing behavior under different throughputs
and/or configurations
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
0#
20#
40#
60#
80#
100#
120#
!Dura&on!(msec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
Setup#D#
Setup#E#
Setup#F#
Setup#A#
Setup#B#
Setup#C#
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
0#
10#
20#
30#
40#
50#
60#
70#
80#
90#
Latency((milliseconds)(
(
(
Percen3le(
Latency(by(Percen3le(Distribu3on(
Comparing latency behavior under different throughputs,
configurations
HotSpot&10K&&&
HotSpot&15K&
HotSpot&5K&
HotSpot&1K&
Zing&15K&
Zing&10K&
Zing&5K&
Zing&1K&
&
Required&
Service&Level&
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Instance capacity test: “Fat Portal”
HotSpot CMS: Peaks at ~3GB / 45 concurrent users
* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Instance capacity test: “Fat Portal”
HotSpot CMS: Peaks at ~3GB / 45 concurrent users
* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Instance capacity test: “Fat Portal”
C4: still smooth @ 800 concurrent users
* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Instance capacity test: “Fat Portal”
C4: still smooth @ 800 concurrent users
* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Instance capacity test: “Fat Portal”
C4: still smooth @ 800 concurrent users
* LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
An accidental conspiracy...
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
Common Example A (load testing):
build/buy load tester to measure system behavior
each “client” issues requests one by one at a certain rate
measure and log response time for each request
results log used to produce histograms, percentiles, etc.
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
Common Example A (load testing):
build/buy load tester to measure system behavior
each “client” issues requests one by one at a certain rate
measure and log response time for each request
results log used to produce histograms, percentiles, etc.
So what’s wrong with that?
works well only when all responses fit within rate interval
technique includes implicit “automatic backoff” and coordination
But requirements interested in random, uncoordinated requests
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
Common Example B (monitoring):
System monitors and records each transaction latency
Latency measured between start and end of each operation
keeps observed latency stats of some sort (log, histogram, etc.)
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
Common Example B (monitoring):
System monitors and records each transaction latency
Latency measured between start and end of each operation
keeps observed latency stats of some sort (log, histogram, etc.)
So what’s wrong with that?
works correctly well only when no queuing occurs
Long operations only get measured once
delays outside of timing window do not get measured at all
queued operations are measured wrong
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common Example B:
Coordinated Omission in Monitoring Code
Long operations only get measured once
delays outside of timing window do not get measured at all
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Common Example B:
Coordinated Omission in Monitoring Code
Long operations only get measured once
delays outside of timing window do not get measured at all
How bad can this get?
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How bad can this get?
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How bad can this get?
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How would you characterize this system?
How bad can this get?
Avg. is 1 msec
over 1st 100 sec
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How would you characterize this system?
How bad can this get?
Avg. is 1 msec
over 1st 100 sec
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How would you characterize this system?
Avg. is 50 sec.
over next 100 sec
How bad can this get?
Avg. is 1 msec
over 1st 100 sec
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How would you characterize this system?
Avg. is 50 sec.
over next 100 sec
Overall Average response time is ~25 sec.
How bad can this get?
Avg. is 1 msec
over 1st 100 sec
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
How would you characterize this system?
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
Avg. is 50 sec.
over next 100 sec
Overall Average response time is ~25 sec.
Measurement in practice
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
Measurement in practice
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
Measurement in practice
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
Naïve Characterization
Measurement in practice
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
Naïve Characterization
10,000 @ 1msec 1 @ 100 second
Measurement in practice
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
Naïve Characterization
10,000 @ 1msec 1 @ 100 second
99.99%‘ile is 1 msec! Average. is 10.9msec! Std. Dev. is 0.99sec!
(should be ~100sec) (should be ~25 sec)
Proper measurement
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
Proper measurement
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
Proper measurement
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
Proper measurement
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
Proper measurement
System Stalled
for 100 Sec
Elapsed Time
System easily handles
100 requests/sec
Responds to each
in 1msec
10,000 results
Varying linearly
from 100 sec
to 10 msec
10,000 results
@ 1 msec each
~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
Coordinated
Omission
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The coordinated omission problem
It is MUCH more common than you may
think...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
JMeter makes this mistake... (so do others)
Before
Correction
After
Correcting
for
Omission
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
Results were
collected by a
single client
thread
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
26.182 seconds
represents 1.29%
of the total time
Results were
collected by a
single client
thread
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
99%‘ile MUST be at least 0.29%
of total time (1.29% - 1%)
which would be 5.9 seconds
26.182 seconds
represents 1.29%
of the total time
Results were
collected by a
single client
thread
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
99%‘ile MUST be at least 0.29%
of total time (1.29% - 1%)
which would be 5.9 seconds
26.182 seconds
represents 1.29%
of the total time
wrong by a factor
of 1,000x
Results were
collected by a
single client
thread
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
A world record SPECjEnterprise2010 result
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
A world record SPECjEnterprise2010 result
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
305.197 seconds
represents 8.4% of
the timing run
A world record SPECjEnterprise2010 result
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
The “real” world
The max is 762 (!!!)
standard deviations
away from the mean
305.197 seconds
represents 8.4% of
the timing run
A world record SPECjEnterprise2010 result
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
Max=7995.39*
0#
1000#
2000#
3000#
4000#
5000#
6000#
7000#
8000#
9000#
Latency*(msec)*
*
*
Percen7le*
Uncorrected*Latency*by*Percen7le*Distribu7on*
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
Max=7995.39*
0#
1000#
2000#
3000#
4000#
5000#
6000#
7000#
8000#
9000#
Latency*(msec)*
*
*
Percen7le*
Corrected*Latency*by*Percen7le*Distribu7on*
Real World Coordinated Omission effects
Before
Correction
After
Correction
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
Max=7995.39*
0#
1000#
2000#
3000#
4000#
5000#
6000#
7000#
8000#
9000#
Latency*(msec)*
*
*
Percen7le*
Uncorrected*Latency*by*Percen7le*Distribu7on*
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
Max=7995.39*
0#
1000#
2000#
3000#
4000#
5000#
6000#
7000#
8000#
9000#
Latency*(msec)*
*
*
Percen7le*
Corrected*Latency*by*Percen7le*Distribu7on*
Real World Coordinated Omission effects
Before
Correction
After
Correction
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
Max=7995.39*
0#
1000#
2000#
3000#
4000#
5000#
6000#
7000#
8000#
9000#
Latency*(msec)*
*
*
Percen7le*
Uncorrected*Latency*by*Percen7le*Distribu7on*
0%# 90%# 99%# 99.9%# 99.99%# 99.999%#
Max=7995.39*
0#
1000#
2000#
3000#
4000#
5000#
6000#
7000#
8000#
9000#
Latency*(msec)*
*
*
Percen7le*
Corrected*Latency*by*Percen7le*Distribu7on*
Real World Coordinated Omission effects
Before
Correction
After
Correction
Wrong
by 7x
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Real World Coordinated Omission effects
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Real World Coordinated Omission effects
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
Uncorrected
Data
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
HotSpot#Corrected1#
HotSpot#Corrected2#
Real World Coordinated Omission effects
Uncorrected
Data
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
HotSpot#Corrected1#
HotSpot#Corrected2#
Real World Coordinated Omission effects
Uncorrected
Data
Corrected for
Coordinated
Omission
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Real World Coordinated Omission effects
(Why I care)
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
HotSpot#Corrected1#
HotSpot#Corrected2#
Zing#Base#
Zing#Corrected1#
Zing#Corrected2#
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Real World Coordinated Omission effects
(Why I care)
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
HotSpot#Corrected1#
HotSpot#Corrected2#
Zing#Base#
Zing#Corrected1#
Zing#Corrected2#
Zing
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Real World Coordinated Omission effects
(Why I care)
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
HotSpot#Corrected1#
HotSpot#Corrected2#
Zing#Base#
Zing#Corrected1#
Zing#Corrected2#
Zing
“other”
JVM
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Real World Coordinated Omission effects
(Why I care)
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
0#
100000#
200000#
300000#
400000#
500000#
600000#
!Dura&on!(usec)!
!
!
Percen&le!
Dura&on!by!Percen&le!Distribu&on!
HotSpot#Base#
HotSpot#Corrected1#
HotSpot#Corrected2#
Zing#Base#
Zing#Corrected1#
Zing#Corrected2#
A ~2500x
difference in
reported
percentile levels
for the problem
that Zing
eliminates
Zing
“other”
JVM
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Suggestions
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Suggestions
Whatever your measurement technique is, TEST IT.
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Suggestions
Whatever your measurement technique is, TEST IT.
Run your measurement method against an artificial
system that creates hypothetical pauses scenarios. See if
your reported results agree with how you would describe
that system behavior
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Suggestions
Whatever your measurement technique is, TEST IT.
Run your measurement method against an artificial
system that creates hypothetical pauses scenarios. See if
your reported results agree with how you would describe
that system behavior
Don’t waste time analyzing until you establish sanity
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Suggestions
Whatever your measurement technique is, TEST IT.
Run your measurement method against an artificial
system that creates hypothetical pauses scenarios. See if
your reported results agree with how you would describe
that system behavior
Don’t waste time analyzing until you establish sanity
Don’t EVER use or derive from std. deviation
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Suggestions
Whatever your measurement technique is, TEST IT.
Run your measurement method against an artificial
system that creates hypothetical pauses scenarios. See if
your reported results agree with how you would describe
that system behavior
Don’t waste time analyzing until you establish sanity
Don’t EVER use or derive from std. deviation
ALWAYS measure Max time. Consider what it means...
Be suspicious.
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Suggestions
Whatever your measurement technique is, TEST IT.
Run your measurement method against an artificial
system that creates hypothetical pauses scenarios. See if
your reported results agree with how you would describe
that system behavior
Don’t waste time analyzing until you establish sanity
Don’t EVER use or derive from std. deviation
ALWAYS measure Max time. Consider what it means...
Be suspicious.
Measure %‘iles. Lots of them.
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Some Tools
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
Max=137.472+
0#
20#
40#
60#
80#
100#
120#
140#
160#
Latency+(msec)+
+
+
Percen8le+
Latency+by+Percen8le+Distribu8on+
HdrHistogram
If you want to be able to produce graphs like this...
You need both good dynamic range
and good resolution
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram background
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram background
Goal: Collect data for good latency characterization...
Including acceptable precision at and between varying percentile levels
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram background
Goal: Collect data for good latency characterization...
Including acceptable precision at and between varying percentile levels
Existing alternatives
Record all data, analyze later (e.g. sort and get 99.9%‘ile).
Record in traditional histograms
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram background
Goal: Collect data for good latency characterization...
Including acceptable precision at and between varying percentile levels
Existing alternatives
Record all data, analyze later (e.g. sort and get 99.9%‘ile).
Record in traditional histograms
Traditional Histograms: Linear bins, Logarithmic bins, or
Arbitrary bins
Linear requires lots of storage to cover range with good resolution
Logarithmic covers wide range but has terrible precisions
Arbitrary is.... arbitrary. Works only when you have a good feel for the
interesting parts of the value range
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
A High Dynamic Range Histogram
Covers a configurable dynamic value range
At configurable precision (expressed as number of significant digits)
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
A High Dynamic Range Histogram
Covers a configurable dynamic value range
At configurable precision (expressed as number of significant digits)
For Example:
Track values between 1 microsecond and 1 hour
With 3 decimal points of resolution
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
A High Dynamic Range Histogram
Covers a configurable dynamic value range
At configurable precision (expressed as number of significant digits)
For Example:
Track values between 1 microsecond and 1 hour
With 3 decimal points of resolution
Built-in [optional] compensation for Coordinated
Omission
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
A High Dynamic Range Histogram
Covers a configurable dynamic value range
At configurable precision (expressed as number of significant digits)
For Example:
Track values between 1 microsecond and 1 hour
With 3 decimal points of resolution
Built-in [optional] compensation for Coordinated
Omission
Open Source
On github, released to the public domain, creative commons CC0
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
Fixed cost in both space and time
Built with “latency sensitive” applications in mind
Recording values does not allocate or grow any data structures
Recording values uses a fixed computation to determine location (no
searches, no variability in recording cost, FAST)
Even iterating through histogram can be done with no allocation
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
Fixed cost in both space and time
Built with “latency sensitive” applications in mind
Recording values does not allocate or grow any data structures
Recording values uses a fixed computation to determine location (no
searches, no variability in recording cost, FAST)
Even iterating through histogram can be done with no allocation
Internals work like a “floating point” data structure
“Exponent” and “Mantissa”
Exponent determines “Mantissa bucket” to use
“Mantissa buckets” provide linear value range for a given exponent.
Each have enough linear entries to support required precision
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
Provides tools for iteration
Linear, Logarithmic, Percentile
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
Provides tools for iteration
Linear, Logarithmic, Percentile
Supports percentile iterators
Practical due to high dynamic range
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
Provides tools for iteration
Linear, Logarithmic, Percentile
Supports percentile iterators
Practical due to high dynamic range
Convenient percentile output
10% intervals between 0 and 50%
5% intervals between 50% and 75%
2.5% intervals between 75% and 87.5%...
Very useful for feeding percentile
distribution graphs...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
HdrHistogram
Provides tools for iteration
Linear, Logarithmic, Percentile
Supports percentile iterators
Practical due to high dynamic range
Convenient percentile output
10% intervals between 0 and 50%
5% intervals between 50% and 75%
2.5% intervals between 75% and 87.5%...
Very useful for feeding percentile
distribution graphs...
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%#
Max=137.472+
0#
20#
40#
60#
80#
100#
120#
140#
160#
Latency+(msec)+
+
+
Percen8le+
Latency+by+Percen8le+Distribu8on+
HdrHistogram
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
jHiccup
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Incontinuities in Java platform execution
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1800"
0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups"by"Time"Interval"
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=1665.024&
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1800"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups"by"Percen@le"Distribu@on"
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
jHiccup
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
jHiccup
A tool for capturing and displaying platform hiccups
Records any observed non-continuity of the underlying platform
Plots results in simple, consistent format
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
jHiccup
A tool for capturing and displaying platform hiccups
Records any observed non-continuity of the underlying platform
Plots results in simple, consistent format
Simple, non-intrusive
As simple as adding jHiccup.jar as a java agent:
% java -javaagent=jHiccup.jar myApp myflags
or attaching jHiccup to a running process:
% jHiccup -p <pid>
Adds a background thread that samples time @ 1000/sec into an
HdrHistogram
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
jHiccup
A tool for capturing and displaying platform hiccups
Records any observed non-continuity of the underlying platform
Plots results in simple, consistent format
Simple, non-intrusive
As simple as adding jHiccup.jar as a java agent:
% java -javaagent=jHiccup.jar myApp myflags
or attaching jHiccup to a running process:
% jHiccup -p <pid>
Adds a background thread that samples time @ 1000/sec into an
HdrHistogram
Open Source. Released to the public domain
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Telco App Example
0"
20"
40"
60"
80"
100"
120"
140"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Telco App Example
0"
20"
40"
60"
80"
100"
120"
140"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Telco App Example
0"
20"
40"
60"
80"
100"
120"
140"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Max Time per
interval
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Telco App Example
0"
20"
40"
60"
80"
100"
120"
140"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Max Time per
interval
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Telco App Example
0"
20"
40"
60"
80"
100"
120"
140"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Max Time per
interval
Hiccup
duration at
percentile
levels
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Telco App Example
0"
20"
40"
60"
80"
100"
120"
140"
0" 500" 1000" 1500" 2000" 2500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
0"
20"
40"
60"
80"
100"
120"
140"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Optional SLA
plotting
Max Time per
interval
Hiccup
duration at
percentile
levels
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Fun with jHiccup
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Oracle HotSpot CMS, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=13156.352&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Zing 5, 1GB in an 8GB heap
0"
5"
10"
15"
20"
25"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
Max=20.384&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Oracle HotSpot CMS, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=13156.352&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Zing 5, 1GB in an 8GB heap
0"
5"
10"
15"
20"
25"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
Max=20.384&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Oracle HotSpot CMS, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=13156.352&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Zing 5, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"Max=20.384&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Drawn to scale
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Good for both
“squishy” real time
(human response times)
and
“soft” real time
(low latency software systems)
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.656&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Oracle HotSpot (pure newgen)
Low latency trading application
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1.8"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=1.568&
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1.8"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.656&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Oracle HotSpot (pure newgen) Zing
Low latency trading application
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1.8"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=1.568&
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1.8"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.656&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Oracle HotSpot (pure newgen) Zing
Low latency trading application
©2012 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Low latency - Drawn to scale
Oracle HotSpot (pure newgen) Zing
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=1.568&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600"
Hiccup&Dura*on&(msec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.656&
0"
5"
10"
15"
20"
25"
Hiccup&Dura*on&(msec)&
&
&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Shameless bragging
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Shameless bragging
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Zing
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Zing
A JVM for Linux/x86 servers
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Zing
A JVM for Linux/x86 servers
ELIMINATES Garbage Collection as a concern for
enterprise applications
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Zing
A JVM for Linux/x86 servers
ELIMINATES Garbage Collection as a concern for
enterprise applications
Very wide operating range: Used in both low latency
and large scale enterprise application spaces
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Zing
A JVM for Linux/x86 servers
ELIMINATES Garbage Collection as a concern for
enterprise applications
Very wide operating range: Used in both low latency
and large scale enterprise application spaces
Decouples scale metrics from response time concerns
Transaction rate, data set size, concurrent users,
heap size, allocation rate, mutation rate, etc.
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
What is Zing good for?
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
What is Zing good for?
If you have a server-based Java application
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
What is Zing good for?
If you have a server-based Java application
And you are running on Linux
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
What is Zing good for?
If you have a server-based Java application
And you are running on Linux
And you use using more than ~300MB of memory
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
What is Zing good for?
If you have a server-based Java application
And you are running on Linux
And you use using more than ~300MB of memory
Then Zing will likely deliver superior behavior
metrics
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Where Zing shines
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Where Zing shines
Low latency
Eliminate behavior blips down to the sub-millisecond-units level
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Where Zing shines
Low latency
Eliminate behavior blips down to the sub-millisecond-units level
Machine-to-machine “stuff”
Support higher *sustainable* throughput (the one that meets
SLAs)
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Where Zing shines
Low latency
Eliminate behavior blips down to the sub-millisecond-units level
Machine-to-machine “stuff”
Support higher *sustainable* throughput (the one that meets
SLAs)
Human response times
Eliminate user-annoying response time blips. Multi-second and
even fraction-of-a-second blips will be completely gone.
Support larger memory JVMs *if needed* (e.g. larger virtual user
counts, or larger cache, in-memory state, or consolidating multiple
instances)
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Where Zing shines
Low latency
Eliminate behavior blips down to the sub-millisecond-units level
Machine-to-machine “stuff”
Support higher *sustainable* throughput (the one that meets
SLAs)
Human response times
Eliminate user-annoying response time blips. Multi-second and
even fraction-of-a-second blips will be completely gone.
Support larger memory JVMs *if needed* (e.g. larger virtual user
counts, or larger cache, in-memory state, or consolidating multiple
instances)
“Large” data and in-memory analytics
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Where Zing shines
Low latency
Eliminate behavior blips down to the sub-millisecond-units level
Machine-to-machine “stuff”
Support higher *sustainable* throughput (the one that meets
SLAs)
Human response times
Eliminate user-annoying response time blips. Multi-second and
even fraction-of-a-second blips will be completely gone.
Support larger memory JVMs *if needed* (e.g. larger virtual user
counts, or larger cache, in-memory state, or consolidating multiple
instances)
“Large” data and in-memory analytics
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Takeaways
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Takeaways
Standard Deviation and application latency should
never show up on the same page...
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Takeaways
Standard Deviation and application latency should
never show up on the same page...
If you haven’t stated percentiles and a Max, you
haven’t specified your requirements
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Takeaways
Standard Deviation and application latency should
never show up on the same page...
If you haven’t stated percentiles and a Max, you
haven’t specified your requirements
Measuring throughput without latency behavior is
[usually] meaningless
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Takeaways
Standard Deviation and application latency should
never show up on the same page...
If you haven’t stated percentiles and a Max, you
haven’t specified your requirements
Measuring throughput without latency behavior is
[usually] meaningless
Mistakes in measurement/analysis can cause orders-
of-magnitude errors and lead to bad business
decisions
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Q & A
http://www.azulsystems.com
http://www.jhiccup.com
http://giltene.github.com/HdrHistogram
©2013 Azul Systems, Inc.	
 	
 	
 	
 	
 	
Q & A
http://www.azulsystems.com
http://www.jhiccup.com
http://giltene.github.com/HdrHistogram

More Related Content

What's hot

What's hot (20)

[수정본] 우아한 객체지향
[수정본] 우아한 객체지향[수정본] 우아한 객체지향
[수정본] 우아한 객체지향
 
Redis
RedisRedis
Redis
 
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking ToolCeph Tech Talk -- Ceph Benchmarking Tool
Ceph Tech Talk -- Ceph Benchmarking Tool
 
JVM: A Platform for Multiple Languages
JVM: A Platform for Multiple LanguagesJVM: A Platform for Multiple Languages
JVM: A Platform for Multiple Languages
 
카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례카카오에서의 Trove 운영사례
카카오에서의 Trove 운영사례
 
Java SE 9 modules (JPMS) - an introduction
Java SE 9 modules (JPMS) - an introductionJava SE 9 modules (JPMS) - an introduction
Java SE 9 modules (JPMS) - an introduction
 
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCERCEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
 
Mona cheatsheet
Mona cheatsheetMona cheatsheet
Mona cheatsheet
 
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
Using Performance Insights to Optimize Database Performance (DAT402) - AWS re...
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
How We Test MongoDB: Evergreen
How We Test MongoDB: EvergreenHow We Test MongoDB: Evergreen
How We Test MongoDB: Evergreen
 
API Gateway를 이용한 토큰 기반 인증 아키텍처
API Gateway를 이용한 토큰 기반 인증 아키텍처API Gateway를 이용한 토큰 기반 인증 아키텍처
API Gateway를 이용한 토큰 기반 인증 아키텍처
 
[164] pinpoint
[164] pinpoint[164] pinpoint
[164] pinpoint
 
Dkos(mesos기반의 container orchestration)
Dkos(mesos기반의 container orchestration)Dkos(mesos기반의 container orchestration)
Dkos(mesos기반의 container orchestration)
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Scalable webservice
Scalable webserviceScalable webservice
Scalable webservice
 

Similar to How NOT to Measure Latency, Gil Tene, London, Oct. 2013

Similar to How NOT to Measure Latency, Gil Tene, London, Oct. 2013 (20)

Understanding Application Hiccups - and What You Can Do About Them
Understanding Application Hiccups - and What You Can Do About ThemUnderstanding Application Hiccups - and What You Can Do About Them
Understanding Application Hiccups - and What You Can Do About Them
 
Enabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul Systems
Enabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul SystemsEnabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul Systems
Enabling Java in Latency Sensitive Applications by Gil Tene, CTO, Azul Systems
 
Don't estimate - forecast
Don't estimate -  forecastDon't estimate -  forecast
Don't estimate - forecast
 
Understanding Latency Response Time and Behavior
Understanding Latency Response Time and BehaviorUnderstanding Latency Response Time and Behavior
Understanding Latency Response Time and Behavior
 
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and ToolsIntelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
Intelligent Trading Summit NY 2014: Understanding Latency: Key Lessons and Tools
 
Enabling Java in Latency-Sensitive Applications
Enabling Java in Latency-Sensitive ApplicationsEnabling Java in Latency-Sensitive Applications
Enabling Java in Latency-Sensitive Applications
 
DotCMS Bootcamp: Enabling Java in Latency Sensitivie Environments
DotCMS Bootcamp: Enabling Java in Latency Sensitivie EnvironmentsDotCMS Bootcamp: Enabling Java in Latency Sensitivie Environments
DotCMS Bootcamp: Enabling Java in Latency Sensitivie Environments
 
How to Build a DevOps Toolchain
How to Build a DevOps ToolchainHow to Build a DevOps Toolchain
How to Build a DevOps Toolchain
 
Enabling Java in Latency Sensitive Environments
Enabling Java in Latency Sensitive EnvironmentsEnabling Java in Latency Sensitive Environments
Enabling Java in Latency Sensitive Environments
 
Losing Sight of DevOps in an Automation Forest - devopsdays Atlanta 2013
Losing Sight of DevOps in an Automation Forest - devopsdays Atlanta 2013Losing Sight of DevOps in an Automation Forest - devopsdays Atlanta 2013
Losing Sight of DevOps in an Automation Forest - devopsdays Atlanta 2013
 
Mining for gold 2.0
Mining for gold 2.0Mining for gold 2.0
Mining for gold 2.0
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
 
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
 
Simplifying Your Infrastructure Through Containerization
Simplifying Your Infrastructure Through ContainerizationSimplifying Your Infrastructure Through Containerization
Simplifying Your Infrastructure Through Containerization
 
Is There a Return on Investment from Model-Based Systems Engineering?
Is There a Return on Investment from Model-Based Systems Engineering?Is There a Return on Investment from Model-Based Systems Engineering?
Is There a Return on Investment from Model-Based Systems Engineering?
 
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy EnvironmentsDOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments
DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous Deployment
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Dev ops
Dev opsDev ops
Dev ops
 

More from Azul Systems Inc.

Push Technology's latest data distribution benchmark with Solarflare and Zing
Push Technology's latest data distribution benchmark with Solarflare and ZingPush Technology's latest data distribution benchmark with Solarflare and Zing
Push Technology's latest data distribution benchmark with Solarflare and Zing
Azul Systems Inc.
 

More from Azul Systems Inc. (20)

Advancements ingc andc4overview_linkedin_oct2017
Advancements ingc andc4overview_linkedin_oct2017Advancements ingc andc4overview_linkedin_oct2017
Advancements ingc andc4overview_linkedin_oct2017
 
Understanding GC, JavaOne 2017
Understanding GC, JavaOne 2017Understanding GC, JavaOne 2017
Understanding GC, JavaOne 2017
 
Zulu Embedded Java Introduction
Zulu Embedded Java IntroductionZulu Embedded Java Introduction
Zulu Embedded Java Introduction
 
What's New in the JVM in Java 8?
What's New in the JVM in Java 8?What's New in the JVM in Java 8?
What's New in the JVM in Java 8?
 
ObjectLayout: Closing the (last?) inherent C vs. Java speed gap
ObjectLayout: Closing the (last?) inherent C vs. Java speed gapObjectLayout: Closing the (last?) inherent C vs. Java speed gap
ObjectLayout: Closing the (last?) inherent C vs. Java speed gap
 
Priming Java for Speed at Market Open
Priming Java for Speed at Market OpenPriming Java for Speed at Market Open
Priming Java for Speed at Market Open
 
Azul Systems open source guide
Azul Systems open source guideAzul Systems open source guide
Azul Systems open source guide
 
Start Fast and Stay Fast - Priming Java for Market Open with ReadyNow!
Start Fast and Stay Fast - Priming Java for Market Open with ReadyNow!Start Fast and Stay Fast - Priming Java for Market Open with ReadyNow!
Start Fast and Stay Fast - Priming Java for Market Open with ReadyNow!
 
Understanding Java Garbage Collection
Understanding Java Garbage CollectionUnderstanding Java Garbage Collection
Understanding Java Garbage Collection
 
The evolution of OpenJDK: From Java's beginnings to 2014
The evolution of OpenJDK: From Java's beginnings to 2014The evolution of OpenJDK: From Java's beginnings to 2014
The evolution of OpenJDK: From Java's beginnings to 2014
 
Push Technology's latest data distribution benchmark with Solarflare and Zing
Push Technology's latest data distribution benchmark with Solarflare and ZingPush Technology's latest data distribution benchmark with Solarflare and Zing
Push Technology's latest data distribution benchmark with Solarflare and Zing
 
Webinar: Zing Vision: Answering your toughest production Java performance que...
Webinar: Zing Vision: Answering your toughest production Java performance que...Webinar: Zing Vision: Answering your toughest production Java performance que...
Webinar: Zing Vision: Answering your toughest production Java performance que...
 
Speculative Locking: Breaking the Scale Barrier (JAOO 2005)
Speculative Locking: Breaking the Scale Barrier (JAOO 2005)Speculative Locking: Breaking the Scale Barrier (JAOO 2005)
Speculative Locking: Breaking the Scale Barrier (JAOO 2005)
 
Java vs. C/C++
Java vs. C/C++Java vs. C/C++
Java vs. C/C++
 
What's Inside a JVM?
What's Inside a JVM?What's Inside a JVM?
What's Inside a JVM?
 
The Java Evolution Mismatch - Why You Need a Better JVM
The Java Evolution Mismatch - Why You Need a Better JVMThe Java Evolution Mismatch - Why You Need a Better JVM
The Java Evolution Mismatch - Why You Need a Better JVM
 
Towards a Scalable Non-Blocking Coding Style
Towards a Scalable Non-Blocking Coding StyleTowards a Scalable Non-Blocking Coding Style
Towards a Scalable Non-Blocking Coding Style
 
Experiences with Debugging Data Races
Experiences with Debugging Data RacesExperiences with Debugging Data Races
Experiences with Debugging Data Races
 
Lock-Free, Wait-Free Hash Table
Lock-Free, Wait-Free Hash TableLock-Free, Wait-Free Hash Table
Lock-Free, Wait-Free Hash Table
 
How NOT to Write a Microbenchmark
How NOT to Write a MicrobenchmarkHow NOT to Write a Microbenchmark
How NOT to Write a Microbenchmark
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

How NOT to Measure Latency, Gil Tene, London, Oct. 2013

  • 1. ©2013 Azul Systems, Inc. How NOT to Measure Latency An attempt to confer wisdom... Gil Tene, CTO & co-Founder, Azul Systems
  • 2. ©2013 Azul Systems, Inc. High level agenda
  • 3. ©2013 Azul Systems, Inc. High level agenda Some latency behavior background
  • 4. ©2013 Azul Systems, Inc. High level agenda Some latency behavior background The pitfalls of using “statistics”
  • 5. ©2013 Azul Systems, Inc. High level agenda Some latency behavior background The pitfalls of using “statistics” Latency “philosophy” questions
  • 6. ©2013 Azul Systems, Inc. High level agenda Some latency behavior background The pitfalls of using “statistics” Latency “philosophy” questions The Coordinated Omission Problem
  • 7. ©2013 Azul Systems, Inc. High level agenda Some latency behavior background The pitfalls of using “statistics” Latency “philosophy” questions The Coordinated Omission Problem Some useful tools
  • 8. ©2013 Azul Systems, Inc. High level agenda Some latency behavior background The pitfalls of using “statistics” Latency “philosophy” questions The Coordinated Omission Problem Some useful tools Use tools for bragging
  • 9. ©2013 Azul Systems, Inc. About me: Gil Tene
  • 10. ©2013 Azul Systems, Inc. About me: Gil Tene co-founder, CTO @Azul Systems
  • 11. ©2013 Azul Systems, Inc. About me: Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC approaches since 2002
  • 12. ©2013 Azul Systems, Inc. About me: Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC approaches since 2002 * working on real-world trash compaction issues, circa 2004
  • 13. ©2013 Azul Systems, Inc. About me: Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC approaches since 2002 Created Pauseless & C4 core GC algorithms (Tene, Wolf) * working on real-world trash compaction issues, circa 2004
  • 14. ©2013 Azul Systems, Inc. About me: Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC approaches since 2002 Created Pauseless & C4 core GC algorithms (Tene, Wolf) A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... * working on real-world trash compaction issues, circa 2004
  • 15. ©2013 Azul Systems, Inc. About me: Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC approaches since 2002 Created Pauseless & C4 core GC algorithms (Tene, Wolf) A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... JCP EC Member... * working on real-world trash compaction issues, circa 2004
  • 16. ©2013 Azul Systems, Inc. About Azul
  • 17. ©2013 Azul Systems, Inc. About Azul We make scalable Virtual Machines
  • 18. ©2013 Azul Systems, Inc. About Azul We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002
  • 19. ©2013 Azul Systems, Inc. About Azul We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega)
  • 20. ©2013 Azul Systems, Inc. About Azul We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega) Vega
  • 21. ©2013 Azul Systems, Inc. About Azul We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega) Vega
  • 22. ©2013 Azul Systems, Inc. About Azul We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega) Zing: Pure software for commodity x86 Vega
  • 23. ©2013 Azul Systems, Inc. About Azul We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega) Zing: Pure software for commodity x86 Vega C4
  • 24. ©2013 Azul Systems, Inc. About Azul We make scalable Virtual Machines Have built “whatever it takes to get job done” since 2002 3 generations of custom SMP Multi-core HW (Vega) Zing: Pure software for commodity x86 Known for Low Latency, Consistent execution, and Large data set excellence Vega C4
  • 25. ©2013 Azul Systems, Inc. Common fallacies
  • 26. ©2013 Azul Systems, Inc. Common fallacies Computers run application code continuously
  • 27. ©2013 Azul Systems, Inc. Common fallacies Computers run application code continuously Response time can be measured as work units/time
  • 28. ©2013 Azul Systems, Inc. Common fallacies Computers run application code continuously Response time can be measured as work units/time Response time exhibits a normal distribution
  • 29. ©2013 Azul Systems, Inc. Common fallacies Computers run application code continuously Response time can be measured as work units/time Response time exhibits a normal distribution “Glitches” or “Semi-random omissions” in measurement don’t have a big effect.
  • 30. ©2013 Azul Systems, Inc. Common fallacies Computers run application code continuously Response time can be measured as work units/time Response time exhibits a normal distribution “Glitches” or “Semi-random omissions” in measurement don’t have a big effect. Wrong! Wrong! Wrong! Wrong!
  • 31. ©2012 Azul Systems, Inc. A classic look at response time behavior Response time as a function of load source: IBM CICS server documentation, “understanding response times”
  • 32. ©2012 Azul Systems, Inc. A classic look at response time behavior Response time as a function of load source: IBM CICS server documentation, “understanding response times”
  • 33. ©2012 Azul Systems, Inc. A classic look at response time behavior Response time as a function of load source: IBM CICS server documentation, “understanding response times” Average? Max? Median? 90%? 99.9%
  • 34. ©2012 Azul Systems, Inc. A classic look at response time behavior Response time as a function of load source: IBM CICS server documentation, “understanding response times” Average? Max? Median? 90%? 99.9%
  • 35. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
  • 36. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
  • 37. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
  • 38. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
  • 39. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
  • 40. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
  • 41. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
  • 42. ©2012 Azul Systems, Inc. Response time over time When we measure behavior over time, we often see: source: ZOHO QEngine White Paper: performance testing report analysis Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown. “Hiccups”
  • 43. ©2012 Azul Systems, Inc. 0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000" 9000" 0" 20" 40" 60" 80" 100" 120" 140" 160" 180" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" What happened here?
  • 44. ©2012 Azul Systems, Inc. 0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000" 9000" 0" 20" 40" 60" 80" 100" 120" 140" 160" 180" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" What happened here? “Hiccups”
  • 45. ©2012 Azul Systems, Inc. 0" 1000" 2000" 3000" 4000" 5000" 6000" 7000" 8000" 9000" 0" 20" 40" 60" 80" 100" 120" 140" 160" 180" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" What happened here? Source: Gil running an idle program and suspending it five times in the middle “Hiccups”
  • 46. ©2012 Azul Systems, Inc. 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 25" Hiccups&by&Percen*le&Distribu*on& The real world (a low latency example)
  • 47. ©2012 Azul Systems, Inc. 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 25" Hiccups&by&Percen*le&Distribu*on& The real world (a low latency example) 99%‘ile is ~60 usec
  • 48. ©2012 Azul Systems, Inc. 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 25" Hiccups&by&Percen*le&Distribu*on& The real world (a low latency example) 99%‘ile is ~60 usec Max is ~30,000% higher than “typical”
  • 49. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal
  • 50. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal They don’t look anything like a normal distribution
  • 51. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal They don’t look anything like a normal distribution They usually look like periodic freezes
  • 52. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal They don’t look anything like a normal distribution They usually look like periodic freezes A complete shift from one mode/behavior to another
  • 53. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal They don’t look anything like a normal distribution They usually look like periodic freezes A complete shift from one mode/behavior to another Mode A: “good”.
  • 54. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal They don’t look anything like a normal distribution They usually look like periodic freezes A complete shift from one mode/behavior to another Mode A: “good”. Mode B: “Somewhat bad”
  • 55. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal They don’t look anything like a normal distribution They usually look like periodic freezes A complete shift from one mode/behavior to another Mode A: “good”. Mode B: “Somewhat bad” Mode C: “terrible”, ...
  • 56. ©2012 Azul Systems, Inc. Hiccups are [typically] strongly multi-modal They don’t look anything like a normal distribution They usually look like periodic freezes A complete shift from one mode/behavior to another Mode A: “good”. Mode B: “Somewhat bad” Mode C: “terrible”, ... ....
  • 57. ©2012 Azul Systems, Inc. Common ways people deal with hiccups
  • 58. ©2012 Azul Systems, Inc. Common ways people deal with hiccups Averages and Standard Deviation
  • 59. ©2012 Azul Systems, Inc. Common ways people deal with hiccups Averages and Standard Deviation Always Wrong!
  • 60. ©2012 Azul Systems, Inc. Better ways people can deal with hiccups Actually measuring percentiles0" 20" 40" 60" 80" 100" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(ms &Elapsed&Time&(sec)& 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA"
  • 61. ©2012 Azul Systems, Inc. Better ways people can deal with hiccups Actually measuring percentiles0" 20" 40" 60" 80" 100" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(ms &Elapsed&Time&(sec)& 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA" Response Time Percentile plot line
  • 62. ©2012 Azul Systems, Inc. Better ways people can deal with hiccups Actually measuring percentiles0" 20" 40" 60" 80" 100" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(ms &Elapsed&Time&(sec)& 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA" Requirements Response Time Percentile plot line
  • 63. ©2013 Azul Systems, Inc. Requirements Why we measure latency and response times to begin with...
  • 64. ©2012 Azul Systems, Inc. Latency tells us how long something took
  • 65. ©2012 Azul Systems, Inc. Latency tells us how long something took But what do we WANT the latency to be?
  • 66. ©2012 Azul Systems, Inc. Latency tells us how long something took But what do we WANT the latency to be? What do we want the latency to BEHAVE like?
  • 67. ©2012 Azul Systems, Inc. Latency tells us how long something took But what do we WANT the latency to be? What do we want the latency to BEHAVE like? Latency requirements are usually a PASS/FAIL test of some predefined criteria
  • 68. ©2012 Azul Systems, Inc. Latency tells us how long something took But what do we WANT the latency to be? What do we want the latency to BEHAVE like? Latency requirements are usually a PASS/FAIL test of some predefined criteria Different applications have different needs
  • 69. ©2012 Azul Systems, Inc. Latency tells us how long something took But what do we WANT the latency to be? What do we want the latency to BEHAVE like? Latency requirements are usually a PASS/FAIL test of some predefined criteria Different applications have different needs Requirements should reflect application needs
  • 70. ©2012 Azul Systems, Inc. Latency tells us how long something took But what do we WANT the latency to be? What do we want the latency to BEHAVE like? Latency requirements are usually a PASS/FAIL test of some predefined criteria Different applications have different needs Requirements should reflect application needs Measurements should provide data to evaluate requirements
  • 71. ©2013 Azul Systems, Inc. The Olympics aka “ring the bell first”
  • 72. ©2013 Azul Systems, Inc. The Olympics aka “ring the bell first” Goal: Get gold medals
  • 73. ©2013 Azul Systems, Inc. The Olympics aka “ring the bell first” Goal: Get gold medals Need to be faster than everyone else at SOME races
  • 74. ©2013 Azul Systems, Inc. The Olympics aka “ring the bell first” Goal: Get gold medals Need to be faster than everyone else at SOME races Ok to be slower in some, as long as fastest at some (the average speed doesn’t matter)
  • 75. ©2013 Azul Systems, Inc. The Olympics aka “ring the bell first” Goal: Get gold medals Need to be faster than everyone else at SOME races Ok to be slower in some, as long as fastest at some (the average speed doesn’t matter) Ok to not even finish or compete (the worst case and 99%‘ile don’t matter)
  • 76. ©2013 Azul Systems, Inc. The Olympics aka “ring the bell first” Goal: Get gold medals Need to be faster than everyone else at SOME races Ok to be slower in some, as long as fastest at some (the average speed doesn’t matter) Ok to not even finish or compete (the worst case and 99%‘ile don’t matter) Different strategies can apply. E.g. compete in only 3 races to not risk burning out, or compete in 8 races in hope of winning two
  • 77. ©2013 Azul Systems, Inc. Pacemakers aka “hard” real time
  • 78. ©2013 Azul Systems, Inc. Pacemakers aka “hard” real time Goal: Keep heart beating
  • 79. ©2013 Azul Systems, Inc. Pacemakers aka “hard” real time Goal: Keep heart beating Need to never be slower than X
  • 80. ©2013 Azul Systems, Inc. Pacemakers aka “hard” real time Goal: Keep heart beating Need to never be slower than X “Your heart will keep beating 99.9% of the time” is not reassuring
  • 81. ©2013 Azul Systems, Inc. Pacemakers aka “hard” real time Goal: Keep heart beating Need to never be slower than X “Your heart will keep beating 99.9% of the time” is not reassuring Having a good average and a nice standard deviation don’t matter or help
  • 82. ©2013 Azul Systems, Inc. Pacemakers aka “hard” real time Goal: Keep heart beating Need to never be slower than X “Your heart will keep beating 99.9% of the time” is not reassuring Having a good average and a nice standard deviation don’t matter or help The worst case is all that matters
  • 83. ©2013 Azul Systems, Inc. “Low Latency” Trading aka “soft” real time
  • 84. ©2013 Azul Systems, Inc. “Low Latency” Trading aka “soft” real time Goal A: Be fast enough to make some good plays
  • 85. ©2013 Azul Systems, Inc. “Low Latency” Trading aka “soft” real time Goal A: Be fast enough to make some good plays Goal B: Contain risk and exposure while making plays
  • 86. ©2013 Azul Systems, Inc. “Low Latency” Trading aka “soft” real time Goal A: Be fast enough to make some good plays Goal B: Contain risk and exposure while making plays E.g. want to “typically” react within 200 usec.
  • 87. ©2013 Azul Systems, Inc. “Low Latency” Trading aka “soft” real time Goal A: Be fast enough to make some good plays Goal B: Contain risk and exposure while making plays E.g. want to “typically” react within 200 usec. But can’t afford to hold open position for 20 msec, or react to 30 msec stale information
  • 88. ©2013 Azul Systems, Inc. “Low Latency” Trading aka “soft” real time Goal A: Be fast enough to make some good plays Goal B: Contain risk and exposure while making plays E.g. want to “typically” react within 200 usec. But can’t afford to hold open position for 20 msec, or react to 30 msec stale information So we want a very good “typical” (median, 50%‘ile)
  • 89. ©2013 Azul Systems, Inc. “Low Latency” Trading aka “soft” real time Goal A: Be fast enough to make some good plays Goal B: Contain risk and exposure while making plays E.g. want to “typically” react within 200 usec. But can’t afford to hold open position for 20 msec, or react to 30 msec stale information So we want a very good “typical” (median, 50%‘ile) But we also need a reasonable Max, or 99.99%‘ile
  • 90. ©2013 Azul Systems, Inc. Interactive applications aka “squishy” real time
  • 91. ©2013 Azul Systems, Inc. Interactive applications aka “squishy” real time Goal: Keep users happy enough to not complain/leave
  • 92. ©2013 Azul Systems, Inc. Interactive applications aka “squishy” real time Goal: Keep users happy enough to not complain/leave Need to have “typically snappy” behavior
  • 93. ©2013 Azul Systems, Inc. Interactive applications aka “squishy” real time Goal: Keep users happy enough to not complain/leave Need to have “typically snappy” behavior Ok to have occasional longer times, but not too high, and not too often
  • 94. ©2013 Azul Systems, Inc. Interactive applications aka “squishy” real time Goal: Keep users happy enough to not complain/leave Need to have “typically snappy” behavior Ok to have occasional longer times, but not too high, and not too often Example: 90% of responses should be below 0.2 sec, 99% should be below 0.5 sec, 99.9 should be better than 2 seconds. And a >10 second response should never happen.
  • 95. ©2013 Azul Systems, Inc. Interactive applications aka “squishy” real time Goal: Keep users happy enough to not complain/leave Need to have “typically snappy” behavior Ok to have occasional longer times, but not too high, and not too often Example: 90% of responses should be below 0.2 sec, 99% should be below 0.5 sec, 99.9 should be better than 2 seconds. And a >10 second response should never happen. Remember: A single user may have 100s of interactions per session...
  • 96. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process
  • 97. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements?
  • 98. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec
  • 99. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec Q: Ok. Typical/average of 20 msec... So what is the worst case requirement?
  • 100. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec Q: Ok. Typical/average of 20 msec... So what is the worst case requirement? A: We don’t have one
  • 101. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec Q: Ok. Typical/average of 20 msec... So what is the worst case requirement? A: We don’t have one Q: So it’s ok for some things to take more than 5 hours?
  • 102. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec Q: Ok. Typical/average of 20 msec... So what is the worst case requirement? A: We don’t have one Q: So it’s ok for some things to take more than 5 hours? A: No way in H%%&!
  • 103. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec Q: Ok. Typical/average of 20 msec... So what is the worst case requirement? A: We don’t have one Q: So it’s ok for some things to take more than 5 hours? A: No way in H%%&! Q: So I’ll write down “5 hours worst case...”
  • 104. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec Q: Ok. Typical/average of 20 msec... So what is the worst case requirement? A: We don’t have one Q: So it’s ok for some things to take more than 5 hours? A: No way in H%%&! Q: So I’ll write down “5 hours worst case...” A: No. That’s not what I said. Make that “nothing worse than 100 msec”
  • 105. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec Q: Ok. Typical/average of 20 msec... So what is the worst case requirement? A: We don’t have one Q: So it’s ok for some things to take more than 5 hours? A: No way in H%%&! Q: So I’ll write down “5 hours worst case...” A: No. That’s not what I said. Make that “nothing worse than 100 msec” Q: Are you sure? Even if it’s only two times a day?
  • 106. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Q: What are your latency requirements? A: We need an avg. response of 20 msec Q: Ok. Typical/average of 20 msec... So what is the worst case requirement? A: We don’t have one Q: So it’s ok for some things to take more than 5 hours? A: No way in H%%&! Q: So I’ll write down “5 hours worst case...” A: No. That’s not what I said. Make that “nothing worse than 100 msec” Q: Are you sure? Even if it’s only two times a day? A: Ok... Make it “nothing worse than 2 seconds...”
  • 107. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process
  • 108. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response?
  • 109. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day
  • 110. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level?
  • 111. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level? A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere
  • 112. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level? A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere Now we have a service level expectation:
  • 113. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level? A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere Now we have a service level expectation: 50% better than 20 msec
  • 114. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level? A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere Now we have a service level expectation: 50% better than 20 msec 90% better than 50 msec
  • 115. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level? A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere Now we have a service level expectation: 50% better than 20 msec 90% better than 50 msec 99.9% better than 500 msec
  • 116. ©2013 Azul Systems, Inc. Establishing Requirements an interactive interview (or thought) process Ok. So we need a typical of 20msec, and a worst case of 2 seconds. How often is it ok to have a 1 second response? A: (Annoyed) I thought you said only a few times a day Q: That was for the worst case. But if half the results are better than 20 msec, is it ok for the other half to be just short of 2 seconds? What % of the time are you willing to take a 1 second, or a half second hiccup? Or some other level? A: Oh. Let’s see. We have to better than 50 msec 90% of the time, or we’ll be losing money even when we are fast the rest of the time. We need to be better than 500 msec 99.9% of the time, or our customers will complain and go elsewhere Now we have a service level expectation: 50% better than 20 msec 90% better than 50 msec 99.9% better than 500 msec 100% better than 2 seconds
  • 117. ©2013 Azul Systems, Inc. Latency does not live in a vacuum
  • 118. ©2013 Azul Systems, Inc. Remember this? How much load can this system handle?
  • 119. ©2013 Azul Systems, Inc. Remember this? How much load can this system handle? What the marketing benchmarks will say
  • 120. ©2013 Azul Systems, Inc. Remember this? How much load can this system handle? What the marketing benchmarks will say Where users complain
  • 121. ©2013 Azul Systems, Inc. Remember this? How much load can this system handle? Where the sysadmin is willing to go What the marketing benchmarks will say Where users complain
  • 122. ©2013 Azul Systems, Inc. Remember this? How much load can this system handle? Where the sysadmin is willing to go What the marketing benchmarks will say Where users complain Sustainable Throughput Level
  • 123. ©2013 Azul Systems, Inc. Sustainable Throughput: The throughput achieved while safely maintaining service levels
  • 124. ©2013 Azul Systems, Inc. Sustainable Throughput: The throughput achieved while safely maintaining service levels
  • 125. ©2013 Azul Systems, Inc. Sustainable Throughput: The throughput achieved while safely maintaining service levels Unsustainable Throughout
  • 126. ©2013 Azul Systems, Inc. Comparing behavior under different throughputs and/or configurations 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 0# 20# 40# 60# 80# 100# 120# !Dura&on!(msec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! Setup#D# Setup#E# Setup#F# Setup#A# Setup#B# Setup#C#
  • 127. ©2013 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 0# 10# 20# 30# 40# 50# 60# 70# 80# 90# Latency((milliseconds)( ( ( Percen3le( Latency(by(Percen3le(Distribu3on( Comparing latency behavior under different throughputs, configurations HotSpot&10K&&& HotSpot&15K& HotSpot&5K& HotSpot&1K& Zing&15K& Zing&10K& Zing&5K& Zing&1K& & Required& Service&Level&
  • 128. ©2013 Azul Systems, Inc. Instance capacity test: “Fat Portal” HotSpot CMS: Peaks at ~3GB / 45 concurrent users * LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
  • 129. ©2013 Azul Systems, Inc. Instance capacity test: “Fat Portal” HotSpot CMS: Peaks at ~3GB / 45 concurrent users * LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
  • 130. ©2013 Azul Systems, Inc. Instance capacity test: “Fat Portal” C4: still smooth @ 800 concurrent users * LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
  • 131. ©2013 Azul Systems, Inc. Instance capacity test: “Fat Portal” C4: still smooth @ 800 concurrent users * LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
  • 132. ©2013 Azul Systems, Inc. Instance capacity test: “Fat Portal” C4: still smooth @ 800 concurrent users * LifeRay portal on JBoss @ 99.9% SLA of 5 second response times
  • 133. ©2013 Azul Systems, Inc. The coordinated omission problem An accidental conspiracy...
  • 134. ©2013 Azul Systems, Inc. The coordinated omission problem
  • 135. ©2013 Azul Systems, Inc. The coordinated omission problem Common Example A (load testing): build/buy load tester to measure system behavior each “client” issues requests one by one at a certain rate measure and log response time for each request results log used to produce histograms, percentiles, etc.
  • 136. ©2013 Azul Systems, Inc. The coordinated omission problem Common Example A (load testing): build/buy load tester to measure system behavior each “client” issues requests one by one at a certain rate measure and log response time for each request results log used to produce histograms, percentiles, etc. So what’s wrong with that? works well only when all responses fit within rate interval technique includes implicit “automatic backoff” and coordination But requirements interested in random, uncoordinated requests
  • 137. ©2013 Azul Systems, Inc. The coordinated omission problem
  • 138. ©2013 Azul Systems, Inc. The coordinated omission problem Common Example B (monitoring): System monitors and records each transaction latency Latency measured between start and end of each operation keeps observed latency stats of some sort (log, histogram, etc.)
  • 139. ©2013 Azul Systems, Inc. The coordinated omission problem Common Example B (monitoring): System monitors and records each transaction latency Latency measured between start and end of each operation keeps observed latency stats of some sort (log, histogram, etc.) So what’s wrong with that? works correctly well only when no queuing occurs Long operations only get measured once delays outside of timing window do not get measured at all queued operations are measured wrong
  • 140. ©2013 Azul Systems, Inc. Common Example B: Coordinated Omission in Monitoring Code Long operations only get measured once delays outside of timing window do not get measured at all
  • 141. ©2013 Azul Systems, Inc. Common Example B: Coordinated Omission in Monitoring Code Long operations only get measured once delays outside of timing window do not get measured at all
  • 142. How bad can this get? Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec
  • 143. How bad can this get? System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec
  • 144. How bad can this get? System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system?
  • 145. How bad can this get? Avg. is 1 msec over 1st 100 sec System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system?
  • 146. How bad can this get? Avg. is 1 msec over 1st 100 sec System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system? Avg. is 50 sec. over next 100 sec
  • 147. How bad can this get? Avg. is 1 msec over 1st 100 sec System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system? Avg. is 50 sec. over next 100 sec Overall Average response time is ~25 sec.
  • 148. How bad can this get? Avg. is 1 msec over 1st 100 sec System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec How would you characterize this system? ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec Avg. is 50 sec. over next 100 sec Overall Average response time is ~25 sec.
  • 149. Measurement in practice System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec
  • 150. Measurement in practice System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec
  • 151. Measurement in practice System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec Naïve Characterization
  • 152. Measurement in practice System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec Naïve Characterization 10,000 @ 1msec 1 @ 100 second
  • 153. Measurement in practice System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec Naïve Characterization 10,000 @ 1msec 1 @ 100 second 99.99%‘ile is 1 msec! Average. is 10.9msec! Std. Dev. is 0.99sec! (should be ~100sec) (should be ~25 sec)
  • 154. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each
  • 155. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each
  • 156. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
  • 157. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec
  • 158. Proper measurement System Stalled for 100 Sec Elapsed Time System easily handles 100 requests/sec Responds to each in 1msec 10,000 results Varying linearly from 100 sec to 10 msec 10,000 results @ 1 msec each ~50%‘ile is 1 msec ~75%‘ile is 50 sec 99.99%‘ile is ~100sec Coordinated Omission
  • 159. ©2013 Azul Systems, Inc. The coordinated omission problem It is MUCH more common than you may think...
  • 160. ©2012 Azul Systems, Inc. JMeter makes this mistake... (so do others) Before Correction After Correcting for Omission
  • 161. ©2012 Azul Systems, Inc. The “real” world
  • 162. ©2012 Azul Systems, Inc. The “real” world Results were collected by a single client thread
  • 163. ©2012 Azul Systems, Inc. The “real” world 26.182 seconds represents 1.29% of the total time Results were collected by a single client thread
  • 164. ©2012 Azul Systems, Inc. The “real” world 99%‘ile MUST be at least 0.29% of total time (1.29% - 1%) which would be 5.9 seconds 26.182 seconds represents 1.29% of the total time Results were collected by a single client thread
  • 165. ©2012 Azul Systems, Inc. The “real” world 99%‘ile MUST be at least 0.29% of total time (1.29% - 1%) which would be 5.9 seconds 26.182 seconds represents 1.29% of the total time wrong by a factor of 1,000x Results were collected by a single client thread
  • 166. ©2012 Azul Systems, Inc. The “real” world A world record SPECjEnterprise2010 result
  • 167. ©2012 Azul Systems, Inc. The “real” world A world record SPECjEnterprise2010 result
  • 168. ©2012 Azul Systems, Inc. The “real” world 305.197 seconds represents 8.4% of the timing run A world record SPECjEnterprise2010 result
  • 169. ©2012 Azul Systems, Inc. The “real” world The max is 762 (!!!) standard deviations away from the mean 305.197 seconds represents 8.4% of the timing run A world record SPECjEnterprise2010 result
  • 170. ©2012 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# Max=7995.39* 0# 1000# 2000# 3000# 4000# 5000# 6000# 7000# 8000# 9000# Latency*(msec)* * * Percen7le* Uncorrected*Latency*by*Percen7le*Distribu7on* 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# Max=7995.39* 0# 1000# 2000# 3000# 4000# 5000# 6000# 7000# 8000# 9000# Latency*(msec)* * * Percen7le* Corrected*Latency*by*Percen7le*Distribu7on* Real World Coordinated Omission effects Before Correction After Correction
  • 171. ©2012 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# Max=7995.39* 0# 1000# 2000# 3000# 4000# 5000# 6000# 7000# 8000# 9000# Latency*(msec)* * * Percen7le* Uncorrected*Latency*by*Percen7le*Distribu7on* 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# Max=7995.39* 0# 1000# 2000# 3000# 4000# 5000# 6000# 7000# 8000# 9000# Latency*(msec)* * * Percen7le* Corrected*Latency*by*Percen7le*Distribu7on* Real World Coordinated Omission effects Before Correction After Correction
  • 172. ©2012 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# Max=7995.39* 0# 1000# 2000# 3000# 4000# 5000# 6000# 7000# 8000# 9000# Latency*(msec)* * * Percen7le* Uncorrected*Latency*by*Percen7le*Distribu7on* 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# Max=7995.39* 0# 1000# 2000# 3000# 4000# 5000# 6000# 7000# 8000# 9000# Latency*(msec)* * * Percen7le* Corrected*Latency*by*Percen7le*Distribu7on* Real World Coordinated Omission effects Before Correction After Correction Wrong by 7x
  • 173. ©2013 Azul Systems, Inc. Real World Coordinated Omission effects 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base#
  • 174. ©2013 Azul Systems, Inc. Real World Coordinated Omission effects 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# Uncorrected Data
  • 175. ©2013 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# HotSpot#Corrected1# HotSpot#Corrected2# Real World Coordinated Omission effects Uncorrected Data
  • 176. ©2013 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# HotSpot#Corrected1# HotSpot#Corrected2# Real World Coordinated Omission effects Uncorrected Data Corrected for Coordinated Omission
  • 177. ©2013 Azul Systems, Inc. Real World Coordinated Omission effects (Why I care) 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# HotSpot#Corrected1# HotSpot#Corrected2# Zing#Base# Zing#Corrected1# Zing#Corrected2#
  • 178. ©2013 Azul Systems, Inc. Real World Coordinated Omission effects (Why I care) 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# HotSpot#Corrected1# HotSpot#Corrected2# Zing#Base# Zing#Corrected1# Zing#Corrected2# Zing
  • 179. ©2013 Azul Systems, Inc. Real World Coordinated Omission effects (Why I care) 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# HotSpot#Corrected1# HotSpot#Corrected2# Zing#Base# Zing#Corrected1# Zing#Corrected2# Zing “other” JVM
  • 180. ©2013 Azul Systems, Inc. Real World Coordinated Omission effects (Why I care) 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# 0# 100000# 200000# 300000# 400000# 500000# 600000# !Dura&on!(usec)! ! ! Percen&le! Dura&on!by!Percen&le!Distribu&on! HotSpot#Base# HotSpot#Corrected1# HotSpot#Corrected2# Zing#Base# Zing#Corrected1# Zing#Corrected2# A ~2500x difference in reported percentile levels for the problem that Zing eliminates Zing “other” JVM
  • 181. ©2013 Azul Systems, Inc. Suggestions
  • 182. ©2013 Azul Systems, Inc. Suggestions Whatever your measurement technique is, TEST IT.
  • 183. ©2013 Azul Systems, Inc. Suggestions Whatever your measurement technique is, TEST IT. Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior
  • 184. ©2013 Azul Systems, Inc. Suggestions Whatever your measurement technique is, TEST IT. Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior Don’t waste time analyzing until you establish sanity
  • 185. ©2013 Azul Systems, Inc. Suggestions Whatever your measurement technique is, TEST IT. Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior Don’t waste time analyzing until you establish sanity Don’t EVER use or derive from std. deviation
  • 186. ©2013 Azul Systems, Inc. Suggestions Whatever your measurement technique is, TEST IT. Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior Don’t waste time analyzing until you establish sanity Don’t EVER use or derive from std. deviation ALWAYS measure Max time. Consider what it means... Be suspicious.
  • 187. ©2013 Azul Systems, Inc. Suggestions Whatever your measurement technique is, TEST IT. Run your measurement method against an artificial system that creates hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior Don’t waste time analyzing until you establish sanity Don’t EVER use or derive from std. deviation ALWAYS measure Max time. Consider what it means... Be suspicious. Measure %‘iles. Lots of them.
  • 188. ©2013 Azul Systems, Inc. Some Tools
  • 189. ©2013 Azul Systems, Inc. HdrHistogram
  • 190. ©2012 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# Max=137.472+ 0# 20# 40# 60# 80# 100# 120# 140# 160# Latency+(msec)+ + + Percen8le+ Latency+by+Percen8le+Distribu8on+ HdrHistogram If you want to be able to produce graphs like this... You need both good dynamic range and good resolution
  • 191. ©2013 Azul Systems, Inc. HdrHistogram background
  • 192. ©2013 Azul Systems, Inc. HdrHistogram background Goal: Collect data for good latency characterization... Including acceptable precision at and between varying percentile levels
  • 193. ©2013 Azul Systems, Inc. HdrHistogram background Goal: Collect data for good latency characterization... Including acceptable precision at and between varying percentile levels Existing alternatives Record all data, analyze later (e.g. sort and get 99.9%‘ile). Record in traditional histograms
  • 194. ©2013 Azul Systems, Inc. HdrHistogram background Goal: Collect data for good latency characterization... Including acceptable precision at and between varying percentile levels Existing alternatives Record all data, analyze later (e.g. sort and get 99.9%‘ile). Record in traditional histograms Traditional Histograms: Linear bins, Logarithmic bins, or Arbitrary bins Linear requires lots of storage to cover range with good resolution Logarithmic covers wide range but has terrible precisions Arbitrary is.... arbitrary. Works only when you have a good feel for the interesting parts of the value range
  • 195. ©2013 Azul Systems, Inc. HdrHistogram
  • 196. ©2013 Azul Systems, Inc. HdrHistogram A High Dynamic Range Histogram Covers a configurable dynamic value range At configurable precision (expressed as number of significant digits)
  • 197. ©2013 Azul Systems, Inc. HdrHistogram A High Dynamic Range Histogram Covers a configurable dynamic value range At configurable precision (expressed as number of significant digits) For Example: Track values between 1 microsecond and 1 hour With 3 decimal points of resolution
  • 198. ©2013 Azul Systems, Inc. HdrHistogram A High Dynamic Range Histogram Covers a configurable dynamic value range At configurable precision (expressed as number of significant digits) For Example: Track values between 1 microsecond and 1 hour With 3 decimal points of resolution Built-in [optional] compensation for Coordinated Omission
  • 199. ©2013 Azul Systems, Inc. HdrHistogram A High Dynamic Range Histogram Covers a configurable dynamic value range At configurable precision (expressed as number of significant digits) For Example: Track values between 1 microsecond and 1 hour With 3 decimal points of resolution Built-in [optional] compensation for Coordinated Omission Open Source On github, released to the public domain, creative commons CC0
  • 200. ©2013 Azul Systems, Inc. HdrHistogram
  • 201. ©2013 Azul Systems, Inc. HdrHistogram Fixed cost in both space and time Built with “latency sensitive” applications in mind Recording values does not allocate or grow any data structures Recording values uses a fixed computation to determine location (no searches, no variability in recording cost, FAST) Even iterating through histogram can be done with no allocation
  • 202. ©2013 Azul Systems, Inc. HdrHistogram Fixed cost in both space and time Built with “latency sensitive” applications in mind Recording values does not allocate or grow any data structures Recording values uses a fixed computation to determine location (no searches, no variability in recording cost, FAST) Even iterating through histogram can be done with no allocation Internals work like a “floating point” data structure “Exponent” and “Mantissa” Exponent determines “Mantissa bucket” to use “Mantissa buckets” provide linear value range for a given exponent. Each have enough linear entries to support required precision
  • 203. ©2012 Azul Systems, Inc. HdrHistogram
  • 204. ©2012 Azul Systems, Inc. HdrHistogram Provides tools for iteration Linear, Logarithmic, Percentile
  • 205. ©2012 Azul Systems, Inc. HdrHistogram Provides tools for iteration Linear, Logarithmic, Percentile Supports percentile iterators Practical due to high dynamic range
  • 206. ©2012 Azul Systems, Inc. HdrHistogram Provides tools for iteration Linear, Logarithmic, Percentile Supports percentile iterators Practical due to high dynamic range Convenient percentile output 10% intervals between 0 and 50% 5% intervals between 50% and 75% 2.5% intervals between 75% and 87.5%... Very useful for feeding percentile distribution graphs...
  • 207. ©2012 Azul Systems, Inc. HdrHistogram Provides tools for iteration Linear, Logarithmic, Percentile Supports percentile iterators Practical due to high dynamic range Convenient percentile output 10% intervals between 0 and 50% 5% intervals between 50% and 75% 2.5% intervals between 75% and 87.5%... Very useful for feeding percentile distribution graphs...
  • 208. ©2012 Azul Systems, Inc. 0%# 90%# 99%# 99.9%# 99.99%# 99.999%# 99.9999%# Max=137.472+ 0# 20# 40# 60# 80# 100# 120# 140# 160# Latency+(msec)+ + + Percen8le+ Latency+by+Percen8le+Distribu8on+ HdrHistogram
  • 209. ©2013 Azul Systems, Inc. jHiccup
  • 210. ©2012 Azul Systems, Inc. Incontinuities in Java platform execution 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800" 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups"by"Time"Interval" Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=1665.024& 0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups"by"Percen@le"Distribu@on"
  • 211. ©2013 Azul Systems, Inc. jHiccup
  • 212. ©2013 Azul Systems, Inc. jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform Plots results in simple, consistent format
  • 213. ©2013 Azul Systems, Inc. jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform Plots results in simple, consistent format Simple, non-intrusive As simple as adding jHiccup.jar as a java agent: % java -javaagent=jHiccup.jar myApp myflags or attaching jHiccup to a running process: % jHiccup -p <pid> Adds a background thread that samples time @ 1000/sec into an HdrHistogram
  • 214. ©2013 Azul Systems, Inc. jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform Plots results in simple, consistent format Simple, non-intrusive As simple as adding jHiccup.jar as a java agent: % java -javaagent=jHiccup.jar myApp myflags or attaching jHiccup to a running process: % jHiccup -p <pid> Adds a background thread that samples time @ 1000/sec into an HdrHistogram Open Source. Released to the public domain
  • 215. ©2012 Azul Systems, Inc. Telco App Example 0" 20" 40" 60" 80" 100" 120" 140" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA"
  • 216. ©2012 Azul Systems, Inc. Telco App Example 0" 20" 40" 60" 80" 100" 120" 140" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA"
  • 217. ©2012 Azul Systems, Inc. Telco App Example 0" 20" 40" 60" 80" 100" 120" 140" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA" Max Time per interval
  • 218. ©2012 Azul Systems, Inc. Telco App Example 0" 20" 40" 60" 80" 100" 120" 140" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA" Max Time per interval
  • 219. ©2012 Azul Systems, Inc. Telco App Example 0" 20" 40" 60" 80" 100" 120" 140" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA" Max Time per interval Hiccup duration at percentile levels
  • 220. ©2012 Azul Systems, Inc. Telco App Example 0" 20" 40" 60" 80" 100" 120" 140" 0" 500" 1000" 1500" 2000" 2500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" 0" 20" 40" 60" 80" 100" 120" 140" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Hiccups"by"Percen?le" SLA" Optional SLA plotting Max Time per interval Hiccup duration at percentile levels
  • 221. ©2012 Azul Systems, Inc. Fun with jHiccup
  • 222. ©2012 Azul Systems, Inc. Oracle HotSpot CMS, 1GB in an 8GB heap 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=13156.352& 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Zing 5, 1GB in an 8GB heap 0" 5" 10" 15" 20" 25" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" Max=20.384& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on&
  • 223. ©2012 Azul Systems, Inc. Oracle HotSpot CMS, 1GB in an 8GB heap 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=13156.352& 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Zing 5, 1GB in an 8GB heap 0" 5" 10" 15" 20" 25" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%" Max=20.384& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on&
  • 224. ©2012 Azul Systems, Inc. Oracle HotSpot CMS, 1GB in an 8GB heap 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=13156.352& 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Zing 5, 1GB in an 8GB heap 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" 0" 500" 1000" 1500" 2000" 2500" 3000" 3500" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"Max=20.384& 0" 2000" 4000" 6000" 8000" 10000" 12000" 14000" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Drawn to scale
  • 225. ©2012 Azul Systems, Inc. Good for both “squishy” real time (human response times) and “soft” real time (low latency software systems)
  • 226. ©2012 Azul Systems, Inc. 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=22.656& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Oracle HotSpot (pure newgen) Low latency trading application
  • 227. ©2012 Azul Systems, Inc. 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 1.6" 1.8" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=1.568& 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 1.6" 1.8" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=22.656& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Oracle HotSpot (pure newgen) Zing Low latency trading application
  • 228. ©2012 Azul Systems, Inc. 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 1.6" 1.8" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=1.568& 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" 1.6" 1.8" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=22.656& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& Oracle HotSpot (pure newgen) Zing Low latency trading application
  • 229. ©2012 Azul Systems, Inc. Low latency - Drawn to scale Oracle HotSpot (pure newgen) Zing 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=1.568& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on& 0" 5" 10" 15" 20" 25" 0" 100" 200" 300" 400" 500" 600" Hiccup&Dura*on&(msec)& &Elapsed&Time&(sec)& Hiccups&by&Time&Interval& Max"per"Interval" 99%" 99.90%" 99.99%" Max" 0%" 90%" 99%" 99.9%" 99.99%" 99.999%" Max=22.656& 0" 5" 10" 15" 20" 25" Hiccup&Dura*on&(msec)& & & Percen*le& Hiccups&by&Percen*le&Distribu*on&
  • 230. ©2013 Azul Systems, Inc. Shameless bragging
  • 231. ©2013 Azul Systems, Inc. Shameless bragging
  • 232. ©2013 Azul Systems, Inc. Zing
  • 233. ©2013 Azul Systems, Inc. Zing A JVM for Linux/x86 servers
  • 234. ©2013 Azul Systems, Inc. Zing A JVM for Linux/x86 servers ELIMINATES Garbage Collection as a concern for enterprise applications
  • 235. ©2013 Azul Systems, Inc. Zing A JVM for Linux/x86 servers ELIMINATES Garbage Collection as a concern for enterprise applications Very wide operating range: Used in both low latency and large scale enterprise application spaces
  • 236. ©2013 Azul Systems, Inc. Zing A JVM for Linux/x86 servers ELIMINATES Garbage Collection as a concern for enterprise applications Very wide operating range: Used in both low latency and large scale enterprise application spaces Decouples scale metrics from response time concerns Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.
  • 237. ©2013 Azul Systems, Inc. What is Zing good for?
  • 238. ©2013 Azul Systems, Inc. What is Zing good for? If you have a server-based Java application
  • 239. ©2013 Azul Systems, Inc. What is Zing good for? If you have a server-based Java application And you are running on Linux
  • 240. ©2013 Azul Systems, Inc. What is Zing good for? If you have a server-based Java application And you are running on Linux And you use using more than ~300MB of memory
  • 241. ©2013 Azul Systems, Inc. What is Zing good for? If you have a server-based Java application And you are running on Linux And you use using more than ~300MB of memory Then Zing will likely deliver superior behavior metrics
  • 242. ©2013 Azul Systems, Inc. Where Zing shines
  • 243. ©2013 Azul Systems, Inc. Where Zing shines Low latency Eliminate behavior blips down to the sub-millisecond-units level
  • 244. ©2013 Azul Systems, Inc. Where Zing shines Low latency Eliminate behavior blips down to the sub-millisecond-units level Machine-to-machine “stuff” Support higher *sustainable* throughput (the one that meets SLAs)
  • 245. ©2013 Azul Systems, Inc. Where Zing shines Low latency Eliminate behavior blips down to the sub-millisecond-units level Machine-to-machine “stuff” Support higher *sustainable* throughput (the one that meets SLAs) Human response times Eliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone. Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances)
  • 246. ©2013 Azul Systems, Inc. Where Zing shines Low latency Eliminate behavior blips down to the sub-millisecond-units level Machine-to-machine “stuff” Support higher *sustainable* throughput (the one that meets SLAs) Human response times Eliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone. Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances) “Large” data and in-memory analytics
  • 247. ©2013 Azul Systems, Inc. Where Zing shines Low latency Eliminate behavior blips down to the sub-millisecond-units level Machine-to-machine “stuff” Support higher *sustainable* throughput (the one that meets SLAs) Human response times Eliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone. Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances) “Large” data and in-memory analytics
  • 248. ©2013 Azul Systems, Inc. Takeaways
  • 249. ©2013 Azul Systems, Inc. Takeaways Standard Deviation and application latency should never show up on the same page...
  • 250. ©2013 Azul Systems, Inc. Takeaways Standard Deviation and application latency should never show up on the same page... If you haven’t stated percentiles and a Max, you haven’t specified your requirements
  • 251. ©2013 Azul Systems, Inc. Takeaways Standard Deviation and application latency should never show up on the same page... If you haven’t stated percentiles and a Max, you haven’t specified your requirements Measuring throughput without latency behavior is [usually] meaningless
  • 252. ©2013 Azul Systems, Inc. Takeaways Standard Deviation and application latency should never show up on the same page... If you haven’t stated percentiles and a Max, you haven’t specified your requirements Measuring throughput without latency behavior is [usually] meaningless Mistakes in measurement/analysis can cause orders- of-magnitude errors and lead to bad business decisions
  • 253. ©2013 Azul Systems, Inc. Q & A http://www.azulsystems.com http://www.jhiccup.com http://giltene.github.com/HdrHistogram
  • 254. ©2013 Azul Systems, Inc. Q & A http://www.azulsystems.com http://www.jhiccup.com http://giltene.github.com/HdrHistogram