Slides for my talk at Monitorama PDX 2019. Histograms have the potential to give us tools to meet SLO/SLAs, quantile measurements, and very rich heatmap displays for debugging. Their promise has not been fulfilled by TSDB backends however. This talk talks about the concept of histograms as first class citizens in storage. What does accuracy mean for histograms? How can we store and compress rich histograms for evaluation and querying at massive scale? How can we fix some of the issues with histograms in Prometheus, such as proper aggregation, bucketing, avoiding clipping, etc.?
3. This is not a contribution@evanfchan
What do we do with
Histograms?
4. This is not a contribution@evanfchan
The Evolution of Histograms
• Pre-aggregated percentiles
Prometheus
InfluxDB
???
Statsd
Graphite
OpenTSDB• Histogram with buckets
• Prometheus histograms
• HDRHistogram
• T-Digests
5. This is not a contribution@evanfchan
Overlaid Latency Quantiles
6. This is not a contribution@evanfchan
Now an incident happens…
7. This is not a contribution@evanfchan
Heatmaps: Rich Visuals
8. This is not a contribution@evanfchan
Grafana Heatmaps
• Buckets are scalable to much more input data but
needs TSDB support for histogram buckets
• Time series: flexible, but Grafana needs to read
ALL the raw data
9. This is not a contribution@evanfchan
Useful Histograms
• Should be aggregatable
• Supports quantiles, distributions, other f(x)
• Heatmaps - histograms over time
• Should be accurate
• Should scale and be efficient
10. This is not a contribution@evanfchan
Buckets and Accuracy
• Max quantile error = bucket
width / lowerBound
• Exponential buckets = consistent
max quantile errors (Good!)
• Linear almost never makes sense
• Your custom Prom histogram
buckets likely have >100% error
Histogram Type Max Error % # Buckets
Linear 100% 60,000,000
Exponential 99.1% 26
Linear 10% 600,000,000
Exponential 10.0% 188
Example: (1000, 6E10) value range
11. This is not a contribution@evanfchan
Configuring your Histograms
• Start with the range of values you need: (min, max)
• Pick the desired max quantile error %
• Think about trading off publish freq for accuracy
• # buckets = log(max/min) / log(1 + max_error)
• Example: Max error=50%, (1000 to 6E10):
numBuckets = Math.log(6E10/1000) / Math.log(1 + 0.50)
exponentialBuckets(1000, 1 + 0.50, numBuckets)
12. This is not a contribution
Histograms at Scale
13. This is not a contribution@evanfchan
Histograms as First-Class
Citizen
• Modeling, transporting, and storing histograms holistically
offers many benefits
• Scalability — much better storage, network, query speed
• Proper aggregations
• Better accuracy and features
• Adaptable to better histogram designs in the future
• Almost nobody is doing this yet
14. This is not a contribution@evanfchan
Prometheus Histogram
Schema
__name__ metric_sum
5 buckets, sum, count per histogram
__name__ metric_count
__name__ metric_bucket
__name__ metric_bucket
__name__ metric_bucket
__name__ metric_bucket
__name__ metric_bucket
le 0.5
le 2.0
le 5.0
le 10.
le 25.
44
5
0
2
3
5
5
35
6
1
4
6
6
6
50
10
1
5
8
9
10
60
11
2
6
10
11
11
Series1
Series2
Series3
Series4
Series5
Series6
Series7
15. This is not a contribution@evanfchan
The Scale Problem with
Histograms
• My app: 100 metrics, 20 histograms
• Assume range of (1000, 6E10).
• Notice how histograms dominate the time series!
Max error % Num buckets
Histogram
Series
Other Series Total Series
50% 44 882 80 962
10% 188 3762 80 3842
2% 905 18102 80 18182
16. This is not a contribution@evanfchan
Mama we got a problem
• Actual system: hundreds of
millions of metrics, each one
has histogram with 64
buckets
• Using Prometheus would
lead to tens of billions of
series
17. This is not a contribution@evanfchan
Prometheus: Raw Data
__name__ metric_sum
__name__ metric_count
__name__ metric_bucket
__name__ metric_bucket
__name__ metric_bucket
__name__ metric_bucket
__name__ metric_bucket
le 0.5
le 2.0
le 5.0
le 10.
le 25.
Zone Us-west
Zone Us-west
Zone Us-west
Zone Us-west
Zone Us-west
Zone Us-west
Zone Us-west
44
5
0
2
3
5
5
18. This is not a contribution@evanfchan
Atomicity Issues
• Prom export, scrape does not guarantee grouping
of histogram buckets.
• Easy to only get part of a histogram
• FiloDB is a distributed database. 7 records might
end up in 7 different nodes!
• Calculating histogram_quantile: talk to 7 nodes
for every query!
19. This is not a contribution@evanfchan
Single Histogram Schema
5 buckets, sum, count per histogram
__name__ metric
Sum
Count
Hist
0.5
2.0
5.0
10.
25.
44
5
0
2
3
5
5
35
6
1
4
6
6
6
50
10
1
5
8
9
10
60
11
2
6
10
11
11
Series1
20. This is not a contribution@evanfchan
Single Histogram Raw Data
__name__ MetricZone Us-west
44 5 0 2 3 5 5
Sum Count Hist (0.5, 2, 5, 10, 25)
• One record, not (n + 2). No distribution problem!
• Labels only appear once
• Savings proportional to # of histogram buckets
• 50x savings for 64 histogram buckets
21. This is not a contribution@evanfchan
Much smaller network and
disk usage
• One time series vs 66 -> 50x network I/O reduction
• Single histogram schema in FiloDB uses < 0.2 bytes
per histogram bucket
Network I/O
Bytesper
histogram
0
3500
7000
10500
14000
Series/bucket Series/histo
Storage cost
Bytesperbucket
0
0.4
0.8
1.2
1.6
Series/bucket Series/histo
22. This is not a contribution@evanfchan
Optimizing Histograms:
Compression
• Delta encoding of increasing bucket values
0 2 3 5 5 0 2 1 2 0
1 4 6 6 6 1 3 2 0 0
• Compressed size about 4x-10x better than 1
time series per bucket (64 buckets; FiloDB)
• 0.18 bytes/histogram bucket (range: 0.16 - 0.61)
FiloDB
SingleHistogram
0.18 bytes/bucket
Prometheus 1.5 bytes/bucket
Raw data 8 bytes/bucket
23. This is not a contribution@evanfchan
Optimizing Histograms:
Querying (64 Buckets)
• histogram_quantile()
is more than 100x faster
than series-per-bucket
• No need for group-by
• Localized computation vs
needing to jump across 64
bucket time series
histogram_quantile()
QPS
0
7500
15000
22500
30000
Series/Bucket Series/Histo
24. This is not a contribution
Rich Histograms
Usability and Correctness
25. This is not a contribution@evanfchan
Changing buckets…. sum()
• sum(rate(http_req_latency{…..}[5m])) by (le)
• Different buckets lead to incorrect sums
2.5 5 10 50 +Infle= 25 100
26. This is not a contribution@evanfchan
Holistic Histograms:
Correct Sums
• Adding histograms holistically allows us to track
bucket changes and correctly sum them
2.5 5 10 50 +Infle= 25 100
27. This is not a contribution@evanfchan
histogram_quantile clipping
• At 20:00, quantile is clipped at 2nd-last bucket of
10.0
28. This is not a contribution@evanfchan
histogram_max_quantile
• Client sends a max value at each time interval
29. This is not a contribution@evanfchan
histogram_max_quantile
• Having a known max allows us to interpolate in last bucket
• Cannot interpolate to +Inf
• https://github.com/filodb/FiloDB/pull/361
2.5 5 10 25 +Infle= 40
0.9
30. This is not a contribution@evanfchan
Ad-Hoc Histograms
• Just the quantile, min, max from gauges is not that useful
• Get heat map for CPU use across k8s containers
• histogram(2, 8,
container_cpu_usage_seconds_total{….})
• Aggregate histogram across gauges using new
histogram() function
• Yes Grafana can do heat maps from raw series - but you
can only read so many raw time series. :)
31. This is not a contribution@evanfchan
Summary: Rich Histograms
at Scale
• Treating histograms as a first class citizen
• Massive savings in storage and network I/O
• Solve aggregation and other correctness issues
• Move towards T-Digests and future formats
32. Thank you very much!
Please reach out to help make useful histograms
at scale a reality!
@evanfchan
http://github.com/filodb/FiloDB
Monitorama slack: #talk-evan-chan
33. This is not a contribution@evanfchan
Example 2: Write size
34. This is not a contribution@evanfchan
Heatmap 2: Write Size
35. This is not a contribution@evanfchan
Histogram aggregation:
Prometheus
• Group by is needed for summing histogram buckets
due to data model - leak of abstraction
• What if dev changes the histogram scheme? (# of
buckets, etc.)
• Not possible to resolve scheme differences in Prom,
since aggregation knows nothing about histograms
sum(rate(histogram_bucket{app="foo")[5m])) by (le)
36. This is not a contribution@evanfchan
Histogram aggregation:
FiloDB
• No need for _bucket, but need to select histogram
column
• No need for group by. Histograms are natively
understood and correct aggregations happen
sum(rate(histogram{app=“foo”,__col__=“h”)[5m]))