Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016

Chris Lohfink
Cassandra Metrics

About me
• Software developer at DataStax
• OpsCenter, Metrics & Cassandra interactions
© DataStax, All Rights Reserved. 2

What this talk is
• What does the thing the metrics report mean (da dum tis)
• How metrics evolved in C*

Collecting
Not how, but what and why

Cassandra Metrics
• For the most part metrics do not break backwards compatibility
• Until they do (from deprecation or bugs)
• Deprecated metrics are hard to identify without looking at source
code, so their disappearance may have surprising impacts even if
deprecated for years.
• i.e. Cassandra 2.2 removal of “Recent Latency” metrics

C* Metrics Pre-1.1
• Classes implemented MBeans and metrics were added in place
• ColumnFamilyStore -> ColumnFamilyStoreMBean
• Semi-adhoc, tightly coupled to code but had a “theme” or common
abstractions

Latency Tracker
• LatencyTracker stores:
• recent histogram
• total histogram
• number of ops
• total latency
• Use latency/#ops since last time called to compute “recent” average
latency
• Every time queried it will reset the latency and histogram.

Describing Latencies
0 100 200 300 400 500 600 700 800 900 1000
• Listing the raw the values:
13ms, 14ms, 2ms, 13ms, 90ms, 734ms, 8ms, 23ms, 30ms
• Doesn’t scale well
• Not easy to parse, with larger amounts can be difficult to find high values

0 100 200 300 400 500 600 700 800 900 1000
• Average:
• 103ms

0 100 200 300 400 500 600 700 800 900 1000
• Average:
• 103ms
• Missing outliers

0 100 200 300 400 500 600 700 800 900 1000
• Average:
• 103ms
• Max: 734ms
• Min: 2ms

Latency Tracker
• LatencyTracker stores:
• recent histogram
• total histogram
• number of ops
• total latency
• Use latency/#ops since last time called to compute “recent”
average latency
• Every time queried it will reset the latency and histogram.

Recent Average Latencies
0 100 200 300 400 500 600 700 800 900 1000
• Reported latency from
• Sum of latencies since last called
• Number of requests since last called
• Average:
• 103ms
• Outliers lost

Histograms
• Describes frequency of data
1, 2, 1, 1, 3, 4, 3, 1

Histograms
1
1, 2, 1, 1, 3, 4, 3, 1

Histograms
1
2
1, 2, 1, 1, 3, 4, 3, 1

Histograms
11
2
1, 2, 1, 1, 3, 4, 3, 1

Histograms
111
2
1, 2, 1, 1, 3, 4, 3, 1

Histograms
111
2
3
1, 2, 1, 1, 3, 4, 3, 1

Histograms
111
2
3
4
1, 2, 1, 1, 3, 4, 3, 1

Histograms
111
2
33
4
1, 2, 1, 1, 3, 4, 3, 1

Histograms
1111
2
33
4
1, 2, 1, 1, 3, 4, 3, 1

Histograms
1111
2
33
4
1, 2, 1, 1, 3, 4, 3, 1
0 1 2 3 4
4
3
2
1
Count

Histograms
• "bin" the range of values
• divide the entire range of values into a series of intervals
• Count how many values fall into each interval

Histograms
• "bin" the range of values—that is, divide the entire range of values
into a series of intervals—and then count how many values fall into
each interval
0 100 200 300 400 500 600 700 800 900 1000
13, 14, 2, 20, 13, 90, 734, 8, 53, 23, 30

Histograms
each interval
13, 14, 2, 20, 13, 90, 734, 8, 53, 23, 30

Histograms
each interval
2, 8, 13, 13, 14, 20, 23, 30, 53, 90, 734

Histograms
each interval
2, 8, 13, 13, 14, 20, 23, 30, 53, 90, 734
1-10 11-100 101-1000
2 8 1

Histograms
Approximations
Max: 1000 (actual 734)
1-10 11-100 101-1000
2 8 1

Histograms
Approximations
Min: 10 (actual 2)
1-10 11-100 101-1000
2 8 1

Histograms
Approximations
Min: 10 (actual 2)
Average: sum / count, (10*2 + 100*8 + 1000*1) / (2+8+1) = 165 (actual 103)
1-10 11-100 101-1000
2 8 1

Histograms
Approximations
Min: 10 (actual 2)
Average: sum / count, (10*2 + 100*8 + 1000*1) / (2+8+1) = 165 (actual 103)
Percentiles: 11 requests, so we know 90 percent of the latencies occurred in the 11-100 bucket or
lower.
90th Percentile: 100
1-10 11-100 101-1000
2 8 1

Histograms
Approximations
Min: 10 (actual 2)
Average: sum / count, (10*2 + 100*8 + 1000) / (2+8+1) = 165 (actual 103)
Percentiles: 11 requests, so we know 90 percent of the latencies occurred in the 11-100 bucket or
lower.
90th Percentile: 100
1-10 11-100 101-1000
2 8 1

EstimatedHistogram
The series starts at 1 and grows by 1.2 each time
1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 17, 20, 24, 29,
…
12108970, 14530764, 17436917, 20924300, 25109160

LatencyTracker
Has two histograms
• Recent
• Count of times a latency occurred since last time read for each bin
• Total
• Count of times a latency occurred since Cassandra started for each bin

Total Histogram Deltas
If you keep track of histogram last time you read it can find delta to determine
how many occurred in that interval
Last
Now
1-10 11-100 101-1000
2 8 1
1-10 11-100 101-1000
4 8 2

Total Histogram Deltas
If you keep track of histogram last time you read it can find delta to determine
how many occurred in that interval
Last
Now
Delta
© DataStax, All Rights Reserved.
1-10 11-100 101-1000
2 8 1
1-10 11-100 101-1000
4 8 2
1-10 11-100 101-1000
2 0 1

Cassandra 1.1
• Yammer/Codahale/Dropwizard Metrics introduced
• Awesome!
• Not so awesome…

Reservoirs
• Maintain a sample of the data that is representative of the entire set.
• Can perform operations on the limited, fixed memory set as if on entire dataset
• Vitters Algorithm R
• Offers a 99.9% confidence level & 5% margin of error
• Simple
• Randomly include value in reservoir, less and less likely as more
values seen

Reservoirs
• Maintain a sample of the data that is representative of the entire set.
• Can perform operations on the limited, fixed memory set as if on entire dataset
• Vitters Algorithm R
• Offers a 99.9% confidence level & 5% margin of error
* When the stream has a normal distribution

Metrics Reservoirs
• Random sampling, what can it miss?
– Min
– Max
– Everything in 99th percentile?
– The more rare, the less likely to be included
43

Metrics Reservoirs
• “Good enough” for basic adhoc viewing but too non-deterministic for many
• Commonly resolved using replacement reservoirs (i.e. HdrHistogram)
44

Metrics Reservoirs
• “Good enough” for basic adhoc viewing but too non-deterministic for many
• Commonly resolved using replacement reservoirs (i.e. HdrHistogram)
– org.apache.cassandra.metrics.EstimatedHistogramReservoir
45

Cassandra 2.2
• CASSANDRA-5657 – upgrade metrics library (and extend it)
– Replaced reservoir with EH
• Also exposed raw bin counts in values operation
– Deleted deprecated metrics
• Non EH latencies from LatencyTracker
46

Cassandra 2.2
• No recency in histograms
• Requires delta’ing on the total bin counts currently which is beyond
some simple tooling
• CASSANDRA-11752 (fixed 2.2.8, 3.0.9, 3.8)
47

Storing the data
• We have data, now to store it. Approaches tend to follow:
– Store all data points
• Provide aggregations either pre-computed as entered, MR, or on query
– Round Robin Database
• Only store pre-computed aggregations
• Choice depends heavily on requirements
49

Round Robin Database
• Store state required to generate the aggregations, and only store the
aggregations
– Sum & Count for Average
– Current min, max
– “One pass” or “online” algorithms
• Constant footprint
50

• Store state required to generate the aggregations, and only store the aggregations
– Sum & Count for Average
– Current min, max
– “One pass” or “online” algorithms
• Constant footprint
51
60 300 3600
Sum 0 0 0
Count 0 0 0
Min 0 0 0
Max 0 0 0

> 10ms @ 00:00
52
60 300 3600
Sum 10 10 10
Count 1 1 1
Min 10 10 10
Max 10 10 10

> 10ms @ 00:00
> 12ms @ 00:30
53
60 300 3600
Sum 22 22 22
Count 2 2 2
Min 10 10 10
Max 12 12 12

> 10ms @ 00:00
> 12ms @ 00:30
> 14ms @ 00:59
54
60 300 3600
Sum 36 36 36
Count 3 3 3
Min 10 10 10
Max 14 14 14

> 10ms @ 00:00
> 12ms @ 00:30
> 14ms @ 00:59
> 13ms @ 01:10
55
60 300 3600
Sum 36 36 36
Count 3 3 3
Min 10 10 10
Max 14 14 14

> 10ms @ 00:00
> 12ms @ 00:30
> 14ms @ 00:59
> 13ms @ 01:10
56
60 300 3600
Sum 36 36 36
Count 3 3 3
Min 10 10 10
Max 14 14 14
Average 12
Min 10
Max 14

> 10ms @ 00:00
> 12ms @ 00:30
> 14ms @ 00:59
> 13ms @ 01:10
57
60 300 3600
Sum 0 36 36
Count 0 3 3
Min 0 10 10
Max 0 14 14

> 10ms @ 00:00
> 12ms @ 00:30
> 14ms @ 00:59
> 13ms @ 01:10
58
60 300 3600
Sum 13 49 49
Count 1 4 4
Min 13 10 10
Max 13 14 14

Max is a lie
• The issue with the deprecated LatencyTracker metrics is that the 1 minute interval
does not have a min/max. So we cannot compute true min/max
the rollups min/max will be the minimum and maximum average
59

Histograms to the rescue (again)
• The histograms of the data does not have this issue. But storage is
more complex. Some options include:
– Store each bin of the histogram as a metric
– Store the percentiles/min/max each as own metric
– Store raw long[90] (possibly compressed)
60

Histogram Storage Size
• Some things to note:
– “Normal” clusters have over 100 tables.
– Each table has at least two histograms we want to record
• Read latency
• Write latency
• Tombstones scanned
• Cells scanned
• Partition cell size
• Partition cell count
61

Histogram Storage
Because we store the extra histograms we have a 600 (minimum) with upper
bounds seen to be over 24,000 histograms per minute.
• Storing 1 per bin means [54000] metrics (expensive to store, expensive to
read)
• Storing raw histograms is [600] metrics
• Storing min, max, 50th, 90th, 99th is [3000] metrics
– Additional problems with this
• Cant compute 10th, 95th, 99.99th etc
• Aggregations
62

Aggregating Histograms
Averaging the percentiles
[ INSERT DISAPOINTED GIL TENE PHOTO ]

• Consider averaging the maximum
If there is a node with a 10 second GC, but the maximum latency on your other 9 nodes
is 60ms. If you report a “Max 1 second” latency, it would be misleading.
• Poor at representing hotspots affects on your application
One node in 10 node raspberry pi cluster gets 1000 write reqs/sec while others get 10
reqs/sec. The 1 node being under heavy stress has a 90th percentile of 10 second. The
other nodes are basically sub ms and writes are taking 1ms on 90th percentile. Would
report a 1 second 90th percentile, even though 10% of our applications writes are taking
>10 seconds

Merging histograms from different nodes more accurately can be straight forward:
Node1
Node2
Cluster
1-10 11-100 101-1000
2 8 1
1-10 11-100 101-1000
2 1 5
1-10 11-100 101-1000
4 9 6

Histogram Storage
Because we store the extra histograms we have a 600 (minimum) with upper
bounds seen to be over 24,000 histograms per minute.
• Storing 1 per bin means [54000] metrics (expensive to store, expensive to
read)
• Storing raw histograms is [600] metrics
• Storing min, max, 50th, 90th, 99th is [3000] metrics
– Additional problems with this
• Cant compute 10th, 95th, 99.99th etc
• Aggregations
66

Raw Histogram storage
• Storing raw histograms 160 (default) longs is a minimum of 1.2kb
bytes per rollup and hard sell
– 760kb per minute (600 tables)
– 7.7gb for the 7 day TTL we want to keep our 1 min rollups at
– ~77gb with 10 nodes
– ~2.3 Tb on 10 node clusters with 3k tables
– Expired data isn’t immediately purged so disk space can be much worse
67

Raw Histogram storage
• Goal: We wanted this to be comparable to other min/max/avg metric
storage (12 bytes each)
– 700mb on expected 10 node cluster
– 2gb on extreme 10 node cluster
• Enter compression
68

Compressing Histograms
• Overhead of typical compression makes it a non-starter.
– headers (ie 10 bytes for gzip) alone nearly exceeds the length used by
existing rollup storage (~12 bytes per metric)
• Instead we opt to leverage known context to reduce the size of the
data along with some universal encoding.
69

• Instead of storing every bin, only store the value of each bin with a value > 0
since most bin will have no data (ie, very unlikely for a read histogram to be
between 1-10 microseconds which is first 10 bins)
• Write the count of offset/count pairs
• Use varint for the bin count
– To reduce the value of the varint as much as possible we sort the offset/count
pairs by the count and represent it as a delta sequence
70

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
71
1 byte 1 byte 1 byte 1 byte 1 byte 1 byte 1 byte 1 byte

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
72
7

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
{4:1, 8:100, 11:9999999, 14:1, 15:127, 16:128 17:129}
73
7

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
{4:1, 14:1, 8:100, 15:127, 16:128, 17:129, 11:9999999}
74
7

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
{4:1, 14:1, 8:100, 15:127, 16:128, 17:129, 11:9999999}
75
7 4 1

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
{4:1, 14:1, 8:100, 15:127, 16:128, 17:129, 11:9999999}
76
7 4 1 14 0

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
{4:1, 14:1, 8:100, 15:127, 16:128, 17:129, 11:9999999}
77
7 4 1 14 0 8 99

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
{4:1, 14:1, 8:100, 15:127, 16:128, 17:129, 11:9999999}
78
7 4 1 14 0 8 99 15
27

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
{4:1, 14:1, 8:100, 15:127, 16:128, 17:129, 11:9999999}
79
7 4 1 14 0 8 99 15
27 16 1 17 1

0 0 0 0 1 0 0 0 100 0 0 9999999 0 0 1 127 128 129 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
{4:1, 14:1, 8:100, 15:127, 16:128, 17:129, 11:9999999}
80
7 4 1 14 0 8 99 15
27 16 1 17 1 11
9999870

Real Life** results of compression:
81
Size in bytes
Median 1
75th 3
95th 15
99th 45
Max** 124

Note on HdrHistogram
• Comes up every couple months
• Very awesome histogram, popular replacement for Metrics reservoir.
– More powerful and general purpose than EH
– Only slightly slower for all it offers
A issue comes up a bit with storage:
• Logged HdrHistograms are ~31kb each (30,000x more than our average use)
• Compressed version: 1kb each
• Perfect for many many people when tracking 1 or two metrics. Gets painful when
tracking hundreds or thousands
82

Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016

Similar to Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016 (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016