G1, CMS, Shenandoah, or Zing? Heap size at 8GB or 31GB? compressed pointers? Region size? What is the maximum break time? Throughput or Latency... What gain? MaxGCPauseMillis, G1HeapRegionSize, MaxTenuringThreshold, UnlockExperimentalVMOptions, ParallelGCThreads, InitiatingHeapOccupancyPercent, G1RSetUpdatingPauseTimePercent, which parameters have the most impact?
2. Disclaimer
2
• Add implicit “for this use case” after each sentence
• Not a definitive jvm guide. Big heap only. Don’t
copy/paste the settings!
• Base for standard use case. Might change for
• Search workload, especially with big fq cache
• Analytic workload (lot of concurrent pagination)
• Abnormal memory usage (prepared statement cache etc.)
• Off heap usage
• …
10. Concurrent Mark Sweep (CMS) collector
10
Young collection uses ParNewGC. Behaves like Parallel GC
Difference: needs to communicate with CMS for the old generation
11. Concurrent Mark Sweep (CMS) collector
11
Old region is getting too big
Limit defined by -XX:CMSInitiatingHeapOccupancyPercent
Start concurrent marking & cleanup
Only smalls STW phase
Delete only. No compaction leading to fragmentation
Triggers a serial full STW GC if continuous memory can’t be allocated
Memory is requested by “block” –XX:OldPLABSize=16
each thread requests a block to copy young to old
12. CMS profile
12
Hard to tune. Need to choose the young size, won’t adapt to new workload
inexplicit options. Easier when heap remains around 8GB
Fragmentation will trigger super long full GC
13. Garbage first (G1) collector
13
Heap (31GB)
Empty Regions (-XX:G1HeapRegionSize)
15. Garbage first (G1) collector
15
Young space is full
Dynamically sized to reach pause objective -XX:MaxGCPauseMillis
Trigger STW “parallel gc” on young
Scan object in each region
Copy survivor to survivor
In another young region
Free regions
16. Garbage first (G1) collector
16
Young space is full
Dynamically sized to reach pause objective -XX:MaxGCPauseMillis
Trigger STW “parallel gc” on young
Scan object in each region
Copy survivor to survivor
In another young region. Survivor size dynamically adjusted
Copy old survivor to old region
In another young region
Free regions
17. Garbage first (G1) collector
17
Old space is getting too big
Limit defined by -XX:InitiatingHeapOccupancyPercent=40
Start region concurrent scan
Not blocking (2 short STW with SATB: start + end)
100% empty regions reclaimed
Fast, “free” operation
Trigger STW mixed gc:
Dramatically reduce young size
Includes a few old region in the next young gc
Repeat mixed gc until -XX:G1HeapWastePercent
10% 100% 100%
80%78%
30%
Young size being reduced to respect target pause time,
G1 triggers several concurrent gc
20. Test protocol
20
128GB RAM, 2x8 cores (32 ht cores)
2 CPU E5-2620 v4 @ 2.10GHz. Disk SSD RAID0. JDK 1.8.161
DSE 5.1.7, memtable size fixed to 4G. Data saved in ramdisk (40GB)
Gatling query on 3 tables
jvm: zing. Default C* schema configuration. Durable write=false (avoid commit log to reduce disk activity)
Rows size: 100byte, 1.5k, 3k
50% Read, 50% write. Throughput 40-120kq/sec
33% of the read return a range of 10 values. Main tests also executed with small rows only.
Datastax recommended OS settings applied
24. Which Heap size
24
Why a bigger heap could be better?
• Heap get filled slower (less gc)
• Increase ratio of dead object during collections
Moving (compacting) the remaining objects is the most heavy operation
• Increases chances to flush an entirely region
25. Which Heap size
25
Why a bigger heap could be worse?
• Full GC has a bigger impact
Now parallel with java 10
• Increases chances to trigger longer pauses
• Less memory remaining for disk cache
26. Heap size
26
Small heap (<16GB) have bad
latencies
After a given size, no obvious
difference
0
100
200
300
400
500
600
0pt
20pt
40pt
55pt
65pt
75pt
80pt
85pt
88.75pt
91.25pt
93.75pt
95pt
96.25pt
97.18pt
97.81pt
98.43pt
98.75pt
99.06pt
99.29pt
99.45pt
99.6pt
99.68pt
99.76pt
99.92pt
Latency(ms)
Client latencies by Heap Size - target pause= 300ms, 60kq/sec
8GB 12GB 20GB 28GB 36GB 44GB 52GB 60GB
95th percentile = 160ms
Perfect latencies
27. Heap size & GC Pause time
27
0
20
40
60
80
100
120
8 18 28 38 48 58
Totalpausetime(sec)
Heap size (GB)
GC STW Pause by Heap size and heap allocation
target pause 300ms – jvm allocation rate 800mb/s
Total pause time (800mb/s)
28. Heap size & GC Pause time
28
0
100
200
300
400
500
600
700
0
20
40
60
80
100
120
8 18 28 38 48 58
Heap size (GB)
MaxPausetime(ms)
Totalpausetime(sec) GC STW Pause by Heap size and heap allocation
target pause 300ms – jvm allocation rate 800mb/s
Total pause time (800mb/s) GC Max pause time (800mb/s)
29. Heap size & GC Pause time
29
0
100
200
300
400
500
600
0
20
40
60
80
100
120
140
160
180
8 18 28 38 48 58
Heap size (GB)
MaxPausetime(ms)
Totalpausetime(sec) GC STW Pause by Heap size and heap allocation
target pause 300ms – jvm allocation rate 1250mb/s
Total pause time (1250mb/s) GC Max pause time (1250mb/s)
30. Heap size & GC Pause time
30
0
100
200
300
400
500
600
700
0
20
40
60
80
100
120
140
160
180
8 18 28 38 48 58
MaxPausetime(ms)
Totalpausetime(sec)
Heap size (GB)
GC STW Pause by Heap size and heap allocation
target pause 300ms
Total pause time (800mb/s) Total pause time (1250mb/s)
GC Max pause time (800mb/s) GC Max pause time (1250mb/s)
31. Target pause time -XX:MaxGCPauseMillis
31
0
10
20
30
40
50
60
70
50 100 200 300 400 500 600
Totalpausetime(sec)
G1 Pause time target (ms)
Total STW GC pause by target pause - heap 28GB
Total STW GC duration
32. Target pause time -XX:MaxGCPauseMillis
32
0
100
200
300
400
500
600
0
10
20
30
40
50
60
70
50 100 200 300 400 500 600
Maxpausetime(ms)
Totalpausetime(sec)
G1 Pause time target (ms)
Total STW GC pause by target pause - heap 28GB
Total STW GC duration (900 mb/sec) Max Pause time (900mb/sec)
34. Heap size Conclusion
35
G1 struggle with a too small heap. Also increases full GC risk
GC pause time doesn’t reduce proportionally when heap size increase
Sweet spot seems to be around 30x allocation rate, at least 12GB
Keep -XX:MaxGCPauseMillis >= 200ms (for 30GB heap)
36. 31GB or 32GB?
37
Up to 31GB: Oops compressed on 32bit with 8 bytes alignement (3 bit shift)
8 0000 1000 0000 0001
32 0010 0000 0000 0100
40 0010 1000 0000 0101
2^32 => 4G. 3 bit shift trick leads to 2^3 = 8 times more addresses. 2^32 * 2^3 = 32G
32GB: Oops on 64 bits
Heap from 32GB to 64GB can be aligned on 16bit
G1 targets 2048 regions and changes the default size at 32GB
31GB => XX:G1HeapRegionSize=8m = 3875 regions
32GB => XX:G1HeapRegionSize=16m = 2000 regions
Region number can have an impact on Remember Set update/scan
[Update RS (ms): Min: 1232.7, Avg: 1236.0, Max: 1237.7, Diff: 5.0, Sum: 12576.2]
[Scan RS (ms): Min: 1328.5, Avg: 1329.5, Max: 1332.7, Diff: 4.2, Sum: 21271.7]
nodetool sjk mx -mc -b "com.sun.management:type=HotSpotDiagnostic" -op getVMOption -a UseCompressedOops
37. 31GB or 32GB?
38
No major difference
Concurrent marking cycle is slower with smaller
region (8MB => +20%)
No major difference in cpu usage
31GB + RegionSize=16MB
Total GC Pause time -10%
Latencies mean -8%
All other GC metrics are very similar
Not sure? stick with 31GB + RegionSize=16MB0
50
100
150
200
250
300
350
400
92.5pt
93.75pt
94.375pt
95pt
95.625pt
96.25pt
96.875pt
97.187pt
97.5pt
97.812pt
98.125pt
98.437pt
98.593pt
98.75pt
98.906pt
99.062pt
99.218pt
99.296pt
99.375pt
99.453pt
99.531pt
99.609pt
99.648pt
99.687pt
99.726pt
99.765pt
99.804pt
99.824pt
99.843pt
99.863pt
99.882pt
Clientlatency(ms)
percentiles
32 or 31GB / Compressed oops & regions size
32GB RS=16mb bytealign=16 32GB RS=8mb bytealign=16
32GB RS=16mb 31GB RS=8mb
31GB RS=16mb 32GB RS=8mb
38. Zero based compressed oops?
39
Zero based compressed oops
Virtual memory starts at zero:
native oop = (compressed oop << 3)
Not zero based:
if (compressed oop is null)
native oop = null
else
native oop = base + (compressed oop << 3)
Happens around 26/30GB
Can be checked with -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode
No noticeable difference for this workload
40. XX:ParallelGCThreads
41
Defines how many thread participate in GC.
2x8 physical cores
32 with hyperthreading
threads STW total time
8 90 sec
16 41 sec
32 32 sec
40 35 sec
0
50
100
150
200
250
300
350
400
Clientlatency(ms)
ParallelGCThreads - 31GB / 300ms
8 threads 16 threads 32 threads 40 threads
41. Minimum young size
42
During mixed gc, young size is drastically reduced (1.5GB with 31GB heap)
Young get filled very quickly. Can lead to multiple consecutive GC
We can force it to a minimum size
XX:G1NewSizePercent=8 seems to be a better default (5)
Interval between GC increased by x3 during mixed GC (min 3 sec)
No noticeable changes in throughout and latencies
(Increase mixed time, reduce young gc time)
GC Pause time, 31GB GC Pause time, 31GB -XX:NewSize=4GB
GC every sec (or multiple per sec)
42. Survivor threshold
43
Defines how many times data will be copied into young before promotion to old
Dynamically resized by G1
Desired survivor size 27262976 bytes, new threshold 3 (max 15)
- age 1: 21624512 bytes, 21624512 total
- age 2: 4510912 bytes, 26135424 total
- age 3: 5541504 bytes, 31676928 total
Default 15, but tends to remains <= 4 under heavy load
quickly fill survivor space defined by XX:SurvivorRatio
Most object should be either long-living or instantaneous. Is it worth disabling survivor ?
-XX:MaxTenuringThreshold=0 (default 15)
44. Survivor threshold - JVM pauses
45
Max tenuring = 15 (default)
GC Avg: 235 ms
GC Max: 381 ms
STW GC Total: 53 sec
Max tenuring = 0
GC Avg: 157 ms
GC Max: 290 ms
STW GC Total: 28 sec
Generated by gceasy.io
45. Survivor threshold
46
Generated by gceasy.io
Check your GC log first!
In this case (almost no activity), the survivor size doesn’t reduce much after 4 periods
Try with -XX:MaxTenuringThreshold=4
46. Delaying the marking cycle
47
By default, G1 starts a marking cycle when the heap is used at 45%
-XX:InitiatingHeapOccupancyPercent (IHOP)
By delaying marking cycle:
• Reduce count of old gc
• Increase chance to reclaim empty regions ?
• Increase risk to trigger full GC
Java 9 now dynamically resizes IHOP!
JDK-8136677. Disable adaptative behavior with -XX:-G1UseAdaptiveIHOP
48. 49
Generated by gceasy.io
IHOP=80, 31GB heap
Max heap after GC = 25GB
4 “old” compaction
+10% “young” GC
GC Total: 24sec
IHOP=60, 31GB heap
Max heap after GC = 20GB
6 “old” compaction
GC Total: 21sec
49. Delaying the marking cycle
50
Conclusion:
• Reduce number of old gc but increase young
• No major improvement
• Increases risk of full GC
• Avoid increasing over 60% (or rely on java9 dynamic sizing – not tested)
50. Remember set updating time
51
G1 keep cross-region references into a structured called Remember Set.
Updated in batches to improve performances and avoid concurrent issue
-XX:G1RSetUpdatingPauseTimePercent controls how many time should be spent
in evacuation phase
percent of the MaxPauseTime. Default to 10
Adjust refinement thread zones (G1ConcRefinementGreen/Yellow/RedZone) after each gc
GC logs: [Update RS (ms): Min: 1255,1, Avg: 1256,0, Max: 1257,8, Diff: 2,8, Sum: 28889,1]
51. Remember set updating time
52
0
50
100
150
200
250
300
350
400
450
Clientlatencies(ms)
percentiles
R. Set updating time percent - 31GB, RegionSize=16mb, 300ms
RSetUpdatingPauseTimePercent=0 RSetUpdatingPauseTimePercent=5
RSetUpdatingPauseTimePercent=10 RSetUpdatingPauseTimePercent=15
52. Other settings
53
-XX:+ParallelRefProcEnabled
No noticeable difference for this workload
-XX:+UseStringDeduplication
No noticeable difference for this workload
-XX:G1MixedGCLiveThresholdPercent=45/55/65
No noticeable difference for this workload
55. Write barrier
56
A B C DB C
Object not marked
Hotspot uses write barrier to capture “pointer deletion”
Prevent D from being cleaned during the next GC
What about memory compaction?
Hotspot stop all applications thread, solving concurrent issues
C.myFieldD = null
-----------
markObjectAsPotentialAlive (D)
56. Read barrier
57
A B C DB C
Read D
-----------
If (D hasn’t been marked in this cycle){
markObject (D)
}
Return DD
What about memory compaction?
Read D
-----------
If (D is in a memory page being compacted){
followNewAddressOrMoveObjectToNewAddressNow(D)
updateReferenceToNewAddress(D)
}
Return D
57. Read barrier
58
A B C DB C D
Predictable, constant pause time
No matter how big the heap size is
Comes with a performance cost
Higher cpu usage
Ex: Gatling report takes 20% more time to complete using
low latency jvm vs hotspot+G1
(computation on 1 cpu running for 10minutes)
60. Low latencies JVM (shenandoah)
61
Shenandoah rampup
G1 rampup
JVM test, not a Cassandra benchmark!
61. Low latencies conclusion
62
Capable to deal with big heap
Can handle a bigger throughput than G1 before getting the first error
G1 pauses create a burst with potential timeout
Zing offers good performances, including with lower heap.
Shenandoah stable in our tests. Offers a very good alternative to G1 to avoid pause time
Try carefully, still young (?)
63. Small heap
64
A typical webserver or kafka uses a few GB max.
The same guidelines apply
Pointers uncompressed: stay below 4GB
32 bit, 2^32
Same logic to set the max pause time (latency vs throughput)
With 3GB, you can aim around 30-50ms pause time
Usually less GC issue (won’t randomly freeze for 3 secs)
Not thoroughly tested !!
65. Conclusion
66
(Super) easy to get things wrong. Change 1 param at a time & test for > 1 day
Measure the wrong thing, test too short introducing external (not JVM) noise…
Lot of work for to save a few percentiles
Tests have been running for more than 300hours
Don’t over-tune. Keep things simple to avoid side effect
Keep target pause > 200ms
Keep heap size between 16GB and 31GB
Start with heap size = 30*(allocation rate).
G1 doesn’t play well with very sudden allocation spike
Can try to reduce -XX:G1MaxNewSizePercent=60 to mitigate this issue
66. Conclusion
67
GC pause still an issue? Upgrade to DSE6 or add extra servers
Running after the 99.9x pt? Go with low-latency jvms
Don’t trust what you’ve read! Check your GC logs!