SpringOne Platform 2019
Session Title: Performance in Geode: How Fast Is It, How Is It Measured, and How Can It Be Improved?
Speakers: Helena Bales, Software Engineer, Pivotal
Youtube: https://youtu.be/awQ4byzC2LM
4. What do those number mean?
4
● 200,000 operations per second means nothing to a person.
○ Is that good?
○ Is the performance consistent and accurate?
○ Has it improved or regressed since the last version?
○ Can it be better?
5. What do those number mean?
5
● 200,000 operations per second means nothing to a person.
○ Is that good? Pretty good, yes.
○ Is the performance consistent and accurate? Not yet.
○ Has it improved since the last version? Yes, slightly.
○ Can it be better? YES.
6. What do those number mean?
6
● 200,000 operations per second means nothing to a person.
○ Is that good? Pretty good, yes.
○ Is the performance consistent and accurate? Not yet.
○ Has it improved since the last version? Yes, slightly.
○ Can it be better? YES.
How do you know???
8. Creating the Geode Benchmark - Features
8
● On demand
● Against any revision of Geode
● On AWS cluster deployment of Geode
● On any dev machine in the office
● From Concourse CI pipeline
● With a profiler attached
● Compare two runs of benchmarks for performance changes
9. Creating the Geode Benchmark - Goals
9
● Run by anyone interested in Geode
● Have others create benchmarks
● Visualize benchmark results over time
● Increase benchmark coverage of Geode
13. Finding Performance Bottlenecks
13
● Monitor locks
● Thread Park/Unpark Reentrant Locks
● Allocations/GC
● Overuse of synchronization
● Getting a system property in a hot path
● Lazy initialization of objects in a hot path
● Synchronization on a container (ex. hash map)
14. Case Study – The Connection Pool
14
● Why were we even looking for anything?
○ Couldn’t saturate network, CPU, memory; no matter the available
resources
○ Profiler gave us no suspect hot spots
● How did we find the issue?
○ Found the secret profiler option to measure zero-time reentrant locks
○ Thread.park() became a hot spot, with reentrant lock and connection
pool as callers
○ The connection pool was holding a reentrant lock in a hot path
while using a deque.
28. Other Bottlenecks – Over Eager Allocations
28
2 potentially
unused objects
per call –
new HashSet() =>
1 HashSet
& 1 HashMap
29. Other Bottlenecks – Over Eager Allocations (fixed)
● Do not allocate eagerly
● Allocate near first use
● Allocate after early returns that don’t use the allocated object
29
30. Other Bottlenecks – Know Your Structures
30
Methods called for every
operation and results in
1 add and 1 remove per op
31. Other Bottlenecks – Know Your Structures (fixed)
31
Methods still called for every
operation but does not allocate/gc
Hi, my name is Helena Bales, and my pronouns are they/them. I am a Senior Software Engineer at Pivotal, working on GemFire, and have been a Geode committer for about a year and a half.
Today I want to talk to you about Geode’s performance. Specifically, what is the performance, how it is measured, and how it can be improved.
So let’s start with the most basic of those three questions: what is the performance of geode?
So this is the performance of Geode. On the vertical axis is the throughput in operations per second, and the horizontal axis has four different benchmark tests. So we can see that PartitionedGetBenchmark had an average throughput of 200,000 operations per second. But what does that mean?
AWS Machine info: type - c5.9xlarge; vCPU - 36; Memory - 72GiB; Network – 10 Gbps; EBS bandwidth – 7,000 Mbps
To describe performance, just a number doesn’t tell much about the performance of geode, and just raises more questions. Like is 200,000 good? Is the measurement consistent and accurate? And has it changed since the last version? Perhaps most importantly, can it be improved?
Well here’s my answers to those questions from when we started these new benchmarks. We had pretty good performance, but we were seeing some variance between runs, and some issues with stop-the-world garbage collections. We also saw some improvements from the previous version, but also, a lot of room for improvement.
But that brings up one more question. How do we know any of this?
To answer that, lets start by talking about what the benchmarks test now.
When the Performance team started replacing the previous bare metal performance testing of Geode, we had several goals for the project. These are the ones that we have completed so far.
The benchmark can be run on demand against again revision of Geode (released or in development), on an AWS cluster or on any dev machine. They can also be run from Concourse CI pipelines.
We also enabled running with a profiler attached for use in debugging performance bottlenecks. And finally, we can compare any two runs of benchmarks for changes in performance.
And these are the goals that we are still working on.
We want benchmarks to be run by members of the Geode community against their changes to Geode, or against their deployments. So far we have not received feedback that anyone outside of our office has used this project. We also would like for members of the community to create their own benchmarks to add to the existing list.
The visualization of data is also something that is in progress, as that requires many iterations to get right. And finally, we would like to increase the test coverage that Benchmarks provide over Geode.
This is our current lists of tests. We’re only going to focus on the highlighted four today, but you can see that we do have some good coverage over operations so far.
The four that we are going to focus on are ReplicatedGetBenchmark, PartitionedGetBenchmark, ReplicatedPutBenchmark, and PartitionedPutBenchmark. That’s because gets and puts on replicated and partitioned regions are some of the most commonly used and basic operations that Geode supports.
With those tests, we also provide many different configuration options for running the benchmarks. We support running the cluster with and without SSL enabled, with JDKs 8 through 13, with or without security manager enabled, with a variety of garbage collectors, and with adjustable max heap size. These options all change the geode cluster, so it has increased our coverage to run with them.
So now that we all know a bit about the goals and features of this new benchmarking framework, lets pivot and talk about how we can improve performance.
And the first step to fixing performance issues is finding them. Here are some of the things to look for using the profiler, including both monitor and reentrant locks, extra allocations and garbage collections. Other things to look for are overuse of synchronization, getting a system property in a hot path, lazy initialization of objects in a hot path, and synchronization on a container such as a hash map. And I’ll go over some examples of these in a bit.
So now lets focus on a specific example of a performance refactor. Starting with the reason that we thought that anything was wrong in the first place. When the benchmark was run, none of the resources were saturated, and we couldn’t figure out where the bottleneck was since the profiler gave us no hot spots.
Eventually we found the secret profiler option that shows the zero-time reentrant locks and found that Threak.park became a hotspot, and its callers were reentrant lock and the connection pool. So eventually we found that the connection pool was holding a reentrant lock in a hot path while using a deque.
Highlight 36 vcpu aws instance
This graph shows the average performance of the get operation with different numbers of threads on the client, using version 1.9.0 of Geode. As you can see, the performance stops scaling pretty quickly after 32 threads. This ended up being due to the connection pool. It did not support enough concurrent operations for more than 32 threads, causing decreasing performance.
And the profiler shows where the issue is occurring in the code. Every operation that is executed on the server results in one call to borrowConnection and one call to returnConnection. Both of those methods get a reentrant lock. This lock is responsible for almost half the time spent in these two methods. This is the cause of that taper in performance as the thread count increases, and contention for the lock increases as operations both borrow and return connections concurrently.
Here is the issue in code. This is a parred down version of the ConnectionManagerImpl which implements the ConnectionManager. With the first arrow here I have highlighted that the available connections are being stored in a deque. Because a deque is not a thread-safe structure, the second arrow highlights the reentrant lock that was appearing in the profiler.
I’m going to focus on the borrow operation when talking about this issue, but it is also an issue in the returnConnection method. There are also two signatures of borrowConnection, one of which looks for a connection to a specific server. The other just takes a timeout and gets a connection to any available server. This is the one that I’ll be focusing on from here on.
So this is the borrowConntection method. The red arrows on the left highlights that the lock is held for a significant portion of the method. And note that this is a collapsed view of the method to fit on one page. Holding the lock for this long makes it difficult for multiple threads to use the connection pool at the same time. Another issue with this code is that there is an await in here. The await causes the thread to be paused until the condition has been met and a signal is received. During this time, the lock is returned. This means that it must be reacquired before the thread can continue. This further delays the return of a connection to the caller by, in the worst case, the duration of the timeout plus the time it takes to reacquire the locks in with contention.
Let’s move on and discuss the solution to this issue. The first part is to replace the deque in the connection manager with something else. In order to introduce some modularity of this code, all of the behavior related to the available connections will be moved into another class, called the AvailableConnectionManager. This allows us to get rid of the lock in the connection manager. This is due to the implementation of AvailableConnectionManager.
This is the signature for the AvailableConnectionManager. As you can see, the deque has been replaced with a concurrent linked deque. The linked nature of the deque does provide some performance hits due to the need to allocate and garbage collect the nodes. This structure relies on Compare And Swap for a lock-free implementation, making the ConcurrentLinkedDeque the ideal choice for this implementation.
With that change in mind, this is what borrowConnection() in the ConnectionManager looks like now. There are no locks in this method. Instead, we call useFirst on the available connection manager, with a predicate to get a connection to the server that we want.
And this is the implementation of useFirst. There is still no locking in this method, and removeFirstOccurence is thread-safe, meaning that with a sufficiently large pool of connections, scaling should continue well past 32 threads on the client.
To test if this new solution has other hot spots, we can use a profiler. And this time, you can see that the operation still results in an execution on the server, which calls both borrow and return connection, but both of those calls take 0-1% of the time spent in those methods. This provides good confidence that this implementation does not have a performance bottleneck.
The new implementation of ConnectionManagerImpl and AvailableConnectionManager have been thoroughly tested at every level. I’m sure most of you are familiar with the concepts of unit and integration testing. But this has also been tested in three other ways. Distributed tests test how the connection manager behaves in a real Geode cluster. A cluster is spun up in several VMs and operations are run, causing connections to be created and destroyed, borrowed and returned.
The next type of test is the Concurrency Test. For concurrency testing, an executor is given multiple threads to be run in parallel, applying pressure to the connection manager to test that certain timings do not result in concurrency issues.
And finally, the testing that we’ve been talking about this whole time, performance testing.
These are the results of the performance test, comparing the commit before the refactor with the refactor code. As you can see, this one commit results in a 239% increase in PartitionedGetBenchmark. And with this run of the tests the CPU of the client was saturated.
Here is how the performance scales with the number of threads on the client in version 1.10.0, which includes the connection pool refactor as well as several other smaller refactors. As you can see, scaling continues significantly beyond 32 threads.
So now let’s quickly look at a couple of other performance bottlenecks, starting with over eager allocations. What I mean by that is allocation objects long before they are used, resulting in excess garbage production. So once again, ignore most of the code here, and focus on the highlighted areas. Note that the declaration of the attemptedServers object (the first highlighted aread) occurs well before the first use of that object, the second yellow highlighted area. And since there is an early return, highlighted in green, between the declaration and the first usage, there is a chance that the object could be allocated and garbage collected without ever having been used. And in this case, a HashMap is being allocated, which results in one HashSet and one HashMap, creating a significant amount of garbage.
The best way to avoid this issue is to allocate close to the first use of that object. And make sure that you’re allocating before early returns that would allow you to avoid allocating the object in the first place.
Another common performance bottleneck is caused by choosing the wrong structure for the implementation. In this case, a linked list is used. This code is a hot path, and the borrowConnection and returnConnection are each called once per operation. This means that each operation results in one allocation of a node and one dereference of a node.
In this case, a deque is a better choice, since the connection pool is of relatively constant size, the deque will not need to be resized very often. This shows the importance for performance of understanding your data structures.
So to wrap things up, let’s talk about how much performance has improved.
This is a graph of the throughputs of our four tests in version 1.10.0 compared to version 1.9.0. Each of these tests saw a significant improvement in performance due to the connection pool and other refactors.
This is a similar graph to the previous slide but for latency. It shows that latency was also reduced by a significant amount between versions 1.9.0 and 1.10.0.
So why should you upgrade to Geode 1.10.0?
Well I think we’d all rather see a vsd output like the red line instead of the blue.
Finally, I’d like to point you to some useful resources. We’d love to have more people use these benchmarks. There are instructions for running, and adding new benchmarks, in the benchmarks repository. We also have a great list of performance bottlenecks that we’ve found in our investigations but have not been able to prioritize. If you’re interested in working on performance issues, check out this JIRA search.