Skip to main content

The Garbage Collection Mysteries (Volume 1) — An Awkward Pause

Mattfluet Matt Fluet

Header (6)

Background

Around the end of 2019, an Appian alert was raised by an application server health check request failing to get a response from a customer site for almost 7 minutes.

During the investigation, support engineering saw a corresponding 7 minute gap in the application server site metrics, indicating that the metrics scraping requests were not getting responses and timing out.

The timeframe also coincided with a gap in the Tomcat access log, which confirmed that no user request traffic was being processed at all during that window.

The first suspect of such JVM freezes is a pause for garbage collection (GC), but a 7 minute pause is extraordinarily unusual for a heap of this size. An inspection of the GC logs showed some activity occurring during the incident, but the heap usage looked relatively stable, and no collection cycle was taking more than a fraction of a second:

2019–10–22 20:00:45: Full GC detected. Heap used: 31.93% Seconds since last GC: 302.2 Time spent doing GC: 0.383 GC Overhead for this period: 0.126737%2019–10–22 20:06:48: Full GC detected. Heap used: 30.57% Seconds since last GC: 362.9 Time spent doing GC: 0.422 GC Overhead for this period: 0.116285%2019–10–22 20:17:53: Full GC detected. Heap used: 26.96% Seconds since last GC: 302.3 Time spent doing GC: 0.303 GC Overhead for this period: 0.100232%2019–10–22 20:22:55: Full GC detected. Heap used: 27.74% Seconds since last GC: 302.3 Time spent doing GC: 0.341 GC Overhead for this period: 0.112802%2019–10–22 20:28:58: Full GC detected. Heap used: 28.33% Seconds since last GC: 362.8 Time spent doing GC: 0.32 GC Overhead for this period: 0.0882029%2019–10–22 20:49:09: Full GC detected. Heap used: 30.42% Seconds since last GC: 303.5 Time spent doing GC: 0.406 GC Overhead for this period: 0.133773%

Since these GC logs didn’t point to an obvious issue, we took a broader look at the server’s health. CPU and disk metrics were reporting as expected, with no spike in activity or anything else unusual around the incident. System logs didn’t report anything of note either. With no other active leads to pursue, we decided to focus on updating the JVM settings to obtain more detailed information in the GC logs when another incident like this occurred.

JVM garbage collection pausing.

To perform a full GC, the JVM must first signal all threads to pause activity at a safepoint so the heap can be safely manipulated. Once all threads have reached a safepoint and are paused (commonly referred to as a “Stop The World” or STW pause), the system will proceed with the GC work of reclaiming heap memory. During this time, the JVM will be essentially frozen.

The JVM settings we had in place for GC logging were telling us the time spent performing the actual GC work, but did not tell us how long the application was actually paused. We already had -XX:+PrintGCDetails enabled, so we added the following JVM options to print both how long the application threads were stopped and how long it took to stop them:

-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime

Stopping the world.

At this point, sporadic alerts were beginning to appear from other customer sites with the same pattern of behavior. Soon after we deployed the JVM settings change we tripped another alert on a customer’s application server. An interesting detail emerged in the GC logs:

2020–02–06T18:12:26.202+0000: 826127.401: Total time for which application threads were stopped: 248.8020090 seconds, Stopping threads took: 248.7995990 seconds

During a GC that took a little over 4 minutes, almost all of that time was spent stopping the application threads! The actual GC work only took a handful of milliseconds after the application threads finally stopped. This brought us to two lines of questioning:

Focusing on the first question, we knew that in order for the threads to stop, they all need to reach a safepoint. But what are the specific rules there? We found a few good resources that discuss safepoints and led us to a deeper line of questioning around the specific thread activity that might be involved.

What are safepoints?

A safepoint is a range of execution where a thread’s state is known in terms of its interaction with the heap and the JVM can account for all references on the stack. Examples of safepoints in Java include blocked states (waiting on synchronized resources, waiting on a lock, parked, or blocking IO) and executing native code (JNI). These are points at which the thread will not be interacting with the heap. If all threads are at safepoints, the JVM can then manipulate the heap and proceed to clean up safely.

A thread does not get physically interrupted when the JVM wants it to pause at a safepoint. Rather, the JVM activates a flag that the threads are all polling for at a regular interval. Once a thread sees the flag, it will block until it is given the all clear to proceed.

Visit this blog for a good in-depth explanation of the low level mechanics of safepoints.

In the case of our STW GC pause, total time “Stopping threads” is the elapsed time it takes for every thread in the application server to successfully check for the flag and pause itself.

Since we now knew most of the time was spent waiting for the appserver threads to stop themselves at a safepoint, the key questions became:

To answer question 1, the JVM provides additional settings that allow us to print information identifying specific threads that have still not reached a safepoint after a timeout interval has passed. This “timeout” does not halt the safepoint process, it just triggers a printout to the appserver log with the details we want to see:

-XX:+SafepointTimeout -XX:+SafepointTimeoutDelay=10000

Shortly after that change was deployed, another alert popped up and we were able to see these types of traces in the log, all pointing to a single thread still executing at the safepoint timeout of 10 seconds.

# SafepointSynchronize::begin: Timeout detected:
# SafepointSynchronize::begin: Timed out while spinning to reach a safepoint.
# SafepointSynchronize::begin: Threads which did not reach the safepoint:
# “exec-1” #9136737 daemon prio=5 os_prio=0 tid=0x00007fe028306800 nid=0x96c2 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

# SafepointSynchronize::begin: (End of list)

A quick top command in thread view confirmed it was just the one thread churning away at something and the VM thread was the only other one still there waiting for it to stop so it could kick off the GC activity:

PID   PR NI VIRT    RES    SHR  S %CPU %MEM TIME+   COMMAND
45584 20 0 36.122g 0.013t 7028 R 99.9 3.5 0:10.42 exec-1
87 20 0 36.122g 0.013t 7028 R 94.7 3.5 1:34.27 VM Thread

It’s interesting that in this particular case a single thread can keep everything blocked for minutes until it reaches a safepoint. Now we needed to answer the second question: Why did it take so long? What was it doing?

Thread dumps and safepoints.

In an attempt to pinpoint thread activity, Appian’s performance engineering team wrote a script that would take thread dumps using jstack at a regular interval which would be executed when an alert fired. Normally this would be enough to give us the execution stack trace we needed to identify the problem area. However, thread dumps also require a safepoint to be reached. The dumps we were capturing were not actually occurring until after the runaway thread had reached a safepoint, so obviously we were not going to be able to find out what it was up to using OOTB settings.

We can force thread dumps to occur without waiting for them to reach a safepoint by using the -F option of jstack like so:

jstack -F <pid>

However, even though this allows for the thread dump to occur upon invocation, it can’t print out the stack information for threads that are not at a safepoint. Instead, we get an error message that obscures the information we want to see:

Thread 25865: (state = IN_JAVA)
Error occurred during stack walking: java.lang.NullPointerException
at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:88)
at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:45)
at sun.jvm.hotspot.tools.JStack.run(JStack.java:66)
at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:92)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jstack.JStack.runJStackTool(JStack.java:140)
at sun.tools.jstack.JStack.main(JStack.java:106)

Fortunately, there is a diagnostic JVM option available that will allow us to get a stacktrace for threads even if they’re not at a safepoint:

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

We worked with a customer who was one of the more frequently impacted and temporarily enabled these settings on their site, hoping we could produce traces that would align in a specific area of the system.

Voila! Soon after we got an alert for that customer and the forced thread dumps produced a full stack trace pointing to a specific area in the codebase.

Counted loops and safepoints.

The stack trace led us to an equals method on a set of keys (strings) for a map-like structure that was being serialized. We wrote a test that simply created these structures with a large number of keys and were able to consistently trigger the safepoint delay by having multiple threads each serializing one of them in parallel. Hooking up a Java profiler to a test run showed that 99% of the considerable amount of overall processing time occurred within that equality method.

The method itself did most of its work inside of an indexed for loop:

for (int i=0; i<length; i++) {
// some expensive O(n²) comparison logic for large datasets
}

This code had two problems:

In Java, a counted loop is a loop that is indexed by an int and incremented by one on each iteration. The following examples are considered counted loops:

for (int i = 0; i < reps; i++) {
// do something
}while (i < reps) {
// do something
i++;
}

Executions inside of a counted loop will not check for the safepoint flag at each iteration. Instead, it will wait until the loop completes. This is a performance optimization that occurred in OpenJDK8.

We suspected that a combination of a large set of keys processing on multiple threads increased the odds one would get caught within the counted loop. This would explain why we were seeing this issue only on certain types of customer sites at an irregular interval.

Remediation

The safepoint check inside of counted loops can be reactivated with yet another JVM option:

-XX:+UseCountedLoopSafepoint

We reran the test that reproduced the issue with this option enabled to test the theory that the counted loop was indeed preventing the safepoint from being reached. It did clear up the problem — the extended application pauses were gone.

Using this JVM setting is not something we wanted to enable in production, since an unknown performance penalty comes with it (enough to make OpenJDK turn off the behavior by default). However, it is useful as a diagnostic experiment.

Understanding that we had a specific problem in a frequently hit method of the codebase, we considered two basic options to correct things:

Changing the loop index variable to a long would prevent it from qualifying as a counted loop and give us the opportunity to reach a safepoint between iterations, but it would not be the best performing option since it would slow down the loop with those safepoint checks.

Instead, the counted loop remained in place and the logic inside the loop was updated. With a small tweak, we were able to reduce the O(n²) comparison logic down to O(n). This made the loop much faster, even for larger datasets.

We prepared a small hotfix with this change to deploy to customer sites. After responding to all of the active customer support cases that had been opened around this issue to let them know about the hotfix, we arranged for upgrades and monitored the sites for a few weeks afterward to make sure there were no alerts raised or other irregularities. We did not encounter any further incidents and were able to close out all of the support cases.

Lessons learned.

During the incident postmortem a short time later, some important points were discussed about things we learned not just from the final solution, but from the analysis process and the detailed research along the way:

We continue to actively monitor detailed GC activity on customer sites. Unusual activity around stoppages or collection times can point to issues within the codebase or with a specific customer’s usage patterns and should be examined. Even seemingly small problems can negatively impact performance in big and unexpected ways!

Mattfluet

Written by

Matt Fluet

Matt is a Principal Software Engineer at Appian