Yet Another Reason your Docker Containers May Be Slow on EC2: clock_gettime, gettimeofday and seccomp
Thanks to Gabriel Hollies for his help in writing this post
TL;DR: on AWS EC2 M4 instances, calls to System.nanoTime() and System.currentTimeMillis() make system calls, and these system calls are subject to even more overhead when running in Docker due to Docker’s default seccomp profile filters. The new M5 instances no longer make system calls for these time methods, so you can upgrade to newer instances if available in your region and availability zone(s). If you are stuck on 4-series instances, you could switch the clocksource to tsc as Amazon recommends in this tuning FAQ.
Appian uses Docker to run hundreds of thousands of automated tests every day. As we containerized our test applications, we started noticing that some parts of the software ran more slowly in Docker than others. It seemed that extra time was being spent somewhere in native code, so we decided to explore profiling tools that could display system calls and native code in the context of our Java call stacks. One such tool is FlameGraphs.
We found some useful blog posts about generating FlameGraphs for containerized Java: Flame graphs and perf-top for JVMs inside Docker containers and Making FlameGraphs with Containerized Java
These were useful starting points, but we wanted extra information about JVM and native calls that were not visible in the setup provided by those posts. For example, you can see, in the final image by Alice Goldfuss, that Java application methods are visible, but that JVM calls are displayed as just [libjvm.so]. We noticed changes in performance depending on our choice of base image (e.g. alpine, debian, centos), so we suspected that having JVM symbols may give some insight as to why. Perhaps the JVM was spending more time in certain routines when linking against different versions of libraries provided by these distributions. We wanted to document our full procedure for generating FlameGraphs in case others were running into the same problems.
We will walk through the issues we encountered generating FlameGraphs using the following demo app here: Docker Clocksource CPU Test.
The sample app has a deep call stack that gets the current time and passes it back through the functions like:
We will demonstrate issues with deep Java call stacks and resolving symbols.
If you’d just like to see the steps required to generate good Java FlameGraphs, you can skip to the “How to Generate a FlameGraph” section below.
Challenge 1: generating FlameGraphs for Java.
Our approach to generating FlameGraphs was:
- Run the linux perf record tool on the host to collect profiling data.
- Dump our containerized app’s Java symbols inside the container.
- Use perf script and Brendan Gregg’s FlameGraph scripts to visualize the data as a FlameGraph.
Problem 1: resolving Java code symbols.
Java breaks perf’s method of walking frame pointer-based stacks (see this Netflix post).
Resolution: add JVM option -XX:+PreserveFramePointer
Problem 2: Java’s JIT compilation also breaks FlameGraphs.
- FlameGraphs depend on stable call stacks during the profiling period.
- If Java JIT-compiles a method, the call stack now changes and the flame splits at that point, making it hard to account for the true time taken by a given method and its callees.
Problem 3: broken FlameGraphs due to deep stacks.
- Perf cuts off stacks at 127 frames by default (keeping only the top of the stack)
- A Java web application with filters, Spring, etc. can have deep call stacks
Resolution: This limit can now be increased as of kernel 4.8 by option kernel.perf_event_max_stack, for example:
sysctl -w kernel.perf_event_max_stack=3000
Note that perf script does not respect kernel.perf_event_max_stack, so it requires the explicit--max-stack parameter or it will also truncate stacks:
perf script --max-stack=3000 ...
Example of a broken FlameGraph due to stack truncation. There should only be one root to all the Application stacks, but they are split:
Challenge 2: generating FlameGraphs for containerized Java
Problem 1: resolving Java code symbols from within the container
On older kernels (pre 4.14?), perf script cannot see the /tmp pid symbol files generated by jmaps if they are inside the Docker container.
Resolution: You need to upgrade to a recent kernel, we used 4.15.
You can generate symbols for java apps inside the container with:
docker exec --user=root <container name> bash -c 'export USER=root; /opt/FlameGraph/jmaps'
This script requires java, a JAVA_HOME environment variable, the FlameGraphs repo at /opt/FlameGraph, and the perf-map-agent at /usr/lib/jvm/perf-map-agent. See this Dockerfile for an example.
Problem 2: resolving JVM native symbols like [libjvm.so]
If perf cannot find the JVM debug symbols, it will just display binary names like [libjvm.so].
Resolution: install the debug symbols inside the container and use a new version of the kernel that has fixes for perf (4.14+).
- Example JVM debug package for Ubuntu 18.04: openjdk-8-dbg.
How to Generate a FlameGraph
To demonstrate the creation of FlameGraphs for a containerized Java app, we will use a sample app that has a deep call stack that simply calls System.currentTimeMillis(). This will show that we’re able to generate FlameGraphs for a containerized Java app with a deep stack and resolve all symbols, and it will also demonstrate an issue with timing calls on EC2 that was causing our app to run more slowly in Docker.
Here is the full code for the sample app:
To run the demo, we’ve provisioned an m4.large EC2 instance with Ubuntu 18.04 on the 4.15 linux kernel and installed Docker CE 18.09.04.
Step 1: install perf tools. These will be used to profile our sample app.
sudo apt update
sudo apt install -y linux-tools-common linux-tools-`uname -r`
Step 2: make sure perf record does not cut off our deep stack traces.
sudo sysctl -w kernel.perf_event_max_stack=3000
Step 3: get scripts for generating FlameGraphs from perf output.
sudo git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph/
Step 1: assume superuser permissions (will be needed later to run perf).
Step 2: run the sample app in Docker. We’ve published an image that can be run directly.
docker run --rm -d --name clocksource-cpu-test appian/docker-clocksource-cpu-test
Step 3: record 10 seconds of on-CPU code at 49Hz and save as perf.data
perf record -F 49 -o perf.data -a -g -e cycles:p -- sleep 10
Step 4: generate Java symbols inside the container for our sample app. Newer versions of perf can pick these up automatically.
docker exec --user=root clocksource-cpu-test bash -c 'export USER=root; /opt/FlameGraph/jmaps'
Step 5: convert the stored perf.data into a FlameGraph.
perf script --max-stack=3000 | \
/opt/FlameGraph/stackcollapse-perf.pl --pid | \
/opt/FlameGraph/flamegraph.pl --color=java --hash > flamegraph.svg
Note: perf script does not obey the kernel.perf_event_max_stack setting, so you have to also pass the --max-stack parameter if you have deep call stacks.
Step 6: stop the container
docker kill clocksource-cpu-test
Our resulting FlameGraph (save as flamegraph.svg): https://raw.githubusercontent.com/appian/docker-clocksource-cpu-test/master/flamegraph_sample.svg
Let’s zoom in to the very tip of the flames where the real work is being done:
In this FlameGraph, we see that javaTimeMillis is calling the vDSO function gettimeofday, and that is making an actual gettimeofday system call. (System.nanoTime makes a similar call to clock_gettime) On the right, we see a large section starting with syscall_trace_enter. We’ll come back to that shortly.
Compare this to the same Java code running natively on the host:
Here, we still see the same gettimeofday system call, but there is no __secure_computing section.
And here is the same code in Docker on an M5 instance:
Here on the M5 instance, no system call is made. What is going on?
It turns out that the reason the system call is made is due to a difference in the clocksource between the 4- and 5-series instances. M4s use the xen clocksource:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Whereas M5s use the kvm-clock clocksource:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Others have documented before that the xen clocksource incurs a system call. Normally Java would attempt to read the clock in userspace via vDSO, but the implementation for the xen clocksource in vDSO falls back to a system call. We found out by experimentation that the kvm-clock clocksource does not require a system call.
Calling System.currentTimeMillis() was already slower on the M4 series because it required a system call, but our benchmark here shows that this call in Docker can take almost 6 times as long as a non-containerized call due to the overhead of seccomp filters running on the system call. Docker adds a seccomp profile by default. If you’re unfamiliar with FlameGraphs, you can see the magnitude of the overhead by comparing the relative horizontal size of method at each level of the graph. The horizontal size is proportional to time spent on CPU.
We found that the throughput on our tests was much better once we switched to M5 instances. If you have experienced a similar performance problem with these calls after moving to Docker and are stuck on 4-series instances in your region or availability zones, you could switch the clocksource to tsc as Amazon recommends in this tuning FAQ, as this clocksource also does not require a system call.