Chaos Testing a Distributed System with Jepsen
Many thanks to Josh Kessler, the Data Layer team, Antonio Andrade, Jonathon Blonchek, and Debby Chang for their help in making this blog possible.
This is the first post in a series that will explore how to run chaos tests against a distributed system using Jepsen. This first post will give a brief introduction to what we are testing, why Jepsen was the ideal tool, how to run Jepsen in docker, and understanding Jepsen test results. In future posts, we will go into greater detail about what it takes to setup your distributed system to run inside Jepsen containers, the kinds of issues we were able to find in our distributed system, and how we resolved those issues.
This post assumes you have some familiarity with distributed systems. At a minimum, you should have:
- knowledge of basic distributed system concepts such as leadership election and CAP theorem.
- working knowledge of docker.
- gone through the Jepsen tutorial where you write a chaos test against the popular etcd key-value store.
Chaos testing for consistency.
According to the CAP theorem, a distributed system can only achieve two of three key properties: Consistency, Availability, and Partition Tolerance. Most distributed systems will not give up Partition tolerance, and instead prioritize between Consistency and Availability when trade-offs are inevitable. Appian’s data store favors Consistency, and this blog series will explore how Appian validates its consistency guarantees by using a chaos testing tool called Jepsen.
Chaos testing is the discipline of disrupting a system in unplanned ways, usually in production, and observing how they fail. By learning from these observations, you can make your system more robust. Chaos tests are an important part of a comprehensive test plan because they can uncover race conditions that are otherwise hard to detect in development. Jepsen can perform numerous chaos events on a distributed system such as introducing network issues, killing components, and generating random load. We feel confident that Jepsen can find any issues regarding consistency in a multi-node environment. If you’re interested in learning more about chaos testing, join us and learn how we implemented Jepsen to test our distributed system.
What are we testing?
“In plain English, under linearizability, writes should appear to be instantaneous. Imprecisely, once a write completes, all later reads (where “later” is defined by wall-clock start time) should return the value of that write or the value of a later write. Once a read returns a particular value, all later reads should return that value or the value of a later write.”
Under a linearizable data store, these two rules should hold true:
- A read that starts before a write ends can read either a previous write, or the current write.
- A read that starts after a write ends must read that write or a later write.
Linearizability builds on a few other guarantees that must also hold true: Read your own writes, Monotonic Reads, and Monotonic Writes. Check out this page for more details on these consistency guarantees.
These consistency guarantees are well tested for the single-node system, but highly-available (HA) configurations introduce complex problems that must be tested in novel ways. For example, in an HA configuration, writes are routed to a single node called a leader node (vs replica, which only services reads). When the leader goes down, we want the system to perform a leader election with little down-time. Verifying that failover happens quickly under heavy load and while enduring chaotic failures is critical for a robust system.
Another example of a novel problem stems from the fact that the replicas become up to date with the transactions on their own time; it is not guaranteed that a replica is up to date with the latest transaction at any given moment. If not handled correctly, this can introduce inconsistencies. Since Appian guarantees read-your-own-writes, this must also be tested under different failure scenarios.
Jepsen is a framework designed for testing whether distributed systems live up to their consistency guarantees. It is a great tool and we encourage you to take a look at talks and presentations given by its creator, Kyle Kingsbury. Jepsen has been used to test many distributed systems such as MongoDB, ElasticSearch, and Zookeeper.
Jepsen validates a system by generating random operations (reads, writes, etc.) against your system while recording the timestamp and duration of each operation, while building a model of your system in memory. It then tries to prove whether the history of events make sense given your model. Throughout this series, we will be testing for linearizability using an out of the box linearizability model (knossos) to prove whether the operation history is linearizable, but Jepsen supports using other models to verify other properties.
Running Jepsen in Docker.
We’re going to be running Jepsen in docker. Containerizing all the requirements will simplify and standardize running these tests anywhere — locally and on CI. Once the docker scripts are set up, all anyone needs to do to run these tests is `docker-compose up` without worrying about whether they have all the requirements installed on their host. Since the tutorial doesn’t cover this let’s talk a little bit about how this is done. We will be running Jepsen v0.1.11 in docker. The Jepsen repo already provides the basic scaffolding to run Jepsen in docker.
It runs 3 kinds of containers:
- jepsen-control: the orchestrator container — controls other nodes, setup & teardown, generates data and chaos.
- jepsen-nX: your system will be running in an n-node (Jepsen defaults to 5) cluster on these containers. More on how to achieve this later!
- jepsen-node: used by Jepsen’s chaos agent, nemesis.
To run a Jepsen test in docker, you exec into the jepsen-control node via docker exec -ti jepsen-control bash and under your project folder run lein run test — time-limit=15 which will run your test for 15 seconds.
Analyzing test results.
After the test runs, you will get some results in addition to system logs. Throughout this blog series we will primarily refer to two files: jepsen.log and timeline.html. The former is a log of all operations Jepsen performed against the system, and the latter is a neat html document showing the timeline of operations:
The first column is for Jepsen’s chaos agent (aptly named nemesis) and the chaos events it generated. The other columns are for each worker thread; 5 by default. You read the timeline top to bottom:
- Worker 1 reads 1 at the beginning of this timeline.
- Workers 3 and 0 then successfully write 0 and 3 respectively.
- Worker 3 reads 3 while worker 2 fails to write 2, and worker 0 writes 1.
- Worker 2 then reads 0, because during that read, worker 3 wrote a 0.
This is in line with the linearizability rules listed earlier:
A read that starts before a write ends can read either a previous write, or the current write.
Reading 1 at this point would also be linearizable (workers 0 and 4 wrote 1 during this read), but a read of any other value would not be. In this way, the timeline becomes a very useful tool to understand the order of operations in a test, and deduce the causes of inconsistent test runs.
Blue indicates that the operation was successful, red indicates a failed operation (system state is unchanged), and orange indicates an indeterminate operation. By default, if you do not handle exceptions, Jepsen will treat them as being indeterminate — since it doesn’t know whether the state of the system changed or not.
Note that for every chaos event, there will be an orange box for starting the operation, and another orange box for when the operation is finished.
That’s it for Part I of this series. In the next part, we will discuss some of the hurdles we faced while operationalizing Jepsen such as stream-lining the docker setup, running Jepsen in a Jenkins pipeline, and accessing a secure corporate repository inside a Jepsen test.