Skip to main content

Chaos Testing a Distributed System with Jepsen — Part II

Serdar Serdar Benderli

Header (16)

Photo by Alyssa Ledesma on Unsplash

Many thanks to Josh Kessler, the Data Layer team, Antonio Andrade, Jonathon Blonchek, Kelley MacEwen, and Michael Crawford for their help in making this blog possible.

This is the second post in a series that explores running chaos tests against a distributed system using Jepsen. In Part I, we gave a brief introduction about the system-under-test (SUT) and why we chose Jepsen as a chaos testing tool. In this part, we’re going to talk about some hurdles we faced while running Jepsen against Appian using Docker and how we overcame them. Our hope is that the solutions we provide will inspire and help you in your own chaos testing journey with Jepsen.

Throughout the post we will be referring to Jepsen’s docker-compose and up.sh files, so it would help if you familiarized yourself with them.

1 — Running other services alongside your SUT.

Appian needs a running instance of Kafka and Zookeeper. In production, these systems are run in a highly-available (HA) configuration alongside Appian, and since we are not interested in validating Kafka or Zookeeper, we can keep them outside of Jepsen’s reach. As far as Jepsen is concerned, these “nodes,” as Jepsen refers to the containers it spins up, should not be disturbed in any way. This means we can treat them as just additional services to be run by docker-compose and Jepsen will leave them alone.

The solution is to extend Jepsen’s docker-compose file to add Kafka and Zookeeper containers. We used Confluence’s Kafka and Zookeeper images, followed their docker wiki, and ended up appending this section to Jepsen’s docker-compose file:

 

# Append to Jepsen's docker-compose file
zookeeper:
    container_name: jepsen-zookeeper
    hostname: zookeeper
    image: confluentinc/cp-zookeeper
    ports:
      - "2181:2181"
      - "2888:2888"
      - "3888:3888"
    healthcheck:
      test: echo stat | nc localhost 2181
      interval: 10s
      timeout: 10s
      retries: 3
    environment:
      - ZOOKEEPER_SERVER_ID=1
      - ZOOKEEPER_CLIENT_PORT=2181
      - ZOOKEEPER_TICK_TIME=2000
      - ZOOKEEPER_INIT_LIMIT=5
      - ZOOKEEPER_SYNC_LIMIT=2
      - ZOOKEEPER_SERVERS=zookeeper:2888:3888
kafka:
    container_name: jepsen-kafka
    hostname: kafka
    image: confluentinc/cp-Kafka:4.0.0 #kafka v1.0.0
    healthcheck:
      test: ps augwwx | egrep [S]upportedKafka
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
      - KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092
      - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
      - KAFKA_BROKER_ID=1
      - BOOTSTRAP_SERVERS=kafka:9091
      - ZOOKEEPER=zookeeper:2181

That’s really all you need for running other services alongside your system. Note that it is also possible to run these services in their HA configurations, although for our purposes we decided that it is sufficient to run them as single node.

2 — Downloading binaries from a private repo.

Once all this groundwork is done, we are ready to install our SUT on the worker nodes. In Jepsen, this is done by instructing the control node to download binaries from a URL using the built-in Jepsen function install-archive. Note that this function just takes a URL and a destination along with an optional parameter to bypass the cache. This is a problem because we need to download the binaries from our corporate repo which requires authentication. So, we extended the Jepsen library in this PR to allow the user to optionally provide a username and a password to this function.

It is straightforward to make changes in the Jepsen source code and have those changes take effect for your tests. By default, Jepsen copies the entire Jepsen library into the control node and builds it locally during startup, so any changes you make will immediately be available in the control node, provided that you use this local Jepsen version (as opposed to the latest on maven central) in your dependencies. For reference, these and other dependencies are set in the project.clj file at the root of a Jepsen project.

To avoid checking in credentials to the repo, you should have a mechanism to retrieve them inside the startup script and pass them into your containers. For example, you can do this by setting credentials as environment variables similar to the existing mechanism Jepsen uses to set SSH credentials on the containers. Environment variables can be read at runtime via Clojure standard library’s System/getenv function.

3— Using custom libraries in Clojure.

In order for Jepsen to interact with your system, you have to tell it how to perform the operations you want. This is done via clients, and we have a client that interacts with the Appian store. To use this client, we need the Appian libraries, which are published to our corporate Artifactory repo. To get access to them, we need to tell Leiningen (Clojure’s build tool) where to find these dependencies, as well as how to authenticate with the corporate repo. This is done by adding your repo to project.clj. The syntax for using the environment variables can be a little tricky. Here’s an example project.clj file that adds two more repositories for Leiningen to search, and retrieves library versions as well as the credentials from environment variables:

 

(def version (System/getenv "VERSION"))
(defproject jepsen.somesystem "0.1.0-SNAPSHOT"
  :description "You know, for chaos..."
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :main jepsen.somesystem
  :jvm-opts ["-Dcom.sun.management.jmxremote"]
  :dependencies [[org.clojure/clojure "1.9.0"]
                 [jepsen "0.1.10"]
                 [com.company/some-library ~version]
                 [com.company/some-other-library ~version]]
  :repositories [
    ["public"  {
      :url "https://some-public-repo" }]
    ["private"  {
      :url "https://some-private-repo"
      :username :env/USERNAME
      :password :env/PASSWORD }]
  ]
)

At this point, we have a five-node Jepsen cluster with Kafka & Zookeeper running on the side and parameterized credentials to a private repo to test any version of Appian. We can now run Jepsen tests against our system on our local machines, but running tests locally is not enough — we want to run them continuously.

4 — Jenkins integration.

In this section, we’ll cover how we integrated Jepsen with our Jenkins pipeline so that it’s always running against the master branch on every change and not just on our local machines.

Our CI infrastructure uses Jenkins Pipelines. Every job we have is associated with a jenkinsfile that describes how the test will run so the first step to adding a new test is to create a jenkinsfile that describes the Jepsen job. We want this job to be able to do the following:

  1. Accept parameters for Appian version and test duration
  2. Only run on master, with an option to force run on other branches
  3. Run Jepsen tests (in docker)
  4. Report test results and archive artifacts

Below you can find an example jenkinsfile that accomplishes this:

 

#!/usr/bin/env groovy

pipeline {
  options {
    timeout(time: 1, unit: 'HOURS')
  }
  parameters {
    string(name: 'VERSION', defaultValue: '', description: 'Version to test')
    string(name: 'TIME_LIMIT', defaultValue: '300', description: 'Duration of jepsen tests (seconds)')
    choice(name: 'NODES', choices: "3\n5", description: 'Number of nodes')
    booleanParam(name: 'FORCE_RUN', defaultValue: false, description: 'Forces the job to run')
  }
  stages {
    stage('Jepsen') {
      when {
        anyOf {
          expression { env.GIT_LOCAL_BRANCH ==~ /^(master)$/ }
          expression { params.FORCE_RUN }
        }
      }
      tools { jdk 'system-jdk-8' }
      steps {
        dir('your_repo/jepsen/docker') {
          script {
            version = ""
            if (params.VERSION != "") {
              version = "--version ${params.VERSION}"
            }
            sh "./up.sh --daemon ${version} --nodes ${params.NODES}"

            sh "python3 wait_for_jepsen.py"
            sh "docker exec jepsen-control bash -c 'cd /jepsen/your_project && lein run test --time-limit=${params.TIME_LIMIT}' || true"
            sh "docker cp jepsen-control:/jepsen/your_project/store/latest/ results" // copy the results over

            def jepsenLog = 'results/jepsen.log'
            def jepsenLogExists = fileExists jepsenLog
            if (jepsenLogExists) {
              def failStrings = ['nemesis crashed'].join('|')
              def hasErr = sh (
                  script: "egrep -c \"$failStrings\" $jepsenLog",
                  returnStatus: true
              ) == 0
              if (hasErr) {
                currentBuild.result = 'FAILURE'
                return
              }
            }

            def resultFile = 'results/results.edn'
            def resultExists = fileExists resultFile
            if (resultExists) {
              def result = readFile resultFile
              if (result.contains(':valid? false')) {
                currentBuild.result = 'UNSTABLE'
              } else if (result.contains(':valid? true')) {
                currentBuild.result = 'SUCCESS'
              } else {
                currentBuild.result = 'FAILURE'
              }
            } else {
              currentBuild.result = 'FAILURE'
            }
          }
        }
     }
     post {
       always {
         archiveArtifacts artifacts: 'your_repo/jepsen/docker/results/', allowEmptyArchive: true
       }
     }
    }
  }
}

  • Line 29 runs the modified up.sh script in the background with version and nodes passed in as parameters.
  • Line 30 runs a python script that waits for Jepsen to come up. This is needed to let Jenkins know when Jepsen is up and ready to run tests. Locally, you would watch the logs from the up.sh to know when Jepsen is ready. The python script automates this task by waiting on all docker containers to come up:
import shlex
import subprocess
import time
from functools import reduce


def get_lines(args, shell=False, check=False):
    return str(subprocess.run(shlex.split(args), stdout=subprocess.PIPE, shell=shell, check=check).stdout, 'utf-8').splitlines()


# get services from docker-compose
SERVICES = list(map(lambda x: "jepsen-"+x, get_lines("docker-compose config --services")))
print("Waiting for services:{}".format(", ".join(SERVICES)) )
cnt = 0
up = False
while cnt < 10:
    # get container IDs of the services
    jepsens = get_lines(args="docker ps -aq --filter name={}".format("|".join(SERVICES)),check=True)
    if len(jepsens) == len(SERVICES):
        # get the health of containers
        s = get_lines(args="docker inspect -f {{.State.Health.Status}} " + " ".join(jepsens),check=True)
        # convert string status to boolean
        statuses = list(map(lambda x: True if x == 'healthy' else False, s))
        # expect all to be `healthy`
        up = all(statuses)
        if up:
            break
    print("Jepsen is not up yet...waiting 5 seconds")
    time.sleep(5)
    cnt+=1

if up:
    print("Jepsen is up and ready for chaos ヽ(‘ー`)ノ")
else:
    print("Jepsen failed to come up within a reasonable time (ノಠ益ಠ)ノ彡┻━┻")
    print("\n".join(get_lines(args="docker inspect -f \"{{.Name}}\t{{.State.Health.Status}}\" " + " ".join(jepsens),check=False)))

  • Line 32 in the jenkinsfile starts the tests by issuing a command to the control node via `docker exec`. It is set to always return true so that we can analyze the results afterwards.
  • Line 33 copies the results from the control node to the machine running the CI job so that they can be analyzed & archived.

Not all test failures mean Jepsen detected inconsistencies. For example, one common case is when the chaos agent nemesis crashes during the test. The test may still pass, but only because the system was not under stress during the tests. Since such a test is without much value, we manually set the test result to failed which indicates that something went wrong during the test (nemesis crashed, host ran out of memory, etc.) and that the test needs to be re-run because we can’t trust the results.

  • Lines 49–62 look for the valid keyword in the Jepsen result file to know whether the results were linearizable. When the results are invalid, the job status is set to unstable which indicates the test ran all the way, and the analysis indicates that the results are not linearizable.

That’s it for integrating Jepsen tests into CI infrastructure. With this, we now have Jepsen tests run on every change that goes in which increases our confidence in the system.

Ready for chaos!

That’s it for Part II of this series.

In the next and final part, we will discuss the issues we found by running Jepsen against our distributed system and how we resolved those issues.

 

Serdar

Written by

Serdar Benderli

Serder is a Technical Mentor of Software Development at Appian.