jepsen
A framework for distributed systems verification, with fault injection
A Clojure testing framework that deliberately crashes nodes and cuts network connections in a live distributed database, then checks whether the recorded history of operations is logically consistent with the system's guarantees.
Jepsen is a testing framework for distributed systems, the kind of software that runs across multiple computers at the same time and must coordinate between them. The library's tagline is "breaking distributed systems so you don't have to," which describes its purpose: it deliberately injects faults (network partitions, crashed nodes, clock drift) while running operations against a live system, then checks whether the recorded history of those operations is logically consistent. If it finds something that should not be possible given the system's guarantees, it reports the anomaly.
A Jepsen test is a program written in Clojure, a programming language that runs on the Java virtual machine. The test sets up a control node on your machine and connects via SSH to a set of database nodes where the target system runs. During the test, virtual clients send reads and writes to the system while a separate component called the "nemesis" disrupts things: dropping network packets between nodes, killing processes, manipulating system clocks. Jepsen records every operation's start and end time, then a checker analyzes whether the complete history could have legally occurred given the system's claimed consistency model.
Test results include correctness analysis, performance graphs, and availability charts saved to disk for review. There is also a web interface and a REPL (an interactive prompt) for examining test results in detail after a run.
To run tests, you need a control machine and at least five database machines, though these can be virtual machines or Linux containers rather than real hardware. AWS, LXC containers on a local machine, and ordinary VMs are all supported. The project notes that tests can aggressively modify the database nodes (killing processes, altering firewall rules, changing clocks), so running Jepsen against a production system is not recommended.
Jepsen has been used publicly to find correctness bugs in many well-known databases and coordination systems. The project website lists published analyses. The framework is primarily a tool for database developers and distributed systems researchers who want rigorous, automated correctness testing under adversarial conditions.
Where it fits
- Write automated correctness tests for a distributed database that inject network partitions and node crashes
- Detect consistency anomalies by analyzing every read and write recorded during a fault-injection run
- Measure a database's availability under adversarial conditions like clock skew and killed processes
- Reproduce published Jepsen correctness analyses against a specific database version