Cadre Solves Distributed State Challenges

Tags

Network of highways

One of the programming language Elixir’s greatest capabilities is scaling to multiple nodes. In many ways, this can be done transparently, but there is one problem affecting any system that scales horizontally: distributed state. BEAM has excellent tools for writing distributed applications, but it does not provide tools to protect us from the pitfalls of distributed state.

Enter Cadre, a new library developed by DockYard Architectural Engineer Paul Schoenfelder to help solve the challenges of building software with distributed state. Cadre offers foundational tooling for such software, including a testing and simulation framework, and abstractions for constructing higher-level stores of distributed state, such as an eventually consistent cache, or a strongly consistent key/value database. It also provides tooling for troubleshooting consistency issues, using visualizations of the system under test.

In this Big Elixir presentation, Paul explains the challenges Cadre can help address, and how the pieces provided by Cadre can be used together to construct stateful distributed systems with confidence. Watch his full presentation and read the summary below.

How Do Systems End Up with Distributed State?

As projects grow, the need for state in the system becomes inevitable, for example:

  • Service discovery
  • Process registry
  • Caching
  • Locks
  • Presence
  • Event logs
  • Etc.

Eventually it becomes desirable to scale horizontally onto multiple nodes, typically for one of the following reasons:

Reliability. By running multiple instances of a system, a loss of some subset of that system can be handled without loss of availability.

Scalability. By running multiple instances of a system, load can be spread across the cluster, improving performance by reducing load on individual instances.

Some combination of both

The Challenges of Distributed State

With multiple nodes involved, solutions that worked on a single node are no longer effective, and many assumptions made in that environment may no longer hold. The state in the system that was once easy to maintain now becomes an incredibly complex problem involving tradeoffs that are important to understand:

What consistency guarantees are important for each bit of state? Is it eventually consistent, or are stronger guarantees required?

How are those guarantees provided? If eventually consistent, do we use CRDTs, or some kind of gossip protocol? If strongly consistent, do we use consensus or leadership, and what algorithm?

How do failures in the system impact the architecture? If using consensus, the loss of too many nodes may result in the system as a whole being unavailable until those nodes return. If using eventual consistency, a network partition may result in divergence so great that resolving the conflicts ends up in data loss. If done incorrectly, one can end up with split brain scenarios, or other pathological issues.

All of the above require significant planning, and rigorous testing and verification. Unfortunately, there are no real tools native to the BEAM which assist in this effort.

Testing and Verification

Solutions to the problems outlined in the previous section require testing to ensure that they hold up to the large variety of faults the system might encounter. This is especially important when constructing a consistency protocol from scratch, or reimplementing an existing algorithm. There are three properties of a consistent system, whether the guarantees are weak or strong, that form the key trade offs of any design, and the basis upon which the system is evaluated:

  • Consistency. Determines the order in which read/write operations will appear to occur. For example, in an eventually consistent system, it is typically the case that this order is not defined; each node will see reads and writes occur in potentially different orders.

  • Availability. Determines to what degree reads and writes are required to “complete,” or successfully execute, in the face of various failure scenarios. For example, in leader-based systems, reads and writes are typically not guaranteed to complete when a new leader is being elected.

  • Convergence. Determines to what degree nodes in the system are required to observe each other’s updates. More intuitively, this property describes the ability of the system to arrive at agreement on a shared view of the data. For example, an eventually consistent system may diverge across nodes for some time, but in the absence of further updates, such protocols will typically converge on the same set of values.

Cadre

Cadre is designed to provide a foundation for implementing consistency protocols and the higher-level layers which rely on them. At its core, it provides a simulation framework that allows running a protocol and verifying that it upholds the guarantees that it is intended to provide. This framework also provides helpful tools for troubleshooting issues with a protocol, by producing data lineage graphs and Lamport diagrams, which show how events in the system occurred, and how those events resulted in the outputs produced by the protocol.

To provide this framework, Cadre also provides abstractions for a variety of sources of non-determinism in a system: disk access, time, networking, etc. By building on top of those abstractions, Cadre is able to replace them under simulation with implementations that allow injecting failures. It also provides an abstraction for key/value storage, with ordered key semantics. This makes it trivial to construct custom stores with different consistency semantics, and plug in different layers for functionality like indexes, namespacing, and more. All without needing to write the plumbing yourself.

Lineage-Driven Fault Injection

The testing framework that Cadre provides is based on a novel technique developed by Peter Alvaro, et al., called lineage-driven fault injection. In short, it is a form of restricted model checking, that can provide a degree of confidence in systems which use it, that is not otherwise attainable in Elixir today.

The best way to think about the technique is as a game between the programmer and the test framework. The programmer provides the parameters of the game and the guarantees that their protocol need to uphold, and an implementation of the protocol to be tested. The test framework then runs the protocol, and uses data lineage to find ways in which faults can result in a guarantee being violated. This allows the test framework to avoid the need for an exhaustive search of the space of failures, by testing only those faults which can impact the successful outcome.

If the test framework succeeds in producing a failure case, the programmer is given enough information to identify the issue and fix it, and start another round of the game. If the test framework cannot find a failure case, then the programmer “wins”, and the protocol can be considered fully verified within the parameters of the game.

The parameters of the game are really a few things:

  • The maximum amount of time the test framework has to inject failures before it has to let the system recover. This value also determines to what extent the verification of the protocol is valid, since the time is finite, and so not a proof.

  • The maximum number of certain classes of failures that can occur. For example, it is not useful to test a system where all of the nodes crash, because that is fundamentally a failure mode for any protocol. So you typically express guarantees in terms of the number of nodes in a cluster which can be unavailable and still recover.

The guarantees that are being evaluated, in terms of pre and post conditions. For example, in Raft:

  • Pre: If a voting member of a cluster has cast a vote for a candidate in a leader election
  • Post: Then it can not have voted for any other candidate in that election

If the test framework produces a failure case, it will generate a data lineage graph representing how the outputs of the system were arrived at (i.e. what set of reads and writes led to the final value), and a Lamport diagram which shows communication between nodes in the system throughout the simulation. When combined, this provides a comprehensive view of the system under test, as well as what led to a failure, which is invaluable for fixing issues in a given protocol.

DockYard is a digital product agency offering exceptional strategy, design, full stack engineering, web app development, custom software, Ember, Elixir, and Phoenix services, consulting, and training. With a nationwide staff, we’ve got consultants in key markets across the United States, including Seattle, Los Angeles, Denver, Chicago, Austin, New York, and Boston.