Big Idea

With the addition of a Proxmox server to my homelab I have been planning to scale out the control plane of my k3s cluster by virtualizing an additional two master nodes for a HA set up. I was aware that this would involve replicated state in my etcd backed control plane and that etcd itself relies on the Raft concensus algorithm. For this reason, I decided to take a deeper dive into the inner workings of Raft.

Raft is a concensus algorithm responsible for managing a replicated log. It was designed as an alternative to Paxos that aimed to reduce complexity by focusing on understandability and extensibility. It does this by a decomposition of key problems such as leader election, log replication, and safety, while simultaneously reducing the size of state space generated by non-determinism (i.e. Raft enforces more coherency).

TO understand Raft better, I will, first, give an overview of the problem space that Raft aims to solve (i.e. the problem of faul-tolerance in a replicated state machine), second, I will discuss the design considerations of Raft, third, I will dive into the speicific for leader election, log replication and safety in Raft.

This post is not intended to be original. It is for me (and hopefully for you!) to better understand Raft and will a be a digest with many parallels to the “the raft paper”

What is Concensus? Why Do I Need it? A Replicated What Now?

If you look up Raft, the one-line summary ypu might get is that “Raft is a concensus algorithm for replicated state machines”. There’s a lot to unpack there…

First, concensus. Concensus is a central problem that arises when considering how to design fault-tolerant distributed systems. Being fault-tolerant means being able to handle hardware of software failures that may arise among a set if servers that are part of a distributed systems. In the context of a distributed system, concensus means multiple servers being in agreement about certain values. These values may include, state, cluster membership, or other. Here are some common feature of a distibuted system:

  • Once concesus has been reached about a value that decision is final.
  • The design of concensus algorithms usually allows the system to continue progressing successfully as long as a majority of the servers in the system are operating (i.e. a system of 5 servers can tolerate 2 servers going into a failure state).
  • If a majority of servers fail, progress is no longer made by the system but an incorrect result is never returned

The issue of concensus usually arises in the situation of replicated state machines. A replicated state machine is a series of servers, each with a state machine and a log. The state machine is deterministic for all servers, such that, given the same state, and inputs, the same subsuqyent state will be reached. The log is a series of commands/inputs to apply to the sate machine. Replication is transaprent to users. They apply changes as if dealing with a single server. Concensus is leveraged to ensure that any and all commands passed by the user, such as x <- 3, are in the same position in the log across all machines. That is to say, that if x <- 3 is the nth command to one server then it must be the nth command to all servers such that they all produce the same n+1th state. This ensures that all machines process commands and move between states in the same order to ensure an eventual consistency.

What makes Raft different?

Raft was designed for understandability. The main design strategy to achieve this was decomposition, in which individual sub-problems (leader election, log replication, safety) are divided and solved inidivually. In addition, the design focused on minimize the state space by reducing non-determinism and reducin the number of states the system could be in

Raft: Implementation

Raft decompses into three sub problems:

  • leader election
  • log replication
  • safety

Raft Properties

  • Election Safety: At most, one leader can be elected in a given term
  • Leader Append-Only: A leader never overwrites or deleres entries in it’s log; it only appends new entries
  • Log Matching: If a log entry is comitted in a given term, then that entry will be present in the logs of the leaders for all higher numbered terms
  • State Machine Safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index

Strong leader

  • Leader first elected
  • leader accepts all entries, propagates/replicates log entries to other servers, tells other severs when it is safe to apply log entries to their state machines
  • in the case of leader failure a new election takes place

Basics

A server is in on e of three states:

  • leader
  • follower
  • candidate

In normal operation there is one leader and the other servers are followers. Followers are passive, recieve no requests. They only resond to requests from leaders and candidates. Leaders, by contrast handle all client requests and if a client contacts a follower then that client is redirected to the leader. A candidate, is used to elect a new leader.

Time is divded into “terms” of arbitrary length. Terms are numbered with consecuitve integers. Each term begins with an election in which candidate servers attempt to become the leader. The elected candidate becomes leader for the reaminder of the term. In the case that election reuslts in a split vote, no leader is elected for the term and a new term (with a new election) begins.

Resources