Failure detectors are an important piece in designing robust distributed systems. Components must be expected to fail, and the rest of the system should either continue functioning properly (ideal) or at the very least degrade gracefully instead of crashing or becoming corrupted. Because of the unreliable nature of communication over networks, however, detecting that a node has failed is a nontrivial task. The phi accrual failure detector is a popular choice for solving this problem, as it provides a good balance of flexibility and adaptability to different network conditions. It is used successfully in several real-world distributed systems, such as Apache Cassandra (see here) and Akka clusters (see here), and also has a Node.js implementation.

There is a formal theory about the consistency and accuracy guarantees of failure detectors which I will quickly outline before explaining phi accrual (to learn more, please see here). A process is said to "suspect" another process if it believes the other process to have failed. A failure detector is strongly complete if "every faulty process is eventually permanently suspected by every non-faulty process," which is to say that once a process fails, at some point all the processes still running will know that fact. Similarly, a failure detector is strongly accurate if "no non-faulty process is suspected after some time." In summary, non-faulty processes should eventually have the "correct" assessment for all other processes regarding whether they have failed or not, which is a pretty sensible guarantee for a failure detector in a distributed system. Naturally, there are two important metrics for determining how effective a failure detector which satisfies these properties is, namely the time it takes to detect a failure and the rate at which it makes mistakes in suspecting processes. These will guide the discussion of why the phi accrual failure detector makes sense.

Firstly, we should understand the concept of accrual failure detectors. In the formal theory, a process only makes binary decisions about other processes: each of them is either suspected of failure or not. But in practice, we can better capture the uncertainty of these judgments by having a continuous value ($\phi(t)$ in this case, a function of time) and a threshold $\Phi$, where we suspect the process if $\phi(t) \ge \Phi$. Because $\Phi$ is a parameter, accrual failure detectors provide flexibility on top of the basic model. Depending on the requirements of the system, different values can be chosen to optimize for quicker detection or reduced false positives. An example provided in the paper demonstrates how this can be useful: consider a job scheduling system with a master and a set of workers, where the master monitors the status of the workers and assigns jobs to them. Instead of having a binary determination of whether a worker is alive or not, consider having two thresholds $\Phi_1 < \Phi_2$. If $\Phi_1$ exceeded, then stop sending new jobs to the worker, and only when $\Phi_2$ is exceeded assume the worker has failed and reassign any pending jobs. Thus the system can minimize both the chance that it reassigns jobs unnecessarily as well as the chance of assigning jobs to a failed worker.

Phi accrual is one implementation of an accrual failure detector. The way it works is quite simple and relies on the processes sending each other heartbeats at a regular interval, e.g. 100 milliseconds. It keeps track of the intervals between heartbeats in a sliding window of time and measures the mean and variance of these samples, building the corresponding normal distribution with cumulative density function $P(x)$. Then define $\phi(t) = -\log_{10}(1 - P(t - t_{last}))$ where $t_{last}$ is the last time at which a heartbeat was received. The value of $\phi$ will increase the longer it has been since the last heartbeat. Furthermore, the algorithm is adaptive to network conditions because of the measurements in the sliding window. If the network becomes slow or unreliable, the resulting mean and variance will increase; as such, there will need to be a longer period for which no heartbeat is received before the process is suspected. The phi accrual model additionally allows for convenient intuition as to the choice of the threshold $\Phi$. Assuming that the sliding window is representative of the real distribution of heartbeat inter-arrival times, a choice of $\Phi = 1$ means that there is about a 10% chance of a false positive, and in general the probability of a false positive is $0.1^{\Phi}$.

One of the hardest parts about building distributed systems is that there is much less "library code" that can be leveraged. Oftentimes, good software engineering is about not reinventing the wheel, and having well-defined, reusable components is essential to scaling a codebase (or multiple). Phi accrual failure detection seems like it could be an important library that distributed systems can plug in without having to worry about the intricacies of the algorithm or the implementation.

## No comments:

## Post a Comment