Walks in the neigborhood: Correlated failure in distributed systems

Every time I start to get mad at Google for being closed and proprietary, they release something really interesting. The paper, Availability in Globally Distributed Storage Systems is loaded with interesting data on component failures and it presents a nice framework for analyzing failure data. The big takeaway is that naive reasoning about "HA" deployments can lead to inflated expectations about overall system availability.

Suppose you have a clustered pair of servers and each is 99% available. What availability can you expect from the clustered pair? "Obviously," $99.99\%.$

Unfortunately, that expectation assumes that the servers fail independently - i.e., component failures are not correlated. The Google research shows that this can be a bad assumption in practice. Referring to storage systems designed with good contemporary architecture and replication schemes, the authors say, "failing to account for correlation of node failures typically results in overestimating availability by at least two orders of magnitude." They go on to report that due to high failure correlation, things like increasing the number of replicas or decreasing individual component failure probabilities have much weaker effects when you take correlation into account.

Steve Laughran does a nice job summarizing the practical implications of the specific area covered by this research. What is interesting to me is the model proposed for quantifying correlation in component failure and estimating its impact on overall system availability.   Its easy to come up with scenarios that can lead to correlated component failures, e.g. servers on a rack served by a single power source that fails; switch failures; bad OS patches applied across a cluster. What is not as obvious is how to tell from component-level availability data what counts as a correlated failure and how to adjust expectations of overall system availability based on the extent of correlation.

The first question you have to answer to get to a precise definition of correlation in component failure is what does it mean for two components to fail "at the same time" or equivalently what does it mean for two observed component failures to be part of a single failure event? The Google authors define a "failure burst" to be a maximal sequence of node failures, all of which start within a given window-size, $w$, of one another.   They use $w=120$ seconds for their analysis, as this matches their internal polling interval and it also corresponds to an inflection point on the curve formed when they plot window size against percentage of failures that get clubbed into bursts.

We can define correlation among node failures by looking at how the nodes affected by bursts are distributed. The practically relevant thing to look at is how often nodes from system architecture domains fail together - for example, to what extent do node failures occur together in the same rack. If failures are highly rack-concentrated, for example, having system redundancy only within-rack is a bad idea.

Given a failure burst consisting of a set $N = {f_0,..., f_n}$ of failing nodes and a partition $D = {d_0, ... , d_m}$ of $N$ into domains, we will define the $D$-affinity of $N$ to be the probability that a random assignment of failing nodes across domains will look less concentrated than what we are observing. High $D$-affinity means correlation, low means dispersion or anti-correlation. If domains are racks, high rack-affinity means failures are concentrated within-rack.

To make the above definition precise, we need a measure of domain concentration. The Google paper proposes a definition equivalent to the following. For each $i = 0, ..., m$ let $k_i$ be the number of nodes in $N$ included in $d_i$. So for example if the $d_i$ are racks, then $k_0$ is the number of nodes in rack $0$ that fail, $k_1$ counts the failures in rack $1$, etc.   Then set $x = \sum_{i=0}^{m}{k_i \choose 2}$. This makes $x$ the number of "failure pairs" that can be defined by choosing pairs of failing nodes from the same domain. Clearly this is maximized when all of the failures are in the same domain (every pairing is possible) and minimized when all failing nodes are isolated in different domains. Increasing domain concentration of failures increases $x$ and disaggregating failing nodes decreases it.

Now let $X$ be a random variable whose values are the values of $x$ above. For each possible value $x$ define $Pr(X = x)$ to be the likelihood that $X$ will take this value when failing nodes are randomly distributed across domains. Then for each value $x$, define $r_x = Pr(X < x) + \frac{1}{2}Pr(X = x)$. Then $r_x$ measures the likelihood that a random assignment of failing nodes to domains will result in concentration at least as large as $x$. The $\frac{1}{2}$ is to prevent the measure from being biased, as we will see below. A value of $r$ close to $1$ means that failures are highly correlated with respect to domain, while values close to $0$ indicate dispersion. With domains equal to racks and $r$ called rack-affinity, the Google paper reports:

We find that, in general, larger failure bursts have higher rack affinity. All our failure bursts of more than 20 nodes have rack affinity greater than 0.7, and those of more than 40 nodes have affinity at least 0.9. It is worth noting that some bursts with high rack affinity do not affect an entire rack and are not caused by common network or power issues. This could be the case for a bad batch of components or new storage node binary or kernel, whose installation is only slightly correlated with these domains.

The authors point out that it can be shown that the expected value of $r$ is $.5$. To see this, let $x_0, x_1, ..., x_t$ be the values of $X$ as defined above and for each $i = 0, ..., t$ let $p_i = Pr(X = x_i)$. Then the expected value of $r$ is $$E(r) = \sum_{i=0}^{t}\left\{p_i \left(\sum_{j=0}^{i-1}p_j + \frac{1}{2}p_i\right)\right\}.$$Since $\sum p_i = 1$, we must have $(\sum p_i)^2 = 1$. Expanding this last sum and the sum for $E(r)$, it is easy to see that $E(r) = \frac{1}{2}(\sum p_i)^2$. Note that this applies to any discrete probability distribution - i.e., $r$ as above could be defined for any discrete distribution and its expectation will always be $.5$. Note also that while $r$ can take the value $0$, its maximum value is $1 - \frac{1}{2}p_t.$ For $X$ as defined above, $p_t$ is the probability that all failures are in the same domain, which is $1/B_N$ where $N$ is the total number of nodes and $B_N,$ the $Nth$ Bell number, is the number of ways that the $N$ nodes can be partitioned.

Computing the value of $r$ given counts $c_0, c_1, ..., c_m$ of failing nodes by domain is non-trivial. According to the Google authors,

It is possible to approximate the metric using simulation of random bursts. We choose to compute the metric exactly using dynamic programming because the extra precision it provides allows us to distinguish metric values very close to 1.

I have not been able to figure out a straightforward way to do this computation. Maybe the Googlers will release some code to do the computation on Google Code. The only way that I can see to do it is to fully enumerate partitions over the node set, compute $x$ for each partition and build the distribution of $X$ using frequency counts. Patches welcome :)

The Google paper stops short of developing a framework for using estimates of node failure correlation in end-to-end system availability modelling. That would be an interesting thing to do. Here are some simple observations that might be useful in this regard and that also illustrate some of the practical implications.

Correlation cuts both ways - i.e., it is possible to do better than independence if a system's deployment architecture splits over domains with high failure affinity. Consider, for example, an application that requires at least one database node to be available for it to provide service. Suppose that database node failures are perfectly rack-correlated (i.e., all database node failures are concentrated on single racks). Then if the application splits database nodes over racks (i.e. has at least one node in each of two different racks) it can deliver continuous availability (assuming the database is the only thing that can fail).

End-to-end HA requires splitting over all domains with high failure correlation. Suppose that in the example above, database node failures also show high switch affinity. Then to deliver HA at the application level, you need to ensure that in addition to having database nodes in two different racks, you also need nodes connected to at least two different switches.

As always, correlation does not imply causation. The Google paper makes this point in a couple of places. Suppose that in our simple example all database failures are in fact due to database upgrades and the operational practice is to apply these upgrades one rack at a time. That will result in high rack affinity among failures, but the failures have nothing to do with the physical characteristics or failure modes of the racks or their supporting infrastructure.

The observations above are basic and consistent with the conventional wisdom applied by operations engineers every day. In an ideal world, HA systems would be designed to split over every possible failure domain (rack, switch, power supply, OS image, data center...). This is never practical and rarely cost-effective. What is interesting is how quantitative measurements of failure correlation can be used to help estimate the benefit of splitting over failure domains. Just measuring correlation as defined above is a good start.

Walks in the neigborhood

Friday, October 28, 2011

Correlated failure in distributed systems

1 comment: