Sunday, October 11, 2020

A little dusting vs. getting buried - why variance matters in ranking risks

Some simple risk management frameworks recommend stack ranking risks based on expected loss should they materialize.  The simple formula $L(r) = P(r) E(C(r))$ is sometimes used, where $P(r)$ is the probability that a risk event of type $r$ materializes and $E(C(r))$ is an estimate of the cost associated with such an event.  So for example, if the cost of a data breach at a company is estimated to be $\$1m$ and the estimated probability of this happening is $.01$, then $L$ is in this case estimated to be $\$10k$.  Given estimates of $P(r)$ and $E(C(r))$ for the different risks faced by a company or individual, a natural way to rank them is by $L$ values.  The problem with this approach is that it fails to take into account the variance in $C$ and it also puts too much emphasis on the point estimate of $L$.

Consider the following example.  Suppose that a company has one risk $r$ with $P(r) = 0.01$ and $E(C(r)) = \$1m$ and another risk $s$ with $P(s) = 0.01$ and $E(C(s))= \$800k$.  Suppose further that a realized loss of more than $\$1.2m$ is catastrophic for the company, meaning losses of this amount or greater cannot be absorbed.  If $C$ has no variance for both $r$ and $s$, neither is a catastrophic threat and it makes sense to prioritize mitigating $r$ over $s$.

The problem is that in real world situations, $C$ always has variance and when what you really care about is guarding against large losses, that variance changes the equation.  In the example above, suppose that $C(r)$ is normally distributed with mean $1m$ and standard deviation $100k$ but $C(s)$ has a uniform distribution with minimum $0$ and maximum $1.6m$.  This means that the values of $C(r)$ should follow a bell-shaped curve clustering around the mean $E(C(r)) = 1m$ but the values of $C(s)$ are randomly spread across the interval $[0,1.6m]$ with no range of values any more likely than any other.

The probability that $r$ results in a catastrophic loss under these assumptions is $P(r) P(C(r) > 1.2m) = .01 \times P(N(1, 0.1) > 1.2)$ where $N(\cdot,\cdot)$ is the normal distribution.  Now $P(N(1, 0.1) > 1.2)$ is approximately $0.023$, so that means the probability that $r$ unmitigated will result in a catastrophic loss is approximately $0.01 \times 0.023 = 0.00023$.

For $s$, $P(C(s) > 1.2) = (1.6 - 1.2) / 1.6 = 0.25$, so the probability that $s$ results in a catastrophic loss is $0.01 \times 0.25 = 0.0025$, which is more than ten times higher. So if what the company wants to do is to minimize the probability of catastrophic loss, $s$ is actually a much more important risk to mitigate.  

The practical problem is that in general the distributions of the cost functions are unknown as are their expected values.  But just asking the question of how bad a loss can be and how likely a tail event is can lead to better risk prioritization decisions.  The example above is mathematically extreme.  The uniform distribution is just one big tail.  But it does illustrate what happens when there is a lot of probability mass in the bad part of the cost distribution.

I recently ran across this tragic, but beautiful post that illustrates the main point here very well.  The author puts it simply,

    "There are three distinct sides of risk 

  • The odds you will get hit
  • The average consequences of getting hit
  • The tail-end consequences of getting hit" 
The odds of getting hit are $P(r)$, the average consequences are $E(C(r))$ and the tail-end consequences of getting hit are the upper tail of the distribution of $C(r)$.