Error Budget

I recently got a paperback version of “The SRE Book” (because I already spend enough time in front of screens), and I re-read the section “Motivation for Error Budgets” in chapter 3.

This is, in my opinion, one of the central elements of devops, because it fundamentally changes the way Dev and Ops view risk and aligns their incentives.

How?

The “classical” conflict of stability (Ops) vs change velocity (Dev) is solved by acknowledging that a 100% uptime is, in most cases, neither achievable nor desirable.

We can’t eliminate risk, we can only minimize it - and minimizing it has an increasing marginal cost¹ and decreasing marginal utility².

With this in mind, we define a different (lower) availability target (it does not matter yet how exactly we measure availability) for our service, which in turn gives us the accepted “non-availability” of our service (1 - availability_target). Multiplying this number by our availability metric for a chosen timeframe (e.g. 3 months), we get the error budget for this timeframe.

So far, this is just a number we have thrown out there, without much actual meaning - the important part is this: Any non-availability during this timeframe is counted against the error budget, it is shared between Dev and Ops. As long as we are within our error budget, we can deploy changes (take risks). Once our error budget is depleted (or nearing depletion), we can not deploy anymore until the end of the chosen timeframe (3 months in this example).

With this setup, Dev now has an incentive to chose the “correct” level of risk they are willing to take without jeopardizing the overall availability target. Dev and Ops therefore both work to “spend” as much of the error budget on product improvements / changes as possible, while maintaing the agreed upon service availability.

I strongly recommend reading the whole chapter Embracing Risk - or better yet, the whole book.

However many redundancies we build - there is always a slim chance that everything fails at the same time. Reducing that risk from 0.1 to 0.01 (90% to 99% availability) is probably easier/cheaper than reducing it from 0.001 to 0.0001 (99.99% to 99.999% availability) ↩︎
If a users internet connection only has an uptime of 99.9%, there is basically no way they can notice the difference between our service being available 99.99% or 99.999%. Even if the difference is noticeable - reducing downtime from 10 to 1 hour per month will most likely be appreciated, while a reduction in downtime from 1 minute to 6 seconds will not make that much of a difference to the users. ↩︎