Error Budgets: Do You Use Them?

I read a blog recently that introduced a term I hadn’t heard used before “Error Budgets”

I was particularly interested by

Error budgets tell the entire story of the impact of a bug. You can see how many users were impacted and how much error budget was burned. You can also see how long it takes for the incident to be resolved and the burn to stop.

Do you have Error Budgets? (Known by another name perhaps) And if so, what struggles have you found while implementing them?


I’ve never heard it out into such complete terms and explained I such detail. Great article.

I definitely do Capacity testing to ensure new changes won’t breach our promises in terms of how quickly we will deliver messages. We have hat we call KPIs but are similar to SLAs. Might be something like we deliver in under X seconds 95% of the time. Or only so many messages within a given period will be slower then X seconds to deliver.

In the past I’ve also used “Defect density” to understand if we can make changes to a component or if we need to fix bugs before adding anything new.

1 Like


When I worked as an engineer, we used Error Budgets as a method to understand the operating parameters of an electronic circuit. For example, we created products that measured electrical parameters such as current (amps) and voltage (volts). The circuit used components which had specific values but had a tolerance around that value. These tolerances would be used to create the Error Budget.

Say you wanted to measure the amount of water in a glass and your measurement device measured in liters but had a tolerance or error of 5%. That is, when you measure a liter of water, the device could display anywhere from 0.95 to 1.05 liters.


1 Like

I had the fortune of working in a company where the direct monetary impact of a bug was easy to calculate like this. One user produces €100 / hour. If the bug cause a service outage for 1 user for 1 hour the direct cost is €100. This was very liberating as we had very constructive discussions about the priority of bugs and also the efficiency of the testing process.

But after that one problem with all these systems are that all bugs are not equal. You cannot discuss only the number of bugs. A typo on a manual page that only 1 in 100 users visit is not the same a the interest calculation is of by 0.1%. And then you also have the good will / bad will as in. 100 minor bugs will individually not register as a “big problem” but working in a service with 100 minor bugs decreases your trust in the product and the brand which might cause you to cancel the service or chose a different product.

But as per the Black Swan project, the value might not necessarily lie in the correctness of the calculation but in the insights that comes from the process of the calculation, and it still might be useful to decide on incorrect monetary values instead of abstract numbers.

Currently we do not have error budget, but checkpoints. “No bugs that cause beyond this point.”