Calculating Certainty

Does anyone calculate a degree of certainty on their results? For our deliveries we have extremely large data sets which need to be checked - 10s of thousands of units which can’t be checked automatically as of yet (and maybe never). I’d like to be able to work out how many ‘units’ would need checking to get a 95% level of confidence (for example) that the delivery is good. Would anyone be able to point me at some resources to explain how to go about this please?


I’m wondering whether the rule of 3 would work for this… but ideally I was wanting to know how many units we’d need to test in order to get a 95% confidence in our results.

1 Like

for each one of your test result, try to define why it was selected and the match up all production transactions that match the definition.

testing techniques will help you narrow down from 100s thousands of transactions down to a handful of variations. for example boundary value analysis and pairwise testing are two testing techniques.

each test result represents of one or more production transactions, if time is available you could download the entire history of production transactions and count the volume of production trasactions that each test result mimics to give percentage covered.

need to be careful trying to calculate stastistics from test results. Test results are not real, they are a representatice selection of expected production transactions.

hope this helps… great question


Not sure my bank would agree with a ‘rule of three’ as an acceptable way to decide the error rate is low enough, but yes very good question.

For example, in a fictitious system, I know that under certain conditions around 1 in 5 first attempts to transact fail on a certain platform. However one really wants 99.9% of transactions to complete, so every platform will keep trying for about a minute to complete. If I now tell you that ‘almost’ all of the ones that failed after a minute, failed because the network really was down or the resource really was not available. So the ‘almost’ that remains becomes the only number we are interested in. Transaction failure by this reduction starts to become acceptable, however, inaccurate transactions, on those my bank manager would beg to differ and would be asking for 200% accuracy.

Maybe quantifying or re-classifying the data errors you suspect can exist into different cause areas will help attack each one separately. Naming is often step 1 of communication.


Thank you Daniel, that’s helped me shift my thinking slightly which I think is going to be beneficial. There are a number of complexities with this which mean the more traditional types of testing aren’t as easy to apply. The data is layered, and can take anywhere from 20-30 seconds, to 25 minutes to test each ‘unit’. Also , currently the units between the layers are different (geospatial) making it very difficult to do a vertical slice, although that’s a WIP. I do currently use boundary value analysis as much as is possible, but it doesn’t provide enough coverage on it’s own. Pairwise doesn’t really work for me, due to the type of data (satellite) that we’re using.
Thanks again :slight_smile:

1 Like

Hi Conrad, thank you for your reply.
I can relate to your example, sometimes we have multiple failures due to a data layer not being available, but if this is the case we generate that data layer and then we still need to do the testing. I need/want to be able to say that we are 95% confident that 90% of our results are good, or something along those lines.
I have done as you have suggested, and have identified the variables which could fail, but due to the type of data, being confident that there are no failures for variable x in one area of data, does not mean there wonn’t be failures for that variable in another area.

1 Like

sound like a facinating dataset.

i have heard that geo location can involve sometimes complex calculations when converting from one scheme to the next.


To be fair, 95% can be a low bar. But if we are talking , for example, things like an AI driven vehicle. Where the chances of an error will always be large, a redundant system is in place to prevent a majority of the damage, but also redundancy to self-right is present too, so it is common for a system to ‘slice-and-dice’ a problem.

I’m less a fan of ‘rule of three’ but it’s very valid for most problems.

I’m pointing to a statistician here, because this is a stats question more than a testing question. John knows his stuff on stats, so worth a deeper read.

1 Like

That’s great - thank you I will have a read.

Only if it is a part of the requirements or following a standard.

A often tangential to my travel, but clever blogger Is the Variability Variable?

1 Like