Does anyone calculate a degree of certainty on their results? For our deliveries we have extremely large data sets which need to be checked - 10s of thousands of units which canāt be checked automatically as of yet (and maybe never). Iād like to be able to work out how many āunitsā would need checking to get a 95% level of confidence (for example) that the delivery is good. Would anyone be able to point me at some resources to explain how to go about this please?
Iām wondering whether the rule of 3 would work for thisā¦ but ideally I was wanting to know how many units weād need to test in order to get a 95% confidence in our results.
for each one of your test result, try to define why it was selected and the match up all production transactions that match the definition.
testing techniques will help you narrow down from 100s thousands of transactions down to a handful of variations. for example boundary value analysis and pairwise testing are two testing techniques.
each test result represents of one or more production transactions, if time is available you could download the entire history of production transactions and count the volume of production trasactions that each test result mimics to give percentage covered.
need to be careful trying to calculate stastistics from test results. Test results are not real, they are a representatice selection of expected production transactions.
hope this helpsā¦ great question
Not sure my bank would agree with a ārule of threeā as an acceptable way to decide the error rate is low enough, but yes very good question.
For example, in a fictitious system, I know that under certain conditions around 1 in 5 first attempts to transact fail on a certain platform. However one really wants 99.9% of transactions to complete, so every platform will keep trying for about a minute to complete. If I now tell you that āalmostā all of the ones that failed after a minute, failed because the network really was down or the resource really was not available. So the āalmostā that remains becomes the only number we are interested in. Transaction failure by this reduction starts to become acceptable, however, inaccurate transactions, on those my bank manager would beg to differ and would be asking for 200% accuracy.
Maybe quantifying or re-classifying the data errors you suspect can exist into different cause areas will help attack each one separately. Naming is often step 1 of communication.
Thank you Daniel, thatās helped me shift my thinking slightly which I think is going to be beneficial. There are a number of complexities with this which mean the more traditional types of testing arenāt as easy to apply. The data is layered, and can take anywhere from 20-30 seconds, to 25 minutes to test each āunitā. Also , currently the units between the layers are different (geospatial) making it very difficult to do a vertical slice, although thatās a WIP. I do currently use boundary value analysis as much as is possible, but it doesnāt provide enough coverage on itās own. Pairwise doesnāt really work for me, due to the type of data (satellite) that weāre using.
Thanks again
Hi Conrad, thank you for your reply.
I can relate to your example, sometimes we have multiple failures due to a data layer not being available, but if this is the case we generate that data layer and then we still need to do the testing. I need/want to be able to say that we are 95% confident that 90% of our results are good, or something along those lines.
I have done as you have suggested, and have identified the variables which could fail, but due to the type of data, being confident that there are no failures for variable x in one area of data, does not mean there wonnāt be failures for that variable in another area.
sound like a facinating dataset.
i have heard that geo location can involve sometimes complex calculations when converting from one scheme to the next.
To be fair, 95% can be a low bar. But if we are talking , for example, things like an AI driven vehicle. Where the chances of an error will always be large, a redundant system is in place to prevent a majority of the damage, but also redundancy to self-right is present too, so it is common for a system to āslice-and-diceā a problem.
Iām less a fan of ārule of threeā but itās very valid for most problems.
Iām pointing to a statistician here, because this is a stats question more than a testing question. John knows his stuff on stats, so worth a deeper read.
Thatās great - thank you I will have a read.
Only if it is a part of the requirements or following a standard.