How to reduce the incidence of data related bugs?

Hi There,

Our company has been burned quite recently due to the higher prevalence of ‘data related bugs’ being deployed to live. By ‘data related’, I mean bugs that occur due to a specific volume and complexity of data - e.g. the same functionality may work for one user but not another due to the amount and type of linked data.

We have a selenium automation framework that covers the high risk areas. Best practices are followed with regards to making the tests atomic, quick, stable and repeatable. The automation environment data is managed by either clearing down all the database tables before a test run, or backing up and restoring to reset the data to just before the test run started.

This has been working well - False positives reported are low, and a multitude of different issues have been found whilst automation has been in use, however an increasing number of data specific bugs have slipped through. They have also slipped through manual testing as our test environments are often not as complex as what the user ultimately attempts to do with the system.

How can we prevent more of these type of bugs going out to live?

These are some of things I have thought of, time and resources dependent…

  • Find a client who experiences these issues more often than others and backup and restore their dataset to test. Use this test environment for manual testing and automation (although the automation will still create fresh data on top)

  • Evaluate where multiple layers of complexity can be added in the system, and attempt to setup equivalent data both with the automation and manually, to test with.

  • Have a fast and reliable roll back procedure should a dodgy build go out.

I should mention as a department we are overworked and under resourced, which is prehaps the biggest factor of all that needs to be addressed.

1 Like

Hello @konzy262!

Data related bugs have a couple of root causes. One is data quality and another is data diversity. From your description, it seems the errors are rooted in data diversity. That is, for all of the great inspections made by your team, some combinations of data still cause the system to break.

I had this scenario a while back. We worked with a vendor to define a schema for the data. We worked with them for a few months to have them deliver the data according to the schema. While we started to have some confidence with the delivered data, we were unsure about how the system would behave in production.

We worked with the vendor to scrub production data of sensitive information and present the scrubbed data to a non-production version of the system. The system was designed to have this data delivered daily.
Every day, we received the scrubbed data early in the morning. We were able to assess the behavior of the system before the same data (that is, un-scrubbed) was presented to the production system. Some days, we found data issues, corrected them, and verified them before the data was presented to the production system. Over time, we identified a handful of data related bugs while preventing production issues. This was not a long term solution but it helped reduce “known bug” in production.

Production data is a great source of diversity so this may be a solution for you. I believe your first thought on finding a client where the issues occur more often could help especially if they can deliver scrubbed production data. I was unclear how adding complexity (your second thought) could assist.


1 Like

It’s difficult to say I think many businesses struggle with data. Historically data lineage has not been taking seriously enough. It sounds like this could be related to complex state related to the business domain? Ideally you have checks way before the UI layer but I think there is not enough context here to give anyone the confidence to give you good advice.


What I’m also curious about is whether most software should have constraints built in to prevent the data volume and complexity spiraling out of control.

I’m not necessarily talking about obvious examples, like if you have a file upload and you select a 25gb file and the system inevitably grinds to a halt or crashes. I’m talking about any area where there is an unrestricted ability to add new data.

Take addresses for example. 95% of test cases will have either 1 or 2 addresses against a record. Some user somewhere may decide to add 50, which then causes the record to load slowly, and maybe cause bugs in other parts of the system, e.g. if there is a drop down which needs to load all those addresses.

Should every area of a good software system that allows a user to add data have upper boundaries for the amount of allowed data clearly defined and checked for?

Hello @konzy262!

In my opinion, well designed software provides both the ability to create AND manage its data. When we (designers or testers) discover or determine that some data scenarios can impact system behavior or performance, we should both craft a solution to manage that data and be transparent with our business team members about the data management challenges.

In the case of a large file, typical design solutions are to convey small bits of the file with some validation model (e.g., check sums) until the file is conveyed. Enhancements to this design, as we have seen, is background execution or queuing (as used in video presentations).

In the case of addresses, I might wonder about a business case for 50 addresses but it depends on the domain (I guess rental management might have a need for 50 or more addresses).
One might apply the large file model and present X addresses per page (that is, a small amount at a time). I have seen this at the bottom of a forum page where I might navigate to other records or pages (something like << < 2 3 4 5 > >>). There is an opportunity to manage the data conveyance while the user may be scanning the first page, the application could be retrieving more pages.

It sounds like 50 addresses is not typical in your application but possible. Since this has been discovered, might it be an opportunity to explore alternative presentations with your business team members? For example, when addresses are more than X, the presentation is simplified.

Should every area of a good software system that allows a user to add data have upper boundaries for the amount of allowed data clearly defined and checked for?

Yes. In my opinion, that upper boundary should be a balance to user value and system capabilities. When the upper boundary is reached, you can get more storage, or ask the user to pay for more addresses. There may be other alternatives.


I think this is probably unsatisfying but it depends on what a business is willing to pay and what its goals are. In a large enough enterprise there is going to be many systems hooked together in some manner. Developers would be responsible for building in that validation at every system. Business owners will prioritize new features over data validation across systems. I helped build a homerolled data/information catalog that illustrates the length constraints of fields in different systems. It would take many man hours to truly fix this. I think that at that point we have to take off our QA hats and put on a business user or software architect hat and think about what risks are we trying to prevent across systems and how much are we willing to spend on it either in monetary terms or opportunity costs.

I’ve seen this on big enterprise projects where they got a lot of different systems communicating with each other. A lot of “bugs”, if not most are caused by data related issues, like delays in replications, miscommunication with API middle-where, very rarely I’ve seen bugs in prod that have been introduced by code changes. Some of these issues can’t be avoided, I guess, as complexity of the project grows, but the approach is to prevent it by planing good architecture for the future.

The first system I ever worked on was a huge data collection tool - in fact, at the outset, it wasn’t a tool, it was a paper form and our QA role was about ensuring the quality of the 40,000 or so data items we were requiring our clients to provide us with. (This was in the field of utility regulation.) Only as the data collection systems became more complex did increasing elements of software testing become more important as the focus moved from ensuring that collected data was soundly gathered according to detailed real world requirements, to making certain that our systems recorded data correctly, properly allocated that data to the correct space in the master database, and any operations performed on that data (deriving percentages, manipulating data to provide completion rates) were performed accurately and consistently.

Unlike the OP’s situation, we were not so concerned with data volume (that was a small-‘p’ political issue between our senior management and the regulated companies) as with data quality, compliance with the requirements for reporting that number, and accurate data handling so that the results of our specialist analysis and calculations were robust, repeatable and above all, defensible, both in court [including the ‘court of public opinion’] and ultimately in Parliament.

Where I think this is relevant to the OP’s situation is that the QA role involved a lot of work in defining the reporting requirements - what was to be collected, to what level of accuracy and under what degree if auditable scrutiny - and maintaining the dataset - a year-round process of allowing clients to query and challenge the numbers and our analysis of them. After a few years of this, I found that I had a pretty good grasp on what each of those 40,000 data items ought to look like, how they would change over time (year-on-year), and so what a “typical” dataset ought to consist of. I was then able to build, from scratch, a test dataset that encompassed each of these 40,000 items in such a way as to make them typical enough to test out the data collection and uploading tools without those numbers being readily identifiable as belonging to any one client (commercial confidentiality and all that).

That degree of knowledge meant that I reached a stage where I could be confident that any issues with the application weren’t caused by the dataset. And the extent of the work I did with clients meant that I was considered to be the ‘go-to’ person on data issues. There were only two problems with this.

First, I became aware that there was at least one client whose submitted numbers looked very similar to mine in terms of their magnitude and (in particular) the rate of change over successive years. That client turned out to have been falsifying their data returns, resulting in their being fined somewhere over £30 million and subject to a criminal investigation by the Serious Fraud Office.

Secondly, I was the only person in the organisation doing this sort of work, and I did it for fifteen years. By the end of that time, I was feeling pretty jaded whilst colleagues at the same time both relied on my work and were ignorant of it. Ultimately, I had to get out of the organisation before I burnt out completely.

So my answer to the question the OP posed is:

  • Understand your data
  • Understand your metadata
  • Understand the operations that data is subjected to; and
  • Don’t try to do it all yourself - otherwise you’ll wake up one morning
    in a very bad place.