Maintaining large automated API test suites long-term?

Hi everyone! :slight_smile:

While researching automated API testing workflows I recently came across an older discussion here about maintaining large Postman collections, and it really resonated with me:

https://club.ministryoftesting.com/t/what-are-your-strategies-for-working-with-and-maintaining-large-postman-collections/70926

Over the years I’ve worked closely with QA teams in several companies, and something I’ve repeatedly seen is that once automated test suites grow large enough, maintenance starts becoming the real challenge.

Sometimes it’s Postman collections, but I’ve also seen the same thing happen with large Cypress or Selenium suites:

  • tests becoming brittle

  • failures that require constant babysitting

  • CI runs becoming slow or difficult to debug

  • version control conflicts when many people touch the same tests

  • lots of effort spent maintaining the tests rather than testing the system

None of these tools are bad — they solve real problems — but it made me wonder whether the way we structure automated tests might be part of the issue once they reach a certain size.

A while back I started experimenting with a small open-source tool to explore a different approach to API testing, mainly focused on making tests easier to maintain as they grow larger (tests as simple files, version-friendly, CI-friendly, etc).

I’m not trying to promote anything here — I’m genuinely curious how others approach this problem in practice.

What strategies or tools have worked well for you when API test suites start getting very large?
I’m particularly curious about experiences once suites pass hundreds or thousands of tests.

If anyone is curious about the tool I mentioned, it’s here:

GitHub:
https://github.com/hyrfilm/skivvy

But the main thing I’d really like to learn is how people here deal with long-term test suite maintenance.

1 Like

Honestly, the maintenance problem is underrated in almost every team I’ve worked with. Everyone celebrates when the suite hits 500 tests, nobody talks about what happens six months later when half of them are flaky and nobody wants to touch them.

The biggest thing that helped us was treating test files like production code, proper code review, naming conventions, and a rule that if a test fails more than twice for non-product reasons it gets triaged immediately, not just re-run. Flaky tests are technical debt and most teams let them pile up until the whole suite loses credibility. We also started tracking test run history more seriously, we were using Tuskr at the time, and just having visible run logs made it a lot easier to spot patterns in failures before they became a real problem.

The other shift that made a real difference was separating smoke, regression, and exploratory automation into clearly scoped suites rather than one giant run. CI doesn’t need to run everything on every commit. Once we did that, the feedback loop got faster and debugging failures stopped being a full-time job.

1 Like

WABBING a bit here - (work-avoidance-behaviour). I’m supposed to be doing this https://www.youtube.com/watch?v=kCYo2gJ3Y38 and procrastinating, when this topic came up on my MOT feed. So. How do you keep at it when you have a large regression suite and it just feels like every time a new API or message gets added, that you go and write 2 tests for the new APIcall, and then realise you missed something just like Jen did. It slows you down, because you have to size all work, if you have to change 1 API you know that should take max 1 hour to do, maybe more if there are setup and side effects. But now you have all this fresh discovery to deal with. And in reality I find myself having to either do a bunch of refactoring or have to do some scenario setup work that I had managed to put off, because originally I only had 1 API that needed to be tested in this specific way.

So yeah I like the skivvy tool. I had actually heard about it a while back while I was googling for something to help me with a JSON API. not all my api’s are json, but at the time I never looked deeper, but the skivvy pattern really makes sense to me now. Any I’m wondering, do I take that pattern and apply it to all of my tests and rewrite them in a new meta-language? Probably not. I’d love to especially for the C++ API I am just now embarking upon testing. C++ has many drawbacks at least until we hit C++26? and get reflection, and that will make parameter and type discovery much easier to build meta-languages and scripts against. Most test F/Works for C/C++ use macros, which are metaprograms, but have so many disadvantages which really do get in the way when you are writing code that repeats itself. It just creates state leakage all over the place. For now I’m stuck with very specific kinds of guns, scalpels, falling anvils and bearskins, and even lambdas, when testing a ā€œCā€ API, and my get out of jail may well be to use a C# wrapper. Just to avoid the way C/C++ and type-safety are helpful, but also unhelpful when it comes to data-driven testing. This is an inspiring tool though Jonas, very inspiring.

Anyway, welcome to the community @hyrfilm , I fully agree, a tool-language that lets you write tests using metadata instead of being explicit in code is worth doing every time.

1 Like

To maintain large API test suites over the long term, utilize the Page Object Model (POM) or an Action-Based Framework to ensure the code remains modular and reusable. Additionally, employ environment-specific configurations and mocking services to ensure test reliability and to minimize maintenance efforts in the event of data changes.

Hi Conrad,

thanks for the kind words. :slightly_smiling_face:

That description of a ā€œ1 hour change turning into something much biggerā€ really resonates.

One situation I’ve run into a few times is when a change forces you to go back and rethink not just the tests, but also fixtures or mocks — and suddenly you’re in this slightly uncomfortable position where you’re effectively rebuilding the parachute (the tests) at the same time as the implementation.

That’s usually where things start to feel risky, because the tests stop being a stable safety net and become part of the change itself.

Your point about scenario/setup complexity ties into that as well — it feels like a lot of coupling tends to accumulate there over time.

Out of curiosity, in your case — is it usually the setup/mocks that cause that cascade, or the assertions themselves? Since I take it that most of your development is in C++, when it comes to APIs between processes/machines you are you mostly dealing with things like Protobuf/gRPC, rather than more traditional HTTP/JSON APIs between services?

The rule about triaging flaky tests immediately is interesting — I’ve seen a lot of teams do the opposite (just re-run and hope it passes), and over time the whole suite kind of loses credibility. Actually measuring that stuff and having somme process to deal with it makes a lot of sense. How big is the QA team where you are? The processes in place sounds like something out of a somewhat mature org?

The test run history part is also something I haven’t seen used that often, but it makes a lot of sense. Being able to spot patterns in failures feels like it would help catch issues before they turn into noise.

I also like the separation into smoke / regression / exploratory — I’ve seen similar things make a big difference in keeping feedback loops manageable.

One thing I’ve been wondering about is where most of the flakiness tends to come from in practice.

Is it usually timing / environment issues, or do you also see cases where tests become brittle because they assert more than they really need to? I’ve seen both patterns, but I’m curious what it looks like in your experience.

That makes sense — I can see how a lot of that ties into treating testability as something you design into the system, rather than something you try to add afterwards.

At the same time, I’ve been wondering how far that gets you once the suite itself becomes large.

Even with well-structured APIs, I’ve seen situations where changes still cascade into a lot of test updates, especially when setup, mocks, or assertions are tightly coupled to specific responses.

I think there are two factors in play when it comes to keeping large API test suites current.

There’s a culture and workflow component to it: what do we do when a large API test suite breaks? Do we go to production with failing tests? Or does it block the pipeline so we fix it to make the pipeline work?

The other side is more technical: how big is the maintenance burden when the test suite does break. Is everything hard coded in multiple places, or do we have a quick way to change things?

For these two, in my work: we try to keep the culture such that pipelines must pass, and that we don’t fix the pipeline by turning off the failing tests (provided the failing tests add value by covering requirements or risks not already covered elsewhere - if the failing tests only duplicate coverage removal rather than turning off would seem to be indicated). As with all cultural norms this can be something of a moving target (especially in an organization with 60+ DevOps teams) but this is what we strive for.

On the technical side we’re mostly ā€˜automation as code’, predominantly java for API automation. And we’ve been slowly shifting from all custom coded to making more and more use of things like openapi-generator libraries (including a custom in house one) to generate the basic Test Clients, so there’s less work on boiler plate and less fixing broken boilerplate. This also encourages good practice as the generated clients expect a lot of things to be handled through properties rather than hard coded literals. Java (or any strongly typed language) also has the advantage than it breaks pretty early if a new version of the API spec results in an incompatible generated client, so even where no explicit ā€˜contract testing’ is involved this gives a lot of the benefits of contract testing because the ā€˜were we expecting breaking changes to the api’ conversations happen far earlier.

1 Like

I like that analogy - repairing the parachute - while it’s still in use. I’m really going to have to put that onto a powerpoint slide. although to be fair. It’s never truly that simple, because the developers should not be able to inject bugs faster than the bit of time you take out to patch the parachute up ahead of the next big jump. But it definitely feels like the end of the world at the time. No amount of branching workflow can really take away the stress.

I have a Json API that roughly maps onto a ā€˜C’ API. ā€œRoughlyā€, there is no actual mapping, the workflow is just ā€œsimilarā€, but the data is not the same shape. It’s never simple though, the Json API has actually got 3 variants where the 3 server(s) are written in C#, and the messages and flows differ. It’s a simple single-threaded TCP socket link, nothing clever. But 90% of the bugs are in the simple everyday things like field validation. I tackled the biggest of the Json API’s first, and now regret not generalising things because the other 2 server apps share a lot of messages and I don’t have a good story for grouping not just the test-cases, but for re-using the same messages. I’m going to have to write some kind of generator and some kind of fixture to accept it all. At least Python test code is very flexible.

The ā€˜C’ API has only 2 variants and they are much more similar to each other. So far only just getting started, so I have space to do those heavy parachute re-stitching exercises there. I test the Json in Python, and the ā€˜C’ in C++ of course, and that’s where I have run into a lot of tight coupling problems. Luckily no RPC/Protobuf , although RPC does come into it.