We are closed to 95% pass rate for regression tests nightly build execution. However always get 4%-5% failure due to time out issues, environmental issues. Other problem is there is no fix pattern which 4% will fail. For example - today say a particular set of tests failed, tomorrow some other tests fail due to time out issues.
Any one is feeling the same pain?
Any resolution, suggestions are highly appreciated !!
Simple as that. I DISLIKE false-positives and flaky tests in my pipelines.
I rather move the false positives and flaky tests and put them in the manual regression suite.
I don’t want to spend my time trying to debug why something failed to find the result to be an unstable environment or time out issues.
Hence why we remove such flaky tests (or re-write them) and test those scenario’s manually.
Remove them
I’m not saying you need to remove 1 flaky test, if 200 of your 1500 tests are flaky. Remove them and test them manually or fix them to be not flaky.
It’s not worth the time and effort to check them out daily to see if it’s a false positive or not. I rather remove 200 flaky tests and test them manual.
Initially, you’re not testing but you’re spending your time coding a separate product that you expect will tell you nothing (UI automation). See Cem Kaner’s definition of a good test which goes something like: reveals new information about a product. If you didn’t find anything new your test has failed.
Then instead of stopping there, and doing some testing or building automation that will aid your testing, you check the UI automation daily, going through the failures. More time is spent again there.
Then you’re not stopping there either, you want to ‘fix’ the failures. And have iterations of potential solutions. In the meantime, no testing happens as you’re busy.
Then you want to cover more, automate more because it’s fast. The coding time increases, maintenance increases, looking at failures increases. But then, is anyone left in the team finding problems in the product?
Then I’m assuming you’re not spending much time on the 95% of the automated UI checks. If no failure occurs, can you be certain you’re even looking for appropriate things that could fail and are important?
I believe the 95% should get more attention. As it has been recommended here, delete the flaky stuff. Reduce the time you’re wasting on not doing good testing.
E.g. you have a scenario about a search bar, but you end up rendering the the foot component?
By rendering only the search bar, you remove the fragility of your search bar scenario regarding the foot.
2 - Isolate the UI
Are you executing code that are not your target, e.g. business logic? By isolating the UI, you remove the fragility of your UI scenarios from other pieces of code.
3 - TDD
By moving in tiny steps driven by a failed checks, you will tend to have always a stable suite of checks in green state. If you make code something in the production code that makes the suite unstable, this code will be tiny, easy to identify and revert. The whole problem happens when you make big moves and only validate their stability after putting all this work - do the contrary, move small and check continuously.
I would question if you need that many End to End tests. You can check out this blog from google where they talk about the need for E2E but showcase how and why you should limit them. If the tests are flakey I personally would remove them. They aren’t doing much for you if you’re having to go back in and check the tests anyway, constantly, ruining the point of automation.
If you do need them then I would remove them for now, work on isolating the issues and problems. Fix them and re-introduce them to the nightly builds.
Yes, I inherited an automated test suite with this very problem. Reducing the failure rate of these tests was not easy.
First, any tests that do fail, we rerun - even if that test has a history of failing for no reason. This means we know if the failures are hiding a bug that may have been introduced.
If the tests pass on rerun, then I mark these tests as ‘flaky’ and aim to reduce flakiness in the future. Here are a few things I do to reduce flakiness:
Run the test multiple times, 20 times at least, to get an idea of how often this test will fail.
See if there is a common step within the flaky tests where the failure occurs. Its usually a step within the test that is flaky, rather than the test itself. Fix the step, then you may reduce flakiness on multiple tests.
For timing issues, consider introducing waits into the test (not sleeps) - for example, wait for a button to load. A common cause of a flaky test is the page or element on the page not loading quick enough, so allow the test to wait for an element which the test is interacting with to finish loading before continuing with the test.
(Check out section on explicit waits here: Waiting Strategies | Selenium)
Without more information about specific issues, I can’t provide you with any more pointers on reducing flakiness in tests. All I can say is reducing the failure rate takes time. Take it one test at a time, investigate, try to improve the tests pass rate, then move onto the next test. Over time, you will eventually get 99%+ tests passing each time.
At the moment, on my automated test runs, I get a 100% pass rate about once in every 4 test runs which doesn’t sound brilliant, but when I first joined the company we were lucky if we had all tests pass once a month. Things are gradually getting better.
What is the functional coverage? Having some tests that are reliable is one thing, however you have only truly arrived when you can use just those results and no other acceptance or UX testing results, to authorise a release.
1 . Automatically re-running a failed test actually starts to only tell you one thing really. Your product is hard to test, address that.
2. UI testing is cool and important, but does it cover end-to-end, all the little things users hit when they first ever see your platform as-in that initial experience, the OOB experience? Turning off new users is not a business win. Aim for coverage, not pass rate.
3. Don’t beat yourself up over flakey tests, those flakey tests are a weakness, but also a strength, they teach you how to deal with hard to test features, and hard to automate environments. Most testers gradually dumb down their environments to make their flakey tests more stable, eventually the test environment no longer resembles the real world. Be careful of bubble wrapping flakey tests. Flakey tests are your Davy Lamp Davy lamp - Wikipedia
Thank you all for the help. Current test coverage is about 70% and we will take out flaky tests to different test run and stabilise them by redesigning or converting them in to API or Unit tests.
Make the tests fully hermetic. Is there a network call going out over a network? Run it against a mock API instead. Is it using a database? Set the database up locally with fixed data and tear it down after each test.
In practice I think almost nobody makes end to end tests hermetic. It’s very, very hard. It is a worthwhile goal though, for more reasons than just flakiness.
Remove everything resembling a sleep and replace it with an explicit wait.
This is in practice fairly easy, but very common
Even after you do all of this, however, you will probably see flakiness. This is
Identify sources of non-determinism in the code and fix or eliminate them.
3 is really tricky because you need to either be a dev (like me) or you need support from devs to fix these things and there are always a lot of them. They will include things deep down in the code like:
Looping through a data structure without a deterministic order like a hashmap.
SELECT queries buried deep in the code that don’t have an ORDER BY.
Usage of time (this can often be fixed by mocking time).
Deliberate use of random numbers (this can be fixed by either fixing a seed on test runs or mocking the RNG).
The 2nd worst thing is when these things are buried in an open source library. The worst thing is when they are buried in a closed source library.
I’ve managed to achieve 1, 2 and 3 on small projects. I’ve rarely achieved it all on large projects - though that was mostly due to lack of time.