Iām back again with another question that Iād love to get peopleās thoughts on. Specifically:
What makes automated tests/checks flakey?
I think there are many reasons why tests/checks become flakey, but Iād love to hear individual perspectives and experiences. Do you have any horror stories you could tell about facing flakey tests/chats.
When I worked for Elastic, we had automated āfunctional testsā but these were not functional tests but rather e2e integration tests. We had a custom test harness that spun up elasticsearch, spun up the application under test (Kibana), added test data and did some setup, and then started the Selenium style tests. As you can imagine, tests could fail at any point during any of these processes.
Unexpected latency starting the server.
Server crash.
Unexpected latency setting up test data.
Using hard coded sleep values to ensure state.
Trying to automate tests that are poor candidates.
And much much more!
I saw an interesting one not too long ago. Some tests failed when using date differences, even though the server was in the same times time zone as the user at all times. ā For some ādate pairsā the result was OK, for other combination of a start date and end date they were wrong.
While the user was in the same time zone as the server at all times, it turned out that both changed the time zone now and then ā even without moving geographically. Thanks daylight savings time!
That meant an hour might be āmissingā in the difference, and therefore a day wasnāt completed. And that caused a month to be considered as not completely passed.
Processing dates and times is harder that you think, even if you think your took that fact into consideration.
As our team finally got our automated test suites under control and quite stable, we merged with another team and āinheritedā a bunch of tests which turned out to be flaky.
The flakiness is caused by various reasons:
Poorly written tests with static delays etc ā¦ which would always succeed on one system but not always on another
Regression test systems with wildly varying specs and performance, which exposed problems with the robustness of the tests
Issues due to running tests in parallel where one test does seem to impact another
Hard to catch and trigger software issues that occasionally generate test failures
ā¦
The various sources of flakiness are gradually being tackled, but until we get there, this introduces a lot of time wasted trying to identify which failures are genuine, and which ones are flakiness.
Latency and other networking/environmental issues are the biggest problem we face when it comes to test flakiness, so agree with you there.
We use Browserstackās automate mobile productā¦ which works well in the main but if there is an issue their side then you can get flakey results even though the tests are robust.
Poor candidate tests such that the test data or content on test changes unexpectedly is something weāve had to dwal with also.
Iād hope we were past the use of thread.sleep() in this day in age. Those are always a recipe for disaster.
For the latency issue, I know that Google Chrome has a setting where you can artificially add latency to your tests to simulate possible latency in CI. We set that to an environmental flag and add some latency when we are working with tests locally to help simulate a barebones CI machine. It helps us to catch some tests that would have flaked in CI. Unfortunately, I think only chrome has this setting.
The code base has changed quite a bit since I worked there but I found it.
This is where we mapped it.
and also:
This is an article showing how to set the latency manually while running tests.
Flakey tests, are perhaps as fundamental a time suck as you let them be. In my experience they stem from testing code-paths that our dev team did not write. Which is roughly what everyone else has just described above. And are the reason I continuously warn people for example not to test things like kaptchas in a end-to-end test.
Perhaps itās unfair to say testing other peoples code builds flakey tests, but generally for mee itās indicated by time.sleep(5) statements I see in our python scripts. I love these during code reviews for that reason. Delays. Which may have been added to allow an Android system setting to ripple through the device, or just for an element to appear after a javascript animation on a web page. Or even for a web service to talk to another before tha API call you make will work! I turned off implicit waits in our selenium connection start code as well, implicit waits have their place, but they hide the other cause of flakey tests, being temporally sensitive without being context sensitive. Never replace uncertain state, with uncertain time. Get the devs to expose the system state via a secure and debug-hidden API, you will be amazed at how much it will speed up and additionally how much it will stabilize your tests to change to using system state. I still have dozens of flakes, there is no silver bullet.
flaky = ā(of a device) prone to break down or fail.ā
People:
by mistake - when designing or coding it;
by choice - thereās a high chance of getting into unstable things, but the direction is kept;
by listening to or obeying others - someone else leads what and/or how some automation should be implemented;
by pride - wanting to keep some number high, or thinking one can fix a problem: not maintaining, cleaning, rewriting checks that make sense and patching or leaving flakey code exist;
by indifference - itās accepted and acceptable for the company to have this.
by selfishness - not working with others or not being supported by others to increase stability, testability, infrastructure, code, approach, etcā¦
We have flaky systems and tests uncover that flakiness, I always wonder why we were so fast in calling tests flaky but I never heard of a flaky environment/product/App?
The exception is poorly written tests but again why do we allow that code to execute, we should have the same criteria for production code and for tests, at the end of the day if we are going to produce code we should do it with the same standards.
In my experience, itās mostly due to various conditions external to the actual software. Things like latency, variable times to display web application information, and so on - these are almost always an issues in end-to-end automation and are almost always the result of factors outside the control of the developers or testers.
Unfortunately, sometimes thereās no choice about where to automate. Older software can be impossible to unit test because UI and business logic are intertwined. Web services may not expose a testable API - Iāve dealt with a web service where testing was a matter of dropping prepared files into the designated directory and watching what happened - automation consisted of building the files on the fly and parsing the XML dropped into the results directory.
In my experience almost anything in-app can be handled. Interactions with the computer running the app, the network, the internet, updates deciding to happen at an awkward timeā¦ These are usually the cause of flakiness in my experience.
Welcome to the community @ashish_te . Very insightful way to start off and actually nail it, where have you been all this time I wonder.
Iām not a web app tester, but element locator choices are a long running point of debate and stem from developers writing āuntestable codeā for web. On Native platforms, this problem is less common, but still causes test code confusion anyway. But your point about poorly performing test environments is still nailing it. I was pricing up chromebooks for testing yesterday and Iām totally choosing slightly higher spec devices for just this very reason, testing on a slow system is good for finding bugs of all kinds, but skimping on resources is not good for automated tests at all, ever. Cheers.
Often very basically: Using mostly GUIs for automation.
GraphicalUserIntefaces, as they are made for humans, can be very unreliable for automation.
2 technically different (under the hood) versions of a GUI can look the same for humans. No problem for a human, but a machine.
I find āEnd 2 Endā being done by GUI very wasteful. I know that is often a escape path for check automaters without much support for testability of the product.
Welcome to automation hell if you lack time to adapt the automation to the changes in the UI ā¦ I have been there.
In my current project we check most business logic via API. And the GUI part is more about āDoes the UI works at allā, which includes the connection to the server.
The API parts needs around 1,5 hours and is quite stable in terms of check code calling the application. The GUI parts runs in around 10 minutes.
What makes automated checks flakey?
Often the lack of testability of the product and problematic approaches to automation.
Which finally also relates to a lack of support by the organisation.
insert here a rant about letting one progam interact with another in the role of simulated users unsupervised for hours
insert here a plea for āsemi-automationā, using it as tool to support exploration
Visibility of those elements the test script will interact with, particularly in mobile automation. Due to all different screen sizes, sometime the expected elements show up since the beginning, some other times donāt.
The appearance of random modals or dialogs (ads, marketing campaigns, etc) that interrupts the script flow.
The moment the tester resorts to inspecting the implementation, they build flakiness into a test. So yeah, Daniel , try not to go down the rabbit hole, it only ends up making the test more flakey. Nothing is ever rocksolid, there is no silver bullet that bypasses a human actually talking to another human. Get the developers of the app to make the app easier to test instead, via hooks or other means.
Usually flakey tests are flakey because the tester has no way of asking the devs to make the app more ātestableā and as a result both the system developer and the tester suffer the time-costs of not being on the same āteamā to make this work better together.