What caused your automation run to fail? ❌

Hi all,

So I think this is going to be an interesting question to ask and I’m looking forward to the responses.

What caused your automation run to fail?

What experiences do you have with automation failing and why? I remember hearing a story once of an automation run failing because someone turned off the physical box in the evenings which would run the nightly build. Perhaps it’s an urban myth, but I’d love to know your stories, be they hardware or software related.

1 Like

I’ve seen automation runs fail for any number of reasons. Someone turning off the box isn’t one I’ve encountered, but power outages have definitely caused issues. So have awkwardly timed automatic updates (after that incident, automatic updates were disabled and there was a regular task to check for and install updates).

Some of the other interesting failures I’ve encountered:

  • A bug in the UI that caused the displayed items to reset to the default after 60 seconds. When you’re switching menus to access different items in a point of sale system, that kind of thing wreaks havoc.
  • Someone accidentally deleting an item used in the automation - not realizing they were editing the master data and not a local copy.
  • An event setup routine that tried to create events on Feb 29 when it wasn’t a leap year.

Probably the biggest, and weirdest I ever encountered, though, was when the automation runs started failing at random points with no sensible error messages. When I investigated, the application and the tooling I was using would crash, again with no sensible error message.

After much back and forth between me and one of the senior developers, we eventually worked out that the problem was a compiler bug. The automation made extensive use of the debug info compiled into the test version of the application because without it addressing the various parts of the software was next to impossible.

The problem we were having was because at some point, the executable had grown to a point where the compile tools couldn’t cope, and the debug info got corrupted so when the automation tried to access what it thought was the order module, the debug info returned the exit application command. And so on. Exactly what would happen was rather random, so we’d get completely different results when running the suite without any changes.

I honestly don’t know if the problem was ever fixed. I was in the process of reworking all the locators to work without the debug info when I was laid off from that position, so I have no idea what happened after that. But it was undoubtedly the most memorable automation failure I’ve ever encountered.

Some things from my experience in companies where I’ve seen about 10-12 failed automation projects:

  • No one runs it; it’s a failed effort
  • It runs automatically but no one is looking at the results;
  • It runs, results are in, too much flakiness and time required to identify a product issue; checking briefly only several fails;
  • It’s not portable, moving it from one OS to another, from local to server, etc…, or trying to set it up somewhere else is almost impossible;
  • Conflicting libraries; some frameworks have dozens of libraries each with lots of dependencies, and updating some will cause incompatibilities;
  • Overcommitting or underestimating the costs; the project is thrown before running, or having a useful automation setup;
  • Short-term thinking; Building a framework that is killed once the person coding leaves or some restructuring happens;
  • Adding automation on top of an unstable, in-progress development - spending most of the quality/test department budget on maintenance of automation;
  • Changing owners and/or coders of an automation project(mainly when there’s no manager of this); Old code usually stays, more bloat is added on top, refactoring is left for later; leads to an unmaintainable product which is failing randomly and then is thrown away;
  • Migrating to new tools without a long-term vision; the old framework is killed and a new one can take as much to build as the life of the product it tries to check which might be in obsolete mode by the time any automation is usable;
  • The product that automation attempts to check is removed from the market;
  • Data changes or environment is refreshed by DB, or infrastructure team. Relying on data and not having control over data makes automation unpredictable.
  • Firewall, bot detection - it can block access/execution to the product;
  • Automation is coded in a stack that no one else can maintain or continue to develop if the owner left, too difficult to recruit as well;
  • Similar to above, in some places to cut costs they shut down environments where the tests might run; or it goes down due to power cuts, or it enters standby, or an automatic update, etc…
  • A developer, tester, or engineer decides that an automation framework is needed, and they spend a month starting it, making it expandable, and flexible. Then they hope anyone cares to use or develop on it. They won’t :slight_smile:

I’ve actually currently got something similar happening at my current workplace🫣

We run a cron job once a week to backup various files, clean the cache and restart the physical box that run our nightly builds.

When we requested this it was specified to happen at a time that wouldn’t interfere with the daily runs (midnight Sunday). On weekdays our test builds start at 7pm, on weekends because no one is working we start them earlier (10 am incase the Friday night ones ran signficantly longer)

Unfortunately for some reason it was set up to restart at 4pm on a Saturday which usually ends up being while our second test build for the day is still going.

For context our tests take a fair bit of time as they are entire e2e regression suites run against develop branch using Detox. This is done 2x, once for android and once for iOS. We are limited to running 1x bamboo plan at a time (additional cost) so they dont run concurrently.

On average they take ~2.5 - 5hrs per run to completly build, run and cleanup. While I consider this to be too long, this is down from the ~13hrs for 1x run that was happening earlier in the year when I first started with the company.

As I mentioned, they cant be run concurrently. Which is the same for any other CI processes we run using that box, such as linting, synk checks or building the app and pushing to google play/test flight etc. This is why we dont start the tests any earlier on weekdays because it can end up blocking development checks etc.

But it also means if any of those other builds are triggered late friday afternoon (say by a whole pile of PR’s merged to develop) then they may be still running at 7pm when our tests trigger. Which mean our tests are queued and will start once the box is available. You can see how it means the box restarting at 4pm on Saturday now becomes an issue.

The secondary issues with it, is that while it restarts, it doesn’t sign back into the box. Which then causes the runs after it resarts to also fail. Until I come in on Monday and remote into the box and sign in with the general team user credentials.

The best part about that though is the failures due to the machine being logged out look different when reading through the logs and comparing android to iOS :melting_face::melting_face::melting_face:

For one it appears to be able to execute and passes each test, its only when attempting to deal with finding the test results, logs, screenshots etc does it fail. The other doesnt appear to execute the tests. So it took a bit of troubleshooting to confirm what the problem even was to begin with :sweat_smile:

Oh another fun one, from two days ago…

Both our test runs for a release branch failed to execute the night before a 10am release that had a bugfix late the afternoon before. Both the develop test runs worked fine. I tried to manually trigger the release ones again in the morning, wouldnt even start…

Turned out the dev had accidentally deleted the release branch when they back merged their bugfix from release to develop.

Can’t run tests overnight against a branch that doesn’t exist :rofl::rofl::rofl:

My experience with automation runs failing were entirely caused by external factors.

Examples:

  • In one of the cases I was checking for a certain colour code in a messaging app functionalities; it appeared that a developer changed the colour spectrum and added new themes, without notifying me. Spent a good few days figuring out what was wrong.

  • Again, in one of my web automation test cases, I was waiting for a certain web element to render in with a waiting time of 2-3 seconds. I was running the test on Dev environment and it appeared that devops decided to cut resources from dev, causing the whole website loading to slow down. Therefore the timeout I had previously set for waiting element to render in was not enough anymore.

I could think of probably a dozen such cases where automations failed because some external changes/lack of transparency/ issue with network and load times, etc…