Where to trigger E2E test suite in microservice land?

Should automated E2E tests be triggered by CI/CD pipelines of microservices or just execute regardless of the platform-under-test?

We have about 10 cross-functional microservice dev teams working on the same application. Each owns a domain and tests their own components. Each story is done when it deploys to prod, so about 100 prod deploys per week. Each team has been doing their own testing with little to no E2E testing. I’ll define E2E tests as user journeys that touch services from multiple dev teams.

There are two E2E test suite triggers I can think of but they both have problems:

  1. Run E2E tests daily. Problem = that will not cover the 20 or so microservice deploys that occur each day.
  2. Run the E2E tests with each deploy. Problem = depending on the time span of the E2E test suite, that will not scale to 20 deploys a day.

Lots to discuss but I’ll shut up for now. Would love some suggestions.

3 Likes

Assuming each team has e2e test for their microservices, in my ideal world, I’d punt on holistic system level tests, and be pushing for more observability than anything else.

Contract testing becomes really important, ensuring that nobody breaks their input/output interfaces when dealing with all the other microservices, but without a full on test environment that’s getting all the code in prod (or running tests in production), actual system level e2e doesn’t seem viable if everything is a microservice?

The environment I’m in hasn’t made the full move to microservices, but many of our monolithic applications are being re-written with more micro-service oriented architectures as we shift towards the cloud. In that case, we can do holistic e2e, as it’s a known collection of microservices that interact with each other, and we can deploy all of them into k8s. If the interactions are understood, having a few critical path e2e tests makes sense, but again, depends on if you have an environment where you can run such tests (both in being able to run the test, but also controlling what’s deployed in that environment), which is why my ideal world is going to rely more upon observability (and strong alerting) for prod, rather than trying to write e2e tests that exercise the whole world.

3 Likes

Thanks for the reply, @ernie . Totally agree with your plan to test around contracts and focus on observability. I’m waving that flag. The execs, with money to spend, are of course pushing for the more traditional E2E approach and I’m attempting to not make it suck :wink:. So if you do end up supplementing your testing with a handful of critical path E2E tests, how would you trigger those E2E tests? Would you choose one the two options I referenced?

1 Like

Assuming you have a fully functioning pre-prod environment or can run the e2e tests in prod, then I’d do it each deploy. You mention concerns of it not scaling if you have 20 deploys a day, but that sounds like you might have too many tests at the top of the test pyramid at that point? I’d imagine these critical e2e tests to be very high level smoke tests, and it only taking a few minutes to run. I’d also expect I could actually have multiple test runs happening simultaneously, running from different test clients, so even if two deploys happened in a very tight window, it wouldn’t be an issue.

1 Like

“Definition of done” is a deploy to Production, do you not have a Dev and a Staging environment? Once you have various gates, these gates will show other issues. Does your system use Feature Flags and where in this pipeline does Stress testing occur?

Context is everything. Software is not created in a perfect vacuum, the humans on your dev teams can only context-switch so fast, and because majority of the bugs are going to be found in interactions between services, not within them, it is a balancing act.

  1. There is no right way, do not let anyone tell you you are doing it wrong, change what you do often and change how you do it often. complex systems are reactive as well as pro-active. It’s the a case of making sure you are doing both of these. Try to intentionally drop all gatekeeping guards to see how your team behaves when it goes pear-shaped and then make process changes to make sure you can recover quickly. This is going to happen every time a team drops large changes that have been on a branch for a long time anyway, so doing things that allow you to fail early and become versed in recovering quickly without panic will reduce stress levels in your teams.
  2. My other take on this is that automated testing can even benefit in a “broken” environment where services are busy refreshing live is a good place to not only test in a broken system, but also test how the live system will behave while loads shift and updates occur and customers with open sessions for example should recover.
  3. If you have feature flags, automated test can then read/write the feature flags to turn things on and off in combinations that let you get coverage of unreleased features early.
1 Like

I’m working on a microservice-based system these days. The E2E tests get run once a day; first thing in the morning against the master branch so testing everything that got merged into master on the previous day. However, we also have automated tests for each API, and run the relevant suite of API tests for the microservice in the CI/CD pipelines when a pull request has been made to merge code into master; these need to pass before the merge is allowed.
Although this can’t guarantee that bugs won’t turn up in the E2E runs it does provide a first (and quicker) quality gate for code changes going in.

2 Likes

This a tough situation as I wouldn’t recommend e2e for microservices except for a small number of smoke tests against production. Over 5 years of doing microservices all the value has been from contract testing the services and using semantic versioning on their interfaces. The only time our e2e smoke checks picked up something is if we didn’t follow our contract testing policy (very rare) or some unknown event was happening (Network issues etc… ) that you get with distributed systems but these were always picked up by observability first.

The problem with e2e and microservices is all about what versions of services do you run in the staging environment. With high-frequency deployments then this is going to be a moving target and the e2e will never be run with the right version combination or combinations (you won’t catch everything) of services. It is also going to slow the teams commit-to-deploy time and feedback loop so they might not be so happy about this (perhaps this could be used to change managements mind?).

If I had to run e2e, then I would run some smoke e2e tests against production and trigger these to happen after a service deployment. It might not tell you which change broke the system but there is a problem. How big the hyperthenical window of a broken system could exist is a risk the business needs to decide upon.

To answer the original question (because I didn’t really) , on each deploy we now have about 10 minutes dedicated to a very small batch of e2e tests. But even that 10 minutes is actually too long. But the deploy to “development” runs on every commit and triggers the e2e job, unless it’ already busy, and then runs nightly (to catch the corner-case.) This gives us a ground to stress test functionality test before promoting to integration.

I’m not yet in the territory @alan.parkinson is in, but I know I will be sooner or later. I’m hopeing that whenever I do run my Smoke E2E testsuite in production all it’s going to tell me is that customer support are starting to field phone calls, which may not be a useful warning in a huge multi-user environment. While having fewer automated gates might seem counter-intuitive, I hope in my context to use it to drive for better unit and contract testing earlier in pipeline.

We do. “Done” also means tested. If you run tests in prod, you can’t be done until then, right? I’m a student of Michael Bolton so I don’t see testing as a phase that occurs after development. It’s all development. :wink: Seems to me this is also congruent with reducing context switching. Build one story, put it in prod, you’re done, build the next story, etc…

1 Like

@professorwoozle , your approach resonates with me. It’s certainly a no-brainer to run the automated tests for a given microservice, as part of that service’s CI/CD. And I like the language you are using b/c you didn’t call these E2E tests. But if your E2E tests only run in the morning, you have the problem I pointed out in my OP: Those E2E tests will not cover a merge/deploy that happens at, say noon. As @alan.parkinson said, we want to reduce our commit-to-deploy time (AKA “Lead Time for Changes”), so if a Story is ready to ship at noon, why wait until first thing in the morning to run those E2E tests?

1 Like

Yeah, I agree. 10 minutes is too long. Too long to run for each microservice deploy. But I am really scatterbrained about what I think a useful E2E test should be. What E to what E? And the more I think on that, the more I think I want one complete E2E test (even if it takes 3 hours) rather than several smaller E2E tests. Maybe I’m going crazy :exploding_head:. Perhaps there is an E2E thread somewhere on MoT that can refine my def of e2E.

1 Like

I read @professorwoozle’s description of their e2e tests as essentially being a morning smoke test/health check, not a gate to deploy nor a sanity check on merges.

The common note I’m seeing in most (all?) of the responses is test the micros e2e and ensure they don’t break contracts, and then do higher level system testing as a sanity check/when you can.

1 Like

Eric, I get your thing about how long should e2e be. I’m coming from a world where the app is a thing with a GUI or a web page that humans actually use and buy the product. If the application is a for example a web-service, with fewer dependencies, I would expect a lot quicker if for example there is no purchasing stage in the system. Skipping the purchasing entirely is very risky to the business.

In CI/CD you want as short a possible E2E smoke test. James Bach has written a lot about E2E, its about testing end-to-end. A bit like how buying a thing on amazon includes creating amazon account, adding credit card, add an address, adding item to basket, hit the buy button and then logout. In the amazon test environments they do all of these steps in mere seconds. That’s why amazon can and do deploy thousands of times per day.

In my first job where we did e2e properly, I wrote a post-build Smoke test that took average 1 hour to run. It was for a server-based app, no microservices, but included the time taken to deploy the app and all needed services, and back then, the setting up alone took 40 minutes, leaving 20 minutes of testing time. The Smoke test can be user as a gate, which if it passes triggers regression testing and other E2E testing.

I’m actually not directly involved in the services where I now work, but I know that deploys take about an hour, into an AWS environment. During that time each component runs loads of unit tests, often in parallel, any microservices also must run their contract tests too. But even so this leaves us very little time to run integration tests if we want to be truly not blocking the delivery pipeline. Luckily we have “functional” testing time down to about 10 minutes. This includes

  1. account creation (completely new user onboarding and provisioning) - about 1 minute
  2. web app client - about 30 seconds to spin up a selenium grid and open browser
  3. server-end app deployment - about 30 seconds
  4. login, add a resource, find the resource, edit/connect to it, close it - about 10 seconds
  5. logout - about 1 second
  6. repeat the above steps for each headline feature and in slightly different paths

There was a thread here not long ago defining E2E, I’ll try find it so long. But for me it means covering the entire user journey; account creation, login, do any 1 thing that 80% of users do, logout, and finally delete all resources and accounts (CRUD - create,read,update,delete) . That last bit about deleting the account takes only 5 seconds, but uncovers loads of bugs early. I try to pack as many quick but totally vital things into the smoke test as possible. Never ever add a test for something optional into the Smoke test. The smoke test must never fail due to a check that is not critical. It must not raise false alarms for things that can easily be caught later on in a regression suite. Basically test as much as possible as early as possible, without having noisy tests that fail and block the coders.

I have had to differ in my “definition of done” in every single place I have worked. “Done” must include testing, but if you work in a place where for example language translation is needed, you cannot let your team get held because translations take 5 working days. So for some teams done mean the team owns all external dependencies, but in some places that fails to hold true. Done should also include documentation written, user documentation becomes another matter. Because we have staggered deployment stages, Done means deployed to “dev” environment, and then promoted from there into a 2nd environment called integration, but not the deploy into “live” environment it no part of “Done” where I work, because that is still far too infrequent sadly.

The e2e thread Where do your ends start and end in E2E testing?

1 Like

That E2E link is going to be delicious, @conrad.braam . thanks, dude. And you had me at James Bach. Big fan. I’ll look up his E2E teachings.

Your and other’s posts have moved the needle for me. I think a useful E2E test to start with, is to slide each E to its extreme boundaries. Automating this does seem like a fool’s errand. But if it shakes out bugs in the process, so be it.

Yes, perhaps in some contexts, done <> shipped. But we can challenge that thinking as well. If documentation or translation takes several days, might that be b/c we have a documentation team servicing all requests? What would happen if our cross-functional dev team included a documentation specialist? No?

E2E has to include risks to the business as well as bugs. I love computer games, but hate the game industry employers, but I learned one thing from a mate who worked at one of the biggest MMO games houses.

"Cant play - won’t pay". Basically translates to if a user cannot login, they won’t play, and they won’t re-subscribe, and probably won’t be able to raise a support ticket or be bothered. It’s an ideal goal to cover the revenue stream of your product, often just testing the licensing expiry is enough if the entire flow is hard to mock out.

@conrad.braam , hey dude, can you help me find some of these writings. Google is not finding them for me.