Has anyone used synthetic monitoring for production smoke tests?

Hi all :waving_hand:

I’m looking into ways to validate critical production endpoints safely, especially when we can’t create or modify data due to downstream or financial impact.

Has anyone used synthetic monitoring as a way to run post-deployment smoke tests in production?

I’d love to hear:

  • What tools you’ve used (e.g., Grafana, Prometheus, Datadog, etc.)
  • What types of tests or flows you cover
  • How you manage safe, synthetic data
  • Any lessons learned or anything to avoid

Any examples or insights would be really helpful! Thanks in advance :folded_hands:

Very valid point!
We use DD synthetic monitors (€€€) to check certain endpoints that are particularly important for the company. The goal is to ensure the service status in production. Personally, I am against this approach because we deploy our services on a managed cluster, allowing us to test these “important” endpoints from within the cluster, thereby reducing costs. Ensuring connectivity from outside the cluster should be the responsibility of the cloud provider, as defined by their SLAs. So what is the point of Synthetic monitors in production in a managed environment?

I have explored synthetic monitoring in production using combination of Playwright, Prometheus, and Grafana

  • Playwright: For browser-based automation to simulate user journeys.
  • Kubernetes CronJobs: To schedule and run Playwright tests at regular intervals.
  • Prometheus: For collecting custom metrics like flow success/failure, response time, and errors.
  • Grafana: For visualizing the synthetic test trends and alerting on failures.

Few things to consider :

  • Avoid flaky tests: Keep the synthetic flows short, stable, and fast.
  • Use tagging and labels in Grafana to distinguish synthetic failures from backend spikes.

I’m late to the party here and I don’t have specific tool suggestions. One company where I worked, where our product was SaaS, we used a service where we ran a simple UI smoke test script at regular intervals during the day - no updating, just navigating around - so that the requests came from different parts of the world. Must have used a VPN service? Anyway - it was quite a surprise to know that our website could be inaccessible from one part of the world when it was working fine for us. And as our customers were global, it was critical that it was always available.

I have used newrelic for this. We had a couple of things running regularly, but that one that stands out was a UI test (using playwright) we used for testing the log in and log out functionality. Proved a useful early indicator of issues more than once.

Where I am currently we have a combination of a postman collection run on a schedule and some in datadog. These all monitor apis.