How Do You Monitor?

I read a user experience themed article today titled If there’s a problem, customers’ll tell us (warning some not safe for work language within). It made some interesting points, particularly for me around knowing what is happening in a production environment.

I’ve had varying experiences with this, most of them hands off if I’m honest. I’ve never really known what I wanted from monitoring or how I, as the software tester in the situation, could contribute. I knew I wanted to know what devices customers were using and if I could, I’d like to know what occurred or errored when a failure happened.

What is your experience of monitoring in production environments? What do you ask for?

If you have a strong involvement with it, do you have any tool or reading recommendations you can share?

1 Like

My experience of monitoring hasn’t been great so far.

In my last job we used Azure’s Application Insights. The logging implemented by the teams wasn’t great and there was a lot of noise from the logs. We also couldn’t track user flows in an easy way, meaning we genuinely had no idea what the customer was doing when they used our product.

What I wanted from the logging was:

  • Ability to track a user’s journey (including browser versions etc, like you mentioned).
  • Error/exception messages linked to a user’s journey.
  • Performance stats (CPU/RAM utilisation, response times for APIs etc).

I created a few proof of concept dashboards for the last two points but it was hard to get buy-in to put them up on TV screens around the office. I wanted to have meaningful stats displayed there so that when the number of exceptions (for example) spiked then teams would take action.

We also had some alerts set up based on number of exceptions and slow response times… unfortunately these got emailed to random people and there was no action plan put in place based on these alerts!

3 Likes

Hi
Logging and monitoring work well for us.
We usually catch critical bugs because of product people noticed spikes in user’s journey (we use
amplitude to track it)
We also have extended logging system (kibana, grafana) which helps us to keep on track on any changes in apps, servers, network.
Basically most of production issues caused spikes in logs (even, simple because services started to write too many logs)
Additionally, grafana has alerting system that you can configure on certain logs occurrence and link it with slack, pagerduty (www.pagerduty.com/use-cases/)
There is also rollbar.com for f2e logging.
Using those logging approach also works during testing on non production env and notice errors on early development stage.
I would highly recommend to look at grafana to monitor your apps. This mainly affect product development team ecosystem rather then QA field. But anyway, it’s not hard to make it work if you are looking for a ways to improve the time of bugs discovery and reduce the need of manual testing.
Grand

1 Like

Hi,

When I first started in my current test role, the responsibility to manage deployments on two non-prod and, in time, one prod environment was passed onto me. Whilst I was primarily testing one api, I had to check that several other web apps, apps, APIs, databases etc. were all available and coordinate deployments across several teams globally. This was incredibly time-consuming and a pain to work out which component had gone down when there were issues.

With that, I built a healthcheck monitor. The backend was Java with a HTML and Javascript frontend. This served a purpose but I hadn’t factored in any way to quickly reconfigure the components to monitor.

During the last month or so, I’ve been building a new monitor. This one has a Node backend and a React front-end.

I recently posted about it here.

It’s not a sophisticated tool, it’s not meant to be. But if your patch is a distributed system and you want a simple dashboard to see what is running and what isn’t, you might find it useful.

Regarding logging. Effective application logging, by effective I mean meaningful and not overly verbose, is essential in investigating why an exception/issue occurred. Quite often, I find logs to be very noisy (as noted elsewhere in this thread). I guess there is a tendency to log absolutely everything to combat the fear of not logging something that might be useful for reason (unknown unknowns!).

We have also recently included decision logging too. This isn’t to do with errors or exceptions, but instead focuses on why the system made a particular decision. By that I mean a separate log that explains why the system selected a product at a particular warehouse. E.g shorter shipping time, combines items from the same order into a single package.

We tend not to worry about CPU or memory as a) the servers are managed by other groups and b) we wouldn’t be granted such low-level access.

1 Like

Not a direct answer, but I like the work that Charity Majors does around observability. She gave a great talk at StrangeLoop about it. We’re not fully on the observability wagon, but we’re shifting there as we shift towards containerization - seems like we’ll likely land on Kafka for logging, though we’ve definitely got work to do to build out traceIds and things like that.

In my current position, we’re really trying to shift towards service level objectives as described in the Google SRE book, so ideally we should have a few canaries that indicate something is wrong, and not these crazy monolithic dashboards you see in so many places.

1 Like

Greeting!

This is a great testing topic! I believe it is a testing topic because it is very much within a tester’s or test lead’s responsibility to advocate for monitoring. Monitoring provides transparency and introspection both of which improve the overall testability of an application.

In my opinion, a product team collaborates to define what is monitored, what is tracked (as @alihill suggests), and the format of messages. One method of avoiding “noise” in web transactions is to link the user’s journey with a single identifier. This can be challenging if the journey crosses multiple platforms; a conversation about monitoring at the enterprise level may be necessary. Also, messages should be categorized at the time of logging (e.g., error, fatal error, information, etc.).
You might take an agile approach and start with small messages as an experiment. Work with it a while and learn what is valuable to you and your organization. Then, grow your monitoring.

A product and project benefits from implementing monitoring at the start. It may be helpful to resolve errors, verify defects, and validate journeys.

A healthy definition of what you want from your monitor is a good place to start. As demonstrated above, there are plenty of HOWs available.

Thanks!
Joe

1 Like