When I first started in my current test role, the responsibility to manage deployments on two non-prod and, in time, one prod environment was passed onto me. Whilst I was primarily testing one api, I had to check that several other web apps, apps, APIs, databases etc. were all available and coordinate deployments across several teams globally. This was incredibly time-consuming and a pain to work out which component had gone down when there were issues.
During the last month or so, I’ve been building a new monitor. This one has a Node backend and a React front-end.
I recently posted about it here.
It’s not a sophisticated tool, it’s not meant to be. But if your patch is a distributed system and you want a simple dashboard to see what is running and what isn’t, you might find it useful.
Regarding logging. Effective application logging, by effective I mean meaningful and not overly verbose, is essential in investigating why an exception/issue occurred. Quite often, I find logs to be very noisy (as noted elsewhere in this thread). I guess there is a tendency to log absolutely everything to combat the fear of not logging something that might be useful for reason (unknown unknowns!).
We have also recently included decision logging too. This isn’t to do with errors or exceptions, but instead focuses on why the system made a particular decision. By that I mean a separate log that explains why the system selected a product at a particular warehouse. E.g shorter shipping time, combines items from the same order into a single package.
We tend not to worry about CPU or memory as a) the servers are managed by other groups and b) we wouldn’t be granted such low-level access.