It was some time since I was on call for a product myself, but my team was on call for our product.
When you have to respond, in the middle of the night, to an issue this is what it does to you:
The first thing you learn is the ability to find out the state of the application. What is “working” and “what is not”. If something is failing, where does it actually fail. It might be that application a is reporting the issue while in fact it is because it is because the dependency on application b is not fulfilled. But since B is down A is the one reporting it.
The next thing you learn is how easy it is to get the information about the state at a specific point in time. I.e. something happened 20 minutes ago. Can I still see what happened, can I follow the development of the problem. To find the source, and then to find how to solve it.
The last thing is how easy the thing is to operate without causing more problems (The operatbility of the system)
Remembering one of the sweatiest nights of my life when I was on call for our customers production system and my pager got an alarm that of the minor kind, the one that I did not need to respond too. But at this time we had two known bugs, one was that the system occasionally failed which required my intervention to restore, the other was that once the alarm system had fired an alarm it would not fire more until someone went and reset it. So here I am, in the middle of the night in a foreign country, knowing that parts of the mobile network for this country may go down during the night and I would not get notified. Fun times.
Anyways, what I have learned from this is that some things that you test and think is a minor annoyance will typically scale in production. Other lessons in this area is that once you have a system that are designed to be monitored use that to help solve the testability problem. And you also need to add testability to the monitoring system as in injecting errors and other incidents with stuff like Chaos monkey or similar.