Log-Digging - Featured TestSphere Card

One hundred cards. One hundred Test-related concepts.
Here on the club, we’ll feature a card from the TestSphere deck for people to write their stories about every month.

I challenge you:
Take a few minutes to think about your experiences with the featured card.

What bugs have you found that are related? Which ones have you missed?
How have you tackled testing for this concept?
What made it difficult or more easy?
What have you learned? What can others learn from your experience?

Take one of those experiences and put it to prose.
Telling your stories is as valuable to yourself as it is to others.

5 Patterns7


I’m not known to have much patience with logs. Even though I can see much value in all they record, I rather look for patterns than try to exactly understand what they’re saying.
Thankfully, for people like me, there are wonderful tools to help me. The tool I speak of, was my colleague Geert. (@_TestHeader)
He was the Dynatrace guy (among other roles) at my previous project.
This app does deep recording of what goes on with and around your product.

Whenever I saw something really strange I’d ask him to investigate and ‘drill down’. He didn’t necessarily have to scroll through logs, but he’d look for obvious problems at the time I saw the strange behaviour.
More often than not, he’d supplement my bug reports with important information on the lowest level possible for the developers to better understand what exactly had gone wrong.

Separately, our information might not have meant much. But together, they provided our developers with a good case why to fix the issue, how it was triggered and a start where to look for a solution.

What’s your story?

Once, some time ago, I had just started playing with our system logs. At the time, there were two levels of logging, “info” and “debug”. In one test, back when I started playing with the system logs, my system had a catastrophic failure. (As it was hardware and software, think smoke and extensive repairs)

Being a glutton for punishment, I pulled an “identical” system from one of the other testers to repeat the failure. I have it in my head that if a problem can be repeated, it would be easier to fix. So I run what I think was the exact same test, brace myself for the boom, and… Nothing odd happened. The system appeared to be happy as a clam.

Now I’m confused. Two systems which are supposedly identical in hardware and software react to a supposedly identical situation in two different ways. How can that happen? Since I had started playing with the logs, I set out to repeat the test again, this time checking the logs (and having a hand on the power off button to not cause more smoke-and-repairs, it’s not fun having everyone looking at you and asking about the foul stench!). What I found was, our two systems were, in fact, different. Mine had logging set to debug and his had logging set to info. So my system was too slow to react to a necessary change, which snowballed to a huge failure.

The lesson here is that, while logs are awesome tools, sometimes the existence of a log can change the result of your test.

2 Likes

Hi all,

I decided to share another Test Sphere story and this time ending up with the Log-Digging patterns card.

I really liked @brian_seg story, I’ve worked on applications before where there were several different logs which we could switch on or off as needed. We’d switch on different logs depending on what we were trying to investigate. There was one incident where one of my colleagues was complaining about the application running too slowly. Guess the reason why? Somehow, every single possible log had been enabled. There needs to be enough logging to get the information you need, but not so much that it slows down the application to the point that it is unusable.

I’ve written a blog post with some other stories related to log-digging.

One example is where the logs were only including a code rather than a human readable description. We had to look up what each entry actually meant. We asked the developer to start recording meaningful messages rather than codes which no one would understand.

Another, story I have is one where I was unable to test the application so I asked the developers for help. They suggested adding some extra logging in so I could see what was happening under the hood, making it easier for me to test.

4 Likes

Yes! It all has to do with the scientific method and quantum mechanics. If a tree falls in the forest, and there’s no one there, there is no log.

Sorry, I couldn’t resist😊

4 Likes

Brian,

Wow! A story where a software error really did cause a smoke and flames computer failure, the way Star Trek always said it would!

1 Like

Here’s a short story of how I used blink testing instead of detailed queries while going through apache error and access logs to discover patterns.

At the start of a school year, the educational application our team was working on, doubled in users. During the first week it became clear that we had severe performance issues. As a quick fix we went from one server to two servers and then to four servers under a load balancer. Still the application went down.

To check what was wrong, Apache access and error logs were being checked. The first thing I noticed was that developers open the log in shell with putty using tail or more, or they search the log using grep. They were doing very detailed searches that were not helping them find the issue.

The console window is not very useful for detecting patterns. Developers are quite used to the shell and perceive it as good enough. For me a visual person, viewing the logs in the shell window was not very satisfying. So I went out looking for tools that could provide help there. The tool “Apache Log Viewer” by iannet.org was the solution. It enables you to scroll through logs quickly, giving different colors to different status codes. This is perfect for blink testing using the free scrolling wheel of my Logitec MX Master mouse sweeping at high speed through the logs. Moreover, it provides very good filters for exploring the logs.

Sweeping at high speed makes you discover returning events quickly. You see an interesting pattern, put your index finger on the wheel to stop it and analyse the red lines and preceding lines.

The first issue found using the Apache Log Viewer was that students are actually keeping the F5 key pressed to mess up the application. What a surprise. This was an issue that was already found ages ago through Shoe testing but never addressed. We now had the prove to get it fixed.

The next pattern discovered was that at 18 minutes past every hour, requests start times where not displayed in chronological order any more. This means that the server is overloaded. Once I pointed out the pattern to a developer, he started looking into it. What happened at 18 minutes past every hour was a database backup into an archive file, sucking up memory and cpu.

3 Likes