Power Hour - Curious, Stuck or Need Guidance on DevOps or Observability?

Automation…it’s always automation with you :stuck_out_tongue:

But seriously, while both are very important, there is also absolutely a difference!

Automated testing would be more akin to alerting in the operations world. You are setting up a known scenario and telling a computer to let you know if it does anything out of the ordinary.

Observability would be more akin to tool assisted exploratory testing. With observability you are in the experimenting mindset and building off the output of other answers by using the information in your data like logs, traces, and metrics. Really everyone in software delivery (testing, development and ops) should have at least basic understanding of how to use logs, metrics and traces which is why we are doing a workshop at London Tester Gathering about how to use popular open source versions of these tools.

1 Like

First of all, thanks for sharing the article! I think that quote could easily be compared to “If you are testable, I can automate you.” As in…just because something is testable doesn’t mean you would choose to automate it. And just because something isn’t particularly testable (e.g. doesn’t have identifiers in the HTML) doesn’t mean I won’t automate you (:woman_facepalming: xpath!).

Observability is all about asking new questions while monitoring is about cementing in existing understanding. I would say the biggest benefit (including being more than monitoring) is to the ability to track down “weird behaviours” and other unknowns. For example, we had a situation where it looked like a malicious IP spamming our services. Thankfully we had a bit of time to review because in looking more at the requests it was actually that we had created a bad redirect which meant a perfectly ok customer request was being multiplied! We needed to know things like the customer ID as well as size of the customer project to track down this behaviour which could never have been monitored since we had no idea it could ever have happened.

By plugging high cardinality data (very specific stuff like user IDs and size of image uploaded etc) into the system outputs we can start slicing and dicing the exemplar “good” and “bad” requests to find new patterns given sometimes complex characteristics.

Hi Sharon, you have a couple here, but to start with this one:

IMO the number one principle has to be around everyone being accountable for testing and a fundamental belief that testing can be applied throughout an application’s lifecycle including idea generation and after customers are using it. When talking about speed, I also think it is key to examine what is safe to fail and what is fail safe. There is some amazing work done within the Cynefin space trying to examine this and have learned a lot from @vds4 about how to focus on minimizing our surface area of what can’t fail and therefore must be tested exhaustively before release while letting the (often majority of) code which can be resilient to roll backs etc sail at a quicker pace.

What these ideas do is focus on changing how we develop software as much (or sometimes more) than how we test it. So, to answer this one:

I have to start with the annoying statement that I believe DevOps testing is normal testing. It is still modeling a system and a change, identifying the risks that they present, and then validating the likelihood and impact of an issue arsing. The differences come in on the size of changes you may be looking at. Of course DevOps still delivers large and valuable features, but it does so in a way that provides much smaller steps of change even down to only a few hours of work.

Therefore, as a test specialist on a DevOps team I found myself following the modern testing principles before Alan Page was so nice as to write them up clearly! That means I focused on upskilling teammates in other roles on techniques software testers use frequently like oracles, heuristics and models so that they can handle the small changes. Meanwhile, I spend the bulk of my implementation time on deployment strategies to get feedback that we can rely on and react to quickly as well as the bigger picture testing of feature ideas.

The other area I would suggest looking into the observability topic that is on this thread as well. It is a fantastic way to get feedback from our systems as the delivery moves faster.

2 Likes

So I think there are some really basic technical requirements which underpin any strategy which is to keep in mind when working on this though is how to make data from different applications traversable by everyone. An example of this could be setting a standard set of data that every log should have (even if some teams add more data because of their own specifics).

Given this focus, I would suggest cracking on with getting all of your services exposing logs (structured please!) and metrics to the same tool as well as passing tracing IDs properly. By starting to tackle this, you can build a strong foundation to then add in higher cardinality data that may help you debug your systems faster like customer ID, region of request or other domain specific information.

When thinking about strategy in relation to the second tier growth (as in you have tooling and functionality in place to gather data, but you not yet able to answer interesting questions, I think it is time to jump into the answer to this question:

This is interesting in relation to strategy because methods could mean a couple different things. In many ways people speak about metrics, logs and traces as the core data structures of observability. There is a newer model though that speaks about rich events instead which would give all the data necessary to calculate each of these pieces. For example, an event would have timing information to calculate latency and error code which can then be extrapolated out and made into error rate. This amazing deep rabbit hole is where you can start to explore strategy one you successfully implement the fundamental pieces.

1 Like

This is an awesome question and one I am wrestling with a bit myself. I think that there are some great metrics from Accelerate and The State of DevOps Report like Mean Time To Recovery. To have a low MTTR, you must also be quick to diagnosis issues which can be driven by high observability.

But tbh that is a bit of a cop out. I do think there should be some “standard” questions which can at least get people started. What I am thinking about these days is maybe asking something like: “What are the most frustrating user interactions on your site?” This will require you to understand a combination of latency, error rate, and throughput (three key indicators for Site Reliability Engineering), but also to understand your business model. This is how the real power of observability comes in. Think you can already answer this? Then your additional success criteria is probably around your ability to ask new domain specific questions. This still a stretch for you? Think about how you can break the question down to build to a wholistic answer. Maybe first ask about latency, do you know what your slowest response time is? Or about throughput, do you know what endpoints are most requested and does time of day/week/year affect this? Then move to error rate and ask how fine grain you can understand these numbers, and so on.

But I tend to fall over a bit when getting to the more specific question of:

Really observability is just a sliding scale and is best measured in your ability to answer questions you may ask. But this is really woolly and even more so it is a trailing indicator. I do not have any great ideas for leading indicators other than just to speed up feedback by taking an idea from the chaos engineering folks and running gamedays to help identify possible gaps in your understanding.

Most organisations today should have log aggregation tools, metrics databases and visualisers and (hopefully) tracing tools. These are not always accessible without asking (licensing fees etc) but I would definitely build your understanding of what these tools can do, and also what your business currently uses them for. If you are in the position of introducing these tools, you can get up and running pretty quickly with some free tools including Prometheus for metrics, Elastic Search/Logstash/Kibana for log aggregation and Jaeger for tracing.

In my experience exploratory testing is built in throughout the process in a devops workflow. This means structuring delivery of features so that maybe the API part is ready to explore before the UI is complete (or visa versa). It also means exploratory testing in either a fixed environment or straight on production but in both cases after the features are released and as the continue to progress. This can be done in a lower risk fashion by using dark deploy techniques or it can be done with a reliance on roll back/forward ability if any issues are identified.

As for your second question about global knowledge, I think this may be within your context. Depending on the size of your organisation and the speed of delivery, I would maybe call into question the value of making all findings global knowledge ever let along before some action would need to be taken.

And hey @sharonoboyle this above stuff applies to your question 6 too.

Though I may add also to your bottleneck question a suggestion that exploratory testing become a part of a development workflow. This doesn’t have to mean that developers do the testing instead of testers, but maybe while pairing? Basically if the code is so high risk that it cannot be deployed to production without exploratory testing then we shouldn’t shield the developers from that experience and impact.

2 Likes

This is not my area of expertise, but the really big question you have to ask here is what risk are you trying to mitigate. Is the issue just maintaining current speed of computing? If so, you can put this type of testing into the pipeline as it comparative and can be done at low scale. Is it to validate a new service is ready for production load? If so, can you mirror the traffic to it? Replay the traffic on it? Or even just turn it on in production with a ready hand to switch it off based on results? I guess the key here is that putting up “production like” environments to do any sort of soak/load/perf testing is extremely expensive and can only ever be “close” to production not the real thing. Therefore, like Eric Proegler shared in his TestBash Brighton 2019 talk, look to production for you performance testing!

And I think this kind of speaks to the more generic tools in DevOps question:

I think the need to incorporate more into our build pipelines and different environments definitely does encourage new tooling. But to the question of does it evolve as we get MORE agile or MORE devops, there was actually an interesting note made in the DevTestOps survey that was released recently that high performing teams were never completely satisfied with their tools. My experience is that we have gotten more complex with our tooling only to later want to trim back and find ways to offload some of the complexity to vendors or more opinionated tooling. I think these are both valid and necessary phases as you can’t get to know what you can give control up on until you understand what it is.

Alright Sharon, this is one of your harder ones. But it is also very much in my old wheelhouse of being a consultant so I will say what I can :woman_shrugging:

This change and the very real challenges and falters it uncovers requires a strong set of shared values throughout the organisation (or division/team etc) and particularly from the top. One way to keep it safer is to focus on that safe to fail vs fail safe concept and see if you can build a shared understanding of what an “ok” failure may look like so as not to get overwhelmed when less priority issues do happen.

1 Like

IMO moving towards DevOps is a very natural direction for testers who are on fast moving delivery teams. When I was on a team completing releasable features almost daily and had a ratio of more than 5 developers to 1 tester, I realised that the value I could bring was in enabling, not owning. I think that would be my advice…DevOps is all about connecting the developing and operating of services so how can you as a tester improve this? Could be as simple as asking about logging while testing a story, or it could be more involved like designing automated pipelines for continuous delivery.

Some more quick win examples include introducing faster forms of feedback like code linting and parallelising of tests or automating previously manual jobs like running certain commands or emailing someone for review. Most importantly, you need to think of ways to help solve needs without owning the long term implementation of their solution.

I think this is very true and I think it is most difficult because of the barrier to entry on new tooling. Not only are many software engineers and testers not exposed to the operations side of things, they often aren’t even granted logins and access when they ask! I believe understanding the tools that are available are a great way to generate ideas and fundementally necessay to advocate for what you need. If you are in a space where you don’t have operational access and want to get hands on experience using logs and metrics and traces to debug a distributed system you can use awesome OSS tools like we are for the London Tester Gathering workshop.

2 Likes

I am liking this! Thanks

1 Like

Thank you Abby, :heart: all your replies, so clear and helpful and there is plenty to digest and start to put into practice. Fab!

2 Likes

Glad to be helpful Sharon, sorry I didn’t get to all of yours but hopefully it’s just the start of the conversation :smiley:

1 Like

DevOps is a methodology and a culture shift that emphasizes collaboration across the entire lifecycle of a product.
In this infographic, we will discuss the four fundamentals that are important for a successful DevOps implementation.