NATS issue - possible bug

Interesting hopefully - I am reading that the recent chaos at airports could possibly down to bad error handling (if I am reading correctly - down to the French trying to upload a file with data errors and the result of this causing the chaos rather than the file being rejcted).

“exception” handling etc is definitely an area which is overlooked a lot in design and coding. Good area to focus the testing on. Don’t always assume everyone will do things “properly”!

5 Likes

Details for interested people:

“Our systems, both primary and the back-ups, responded by suspending automatic processing to ensure that no incorrect safety-related information could be presented to an air traffic controller or impact the rest of the air traffic system.”

“One particular flight plan, whoever it was from, corrupted the whole system”

4 Likes

No validating, sanitizing incoming data? Yikes.

2 Likes

What really pulls me backwards from this one is that there is rigorous testing in place, and I’m sure we will never know who the responsible persons for those checks are, it will be old news by the time facts surface. So stepping back is super important. It’s easy to get lost speculating as to how strong Culture of Reliability was, how good their DRP (Disaster recovery plan) was or if it was down to a recent process or component change, and not human error. Statements like the one released saying this “will never happen again” are all distractions.

How often do we discuss the differences between risk, likelihood and damage? Regular maintenance of our vocabulary and our communications about what we do is just as important, so as to stop any blame slinging creating toxic team workings.

1 Like

Really interesting post, thankyou. I was almost impacted by this flying home on Monday, so was very curious to know what happened, and once hacking was ruled out I did wonder about errant data (has someone put an emoji in is always in my head!)

I guess the question for me is, what are we going to do to learn from this mistake at NATS? Can we tighten anything up or get lower priority “that’ll never happen” data entry type bugs pushed higher up the chain for fixing? Can we check for them earlier? If so, what tools or resources can we point to to help with this? (Bug magnet chrome extension is always a fave of mine).

Would love to hear the communities thoughts on this one!

1 Like

So it turns out that a rather long flight plan that also runs through UK airspace caused this. The software attempted to work out which portions of the flight plan affected the UK systems, could not and “shut down”. The backup system executed the same orderly “shutdown” when fed the same flight plan.

Shows how a corner case can become so important, but also how media briefings full of euphemisms to make it seem like a “one in 15 million” mistake was an acceptable safety margin. I cannot see how any application has ever “shut down”… The last time millions and “shut down” was used in one news briefing sound-byte, I recall was about a “rapid unscheduled disassembly” type shutdown. I wonder if the deeper thing is that we just dumb down all press releases to the point that nobody can ever learn any lessons about all the hard work that teams do, and that wider standards bodies do if we bandy about numbers like one in 15 million. Makes all the hard work testers do feel like random chance, it probably was, but dumbing it down benefits nobody except the conspiracy theorists surely?

2 Likes

My inner cynic says that the dumbing down might be necessary to prevent the media getting the message catastrophically wrong.

I do live in the USA, which is not exactly famed for intelligent journalism these days.

I suspect it’s always going to remain the Utopian dream that the internet is a superhighway that boosts us all into the 22st century of enlightenment. People will always believe what they want to believe, facts or none, but not being factual hurts those who seek to build better.

We had a recent service outage at work and the email we sent to customers was a mostly technical heavy explanation of what went wrong, it forces people to communicate without any agenda being on the table. Which is something test engineers have to always be aware of, bugs you find, are not “yours”, they are all risks, but are just risks that you happened to discover or a customer discovered them for you. How you adapt tactically to new classes of defect being found, is the key for me. One in 15 million is not a class of defect.

No, it isn’t. It’s an edge case, but a low-probability, extreme impact one. I’d love to know who decided that it wasn’t possible to have multiple locations with the same name (the information publicly available so far suggests the assumption that all locations in the system would have unique names is the root cause - but there’s so much that’s not available I wouldn’t want to go on record as claiming that).

I will say that simple assumptions like “each location will have a unique name” can and do wreak havoc if they aren’t enforced at all levels including data storage.

1 Like

I’m pretty sure it’s not the same thing that happened, but I recall a good week long debate around when UUID’s are generated on multiple computers at the same point in time, what the collision rate would be. It’s like 1 in many billions I believe. But the chance of it if those computers all shared a common root computer gets greater. In the end we boiled it down to splitting them into separate pools where a pool holds objects of one type only, but never to assume that it can never happen in the wild.

You have to leave some bugs to be discovered by your customers. Totally hats off to the OPS team.