Air traffic control chaos: How human error can lead a tiny glitch to spiral out of control
Several thousand passengers were stranded at airports, hotels and connection depots following the recent system-wide glitch of the UK air traffic control systems. Some passengers were told of flight cancellations in advance, so they could make alternative travel plans.
Unfortunately, following some 2,000 flight cancellations over 48 hours, most passengers were either sleeping on airport floors or sitting on planes which were unable to take off. So what was the glitch and how did it create so much chaos?
The problems appear to have been caused by unusual data in a flight plan submitted into the National Air Traffic Services (Nats) system by a French airline. This data couldn't be processed because it wasn't recognized by computers.
But it's also worth considering whether there were organizational issues. It will be important to know how much senior staff knew about the systems they were in charge of and how proactive they were in addressing the problem.
From the managerial perspective, Nats can be divided into four different units. These are: local, regional, central and top (where the higher level of decision-making occurs).
In principle, controllers should be able to rectify the data error. In practice, a common approach is to mark and hold it temporarily—something called "error parking." This can mitigate the problem as long as everything else continues to work properly. But this can also cause the error to "grow," affecting other parts of the system.
This week, Nats released its a preliminary report into the incident. Its chief executive Martin Rolfe said the error was "a one in 15 million" event. In a response, transport secretary Mark Harper said he wanted to "echo NATS's apology to those who were caught up."
However, the incident will also be subject to investigation by the Civil Aviation Authority (CAA). There are some obvious questions to ask.
These focus on the roles played by managers in the identification of glitches and their repair, the quality of training offered to unit controllers, guidelines for standardized operating procedures—documenting day-to-day processes to make them repeatable—and support for resolving glitches.
In December 2013, an air traffic control system failure led authorities to recommend changes to Nats' "crisis management capabilities" and for it to consider the different ways crises can be handled. A year later, another incident occurred, caused by a fault in software written in the Ada programming language that was developed in the 1980s.
The resulting enquiry report said that "it is evident that neither of these recommendations had been addressed fully." It made further recommendations to strengthen systems and contingency steps to help ensure they were "sensitive to their impact on the wider aviation system."
For the most recent incident, the picture remains unclear. But, in my experience as a researcher of management, managers further up the chain can often pay more attention to immediate threats. They may therefore underestimate the impact of accumulated errors, or may not have enough time to monitor them.
There has been stinging criticism of the chaos from figures within the industry, including the director general of the International Air Transport Association, Willie Walsh, Ryanair boss Michael O'Leary and Johan Lundgren, chief executive of Easyjet.
"This system should be designed to reject data that's incorrect, not to collapse," Walsh explained. Lundgren said a review of the situation should determine whether NATs is "really fit for purpose, not only on the systems but on the technology, on the staffing levels." O'Leary said the preliminary report into the chaos was "full of excuses."
With that in mind, it's reasonable to ask questions of managers in charge of the systems and procedures, including whether everything possible was done to avoid the disruption seen during the bank holiday.
Another point to bear in mind: many senior managers—particularly at chief executive and managing director level—are not necessarily technicians. This means that they may not be fully aware of glitches or their potential impacts if the problems have not previously been reported.
Sometimes, front-line workers may have reasons not to report problems. For example, they might not be significant enough. Or, employees might feel that raising their heads above the parapet could limit their career opportunities. Unfortunately, as long as the glitch is not salient and the machine still works, people usually ignore it.
What's currently unclear is the precise role management culture, decision making or an inability by senior staff to understand parts of the system might have played in this—if at all. That will be for the CAA investigation to disentangle.
Problems affecting air traffic control have the potential to spark a crisis of consumer confidence which must be addressed as a matter of urgency. There are a couple of things that should already be happening.
Nats has now apologized to the affected passengers. But managers and authorities should also offer replacement flights, coupons or other objects of comparable value as compensation. A phone line or website should be set up to ease the situation.
Managers have been improving communication between technicians and non-technicians and should be praised for this change in attitudes. The more two sides talk to each other, the lower the chances of something like this happening again.
However, the damage to the aviation industry from this episode has been severe. The risk for the industry is that passengers affected by the problems may look to alternative forms of transport in the future. In addition, aviation insurers may significantly raise the insurance premium, ultimately affecting the cost of flying for consumers.
The CAA has a very serious job to do.