It was a busy holiday weekend, and Great Britain’s national flag carrier was forced to ground all flights out of London’s two main airports, Heathrow and Gatwick—which affected the airline’s operations around the world. Oh, and the incident also affected British Airways’ call centers and online booking sites, making the situation even more frustrating for stranded passengers.
Most operations have now been restored, the airline says, but more than 1,000 flights were canceled and 75,000 passengers stranded.
But here’s the thing: The problems weren’t due to some evil cyber attack or ransomware assault. Nope, it was just another “global IT system failure,” reportedly British Airways’ sixth such incident in the last year alone!
Power supply issue blamed
As the chaos continued, British Airways CEO Alex Cruz laid the cause on a relatively short power surge that was so strong it affected the ability of a backup system to start up properly for several hours, possibly affecting data sychronization. Cruz declined to say where problem was located, and the BBC reported that Cruz is resisting calls to resign over the incident.
According to The Independent, the airline is still working to figure out the precise causes, but some are blaming a common combination of modern and legacy technologies.According to the Register, British Airways maintains two large data centers near its Heathrow headquarters.
Meanwhile, the GMB union representing much of British Airways’ IT staff blamed the problem on the company’s recent outsourcing of many IT functions to Tata Consultancy Services in India. According to British newspaper The Sun, sources claim the airline’s backup system “failed to take over when the primary [IT system] failed due to a power cut,” and the problem spiraled out of control because “inexperienced staff in India didn’t know how to kick-start the airline’s back-up system.”
The airline responded, saying, “We would never compromise the integrity and security of our IT systems.” That sounds great except that—one way or another—those systems’ integrity has obviously been compromised. And Cruz, meanwhile, claimed the incident had nothing to do with cost cutting: “They’ve all been local issues around a local data centre, which has been managed and fixed by local resources,” he told Sky News.
+ Also on Network World: We’re learning the wrong lessons from airline IT outages +
Maybe not, but the incident points to the often-overlooked human factor in disaster recovery systems. It’s not enough to the have the equipment in place to deal with outages and other incidents. It’s not even enough to develop plans to deal with problems. It’s critical to make sure key personnel are knowledgeable and practiced in actually implementing those plans.
That usually means regular failure and back-up testing, with game-day simulations of all the things that can possibly go wrong. To be fair, I have no idea to what degree British Airways has such measures in place, but it seems clear that whatever the airline was doing, it wasn’t enough. And it isn’t hard to imagine that the complex international nature of the airline’s IT systems and staffing didn’t help.
Time to check your company’s disaster recovery capabilities
Again according to The Sun, sources close to British Airways are calling the situation “appalling.” With one saying, “No major company should be in a position where 300,000 of its customers were stranded, with little to no information. … It’s a shambles. Heads should roll for this.”
Given the magnitude of the problems, and an estimated £100 million compensation bill, it’s quite likely IT leaders could lose their jobs over the weekend’s events. And if that’s not enough to spur an immediate and careful review of your company’s disaster recovery plans and procedures, I’m not sure what you could possibly be waiting for.