An IT outage on May 27 that caused British Airways (BA) to cancel more than 400 flights and strand 75,000 passengers in one day was because of human error—and a simple one at that.
An engineer had disconnected a power supply at a data center near London’s Heathrow airport, and when it was reconnected, it caused a surge of power that resulted in major damage, according to Willie Walsh, CEO of BA’s parent company IAG SA. Walsh made the comment to reporters in Mexico, and it was picked up by Bloomberg and other news outlets.
+ Also on Network World: We’re learning the wrong lessons from airline IT outages +
The engineer in question had been authorized to be on site and was part of a team working at the Heathrow data center hit by the power outage. The facility is managed by CBRE Works Solutions, a U.S. property services company.
A BA spokesperson told the U.K. publication IT PRO, “There was a loss of power to the U.K. data center, which was compounded by the uncontrolled return of power, which caused a power surge taking out our IT systems. So we know what happened; we just need to find out why. It was not an IT failure and had nothing to do with outsourcing of IT; it was an electrical power supply which was interrupted.”
An internal email sent by the head of group IT at IAG, which was leaked to the Press Association, a U.K. news group similar to the Associated Press in the U.S., said, “This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries. … It was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system.”
This was no small accident. It’s estimated to have cost BA as much as 100 million euros (U.S. $112 million) to say nothing of the black eye BA got for the outage. What BA and CBRE want to figure out now is how a single technician could cause so much disruption and why the airline’s backup systems failed.
This isn’t an isolated incident. Most recently, in March, Amazon Web Services suffered a massive outage when one of its employees was debugging an issue with the billing system and accidentally took more servers offline than he intended. That error started a cascade effect that took down other systems, resulting in the outage.
Humans the cause for most data center failures
To err is human, and we err a lot.
A 2016 study by the Ponemon Institute found human error was the chief cause of failure, accounting for 22 percent of data center outages, while water, heat or air conditioning failure accounted for 11 percent of outages, weather accounted for 10 percent and generator failures were 6 percent. IT equipment malfunction accounted for only 4 percent of all outages.
That’s because the IT industry doesn’t do a good job at educating its workers on proper processes for managing this equipment. Two-thirds of data center outages are related to processes, not infrastructure systems, according to David Boston, director of facility operations solutions for TiePoint-bkm Engineering.
“Most are quite aware that processes cause most of the downtime, but few have taken the initiative to comprehensively address them. This is somewhat unique to our industry,” he told Data Center Knowledge.
This matches another long-standing problem in corporate America: failure to educate users on acceptable practices for their laptop and smartphone. It’s well established that most corporate breaches come not from external hackers or even internal malcontents, but employees making stupid mistakes, such as opening phishing emails.
Management needs to stop assuming people are computer literate, know the technology as well as their kids, and are mind readers when it comes to policy. Taking some time to train and educate people on what is expected of them should not be a special consideration, it should be standard operating procedure. Clearly it is not at the moment.