United Airlines' Systems Meltdown: Lessons LearnedBy Larry Barrett | Posted 2007-06-29 Email Print
Re-Thinking HR: What Every CIO Needs to Know About Tomorrow's Workforce REGISTER >
The airline's computers shut down for two hours while other carriers were up and running. The result: a multimillion-dollar loss and a hit to United's reputation. It could happen to your company, analysts say, unless you take the proper precautions.
The nightmare scenario that United Airlines and its passengers endured last week when its computer system shut down for two excruciating hours will happen again and again in other industries, analysts say, unless CIOs make significant changes in the way they manage their staffs and their systems.
The Chicago-based airline's June 20 debacle illustrates just how vulnerable a company can be to an innocent mistake—in this case, a United spokeswoman says an employee accidentally disabled the computers while running a routine test on Unimatic, United's flight operations system—and how far-reaching the consequences can be.
More than 260 United flights were delayed for an average of 90 minutes, and nearly 70 other flights were cancelled altogether. The timing couldn't have been worse. The system crashed between 8 a.m. and 10 a.m. Chicago time, right in the heart of the morning rush, leaving arriving United flights to scramble for the few available gates that weren't blocked by delayed United planes that were unable to take off.
"As soon as it happened, the I.T. employee who made the mistake knew what it meant to the system," says Robin Urbanski, a United spokeswoman. "It was just simple human error during a routine test of our flight operation system. It's unfortunate, but we're developing new processes that will prevent future issues like these from impacting our customers."
Making matters worse, according to Urbanski, was the fact that this testing snafu also knocked out Unimatic's backup system. She declined to comment on what new processes would be implemented, or if new software and hardware would be installed.
The Unimatic system, which was originally launched almost 20 years ago, provides pilots with flight plans, gives updates on maintenance information and crew schedules, and records the amount of weight and proper balancing of that weight for all outbound flights. Urbanski says the system has been updated repeatedly in recent years.
All of these applications, including the flight plans, crew scheduling and weight measurements, can still can be done by hand in a pinch—but at a cost.
Michael Boyd, an airline consultant based in Evergreen, Colo., told the Chicago Tribune that United's computer glitch will cost it more than $10 million in lost revenue due to refunds and re-bookings, to say nothing of the hit its reputation took in the process. Worse, while United passengers and planes were stranded at the gates, its competitors were up and running as usual. This wasn't a weather-related incident; it was self-inflicted.
"Of course, organizations could do more to prevent such debacles," says Tom Welsh, senior consultant at Cutter Consortium, an Arlington, Mass.-based I.T. consulting firm. "The live and backup machines should be more rigorously separated, and there should be someone present to make sure these [innocent mistakes] don't happen. Companies strive to maximize profits, but often fallible decision-makers judge the risks wrong."
Analysts say the financial services, retail and transportation industries are particularly vulnerable to the type of systemic paralysis experienced by United last week because they most conform to rigid and unrelenting schedules.
While each company or organization has unique I.T. issues and limitations, analysts say CEOs and CIOs should consider the following advice to avoid the type of disruption United and its customers experienced:
- Get upper management in the organization—including the CEO—to understand how the company's I.T. systems work, what could possibly go wrong, and how that would affect the business. That way, everyone can make informed decisions about I.T. investments and governance.
- Avoid running tests on a live system.
- Make sure it's not possible to confuse the live and backup systems to prevent regular maintenance or testing from affecting the wrong system.
- Improve working conditions for all business-critical technical employees, ensuring reasonable work hours, plenty of rest and the sharing of responsibilities so that no one person is overworked or overly relied upon.
- Conduct a regular and comprehensive analysis of the costs, benefits and risks of all I.T. procedures. This analysis should be performed by qualified technical staff along with business managers.
Despite inconveniencing thousands of passengers, taking a likely eight-figure hit to its bottom line and disrupting service for more than 24 hours, United was fortunate that the computer outage only lasted two hours. Had the problem lingered for another two or three hours, analysts say, the cascading effect would have sent United's operations into a tailspin for as much as a week.
"United now knows that it made an incorrect cost-benefit judgment," says the Cutter Consortium's Welsh. "Now that they're aware, they will probably do the right thing and urgently appoint qualified people to analyze the scenario to make sure it doesn't happen again. These precautions are analogous to seat belts or crash helmets—they save lives but often are left off anyway."