Stress Test To Avoid Stress
What goes wrong when things go wrong? What follows may not be suitable for younger audiences: post-mortems of several IT projects that turned ugly.
Chapter One: Don't Guess for Success
The lack of clear requirements at the outset of a new IT project or consulting partnership is the number one killer of initiatives within large organizations. Project charter processes are critical, whether the project is to be handled in-house or by contractors.
IBM’s Team for Texas Consolidation
When data center consolidation is done right, it can save a bundle and bump up performance in the process. When outsourcing is done right, it can do the same. When both are done wrong at the same time, you have an unmitigated disaster the size of the Lone Star State.
At the end of 2008, the State of Texas reported a colossal failure in a years-long project to consolidate 27 data centers across the state. When Texas shook hands with IBM in 2007, the state agreed to pay Big Blue to lead up the "Team for Texas." The team promised to reduce the number of servers across the state from 5,300 to 1,300 within three years and save the state $25 million by year two of the contract. The idea was to rapidly move from a highly distributed environment where individual agencies ran their own data centers to a more consolidated approach.
When the state shopped for a contractor to consolidate data centers, it chose IBM over shortlisted Northrup-Grumman, the incumbent data center operator at the time. Northrup-Grumman’s estimate came in $358 million over IBM's because the company believed a slower approach to consolidation was necessary due to geographic restraints and an eclectic collection of systems and applications across agencies.
By November 2008 it was clear that Texas was regretting its decision. IBM sent Texas a letter on Nov. 3 that stated it would need to rework its infrastructure funding to replace legacy systems that were hampering consolidation efforts. The project had only saved the state $500,000 so far and agencies were giving IBM failing grades across the board for service and support. Governor Rick Perry told IBM that it was in danger of losing the contract if it didn’t fix dangerous disaster recovery backup oversights and breaches in contract that were caught by state officials after a system failure.
The state of the contract is still in flux. Observers believe there will be a finger-pointing match between the state, which had out-of-reach expectations for the project timeline, and IBM, which oversold its ability to accomplish unrealistic goals.
Lessons Learned: Sometimes the cheapest proposal is not necessarily the best proposal. Project managers must double and triple check cost estimates to ensure they’re based on realistic assumptions.
U.S. 2010 Census Handheld Rollout
The scourge of scope creep is a classic project management stumbling block that can waste money and even destroy promising initiatives.
The case of the U.S. 2020 Census modernization project offers a classic example. When the U.S. Census unveiled the project, the idea was to arm its census takers in the field with high-tech handheld devices that could directly beam population information to HQ as employees dug it up.
Census officials inked a $600 million deal with Harris Corporation to build 500,000 devices, but they still weren’t sure which features they wanted included in the units. What they did know was that this go-around at the census would cost at least $10 billion in total. Though there were no detailed projections, they assured the public that the handhelds were sure to save the department money.
Two years later, the handheld project was in shambles. In April, the bureau chief sent a press release announcing that it was significantly reducing the role of the handhelds in the 2010 Census efforts. After laying out hundreds of millions in taxpayer dollars, the census takers were for the most part going back to the old-fashioned paper method.
According to a report by the Government Accountability Office in July 2008, the units that were tested out in the field in 2007 were regularly experiencing problems with "transmission, the device freezing, mapspotting (collecting mapping coordinates), and difficulties working with large blocks."
Even more troubling, Harris reported that since it signed on with the Census Bureau, it received over 400 change requests to project requirements. Part of the reason the bureau scrapped the project was the subsequent cost overruns that were going to be necessary to get the full project off the ground given the midstream requirements changes.
What’s worse, even as census enumerators were relegated back to the paper age, the total cost of census estimates rose by the billions. The current estimate is somewhere around $14.5 billion.
Lessons Learned: A clear set of requirements and a well-governed change-management process are the critical ingredients to project success.
Nothing’s worse for a company than an IT fiasco that directly hits the way customers interact with the organization. In some extreme cases, IT may as well hand disgruntled customers over to the competition on a silver platter. Don’t believe us? Read on.
When an IT failure wastes a financial institution’s money, that’s a headache. When it erodes the organization’s reputation, it’s a life-threatening head injury.
HSBC USA learned this lesson the hard way in August 2008, when systems malfunctions kept thousands of customers from accessing their accounts for the better part of a week. The culprit of the downtime was a disk failure within the legacy mainframe system in HSBC’s Amherst-based data center, which handled the bank’s account and transaction information. According to the company, the failure occurred during a storage system upgrade.
Most experts agreed that something as simple as a disk failure should not jeopardize a financial institution’s uptime for five days -- disaster recovery governance should cover something as mundane as hardware on the fritz. While HSBC did not disclose the intimate details of its difficulties, one unnamed, whistle-blowing insider told Bank Technology News that HSBC was running a decades-old legacy core processing system that likely suffered from complications following the disk failure.
"It's not surprising, given the age of that software and that internal code would be holding everything together," the source told BTN.
Some experts speculated that the crash of the core processing system combined with a sluggish batch recovery process could have held up IT staff in returning the system to service.
The end result was spotty service for many customers, some of whom were unable to view their account balances, bank online or even use their debit cards in some cases. Regardless of technical details, the failure shows how a simple upgrade can result in a loss of service and customer confidence if IT doesn’t have its ducks in a row before rolling up its sleeves.
Lessons Learned: Financial systems failures say bad things to customers seeking stability.
London Stock Exchange Sells Customers Short
On September 9, 2008, thousands of UK-based traders were champing at the bit to start the trading day after news broke that the U.S. government was planning to bail out mortgage-lenders Fannie Mae and Freddie Mac. Most brokers expected their trades across the London Stock Exchange that Monday to net them their most lucrative commissions of the year. Then an LSE technical glitch got in the way.
Traders were locked out of the LSE trading platform for nearly the whole day -- more than seven hours in total -- due to what LSE called “connectivity problems.” The financial papers quoted traders expressing their frustration over losing millions and millions in pounds from unearned commissions that day, a wound made more painful by LSE’s lack of communication about the specific problems that caused the shutdown. Just days before the problem it had just finished extolling the virtues of a shiny, new update to its trading platform, TradElect.
LSE was oblique about the exact technical issue, stating only that the issue was not due to an inability to handle trading volume or any problem associated with the upgrade.
If true, it goes to show that no matter how much work is spent on the bells and whistles of a new project, if the fundamental infrastructure isn’t right then all the project work is for naught.
Lessons Learned: Poor communication exacerbates customer-facing IT issues.
It is imperative that an organization pay its employees on time and as promised. In today’s environment, IT is a critical partner in making that happen. When IT fails to do so, the whole organization suffers.
Sprint-Nextel Get Static Over Unpaid Commissions
Last December, news trickled in from the District Court of Kansas that a cadre of former and current Sprint-Nextel employees had filed a class action lawsuit against the telecom giant to fight for more than $5 million in sales commissions they say were never paid to them. Though Sprint has so far chosen to fight the suit, it has admitted that it has consistently had problems figuring out how to pay its employees accurately the first time around. The culprit? Creaky back-end systems that were never appropriately integrated when Sprint and Nextel merged in 2005.
Details about the systems themselves are scanty, but what the court documents did disclose was illuminating. According to Sprint, it has spent close to $10 million to fix problems within the computer system governing sales tracking and commission payouts. The company says that it provided underpaid employees a process for fixing pay discrepancies. Therein lies the rub.
The plaintiffs in this case charge that this paperwork-laden process was needlessly complex; it was too difficult and took too long to navigate, they say. This case is a good example of how executive response to end users after an IT snafu can greatly affect the ultimate response to technical problems. Had Sprint made a more comprehensive effort to make its employees whole, minus a lot of red tape, it likely could have avoided the lawsuit.
Lessons Learned: Throwing money at a problem doesn’t always work. Communication and an earnest effort to make users whole after IT failure.
LAUSD Payroll Problems
When the Los Angeles Unified School District announced it was ready to flip the switch on an ambitious, $95 million plan to migrate its payroll systems to SAP, the teachers’ union response was, "Are you sure it’s ready?"
Union leaders asked that the district test the new payroll system while concurrently running the old one in order to ensure a smooth transition. The district, and its consultant Deloitte Consulting, chose to ignore the request. LAUSD and Deloitte said that such testing would be too costly.
Even after a spate of red-flags popped up during the preceding months -- so many that the district CIO resigned in disgust six-months before showtime -- the district pulled the trigger and brought the system live in late January 2007. What followed was a year-long chain reaction of failure that caused teachers to picket, kept payroll clerks from going home on many long nights, and eventually cost the school district an additional $35 million to clean up the mess. Over the course of 2007, some employees were overpaid, others underpaid and some were not paid at all for months on end.
So many things went wrong that it’s hard to pinpoint a single cause for the pain. Among the contributing factors: a failure to scrub data before migration, bugs in the custom code developed by Deloitte to tailor the system to LAUSD needs, and inadequate system training for payroll clerks. All of this was exacerbated by the convoluted and complex contract arrangements with individual teachers and creaky hardware infrastructure tasked with running the payroll platform.
Some believe that what really torpedoed the project, though, was an extreme lack of leadership. The executive the school district brought on to sponsor the implementation program didn’t just lack ERP experience, he had very little knowledge or expertise about any IT systems. Furthermore, when CIO Megan Klee resigned in 2006, the district piled her duties on top of the CFO. Clearly, no one was manning the tiller.
Lessons Learned: Difficult projects require experienced and intuitive leaders to reign in costs and address problems before it’s too late to fix them. Careful testing is a must before going live with any new system.
Perhaps no other wide-scale enterprise project can strike fear into the heart of a CIO or project manager like an enterprise resource planning (ERP) implementation. And for good reason. ERP touches the nerve center of organizations. If something goes wrong, it goes terribly wrong.
American La France Hosed by ERP
Sometimes a nightmare project can put a company in bankruptcy.
Last year, the venerable firetruck manufacturer American LaFrance was forced to undergo Chapter 11 proceedings after what it called “operational disruptions caused by the installation of a new ERP system.” According to information provided to customers, the company incurred $100 million in debt since it spun off from Freightliner in 2005 due to inventory imbalances and ERP malfunction.
“These problems have resulted in slowed production, a large unfulfilled backlog, and a lack of sufficient funds to continue operating,” the company told customers in January 2008.
When American La France initially bought back its business from Freightliner, the plan was to get its former host company to handle accounting, purchasing, inventory, production, payroll and finance until American La France could work with its consultants at IBM to implement a new ERP system that would handle all of these functions.
But after flipping the switch to set the ERP system running in June 2007, American La France encountered inventory difficulties, resulting in serious liquidity issues for the manufacturer. Though American La France said at the time that it was looking into holding IBM legally accountable for its role in the botched ERP implementation, more than a year later no such action has been taken. Since then the company has come out of Chapter 11, worse for the wear after customers complained of late truck deliveries and a shortage of repair parts that made certain firehouses scramble for months to keep their fleets running.
Lessons Learned: Botched ERP implementations can drastically affect operational capabilities. Spin-offs, company sales and M&A activity must be followed by careful infrastructure planning.
Tom Shane Paid Too Much
As our previous example showed, picking one’s way through an enterprise resource planning project can be akin to dancing across a minefield. With so much money and so many critical functions at stake, ERP projects have the potential to make or break a company.
Unfortunately for Centennial, Colo.-based Shane Co. Jewelers, its foray into SAP broke the company. In January 2009 the family-owned company announced bankruptcy as a result of poor sales returns from 2007 through 2008 and an SAP implementation that executives said cost the company three times its initial estimate and left store inventories unbalanced.
According to bankruptcy filings, Shane Co. reported that it took three years and $36 million to put in place the SAP AG system after being told by SAP that it would only take one year and $10 million. What’s worse, once the system went live in 2007, Shane experienced nine months of inventory issues that ‘adversely affected sales,’ the company told the court.
Most egregious of these issues was a severe overstock of the wrong products in late 2007. Shane told the court that since then it managed to sell off the overstocked products and stabilize the SAP system, but that the financial damage was already done. In 2008 the company experienced a 32 percent decline in sales during the holiday season.
Bankruptcy filings show the company owes more than $100 million to creditors. Shane told the court that it was revamping operations, scrapping plans for new headquarters and putting the kibosh on store expansions. During the bankruptcy, company founder Tom Shane is lending the company money from a separate business he owns to keep Shane Co. operational.
Lessons Learned: Appropriate vendor selection and realistic project goals are imperatives for successful ERP implementations.
Worst case scenarios happen. Anticipate and communicate, or they could happen to you.
Heartland Payment Systems started off 2009 with a bang, but not the kind it wanted.
The credit card processor suffered from a colossal security breach that reportedly exceeds 100 million records and affects accounts at hundreds of banks. An unknown party hid sniffer software in unallocated disk space on a server located within a section of infrastructure unprotected by encryption. Heartland’s CFO reported that the malware was so well camouflaged that forensics experts had a hard time finding it even after the Visa and MasterCard flagged the company with fraud alerts .
Details from the investigation will likely take years to unravel, but security experts speculate that it could have been installed from web-borne malware or perhaps even propagated by an insider armed with an infected USB device. Interestingly, the company had previously been certified as PCI compliant.
Heartland is suffering the consequences. Since it made the mandated breach announcement in January, its stock price has plummeted, and affected banks and consumers have filed dozens of class-action lawsuits. Competing card processors are aggressively courting Heartland customers. VISA and MasterCard revoked the company’s PCI compliance certification and are threatening to slap it with hundreds of thousands of dollars in fines. And CEO Robert Carr is now under fire for selling a large chunk of his shares in the company just before the breach announcement was made public.
Lessons Learned: Compliance efforts may not always result in a secure risk posture. Security missteps can cost companies dearly.
Beijing Olympics Ticket Turmoil
If the Beijing Organizing Committee for the Olympic Games (BOCOG) IT team competed on the field, it wouldn’t even make it past the first heat in e-sales systems development.
During the run up to the Olympics, the Chinese ticketing system crashed not once but twice during two separate waves of domestic sales. The first time, in October 2007, Chinese authorities had released 1.8 million tickets to be sold online and at branches of the Bank of China (BOC). Within an hour the online ticketing system was unable to handle the heavy load of more than 8 million hits.
Designed to handle about 1 million hits per hour, the system crashed after selling a measly 43,000 tickets. Officials had to resort to offering the remainder by lottery while they worked with Ticketmaster to revamp their online sales portal. In the interim, BOCOG fired its director of ticketing.
Then in May 2008 the systems crashed again, this time in a bid to sell 1.38 million tickets. This go round the system was able to shop 300,000 tickets before going down.
The Beijing spin machine worked full force following the humiliation of both events, pointing to overwhelming demand for tickets as a signal of success. But ultimately the IT team failed to accurately project realistic traffic numbers and architect systems accordingly. The lack of planning cost one man his job and served up a splatter of egg on the face of countless others.
Lessons Learned: IT will fail again and again if it isn’t provided with accurate projections and numerical assumptions from the business side.
The ultimate cost of some bungled projects is human life.
Kaiser Kidney Transplant Center Debacle
Only two years after opening, the Kaiser Permanente Kidney Transplant Center in San Francisco was forced to shut down in the wake claims that it risked patient’s lives through delays brought on by bureaucratic bungling.
According to whistle-blowers, patients, and regulators, Kaiser’s lack of planning in data-management governance, procedures and policies for the setup of its Kidney Transplant Center delayed the entrance of countless patients into the operating room to receive vital surgeries. Kaiser is said to have never set up procedures to transfer patient data from prior transplant facilities. or to have compiled a database of records for patients transferred into the center’s program. Administrative staff did not have a clear set of written directions to do their jobs and had little experience in the transplant specialty.
In one example, patient Bernard Burks was told by his insurer, Kaiser, that he had to transfer from a kidney transplant center in Sacramento to Kaiser’s new facility in San Francisco in order to continue to receive coverage. In the process Kaiser lost his records and he was somehow bumped to the back of the line for a kidney, even though he had already accumulated three years’ credit at his former facility. When his daughter stepped forward to donate, that information was mismanaged, too.
Burks’ experience wasn’t isolated. In February 2006 an internal whistle-blower, David Merlin, blew the roof off the story when he went to the Los Angeles Times and other media outlets with concerns. By May 2006 regulators at the state Department of Managed Health Care (DMHC), Medicare, and Medicaid released scathing reports on the matter, and Kaiser decided to shut the whole transplant operation down. Kaiser was later fined $2 million by DMHC, and it voluntarily paid another $3 million to a transplant education group to atone for its missteps.
Lessons Learned: Regardless of the technology used, governance and compliance must be in place in order to properly manage information in the enterprise.
Trains, planes, and automobiles rely on software to improve engine performance, control electrical systems and keep navigation systems running. But when developers and QA testers miss something in the code and bugs crop up, resultant glitches can risk lives and damage reputations.
Last February, just that kind of problem cropped up at Ford. The car manufacturer was forced to recall nearly half a million 2005-2008 model Mustangs after discovering that a software problem in these cars caused airbags to be deployed forcefully enough to cause neck injury to smaller passengers not wearing seatbelts. Mustang owners had to return their vehicles to the dealer in order to receive a software update to the Restraint Control Module within the car computer.
While no injuries related to the error ever occurred, Ford paid to notify customers and conduct the recall. Developers and testers of all stripes should sit up and take notice of this classic software failure scenario, born out of inadequate testing before the product went to market. Ford only found the discrepancy during safety crash testing mandated by the National Highway Traffic Safety Administration.
Lessons Learned: Testing may be expensive, but it’s a lot cheaper to get the code done right the first time.