How To Guard Against
At exactly 1:47 p.m. on July 24, miles kelly got the call every CIO and data center manager dreads: The data center had experienced a power outage. Indeed, a power surge had shut off energy to the company's primary data center in San Francisco, and four of the building's 10 backup generators had failed to start. Three computer rooms were down.
That would signal the start of a bad day for any enterprise, but for 365 Main, where Kelly serves as vice president of marketing and strategy, the problem was magnified many times over. That's because 365 Main isn't just any business: It's one of the nation's top data center managers, or co-location service providers. There are more than 75,000 servers in its 227,000- square-foot San Francisco facility, supporting hundreds of customers, including such high-profile companies as Craigslist, Sun Microsystems, Six Apart and the Oakland Raiders.
"When the failure of a data center becomes a bigger issue is when you have all these Web services and start-ups that have their data center services only at this one site," says James Staten, principal analyst in the infrastructure and operations group at Forrester Research.
When the four, 2.1-megawatt diesel engine-generator units failed to kick in as they should have, it was a disaster in the making for 365 Main. The company promotes itself as having "The World's Finest Data Centers," and clients rely on it for constant uptime. Prior to the incident, 365 Main could claim 100% uptime.
But on the afternoon of July 24, 40% to 45% of 365 Main's customers lost power to their equipment for about 45 minutes, Kelly says.
At Sun Microsystems, sites were down 45 minutes to three hours, with most being restored in about 90 minutes, according to Will Snow, senior director of Web platforms at Sun. Although the power was out for 45 minutes, it can take a few hours to bring systems and networks back up and ensure they're working properly.
Snow says Sun had a backup plan for services at 365 Main, but it wasn't complete. "We're in the process of changing our disaster recovery plans to deal with shorter outages," Snow says. "Originally our plans were tailored for more significant outages of four-plus hours, but now we're looking to respond to very short outages such as the San Francisco outage."
At Six Apart, four of the company's Web sitesLiveJournal, Vox, TypePad and Movable Typewere down 90 minutes. On LiveJournal, the company posted an apology, explaining that during outages it would normally display a message telling visitors about the status of the site. "But because this was a full power outage there was a period of time where we could not access or update a status page," the posting explained.
Fortunately, 365 Main was able to manually restart the generators that failed to kick in automatically, which allowed it to operate on backup power until PG&E began delivering a stable power supply.
WHAT CAUSED THE OUTAGE
The root cause of the outage turned out to be highly unusual: The data center's $1.2 million diesel-powered generators had gradually fallen out of sync with the electronic controllers that start them. "This was a truly rare incident," Staten says.
Unlike centers with batterystarted backup generators, 365 Main's data center has a continuous power system. Energy from the local utility flows into the system to operate its generators, which supply power to the building. In the event of a power failure, the system normally restarts using energy stored in each generator's flywheel. The flywheels, basically large spinning discs, keep turning long enough after a power failure to restart the diesel engines.
With its 10 backup diesel-powered generator units, 365 Main's primary data center had operated without a glitch through numerous power outages since its construction in the spring of 2001. The building has eight data rooms, each with its own dedicated generating unit. There are also two extra units, ready to kick in if one of the eight dedicated backup generators fails.
As it turned out, on July 24, four of the diesel engines failed to start, causing three computer rooms to lose power. "We could have failed three units, and through an automatic load-shed of chillers and air conditioning units, we could have continued to function," says Jean-Paul Balajadia, senior vice president of operations at 365 Main. In other words, the facility had enough backup generators to run the computer rooms if only three units had gone down. But with four unable to start and keep running, up to 45% of the building's computer systems shut down.
As soon as the failure occurred, Balajadia and his staff called the manufacturer of the power generators, Hitec Power Solutions. They also called Cupertino Electric, the engineers and project manager for the building's construction.
After several days, they determined the cause to be a discrepancy in the engines' start-up routine. Over the years, as each engine was periodically tested and shut down, the engine's digital controller would record the exact position of the pistons in the cylinders when they stopped so that, on the next start-up, fuel would be injected at the precise moment. "The controller writes this into memory at zero RPM, reading the information and then clearing out the prior memory," Balajadia explains.
When the engines were first shipped from the factory, it took seven seconds to 10 seconds for the engine to come to a complete stop, at which point the pistons' positions would be recorded for the next startup sequence. This is critical, because if the fuel is injected at the wrong time, the engines won't start. But over several years, during which time 365 Main had accumulated more than 1,000 hours of operation on the diesels, the engines had been fully broken in, so their shut-down time had increased to as much as 13 seconds.
Still, in the controllers' memory from the last shutdown, the pistons were recorded as being several seconds out of position, because each digital controller is calibrated to initiate the next restart based on the position of the pistons only seven to 10 seconds after the last shutdown. That variance of three or four seconds had caused four of the diesel engines to be out of their normal starting sequence, so they misfired, failing to start and keep running.
"This was completely uniqueit was a true bug," Kelly says. The fix was to adjust the controller to allow more time between the shutdown and the reset command. The company implemented the fix not only at its San Francisco facility but also at its El Segundo, Calif., data center, which had the same Hitec generators containing the identical controllers.
Hitec reports that only about 100 such engines were shipped in 2001 with this particular Detroit Diesel controller, and that the other companies using them in data centers have had their controllers fixed as a result. Newer diesel generators have a more sophisticated ignition sequence. "We had only two other sites that used these engines as extensively, and both customers had reported isolated incidents where single engines failed to start," says John Sears, marketing and sales manager at Hitec Power Solutions in Stafford, Texas.
Massive System Failures">
GUARDING AGAINST MASSIVE SYSTEM FAILURES
Although the root cause of the outage is rare, the lessons gleaned from the experience are useful for data center managers seeking ways to guard against massive system failures:
Headquarters: 365 Main St., San Francisco, CA 94105
Phone: (877) 365-6246
Business: Data center developer and operator providing mission-critical operations and business continuity for tenants
Senior Vice President, Operations: Jean-Paul Balajadia
Financials: NA (privately held)
Challenge: Maintain the company's post-outage record of 99.9967% uptime across half a dozen data centers around the U.S., including the San Francisco facility