The root cause of the outage turned out to be highly unusual:
The data center's $1.2 million
diesel-powered generators had
gradually fallen out of sync with
the electronic controllers that
start them. "This was a truly rare
incident," Staten says.
Unlike centers with batterystarted
backup generators, 365
Main's data center has a continuous
power system. Energy
from the local utility flows into
the system to operate its generators,
which supply power to the
building. In the event of a power
failure, the system normally
restarts using energy stored in
each generator's flywheel. The
flywheels, basically large spinning
discs, keep turning long
enough after a power failure to restart the diesel engines.
With its 10 backup diesel-powered generator units, 365 Main's
primary data center had operated without a glitch through
numerous power outages since its construction in the spring of
2001. The building has eight data rooms, each with its own dedicated
generating unit. There are also two extra units, ready to
kick in if one of the eight dedicated backup generators fails.
As it turned out, on July 24, four of the diesel engines failed
to start, causing three computer rooms to lose power. "We could
have failed three units, and through an automatic load-shed of
chillers and air conditioning units, we could have continued
to function," says Jean-Paul Balajadia, senior vice president of
operations at 365 Main. In other words, the facility had enough
backup generators to run the computer rooms if only three units
had gone down. But with four unable to start and keep running,
up to 45% of the building's computer systems shut down.
As soon as the failure occurred, Balajadia and his staff
called the manufacturer of the power generators, Hitec Power
Solutions. They also called Cupertino Electric, the engineers
and project manager for the building's construction.
After several days, they determined the cause to be a discrepancy
in the engines' start-up routine. Over the years, as
each engine was periodically tested and shut down, the engine's
digital controller would record the exact position of the pistons
in the cylinders when they stopped so that, on the next start-up,
fuel would be injected at the precise moment. "The controller
writes this into memory at zero RPM, reading the information
and then clearing out the prior memory," Balajadia explains.
When the engines were first shipped from the factory, it
took seven seconds to 10 seconds for the engine to come to a
complete stop, at which point the pistons' positions would be
recorded for the next startup sequence. This is critical, because
if the fuel is injected at the wrong time, the engines won't start.
But over several years, during which time 365 Main had accumulated
more than 1,000 hours of operation on the diesels, the
engines had been fully broken in, so their shut-down time had
increased to as much as 13 seconds.
Still, in the controllers' memory from the last shutdown,
the pistons were recorded as being several seconds out of position,
because each digital controller is calibrated to initiate the
next restart based on the position of the pistons only seven to
10 seconds after the last shutdown.
That variance of three or
four seconds had caused four of
the diesel engines to be out of
their normal starting sequence,
so they misfired, failing to start
and keep running.
"This was completely
uniqueit was a true bug," Kelly
says. The fix was to adjust the
controller to allow more time
between the shutdown and the
reset command. The company
implemented the fix not only
at its San Francisco facility but
also at its El Segundo, Calif., data
center, which had the same Hitec
generators containing the identical
controllers.
Hitec reports that only about 100 such engines were shipped
in 2001 with this particular Detroit Diesel controller, and that
the other companies using them in data centers have had their
controllers fixed as a result. Newer diesel generators have a more
sophisticated ignition sequence. "We had only two other sites
that used these engines as extensively, and both customers had
reported isolated incidents where single engines failed to start,"
says John Sears, marketing and sales manager at Hitec Power
Solutions in Stafford, Texas.
Next page: How To Guard Against Massive System Failures