By David F. Carr  |  Posted 2007-07-18 Email Print this article Print

University of Pittsburgh Medical Center had runaway growth in its server and storage infrastructure. Here's what it did.

-IBM Playbook">

If you're trying to play the game of system operations and support, it helps to have a good rulebook. UPMC and IBM settled on a whole shelf full of books from the Information Technology Infrastructure Library (ITIL), the compendium of best practices that started out as a British government effort to improve its own information management.

The ITIL books provide generalized guidance on the processes and organizational structures required for consistent, reliable information services maintenance and support.

"It's helped us to have a common vocabulary for things like 'incident' versus 'problem,'" explains UPMC application manager Rey Johnson. In ITIL terminology, he explains, an "incident" is just an interruption in service, like a server crashing. "A 'problem' is where the underlying cause is not known," he says. In other words, when speaking with his IBM counterparts, he learned not to refer to an incident as a problem if he knew why it happened and how to fix it. In that case, it's not a problem to be debugged, just an incident to be handled.

Johnson is translating some of the ITIL theory into practice for the HP ServiceCenter system, which will be used by the support and help-desk staffs to track incidents and problems and the steps taken to fix them. The software also provides automated tracking of system events (such as detecting a server crash) so that the support staff can respond immediately, rather than wait for someone to complain.

Ultimately, UPMC hopes to move toward "autonomic computing," which means teaching the automated systems within the data center to detect and fix problems, in much the same necessary to restore the network to normal, he notes, but before you can automate that, you have to establish a precise definition of what "normal" is and some scripted series of procedures to restore the network to normal.

"We decided it was something we needed to quit distracting ourselves with until we were ready to make it happen," Sikora says.

According to Sikora, another way ITIL helps is by providing a reference work that documents some of the common-sense concepts behind the Transformation Project. "It helps provide that justification by giving you an independent, third-party source you can point to that cites all these other reference points for why this is a good idea," he says.

UPMC has yet to implement one key recommendation, which is that there be a single point of contact for all support calls. Today, hospital employees call the help desk if they have a problem with their PC, but calls about some specific systems, such as Cerner's software, go to separate application support teams. "That definitely causes confusion," Johnson says.

Moving the support staffs onto a common software system is a step toward the single-point-of-contact ideal, but the required organizational change has yet to happen. "What we're doing now is setting the stage," Sikora says.

Beyond the general guidelines from ITIL, the Transformation Project was organized around a specific process architecture—not a technology architecture, but one that described what would be done and how.

The first draft created jointly by IBM and UPMC wound up being 120 pages long without telling Sikora what he really needed to know. "It was a great conceptual document, but as a technology manager I didn't know how to build it," he says. "The analogy I used at the time was, you've given me a map of the U.S., and what I want to do is tell people how to get from Pittsburgh to Los Angeles—specifically. If I give 20 people a map of the U.S. and tell them to get me to Los Angeles, I'm going to get 20 different routes."

The second draft was shorter and more specific. Essentially, it describes the interlocking gears for running an I.T. organization like a well-oiled machine. Muha presents it as a diagram showing three interlocking cycles for implementing an On Demand Operating Environment—the utility computing environment where applications and users can draw more or less computing power from a pool of server storage resources as their needs change. In other words, the ideal is that getting access to another computer processor or another gigabyte of storage should be no more difficult than drawing an extra kilowatt from the power grid. Virtualization is an enabler, making it possible to rapidly provision, grow or shrink partitions on a server, but the other side of it is having standardized processes for managing budgets, approvals, support and service levels.

For example, UPMC defined what it calls a "factory" process outlining how to forecast demand for new equipment and put it into service: approval and forecast, budget and finance, specify standards, order equipment, check infrastructure readiness, receive equipment, install equipment, build/reserve resource pools—and repeat. The parallel cycles define how to put an application into service and support it, and how to provide the monitoring, management, auditing and dependable service levels.

Meanwhile, Sikora says one of his goals is to provide a standard catalog of information services that hospitals and departments will be able to order from, with price tags attached so that they will be able to see the difference between specifying a standard Windows server that will cost $20,000 over three years, and a virtual machine that will cost $1,700 to $1,800.

"Look, year after year, we've been rated one of the most wired hospitals," Sikora says. "The performance on our systems is very good. I mean, our systems are 99.99% available. By most measures of performance, we're there. But we know how much better we want to be."

Muha, who was formerly in charge of the Windows server infrastructure, says UPMC definitely had to break from the pattern of runaway growth. "If we had kept going the way we were, I can't imagine," he says. "We would definitely not be in this data center. We'd be supporting 500 more servers."

"Sometimes, we have to force people to take a step back and look at how we used to do it," rather than what has yet to be done, Sikora agrees. "I'm probably the most impatient one."

Next page: When Change is Good

David F. Carr David F. Carr is the Technology Editor for Baseline Magazine, a Ziff Davis publication focused on information technology and its management, with an emphasis on measurable, bottom-line results. He wrote two of Baseline's cover stories focused on the role of technology in disaster recovery, one focused on the response to the tsunami in Indonesia and another on the City of New Orleans after Hurricane Katrina.David has been the author or co-author of many Baseline Case Dissections on corporate technology successes and failures (such as the role of Kmart's inept supply chain implementation in its decline versus Wal-Mart or the successful use of technology to create new market opportunities for office furniture maker Herman Miller). He has also written about the FAA's halting attempts to modernize air traffic control, and in 2003 he traveled to Sierra Leone and Liberia to report on the role of technology in United Nations peacekeeping.David joined Baseline prior to the launch of the magazine in 2001 and helped define popular elements of the magazine such as Gotcha!, which offers cautionary tales about technology pitfalls and how to avoid them.

Submit a Comment

Loading Comments...
Manage your Newsletters: Login   Register My Newsletters