When a Problem Isn't a ProblemBy Anna Maria Virzi | Posted 2006-07-06 Email Print
The Information Technology Infrastructure Library helps the bank define, measure, track and investigate service outages. But does it improve service?
When a Problem Isn't a Problem
When the Bank of New York acquired Pershing, a financial clearinghouse for securities and other financial transactions, in May 2003, both organizations were working on improving I.T. service support through ITIL; the Bank of New York initially focused on change and incident management and the Pershing subsidiary zeroed in on incident and problem management. Why? "They were the ones [that gave us] the most pain—from a business perspective," says Gallagher, who was then project leader for the efforts to implement incident and problem management at Pershing.
Today, the Bank of New York, including its Pershing subsidiary, defines a major incident as any issue that causes a disruption of service to three or more internal or external customers. It could involve, say, a disabled mainframe or a server crash that makes it impossible for employees to access a software application. Previously, each group within the I.T. organization used its own method to manage incidents, making it difficult to measure the effectiveness of the process, according to Gallagher. "Also, communication of the status of an incident as it was under investigation was limited and inconsistent; many managers were not aware of the progress of incident diagnosis and resolution," he explains.
At the bank, problem management refers to the process of investigating and verifying the cause of incidents, such as a server outage, and then assessing whether the incident is part of a larger pattern that requires a long-term fix or other remediation, such as notifying the server manufacturer about a possible product defect.
Taking a cue from ITIL, Gallagher stresses that the terms "incident" and "problem" are not interchangeable. In ITIL lingo, an incident refers to a service disruption, e.g., the actual event. The "problem" refers to why an incident happened in the first place.
But isn't this splitting hairs? Gallagher sees it this way: An "incident" is something that technology managers must react to as quickly as possible to ensure that service is restored. If you try to investigate the cause of an incident while attempting to restore service, Gallagher says your immediate restoration work won't get top priority—and, as a result, may take longer.
Oftentimes, a company hit with a major service outage due to systems changes will adopt the library's processes for filtering and managing changes to avoid disruptions in service. "What seems to resonate with everyone is that [ITIL] is best practices-based; it provides guidance without specific direction. So, it provides flexibility in how you implement the framework," says Probst.
What's the downside? Don't expect change to happen quickly. "A lot of executives want to know why it's going to take a year or 18 months" to improve a process, Probst says. The reason why it takes so long: Managing services consistently throughout an organization may require employees to change the way they work. "People have to give up old ways." Probst adds. "You're changing culture and the organization."