How Bank of New York Uses ITIL to Troubleshoot
Technology managers at the Bank of New York thought they were doing a good job of running the information systems, networks and other services the company supplies to its internal and external customers. But good wasn't good enough. They couldn't back up their assessment with metrics.
"When senior business managers asked senior I.T. managers: 'How are you guys doing? How is the process working? Show me some measurements,' we were not able to supply that," recalls Joseph Gallagher, vice president of I.T. process implementation at the Bank of New York. Those measurements, which have since been adopted, include statistics such as the number of service outages and their severity as well as the amount of time it takes to restore service.
That's why Bank of New York's technology managers turned to the Information Technology Infrastructure Library, or ITIL, a series of publications written by the United Kingdom's Office of Government Commerce in the 1980s that describe best practices for managing technology infrastructure—such as specifying that you should have a central information repository for tracking service problems such as server crashes or network outages. It provides a "feedback loop" to reduce or eliminate future incidents by collecting and analyzing the source of incidents and taking corrective action before a similar incident reoccurs.
Advocates see ITIL as a way to boost discipline in technology operations, and adopt a common vocabulary for discussing quality of service and establishing metrics. The library, which is available in print, on a CD-ROM or via an intranet license, provides guidance on how to approach eight key processes, including service delivery, service support, security management and software asset management.
"ITIL was specifically created because of the recognition that good quality software and systems don't automatically provide good quality services to customers," says David Cannon, Hewlett-Packard's I.T. service management practice principal and a consultant working on behalf of the British government to refresh ITIL.
Cannon compares the situation to someone who owns a $100,000 Mercedes—with the best safety and performance systems that money can buy. "The problem is that I have never taken a driving lesson in my life and don't have a license. As a result, I stick to the back roads and drive at least 10 mph below the speed limit ... I never get anywhere on time," he says. U.K. technology managers apparently were in the slow lane, too, when they purchased software and hardware to automate services back in the 1980s, yet did not get optimal performance. So, the U.K. recognized the need for a manual to help get its operations running better.
Now, that manual is being used by some of the world's largest companies as a guide to better information-technology management. Earlier this year, for instance, Hewlett-Packard announced that General Motors had retained HP Services to use ITIL as a building block for the deployment of GM's next-generation outsourcing model—to verify that I.T. vendors meet service levels.
As many as 75% of the Fortune 100 are among those companies embracing ITIL for an assortment of reasons, says Jack Probst, an executive consultant at Pink Elephant, a firm that provides ITIL education, conferences and consulting services to companies such as Nationwide Insurance, Bank of Montreal and BP. Two Pink Elephant employees are also working as consultants on the project to update the library.
When a Problem Isn't a Problem
When the Bank of New York acquired Pershing, a financial clearinghouse for securities and other financial transactions, in May 2003, both organizations were working on improving I.T. service support through ITIL; the Bank of New York initially focused on change and incident management and the Pershing subsidiary zeroed in on incident and problem management. Why? "They were the ones [that gave us] the most pain—from a business perspective," says Gallagher, who was then project leader for the efforts to implement incident and problem management at Pershing.
Today, the Bank of New York, including its Pershing subsidiary, defines a major incident as any issue that causes a disruption of service to three or more internal or external customers. It could involve, say, a disabled mainframe or a server crash that makes it impossible for employees to access a software application. Previously, each group within the I.T. organization used its own method to manage incidents, making it difficult to measure the effectiveness of the process, according to Gallagher. "Also, communication of the status of an incident as it was under investigation was limited and inconsistent; many managers were not aware of the progress of incident diagnosis and resolution," he explains.
At the bank, problem management refers to the process of investigating and verifying the cause of incidents, such as a server outage, and then assessing whether the incident is part of a larger pattern that requires a long-term fix or other remediation, such as notifying the server manufacturer about a possible product defect.
Taking a cue from ITIL, Gallagher stresses that the terms "incident" and "problem" are not interchangeable. In ITIL lingo, an incident refers to a service disruption, e.g., the actual event. The "problem" refers to why an incident happened in the first place.
But isn't this splitting hairs? Gallagher sees it this way: An "incident" is something that technology managers must react to as quickly as possible to ensure that service is restored. If you try to investigate the cause of an incident while attempting to restore service, Gallagher says your immediate restoration work won't get top priority—and, as a result, may take longer.
Oftentimes, a company hit with a major service outage due to systems changes will adopt the library's processes for filtering and managing changes to avoid disruptions in service. "What seems to resonate with everyone is that [ITIL] is best practices-based; it provides guidance without specific direction. So, it provides flexibility in how you implement the framework," says Probst.
What's the downside? Don't expect change to happen quickly. "A lot of executives want to know why it's going to take a year or 18 months" to improve a process, Probst says. The reason why it takes so long: Managing services consistently throughout an organization may require employees to change the way they work. "People have to give up old ways." Probst adds. "You're changing culture and the organization."
Proponents of ITIL say the practice is designed to improve consistency in operations from one unit or division to another. "We used to have four different service desks, and everyone did things their own way. There was no commonality as to what was an incident, what was a problem. We didn't have any tools to track this," says Pershing's chief information officer, Suresh Kumar, of his organization.
At Pershing, one service desk is dedicated to handling calls from internal business users and another from external customers; the other service desks are dedicated to taking calls from a specific segment of the company's customer base. While Pershing still has four service desks, they all use the same definitions to describe a particular problem.
"ITIL recommends that an organization follow the same processes, start to finish. Regardless of service or customer, the activities for incident should be the same," Gallagher explains. "People get into their heads that this is the process, it becomes institutionalized, it becomes repeatable."
Following the same definitions provides another benefit. "It gives our business a sense that we're working on the highest priority incidents. That we're focusing our resources," Gallagher explains. "It gives us the ability to measure the time to resolve those incidents, and look for ways to improve and reduce the time to resolve [them]."
Incident and problem management are popular ITIL practices, says HP's Cannon. "When things break, it causes downtime to the business. There's a certain amount of urgency to get that problem fixed."
At the Bank of New York, each incident is assigned a severity number, from one to four, with number one representing an incident that has significant impact on the services offered to the business and customers. Four is assigned to minor incidents that do not need immediate attention. ITIL recommends the same numbering scheme.
ITIL also recommends that the service desk "own" every incident created. The Bank of New York follows this guideline, making an exception for its highest priority incidents affecting business or customers. In those cases, the bank assigns incident "ownership" to the service owner within I.T. The service desk then acts as incident coordinator, setting up telephone conference calls, updating the incident ticket, paging managers and keeping customers updated, according to Gallagher. "This distribution of responsibilities during high priority incidents ensures that all I.T. staff are focused on incident resolution," he says. By breaking up these duties, the Bank of New York makes sure that one person is not responsible for both fixing an incident and alerting customers about the service disruption.
Once an incident is resolved at the Bank of New York, the "problem management" process kicks in with a so-called root cause analysis. This phase involves a more in-depth investigation into the cause of an incident; those findings are then vetted and corrective actions are taken to prevent an incident from reoccurring, according to Gallagher.
The bank has realized benefits from this effort. During 2005, Pershing recorded an average of 65 so-called severity-one incidents each month. For the first five months of 2006, the number dropped to 51, a 21.5% decline from 2005.
The cost to resolve the most severe incidents declined from an average of about $1,600 per incident in July 2004, to about $200 per incident in January 2006, according to Gallagher.
The bank has not performed a return-on-investment assessment for the ITIL initiative, he says. "ITIL has allowed us to streamline processes and advance the culture toward providing exceptional service."
The cornerstone of the Bank of New York's problem management efforts has been the establishment of a Root Cause Analysis (RCA) Committee, comprised of about 10 senior managers who represent a cross-section of I.T. groups. Every business day, the committee reviews and verifies the cause of a severity-one incident. That review process, first initiated at the Pershing subsidiary, is now being rolled throughout the Bank of New York, according to company executives.
Here, the Bank of New York has gone over and above what's called for in ITIL. Each severity-one incident is assigned to a root cause analysis "owner," potentially any one of Bank of New York's technology employees. The root cause analysis owner is then asked to investigate an incident and report back to the RCA Committee within five business days.
Promptly at 9 a.m. each business day, the RCA Committee convenes. Those on-site gather at a Bank of New York conference room; others dial in to a telephone conference line. On each agenda: two to five incidents; each "owner" is allocated time to discuss his investigation into a particular service disruption and explain what he thinks caused the problem. From there, the owner fields questions from committee members, who may agree or challenge the owner's assessment and direct him to gather more information.
About one-third of the cases are kicked back for additional research, according to Gallagher. "If you are not sure of your root cause, how can you be sure of your corrective action?" he asks.
Out of a monthly average of 65 severity-one incidents reported in 2005, 15% were attributed to human error, 13% to a program bug, 11% to vendor issues, 11% to process problems, 8% to software and 8% to inaccurate technical analysis; 13% were due to unknown reasons.
By collecting in-depth, consistent information about incidents and their causes, the technology team is able to determine whether a problem is an isolated incident or part of a larger trend, according to Pershing's technology managers. If it's part of a larger trend, a task force is created to investigate the cause and recommend an action plan.
This was the case with printers owned by Pershing and located at customers' premises. These printers, used to prepare daily reports about transactions, account balances, etc., were going offline without explanation. Once Pershing started to log each disruption as an incident, the financial institution discovered there were 600 to 700 printer-related incidents each month.
CIO Kumar assigned a team of technology workers to use Six Sigma methodology to investigate the printer problems. The team discovered that "a large percentage of incidents were due to a network communication issue," Gallagher says. After further investigation, Pershing learned that a software patch would remediate most of the incidents. Today, there are about 100 printer incidents a month that are attributed to typical issues such as paper jams or spooling errors. "Nothing stands out as a major repeatable incident now," Gallagher says.
Keep in mind: ITIL does not offer specifics on how to fix printers or any other equipment. "What ITIL provides is a process to classify incidents in a manner to ensure they are accurately assigned to the appropriate support group within the I.T. organization," Gallagher says.
To collect and analyze information about service disruptions, you need tools. And Gallagher advises that those tools be automated whenever possible. At Pershing, the internal software developers built an application, called OmniMetrics, that imports information from its BMC Remedy help-desk application database and other sources to create performance scorecards. These scorecards show the total number of severity-one incidents by month, group or manager; they also assess each tech team's responses to incidents, thus giving an overall picture of performance.
For organizations looking to implement ITIL and measure performance, Gallagher has some advice: "Don't promise improvement until you have a real good historical baseline."
Case in point: Before June 2004, Pershing had no baseline metrics for measuring severity-one incidents. When it started to develop processes for managing incidents and problems, the definition of a severity-one incident changed, and more incidents were reported.
Why? The company saw the value of reporting and tracking outages, Gallagher says. Incidents such as missing a service level agreement during the overnight batch cycle—not originally included—were added.
By December 2004—in time for the 2005 calendar year—Pershing locked down its definition of severe service outages. "We wanted a 12-month period of saying, 'This is how the organization performs without any changes to the criteria.' That was our baseline," Gallagher explains.
Once Pershing collected and analyzed metrics, the technology organization could create scorecards, another key tenet of ITIL, to show the performance of technology services—and, in effect, the performance of managers responsible for those particular services, or service owners, according to CIO Kumar.
One surprise: When Pershing created a catalog listing all technology services, Kumar says he and his team discovered that in some cases, there were multiple owners of a service. And in other cases, there were no owners. Plus, a service owner might not necessarily know who the customer of his service was. To improve accountability, Pershing has identified one service owner for each service, in a process that's still evolving throughout the Bank of New York, according to Kumar.
The CIO provided this example: Let's say there's an incident with a printing service that could be due to any of several reasons—either an application, network, firewall or server failing to work properly. Previously, the tech team would view the problem as either a hardware or software problem. Someone using the service, though, would view it as a printer problem, which is how the technology team now approaches its job.
Publishing a monthly scorecard—and identifying services and service owners—increased the transparency of technology operations and triggered an immediate improvement in operations, Kumar says.
"We didn't have to yell at people," he says. "We didn't have to offer any incentives. It just goes to show that when everyone comes to work, they love to do the best they can. But lack of information doesn't give them the knowl edge to see performance."
Bank of New York Base Case
Headquarters: 1 Wall St., New York, NY 10286
Phone: (212) 495-1784
Chief Information Officer: Kurt Woetzel
Financials in 2005: $8.3 billion in revenue, an increase of 16.9% from prior year; $1.6 billion in net profit, an increase of 9.1%.
Business: Provides investor, issuer and broker-dealer services. Its Pershing subsidiary offers clearing services.
Challenge: Identify, verify and analyze the cause of technology service disruptions.
- Reduce the average monthly number of severe service disruptions, from 65 in 2005, by 20% in 2006.
- Cut the monthly number of printer problems, from 500 in June 2004 to 100 this year.
- Reduce the monthly number of password-related incidents, from about 2,600 in January 2005 to less than 1,200 in 2006.