Managing ProblemsBy Anna Maria Virzi | Posted 2006-07-06 Email Print
Re-Thinking HR: What Every CIO Needs to Know About Tomorrow's Workforce
The Information Technology Infrastructure Library helps the bank define, measure, track and investigate service outages. But does it improve service?
Once an incident is resolved at the Bank of New York, the "problem management" process kicks in with a so-called root cause analysis. This phase involves a more in-depth investigation into the cause of an incident; those findings are then vetted and corrective actions are taken to prevent an incident from reoccurring, according to Gallagher.
The bank has realized benefits from this effort. During 2005, Pershing recorded an average of 65 so-called severity-one incidents each month. For the first five months of 2006, the number dropped to 51, a 21.5% decline from 2005.
The cost to resolve the most severe incidents declined from an average of about $1,600 per incident in July 2004, to about $200 per incident in January 2006, according to Gallagher.
The bank has not performed a return-on-investment assessment for the ITIL initiative, he says. "ITIL has allowed us to streamline processes and advance the culture toward providing exceptional service."
The cornerstone of the Bank of New York's problem management efforts has been the establishment of a Root Cause Analysis (RCA) Committee, comprised of about 10 senior managers who represent a cross-section of I.T. groups. Every business day, the committee reviews and verifies the cause of a severity-one incident. That review process, first initiated at the Pershing subsidiary, is now being rolled throughout the Bank of New York, according to company executives.
Here, the Bank of New York has gone over and above what's called for in ITIL. Each severity-one incident is assigned to a root cause analysis "owner," potentially any one of Bank of New York's technology employees. The root cause analysis owner is then asked to investigate an incident and report back to the RCA Committee within five business days.
Promptly at 9 a.m. each business day, the RCA Committee convenes. Those on-site gather at a Bank of New York conference room; others dial in to a telephone conference line. On each agenda: two to five incidents; each "owner" is allocated time to discuss his investigation into a particular service disruption and explain what he thinks caused the problem. From there, the owner fields questions from committee members, who may agree or challenge the owner's assessment and direct him to gather more information.
About one-third of the cases are kicked back for additional research, according to Gallagher. "If you are not sure of your root cause, how can you be sure of your corrective action?" he asks.
Out of a monthly average of 65 severity-one incidents reported in 2005, 15% were attributed to human error, 13% to a program bug, 11% to vendor issues, 11% to process problems, 8% to software and 8% to inaccurate technical analysis; 13% were due to unknown reasons.
By collecting in-depth, consistent information about incidents and their causes, the technology team is able to determine whether a problem is an isolated incident or part of a larger trend, according to Pershing's technology managers. If it's part of a larger trend, a task force is created to investigate the cause and recommend an action plan.
This was the case with printers owned by Pershing and located at customers' premises. These printers, used to prepare daily reports about transactions, account balances, etc., were going offline without explanation. Once Pershing started to log each disruption as an incident, the financial institution discovered there were 600 to 700 printer-related incidents each month.
CIO Kumar assigned a team of technology workers to use Six Sigma methodology to investigate the printer problems. The team discovered that "a large percentage of incidents were due to a network communication issue," Gallagher says. After further investigation, Pershing learned that a software patch would remediate most of the incidents. Today, there are about 100 printer incidents a month that are attributed to typical issues such as paper jams or spooling errors. "Nothing stands out as a major repeatable incident now," Gallagher says.
Keep in mind: ITIL does not offer specifics on how to fix printers or any other equipment. "What ITIL provides is a process to classify incidents in a manner to ensure they are accurately assigned to the appropriate support group within the I.T. organization," Gallagher says.