Wachovia Reaps Rewards With Active Systems Monitoring

 
 
By Kevin Fogarty  |  Posted 2008-07-08
 
 
 

A lot of IT managers would like to keep closer tabs on their important systems, but it often takes a crisis to justify the time and expense required.

Jim Hirschauer, technology architecture manager and technical expert for Wachovia’s Corporate and Investment Banking Group, recalls one poignant example of this. "It was a foreign trading application, and we were having intermittent performance issues, which was causing some significant problems for some of our customers," he says. "They started complaining about delays, and with the speed of that market, delays aren't good. Those customers can go away very easily, and a couple of them did."

Hirschauer couldn't estimate the amount of revenue Wachovia lost due to the defections, some of which were only temporary. But in a business as volatile as investment banking, a single hot trading day can have a significant impact on revenue, and the loss of one large customer can keep a bank out of the rush.

The problem, Hirschauer says, was that Wachovia's battery of systems-management and monitoring tools reacted too slowly. Like most systems-management software, these tools required IT to set performance thresholds and sent an alarm if they were exceeded.

If capacity on one application server reached 30 percent, for example, or available storage dipped below 50 percent, the systems-management agent might send an alert to a central console.

"The problem with a static threshold is that you tend to set it high to avoid having to react to false positives," Hirschauer says. "But if you set it too high, you're already pretty deep in the weeds by the time you know about it. And your values change as your infrastructure changes, so it becomes an administrative nightmare."

Wachovia, which was in the midst of a larger project to improve systems monitoring, needed a better way to tell when performance was just beginning to affect customers─not just to know when the system was about to crash.

"We were basically in the business of availability monitoring," Hirschauer says. "We were really good at knowing when a file system would fill up or if a server was up or down. What we didn't do a good job of traditionally was monitoring of the component level and performance through the various application tiers."

Symantec’s I3 deep-dive analysis tools helped troubleshoot that particular system, but they couldn't get Wachovia's IT crew out of the firefighting business.

The overall solution was a better-instrumented and better-integrated set of systems-management tools from Symantec, Hewlett Packard and others.

The technology that pushed Wachovia's response far enough upstream to head off trouble before it arrived was a tool that warns IT when a system or application has stopped behaving normally─whether that means a sudden slowdown, acceleration or complete lack of response.

The tool, Netuitive SI from Netuitive Inc. in Reston, Va., is an agentless monitoring system that sits on the network watching traffic from specific hardware or software.

Behavioral monitoring and self-learning in performance-management tools give IT managers much greater predictability and control over application performance, according to a Yankee Group report "Optimizing Virtual  Environments Requires Self-Learning Performance Management"  updated in January 2008.

The ability to adapt quickly to changes in performance and to monitor virtual machines as well as physical servers and applications helps IT establish benchmarks that make predictive analytics much more effective, he says.

Netuitive SI  gave solid data on systems performance right away,  and the technology’s greatest value for Wachovia came after about two weeks of watching and benchmarking systems under various conditions, Hirschauer explains.

"Within the first week, you're getting decent predictive analytics," he says. "After two weeks, we saw the baseline thresholds and algorithms get very, very accurate for us. When we add systems to Netuitive, we look at the data, but take it with a grain of salt for the first two weeks. After that, we take it very seriously when it predicts we'll have a problem with something."

The product proved itself during the feasibility testing by calling a warning on a production database that hadn't caused any performance alerts from Wachovia's other tools for more than 30 days.

Netuitive SI issued a critical alarm after watching CPU utilization go from 10% to close to 50%, with a coincidental rise in contact switching, network activity and other metrics.

"The SQL calls were really slowing down, but none of the other tools said anything; at 50% utilization, the level wasn't high enough to trigger an alarm," Hirschauer says. "It turned out that one of our vendors had dumped a large amount of data into the database without going through the change-control process, and had caused a significant change in performance.

"Customers didn't know what happened. Some saw a slowdown, but it was not that great yet," Hirschauer says. "If this had happened without Netuitive, we wouldn't have known it had happened; if it happened again, we would have had some serious problems with that system."

Dynamic thresholding helps save IT managers time and effort in responding to false alarms and inaccurate performance estimates, according to a Meta Group report by analyst Bob Wallace, who cited research showing that false positives make up as much as 90% of total alerts.

That volume of false alarms not only wastes time, it erodes the credibility of any performance alarm; this "cry-wolf effect" keeps IT managers putting out fires rather than identifying underlying performance problems, Wallace says.

In addition to Netuitive, Wachovia uses Symantec systems management tools, and CoreFirst software from Optier to  track business transactions as they flow through the IT infrastructure and collectperformance statistics on server and applications. [[[Nope. That's right. Optier does the transaction monitoring]]]

The primary management console is BMC Software's Patrol, which integrates and displays performance data from other tools, including Netuitive. Netuitive is doing some custom integration for Wachovia, but is also working on more generic integration code that will enable BMC customers to use it without any external assistance. (BMC and Netuitive have been bundling and integrating their tools since 2003.)

The combination is a solid, easily justified toolset, Hirschauer says.

"If you have a performance impact on an important system, you're doing a disservice to the business if you're not keeping up with it and keeping it from happening," Hirschauer says. "Especially in investment banking, customers will just pick up and move elsewhere."