Managing Applications and Infrastructure

By Shawn Pearson

We were having the kind of “Groundhog Day” that no one wants to face. I’m referring to the Bill Murray movie that makes his character relive the same day over and over.

Reliving a day wouldn’t be bad if it were a day that went smoothly. However, when there are disruptions, angry calls from customers and system downtime, that’s not the kind of day you want to repeat. Unfortunately, that’s exactly the situation I and my IT team at State Auto Insurance Companies were facing.

State Auto, based in Columbus, Ohio, is a Fortune 500 organization with 3,000 independent agencies across 33 states. Our IT systems support these geographically dispersed agencies, as well as a vibrant, highly interactive Web presence boosted by a new Tier 3 state-of-the-art data center. It’s critical to State Auto’s success that our systems are always available to serve our policyholders, partners and agents.

As an operations engineering leader for State Auto, I serve as the primary liaison for 23 subject-matter experts (SMEs) who oversee the many applications, systems and infrastructures that make up our massive, online ecosystem. The transactions that run across this system are extremely complex. Not surprisingly, we found that tracking down the root cause for performance issues was difficult—and sometimes impossible. We know we’re not alone in facing this challenge, but that’s not a big help when problems start occurring.

One string of incidents triggered a three-week stretch of e-commerce disruptions and slow application performance. There were so many angry calls to the help desk that we were worried we might lose some customers or that our company’s reputation would take a hit.

We were also concerned that the slowdown might affect our presence on consumer sites that compare insurance rates. These performance issues were critical not just to IT, but to the entire business.

Our existing toolsets monitored individual IT systems and determined whether there were issues with specific servers going offline, such as those handling email, security and storage. We also tracked when customers entered the system and what kind of transaction time they were experiencing. Unfortunately, none of this shed any light on how the distinct parts worked as a group—how all these highly interconnected points came together.

We knew that every “dot” in our ecosystem had a great deal of influence on the other dots, but none of our IT tools could help us connect the dots. In that sense, we were working in the dark and unable to determine the cause of the disruptions.

We hadn’t done anything wrong.  Every time you add capabilities—via middleware, virtualization, service-oriented architecture (SOA), Web services and so forth—you create greater complexities for apps. The transactions change, making connections that did not previously exist.

In our case, we found that our systems grew so intricate that we reached a disconnect as to how all the pieces worked together. This is a common challenge throughout all IT shops. Like our counterparts in many other organizations, we knew of no existing tool that could lend true, overarching visibility throughout our systems.

Solving Performance Issues

During this crisis, the 23 of SMEs and I locked ourselves into a conference room and tried to solve the performance issues, but we couldn’t find a silver-bullet solution. Each expert examined his or her own technology silo without any understanding of how that system structure affected overall transaction performance.

Without a common understanding of how everything worked together, voiced in a common language, things looked bleak for ever understanding why problems kept occurring.

We did make some corrections that we felt solved the problem, and some of the SMEs found tweaks to their specific systems that seemed to make things go smoother. Then, out of nowhere, the same pattern of events resurfaced, resulting in a flood of customer calls to the help desk.

That’s when we knew we had to take a different approach – an application performance management (APM) solution that operations could use to unify our application and transaction management efforts. We needed to get out of the silo game and take a transaction-centric view of our applications and infrastructure.

None of our silo tools—even the deep-dive tools our developers used—could do that. We wanted to follow all transactions, identify the slow ones and drill down the server stack to the root cause of the problem.

Results had to be reported in a format that was clear and useful for executives and operations teams—not just the technology experts and developers. Our existing APM tools weren’t designed to do those things, so we found an APM solution for IT operations.

I had seen one system, BlueStripe Software’s FactFinder transaction monitoring solution, help with these kinds of issues on other State Auto, so I loaded it up on all our apps. In less than 2 hours, I installed the product; enabled it across the application components; and had a complete application and transaction map, measuring end-to-end response times, hop by hop, across every technology silo.

With FactFinder’s complete transaction visibility, we could see how a large number of our applications were heavily dependent on a single database server. There were so many applications that we could tell the number of requests far outnumbered what the machine was expected to handle.

We discovered that some of the application teams had been “piling on” this particular server to the point where it was responding to an insane number of applications, many of which weren’t showing any performance problems–yet. Once we knew the influence of this particular database, and how its performance for one app was directly impacted by requests from other apps, it was easy to correct the overload and quickly get application service back to superior levels.

Ultimately, for IT operations we need tools that make things clearer and don’t hide system-level data behind technology silos. We need to know how all the systems in our complex applications are related and what the impact of one server is on overall application and transaction performance.  For monitoring, we want to know where transactions go, where they get stuck and why. 

In addition to helping us stay on top of how changes or upgrades would affect our service levels with mission-critical third-party connectors, this approach has reduced the number of people called into that conference room down to two (both in operations). It’s nice to know there won’t be any more repeats of those outages—no more groundhog days for us.

Shawn Pearson is the engineering lead at State Auto Insurance Companies.