SOA Case Study: How R.L. Polk Revved Its Data Engine

 
 
By Baselinemag  |  Posted 2006-06-12
 
 
 

Kevin Vasconi, chief information officer of R.L. Polk & Co., saw the future. And it made him feel a little sick.

In the fall of 2004, Vasconi was meeting with other top executives of the company, one of the largest providers of marketing data to automobile manufacturers, in the boardroom of its suburban Detroit headquarters—in the heart of the U.S. auto industry.

It was a state-of-the-company gathering to discuss Polk's strategic direction. And the consensus was that its information systems wouldn't be able to support the business into the next decade. "If you have that discussion honestly," Vasconi says, "it will scare the crap out of you."

The Southfield, Mich.-based company's business, at its core, is data aggregation. Polk compiles vehicle registration and sales data from 260 sources. These include motor vehicle departments in the U.S. and Canada, insurance companies, automakers and lending institutions. The company then repackages that data and sells it to dealers, manufacturers and marketing firms—anyone who wants detailed information about car-buying trends, such as the top-selling SUV for a particular ZIP code.

For years, Polk's process of consolidating data ran on IBM mainframes. By the time Vasconi joined the company in 2003, portions of the software were 20 years old. "Some of the people working here are younger than the code," he says.

The mainframe system wasn't broken, per se. But the entire process was engineered around the batch-processing operations of a mainframe, in which multiple computing tasks are queued before they're processed in order to maximize mainframe resources. Vasconi believed newer technologies could speed up delivery of data to customers—by processing data as soon as Polk received it, instead of in daily or weekly batches—and lower the company's costs by automating tasks that were handled manually.

Vasconi also worried that the old system couldn't keep pace with the proliferation of data. Polk's entire database already comprises more than 1.5 petabytes (1.5 quadrillion pieces of data), and historical trends indicate it will continue to grow even faster. "We knew we had a capacity issue, and that getting the value out of the data would be a challenge for the company because of the sheer volume," he says.

Customers, meanwhile, have been champing at the bit to get sales data more quickly. Paul C. Taylor, chief economist for the National Automobile Dealers Association, which represents 19,700 car and truck dealers, says Polk's vehicle registration data by state is typically available 30 days after carmakers release their national sales data. That prevents dealers in, say, New Jersey from immediately comparing trends in their area with those nationwide and adjusting inventories accordingly.

"In a perfect world, you'd have the state breakdown when you have the national sales figures," he says. "But if [Polk] could take even a week off the cycle, that would be a vast improvement."

Actually, Polk had tried twice before to move off the mainframe, but those projects ended up being scaled back. "It's the mother of all databases for automotive intelligence," says Joe Walker, president of Polk Global Automotive, the division of the company that sells data to businesses. "It seemed too daunting a task to try to move it."

Company executives took a different tack with a project code-named ReFuel. In late 2004, Polk created a new company, called RLPTechnologies, to build the next data aggregation system. The subsidiary is 7 miles from Polk's campus at a building in neighboring Farmington, Mich. It has a full-time staff of 30, and at the peak of development last year employed 130 contractors, including consultants from Capgemini.

"We wanted to free up the people who were going to build the next generation of what is, quite honestly, our cash cow," says Vasconi, who is also president of RLPTechnologies.

Walker acknowledges that the expected cost of the project, which ended up exceeding $20 million, caused some trepidation. It was a huge undertaking for the private company, whose annual revenue is estimated to be around $275 million. "Right from the beginning, we were concerned with whether we'd see the ROI [return on investment] on this," he says.

Polk expected the ReFuel project to save money. But only later did Walker and Vasconi confirm that it helped chop Polk's costs for data operations management nearly in half.

A Blank Slate

After Vasconi hired a core team of 10 for the new subsidiary, most from Polk's information-technology staff, his first task was to figure out what the new system would look like.

Polk had three high-level objectives, referred to in shorthand as "50/50/100": The new system needed to be 50% more efficient (in other words, cost half as much to operate); deliver data 50% more quickly; and aim for 100% data accuracy.

Dubbed the Data Factory, the new system performs the same three jobs that the IBM mainframe did. It first has to capture the data, pulling in feeds from the 260 sources. Then it must convert the data into a standard format, using a uniform structure and nomenclature so that, say, a Vehicle Identification Number as reported by the state of Texas is stored in a way that Polk's other applications can read. Finally, the system needs to enhance the data by cross-referencing it with other databases—for example, verifying consumers' names and addresses, or associating financing history with a particular vehicle.

Vasconi knew the system should have a service-oriented architecture, or SOA, which allows software components in different systems to communicate in a standard way. That's because he wanted the flexibility to add or change pieces without disrupting the whole system. An SOA is also potentially more scalable than a monolithic architecture (meaning it can handle progressively higher processing loads) since larger tasks can be broken into subtasks more easily. In addition, Vasconi wanted to use grid computing, which harnesses multiple machines to work on a common task, as opposed to using high-powered, standalone servers.

"At the end of the day," he says, "we needed to build something that will last 30 years."

The RLPTechnologies team sketched out the functional pieces of the new system, and then determined those elements that were available as commercial software products and those they would have to develop themselves. "If we could find technology we could buy, we wanted to buy it in order to speed up our time to market," Vasconi says.

The hardware building blocks of Polk's Data Factory are Dell servers with Intel processors, running the Linux operating system. The two- and four-processor servers are configured into separate grids that handle different applications. One grid runs the Oracle 10g database; a second runs JBoss' application server, for hosting custom Java code. A third grid runs Tibco Software's BusinessWorks "messaging bus" software, which acts as the communications broker among other pieces of the system. The Tibco software provides the system's SOA backbone.

The Data Factory incorporates other off-the-shelf packages. Software from Informatica turns incoming data into eXtensible Markup Language (XML) documents, which puts data into a common format. Polk uses software from DataFlux, a unit of business intelligence vendor SAS, to analyze data quality so possible errors can be flagged for investigation.

RLPTechnologies built the rest of the software it needed. Vasconi estimates that about 50% of the system runs on custom Java code—less than he originally expected. "The SOA architecture empowered us to go to the marketplace and find companies that had embraced the SOA approach and the supporting industry standards," he says.

The main function the team needed to write itself had to do with "service orchestration." This software looks at an incoming XML document and determines what actions need to be taken; for example, does the ZIP code in the address need to be appended with the extra ZIP+4 digits? The service orchestration software then submits the relevant portion of data from the document to the appropriate system to handle that task, through the Tibco messaging bus. RLPTechnologies also developed its own data access layer, which assembles all of the updated information and inserts it into the Oracle database repository.

Polk began a staged rollout of the Data Factory in December 2005, and completed the deployment in early May. Officially, it's called the Enterprise Information Factory, and Vasconi imagines it could actually become a profit center: Eventually, RLPTechnologies plans to sell data-processing services to other companies.

All told, the system now comprises about 50 servers and processes 6 million XML documents per week. The project, from inception to rollout, took roughly 18 months.

Before Polk switches off its mainframe, though, Vasconi and his team are doing a final series of tests to make sure the output of the new system precisely matches that of the old one. "We have to spend quite a bit of time doing data validation to make sure it's exactly the same, down to the row level in the database," he says.

The Road Test

The new system, according to Vasconi, has delivered on Polk's expectations. It's cheaper to maintain—close to the company's original goal of cutting maintenance costs by 50%, he says—and faster at processing data, although Vasconi couldn't provide specific metrics to back up that claim.

First, he says, the initial acquisition costs for hardware and software (which he wouldn't disclose) were 40% lower than buying a comparable amount of IBM mainframe processing power. Plus, Polk's ongoing maintenance fees—to vendors including Dell, Tibco, Oracle, Informatica and DataFlux—will be less than what it has paid to IBM.

An even bigger area of savings for Polk: The Data Factory has let the company reduce head count in the data operations group by 43%, from 56 to 32, Walker says. Mainly, the reduction in staff was possible because many manual steps in the process have been automated. "We've eliminated scores and scores of manual touches," he says.

Vasconi, expanding on the factory metaphor, compares the new system to a manufacturing assembly line that uses robots to put components together. Humans sit in a glass-lined booth and only intervene when something goes wrong. With the mainframe system, workers were needed on the factory floor to push levers and buttons. Some business processes, Vasconi says, "were just broken." For example, administrators would have to check for vehicle registration data that arrived from the various states before sending it through the system; that's now automated.

Also, with the new system, Polk can catch any data-processing errors earlier in the process, reducing the need to rerun an entire data processing job. By using DataFlux's data-quality analysis software, operators can identify anomalies—say, an unusually low number of sales in a particular state, which could indicate an error—earlier in the process. In a batch-processing mainframe environment, "you don't have the ability to stop the batch in mid-process and do a quality check," Vasconi explains. "If it was wrong, you'd have to run it all over again and find out where in the 50 steps along the way the data anomaly occurred."

That efficiency has also allowed Polk to reduce the time it takes to turn raw data into a product available to customers by more than 50%, Vasconi claims. He doesn't have an overall average of the improvement. However, with the previous system, data sometimes just sat around for days so that it could be grouped into batch-processing jobs. The Data Factory eliminates that waiting period.

"We've taken multi-hour processes down to multi-minute processes," Vasconi says.

As for what Polk would have done differently, Walker says the company probably should have taken more time in the initial planning stage. The accelerated time line—planning took less than six months—most likely increased the price tag for the project, Walker believes, because additional contractors were required to supplement the work of RLPTechnologies' full-time staff. "That's just a gut feel," he notes. "But I think if we'd gone just a little slower, it would have cost less."

Vasconi, though, says speed was of the essence for the project. "Internal initiatives tend not to move with passion and singular focus and inventiveness," he says. "We felt we needed a sense of urgency."

.L. Polk & Co. Base Case">

R.L. Polk & Co. Base Case

HEADQUARTERS: 26955 Northwestern Hwy., Southfield, MI 48034
PHONE: (248) 728-7000
BUSINESS: Collects and sells vehicle sales and registration data.
CHIEF INFORMATION OFFICER: Kevin Vasconi
REVENUE, 2005: $275 million (Baseline estimate)
CHALLENGES: Improve efficiency and speed up the process of
turning raw data into packaged information products.

BASELINE GOALS:

  • Reduce time to process incoming data by 50% from 2004 to 2006.
  • Cut administrative costs by 50% over the same period.
  • Strive for 100% data accuracy over the same period.

    Inside the Data factory
    R.L. Polk's new data-processing system uses a service-oriented architecture to coordinate tasks among several discrete clusters of servers.

    • Data is fed from 260 different sources.
    • Software converts the data into standard eXtensible Markup Language documents.
    • Service-orchestration software evaluates each XML document and determines which elements need additional processing (i.e., appending ZIP+4 code).
    • At the same time, data-quality software compares incoming information with a historical profile of what's normal for that type of data; exceptions are investigated.
    • Processed data is inserted into a database repository, Polk's single source of truth; from there, information is extracted into separate data warehouses for customer to access.