By Baselinemag  |  Posted 2006-06-12 Print this article Print

The auto data aggregator ditched its mainframe, spending more than $20 million to build a data factory. Was it worth it?

A Blank Slate

After Vasconi hired a core team of 10 for the new subsidiary, most from Polk's information-technology staff, his first task was to figure out what the new system would look like.

Polk had three high-level objectives, referred to in shorthand as "50/50/100": The new system needed to be 50% more efficient (in other words, cost half as much to operate); deliver data 50% more quickly; and aim for 100% data accuracy.

Dubbed the Data Factory, the new system performs the same three jobs that the IBM mainframe did. It first has to capture the data, pulling in feeds from the 260 sources. Then it must convert the data into a standard format, using a uniform structure and nomenclature so that, say, a Vehicle Identification Number as reported by the state of Texas is stored in a way that Polk's other applications can read. Finally, the system needs to enhance the data by cross-referencing it with other databases—for example, verifying consumers' names and addresses, or associating financing history with a particular vehicle.

Vasconi knew the system should have a service-oriented architecture, or SOA, which allows software components in different systems to communicate in a standard way. That's because he wanted the flexibility to add or change pieces without disrupting the whole system. An SOA is also potentially more scalable than a monolithic architecture (meaning it can handle progressively higher processing loads) since larger tasks can be broken into subtasks more easily. In addition, Vasconi wanted to use grid computing, which harnesses multiple machines to work on a common task, as opposed to using high-powered, standalone servers.

"At the end of the day," he says, "we needed to build something that will last 30 years."

The RLPTechnologies team sketched out the functional pieces of the new system, and then determined those elements that were available as commercial software products and those they would have to develop themselves. "If we could find technology we could buy, we wanted to buy it in order to speed up our time to market," Vasconi says.

The hardware building blocks of Polk's Data Factory are Dell servers with Intel processors, running the Linux operating system. The two- and four-processor servers are configured into separate grids that handle different applications. One grid runs the Oracle 10g database; a second runs JBoss' application server, for hosting custom Java code. A third grid runs Tibco Software's BusinessWorks "messaging bus" software, which acts as the communications broker among other pieces of the system. The Tibco software provides the system's SOA backbone.

The Data Factory incorporates other off-the-shelf packages. Software from Informatica turns incoming data into eXtensible Markup Language (XML) documents, which puts data into a common format. Polk uses software from DataFlux, a unit of business intelligence vendor SAS, to analyze data quality so possible errors can be flagged for investigation.

RLPTechnologies built the rest of the software it needed. Vasconi estimates that about 50% of the system runs on custom Java code—less than he originally expected. "The SOA architecture empowered us to go to the marketplace and find companies that had embraced the SOA approach and the supporting industry standards," he says.

The main function the team needed to write itself had to do with "service orchestration." This software looks at an incoming XML document and determines what actions need to be taken; for example, does the ZIP code in the address need to be appended with the extra ZIP+4 digits? The service orchestration software then submits the relevant portion of data from the document to the appropriate system to handle that task, through the Tibco messaging bus. RLPTechnologies also developed its own data access layer, which assembles all of the updated information and inserts it into the Oracle database repository.

Polk began a staged rollout of the Data Factory in December 2005, and completed the deployment in early May. Officially, it's called the Enterprise Information Factory, and Vasconi imagines it could actually become a profit center: Eventually, RLPTechnologies plans to sell data-processing services to other companies.

All told, the system now comprises about 50 servers and processes 6 million XML documents per week. The project, from inception to rollout, took roughly 18 months.

Before Polk switches off its mainframe, though, Vasconi and his team are doing a final series of tests to make sure the output of the new system precisely matches that of the old one. "We have to spend quite a bit of time doing data validation to make sure it's exactly the same, down to the row level in the database," he says.


Submit a Comment

Loading Comments...
eWeek eWeek

Have the latest technology news and resources emailed to you everyday.