Shopzilla Is Sold on Big Data

By Samuel Greengard

Online retailing has emerged as a high-stakes environment where winners and losers are determined by the speed and efficiency of their transactions. At Shopzilla, a 17-year-old online comparison-shopping service that manages a portfolio of shopping brands (including Bizrate in the United States and the United Kingdom, Spardeingeld in Germany and Prix Moins Sher in France) the ability to manage vast volumes of data and act on it in immediately is critical.

“We receive an ongoing feed of inventory from merchants, create offers and enrich them with data that provides classifications and attributes,” says Mik Quinlan, distinguished architect for Shopzilla.

Processing the feeds efficiently is at the heart of the business. Shopzilla processes approximately 60 million offers a day from about 15,000 merchants. However, a few years ago, the company was hamstrung by an existing IT infrastructure that was about 13 years old. “We realized that extending the platform was an extremely difficult proposition,” Quinlan recalls.

Shopzilla required a platform that could provide a high level of extensibility and predictability. “When data is batch-processed through Hadoop or another tool, any error can result in the entire feed failing,” he explains.

As a result, the company had two primary objectives: decrease latency at the site and build a data store that could ensure high data quality. After examining a number of products, Shopzilla opted for a VoltDB solution that could manage inventory while tackling search and site requirements under a single platform. The firm installed the new system in June 2012.

Shopzilla is now running its retail data repository on a seven-node VoltDB cluster that holds 1.2 terabytes of data. It allows the company to store all merchant offer data and its associated enrichment data centrally and permanently, Quinlan says.

VoltDB’s CSV snapshot capability streamlines data export into Hadoop, where analysts can use Hive to conduct queries. The system makes it possible to use analytics across the entire spectrum of data associated with an offer. New offers are ingested into the retail data repository within 15 minutes for an overall feed of 1 million offers. Three to five hours later, the data is loaded into Hive tables for querying.

The system has simplified Shopzilla’s architecture and allowed the company to plug in a number of enrichment services that have improved machine learning algorithms. A seven-node VoltDB Enterprise Edition cluster handles 30,000 transactions per second of persisting offer data, while amalgamating enrichment data with those offers.

A six-node VoltDB Community Edition cluster handles delta processing at a rate of 30,000 to 45,000 transactions per second. The company is now upgrading other data repository systems in order to reduce latency.

The end result has been a faster and far more efficient environment—one that provides a foundation for continued growth.

“The infrastructure has greatly simplified the way we approach transactions and the overall business,” Quinlan says. “It has created a more flexible environment that helps us compete more effectively. We are able to integrate data and put it to use in ways that truly change the business.”