Accelerating the Data Warehouse

By David F. Carr  |  Posted 2002-11-01 Print this article Print

Need quicker access to your data? Try Massively Parallel Processing (MPP), which breaks up a query so that multiple processors can run it against multiple storage devices.

  • How is it done? In a number of ways. One of the more promising strategies, Massively Parallel Processing (MPP), involves breaking up a query so that multiple processors can run it against multiple storage devices, then reassemble the responses to produce an answer. Another alternative is SMP (Symmetric Multiprocessing), in which multiple processors juggle tasks using caching techniques and a common pool of memory.

PDF Download

  • What's the benefit? Quicker access to information in a world of huge databases. With MPP, adding processors improves access time at a nearly linear rate: A 32-processor machine can query more than 3 terabytes of data in about the same time that a single processor could query 100 gigabytes. While the scalability and performance of SMP systems keeps improving, MPP architectures still dominate very large data warehousing applications.
  • Who invented it? In the data warehousing market, NCR's Teradata unit has been MPP's biggest proponent. The largest Teradata warehouses run on the company's own WorldMark server hardware and its own version of Unix, using a database management system designed specifically for the MPP environment.

Because supporting MPP requires tweaks to the database management system, operating system and server hardware, many vendors have preferred to push the limits of what they can achieve with SMP. However, IBM is supporting MPP with its Regatta servers (RS/6000 SP) and in its DB2 Extended Enterprise Edition.

In September, startup Netezza introduced its Netezza Performance Server, a refrigerator-sized "data warehouse appliance" aimed at providing MPP performance at a lower price by using open-source software like Linux and the Postgres database. Netezza uses specialized query-processing chips installed on each hard disk. Each of these "snippet processors" scans the disk it is responsible for, finds data matching the query parameters, and sends the results back to the database responsible for assembling the answer. This cuts down on the transmission of irrelevant data within the server cabinet, minimizing performance bottlenecks and lessening the workload on the central database.

  • Who's using it? Teradata has a blue-chip customer base, including Wal-Mart in retailing and Whirlpool in manufacturing. Lloyd's of London is using IBM's MPP solution to analyze claims and other insurance data.

Netezza has captured a handful of early customers. Vibrant Solutions, which works with companies such as Nextel on call-data analysis, says it will be able to support much more data, with faster query response, by employing Netezza's technology. "It's very similar to a lot of the other massively parallel architectures that have been around for a while, but they brought the price into a reasonable window," says Vibrant CTO Rick Mahuson.

  • What are the drawbacks? MPP systems tend to cost more, both in price and ongoing administration. Teradata says the long-term cost of ownership is favorable, however, particularly when scattered data marts (departmental data warehouses) are consolidated into a central, company-wide data warehouse.

Netezza is trying to change the price equation (at $2.5 million, even its 18-terabyte server is a fraction of the cost of comparable MPP systems) and claims its appliance will run with minimal administration. "Netezza's product shows great promise," says Giga Information Group analyst Philip Russom, but he suspects many enterprise customers will be scared of entrusting multi-terabyte applications to open-source technology.

REFERENCE: ONE QUERY, MANY PATHS Even with today's superfast machines, it can take days to generate a report from a multi-terabyte warehouse. Here's how using Massively Parallel Processing can speed up the task.

  • 1. A 3-terabyte data warehouse receives a request for a list of all customer purchases that were greater than $10,000.
  • 2. It passes on the query to 10 "nodes." Each node has its own processors and also controls one or more storage devices. Each storage device, in turn, contains a subset of the 3-terabyte warehouse. In this example, each node queries one storage device that holds 100,000 records.
  • 3. Each device sends back a list. The data warehouse consolidates the responses into a single result that took hours instead of days to build.
  • Wondering if you might need to reexamine your processing capability? Click here to take our quick Quiz.
    Background Reading
    Not convinced you need to process in parallel? Click here to download a PDF (Portable Document Format) version of Sun Microsystems' white paper on the advantages of a symmetric architecture.

David F. Carr David F. Carr is the Technology Editor for Baseline Magazine, a Ziff Davis publication focused on information technology and its management, with an emphasis on measurable, bottom-line results. He wrote two of Baseline's cover stories focused on the role of technology in disaster recovery, one focused on the response to the tsunami in Indonesia and another on the City of New Orleans after Hurricane Katrina.David has been the author or co-author of many Baseline Case Dissections on corporate technology successes and failures (such as the role of Kmart's inept supply chain implementation in its decline versus Wal-Mart or the successful use of technology to create new market opportunities for office furniture maker Herman Miller). He has also written about the FAA's halting attempts to modernize air traffic control, and in 2003 he traveled to Sierra Leone and Liberia to report on the role of technology in United Nations peacekeeping.David joined Baseline prior to the launch of the magazine in 2001 and helped define popular elements of the magazine such as Gotcha!, which offers cautionary tales about technology pitfalls and how to avoid them.

Submit a Comment

Loading Comments...
eWeek eWeek

Have the latest technology news and resources emailed to you everyday.