Accelerating the Data Warehouse

  • How is it done? In a number of ways. One of the more promising strategies, Massively Parallel Processing (MPP), involves breaking up a query so that multiple processors can run it against multiple storage devices, then reassemble the responses to produce an answer. Another alternative is SMP (Symmetric Multiprocessing), in which multiple processors juggle tasks using caching techniques and a common pool of memory.

PDF Download

  • What’s the benefit? Quicker access to information in a world of huge databases. With MPP, adding processors improves access time at a nearly linear rate: A 32-processor machine can query more than 3 terabytes of data in about the same time that a single processor could query 100 gigabytes. While the scalability and performance of SMP systems keeps improving, MPP architectures still dominate very large data warehousing applications.
  • Who invented it? In the data warehousing market, NCR’s Teradata unit has been MPP’s biggest proponent. The largest Teradata warehouses run on the company’s own WorldMark server hardware and its own version of Unix, using a database management system designed specifically for the MPP environment.

Because supporting MPP requires tweaks to the database management system, operating system and server hardware, many vendors have preferred to push the limits of what they can achieve with SMP. However, IBM is supporting MPP with its Regatta servers (RS/6000 SP) and in its DB2 Extended Enterprise Edition.

In September, startup Netezza introduced its Netezza Performance Server, a refrigerator-sized “data warehouse appliance” aimed at providing MPP performance at a lower price by using open-source software like Linux and the Postgres database. Netezza uses specialized query-processing chips installed on each hard disk. Each of these “snippet processors” scans the disk it is responsible for, finds data matching the query parameters, and sends the results back to the database responsible for assembling the answer. This cuts down on the transmission of irrelevant data within the server cabinet, minimizing performance bottlenecks and lessening the workload on the central database.

  • Who’s using it? Teradata has a blue-chip customer base, including Wal-Mart in retailing and Whirlpool in manufacturing. Lloyd’s of London is using IBM’s MPP solution to analyze claims and other insurance data.

Netezza has captured a handful of early customers. Vibrant Solutions, which works with companies such as Nextel on call-data analysis, says it will be able to support much more data, with faster query response, by employing Netezza’s technology. “It’s very similar to a lot of the other massively parallel architectures that have been around for a while, but they brought the price into a reasonable window,” says Vibrant CTO Rick Mahuson.

  • What are the drawbacks? MPP systems tend to cost more, both in price and ongoing administration. Teradata says the long-term cost of ownership is favorable, however, particularly when scattered data marts (departmental data warehouses) are consolidated into a central, company-wide data warehouse.

Netezza is trying to change the price equation (at $2.5 million, even its 18-terabyte server is a fraction of the cost of comparable MPP systems) and claims its appliance will run with minimal administration. “Netezza’s product shows great promise,” says Giga Information Group analyst Philip Russom, but he suspects many enterprise customers will be scared of entrusting multi-terabyte applications to open-source technology.

REFERENCE: ONE QUERY, MANY PATHS Even with today’s superfast machines, it can take days to generate a report from a multi-terabyte warehouse. Here’s how using Massively Parallel Processing can speed up the task.

  • 1. A 3-terabyte data warehouse receives a request for a list of all customer purchases that were greater than $10,000.
  • 2. It passes on the query to 10 “nodes.” Each node has its own processors and also controls one or more storage devices. Each storage device, in turn, contains a subset of the 3-terabyte warehouse. In this example, each node queries one storage device that holds 100,000 records.
  • 3. Each device sends back a list. The data warehouse consolidates the responses into a single result that took hours instead of days to build.
  • Wondering if you might need to reexamine your processing capability? Click here to take our quick Quiz.
    Background Reading
    Not convinced you need to process in parallel? Click here to download a PDF (Portable Document Format) version of Sun Microsystems’ white paper on the advantages of a symmetric architecture.