How Google Works: Reducing Complexity

By David F. Carr Print this article Print

For all the razzle-dazzle surrounding Google, the company must still work through common business problems such as reporting revenue and tracking projects. But it sometimes addresses those needs in unconventional—yet highly efficient—ways. Other

Reducing Complexity

Google's distributed storage architecture for data is combined with distributed execution of the software that parses and analyzes it.

To keep software developers from spending too much time on the arcana of distributed programming, Google invented MapReduce as a way of simplifying the process. According to a 2004 Google Labs paper, without MapReduce the company found "the issues of how to parallelize the computation, distribute the data and handle failures" tended to obscure the simplest computation "with large amounts of complex code."

Much as the GFS offers an interface for storage across multiple servers, MapReduce takes programming instructions and assigns them to be executed in parallel on many computers. It breaks calculations into two parts—a first stage, which produces a set of intermediate results, and a second, which computes a final answer. The concept comes from functional programming languages such as Lisp (Google's version is implemented in C++, with interfaces to Java and Python).

A typical first-week training assignment for a new programmer hired by Google is to write a software routine that uses MapReduce to count all occurrences of words in a set of Web documents. In that case, the "map" would involve tallying all occurrences of each word on each page—not bothering to add them at this stage, just ticking off records for each one like hash marks on a sheet of scratch paper. The programmer would then write a reduce function to do the math—in this case, taking the scratch paper data, the intermediate results, and producing a count for the number of times each word occurs on each page.

One example, from a Google developer presentation, shows how the phrase "to be or not to be" would move through this process.

While this might seem trivial, it's the kind of calculation Google performs ad infinitum. More important, the general technique can be applied to many statistical analysis problems. In principle, it could be applied to other data mining problems that might exist within your company, such as searching for recurring categories of complaints in warranty claims against your products.

But it's particularly key for Google, which invests heavily in a statistical style of computing, not just for search but for solving other problems like automatic translation between human languages such as English and Arabic (using common patterns drawn from existing translations of words and phrases to divine the rules for producing new translations).

MapReduce includes its own middleware—server software that automatically breaks computing jobs apart and puts them back together. This is similar to the way a Java programmer relies on the Java Virtual Machine to handle memory management, in contrast with languages like C++ that make the programmer responsible for manually allocating and releasing computer memory. In the case of MapReduce, the programmer is freed from defining how a computation will be divided among the servers in a Google cluster.

Typically, programs incorporating MapReduce load large quantities of data, which are then broken up into pieces of 16 to 64 megabytes. The MapReduce run-time system creates duplicate copies of each map or reduce function, picks idle worker machines to perform them and tracks the results.

Worker machines load their assigned piece of input data, process it into a structure of key-value pairs, and notify the master when the mapped data is ready to be sorted and passed to a reduce function. In this way, the map and reduce functions alternate chewing through the data until all of it has been processed. An answer is then returned to the client application.

If something goes wrong along the way, and a worker fails to return the results of its map or reduce calculation, the master reassigns it to another computer.

As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes.

Also in this Feature:

This article was originally published on 2006-07-06
David F. Carr David F. Carr is the Technology Editor for Baseline Magazine, a Ziff Davis publication focused on information technology and its management, with an emphasis on measurable, bottom-line results. He wrote two of Baseline's cover stories focused on the role of technology in disaster recovery, one focused on the response to the tsunami in Indonesia and another on the City of New Orleans after Hurricane Katrina.David has been the author or co-author of many Baseline Case Dissections on corporate technology successes and failures (such as the role of Kmart's inept supply chain implementation in its decline versus Wal-Mart or the successful use of technology to create new market opportunities for office furniture maker Herman Miller). He has also written about the FAA's halting attempts to modernize air traffic control, and in 2003 he traveled to Sierra Leone and Liberia to report on the role of technology in United Nations peacekeeping.David joined Baseline prior to the launch of the magazine in 2001 and helped define popular elements of the magazine such as Gotcha!, which offers cautionary tales about technology pitfalls and how to avoid them.
eWeek eWeek

Have the latest technology news and resources emailed to you everyday.