Infrastructure - Baseline
Home arrow Infrastructure arrow Page 6 - How Google Works













Renew Your Subscription

Infrastructure



How Google Works



By David F. Carr

  Table of Contents:
  1. How Google Works
  2. How Google Works: What CIOs Can Learn
  3. How Google Works: The Start of the Story
  4. How Google Works: A Role Model
  5. How Google Works: The Google File System
  6. How Google Works: Reducing Complexity
  7. How Google Works: Google's Secrets
  8. How Google Works: Exotic but Not Unique
  9. How Google Works: Google, the Enterprise
  10. How Google Works: Google Base Case

For all the razzle-dazzle surrounding Google, the company must still work through common business problems such as reporting revenue and tracking projects. But it sometimes addresses those needs in unconventional—yet highly efficient—ways. Other

Rate This Article:
Add This Article To:

How Google Works - How Google Works: Reducing Complexity


( Page 6 of 10 )

Reducing Complexity

Google's distributed storage architecture for data is combined with distributed execution of the software that parses and analyzes it.

To keep software developers from spending too much time on the arcana of distributed programming, Google invented MapReduce as a way of simplifying the process. According to a 2004 Google Labs paper, without MapReduce the company found "the issues of how to parallelize the computation, distribute the data and handle failures" tended to obscure the simplest computation "with large amounts of complex code."

Much as the GFS offers an interface for storage across multiple servers, MapReduce takes programming instructions and assigns them to be executed in parallel on many computers. It breaks calculations into two parts—a first stage, which produces a set of intermediate results, and a second, which computes a final answer. The concept comes from functional programming languages such as Lisp (Google's version is implemented in C++, with interfaces to Java and Python).

A typical first-week training assignment for a new programmer hired by Google is to write a software routine that uses MapReduce to count all occurrences of words in a set of Web documents. In that case, the "map" would involve tallying all occurrences of each word on each page—not bothering to add them at this stage, just ticking off records for each one like hash marks on a sheet of scratch paper. The programmer would then write a reduce function to do the math—in this case, taking the scratch paper data, the intermediate results, and producing a count for the number of times each word occurs on each page.

One example, from a Google developer presentation, shows how the phrase "to be or not to be" would move through this process.

While this might seem trivial, it's the kind of calculation Google performs ad infinitum. More important, the general technique can be applied to many statistical analysis problems. In principle, it could be applied to other data mining problems that might exist within your company, such as searching for recurring categories of complaints in warranty claims against your products.

But it's particularly key for Google, which invests heavily in a statistical style of computing, not just for search but for solving other problems like automatic translation between human languages such as English and Arabic (using common patterns drawn from existing translations of words and phrases to divine the rules for producing new translations).

MapReduce includes its own middleware—server software that automatically breaks computing jobs apart and puts them back together. This is similar to the way a Java programmer relies on the Java Virtual Machine to handle memory management, in contrast with languages like C++ that make the programmer responsible for manually allocating and releasing computer memory. In the case of MapReduce, the programmer is freed from defining how a computation will be divided among the servers in a Google cluster.

Typically, programs incorporating MapReduce load large quantities of data, which are then broken up into pieces of 16 to 64 megabytes. The MapReduce run-time system creates duplicate copies of each map or reduce function, picks idle worker machines to perform them and tracks the results.

Worker machines load their assigned piece of input data, process it into a structure of key-value pairs, and notify the master when the mapped data is ready to be sorted and passed to a reduce function. In this way, the map and reduce functions alternate chewing through the data until all of it has been processed. An answer is then returned to the client application.

If something goes wrong along the way, and a worker fails to return the results of its map or reduce calculation, the master reassigns it to another computer.

As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes.

Also in this Feature:

  • The People Who Power Google
  • Google Courts the Enterprise
  • How Google Manages a Global Workforce


  •  
     
    >>> More Infrastructure Articles          >>> More By David F. Carr
     


    Sponsored Links
  • Get up and running in as quickly as 30 days with BI. Learn how today.

  • FREE Securing Smartphones & Tablets for Dummies Book from Sophos
  • 5 New Technologies That Will Change Enterprise ITAdvertisement
  • Build an IT Infrastructure That Delivers the Future
     
  •  
    FEATURED SPONSORED ARTICLES

    FEATURED SPONSORED VIDEOS

     



    LATEST STORIES


     

     


    Advertisement
    rss graphic
           Baseline Newsletters