How Google Works: A Role ModelBy David F. Carr | Posted 2006-07-06 Email Print
For all the razzle-dazzle surrounding Google, the company must still work through common business problems such as reporting revenue and tracking projects. But it sometimes addresses those needs in unconventional—yet highly efficient—ways. Other
A Role Model
For Google, the point of having lots of servers is to get them working in parallel. For any given search, thousands of computers work simultaneously to deliver an answer.
For CIOs to understand why parallel processing makes so much sense for search, consider how long it would take for one person to find all the occurrences of the phrase "service-oriented architecture" in the latest issue of Baseline. Now assign the same task to 100 people, giving each of them one page to scan, and have them report their results back to a team leader who will compile the results. You ought to get your answer close to 100 times faster.
On the other hand, if one of the workers on this project turns out to be dyslexic, or suddenly drops dead, completion of the overall search job could be delayed or return bad results. So, when designing a parallel computing system—particularly one like Google's, built from commodity components rather than top-shelf hardware—it's critical to build in error correction and fault tolerance (the ability of a system to recover from failures).
Some basic concepts Google would use to scale up had already been defined by 1998, when Page and Brin published their paper on "The Anatomy of a Large-Scale Hypertextual Web Search Engine." In its incarnation as a Stanford research project, Google had already collected more than 25 million Web pages. Each document was scanned to produce one index of the words it contained and their frequency, and that, in turn, was transformed into an "inverted index" relating keywords to the pages in which they occurred.
But Google did more than just analyze occurrences of keywords. Page had come up with a formula he dubbed PageRank to improve the relevance of search results by analyzing the link structure of the Web. His idea was that, like frequent citations of an academic paper, links to a Web site were clues to its importance. If a site with a lot of incoming links—like the Yahoo home page—linked to a particular page, that link would also carry more weight. Google would eventually have to modify and augment Page's original formula to battle search engine spam tactics. Still, PageRank helped establish Google's reputation for delivering better search results.
Previous search engines had not analyzed links in such a systematic way. According to The Google Story, a book by Washington Post writer David Vise and Mark Malseed, Page had noticed that early search engine king AltaVista listed the number of links associated with a page in its search results but didn't seem to be making any other use of them. Page saw untapped potential. In addition to recording which pages linked to which other pages, he designed Google to analyze the anchor text—the text of a link on a Web page, typically displayed to a Web site visitor with underlining and color—as an additional clue to the content of the target page.
That way, it could sometimes divine that a particular page was relevant even though the search keywords didn't appear on that page. For example, a search for "cardiopulmonary resuscitation" might find a page that exclusively uses the abbreviation "CPR" because one of the Web pages linking to it spells out the term in the link's anchor text.
All this analysis requires a lot of storage. Even back at Stanford, the Web document repository alone was up to 148 gigabytes, reduced to 54 gigabytes through file compression, and the total storage required, including the indexes and link database, was about 109 gigabytes. That may not sound like much today, when you can buy a Dell laptop with a 120-gigabyte hard drive, but in the late 1990s commodity PC hard drives maxed out at about 10 gigabytes.
To cope with these demands, Page and Brin developed a virtual file system that treated the hard drives on multiple computers as one big pool of storage. They called it BigFiles. Rather than save a file to a particular computer, they would save it to BigFiles, which in turn would locate an available chunk of disk space on one of the computers in the server cluster and give the file to that computer to store, while keeping track of which files were stored on which computer. This was the start of what essentially became a distributed computing software infrastructure that runs on top of Linux.
The overall design of Google's software infrastructure reflects the principles of grid computing, with its emphasis on using many cheap computers working in parallel to achieve supercomputer-like results.
Some definitions of what makes a grid a grid would exclude Google as having too much of a homogenous, centrally controlled infrastructure, compared with a grid that teams up computers running multiple operating systems and owned by different organizations.
But the company follows many of the principles of grid
architecture, such as the goal of minimizing network bottlenecks by minimizing data transmission from one computer to another.
Instead, whenever possible, processing instructions are sent to the server or servers containing the relevant data, and only the results are returned over the network.
"They've gotten to the point where they're distributing what really should be considered single computers across continents," says Colin Wheeler, an information systems consultant with experience in corporate grid computing projects, including a project for the Royal Bank of Canada.
Having studied Google's publications, he notes that the company has had to tinker with computer science fundamentals in a way that few enterprises would: "I mean, who writes their own file system these days?"
Also in this Feature: