Google OutreachBy David F. Carr | Posted 2007-08-20 Email Print
Initiative for distributed data processing may give the No. 2 search service some of the "geek cred" it's been lacking.
Of course, Google has its own outreach programs, which it uses to cement its reputation as a technology leader and boost recruiting. One reason Google gives for not releasing source code for things like its distributed file system is that the software is too deeply intertwined with other components of its operational systems and can't be easily separated out. That's the story Google representatives repeated at Oscon when an audience member asked why they hadn't open sourced more of the software they use to manage their data centers.
For his part, Cutting downplays the idea that Yahoo is using the Hadoop project as some sort of competitive weapon. "While we do compete, we don't compete over this stuff," he says.
Google hasn't explicitly encouraged the development of Hadoop or provided clues about how to produce a MapReduce system, Cutting says. On the other hand, he notes, there's actually a Google-sponsored course at the University of Washington that uses Hadoop to give students hands-on experience using MapReduce for distributed computing.
The Hadoop style of distributed computing is mostly good for batch-oriented analysis of unstructured data (such as compiling an index of the Web), rather than interactive applications (providing an immediate answer to a query), Cutting says. However, yet another Lucene project spin-off called HBase is in the process of trying to replicate Google's BigTable. BigTable is another technology Google has described publicly, a database management system for structured and semi-structured information that builds on the Google File System and MapReduce and uses that structure to provide more interactive answers to queries across very large data sets.
But the HBase effort is still in an early, pre-alpha stage of development, and most of what you can do with Hadoop is inherently batch oriented—aimed at shrinking the time required to perform an analysis from days to hours, but not for delivering an answer within seconds. "If you need to make changes and see them in real time, Hadoop is not the answer," Cutting says. "What it's really great for is just munging through tons of data."
Hadoop includes a version of the distributed file system originally created for Nutch along with a version of MapReduce, both written in Java. As in Google's MapReduce, the Hadoop version automates the division of computer-intensive tasks into smaller sub-tasks that are assigned to individual computers in a cluster. Each computation is broken into two stages: the "Map," which produces an intermediate set of results, and the "Reduce" function, usually devoted to sorting and aggregating data to produce a final result. In the context of compiling a search index, the Map phase would involve thousands of computers each assigned the task of indexing a subset of the Web crawl data, and the Reduce phase would be sorting and merging those results into the final index.
"It's a very simple programming metaphor, where people can catch on quickly and start using it," Cutting says. "Your first program can be something that can be expressed on a page and does something useful." Those with a Unix background may find the MapReduce technique to be a little bit like using "pipes," a technique for chaining programs together by having the output from one program fed as input into the next.