Google Outreach

By David F. Carr  |  Posted 2007-08-20 Print this article Print

Initiative for distributed data processing may give the No. 2 search service some of the "geek cred" it's been lacking.

Of course, Google has its own outreach programs, which it uses to cement its reputation as a technology leader and boost recruiting. One reason Google gives for not releasing source code for things like its distributed file system is that the software is too deeply intertwined with other components of its operational systems and can't be easily separated out. That's the story Google representatives repeated at Oscon when an audience member asked why they hadn't open sourced more of the software they use to manage their data centers.

For his part, Cutting downplays the idea that Yahoo is using the Hadoop project as some sort of competitive weapon. "While we do compete, we don't compete over this stuff," he says.

Google hasn't explicitly encouraged the development of Hadoop or provided clues about how to produce a MapReduce system, Cutting says. On the other hand, he notes, there's actually a Google-sponsored course at the University of Washington that uses Hadoop to give students hands-on experience using MapReduce for distributed computing.

The Hadoop style of distributed computing is mostly good for batch-oriented analysis of unstructured data (such as compiling an index of the Web), rather than interactive applications (providing an immediate answer to a query), Cutting says. However, yet another Lucene project spin-off called HBase is in the process of trying to replicate Google's BigTable. BigTable is another technology Google has described publicly, a database management system for structured and semi-structured information that builds on the Google File System and MapReduce and uses that structure to provide more interactive answers to queries across very large data sets.

But the HBase effort is still in an early, pre-alpha stage of development, and most of what you can do with Hadoop is inherently batch oriented—aimed at shrinking the time required to perform an analysis from days to hours, but not for delivering an answer within seconds. "If you need to make changes and see them in real time, Hadoop is not the answer," Cutting says. "What it's really great for is just munging through tons of data."

Hadoop includes a version of the distributed file system originally created for Nutch along with a version of MapReduce, both written in Java. As in Google's MapReduce, the Hadoop version automates the division of computer-intensive tasks into smaller sub-tasks that are assigned to individual computers in a cluster. Each computation is broken into two stages: the "Map," which produces an intermediate set of results, and the "Reduce" function, usually devoted to sorting and aggregating data to produce a final result. In the context of compiling a search index, the Map phase would involve thousands of computers each assigned the task of indexing a subset of the Web crawl data, and the Reduce phase would be sorting and merging those results into the final index.

"It's a very simple programming metaphor, where people can catch on quickly and start using it," Cutting says. "Your first program can be something that can be expressed on a page and does something useful." Those with a Unix background may find the MapReduce technique to be a little bit like using "pipes," a technique for chaining programs together by having the output from one program fed as input into the next.

David F. Carr David F. Carr is the Technology Editor for Baseline Magazine, a Ziff Davis publication focused on information technology and its management, with an emphasis on measurable, bottom-line results. He wrote two of Baseline's cover stories focused on the role of technology in disaster recovery, one focused on the response to the tsunami in Indonesia and another on the City of New Orleans after Hurricane Katrina.David has been the author or co-author of many Baseline Case Dissections on corporate technology successes and failures (such as the role of Kmart's inept supply chain implementation in its decline versus Wal-Mart or the successful use of technology to create new market opportunities for office furniture maker Herman Miller). He has also written about the FAA's halting attempts to modernize air traffic control, and in 2003 he traveled to Sierra Leone and Liberia to report on the role of technology in United Nations peacekeeping.David joined Baseline prior to the launch of the magazine in 2001 and helped define popular elements of the magazine such as Gotcha!, which offers cautionary tales about technology pitfalls and how to avoid them.

Submit a Comment

Loading Comments...
eWeek eWeek

Have the latest technology news and resources emailed to you everyday.