Want to get your hands on an open
source version of some core Google technologies?
Just ask Yahoo.
Yahoo has emerged as a major sponsor of
Hadoop, an open source project that aims to
replicate Google's techniques for storing and
processing large amounts of data distributed
across hundreds or thousands of commodity
PCs (see "How Google Works," at www.baselinemag.
com). Last year, Yahoo hired Hadoop
project founder Doug Cutting, and at July's
Oscon open source conference Cutting and
Eric Baldeschwieler, Yahoo director of grid
computing, detailed how they're applying the
technology.
Cutting, formerly of Excite and Xerox PARC, has
founded or cofounded a series of projects related to
creation of an open source platform for search under
the banner of the Apache Software Foundation. His
work on Lucene (a Java software library for Web index
and search) and Nutch (a search engine application that
builds on Lucene) led to Hadoop, which started as a
Nutch subproject aimed at efficiently spreading the
workload for compiling a search index across multiple
computers.
Hadoop's basic functionality is part of what has let
Google manage the massive data processing challenges associated
with indexing Web content—and to do it economically.
Google has not released source code for its Google File
System or the associated distributed computing environment,
known as MapReduce. But it has published academic papers
on the computer science behind both—presumably knowing
that competitors and open source programmers would likely
create their own implementations.
At a panel discussion with O'Reilly Media CEO Tim
O'Reilly at Oscon, Cutting declined to speculate on Yahoo's
motives for backing the project. But O'Reilly called it an
example of open source being "the natural ally of the number
two player" in a market and a way to level the playing field.
Hadoop includes a version of the distributed file system
created for Nutch along with a version of MapReduce, both
written in Java. Like Google's MapReduce, the Hadoop version
automates the division of computer-intensive tasks into
subtasks assigned to individual computers in a cluster. Each
computation is divided into two stages: "Map," which produces
an intermediate set of results, and "Reduce," usually
devoted to sorting and aggregating data to produce a final
result. In the context of compiling a search index, Map
involves thousands of computers each assigned
indexing of a subset of the Web crawl data, and
Reduce sorts and merges the results.
"It's a very simple programming metaphor,
where people can catch on quickly and start
using it," Cutting says. It's a method of feeding
the output of one program into the next, much
like the Unix pipes utility.
So far, at least, this style of distributed computing
is not as central to Yahoo's day-to-day
operations as it is said to be at Google. For
example, Hadoop has not been integrated into
the process for indexing the Web crawl data
that feeds the Yahoo search
engine. However, Yahoo is
analyzing that same Web
crawl data and other log
files with Hadoop for other
purposes, such as market
research and product planning.
Where Hadoop comes
into play is for ad-hoc analysis
of data—answering
questions that weren't necessarily
anticipated when the data gathering system was designed. Instead of looking for
keywords and links, for example, a market researcher might
comb through the Web crawl data to see how many sites
include a Flickr "badge"—the snippet of code used to display
thumbnails of images posted to the photo-sharing service.
Yahoo has tested the system with as many as 2,000 computers
working in tandem. Overall, Yahoo has about 10,000
computers running Hadoop; the largest cluster in production
use is 1,600 machines.
Hadoop does nothing to enhance Yahoo's reputation as a
technology innovator, because this project is based on replicating
techniques pioneered at Google. But that's beside the
point, according to Cutting. "What open source tends to be
most useful for is giving us commodity systems as opposed to
special sauce systems," he says.
And this is one commodity that wouldn't be available
without his work and the work of others at Yahoo and
elsewhere.
Please send questions and comments on this article to editors@baselinemag.com.