<img alt="dcsimg" id="dcsimg" width="1" height="1" src="//www.qsstats.com/dcs8krshw00000cpvecvkz0uc_4g4q/njs.gif?dcsuri=/index.php/c/a/Business-Intelligence/Yahoo-Tries-Harder-Going-for-Geek-Cred&amp;WT.js=No&amp;WT.tv=10.4.1&amp;dcssip=www.baselinemag.com&amp;WT.qs_dlk=XTSDrSARZPhz0Zlx-8yQEwAAAAI&amp;">

Yahoo Tries Harder Going for Geek Cred

By Baselinemag  |  Posted 2007-10-02 Print this article Print

The number 2 search engine company clones google's distributed computing as open source.

Want to get your hands on an open source version of some core Google technologies? Just ask Yahoo.

Yahoo has emerged as a major sponsor of Hadoop, an open source project that aims to replicate Google's techniques for storing and processing large amounts of data distributed across hundreds or thousands of commodity PCs (see "How Google Works," at www.baselinemag. com). Last year, Yahoo hired Hadoop project founder Doug Cutting, and at July's Oscon open source conference Cutting and Eric Baldeschwieler, Yahoo director of grid computing, detailed how they're applying the technology.

Cutting, formerly of Excite and Xerox PARC, has founded or cofounded a series of projects related to creation of an open source platform for search under the banner of the Apache Software Foundation. His work on Lucene (a Java software library for Web index and search) and Nutch (a search engine application that builds on Lucene) led to Hadoop, which started as a Nutch subproject aimed at efficiently spreading the workload for compiling a search index across multiple computers.

Hadoop's basic functionality is part of what has let Google manage the massive data processing challenges associated with indexing Web content—and to do it economically. Google has not released source code for its Google File System or the associated distributed computing environment, known as MapReduce. But it has published academic papers on the computer science behind both—presumably knowing that competitors and open source programmers would likely create their own implementations.

At a panel discussion with O'Reilly Media CEO Tim O'Reilly at Oscon, Cutting declined to speculate on Yahoo's motives for backing the project. But O'Reilly called it an example of open source being "the natural ally of the number two player" in a market and a way to level the playing field.

Hadoop includes a version of the distributed file system created for Nutch along with a version of MapReduce, both written in Java. Like Google's MapReduce, the Hadoop version automates the division of computer-intensive tasks into subtasks assigned to individual computers in a cluster. Each computation is divided into two stages: "Map," which produces an intermediate set of results, and "Reduce," usually devoted to sorting and aggregating data to produce a final result. In the context of compiling a search index, Map involves thousands of computers each assigned indexing of a subset of the Web crawl data, and Reduce sorts and merges the results.

"It's a very simple programming metaphor, where people can catch on quickly and start using it," Cutting says. It's a method of feeding the output of one program into the next, much like the Unix pipes utility.

So far, at least, this style of distributed computing is not as central to Yahoo's day-to-day operations as it is said to be at Google. For example, Hadoop has not been integrated into the process for indexing the Web crawl data that feeds the Yahoo search engine. However, Yahoo is analyzing that same Web crawl data and other log files with Hadoop for other purposes, such as market research and product planning.

Where Hadoop comes into play is for ad-hoc analysis of data—answering questions that weren't necessarily anticipated when the data gathering system was designed. Instead of looking for keywords and links, for example, a market researcher might comb through the Web crawl data to see how many sites include a Flickr "badge"—the snippet of code used to display thumbnails of images posted to the photo-sharing service.

Yahoo has tested the system with as many as 2,000 computers working in tandem. Overall, Yahoo has about 10,000 computers running Hadoop; the largest cluster in production use is 1,600 machines.

Hadoop does nothing to enhance Yahoo's reputation as a technology innovator, because this project is based on replicating techniques pioneered at Google. But that's beside the point, according to Cutting. "What open source tends to be most useful for is giving us commodity systems as opposed to special sauce systems," he says.

And this is one commodity that wouldn't be available without his work and the work of others at Yahoo and elsewhere.

Please send questions and comments on this article to editors@baselinemag.com.

eWeek eWeek

Have the latest technology news and resources emailed to you everyday.