Yahoo Tries Harder Going for Geek Cred

Want to get your hands on an opensource version of some core Google technologies?Just ask Yahoo.

Yahoo has emerged as a major sponsor ofHadoop, an open source project that aims toreplicate Google’s techniques for storing andprocessing large amounts of data distributedacross hundreds or thousands of commodityPCs (see "How Google Works," at www.baselinemag.com). Last year, Yahoo hired Hadoopproject founder Doug Cutting, and at July’sOscon open source conference Cutting andEric Baldeschwieler, Yahoo director of gridcomputing, detailed how they’re applying thetechnology.

Cutting, formerly of Excite and Xerox PARC, hasfounded or cofounded a series of projects related tocreation of an open source platform for search underthe banner of the Apache Software Foundation. Hiswork on Lucene (a Java software library for Web indexand search) and Nutch (a search engine application thatbuilds on Lucene) led to Hadoop, which started as aNutch subproject aimed at efficiently spreading theworkload for compiling a search index across multiplecomputers.

Hadoop’s basic functionality is part of what has letGoogle manage the massive data processing challenges associatedwith indexing Web content?and to do it economically.Google has not released source code for its Google FileSystem or the associated distributed computing environment,known as MapReduce. But it has published academic paperson the computer science behind both?presumably knowingthat competitors and open source programmers would likelycreate their own implementations.

At a panel discussion with O’Reilly Media CEO TimO’Reilly at Oscon, Cutting declined to speculate on Yahoo’smotives for backing the project. But O’Reilly called it anexample of open source being "the natural ally of the numbertwo player" in a market and a way to level the playing field.

Hadoop includes a version of the distributed file systemcreated for Nutch along with a version of MapReduce, bothwritten in Java. Like Google’s MapReduce, the Hadoop versionautomates the division of computer-intensive tasks intosubtasks assigned to individual computers in a cluster. Eachcomputation is divided into two stages: "Map," which producesan intermediate set of results, and "Reduce," usuallydevoted to sorting and aggregating data to produce a finalresult. In the context of compiling a search index, Mapinvolves thousands of computers each assignedindexing of a subset of the Web crawl data, andReduce sorts and merges the results.

"It’s a very simple programming metaphor,where people can catch on quickly and startusing it," Cutting says. It’s a method of feedingthe output of one program into the next, muchlike the Unix pipes utility.

So far, at least, this style of distributed computingis not as central to Yahoo’s day-to-dayoperations as it is said to be at Google. Forexample, Hadoop has not been integrated intothe process for indexing the Web crawl datathat feeds the Yahoo searchengine. However, Yahoo isanalyzing that same Webcrawl data and other logfiles with Hadoop for otherpurposes, such as marketresearch and product planning.

Where Hadoop comesinto play is for ad-hoc analysisof data?answeringquestions that weren’t necessarilyanticipated when the data gathering system was designed. Instead of looking forkeywords and links, for example, a market researcher mightcomb through the Web crawl data to see how many sitesinclude a Flickr "badge"?the snippet of code used to displaythumbnails of images posted to the photo-sharing service.

Yahoo has tested the system with as many as 2,000 computersworking in tandem. Overall, Yahoo has about 10,000computers running Hadoop; the largest cluster in productionuse is 1,600 machines.

Hadoop does nothing to enhance Yahoo’s reputation as atechnology innovator, because this project is based on replicatingtechniques pioneered at Google. But that’s beside thepoint, according to Cutting. "What open source tends to bemost useful for is giving us commodity systems as opposed tospecial sauce systems," he says.

And this is one commodity that wouldn’t be availablewithout his work and the work of others at Yahoo andelsewhere.

Please send questions and comments on this article to [email protected].