Yahoo Challenge to Google Has Roots in Open Source

If you want to get your hands on an open source version of some of Google’s core technologies, maybe you should ask Yahoo.

Yahoo has emerged as one of a major sponsor of Hadoop, an open source project that aims to replicate Google’s techniques for storing and processing large amounts of data distributed across hundreds or thousands of commodity PCs (see Baseline’s report: How Google Works). Last year, Hadoop project founder Doug Cutting became a Yahoo employee, and at July’s Oscon open source conference he and Yahoo’s director of grid computing Eric Baldeschwieler detailed how they are applying the technology.

Cutting, formerly of Excite and Xerox PARC, has founded or co-founded a series of projects related to creating an open source platform for search under the banner of the Apache Software Foundation. His work on Lucene (a Java software library for Web indexing and search) and Nutch (a search engine application that builds on Lucene) led to Hadoop, which started as a Nutch sub-project aimed at efficiently spreading the workload for compiling a search index across multiple computers. Since he doesn’t work in a Yahoo office, Cutting says his employment is really more like being paid a salary to work full-time on his Apache projects and help Yahoo work efficiently with the open source community. On the other hand, he does work with Yahoo to get the most out of the technology.

The basic technique Hadoop uses is part of what has allowed Google to manage the massive data processing challenges associated with indexing the Web—and do it economically. Google has not released source code for its Google File System or the associated distributed computing environment, known as MapReduce. But what Google has done is publish academic papers on the computer science behind both—presumably knowing full well that competitors and open source programmers would be likely to create their own implementations.

In addition to giving a presentation on Hadoop at Oscon, Cutting participated in a panel discussion on new system programming and architecture techniques moderated by O’Reilly Media CEO Tim O’Reilly. While Cutting declined to speculate on Yahoo’s motives for backing the project, O’Reilly called it an example of open source being “the natural ally of the number two player” in a market and a way of leveling the playing field.

In a follow-up blog post, O’Reilly wrote that Yahoo evidently wanted to make this a “coming out party” showcasing its backing of the project. “In fact, I even had a call from David Filo to make sure I knew that the support is coming from the top,” he wrote. (While his co-founder Jerry Yang is better known as the public face of Yahoo, Filo is the geekier of the two and has always played a strong behind-the-scenes role in the company’s technology decisions.) O’Reilly thinks Yahoo is trying to give itself “geek cred” by reaching out to the open source community with projects like Hadoop and its Yahoo Hack Day events.