How Hadoop Works

By David F. Carr  |  Posted 2007-08-20 Print this article Print

Initiative for distributed data processing may give the No. 2 search service some of the "geek cred" it's been lacking.

The Hadoop runtime environment takes into account the fact that when computing jobs are spread across hundreds or thousands of relatively cheap computers, some of those computers are likely to fail in mid-task. So one of the main things Hadoop tries to automate is the process for detecting and correcting for those failures.

A master server within the grid of computers tracks the handoffs of tasks from one computer to another and reassigns tasks, if necessary, when any one of those computers locks up or fails. The same task can also be assigned to multiple computers, with the one that finishes first contributing to the final result (while the computations produced by the laggards get thrown away). This technique turns out to be a good match for massive data analysis challenges like producing an index of the entire Web.

So far, at least, this style of distributed computing is not as central to Yahoo's day-to-day operations as it is said to be at Google. For example, Hadoop has not been integrated into the process for indexing the Web crawl data that feeds the Yahoo search engine—although "that would be the idea" in the long run, Cutting says.

However, Yahoo is analyzing that same Web crawl data and other log files with Hadoop for other purposes, such as market research and product planning.

Where Hadoop comes into play is for ad-hoc analysis of data—answering a question that wasn't necessarily anticipated when the data gathering system was designed. For example, instead of looking for keywords and links, a market researcher might want to comb through the Web crawl data to see how many sites include a Flickr "badge"—the snippet of code used to display thumbnails of recent images posted to the photo sharing service.

From its first experiments with 20-node clusters, Yahoo has tested the system with as many as 2,000 computers working in tandem. Overall, Yahoo has about 10,000 computers running Hadoop, and the largest cluster in production use is 1,600 machines.

"We're confident at this point that we can get fairly linear scaling to several thousand nodes," Baldeschwieler says. "We ran about 10,000 jobs last week. Now, a good number of those come from a small group of people who run a job every minute. But we do have several hundred users."

Although Yahoo had previously created its own systems for distributing work across a grid of computers for specific applications, Hadoop has given Yahoo a generally useful framework for this type of computing, Baldeschwieler says. And while there is nothing simple about running these large grids, Hadoop helps simplify some of the hardest problems.

By itself, Hadoop does nothing to enhance Yahoo's reputation as a technology innovator, since by definition this project is focused on replicating techniques pioneered at Google. But Cutting says that's beside the point. "What open source tends to be most useful for is giving us commodity systems, as opposed to special sauce systems," he says. "And besides, I'm sure we're doing it differently."

David F. Carr David F. Carr is the Technology Editor for Baseline Magazine, a Ziff Davis publication focused on information technology and its management, with an emphasis on measurable, bottom-line results. He wrote two of Baseline's cover stories focused on the role of technology in disaster recovery, one focused on the response to the tsunami in Indonesia and another on the City of New Orleans after Hurricane Katrina.David has been the author or co-author of many Baseline Case Dissections on corporate technology successes and failures (such as the role of Kmart's inept supply chain implementation in its decline versus Wal-Mart or the successful use of technology to create new market opportunities for office furniture maker Herman Miller). He has also written about the FAA's halting attempts to modernize air traffic control, and in 2003 he traveled to Sierra Leone and Liberia to report on the role of technology in United Nations peacekeeping.David joined Baseline prior to the launch of the magazine in 2001 and helped define popular elements of the magazine such as Gotcha!, which offers cautionary tales about technology pitfalls and how to avoid them.

Submit a Comment

Loading Comments...
eWeek eWeek

Have the latest technology news and resources emailed to you everyday.