How Hadoop WorksBy David F. Carr | Posted 2007-08-20 Email Print
Re-Thinking HR: What Every CIO Needs to Know About Tomorrow's Workforce
Initiative for distributed data processing may give the No. 2 search service some of the "geek cred" it's been lacking.
The Hadoop runtime environment takes into account the fact that when computing jobs are spread across hundreds or thousands of relatively cheap computers, some of those computers are likely to fail in mid-task. So one of the main things Hadoop tries to automate is the process for detecting and correcting for those failures.
A master server within the grid of computers tracks the handoffs of tasks from one computer to another and reassigns tasks, if necessary, when any one of those computers locks up or fails. The same task can also be assigned to multiple computers, with the one that finishes first contributing to the final result (while the computations produced by the laggards get thrown away). This technique turns out to be a good match for massive data analysis challenges like producing an index of the entire Web.
So far, at least, this style of distributed computing is not as central to Yahoo's day-to-day operations as it is said to be at Google. For example, Hadoop has not been integrated into the process for indexing the Web crawl data that feeds the Yahoo search engine—although "that would be the idea" in the long run, Cutting says.
However, Yahoo is analyzing that same Web crawl data and other log files with Hadoop for other purposes, such as market research and product planning.
Where Hadoop comes into play is for ad-hoc analysis of data—answering a question that wasn't necessarily anticipated when the data gathering system was designed. For example, instead of looking for keywords and links, a market researcher might want to comb through the Web crawl data to see how many sites include a Flickr "badge"—the snippet of code used to display thumbnails of recent images posted to the photo sharing service.
From its first experiments with 20-node clusters, Yahoo has tested the system with as many as 2,000 computers working in tandem. Overall, Yahoo has about 10,000 computers running Hadoop, and the largest cluster in production use is 1,600 machines.
"We're confident at this point that we can get fairly linear scaling to several thousand nodes," Baldeschwieler says. "We ran about 10,000 jobs last week. Now, a good number of those come from a small group of people who run a job every minute. But we do have several hundred users."
Although Yahoo had previously created its own systems for distributing work across a grid of computers for specific applications, Hadoop has given Yahoo a generally useful framework for this type of computing, Baldeschwieler says. And while there is nothing simple about running these large grids, Hadoop helps simplify some of the hardest problems.
By itself, Hadoop does nothing to enhance Yahoo's reputation as a technology innovator, since by definition this project is focused on replicating techniques pioneered at Google. But Cutting says that's beside the point. "What open source tends to be most useful for is giving us commodity systems, as opposed to special sauce systems," he says. "And besides, I'm sure we're doing it differently."