Web Services/SOA - Baseline
Home arrow Web Services/SOA arrow Why Hadoop Has Google Fans (and Rivals) Excited

Why Hadoop Has Google Fans (and Rivals) Excited

By David F. Carr on 2009-04-01


by David F. Carr

Based on distributed computing technologies Google has publicly disclosed, Hadoop provides an open source implementation for other companies with very large data analysis challenges, including Yahoo! and Facebook. The free software, named for a toy elephant, now runs on some of the largest sites on the Web.

  • of
Hadoop has attracted the attention of major companies faced with Internet-scale data analysis challenges, including Amazon A9.com, AOL, Facebook, Fox Interactive Media, IBM, New York Times, Veoh, and Yahoo!

Hadoop is a solution for analyzing large unstructured or semi-structured data sets (indexing the web, identifying spam email), typically in batch mode (indexing the web to feed a search engine, not executing the live queries).

You can use Hadoop for ad hoc analysis of large data sets, or to extract, transform, and load data into a traditional data warehouse. It can also be applied to machine learning, computer modeling, and scientific computing.

Google’ powerful analytic software for tasks such as indexing the web runs on a distributed system of cheap computers, each of which would complete some small part of the task.

Academic papers on "The Google File System" (2003) and "MapReduce: Simplified Data Processing on Large Clusters" (2004) revealed enough details to allow for the creation of an open source implementation.

Doug Cutting, a veteran of search technology research and development for Excite, Apple, and XEROX PARC, created Hadoop as a spin-off of the Apache Nutch and Lucene open source search technology projects.

In 2006, Yahoo hired Cutting and became a major sponsor of the Hadoop Project. Yahoo has since incorporated Hadoop into the process of producing the Yahoo search index.

Hadoop includes an implementation of MapReduce, with a Job Tracker/Task Tracker system for submitting MapReduce jobs, executing them on nodes of the computing cluster, and restarting any that fail.

Data is distributed over many computers in a distributed file system cluster. Map programs on each computer analyze their own subset of the data and return intermediate results as key-value pairs. The Reduce step sorts and aggregates those intermediate results, then returns a final result.

Hadoop provides support for distributed file systems, including Hadoop's own Hadoop File System (HFS), which is essentially a clone of the Google File System, supporting petabytes of storage across many cheap computers.

Hadoop also runs on other distributed file systems, including Amazon's S3 cloud storage service. Hadoop MapReduce jobs can also run on Amazon's EC2 elastic compute cloud service.

HBase is a very large database management system that runs on top of the Hadoop File System. It's a clone of another publicly-disclosed Google system, BigTable, for managing very large database tables.

Pig is a high-level language for distributed programming. It provides as an alternative to working directly with MapReduce but runs atop the same runtime infrastructure.

HIVE is a data warehouse infrastructure for executing SQL-like ad hoc queries on Hadoop. It started as an internal project at Facebook, and the developers there contributed it to the open source community.

You can download the code from http://hadoop.apache.org/. Test on a Java-enabled computer, load onto your own computer cluster, or deploy to Amazon's EC2 / S3 cloud services.

On March 15, 2009, a start-up called Cloudera announced it would produce a commercially supported distribution of Hadoop, with installation and configuration tools to simplify the setup. Also offers support for Pig and HIVE.

  • More slideshows

 
LATEST STORIES

rss graphic
       Baseline Newsletters