Instant ScalabilityBy Doug Bartholomew | Posted 2008-02-21 Email Print
With its Simple Storage Service and Elastic Compute Cloud, Amazon is blazing a trail to Web services and mixing it up with the likes of IBM and Sun — and maybe even Microsoft and Google.
One of the biggest advantages of Amazon’s Web services is its scalability. “We make it easier for self-funding, self-scaling businesses to succeed, because there are no gigantic technology investments they’re liable for,” Amazon’s Barr says.
Start-up CEOs and other entrepreneurs like the idea of being able to launch a new product or service for which the real-world demand is largely unknown, and still have the confidence that their support systems will handle whatever gets thrown their way—whether that’s 100 customers a day or 10,000 per hour.
“You have to plan for success,” Animoto’s Jefferson says. “Do we make it so only a hundred users can log in per week? It’s silly to prevent growth. We wanted a way we could scale to the world on day one.”
Jefferson and his partners looked at a number of other options, such as buying a batch of servers. “But a lot of those servers would have been dormant after the initial launch,” he says, “and it would have meant a huge capital expenditure.”
Amazon even tailors its offering according to the size of the computing job. Amazon charges 10 cents per compute hour per instance on a machine with a single processor. For a four-processor instance, for example, the cost is 40 cents per compute hour, or 80 cents per hour for an eight-processor machine. Similarly, data storage costs 15 cents per gigabyte per month. And there is no charge for moving data from EC2 to S3.
By contrast, the Sun Grid Compute Utility charges users $1 per CPU per hour. Sun aggregates each customer’s job usage and then rounds it up to the nearest whole hour. For instance, a job that uses 1,000 CPUs for one minute would be billed as 1,000 CPU minutes or 16.67 CPU hours, with the latter figure rounded up to 17 hours, for a total of $17.
As of January, Amazon customers had stored 14 billion objects on S3. Amazon won’t say what its data storage capacity or computing capacity limits are, nor whether the company has had to purchase additional hardware to handle AWS growth.
For startups—and even for large companies looking to do specific computing projects—this incredibly minuscule price tag for mojo-size computing tasks is a big draw, because it means they can forgo investing in yet another server farm with all of the associated operating and maintenance costs. Microsoft is using the storage service to help speed software downloads, for example, and Linden Lab is using it to deal with the blizzard of software downloads for its popular virtual world Second Life.
For some large companies, such as SanDisk and The New York Times, Amazon made it possible to launch new products or additional services.
SanDisk, the $4 billion maker of flash drives, uses Amazon’s data storage as an automatic backup for its new Cruzer Titanium Plus. SanDisk adapted BeInSync’s application for automatic backup of the Cruzer, allowing the company to promise customers automatic backup of their data even if the device is lost or stolen. “Amazon Web Services made it possible for us to pursue this very innovative new idea,” says Mike Langberg, a SanDisk spokesman.
For The New York Times, Amazon’s low cost was the main selling point. “The cost structure is so minimal that we didn’t have to make the traditional budget requests to get our project done,” says Derek Gottfrid, the newspaper’s senior software architect.
The project was massive and complex. America’s “newspaper of record” wanted a way to archive 11 million articles published from 1851 to 1980 as PDF files and make them available on the Web.
“We wanted a system that would be scalable, could handle a lot of traffic and could generate PDFs,” Gottfrid says. “We also needed a place to store these files. We weren’t really sure it would work using EC2 and S3, but we thought it was worth a chance to test it and see.”
Gottfrid was able to do the whole job in a few days. Each article that was to be put into PDF format consisted of a series of TIFF file images that had to be assembled and put together in a particular geometric arrangement, including photos, captions, headlines and columns of text.
One of the biggest challenges was managing the large number of computer instances simultaneously, because running computations on large data sets is difficult to set up and manage. For that, Gottfrid took advantage of Apache’s Hadoop, an open-source implementation of the MapReduce idea developed at Google.
Hadoop, which provides a framework for running large data processing applications on clusters of commodity hardware, enabled Gottfrid to use EC2 to test-generate a few thousand articles using only four EC2 instances.
Upon successful completion of the test, Gottfrid calculated he could run through all 11 million articles in just under 24 hours by harnessing 100 EC2 instances. The project generated another 1.5 terabytes of data to store in S3. He even ran it a second time to fix an error in the PDFs.
“Honestly, I had a couple of moments of panic,” he wrote in his blog. “I was using some very new and not totally proven pieces of technology on a project that was very high profile and on an inflexible deadline. But clearly it worked out, since I am still blogging from open.nytimes.com.”