Using Content Delivery Networks

By Dan Marriott  |  Posted 2011-04-06

Summary: uses blade servers, open-source software and other tech tools to craft a robust Internet infrastructure for dispatching on-demand information to millions of users each month. In the highly competitive Internet search segment, success is measured in microseconds, and slow-loading pages can cost millions in lost advertising revenue. Dan Marriott, director of operations at the New York-based Q&A community site, tells how a “zero single point of failure” environment ensures both speed and reliability.

As director of operations at the world’s leading Q&A community site, I am responsible for providing a lot of answers—literally. delivers on-demand information to tens of millions of unique visitors each month and must do so as fast as possible because Internet users do not like waiting, and latency measured in seconds translates into significant losses.

With these demands in mind, we designed the Web infrastructure and content serving systems for and other Web properties in our network to be blazingly fast, highly reliable and extremely scalable. As key management at a publicly traded company, we have a fiduciary responsibility to ensure that never goes down. constitutes the world’s largest pool of community-supplied Q&A information combined with authoritative reference information. In total, sites maintain a database of 9 million community-supplied answers to questions covering topics that run the gamut from travel to technology. Recently, reached 5 million registered users.

In the highly competitive Internet Q&A and information search segment, success is measured in microseconds, and slow-loading pages can cost millions in lost advertising revenues. Naturally, we have put serious thought into the Internet infrastructure behind the network. Through trial, error and hundreds of hours of testing, we built out a custom application stack designed for screamingly fast content delivery to anywhere in the world at any time, in a manner that accommodates any type of network connection.

But providing fast content delivery is not enough. We also must have a fully redundant architecture that minimizes the chance of catastrophic technology collapses. For that reason, we operate a “zero single point of failure” environment. Every server in our network infrastructure has a redundant counterpart. We run redundant power supplies, redundant switches and load balancers, redundant firewalls and redundant rack chassis clusters.

We checked out various blade providers before going with Hewlett-Packard as our primary server vendor. A key factor in the decision involved the features HP incorporated into its blades. For example, we use solid-state drives fairly extensively in our database tier, and HP was the first to offer PCI-based SSDs that could be incorporated within the blade form factor. It altered our whole approach to our database tier.

Another positive was the ability to add two- and four-port network interface cards, a key functional requirement for us. Based on the range of operational expenses—including power consumption, cooling, space/rack requirements, support and maintenance, and system administration—we found that the HP products offered better than 20 percent cost savings compared with other servers at that time.

For colocation, we use two facilities near the East and West Coasts. Each colocation server cluster provides sufficient CPU and bandwidth capacity to handle at least 120 percent of our peak traffic requirements. This gives us a total burst capacity of well over 200 percent of estimated peak traffic.

To speed that content on its way, we also deploy Varnish, an open-source HTTP caching application, in front of our LAMP-stack infrastructure. It also supports an essential requirement: to purge specific pages from cache any time they are updated.

On the client side, uses now-standard practices to reduce load times, including deployment of CSS Sprites, combined files and image maps. Our engineering team uses Asynchronous Java-Script and XML (AJAX) wherever possible to let users down-load content in the background that can be quickly displayed.

Using Content Delivery Networks

To push the network’s static content further out toward the client side, we’ve extensively used content delivery networks (CDNs). We dynamically generate HTML pages ourselves, but all the other page components (GIFs, JPEGs, PNGs,

JS and CSS) are much more static, so we moved them off our servers and onto the CDNs to be closer to users.

In the past, we used a variety of CDNs, but we switched to Cotendo, which focuses on optimizing its caching software and improving performance using fewer resources. Since began serving traffic through Cotendo, we’ve experienced 99.999 percent uptime, versus 99.9 percent uptime with our previous CDN vendors—44 minutes less downtime per month.Other features include hourly traffic log data dumps via automated FTP delivery, which lets us quickly spot traffic trends.

We’re exploring using Cotendo’s search engine optimization (SEO) tools, which would give us near-real-time insights into the search engines’ spidering experiences. We think this tool can highlight problems quickly before they become serious traffic and site-ranking issues, enabling us to make quick adjustments.

The software core of’s data infrastructure is built primarily on the open-source LAMP stack with a heavy dose of virtualization. We have become comfortable running several layers of our mission-critical Web serving in a completely virtualized environment. Several of our layers and all our Apache PHP run on VMware. Virtualization is now part of our operations infrastructure DNA.

As long-time MySQL proponents, we now use this open-source software as our primary database in development, test, staging and production environments. Other data stores we’ve used include Memcached, a distributed memory caching system that enhances site performance by caching data and objects in memory and SSDs to reduce external data source calls to servers.

We also use Cassandra, an open-source NoSQL database that Digg and Facebook use to store and manipulate large volumes of data distributed across multiple commodity servers. For critical search capabilities, uses Solr/Lucene, an open-source enterprise search platform from the Apache Lucene project that offers highly scalable full-text search, faceted search, dynamic clustering and simple database integration.

Dealing With Ads

No discussion of maintaining a fast Web architecture would be complete without considering advertisements. Every Web operations director has horror stories about a poorly coded Flash ad spot that took 10 seconds or longer to load and chased off thousands of site visitors before it was spotted as a serious performance hazard. avoids highly disruptive advertising units like interstitials and page takeovers because we don’t like to put our users through that. We select our ad network partners carefully and spend time checking out their technology and talking to other Website operators who have served the network’s ads. In fact, we deal only with ad networks that, like, have highly reliable and redundant server environments that serve their traffic through reliable networks.

That said, some badly coded or resource-hogging ads do slip through. We know that it can be incredibly annoying when one component holds up the page from displaying in the browser. To minimize that problem, places a higher load priority on editorial content and ensures that ads are loaded later. We want to make sure that the core content of the page always loads first.

Above all, we want our users to have a great experience on our sites, and we’ll do whatever it takes to ensure they get the information they want as quickly as possible and with minimal interruption or wait times. We believe that this attitude, along with our willingness to invest extra time and effort in R&D, has played a key role in growing to nearly 70 million monthly unique visitors. Our distribution system’s flexible and global nature has allowed us to grow quickly around the world.

Maintaining our high ranking in the fast-changing global Internet site network is no simple feat. But I think we have the answer: a well-constructed combination of smart hardware, open-source software, feature-rich management systems, fast and resilient distribution systems, and, most important, lots of smart people on our team.

Dan Marriott is director of operations at New York-based As a technologist with more than 20 years’ professional experience in the private and public sectors, he now specializes in high scalability and performance, pushing MySQL database limits, Internet infrastructure and data security.