Inside MySpace: The StoryBy David F. Carr | Posted 2007-01-16 Email Print
Booming traffic demands put a constant stress on the social network's computing infrastructure. Here's how it copes.title=Unexpected Errors}
If it were not for this series of upgrades and changes to systems architecture, the MySpace Web site wouldn't function at all. But what about the times when it still hiccups? What's behind those "Unexpected Error" screens that are the source of so many user complaints?
One problem is that MySpace is pushing Microsoft's Web technologies into territory that only Microsoft itself has begun to explore, Benedetto says. As of November, MySpace was exceeding the number of simultaneous connections supported by SQL Server, causing the software to crash. The specific circumstances that trigger one of these crashes occur only about once every three days, but it's still frequent enough to be annoying, according to Benedetto. And anytime a database craps out, that's bad news if the data for the page you're trying to view is stored there. "Anytime that happens, and uncached data is unavailable through SQL Server, you'll see one of those unexpected errors," he explains.
Last summer, MySpace's Windows 2003 servers shut down unexpectedly on multiple occasions. The culprit turned out to be a built-in feature of the operating system designed to prevent distributed denial of service attacks—a hacker tactic in which a Web site is subjected to so many connection requests from so many client computers that it crashes. MySpace is subject to those attacks just like many other top Web sites, but it defends against them at the network level rather than relying on this feature of Windows—which in this case was being triggered by hordes of legitimate connections from MySpace users.
"We were scratching our heads for about a month trying to figure out why our Windows 2003 servers kept shutting themselves off," Benedetto says. Finally, with help from Microsoft, his team figured out how to tell the server to "ignore distributed denial of service; this is friendly fire."
And then there was that Sunday night last July when a power outage in Los Angeles, where MySpace is headquartered, knocked the entire service offline for about 12 hours. The outage stood out partly because most other large Web sites use geographically distributed data centers to protect themselves against localized service disruptions. In fact, MySpace had two other data centers in operation at the time of this incident, but the Web servers housed there were still dependent on the SAN infrastructure in Los Angeles. Without that, they couldn't serve up anything more than a plea for patience.
According to Benedetto, the main data center was designed to guarantee reliable service through connections to two different power grids, backed up by battery power and a generator with a 30-day supply of fuel. But in this case, both power grids failed, and in the process of switching to backup power, operators blew the main power circuit.
MySpace is now working to replicate the SAN to two other backup sites by mid-2007. That will also help divvy up the Web site's workload, because in the normal course of business, each SAN location will be able to support one-third of the storage needs. But in an emergency, any one of the three sites would be able to sustain the Web site independently, Benedetto says.
While MySpace still battles scalability problems, many users give it enough credit for what it does right that they are willing to forgive the occasional error page.
"As a developer, I hate bugs, so sure it's irritating," says Dan Tanner, a 31-year-old software developer from Round Rock, Texas, who has used MySpace to reconnect with high school and college friends. "The thing is, it provides so much of a benefit to people that the errors and glitches we find are forgivable." If the site is down or malfunctioning one day, he simply comes back the next and picks up where he left off, Tanner says.
That attitude is why most of the user forum responses to Drew's rant were telling him to calm down and that the problem would probably fix itself if he waited a few minutes. Not to be appeased, Drew wrote, "ive already emailed myspace twice, and its BS cause an hour ago it was working, now its not ... its complete BS." To which another user replied, "and it's free."
Benedetto candidly admits that 100% reliability is not necessarily his top priority. "That's one of the benefits of not being a bank, of being a free service," he says.
In other words, on MySpace the occasional glitch might mean the Web site loses track of someone's latest profile update, but it doesn't mean the site has lost track of that person's money. "That's one of the keys to the Web site's performance, knowing that we can accept some loss of data," Benedetto says. So, MySpace has configured SQL Server to extend the time between the "checkpoints" operations it uses to permanently record updates to disk storage—even at the risk of losing anywhere between 2 minutes and 2 hours of data—because this tweak makes the database run faster.
Similarly, Benedetto's developers still often go through the whole process of idea, coding, testing and deployment in a matter of hours, he says. That raises the risk of introducing software bugs, but it allows them to introduce new features quickly. And because it's virtually impossible to do realistic load testing on this scale, the testing that they do perform is typically targeted at a subset of live users on the Web site who become unwitting guinea pigs for a new feature or tweak to the software, he explains.
"We made a lot of mistakes," Benedetto says. "But in the end, I think we ended up doing more right than we did wrong."