Best Practices in Disaster Recovery, Business Continuity Planning
It’s every IT manager’s worst nightmare: the call from the CEO to evacuate the data center because of a hurricane or other emergency. That’s what happened to John Chaffe, IT director of New Orleans-based Tidewater Marine, the morning before Hurricane Katrina hit, and he ended up driving critical servers to shared office space in Houston.
Chaffe and other IT executives learned some important lessons about how to plan for business continuity with appropriate storage and backup strategies. And disaster recovery (DR) should be looked at not just in terms of business continuity and applications availability, but also for compliance reasons.
“Three years ago, Hurricanes Francis and Wilma destroyed three of our 11 campuses,” says Dan Weiss, IT director at MedVance Institute of West Palm Beach, Fla., a chain of health professional training schools. “The IT director [at the time] had to operate our data center out of his house for about a week.” Since then, Weiss has put together a solid disaster recovery plan that includes a collocated data center in Atlanta that is out of the path of most hurricanes.
Of course, IT operations can be interrupted for reasons other than natural disasters. So it’s important to be prepared for any type of disruption.
“Our offices are in Manhattan, and one of our building’s other tenants is Microsoft,” says Mark Tirschwell, CTO of Wall Street Systems, a provider of financial services hosted systems. “They can shut down the building for an entire day when they need more power for their servers.”
Tirschwell suggests conducting an extensive risk analysis and getting input from key stakeholders, such as operations people, department managers and application development managers. “Understand the key systems that will keep your business running and what will make a huge difference in your business operations,” he advises.
It’s also important to have a thorough understanding of the systems’ interdependencies and which systems need to be brought up first.
“Systems need to be restored in a specific order and with specific Internet and LAN connections,” says Mike Croy, director of business continuity solutions at Forsythe Solutions Group, a consulting firm in Skokie, Ill. “You need to know the business impact of losing connectivity and the compliance implications of losing critical customer records.”
Once you have this information, make a detailed catalog of all your servers and services and understand the recovery-time objectives for each. Some of these systems need to be up and running within minutes, while others can wait hours or even days.
“We don’t restore all our servers,” says Lee Abner, technology director for CIB Marine Information Services, a Bloomington, Ill.-based subsidiary of CIB Marine Bancshares, a bank holding company. “Within the first 48 hours, we need roughly 30 of our servers that are mission-critical, such as e-mail, document management and check-imaging systems, along with backups of our Active Directory environment and anti-virus protection.” The rest of the company’s systems can wait, he says.
Munder Capital’s disaster recovery priorities depend on the nature of the system. “We take snapshots ranging from every hour to every 15 minutes, depending on our systems,” says Wolfgang Goerlich, network operations and security manager for the Birmingham, Mich.-based investment banking firm. “Our top-tier systems, such as trading, can have an issue if we lose even 15 minutes. Lower-tier systems, such as research, just generate reports once a day, so if they lose data for [a few] hours, it isn’t as big of an issue. With our lowest-tier systems, our DR plan is to go out and buy boxes and bring them up in a couple of weeks.”
Disaster Recovery: Moving Data to Safety
If you have a single data center, consider how and where you want to move your data out of harm’s way. Solutions could range from purchasing a high-quality fire safe and putting backup tapes inside it each night, to paying for a remote hot backup site where the servers are constantly running and the data is being replicated in near real-time.
The migration process could involve a combination of steps, such as making backup tapes and moving them to a remote location on a set tape-rotation schedule, and then replicating data to another storage repository across the Internet.
“We looked into building our own backup site in-house, but the cost was about the same as paying for a provider, plus we would have had to pay for the entire costs up front,” says Abner, who added that CIB Information Services ended up going with SunGard. “While I think SunGard is our best option, I caution anyone to make sure their initial contract will handle all sorts of situations.”
For Miles and Stockbridge, a Baltimore law firm, multiple data centers or off-site storage would have been too costly, so the company has been “moving toward as many hosted solutions as possible,” says CIO Ken Adams. The firm, which previously used Postini for its e-mail spam and anti-virus protection, switched to the hosted service from Mimecast. “All my e-mail goes through their service,” he says. “They do the journaling and archiving, and I don’t have to worry about backups.”
The company also converted from PCDocs to NetDocuments’ hosted document management service. “Law firms live and die by their documents, but I don’t have to index them or worry about backups with NetDocuments, since they are off-site,” Adams says.
Disaster Recovery: Virtualization Advantages
Virtualization offers benefits for both virtualized storage arrays and virtualized servers, both of which are helpful for restarting critical services and replicating servers across different data centers and offices. Products such as Microsoft’s Hyper-V and VMware can be a less expensive substitute for close to near-time high availability. (See “Virtualization Is the New Clustering” in the August issue of Baseline.)
“The key thing for us was a very short recovery-time objective,” says Munder Capital’s Goerlich. The firm uses Compellent’s virtual storage arrays, with the DR baked in. He says it takes just one click to activate DR and boot up the systems on a new box.
Eighty percent of the systems at CIB Marine are virtualized. As a result, the company was able to cut its hosting bill in half and save recovery time by consolidating servers at its backup site. “Before virtualization, it took us 48 hours and
12 staff people to recover our systems,” Abner says. “Now, four of us can do it in 24 hours, and, for most of that time, we are just watching to make sure the systems are running properly.”
Virtualization has also been a boon for Wall Street Systems. “All our mission-critical servers leverage it, so we can be up and running in short order,” Tirschwell says. “We can use virtualization to make copies of a physical machine to help out with balancing our server loads. It saves us so much time with our operations.”
The company uses eGenera’s virtual rack-mounted highly available servers so it can quickly swap out blades or failing components without taking down applications. “Ten years ago, we would have needed a staff of 50 and pay a small fortune to do what we do now,” Tirschwell says. “Plus, we can grow our virtual infrastructure and manage it with the same number of staff.”
No matter what solution a company chooses, it should consider how it will staff the DR site, and whether the staff can actually get to the site—or be able to use the Internet to administer the machines remotely—in case of disaster.
“One thing we learned from Hurricane Francis is that the loss of centralized communications impacts your ability to work,” says MedVance’s Weiss. “Having a virtualized workspace means my staff can still get their jobs done as long as they have Internet connectivity and can log in to our Citrix portal.”
MedVance has taken other steps, such as hiring CDW to help set up a LeftHand Networks storage-area network. The institute has also found that cell phone service is restored faster than other communications links, so it’s supplying employees with cell phones for redundancy.
Finally, realize that no solution is universally applicable. “I can usually see the storms coming and have time to prepare for them,” Tidewater Marine’s Chaffe says. “So asynchronous replication works fine for me, but it may not work for others.”
Croy of Forsythe Solutions agrees: “Everybody has built their infrastructure differently, and organizations have to realize that they are unique, so they have to build something that will protect their particular business.”