5 Surefire Tips for Smooth Disaster Recovery

Major disasters in the news have a way of triggering worrisome thoughts about business continuity in the minds of many IT professionals. No doubt the fires in Southern California over the past two weeks reminded more than a few people about the gaps in their disaster recovery programs.

Even though the disaster recovery and business continuity management niche has certainly matured over the past several years within most enterprise IT departments, there is still plenty of room for improvement. Although most organizations have disaster recovery plans, almost half of those plans fail during testing, according to a recent survey by Dynamic Markets for the Symantec Disaster Recovery Research 2007 report.

Some of the key survey findings:

  • 13 percent have no disaster recovery plan
  • 44 percent without a plan have experienced at least one disaster
  • 50 percent have failed their own disaster recovery tests
  • 77 percent of CEOs do not take part in disaster recovery tests

Disaster recovery experts believe part of the problem is that many enterprises approach disaster planning with a checkbox mentality. Although they have systems in place for backup and redundancy, it’s often not enough.

“It’s too easy to become complacent, to become too trusting that the plan will bail you out when everything hits the fan,” says Philip Jan Rothstein, president of consultancy Rothstein Associates and author of numerous business continuity books. “When that happens is the time when you least want to learn that you’ve overlooked something.”

The following are five best practices frequently overlooked by enterprises:

  1. Think business needs first
  2. Drill on potential disasters
  3. Maintain data integrity
  4. Keep data stores in sync
  5. Secure the entire process

Think Business Needs First

One of the biggest mistakes often made is failing to consider business needs when developing a disaster plan. “IT disaster recovery plans should be integrated with the business needs, not built in a vacuum,” Rothstein says. “Disaster recovery is not about writing plans and procedures. But what a lot of organizations do is start out by buying a template, and then it becomes a matter of filling in the blanks and generating some paper.”

Rothstein recommends starting with an impact assessment to determine recovery priorities based on business processes, and then developing strategies to mitigate risk based on those priorities.

This can go a long way in preventing money being wasted on a disaster plan that dedicates too much time on applications that have low business impact while neglecting those of the highest importance.

Drill on Potential DisastersOnce the plan is developed, it is equally important to practice it. As Rothstein notes, a plan that isn’t exercised is worse than no plan at all. And he warns businesses that a disaster recovery exercise is distinct from disaster recovery tests.

“There’s testing, and then there’s exercise,” he says. “A test connotes pass/fail, good/bad, right/wrong, whereas the purpose of exercising is to find the weak links and strengthen them. It’s to find the flaws and fix them. It’s to find the concerns that go unresolved, and to find the things you might not even notice in a test. An exercise is like going to the gym-it makes you stronger, better and more effective.”

John Morency, a disaster recovery researcher at Gartner, agrees: “Sometimes the common assumption is that you’ve got to do everything in one fell swoop. It’s a fallacy that you have to rely on testing everything all at once.”

Instead, exercise individual disaster recovery actions one or two at a time. One of the easiest ways to start the process is to use “tabletop” exercises.

“Get a bunch of key people sitting around a table, present a scenario, and talk it through for an hour,” Rothstein says. “This is useful because things come out like, ‘Gee, how can you be doing that if you’re across town at the recovery site?’ Or, ‘What if So-and-So is on vacation? Who else can do this?'”

From there, start testing individual elements of the disaster recovery plan. This can mean checking on an individual server or an application, or even conducting a connectivity test. By performing these small exercises on a regular basis, organizations will become better equipped for a real event than if they simply wait for one or two big pass/fail tests a year.

Maintain Data Integrity

Organizations can’t examine everything, so Gartner’s Morency recommends prioritizing data integrity when conducting disaster recovery plans and testing.

“There are too few organizations that have any type of systematic integrity testing on the integrity of the data that are on backup tapes,” he says. Sometimes that can negate the success of a disaster recovery test, and in some cases it can prevent the success of an actual recovery.

It’s important to check on tape backup processes to guard against data corruptions and other structural problems that would hinder or negate recovery efforts, Morency notes.

“Align the integrity testing with what the auditors would do for standard compliance testing for controls which are run on a daily basis,” he says, explaining a good rule of thumb for coming up with the number of samples for testing. “That way you have a mathematical, statistical basis to work with, and you’ve done a reasonable job in terms of testing the entire population.”

Keep Data Stores in SyncAnother frequently overlooked aspect of disaster recovery is change management. This is especially critical when working in an active/active environment that involves data mirroring, Morency says.

“You really have to make sure that change management is aligned between systems, so that whatever changes or patches you make to the primary production system are also made to the recovery system,” he says.

Failing to sync change management is a common mistake, one that Morency and his colleagues at Gartner recently saw during a disaster test conducted by one of the firm’s clients. A state government failed a large part of the test when the critical databases for its tax service stalled midway through processes.

“It turned out that even though the production servers had been updated months before, the provider hadn’t patched the Oracle system on the recovery side to prevent the gridlock,” Morency says.

Secure the Entire Process

Businesses shouldn’t forget about security during disaster scenarios. Not only should IT security be as robust in the recovery systems as in their production counterparts, but physical security must also be accounted for in the overall disaster recovery plan.

“One aspect that gets overlooked is that the middle of a declared disaster is the last time and place where you want to let your guard down,” Rothstein says. “What if you’re in a situation where it’s something intentional? I went through a scenario years ago where they were trying to rebuild data center operations at a recovery site, and every time they’d get to another step, the previous step would start to fail. It turned out that one of the members of the recovery team was the one causing this.”

Disaster recovery security should be proactive, ensuring that all recovery and backup systems are patched, segregated and maintained in the same manner as production systems. Staff assigned to recovery tasks should have a clear understanding of their duties, and should be cleared. And all steps of the recovery process should be monitored for anomalies and checked for quality and completeness to guard against mistakes and insider tampering.