Long-Term Data Health

By Virginia Citrano  |  Posted 2008-01-30

What happens today in the research labs of the Roswell Park Cancer Institute (RPCI) could have an immediate, beneficial impact on the field of oncology. Then again, it might take 50 years. That is Thomas Vaughan’s dilemma.

As director of IT infrastructure at the Buffalo, N.Y.-based institution—founded in 1898 as the first cancer center in the United States Vaughanmanages information with an exceptionally long lifecycle that must also be immediately available. The data, which includes text documents, spreadsheets and diagnostic X-rays, must be stored in a way that preserves its integrity for today’s health care regulators as well as for future researchers.

“I want to save our data forever,” Vaughansays.

As they have in many organizations, the benefits and pitfalls of data storage pushed RPCI to think more strategically about what information retains value and for how long. Enterprises have always needed to record and store data, but the days of the simple ledger and filing cabinet are long gone, vanquished by three key factors. First is the sheer volume of information companies create. Second, unstructured data has largely eclipsed structured data. Records created in the line of business can range from text documents to presentations, PDFs, e-mail and more. Finally, companies face the burdensome challenge of making information accessible to regulators, auditors and lawyers.

So, what is the best way to tell which data can be deleted, saved or shuttled off to storage with slower retrieval speeds? Indeed, the discipline of data storage is all about finding proper balance.

Information lifecycle management (ILM) was conceived as a way for companies to strike that balance and get a handle on the data deluge. Simply put, ILM gives businesses a framework for classifying stored information, identifying the best technology for that storage, crafting guidelines for retention and managing the total cost of storage. ILM correlates the business value of information with the IT infrastructure that surrounds it.

There is no single path to ILM, no one-size-fits-all definition or grab-and-go tech solution. For his part, Vaughanfound a way to improve RPCI’s archiving and storage by working with the same vendor that supplies much of the institution’s infrastructure: Hewlett-Packard. From HP’s StorageWorks line he added enterprise virtual arrays and enterprise file services, a 10-TB Medical Archive Solution, a Reference Information Storage System and clustered HP ProLiant servers. “Understanding the technology,” Vaughansays, “was the easy part.”

What’s more difficult about ILM, he says, is getting folks to focus on the real value of stored data.

“[We had to understand] the change in the nature of data from just being data at rest on a disk for the previous 20 years to suddenly becoming dynamic data moving from place to place, being minable, and how that would empower the business,” he says.

Data Hierarchy Over Time

Beth Cohen, director of operations for data protection and storage consultancy Broadleaf Services, in Burlington, Mass., says ILM combines library science’s traditional understanding of how information changes relevance over time with the storage industry’s expertise in hierarchical data retrieval.

“ILM offers a different paradigm for companies wrestling with a tidal wave of data,” says Cohen, whose clients include a company trying to archive more than three million Microsoft PowerPoint slides.

The initial emphasis of ILM was on finding a framework for data classification; that is where the Storage Networking Industry Association’s road map still starts. But classifying data and assigning a life expectancy to its usefulness is hard work—harder still if the tools don’t match the task.

“The amount of unstructured data has been growing by leaps and bounds, and the tools were just not keeping up,” Cohen says.

To keep themselves from being overwhelmed by the data tidal wave, many organizations merely bulked up the beachhead with more storage gear. However, in a report issued earlier this summer, analyst firm Gartner predicted that, by 2010, rising costs for storage media, energy and storage facilities will compel companies “to abandon the axiom that it is easier to add storage than to craft an ILM strategy.”

The Gartner report focused on companies in the h1ealth care industry, but its author, Barry Runyon, says its conclusions hold true for other industries as well. Runyon’s recommendations include:

  • Slow storage growth by improving overall use.
  • Initiate a project to discover, identify and classify critical enterprise data, both structured and unstructured.
  • Establish performance and recovery objectives for each data category. 
  • Establish formal data-retention schedules.
  • Deploy a storage resource management tool.
  • Implement a tiered-storage infrastructure with at least three tiers.

For many companies, the genesis of their ILM strategy lies in regulatory compliance. Regulators demand that certain categories of information be kept for set periods in a certain way. The companies they regulate must comply. These organizations have learned the hard way that not having a data-retention policy—or failing to follow an existing policy to the letter—can be highly damaging.

Financial-services companies, for example, were among the first to get a broad set of targeted ILM tools because they faced so many mandates on their data. The finance industry’s e-mail archiving tools are robust enough to collect missives sent from a wide variety of communications devices, and can sock away instant messaging chats, too. These tools are now making their way out to a broader audience that includes both enterprises and their legal counsel.

“Attorney review time is the most expensive part of discovery,” says Brad Harris, director of product management at electronic discovery services provider Fios. “So it’s good to have a tool that makes that more efficient.” Harris’ common-sense tips: Move information off desktops into shared storage, use a content management system and take full advantage of its metadata tagging capabilities, dispose of what you don’t need and, above all, stick with the company’s ILM plan.

“A lot of companies talk about having retention policies, but nobody follows [them],” Harris says.

ILM Gets Personal

Helen of Troy is trying to do ILM right. The Hamilton, Bermuda-based company might seem like an unlikely target of compliance and data-archiving issues; it sells personal care products under brand names such as Vidal Sassoon and Revlon, as well as OXO-brand kitchen gadgets—not financial services or prescription drugs. But as a publicly traded company, Helen of Troy must adhere to all Sarbanes-Oxley Act regulations for financial reporting and accountability.

With sales topping $634 million for the fiscal year ended March 31, 2007—and with data growing by as much as 500 GB a year—the company realized it had to get a handle on how information about orders, inventory and receivables was backed up and archived from its main Oracle database. The company also needed to make information retrieval easier for internal and external auditors, who examine data-retention policies, wireless security, credit-card encryption and archive integrity.

As it began to craft its ILM strategy, Helen of Troy had one key advantage: Only 35 percent of the data it was looking to manage was unstructured, according to Pedro T. Contreras, vice president of information technology. The bulk of its data was structured by its worldwide enterprise resource planning (ERP) system. “ERP forces a structure on the data,” says Contreras, based in El Paso, Texas. “But you still have factors. And some people just have a hard time giving up paper copies.”

After vetting several vendors, Helen of Troy selected enterprise data management provider Solix Technologies and its ARCHIVEjinni product. Solix is a relatively young company—it was founded in 2001 in Sunnyvale, Calif.—but it had established a strategic partnership with Oracle, which meant ARCHIVEjinni was already integrated with the Oracle modules that Helen of Troy was using.

Helen of Troy worked through each aspect of its operations to define data-classification and information-retention policies. Contreras says the evaluation matrix that evolved covers factors such as ownership of data, how often it needs to be accessed and by whom, as well as how and where the data is stored. “You can’t break the ILM rules,” Contreras says.

You can break the power grid, however, which is why companies are increasingly looking at cutting energy consumption as part of their overall ILM strategies. Ever-rising volumes of data are running into ever-rising utility costs. By controlling power usage, a company can stretch a tight storage budget to fit more data and make itself more environmentally responsible in the process.

A Path to Greeen Storage

John Halamka, a self-professed Prius-driving vegan, is the kind of guy who would have found a path to green storage even without outside prompting. But the many hats he wears in business have made energy-efficient, long-term storage an absolute necessity.

Halamka is chief information officer for CareGroup Health System, a confederation of four Boston-area hospitals with more than 1,000 beds. He is also the CIOand dean for technology at Harvard  Medical  School. In all, that leaves him with 200 TB of medical records, which, thanks to compliance rules, need to be maintained for 30 years. Personnel at Halamka’s institutions expect Internet connections to be fast and reliable. They also expect to have ample storage for not only every PowerPoint they have ever created, but also for the growing volume of medical research passing through their computers. In all, the storage needs of Halamka’s systems were growing at 25 percent a year.

“Given that oil prices have hit $90 a barrel, telling senior management that we have to reduce our energy usage was pretty straightforward,” Halamka says. And it wasn’t just imported oil pressuring rates. The utility that CareGroup and Harvard Medical both rely on has a fixed capacity for power, heat and cooling, so when the plant gets to maximum capacity in summer months, it raises prices.

Halamka took a methodical approach to cutting storage power consumption at both of his institutions. First, he moved everything to 750-GB serial advanced technology attachment (SATA) drives. He noted in a recent blog post (geekdoctor.blogspot.com) that while many of his constituents were worried about how the slower SATA drives would perform, no one remarked on the change when it actually happened.

Then he took the eminently practical step of cutting back on how much data was stored. His backup systems now deduplicate files; if a document is attached to an e-mail sent to all 5,000 of his institutions’ employees, only one copy of the document is stored. In doing so, he managed to cut the space needed for archiving by 50 percent.

Halamka also instituted hierarchical storage management, meaning data is prioritized based on importance and the frequency of it being accessed. As files age and access becomes less critical, they migrate from a high-availability, high-speed storage area network to network-attached storage (NAS) or content-addressed storage ( CAS). Unused files eventually get moved to tape. Unread e-mail and old attachments are also archived to CAS. While Halamka’s institutions must keep medical records on hand for a long time, they are not required to permanently store e-mail, instant messages or personal files.

And even though Halamka has been rigorous in his ILM strategy, he concedes that managing the demand for storage is difficult. “We have tried to enforce e-mail and storage quotas, but it is much easier to just increase the supply of storage than limit demand,” he says. “We’re continuing to try to strike a balance.”

Halamka, who uses a variety of EMCequipment—including a Symmetrix enterprise storage array, Clariion SANdisk array and Centerra NAS system, along with Sun Microsystems’ StorageTek tape libraries—also took a step with his storage systems that might not be for everyone: spin down and slow down technologies. “Suppose that virtual-tape libraries are used for backup, but backups are only done four hours a day,” Halamka says. “The drives can be slowed or stopped when not in use.”

Halamka’s strategy is aggressive, but he has an ambitious goal. “I want to stay under 200 kilowatts for the next five years,” he says, “and I want to accommodate growth.”

The same can be said for ILM. It may not have all the tools it needs and is exposed to human failings, but ILM has a lot of growth to accommodate.