Managing Big Data in the Cloud
By Bob Violino
Two of the hottest IT trends today are the move to cloud computing and the emergence of big data as a key initiative for leveraging information. For some enterprises, both of these trends are converging, as they try to manage and analyze big data in their cloud deployments.
"Our research with respect to the interaction between big data and cloud suggests that the dominant sentiment among developers is that big data is a natural component of the cloud," says Ben Hanley, senior analyst at research firm Evans Data. Companies are increasingly using cloud deployments to address big data and analytics needs, he says, adding, "We have observed significant growth with respect to the interaction between cloud and big data."
Geostellar, a Washington, D.C., company that provides computations of available renewable-energy resources for geographic locations, is involved in both the cloud and big data. The company has had to develop strategies—including the use of cloud services—to store, process and move the petabytes of information in various formats that it processes and provides to customers.
The company didn't move to the cloud until about a year and a half ago. It started out by providing data to customers via hard drives. Later it implemented on-site virtualized servers and moved them into hosted environments, and then migrated to the cloud.
"All of the data we're processing has to be centralized in our operations center," says CEO David Levine, "because the various fields are so large, and it's much more efficient in terms of the proximity of dedicated CPUs and disk drives for reading and writing and processing configurations."
Before the company processes data internally, various sources ship raw data sets via hard drives sent by overnight delivery or some other means. "We take all these different data assets and create data structures, so when the customer looks up [a particular] property, he has the profile he needs," Levine explains. That applies regardless of whether it's weather patterns or available resources in the area being examined.
The data Geostellar collects isn't moved within the cloud because of its large size. "We've got these very large files—imagery, surface models, databases, etc.—and we have to aggregate all of this information," Levine says. "And people are still shipping that to us on hard drives because of the bandwidth."
Once processing of the data is complete, Geostellar streams it over to the cloud, and then customers can access and interact with the data from there. "We [and customers] can work with data in the cloud because we've already created all these interrelated structures," Levine says.
Over time, Geostellar has developed its process of gathering and analyzing large volumes of information, producing connected spatial-relational data sets and then moving the data from its data centers to the cloud.
The company now operates two separate infrastructures, a highly efficient processing system that includes solid-state hard drives and powerful, dedicated servers, and a virtualized, cloud-based environment used for managing the information it produces through computation. The cloud is critical for distributing and providing access to this data, Levine says.
"Probably the biggest benefit of the cloud is that it's much easier to manage capacity," he says. "You can stay ahead of whatever trends are happening." There's also resiliency in terms of long-term storage of the data.
The cost saving is another benefit. "It's [the service provider's] excess capacity we're using, and the memory is cheaper than if we had procured our own systems and set up our own nodes," Levine says.
Collecting Data From Around the World
Another organization using big data in the cloud is the Virginia Bioinformatics Institute (VBI), a research institute in Blacksburg, Va. VBI conducts genome analysis and DNA sequencing using about 100 terabytes of data that's collected each week from around the world.
"Our largest project is the downloading and reanalysis of every sequenced human genome to identify new biomarkers and drug targets, especially for cancer," says Skip Garner, executive director and professor at VBI. "We are analyzing approximately 100 genomes per day, and these are all downloaded from the cloud."
Data generated from various scientific sources is downloaded and then analyzed on VBI servers. "Recently, it has become easier and more efficient to download what we need and not keep local copies, for it amounts to tens of petabytes," Garner says. "So the cloud has enabled us to download, use and throw away raw data to save space, and then download again if necessary."
The institute hasn't used non-cloud compute resources for the research work because its codes "are memory hogs, requiring servers with at least a terabyte of RAM," he explains.
Managing big data in the cloud does come with challenges, Garner points out. The big issues are security and intellectual property. For example, VBI has permission to download certain data sets, and, in those agreements, it must maintain control, allowing only certain people to have access to the data.
"We can be absolutely sure of where the data is when it is in our servers, and we are confident that we are adhering to the terms of agreements," Garner says. "That is not [the case] when data is in the cloud. So, currently, we do not put data in the cloud, we only download."
Downloading and using data from the cloud saves VBI a lot on storage costs, and the return on investment was "immediate", according to Garner.
As organizations approach big data, their first choice for compute and storage platforms should be the cloud, says Chris Smith, U.S. federal chief technology and innovation officer at New York-based Accenture, a global management consulting company.
"Low cost, highly scalable and elastic capabilities are the right formula for implementing big data," Smith says. "In some cases, a big data solution in a highly secure environment may dictate an internal data center [strategy], but most organizations are developing their own internal private clouds, and this is the right place for those specific solutions as well."
Organizations continue to adopt and implement private, public and hybrid clouds, "with these technologies having become mainstream choices for developing new capabilities," Smith says. "I expect to see increased and even more rapid adoption over the next 18 to 24 months."
As organizations increase the breadth and depth of business technology offerings in the cloud, Smith says, they need to ensure that they can manage information across multiple heterogeneous environments, in order to be able to clearly develop, analyze and articulate the state of business, as well as provide highly available, high-performing services that deliver value.
"A robust cloud brokering and orchestration capability that puts the organization in the driver's seat to maintain, deliver and innovate new and better services will be key for the enterprise," Smith says.
The cloud itself will continue to generate lots of data, says London-based research firm Ovum. In "2013 Trends to Watch: Cloud Computing," the firm says that 2013 will see cloud computing continue to grow rapidly. Cloud computing in all its types—public, private and hybrid—is building momentum, evolving fast and becoming increasingly enterprise-grade, Ovum says.
Cloud computing services—and the social and mobile applications that cloud platforms underpin—are generating a lot of data, which, in turn, requires cloud services and applications to make sense of it, Ovum notes.
This trend is fueling other industry trends, such as the Internet of things (machine-to-machine communication and data processing), consumerization of IT and big data.