Managing Big Data in the Cloud

By Bob Violino

Two of the hottest IT trends today are the move to cloud computing and the emergence of big data as a key initiative for leveraging information. For some enterprises, both of these trends are converging, as they try to manage and analyze big data in their cloud deployments.

Our research with respect to the interaction between big data and cloud suggests that the dominant sentiment among developers is that big data is a natural component of the cloud,” says Ben Hanley, senior analyst at research firm Evans Data. Companies are increasingly using cloud deployments to address big data and analytics needs, he says, adding, “We have observed significant growth with respect to the interaction between cloud and big data.”  

Geostellar, a Washington, D.C., company that provides computations of available renewable-energy resources for geographic locations, is involved in both the cloud and big data. The company has had to develop strategies—including the use of cloud services—to store, process and move the petabytes of information in various formats that it processes and provides to customers.

The company didn’t move to the cloud until about a year and a half ago. It started out by providing data to customers via hard drives. Later it implemented on-site virtualized servers and moved them into hosted environments, and then migrated to the cloud.

“All of the data we’re processing has to be centralized in our operations center,” says CEO David Levine, “because the various fields are so large, and it’s much more efficient in terms of the proximity of dedicated CPUs and disk drives for reading and writing and processing configurations.”

Before the company processes data internally, various sources ship raw data sets via hard drives sent by overnight delivery or some other means. “We take all these different data assets and create data structures, so when the customer looks up [a particular] property, he has the profile he needs,” Levine explains. That applies regardless of whether it’s weather patterns or available resources in the area being examined.

The data Geostellar collects isn’t moved within the cloud because of its large size. “We’ve got these very large files—imagery, surface models, databases, etc.—and we have to aggregate all of this information,” Levine says. “And people are still shipping that to us on hard drives because of the bandwidth.”

Once processing of the data is complete, Geostellar streams it over to the cloud, and then customers can access and interact with the data from there. “We [and customers] can work with data in the cloud because we’ve already created all these interrelated structures,” Levine says.

Over time, Geostellar has developed its process of gathering and analyzing large volumes of information, producing connected spatial-relational data sets and then moving the data from its data centers to the cloud.

The company now operates two separate infrastructures, a highly efficient processing system that includes solid-state hard drives and powerful, dedicated servers, and a virtualized, cloud-based environment used for managing the information it produces through computation. The cloud is critical for distributing and providing access to this data, Levine says.

“Probably the biggest benefit of the cloud is that it’s much easier to manage capacity,” he says. You can stay ahead of whatever trends are happening.” There’s also resiliency in terms of long-term storage of the data.

The cost saving is another benefit. “It’s [the service provider’s] excess capacity we’re using, and the memory is cheaper than if we had procured our own systems and set up our own nodes,” Levine says.