Analytics Helps Uncover the Oceans’ Secrets

By Michael Heaney

The J. Craig Venter Institute (JCVI), a leader in genomic research, is best known for the work of our founder and president, Dr. J. Craig Venter, and his team in decoding the first draft of the human genome and constructing the first synthetic cell.

Our organization continues to pioneer new avenues of genomic research, including the Global Ocean Sampling Expedition (GOS). Begun in 2003, the project involves traveling the world collecting and analyzing microorganisms found in seawater. The goal is to help scientists better understand the evolution of the oceans, microbial biodiversity, climate and environmental changes, and more. Through this research, more than 80 million genes and thousands of novel protein families have already been discovered.

Modern scientific and technological advances have enabled our GOS Expedition team to capture information at a much faster rate and at greater volumes than ever before. But the growth of this R&D data also began to present data management challenges for our organization, as we needed to find a way to keep up with it without spending too much money or slowing down the progress of the research project. 

The GOS Expedition involves a massive volume of genomic data—data that expands in size and scope as more research and analysis is done. Although the initial data loads (the information extracted from the ocean samples) range from approximately 20GB to 100GB, data sets can quickly expand to the terabyte level and beyond after the first rounds of processing and analysis.

In addition, increasingly sophisticated research technologies now allow us to capture more data overall. When the GOS Expedition first began in 2003, a data set that included 50,000 sequences was considered large. Today, a typical round of analysis may include 40 to 50 million sequences.

Because of the growing amount of genomic data the GOS team was collecting, we faced problems in efficiently and economically storing, loading and analyzing all using with our existing MySQL databases. Analytic query speed began to suffer, and our scientists were waiting hours—and sometimes days—for results to come in. In addition, we were spending considerable time manually indexing and partitioning data.

Time, Cost and Scalability

We recognized that we needed an analytics solution that could easily scale up as our scientists generated more data. As a not-for-profit organization, we also knew we needed an affordable solution. Purchasing more and more servers and disk storage subsystems would not be a cost-effective option for us over time.

Initially, that is what led us to MySQL, an open-source, row-oriented database. However, as the GOS Expedition grew in data size and diversity, that solution bogged down, and analysis slowed to a crawl. As a result, we started to look at columnar database technologies, which store data column-by-column instead of row-by-row. We found them to be far more efficient at running ad hoc investigative analysis on large amounts of data.

Ultimately, we chose an analytic engine developed by Infobright, which comes in an open-source edition, as well as an affordable enterprise option. The data compression capabilities allowed us to analyze more data without having to invest in or maintain additional hardware and storage.

We can also run more complex and diverse types of investigative analysis, since the system makes it easy to create new queries as the research project evolves. This is important because scientific research is not like running a weekly report: The questions we need to ask are always changing. Finally, since the solution is designed to integrate seamlessly with MySQL, we knew it would work well in our existing information infrastructure.   

Since deploying Infobright, we have seen a tenfold improvement in analytic speed. Queries that used to take many minutes (or hours or even days) to process now come back in seconds. We have also been able to achieve data compression ratios of 10:1—and in many cases up to 14:1—allowing us to speed up analysis even more while also cutting down storage costs.

From a maintenance perspective, the deployment has been a lifesaver. I used to spend several hours a week tinkering with MySQL queries and creating indexes and partitions to ensure that analyses could be performed. Now, data just needs to be loaded into Infobright, with no customization required, saving IT time and resources.

Our scientists who are analyzing data from the GOS Expedition can now ask more questions and change analytic parameters on the fly, because they are no longer limited by the bottlenecks that used to be created when queries became too complicated or required too much setup time to perform.  

In the end, an ability to ask more questions leads to more knowledge. Backed by a scalable, affordable and efficient database specifically designed for high-volume analytics, JCVI scientists can continue to study and gain new insights from an ocean of data.

Michael Heaney is the database manager at the J. Craig Venter Institute, in Rockville, Md.