High-Performance Computing Powers Genome ProjectBy Eileen McCooey | Posted 2017-08-16 Email Print
A major biomedical research center relies on an HPC system to run a genome project that will allow personalized healthcare for the people of Qatar.
Biomedical research is one of the most dynamic, fast-paced fields of science, and it takes a lot of computing power to keep pace. Sidra Medical and Research Center (Sidra), a hospital, biomedical research and educational institution in Qatar, is relying on a customized high-performance computing (HPC) system to provide the digital muscle its high-tech work demands.
Developing a robust infrastructure "from the ground up" was a real challenge given the organization's ambitious goals, says Dr. Mohamed-Ramzi Temanni, manager of Sidra's Bioinformatics Technical Group since 2013, the year it was established. The organization needed a technology partner to manage and store clinical genome sequencing data and to provide biomedical informatics technology infrastructure capabilities that would serve as a national resource.
"In the U.S., you have a broad choice of vendors, but that's not the case in Qatar," Temanni explains. "Finding a vendor with a local presence and high-quality support played a very important part in our selection process."
The nature of the system required also raised the stakes. "Buying a complex HPC system is not like buying a small computer," he adds. "You can't DIY a solution by adding a new component. There's no room for downtime, so you need a partner with a strong team that can quickly jump in and support you."
Customized for Genomics
The Sidra team decided that IBM was the right fit, and the vendor brought in a team of experts to line up the options and solutions and to design the infrastructure. The first iteration of the system, installed in 2014, was set up specifically for genomics.
"Our needs are different from those of an organization looking to run simulations or do forecasting," Temanni points out. Among other things, Sidra is dealing with voluminous data, huge address space, high dimensionality of data and a large number of applications with fast turnover, including mostly academic open-source software.
The initial setup included an IBM HPC and head nodes, in addition to tools to monitor and automatically deploy clusters and 0.5 petabytes (PB) of GPFS (general parallel file system) storage and a network. Fine-tuning such a complex setup required extensive communication with IBM and constant tweaking over the course of months. By the end of the year, the system was very stable and running without glitches.
One of the first projects Sidra tackled with the platform was the Qatar Genome Program (QGP), a national medical research project designed to develop personalized healthcare therapies for the Qatari population. Sidra is responsible for sequencing, analyzing and providing the data management for whole genome sequences from the population.
For the first phase of the QGP, Sidra sequenced samples from more than 3,000 volunteers who offered their DNA for analysis. Each sample contains 70 to 100GB of genetic sequencing data, which took up to 85 hours to process.
Failures at any point in the analysis of the data would have required restarting each job from the beginning, but the IBM Spectrum software provided high reliability to manage the application pipeline. That helped Sidra complete the first phase in mid-2016, ahead of schedule.
The institution has embarked on the sequencing and analysis of another 3,000 samples for Phase II of the QGP, and Temanni expects to complete that before year's end. An updated tool due this month should cut file size and processing time by as much as half. With the new tool, the team will have to redo only jobs that might fail, not the entire analysis.
More Nodes, Servers and Storage
Sidra has expanded its system several times since the outset, increasing the number of HPC compute nodes and upgrading the infrastructure for servers and storage. It now has more than 4,000 cores and 3PB of storage, a dramatic upgrade from the starting configuration. An archiving system with robotic functionality enables the recovery of large genomic data sets in case of data loss and the retrieval of large genomic files that are in active archive almost instantly.
Overall, Sidra has reduced its time to completion for long-running jobs while increasing its resource utilization. The current infrastructure can handle the existing workload, but it is likely that Sidra's system will evolve and grow over time.
"Bioinformation changes fast and drastically," Temanni points out, "so we need to make sure we have the latest technologies and pipeline. Our challenge is keeping a production line running, while also doing constant R&D and migrating to new systems."