Scalable Parallel Algorithms for Genome Analysis

Evangelos Georganas

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2016-135
August 3, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-135.pdf

A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled “shotgun” sequences, namely reads, into the contiguous and accurate reconstruction of complex genomes. These reads are significantly shorter (e.g. hundreds of bases long) than the size of chromosomes and also include errors. While advanced methods exist for assembling the small and haploid genomes of prokaryotes, the genomes of eukaryotes are more complex. Moreover, de novo assembly has been unable to keep pace with the flood of data, due to the dramatic increases in genome sequencer capabilities, combined with the computational requirements and the algorithmic complexity of assembling large scale genomes and metagenomes.

In this dissertation, we address this challenge head on by developing parallel algorithms for de novo genome assembly with the ambition to scale to massive concurrencies. Our work is based on the Meraculous assembler, a state-of-the-art de novo assembler for short reads developed at JGI. Meraculous identifies non-erroneous overlapping substrings of length k (k-mers) with high quality extensions and uniquely assembles genome regions into uncontested sequences called contigs by constructing and traversing a de Bruijn graph of k-mers, a special graph that is used to represent overlaps among k-mers. The original reads are subsequently aligned onto the contigs to obtain information regarding the relative orientation of the contigs. Contigs are then linked together to create scaffolds, sequences of contigs that may contain gaps among them. Finally gaps are filled using localized assemblies based on the original reads.

First, we design efficient scalable algorithms for k-mer analysis and contig generation. K-mer analysis is characterized by intensive communication and I/O requirements and our parallel algorithms successfully reduce the memory requirements by 7×. Then, contig generation relies on efficient parallelization of the de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We present a novel algorithm that leverages one-sided communication capabilities of the UPC to facilitate the requisite fine-grained, irregular parallelism and the avoidance of data hazards. The sequence alignment is characterized by intensive I/O and large computation requirements. We introduce mer-Aligner, a highly parallel sequence aligner that employs parallelism in all of its components. Finally, this thesis details the parallelization of the scaffolding modules, enabling the first massively scalable, high quality, complete end-to-end de novo assembly pipeline. Experimental large-scale results using human and wheat genomes demonstrate efficient performance and scalability on thousands of cores. Compared to the original Meraculous code, which requires approximately 48 hours to assemble the human genome, our pipeline called HipMer computes the assembly in only 4 minutes using 23,040 cores of Edison – an overall speedup of approximately 720×.

In the last part of the dissertation we tackle the problem of metagenome assembly. Metagenomics is currently the leading technology to study the uncultured microbial diversity. While accessing an unprecedented number of environmental samples that consist of thousands of individual microbial genomes is now possible, the bottleneck is becoming computational, since the sequencing cost improvements exceed that of Moore’s Law. Metagenome assembly is further complicated by repeated sequences across genomes, polymorphisms within a species and variable frequency of the genomes within the sample. In our work we repurpose HipMer components for the problem of metagenome assembly and we design a versatile, high-performance metagenome assembly pipeline that outperforms state-of-the-art tools in both quality and performance.

Advisor: Katherine A. Yelick


BibTeX citation:

@phdthesis{Georganas:EECS-2016-135,
    Author = {Georganas, Evangelos},
    Title = {Scalable Parallel Algorithms for Genome Analysis},
    School = {EECS Department, University of California, Berkeley},
    Year = {2016},
    Month = {Aug},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-135.html},
    Number = {UCB/EECS-2016-135},
    Abstract = {A critical problem for computational genomics is the problem of de novo genome assembly: the development of robust scalable methods for transforming short randomly sampled “shotgun” sequences, namely reads, into the contiguous and accurate reconstruction of complex genomes. These reads are significantly shorter (e.g. hundreds of bases long) than the size of chromosomes and also include errors. While advanced methods exist for assembling the small and haploid genomes of prokaryotes, the genomes of eukaryotes are more complex. Moreover, de novo assembly has been unable to keep pace with the flood of data, due to the dramatic increases in genome sequencer capabilities, combined with the computational requirements and the algorithmic complexity of assembling large scale genomes and metagenomes.

In this dissertation, we address this challenge head on by developing parallel algorithms for de novo genome assembly with the ambition to scale to massive concurrencies. Our work is based on the Meraculous assembler, a state-of-the-art de novo assembler for short reads developed at JGI. Meraculous identifies non-erroneous overlapping substrings of length k (k-mers) with high quality extensions and uniquely assembles genome regions into uncontested sequences called contigs by constructing and traversing a de Bruijn graph of k-mers, a special graph that is used to represent overlaps among k-mers. The original reads are subsequently aligned onto the contigs to obtain information regarding the relative orientation of the contigs. Contigs are then linked together to create scaffolds, sequences of contigs that may contain gaps among them. Finally gaps are filled using localized assemblies based on the original reads.

First, we design efficient scalable algorithms for k-mer analysis and contig generation. K-mer analysis is characterized by intensive communication and I/O requirements and our parallel algorithms successfully reduce the memory requirements by 7×. Then, contig generation relies on efficient parallelization of the de Bruijn graph construction and traversal, which necessitates a distributed hash table and is a key component of most de novo assemblers. We present a novel algorithm that leverages one-sided communication capabilities of the UPC to facilitate the requisite fine-grained, irregular parallelism and the avoidance of data hazards. The sequence alignment is characterized by intensive I/O and large computation requirements. We introduce mer-Aligner, a highly parallel sequence aligner that employs parallelism in all of its components. Finally, this thesis details the parallelization of the scaffolding modules, enabling the first massively scalable, high quality, complete end-to-end de novo assembly pipeline. Experimental large-scale results using human and wheat genomes demonstrate efficient performance and scalability on thousands of cores. Compared to the original Meraculous code, which requires approximately 48 hours to assemble the human genome, our pipeline called HipMer computes the assembly in only 4 minutes using 23,040 cores of Edison – an overall speedup of approximately 720×.

In the last part of the dissertation we tackle the problem of metagenome assembly. Metagenomics is currently the leading technology to study the uncultured microbial diversity. While accessing an unprecedented number of environmental samples that consist of thousands of individual microbial genomes is now possible, the bottleneck is becoming computational, since the sequencing cost improvements exceed that of Moore’s Law. Metagenome assembly is further complicated by repeated sequences across genomes, polymorphisms within a species and variable frequency of the genomes within the sample. In our work we repurpose HipMer components for the problem of metagenome assembly and we design a versatile, high-performance metagenome assembly pipeline that outperforms state-of-the-art tools in both quality and performance.}
}

EndNote citation:

%0 Thesis
%A Georganas, Evangelos
%T Scalable Parallel Algorithms for Genome Analysis
%I EECS Department, University of California, Berkeley
%D 2016
%8 August 3
%@ UCB/EECS-2016-135
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-135.html
%F Georganas:EECS-2016-135