Interactive Exploration on Large Genomic Datasets
Eric Tu
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2016-111
May 16, 2016
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-111.pdf
The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project, which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount of larger, more complex data. This difficulty is acute when viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently viewing multiple samples of data. While genomic processing pipelines have shifted towards using distributed computing techniques, such as with ADAM, genomic visualization tools have not.
In this work we present Mango, a scalable genome browser built on top of ADAM that can run both locally and on a cluster. Mango presents a combination of different optimizations that can be combined in a single application to drive novel genomic visualization techniques over terabytes of genomic data. By building visualization on top of a distributed processing pipeline, we can perform visualization queries over large regions that are not possible with current tools, and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics project at University of California-Berkeley and is published under the Apache 2 license. Mango is available at https://github.com/bigdatagenomics/mango
Advisors: David A. Patterson
BibTeX citation:
@mastersthesis{Tu:EECS-2016-111, Author= {Tu, Eric}, Title= {Interactive Exploration on Large Genomic Datasets}, School= {EECS Department, University of California, Berkeley}, Year= {2016}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-111.html}, Number= {UCB/EECS-2016-111}, Abstract= {The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project, which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount of larger, more complex data. This difficulty is acute when viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently viewing multiple samples of data. While genomic processing pipelines have shifted towards using distributed computing techniques, such as with ADAM, genomic visualization tools have not. In this work we present Mango, a scalable genome browser built on top of ADAM that can run both locally and on a cluster. Mango presents a combination of different optimizations that can be combined in a single application to drive novel genomic visualization techniques over terabytes of genomic data. By building visualization on top of a distributed processing pipeline, we can perform visualization queries over large regions that are not possible with current tools, and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics project at University of California-Berkeley and is published under the Apache 2 license. Mango is available at https://github.com/bigdatagenomics/mango}, }
EndNote citation:
%0 Thesis %A Tu, Eric %T Interactive Exploration on Large Genomic Datasets %I EECS Department, University of California, Berkeley %D 2016 %8 May 16 %@ UCB/EECS-2016-111 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-111.html %F Tu:EECS-2016-111