Interactive Exploration on Large Genomic Datasets

Eric Tu

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2016-111
May 16, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-111.pdf

The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project, which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount of larger, more complex data. This difficulty is acute when viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently viewing multiple samples of data. While genomic processing pipelines have shifted towards using distributed computing techniques, such as with ADAM, genomic visualization tools have not.

In this work we present Mango, a scalable genome browser built on top of ADAM that can run both locally and on a cluster. Mango presents a combination of different optimizations that can be combined in a single application to drive novel genomic visualization techniques over terabytes of genomic data. By building visualization on top of a distributed processing pipeline, we can perform visualization queries over large regions that are not possible with current tools, and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics project at University of California-Berkeley and is published under the Apache 2 license. Mango is available at https://github.com/bigdatagenomics/mango

Advisor: David A. Patterson


BibTeX citation:

@mastersthesis{Tu:EECS-2016-111,
    Author = {Tu, Eric},
    Title = {Interactive Exploration on Large Genomic Datasets},
    School = {EECS Department, University of California, Berkeley},
    Year = {2016},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-111.html},
    Number = {UCB/EECS-2016-111},
    Abstract = {The prevalence of large genomics datasets has made the the need to explore this data more
important. Large sequencing projects like the 1000 Genomes Project, which reconstructed the
genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of
publically available data. Meanwhile, existing genomic visualization tools have been unable to
scale with the growing amount of larger, more complex data. This difficulty is acute when
viewing large regions (over 1 megabase, or 1,000,000 bases of DNA), or when concurrently
viewing multiple samples of data. While genomic processing pipelines have shifted towards
using distributed computing techniques, such as with ADAM, genomic visualization tools
have not.

In this work we present Mango, a scalable genome browser built on top of ADAM that can run
both locally and on a cluster. Mango presents a combination of different optimizations that can
be combined in a single application to drive novel genomic visualization techniques over
terabytes of genomic data. By building visualization on top of a distributed processing pipeline,
we can perform visualization queries over large regions that are not possible with current tools,
and decrease the time for viewing large data sets. Mango is part of the Big Data Genomics
project at University of California-Berkeley and is published under the Apache 2 license.
Mango is available at https://github.com/bigdatagenomics/mango}
}

EndNote citation:

%0 Thesis
%A Tu, Eric
%T Interactive Exploration on Large Genomic Datasets
%I EECS Department, University of California, Berkeley
%D 2016
%8 May 16
%@ UCB/EECS-2016-111
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-111.html
%F Tu:EECS-2016-111