Distributed Visualization for Genomic Analysis

Alyssa Morrow

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2017-82

May 12, 2017

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-82.pdf

The transition from Sanger to second and third generation sequencing technologies in the past decade has led to a dramatic increase in the availability of genomic data. The 1000 Genomes Project provides over 1.6 terabytes of variant data and 14 terabytes of alignment data, laying the foundation for large-scale exploration of human variation across 2,504 individuals. Sequencing is useful beyond identifying DNA variation: the ENCODE Consortium project has collected 20TB of sequencing data across various assays, which has enabled novel insights into the role of epigenetics in human disease. However, current genomic visualization tools are intended for a single-node environment and cannot scale to terabyte scale datasets. To enable visualization of terabyte scale genomic datasets, we develop Mango. Mango is a visualization tool that selectively materializes and organizes genomic data to provide fast in-memory queries driving genomic visualization. Mango materializes data from persistent storage as the user requests different regions of the genome, and efficiently organizes data in-memory using interval arrays, an optimized data structure derived from interval trees. This interval based organizational structure supports ad hoc queries, filters, and joins across multiple samples at a time, enabling exploratory interaction with genomic data. When used in conjunction with Apache Spark, Mango allows users to query large datasets and predictive models built from such datasets, while exploring results in real time.

Advisors: Anthony D. Joseph and Nir Yosef

BibTeX citation:

@mastersthesis{Morrow:EECS-2017-82,
    Author= {Morrow, Alyssa},
    Editor= {Joseph, Anthony D. and Yosef, Nir},
    Title= {Distributed Visualization for Genomic Analysis},
    School= {EECS Department, University of California, Berkeley},
    Year= {2017},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-82.html},
    Number= {UCB/EECS-2017-82},
    Abstract= {The transition from Sanger to second and third generation sequencing technologies in the past decade has led to a dramatic increase in the availability of genomic data. The 1000 Genomes Project provides over 1.6 terabytes of variant data and 14 terabytes of alignment data, laying the foundation for large-scale exploration of human variation across 2,504 individuals. Sequencing is useful beyond identifying DNA variation: the ENCODE Consortium project has collected 20TB of sequencing data across various assays, which has enabled novel insights into the role of epigenetics in human disease. However, current genomic visualization tools are intended for a single-node environment and cannot scale to terabyte scale datasets. To enable visualization of terabyte scale genomic datasets, we develop Mango. Mango is a visualization tool that selectively materializes and organizes genomic data to provide fast in-memory queries driving genomic visualization. Mango materializes data from persistent storage as the user requests different regions of the genome, and efficiently organizes data in-memory using interval arrays, an optimized data structure derived from interval trees. This interval based organizational structure supports ad hoc queries, filters, and joins across multiple samples at a time, enabling exploratory interaction with genomic data. When used in conjunction with Apache Spark, Mango allows users to query large datasets and predictive models built from such datasets, while exploring results in real time.},
}

EndNote citation:

%0 Thesis
%A Morrow, Alyssa 
%E Joseph, Anthony D. 
%E Yosef, Nir 
%T Distributed Visualization for Genomic Analysis
%I EECS Department, University of California, Berkeley
%D 2017
%8 May 12
%@ UCB/EECS-2017-82
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-82.html
%F Morrow:EECS-2017-82