SiRen: Leveraging Similar Regions for Efficient & Accurate Variant Calling

Kristal Curtis and Ameet Talwalkar and Matei Zaharia and Armando Fox and David A. Patterson

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2015-159

May 30, 2015

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-159.pdf

Next-generation genomic sequencing costs are rapidly decreasing, having recently reached the $1000- per-genome barrier, a likely tipping point for widespread clinical use. However, genomic analysis techniques have failed to keep pace. In particular, the process of variant calling, or inferring a sample genome from the noisy sequencing data, introduces major computational and statistical challenges. In this work, we explore the feasibility of a hybrid approach that addresses these challenges by partitioning the genome into easier and harder regions, deploying efficient algorithms on the easier regions, and relying on more expensive and accurate technologies in the harder regions. We propose that near duplication, or similarity, in the genome is a natural signal for identifying harder regions, and then present a large-scale distributed clustering approach to identify these similar regions. We perform an extensive empirical study illustrating the effectiveness of existing variant calling algorithms on the easier regions and their contrasting struggles on the similar regions. We also confirm that the similar regions are sufficiently disjoint, thus providing the opportunity for sophisticated analysis of these regions in an embarrassingly parallel manner.

BibTeX citation:

@techreport{Curtis:EECS-2015-159,
    Author= {Curtis, Kristal and Talwalkar, Ameet and Zaharia, Matei and Fox, Armando and Patterson, David A.},
    Title= {SiRen: Leveraging Similar Regions for Efficient & Accurate Variant Calling},
    Year= {2015},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-159.html},
    Number= {UCB/EECS-2015-159},
    Abstract= {Next-generation genomic sequencing costs are rapidly decreasing, having recently reached the $1000- per-genome barrier, a likely tipping point for widespread clinical use. However, genomic analysis techniques have failed to keep pace. In particular, the process of variant calling, or inferring a sample genome from the noisy sequencing data, introduces major computational and statistical challenges. In this work, we explore the feasibility of a hybrid approach that addresses these challenges by partitioning the genome into easier and harder regions, deploying efficient algorithms on the easier regions, and relying on more expensive and accurate technologies in the harder regions. We propose that near duplication, or similarity, in the genome is a natural signal for identifying harder regions, and then present a large-scale distributed clustering approach to identify these similar regions.
We perform an extensive empirical study illustrating the effectiveness of existing variant calling algorithms on the easier regions and their contrasting struggles on the similar regions. We also confirm that the similar regions are sufficiently disjoint, thus providing the opportunity for sophisticated analysis of these regions in an embarrassingly parallel manner.},
}

EndNote citation:

%0 Report
%A Curtis, Kristal 
%A Talwalkar, Ameet 
%A Zaharia, Matei 
%A Fox, Armando 
%A Patterson, David A. 
%T SiRen: Leveraging Similar Regions for Efficient & Accurate Variant Calling
%I EECS Department, University of California, Berkeley
%D 2015
%8 May 30
%@ UCB/EECS-2015-159
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-159.html
%F Curtis:EECS-2015-159