Wei-Chun Kao

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2011-99

September 2, 2011

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-99.pdf

Recent advances of DNA sequencing technologies are allowing researchers to generate immense amounts of data in a fast and cost effective fashion, enabling genome-wide association study and population genetic research which could not be done a decade ago. There are quite numerous computational challenges arising from this advancement, however. Examples include efficient algorithms for processing raw data generated by sequencing instruments, algorithms for detecting and correcting sequencing errors, algorithms for detecting genome variations from sequence data, just to name a few. Because different sequencing technologies can have drastically different characteristics, these algorithms often need to be adapted in order to produce most accurate results.

In this thesis, I will address a few of the aforementioned problems. First, I will describe two model-based basecalling algorithms for the Illumina sequencing platforms: BayesCall and naiveBayesCall. The novelty of BayesCall algorithm is that it is fully unsupervised, requiring no training data with known labels, and therefore it is applicable to data without a reference sequence. It also dramatically improves sequencing accuracies. Built upon BayesCall algorithm, naiveBayesCall dramatically improves computational efficiency by approximating the original model without sacrificing accuracy. We will also show that improved basecall can have positive effects on the downstream sequence analysis, such as the detection of single nucleotide polymorphism and the assembly of novel genomes.

In the third chapter, an algorithm, called ECHO, for correcting short-read sequencing errors will be described. The correction algorithm efficiently computes all overlaps between sequencing reads and corrects errors by using statistical models. Since it does not rely on reference genomes, ECHO can also be applied to de novo sequencing. Most other error correction algorithms require users to specify key parameters, but the optimal values for these parameters are unknown to users and can be hard to specify. Without key parameters being optimized, the effectiveness of error correction algorithm could sometimes be dramatically reduced. Based on statistical models, ECHO optimizes these parameters accordingly. We will show that ECHO can significantly reduce sequence error rates and also facilitate downstream sequence analysis. It is also demonstrated that ECHO can be extended to detect heterozygousity from sequencing data.

These algorithms are developed in hopes to make downstream analysis of sequence data easier and ultimately facilitate genome researches.

Advisors: Yun S. Song


BibTeX citation:

@phdthesis{Kao:EECS-2011-99,
    Author= {Kao, Wei-Chun},
    Title= {Algorithms for Next-Generation High-Throughput Sequencing Technologies},
    School= {EECS Department, University of California, Berkeley},
    Year= {2011},
    Month= {Sep},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-99.html},
    Number= {UCB/EECS-2011-99},
    Abstract= {Recent advances of DNA sequencing technologies are allowing researchers to generate immense amounts of data in a fast and cost effective fashion, enabling genome-wide association study and population genetic research which could not be done a decade ago. There are quite numerous computational challenges arising from this advancement, however. Examples include efficient algorithms for processing raw data generated by sequencing instruments, algorithms for detecting and correcting sequencing errors, algorithms for detecting genome variations from sequence data, just to name a few. Because different sequencing technologies can have drastically different characteristics, these algorithms often need to be adapted in order to produce most accurate results.

In this thesis, I will address a few of the aforementioned problems.  First, I will describe two model-based basecalling algorithms for the Illumina sequencing platforms: BayesCall and naiveBayesCall. The novelty of BayesCall algorithm is that it is fully unsupervised, requiring no training data with known labels, and therefore it is applicable to data without a reference sequence. It also dramatically improves sequencing accuracies. Built upon BayesCall algorithm, naiveBayesCall dramatically improves computational efficiency by approximating the original model without sacrificing accuracy. We will also show that improved basecall can have positive effects on the downstream sequence analysis, such as the detection of single nucleotide polymorphism and the assembly of novel genomes.

In the third chapter, an algorithm, called ECHO, for correcting short-read sequencing errors will be described. The correction algorithm efficiently computes all overlaps between sequencing reads and corrects errors by using statistical models. Since it does not rely on reference genomes, ECHO can also be applied to de novo sequencing. Most other error correction algorithms require users to specify key parameters, but the optimal values for these parameters are unknown to users and can be hard to specify. Without key parameters being optimized, the effectiveness of error correction algorithm could sometimes be dramatically reduced. Based on statistical models, ECHO optimizes these parameters accordingly. We will show that ECHO can significantly reduce sequence error rates and also facilitate downstream sequence analysis. It is also demonstrated that ECHO can be extended to detect heterozygousity from sequencing data.

These algorithms are developed in hopes to make downstream analysis of sequence data easier and ultimately facilitate genome researches.},
}

EndNote citation:

%0 Thesis
%A Kao, Wei-Chun 
%T Algorithms for Next-Generation High-Throughput Sequencing Technologies
%I EECS Department, University of California, Berkeley
%D 2011
%8 September 2
%@ UCB/EECS-2011-99
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-99.html
%F Kao:EECS-2011-99