Fast and accurate quantification and differential analysis of transcriptomes

Harold Pimentel and Lior Pachter

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2016-131
July 18, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-131.pdf

As access to DNA sequencing has become ubiquitous to scientists, the use of sequencers has expanded from determining the genomes of individuals to performing molecular probing assays. These assays have turned DNA sequencers into molecule counting machines and can be used to measure biological activities such as gene expression (RNA-Seq), DNA accessibility (ATAC-Seq) and many others. Each new assay poses new analytical challenges, and the main focus presented here is in analyzing RNA-Seq data. One of the main challenges in RNA-Seq is that sequenced fragments are often ambiguous, meaning they are compatible with multiple splice forms or genomic locations. In order to estimate gene abundances effectively, these ambiguous fragments should be used in a comprehensive model in order to not bias results. Analysis has come a long way from ignoring ambiguous mappings, to maximum likelihood models, and even streaming models. Advancements in these models have greatly improved the accuracy of estimating gene and transcript abundances. In parallel, methods for determining true expression differences between experimental conditions, termed differential expression, have been developed. Historically, these methods have mostly ignored advancements in gene expression estimation but have made much progress in between-sample variance estimation when sample sizes are small – a common practice in this field. Herein, we present advancements to both abundance estimation and differential expression analysis. We show dramatic improvements to the speed of abundance estimation while maintaining accuracy. Furthermore, we bridge these two fields by developing a differential expression model incorporating the uncertainty introduced by abundance estimation. We show that this model outperforms existing techniques at both the transcript and gene level. Additionally, we show that these methods can be used to address other biological questions such as the discovery of novel retained introns and estimation of their abundances. An extension to the differential expression model is proposed to identify differences in retained intron levels while incorporating abundance estimation uncertainty.

Advisor: Lior Pachter


BibTeX citation:

@phdthesis{Pimentel:EECS-2016-131,
    Author = {Pimentel, Harold and Pachter, Lior},
    Title = {Fast and accurate quantification and differential analysis of transcriptomes},
    School = {EECS Department, University of California, Berkeley},
    Year = {2016},
    Month = {Jul},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-131.html},
    Number = {UCB/EECS-2016-131},
    Abstract = {As access to DNA sequencing has become ubiquitous to scientists, the use of sequencers has expanded from determining the genomes of individuals to performing molecular probing assays. These assays have turned DNA sequencers into molecule counting machines and can be used to measure biological activities such as gene expression (RNA-Seq), DNA accessibility (ATAC-Seq) and many others.
Each new assay poses new analytical challenges, and the main focus presented here is in analyzing RNA-Seq data. One of the main challenges in RNA-Seq is that sequenced fragments are often ambiguous, meaning they are compatible with multiple splice forms or genomic locations. In order to estimate gene abundances effectively, these ambiguous fragments should be used in a comprehensive model in order to not bias results. Analysis has come a long way from ignoring ambiguous mappings, to maximum likelihood models, and even streaming models. Advancements in these models have greatly improved the accuracy of estimating gene and transcript abundances.
In parallel, methods for determining true expression differences between experimental conditions, termed differential expression, have been developed. Historically, these methods have mostly ignored advancements in gene expression estimation but have made much progress in between-sample variance estimation when sample sizes are small – a common practice in this field.
Herein, we present advancements to both abundance estimation and differential expression analysis. We show dramatic improvements to the speed of abundance estimation while maintaining accuracy. Furthermore, we bridge these two fields by developing a differential expression model incorporating the uncertainty introduced by abundance estimation. We show that this model outperforms existing techniques at both the transcript and gene level.
Additionally, we show that these methods can be used to address other biological questions such as the discovery of novel retained introns and estimation of their abundances. An extension to the differential expression model is proposed to identify differences in retained intron levels while incorporating abundance estimation uncertainty.}
}

EndNote citation:

%0 Thesis
%A Pimentel, Harold
%A Pachter, Lior
%T Fast and accurate quantification and differential analysis of transcriptomes
%I EECS Department, University of California, Berkeley
%D 2016
%8 July 18
%@ UCB/EECS-2016-131
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-131.html
%F Pimentel:EECS-2016-131