Evaluating the use of sequence-to-expression predictors for personalized expression prediction

Parth Baokar

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-120

May 13, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-120.pdf

With rapid advances in deep neural network architectures, there has been recent interest in using these complex models to understand the regulatory factors that govern gene expression. Recent state-of-the-art models are trained to predict expression levels in different cell types from the reference genome sequence around the start site of each gene. These models explain a large fraction of the variation in expression across different genes in the genome, and have demonstrated an ability to recognize biologically relevant regulatory motifs. However, here we show that model performance is limited when applied to sequences from personal genomes to explain variation in expression across individuals. Our results suggest a relative insensitivity of these models to small but biologically meaningful perturbations in the input sequence. We also demonstrate that while the models identify some key sites of regulatory variation corresponding to those found in eQTL (expression quantitative trait loci) studies, they often fail to capture the correct direction of effect on expression. This work highlights potential shortcomings of these deep learning models when applied to personal genome interpretation in a clinical setting, and suggests further avenues of exploration for improving model performance on personalized genomes.

Advisors: Nilah Ioannidis

BibTeX citation:

@mastersthesis{Baokar:EECS-2022-120,
    Author= {Baokar, Parth},
    Editor= {Ioannidis, Nilah},
    Title= {Evaluating the use of sequence-to-expression predictors for personalized expression prediction},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-120.html},
    Number= {UCB/EECS-2022-120},
    Abstract= {With rapid advances in deep neural network architectures, there has been recent interest in using these complex models to understand the regulatory factors that govern gene expression. Recent state-of-the-art models are trained to predict expression levels in different cell types from the reference genome sequence around the start site of each gene. These models explain a large fraction of the variation in expression across different genes in the genome, and have demonstrated an ability to recognize biologically relevant regulatory motifs. However, here we show that model performance is limited when applied to sequences from personal genomes to explain variation in expression across individuals. Our results suggest a relative insensitivity of these models to small but biologically meaningful perturbations in the input sequence. We also demonstrate that while the models identify some key sites of regulatory variation corresponding to those found in eQTL (expression quantitative trait loci) studies, they often fail to capture the correct direction of effect on expression. This work highlights potential shortcomings of these deep learning models when applied to personal genome interpretation in a clinical setting, and suggests further avenues of exploration for improving model performance on personalized genomes.},
}

EndNote citation:

%0 Thesis
%A Baokar, Parth 
%E Ioannidis, Nilah 
%T Evaluating the use of sequence-to-expression predictors for personalized expression prediction
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 13
%@ UCB/EECS-2022-120
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-120.html
%F Baokar:EECS-2022-120