Akshay Ravoor
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2024-190
September 23, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-190.pdf
Models of protein evolution are essential for a variety of applications, from phylogenetic analysis and ancestral sequence reconstruction to variant effect prediction and protein design. A protein evolution model is typically characterized by a rate matrix Q which describes the rate at which amino acids mutate into one another, and evolution is inferred under an independent sites model (possibly with site rate variation). In this work we move beyond classical models by leveraging the CherryML framework to efficiently train new kinds of models either doing away with the independent sites assumption (RNN) or the idea of global rate matrices (SiteRM). Both the RNN and SiteRM models show improved performance compared to WAG when evaluated on per-site likelihood. We then apply these two models to ASR using the extant sequence reconstruction method and a variety of reconstruction algorithms. We are able to use the SiteRM model to attain a performance competitive with IQ-Tree and consistently outperform it in the longer sequence length datasets. Though more validation is needed for the particular task of ASR, our results are promising for the development of new protein evolution models under the CherryML paradigm.
Advisor: Yun S. Song
"; ?>
BibTeX citation:
@mastersthesis{Ravoor:EECS-2024-190, Author = {Ravoor, Akshay}, Title = {Novel Protein Evolution Models for Ancestral Sequence Reconstruction}, School = {EECS Department, University of California, Berkeley}, Year = {2024}, Month = {Sep}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-190.html}, Number = {UCB/EECS-2024-190}, Abstract = {Models of protein evolution are essential for a variety of applications, from phylogenetic analysis and ancestral sequence reconstruction to variant effect prediction and protein design. A protein evolution model is typically characterized by a rate matrix Q which describes the rate at which amino acids mutate into one another, and evolution is inferred under an independent sites model (possibly with site rate variation). In this work we move beyond classical models by leveraging the CherryML framework to efficiently train new kinds of models either doing away with the independent sites assumption (RNN) or the idea of global rate matrices (SiteRM). Both the RNN and SiteRM models show improved performance compared to WAG when evaluated on per-site likelihood. We then apply these two models to ASR using the extant sequence reconstruction method and a variety of reconstruction algorithms. We are able to use the SiteRM model to attain a performance competitive with IQ-Tree and consistently outperform it in the longer sequence length datasets. Though more validation is needed for the particular task of ASR, our results are promising for the development of new protein evolution models under the CherryML paradigm.} }
EndNote citation:
%0 Thesis %A Ravoor, Akshay %T Novel Protein Evolution Models for Ancestral Sequence Reconstruction %I EECS Department, University of California, Berkeley %D 2024 %8 September 23 %@ UCB/EECS-2024-190 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-190.html %F Ravoor:EECS-2024-190