Neil Thomas

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-27

May 1, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-27.pdf

Proteins are the molecular machines that perform the vast majority of natural biological functions. Discovering proteins to perform novel functions or optimizing them for an existing function are central goals of synthetic biology. Doing so is challenging primarily because for most proteins there is limited understanding of how they function, let alone how to modify them; experimental characterization and crystal structures are expensive and time-consuming to collect. For a given protein, however, genes performing related functions can be found in the the genomes of diverse organisms -- the natural result of the process of evolution. With improved techniques for genetic sequencing, an abundance of data deposited in protein sequence databases has become available. This presents a tantalizing modeling opportunity: models that can understand protein function through the observation of related sequences can reduce the reliance on experimental characterization and unlock new possibilities for protein discovery and optimization. Building such models has been a goal of bioinformatics research, and has more recently emerged as a goal of machine learning research. In particular, ``protein language models,'' models trained to learn a distribution over sequence data, have shown promise in predicting functional properties of proteins.

This work leverages the information in protein sequence databases to the following ends. First, it presents a benchmark for the effectiveness of protein language models using a suite of protein prediction tasks. Second, it draws a connection between a well-established graphical model of protein families and the neural network architecture of protein language models. Third, it presents a framework for deriving synthetic protein fitness landscapes from evolutionary data that can be used to evaluate strategies for model-guided protein design in silico.

Advisors: Yun S. Song


BibTeX citation:

@phdthesis{Thomas:EECS-2024-27,
    Author= {Thomas, Neil},
    Title= {Browsing in the Library of Babel: Leveraging Evolutionary Information to Improve Protein Modeling},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-27.html},
    Number= {UCB/EECS-2024-27},
    Abstract= {Proteins are the molecular machines that perform the vast majority of natural biological functions. Discovering proteins to perform novel functions or optimizing them for an existing function are central goals of synthetic biology. Doing so is challenging primarily because for most proteins there is limited understanding of how they function, let alone how to modify them; experimental characterization and crystal structures are expensive and time-consuming to collect. For a given protein, however, genes performing related functions can be found in the the genomes of diverse organisms -- the natural result of the process of evolution. With improved techniques for genetic sequencing, an abundance of data deposited in protein sequence databases has become available. This presents a tantalizing modeling opportunity: models that can understand protein function through the observation of related sequences can reduce the reliance on experimental characterization and unlock new possibilities for protein discovery and optimization. Building such models has been a goal of bioinformatics research, and has more recently emerged as a goal of machine learning research. In particular, ``protein language models,'' models trained to learn a distribution over sequence data, have shown promise in predicting functional properties of proteins.

This work leverages the information in protein sequence databases to the following ends. First, it presents a benchmark for the effectiveness of protein language models using a suite of protein prediction tasks. Second, it draws a connection between a well-established graphical model of protein families and the neural network architecture of protein language models. Third, it presents a framework for deriving synthetic protein fitness landscapes from evolutionary data that can be used to evaluate strategies for model-guided protein design in silico.},
}

EndNote citation:

%0 Thesis
%A Thomas, Neil 
%T Browsing in the Library of Babel: Leveraging Evolutionary Information to Improve Protein Modeling
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 1
%@ UCB/EECS-2024-27
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-27.html
%F Thomas:EECS-2024-27