Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

Roshan Rao

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-1

January 8, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-1.pdf

Novel protein sequences arise through mutation. These mutations may be deleterious, beneficial, or neutral; the effect of a mutation on an organism’s evolutionary fitness is reflected in whether an organism survives long enough for its proteins to be sampled and deposited in a sequence database. Bioinformatics has long sought to use this evolutionary signal, commonly in the form of Multiple Sequence Alignments (MSAs), to make inferences as to the structure and function of novel proteins. With the advent of neural networks and self-supervised pretraining, a different approach emerged, where a large scale neural network could be pretrained using a language modeling objective to automatically produce informative features from an input protein sequence.

In this work, methods to train and evaluate protein language models on a common benchmark are introduced. Subsequently, the effects of increased model scaling, dataset preprocessing and training hyperparameters on the ability of transformers to learn protein contacts without supervision are explored. A novel method operating on MSAs instead of single sequences is then presented, and shown to achieve state-of-the-art performance on several downstream tasks. Finally, the utility of all of these methods in protein design is discussed.

Advisors: John F. Canny

BibTeX citation:

@phdthesis{Rao:EECS-2022-1,
    Author= {Rao, Roshan},
    Title= {Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {Jan},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-1.html},
    Number= {UCB/EECS-2022-1},
    Abstract= {Novel protein sequences arise through mutation. These mutations may be deleterious, beneficial, or neutral; the effect of a mutation on an organism’s evolutionary fitness is reflected in whether an organism survives long enough for its proteins to be sampled and deposited in a sequence database. Bioinformatics has long sought to use this evolutionary signal, commonly in the form of Multiple Sequence Alignments (MSAs), to make inferences as to the structure and function of novel proteins. With the advent of neural networks and self-supervised pretraining, a different approach emerged, where a large scale neural network could be pretrained using a language modeling objective to automatically produce informative features from an input protein sequence.

In this work, methods to train and evaluate protein language models on a common benchmark are introduced. Subsequently, the effects of increased model scaling, dataset preprocessing and training hyperparameters on the ability of transformers to learn protein contacts without supervision are explored. A novel method operating on MSAs instead of single sequences is then presented, and shown to achieve state-of-the-art performance on several downstream tasks. Finally, the utility of all of these methods in protein design is discussed.},
}

EndNote citation:

%0 Thesis
%A Rao, Roshan 
%T Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences
%I EECS Department, University of California, Berkeley
%D 2022
%8 January 8
%@ UCB/EECS-2022-1
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-1.html
%F Rao:EECS-2022-1