Meta-learning for evolutionary-based protein property prediction

Arthur Deng

EECS Department, University of California, Berkeley

Technical Report No. UCB/

May 1, 2024

Protein property predictive models based on evolutionary data are important for initiating an ML-guided protein engineering campaign and variant effect prediction. Many of the best performing such models take as input evolutionary information in the form of an Multiple Sequence Alignment (MSA) for one protein family to train a density model, such as Potts model. Although much effort has focused on comparing new model classes for these tasks, little work has considered how to weight the importance of each sequence in the MSA, a key step. Herein, we explore this topic by introducing a meta-learning framework for automatically re-weighting sequences in an MSA, for any density model class, to improve zero-shot performance. At training time, our model takes as input, for each of several distinct protein families: i) unlabeled sequences from an MSA that could be used to train a density model, ii) other sequences that have had a family-relevant property assayed, say RNA binding affinity, that could be used for supervised learning. Critically, at test time, the protein family of interest is expected not to have assay- labelled data, hence yielding a zero shot. From the training families, we learn a meta-model function that assigns a weight to each sequence in one family’s MSA. Our meta-learning approach does so using featurization of the MSA that can generalize to other families. Hence, at test time, we use the meta-model to re-weight the sequences in an MSA for zero-shot property prediction.

Advisors: Jennifer Listgarten

BibTeX citation:

@mastersthesis{Deng:31353,
    Author= {Deng, Arthur},
    Title= {Meta-learning for evolutionary-based protein property prediction},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Number= {UCB/},
    Abstract= {Protein property predictive models based on evolutionary data are important for initiating an ML-guided protein engineering campaign and variant effect prediction. Many of the best performing such models take as input evolutionary information in the form of an Multiple Sequence Alignment (MSA) for one protein family to train a density model, such as Potts model. Although much effort has focused on comparing new model classes for these tasks, little work has considered how to weight the importance of each sequence in the MSA, a key step. Herein, we explore this topic by introducing a meta-learning framework for automatically re-weighting sequences in an MSA, for any density model class, to improve zero-shot performance. At training time, our model takes as input, for each of several distinct protein families: i) unlabeled sequences from an MSA that could be used to train a density model, ii) other sequences that have had a family-relevant property assayed, say RNA binding affinity, that could be used for supervised learning. Critically, at test time, the protein family of interest is expected not to have assay- labelled data, hence yielding a zero shot. From the training families, we learn a meta-model function that assigns a weight to each sequence in one family’s MSA. Our meta-learning approach does so using featurization of the MSA that can generalize to other families. Hence, at test time, we use the meta-model to re-weight the sequences in an MSA for zero-shot property prediction.},
}

EndNote citation:

%0 Thesis
%A Deng, Arthur 
%T Meta-learning for evolutionary-based protein property prediction
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 1
%@ UCB/
%F Deng:31353