STraP: Self-Training for Proteins

Arbaaz Muslim and Nilah Ioannidis

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2023-110
May 11, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.pdf

Protein engineering is a field with the potential for immense impact in a broad range of fields such as agriculture, medicine, and manufacturing. However, manually searching for proteins (or equivalently, amino acid sequences) with desirable proper- ties by generating and testing large numbers of candidate sequences in experimental assays is incredibly resource-intensive. Computational approaches to model protein fitness, particularly few-shot learning approaches that can leverage a limited quan- tity of experimental assay-labeled data for training, are therefore highly desirable. In this work, we explore a computational approach that combines prior work in protein language modeling using large language models with the few-shot learning technique of self-training, which iteratively generates pseudo-labels for unlabelled sequences during fine-tuning to enhance the accuracy of a model’s predictions despite sparsely available labeled data. Here, we perform initial tests of self-training for proteins and propose follow-up studies to further explore this approach.

Advisor: Nilah Ioannidis


BibTeX citation:

@mastersthesis{Muslim:EECS-2023-110,
    Author = {Muslim, Arbaaz and Ioannidis, Nilah},
    Title = {STraP: Self-Training for Proteins},
    School = {EECS Department, University of California, Berkeley},
    Year = {2023},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.html},
    Number = {UCB/EECS-2023-110},
    Abstract = {Protein engineering is a field with the potential for immense impact in a broad range of fields such as agriculture, medicine, and manufacturing. However, manually searching for proteins (or equivalently, amino acid sequences) with desirable proper- ties by generating and testing large numbers of candidate sequences in experimental assays is incredibly resource-intensive. Computational approaches to model protein fitness, particularly few-shot learning approaches that can leverage a limited quan- tity of experimental assay-labeled data for training, are therefore highly desirable. In this work, we explore a computational approach that combines prior work in protein language modeling using large language models with the few-shot learning technique of self-training, which iteratively generates pseudo-labels for unlabelled sequences during fine-tuning to enhance the accuracy of a model’s predictions despite sparsely available labeled data. Here, we perform initial tests of self-training for proteins and propose follow-up studies to further explore this approach.}
}

EndNote citation:

%0 Thesis
%A Muslim, Arbaaz
%A Ioannidis, Nilah
%T STraP: Self-Training for Proteins
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 11
%@ UCB/EECS-2023-110
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.html
%F Muslim:EECS-2023-110