STraP: Self-Training for Proteins

Arbaaz Muslim and Nilah Ioannidis

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-110

May 11, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.pdf

Protein engineering is a field with the potential for immense impact in a broad range of fields such as agriculture, medicine, and manufacturing. However, manually searching for proteins (or equivalently, amino acid sequences) with desirable proper- ties by generating and testing large numbers of candidate sequences in experimental assays is incredibly resource-intensive. Computational approaches to model protein fitness, particularly few-shot learning approaches that can leverage a limited quan- tity of experimental assay-labeled data for training, are therefore highly desirable. In this work, we explore a computational approach that combines prior work in protein language modeling using large language models with the few-shot learning technique of self-training, which iteratively generates pseudo-labels for unlabelled sequences during fine-tuning to enhance the accuracy of a model’s predictions despite sparsely available labeled data. Here, we perform initial tests of self-training for proteins and propose follow-up studies to further explore this approach.

Advisors: Nilah Ioannidis

BibTeX citation:

@mastersthesis{Muslim:EECS-2023-110,
    Author= {Muslim, Arbaaz and Ioannidis, Nilah},
    Title= {STraP: Self-Training for Proteins},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.html},
    Number= {UCB/EECS-2023-110},
    Abstract= {Protein engineering is a field with the potential for immense impact in a broad range of fields such as agriculture, medicine, and manufacturing. However, manually searching for proteins (or equivalently, amino acid sequences) with desirable proper- ties by generating and testing large numbers of candidate sequences in experimental assays is incredibly resource-intensive. Computational approaches to model protein fitness, particularly few-shot learning approaches that can leverage a limited quan- tity of experimental assay-labeled data for training, are therefore highly desirable. In this work, we explore a computational approach that combines prior work in protein language modeling using large language models with the few-shot learning technique of self-training, which iteratively generates pseudo-labels for unlabelled sequences during fine-tuning to enhance the accuracy of a model’s predictions despite sparsely available labeled data. Here, we perform initial tests of self-training for proteins and propose follow-up studies to further explore this approach.},
}

EndNote citation:

%0 Thesis
%A Muslim, Arbaaz 
%A Ioannidis, Nilah 
%T STraP: Self-Training for Proteins
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 11
%@ UCB/EECS-2023-110
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.html
%F Muslim:EECS-2023-110