Arbaaz Muslim and Nilah Ioannidis
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2023-110
May 11, 2023
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.pdf
Protein engineering is a field with the potential for immense impact in a broad range of fields such as agriculture, medicine, and manufacturing. However, manually searching for proteins (or equivalently, amino acid sequences) with desirable proper- ties by generating and testing large numbers of candidate sequences in experimental assays is incredibly resource-intensive. Computational approaches to model protein fitness, particularly few-shot learning approaches that can leverage a limited quan- tity of experimental assay-labeled data for training, are therefore highly desirable. In this work, we explore a computational approach that combines prior work in protein language modeling using large language models with the few-shot learning technique of self-training, which iteratively generates pseudo-labels for unlabelled sequences during fine-tuning to enhance the accuracy of a model’s predictions despite sparsely available labeled data. Here, we perform initial tests of self-training for proteins and propose follow-up studies to further explore this approach.
Advisor: Nilah Ioannidis
BibTeX citation:
@mastersthesis{Muslim:EECS-2023-110, Author = {Muslim, Arbaaz and Ioannidis, Nilah}, Title = {STraP: Self-Training for Proteins}, School = {EECS Department, University of California, Berkeley}, Year = {2023}, Month = {May}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.html}, Number = {UCB/EECS-2023-110}, Abstract = {Protein engineering is a field with the potential for immense impact in a broad range of fields such as agriculture, medicine, and manufacturing. However, manually searching for proteins (or equivalently, amino acid sequences) with desirable proper- ties by generating and testing large numbers of candidate sequences in experimental assays is incredibly resource-intensive. Computational approaches to model protein fitness, particularly few-shot learning approaches that can leverage a limited quan- tity of experimental assay-labeled data for training, are therefore highly desirable. In this work, we explore a computational approach that combines prior work in protein language modeling using large language models with the few-shot learning technique of self-training, which iteratively generates pseudo-labels for unlabelled sequences during fine-tuning to enhance the accuracy of a model’s predictions despite sparsely available labeled data. Here, we perform initial tests of self-training for proteins and propose follow-up studies to further explore this approach.} }
EndNote citation:
%0 Thesis %A Muslim, Arbaaz %A Ioannidis, Nilah %T STraP: Self-Training for Proteins %I EECS Department, University of California, Berkeley %D 2023 %8 May 11 %@ UCB/EECS-2023-110 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.html %F Muslim:EECS-2023-110