STraP: Self-Training for Proteins
Arbaaz Muslim and Nilah Ioannidis
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2023-110
May 11, 2023
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.pdf
Protein engineering is a field with the potential for immense impact in a broad range of fields such as agriculture, medicine, and manufacturing. However, manually searching for proteins (or equivalently, amino acid sequences) with desirable proper- ties by generating and testing large numbers of candidate sequences in experimental assays is incredibly resource-intensive. Computational approaches to model protein fitness, particularly few-shot learning approaches that can leverage a limited quan- tity of experimental assay-labeled data for training, are therefore highly desirable. In this work, we explore a computational approach that combines prior work in protein language modeling using large language models with the few-shot learning technique of self-training, which iteratively generates pseudo-labels for unlabelled sequences during fine-tuning to enhance the accuracy of a model’s predictions despite sparsely available labeled data. Here, we perform initial tests of self-training for proteins and propose follow-up studies to further explore this approach.
Advisors: Nilah Ioannidis
BibTeX citation:
@mastersthesis{Muslim:EECS-2023-110, Author= {Muslim, Arbaaz and Ioannidis, Nilah}, Title= {STraP: Self-Training for Proteins}, School= {EECS Department, University of California, Berkeley}, Year= {2023}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.html}, Number= {UCB/EECS-2023-110}, Abstract= {Protein engineering is a field with the potential for immense impact in a broad range of fields such as agriculture, medicine, and manufacturing. However, manually searching for proteins (or equivalently, amino acid sequences) with desirable proper- ties by generating and testing large numbers of candidate sequences in experimental assays is incredibly resource-intensive. Computational approaches to model protein fitness, particularly few-shot learning approaches that can leverage a limited quan- tity of experimental assay-labeled data for training, are therefore highly desirable. In this work, we explore a computational approach that combines prior work in protein language modeling using large language models with the few-shot learning technique of self-training, which iteratively generates pseudo-labels for unlabelled sequences during fine-tuning to enhance the accuracy of a model’s predictions despite sparsely available labeled data. Here, we perform initial tests of self-training for proteins and propose follow-up studies to further explore this approach.}, }
EndNote citation:
%0 Thesis %A Muslim, Arbaaz %A Ioannidis, Nilah %T STraP: Self-Training for Proteins %I EECS Department, University of California, Berkeley %D 2023 %8 May 11 %@ UCB/EECS-2023-110 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-110.html %F Muslim:EECS-2023-110