Towards Integrating Evolutionary Information into Inverse Folding Models

Juno Lee

EECS Department, University of California, Berkeley

Technical Report No. UCB/

May 1, 2025

Inverse folding models generate amino acid sequences given the protein structures they are supposed to fold into (e.g. the various genetic sequences in mammals, for myoglobin), with the ultimate aim of designing more functional and stable versions of existing proteins. In contrast to the protein folding problem (sequence to structure; considered "solved" in a respect by models like AlphaFold), inverse folding is particularly challenging because the distribution the problem models (the probability of every sequence, given a structure) has an intractably large state space. Furthermore, training data is limited; in the past, inverse folding models have traditionally been trained on one-to-one structure-sequence mappings, which limits the diversity of generated sequences. This report explores supplementing inverse folding models with additional evolutionary sequence data from Multiple Sequence Alignments (MSAs). Specifically, we modify ProteinMPNN’s Graph Neural Network inverse folding model to instead train on multiple sequences per protein structure. We train the modified ProteinMPNN on a subset of the OpenProteinSet database, which contains pregenerated MSAs for proteins in the Protein Database (PDB) dataset. While the proposed method achieves some gains in sequence diversity, this enhancement comes at a significant cost to foldability. We also explore limitations of our techniques and explore future steps for MSA-informed inverse folding.

Advisors: Jennifer Listgarten

BibTeX citation:

@mastersthesis{Lee:31872,
    Author= {Lee, Juno},
    Title= {Towards Integrating Evolutionary Information into Inverse Folding Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Number= {UCB/},
    Abstract= {Inverse folding models generate amino acid sequences given the protein structures they are supposed to fold into (e.g. the various genetic sequences in mammals, for myoglobin), with the ultimate aim of designing more functional and stable versions of existing proteins. In contrast
to the protein folding problem (sequence to structure; considered "solved" in a respect by models like AlphaFold), inverse folding is particularly challenging because the distribution the problem models (the probability of every sequence, given a structure) has an intractably large state space. Furthermore, training data is limited; in the past, inverse folding models have traditionally been trained on one-to-one structure-sequence mappings, which limits the diversity of generated sequences. This report explores supplementing inverse folding models with additional evolutionary sequence data from Multiple Sequence Alignments
(MSAs). Specifically, we modify ProteinMPNN’s Graph Neural Network inverse folding model to instead train on multiple sequences per protein structure. We train the modified
ProteinMPNN on a subset of the OpenProteinSet database, which contains pregenerated MSAs for proteins in the Protein Database (PDB) dataset. While the proposed method
achieves some gains in sequence diversity, this enhancement comes at a significant cost to foldability. We also explore limitations of our techniques and explore future steps for MSA-informed inverse folding.},
}

EndNote citation:

%0 Thesis
%A Lee, Juno 
%T Towards Integrating Evolutionary Information into Inverse Folding Models
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 1
%@ UCB/
%F Lee:31872