Milind Jagota

EECS Department, University of California, Berkeley

Technical Report No. UCB/

May 1, 2025

Genomes are a precise and nearly comprehensive representation of biological evolution, but are high-dimensional and difficult to interpret. In recent years, rapid growth in sequencing and computational power has made it possible to train large machine learning models of evolution and use them for biological discovery. A notable example is AlphaFold, a machine learning model that solved the protein structure prediction problem and relies on evolutionary signals. This thesis describes the use of machine learning models to study aspects of biomolecules that evolve faster than protein structure, and which therefore require modeling approaches different from those used by AlphaFold. We first discuss protein variant pathogenicity prediction, a problem of large practical importance for which evolutionary signals are valuable. We show that evolutionary models at multiple timescales can provide complementary information for pathogenicity prediction, and build a model that achieves state-of-the-art performance. Next, we examine B cells and antibodies, which are a key part of the human adaptive immune system and are generated by a miniature evolutionary process within each human. We describe a new approach for observing and modeling antibodies that are under negative selection and use these models for improved antibody property prediction. Finally, we discuss germline mutational processes, which are a key mechanism by which genomes acquire the diversity that is necessary for evolution. We present a framework for predicting mutation rate heterogeneity that matches state-of-the-art predictors and provides greater scalability. In ongoing work, we look to apply this methodology to carry out comparative studies of mutational processes and selection across multiple species, at a scale that would have been difficult with prior methods.

Advisors: Yun S. Song


BibTeX citation:

@phdthesis{Jagota:31865,
    Author= {Jagota, Milind},
    Title= {Modeling and Interpreting Genome Evolution at Multiple Timescales},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Number= {UCB/},
    Abstract= {Genomes are a precise and nearly comprehensive representation of biological evolution, but are high-dimensional and difficult to interpret. In recent years, rapid growth in sequencing and computational power has made it possible to train large machine learning models of evolution and use them for biological discovery. A notable example is AlphaFold, a machine learning model that solved the protein structure prediction problem and relies on evolutionary signals. This thesis describes the use of machine learning models to study aspects of biomolecules that evolve faster than protein structure, and which therefore require modeling approaches different from those used by AlphaFold. We first discuss protein variant pathogenicity prediction, a problem of large practical importance for which evolutionary signals are valuable. We show that evolutionary models at multiple timescales can provide complementary information for pathogenicity prediction, and build a model that achieves state-of-the-art performance. Next, we examine B cells and antibodies, which are a key part of the human adaptive immune system and are generated by a miniature evolutionary process within each human. We describe a new approach for observing and modeling antibodies that are under negative selection and use these models for improved antibody property prediction. Finally, we discuss germline mutational processes, which are a key mechanism by which genomes acquire the diversity that is necessary for evolution. We present a framework for predicting mutation rate heterogeneity that matches state-of-the-art predictors and provides greater scalability. In ongoing work, we look to apply this methodology to carry out comparative studies of mutational processes and selection across multiple species, at a scale that would have been difficult with prior methods.},
}

EndNote citation:

%0 Thesis
%A Jagota, Milind 
%T Modeling and Interpreting Genome Evolution at Multiple Timescales
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 1
%@ UCB/
%F Jagota:31865