Amy Lu

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-223

December 19, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-223.pdf

Drug discovery is concerned with designing, modifying, or repurposing biomolecules that can prevent or treat disease. The application of AI for biology is gaining mainstream recognition, notably with the awarding of the 2024 Nobel Prize in Chemistry for protein design and structure prediction. As artificial intelligence (AI) for drug discovery moves beyond proof-of-concept towards the real-world, subtler questions emerge around interpretability, efficiency, and the need for finegrained generation specification to satisfy the complex constraints of biology. This thesis aims to examine these questions, with emphasis on controllability, interpretability, and deployability. First, we examine the compressibility of protein folding model latent spaces. This improves the reusability of pretrained model weights, and helps elucidate the representation information content necessary to solve different tasks. Next, we introduce PLAID, an all-atom protein diffusion model which samples from this regularized and compressed latent space. Leveraging structural information captured in the weights of protein folding models allows PLAID to be trained on sequence databases 2–4 orders of magnitude larger than structure databases. This also gives access to more taxonomy and function keywords, and conditioning on them therefore enable compositional controllability. Experimental assays confirm heme binding activity by proteins generated using function prompting. Finally, we explore questions around deployment of such models, including the impact of pretraining data on protein language model likelihoods, applications to genome editing, and biosecurity auditing for large genomic foundation models. Together, this thesis begins to characterize key challenges in applying AI to drug discovery and proposes approaches to improve generalizability, deployability, interpretability, and real-world effectiveness.

Advisors: Pieter Abbeel


BibTeX citation:

@phdthesis{Lu:EECS-2025-223,
    Author= {Lu, Amy},
    Title= {Generative Models for Real-World Drug Discovery},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-223.html},
    Number= {UCB/EECS-2025-223},
    Abstract= {Drug discovery is concerned with designing, modifying, or repurposing biomolecules that can prevent or treat disease. The application of AI for biology is gaining mainstream recognition, notably with the awarding of the 2024 Nobel Prize in Chemistry for protein design and structure prediction. As artificial intelligence (AI) for drug discovery moves beyond proof-of-concept towards the real-world, subtler questions emerge around interpretability, efficiency, and the need for finegrained generation specification to satisfy the complex constraints of biology. This thesis aims to examine these questions, with emphasis on *controllability*, *interpretability*, and *deployability*. First, we examine the compressibility of protein folding model latent spaces. This improves the reusability of pretrained model weights, and helps elucidate the representation information content necessary to solve different tasks. Next, we introduce PLAID, an all-atom protein diffusion model which samples from this regularized and compressed latent space. Leveraging structural information captured in the weights of protein folding models allows PLAID to be trained on sequence databases 2–4 orders of magnitude larger than structure databases. This also gives access to more taxonomy and function keywords, and conditioning on them therefore enable compositional controllability. Experimental assays confirm heme binding activity by proteins generated using function prompting. Finally, we explore questions around deployment of such models, including the impact of pretraining data on protein language model likelihoods, applications to genome editing, and biosecurity auditing for large genomic foundation models. Together, this thesis begins to characterize key challenges in applying AI to drug discovery and proposes approaches to improve generalizability, deployability, interpretability, and real-world effectiveness.},
}

EndNote citation:

%0 Thesis
%A Lu, Amy 
%T Generative Models for Real-World Drug Discovery
%I EECS Department, University of California, Berkeley
%D 2025
%8 December 19
%@ UCB/EECS-2025-223
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-223.html
%F Lu:EECS-2025-223