Emaad Khwaja

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-251

December 1, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-251.pdf

We present CELL-E 2, a novel bidirectional non-autoregressive transformer that can generate realistic images and sequences of protein localization in the cell. Protein localization is a challenging problem that requires integrating sequence and image information, which most existing methods ignore. CELL-E 2 extends the work of CELL-E by capturing the spatial complexity of protein localization and produce probability estimates of localization atop a nucleus image, but can also generate sequences from images, enabling de novo protein design. We train and finetune CELL-E 2 on two large-scale datasets of human proteins. We also demonstrate how to use CELL-E 2 to create hundreds of novel nuclear localization signals (NLS) for Green Fluorescent Protein (GFP).

Advisors: Yun S. Song


BibTeX citation:

@mastersthesis{Khwaja:EECS-2023-251,
    Author= {Khwaja, Emaad},
    Title= {Text-to-Image Model for Protein Localization Prediction},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-251.html},
    Number= {UCB/EECS-2023-251},
    Abstract= {We present CELL-E 2, a novel bidirectional non-autoregressive transformer that can generate realistic images and sequences of protein localization in the cell. Protein localization is a challenging problem that requires integrating sequence and image information, which most existing methods ignore. CELL-E 2 extends the work of CELL-E by capturing the spatial complexity of protein localization and produce probability estimates of localization atop a nucleus image, but can also generate sequences from images, enabling de novo protein design. We train and finetune CELL-E 2 on two large-scale datasets of human proteins. We also demonstrate how to use CELL-E 2 to create hundreds of novel nuclear localization signals (NLS) for Green Fluorescent Protein (GFP).},
}

EndNote citation:

%0 Thesis
%A Khwaja, Emaad 
%T Text-to-Image Model for Protein Localization Prediction
%I EECS Department, University of California, Berkeley
%D 2023
%8 December 1
%@ UCB/EECS-2023-251
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-251.html
%F Khwaja:EECS-2023-251