Text-to-Image Model for Protein Localization Prediction
Emaad Khwaja
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2023-251
December 1, 2023
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-251.pdf
We present CELL-E 2, a novel bidirectional non-autoregressive transformer that can generate realistic images and sequences of protein localization in the cell. Protein localization is a challenging problem that requires integrating sequence and image information, which most existing methods ignore. CELL-E 2 extends the work of CELL-E by capturing the spatial complexity of protein localization and produce probability estimates of localization atop a nucleus image, but can also generate sequences from images, enabling de novo protein design. We train and finetune CELL-E 2 on two large-scale datasets of human proteins. We also demonstrate how to use CELL-E 2 to create hundreds of novel nuclear localization signals (NLS) for Green Fluorescent Protein (GFP).
Advisors: Yun S. Song
BibTeX citation:
@mastersthesis{Khwaja:EECS-2023-251, Author= {Khwaja, Emaad}, Title= {Text-to-Image Model for Protein Localization Prediction}, School= {EECS Department, University of California, Berkeley}, Year= {2023}, Month= {Dec}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-251.html}, Number= {UCB/EECS-2023-251}, Abstract= {We present CELL-E 2, a novel bidirectional non-autoregressive transformer that can generate realistic images and sequences of protein localization in the cell. Protein localization is a challenging problem that requires integrating sequence and image information, which most existing methods ignore. CELL-E 2 extends the work of CELL-E by capturing the spatial complexity of protein localization and produce probability estimates of localization atop a nucleus image, but can also generate sequences from images, enabling de novo protein design. We train and finetune CELL-E 2 on two large-scale datasets of human proteins. We also demonstrate how to use CELL-E 2 to create hundreds of novel nuclear localization signals (NLS) for Green Fluorescent Protein (GFP).}, }
EndNote citation:
%0 Thesis %A Khwaja, Emaad %T Text-to-Image Model for Protein Localization Prediction %I EECS Department, University of California, Berkeley %D 2023 %8 December 1 %@ UCB/EECS-2023-251 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-251.html %F Khwaja:EECS-2023-251