Knowledge-Guided Self-Supervised Vision Transformers for Medical Imaging

Kevin Miao and Colorado Reed and Akash Gokul and Suzanne Petryk and Raghav Singh and Kurt Keutzer and Joseph Gonzalez and Trevor Darrell

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-56

May 10, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-56.pdf

Recent trends in self-supervised representation learning have focused on removing inductive biases from the training process. However, inductive biases can be useful in certain settings, such as medical imaging, where domain expertise can help define a prior over semantic structure. We present Medical DINO (MeDINO), a method that takes advantage of consistent spatial and semantic structure in unlabeled medical imaging datasets to guide vision transformer attention. MeDINO operates by regularizing attention masks from separate transformer heads to follow various priors over semantic regions. These priors can be derived from data statistics or are provided via a single labeled sample from a domain expert. Using chest X-ray radiographs as a primary case study, we show that the resulting attention masks are more interpretable than those resulting from domain-agnostic pretraining, producing a 58.7 mAP improvement for lung and heart segmentation following the self- supervised pretraining. Additionally, our method yields a 2.2 mAUC improvement compared to domain-agnostic pretraining when transferring the pretrained model to a downstream chest disease classification task.

Advisors: Joseph Gonzalez

BibTeX citation:

@mastersthesis{Miao:EECS-2022-56,
    Author= {Miao, Kevin and Reed, Colorado and Gokul, Akash and Petryk, Suzanne and Singh, Raghav and Keutzer, Kurt and Gonzalez, Joseph and Darrell, Trevor},
    Title= {Knowledge-Guided Self-Supervised Vision Transformers for Medical Imaging},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-56.html},
    Number= {UCB/EECS-2022-56},
    Abstract= {Recent trends in self-supervised representation learning have focused on removing inductive biases from the training process. However, inductive biases can be useful in certain settings, such as medical imaging, where domain expertise can help define a prior over semantic structure. We present Medical DINO (MeDINO), a method that takes advantage of consistent spatial and semantic structure in unlabeled medical imaging datasets to guide vision transformer attention. MeDINO operates by regularizing attention masks from separate transformer heads to follow various priors over semantic regions. These priors can be derived from data statistics or are provided via a single labeled sample from a domain expert. Using chest X-ray radiographs as a primary case study, we show that the resulting attention masks are more interpretable than those resulting from domain-agnostic pretraining, producing a 58.7 mAP improvement for lung and heart segmentation following the self- supervised pretraining. Additionally, our method yields a 2.2 mAUC improvement compared to domain-agnostic pretraining when transferring the pretrained model to a downstream chest disease classification task.},
}

EndNote citation:

%0 Thesis
%A Miao, Kevin 
%A Reed, Colorado 
%A Gokul, Akash 
%A Petryk, Suzanne 
%A Singh, Raghav 
%A Keutzer, Kurt 
%A Gonzalez, Joseph 
%A Darrell, Trevor 
%T Knowledge-Guided Self-Supervised Vision Transformers for Medical Imaging
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 10
%@ UCB/EECS-2022-56
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-56.html
%F Miao:EECS-2022-56