Masked Layer Distillation: Fast and Robust Training Through Knowledge Transfer Normalization

Derek Wan and Paras Jain and Tianjun Zhang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-243

December 1, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-243.pdf

Distillation is a common tool to compress models, accelerate training, and improve model performance. Often a model trained via distillation is able to achieve accuracy exceeding that of a model with the same architecture but trained from scratch. However, we surprisingly find that distillation incurs significant accuracy penalties for EfficientNet and MobileNet. We offer a hypothesis as to why this happens as well as Masked Layer Distillation, a new training algorithm that recovers a significant amount of this performance loss and also translates well to other models such as ResNets and VGGs. As an additional benefit, we also find that our method accelerates training by 2x to 5x and is robust to adverse initialization schemes.

Advisors: Joseph Gonzalez

BibTeX citation:

@mastersthesis{Wan:EECS-2021-243,
    Author= {Wan, Derek and Jain, Paras and Zhang, Tianjun},
    Editor= {Gonzalez, Joseph and Keutzer, Kurt},
    Title= {Masked Layer Distillation: Fast and Robust Training Through Knowledge Transfer Normalization},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-243.html},
    Number= {UCB/EECS-2021-243},
    Abstract= {Distillation is a common tool to compress models, accelerate training, and improve model performance. Often a model trained via distillation is able to achieve accuracy exceeding that of a model with the same architecture but trained from scratch. However, we surprisingly find that distillation incurs significant accuracy penalties for EfficientNet and MobileNet. We offer a hypothesis as to why this happens as well as Masked Layer Distillation, a new training algorithm that recovers a significant amount of this performance loss and also translates well to other models such as ResNets and VGGs. As an additional benefit, we also find that our method accelerates training by 2x to 5x and is robust to adverse initialization schemes.},
}

EndNote citation:

%0 Thesis
%A Wan, Derek 
%A Jain, Paras 
%A Zhang, Tianjun 
%E Gonzalez, Joseph 
%E Keutzer, Kurt 
%T Masked Layer Distillation: Fast and Robust Training Through Knowledge Transfer Normalization
%I EECS Department, University of California, Berkeley
%D 2021
%8 December 1
%@ UCB/EECS-2021-243
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-243.html
%F Wan:EECS-2021-243