Nicholas Lee and Kurt Keutzer and Gopala Krishna Anumanchipalli

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-141

May 12, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-141.pdf

With the emergence of a plethora of Large Language Models (LLMs) to date, the future of having LLMs run locally at the edge has come closer and closer with every passing day. However, there has not been as much work on smaller language models that can potentially solve tasks where it would be inefficient to run a full LLM at scale. In this paper, we explore Small Language Models (SLMs) and how we can make them more efficient at the edge without sacrificing performance. Pruning or simplifying SLMs can cause a significant degradation of downstream performance. To this end, we investigate weight reparameterization and knowledge distillation as two avenues for these small language models to mitigate these pitfalls. This study investigates the structure of the FFN module in the transformer architecture in order to improve the inference speed of these language models for short sequence length tasks. We also investigate whether we can distill from these LLMs into significantly smaller SLMs in order to take advantage of the plethora of pretrained models available to the public. We find that when simplifying the FFN module, one can use weight reparameterization at training time to help the model converge and improve downstream accuracy. We also find that knowledge distillation may not be a surefire way to improve the downstream model performance as the difference between the model capacities of these LLMs and small language models may be difficult to overcome.

Advisors: Kurt Keutzer


BibTeX citation:

@mastersthesis{Lee:EECS-2023-141,
    Author= {Lee, Nicholas and Keutzer, Kurt and Anumanchipalli, Gopala Krishna},
    Title= {Exploring the Limits of Small Language Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-141.html},
    Number= {UCB/EECS-2023-141},
    Abstract= {With the emergence of a plethora of Large Language Models (LLMs) to date, the future
of having LLMs run locally at the edge has come closer and closer with every passing day.
However, there has not been as much work on smaller language models that can potentially
solve tasks where it would be inefficient to run a full LLM at scale. In this paper, we explore Small Language Models (SLMs) and how we can make them more efficient at the edge
without sacrificing performance. Pruning or simplifying SLMs can cause a significant degradation of downstream performance. To this end, we investigate weight reparameterization
and knowledge distillation as two avenues for these small language models to mitigate these
pitfalls. This study investigates the structure of the FFN module in the transformer architecture in order to improve the inference speed of these language models for short sequence
length tasks. We also investigate whether we can distill from these LLMs into significantly
smaller SLMs in order to take advantage of the plethora of pretrained models available to
the public. We find that when simplifying the FFN module, one can use weight reparameterization at training time to help the model converge and improve downstream accuracy. We
also find that knowledge distillation may not be a surefire way to improve the downstream
model performance as the difference between the model capacities of these LLMs and small
language models may be difficult to overcome.},
}

EndNote citation:

%0 Thesis
%A Lee, Nicholas 
%A Keutzer, Kurt 
%A Anumanchipalli, Gopala Krishna 
%T Exploring the Limits of Small Language Models
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 12
%@ UCB/EECS-2023-141
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-141.html
%F Lee:EECS-2023-141