On Imitating Proprietary Language Models

Arnav Gudibande

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-149

May 12, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-149.pdf

Fine-tuned language models (LMs) provide the backbone for popular services such as ChatGPT, GitHub Copilot, and Cohere AI. The competitive edge of these systems often arises from their proprietary finetuning data (e.g., user-submitted prompts), and thus companies invest substantial resources into collecting and protecting this data. In this work, we study model imitation as a method to close the gap between open-source LMs and their closed-source counterparts. In the first part, we propose a framework for cheaply imitating proprietary language models in specific domains. In particular, we create a prompting pipeline that first asks what tasks a particular LM can solve and then asks for input-output examples for those tasks. We then fine-tune open-source LMs on these supervised input-output examples to create imitation models. We show that human evaluators rate the outputs of these imitation models more highly as these models get larger and use bigger querying budgets. In the second part, we apply this general framework to ChatGPT and release Koala, our strongest imitation model. Initial evaluations show that this model results in impressive qualitative performance compared to ChatGPT in specific domains.

Advisors: Dawn Song

BibTeX citation:

@mastersthesis{Gudibande:EECS-2023-149,
    Author= {Gudibande, Arnav},
    Editor= {Song, Dawn},
    Title= {On Imitating Proprietary Language Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-149.html},
    Number= {UCB/EECS-2023-149},
    Abstract= {Fine-tuned language models (LMs) provide the backbone for popular services such as ChatGPT, GitHub Copilot, and Cohere AI. The competitive edge of these systems often arises from their proprietary finetuning data (e.g., user-submitted prompts), and thus companies invest substantial resources into collecting and protecting this data. In this work, we study model imitation as a method to close the gap between open-source LMs and their closed-source counterparts. In the first part, we propose a framework for cheaply imitating proprietary language models in specific domains. In particular, we create a prompting pipeline that first asks what tasks a particular LM can solve and then asks for input-output examples for those tasks. We then fine-tune open-source LMs on these supervised input-output examples to create imitation models. We show that human evaluators rate the outputs of these imitation models more highly as these models get larger and use bigger querying budgets. In the second part, we apply this general framework to ChatGPT and release Koala, our strongest imitation model. Initial evaluations show that this model results in impressive qualitative performance compared to ChatGPT in specific domains.},
}

EndNote citation:

%0 Thesis
%A Gudibande, Arnav 
%E Song, Dawn 
%T On Imitating Proprietary Language Models
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 12
%@ UCB/EECS-2023-149
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-149.html
%F Gudibande:EECS-2023-149