Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaption

Danny Halawi and Alexander Wei and Eric Wallace and Tony Wang and Nika Haghtalab and Jacob Steinhardt

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-216

December 16, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-216.pdf

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether blackbox finetuning access can be secured against sophisticated adversaries.

Advisors: Jacob Steinhardt

BibTeX citation:

@mastersthesis{Halawi:EECS-2024-216,
    Author= {Halawi, Danny and Wei, Alexander and Wallace, Eric and Wang, Tony and Haghtalab, Nika and Steinhardt, Jacob},
    Title= {Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaption},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-216.html},
    Number= {UCB/EECS-2024-216},
    Abstract= {Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether blackbox finetuning access can be secured against sophisticated adversaries.},
}

EndNote citation:

%0 Thesis
%A Halawi, Danny 
%A Wei, Alexander 
%A Wallace, Eric 
%A Wang, Tony 
%A Haghtalab, Nika 
%A Steinhardt, Jacob 
%T Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaption
%I EECS Department, University of California, Berkeley
%D 2024
%8 December 16
%@ UCB/EECS-2024-216
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-216.html
%F Halawi:EECS-2024-216