Finetuning as a Defense Against LLM Secret-leaking
Bryce Wong
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2024-135
May 17, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-135.pdf
The emergence of large language models (LLMs) in today's society has driven the integration of AI into modern applications, some of which may demand an LLM to safeguard a confidential secret that influences its behavior. However, there exist many simple prompt injection attacks that can successfully manipulate the model into revealing its secret, thus breaking the integrity of these systems and compromising their security. In this work, we explore the use of finetuning as a defense against these types of attacks. Instead of explicitly listing the secret within the model's initial system message, we train an LLM to learn a secret through its training data. This formulation prevents a wide variety of attacks that can directly extract the model's instructions. Although this approach is extremely effective in preventing the full secret from being leaked, there are some instances where the finetuned models exhibit unexpected behavior as a result of this training process. However, our findings demonstrate that finetuning can serve as a potential defense against LLM secret-leaking, and we encourage further exploration of this approach in future research.
Advisors: Dawn Song
BibTeX citation:
@mastersthesis{Wong:EECS-2024-135, Author= {Wong, Bryce}, Title= {Finetuning as a Defense Against LLM Secret-leaking}, School= {EECS Department, University of California, Berkeley}, Year= {2024}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-135.html}, Number= {UCB/EECS-2024-135}, Abstract= {The emergence of large language models (LLMs) in today's society has driven the integration of AI into modern applications, some of which may demand an LLM to safeguard a confidential secret that influences its behavior. However, there exist many simple prompt injection attacks that can successfully manipulate the model into revealing its secret, thus breaking the integrity of these systems and compromising their security. In this work, we explore the use of finetuning as a defense against these types of attacks. Instead of explicitly listing the secret within the model's initial system message, we train an LLM to learn a secret through its training data. This formulation prevents a wide variety of attacks that can directly extract the model's instructions. Although this approach is extremely effective in preventing the full secret from being leaked, there are some instances where the finetuned models exhibit unexpected behavior as a result of this training process. However, our findings demonstrate that finetuning can serve as a potential defense against LLM secret-leaking, and we encourage further exploration of this approach in future research.}, }
EndNote citation:
%0 Thesis %A Wong, Bryce %T Finetuning as a Defense Against LLM Secret-leaking %I EECS Department, University of California, Berkeley %D 2024 %8 May 17 %@ UCB/EECS-2024-135 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-135.html %F Wong:EECS-2024-135