Finetuning as a Defense Against LLM Secret-leaking

Bryce Wong

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-135

May 17, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-135.pdf

The emergence of large language models (LLMs) in today's society has driven the integration of AI into modern applications, some of which may demand an LLM to safeguard a confidential secret that influences its behavior. However, there exist many simple prompt injection attacks that can successfully manipulate the model into revealing its secret, thus breaking the integrity of these systems and compromising their security. In this work, we explore the use of finetuning as a defense against these types of attacks. Instead of explicitly listing the secret within the model's initial system message, we train an LLM to learn a secret through its training data. This formulation prevents a wide variety of attacks that can directly extract the model's instructions. Although this approach is extremely effective in preventing the full secret from being leaked, there are some instances where the finetuned models exhibit unexpected behavior as a result of this training process. However, our findings demonstrate that finetuning can serve as a potential defense against LLM secret-leaking, and we encourage further exploration of this approach in future research.

Advisors: Dawn Song

BibTeX citation:

@mastersthesis{Wong:EECS-2024-135,
    Author= {Wong, Bryce},
    Title= {Finetuning as a Defense Against LLM Secret-leaking},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-135.html},
    Number= {UCB/EECS-2024-135},
    Abstract= {The emergence of large language models (LLMs) in today's society has driven the integration of AI into modern applications, some of which may demand an LLM to safeguard a confidential secret that influences its behavior. However, there exist many simple prompt injection attacks that can successfully manipulate the model into revealing its secret, thus breaking the integrity of these systems and compromising their security. In this work, we explore the use of finetuning as a defense against these types of attacks. Instead of explicitly listing the secret within the model's initial system message, we train an LLM to learn a secret through its training data. This formulation prevents a wide variety of attacks that can directly extract the model's instructions. Although this approach is extremely effective in preventing the full secret from being leaked, there are some instances where the finetuned models exhibit unexpected behavior as a result of this training process. However, our findings demonstrate that finetuning can serve as a potential defense against LLM secret-leaking, and we encourage further exploration of this approach in future research.},
}

EndNote citation:

%0 Thesis
%A Wong, Bryce 
%T Finetuning as a Defense Against LLM Secret-leaking
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 17
%@ UCB/EECS-2024-135
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-135.html
%F Wong:EECS-2024-135