Vulnerabilities of Language Models

Eric Wallace

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-8

February 19, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-8.pdf

Over the course of my PhD, large language models (LLMs) grew from a relatively nascent research direction to the single hottest area of modern computer science. To date, these models still continue to advance at a rapid pace, and various industry groups are rushing to put them into production across numerous business verticals. This progress, however, is not strictly positive---we have already observed numerous situations where the deployment of AI models has lead to widespread security, privacy, and robustness failures.

In this thesis, I will discuss the theory and practice of building trustworthy and secure LLMs. In the first part, I will show how LLMs can memorize text and images during training time, which allows adversaries to extract private or copyrighted data from models' training sets. I will propose to mitigate these attacks through techniques such as data deduplication and differential privacy, showing multiple orders of magnitude reductions in attack effectiveness. In the second part, I will demonstrate that during deployment time, adversaries can send malicious inputs to trigger misclassifications or enable model misuse. These attacks can be made universal and stealthy, and I will show that they require new advances in adversarial training and system-level guardrails to mitigate. Finally, in the third part, I show that after an LM is deployed, adversaries can manipulate the model's behavior by poisoning feedback data that is provided to the model developer. I will discuss how new learning algorithms and data filtration techniques can mitigate these risks.

Advisors: Daniel Klein and Dawn Song

BibTeX citation:

@phdthesis{Wallace:EECS-2025-8,
    Author= {Wallace, Eric},
    Title= {Vulnerabilities of Language Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Feb},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-8.html},
    Number= {UCB/EECS-2025-8},
    Abstract= {Over the course of my PhD, large language models (LLMs) grew from a relatively nascent research direction to the single hottest area of modern computer science. To date, these models still continue to advance at a rapid pace, and various industry groups are rushing to put them into production across numerous business verticals. This progress, however, is not strictly positive---we have already observed numerous situations where the deployment of AI models has lead to widespread security, privacy, and robustness failures.

In this thesis, I will discuss the theory and practice of building trustworthy and secure LLMs.
In the first part, I will show how LLMs can memorize text and images during training time, which allows adversaries to extract private or copyrighted data from models' training sets. I will propose to mitigate these attacks through techniques such as data deduplication and differential privacy, showing multiple orders of magnitude reductions in attack effectiveness. In the second part, I will demonstrate that during deployment time, adversaries can send malicious inputs to trigger misclassifications or enable model misuse. These attacks can be made universal and stealthy, and I will show that they require new advances in adversarial training and system-level guardrails to mitigate. Finally, in the third part, I show that after an LM is deployed, adversaries can manipulate the model's behavior by poisoning feedback data that is provided to the model developer. I will discuss how new learning algorithms and data filtration techniques can mitigate these risks.},
}

EndNote citation:

%0 Thesis
%A Wallace, Eric 
%T Vulnerabilities of Language Models
%I EECS Department, University of California, Berkeley
%D 2025
%8 February 19
%@ UCB/EECS-2025-8
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-8.html
%F Wallace:EECS-2025-8