Eric Wallace
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-8
February 19, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-8.pdf
Over the course of my PhD, large language models (LLMs) grew from a relatively nascent research direction to the single hottest area of modern computer science. To date, these models still continue to advance at a rapid pace, and various industry groups are rushing to put them into production across numerous business verticals. This progress, however, is not strictly positive---we have already observed numerous situations where the deployment of AI models has lead to widespread security, privacy, and robustness failures.
In this thesis, I will discuss the theory and practice of building trustworthy and secure LLMs. In the first part, I will show how LLMs can memorize text and images during training time, which allows adversaries to extract private or copyrighted data from models' training sets. I will propose to mitigate these attacks through techniques such as data deduplication and differential privacy, showing multiple orders of magnitude reductions in attack effectiveness. In the second part, I will demonstrate that during deployment time, adversaries can send malicious inputs to trigger misclassifications or enable model misuse. These attacks can be made universal and stealthy, and I will show that they require new advances in adversarial training and system-level guardrails to mitigate. Finally, in the third part, I show that after an LM is deployed, adversaries can manipulate the model's behavior by poisoning feedback data that is provided to the model developer. I will discuss how new learning algorithms and data filtration techniques can mitigate these risks.
Advisor: Daniel Klein and Dawn Song
";
?>
BibTeX citation:
@phdthesis{Wallace:EECS-2025-8, Author = {Wallace, Eric}, Title = {Vulnerabilities of Language Models}, School = {EECS Department, University of California, Berkeley}, Year = {2025}, Month = {Feb}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-8.html}, Number = {UCB/EECS-2025-8}, Abstract = {Over the course of my PhD, large language models (LLMs) grew from a relatively nascent research direction to the single hottest area of modern computer science. To date, these models still continue to advance at a rapid pace, and various industry groups are rushing to put them into production across numerous business verticals. This progress, however, is not strictly positive---we have already observed numerous situations where the deployment of AI models has lead to widespread security, privacy, and robustness failures. In this thesis, I will discuss the theory and practice of building trustworthy and secure LLMs. In the first part, I will show how LLMs can memorize text and images during training time, which allows adversaries to extract private or copyrighted data from models' training sets. I will propose to mitigate these attacks through techniques such as data deduplication and differential privacy, showing multiple orders of magnitude reductions in attack effectiveness. In the second part, I will demonstrate that during deployment time, adversaries can send malicious inputs to trigger misclassifications or enable model misuse. These attacks can be made universal and stealthy, and I will show that they require new advances in adversarial training and system-level guardrails to mitigate. Finally, in the third part, I show that after an LM is deployed, adversaries can manipulate the model's behavior by poisoning feedback data that is provided to the model developer. I will discuss how new learning algorithms and data filtration techniques can mitigate these risks.} }
EndNote citation:
%0 Thesis %A Wallace, Eric %T Vulnerabilities of Language Models %I EECS Department, University of California, Berkeley %D 2025 %8 February 19 %@ UCB/EECS-2025-8 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-8.html %F Wallace:EECS-2025-8