Scalable Auditing for AI Safety
Erik Jones
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2025-56
May 14, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-56.pdf
Despite their promise, contemporary AI systems pose safety risks; for example, these systems could be misused by adversaries to conduct malicious tasks, or exhibit behavior that is misaligned with developer intent. However, as both capabilities and deployments scale, effective audits for such risks are becoming increasingly intractable for humans alone to conduct. This is because the risk profile of these systems is increasingly broad: systems may only exhibit certain failures rarely; some failures may be challenging to anticipate a priori; and some failures only emerge in broader contexts.
In this thesis, we develop evaluation systems to conduct scalable audits for AI safety. We first aim to develop systems to elicit rare failures---failures that occur sufficiently infrequently that humans might not find them with manual testing. Specifically, we present ARCA, a method that casts auditing for rare failures as a discrete optimization problem over prompts and outputs, which we solve with a novel optimizer. We next develop systems to uncover unexpected failure modes---failures that humans would not have anticipated and tested for beforehand. Specifically, we present MultiMon and TED: two evaluation systems that uncover unforeseen failure modes by studying the relationship between classes of system outputs, rather than assessing the veracity of outputs directly. We finally explore auditing for failures given broader context, and introduce a class of attacks that combines individually-safe systems to produce harmful outputs.
Advisors: Anca Dragan and Jacob Steinhardt
BibTeX citation:
@phdthesis{Jones:EECS-2025-56, Author= {Jones, Erik}, Title= {Scalable Auditing for AI Safety}, School= {EECS Department, University of California, Berkeley}, Year= {2025}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-56.html}, Number= {UCB/EECS-2025-56}, Abstract= {Despite their promise, contemporary AI systems pose safety risks; for example, these systems could be misused by adversaries to conduct malicious tasks, or exhibit behavior that is misaligned with developer intent. However, as both capabilities and deployments scale, effective audits for such risks are becoming increasingly intractable for humans alone to conduct. This is because the risk profile of these systems is increasingly broad: systems may only exhibit certain failures rarely; some failures may be challenging to anticipate a priori; and some failures only emerge in broader contexts. In this thesis, we develop evaluation systems to conduct scalable audits for AI safety. We first aim to develop systems to elicit rare failures---failures that occur sufficiently infrequently that humans might not find them with manual testing. Specifically, we present ARCA, a method that casts auditing for rare failures as a discrete optimization problem over prompts and outputs, which we solve with a novel optimizer. We next develop systems to uncover unexpected failure modes---failures that humans would not have anticipated and tested for beforehand. Specifically, we present MultiMon and TED: two evaluation systems that uncover unforeseen failure modes by studying the relationship between classes of system outputs, rather than assessing the veracity of outputs directly. We finally explore auditing for failures given broader context, and introduce a class of attacks that combines individually-safe systems to produce harmful outputs.}, }
EndNote citation:
%0 Thesis %A Jones, Erik %T Scalable Auditing for AI Safety %I EECS Department, University of California, Berkeley %D 2025 %8 May 14 %@ UCB/EECS-2025-56 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-56.html %F Jones:EECS-2025-56