AI Safety: Model Trojaning and Benchmarking

Akul Arora

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-64

May 4, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-64.pdf

In this work, we address the overarching issue of AI Safety by focusing on two key topics: (1) model trojaning and (2) model benchmarking. We present three significant contributions that advance the field of AI safety. First, we explore evasive trojan injection in AI systems in "How Hard is Trojan Detection in DNNs? Fooling Detectors With Evasive Trojans," where we develop a general method to make trojans harder to detect and reverse-engineer. Second, we assess models' intellectual capabilities in "Measuring Mathematical Problem Solving With the MATH Dataset," introducing the AMPS and MATH datasets to evaluate current models’ performance on mathematical problem-solving tasks. Lastly, we examine a practical application of models in "Measuring Coding Challenge Competence With APPS" where we evaluate code-generation capabilities of machine learning models using the APPS benchmark, which consists of over 10,000 programming problems. Collectively, these contributions offer valuable insights into the capabilities, risks, and challenges associated with AI systems, thereby promoting the advancement of AI Safety research.

Advisors: Dawn Song

BibTeX citation:

@mastersthesis{Arora:EECS-2023-64,
    Author= {Arora, Akul},
    Title= {AI Safety: Model Trojaning and Benchmarking},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-64.html},
    Number= {UCB/EECS-2023-64},
    Abstract= {In this work, we address the overarching issue of AI Safety by focusing on two key topics: (1) model trojaning and (2) model benchmarking. We present three significant contributions that advance the field of AI safety. First, we explore evasive trojan injection in AI systems in "How Hard is Trojan Detection in DNNs? Fooling Detectors With Evasive Trojans," where we develop a general method to make trojans harder to detect and reverse-engineer. Second, we assess models' intellectual capabilities in "Measuring Mathematical Problem Solving With the MATH Dataset," introducing the AMPS and MATH datasets to evaluate current models’ performance on mathematical problem-solving tasks. Lastly, we examine a practical application of models in "Measuring Coding Challenge Competence With APPS" where we evaluate code-generation capabilities of machine learning models using the APPS benchmark, which consists of over 10,000 programming problems. Collectively, these contributions offer valuable insights into the capabilities, risks, and challenges associated with AI systems, thereby promoting the advancement of AI Safety research.},
}

EndNote citation:

%0 Thesis
%A Arora, Akul 
%T AI Safety: Model Trojaning and Benchmarking
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 4
%@ UCB/EECS-2023-64
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-64.html
%F Arora:EECS-2023-64