Safety, Robustness, and Interpretability in Machine Learning

Samuel Pfrommer

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-67
May 15, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-67.pdf

Machine learning is poised to have a dramatic impact across many scientific, industrial, and social domains. While current Artificial Intelligence (AI) systems generally involve human supervision, future applications will demand significantly more autonomy. Such a transition will require us to trust the behavior of increasingly large models. This dissertation addresses three critical research areas towards this goal: safety, robustness, and interpretability.

We first address safety concerns in Reinforcement Learning (RL) and Imitation Learning (IL). While learned policies have achieved impressive performance, they often exhibit unsafe behavior due to training-time exploration and test-time environmental shifts. We introduce a model predictive control-based safety guide which refines the actions of a base RL policy, conditioned on user-provided constraints. With an appropriate optimization formulation and loss function, we show theoretically that the final base policy is provably safe at optimality. IL suffers from a distinct causal confusion safety concern, where spurious correlations between observations and expert actions can lead to unsafe behavior upon deployment. We leverage tools from Structural Causal Models (SCMs) to identify and mask problematic observations. Whereas previous work requires access to a queryable expert or an expert reward function, our approach uses the typical ability of an experimenter to intervene on the initial state of an episode.

The second part of this dissertation concerns robustifying machine learning classifiers against adversarial inputs. Classifiers are a critical component of many AI systems and have been shown to be highly sensitive to small input perturbations. We first extend randomized smoothing beyond traditional isotropic certification by projecting inputs into a data-manifold subspace, resulting in orders-of-magnitude improvements in certified volume. We then revisit the fundamental robustness problem by proposing asymmetric certification. This binary classification setting requires only certified robustness for one class, reflecting the fact that many real-world adversaries are strictly interested in producing false negatives. This more focused problem admits an interesting class of feature-convex architectures, which we leverage to provide efficient, deterministic, and closed-form certified radii.

The third part of this dissertation discusses two distinct aspects of interpretability: how Large Language Models (LLMs) decide what to recommend to human users, and how we can build learned models which obey human-interpretable structures. We first analyze conversational search engines, in which we use LLMs to rank consumer products for a user query. Our results show that LLMs vary widely in prioritizing product names, associated website content, and input context position. Finally, we propose a new family of interpretable models in domains where latent embeddings carry mathematical structure: structural transport nets. Via a learned bijection to a carefully-designed mirrored algebra, we produce interpretable latent-space operations which respect the laws of the original input space. We demonstrate that respecting underlying algebraic laws is crucial for learning accurate and self-consistent operations.

Advisor: Somayeh Sojoudi

\"Edit"; ?>


BibTeX citation:

@phdthesis{Pfrommer:EECS-2025-67,
    Author = {Pfrommer, Samuel},
    Title = {Safety, Robustness, and Interpretability in Machine Learning},
    School = {EECS Department, University of California, Berkeley},
    Year = {2025},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-67.html},
    Number = {UCB/EECS-2025-67},
    Abstract = {Machine learning is poised to have a dramatic impact across many scientific, industrial, and social domains. While current Artificial Intelligence (AI) systems generally involve human supervision, future applications will demand significantly more autonomy. Such a transition will require us to trust the behavior of increasingly large models. This dissertation addresses three critical research areas towards this goal: safety, robustness, and interpretability.

We first address safety concerns in Reinforcement Learning (RL) and Imitation Learning (IL). While learned policies have achieved impressive performance, they often exhibit unsafe behavior due to training-time exploration and test-time environmental shifts. We introduce a model predictive control-based safety guide which refines the actions of a base RL policy, conditioned on user-provided constraints. With an appropriate optimization formulation and loss function, we show theoretically that the final base policy is provably safe at optimality. IL suffers from a distinct causal confusion safety concern, where spurious correlations between observations and expert actions can lead to unsafe behavior upon deployment. We leverage tools from Structural Causal Models (SCMs) to identify and mask problematic observations. Whereas previous work requires access to a queryable expert or an expert reward function, our approach uses the typical ability of an experimenter to intervene on the initial state of an episode.

The second part of this dissertation concerns robustifying machine learning classifiers against adversarial inputs. Classifiers are a critical component of many AI systems and have been shown to be highly sensitive to small input perturbations. We first extend randomized smoothing beyond traditional isotropic certification by projecting inputs into a data-manifold subspace, resulting in orders-of-magnitude improvements in certified volume. We then revisit the fundamental robustness problem by proposing asymmetric certification. This binary classification setting requires only certified robustness for one class, reflecting the fact that many real-world adversaries are strictly interested in producing false negatives. This more focused problem admits an interesting class of feature-convex architectures, which we leverage to provide efficient, deterministic, and closed-form certified radii.

The third part of this dissertation discusses two distinct aspects of interpretability: how Large Language Models (LLMs) decide what to recommend to human users, and how we can build learned models which obey human-interpretable structures. We first analyze conversational search engines, in which we use LLMs to rank consumer products for a user query. Our results show that LLMs vary widely in prioritizing product names, associated website content, and input context position. Finally, we propose a new family of interpretable models in domains where latent embeddings carry mathematical structure: structural transport nets. Via a learned bijection to a carefully-designed mirrored algebra, we produce interpretable latent-space operations which respect the laws of the original input space. We demonstrate that respecting underlying algebraic laws is crucial for learning accurate and self-consistent operations.}
}

EndNote citation:

%0 Thesis
%A Pfrommer, Samuel
%T Safety, Robustness, and Interpretability in Machine Learning
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 15
%@ UCB/EECS-2025-67
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-67.html
%F Pfrommer:EECS-2025-67