Toward Trustworthy Language Models: Interpretation Methods and Clinical Decision Support Applications

Aliyah Hsu

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-57

May 14, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-57.pdf

As deep learning models are increasingly deployed in high stakes domains like healthcare, understanding their decision-making processes has become essential. While numerous interpretation methods have been proposed in response, many remain unreliable (i.e., being sensitive to input perturbations, or misaligned with real-world reasoning) and struggle to scale effectively. This dissertation advances interpretability in deep learning through a structured investigation across three fronts: post-hoc explanations for black-box models, mechanistic insights into deep learning model internals, and interpretable real-world clinical applications guided by domain expertise. A central emphasis is placed on ensuring the trustworthiness of the developed methods through internal stability analyses and external validation in collaboration with domain experts on real-world tasks. First, we develop two black-box interpretation methods: one distills symbolic rules from concept bottleneck models, and the other uses prompt-based techniques to generate natural language explanations from text modules, both offering interpretable outputs without internal model access. Next, by extending the utility of contextual decomposition (a prior work proposed for local interpretations), we introduce a scalable, mathematically grounded method for mechanistic interpretability in transformers, efficiently identifying task-relevant computational subgraphs at fine granularity. Finally, we explore interpretability in real-world clinical decision support. In collaboration with clinicians, we develop a framework for analyzing fine-tuned transformer feature spaces to inform model suitability for tasks, and design a rule-based LLM system that autonomously applies clinical decision rules from unstructured notes to support emergency care, guided by expert feedback throughout development. These contributions collectively demonstrate how trustworthy interpretability can bridge the gap between model performance and trustworthy deployment in practice.

Advisors: Bin Yu

BibTeX citation:

@phdthesis{Hsu:EECS-2025-57,
    Author= {Hsu, Aliyah},
    Title= {Toward Trustworthy Language Models: Interpretation Methods and Clinical Decision Support Applications},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-57.html},
    Number= {UCB/EECS-2025-57},
    Abstract= {As deep learning models are increasingly deployed in high stakes domains like healthcare, understanding their decision-making processes has become essential. While numerous interpretation methods have been proposed in response, many remain unreliable (i.e., being sensitive to input perturbations, or misaligned with real-world reasoning) and struggle to scale effectively. This dissertation advances interpretability in deep learning through a structured investigation across three fronts: post-hoc explanations for black-box models, mechanistic insights into deep learning model internals, and interpretable real-world clinical applications guided by domain expertise. A central emphasis is placed on ensuring the trustworthiness of the developed methods through internal stability analyses and external validation in collaboration with domain experts on real-world tasks. First, we develop two black-box interpretation methods: one distills symbolic rules from concept bottleneck models, and the other uses prompt-based techniques to generate natural language explanations from text modules, both offering interpretable outputs without internal model access. Next, by extending the utility of contextual decomposition (a prior work proposed for local interpretations), we introduce a scalable, mathematically grounded method for mechanistic interpretability in transformers, efficiently identifying task-relevant computational subgraphs at fine granularity. Finally, we explore interpretability in real-world clinical decision support. In collaboration with clinicians, we develop a framework for analyzing fine-tuned transformer feature spaces to inform model suitability for tasks, and design a rule-based LLM system that autonomously applies clinical decision rules from unstructured notes to support emergency care, guided by expert feedback throughout development. These contributions collectively demonstrate how trustworthy interpretability can bridge the gap between model performance and trustworthy deployment in practice.},
}

EndNote citation:

%0 Thesis
%A Hsu, Aliyah 
%T Toward Trustworthy Language Models: Interpretation Methods and Clinical Decision Support Applications
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 14
%@ UCB/EECS-2025-57
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-57.html
%F Hsu:EECS-2025-57