Towards Controllable Language Models With Instruction Hierarchies
Jonathan Lu and Norman Mu and Michael Lavery
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2025-130
May 23, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-130.pdf
As large language models (LLMs) have become more capable, researchers and developers have increasingly relied on them to implement complex systems and applications. Instead of writing explicit code, developers often rely on natural language prompts to “program” model behavior directly. While this paradigm is convenient and flexible, natural language instructions remain far less reliable than traditional code. For example, although most instruction-tuned models support system messages, which allow developers to specify context and response preferences, models often fail to consistently enforce these constraints, especially in the face of malicious or adversarial users. To address this challenge and formalize the notion of instruction priorities, OpenAI proposed an instruction hierarchy, where models are expected to prioritize system messages over user messages, user messages over tool messages, etc. Yet, despite a growing body of research on enforcing instruction priorities, it remains unclear which approaches are most effective. In this thesis, we focus on system prompt robustness in LLMs and introduce a new dataset and benchmark designed to facilitate the evaluation and development of models that can reliably follow hierarchical instructions.
Advisors: David A. Wagner
BibTeX citation:
@mastersthesis{Lu:EECS-2025-130, Author= {Lu, Jonathan and Mu, Norman and Lavery, Michael}, Editor= {Wagner, David A.}, Title= {Towards Controllable Language Models With Instruction Hierarchies}, School= {EECS Department, University of California, Berkeley}, Year= {2025}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-130.html}, Number= {UCB/EECS-2025-130}, Abstract= {As large language models (LLMs) have become more capable, researchers and developers have increasingly relied on them to implement complex systems and applications. Instead of writing explicit code, developers often rely on natural language prompts to “program” model behavior directly. While this paradigm is convenient and flexible, natural language instructions remain far less reliable than traditional code. For example, although most instruction-tuned models support system messages, which allow developers to specify context and response preferences, models often fail to consistently enforce these constraints, especially in the face of malicious or adversarial users. To address this challenge and formalize the notion of instruction priorities, OpenAI proposed an instruction hierarchy, where models are expected to prioritize system messages over user messages, user messages over tool messages, etc. Yet, despite a growing body of research on enforcing instruction priorities, it remains unclear which approaches are most effective. In this thesis, we focus on system prompt robustness in LLMs and introduce a new dataset and benchmark designed to facilitate the evaluation and development of models that can reliably follow hierarchical instructions.}, }
EndNote citation:
%0 Thesis %A Lu, Jonathan %A Mu, Norman %A Lavery, Michael %E Wagner, David A. %T Towards Controllable Language Models With Instruction Hierarchies %I EECS Department, University of California, Berkeley %D 2025 %8 May 23 %@ UCB/EECS-2025-130 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-130.html %F Lu:EECS-2025-130