Improving Auto-Formalization to UCLID5 with LLMs and Formal Methods

Anirudh Chaudhary

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-115

May 16, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-115.pdf

One of the uses for large language models (LLMs) is code generation, and a growing number of tools aim to reduce developer effort by automating routine programming tasks. However, LLMs frequently generate code that appears correct but fails to meet functional requirements. This thesis addresses the problem of automating the generation of formal verification models from natural language, with a particular focus on the UCLID5 formal modeling and verification language. We propose a new pipeline to improve the semantic correctness of generated code by incorporating formal verification through variable- and line-coverage–based specifications. For each natural language task in our dataset, we generate four versions of code: a baseline from previous work, specification-guided generation with bounded model checking (BMC), generation with smoke testing, and a combination of both. We evaluate these variants through a user study in which UCLID5 users and developers assess output quality.

In this work, we implement two additional techniques--specification generation and automated repair--to assist in the code generation process. We find that while the use of the additional techniques, namely specification generation and automated repair, did not significantly increase user preference compared to the baseline, our results reveal important nuances. Our feedback-driven repair process led to either improvement or no regression in 93% of cases, highlighting its promise as a low-risk refinement layer. However, we find no clear correlation between a model’s assertion pass rate and user preference, suggesting that current specification strategies may not yet align with human-centered notions of code quality. These findings suggest that while formal methods may not yet enhance perceived output quality, they offer a principled foundation for improving semantic alignment in code generation. Future work should focus on creating and using structural specifications to better capture user intent.

Advisors: Sanjit A. Seshia

BibTeX citation:

@mastersthesis{Chaudhary:EECS-2025-115,
    Author= {Chaudhary, Anirudh},
    Title= {Improving Auto-Formalization to UCLID5 with LLMs and Formal Methods},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-115.html},
    Number= {UCB/EECS-2025-115},
    Abstract= {One of the uses for large language models (LLMs) is code generation, and a growing number of tools aim to reduce developer effort by automating routine programming tasks. However, LLMs frequently generate code that appears correct but fails to meet functional requirements. This thesis addresses the problem of automating the generation of formal verification models from natural language, with a particular focus on the UCLID5 formal modeling and verification language. We propose a new pipeline to improve the semantic correctness of generated code by incorporating formal verification through variable- and line-coverage–based specifications. For each natural language task in our dataset, we generate four versions of code: a baseline from previous work, specification-guided generation with bounded model checking (BMC), generation with smoke testing, and a combination of both. We evaluate these variants through a user study in which UCLID5 users and developers assess output quality.

In this work, we implement two additional techniques--specification generation and automated repair--to assist in the code generation process. We find that while the use of the additional techniques, namely specification generation and automated repair, did not significantly increase user preference compared to the baseline, our results reveal important nuances. Our feedback-driven repair process led to either improvement or no regression in 93% of cases, highlighting its promise as a low-risk refinement layer. However, we find no clear correlation between a model’s assertion pass rate and user preference, suggesting that current specification strategies may not yet align with human-centered notions of code quality. These findings suggest that while formal methods may not yet enhance perceived output quality, they offer a principled foundation for improving semantic alignment in code generation. Future work should focus on creating and using structural specifications to better capture user intent.},
}

EndNote citation:

%0 Thesis
%A Chaudhary, Anirudh 
%T Improving Auto-Formalization to UCLID5 with LLMs and Formal Methods
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 16
%@ UCB/EECS-2025-115
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-115.html
%F Chaudhary:EECS-2025-115