Advancing Large Language Models for Code Using Code-Structure-Aware Methods

Linyuan Gong

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-50

May 13, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-50.pdf

Large language models (LLMs) have transformed code-related tasks. However, most code LLMs ignore structural patterns of programming languages. This dissertation studies code-structure-aware LLMs by proposing novel methodologies, benchmarks, and pretraining strategies, showing that explicit structural modeling significantly enhances coding capability of LLMs.

First, we introduce ADELT, a transpiler that decouples code structure conversion from API keyword translation. ADELT achieves state-of-the-art transpilation without parallel data, showing the importance of structural awareness.

To rigorously evaluate structural understanding, we present SAFIM, a benchmark for syntax-aware FIM tasks. Evaluating 15 LLMs, we challenge the idea that ``big model = good performance'', and show that pretraining strategies and data quality are more important. We establish SAFIM as a foundational tool for future research.

We then propose two structure-aware pretraining paradigms. AST-T5 integrates abstract syntax trees (ASTs) into T5-like encoder-decoder models, outperforming baselines in code repair and transpilation. For decoder-only architectures, AST-FIM uses AST-guided masking to better address the tradeoff between FIM and left-to-right (L2R) generation, surpassing traditional methods on infilling tasks while retaining L2R generation capability.

Collectively, we show that code structure awareness enhances code generation, understanding, and transformation ability of LLMs. Our contributions---spanning transpilation frameworks, evaluation benchmarks, and pretraining techniques---provide a roadmap for integrating code structures into LLMs.

Advisors: Alvin Cheung

BibTeX citation:

@phdthesis{Gong:EECS-2025-50,
    Author= {Gong, Linyuan},
    Editor= {Cheung, Alvin and Song, Dawn and Wang, Sida},
    Title= {Advancing Large Language Models for Code Using Code-Structure-Aware Methods},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-50.html},
    Number= {UCB/EECS-2025-50},
    Abstract= {Large language models (LLMs) have transformed code-related tasks. However, most code LLMs ignore structural patterns of programming languages. This dissertation studies code-structure-aware LLMs by proposing novel methodologies, benchmarks, and pretraining strategies, showing that explicit structural modeling significantly enhances coding capability of LLMs.

First, we introduce ADELT, a transpiler that decouples code structure conversion from API keyword translation. ADELT achieves state-of-the-art transpilation without parallel data, showing the importance of structural awareness.

To rigorously evaluate structural understanding, we present SAFIM, a benchmark for syntax-aware FIM tasks. Evaluating 15 LLMs, we challenge the idea that ``big model = good performance'', and show that pretraining strategies and data quality are more important. We establish SAFIM as a foundational tool for future research.

We then propose two structure-aware pretraining paradigms. AST-T5 integrates abstract syntax trees (ASTs) into T5-like encoder-decoder models, outperforming baselines in code repair and transpilation. For decoder-only architectures, AST-FIM uses AST-guided masking to better address the tradeoff between FIM and left-to-right (L2R) generation, surpassing traditional methods on infilling tasks while retaining L2R generation capability.

Collectively, we show that code structure awareness enhances code generation, understanding, and transformation ability of LLMs. Our contributions---spanning transpilation frameworks, evaluation benchmarks, and pretraining techniques---provide a roadmap for integrating code structures into LLMs.},
}

EndNote citation:

%0 Thesis
%A Gong, Linyuan 
%E Cheung, Alvin 
%E Song, Dawn 
%E Wang, Sida 
%T Advancing Large Language Models for Code Using Code-Structure-Aware Methods
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 13
%@ UCB/EECS-2025-50
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-50.html
%F Gong:EECS-2025-50