Benchmarking Extraction of Structured Data from Templatized Documents

Mawil Hasan

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-77
May 15, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.pdf

Portable Document Format (PDF) files dominate legal, financial, and governmental workflows, yet their opaque layout thwarts automated analytics. Existing extractors—rule-based parsers, commercial learned extraction APIs, and Large Language Models (LLMs)—all provide inadequate performance on unseen or unfamiliar templates and incur prohibitive costs.

We introduce TWIX, a hybrid framework that fuses lightweight template inference with targeted LLM-based semantic refinement to recover key–value pairs and tables from heterogeneous, multi-page templatized PDFs. TWIX clusters phrase locations across documents to identify potential candidates for attributes or fields, with input from an LLM to infuse domain knowledge, and assembles a global template with an ILP solver. The learned template is reused to extract data across the corpus—eliminating per-page model calls.

On two real-world benchmarks that we assemble—including a new TWIX benchmark (34 diverse business templates) and a new Police-Records dataset (thousands of noisy, redacted complaints)—TWIX delivers improvement in F1 by 25% over popular extraction tools—Amazon Textract, Azure Form Recognizer, GPT-4 Vision, and Evaporate. After template inference, extraction is 520× faster and 3700× cheaper than vision-LLM baselines, processing 2000-page datasets in minutes for under $0.02. Explicit template modeling and extraction—augmented, not replaced by, LLMs—thus achieves state-of-the-art accuracy while meeting real-world demands for scalability, robustness, and cost-efficiency.

Advisor: Aditya Parameswaran

\"Edit"; ?>


BibTeX citation:

@mastersthesis{Hasan:EECS-2025-77,
    Author = {Hasan, Mawil},
    Editor = {Parameswaran, Aditya and Cheung, Alvin},
    Title = {Benchmarking Extraction of Structured Data from Templatized Documents},
    School = {EECS Department, University of California, Berkeley},
    Year = {2025},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.html},
    Number = {UCB/EECS-2025-77},
    Abstract = {Portable Document Format (PDF) files dominate legal, financial, and governmental workflows, yet their opaque layout thwarts automated analytics. Existing extractors—rule-based parsers, commercial learned extraction APIs, and Large Language Models (LLMs)—all provide inadequate performance on unseen or unfamiliar templates and incur prohibitive costs.

We introduce TWIX, a hybrid framework that fuses lightweight template inference with targeted LLM-based semantic refinement to recover key–value pairs and tables from heterogeneous, multi-page templatized PDFs. TWIX clusters phrase locations across documents to identify potential candidates for attributes or fields, with input from an LLM to infuse domain knowledge, and assembles a global template with an ILP solver. The learned template is reused to extract data across the corpus—eliminating per-page model calls.

On two real-world benchmarks that we assemble—including a new TWIX benchmark (34 diverse business templates) and a new Police-Records dataset (thousands of noisy, redacted complaints)—TWIX delivers improvement in F1 by 25% over popular extraction tools—Amazon Textract, Azure Form Recognizer, GPT-4 Vision, and Evaporate. After template inference, extraction is 520× faster and 3700× cheaper than vision-LLM baselines, processing 2000-page datasets in minutes for under $0.02. Explicit template modeling and extraction—augmented, not replaced by, LLMs—thus achieves state-of-the-art accuracy while meeting real-world demands for scalability, robustness, and cost-efficiency.}
}

EndNote citation:

%0 Thesis
%A Hasan, Mawil
%E Parameswaran, Aditya
%E Cheung, Alvin
%T Benchmarking Extraction of Structured Data from Templatized Documents
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 15
%@ UCB/EECS-2025-77
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.html
%F Hasan:EECS-2025-77