Benchmarking Extraction of Structured Data from Templatized Documents

Mawil Hasan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-77

May 15, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.pdf

Portable Document Format (PDF) files dominate legal, financial, and governmental workflows, yet their opaque layout thwarts automated analytics. Existing extractors—rule-based parsers, commercial learned extraction APIs, and Large Language Models (LLMs)—all provide inadequate performance on unseen or unfamiliar templates and incur prohibitive costs.

We introduce TWIX, a hybrid framework that fuses lightweight template inference with targeted LLM-based semantic refinement to recover key–value pairs and tables from heterogeneous, multi-page templatized PDFs. TWIX clusters phrase locations across documents to identify potential candidates for attributes or fields, with input from an LLM to infuse domain knowledge, and assembles a global template with an ILP solver. The learned template is reused to extract data across the corpus—eliminating per-page model calls.

On two real-world benchmarks that we assemble—including a new TWIX benchmark (34 diverse business templates) and a new Police-Records dataset (thousands of noisy, redacted complaints)—TWIX delivers improvement in F1 by 25% over popular extraction tools—Amazon Textract, Azure Form Recognizer, GPT-4 Vision, and Evaporate. After template inference, extraction is 520× faster and 3700× cheaper than vision-LLM baselines, processing 2000-page datasets in minutes for under $0.02. Explicit template modeling and extraction—augmented, not replaced by, LLMs—thus achieves state-of-the-art accuracy while meeting real-world demands for scalability, robustness, and cost-efficiency.

Advisors: Aditya Parameswaran

BibTeX citation:

@mastersthesis{Hasan:EECS-2025-77,
    Author= {Hasan, Mawil},
    Editor= {Parameswaran, Aditya and Cheung, Alvin},
    Title= {Benchmarking Extraction of Structured Data from Templatized Documents},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.html},
    Number= {UCB/EECS-2025-77},
    Abstract= {Portable Document Format (PDF) files dominate legal, financial, and governmental workflows, yet their opaque layout thwarts automated analytics. Existing extractors—rule-based parsers, commercial learned extraction APIs, and Large Language Models (LLMs)—all provide inadequate performance on unseen or unfamiliar templates and incur prohibitive costs.

We introduce TWIX, a hybrid framework that fuses lightweight template inference with targeted LLM-based semantic refinement to recover key–value pairs and tables from heterogeneous, multi-page templatized PDFs. TWIX clusters phrase locations across documents to identify potential candidates for attributes or fields, with input from an LLM to infuse domain knowledge, and assembles a global template with an ILP solver. The learned template is reused to extract data across the corpus—eliminating per-page model calls.

On two real-world benchmarks that we assemble—including a new TWIX benchmark (34 diverse business templates) and a new Police-Records dataset (thousands of noisy, redacted complaints)—TWIX delivers improvement in F1 by 25% over popular extraction tools—Amazon Textract, Azure Form Recognizer, GPT-4 Vision, and Evaporate. After template inference, extraction is 520× faster and 3700× cheaper than vision-LLM baselines, processing 2000-page datasets in minutes for under $0.02. Explicit template modeling and extraction—augmented, not replaced by, LLMs—thus achieves state-of-the-art accuracy while meeting real-world demands for scalability, robustness, and cost-efficiency.},
}

EndNote citation:

%0 Thesis
%A Hasan, Mawil 
%E Parameswaran, Aditya 
%E Cheung, Alvin 
%T Benchmarking Extraction of Structured Data from Templatized Documents
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 15
%@ UCB/EECS-2025-77
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.html
%F Hasan:EECS-2025-77