Benchmarking Extraction of Structured Data from Templatized Documents
Mawil Hasan
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2025-77
May 15, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.pdf
Portable Document Format (PDF) files dominate legal, financial, and governmental workflows, yet their opaque layout thwarts automated analytics. Existing extractors—rule-based parsers, commercial learned extraction APIs, and Large Language Models (LLMs)—all provide inadequate performance on unseen or unfamiliar templates and incur prohibitive costs.
We introduce TWIX, a hybrid framework that fuses lightweight template inference with targeted LLM-based semantic refinement to recover key–value pairs and tables from heterogeneous, multi-page templatized PDFs. TWIX clusters phrase locations across documents to identify potential candidates for attributes or fields, with input from an LLM to infuse domain knowledge, and assembles a global template with an ILP solver. The learned template is reused to extract data across the corpus—eliminating per-page model calls.
On two real-world benchmarks that we assemble—including a new TWIX benchmark (34 diverse business templates) and a new Police-Records dataset (thousands of noisy, redacted complaints)—TWIX delivers improvement in F1 by 25% over popular extraction tools—Amazon Textract, Azure Form Recognizer, GPT-4 Vision, and Evaporate. After template inference, extraction is 520× faster and 3700× cheaper than vision-LLM baselines, processing 2000-page datasets in minutes for under $0.02. Explicit template modeling and extraction—augmented, not replaced by, LLMs—thus achieves state-of-the-art accuracy while meeting real-world demands for scalability, robustness, and cost-efficiency.
Advisors: Aditya Parameswaran
BibTeX citation:
@mastersthesis{Hasan:EECS-2025-77, Author= {Hasan, Mawil}, Editor= {Parameswaran, Aditya and Cheung, Alvin}, Title= {Benchmarking Extraction of Structured Data from Templatized Documents}, School= {EECS Department, University of California, Berkeley}, Year= {2025}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.html}, Number= {UCB/EECS-2025-77}, Abstract= {Portable Document Format (PDF) files dominate legal, financial, and governmental workflows, yet their opaque layout thwarts automated analytics. Existing extractors—rule-based parsers, commercial learned extraction APIs, and Large Language Models (LLMs)—all provide inadequate performance on unseen or unfamiliar templates and incur prohibitive costs. We introduce TWIX, a hybrid framework that fuses lightweight template inference with targeted LLM-based semantic refinement to recover key–value pairs and tables from heterogeneous, multi-page templatized PDFs. TWIX clusters phrase locations across documents to identify potential candidates for attributes or fields, with input from an LLM to infuse domain knowledge, and assembles a global template with an ILP solver. The learned template is reused to extract data across the corpus—eliminating per-page model calls. On two real-world benchmarks that we assemble—including a new TWIX benchmark (34 diverse business templates) and a new Police-Records dataset (thousands of noisy, redacted complaints)—TWIX delivers improvement in F1 by 25% over popular extraction tools—Amazon Textract, Azure Form Recognizer, GPT-4 Vision, and Evaporate. After template inference, extraction is 520× faster and 3700× cheaper than vision-LLM baselines, processing 2000-page datasets in minutes for under $0.02. Explicit template modeling and extraction—augmented, not replaced by, LLMs—thus achieves state-of-the-art accuracy while meeting real-world demands for scalability, robustness, and cost-efficiency.}, }
EndNote citation:
%0 Thesis %A Hasan, Mawil %E Parameswaran, Aditya %E Cheung, Alvin %T Benchmarking Extraction of Structured Data from Templatized Documents %I EECS Department, University of California, Berkeley %D 2025 %8 May 15 %@ UCB/EECS-2025-77 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.html %F Hasan:EECS-2025-77