Mawil Hasan
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-77
May 15, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.pdf
Portable Document Format (PDF) files dominate legal, financial, and governmental workflows, yet their opaque layout thwarts automated analytics. Existing extractors—rule-based parsers, commercial learned extraction APIs, and Large Language Models (LLMs)—all provide inadequate performance on unseen or unfamiliar templates and incur prohibitive costs.
We introduce TWIX, a hybrid framework that fuses lightweight template inference with targeted LLM-based semantic refinement to recover key–value pairs and tables from heterogeneous, multi-page templatized PDFs. TWIX clusters phrase locations across documents to identify potential candidates for attributes or fields, with input from an LLM to infuse domain knowledge, and assembles a global template with an ILP solver. The learned template is reused to extract data across the corpus—eliminating per-page model calls.
On two real-world benchmarks that we assemble—including a new TWIX benchmark (34 diverse business templates) and a new Police-Records dataset (thousands of noisy, redacted complaints)—TWIX delivers improvement in F1 by 25% over popular extraction tools—Amazon Textract, Azure Form Recognizer, GPT-4 Vision, and Evaporate. After template inference, extraction is 520× faster and 3700× cheaper than vision-LLM baselines, processing 2000-page datasets in minutes for under $0.02. Explicit template modeling and extraction—augmented, not replaced by, LLMs—thus achieves state-of-the-art accuracy while meeting real-world demands for scalability, robustness, and cost-efficiency.
Advisor: Aditya Parameswaran
";
?>
BibTeX citation:
@mastersthesis{Hasan:EECS-2025-77, Author = {Hasan, Mawil}, Editor = {Parameswaran, Aditya and Cheung, Alvin}, Title = {Benchmarking Extraction of Structured Data from Templatized Documents}, School = {EECS Department, University of California, Berkeley}, Year = {2025}, Month = {May}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.html}, Number = {UCB/EECS-2025-77}, Abstract = {Portable Document Format (PDF) files dominate legal, financial, and governmental workflows, yet their opaque layout thwarts automated analytics. Existing extractors—rule-based parsers, commercial learned extraction APIs, and Large Language Models (LLMs)—all provide inadequate performance on unseen or unfamiliar templates and incur prohibitive costs. We introduce TWIX, a hybrid framework that fuses lightweight template inference with targeted LLM-based semantic refinement to recover key–value pairs and tables from heterogeneous, multi-page templatized PDFs. TWIX clusters phrase locations across documents to identify potential candidates for attributes or fields, with input from an LLM to infuse domain knowledge, and assembles a global template with an ILP solver. The learned template is reused to extract data across the corpus—eliminating per-page model calls. On two real-world benchmarks that we assemble—including a new TWIX benchmark (34 diverse business templates) and a new Police-Records dataset (thousands of noisy, redacted complaints)—TWIX delivers improvement in F1 by 25% over popular extraction tools—Amazon Textract, Azure Form Recognizer, GPT-4 Vision, and Evaporate. After template inference, extraction is 520× faster and 3700× cheaper than vision-LLM baselines, processing 2000-page datasets in minutes for under $0.02. Explicit template modeling and extraction—augmented, not replaced by, LLMs—thus achieves state-of-the-art accuracy while meeting real-world demands for scalability, robustness, and cost-efficiency.} }
EndNote citation:
%0 Thesis %A Hasan, Mawil %E Parameswaran, Aditya %E Cheung, Alvin %T Benchmarking Extraction of Structured Data from Templatized Documents %I EECS Department, University of California, Berkeley %D 2025 %8 May 15 %@ UCB/EECS-2025-77 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-77.html %F Hasan:EECS-2025-77