Supervised Text Region Identification on Historical Documents

Jonathan Eng

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2015-254
December 18, 2015

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.pdf

We present multi-column text region identification support for Ocular, the unsupervised historical printed document transcription project of Berg-Kirkpatrick et. al (2013). We use structured prediction with rich features defined on the input document and incorporate a transition model based on prior document layout assumptions. Our model is trained using a structured-SVM objective on a randomly selected set of historical documents from The Proceedings of Old Bailey corpus. For learning, we use loss-augmented Viterbi decoding with a weighted Hamming loss function. We present our suite of features that achieve a 37.4 F1 text score and 39.4 F1 non-text improvement in text region identification over the Ocular baseline text cropper.

Advisor: Daniel Klein


BibTeX citation:

@mastersthesis{Eng:EECS-2015-254,
    Author = {Eng, Jonathan},
    Title = {Supervised Text Region Identification on Historical Documents},
    School = {EECS Department, University of California, Berkeley},
    Year = {2015},
    Month = {Dec},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.html},
    Number = {UCB/EECS-2015-254},
    Abstract = {We present multi-column text region identification support for Ocular, the unsupervised historical printed document transcription project of Berg-Kirkpatrick et. al (2013). We use structured prediction with rich features defined on the input document and incorporate a transition model based on prior document layout assumptions. Our model is trained using a structured-SVM objective on a randomly selected set of historical documents from The Proceedings of Old Bailey corpus. For learning, we use loss-augmented Viterbi decoding with a weighted Hamming loss function. We present our suite of features that achieve a 37.4 F1 text score and 39.4 F1 non-text improvement in text region identification over the Ocular baseline text cropper.}
}

EndNote citation:

%0 Thesis
%A Eng, Jonathan
%T Supervised Text Region Identification on Historical Documents
%I EECS Department, University of California, Berkeley
%D 2015
%8 December 18
%@ UCB/EECS-2015-254
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.html
%F Eng:EECS-2015-254