Supervised Text Region Identification on Historical Documents
Jonathan Eng
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2015-254
December 18, 2015
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.pdf
We present multi-column text region identification support for Ocular, the unsupervised historical printed document transcription project of Berg-Kirkpatrick et. al (2013). We use structured prediction with rich features defined on the input document and incorporate a transition model based on prior document layout assumptions. Our model is trained using a structured-SVM objective on a randomly selected set of historical documents from The Proceedings of Old Bailey corpus. For learning, we use loss-augmented Viterbi decoding with a weighted Hamming loss function. We present our suite of features that achieve a 37.4 F1 text score and 39.4 F1 non-text improvement in text region identification over the Ocular baseline text cropper.
Advisors: Daniel Klein
BibTeX citation:
@mastersthesis{Eng:EECS-2015-254, Author= {Eng, Jonathan}, Title= {Supervised Text Region Identification on Historical Documents}, School= {EECS Department, University of California, Berkeley}, Year= {2015}, Month= {Dec}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.html}, Number= {UCB/EECS-2015-254}, Abstract= {We present multi-column text region identification support for Ocular, the unsupervised historical printed document transcription project of Berg-Kirkpatrick et. al (2013). We use structured prediction with rich features defined on the input document and incorporate a transition model based on prior document layout assumptions. Our model is trained using a structured-SVM objective on a randomly selected set of historical documents from The Proceedings of Old Bailey corpus. For learning, we use loss-augmented Viterbi decoding with a weighted Hamming loss function. We present our suite of features that achieve a 37.4 F1 text score and 39.4 F1 non-text improvement in text region identification over the Ocular baseline text cropper.}, }
EndNote citation:
%0 Thesis %A Eng, Jonathan %T Supervised Text Region Identification on Historical Documents %I EECS Department, University of California, Berkeley %D 2015 %8 December 18 %@ UCB/EECS-2015-254 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-254.html %F Eng:EECS-2015-254