A Dynamic Topic Model for Document Segmentation
John F. Canny and Tye Lawrence Rattenbury
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2006-161
December 5, 2006
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-161.pdf
Factor language models, like Latent Semantic Analysis, represent documents as mixtures of topics, and have a variety of applications. Normally, the mixture is computed at the whole-document level, that is, the entire document contains material on several topics, without specifying where they occur in the document. In this paper, we describe a new model which computes the topic mixture estimate for every word in each document. There are a number of applications of this model, but a natural one is topical document segmentation which we explore in this paper. Topical segmentation breaks a document into passages that are mostly about a single topic and so that adjacent passages have different topics. Most previous works have started with an a-priori segmentation (primarily multi-sentence passages). The goal in this setting is to merge the a-priori segments to build topic-based passages. Our method uses no a-priori segmentation of the text, and can mark boundaries anywhere (i.e. between any adjacent words), although it is more likely to do so at natural boundaries such as sentences and paragraphs. Our model accomplishes this fine-grain segmentation by computing a per-word topic mixture distribution. We first show that per-word mixture analysis is a natural extension of an earlier factor model (specifically the Gamma-Poisson model). Next we detail the computational efficiency of our model -- it costs only slightly more than traditional per-document topic mixture methods. Finally we present some experimental results.
BibTeX citation:
@techreport{Canny:EECS-2006-161, Author= {Canny, John F. and Rattenbury, Tye Lawrence}, Title= {A Dynamic Topic Model for Document Segmentation}, Year= {2006}, Month= {Dec}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-161.html}, Number= {UCB/EECS-2006-161}, Abstract= {Factor language models, like Latent Semantic Analysis, represent documents as mixtures of topics, and have a variety of applications. Normally, the mixture is computed at the whole-document level, that is, the entire document contains material on several topics, without specifying where they occur in the document. In this paper, we describe a new model which computes the topic mixture estimate for every word in each document. There are a number of applications of this model, but a natural one is topical document segmentation which we explore in this paper. Topical segmentation breaks a document into passages that are mostly about a single topic and so that adjacent passages have different topics. Most previous works have started with an a-priori segmentation (primarily multi-sentence passages). The goal in this setting is to merge the a-priori segments to build topic-based passages. Our method uses no a-priori segmentation of the text, and can mark boundaries anywhere (i.e. between any adjacent words), although it is more likely to do so at natural boundaries such as sentences and paragraphs. Our model accomplishes this fine-grain segmentation by computing a per-word topic mixture distribution. We first show that per-word mixture analysis is a natural extension of an earlier factor model (specifically the Gamma-Poisson model). Next we detail the computational efficiency of our model -- it costs only slightly more than traditional per-document topic mixture methods. Finally we present some experimental results.}, }
EndNote citation:
%0 Report %A Canny, John F. %A Rattenbury, Tye Lawrence %T A Dynamic Topic Model for Document Segmentation %I EECS Department, University of California, Berkeley %D 2006 %8 December 5 %@ UCB/EECS-2006-161 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-161.html %F Canny:EECS-2006-161