A Dynamic Topic Model for Document Segmentation

John F. Canny and Tye Lawrence Rattenbury

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2006-161

December 5, 2006

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-161.pdf

Factor language models, like Latent Semantic Analysis, represent documents as mixtures of topics, and have a variety of applications. Normally, the mixture is computed at the whole-document level, that is, the entire document contains material on several topics, without specifying where they occur in the document. In this paper, we describe a new model which computes the topic mixture estimate for every word in each document. There are a number of applications of this model, but a natural one is topical document segmentation which we explore in this paper. Topical segmentation breaks a document into passages that are mostly about a single topic and so that adjacent passages have different topics. Most previous works have started with an a-priori segmentation (primarily multi-sentence passages). The goal in this setting is to merge the a-priori segments to build topic-based passages. Our method uses no a-priori segmentation of the text, and can mark boundaries anywhere (i.e. between any adjacent words), although it is more likely to do so at natural boundaries such as sentences and paragraphs. Our model accomplishes this fine-grain segmentation by computing a per-word topic mixture distribution. We first show that per-word mixture analysis is a natural extension of an earlier factor model (specifically the Gamma-Poisson model). Next we detail the computational efficiency of our model -- it costs only slightly more than traditional per-document topic mixture methods. Finally we present some experimental results.

BibTeX citation:

@techreport{Canny:EECS-2006-161,
    Author= {Canny, John F. and Rattenbury, Tye Lawrence},
    Title= {A Dynamic Topic Model for Document Segmentation},
    Year= {2006},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-161.html},
    Number= {UCB/EECS-2006-161},
    Abstract= {Factor language models, like Latent Semantic Analysis, represent documents  as mixtures of topics, and have a variety of applications. Normally, the mixture  is computed at the whole-document level, that is, the entire document contains material on several topics, without specifying where they  occur in the document. In this paper, we describe a new model which  computes the topic mixture estimate for every word in each  document. There are a number of applications of this model, but a  natural one is topical document segmentation which we explore in this  paper. Topical segmentation breaks a document into passages that are  mostly about a single topic and so that adjacent passages have  different topics. Most previous works have started with an a-priori  segmentation (primarily multi-sentence passages). The goal in this  setting is to merge the a-priori segments to build topic-based passages.  Our method uses no a-priori segmentation of the text, and can mark  boundaries anywhere (i.e. between any adjacent words), although it  is more likely to do so at natural boundaries such as sentences and  paragraphs. Our model accomplishes this fine-grain segmentation by  computing a per-word topic mixture distribution. We first show that  per-word mixture analysis is a natural extension of an earlier  factor model (specifically the Gamma-Poisson model). Next we detail  the computational efficiency of our model -- it costs only slightly  more than traditional per-document topic mixture methods. Finally  we present some experimental results.},
}

EndNote citation:

%0 Report
%A Canny, John F. 
%A Rattenbury, Tye Lawrence 
%T A Dynamic Topic Model for Document Segmentation
%I EECS Department, University of California, Berkeley
%D 2006
%8 December 5
%@ UCB/EECS-2006-161
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-161.html
%F Canny:EECS-2006-161