Experiments in Improving Unsupervised Word Sense Disambiguation

Jonathan Traupman and Robert Wilensky

EECS Department, University of California, Berkeley

Technical Report No. UCB/CSD-03-1227

, 2003

As with many problems in Natural Language Processing, word sense disambiguation is a difficult yet potentially very useful capability. Automatically determining the meanings of words with multiple definitions could benefit document classification, keyword searching, OCR, and many other applications that process text. Unfortunately, it is a challenge to design a system that can accurately cope with the idiosyncrasies of human language.

In this report we describe our attempts to improve the discrimination accuracy of the Yarowsky word sense disambiguation algorithm. The first of these experiments used an iterative approach to re-train the classifier. Our hope was that a corpus labeled by an imperfect classifier would make training material superior to an unlabeled corpus. By using the classifier's output from one iteration as its training input in the next, we tried to boost the accuracy of each successive cycle.

Our second experiment used part-of-speech information as an additional knowledge source for the Yarowsky algorithm. We pre-processed our training and test corpora with a part-of-speech tagger and used these tags to filter possible senses and improve the predictive power of words' contexts. Since part-of-speech tagging is a relatively mature technology with high accuracy, we expected it to improve the accuracy of the much more difficult word sense disambiguation process.

The third experiment modified the training phase of the Yarowsky algorithm by replacing its assumption of a uniform distribution of senses for a word with a more realistic one. We exploit the fact that our dictionary lists senses roughly in order by frequency of use to create a distribution that allows more accurate training.

BibTeX citation:

@techreport{Traupman:CSD-03-1227,
Author= {Traupman, Jonathan and Wilensky, Robert},
Title= {Experiments in Improving Unsupervised Word Sense Disambiguation},
Year= {2003},
Month= {Feb},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2003/5568.html},
Number= {UCB/CSD-03-1227},
Abstract= {As with many problems in Natural Language Processing, word sense disambiguation is a difficult yet potentially very useful capability. Automatically determining the meanings of words with multiple definitions could benefit document classification, keyword searching, OCR, and many other applications that process text. Unfortunately, it is a challenge to design a system that can accurately cope with the idiosyncrasies of human language. <p>In this report we describe our attempts to improve the discrimination accuracy of the Yarowsky word sense disambiguation algorithm. The first of these experiments used an iterative approach to re-train the classifier. Our hope was that a corpus labeled by an imperfect classifier would make training material superior to an unlabeled corpus. By using the classifier's output from one iteration as its training input in the next, we tried to boost the accuracy of each successive cycle. <p>Our second experiment used part-of-speech information as an additional knowledge source for the Yarowsky algorithm. We pre-processed our training and test corpora with a part-of-speech tagger and used these tags to filter possible senses and improve the predictive power of words' contexts. Since part-of-speech tagging is a relatively mature technology with high accuracy, we expected it to improve the accuracy of the much more difficult word sense disambiguation process. <p>The third experiment modified the training phase of the Yarowsky algorithm by replacing its assumption of a uniform distribution of senses for a word with a more realistic one. We exploit the fact that our dictionary lists senses roughly in order by frequency of use to create a distribution that allows more accurate training.},
}

EndNote citation:

%0 Report
%A Traupman, Jonathan 
%A Wilensky, Robert 
%T Experiments in Improving Unsupervised Word Sense Disambiguation
%I EECS Department, University of California, Berkeley
%D 2003
%@ UCB/CSD-03-1227
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2003/5568.html
%F Traupman:CSD-03-1227