An Investigation of Sense Disambiguation in Scientific Texts

Manav Rathod

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-132

May 16, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-132.pdf

Word Sense Disambiguation (WSD) is a well-researched problem in Natural Language Processing with decades of papers written about it and new techniques coming out every year. However, this technique is still under-explored in certain domains such as scientific texts. Scientific papers are a particularly interesting use case as there is a rich amount of information associated with each paper that can potentially improve existing approaches for disambiguation, mainly its title, abstract, and its place in the citation graph. A related area of study that also works in this direction is Acronym Disambiguation (AD). We believe that these two problems are related and similar techniques could strongly perform in both settings. However, there is a lack of a large dataset for WSD in scientific texts, motivating the need to create one artificially, without spending an exorbitant amount of resources. Thus, we turn towards Pseudowords as a means of creating this dataset. We demonstrate that using paper information can lead to improvements in AD and WSD and present a brand-new dataset to further research in Scientific WSD.

Advisors: Marti Hearst

BibTeX citation:

@mastersthesis{Rathod:EECS-2022-132,
    Author= {Rathod, Manav},
    Title= {An Investigation of Sense Disambiguation in Scientific Texts},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-132.html},
    Number= {UCB/EECS-2022-132},
    Abstract= {Word Sense Disambiguation (WSD) is a well-researched problem in Natural Language Processing with decades of papers written about it and new techniques coming out every year. However, this technique is still under-explored in certain domains such as scientific texts. Scientific papers are a particularly interesting use case as there is a rich amount of information associated with each paper that can potentially improve existing approaches for disambiguation, mainly its title, abstract, and its place in the citation graph. A related area of study that also works in this direction is Acronym Disambiguation (AD). We believe that these two problems are related and similar techniques could strongly perform in both settings. However, there is a lack of a large dataset for WSD in scientific texts, motivating the need to create one artificially, without spending an exorbitant amount of resources. Thus, we turn towards Pseudowords as a means of creating this dataset. We demonstrate that using paper information can lead to improvements in AD and WSD and present a brand-new dataset to further research in Scientific WSD.},
}

EndNote citation:

%0 Thesis
%A Rathod, Manav 
%T An Investigation of Sense Disambiguation in Scientific Texts
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 16
%@ UCB/EECS-2022-132
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-132.html
%F Rathod:EECS-2022-132