Joshua Wu

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-167

May 12, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-167.pdf

String matching is the operation of identifying and matching similar strings according to a similarity function or distance metric. It is a historically difficult problem that has many real world applications such as data cleaning, entity deduplication, and search. Traditional string distance metrics only consider either acronyms, abbreviations, or typos, but do not consider all three together. Prior research has attempted to address string matching with acronyms and abbreviations by leveraging a dictionary of generated synonyms. However, this approach is limited because it must trade off between potentially leaving incorrect rules in the dictionary or potentially removing correct rules during a refinement procedure.

We propose Smash, a new character-based string distance metric for applications in string matching. We also propose an efficient dynamic programming algorithm that leverages Smash and captures abbreviations, acronyms, and misspellings, three of the most common data modifications in real world data. We evaluate our metric on real world datasets to show that our metric’s accuracy outperforms that of state-of-the-art string distance metrics and similarity measures. In particular, we evaluate Smash on police roster data from the NACDL (National Association of Criminal Defense Lawyers) and show that it achieves the highest F-score among different metrics on a police roster dataset as well as on other benchmark datasets. We also show that Smash is practical to integrate into a data cleaning workflow.

Advisors: Aditya Parameswaran


BibTeX citation:

@mastersthesis{Wu:EECS-2023-167,
    Author= {Wu, Joshua},
    Editor= {Parameswaran, Aditya and Tang, Dixin},
    Title= {Smash: A Dictionary-Free String Distance Metric that Considers Acronyms, Abbreviations, and Typos},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-167.html},
    Number= {UCB/EECS-2023-167},
    Abstract= {String matching is the operation of identifying and matching similar strings according to a similarity function or distance metric. It is a historically difficult problem that has many real world applications such as data cleaning, entity deduplication, and search. Traditional string distance metrics only consider either acronyms, abbreviations, or typos, but do not consider all three together. Prior research has attempted to address string matching with acronyms and abbreviations by leveraging a dictionary of generated synonyms. However, this approach is limited because it must trade off between potentially leaving incorrect rules in the dictionary or potentially removing correct rules during a refinement procedure.

We propose Smash, a new character-based string distance metric for applications in string matching. We also propose an efficient dynamic programming algorithm that leverages Smash and captures abbreviations, acronyms, and misspellings, three of the most common data modifications in real world data. We evaluate our metric on real world datasets to show that our metric’s accuracy outperforms that of state-of-the-art string distance metrics and similarity measures. In particular, we evaluate Smash on police roster data from the NACDL (National Association of Criminal Defense Lawyers) and show that it achieves the highest F-score among different metrics on a police roster dataset as well as on other benchmark datasets. We also show that Smash is practical to integrate into a data cleaning workflow.},
}

EndNote citation:

%0 Thesis
%A Wu, Joshua 
%E Parameswaran, Aditya 
%E Tang, Dixin 
%T Smash: A Dictionary-Free String Distance Metric that Considers Acronyms, Abbreviations, and Typos
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 12
%@ UCB/EECS-2023-167
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-167.html
%F Wu:EECS-2023-167