Theoretical and Empirical Comparisons of Approximate String Matching Algorithms

William I. Chang and Jordan Lampe

EECS Department, University of California, Berkeley

Technical Report No. UCB/CSD-91-653

, 1991

We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various O(kn) algorithms based on dynamic programming (DP), paying particular attention to dependence on b the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on b. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is O(kn) for random text. Furthermore, we give a heuristic argument that our algorithm is O(kn/(the square root of b - 1)) on the average, when alphabet size is taken into consideration.

BibTeX citation:

@techreport{Chang:CSD-91-653,
    Author= {Chang, William I. and Lampe, Jordan},
    Title= {Theoretical and Empirical Comparisons of Approximate String Matching Algorithms},
    Year= {1991},
    Month= {Oct},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6124.html},
    Number= {UCB/CSD-91-653},
    Abstract= {We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the <i>k</i> differences approximate string matching problem specifies a text string of length <i>n</i>, a pattern string of length <i>m</i>, the number <i>k</i> of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various <i>O</i>(<i>kn</i>) algorithms based on dynamic programming (DP), paying particular attention to dependence on <i>b</i> the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on <i>b</i>. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is <i>O</i>(<i>kn</i>) for random text. Furthermore, we give a heuristic argument that our algorithm is <i>O</i>(<i>kn</i>/(the square root of <i>b</i> - 1)) on the average, when alphabet size is taken into consideration.},
}

EndNote citation:

%0 Report
%A Chang, William I. 
%A Lampe, Jordan 
%T Theoretical and Empirical Comparisons of Approximate String Matching Algorithms
%I EECS Department, University of California, Berkeley
%D 1991
%@ UCB/CSD-91-653
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6124.html
%F Chang:CSD-91-653