William I. Chang and Jordan Lampe

EECS Department, University of California, Berkeley

Technical Report No. UCB/CSD-91-653

, 1991

http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-653.pdf

We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the <i>k</i> differences approximate string matching problem specifies a text string of length <i>n</i>, a pattern string of length <i>m</i>, the number <i>k</i> of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various <i>O</i>(<i>kn</i>) algorithms based on dynamic programming (DP), paying particular attention to dependence on <i>b</i> the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on <i>b</i>. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is <i>O</i>(<i>kn</i>) for random text. Furthermore, we give a heuristic argument that our algorithm is <i>O</i>(<i>kn</i>/(the square root of <i>b</i> - 1)) on the average, when alphabet size is taken into consideration.


BibTeX citation:

@techreport{Chang:CSD-91-653,
    Author= {Chang, William I. and Lampe, Jordan},
    Title= {Theoretical and Empirical Comparisons of Approximate String Matching Algorithms},
    Year= {1991},
    Month= {Oct},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6124.html},
    Number= {UCB/CSD-91-653},
    Abstract= {We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the <i>k</i> differences approximate string matching problem specifies a text string of length <i>n</i>, a pattern string of length <i>m</i>, the number <i>k</i> of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various <i>O</i>(<i>kn</i>) algorithms based on dynamic programming (DP), paying particular attention to dependence on <i>b</i> the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on <i>b</i>. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is <i>O</i>(<i>kn</i>) for random text. Furthermore, we give a heuristic argument that our algorithm is <i>O</i>(<i>kn</i>/(the square root of <i>b</i> - 1)) on the average, when alphabet size is taken into consideration.},
}

EndNote citation:

%0 Report
%A Chang, William I. 
%A Lampe, Jordan 
%T Theoretical and Empirical Comparisons of Approximate String Matching Algorithms
%I EECS Department, University of California, Berkeley
%D 1991
%@ UCB/CSD-91-653
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6124.html
%F Chang:CSD-91-653