### William I. Chang and Jordan Lampe

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/CSD-91-653

October 1991

### http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-653.pdf

We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the
*k* differences approximate string matching problem specifies a text string of length
*n*, a pattern string of length
*m*, the number
*k* of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various
*O*(
*kn*) algorithms based on dynamic programming (DP), paying particular attention to dependence on
*b* the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on
*b*. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is
*O*(
*kn*) for random text. Furthermore, we give a heuristic argument that our algorithm is
*O*(
*kn*/(the square root of
*b* - 1)) on the average, when alphabet size is taken into consideration.

BibTeX citation:

@techreport{Chang:CSD-91-653, Author = {Chang, William I. and Lampe, Jordan}, Title = {Theoretical and Empirical Comparisons of Approximate String Matching Algorithms}, Institution = {EECS Department, University of California, Berkeley}, Year = {1991}, Month = {Oct}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6124.html}, Number = {UCB/CSD-91-653}, Abstract = {We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the <i>k</i> differences approximate string matching problem specifies a text string of length <i>n</i>, a pattern string of length <i>m</i>, the number <i>k</i> of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various <i>O</i>(<i>kn</i>) algorithms based on dynamic programming (DP), paying particular attention to dependence on <i>b</i> the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on <i>b</i>. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is <i>O</i>(<i>kn</i>) for random text. Furthermore, we give a heuristic argument that our algorithm is <i>O</i>(<i>kn</i>/(the square root of <i>b</i> - 1)) on the average, when alphabet size is taken into consideration.} }

EndNote citation:

%0 Report %A Chang, William I. %A Lampe, Jordan %T Theoretical and Empirical Comparisons of Approximate String Matching Algorithms %I EECS Department, University of California, Berkeley %D 1991 %@ UCB/CSD-91-653 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6124.html %F Chang:CSD-91-653