Theoretical and Empirical Comparisons of Approximate String Matching Algorithms
William I. Chang and Jordan Lampe
EECS Department, University of California, Berkeley
Technical Report No. UCB/CSD-91-653
, 1991
http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-653.pdf
We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the <i>k</i> differences approximate string matching problem specifies a text string of length <i>n</i>, a pattern string of length <i>m</i>, the number <i>k</i> of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various <i>O</i>(<i>kn</i>) algorithms based on dynamic programming (DP), paying particular attention to dependence on <i>b</i> the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on <i>b</i>. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is <i>O</i>(<i>kn</i>) for random text. Furthermore, we give a heuristic argument that our algorithm is <i>O</i>(<i>kn</i>/(the square root of <i>b</i> - 1)) on the average, when alphabet size is taken into consideration.
BibTeX citation:
@techreport{Chang:CSD-91-653, Author= {Chang, William I. and Lampe, Jordan}, Title= {Theoretical and Empirical Comparisons of Approximate String Matching Algorithms}, Year= {1991}, Month= {Oct}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6124.html}, Number= {UCB/CSD-91-653}, Abstract= {We study in depth a model of non-exact pattern matching based on edit distance, which is the minimum number of substitutions, insertions, and deletions needed to transform one string of symbols to another. More precisely, the <i>k</i> differences approximate string matching problem specifies a text string of length <i>n</i>, a pattern string of length <i>m</i>, the number <i>k</i> of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We have carefully implemented and analyzed various <i>O</i>(<i>kn</i>) algorithms based on dynamic programming (DP), paying particular attention to dependence on <i>b</i> the alphabet size. An empirical observation on the average values of the DP tabulation makes apparent each algorithm's dependence on <i>b</i>. A new algorithm is presented that computes much fewer entries of the DP table. In practice, its speedup over the previous fastest algorithm is 2.5X for binary alphabet; 4X for four-letter alphabet; 10X for twenty-letter alphabet. We give a probabilistic analysis of the DP table in order to prove that the expected running time of our algorithm (as well as an earlier "cut off" algorithm due to Ukkonen) is <i>O</i>(<i>kn</i>) for random text. Furthermore, we give a heuristic argument that our algorithm is <i>O</i>(<i>kn</i>/(the square root of <i>b</i> - 1)) on the average, when alphabet size is taken into consideration.}, }
EndNote citation:
%0 Report %A Chang, William I. %A Lampe, Jordan %T Theoretical and Empirical Comparisons of Approximate String Matching Algorithms %I EECS Department, University of California, Berkeley %D 1991 %@ UCB/CSD-91-653 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6124.html %F Chang:CSD-91-653