Sublinear Expected Time Approximate String Matching and Biological Applications

William I. Chang and Eugene L. Lawler

EECS Department, University of California, Berkeley

Technical Report No. UCB/CSD-91-654

, 1991

The k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, the number k of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We treat k not as a constant but as a fraction of m (not necessarily constant-fraction). Previous algorithms require at least O(kn) time (or else exponential space). We are interested in much faster algorithms for restricted cases of the problem, such as when the text string is random and the allowable error rate is not too high (log-fraction). We have devised an algorithm that is sublinear time O(n/m)k logb m) on the average, when k is bounded by the threshold m / (logb m + O(1)). In particular, when k = o(m / logb m) the expected running time is o(n). In the worst case, our algorithm is O(kn), but still an improvement in that it is practical and uses O(m) space compared to O(n) or O(msquared). We define three problems inspired by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching; (2) approximate-overlap detection; and (3) approximate codon transcription. Respectively, applications to genetics are local similarity search; sequence assembly; and DNA-protein matching.

BibTeX citation:

@techreport{Chang:CSD-91-654,
    Author= {Chang, William I. and Lawler, Eugene L.},
    Title= {Sublinear Expected Time Approximate String Matching and Biological Applications},
    Year= {1991},
    Month= {Oct},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6144.html},
    Number= {UCB/CSD-91-654},
    Abstract= {The <i>k</i> differences approximate string matching problem specifies a text string of length <i>n</i>, a pattern string of length <i>m</i>, the number <i>k</i> of differences (substitutions, insertions, deletions) allowed in a match, and asks for all locations in the text where a match occurs. We treat <i>k</i> not as a constant but as a fraction of <i>m</i> (not necessarily constant-fraction). Previous algorithms require at least <i>O</i>(<i>kn</i>) time (or else exponential space). We are interested in much faster algorithms for restricted cases of the problem, such as when the text string is random and the allowable error rate is not too high (log-fraction). We have devised an algorithm that is sublinear time <i>O</i>(<i>n</i>/<i>m</i>)<i>k</i> log<i>b</i> <i>m</i>) on the average, when <i>k</i> is bounded by the threshold <i>m</i> / (log<i>b</i> <i>m</i> + <i>O</i>(1)). In particular, when <i>k</i> = <i>o</i>(<i>m</i> / log<i>b</i> <i>m</i>) the expected running time is <i>o</i>(<i>n</i>). In the worst case, our algorithm is <i>O</i>(<i>kn</i>), but still an improvement in that it is practical and uses <i>O</i>(<i>m</i>) space compared to <i>O</i>(<i>n</i>) or <i>O</i>(<i>m</i>squared). We define three problems inspired by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching; (2) approximate-overlap detection; and (3) approximate codon transcription. Respectively, applications to genetics are local similarity search; sequence assembly; and DNA-protein matching.},
}

EndNote citation:

%0 Report
%A Chang, William I. 
%A Lawler, Eugene L. 
%T Sublinear Expected Time Approximate String Matching and Biological Applications
%I EECS Department, University of California, Berkeley
%D 1991
%@ UCB/CSD-91-654
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1991/6144.html
%F Chang:CSD-91-654