of 27 /27
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Exact String Matching Algorithms

Embed Size (px)

DESCRIPTION

Exact String Matching Algorithms. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Exact Matching: What’s the Problem. 1 1 2 34 5 67 8 90 1 2 T = bbabaxababay P = aba. P occurs in T starting at locations 3, 7, and 9 P may overlap, as found at 7 and 9. The Naive Method. - PowerPoint PPT Presentation

Text of Exact String Matching Algorithms

Introduction to Bioinformatics

Exact String Matching AlgorithmsPresented ByDr. Shazzad HosainAsst. Prof. EECS, NSUExact Matching: Whats the Problem

1 1 2 34 5 67 8 90 1 2T = bbabaxababayP = abaP occurs in T starting at locations 3, 7, and 9P may overlap, as found at 7 and 9.The Naive MethodProblem is to find if a pattern P[1..m] occurs within text T[1..n]Let P = abxyabxz and T = xabxyabxyabxzWhere m = 8 and n = 13The Naive MethodIf P = aaa and T = aaaaaaaaaa then n=3, m=10In worst case exactly n(m-n+1) comparisonsIn this case 24 comparisons in the order of (mn).

The Naive AlgorithmChar text[], pat[] ;int n, m ;{ int i, j, k, lim ; lim=n-m+1 ; for (i=1 ; i j 1 1 2 34 5 67 8 90 1 2S = bbabaxababayS[3..7] = abaxaS[1..4] = bbab|S| is the length of the string. Here, |S| = 12S[1..i] is prefix of S that ends at position i

PrefixS[i..|S|] is the suffix of S that begins at position iS[9..12] = abaySuffixA proper prefix, suffix or substring of S is, respectively, a prefix, suffix or substring that is not the entire string S, not the empty string.For any string S, S(i) denotes the ith character of SBasic String Definitions/Notations12PreprocessingGoal: To gather the information needed for speeding up the algorithmDefinitions:Zi: For i>1, the length of the longest substring of S that starts at i and matches a prefix of SZ-box: for any position i >1 where Zi>0, the Z-box at i starts at i and ends at i+Zi-1ri; For every i>1, ri is the right-most endpoint of the Z-boxes that begin at or before ili; For every i>1, li is the left endpoint of the Z-box ends at ri PreprocessingZi(S) = The longest prefix of S[i..|S|] that matches a prefix of S, where i > 1 1 12 3 456 7 8 901S = aabcaabxaazZ5(S) = Z6(S) = Z7(S) = Z8(S) = 0Z9(S) = 2 (aabaaz) 3 (aabcaabx)1 (aaab)We will use Zi in place of Zi(S)Z Boxfor i > 1, where Zi is greater than zero

Figure 1.2: From GusfieldThe li and ri of Z-Box

40 50 55 62 70 78 82 85 89 95ri = the right-most endpoint of the Z-boxes that begin at or before position i.li = the left end of the Z-box that ends at ri.r78 = 95l78 = 78r82 = 95l82 = 78r52 = 50l52 = 40r75 = 85l75 = 7015 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 S: a a b a a b c a x a a b a a b c y Z: 0 1 0 3 1 0 0 1 0 7 1 0 3 1 0 0 0Z-box

a a b a a b c a x a a b a a b c y ri: 0 2 2 6 6 6 6 8 8 16 16 16 16 16 16 16 16 li: 0 2 2 4 4 4 4 8 8 10 10 10 10 10 10 10 10Preprocessing16Z-AlgorithmGoal: To calculate Zi for an input string S in a linear time

Starting from i=2, calculate Z2, r2 and l2For i=3; irCompare the characters starting at k+1 with those starting at 1.Update r, and l if necessary22Z-AlgorithmInput: Pattern POutput: ZiZ AlgorithmCalculate Z2, r2 and l2 specifically by comparisons. R= r2 and l=l2 for i=3; i