Upload
aysel
View
68
Download
0
Embed Size (px)
DESCRIPTION
Exact and Approximate Pattern in the Streaming Model. Benny Porat and Ely Porat 2009 FOCS. Presented by - Tanushree Mitra. Problem Statement. Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n. Contributions. - PowerPoint PPT Presentation
Citation preview
Exact and Approximate Pattern in the Streaming Model
Presented by - Tanushree Mitra
Benny Porat and Ely Porat 2009 FOCS
Problem Statement
• Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n.
Contributions• Exact pattern matching - A fully online randomized
algorithm for the classical pattern matching problem
Time complexity - O(logm) per character that arrives
Space complexity - O(logm), breaking the O(m) barrier that held for this problem for a long time.
• Approximate pattern matching – An algorithm for pattern matching with k mismatches problem.
Time complexity - O(k2poly(logm)) per character Space complexity - O(k3poly(logm))
Applications• Monitoring Internet traffic
• Computational Biology
• Large Scale web searching
• Viruses and Malware detection
• Automatic Stock market analysis
• Robotics
BackgroundBrute Force Algorithm –
– Slide the pattern along the text and – Compare it to the corresponding portion of the text
Time Complexity – O(mn)
Speedup possible in these 2 steps.• Sliding step speedup by pre-processing the pattern,
– Knuth-Morris-Pratt algorithm – Boyer-Moore algorithm.– Ukkonen’s algorithm to construct suffix trees
• Comparison step speedup – Rabin-Karp algorithm.
Quick History
The Intuition
• Combine the key features of KMP and the Rabin-Karp algorithms to achieve an online algorithm that uses less space.
The Idea
• When Rabin-Karp’s algorithm is done with the i’th character, and advances to the next position in the text, it does not use any of the information gathered.
• The KMP algorithm, on the other hand, puts that information to good use.
Definitions - Fingerprints
String S ф(S)Fingerprint
Polynomial Fingerprint
q = s1r + s2r2 + … +slrl mod p, where pЄθ(N4), rЄFp
False Positives
If S1 ≠ S2, then probability of фr,p(S1) = фr,p(S2) is < 1/n3
Sliding Fingerprint
Definitions - PeriodPl
• Period - A prefix Sp = s1,s2,….,sl of a string S is defined to be a period of S, iff si = si+l, for 0 ≤ i ≤ n - l
• PeriodPl - For a pattern P = p1,p2,….,pm, prefix is, Pl = p1,p2,….,pl ,0 ≤ l ≤ m. The shortest period of Pl is periodPl
• If Pl matches the test at a given index i, then there cannot be a match between i to i + |periodPl|
Put the information to good use
The Idea
• Match at i’th index indicates that we know the last ‘m’ characters, so no point saving them?
• Preprocessing phase – Calculate Sliding fingerprint on the pattern фp and on the
shortest period фperiod p
• Online phase – Slide fingerprint ф over the entire text. – While ф = фp, slide ф by | PeriodPl | characters
– If we do not reach end of text abort
False Positives?? Slide over |periodPl| position that could be a match. Very
LOW PROBABILITY of false positives
Text and pattern should satisfy
stringent restrictions
Go for subpatterns• Log m subpatterns
p1, p2, p3, … pm-3, pm-2, pm-1, pm
pm
p1, p2, p3, … pm/2
pm-6,pm-5,pm-4,pm-3
pm-2 ,pm-1
P1
P2
P4
Pm/2
• Starting point – Find a position in which the smallest subpattern matches the text. Smallest subpattern is of length 1 – this can be easily found.
Algorithm• Guidelines –
• Find a position where Pi is a match, try to match Pi + 1 from the same starting point as Pi
• If Pi + 1 does not match, use the information that Pi is a match.• Check in jumps of |periodPi| until there is no overlap with the area
where Pi matches.
PROCESS1. Initialize an empty sliding fingerprint ф.2. For each character that arrive:
– Extend ф to include the new character– If |ф| = 2i and ф = фi for some 0 ≤ i ≤ log m.
• If ф has at least |periodPi-1 | length overlaps with the last match, slide ф by |periodPi-1| characters.
• Else, abort.
What if there is a match that starts in
substring of 1st process and ends in
substring of 2nd process
Exact_PM final AlgorithmIntroduce Checkpoint
Checkpoint - Start a new process in the last checkpoint of each process
Algorithm• Preprocessing -
– Initialize an empty sliding fingerprint ф.– For each 0 ≤ i ≤ log m calculate the sliding
fingerprint – фi of Pi and
– фi,period of the period of Pi
Final Algorithm – Online Phase
• Online Phase –– Start a new process
– For any character that arrive send it to all the processes
– If some process aborts start new prorcess
– If some process , A reaches to a checkpoint• Stop the ‘son process’ of A (if it has one)
• Start a new ‘son process’ of A
Complexity• Space –
– All fingerprints from preprocessing use O(log m) space.
– Each process saves another fingerprint and there can be atmost log m processes in parallel
– OVERALL usage – O(log m) space
• Time – – Each process spends O(1) time for each new character
that arrives– Each time there are at most 3 log m processes running
(1. process A, 2. son-process of A, grandson-process of A. A has to die when great-granson of A is created)
– OVERALL running time – O(log m) per character
Pattern Matching ( 1 – Mistmatch)
• Partition the pattern and the text• We need to align every partition of the pattern Pqi,j
to qi text shifts
Intuition
• For each Pqi,j, run qi processes of Exact_PM.
• Processqi,j,σ - σ’th process of the subpattern Pqi,j , for 0 ≤ σ < qi. This will try to match the Pqi,j to the text by considering the text as if it starts from the σ character. (τ mod qi = j – σ)
• If for all qi, – numOfNotMatchqi,σ = 0 ‘match’.
– numOfNotMatchqi,σ = 1, ‘exactly 1-mismatch’
– Otherwise, ‘more than 1-mismatch’.
Complexity
• FACTS –– Run ∑l
i=1 qi2 processes of Exact_PM
– There exists a constant c such that for any x, there exist (x / logm) prime numbers, between x, and cx
– We have q1,q2, . . . ql groups of partitions. Each qi is a prime number
• Space - O(log4m / log log m)
• Time - O(log3m / log log m)
Pattern Matching ( k – Errors)
• Preprocessing Phase – Initialize a process Processqi,j,σ of 1-mismatch, for each qi Є {q1,q2, . . . ql}, 0 ≤ i ≤ qi and 0 ≤ σ < qi
• Online Phase – Send τ character to each Processqi,j,σ such that τ mod qi = j – σ
• d = all mismatches from all processes that return ‘exactly 1-mismatch’– d > k more than k mismatches
Complexity
• Space –– Run ∑i=1
klogm qi2 Є O(k3 log4m/ log log m)
processes of 1-mismatch in parallel. – Each process requires log4m space. – OVERALL - O(k3poly(log m))
• Time – – Number of processes of 1-mismatch algorithm is
bounded by ∑i=1klogm qi
2 Є O(k3 log4m/ log log m) – Running time of each character O(log3m)– OVERALL - O(k2poly(log m))
Concluding Discussion
• The Two-Dimensional String-Matching Problem
• The String-Matching Problem with Wild Characters – Example: pattern P = {abc#abc#} is found in texts T1 = {abcdcadbaccabc}, T2 = {abcabc}
• String matching with weighted mismatch