21
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS

Exact and Approximate Pattern in the Streaming Model

  • Upload
    aysel

  • View
    68

  • Download
    0

Embed Size (px)

DESCRIPTION

Exact and Approximate Pattern in the Streaming Model. Benny Porat and Ely Porat 2009 FOCS. Presented by - Tanushree Mitra. Problem Statement. Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n. Contributions. - PowerPoint PPT Presentation

Citation preview

Page 1: Exact and Approximate Pattern in the Streaming Model

Exact and Approximate Pattern in the Streaming Model

Presented by - Tanushree Mitra

Benny Porat and Ely Porat 2009 FOCS

Page 2: Exact and Approximate Pattern in the Streaming Model

Problem Statement

• Find all instances of pattern P of length m, as a contiguous substring in a text string T, of length n, where m < n.

Page 3: Exact and Approximate Pattern in the Streaming Model

Contributions• Exact pattern matching - A fully online randomized

algorithm for the classical pattern matching problem

Time complexity - O(logm) per character that arrives

Space complexity - O(logm), breaking the O(m) barrier that held for this problem for a long time.

• Approximate pattern matching – An algorithm for pattern matching with k mismatches problem.

Time complexity - O(k2poly(logm)) per character Space complexity - O(k3poly(logm))

Page 4: Exact and Approximate Pattern in the Streaming Model

Applications• Monitoring Internet traffic

• Computational Biology

• Large Scale web searching

• Viruses and Malware detection

• Automatic Stock market analysis

• Robotics

Page 5: Exact and Approximate Pattern in the Streaming Model

BackgroundBrute Force Algorithm –

– Slide the pattern along the text and – Compare it to the corresponding portion of the text

Time Complexity – O(mn)

Speedup possible in these 2 steps.• Sliding step speedup by pre-processing the pattern,

– Knuth-Morris-Pratt algorithm – Boyer-Moore algorithm.– Ukkonen’s algorithm to construct suffix trees

• Comparison step speedup – Rabin-Karp algorithm.

Page 6: Exact and Approximate Pattern in the Streaming Model

Quick History

Page 7: Exact and Approximate Pattern in the Streaming Model

The Intuition

• Combine the key features of KMP and the Rabin-Karp algorithms to achieve an online algorithm that uses less space.

The Idea

• When Rabin-Karp’s algorithm is done with the i’th character, and advances to the next position in the text, it does not use any of the information gathered.

• The KMP algorithm, on the other hand, puts that information to good use.

Page 8: Exact and Approximate Pattern in the Streaming Model

Definitions - Fingerprints

String S ф(S)Fingerprint

Polynomial Fingerprint

q = s1r + s2r2 + … +slrl mod p, where pЄθ(N4), rЄFp

False Positives

If S1 ≠ S2, then probability of фr,p(S1) = фr,p(S2) is < 1/n3

Sliding Fingerprint

Page 9: Exact and Approximate Pattern in the Streaming Model

Definitions - PeriodPl

• Period - A prefix Sp = s1,s2,….,sl of a string S is defined to be a period of S, iff si = si+l, for 0 ≤ i ≤ n - l

• PeriodPl - For a pattern P = p1,p2,….,pm, prefix is, Pl = p1,p2,….,pl ,0 ≤ l ≤ m. The shortest period of Pl is periodPl

• If Pl matches the test at a given index i, then there cannot be a match between i to i + |periodPl|

Put the information to good use

Page 10: Exact and Approximate Pattern in the Streaming Model

The Idea

• Match at i’th index indicates that we know the last ‘m’ characters, so no point saving them?

• Preprocessing phase – Calculate Sliding fingerprint on the pattern фp and on the

shortest period фperiod p

• Online phase – Slide fingerprint ф over the entire text. – While ф = фp, slide ф by | PeriodPl | characters

– If we do not reach end of text abort

False Positives?? Slide over |periodPl| position that could be a match. Very

LOW PROBABILITY of false positives

Text and pattern should satisfy

stringent restrictions

Page 11: Exact and Approximate Pattern in the Streaming Model

Go for subpatterns• Log m subpatterns

p1, p2, p3, … pm-3, pm-2, pm-1, pm

pm

p1, p2, p3, … pm/2

pm-6,pm-5,pm-4,pm-3

pm-2 ,pm-1

P1

P2

P4

Pm/2

• Starting point – Find a position in which the smallest subpattern matches the text. Smallest subpattern is of length 1 – this can be easily found.

Page 12: Exact and Approximate Pattern in the Streaming Model

Algorithm• Guidelines –

• Find a position where Pi is a match, try to match Pi + 1 from the same starting point as Pi

• If Pi + 1 does not match, use the information that Pi is a match.• Check in jumps of |periodPi| until there is no overlap with the area

where Pi matches.

PROCESS1. Initialize an empty sliding fingerprint ф.2. For each character that arrive:

– Extend ф to include the new character– If |ф| = 2i and ф = фi for some 0 ≤ i ≤ log m.

• If ф has at least |periodPi-1 | length overlaps with the last match, slide ф by |periodPi-1| characters.

• Else, abort.

What if there is a match that starts in

substring of 1st process and ends in

substring of 2nd process

Page 13: Exact and Approximate Pattern in the Streaming Model

Exact_PM final AlgorithmIntroduce Checkpoint

Checkpoint - Start a new process in the last checkpoint of each process

Algorithm• Preprocessing -

– Initialize an empty sliding fingerprint ф.– For each 0 ≤ i ≤ log m calculate the sliding

fingerprint – фi of Pi and

– фi,period of the period of Pi

Page 14: Exact and Approximate Pattern in the Streaming Model

Final Algorithm – Online Phase

• Online Phase –– Start a new process

– For any character that arrive send it to all the processes

– If some process aborts start new prorcess

– If some process , A reaches to a checkpoint• Stop the ‘son process’ of A (if it has one)

• Start a new ‘son process’ of A

Page 15: Exact and Approximate Pattern in the Streaming Model

Complexity• Space –

– All fingerprints from preprocessing use O(log m) space.

– Each process saves another fingerprint and there can be atmost log m processes in parallel

– OVERALL usage – O(log m) space

• Time – – Each process spends O(1) time for each new character

that arrives– Each time there are at most 3 log m processes running

(1. process A, 2. son-process of A, grandson-process of A. A has to die when great-granson of A is created)

– OVERALL running time – O(log m) per character

Page 16: Exact and Approximate Pattern in the Streaming Model

Pattern Matching ( 1 – Mistmatch)

• Partition the pattern and the text• We need to align every partition of the pattern Pqi,j

to qi text shifts

Page 17: Exact and Approximate Pattern in the Streaming Model

Intuition

• For each Pqi,j, run qi processes of Exact_PM.

• Processqi,j,σ - σ’th process of the subpattern Pqi,j , for 0 ≤ σ < qi. This will try to match the Pqi,j to the text by considering the text as if it starts from the σ character. (τ mod qi = j – σ)

• If for all qi, – numOfNotMatchqi,σ = 0 ‘match’.

– numOfNotMatchqi,σ = 1, ‘exactly 1-mismatch’

– Otherwise, ‘more than 1-mismatch’.

Page 18: Exact and Approximate Pattern in the Streaming Model

Complexity

• FACTS –– Run ∑l

i=1 qi2 processes of Exact_PM

– There exists a constant c such that for any x, there exist (x / logm) prime numbers, between x, and cx

– We have q1,q2, . . . ql groups of partitions. Each qi is a prime number

• Space - O(log4m / log log m)

• Time - O(log3m / log log m)

Page 19: Exact and Approximate Pattern in the Streaming Model

Pattern Matching ( k – Errors)

• Preprocessing Phase – Initialize a process Processqi,j,σ of 1-mismatch, for each qi Є {q1,q2, . . . ql}, 0 ≤ i ≤ qi and 0 ≤ σ < qi

• Online Phase – Send τ character to each Processqi,j,σ such that τ mod qi = j – σ

• d = all mismatches from all processes that return ‘exactly 1-mismatch’– d > k more than k mismatches

Page 20: Exact and Approximate Pattern in the Streaming Model

Complexity

• Space –– Run ∑i=1

klogm qi2 Є O(k3 log4m/ log log m)

processes of 1-mismatch in parallel. – Each process requires log4m space. – OVERALL - O(k3poly(log m))

• Time – – Number of processes of 1-mismatch algorithm is

bounded by ∑i=1klogm qi

2 Є O(k3 log4m/ log log m) – Running time of each character O(log3m)– OVERALL - O(k2poly(log m))

Page 21: Exact and Approximate Pattern in the Streaming Model

Concluding Discussion

• The Two-Dimensional String-Matching Problem

• The String-Matching Problem with Wild Characters – Example: pattern P = {abc#abc#} is found in texts T1 = {abcdcadbaccabc}, T2 = {abcabc}

• String matching with weighted mismatch