37
Approximate Substring Matching over Uncertain Strings Tingjian Ge Zheng Li University of Kentucky

Approximate Substring Matching over Uncertain Strings

  • Upload
    louis

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Approximate Substring Matching over Uncertain Strings. Tingjian Ge Zheng Li University of Kentucky. Motivation. As a consequence of the burgeoning growth of data and the cost/technology constraints against producing completely clean data , there is much uncertainty in the data itself. . - PowerPoint PPT Presentation

Citation preview

Page 1: Approximate Substring Matching over Uncertain Strings

Approximate Substring Matching over Uncertain Strings

Tingjian Ge Zheng LiUniversity of Kentucky

Page 2: Approximate Substring Matching over Uncertain Strings

Motivation• As a consequence of the burgeoning growth of data and the

cost/technology constraints against producing completely clean data, there is much uncertainty in the data itself.

Computational biology

Signal processing

Text retrieval

Page 3: Approximate Substring Matching over Uncertain Strings

Motivation• The amount of text data increases in an unprecedented rate. Managing

the sheer amount of (often noisy) text data has become more challenging than ever.

• Approximate substring matching has many applications. – The deterministic case is well studied.– But approximate pattern matching over uncertain texts is largely

an unexplored problem in the past.

Page 4: Approximate Substring Matching over Uncertain Strings

Example : Pattern Matching in DNA Sequence

• Suppose a certain DNA pattern, say AAATTT and it’s variations ( with small edit distance ) are known to be the cause of a low or nonfunctioning protein. Thus we want to do approximate pattern matching in DNA sequences.

• Challenges are :– A single DNA sequence can be a few million to a few hundred million

characters. – It has uncertainty due to a number of factors in the high-throughput

sequencing technologies.

Page 5: Approximate Substring Matching over Uncertain Strings

Example : Holter Monitor Application

• For each heartbeat, the annotation software gives a symbol such as N (Normal beat), L (Left bundle branch block beat), etc. Quite often, the ECG signal of each beat may have ambiguity.

• A doctor might be interested in locating a pattern such as “NNAV”, in order to verify a specific diagnosis.

Page 6: Approximate Substring Matching over Uncertain Strings

Outline

• Motivation• Preliminaries• A new semantics• Multilevel filtering index• Two verification algorithms• Experiments

Page 7: Approximate Substring Matching over Uncertain Strings

Q-gram Index• The most frequently used indexing method for approximate matching is

the q-gram indexes.• To use the q-gram index, partition pattern into k+1 pieces, where k is the

edit distance threshold. Since the number of errors is no more than k, it must be true that at least one piece must have an exact match in the text.

Q-gram1 pos1 pos2 …Q-gram2 …

… …

Position list of q-gram1

Pattern String …

k+1 pieces

Page 8: Approximate Substring Matching over Uncertain Strings

Verification Algorithm

• A DP algorithm that computes edit distance

0 1 2 3 41 1 1 2 32 2 2 2 33 3 3 3 2

t h i s

h a s

Ins.

Sub.

[ , ] min{ [ , 1] 1,[ 1, ] 1,[ 1, 1] ( [ ], [ ])}

d i j d i jd i jd i j c p i x j

0, [ ] [ ]( [ ], [ ])

1, [ ] [ ]if p i x j

c p i x jif p i x j

The edit distance between “has” and “This” is 2 in this example. The value in each cell is the edit distance between the corresponding pattern and text characters read so far.

Page 9: Approximate Substring Matching over Uncertain Strings

Outline

• Motivation• Preliminaries• A new semantics• Multilevel filtering index• Two verification algorithms• Experiments

Page 10: Approximate Substring Matching over Uncertain Strings

(k, τ)-Matching Query • We propose (k, τ)-matching query, which is based on a pattern string p, a

set of uncertain text strings {Xi} (1 ≤ i ≤ r), and threshold parameters k, τ, and asks for all substrings X of Xi’s such that Pr[d(p, X) ≤ k] > τ.

• An other semantics is the EED ( expected edit distance ), which computes the expected edit distance between p and X and check if it’s < k’.

Page 11: Approximate Substring Matching over Uncertain Strings

Semantics• (k, τ)-matching query v.s. EED

– EED first summarizes possible worlds and then apply the threshold, hence many algorithms developed for the deterministic case are inapplicable.

– More importantly, EED may either miss real matches or have an unduly big threshold so that many false positives may mix in.

0 2 4 6 8x 10

6

100

101

102

103

104

Text string size

Num

ber o

f mat

ches

DNA: (k, )DNA: EED

Page 12: Approximate Substring Matching over Uncertain Strings

Example : Semantics

T T T T T T T T T

A A A A A A G G G

G G A G A G T G T G A G T A T A G A A G

G G A G A G T G T G A G T A T A G A A G

Pattern p, Length = 20

Consider this approximate pattern matching in DNA sequence :

Substring X1 has an exact match, but 9 characters are uncertain. EED(p, X1) = 6

T T

A G

A G G G G G T G T A A G T A T A G A A T

Substring X2 has no exact match, with 5 errors and 2 uncertain characters. EED(p,X2)=6

To find X1, with EED, the threshold should be at least 6, which means we can’t avoid X2.But with (k, τ)-matching query, we can use a ( 2, 0 )-matching query to select X1 and avoid X2 at the same time.

Page 13: Approximate Substring Matching over Uncertain Strings

Outline

• Motivation• Preliminaries• Semantics• Multilevel filtering Index• Two verification algorithms• Experiments

Page 14: Approximate Substring Matching over Uncertain Strings

Left/Right Signature of a Q-gram

• The edit distance of signatures is no more than the edit distance of original strings if we treat each two-bit value in the signature as a character when computing the edit distance.

G 01HashRATGS

pos pos+q

q-gramleft signature right signature

Text string x

Left signature and right signature of a q-gram.The hash function maps a character to a two-bit value.

Page 15: Approximate Substring Matching over Uncertain Strings

Multilevel Filtering Index

4-byte tag

4-byte tag

4-byte tag

. . . . .

. . . . .

28 72 short position list

next level directory

99 short position list

A multilevel filtering Index for a q-gram based on measuring signature distance.

4-byte tag in the figure contains left/right signature of the q-gram.

Page 16: Approximate Substring Matching over Uncertain Strings

Best Matching Prefix

px

G

A

P

G G A P P

0

01

1

2

2

3

3 4 5

1

1

2

1

2

2

1

2

3

2

1

4

3

2

“1” is the smallest in the last row. Hence, 1 is x’s prefix distance from p and GGAP is the best matching prefix of x for p.

Page 17: Approximate Substring Matching over Uncertain Strings

Dynamic Programing Verification Using Signatures

• The value on each diagonal in the DP table form a non-decreasing sequence, as shown in the figure.

• Let be prefix distance from , The verification requires :

We need to do this mapping

to accommodate the certain length of signatures in the index tag.

x’[l]

x’[m’+k]p’[m’]

x’[m’]

x’[m’−k]

p’[l]p’[l−k]

p’[l+k]

x’[1]p’[1]rl dd /

kdd rl

spp rl '/

rl xx /

Page 18: Approximate Substring Matching over Uncertain Strings

Dynamic Programming Verification Using Signatures

• When pattern is exhausted. We do mapping as the following.

p’[m’] x’[m’−

k]x’[m’+k]

x’[l]

Page 19: Approximate Substring Matching over Uncertain Strings

Example : Using the Index

• Input: X: … … [DC][CB][BEST][RO][DO]… …

P: [DC][AB][BEST][NR][OB]K = 2

PatternText

lDP

PatternText

rDP

Q-gram “BEST” matches. We then use this q-gram’s multi-level signature index for filtering.

Page 20: Approximate Substring Matching over Uncertain Strings

Example : Using the Index

• Input: X: … … [DC][CB][BEST][RO][DO]… …

P: [DC][AB][BEST][NR][OB]K = 2

lDP

PatternText

rDP

B’ C’

B’

A’

C’

D’

0 1 2

1 0 1

2 1 1

3 2 1

4 3 2

PatternR’ O’

N’

R’

O’

B’

0 1 2

1 1 2

2 1 2

3 2 1

4 3 2

Text

kdd rl 211Need to expand to next level of the index.

Use the first level signatures in the index tag.

Page 21: Approximate Substring Matching over Uncertain Strings

Using the Index-Example

• Input: X: … … [DC][CB][BEST][RO][DO]… …

P: [DC][AB][BEST][NR][OB]K = 2

lDP

PatternText

rDP

B C

B

A

C

D

0 1 2

1 0 1

2 1 1

3 2 1

4 3 2

PatternR O

N

R

O

B

0 1 2

1 1 2

2 1 2

3 2 1

4 3 2

Text C D

3 4

2 3

2 3

1 2

2 1

D O

3 4

3 4

3 4

2 3

2 3

kdd rl 21This candidate position is filtered out.

Use the second level signatures in the index tag.

Pattern is exhausted in this case.

Page 22: Approximate Substring Matching over Uncertain Strings

Outline

• Motivation• Preliminaries• Semantics• Multilevel filtering index• Two verification algorithms

– Bounds based on CDF– Bounds based on local perturbation

• Experiments

Page 23: Approximate Substring Matching over Uncertain Strings

Verification Algorithms• The goal of verification is to conclude whether a candidate

position selected by an index is a true match.

• We present two algorithms, each of which gives an upper and a lower bound of the probability that .kxpd ),(

Page 24: Approximate Substring Matching over Uncertain Strings

Outline

• Motivation• Preliminaries• Semantics• Multilevel filtering index• Two verification algorithms

– Bounds based on CDF– Bounds based on local perturbation

• Experiments

Page 25: Approximate Substring Matching over Uncertain Strings

Bounds Based on Cumulative Distribution Functions (CDF)

• The basic verification consists of two symmetric runs of a DP algorithm. We describe how we change such a DP algorithm to accommodate uncertain characters.

• Our key idea is to compute (at most) k+1 pairs of values in each cell, i.e., where D denotes the edit distance.

}0|])[],[{( kjjFjF ul ],[]Pr[][ jFjDjF ul

Page 26: Approximate Substring Matching over Uncertain Strings

A Basic Step

1D 2D

3D D

Consider a basic step: how do we get D from its 3 neighbors?

argmini Di returns the index value i that minimizes Di; the minimization is defined as the Di (1 ≤ i ≤ 3) that has the greatest ( i.e. the one that has a small distance value with the highest prob. ).

)1],1[min(][][

]1[][][

1]Pr[

3

1

)(2

)1(1

)min(arg2

)1(1

12

1

i

iuuu

Dlll

jFpjFpjF

jFpjFpjF

ppcCp

ii

]0[)(ilF

//probability of a match at cell DPick one fixed neighbor cell to use.

Use the union of upper bounds from the three neighbors

Page 27: Approximate Substring Matching over Uncertain Strings

Example : Bounds Based on CDF• Consider p = “CAT” and X is “C” followed by four characters, each of which

has the same distribution G.1A.4T.5, denoting that it is G (A, T) with probability 0.1 (0.4, 0.5). K = 2.

• Take the cell at the 3rd row and the 4th column as an example. How do we compute Fl[j] & Fu[j]?

p x

C

A

T

C G.1A.4T.5 G.1A.4T.5 G.1A.4T.5 G.1A.4T.5

(0, 0)(.7, .7)(1, 1)

(0, 0)(.64, .64)

(1, 1)

(.2, .2)(.7, .7)(1, 1)

(0, 0)(0, 0)

(.784, .784)

(0, 0)(.42, .42)(.85, 1)

(0, 0)(0, 0)

(.602, .602)

(.4, .4)(1, 1)

(1, 1)

(0, 0)(1, 1)

(0, 0)(0, 0)(1, 1)

(0, 0)(1, 1)

(0, 0)(0, 0)(1, 1)

argmini Di is D3 Mixture of the upper bound in D1 and the union of those in all three neighbours.

Page 28: Approximate Substring Matching over Uncertain Strings

Outline

• Motivation• Preliminaries• Semantics• Multilevel filtering technique based on

measuring signature distance• Two verification algorithms

– Bounds based on CDF– Bounds based on local perturbation

• Experiments

Page 29: Approximate Substring Matching over Uncertain Strings

Bounds Based on Local Perturbation

• Adjacent and remote possible worlds: Give a (k, τ)-matching query on a pattern p and on an uncertain text X, we say that a p.w. w of X, denoted x(w) is adjacent to p if . We say that it is remote to p if .

kwxpd ))(,(kwxpd ))(,(

G/A/T A/T … A/T

An initial adjacent\remote p.w.

Perturbation G/A/T A/T … A/T

G/A/T A/T … A/T

G/A/T A/T … A/T

G/A/T A/T … A/T

… … … …

More adjacent\remote p.w. after perturbation.

Page 30: Approximate Substring Matching over Uncertain Strings

How to Get Initial Adjacent/Remote Possible World.

Get a closest p.w. ( an adjacent p.w. with the smallest distance ) on the optimal path in DP table.

C G.1\A.4\T.5 G.1\A.4\T.5 G.1\A.4\T.5

Use a randomized algorithm to get a remote p.w. :

K=1

1D 2D

3D D

If D1 and D contain the same distance value, the corresponding variable in the test string is called a crucial variable of this p.w.

0 1 2 3 4

1 0 1 2 3

2 1 0 1 2

3 2 1 0 1

px C G.1A.4T.5 G.1A.4T.5 G.1A.4T.5

C

A

T

Page 31: Approximate Substring Matching over Uncertain Strings

Perturbation – How?

T/.5 T/.5 T/.5 T/.5 …

A/.4 A/.4 A/.4 A/.4 … G G/.1 G/.1 G/.1 G/.1 …

Suppose there’re c crucial variables in this closest/remote p.w.

Let be the difference between k and the edit distance of this p.w and the pattern.

For an closest p.w. :Pr(at most crucial variables ( of the total c curial variables ) change their values as in the optimal path ) is a lower bound of .

]),(Pr[ kXpd

For a remote p.w. :We could get a upper bound similarly.

Text string x

Page 32: Approximate Substring Matching over Uncertain Strings

Outline

• Motivation• Preliminaries• Semantics• Multilevel filtering index• Two verification algorithms• Experiments

Page 33: Approximate Substring Matching over Uncertain Strings

Setup of Experiment• We examine the behaviors of (k, τ)-pattern matching query using signature

filtering and verification bounds with the following dataset.• The DNA dataset.

– Raw datasets of sequencing runs of Escherichia coli 536 from the NCBI SRA (Sequence Read Archive) database.

– Use Bowtie to align the short DNA sequences with the complete Escherichia coli genome reference. The mapping reports output by Bowtie show positions within the DNA that have more than one possible value.

• The protein dataset. • Synthetic datasets.

– We generate a few synthetic datasets based on the two real datasets above.

– Vary the parameter values of data (such as the uncertainty ratio θ) or the size of the data.

Page 34: Approximate Substring Matching over Uncertain Strings

Experiments

0 2 4 6 8

x 106

10-1

100

101

102

103

Text string size

Exe

cutio

n tim

e (s

econ

ds)

sig.; boundsno sig.; boundsno sig.; bounds 2sig.; no boundsno sig.; no bounds

0.1 0.2 0.3 0.4 0.510

0

101

102

103

104

Exe

cutio

n tim

e (s

econ

ds)

0 2 4 610

1

102

103

104

Text string size (in GB)

Exe

cutio

n tim

e (s

econ

ds)

with signatureswithout signatures

2GB(i) 2GB(ii) 4GB(i) 4GB(ii)0

200

400

600

800

1000

(i) with signatures (ii) without signatures

Exe

cutio

n tim

e (s

econ

ds)

I/O timeCPU time

Running time for various settings.

Varying θ. Using larger synthetic data.

Breakdown of I/O & CPU costs.

Page 35: Approximate Substring Matching over Uncertain Strings

Experiments

# of positions to be verified.

Varying |p| (DNA). Varying |p| (protein).

Varying threshold k.

0 2 4 6 8

x 106

102

103

104

105

Text string size

Num

ber o

f pos

ition

s to

be

verif

ied

no sig.sig.no sig., after boundssig., after bounds

10 20 30 40 5010

0

101

102

103

104

|p|

Exe

cutio

n tim

e (s

econ

ds)

10 20 30 40 5010

0

101

102

103

|p|

Exe

cutio

n tim

e (s

econ

ds)

signatures; boundsno signatures; boundssignatures; no boundsno signatures; no bounds

0 1 2 3 410

0

101

102

103

k

Exe

cutio

n tim

e (s

econ

ds)

Page 36: Approximate Substring Matching over Uncertain Strings

Conclusions and Future Work

• We study a real and unsolved problem of approximate substring matching over uncertain texts.– Proposed a novel semantics and demonstrate its advantages over an

alternative one introduced by previous work.– Developed a q-gram based index to handle uncertain texts.– Proposed two efficient verification algorithms.

• As future work, we plan to study the matching problem under correlated uncertainty to address a wider range of applications.

Page 37: Approximate Substring Matching over Uncertain Strings

Thank You!

• Questions?