33
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park

Finding approximate occurrences of a pattern that contains gaps

  • Upload
    anoki

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Finding approximate occurrences of a pattern that contains gaps. Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park. Contents. The exact/approximate gapped pattern matching problem Previous approaches Our contributions. Exact gapped pattern matching problem. - PowerPoint PPT Presentation

Citation preview

Page 1: Finding approximate occurrences of a pattern that contains gaps

Finding approximate occurrences of a pattern

that contains gaps

Inbok LeeCostas S. IliopoulosAlberto Apostolico

Kunsoo Park

Page 2: Finding approximate occurrences of a pattern that contains gaps

Contents The exact/approximate gapped

pattern matching problem Previous approaches Our contributions

Page 3: Finding approximate occurrences of a pattern that contains gaps

Exact gapped pattern matching problem

Definition find the occurrences of the pattern that contains gaps from the text.

Pattern P = AA *(2,3) GC *(1,3) TT

A A G C T T*(2,3) *(1,3)

P1 P2 P3

any string whose length is between 2

and 3

any string whose length is between 1

and 3

subpatterns

Page 4: Finding approximate occurrences of a pattern that contains gaps

Example – Exact matching

G CA TC A A T T G C T C

A A G C T TPattern

Text

*(2,3) *(1,3)

Pattern P = AA *(2,3) GC *(1,3) TTText T = GCAATTGCACTTC

Page 5: Finding approximate occurrences of a pattern that contains gaps

Approximate gapped pattern matching problem

Definition find all the substrings of the text which match each subpattern Pi with ki number of insertion, deletion, and substitution.

Pattern P = AA *(2,3) GC *(1,3) TT

A A G C T T*(2,3) *(1,3)

P1

k1 = 0

any string whose length is between 2

and 3

any string whose length is between 1

and 3

P2

k2 = 1P3

k3 = 0

Page 6: Finding approximate occurrences of a pattern that contains gaps

Example – Approximate matching

G CA TC A A T T G T T C

A A G C T TPattern

Text

*(2,3) *(1,3)

Pattern P = AA *(2,3) GC *(1,3) TT ,k1 = k3 = 0, k2 = 1

Text T = GCAATTGTACTTC

1 substitution

Page 7: Finding approximate occurrences of a pattern that contains gaps

Class of charactersAllow more than two different characters at a position of the pattern

Pattern P = AA *(2,3) G[CT] *(1,3) TT

A A G T T T*(2,3) *(1,3)

P1

k1 = 0

any string whose length is between 2

and 3

any string whose length is between 1

and 3

P2

k2 = 1P3

k3 = 0

CC or T

Page 8: Finding approximate occurrences of a pattern that contains gaps

Example – Class of characters

G CA TC A A T T G T T C

A A G T T TPattern

Text

*(2,3) *(1,3)

Pattern P = AA *(2,3) G[CT] *(1,3) TT

Text T = GCAATTGTACTTC

C

Page 9: Finding approximate occurrences of a pattern that contains gaps

Application of the gapped pattern

matching Information retrieval Data mining Computational biology

Especially, finding motifs in a sequence

Page 10: Finding approximate occurrences of a pattern that contains gaps

Motifs

Motifs (biologically important common region)

Sequence 1

Sequence 2

Sequence 3

Sequence 4

Sometimes overall sequence alignment doesn’t show the relation between biologically related sequences.

Page 11: Finding approximate occurrences of a pattern that contains gaps

PROSITE database Database of protein families, domains and motifs

http://www.expasy.ch/prosite Motifs are represented as gapped patterns from t

he alphabet of 20 amino acids. Prion protein (Creutzfeld-Jacob Disease) :

E*(1,1)[ED]*(1,1)K[LIVM][LIVM]*(1,1)

[KR][LIVM][LIVM]*(1,1) [QE]MC*(2,2)QY Ribosomal protein L1 :

[IM]*(2,2)[LIVA]*(2,3) [LIVM][GA]*(2,2)[LMS][GSNH][PTKR][KRAV]G*(1,1) [LIMF]P[DENSTKQ]

Page 12: Finding approximate occurrences of a pattern that contains gaps

Finding hidden motifs

a set of sequences

how to findunknown motifs?

Page 13: Finding approximate occurrences of a pattern that contains gaps

Finding motifs in a sequence

known motif

new sequence

As biological sequences may contain errors, we should consider approximate

matching occurrences.

x

Our topic

Page 14: Finding approximate occurrences of a pattern that contains gaps

Previous approaches Regular expression approaches

Exact matching Navarro and Raffinot’s approach [RECO

MB 2002] Exact and approximate matching

Akutsu’s approach [IEICE Trans. Info rmation and Systems 1996] Approximate matching

Page 15: Finding approximate occurrences of a pattern that contains gaps

Regular expression approach

Pattern P = AA *(2,3) GC *(1,3) TT

Regular expression AA**(*|)GC*(*|) (*|)TT

A *G C TA * * T

* * *

Nondeterministic Finite State Automata (NFA) or its equivalent Deterministic Finite State Automata (DFA)

Too general!

Page 16: Finding approximate occurrences of a pattern that contains gaps

Navarro and Raffinot’s approach

A *G C TA * * T

* * *

NFA is not easy to run and DFA can be large.

0 10 10 1 0 0 1 0 0 0 0Bit-Vector

Simulate NFA by the bit-parallelism technique.(A word can be read and written simultaneously)

Page 17: Finding approximate occurrences of a pattern that contains gaps

Navarro and Raffinot’s approach

A *G C TA * * T

* * *

Allow k errors for all the pattern.

A *G C TA * * T

*

0 errors

1 errors

Works for small size pattern and small number of errrors.O (km’n / w) time algorithm (m’ is the total length of the pattern, n is the length of the text, w is the word size)

* *

Page 18: Finding approximate occurrences of a pattern that contains gaps

Akutsu’s approach

Combination of the dynamic programming and the balanced search tree. O (mn log n) time

Text

P1

*(a1, b1)

P2

P3

*(a2, b2)

Dynamic Table

use the tree to compute the smallestvalues here

Page 19: Finding approximate occurrences of a pattern that contains gaps

Drawbacks of the previous approaches

XXX

X X XO O O O O

O O O O O

O O O O

O O O O

OXO

O X OO X O O O

O X O O O

O O X O

O O X O

O

O

O

O

?k = 3 for all the pattern

more sensitive and desirable

k1 = 1 k2 = 1 k3 = 1

Page 20: Finding approximate occurrences of a pattern that contains gaps

Our contributions O (ln + m) time algorithm for the exact g

apped pattern matching problem. l : number of subpatterns n : length of the text m : length of the pattern

O (mn) time algorithm for the approximate gapped pattern matching problem.

Page 21: Finding approximate occurrences of a pattern that contains gaps

Graph Modeling1. Create a node where a subpattern appears (e

xactly or approximately) in the text2. Link two nodes with an edge if they represent

the two consecutive subpatterns and satisfy the gap condition.

3. If there is a path P1– P2 - … - Pm in the graph, there is an occurrence of the pattern in the text.

Page 22: Finding approximate occurrences of a pattern that contains gaps

Exact matching

G CA TC A A T T G C T CText

P1 = AA, P2 = GC, P3 = TT

Step 1. Create nodes

P = AA *(2,3) GC *(1,3) TT

P1

P2

P2

P3

P3

Page 23: Finding approximate occurrences of a pattern that contains gaps

Exact matching

G CA TC A A T T G C T CText

P1 = AA, P2 = GC, P3 = TT

Step 2. Connect the nodes with the edges

P = AA *(2,3) GC *(1,3) TT

P1

P2

P2

P3

P3

Page 24: Finding approximate occurrences of a pattern that contains gaps

Exact matching

G CA TC A A T T G C T CText

P1 = AA, P2 = GC, P3 = TT

Step 3. Find the path by Depth-First Search

P = AA *(2,3) GC *(1,3) TT

P1

P2

P2

P3

P3

Page 25: Finding approximate occurrences of a pattern that contains gaps

A better idea

G CA TC A A T T G C T CText

No need to build the graph explicitly.Step 1. Find P1 = AA and compute thecandidate range for P2.

P = AA *(2,3) GC *(1,3) TT

P1

candidate range

Page 26: Finding approximate occurrences of a pattern that contains gaps

A better idea

G CA TC A A T T G C T CText

Step 2. Find P2 = GC within the candidate rangeand compute candidate range for P3.

P = AA *(2,3) GC *(1,3) TT

P1

P2

candidate range

Page 27: Finding approximate occurrences of a pattern that contains gaps

A better idea

Text

Step 3. After findng P3 = TT within the candidate range, we found the occurrence of P.

P = AA *(2,3) GC *(1,3) TT

P1

P2

P3

G CA TC A A T T G C T C

Page 28: Finding approximate occurrences of a pattern that contains gaps

Approximate matching

Almost the same idea as the exact matching case.Find the approximate occurrence of subpatterns, instead of the exact one.

G C A A T T G C A C T T C

0 0 0 0 0 0 0 0 0 0 0 0 0 0

A 1 1 1 0 0 1 1 1 1 0 1 1 1 1

A 2 2 2 1 0 1 2 2 2 1 1 2 2 2

Text

P1

*(2,3)

k1 = 0

, k2 = 1 candidate range

Page 29: Finding approximate occurrences of a pattern that contains gaps

Approximate matching

G C A C T T C

0 0 ? ? ?

G 1 0 1 2 3

C 2 1 0 1 2

Text

P2

*(1,3)

k2 = 1

, k3 = 0

candidate range

Infinity – no alignment can start

from here

Page 30: Finding approximate occurrences of a pattern that contains gaps

Approximate matching

T T C

0 0 0 ?

T 1 0 0 1

T 2 1 0 1

Text

P3

k3 = 0

approximate occurrence of the pattern

Page 31: Finding approximate occurrences of a pattern that contains gaps

Handling class of characters

Represent characters as bit masks.

A G T C

0 1 0 1[GC]

Text Pattern

G[GC]

&

01000101

0100

T[GC]

&

00100101

0000

nonzero zero

Page 32: Finding approximate occurrences of a pattern that contains gaps

Time Complexity

O (mn) (m is the length of the pattern, n is the length of the text), but faster in practice

Text

P1

P2

P3

Page 33: Finding approximate occurrences of a pattern that contains gaps

Conclusion O (ln + m) time algorithm for the exact g

apped pattern matching problem O (mn) time algorithm for the approximat

e gapped pattern matching problem. Open problem

time complexity in the average case?