22
06/23/22 1 A fast algorithm for approximate string matching on gene sequences Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside

A fast algorithm for approximate string matching on gene sequences

Embed Size (px)

DESCRIPTION

A fast algorithm for approximate string matching on gene sequences. Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside. Outline. Background and motivation Idea and analysis for FAAST Experimental results Conclusion. Background. - PowerPoint PPT Presentation

Citation preview

Page 1: A fast algorithm for approximate string matching on gene sequences

04/19/23 1

A fast algorithm for approximate string matching on gene sequencesZheng Liu, Xin Chen, James Borneman and Tao Jiang

University of California, Riverside

Page 2: A fast algorithm for approximate string matching on gene sequences

204/19/23

Outline

Background and motivation Idea and analysis for FAAST Experimental results Conclusion

Page 3: A fast algorithm for approximate string matching on gene sequences

304/19/23

Background

Approximate string matching

pattern: P = p1p2…pm

text: T = t1t2…tn

K-mismatch K-difference

Applications: text processing and gene sequence analysis.

Page 4: A fast algorithm for approximate string matching on gene sequences

404/19/23

Motivation of FAAST

Motivation: Gene sequence acquisition

Modeled as the k-mismatch problem

Primers: AAGTC CCGTA

AAGTC………CCGTATACTT………CCGTT

…ACGTC………GCGTA

AAGTC………CCGTA…

ACGTC………GCGTA

Page 5: A fast algorithm for approximate string matching on gene sequences

504/19/23

Algorithms for the k-mismatch problem 1992, Shift-Add by Baeza-Yates and Gonnet. 1996, BM with Shift-Add by El-Mabrouk and

Crochemore. 1993, BM extention (bad-charcter rule) by

Tarhio-Ukkonen. 1994, BM extention (good-suffix rule) by Baeza-

Yates and Gonnet.

Page 6: A fast algorithm for approximate string matching on gene sequences

604/19/23

FAAST

Further generalization on Tarhio-Ukkonen algorithm.

tj-m+1 tj-m+2 …… tj-k … tj-2 tj-1 tj

p1 p2 …… pm-k … pm-2 pm-1 pm --

check last k+1

tj-m+1 tj-m+2 … tj-k-x+1… tj-k … tj-2 tj-1 tj

p1 p2 … pm-k-x+1 …pm-k …pm-2 pm-1 pm --check

last k+x

Page 7: A fast algorithm for approximate string matching on gene sequences

704/19/23

Algorithm outline

T: AACTGTTAACTTGCGACTAG (k=2, x=2)

P: AAATCGTAAC

AAATCGTAAC Χ AAATCGTAAC Χ

……… Χ AAATCGTAAC ☺ -after first shift

(6)

Page 8: A fast algorithm for approximate string matching on gene sequences

804/19/23

An example

k=2, x=3, m=10, n=20

T: AACTGTTAACTTGCGACTAG

P: AAATCGTAAC

T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 1 by Tarhio-Ukkonen

T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 6 by FAAST

Page 9: A fast algorithm for approximate string matching on gene sequences

904/19/23

Construction of shift table

Heuristic: Guarantee the last k+x (or y, if y ≤ k+x) aligned text characters to have at least x (or y-k , if y ≤ k+x) matches.

T:AACTGTTAACTTGCGACTA [K=2,X=3] P:AAGTCGTAAC

…. AAGTCGTAAC

Page 10: A fast algorithm for approximate string matching on gene sequences

1004/19/23

Construction details

Vkx[tj-k-x+1…tj, l] :Marks the characters that match with P after shifting P by l.

dkx[tj-k-x+1…tj] : Stores the minimum distance l, s.t. Vkx[tj-k-x+1…tj, l] contains at least min[x, m-k-l] items.

Page 11: A fast algorithm for approximate string matching on gene sequences

1104/19/23

Construction details – cont’d P: AAGTCGTAAC (k=2, x=3, l=[1..8]), Vkx[tj-k-x+1…tj, l] and dkx[tj-k-

x+1…tj]

l 1 2 3 4 5 6 7 8 dkx

AAAAA 0,1 0 4 3,4 2,3 1,2 0,1 6

GCGAC 1 2,3 4 0,2 1 1 7

GTCGT 0,1,2,3,4

0,1 3

TTAAC 0 4 3 0 2 1,2 1 7

TTTTT 2 1,4 0,3 2 1 0 8

Page 12: A fast algorithm for approximate string matching on gene sequences

1204/19/23

Theoretical support

Correctness of FAAST We use random string assumption Average shift distance Total number of character comparisons

Page 13: A fast algorithm for approximate string matching on gene sequences

1304/19/23

Correctness of FAAST

Theorem 1. When P is aligned with tj-k-x+1…tj, we can always shift P by dkx[tj-k-x+1…tj] to the right without miss approximate occurrences of P.

tj-m+1 tj-m+2 … tj-k-x +1 … tj-2 tj-1 tj

p1 p2 …pi-k-x+1 … pi-2 pi-1 pi ……

pm – current

p1 p2 … pi-k-x+1 … pi’-k-x+1 … pi-’2 pi’-1 pi’...pm -- (i

< i’)

Page 14: A fast algorithm for approximate string matching on gene sequences

1404/19/23

Average shift distance

Lemma 1. The prob. Pkx for the last k+x chars of T to have at least x matches is:

Pkx = 1- Σi=0x-1Ck+X

i(1-p)k+x-ipi

Theorem 1. The avg. shift distance of FAAST is:

Ekxd = Σs=0

∞s(1-Pkx)s-1Pkx = 1/Pkx

Page 15: A fast algorithm for approximate string matching on gene sequences

1504/19/23

Average shift distance under diff x.

Page 16: A fast algorithm for approximate string matching on gene sequences

1604/19/23

Total character comparisons

Lemma 2. The expected number of comparisons between two shifts is:

Ekxc = (k+X) / (1-p)

Theorem 2. The expected total comparisons for text of length n is:

TEkxc = nPkx (k+X) / (1-p)

Page 17: A fast algorithm for approximate string matching on gene sequences

1704/19/23

Total character comparisons

Page 18: A fast algorithm for approximate string matching on gene sequences

1804/19/23

Difference of total character comparisons under different x

Page 19: A fast algorithm for approximate string matching on gene sequences

1904/19/23

Experimental result

A PC with 2.8GHz CPU and 1G memory

Simulated random string testing

Real DNA gene sequence data

Page 20: A fast algorithm for approximate string matching on gene sequences

2004/19/23

Result on simulated sequences

Text: 2M bases sequence, Pattern: 39 bases, k=3.

x 1 2 3 4 5 6 7Ave. shift

dist.1.41 2.76 5.59 16.38 31.31 37.37 38.87

Total comp. 6.70 3.68 1.86 0.65 0.34 0.28 0.27

Running time(sec.)

210.2 114.4 58.1 20.6 11.2 10.8 16.7

Prepro. Time(sec.)

0.01 0.01 0.03 0.08 0.36 1.58 6.90

Page 21: A fast algorithm for approximate string matching on gene sequences

2104/19/23

Result on real sequences Text: 150 bacteria DNA sequences, k=3

x 1 2 3 4 5 6 7

Running time (sec.)

18.87 13.05 7.74 3.84 2.63 3.21 8.55

Prepro. Time(sec.)

0.01 0.01 0.02 0.09 0.35 1.57 6.96

matching Time(sec.)

18.77 13.04 7.72 3.75 2.28 1.64 1.59

Text: 150 fungi DNA sequences, k=3

x 1 2 3 4 5 6 7

Running time (sec.)

16.45 11.43 9.24 6.78 5.62 8.24 26.48

Prepro. Time(sec.)

0.02 0.03 0.08 0.32 1.34 5.77 23.86

matching Time(sec.)

16.43 11.40 9.16 6.46 4.28 2.47 2.62

Page 22: A fast algorithm for approximate string matching on gene sequences

2204/19/23

Conclusion

Competitive algorithm for k-mismatch problem on gene sequence.

Time and memory increase with larger x and alphabet size.