A fast algorithm for approximate string matching on gene sequences

04/19/23 1

A fast algorithm for approximate string matching on gene sequencesZheng Liu, Xin Chen, James Borneman and Tao Jiang

University of California, Riverside

204/19/23

Outline

Background and motivation Idea and analysis for FAAST Experimental results Conclusion

304/19/23

Background

Approximate string matching

pattern: P = p1p2…pm

text: T = t1t2…tn

K-mismatch K-difference

Applications: text processing and gene sequence analysis.

404/19/23

Motivation of FAAST

Motivation: Gene sequence acquisition

Modeled as the k-mismatch problem

Primers: AAGTC CCGTA

AAGTC………CCGTATACTT………CCGTT

…ACGTC………GCGTA

…

AAGTC………CCGTA…

ACGTC………GCGTA

504/19/23

Algorithms for the k-mismatch problem 1992, Shift-Add by Baeza-Yates and Gonnet. 1996, BM with Shift-Add by El-Mabrouk and

Crochemore. 1993, BM extention (bad-charcter rule) by

Tarhio-Ukkonen. 1994, BM extention (good-suffix rule) by Baeza-

Yates and Gonnet.

604/19/23

FAAST

Further generalization on Tarhio-Ukkonen algorithm.

tj-m+1 tj-m+2 …… tj-k … tj-2 tj-1 tj

p1 p2 …… pm-k … pm-2 pm-1 pm --

check last k+1

tj-m+1 tj-m+2 … tj-k-x+1… tj-k … tj-2 tj-1 tj

p1 p2 … pm-k-x+1 …pm-k …pm-2 pm-1 pm --check

last k+x

704/19/23

Algorithm outline

T: AACTGTTAACTTGCGACTAG (k=2, x=2)

P: AAATCGTAAC

AAATCGTAAC Χ AAATCGTAAC Χ

……… Χ AAATCGTAAC ☺ -after first shift

(6)

804/19/23

An example

k=2, x=3, m=10, n=20

T: AACTGTTAACTTGCGACTAG

P: AAATCGTAAC

T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 1 by Tarhio-Ukkonen

T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 6 by FAAST

904/19/23

Construction of shift table

Heuristic: Guarantee the last k+x (or y, if y ≤ k+x) aligned text characters to have at least x (or y-k , if y ≤ k+x) matches.

T:AACTGTTAACTTGCGACTA [K=2,X=3] P:AAGTCGTAAC

…. AAGTCGTAAC

1004/19/23

Construction details

Vkx[tj-k-x+1…tj, l] :Marks the characters that match with P after shifting P by l.

dkx[tj-k-x+1…tj] : Stores the minimum distance l, s.t. Vkx[tj-k-x+1…tj, l] contains at least min[x, m-k-l] items.

1104/19/23

Construction details – cont’d P: AAGTCGTAAC (k=2, x=3, l=[1..8]), Vkx[tj-k-x+1…tj, l] and dkx[tj-k-

x+1…tj]

l 1 2 3 4 5 6 7 8 dkx

AAAAA 0,1 0 4 3,4 2,3 1,2 0,1 6

…

GCGAC 1 2,3 4 0,2 1 1 7

…

GTCGT 0,1,2,3,4

0,1 3

…

TTAAC 0 4 3 0 2 1,2 1 7

…

TTTTT 2 1,4 0,3 2 1 0 8

1204/19/23

Theoretical support

Correctness of FAAST We use random string assumption Average shift distance Total number of character comparisons

1304/19/23

Correctness of FAAST

Theorem 1. When P is aligned with tj-k-x+1…tj, we can always shift P by dkx[tj-k-x+1…tj] to the right without miss approximate occurrences of P.

tj-m+1 tj-m+2 … tj-k-x +1 … tj-2 tj-1 tj

p1 p2 …pi-k-x+1 … pi-2 pi-1 pi ……

pm – current

p1 p2 … pi-k-x+1 … pi’-k-x+1 … pi-’2 pi’-1 pi’...pm -- (i

< i’)

1404/19/23

Average shift distance

Lemma 1. The prob. Pkx for the last k+x chars of T to have at least x matches is:

Pkx = 1- Σi=0x-1Ck+X

i(1-p)k+x-ipi

Theorem 1. The avg. shift distance of FAAST is:

Ekxd = Σs=0

∞s(1-Pkx)s-1Pkx = 1/Pkx

1504/19/23

Average shift distance under diff x.

1604/19/23

Total character comparisons

Lemma 2. The expected number of comparisons between two shifts is:

Ekxc = (k+X) / (1-p)

Theorem 2. The expected total comparisons for text of length n is:

TEkxc = nPkx (k+X) / (1-p)

1704/19/23

Total character comparisons

1804/19/23

Difference of total character comparisons under different x

1904/19/23

Experimental result

A PC with 2.8GHz CPU and 1G memory

Simulated random string testing

Real DNA gene sequence data

2004/19/23

Result on simulated sequences

Text: 2M bases sequence, Pattern: 39 bases, k=3.

x 1 2 3 4 5 6 7Ave. shift

dist.1.41 2.76 5.59 16.38 31.31 37.37 38.87

Total comp. 6.70 3.68 1.86 0.65 0.34 0.28 0.27

Running time(sec.)

210.2 114.4 58.1 20.6 11.2 10.8 16.7

Prepro. Time(sec.)

0.01 0.01 0.03 0.08 0.36 1.58 6.90

2104/19/23

Result on real sequences Text: 150 bacteria DNA sequences, k=3

x 1 2 3 4 5 6 7

Running time (sec.)

18.87 13.05 7.74 3.84 2.63 3.21 8.55

Prepro. Time(sec.)

0.01 0.01 0.02 0.09 0.35 1.57 6.96

matching Time(sec.)

18.77 13.04 7.72 3.75 2.28 1.64 1.59

Text: 150 fungi DNA sequences, k=3

x 1 2 3 4 5 6 7

Running time (sec.)

16.45 11.43 9.24 6.78 5.62 8.24 26.48

Prepro. Time(sec.)

0.02 0.03 0.08 0.32 1.34 5.77 23.86

matching Time(sec.)

16.43 11.40 9.16 6.46 4.28 2.47 2.62

2204/19/23

Conclusion

Competitive algorithm for k-mismatch problem on gene sequence.

Time and memory increase with larger x and alphabet size.

Documents

A fast algorithm for approximate string matching on gene sequences