Upload
shoshana-benjamin
View
38
Download
1
Embed Size (px)
DESCRIPTION
A fast algorithm for approximate string matching on gene sequences. Zheng Liu, Xin Chen, James Borneman and Tao Jiang University of California, Riverside. Outline. Background and motivation Idea and analysis for FAAST Experimental results Conclusion. Background. - PowerPoint PPT Presentation
Citation preview
04/19/23 1
A fast algorithm for approximate string matching on gene sequencesZheng Liu, Xin Chen, James Borneman and Tao Jiang
University of California, Riverside
204/19/23
Outline
Background and motivation Idea and analysis for FAAST Experimental results Conclusion
304/19/23
Background
Approximate string matching
pattern: P = p1p2…pm
text: T = t1t2…tn
K-mismatch K-difference
Applications: text processing and gene sequence analysis.
404/19/23
Motivation of FAAST
Motivation: Gene sequence acquisition
Modeled as the k-mismatch problem
Primers: AAGTC CCGTA
AAGTC………CCGTATACTT………CCGTT
…ACGTC………GCGTA
…
AAGTC………CCGTA…
ACGTC………GCGTA
504/19/23
Algorithms for the k-mismatch problem 1992, Shift-Add by Baeza-Yates and Gonnet. 1996, BM with Shift-Add by El-Mabrouk and
Crochemore. 1993, BM extention (bad-charcter rule) by
Tarhio-Ukkonen. 1994, BM extention (good-suffix rule) by Baeza-
Yates and Gonnet.
604/19/23
FAAST
Further generalization on Tarhio-Ukkonen algorithm.
tj-m+1 tj-m+2 …… tj-k … tj-2 tj-1 tj
p1 p2 …… pm-k … pm-2 pm-1 pm --
check last k+1
tj-m+1 tj-m+2 … tj-k-x+1… tj-k … tj-2 tj-1 tj
p1 p2 … pm-k-x+1 …pm-k …pm-2 pm-1 pm --check
last k+x
704/19/23
Algorithm outline
T: AACTGTTAACTTGCGACTAG (k=2, x=2)
P: AAATCGTAAC
AAATCGTAAC Χ AAATCGTAAC Χ
……… Χ AAATCGTAAC ☺ -after first shift
(6)
804/19/23
An example
k=2, x=3, m=10, n=20
T: AACTGTTAACTTGCGACTAG
P: AAATCGTAAC
T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 1 by Tarhio-Ukkonen
T: AACTGTTAACTTGCGACTAG P: AAATCGTAAC shift 6 by FAAST
904/19/23
Construction of shift table
Heuristic: Guarantee the last k+x (or y, if y ≤ k+x) aligned text characters to have at least x (or y-k , if y ≤ k+x) matches.
T:AACTGTTAACTTGCGACTA [K=2,X=3] P:AAGTCGTAAC
…. AAGTCGTAAC
1004/19/23
Construction details
Vkx[tj-k-x+1…tj, l] :Marks the characters that match with P after shifting P by l.
dkx[tj-k-x+1…tj] : Stores the minimum distance l, s.t. Vkx[tj-k-x+1…tj, l] contains at least min[x, m-k-l] items.
1104/19/23
Construction details – cont’d P: AAGTCGTAAC (k=2, x=3, l=[1..8]), Vkx[tj-k-x+1…tj, l] and dkx[tj-k-
x+1…tj]
l 1 2 3 4 5 6 7 8 dkx
AAAAA 0,1 0 4 3,4 2,3 1,2 0,1 6
…
GCGAC 1 2,3 4 0,2 1 1 7
…
GTCGT 0,1,2,3,4
0,1 3
…
TTAAC 0 4 3 0 2 1,2 1 7
…
TTTTT 2 1,4 0,3 2 1 0 8
1204/19/23
Theoretical support
Correctness of FAAST We use random string assumption Average shift distance Total number of character comparisons
1304/19/23
Correctness of FAAST
Theorem 1. When P is aligned with tj-k-x+1…tj, we can always shift P by dkx[tj-k-x+1…tj] to the right without miss approximate occurrences of P.
tj-m+1 tj-m+2 … tj-k-x +1 … tj-2 tj-1 tj
p1 p2 …pi-k-x+1 … pi-2 pi-1 pi ……
pm – current
p1 p2 … pi-k-x+1 … pi’-k-x+1 … pi-’2 pi’-1 pi’...pm -- (i
< i’)
1404/19/23
Average shift distance
Lemma 1. The prob. Pkx for the last k+x chars of T to have at least x matches is:
Pkx = 1- Σi=0x-1Ck+X
i(1-p)k+x-ipi
Theorem 1. The avg. shift distance of FAAST is:
Ekxd = Σs=0
∞s(1-Pkx)s-1Pkx = 1/Pkx
1504/19/23
Average shift distance under diff x.
1604/19/23
Total character comparisons
Lemma 2. The expected number of comparisons between two shifts is:
Ekxc = (k+X) / (1-p)
Theorem 2. The expected total comparisons for text of length n is:
TEkxc = nPkx (k+X) / (1-p)
1704/19/23
Total character comparisons
1804/19/23
Difference of total character comparisons under different x
1904/19/23
Experimental result
A PC with 2.8GHz CPU and 1G memory
Simulated random string testing
Real DNA gene sequence data
2004/19/23
Result on simulated sequences
Text: 2M bases sequence, Pattern: 39 bases, k=3.
x 1 2 3 4 5 6 7Ave. shift
dist.1.41 2.76 5.59 16.38 31.31 37.37 38.87
Total comp. 6.70 3.68 1.86 0.65 0.34 0.28 0.27
Running time(sec.)
210.2 114.4 58.1 20.6 11.2 10.8 16.7
Prepro. Time(sec.)
0.01 0.01 0.03 0.08 0.36 1.58 6.90
2104/19/23
Result on real sequences Text: 150 bacteria DNA sequences, k=3
x 1 2 3 4 5 6 7
Running time (sec.)
18.87 13.05 7.74 3.84 2.63 3.21 8.55
Prepro. Time(sec.)
0.01 0.01 0.02 0.09 0.35 1.57 6.96
matching Time(sec.)
18.77 13.04 7.72 3.75 2.28 1.64 1.59
Text: 150 fungi DNA sequences, k=3
x 1 2 3 4 5 6 7
Running time (sec.)
16.45 11.43 9.24 6.78 5.62 8.24 26.48
Prepro. Time(sec.)
0.02 0.03 0.08 0.32 1.34 5.77 23.86
matching Time(sec.)
16.43 11.40 9.16 6.46 4.28 2.47 2.62
2204/19/23
Conclusion
Competitive algorithm for k-mismatch problem on gene sequence.
Time and memory increase with larger x and alphabet size.