Upload
angeni
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Indexing Biological Sequence Data. Doctoral Seminar by Mihail R. Halachev Supervisor: Dr. N. Shiri Dept. of Computer Science and Software Engineering Concordia University 11/29/2004. Outline. Introduction: From DNA to sequence data Basic tasks over biological sequence data - PowerPoint PPT Presentation
Citation preview
Indexing Biological Sequence Indexing Biological Sequence DataData
Doctoral Seminarby
Mihail R. Halachev
Supervisor: Dr. N. Shiri
Dept. of Computer Science and Software EngineeringConcordia University
11/29/2004
2
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
Source: National Health Museum
3
From DNA to sequence data From DNA to sequence data representationrepresentation
The 2 strands are complementary:
A TC G
A DNA segment can be encoded using the bases
from only one of the strands:
S = AGTACG Σ = {A, C, G, T}
Source: Wikipedia 4
From mRNA to sequence data From mRNA to sequence data representationrepresentation
Each codon specifies a single amino acid.S = ATGLRS*
|Σ’| = 20
5
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
6
Basic tasks over biological dataBasic tasks over biological data
From a biological point of view: Having a novel DNA sequence, perform a search in primary
biological DBs for similar (already known) sequences. Similarity (Alignment) Homology
Compare a novel protein sequence to secondary protein DBs containing motifs, signatures, protein domains, etc.
Approximation of the biochemical function of the query protein
From a computational point of view:
- both tasks are essentially searching
7
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
8
Search techniques for Search techniques for sequence biological data sequence biological data
(BLAST, Clustal W)(BLAST, Clustal W)
Basic Local Alignment Search Tool (BLAST) [Altschul ‘90, ‘97]
The NCBI BLAST family of programs includes:
blastp - an amino acid query against a protein DB
blastn - a nucleotide query against a nucleotide DB
blastx - a nucleotide query (in all reading frames) against a protein DB
tblastn - a protein query against a nucleotide DB (in all reading frames)
tblastx - the six-frame translations of a nucleotide query against the six- frame translations of a nucleotide DB
9
How BLAST works?How BLAST works?
Local pairwise alignment• The BLAST algorithm is a heuristic search method that seeks words of length W that score at least T when aligned with the query and scored with a substitution matrix. • Words in the database that score T or greater are extended in both directions in an attempt to find a alignment to produce a HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. • T parameter values: a trade-off between speed and sensitivity of the search.
Source: National Center for Biotech Info
10
BLAST Case Study [Hunt ‘01]BLAST Case Study [Hunt ‘01]
Hardware:SUN Enterprise 450, 2 GB RAM, 4 Processors, Solaris 7
Software:BLAST (with default parameter settings)
Data:3 human chromosomes (294 Mbp, 10% of human genome),data on local disks
Queries:99 query sequences (predicted human genes), with length between 429 to 5999 bp
Results:6559 hits, average 66 hits per query.
Time: 62 hours
11
BLAST ObservationsBLAST Observations
“BLAST: - performs serial scan of the DB; - is CPU intensive; - its usefulness depends on the biologists being able to provide appropriate search parameters values.”
[Hunt ‘01]
“Filtering approaches, like BLAST, are only suitable for high similarity matching, but often low similarities are biologically significant.”
[Navarro ‘00a]
12
Clustal W [Thompson ‘94]Clustal W [Thompson ‘94]
Dynamic Programming alignment method Based on global multiple alignment
Input : set of N sequences Output : the optimal alignment of N sequences
Improved sensitivity (may find similar sequences which BLAST may omit)
50-100 times slower than BLAST
13
Motivation for Indexing?Motivation for Indexing?
“Many of these biological datasets are growing at exponential rates – for example, the sizes of the sequence datasets in GenBank have been doubling every sixteen months.”
[Tata ‘04]
“As there is a rapid rise in both the volume of data and the demand for searches by researchers investigating functional genomics, it is worth investigating the possibility of accelerating these searches using indexes.” [Hunt ‘01]
14
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
15
Indexing Techniques for Sequence Indexing Techniques for Sequence DataData
Q-grams [Navaro ‘98]
String B-Tree [Ferragina ‘99]
Multi-D Index [Jagadish ‘00]
Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata
‘04]
16
Q-grams -- ConstructionQ-grams -- Construction
Input: T is a text over Σ, |T| = n, |Σ| = σ
Pick an integer, say q = 4 (0 < q < n, a good heuristic is q ≈ log
σn)
Each substring of T with size q is called a “q-gram” and is stored in the index table (in lexical order) with a list of pointers to positions (or blocks) in T where this q-gram occurs
17
Q-grams -- SearchingQ-grams -- Searching
For a pattern P, |P| = m,
Find all approximate occurrences P’ of P in T, where error ratio of each P’ ≤ λ
λ = k / m, where k is the edit distance of P’ to P Knowing m and the desired λ, compute k Split P at k +1 disjoint pieces Having k +1 disjoint pieces of P,
for each of them search the index table (binary search)
Set of candidate matches is the union of all occurrences
Verify each candidate by neighborhood search
18
Q-grams -- ExampleQ-grams -- Example
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
987654321
nactcartnocoiratnoractabmoc
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
987654321
nactcartnocoiratnoractabmoc
3mba
9ron15ioc
14rio23ctc
21rac17con
20tra1com
6, 24tca7car
12tar25can
10, 18ont4bat
2omb5atc
16oco8aro
19ntr13ari
11nta26an
27n22act
T[pos]q-gramsT[pos]q-grams
3mba
9ron15ioc
14rio23ctc
21rac17con
20tra1com
6, 24tca7car
12tar25can
10, 18ont4bat
2omb5atc
16oco8aro
19ntr13ari
11nta26an
27n22act
T[pos]q-gramsT[pos]q-grams
T =
Set q = 3,
Index Table:
19
Q-grams -- ExampleQ-grams -- Example
Search for P = con, k = 1 (i.e. allow only one error), split P in k+1 pieces: P1 = c and P2 = on
3mba
9ron15ioc
14rio23ctc
21rac17con
20tra1com
6, 24tca7car
12tar25can
10, 18ont4bat
2omb5atc
16oco8aro
19ntr13ari
11nta26an
27n22act
T[pos]q-gramsT[pos]q-grams
3mba
9ron15ioc
14rio23ctc
21rac17con
20tra1com
6, 24tca7car
12tar25can
10, 18ont4bat
2omb5atc
16oco8aro
19ntr13ari
11nta26an
27n22act
T[pos]q-gramsT[pos]q-grams• Candidate MatchesP1 = c : 25, 7, 1, 17, 23P2 = on : 10, 18
• Verification (1 error allowed)con ? bat con ? cancon ? carcon ? comcon ? concon ? ctccon ? ioccon ? ombcon ? ontcon ? tar
Answer:T[25], T[1], T[17]
+T[9], T[17]
20
Indexing Techniques for Sequence Indexing Techniques for Sequence DataData
Q-grams [Navaro ‘98]
String B-Tree [Ferragina ‘99]
Multi-D Index [Jagadish ‘00]
Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata
‘04]
21
String B-TreeString B-Tree -- -- Construction Construction
Input: S = {aid, atom, attenuate, car, patent, zoo, atlas}Step 1. Store S consequently on disk.
Input: set of words
Step 2. Sort lexicographically each suffix of each word
Lexicographic Order“aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]
Step 3. Create leaf nodes.Each node contains pointers to the sorted suffixes.
1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31
Step 4. Propagate LMP and RMP from each node up, until construct root
1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31
1 10 20 8 28 39 29 31
1 8 28 31
22
Searching using this index structure is inefficient, because the keys are external and multiple I/Os are required to fetch them.
Lexicographic Order“aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]
1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31
1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31
1 10 20 8 28 39 29 31
1 8 28 31
String B-TreeString B-Tree -- -- ConstructionConstruction
23
Each node is implemented as modified Patricia Trie.
Lexicographic Order“aid” : S[1]“ar” : S[21]“as” : S[38]“ate” : S[16]…..“uate” : S[15]“zoo” : S[31]
1 21 38 16 25 35 5 10 20 3 18 27 13 2 37 8 28 14 33 7 32 24 22 39 29 17 26 12 36 6 11 15 31
1 16 25 10 20 27 13 8 28 7 32 39 29 12 36 31
1 10 20 8 28 39 29 31
1 8 28 31
String B-TreeString B-Tree -- -- ConstructionConstruction
1 16 25 10
aid
ate
atent
attenuate
0
5
3
1
3
2
9
i
a
n
e
t
t
24
String B-Tree -- SearchingString B-Tree -- Searching
Find all occurrences of P = te in S
Start at root:t > n and t < z branch right
1 8 28 31 0
3
1 23
a zm n
a i d
m nt
z o o
1 10 20 8 28 39 29 31
1 8 28 31 00
33
11 2233
a zm n
a i d
m nt
z o o
1 10 20 8 28 39 29 31
Child Node:t ≥ t and t < z branch right
28 39 29 31
2
0
1 13
n zs t
n t
s t z o o
29 12 36 3128 7 32 39
28 39 29 31
2
00
11 1133
n zs t
n t
s t z o o
29 12 36 3128 7 32 39
Child Node:te ≥ te and te < tl branch left
29 12 36 31
t t e
t l a s
z o o
0
2
41
3l
z
e
t
12
29
36
29 17 26 1236 6 11 15 31
29 12 36 31
t t e
t l a s
z o o
00
22
4411
33l
z
e
t
12
29
36
29 17 26 1236 6 11 15 31
Leaf node:P = te found at:S[17,18]S[26,27]S[12,13]
29 17 26 120
2
7
1
3
n
u
e
t
t t e
t e n t
t e n u a t e
S[17]
S[29]
S[12]4
t
S[26]
29 17 26 1200
22
77
11
33
n
u
e
t
t t e
t e n t
t e n u a t e
S[17]
S[29]
S[12]44
t
S[26]
25
Indexing Techniques for Sequence Indexing Techniques for Sequence DataData
Q-grams [Navaro ‘98]
String B-Tree [Ferragina ‘99]
Multi-D Index [Jagadish ‘00]
Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata
‘04]
26
Multi-D Index -- ConstructionMulti-D Index -- Construction
Dimension X
Dimension Y
abce magh
abcd makk
abqs makk
abqs mdbc
alaa magz
almn mazz
abqa maza
abzz mdyz
Input: A set of pairs of strings(not necessarily of same length)
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
0 5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
00 55 1010
1515
2020
2525
3030
3535
4040
4545
5050
5555
6060
6565
7070
7575
Step 1.Store the pairs of strings
(separated properly) consequently on disk
27
Multi-D Index -- ConstructionMulti-D Index -- Construction
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
Step 2. Create index leaf nodes, storing pointers to separating symbolsStep 3. Construct internal nodes (until construct root).
R-trees and MBR computation are used for building up the index.
10 20 5 35 60 50 45 75
MBR1 MBR2
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
0 5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
00 55 1010
1515
2020
2525
3030
3535
4040
4545
5050
5555
6060
6565
7070
7575
28
Multi-D Index -- ConstructionMulti-D Index -- Construction
Searching using this index structure is inefficient, because the keys are external and multiple I/Os are required to fetch them.
At each node, for each dimension,
create an ‘Elided Trie’. E-tries are very similar to Patricia Tries.
For searches, use the E-Tries in a similar manner as the Patricia Tries (during the downward traversal of the index tree).
29
Multi-D Index -- ConstructionMulti-D Index -- Construction
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
0 5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
00 55 1010
1515
2020
2525
3030
3535
4040
4545
5050
5555
6060
6565
7070
7575
30
Multi-D Index -- Multi-D Index -- SearchingSearching
Prefix Search:Q1=(abc*,makk*)
Start at root E-Tries repeat {
x-dim: abc* can only be on left MBR
y-dim: makk* can be in both MBRs
Compute the intersection examine only left MBR
….. until reach a leaf index node….
}
Step k (leaf page) {//compute candidatesx-dim: string pair @ 0 string pair @ 10
y-dim: string pair @ 10 string pair @ 20
Answer to query = the intersection
}
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
0 5 10
15
20
25
30
35
40
45
50
55
60
65
70
75
#abce$magh#abcd$makk#abqs$makk#abqs$mdbc#alaa$magz#almn$mazz#abqa$maza#abzz$mdyz#
00 55 1010
1515
2020
2525
3030
3535
4040
4545
5050
5555
6060
6565
7070
7575
31
Indexing Techniques for Sequence Indexing Techniques for Sequence DataData
Q-grams [Navaro ‘98]
String B-Tree [Ferragina ‘99]
Multi-D Index [Jagadish ‘00]
Suffix Tree [Weiner ‘73, McCreight ‘76][Ukkonen ‘95] [Hunt ‘01, Giegerich ‘03, Tata
‘04]
32
Suffix Tree [Gusfield ‘97]Suffix Tree [Gusfield ‘97]
A Suffix Tree for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m.
Each internal node (except the root) has at least 2 children and each edge is labeled with a nonempty substring of S.
No 2 edges out of a node can have edge-labels beginning with the same character.
The key feature of the Suffix Tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i, i.e., S [i..m].
33
Suffix TreeSuffix Tree
Input: string S = xabxa, add $ at the end (no suffix of S is a prefix of another suffix).
$
ab
xa
2
b x a $3
$
4
5
$
6 $
$
x
xa
b
a
1
Suffix Tree forS = xabxa$
34
Suffix Tree -- SearchingSuffix Tree -- Searching
1 2 3 4 5 6
x a b x a $
Find all occurrences of P = xa in S
$
ab
xa
2
b x a $3
$
4
5
$
6 $
$
x
xa
b
a
1 S =
35
Generalized Suffix TreeGeneralized Suffix Tree
ST can be build for more than one string.1 2 3 4 5 6
x a b x a $
1 2 3 4 5
b x a d $
S1 =
S2 =
b x a $ 3,1
$
4,1 5,1
$
6,1 $
$
ab
xa
2,1
$x
xa
ba
1,1
5,2d
$1,2
d$
2,2
d$
3,2
$d
4,2
36
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
37
Desired for the Indexing Desired for the Indexing TechniqueTechnique
Relatively fast construction, reasonable amount of storage consumption (persistently stored);
Allows huge sequences to be indexed; Supports versatile queries over data;
+
Supports bioinformatics applications!
38
Applicability for Sequence Biological Data
Data Structure Suitable for bio-data indexing?
Q-gramsYesBLAST is using very similar idea. Provides high similarity matching, suitable for some bioinformatics applications.
String B-TreeYes? A DNA sequence cannot be broken into words, but can we exploit the repeats?
Multi-D IndexYes?Can we view promoters, genes, exons, introns, etc. as attributes in a DB?
Suffix TreeYes?Slow construction, limited input sequence size, size of index ≈ 10x size of input, but supports versatile queries over data
39
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
40
Suffix Trees: A closer lookSuffix Trees: A closer look
Suffix Trees are well known in the biological sequence processing field
Recent advances in Suffix Tree construction algorithms
Suffix Trees provide support for answering
versatile biological questions
41
Suffix Tree (ST) ApplicationsSuffix Tree (ST) Applications
REPuter [Kurtz ‘99]The REPuter program family provides state of the art software solutions to compute and visualize repeats in whole genomes or chromosomes.
MUMmer [Delcher ‘99, ‘02, ‘04]MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. NUCmer program aligns contigs from a shotgun sequencing project to another set of contigs or a genome.
42
ST Construction Algorithms ST Construction Algorithms HistoryHistory
[Weiner ‘73] First linear time algorithm to build Suffix Tree (called Position Tree).
[McCreight ‘76] A more space efficient solution.
[Ukkonen ‘95] Presents a variation of [McCreight ‘76], but much easier to understand, to prove bounds, and to implement.
All these algorithms are in-memory algorithms. In practice, the sequences to be indexed are large, they cannot fit in the memory; the corresponding ST is ≈ 10x bigger.
43
Advances in ST Construction Advances in ST Construction AlgorithmsAlgorithms
[Hunt ‘01]Abandons the use of the suffix links (the algorithm is not linear any more), presents the idea of partitioning to reduce the number of disk I/O’s
[Giegerich ‘03]Proposes a space efficient representation of ST.
[Tata ‘04]Extends ideas in [Hunt ‘01] and [Giegerich ‘03], focuses on development of an efficient buffering strategy.
[Tata ‘04] builds a ST on the entire human genome (approx. 3 Gbp)
in 30 hours, using a single processor machine;
even for the in-memory case [Tata ‘04 - O(m2)], performs better than [Ukkonen ‘95 - O(m)]
44
Versatile Biological Support by Versatile Biological Support by STST
Exact search (with or without wild cards)
Approximate search
[Longest] Common substring/subsequence of 2 (or more) strings Recognizing DNA contamination Alignment
[Shortest] Superstring of 2 (or more) strings Shotgun sequencing and sequence assembly
Finding repeats in a single sequence
Compressing DNA strings to study the information content of a string or to discriminate between exons and introns in eukaryotic DNA
….
45
Suffix Tree RepresentationsSuffix Tree Representations
Suffix Array [Manber ‘93, Myers ‘94, Baeza-Yates ‘00]
LC-tries [Anderson ‘95]
Suffix Binary Search Tree [Irving ‘03]
46
OutlineOutline
Introduction: From DNA to sequence data Basic tasks over biological sequence data Search techniques Indexing techniques for sequence data Applicability to bioinformatics Suffix Trees Conclusion Future Work
47
ConclusionConclusion
BLAST Case Study
Observations on existing searching techniques
Alternative indexing techniques for sequence data and their possible application for biological sequence data
Suffix Trees
48
Future WorkFuture Work
Suffix Tree Construction Further improvements of [Tata ‘04] algorithm – time/space Combining of two (or more) Suffix Trees Suffix Tree maintenance
Suffix Tree Usage Most of the widely known ST-based algorithms rely on the
suffix links. How the algorithms that use ST will change in the absence of suffix links?
Potential of ST for mining biodata
Alternative Index Data Structures“Families of reiterated sequences account for about one third of the human genome.” [McConkey ‘93]
49
ReferencesReferences
[Altschul ‘90] S.F. Altschul et al. “Basic local alignment search tool”. J. Mol. Biol., 215:403-10, 1990.[Altschul ‘97] S. F. Altschul, T. L. Madden, A. A. Schaeer, J. Zhang, Z. Zhang, W. Miller, and D. J.
Lipman. “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”. Nucleic Acids Research, 25:3389-3402, 1997.
[Anderson ‘95] A. Andersson and S. Nilsson. “Efficient implementation of suffix trees”. Softw. Pract. Exp., 25(2):129-141, 1995
[Baeza-Yates ‘00] R. Baeza-Yates and G. Navarro. “A Hybrid Indexing Method for Approximate String Matching”. Journal of Discrete Algorithms, 2000.
[Delcher ‘99] A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. “Alignment of Whole Genomes”. Nucleic Acids Research, 27:2369-2376, 1999.
[Ferragina ‘99] P. Ferragina and R. Grossi. “The string B-tree: a new data structure for string search in external memory and its applications”. Journal of the ACM, 46(2):236-280, 1999
[Giegerich ‘03] R. Giegerich, S. Kurtz, and J. Stoye. “Efficient implementation of lazy suffix trees”. Softw. Pract. Exper. 2003; 33:1035-1049, 2003
[Gusfield ‘97] D. Gusfield. “Algorithms on strings, trees and sequences : computer science and computational biology”. Cambridge University Press, 1997
[Hunt ‘01] E. Hunt, M.P. Atkinson, and R.W. Irving. “A Database Index to Large Biological Sequences”. In VLDB J., 7(3):139-148, 2001
[Irving ‘03] R.W. Irving and L. Love. “The Suffix Binary Search Tree and Suffix AVL Tree”. Journal of Discrete Algorithms, 1 (2003) 387–408, 2003.
[Jagadish ‘00] H.V. Jagadish, Nick Koudas, and Divesh Srivastava. “On effective multi-dimensional indexing for strings”. In ACM SIGMOD Conference on Management of Data, pages 403-414, 2000.
50
ReferencesReferences
[Kurtz ‘99] S. Kurtz and C. Schleiermacher. “REPuter: fast computation of maximal repeats in complete genomes”. Bioinformatics, pages 426-427, 1999
[Manber ‘93] U. Manber and G. Myers. “Suffix arrays: a new method for on-line string searches”. SIAM J. Comput., 22(5):935-948, 1993.
[McConkey ‘93] E. McConkey. “Human Genetics: The Molecular Revolution”. Jones and Bartlett, Boston, MA, 1993
[McCreight ‘76] E.M. McCreight. “A Space-economical Suffix Tree Construction Algorithm”. J. ACM, 23(2):262-272, 1976
[Myers ‘94] E. W. Myers. “A sublinear algorithm for approximate key word searching”. Algorithmica,12(4/5):345-374, 1994.
[Navarro ‘98] G. Navarro and R. Baeza-Yates. “A practical q-gram index for text retrieval allowing errors”. CLEI Electronic Journal, 1(2), 1998
[Navarro ‘00a] G. Navarro. “A Guided Tour to Approximate String Matching”. ACM Computing Surveys,33:1:31-88, 2000.
[Navarro ‘00b] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. “Indexing Text with Approximate q-grams”. In CPM2000, LNCS 1848, pages 350-365, 2000
[Tata ‘04] S. Tata, R.A. Hankins, and J. Patel. “Practical Suffix Tree Construction”. In Proc. of the 30th VLDB, 2004
[Thompson ‘94] J. D. Thompson, D. G. Higgins, and T. J. Gibson. “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice”. In Nucleic Acids Research, Vol. 22, No. 22 4673-4680, 1994
[Ukkonen ‘95] E. Ukkonen. “On-line construction of suffix-trees”. Algorithmica 14 (1995), 249-260, 1995
[Weiner ‘73] P. Weiner. “Linear Pattern Matching Algorithms”. In Proc. of the 14th Annual Symposium on Switching and Automata Theory, 1973
51
Thank You!Thank You!