Upload
phungliem
View
222
Download
2
Embed Size (px)
Citation preview
Cache-Aware Parallel Approximate String
Search and Join Algorithms Using BWT
Jiaying Wang, Xiaochun Yang, and Bin Wang
Mar 22, 2013
Northeastern University, China
EDBT/ICDT 2013 Scalable String Similarity Search/Join workshop
Outline
Motivation
Problem statement
Approximate search method
Cache aware parallel framework
Pruning technique
Multi query optimization
Look ahead verification
Approximate join method
Experiment
Conclusion
Motivation
Motivation
Searching dna sequence similar to
"ACGTACAATATTAG" in genome database.
Results:
ACGTACAATATTAG is similar to
...ACGTACATTATTAG...
...ACGTAAAATATTAG...
...ACGTACAAATTTAG...
Problem Statement
We have a string set C, for a given pattern p and
threshold τ, return all answer Ti ∈ C and ED(p, Ti)≤τ
Find all pairs (T1, T2) ∈ C× C that ed(T1, T2) ≤τ as fast as
possible.
p:
“Majaura” τ:
1
T1:“Majaura”
T2:“Deghli”
T3:“Madhura”
search
τ = 1
τ= 1
T1:“Majaura”
T2:“Deghli”
T3:“Madhura”
<s1,s3>
Current famous solutions
Metric space-base method
Signature based (q-gram, q-chunk+ q-gram)
Idea:
Filter (fully filter, prefix filter) +Verify
if s is similar to q, some (lower bound, LB) part (signature)
of s and q must be identical.
#gram = |p|-q +1
LB = #gram –q×τ( prefix, PF = q ×τ+1)
Tree/Trie
BWTPA Index
# M a j a u r a $ D e g h l i $ M a d h u r
$ D e g h l i $ M a d h u r a # M a j a u r
$ M a d h u r a # M a j a u r a $ D e g h l
D e g h l i $ M a d h u r a # M a j a u r a
M a d h u r a # M a j a u r a $ D e g h l i
M a j a u r a $ D e g h l i $ M a d h u r a
a # M a j a u r a $ D e g h l i $ M a d h u
a $ D e g h l i $ M a d h u r a # M a j a u
a d h u r a # M a j a u r a $ D e g h l i $
a j a u r a $ D e g h l i $ M a d h u r a #
a u r a $ D e g h l i $ M a d h u r a # M a
d h u r a # M a j a u r a $ D e g h l i $ M
e g h l i $ M a d h u r a # M a j a u r a $
g h l i $ M a d h u r a # M a j a u r a $ D
h l i $ M a d h u r a # M a j a u r a $ D e
h u r a # M a j a u r a $ D e g h l i $ M a
i $ M a d h u r a # M a j a u r a $ D e g h
j a u r a $ D e g h l i $ M a d h u r a # M
l i $ M a d h u r a # M a j a u r a $ D e g
r a # M a j a u r a $ D e g h l i $ M a d h
r a $ D e g h l i $ M a d h u r a # M a j a
u r a # M a j a u r a $ D e g h l i $ M a d
u r a $ D e g h l i $ M a d h u r a # M a j
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
a
a
i
$
$
#
r
r
M
M
j
a
D
e
g
d
l
a
h
u
u
h
a
L
22
7
14
8
15
0
21
6
16
1
3
17
9
10
11
18
13
2
12
20
5
19
4
SA PA
3
1
2
2
3
1
3
1
3
1
1
3
2
2
2
3
2
1
2
3
1
3
1
T1:“Majaura”
T2:“Deghli”
T3:“Madhura”
BWTPA index contains
BWT:
Simulate suffix array (SA)
PA:
Record id of SA
Approximate String Search
If a string x is similar to a string y, then a segment of x will match a substring of y exactly.
Partition p to τ+1 same (almost) length partitions
r = p % (τ+1)
First r partitions’ length is p /(τ+1) +1
The left τ+1 -r partitions will be p /(τ+1)
X
Madaura
Y
Majaula
Mad ra au
τ = 2
Partition string
find a match
verification
Find Exact Substring
# M a j a u r a $ D e g h l i $ M a d h u r
$ D e g h l i $ M a d h u r a # M a j a u r
$ M a d h u r a # M a j a u r a $ D e g h l
D e g h l i $ M a d h u r a # M a j a u r a
M a d h u r a # M a j a u r a $ D e g h l i
M a j a u r a $ D e g h l i $ M a d h u r a
a # M a j a u r a $ D e g h l i $ M a d h u
a $ D e g h l i $ M a d h u r a # M a j a u
a d h u r a # M a j a u r a $ D e g h l i $
a j a u r a $ D e g h l i $ M a d h u r a #
a u r a $ D e g h l i $ M a d h u r a # M a
d h u r a # M a j a u r a $ D e g h l i $ M
e g h l i $ M a d h u r a # M a j a u r a $
g h l i $ M a d h u r a # M a j a u r a $ D
h l i $ M a d h u r a # M a j a u r a $ D e
h u r a # M a j a u r a $ D e g h l i $ M a
i $ M a d h u r a # M a j a u r a $ D e g h
j a u r a $ D e g h l i $ M a d h u r a # M
l i $ M a d h u r a # M a j a u r a $ D e g
r a # M a j a u r a $ D e g h l i $ M a d h
r a $ D e g h l i $ M a d h u r a # M a j a
u r a # M a j a u r a $ D e g h l i $ M a d
u r a $ D e g h l i $ M a d h u r a # M a j
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
a
a
i
$
$
#
r
r
M
M
j
a
D
e
g
d
l
a
h
u
u
h
a
L
22
7
14
8
15
0
21
6
16
1
3
17
9
10
11
18
13
2
12
20
5
19
4
SA PA
3
1
2
2
3
1
3
1
3
1
1
3
2
2
2
3
2
1
2
3
1
3
1
Pruning Techniques
Length Filtering:
The possible length range of pattern P with is [|P|-τ,|P| + τ ].
[5,33]
[5,16] [16,32]
[5,8] [20,33]
Dehri…Deghli
B1[5,6] B2[7,8]
Majaura$M… Mautern in$… Mautern in$…
Bi[lmin,lmax] Bi[32,33]
… … …
… …
search Dehli withτ= 1
Pruning Techniques
Position Filtering
Three parts: prefix, matched segment, and suffix.
τ = 2
τp ≥2 τs ≥1
τ = τp + τs ≥3
Mad au ra
au F lty
prefix match suffix
Multiple Queries
ID Strings τ
1 lodna 1
2 lodnna 2
3 Dehri 1
4 dehri 1
5 loddna 2
lod na
lo dn na
lo dd na
Deh ri
deh ri
Multiple Queries
ID Strings τ
1 lodna 1
2 lodnna 2
3 Dehri 1
4 dehri 1
5 loddna 2
lod na
a
n
d
o
l
Multiple Queries
ID Strings τ
1 lodna 1
2 lodnna 2
3 Dehri 1
4 dehri 1
5 loddna 2
a
n
d
o
l
lo dn na
n
d
o
l
Multiple Queries
ID Strings τ
1 lodna 1
2 lodnna 2
3 Dehri 1
4 dehri 1
5 loddna 2
a
n
d
d o
l
h
e
D d
i
r
n
d
o
l
Look Ahead Verification
s i m i l a r l y
s
i
m
i
l
a
l
r
y
8
7
6
5
4
3
9
7
6
5
4
3
2
8
6
5
4
3
2
1
02 1
0
1
1
0 1
2
7
5
4
3
2
1
0
1
2
6
4
3
2
1
0
1
4
2
5
3
2
1
0
1
2
5
4
4
2
1
6
4
5
3
1
1
1
2
3
4
7
5
6
3 2
2
1
2
3
4
5
8
6
7
2
2
2
3
4
5
6
9
7
8
3
3
3
0
1
2
3
r e f e r e n c e
d
i
f
f
e
r
e
n
t
8
7
6
5
4
3
9
7
6
5
5
4
3
8
6
5
5
4
4
3
22 2
0
1
1
1 2
2
7
6
6
5
4
3
2
3
3
3
7
6
5
4
3
3
3
4
4
4
7
5
4
3
4
4
4
5
5
5
6
4
3
4
4
5
5
6
6
6
5
3
4
5
5
6
6
7
7
7
4 4
4
5
6
6
7
7
8
8
8
5
5
6
7
7
8
8
9
9
9
case 1 case 2
Cache-aware Parallel Optimization
registers
Cache
Memory
Disk
0.3~0.5 ns
1~10 ns
80~200 ns
10,000,000ns
Cache-aware Parallel Optimization
Dehri$Majaura... Deghli$lodna... Madhura$...Mautern ...
B2 B8 B7 idle
work1 work2 work3 work4
B8 B9 Bn
B5 idle
Approximate String Join
Incremental Approximate String Join
ed(S1,S2)≤τ also means ed(S2,S1)≤τ
remove the symmetrical case
stop the search when reach a ID ≥ current ID
Trie-based Approximate Join
build a reversed segment trie first
avoid the search processing for the duplicated segments
Pruning techniques
count filter
stop the search for current segment when there is only one candidate, which will be itself.
Experiment
Environment
C++ language
PC with 2.93 GHz Intel Core CPU
4 GB main memory
Ubuntu operating system (Linux distribution).
data sets
Geographical name
DBLP author
Human genome read
Search Performance
Geo DBLP reads
Join Performance on DBLP
Geo DBLP reads
Conclusions
A new index BWTPA
Cache-aware multi core framework
Efficient pruning techniques
Length filter
Position filter
Look ahead algorithm to improve edit distance
Approximate string join
Incremental
Trie-based
Thank you!