Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Cache-Aware Parallel Approximate String

Search and Join Algorithms Using BWT

Jiaying Wang, Xiaochun Yang, and Bin Wang

Mar 22, 2013

Northeastern University, China

EDBT/ICDT 2013 Scalable String Similarity Search/Join workshop

Outline

Motivation

Problem statement

Approximate search method

Cache aware parallel framework

Pruning technique

Multi query optimization

Look ahead verification

Approximate join method

Experiment

Conclusion

Motivation

Motivation

Searching dna sequence similar to

"ACGTACAATATTAG" in genome database.

Results:

ACGTACAATATTAG is similar to

...ACGTACATTATTAG...

...ACGTAAAATATTAG...

...ACGTACAAATTTAG...

Problem Statement

We have a string set C, for a given pattern p and

threshold τ, return all answer Ti ∈ C and ED(p, Ti)≤τ

Find all pairs (T1, T2) ∈ C× C that ed(T1, T2) ≤τ as fast as

possible.

p:

“Majaura” τ:

1

T1:“Majaura”

T2:“Deghli”

T3:“Madhura”

search

τ = 1

τ= 1

T1:“Majaura”

T2:“Deghli”

T3:“Madhura”

<s1,s3>

Current famous solutions

Metric space-base method

Signature based (q-gram, q-chunk+ q-gram)

Idea:

Filter (fully filter, prefix filter) +Verify

if s is similar to q, some (lower bound, LB) part (signature)

of s and q must be identical.

#gram = |p|-q +1

LB = #gram –q×τ( prefix, PF = q ×τ+1)

Tree/Trie

BWTPA Index

# M a j a u r a $ D e g h l i $ M a d h u r

$ D e g h l i $ M a d h u r a # M a j a u r

$ M a d h u r a # M a j a u r a $ D e g h l

D e g h l i $ M a d h u r a # M a j a u r a

M a d h u r a # M a j a u r a $ D e g h l i

M a j a u r a $ D e g h l i $ M a d h u r a

a # M a j a u r a $ D e g h l i $ M a d h u

a $ D e g h l i $ M a d h u r a # M a j a u

a d h u r a # M a j a u r a $ D e g h l i $

a j a u r a $ D e g h l i $ M a d h u r a #

a u r a $ D e g h l i $ M a d h u r a # M a

d h u r a # M a j a u r a $ D e g h l i $ M

e g h l i $ M a d h u r a # M a j a u r a $

g h l i $ M a d h u r a # M a j a u r a $ D

h l i $ M a d h u r a # M a j a u r a $ D e

h u r a # M a j a u r a $ D e g h l i $ M a

i $ M a d h u r a # M a j a u r a $ D e g h

j a u r a $ D e g h l i $ M a d h u r a # M

l i $ M a d h u r a # M a j a u r a $ D e g

r a # M a j a u r a $ D e g h l i $ M a d h

r a $ D e g h l i $ M a d h u r a # M a j a

u r a # M a j a u r a $ D e g h l i $ M a d

u r a $ D e g h l i $ M a d h u r a # M a j

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

a

a

i

$

$

#

r

r

M

M

j

a

D

e

g

d

l

a

h

u

u

h

a

L

22

7

14

8

15

0

21

6

16

1

3

17

9

10

11

18

13

2

12

20

5

19

4

SA PA

3

1

2

2

3

1

3

1

3

1

1

3

2

2

2

3

2

1

2

3

1

3

1

T1:“Majaura”

T2:“Deghli”

T3:“Madhura”

BWTPA index contains

BWT：

Simulate suffix array (SA)

PA:

Record id of SA

Approximate String Search

If a string x is similar to a string y, then a segment of x will match a substring of y exactly.

Partition p to τ+1 same (almost) length partitions

r = p % (τ+1)

First r partitions’ length is p /(τ+1) +1

The left τ+1 -r partitions will be p /(τ+1)

X

Madaura

Y

Majaula

Mad ra au

τ = 2

Partition string

find a match

verification

Find Exact Substring

# M a j a u r a $ D e g h l i $ M a d h u r

$ D e g h l i $ M a d h u r a # M a j a u r

$ M a d h u r a # M a j a u r a $ D e g h l

D e g h l i $ M a d h u r a # M a j a u r a

M a d h u r a # M a j a u r a $ D e g h l i

M a j a u r a $ D e g h l i $ M a d h u r a

a # M a j a u r a $ D e g h l i $ M a d h u

a $ D e g h l i $ M a d h u r a # M a j a u

a d h u r a # M a j a u r a $ D e g h l i $

a j a u r a $ D e g h l i $ M a d h u r a #

a u r a $ D e g h l i $ M a d h u r a # M a

d h u r a # M a j a u r a $ D e g h l i $ M

e g h l i $ M a d h u r a # M a j a u r a $

g h l i $ M a d h u r a # M a j a u r a $ D

h l i $ M a d h u r a # M a j a u r a $ D e

h u r a # M a j a u r a $ D e g h l i $ M a

i $ M a d h u r a # M a j a u r a $ D e g h

j a u r a $ D e g h l i $ M a d h u r a # M

l i $ M a d h u r a # M a j a u r a $ D e g

r a # M a j a u r a $ D e g h l i $ M a d h

r a $ D e g h l i $ M a d h u r a # M a j a

u r a # M a j a u r a $ D e g h l i $ M a d

u r a $ D e g h l i $ M a d h u r a # M a j

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

a

a

i

$

$

#

r

r

M

M

j

a

D

e

g

d

l

a

h

u

u

h

a

L

22

7

14

8

15

0

21

6

16

1

3

17

9

10

11

18

13

2

12

20

5

19

4

SA PA

3

1

2

2

3

1

3

1

3

1

1

3

2

2

2

3

2

1

2

3

1

3

1

Pruning Techniques

Length Filtering:

The possible length range of pattern P with is [|P|-τ,|P| + τ ].

[5,33]

[5,16] [16,32]

[5,8] [20,33]

Dehri…Deghli

B1[5,6] B2[7,8]

Majaura$M… Mautern in$… Mautern in$…

Bi[lmin,lmax] Bi[32,33]

… … …

… …

search Dehli withτ= 1

Pruning Techniques

Position Filtering

Three parts: prefix, matched segment, and suffix.

τ = 2

τp ≥2 τs ≥1

τ = τp + τs ≥3

Mad au ra

au F lty

prefix match suffix

Multiple Queries

ID Strings τ

1 lodna 1

2 lodnna 2

3 Dehri 1

4 dehri 1

5 loddna 2

lod na

lo dn na

lo dd na

Deh ri

deh ri

Multiple Queries

ID Strings τ

1 lodna 1

2 lodnna 2

3 Dehri 1

4 dehri 1

5 loddna 2

lod na

a

n

d

o

l

Multiple Queries

ID Strings τ

1 lodna 1

2 lodnna 2

3 Dehri 1

4 dehri 1

5 loddna 2

a

n

d

o

l

lo dn na

n

d

o

l

Multiple Queries

ID Strings τ

1 lodna 1

2 lodnna 2

3 Dehri 1

4 dehri 1

5 loddna 2

a

n

d

d o

l

h

e

D d

i

r

n

d

o

l

Look Ahead Verification

s i m i l a r l y

s

i

m

i

l

a

l

r

y

8

7

6

5

4

3

9

7

6

5

4

3

2

8

6

5

4

3

2

1

02 1

0

1

1

0 1

2

7

5

4

3

2

1

0

1

2

6

4

3

2

1

0

1

4

2

5

3

2

1

0

1

2

5

4

4

2

1

6

4

5

3

1

1

1

2

3

4

7

5

6

3 2

2

1

2

3

4

5

8

6

7

2

2

2

3

4

5

6

9

7

8

3

3

3

0

1

2

3

r e f e r e n c e

d

i

f

f

e

r

e

n

t

8

7

6

5

4

3

9

7

6

5

5

4

3

8

6

5

5

4

4

3

22 2

0

1

1

1 2

2

7

6

6

5

4

3

2

3

3

3

7

6

5

4

3

3

3

4

4

4

7

5

4

3

4

4

4

5

5

5

6

4

3

4

4

5

5

6

6

6

5

3

4

5

5

6

6

7

7

7

4 4

4

5

6

6

7

7

8

8

8

5

5

6

7

7

8

8

9

9

9

case 1 case 2

Cache-aware Parallel Optimization

registers

Cache

Memory

Disk

0.3~0.5 ns

1~10 ns

80~200 ns

10,000,000ns

Cache-aware Parallel Optimization

Dehri$Majaura... Deghli$lodna... Madhura$...Mautern ...

B2 B8 B7 idle

work1 work2 work3 work4

B8 B9 Bn

B5 idle

Approximate String Join

Incremental Approximate String Join

ed（S1，S2）≤τ also means ed（S2，S1）≤τ

remove the symmetrical case

stop the search when reach a ID ≥ current ID

Trie-based Approximate Join

build a reversed segment trie first

avoid the search processing for the duplicated segments

Pruning techniques

count filter

stop the search for current segment when there is only one candidate, which will be itself.

Experiment

Environment

C++ language

PC with 2.93 GHz Intel Core CPU

4 GB main memory

Ubuntu operating system (Linux distribution).

data sets

Geographical name

DBLP author

Human genome read

Search Performance

Geo DBLP reads

Join Performance on DBLP

Geo DBLP reads

Conclusions

A new index BWTPA

Cache-aware multi core framework

Efficient pruning techniques

Length filter

Position filter

Look ahead algorithm to improve edit distance

Approximate string join

Incremental

Trie-based

Thank you!

Documents

Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,