24
Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang, Xiaochun Yang, and Bin Wang Mar 22, 2013 Northeastern University, China EDBT/ICDT 2013 Scalable String Similarity Search/Join workshop

Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Embed Size (px)

Citation preview

Page 1: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Cache-Aware Parallel Approximate String

Search and Join Algorithms Using BWT

Jiaying Wang, Xiaochun Yang, and Bin Wang

Mar 22, 2013

Northeastern University, China

EDBT/ICDT 2013 Scalable String Similarity Search/Join workshop

Page 2: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Outline

Motivation

Problem statement

Approximate search method

Cache aware parallel framework

Pruning technique

Multi query optimization

Look ahead verification

Approximate join method

Experiment

Conclusion

Page 3: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Motivation

Page 4: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Motivation

Searching dna sequence similar to

"ACGTACAATATTAG" in genome database.

Results:

ACGTACAATATTAG is similar to

...ACGTACATTATTAG...

...ACGTAAAATATTAG...

...ACGTACAAATTTAG...

Page 5: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Problem Statement

We have a string set C, for a given pattern p and

threshold τ, return all answer Ti ∈ C and ED(p, Ti)≤τ

Find all pairs (T1, T2) ∈ C× C that ed(T1, T2) ≤τ as fast as

possible.

p:

“Majaura” τ:

1

T1:“Majaura”

T2:“Deghli”

T3:“Madhura”

search

τ = 1

τ= 1

T1:“Majaura”

T2:“Deghli”

T3:“Madhura”

<s1,s3>

Page 6: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Current famous solutions

Metric space-base method

Signature based (q-gram, q-chunk+ q-gram)

Idea:

Filter (fully filter, prefix filter) +Verify

if s is similar to q, some (lower bound, LB) part (signature)

of s and q must be identical.

#gram = |p|-q +1

LB = #gram –q×τ( prefix, PF = q ×τ+1)

Tree/Trie

Page 7: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

BWTPA Index

# M a j a u r a $ D e g h l i $ M a d h u r

$ D e g h l i $ M a d h u r a # M a j a u r

$ M a d h u r a # M a j a u r a $ D e g h l

D e g h l i $ M a d h u r a # M a j a u r a

M a d h u r a # M a j a u r a $ D e g h l i

M a j a u r a $ D e g h l i $ M a d h u r a

a # M a j a u r a $ D e g h l i $ M a d h u

a $ D e g h l i $ M a d h u r a # M a j a u

a d h u r a # M a j a u r a $ D e g h l i $

a j a u r a $ D e g h l i $ M a d h u r a #

a u r a $ D e g h l i $ M a d h u r a # M a

d h u r a # M a j a u r a $ D e g h l i $ M

e g h l i $ M a d h u r a # M a j a u r a $

g h l i $ M a d h u r a # M a j a u r a $ D

h l i $ M a d h u r a # M a j a u r a $ D e

h u r a # M a j a u r a $ D e g h l i $ M a

i $ M a d h u r a # M a j a u r a $ D e g h

j a u r a $ D e g h l i $ M a d h u r a # M

l i $ M a d h u r a # M a j a u r a $ D e g

r a # M a j a u r a $ D e g h l i $ M a d h

r a $ D e g h l i $ M a d h u r a # M a j a

u r a # M a j a u r a $ D e g h l i $ M a d

u r a $ D e g h l i $ M a d h u r a # M a j

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

a

a

i

$

$

#

r

r

M

M

j

a

D

e

g

d

l

a

h

u

u

h

a

L

22

7

14

8

15

0

21

6

16

1

3

17

9

10

11

18

13

2

12

20

5

19

4

SA PA

3

1

2

2

3

1

3

1

3

1

1

3

2

2

2

3

2

1

2

3

1

3

1

T1:“Majaura”

T2:“Deghli”

T3:“Madhura”

BWTPA index contains

BWT:

Simulate suffix array (SA)

PA:

Record id of SA

Page 8: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Approximate String Search

If a string x is similar to a string y, then a segment of x will match a substring of y exactly.

Partition p to τ+1 same (almost) length partitions

r = p % (τ+1)

First r partitions’ length is p /(τ+1) +1

The left τ+1 -r partitions will be p /(τ+1)

X

Madaura

Y

Majaula

Mad ra au

τ = 2

Partition string

find a match

verification

Page 9: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Find Exact Substring

# M a j a u r a $ D e g h l i $ M a d h u r

$ D e g h l i $ M a d h u r a # M a j a u r

$ M a d h u r a # M a j a u r a $ D e g h l

D e g h l i $ M a d h u r a # M a j a u r a

M a d h u r a # M a j a u r a $ D e g h l i

M a j a u r a $ D e g h l i $ M a d h u r a

a # M a j a u r a $ D e g h l i $ M a d h u

a $ D e g h l i $ M a d h u r a # M a j a u

a d h u r a # M a j a u r a $ D e g h l i $

a j a u r a $ D e g h l i $ M a d h u r a #

a u r a $ D e g h l i $ M a d h u r a # M a

d h u r a # M a j a u r a $ D e g h l i $ M

e g h l i $ M a d h u r a # M a j a u r a $

g h l i $ M a d h u r a # M a j a u r a $ D

h l i $ M a d h u r a # M a j a u r a $ D e

h u r a # M a j a u r a $ D e g h l i $ M a

i $ M a d h u r a # M a j a u r a $ D e g h

j a u r a $ D e g h l i $ M a d h u r a # M

l i $ M a d h u r a # M a j a u r a $ D e g

r a # M a j a u r a $ D e g h l i $ M a d h

r a $ D e g h l i $ M a d h u r a # M a j a

u r a # M a j a u r a $ D e g h l i $ M a d

u r a $ D e g h l i $ M a d h u r a # M a j

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

a

a

i

$

$

#

r

r

M

M

j

a

D

e

g

d

l

a

h

u

u

h

a

L

22

7

14

8

15

0

21

6

16

1

3

17

9

10

11

18

13

2

12

20

5

19

4

SA PA

3

1

2

2

3

1

3

1

3

1

1

3

2

2

2

3

2

1

2

3

1

3

1

Page 10: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Pruning Techniques

Length Filtering:

The possible length range of pattern P with is [|P|-τ,|P| + τ ].

[5,33]

[5,16] [16,32]

[5,8] [20,33]

Dehri…Deghli

B1[5,6] B2[7,8]

Majaura$M… Mautern in$… Mautern in$…

Bi[lmin,lmax] Bi[32,33]

… … …

… …

search Dehli withτ= 1

Page 11: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Pruning Techniques

Position Filtering

Three parts: prefix, matched segment, and suffix.

τ = 2

τp ≥2 τs ≥1

τ = τp + τs ≥3

Mad au ra

au F lty

prefix match suffix

Page 12: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Multiple Queries

ID Strings τ

1 lodna 1

2 lodnna 2

3 Dehri 1

4 dehri 1

5 loddna 2

lod na

lo dn na

lo dd na

Deh ri

deh ri

Page 13: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Multiple Queries

ID Strings τ

1 lodna 1

2 lodnna 2

3 Dehri 1

4 dehri 1

5 loddna 2

lod na

a

n

d

o

l

Page 14: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Multiple Queries

ID Strings τ

1 lodna 1

2 lodnna 2

3 Dehri 1

4 dehri 1

5 loddna 2

a

n

d

o

l

lo dn na

n

d

o

l

Page 15: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Multiple Queries

ID Strings τ

1 lodna 1

2 lodnna 2

3 Dehri 1

4 dehri 1

5 loddna 2

a

n

d

d o

l

h

e

D d

i

r

n

d

o

l

Page 16: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Look Ahead Verification

s i m i l a r l y

s

i

m

i

l

a

l

r

y

8

7

6

5

4

3

9

7

6

5

4

3

2

8

6

5

4

3

2

1

02 1

0

1

1

0 1

2

7

5

4

3

2

1

0

1

2

6

4

3

2

1

0

1

4

2

5

3

2

1

0

1

2

5

4

4

2

1

6

4

5

3

1

1

1

2

3

4

7

5

6

3 2

2

1

2

3

4

5

8

6

7

2

2

2

3

4

5

6

9

7

8

3

3

3

0

1

2

3

r e f e r e n c e

d

i

f

f

e

r

e

n

t

8

7

6

5

4

3

9

7

6

5

5

4

3

8

6

5

5

4

4

3

22 2

0

1

1

1 2

2

7

6

6

5

4

3

2

3

3

3

7

6

5

4

3

3

3

4

4

4

7

5

4

3

4

4

4

5

5

5

6

4

3

4

4

5

5

6

6

6

5

3

4

5

5

6

6

7

7

7

4 4

4

5

6

6

7

7

8

8

8

5

5

6

7

7

8

8

9

9

9

case 1 case 2

Page 17: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Cache-aware Parallel Optimization

registers

Cache

Memory

Disk

0.3~0.5 ns

1~10 ns

80~200 ns

10,000,000ns

Page 18: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Cache-aware Parallel Optimization

Dehri$Majaura... Deghli$lodna... Madhura$...Mautern ...

B2 B8 B7 idle

work1 work2 work3 work4

B8 B9 Bn

B5 idle

Page 19: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Approximate String Join

Incremental Approximate String Join

ed(S1,S2)≤τ also means ed(S2,S1)≤τ

remove the symmetrical case

stop the search when reach a ID ≥ current ID

Trie-based Approximate Join

build a reversed segment trie first

avoid the search processing for the duplicated segments

Pruning techniques

count filter

stop the search for current segment when there is only one candidate, which will be itself.

Page 20: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Experiment

Environment

C++ language

PC with 2.93 GHz Intel Core CPU

4 GB main memory

Ubuntu operating system (Linux distribution).

data sets

Geographical name

DBLP author

Human genome read

Page 21: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Search Performance

Geo DBLP reads

Page 22: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Join Performance on DBLP

Geo DBLP reads

Page 23: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Conclusions

A new index BWTPA

Cache-aware multi core framework

Efficient pruning techniques

Length filter

Position filter

Look ahead algorithm to improve edit distance

Approximate string join

Incremental

Trie-based

Page 24: Cache-Aware Parallel Approximate String Search and …leser/searchjoincompetition... · Cache-Aware Parallel Approximate String Search and Join Algorithms Using BWT Jiaying Wang,

Thank you!