1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University...

Preview:

Citation preview

11

Efficient Merging and Filtering Algorithms

for Approximate String Searches

Jiaheng Lu,

University of California, Irvine

Joint work with Chen Li, Yiming Lu

2Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 2

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Iron man 2008 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson The man 2006 Crime

Find movies starred Schwarrzenger.

3Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 3

Data may not clean

Star

Keanu Reeves

Samuel Jackson

Schwarzenegger

Relation R Relation S

Data integration and cleaning:

Star

Keanu Reeves

Samuel L. Jackson

Schwarzenegger

4Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 4

Problem definition: approximate string searches

Schwarzenger

Samuel Jackson

Keanu ReevesStar

Query q:

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity

SchwarrzengerSchwarrzenger

5Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 5

Outline Problem motivation Preliminaries

Grams Inverted lists

Merge algorithms Filtering techniques Conclusion

6Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 6

String Grams q-grams

(un),(ni),(iv),(ve),(er),(rs),(sa),(al)

For example: 2-gram

u n i v e r s a l

7Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 7

Inverted lists Convert strings to gram inverted lists

id strings01234

richstickstichstuckstatic

4

2 30

1 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433

8Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 8

Main ExampleQuery

Merge

Data Grams

stick (st,ti,ic,ck)

count >=2

id strings

0 rich

1 stick

2 stich

3 stuck

4 static

ck

ic

st

ta

ti…

1,3

1,2,3,4

4

1,2,4

ed(s,q)≤1

0,0,1,2,41,2,4

Candidates

9Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 9

Problem definition:

Find elements whose occurrences ≥ T

Ascending

order

Ascending

order

MergeMerge

10Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 10

Example T = 4

Result: 13

1

3

5

10

13

10

13

15

5

7

13

13 15

11Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

Contributions

Three new merge algorithms

New finding: wisely using filters

12Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 12

Outline Problem motivation Preliminaries Merge algorithms

Two previous algorithms Our proposed three algorithms

Filtering techniques Conclusion

13Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 13

Five Merge Algorithms

HeapMerger[Sarawagi,SIGMOD

2004]

MergeOpt[Sarawagi,SIGMOD

2004]

Previous

New

ScanCount MergeSkip DivideSkip

14Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 14

Heap-based Algorithm

Min-heap

Count # of the occurrences of each element by a heap

Push to heap ……

15Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 15

MergeOpt Algorithm

Long Lists: T-1 Short Lists

Binary

search

16Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 16

Example of MergeOpt [Sarawagi et al 2004]

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

Long Lists: 3Short Lists: 2

17Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 17

Can we run faster?

18Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 18

Five Merge Algorithms

HeapMerger MergeOpt

Previous

New

ScanCount MergeSkip DivideSkip

19Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 19

ScanCount Example

1 2 3

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

# of occurrences# of occurrences

00

00

00

44

11

Increment by 1

Increment by 111

String idsString ids

1313

1414

1515

00

22

00

00

Result!Result!

20Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 20

Five Merge Algorithms

HeapMerger MergeOpt

Previous

New

ScanCount MergeSkip DivideSkip

21Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 21

MergeSkip algorithm

Min-heap ……Pop T-1

T-1

Jump Greater or

equals

Greater or

equals

22Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 22

Example of MergeSkip

1

3

5

10

10

15

5

7

13 15

Count threshold T≥ 4

minHeap10

13 15

15

JumpJump

15151515

13131313

17171717

23Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 23

Skip is safe

Min-heap ……

# of occurrences of skipped elements ≤T-1

Skip

24Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 24

Five Merge Algorithms

HeapMerger MergeOpt

Previous

New

ScanCount MergeSkip DivideSkip

25Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

DivideSkip Algorithm

Long Lists Short Lists

Binary

searchMergeSkip

26Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 26

How many lists are treated as long lists?

??

Short ListsMerge

Long ListsLookup

27Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 27

Decide L valueA good balance in the tradeoff:

# of long lists = T / ( μ logM +1)

28Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 28

Experimental data sets

DBLP data IMDB data Google Web corpus

29Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 29

Performance (DBLP)

DivideSkip is the best one

30Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 30

# of access elements (DBLP)

DivideSkip is the best one

31Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 31

Outline Problem motivation Preliminaries Merge algorithms Filtering techniques

Length, positional filters Filter tree

Conclusion and future work

32Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 32

Length Filtering

Ed(s,t) ≤ 2

s: s:

t: t:

Length: 19Length: 19

Length: 10Length: 10

By length only!

By length only!

33Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

33

Positional Filtering

a b

a b

Ed(s,t) ≤ 2

s

t

(ab,1)

(ab,12)

34Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 34

Filter tree

Length level

Gram level

Position level

Inverted list512172844

root

2 n1 3

… zy zzabaa

1 2 m

35Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

Surprising experimental results (DBLP)

No filter (ms)

Length (ms)

Length+Pos (ms)

DivideSkip 2.23 0.76 1.96

Why adding position filter increases the running time?Why adding position filter

increases the running time?

36Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

Filters fragment inverts lists

Applying filters

Applying filters

MergeMergeMergeMerge MergeMerge MergeMerge

Cost:

(1)Tree traversal

(2)More merging

Cost:

(1)Tree traversal

(2)More merging

Saving:

reduce total lists size

Saving:

reduce total lists size

37Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

Conclusion

Three new merge algorithms We run faster

Interesting finding:

Do not abuse filters!Do not abuse filters!

38Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 38

Related work

Approximate string matching

[Navarro 2001]

Varied length Grams

[Li et al 2007]

Fuzzy lookup in

39Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 39

References1. [Arasu 2006] A. Arasu and V. Ganti and R.

Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006

2. [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003

3. [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001

40Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 40

References4. [Li 2007] C. Li, B Wang and X. Yang

“VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007

5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001

6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004