40
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

Efficient Merging and Filtering Algorithms for Approximate String Searches

  • Upload
    wilson

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Efficient Merging and Filtering Algorithms for Approximate String Searches. Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu. Example: a movie database. Find movies starred Schwarrzenger. 2. Data may not clean. Data integration and cleaning:. Relation R. - PowerPoint PPT Presentation

Citation preview

Page 1: Efficient Merging and Filtering Algorithms for Approximate String Searches

11

Efficient Merging and Filtering Algorithms

for Approximate String Searches

Jiaheng Lu,

University of California, Irvine

Joint work with Chen Li, Yiming Lu

Page 2: Efficient Merging and Filtering Algorithms for Approximate String Searches

2Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 2

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Iron man 2008 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson The man 2006 Crime

Find movies starred Schwarrzenger.

Page 3: Efficient Merging and Filtering Algorithms for Approximate String Searches

3Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 3

Data may not clean

Star

Keanu Reeves

Samuel Jackson

Schwarzenegger

Relation R Relation S

Data integration and cleaning:

Star

Keanu Reeves

Samuel L. Jackson

Schwarzenegger

Page 4: Efficient Merging and Filtering Algorithms for Approximate String Searches

4Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 4

Problem definition: approximate string searches

Schwarzenger

Samuel Jackson

Keanu ReevesStar

Query q:

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity

SchwarrzengerSchwarrzenger

Page 5: Efficient Merging and Filtering Algorithms for Approximate String Searches

5Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 5

Outline Problem motivation Preliminaries

Grams Inverted lists

Merge algorithms Filtering techniques Conclusion

Page 6: Efficient Merging and Filtering Algorithms for Approximate String Searches

6Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 6

String Grams q-grams

(un),(ni),(iv),(ve),(er),(rs),(sa),(al)

For example: 2-gram

u n i v e r s a l

Page 7: Efficient Merging and Filtering Algorithms for Approximate String Searches

7Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 7

Inverted lists Convert strings to gram inverted lists

id strings01234

richstickstichstuckstatic

4

2 30

1 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433

Page 8: Efficient Merging and Filtering Algorithms for Approximate String Searches

8Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 8

Main ExampleQuery

Merge

Data Grams

stick (st,ti,ic,ck)

count >=2

id strings

0 rich

1 stick

2 stich

3 stuck

4 static

ck

ic

st

ta

ti…

1,3

1,2,3,4

4

1,2,4

ed(s,q)≤1

0,0,1,2,41,2,4

Candidates

Page 9: Efficient Merging and Filtering Algorithms for Approximate String Searches

9Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 9

Problem definition:

Find elements whose occurrences ≥ T

Ascending

order

Ascending

order

MergeMerge

Page 10: Efficient Merging and Filtering Algorithms for Approximate String Searches

10Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 10

Example T = 4

Result: 13

1

3

5

10

13

10

13

15

5

7

13

13 15

Page 11: Efficient Merging and Filtering Algorithms for Approximate String Searches

11Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

Contributions

Three new merge algorithms

New finding: wisely using filters

Page 12: Efficient Merging and Filtering Algorithms for Approximate String Searches

12Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 12

Outline Problem motivation Preliminaries Merge algorithms

Two previous algorithms Our proposed three algorithms

Filtering techniques Conclusion

Page 13: Efficient Merging and Filtering Algorithms for Approximate String Searches

13Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 13

Five Merge Algorithms

HeapMerger[Sarawagi,SIGMOD

2004]

MergeOpt[Sarawagi,SIGMOD

2004]

Previous

New

ScanCount MergeSkip DivideSkip

Page 14: Efficient Merging and Filtering Algorithms for Approximate String Searches

14Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 14

Heap-based Algorithm

Min-heap

Count # of the occurrences of each element by a heap

Push to heap ……

Page 15: Efficient Merging and Filtering Algorithms for Approximate String Searches

15Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 15

MergeOpt Algorithm

Long Lists: T-1 Short Lists

Binary

search

Page 16: Efficient Merging and Filtering Algorithms for Approximate String Searches

16Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 16

Example of MergeOpt [Sarawagi et al 2004]

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

Long Lists: 3Short Lists: 2

Page 17: Efficient Merging and Filtering Algorithms for Approximate String Searches

17Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 17

Can we run faster?

Page 18: Efficient Merging and Filtering Algorithms for Approximate String Searches

18Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 18

Five Merge Algorithms

HeapMerger MergeOpt

Previous

New

ScanCount MergeSkip DivideSkip

Page 19: Efficient Merging and Filtering Algorithms for Approximate String Searches

19Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 19

ScanCount Example

1 2 3

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

# of occurrences# of occurrences

00

00

00

44

11

Increment by 1

Increment by 111

String idsString ids

1313

1414

1515

00

22

00

00

Result!Result!

Page 20: Efficient Merging and Filtering Algorithms for Approximate String Searches

20Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 20

Five Merge Algorithms

HeapMerger MergeOpt

Previous

New

ScanCount MergeSkip DivideSkip

Page 21: Efficient Merging and Filtering Algorithms for Approximate String Searches

21Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 21

MergeSkip algorithm

Min-heap ……Pop T-1

T-1

Jump Greater or

equals

Greater or

equals

Page 22: Efficient Merging and Filtering Algorithms for Approximate String Searches

22Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 22

Example of MergeSkip

1

3

5

10

10

15

5

7

13 15

Count threshold T≥ 4

minHeap10

13 15

15

JumpJump

15151515

13131313

17171717

Page 23: Efficient Merging and Filtering Algorithms for Approximate String Searches

23Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 23

Skip is safe

Min-heap ……

# of occurrences of skipped elements ≤T-1

Skip

Page 24: Efficient Merging and Filtering Algorithms for Approximate String Searches

24Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 24

Five Merge Algorithms

HeapMerger MergeOpt

Previous

New

ScanCount MergeSkip DivideSkip

Page 25: Efficient Merging and Filtering Algorithms for Approximate String Searches

25Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

DivideSkip Algorithm

Long Lists Short Lists

Binary

searchMergeSkip

Page 26: Efficient Merging and Filtering Algorithms for Approximate String Searches

26Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 26

How many lists are treated as long lists?

??

Short ListsMerge

Long ListsLookup

Page 27: Efficient Merging and Filtering Algorithms for Approximate String Searches

27Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 27

Decide L valueA good balance in the tradeoff:

# of long lists = T / ( μ logM +1)

Page 28: Efficient Merging and Filtering Algorithms for Approximate String Searches

28Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 28

Experimental data sets

DBLP data IMDB data Google Web corpus

Page 29: Efficient Merging and Filtering Algorithms for Approximate String Searches

29Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 29

Performance (DBLP)

DivideSkip is the best one

Page 30: Efficient Merging and Filtering Algorithms for Approximate String Searches

30Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 30

# of access elements (DBLP)

DivideSkip is the best one

Page 31: Efficient Merging and Filtering Algorithms for Approximate String Searches

31Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 31

Outline Problem motivation Preliminaries Merge algorithms Filtering techniques

Length, positional filters Filter tree

Conclusion and future work

Page 32: Efficient Merging and Filtering Algorithms for Approximate String Searches

32Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 32

Length Filtering

Ed(s,t) ≤ 2

s: s:

t: t:

Length: 19Length: 19

Length: 10Length: 10

By length only!

By length only!

Page 33: Efficient Merging and Filtering Algorithms for Approximate String Searches

33Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

33

Positional Filtering

a b

a b

Ed(s,t) ≤ 2

s

t

(ab,1)

(ab,12)

Page 34: Efficient Merging and Filtering Algorithms for Approximate String Searches

34Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 34

Filter tree

Length level

Gram level

Position level

Inverted list512172844

root

2 n1 3

… zy zzabaa

1 2 m

Page 35: Efficient Merging and Filtering Algorithms for Approximate String Searches

35Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

Surprising experimental results (DBLP)

No filter (ms)

Length (ms)

Length+Pos (ms)

DivideSkip 2.23 0.76 1.96

Why adding position filter increases the running time?Why adding position filter

increases the running time?

Page 36: Efficient Merging and Filtering Algorithms for Approximate String Searches

36Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

Filters fragment inverts lists

Applying filters

Applying filters

MergeMergeMergeMerge MergeMerge MergeMerge

Cost:

(1)Tree traversal

(2)More merging

Cost:

(1)Tree traversal

(2)More merging

Saving:

reduce total lists size

Saving:

reduce total lists size

Page 37: Efficient Merging and Filtering Algorithms for Approximate String Searches

37Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu

Conclusion

Three new merge algorithms We run faster

Interesting finding:

Do not abuse filters!Do not abuse filters!

Page 38: Efficient Merging and Filtering Algorithms for Approximate String Searches

38Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 38

Related work

Approximate string matching

[Navarro 2001]

Varied length Grams

[Li et al 2007]

Fuzzy lookup in

Page 39: Efficient Merging and Filtering Algorithms for Approximate String Searches

39Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 39

References1. [Arasu 2006] A. Arasu and V. Ganti and R.

Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006

2. [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003

3. [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001

Page 40: Efficient Merging and Filtering Algorithms for Approximate String Searches

40Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 40

References4. [Li 2007] C. Li, B Wang and X. Yang

“VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007

5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001

6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004