Upload
wilson
View
28
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Efficient Merging and Filtering Algorithms for Approximate String Searches. Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu. Example: a movie database. Find movies starred Schwarrzenger. 2. Data may not clean. Data integration and cleaning:. Relation R. - PowerPoint PPT Presentation
Citation preview
11
Efficient Merging and Filtering Algorithms
for Approximate String Searches
Jiaheng Lu,
University of California, Irvine
Joint work with Chen Li, Yiming Lu
2Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 2
Example: a movie database
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Iron man 2008 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson The man 2006 Crime
Find movies starred Schwarrzenger.
3Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 3
Data may not clean
Star
Keanu Reeves
Samuel Jackson
Schwarzenegger
Relation R Relation S
Data integration and cleaning:
Star
Keanu Reeves
Samuel L. Jackson
Schwarzenegger
4Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 4
Problem definition: approximate string searches
…
Schwarzenger
Samuel Jackson
Keanu ReevesStar
Query q:
Collection of strings s
Search
Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity
SchwarrzengerSchwarrzenger
5Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 5
Outline Problem motivation Preliminaries
Grams Inverted lists
Merge algorithms Filtering techniques Conclusion
6Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 6
String Grams q-grams
(un),(ni),(iv),(ve),(er),(rs),(sa),(al)
For example: 2-gram
u n i v e r s a l
7Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 7
Inverted lists Convert strings to gram inverted lists
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433
8Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 8
Main ExampleQuery
Merge
Data Grams
stick (st,ti,ic,ck)
count >=2
id strings
0 rich
1 stick
2 stich
3 stuck
4 static
ck
ic
st
ta
ti…
1,3
1,2,3,4
4
1,2,4
ed(s,q)≤1
0,0,1,2,41,2,4
Candidates
9Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 9
Problem definition:
Find elements whose occurrences ≥ T
Ascending
order
Ascending
order
MergeMerge
10Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 10
Example T = 4
Result: 13
1
3
5
10
13
10
13
15
5
7
13
13 15
11Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
Contributions
Three new merge algorithms
New finding: wisely using filters
12Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 12
Outline Problem motivation Preliminaries Merge algorithms
Two previous algorithms Our proposed three algorithms
Filtering techniques Conclusion
13Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 13
Five Merge Algorithms
HeapMerger[Sarawagi,SIGMOD
2004]
MergeOpt[Sarawagi,SIGMOD
2004]
Previous
New
ScanCount MergeSkip DivideSkip
14Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 14
Heap-based Algorithm
Min-heap
Count # of the occurrences of each element by a heap
Push to heap ……
15Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 15
MergeOpt Algorithm
Long Lists: T-1 Short Lists
Binary
search
16Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 16
Example of MergeOpt [Sarawagi et al 2004]
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
Long Lists: 3Short Lists: 2
17Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 17
Can we run faster?
18Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 18
Five Merge Algorithms
HeapMerger MergeOpt
Previous
New
ScanCount MergeSkip DivideSkip
19Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 19
ScanCount Example
1 2 3
…
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
# of occurrences# of occurrences
00
00
00
44
11
Increment by 1
Increment by 111
String idsString ids
1313
1414
1515
00
22
00
00
Result!Result!
20Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 20
Five Merge Algorithms
HeapMerger MergeOpt
Previous
New
ScanCount MergeSkip DivideSkip
21Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 21
MergeSkip algorithm
Min-heap ……Pop T-1
T-1
Jump Greater or
equals
Greater or
equals
22Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 22
Example of MergeSkip
1
3
5
10
10
15
5
7
13 15
Count threshold T≥ 4
minHeap10
13 15
15
JumpJump
15151515
13131313
17171717
23Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 23
Skip is safe
Min-heap ……
# of occurrences of skipped elements ≤T-1
Skip
24Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 24
Five Merge Algorithms
HeapMerger MergeOpt
Previous
New
ScanCount MergeSkip DivideSkip
25Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
DivideSkip Algorithm
Long Lists Short Lists
Binary
searchMergeSkip
26Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 26
How many lists are treated as long lists?
??
Short ListsMerge
Long ListsLookup
27Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 27
Decide L valueA good balance in the tradeoff:
# of long lists = T / ( μ logM +1)
28Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 28
Experimental data sets
DBLP data IMDB data Google Web corpus
29Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 29
Performance (DBLP)
DivideSkip is the best one
30Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 30
# of access elements (DBLP)
DivideSkip is the best one
31Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 31
Outline Problem motivation Preliminaries Merge algorithms Filtering techniques
Length, positional filters Filter tree
Conclusion and future work
32Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 32
Length Filtering
Ed(s,t) ≤ 2
s: s:
t: t:
Length: 19Length: 19
Length: 10Length: 10
By length only!
By length only!
33Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
33
Positional Filtering
a b
a b
Ed(s,t) ≤ 2
s
t
(ab,1)
(ab,12)
34Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 34
Filter tree
…
Length level
Gram level
Position level
Inverted list512172844
root
2 n1 3
… zy zzabaa
1 2 m
…
35Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
Surprising experimental results (DBLP)
No filter (ms)
Length (ms)
Length+Pos (ms)
DivideSkip 2.23 0.76 1.96
Why adding position filter increases the running time?Why adding position filter
increases the running time?
36Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
Filters fragment inverts lists
Applying filters
Applying filters
MergeMergeMergeMerge MergeMerge MergeMerge
Cost:
(1)Tree traversal
(2)More merging
Cost:
(1)Tree traversal
(2)More merging
Saving:
reduce total lists size
Saving:
reduce total lists size
37Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
Conclusion
Three new merge algorithms We run faster
Interesting finding:
Do not abuse filters!Do not abuse filters!
38Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 38
Related work
Approximate string matching
[Navarro 2001]
Varied length Grams
[Li et al 2007]
Fuzzy lookup in
39Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 39
References1. [Arasu 2006] A. Arasu and V. Ganti and R.
Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006
2. [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003
3. [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001
40Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 40
References4. [Li 2007] C. Li, B Wang and X. Yang
“VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007
5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001
6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004