34
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1 , Shengyue Ji 1 , Chen Li 1 , Jiaheng Lu 2 1 University of California, Irvine 2 Renmin University of China

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

  • Upload
    yul

  • View
    56

  • Download
    1

Embed Size (px)

DESCRIPTION

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. Alexander Behm 1 , Shengyue Ji 1 , Chen Li 1 , Jiaheng Lu 2 1 University of California, Irvine 2 Renmin University of China. Motivation: Data Cleaning. Should clearly be “ Niels Bohr”. - PowerPoint PPT Presentation

Citation preview

Page 1: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient

Approximate String Search

Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng Lu2

1University of California, Irvine2Renmin University of China

Page 2: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Motivation: Data Cleaning

Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008

Should clearly be “Niels Bohr”

Page 3: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Motivation: Record LinkageName Hobbies AddressBrad Pitt … …Forest Whittacker … …George Bush … …Angelina Jolie … …Arnold Schwarzenegger … …

Phone Age Name… … Brad Pitt… … Arnold Schwarzeneger … … George Bush… … Angelina Jolie … … Forrest Whittaker

No exact match!

Page 4: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Motivation: Query Relaxation

http://www.google.com/jobs/britney.html

Actual queries gathered by Google

Page 5: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

What is Approximate String Search?

Query against collection:Find entries similar to “Arnold Schwarseneger”

What do we mean by similar to?- Edit Distance- Jaccard Similarity- Cosine Similarity- Dice- Etc.

How can we support these types of queries efficiently?

String Collection

Brad PittForest WhittackerGeorge BushAngelina JolieArnold Schwarzenegger…

Page 6: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Approximate Query Answering

irvine2-grams {ir, rv, vi, in, ne}

Intuition: Similar strings share a certain number of grams

Sliding Window

Page 7: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Approximate Query Example

Query: “irvine”, Edit Distance 12-grams {ir, rv, vi, in, ne}

tf vi ir ef rv ne unin ……

Lookup Grams

2-grams134579

59

15

1239

39

79

569

Inverted Lists

(stringIDs)

12456

Count >= 3 Candidates = {1, 5, 9}May have false positives

134579

15

1239

79

569

Page 8: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

T-Occurrence Problem

Find elements whose occurrences ≥ T

Ascendingorder

Merge

Page 9: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Motivation: Compression

Inverted Index >> Source DataFit in memory? Space Budget?

Page 10: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Motivation: Related Work

IR: lossless compression of inverted lists (disk-based)

Delta representation + compact encoding

Inverted lists in memory: decompression overhead

Tune compression ratio?

Overcome these limitations in our setting?

Page 11: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Main Contributions

Two lossy compression techniques Answer queries exactly

Index fits into a space budget

Queries faster on the compressed indexes Flexibility to choose space / time tradeoff

Existing list-merging algorithms: re-use + compression specific

optimizations

Page 12: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Overview Motivation & Preliminaries

Approach 1: Discarding Lists

Approach 2: Combining Lists

Experiments & Conclusion

Page 13: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Approach 1: Discarding Lists

tf vi ir ef rv ne unin ……2-grams134579

59

15

1239

39

79

569

Inverted Lists

(stringIDs)

12456

Lists discarded, “Holes”

Page 14: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Effects on Queries

Decrease lower bound T on common grams

Smaller T more false positives

T <= 0 “panic”, scan entire string collection

Surprise Fewer lists Faster Queries (depends)

Page 15: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

sha han ang ngh gha hai ter …

Query “shanghai”, Edit Distance 13-grams {sha, han, ang, ngh, gha, hai}

uni ing3-grams

Hole grams

Regular grams

Basis: Edit Operations “destroy” q=3 gramsNo Holes: T = #grams – ed * q = 6 – 1 * 3 = 3With holes: T’ = T – #holes = 0 Panic!

Really destroy q=3 grams per edit operation?

Dynamic Programming for tighter T

Page 16: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Choosing Lists to Discard

Good choice depends on query workload

Space budget: Many combinations of grams

Make a “reasonable” choice efficiently?

Effect on QueryUnaffected Panic

Slower or Faster

Page 17: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Choosing Lists to DiscardINPUT: Space Budget, Inverted lists, Workload

OUTPUT: Lists to discard

tf vi ir ef rv ne unin ……

Query1Query2Query3

…Total estimated running time t

Estimated impact ∆t

Incremental Update

Choose one list at a time

ALGORITHM: Greedy & Cost-Based

Page 18: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Estimating Query Times

List-Merging:cost function, offline with linear regression

Panic: #strings * avg similarity time

Post-Processing: #candidates * avg similarity time

Page 19: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Estimating #candidatesIncremental-ScanCount Algorithm

2 3 0 1 40 1 2 3 4

2 2 0 0 30 1 2 3 4

Counts

StringIDs

Counts

StringIDs

Decrement

un

134

List to Discard

BEFORET = 3#candidates = 2

AFTERT’ = T-1 = 2#candidates = 3

Page 20: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Overview Motivation & Preliminaries

Approach 1: Discarding Lists

Approach 2: Combining Lists

Experiments & Conclusion

Page 21: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Approach 2: Combining Lists

tf vi ir ef rv ne unin ……2-grams134579

59

569

1239

139

79

69Inverted

Lists (stringIDs)

12456

Lists combined

Page 22: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Effects on Queries

Lower bound T is unchanged (no new panics)

Lists become longer:

More time to traverse lists

More false positives

Page 23: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Speeding Up Queries

Query3-grams {sha, han, ang, ngh, gha, hai}

combined listsrefcount = 2

combined listsrefcount = 3

Traverse physical lists once. Count for stringIDs increases by refcount.

Page 24: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Choosing Lists to Combine

Discovering candidate gram pairs Frequent q+1-grams correlated adjacent q-grams Locality-Sensitive Hashing (LSH)

Selecting candidate pairs to combine Basis: estimated cost on query workload Similar to DiscardLists Different Incremental ScanCount algorithm

Page 25: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Overview Motivation & Preliminaries

Approach 1: Discarding Lists

Approach 2: Combining Lists

Experiments & Conclusion

Page 26: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

ExperimentsDatasets:

Google WebCorpus Word Grams IMDB Actors DBLP Titles

Overview: Performance & Scalability of DiscardLists & CombineLists Comparison with IR compression & VGRAM Changing workloads

10k Queries: Zipf distributed, from datasetq=3, Edit Distance=2, (also Jaccard & Cosine)

Page 27: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

ExperimentsDiscardLists CombineLists

Runtime decreases! Runtime decreases!

Page 28: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Comparison with IR compression Carryover-12

Uncompressed

Compressed

Page 29: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Comparison with variable-length grams, VGRAM

Uncompressed

Compressed

Page 30: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Future Work

Combine: DiscardLists, CombineLists and IR compression

Filters for partitioning, global vs. local decisions

Dealing with updates to index

Page 31: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Conclusions

Two lossy compression techniques Answer queries exactly

Index fits into a space budget

Queries faster on the compressed indexes Flexibility to choose space / time tradeoff

Existing list-merging algorithms: re-use + compression specific

optimizations

Page 32: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Thank You!This work is part of

The Flamingo Project

http://flamingo.ics.uci.edu

Page 33: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

More ExperimentsWhat if the workload changes from the training workload?

Page 34: Space-Constrained  Gram-Based Indexing for Efficient Approximate String Search

Speaker: Alexander Behm

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

More ExperimentsWhat if the workload changes from the training workload?