Recap Dictionaries Wildcard queries Spelling Correction
Information Retrieval and Search EnginesDictionaries & Tolerant Retrieval
Jörg Tiedemann
[email protected] of Modern Languages
University of Helsinki
Jörg Tiedemann 1/49
Recap Dictionaries Wildcard queries Spelling Correction
Recap from previous lecture
I Inverted indexes
I dictionary & postingsI type/token distinctionI terms = normalized types put in the dictionary
I Boolean Model
I return exact matches for Boolean queries
Jörg Tiedemann 2/49
Recap Dictionaries Wildcard queries Spelling Correction
Topics for Today
Continue with the basics (inverted indexes)
1. Phrase queries
2. Dictionary data structures
3. “Tolerant” retrievalI wild-card queriesI spelling correction
Jörg Tiedemann 3/49
Recap Dictionaries Wildcard queries Spelling Correction
Recall: Basic structure of an Inverted index
For each term t , we store a list of all documents that contain t .
BRUTUS −→ 1 2 4 11 31 45 173 174
CAESAR −→ 1 2 4 5 6 16 57 132 . . .
CALPURNIA −→ 2 31 54 101
...︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings
Jörg Tiedemann 4/49
Recap Dictionaries Wildcard queries Spelling Correction
Phrase queries
Query: “stanford university” – as a phrase(10% of web queries = phrase queries)
Idea one: Biword index!
I index every word bigramI longer phrases: “stanford university palo alto”:
1. “STANFORD UNIVERSITY” AND “UNIVERSITY PALO” AND“PALO ALTO”
2. post-filtering of all hits
Problems?
Jörg Tiedemann 5/49
Recap Dictionaries Wildcard queries Spelling Correction
Phrase queries
Query: “stanford university” – as a phrase(10% of web queries = phrase queries)
Idea one: Biword index!
I index every word bigramI longer phrases: “stanford university palo alto”:
1. “STANFORD UNIVERSITY” AND “UNIVERSITY PALO” AND“PALO ALTO”
2. post-filtering of all hits
Problems?
Jörg Tiedemann 5/49
Recap Dictionaries Wildcard queries Spelling Correction
Phrase queries
I Idea two: Positional indexes! (store positions in posting list)
I Example query: “to1 be2 or3 not4 to5 be6”
TO, 993427:〈 1: 〈 7, 18, 33, 72, 86, 231〉;
2: 〈1, 17, 74, 222, 255〉;4: 〈 8, 16, 190, 429, 433〉;5: 〈363, 367〉;7: 〈13, 23, 191〉; . . . 〉
BE, 178239:〈 1: 〈 17, 25〉;
4: 〈 17, 191, 291, 430, 434〉;5: 〈14, 19, 101〉; . . . 〉
Document 4 is a match!
Jörg Tiedemann 6/49
Recap Dictionaries Wildcard queries Spelling Correction
Phrase queries
I Idea two: Positional indexes! (store positions in posting list)I Example query: “to1 be2 or3 not4 to5 be6”
TO, 993427:〈 1: 〈 7, 18, 33, 72, 86, 231〉;
2: 〈1, 17, 74, 222, 255〉;4: 〈 8, 16, 190, 429, 433〉;5: 〈363, 367〉;7: 〈13, 23, 191〉; . . . 〉
BE, 178239:〈 1: 〈 17, 25〉;
4: 〈 17, 191, 291, 430, 434〉;5: 〈14, 19, 101〉; . . . 〉
Document 4 is a match!
Jörg Tiedemann 6/49
Recap Dictionaries Wildcard queries Spelling Correction
Proximity search
Second advantage of positional indexes:
I Can also use them for proximity search.I For example: employment /4 placeI Find all documents that contain EMPLOYMENT and PLACE
within 4 words of each other.
Jörg Tiedemann 7/49
Recap Dictionaries Wildcard queries Spelling Correction
Data Structures for Dictionaries
For each term t , we store a list of all documents that contain t .
BRUTUS −→ 1 2 4 11 31 45 173 174
CAESAR −→ 1 2 4 5 6 16 57 132 . . .
CALPURNIA −→ 2 31 54 101
...︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings
Jörg Tiedemann 8/49
Recap Dictionaries Wildcard queries Spelling Correction
Naive Dictionary: array of fixed-width entries
term documentfrequency
pointer topostings list
a 656,265 −→aachen 65 −→. . . . . . . . .zulu 221 −→
space needed: 20 bytes 4 bytes 4 bytes
How do we store a dictionary in memory efficiently?How do we look up an element in this array at query time?
Jörg Tiedemann 9/49
Recap Dictionaries Wildcard queries Spelling Correction
Naive Dictionary: array of fixed-width entries
term documentfrequency
pointer topostings list
a 656,265 −→aachen 65 −→. . . . . . . . .zulu 221 −→
space needed: 20 bytes 4 bytes 4 bytes
How do we store a dictionary in memory efficiently?How do we look up an element in this array at query time?
Jörg Tiedemann 9/49
Recap Dictionaries Wildcard queries Spelling Correction
Dictionary Data structures
I Two main classes of data structures: hashes and trees
I Criteria for when to use hashes vs. trees:I Is there a fixed number of terms or will it keep growing?I What are the relative frequencies with which various keys
will be accessed?I How many terms are we likely to have?
Jörg Tiedemann 10/49
Recap Dictionaries Wildcard queries Spelling Correction
Hashes
I Each vocabulary term is hashed into an integer.I Try to avoid collisionsI At query time, do the following: hash query term, resolve
collisions, locate entry in fixed-width array
I Pros:I Lookup is fast (faster than in a search tree)I Lookup time is constant
I ConsI no way to find minor variants (resume vs. résumé)I no prefix search (all terms starting with automat)I need to rehash everything periodically if vocabulary keeps
growing
Jörg Tiedemann 11/49
Recap Dictionaries Wildcard queries Spelling Correction
Hashes
I Each vocabulary term is hashed into an integer.I Try to avoid collisionsI At query time, do the following: hash query term, resolve
collisions, locate entry in fixed-width arrayI Pros:
I Lookup is fast (faster than in a search tree)I Lookup time is constant
I ConsI no way to find minor variants (resume vs. résumé)I no prefix search (all terms starting with automat)I need to rehash everything periodically if vocabulary keeps
growing
Jörg Tiedemann 11/49
Recap Dictionaries Wildcard queries Spelling Correction
Hashes
I Each vocabulary term is hashed into an integer.I Try to avoid collisionsI At query time, do the following: hash query term, resolve
collisions, locate entry in fixed-width arrayI Pros:
I Lookup is fast (faster than in a search tree)I Lookup time is constant
I ConsI no way to find minor variants (resume vs. résumé)I no prefix search (all terms starting with automat)I need to rehash everything periodically if vocabulary keeps
growing
Jörg Tiedemann 11/49
Recap Dictionaries Wildcard queries Spelling Correction
Trees: Binary tree
Jörg Tiedemann 12/49
Recap Dictionaries Wildcard queries Spelling Correction
Trees: Binary tree
I simplest tree structureI efficient for searching
I Pros:I solves the prefix problem (terms starting with automat)
I Cons:I slower: O(log M) in balanced trees
(M is the size of the vocabulary)I re-balancing binary trees is expensive
→ Alternative: B-tree
Jörg Tiedemann 13/49
Recap Dictionaries Wildcard queries Spelling Correction
Trees: B-tree
B-tree definition: every internal node has a number of childrenin the interval [a,b] where a,b are appropriate positive integers,e.g., [2,4].
Jörg Tiedemann 14/49
Recap Dictionaries Wildcard queries Spelling Correction
Trees: B-tree
What’s the difference?
I slightly more complexI still efficient for searchingI same features as binary trees (prefix search!)I need for re-balancing is less frequent!
Jörg Tiedemann 15/49
Recap Dictionaries Wildcard queries Spelling Correction
“Tolerant” Retrieval
I Wildcard queries
I Spelling correction
Jörg Tiedemann 16/49
Recap Dictionaries Wildcard queries Spelling Correction
Tolerant Retrieval: Wildcard queries
Prefix queries:
I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the
range: mon ≤ t < moo
Suffix Queries:
I *mon: find all docs containing any term ending with monI Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non
How can we find all terms matching pro*cent?
Jörg Tiedemann 17/49
Recap Dictionaries Wildcard queries Spelling Correction
Tolerant Retrieval: Wildcard queries
Prefix queries:
I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the
range: mon ≤ t < moo
Suffix Queries:
I *mon: find all docs containing any term ending with mon
I Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non
How can we find all terms matching pro*cent?
Jörg Tiedemann 17/49
Recap Dictionaries Wildcard queries Spelling Correction
Tolerant Retrieval: Wildcard queries
Prefix queries:
I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the
range: mon ≤ t < moo
Suffix Queries:
I *mon: find all docs containing any term ending with monI Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non
How can we find all terms matching pro*cent?
Jörg Tiedemann 17/49
Recap Dictionaries Wildcard queries Spelling Correction
Tolerant Retrieval: Wildcard queries
Prefix queries:
I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the
range: mon ≤ t < moo
Suffix Queries:
I *mon: find all docs containing any term ending with monI Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non
How can we find all terms matching pro*cent?
Jörg Tiedemann 17/49
Recap Dictionaries Wildcard queries Spelling Correction
How to handle * in the middle of a term
I Example: pro*cent
I We could look up pro* and *cent in the B-tree and intersectthe two term sets.
I Expensive!I Alternative: permuterm index!
Jörg Tiedemann 18/49
Recap Dictionaries Wildcard queries Spelling Correction
How to handle * in the middle of a term
I Example: pro*centI We could look up pro* and *cent in the B-tree and intersect
the two term sets.
I Expensive!I Alternative: permuterm index!
Jörg Tiedemann 18/49
Recap Dictionaries Wildcard queries Spelling Correction
How to handle * in the middle of a term
I Example: pro*centI We could look up pro* and *cent in the B-tree and intersect
the two term sets.I Expensive!I Alternative: permuterm index!
Jörg Tiedemann 18/49
Recap Dictionaries Wildcard queries Spelling Correction
Permuterm index
I Rotate wildcard query, so that the * occurs at the end.I introduce special symbol $ to indicate end of string
→ pro*cent becomes cent$pro*
How does this help when matching queries with the index?
Jörg Tiedemann 19/49
Recap Dictionaries Wildcard queries Spelling Correction
Permuterm index
I Rotate wildcard query, so that the * occurs at the end.I introduce special symbol $ to indicate end of string
→ pro*cent becomes cent$pro*
How does this help when matching queries with the index?
Jörg Tiedemann 19/49
Recap Dictionaries Wildcard queries Spelling Correction
Permuterm index
Add all rotations to B-tree:
I HELLO → hello$, ello$h, llo$he, lo$hel o$hell
Jörg Tiedemann 20/49
Recap Dictionaries Wildcard queries Spelling Correction
Permuterm index & term mapping
all rotated entries map to thesame string ...
I For X, look up X$I For X*, look up X*I For *X, look up X$*I For *X*, look up X*I For X*Y, look up Y$X*I Example: For hel*o, look
up o$hel*
Jörg Tiedemann 21/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing a lookup in the permuterm index
To sum up:
I Rotate query wildcard to the rightI Use B-tree lookup as before
What is the problem with this structure?
Permuterm more than quadruples the size of the dictionarycompared to a regular B-tree.(empirical observation for English)
Jörg Tiedemann 22/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing a lookup in the permuterm index
To sum up:
I Rotate query wildcard to the rightI Use B-tree lookup as before
What is the problem with this structure?
Permuterm more than quadruples the size of the dictionarycompared to a regular B-tree.(empirical observation for English)
Jörg Tiedemann 22/49
Recap Dictionaries Wildcard queries Spelling Correction
Alternative: Bigram (k -gram) indexes
I Enumerate all k -grams (sequence of k characters)occurring in a term→ more space-efficient than permuterm index
I Example: from “April is the cruelest month” we get the2-grams (bigrams):$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ueel le es st t$ $m mo on nt h$
I $ is a special word boundary symbol, as before.I Maintain a second inverted index from bigrams to the
dictionary terms that contain the bigram
Jörg Tiedemann 23/49
Recap Dictionaries Wildcard queries Spelling Correction
Alternative: Bigram (k -gram) indexes
I Enumerate all k -grams (sequence of k characters)occurring in a term→ more space-efficient than permuterm index
I Example: from “April is the cruelest month” we get the2-grams (bigrams):$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ueel le es st t$ $m mo on nt h$
I $ is a special word boundary symbol, as before.
I Maintain a second inverted index from bigrams to thedictionary terms that contain the bigram
Jörg Tiedemann 23/49
Recap Dictionaries Wildcard queries Spelling Correction
Alternative: Bigram (k -gram) indexes
I Enumerate all k -grams (sequence of k characters)occurring in a term→ more space-efficient than permuterm index
I Example: from “April is the cruelest month” we get the2-grams (bigrams):$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ueel le es st t$ $m mo on nt h$
I $ is a special word boundary symbol, as before.I Maintain a second inverted index from bigrams to the
dictionary terms that contain the bigram
Jörg Tiedemann 23/49
Recap Dictionaries Wildcard queries Spelling Correction
Postings list in a 3-gram inverted index
etr BEETROOT METRIC PETRIFY RETRIEVAL- - - -
...
I retrieve all postings of matching k -gramsI intersect all lists as usual
Jörg Tiedemann 24/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing wildcarded terms in a bigram index
Query mon* can now be run as:
$m AND mo AND on
I Gets us all terms with the prefix mon, but . . .I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.
I Surviving terms are then looked up in the term-documentinverted index.
→ k -gram are still fast and more space efficient thanpermuterm indexes.
Jörg Tiedemann 25/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing wildcarded terms in a bigram index
Query mon* can now be run as:$m AND mo AND on
I Gets us all terms with the prefix mon, but . . .
I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.
I Surviving terms are then looked up in the term-documentinverted index.
→ k -gram are still fast and more space efficient thanpermuterm indexes.
Jörg Tiedemann 25/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing wildcarded terms in a bigram index
Query mon* can now be run as:$m AND mo AND on
I Gets us all terms with the prefix mon, but . . .I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.
I Surviving terms are then looked up in the term-documentinverted index.
→ k -gram are still fast and more space efficient thanpermuterm indexes.
Jörg Tiedemann 25/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing wildcarded terms in a bigram index
Query mon* can now be run as:$m AND mo AND on
I Gets us all terms with the prefix mon, but . . .I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.
I Surviving terms are then looked up in the term-documentinverted index.
→ k -gram are still fast and more space efficient thanpermuterm indexes.
Jörg Tiedemann 25/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing wildcard queries
I As before, we must potentially execute a large number ofBoolean queries
I Most straightforward semantics: Conjunction ofdisjunctions
I Recall the query: gen* AND universit*
(geneva AND university) OR (geneva AND université) OR (genève AND
university) OR (genève AND université) OR (general AND universities) OR
. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?I Users are lazy!
If wildcards are allowed→ Users will love it (?)I Does Google allow wildcard queries? (And other engines?)
Jörg Tiedemann 26/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing wildcard queries
I As before, we must potentially execute a large number ofBoolean queries
I Most straightforward semantics: Conjunction ofdisjunctions
I Recall the query: gen* AND universit*(geneva AND university) OR (geneva AND université) OR (genève AND
university) OR (genève AND université) OR (general AND universities) OR
. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?
I Users are lazy!If wildcards are allowed→ Users will love it (?)
I Does Google allow wildcard queries? (And other engines?)
Jörg Tiedemann 26/49
Recap Dictionaries Wildcard queries Spelling Correction
Processing wildcard queries
I As before, we must potentially execute a large number ofBoolean queries
I Most straightforward semantics: Conjunction ofdisjunctions
I Recall the query: gen* AND universit*(geneva AND university) OR (geneva AND université) OR (genève AND
university) OR (genève AND université) OR (general AND universities) OR
. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?I Users are lazy!
If wildcards are allowed→ Users will love it (?)I Does Google allow wildcard queries? (And other engines?)
Jörg Tiedemann 26/49
Recap Dictionaries Wildcard queries Spelling Correction
Tolerant Retrieval: Spälling Korrection
I Two general uses of spelling correction:I Correcting documents being indexedI Correcting user queries (more common)
I Two different methods:I Isolated word spelling correction
I Check each word on its own for misspellingI Will not catch typos resulting in correctly spelled words,
e.g., an asteroid that fell form the skyI Context-sensitive spelling correction
I Look at surrounding wordsI Can correct form/from error above
Jörg Tiedemann 27/49
Recap Dictionaries Wildcard queries Spelling Correction
Tolerant Retrieval: Spälling Korrection
I Two general uses of spelling correction:I Correcting documents being indexedI Correcting user queries (more common)
I Two different methods:I Isolated word spelling correction
I Check each word on its own for misspellingI Will not catch typos resulting in correctly spelled words,
e.g., an asteroid that fell form the skyI Context-sensitive spelling correction
I Look at surrounding wordsI Can correct form/from error above
Jörg Tiedemann 27/49
Recap Dictionaries Wildcard queries Spelling Correction
Correcting documents
I primarily for OCR’ed documentsI tuned for OCR mistakes (trained classifiers)I may use domain-specific knowledge
confusion between “O” and “D” etc ...I but also: correct typos in web pages and low quality
documents→ fewer misspellings in our dictionary and better matching
I general IR philosophy: don’t change the documents
Jörg Tiedemann 28/49
Recap Dictionaries Wildcard queries Spelling Correction
Correcting queries
I more common in IR than document correctionI typos in queries are common
I people are in a hurryI users often look for things they don’t know much aboutI example: “al quajda”
I strategies:I (also) retrieve documents with the correct spellingI return alternative query suggestions (“Did you mean ...?”)
Jörg Tiedemann 29/49
Recap Dictionaries Wildcard queries Spelling Correction
Isolated word correction
I Premise 1: There is a list of “correct words” from which thecorrect spellings come.
I Premise 2: We have a way of computing the distancebetween a misspelled word and a correct word.
I Simple spelling correction algorithm: return the “correct”word that has the smallest distance to the misspelled word.
I Example: informaton→ information
Jörg Tiedemann 30/49
Recap Dictionaries Wildcard queries Spelling Correction
Isolated word correction
Two choices:I use a standard lexicon
I Webster’s, OED etc. ...I “industry-specific” dictionary (for domain-specific IR)I advantage: correct entries only!
I vocabulary of the inverted indexI all words in the collection→ better coverageI but: include all misspellings→ compute weights for all terms (based on frequencies)
Jörg Tiedemann 31/49
Recap Dictionaries Wildcard queries Spelling Correction
Isolated word correction
Task: Return lexicon entry that is closest to a given charactersequence Q
I What is “closest”?I Several alternatives:
1. Edit distance and Levenshtein distance2. Weighted edit distance3. k -gram overlap
Jörg Tiedemann 32/49
Recap Dictionaries Wildcard queries Spelling Correction
Edit distance
I The edit distance between string s1 and string s2 is theminimum number of basic operations that convert s1 to s2.
I Levenshtein distance:basic operations = insert, delete, and replace
I Levenshtein distance dog-do: 1I Levenshtein distance cat-cart: 1I Levenshtein distance cat-cut: 1I Levenshtein distance cat-act: 2
I Damerau-Levenshtein: additional operation = transposeI Damerau-Levenshtein distance cat-act: 1
Jörg Tiedemann 33/49
Recap Dictionaries Wildcard queries Spelling Correction
Edit distance
I The edit distance between string s1 and string s2 is theminimum number of basic operations that convert s1 to s2.
I Levenshtein distance:basic operations = insert, delete, and replace
I Levenshtein distance dog-do: 1I Levenshtein distance cat-cart: 1I Levenshtein distance cat-cut: 1I Levenshtein distance cat-act: 2
I Damerau-Levenshtein: additional operation = transposeI Damerau-Levenshtein distance cat-act: 1
Jörg Tiedemann 33/49
Recap Dictionaries Wildcard queries Spelling Correction
Levenshtein distance: Computation
Recursive definition & dynamic programming:
I start with upper-left table cellI fill table with edit costs
f a s t0 1 2 3 4
c 1 1 2 3 4a 2 2 1 2 3t 3 3 2 2 2s 4 4 3 2 3
Jörg Tiedemann 34/49
Recap Dictionaries Wildcard queries Spelling Correction
Levenshtein distance: algorithm
LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}
10 return m[|s1|, |s2|]
Operations: insert, delete, replace, copy
Jörg Tiedemann 35/49
Recap Dictionaries Wildcard queries Spelling Correction
Levenshtein distance: algorithm
LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}
10 return m[|s1|, |s2|]
Operations: insert, delete, replace, copy
Jörg Tiedemann 36/49
Recap Dictionaries Wildcard queries Spelling Correction
Levenshtein distance: algorithm
LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}
10 return m[|s1|, |s2|]
Operations: insert, delete, replace, copy
Jörg Tiedemann 37/49
Recap Dictionaries Wildcard queries Spelling Correction
Levenshtein distance: algorithm
LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}
10 return m[|s1|, |s2|]
Operations: insert, delete, replace, copy
Jörg Tiedemann 38/49
Recap Dictionaries Wildcard queries Spelling Correction
Levenshtein distance: algorithm
LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}
10 return m[|s1|, |s2|]
Operations: insert, delete, replace, copy
Jörg Tiedemann 39/49
Recap Dictionaries Wildcard queries Spelling Correction
Each cell of Levenshtein matrix
cost of getting herefrom my upper leftneighbor(copy or replace)
cost of getting herefrom my upper neigh-bor(delete)
cost of getting herefrom my left neighbor(insert)
the minimum of thethree possible “moves”;the cheapest way ofgetting here
Jörg Tiedemann 40/49
Recap Dictionaries Wildcard queries Spelling Correction
Levenshtein distance: Example
f a s t
0 1 1 2 2 3 3 4 4
c11
1 22 1
2 32 2
3 43 3
4 54 4
a22
2 23 2
1 33 1
3 42 2
4 53 3
t33
3 34 3
3 24 2
2 33 2
2 43 2
s44
4 45 4
4 35 3
2 34 2
3 33 3
Jörg Tiedemann 41/49
Recap Dictionaries Wildcard queries Spelling Correction
Weighted edit distance
I As above, but weight of an operation depends on thecharacters involved.
I Meant to capture keyboard errors, e.g., m more likely to bemistyped as n than as q.
I Therefore, replacing m by n is a smaller edit distance thanby q.
I We now require a weight matrix as input.I Modify dynamic programming to handle weights.
Jörg Tiedemann 42/49
Recap Dictionaries Wildcard queries Spelling Correction
Using edit distance
I given a query: get all sequences within a fixed editdistance
I intersect this list with the list of “correct” wordsI return spelling suggestions
Alternatively:
I use all corrections to retrieve doc’s→ slow!(and accepted by user?)
I use single best correction for retrieval
Jörg Tiedemann 43/49
Recap Dictionaries Wildcard queries Spelling Correction
Using edit distance
Problems:
I lot’s of possible strings even with few edit operationsI intersection with dictionary is slow
Possible solution:
I use N-gram overlapI can replace edit distance for spelling correction
Jörg Tiedemann 44/49
Recap Dictionaries Wildcard queries Spelling Correction
k -gram indexes for spelling correction
I Get all k -grams in the query termI Use the k -gram index to retrieve “correct” words that match
query term k -grams (recall wildcard search)I Threshold by number of matching k -grams
(e.g., only terms that differ by at most 3 k -grams)I or use Jaccard coefficient > threshold
Jaccard(A,B) =|A ∩ B||A ∪ B|
Example:
I Bigram index, misspelled word bordroomI Bigrams: bo, or, rd, dr, ro, oo, om
Jörg Tiedemann 45/49
Recap Dictionaries Wildcard queries Spelling Correction
k -gram indexes for spelling correction
I Get all k -grams in the query termI Use the k -gram index to retrieve “correct” words that match
query term k -grams (recall wildcard search)I Threshold by number of matching k -grams
(e.g., only terms that differ by at most 3 k -grams)I or use Jaccard coefficient > threshold
Jaccard(A,B) =|A ∩ B||A ∪ B|
Example:
I Bigram index, misspelled word bordroomI Bigrams: bo, or, rd, dr, ro, oo, om
Jörg Tiedemann 45/49
Recap Dictionaries Wildcard queries Spelling Correction
k -gram indexes for spelling correction: bordroom
RD aboard ardent boardroom border
OR border lord morbid sordid
BO aboard about boardroom border
- - - -
- - - -
- - - -
...
→ “boardroom” exists in 6 out of 7 lists→ Jaccard = 6/(8 + 7− 6) = 6/9 ≈ 0.67
Jörg Tiedemann 46/49
Recap Dictionaries Wildcard queries Spelling Correction
Context-sensitive spelling correction
One approach:
I break phrase query into conjunction of biwordsI look for biwords that need only one term correctedI get phrase matches and rank them ...
Jörg Tiedemann 47/49
Recap Dictionaries Wildcard queries Spelling Correction
Context-sensitive spelling correction
Another approach: Hit-based spelling correction
I Example: flew form munich
I Try all phrases with possible corrections:I Try query “flea form munich”I Try query “flew from munich”I Try query “flew form munch”I The correct query “flew from munich” has most hits.
Many alternatives! → Not very efficient!
I try to correct only if few hits returned
I tweaking with query logs and expected hits
Jörg Tiedemann 48/49
Recap Dictionaries Wildcard queries Spelling Correction
To sum up
Using a positional inverted index with skip pointers
I efficient dictionary storageI wild-card indexI spelling correctionI Soundex: Find phonetic alternatives
We can quickly run a query like(SPELL(moriset) /3 tor*to) OR SOUNDEX(chaikofski)
Jörg Tiedemann 49/49