Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann [email protected]

Recap Dictionaries Wildcard queries Spelling Correction

Information Retrieval and Search EnginesDictionaries & Tolerant Retrieval

Jörg Tiedemann

[email protected] of Modern Languages

University of Helsinki

Jörg Tiedemann 1/49

[email protected]


Recap from previous lecture

I Inverted indexes

I dictionary & postingsI type/token distinctionI terms = normalized types put in the dictionary

I Boolean Model

I return exact matches for Boolean queries



Topics for Today

Continue with the basics (inverted indexes)

1. Phrase queries

2. Dictionary data structures

3. “Tolerant” retrievalI wild-card queriesI spelling correction



Recall: Basic structure of an Inverted index

For each term t , we store a list of all documents that contain t .

BRUTUS −→ 1 2 4 11 31 45 173 174

CAESAR −→ 1 2 4 5 6 16 57 132 . . .

CALPURNIA −→ 2 31 54 101

...︸︷︷︸︸︷︷︸dictionary postings



Phrase queries

Query: “stanford university” – as a phrase(10% of web queries = phrase queries)

Idea one: Biword index!

I index every word bigramI longer phrases: “stanford university palo alto”:

1. “STANFORD UNIVERSITY” AND “UNIVERSITY PALO” AND“PALO ALTO”

2. post-filtering of all hits

Problems?



Phrase queries

Query: “stanford university” – as a phrase(10% of web queries = phrase queries)

Idea one: Biword index!

I index every word bigramI longer phrases: “stanford university palo alto”:

1. “STANFORD UNIVERSITY” AND “UNIVERSITY PALO” AND“PALO ALTO”

2. post-filtering of all hits

Problems?



Phrase queries

I Idea two: Positional indexes! (store positions in posting list)

I Example query: “to1 be2 or3 not4 to5 be6”

TO, 993427:〈 1: 〈 7, 18, 33, 72, 86, 231〉;

2: 〈1, 17, 74, 222, 255〉;4: 〈 8, 16, 190, 429, 433〉;5: 〈363, 367〉;7: 〈13, 23, 191〉; . . . 〉

BE, 178239:〈 1: 〈 17, 25〉;

4: 〈 17, 191, 291, 430, 434〉;5: 〈14, 19, 101〉; . . . 〉

Document 4 is a match!



Phrase queries

I Idea two: Positional indexes! (store positions in posting list)I Example query: “to1 be2 or3 not4 to5 be6”

TO, 993427:〈 1: 〈 7, 18, 33, 72, 86, 231〉;

2: 〈1, 17, 74, 222, 255〉;4: 〈 8, 16, 190, 429, 433〉;5: 〈363, 367〉;7: 〈13, 23, 191〉; . . . 〉

BE, 178239:〈 1: 〈 17, 25〉;

4: 〈 17, 191, 291, 430, 434〉;5: 〈14, 19, 101〉; . . . 〉

Document 4 is a match!



Proximity search

Second advantage of positional indexes:

I Can also use them for proximity search.I For example: employment /4 placeI Find all documents that contain EMPLOYMENT and PLACE

within 4 words of each other.



Data Structures for Dictionaries

For each term t , we store a list of all documents that contain t .

BRUTUS −→ 1 2 4 11 31 45 173 174

CAESAR −→ 1 2 4 5 6 16 57 132 . . .

CALPURNIA −→ 2 31 54 101

...︸︷︷︸︸︷︷︸dictionary postings



Naive Dictionary: array of fixed-width entries

term documentfrequency

pointer topostings list

a 656,265 −→aachen 65 −→. . . . . . . . .zulu 221 −→

space needed: 20 bytes 4 bytes 4 bytes

How do we store a dictionary in memory efficiently?How do we look up an element in this array at query time?



Naive Dictionary: array of fixed-width entries

term documentfrequency

pointer topostings list

a 656,265 −→aachen 65 −→. . . . . . . . .zulu 221 −→

space needed: 20 bytes 4 bytes 4 bytes

How do we store a dictionary in memory efficiently?How do we look up an element in this array at query time?



Dictionary Data structures

I Two main classes of data structures: hashes and trees

I Criteria for when to use hashes vs. trees:I Is there a fixed number of terms or will it keep growing?I What are the relative frequencies with which various keys

will be accessed?I How many terms are we likely to have?



Hashes

I Each vocabulary term is hashed into an integer.I Try to avoid collisionsI At query time, do the following: hash query term, resolve

collisions, locate entry in fixed-width array

I Pros:I Lookup is fast (faster than in a search tree)I Lookup time is constant

I ConsI no way to find minor variants (resume vs. résumé)I no prefix search (all terms starting with automat)I need to rehash everything periodically if vocabulary keeps

growing



Hashes


collisions, locate entry in fixed-width arrayI Pros:

I Lookup is fast (faster than in a search tree)I Lookup time is constant


growing



Hashes


collisions, locate entry in fixed-width arrayI Pros:

I Lookup is fast (faster than in a search tree)I Lookup time is constant


growing



Trees: Binary tree



Trees: Binary tree

I simplest tree structureI efficient for searching

I Pros:I solves the prefix problem (terms starting with automat)

I Cons:I slower: O(log M) in balanced trees

(M is the size of the vocabulary)I re-balancing binary trees is expensive

→ Alternative: B-tree



Trees: B-tree

B-tree definition: every internal node has a number of childrenin the interval [a,b] where a,b are appropriate positive integers,e.g., [2,4].



Trees: B-tree

What’s the difference?

I slightly more complexI still efficient for searchingI same features as binary trees (prefix search!)I need for re-balancing is less frequent!



“Tolerant” Retrieval

I Wildcard queries

I Spelling correction



Tolerant Retrieval: Wildcard queries

Prefix queries:

I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the

range: mon ≤ t < moo

Suffix Queries:

I *mon: find all docs containing any term ending with monI Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non

How can we find all terms matching pro*cent?




Prefix queries:



Suffix Queries:

I *mon: find all docs containing any term ending with mon

I Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non





Prefix queries:



Suffix Queries:






Prefix queries:



Suffix Queries:





How to handle * in the middle of a term

I Example: pro*cent

I We could look up pro* and *cent in the B-tree and intersectthe two term sets.

I Expensive!I Alternative: permuterm index!




I Example: pro*centI We could look up pro* and *cent in the B-tree and intersect

the two term sets.

I Expensive!I Alternative: permuterm index!




I Example: pro*centI We could look up pro* and *cent in the B-tree and intersect

the two term sets.I Expensive!I Alternative: permuterm index!



Permuterm index

I Rotate wildcard query, so that the * occurs at the end.I introduce special symbol $ to indicate end of string

→ pro*cent becomes cent$pro*

How does this help when matching queries with the index?



Permuterm index

I Rotate wildcard query, so that the * occurs at the end.I introduce special symbol $ to indicate end of string

→ pro*cent becomes cent$pro*

How does this help when matching queries with the index?



Permuterm index

Add all rotations to B-tree:

I HELLO → hello$, ello$h, llo$he, lo$hel o$hell



Permuterm index & term mapping

all rotated entries map to thesame string ...

I For X, look up X$I For X*, look up X*I For *X, look up X$*I For *X*, look up X*I For X*Y, look up Y$X*I Example: For hel*o, look

up o$hel*



Processing a lookup in the permuterm index

To sum up:

I Rotate query wildcard to the rightI Use B-tree lookup as before

What is the problem with this structure?

Permuterm more than quadruples the size of the dictionarycompared to a regular B-tree.(empirical observation for English)



Processing a lookup in the permuterm index

To sum up:

I Rotate query wildcard to the rightI Use B-tree lookup as before

What is the problem with this structure?

Permuterm more than quadruples the size of the dictionarycompared to a regular B-tree.(empirical observation for English)



Alternative: Bigram (k -gram) indexes

I Enumerate all k -grams (sequence of k characters)occurring in a term→ more space-efficient than permuterm index

I Example: from “April is the cruelest month” we get the2-grams (bigrams):$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ueel le es st t$ $m mo on nt h$

I $ is a special word boundary symbol, as before.I Maintain a second inverted index from bigrams to the

dictionary terms that contain the bigram






I $ is a special word boundary symbol, as before.

I Maintain a second inverted index from bigrams to thedictionary terms that contain the bigram






I $ is a special word boundary symbol, as before.I Maintain a second inverted index from bigrams to the

dictionary terms that contain the bigram



Postings list in a 3-gram inverted index

etr BEETROOT METRIC PETRIFY RETRIEVAL- - - -

...

I retrieve all postings of matching k -gramsI intersect all lists as usual



Processing wildcarded terms in a bigram index

Query mon* can now be run as:

$m AND mo AND on

I Gets us all terms with the prefix mon, but . . .I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.

I Surviving terms are then looked up in the term-documentinverted index.

→ k -gram are still fast and more space efficient thanpermuterm indexes.




Query mon* can now be run as:$m AND mo AND on

I Gets us all terms with the prefix mon, but . . .

I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.



















Processing wildcard queries

I As before, we must potentially execute a large number ofBoolean queries

I Most straightforward semantics: Conjunction ofdisjunctions

I Recall the query: gen* AND universit*

(geneva AND university) OR (geneva AND université) OR (genève AND

university) OR (genève AND université) OR (general AND universities) OR

. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?I Users are lazy!

If wildcards are allowed→ Users will love it (?)I Does Google allow wildcard queries? (And other engines?)






I Recall the query: gen* AND universit*(geneva AND university) OR (geneva AND université) OR (genève AND


. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?

I Users are lazy!If wildcards are allowed→ Users will love it (?)

I Does Google allow wildcard queries? (And other engines?)






I Recall the query: gen* AND universit*(geneva AND university) OR (geneva AND université) OR (genève AND


. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?I Users are lazy!

If wildcards are allowed→ Users will love it (?)I Does Google allow wildcard queries? (And other engines?)



Tolerant Retrieval: Spälling Korrection

I Two general uses of spelling correction:I Correcting documents being indexedI Correcting user queries (more common)

I Two different methods:I Isolated word spelling correction

I Check each word on its own for misspellingI Will not catch typos resulting in correctly spelled words,

e.g., an asteroid that fell form the skyI Context-sensitive spelling correction

I Look at surrounding wordsI Can correct form/from error above



Tolerant Retrieval: Spälling Korrection

I Two general uses of spelling correction:I Correcting documents being indexedI Correcting user queries (more common)

I Two different methods:I Isolated word spelling correction

I Check each word on its own for misspellingI Will not catch typos resulting in correctly spelled words,

e.g., an asteroid that fell form the skyI Context-sensitive spelling correction

I Look at surrounding wordsI Can correct form/from error above



Correcting documents

I primarily for OCR’ed documentsI tuned for OCR mistakes (trained classifiers)I may use domain-specific knowledge

confusion between “O” and “D” etc ...I but also: correct typos in web pages and low quality

documents→ fewer misspellings in our dictionary and better matching

I general IR philosophy: don’t change the documents



Correcting queries

I more common in IR than document correctionI typos in queries are common

I people are in a hurryI users often look for things they don’t know much aboutI example: “al quajda”

I strategies:I (also) retrieve documents with the correct spellingI return alternative query suggestions (“Did you mean ...?”)



Isolated word correction

I Premise 1: There is a list of “correct words” from which thecorrect spellings come.

I Premise 2: We have a way of computing the distancebetween a misspelled word and a correct word.

I Simple spelling correction algorithm: return the “correct”word that has the smallest distance to the misspelled word.

I Example: informaton→ information




Two choices:I use a standard lexicon

I Webster’s, OED etc. ...I “industry-specific” dictionary (for domain-specific IR)I advantage: correct entries only!

I vocabulary of the inverted indexI all words in the collection→ better coverageI but: include all misspellings→ compute weights for all terms (based on frequencies)




Task: Return lexicon entry that is closest to a given charactersequence Q

I What is “closest”?I Several alternatives:

1. Edit distance and Levenshtein distance2. Weighted edit distance3. k -gram overlap



Edit distance

I The edit distance between string s1 and string s2 is theminimum number of basic operations that convert s1 to s2.

I Levenshtein distance:basic operations = insert, delete, and replace

I Levenshtein distance dog-do: 1I Levenshtein distance cat-cart: 1I Levenshtein distance cat-cut: 1I Levenshtein distance cat-act: 2

I Damerau-Levenshtein: additional operation = transposeI Damerau-Levenshtein distance cat-act: 1



Edit distance

I The edit distance between string s1 and string s2 is theminimum number of basic operations that convert s1 to s2.

I Levenshtein distance:basic operations = insert, delete, and replace

I Levenshtein distance dog-do: 1I Levenshtein distance cat-cart: 1I Levenshtein distance cat-cut: 1I Levenshtein distance cat-act: 2

I Damerau-Levenshtein: additional operation = transposeI Damerau-Levenshtein distance cat-act: 1



Levenshtein distance: Computation

Recursive definition & dynamic programming:

I start with upper-left table cellI fill table with edit costs

f a s t0 1 2 3 4

c 1 1 2 3 4a 2 2 1 2 3t 3 3 2 2 2s 4 4 3 2 3



Levenshtein distance: algorithm

LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}

10 return m[|s1|, |s2|]

Operations: insert, delete, replace, copy





10 return m[|s1|, |s2|]






10 return m[|s1|, |s2|]






10 return m[|s1|, |s2|]






10 return m[|s1|, |s2|]




Each cell of Levenshtein matrix

cost of getting herefrom my upper leftneighbor(copy or replace)

cost of getting herefrom my upper neigh-bor(delete)

cost of getting herefrom my left neighbor(insert)

the minimum of thethree possible “moves”;the cheapest way ofgetting here



Levenshtein distance: Example

f a s t

0 1 1 2 2 3 3 4 4

c11

1 22 1

2 32 2

3 43 3

4 54 4

a22

2 23 2

1 33 1

3 42 2

4 53 3

t33

3 34 3

3 24 2

2 33 2

2 43 2

s44

4 45 4

4 35 3

2 34 2

3 33 3



Weighted edit distance

I As above, but weight of an operation depends on thecharacters involved.

I Meant to capture keyboard errors, e.g., m more likely to bemistyped as n than as q.

I Therefore, replacing m by n is a smaller edit distance thanby q.

I We now require a weight matrix as input.I Modify dynamic programming to handle weights.



Using edit distance

I given a query: get all sequences within a fixed editdistance

I intersect this list with the list of “correct” wordsI return spelling suggestions

Alternatively:

I use all corrections to retrieve doc’s→ slow!(and accepted by user?)

I use single best correction for retrieval



Using edit distance

Problems:

I lot’s of possible strings even with few edit operationsI intersection with dictionary is slow

Possible solution:

I use N-gram overlapI can replace edit distance for spelling correction



k -gram indexes for spelling correction

I Get all k -grams in the query termI Use the k -gram index to retrieve “correct” words that match

query term k -grams (recall wildcard search)I Threshold by number of matching k -grams

(e.g., only terms that differ by at most 3 k -grams)I or use Jaccard coefficient > threshold

Jaccard(A,B) =|A ∩ B||A ∪ B|

Example:

I Bigram index, misspelled word bordroomI Bigrams: bo, or, rd, dr, ro, oo, om



k -gram indexes for spelling correction

I Get all k -grams in the query termI Use the k -gram index to retrieve “correct” words that match

query term k -grams (recall wildcard search)I Threshold by number of matching k -grams

(e.g., only terms that differ by at most 3 k -grams)I or use Jaccard coefficient > threshold

Jaccard(A,B) =|A ∩ B||A ∪ B|

Example:

I Bigram index, misspelled word bordroomI Bigrams: bo, or, rd, dr, ro, oo, om



k -gram indexes for spelling correction: bordroom

RD aboard ardent boardroom border

OR border lord morbid sordid

BO aboard about boardroom border

- - - -

- - - -

- - - -

...

→ “boardroom” exists in 6 out of 7 lists→ Jaccard = 6/(8 + 7− 6) = 6/9 ≈ 0.67



Context-sensitive spelling correction

One approach:

I break phrase query into conjunction of biwordsI look for biwords that need only one term correctedI get phrase matches and rank them ...



Context-sensitive spelling correction

Another approach: Hit-based spelling correction

I Example: flew form munich

I Try all phrases with possible corrections:I Try query “flea form munich”I Try query “flew from munich”I Try query “flew form munch”I The correct query “flew from munich” has most hits.

Many alternatives! → Not very efficient!

I try to correct only if few hits returned

I tweaking with query logs and expected hits



To sum up

Using a positional inverted index with skip pointers

I efficient dictionary storageI wild-card indexI spelling correctionI Soundex: Find phonetic alternatives

We can quickly run a query like(SPELL(moriset) /3 tor*to) OR SOUNDEX(chaikofski)


Documents

Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann [email protected]