71
Recap Dictionaries Wildcard queries Spelling Correction Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jrg Tiedemann [email protected] Department of Modern Languages University of Helsinki Jrg Tiedemann 1/49

Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann [email protected]

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Information Retrieval and Search EnginesDictionaries & Tolerant Retrieval

Jörg Tiedemann

[email protected] of Modern Languages

University of Helsinki

Jörg Tiedemann 1/49

Page 2: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Recap from previous lecture

I Inverted indexes

I dictionary & postingsI type/token distinctionI terms = normalized types put in the dictionary

I Boolean Model

I return exact matches for Boolean queries

Jörg Tiedemann 2/49

Page 3: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Topics for Today

Continue with the basics (inverted indexes)

1. Phrase queries

2. Dictionary data structures

3. “Tolerant” retrievalI wild-card queriesI spelling correction

Jörg Tiedemann 3/49

Page 4: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Recall: Basic structure of an Inverted index

For each term t , we store a list of all documents that contain t .

BRUTUS −→ 1 2 4 11 31 45 173 174

CAESAR −→ 1 2 4 5 6 16 57 132 . . .

CALPURNIA −→ 2 31 54 101

...︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings

Jörg Tiedemann 4/49

Page 5: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Phrase queries

Query: “stanford university” – as a phrase(10% of web queries = phrase queries)

Idea one: Biword index!

I index every word bigramI longer phrases: “stanford university palo alto”:

1. “STANFORD UNIVERSITY” AND “UNIVERSITY PALO” AND“PALO ALTO”

2. post-filtering of all hits

Problems?

Jörg Tiedemann 5/49

Page 6: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Phrase queries

Query: “stanford university” – as a phrase(10% of web queries = phrase queries)

Idea one: Biword index!

I index every word bigramI longer phrases: “stanford university palo alto”:

1. “STANFORD UNIVERSITY” AND “UNIVERSITY PALO” AND“PALO ALTO”

2. post-filtering of all hits

Problems?

Jörg Tiedemann 5/49

Page 7: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Phrase queries

I Idea two: Positional indexes! (store positions in posting list)

I Example query: “to1 be2 or3 not4 to5 be6”

TO, 993427:〈 1: 〈 7, 18, 33, 72, 86, 231〉;

2: 〈1, 17, 74, 222, 255〉;4: 〈 8, 16, 190, 429, 433〉;5: 〈363, 367〉;7: 〈13, 23, 191〉; . . . 〉

BE, 178239:〈 1: 〈 17, 25〉;

4: 〈 17, 191, 291, 430, 434〉;5: 〈14, 19, 101〉; . . . 〉

Document 4 is a match!

Jörg Tiedemann 6/49

Page 8: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Phrase queries

I Idea two: Positional indexes! (store positions in posting list)I Example query: “to1 be2 or3 not4 to5 be6”

TO, 993427:〈 1: 〈 7, 18, 33, 72, 86, 231〉;

2: 〈1, 17, 74, 222, 255〉;4: 〈 8, 16, 190, 429, 433〉;5: 〈363, 367〉;7: 〈13, 23, 191〉; . . . 〉

BE, 178239:〈 1: 〈 17, 25〉;

4: 〈 17, 191, 291, 430, 434〉;5: 〈14, 19, 101〉; . . . 〉

Document 4 is a match!

Jörg Tiedemann 6/49

Page 9: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Proximity search

Second advantage of positional indexes:

I Can also use them for proximity search.I For example: employment /4 placeI Find all documents that contain EMPLOYMENT and PLACE

within 4 words of each other.

Jörg Tiedemann 7/49

Page 10: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Data Structures for Dictionaries

For each term t , we store a list of all documents that contain t .

BRUTUS −→ 1 2 4 11 31 45 173 174

CAESAR −→ 1 2 4 5 6 16 57 132 . . .

CALPURNIA −→ 2 31 54 101

...︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings

Jörg Tiedemann 8/49

Page 11: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Naive Dictionary: array of fixed-width entries

term documentfrequency

pointer topostings list

a 656,265 −→aachen 65 −→. . . . . . . . .zulu 221 −→

space needed: 20 bytes 4 bytes 4 bytes

How do we store a dictionary in memory efficiently?How do we look up an element in this array at query time?

Jörg Tiedemann 9/49

Page 12: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Naive Dictionary: array of fixed-width entries

term documentfrequency

pointer topostings list

a 656,265 −→aachen 65 −→. . . . . . . . .zulu 221 −→

space needed: 20 bytes 4 bytes 4 bytes

How do we store a dictionary in memory efficiently?How do we look up an element in this array at query time?

Jörg Tiedemann 9/49

Page 13: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Dictionary Data structures

I Two main classes of data structures: hashes and trees

I Criteria for when to use hashes vs. trees:I Is there a fixed number of terms or will it keep growing?I What are the relative frequencies with which various keys

will be accessed?I How many terms are we likely to have?

Jörg Tiedemann 10/49

Page 14: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Hashes

I Each vocabulary term is hashed into an integer.I Try to avoid collisionsI At query time, do the following: hash query term, resolve

collisions, locate entry in fixed-width array

I Pros:I Lookup is fast (faster than in a search tree)I Lookup time is constant

I ConsI no way to find minor variants (resume vs. résumé)I no prefix search (all terms starting with automat)I need to rehash everything periodically if vocabulary keeps

growing

Jörg Tiedemann 11/49

Page 15: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Hashes

I Each vocabulary term is hashed into an integer.I Try to avoid collisionsI At query time, do the following: hash query term, resolve

collisions, locate entry in fixed-width arrayI Pros:

I Lookup is fast (faster than in a search tree)I Lookup time is constant

I ConsI no way to find minor variants (resume vs. résumé)I no prefix search (all terms starting with automat)I need to rehash everything periodically if vocabulary keeps

growing

Jörg Tiedemann 11/49

Page 16: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Hashes

I Each vocabulary term is hashed into an integer.I Try to avoid collisionsI At query time, do the following: hash query term, resolve

collisions, locate entry in fixed-width arrayI Pros:

I Lookup is fast (faster than in a search tree)I Lookup time is constant

I ConsI no way to find minor variants (resume vs. résumé)I no prefix search (all terms starting with automat)I need to rehash everything periodically if vocabulary keeps

growing

Jörg Tiedemann 11/49

Page 17: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Trees: Binary tree

Jörg Tiedemann 12/49

Page 18: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Trees: Binary tree

I simplest tree structureI efficient for searching

I Pros:I solves the prefix problem (terms starting with automat)

I Cons:I slower: O(log M) in balanced trees

(M is the size of the vocabulary)I re-balancing binary trees is expensive

→ Alternative: B-tree

Jörg Tiedemann 13/49

Page 19: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Trees: B-tree

B-tree definition: every internal node has a number of childrenin the interval [a,b] where a,b are appropriate positive integers,e.g., [2,4].

Jörg Tiedemann 14/49

Page 20: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Trees: B-tree

What’s the difference?

I slightly more complexI still efficient for searchingI same features as binary trees (prefix search!)I need for re-balancing is less frequent!

Jörg Tiedemann 15/49

Page 21: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

“Tolerant” Retrieval

I Wildcard queries

I Spelling correction

Jörg Tiedemann 16/49

Page 22: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Tolerant Retrieval: Wildcard queries

Prefix queries:

I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the

range: mon ≤ t < moo

Suffix Queries:

I *mon: find all docs containing any term ending with monI Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non

How can we find all terms matching pro*cent?

Jörg Tiedemann 17/49

Page 23: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Tolerant Retrieval: Wildcard queries

Prefix queries:

I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the

range: mon ≤ t < moo

Suffix Queries:

I *mon: find all docs containing any term ending with mon

I Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non

How can we find all terms matching pro*cent?

Jörg Tiedemann 17/49

Page 24: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Tolerant Retrieval: Wildcard queries

Prefix queries:

I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the

range: mon ≤ t < moo

Suffix Queries:

I *mon: find all docs containing any term ending with monI Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non

How can we find all terms matching pro*cent?

Jörg Tiedemann 17/49

Page 25: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Tolerant Retrieval: Wildcard queries

Prefix queries:

I mon*: find all docs containing any term beginning with monI Easy with B-tree dictionary: retrieve all terms t in the

range: mon ≤ t < moo

Suffix Queries:

I *mon: find all docs containing any term ending with monI Maintain an additional tree for terms backwardsI Then retrieve all terms t in the range: nom ≤ t < non

How can we find all terms matching pro*cent?

Jörg Tiedemann 17/49

Page 26: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

How to handle * in the middle of a term

I Example: pro*cent

I We could look up pro* and *cent in the B-tree and intersectthe two term sets.

I Expensive!I Alternative: permuterm index!

Jörg Tiedemann 18/49

Page 27: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

How to handle * in the middle of a term

I Example: pro*centI We could look up pro* and *cent in the B-tree and intersect

the two term sets.

I Expensive!I Alternative: permuterm index!

Jörg Tiedemann 18/49

Page 28: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

How to handle * in the middle of a term

I Example: pro*centI We could look up pro* and *cent in the B-tree and intersect

the two term sets.I Expensive!I Alternative: permuterm index!

Jörg Tiedemann 18/49

Page 29: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Permuterm index

I Rotate wildcard query, so that the * occurs at the end.I introduce special symbol $ to indicate end of string

→ pro*cent becomes cent$pro*

How does this help when matching queries with the index?

Jörg Tiedemann 19/49

Page 30: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Permuterm index

I Rotate wildcard query, so that the * occurs at the end.I introduce special symbol $ to indicate end of string

→ pro*cent becomes cent$pro*

How does this help when matching queries with the index?

Jörg Tiedemann 19/49

Page 31: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Permuterm index

Add all rotations to B-tree:

I HELLO → hello$, ello$h, llo$he, lo$hel o$hell

Jörg Tiedemann 20/49

Page 32: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Permuterm index & term mapping

all rotated entries map to thesame string ...

I For X, look up X$I For X*, look up X*I For *X, look up X$*I For *X*, look up X*I For X*Y, look up Y$X*I Example: For hel*o, look

up o$hel*

Jörg Tiedemann 21/49

Page 33: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing a lookup in the permuterm index

To sum up:

I Rotate query wildcard to the rightI Use B-tree lookup as before

What is the problem with this structure?

Permuterm more than quadruples the size of the dictionarycompared to a regular B-tree.(empirical observation for English)

Jörg Tiedemann 22/49

Page 34: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing a lookup in the permuterm index

To sum up:

I Rotate query wildcard to the rightI Use B-tree lookup as before

What is the problem with this structure?

Permuterm more than quadruples the size of the dictionarycompared to a regular B-tree.(empirical observation for English)

Jörg Tiedemann 22/49

Page 35: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Alternative: Bigram (k -gram) indexes

I Enumerate all k -grams (sequence of k characters)occurring in a term→ more space-efficient than permuterm index

I Example: from “April is the cruelest month” we get the2-grams (bigrams):$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ueel le es st t$ $m mo on nt h$

I $ is a special word boundary symbol, as before.I Maintain a second inverted index from bigrams to the

dictionary terms that contain the bigram

Jörg Tiedemann 23/49

Page 36: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Alternative: Bigram (k -gram) indexes

I Enumerate all k -grams (sequence of k characters)occurring in a term→ more space-efficient than permuterm index

I Example: from “April is the cruelest month” we get the2-grams (bigrams):$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ueel le es st t$ $m mo on nt h$

I $ is a special word boundary symbol, as before.

I Maintain a second inverted index from bigrams to thedictionary terms that contain the bigram

Jörg Tiedemann 23/49

Page 37: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Alternative: Bigram (k -gram) indexes

I Enumerate all k -grams (sequence of k characters)occurring in a term→ more space-efficient than permuterm index

I Example: from “April is the cruelest month” we get the2-grams (bigrams):$a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ueel le es st t$ $m mo on nt h$

I $ is a special word boundary symbol, as before.I Maintain a second inverted index from bigrams to the

dictionary terms that contain the bigram

Jörg Tiedemann 23/49

Page 38: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Postings list in a 3-gram inverted index

etr BEETROOT METRIC PETRIFY RETRIEVAL- - - -

...

I retrieve all postings of matching k -gramsI intersect all lists as usual

Jörg Tiedemann 24/49

Page 39: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing wildcarded terms in a bigram index

Query mon* can now be run as:

$m AND mo AND on

I Gets us all terms with the prefix mon, but . . .I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.

I Surviving terms are then looked up in the term-documentinverted index.

→ k -gram are still fast and more space efficient thanpermuterm indexes.

Jörg Tiedemann 25/49

Page 40: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing wildcarded terms in a bigram index

Query mon* can now be run as:$m AND mo AND on

I Gets us all terms with the prefix mon, but . . .

I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.

I Surviving terms are then looked up in the term-documentinverted index.

→ k -gram are still fast and more space efficient thanpermuterm indexes.

Jörg Tiedemann 25/49

Page 41: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing wildcarded terms in a bigram index

Query mon* can now be run as:$m AND mo AND on

I Gets us all terms with the prefix mon, but . . .I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.

I Surviving terms are then looked up in the term-documentinverted index.

→ k -gram are still fast and more space efficient thanpermuterm indexes.

Jörg Tiedemann 25/49

Page 42: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing wildcarded terms in a bigram index

Query mon* can now be run as:$m AND mo AND on

I Gets us all terms with the prefix mon, but . . .I . . . also many “false positives” like MOON.→ Must post-filter these terms against query.

I Surviving terms are then looked up in the term-documentinverted index.

→ k -gram are still fast and more space efficient thanpermuterm indexes.

Jörg Tiedemann 25/49

Page 43: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing wildcard queries

I As before, we must potentially execute a large number ofBoolean queries

I Most straightforward semantics: Conjunction ofdisjunctions

I Recall the query: gen* AND universit*

(geneva AND university) OR (geneva AND université) OR (genève AND

university) OR (genève AND université) OR (general AND universities) OR

. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?I Users are lazy!

If wildcards are allowed→ Users will love it (?)I Does Google allow wildcard queries? (And other engines?)

Jörg Tiedemann 26/49

Page 44: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing wildcard queries

I As before, we must potentially execute a large number ofBoolean queries

I Most straightforward semantics: Conjunction ofdisjunctions

I Recall the query: gen* AND universit*(geneva AND university) OR (geneva AND université) OR (genève AND

university) OR (genève AND université) OR (general AND universities) OR

. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?

I Users are lazy!If wildcards are allowed→ Users will love it (?)

I Does Google allow wildcard queries? (And other engines?)

Jörg Tiedemann 26/49

Page 45: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Processing wildcard queries

I As before, we must potentially execute a large number ofBoolean queries

I Most straightforward semantics: Conjunction ofdisjunctions

I Recall the query: gen* AND universit*(geneva AND university) OR (geneva AND université) OR (genève AND

university) OR (genève AND université) OR (general AND universities) OR

. . . . . . → Very expensive! → Requires query optimization!I Do we need to support wildcard queries?I Users are lazy!

If wildcards are allowed→ Users will love it (?)I Does Google allow wildcard queries? (And other engines?)

Jörg Tiedemann 26/49

Page 46: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Tolerant Retrieval: Spälling Korrection

I Two general uses of spelling correction:I Correcting documents being indexedI Correcting user queries (more common)

I Two different methods:I Isolated word spelling correction

I Check each word on its own for misspellingI Will not catch typos resulting in correctly spelled words,

e.g., an asteroid that fell form the skyI Context-sensitive spelling correction

I Look at surrounding wordsI Can correct form/from error above

Jörg Tiedemann 27/49

Page 47: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Tolerant Retrieval: Spälling Korrection

I Two general uses of spelling correction:I Correcting documents being indexedI Correcting user queries (more common)

I Two different methods:I Isolated word spelling correction

I Check each word on its own for misspellingI Will not catch typos resulting in correctly spelled words,

e.g., an asteroid that fell form the skyI Context-sensitive spelling correction

I Look at surrounding wordsI Can correct form/from error above

Jörg Tiedemann 27/49

Page 48: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Correcting documents

I primarily for OCR’ed documentsI tuned for OCR mistakes (trained classifiers)I may use domain-specific knowledge

confusion between “O” and “D” etc ...I but also: correct typos in web pages and low quality

documents→ fewer misspellings in our dictionary and better matching

I general IR philosophy: don’t change the documents

Jörg Tiedemann 28/49

Page 49: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Correcting queries

I more common in IR than document correctionI typos in queries are common

I people are in a hurryI users often look for things they don’t know much aboutI example: “al quajda”

I strategies:I (also) retrieve documents with the correct spellingI return alternative query suggestions (“Did you mean ...?”)

Jörg Tiedemann 29/49

Page 50: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Isolated word correction

I Premise 1: There is a list of “correct words” from which thecorrect spellings come.

I Premise 2: We have a way of computing the distancebetween a misspelled word and a correct word.

I Simple spelling correction algorithm: return the “correct”word that has the smallest distance to the misspelled word.

I Example: informaton→ information

Jörg Tiedemann 30/49

Page 51: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Isolated word correction

Two choices:I use a standard lexicon

I Webster’s, OED etc. ...I “industry-specific” dictionary (for domain-specific IR)I advantage: correct entries only!

I vocabulary of the inverted indexI all words in the collection→ better coverageI but: include all misspellings→ compute weights for all terms (based on frequencies)

Jörg Tiedemann 31/49

Page 52: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Isolated word correction

Task: Return lexicon entry that is closest to a given charactersequence Q

I What is “closest”?I Several alternatives:

1. Edit distance and Levenshtein distance2. Weighted edit distance3. k -gram overlap

Jörg Tiedemann 32/49

Page 53: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Edit distance

I The edit distance between string s1 and string s2 is theminimum number of basic operations that convert s1 to s2.

I Levenshtein distance:basic operations = insert, delete, and replace

I Levenshtein distance dog-do: 1I Levenshtein distance cat-cart: 1I Levenshtein distance cat-cut: 1I Levenshtein distance cat-act: 2

I Damerau-Levenshtein: additional operation = transposeI Damerau-Levenshtein distance cat-act: 1

Jörg Tiedemann 33/49

Page 54: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Edit distance

I The edit distance between string s1 and string s2 is theminimum number of basic operations that convert s1 to s2.

I Levenshtein distance:basic operations = insert, delete, and replace

I Levenshtein distance dog-do: 1I Levenshtein distance cat-cart: 1I Levenshtein distance cat-cut: 1I Levenshtein distance cat-act: 2

I Damerau-Levenshtein: additional operation = transposeI Damerau-Levenshtein distance cat-act: 1

Jörg Tiedemann 33/49

Page 55: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Levenshtein distance: Computation

Recursive definition & dynamic programming:

I start with upper-left table cellI fill table with edit costs

f a s t0 1 2 3 4

c 1 1 2 3 4a 2 2 1 2 3t 3 3 2 2 2s 4 4 3 2 3

Jörg Tiedemann 34/49

Page 56: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Levenshtein distance: algorithm

LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}

10 return m[|s1|, |s2|]

Operations: insert, delete, replace, copy

Jörg Tiedemann 35/49

Page 57: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Levenshtein distance: algorithm

LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}

10 return m[|s1|, |s2|]

Operations: insert, delete, replace, copy

Jörg Tiedemann 36/49

Page 58: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Levenshtein distance: algorithm

LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}

10 return m[|s1|, |s2|]

Operations: insert, delete, replace, copy

Jörg Tiedemann 37/49

Page 59: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Levenshtein distance: algorithm

LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}

10 return m[|s1|, |s2|]

Operations: insert, delete, replace, copy

Jörg Tiedemann 38/49

Page 60: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Levenshtein distance: algorithm

LEVENSHTEINDISTANCE(s1, s2)1 for i ← 0 to |s1|2 do m[i ,0] = i3 for j ← 0 to |s2|4 do m[0, j] = j5 for i ← 1 to |s1|6 do for j ← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1]}9 else m[i , j] = min{m[i − 1, j] + 1,m[i , j − 1] + 1,m[i − 1, j − 1] + 1}

10 return m[|s1|, |s2|]

Operations: insert, delete, replace, copy

Jörg Tiedemann 39/49

Page 61: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Each cell of Levenshtein matrix

cost of getting herefrom my upper leftneighbor(copy or replace)

cost of getting herefrom my upper neigh-bor(delete)

cost of getting herefrom my left neighbor(insert)

the minimum of thethree possible “moves”;the cheapest way ofgetting here

Jörg Tiedemann 40/49

Page 62: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Levenshtein distance: Example

f a s t

0 1 1 2 2 3 3 4 4

c11

1 22 1

2 32 2

3 43 3

4 54 4

a22

2 23 2

1 33 1

3 42 2

4 53 3

t33

3 34 3

3 24 2

2 33 2

2 43 2

s44

4 45 4

4 35 3

2 34 2

3 33 3

Jörg Tiedemann 41/49

Page 63: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Weighted edit distance

I As above, but weight of an operation depends on thecharacters involved.

I Meant to capture keyboard errors, e.g., m more likely to bemistyped as n than as q.

I Therefore, replacing m by n is a smaller edit distance thanby q.

I We now require a weight matrix as input.I Modify dynamic programming to handle weights.

Jörg Tiedemann 42/49

Page 64: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Using edit distance

I given a query: get all sequences within a fixed editdistance

I intersect this list with the list of “correct” wordsI return spelling suggestions

Alternatively:

I use all corrections to retrieve doc’s→ slow!(and accepted by user?)

I use single best correction for retrieval

Jörg Tiedemann 43/49

Page 65: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Using edit distance

Problems:

I lot’s of possible strings even with few edit operationsI intersection with dictionary is slow

Possible solution:

I use N-gram overlapI can replace edit distance for spelling correction

Jörg Tiedemann 44/49

Page 66: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

k -gram indexes for spelling correction

I Get all k -grams in the query termI Use the k -gram index to retrieve “correct” words that match

query term k -grams (recall wildcard search)I Threshold by number of matching k -grams

(e.g., only terms that differ by at most 3 k -grams)I or use Jaccard coefficient > threshold

Jaccard(A,B) =|A ∩ B||A ∪ B|

Example:

I Bigram index, misspelled word bordroomI Bigrams: bo, or, rd, dr, ro, oo, om

Jörg Tiedemann 45/49

Page 67: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

k -gram indexes for spelling correction

I Get all k -grams in the query termI Use the k -gram index to retrieve “correct” words that match

query term k -grams (recall wildcard search)I Threshold by number of matching k -grams

(e.g., only terms that differ by at most 3 k -grams)I or use Jaccard coefficient > threshold

Jaccard(A,B) =|A ∩ B||A ∪ B|

Example:

I Bigram index, misspelled word bordroomI Bigrams: bo, or, rd, dr, ro, oo, om

Jörg Tiedemann 45/49

Page 68: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

k -gram indexes for spelling correction: bordroom

RD aboard ardent boardroom border

OR border lord morbid sordid

BO aboard about boardroom border

- - - -

- - - -

- - - -

...

→ “boardroom” exists in 6 out of 7 lists→ Jaccard = 6/(8 + 7− 6) = 6/9 ≈ 0.67

Jörg Tiedemann 46/49

Page 69: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Context-sensitive spelling correction

One approach:

I break phrase query into conjunction of biwordsI look for biwords that need only one term correctedI get phrase matches and rank them ...

Jörg Tiedemann 47/49

Page 70: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

Context-sensitive spelling correction

Another approach: Hit-based spelling correction

I Example: flew form munich

I Try all phrases with possible corrections:I Try query “flea form munich”I Try query “flew from munich”I Try query “flew form munch”I The correct query “flew from munich” has most hits.

Many alternatives! → Not very efficient!

I try to correct only if few hits returned

I tweaking with query logs and expected hits

Jörg Tiedemann 48/49

Page 71: Information Retrieval and Search Engines - Dictionaries ... · Information Retrieval and Search Engines Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@helsinki.fi

Recap Dictionaries Wildcard queries Spelling Correction

To sum up

Using a positional inverted index with skip pointers

I efficient dictionary storageI wild-card indexI spelling correctionI Soundex: Find phonetic alternatives

We can quickly run a query like(SPELL(moriset) /3 tor*to) OR SOUNDEX(chaikofski)

Jörg Tiedemann 49/49