32
Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)

Probabilistic Spelling Correction

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Probabilistic Spelling Correction

Probabilistic Spelling CorrectionCE-324: Modern Information Retrieval Sharif University of Technology

M. Soleymani

Fall 2016

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)

Page 2: Probabilistic Spelling Correction

Applications of spelling correction

2

Page 3: Probabilistic Spelling Correction

Spelling Tasks

3

Spelling Error Detection

Spelling Error Correction:

Autocorrect

htethe

Suggest a correction

Suggestion lists

Page 4: Probabilistic Spelling Correction

Types of spelling errors

4

Non-word Errors

graffegiraffe

Real-word Errors

Typographical errors

three there

Cognitive Errors (homophones)

piece peace,

too two

your you’re

Real-word correction almost needs to be contextsensitive

Page 5: Probabilistic Spelling Correction

Spelling correction steps

5

For each word w, generate candidate set:

Find candidate words with similar pronunciations

Find candidate words with similar spellings

Choose best candidate

By “Weighted edit distance” or “Noisy Channel” approach

Context-sensitive – so have to consider whether the

surrounding words “make sense”

“Flying form Heathrow to LAX””Flying from Heathrow to

LAX”

Page 6: Probabilistic Spelling Correction

Candidate Testing:

Damerau-Levenshtein edit distance

6

Minimal edit distance between two strings, where edits

are:

Insertion

Deletion

Substitution

Transposition of two adjacent letters

Page 7: Probabilistic Spelling Correction

7

Page 8: Probabilistic Spelling Correction

Noisy channel intuition

8

Page 9: Probabilistic Spelling Correction

Noisy channel

9

We see an observation 𝑥 of a misspelled word

Find the correct word 𝑤

Page 10: Probabilistic Spelling Correction

Language Model

10

Take a big supply of words with T tokens:

𝑝 𝑤 =𝐶(𝑤)

𝑇

Supply of words

your document collection

In other applications:

you can take the supply to be typed queries (suitably filtered) – when

a static dictionary is inadequate

C(w) = # occurrences of w

Page 11: Probabilistic Spelling Correction

Unigram prior probability

11

Counts from 404,253,213 words in Corpus of Contemporary

English (COCA)

Page 12: Probabilistic Spelling Correction

Channel model probability

12

Error model probability, Edit probability

Misspelled word x = x1, x2, x3,… ,xm

Correct word w = w1, w2, w3,…, wn

P(x|w) = probability of the edit

(deletion/insertion/substitution/transposition)

Page 13: Probabilistic Spelling Correction

Calculating p(x|w)

Still a research question.

Can be estimated.

Some simply ways. i.e.,

Confusion matrix A square 26×26 table which represents how many times one

letter was incorrectly used instead of another.

Usually, there are four confusion matrix:

deletion, insertion, substitution and transposition.

Page 14: Probabilistic Spelling Correction

Computing error probability: Confusion

matrix

14

del[x,y]: count(xy typed as x)

ins[x,y]: count(x typed as xy)

sub[x,y]: count(y typed as x)

trans[x,y]: count(xy typed as yx)

Inser*on and dele*on condi*oned on previous character

Page 15: Probabilistic Spelling Correction

Confusion matrix for subs*tu*on

15 The cell [o,e] in a substitution confusion matrix would give the count of times that e was substituted for o.

Page 16: Probabilistic Spelling Correction

Channel model

16

Page 17: Probabilistic Spelling Correction

Smoothing probabili*es: Add-1 smoothing

17

|A| character alphabet

Page 18: Probabilistic Spelling Correction

Channel model for acress

18

Page 19: Probabilistic Spelling Correction

19

Page 20: Probabilistic Spelling Correction

20

Page 21: Probabilistic Spelling Correction

Noisy channel for real-word spell correc*on

21

Given a sentence w1,w2,w3,…,wn

Generate a set of candidates for each word wi

Candidate(w1) = {w1, w’1 , w’’1 , w’’’1 ,…}

Candidate(w2) = {w2, w’2 , w’’2 , w’’’2 ,…}

Candidate(wn) = {wn, w’n , w’’n , w’’’n ,…}

Choose the sequence W that maximizes P(W)

Page 22: Probabilistic Spelling Correction

Incorpora*ng context words:

Context-sensi*ve spelling correc*on

22

Determining whether actress or across is appropriate

will require looking at the context of use

A bigram language model condi*ons the probability of

a word on (just) the previous word

𝑃(𝑤1…𝑤𝑛) = 𝑃(𝑤1)𝑃(𝑤2|𝑤1)…𝑃(𝑤𝑛|𝑤𝑛−1)

Page 23: Probabilistic Spelling Correction

Incorpora*ng context words

23

For unigram counts,𝑃(𝑤𝑘) is always non-zero

if our dic*onary is derived from the document collec*on

This won’t be true of 𝑃(𝑤𝑘|𝑤𝑘−1).We need to smooth

add-1 smoothing on this condi*onal distribu*on

Interpolate a unigram and a bigram:

Page 24: Probabilistic Spelling Correction

Using a bigram language model

24

Page 25: Probabilistic Spelling Correction

Using a bigram language model

25

Page 26: Probabilistic Spelling Correction

Noisy channel for real-word spell

correc*on

26

Page 27: Probabilistic Spelling Correction

Noisy channel for real-word spell

correc*on

27

Page 28: Probabilistic Spelling Correction

Simplifica*on: One error per sentence

28

Page 29: Probabilistic Spelling Correction

Where to get the probabili*es

29

Language model

Unigram

Bigram

Channel model

Same as for non-word spelling correc*on

Plus need probability for no error, P(w|w)

Page 30: Probabilistic Spelling Correction

Probability of no error

30

What is the channel probability for a correctly typed

word?

P(“the”|“the”)

If you have a big corpus, you can es*mate this percent

correct

But this value depends strongly on the applica*onbility of

no error

Page 31: Probabilistic Spelling Correction

Peter Norvig’s “thew” example

31

Page 32: Probabilistic Spelling Correction

Improvements to channel model

32

Allow richer edits (Brill and Moore 2000)

entant

ph f

le al

Incorporate pronuncia*on into channel (Toutanova and

Moore 2002)

Incorporate device into channel

Not all Android phones need have the same error model

But spell correc*on may be done at the system level