NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Preview:

DESCRIPTION

 

Citation preview

Outline

● N-Grams

● Markov & Hidden Markov Models (HMMs)

● First-Order Predicate Calculus Basics (FOPC)

N-Grams

Introduction

● Word prediction is a fundamental task of spelling error

correction, speech recognition, augmentative

communication, and many other areas of NLP

● Word prediction can be trained on various text corpora

● N-Gram is a word prediction model that uses the

previous N-1 words to predict the next word

● In statistical NLP, N-Gram is called a language model

(LM) or grammar

Word Prediction Examples

● It happened a long time …

● She wants to make a collect phone …

● I need to open a bank …

● Nutrition labels include serving …

● Nutrition labels include amounts of total …

Word Prediction Examples

● It happened a long time ago.

● She wants to make a collect phone call.

● I need to open a bank account.

● Nutrition labels include serving sizes.

● Nutrition labels include amounts of total fat|

carbohydrate.

Augmentative Communication

● Many people with physical disabilities experience

problems communicating with other people: many of

them cannot speak or type

● Word prediction models can productively augment their

communication efforts by automatically suggesting the

next word to speak or type

● For example, people with disabilities can use simple

hand movements to choose next words to speak or type

Real-Word Spelling Errors

● Real-word spelling errors are real words incorrectly used

● Examples:

– They are leaving in about fifteen minuets to go to her house.

– The study was conducted mainly be John Black.

– The design an construction of the system will take more than a year.

– Hopefully, all with continue smoothly in my absence.

– I need to notified the bank of this problem.

– He is trying to fine out.

K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM

Computing Surveys, Vol. 24, No. 4, Dec. 1992.

Word Sequence Probabilities

● Word prediction is based on evaluating probabilities of specific

word sequences

● To count those probabilities we need a corpus (speech or text)

● We also need to determine what is counted count and how:

the most important decision is how to handle punctuation

marks and capitalization (text) or pauses like uh and um

(speech)

● What is counted and how depends on the task at hand (e.g.,

punctuation is more important to grammar checking than

spelling correction)

Wordforms, Lemmas, Types, Tokens

● Wordform is an alphanumerical sequence actually used

in the corpus (e.g., begin, began, begun)

● Lemma is a set of word forms (e.g., {begin, began,

begun})

● Token is a synonym of wordform

● Type is a dictionary entry: for example, a dictionary

lists only begin as the main entry for the lemma

{begin, began, begun}

Unsmoothed N-Grams

Notation: A Sequence of N Words

n

n www ... 11

● Example: ‘I understand this algorithm.’

– W1 = ‘I’

– W2 = ‘understand’

– W3 = ‘this’

– W4 = ‘algorithm’

– W5 = ‘.’

Probabilities of Word Sequences

1

111

1

1

2

13211

|

| ... | ...

kn

k

n

nn

wwP

wwPwwPwPwPwwP

Example:

P(‘I understand this algorithm.’) =

P(‘I’) *

P(‘understand’|‘I’) *

P(‘this’|‘I understand’) *

P(‘algorithm’|‘I understand this’) *

P(‘.’|‘I understand this algorithm’)

Probabilities of Word Sequences

● How difficult is it to compute the required probabilities?

– P(‘I’) - this is easy to compute

– P(‘understand’|‘I’) – harder but quite feasible

– P(‘this’|‘I understand’) – harder but feasible

– P(‘algorithm’|‘I understand this’) – really hard

– P(‘.’|‘I understand this algorithm’) – possible but

impractical

Probability Approximation

● Markov assumption: we can estimate the probability of a word

given only N previous words

● If N = 0, we have the unigram model (aka 0th-order Markov

model)

● If N = 1, we have the bigram model (aka 1st-order Markov

model)

● If N = 3, we have the trigram model (aka 2nd-order Markov

model)

● N can be greater but the higher values are rare, because they

hard to compute

Bigram Probability Approximation

● <S> is the start of sentence mark

● P(‘I understand this algorithm.’) =

P(‘I’|<S>) *

P(‘understand’|‘I’) *

P(‘this’|‘understand’) *

P(‘algorithm’|‘this’) *

P(‘.’ |‘algorithm’)

Trigram Probability Approximation

● <S> is the start of sentence mark

● P(‘I understand this algorithm.’) =

P(‘I’|<S><S>) *

P(‘understand’|‘<S>I’) *

P(‘this’|‘I understand’) *

P(‘algorithm’|‘understand this’) *

P(‘.’ |‘this algorithm’)

N-Gram Approximation

4,||||

3,||||

2,||||

1,||

123

1

3

1

14

1

1

12

1

2

1

13

1

1

1

1

1

1

12

1

1

1

1

1

1

NwwwwPwwPwwPwwP

NwwwPwwPwwPwwP

NwwPwwPwwPwwP

NwwPwwP

nnnn

n

nn

n

nn

n

nn

nnn

n

nn

n

nn

n

n

nn

n

nn

n

nn

n

n

n

Nnn

n

n

Bigram Approximation

11

1

1 ||

kk

n

k

n

n wwPwwP

Bigram Approximation

<S> I 0.25

I understand 0.3

understand this 0.05

this algorithm 0.7

algorithm . 0.45

P(‘I understand this algorithm.’) =

P(‘I’|<S>) * P(‘understand’|‘I’) * P(‘this’|‘understand’) * P(‘algorithm’|‘this’) * P(‘.’ |‘algorithm’) =

0.25 * 0.3 * 0.05 * 0.7 * 0.45 =

0.00118125

Logprobs

● If we compute raw probability products, we risk the problem of

numerical underflow: at some point all probability products

become zero, especially on long word sequences

● To address this problem, the probabilities are computed in the

logarithmic space: instead of computing the product of

probabilities, the sum of logarithms of those probabilities is

computed

● log(P(A)P(B)) = log(P(A)) + log(P(B))

● Original product can be recovered: P(A)P(B) = log-1(P(A)P(B))

Bigram Computation

1

1

1

1

11

1

11

11

|

size dictionary is ,

corpusin ofcount

n

nn

V

i

in

nnnn

V

i

nin

nnnn

wC

wwC

wwC

wwCwwP

VwCwwC

wwwwC

N-Gram Generalization

1,|1

1

1

11

1

NwC

wwCwwP

n

Nn

n

n

Nnn

Nnn

Maximum Likelihood Estimation

● This N-Gram probability estimation is known as the Maximum

Likelihood Estimation (MLE)

● It is the MLE because it always maximizes the probability of the

training set (the statistics of the training set)

● If a word W occurs 5 times in a training corpus of 100 words, its

probability of occurrence is P(W) = 5/100

● This is not a good estimate of P(W) in all corpora, but the one

that maximizes P(W) in the training corpus

Smoothed N-Grams

Unsmoothed N-Gram Problem

● Since any corpus is finite, in any corpus used for computing N-

Grams, some valid N-Grams will not be found

● To put it differently, an N-Gram matrix for any corpus is likely

to be sparse: it will have a large number of possible N-Grams

with zero counts

● The MLE methods produce unreliable estimates when counts are

greater than 0 but still small (small is relative)

● Smoothing is a set of techniques used to overcome zero or low

counts

Add-One Smoothing

One way to smooth is to add one to all N-Gram counts and

normalize by the size of the dictionary (V)

smoothed one-add //1

|

unsmoothed//|

1

11

*

1

11

VwC

wwCwwP

wC

wwCwwP

n

nnnn

n

nnnn

A Problem with Add-One Smoothing

● Much of the total probability mass moves to the N-

Grams with zero counts

● Researchers attribute it to the arbitrary choice of the

value of 1

● Add-One smoothing appears to be worse than other

methods at predicting N-Grams with zero counts

● Some research indicates that add-one smoothing is no

better than no smoothing

Good-Turing Discounting

● Probability mass assigned to N-Grams with zero or low

counts is reassigned by using with N-Grams with higher

counts

● Let Nc is the number of N-Grams that occur c times in a

corpus

● N0 is the number of N-Grams that occur 0 times

● N1 is the number of N-Grams that occur once

● N2 is the number of N-Grams that occur twice

Good-Turing Discounting

Let C(w1 … wn)=c be a count of some N-Gram w1 … wn,

then the new count smoothed by the GTD, i.e., C*(w1 …

wn), is:

c

cnn

N

NwwCwwC 1

11

* 1......

N-Gram Vectors

● N-Grams can be computed over any finite symbolic sets

● Those symbolic sets are called alphabets and can

consists of wordforms, waveforms, individual letters,

etc.

● The choice of the symbols in the alphabet depends on

the application

● Regardless of the application, the objective is to take

an input sequence over a specific alphabet and

compute its N-Gram frequency vector

Dimensions of N-Gram Vectors

● Let A be an alphabet and n > 0 be the size of the N-Gram

● The number of N-Gram dimensions is |A|n

● Suppose that the alphabet has 26 characters and we

compute trigrams over that alphabets, then the number

of possible trigrams, i.e., the dimension of N-Gram

frequency vectors is 263 = 17576

● A practical implication is that N-Gram frequency vectors

even for low values of n are sparse

Example

● Suppose the alphabet A = {a, <space>, <start>}

● The number of possible bigrams (n=2) is |A|2 = 9:

– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;

6) <start><space>; 7) <space><space> ; 8) <space>a;

9) <space><start>

● Suppose the input is = ‘a a’

● The input’s N-Grams are: <start>a, a<space>, <space>a

● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)

Example

● Suppose the alphabet A = {a, <space>, <start>}

● The number of possible bigrams (n=2) is |A|2 = 9:

– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;

6) <start><space>; 7) <space><space> ; 8) <space>a;

9) <space><start>

● Suppose the input is = ‘a a’

● The input’s N-Grams are: <start>a, a<space>, <space>a

● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)

Markov & Hidden Markov Models

Markov Models

11 ... | nn wwwP

Markov Models are closely related to N-Grams:

the basic idea is to estimate the conditional

probability of the n-th observation given a

sequence of n-1 observations

Markov Assumption

order 3rd // | ... |

order 2nd //| ... |

order1st // | ... |

12311

1211

111

nnnnnn

nnnnn

nnnn

wwwwPwwwP

wwwPwwwP

wwPwwwP

● If n = 5 and the size of the observation alphabet is 3, we

need to collect statistics over 35 = 243 sequence types

● If n = 2 and the size of the observation alphabet is 3, we

need to collection statistics over 32 = 9 sequence types

● So the number of observations matters

Weather Example 01

Sunny Rainy Foggy

Sunny 0.8 0.05 0.15

Rainy 0.2 0.6 0.2

Foggy 0.2 0.3 0.5

Weather Today vs. Weather Tomorrow

04.08.005.0

||

|

,|

|,

1223

12

213

132

SunnywSunnywPSunnywRainywP

SunnywSunnywP

SunnywSunnywRainywP

SunnywRainywSunnywP

Weather Example 02

Sunny Rainy Foggy

Sunny 0.8 0.05 0.15

Rainy 0.2 0.6 0.2

Foggy 0.2 0.3 0.5

Weather Today vs. Weather Tomorrow

34.0

||

||

||

|,

|,

|,

|

1223

1223

1223

132

132

132

13

FoggywSunnywPSunnywRainywP

FoggywRainywPRainywRainywP

FoggywFoggywPFoggywRainywP

FoggywRainywSunnywP

FoggywRainywRainywP

FoggywRainywFoggywP

FoggywRainywP

Weather Example 03

Umbrella

Sunny 0.8

Rainy 0.2

Foggy 0.2

Weather vs. Umbrella

n

nnnnn

uuP

wwPwwuuPuuwwP

...

......|......|...

1

11111

Speech Recognition

w is a sequence of tokens

L is a language

y is an acoustic signal

wPwyPywP

yP

wPwyPywP

LwLw

LwLw

|maxarg|maxarg

|maxarg|maxarg

FOPC Basics

Basic Notions

● Conceptualization = Objects + Relations

● Universe of Discourse the set of objects in a

conceptualization

● Functions & Relations

Example

A

A

A

A

A

A

c

b

a

d

e

<{a, b, c, d, e}, {hat}, {on, above, clear, table}>

hat is a function; on, above, clear, table - relations

References

● Ch 06, D. Jurafsky & J. Martin. Speech & Language Processing,

Prentice Hall, ISBN 0-13-095069-6

● E. Fossler-Lussier, J. 1998. Markov Models & Hidden Markov

Models: A Brief Tutorial, ICSI, UC Berkley.

Recommended