NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Natural Language Processing

www.vkedco.blogspot.com

N-Grams, Markov & Hidden Markov Models, FOPC Basics

Vladimir Kulyukin

Outline

● N-Grams

● Markov & Hidden Markov Models (HMMs)

● First-Order Predicate Calculus Basics (FOPC)

N-Grams

Introduction

● Word prediction is a fundamental task of spelling error

correction, speech recognition, augmentative

communication, and many other areas of NLP

● Word prediction can be trained on various text corpora

● N-Gram is a word prediction model that uses the

previous N-1 words to predict the next word

● In statistical NLP, N-Gram is called a language model

(LM) or grammar

Word Prediction Examples

● It happened a long time …

● She wants to make a collect phone …

● I need to open a bank …

● Nutrition labels include serving …

● Nutrition labels include amounts of total …

Word Prediction Examples

● It happened a long time ago.

● She wants to make a collect phone call.

● I need to open a bank account.

● Nutrition labels include serving sizes.

● Nutrition labels include amounts of total fat|

carbohydrate.

Augmentative Communication

● Many people with physical disabilities experience

problems communicating with other people: many of

them cannot speak or type

● Word prediction models can productively augment their

communication efforts by automatically suggesting the

next word to speak or type

● For example, people with disabilities can use simple

hand movements to choose next words to speak or type

Real-Word Spelling Errors

● Real-word spelling errors are real words incorrectly used

● Examples:

– They are leaving in about fifteen minuets to go to her house.

– The study was conducted mainly be John Black.

– The design an construction of the system will take more than a year.

– Hopefully, all with continue smoothly in my absence.

– I need to notified the bank of this problem.

– He is trying to fine out.

K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM

Computing Surveys, Vol. 24, No. 4, Dec. 1992.

Word Sequence Probabilities

● Word prediction is based on evaluating probabilities of specific

word sequences

● To count those probabilities we need a corpus (speech or text)

● We also need to determine what is counted count and how:

the most important decision is how to handle punctuation

marks and capitalization (text) or pauses like uh and um

(speech)

● What is counted and how depends on the task at hand (e.g.,

punctuation is more important to grammar checking than

spelling correction)

Wordforms, Lemmas, Types, Tokens

● Wordform is an alphanumerical sequence actually used

in the corpus (e.g., begin, began, begun)

● Lemma is a set of word forms (e.g., {begin, began,

begun})

● Token is a synonym of wordform

● Type is a dictionary entry: for example, a dictionary

lists only begin as the main entry for the lemma

{begin, began, begun}

Unsmoothed N-Grams

Notation: A Sequence of N Words

n www ... 11

● Example: ‘I understand this algorithm.’

– W1 = ‘I’

– W2 = ‘understand’

– W3 = ‘this’

– W4 = ‘algorithm’

– W5 = ‘.’

Probabilities of Word Sequences

| ... | ...

wwPwwPwPwPwwP

Example:

P(‘I understand this algorithm.’) =

P(‘I’) *

P(‘understand’|‘I’) *

P(‘this’|‘I understand’) *

P(‘algorithm’|‘I understand this’) *

P(‘.’|‘I understand this algorithm’)

Probabilities of Word Sequences

● How difficult is it to compute the required probabilities?

– P(‘I’) - this is easy to compute

– P(‘understand’|‘I’) – harder but quite feasible

– P(‘this’|‘I understand’) – harder but feasible

– P(‘algorithm’|‘I understand this’) – really hard

– P(‘.’|‘I understand this algorithm’) – possible but

impractical

Probability Approximation

● Markov assumption: we can estimate the probability of a word

given only N previous words

● If N = 0, we have the unigram model (aka 0th-order Markov

model)

● If N = 1, we have the bigram model (aka 1st-order Markov

model)

● If N = 3, we have the trigram model (aka 2nd-order Markov

model)

● N can be greater but the higher values are rare, because they

hard to compute

Bigram Probability Approximation

● <S> is the start of sentence mark

● P(‘I understand this algorithm.’) =

P(‘I’|<S>) *

P(‘understand’|‘I’) *

P(‘this’|‘understand’) *

P(‘algorithm’|‘this’) *

P(‘.’ |‘algorithm’)

Trigram Probability Approximation

● <S> is the start of sentence mark

● P(‘I understand this algorithm.’) =

P(‘I’|<S><S>) *

P(‘understand’|‘<S>I’) *

P(‘this’|‘I understand’) *

P(‘algorithm’|‘understand this’) *

P(‘.’ |‘this algorithm’)

N-Gram Approximation

4,||||

3,||||

2,||||

NwwwwPwwPwwPwwP

NwwwPwwPwwPwwP

NwwPwwPwwPwwP

NwwPwwP

Bigram Approximation

n wwPwwP

Bigram Approximation

<S> I 0.25

I understand 0.3

understand this 0.05

this algorithm 0.7

algorithm . 0.45

P(‘I understand this algorithm.’) =

0.25 * 0.3 * 0.05 * 0.7 * 0.45 =

0.00118125

Logprobs

● If we compute raw probability products, we risk the problem of

numerical underflow: at some point all probability products

become zero, especially on long word sequences

● To address this problem, the probabilities are computed in the

logarithmic space: instead of computing the product of

probabilities, the sum of logarithms of those probabilities is

computed

● log(P(A)P(B)) = log(P(A)) + log(P(B))

● Original product can be recovered: P(A)P(B) = log-1(P(A)P(B))

Bigram Computation

size dictionary is ,

corpusin ofcount

wwCwwP

VwCwwC

N-Gram Generalization

wwCwwP

Maximum Likelihood Estimation

● This N-Gram probability estimation is known as the Maximum

Likelihood Estimation (MLE)

● It is the MLE because it always maximizes the probability of the

training set (the statistics of the training set)

● If a word W occurs 5 times in a training corpus of 100 words, its

probability of occurrence is P(W) = 5/100

● This is not a good estimate of P(W) in all corpora, but the one

that maximizes P(W) in the training corpus

Smoothed N-Grams

Unsmoothed N-Gram Problem

● Since any corpus is finite, in any corpus used for computing N-

Grams, some valid N-Grams will not be found

● To put it differently, an N-Gram matrix for any corpus is likely

to be sparse: it will have a large number of possible N-Grams

with zero counts

● The MLE methods produce unreliable estimates when counts are

greater than 0 but still small (small is relative)

● Smoothing is a set of techniques used to overcome zero or low

counts

Add-One Smoothing

One way to smooth is to add one to all N-Gram counts and

normalize by the size of the dictionary (V)

smoothed one-add //1

unsmoothed//|

wwCwwP

A Problem with Add-One Smoothing

● Much of the total probability mass moves to the N-

Grams with zero counts

● Researchers attribute it to the arbitrary choice of the

value of 1

● Add-One smoothing appears to be worse than other

methods at predicting N-Grams with zero counts

● Some research indicates that add-one smoothing is no

better than no smoothing

Good-Turing Discounting

● Probability mass assigned to N-Grams with zero or low

counts is reassigned by using with N-Grams with higher

counts

● Let Nc is the number of N-Grams that occur c times in a

corpus

● N0 is the number of N-Grams that occur 0 times

● N1 is the number of N-Grams that occur once

● N2 is the number of N-Grams that occur twice

Good-Turing Discounting

Let C(w1 … wn)=c be a count of some N-Gram w1 … wn,

then the new count smoothed by the GTD, i.e., C*(w1 …

wn), is:

NwwCwwC 1

* 1......

N-Gram Vectors

● N-Grams can be computed over any finite symbolic sets

● Those symbolic sets are called alphabets and can

consists of wordforms, waveforms, individual letters,

● The choice of the symbols in the alphabet depends on

the application

● Regardless of the application, the objective is to take

an input sequence over a specific alphabet and

compute its N-Gram frequency vector

Dimensions of N-Gram Vectors

● Let A be an alphabet and n > 0 be the size of the N-Gram

● The number of N-Gram dimensions is |A|n

● Suppose that the alphabet has 26 characters and we

compute trigrams over that alphabets, then the number

of possible trigrams, i.e., the dimension of N-Gram

frequency vectors is 263 = 17576

● A practical implication is that N-Gram frequency vectors

even for low values of n are sparse

Example

● Suppose the alphabet A = {a, <space>, <start>}

● The number of possible bigrams (n=2) is |A|2 = 9:

– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;

6) <start><space>; 7) <space><space> ; 8) <space>a;

9) <space><start>

● Suppose the input is = ‘a a’

● The input’s N-Grams are: <start>a, a<space>, <space>a

● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)

Example

● Suppose the alphabet A = {a, <space>, <start>}

● The number of possible bigrams (n=2) is |A|2 = 9:

– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;

6) <start><space>; 7) <space><space> ; 8) <space>a;

9) <space><start>

● Suppose the input is = ‘a a’

● The input’s N-Grams are: <start>a, a<space>, <space>a

● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)

Markov & Hidden Markov Models

Markov Models

11 ... | nn wwwP

Markov Models are closely related to N-Grams:

the basic idea is to estimate the conditional

probability of the n-th observation given a

sequence of n-1 observations

Markov Assumption

order 3rd // | ... |

order 2nd //| ... |

order1st // | ... |

nnnnnn

wwwwPwwwP

wwwPwwwP

wwPwwwP

● If n = 5 and the size of the observation alphabet is 3, we

need to collect statistics over 35 = 243 sequence types

● If n = 2 and the size of the observation alphabet is 3, we

need to collection statistics over 32 = 9 sequence types

● So the number of observations matters

Weather Example 01

Sunny Rainy Foggy

Sunny 0.8 0.05 0.15

Rainy 0.2 0.6 0.2

Foggy 0.2 0.3 0.5

Weather Today vs. Weather Tomorrow

04.08.005.0

SunnywSunnywPSunnywRainywP

SunnywSunnywP

SunnywSunnywRainywP

SunnywRainywSunnywP

Weather Example 02

Sunny Rainy Foggy

Sunny 0.8 0.05 0.15

Rainy 0.2 0.6 0.2

Foggy 0.2 0.3 0.5

Weather Today vs. Weather Tomorrow

FoggywSunnywPSunnywRainywP

FoggywRainywPRainywRainywP

FoggywFoggywPFoggywRainywP

FoggywRainywSunnywP

FoggywRainywRainywP

FoggywRainywFoggywP

FoggywRainywP

Weather Example 03

Umbrella

Sunny 0.8

Rainy 0.2

Foggy 0.2

Weather vs. Umbrella

wwPwwuuPuuwwP

......|......|...

Speech Recognition

w is a sequence of tokens

L is a language

y is an acoustic signal

wPwyPywP

|maxarg|maxarg

FOPC Basics

Basic Notions

● Conceptualization = Objects + Relations

● Universe of Discourse the set of objects in a

conceptualization

● Functions & Relations

Example

<{a, b, c, d, e}, {hat}, {on, above, clear, table}>

hat is a function; on, above, clear, table - relations

References

● Ch 06, D. Jurafsky & J. Martin. Speech & Language Processing,

Prentice Hall, ISBN 0-13-095069-6

● E. Fossler-Lussier, J. 1998. Markov Models & Hidden Markov

Models: A Brief Tutorial, ICSI, UC Berkley.

NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Technology

5?@...Slim Fast Peanut Bu˜er Fat Bomb Snacks 90 9 grams 7 grams 1 gram 2 grams Wonderful Pistachios 160 14 grams 8 grams 5 grams 6 grams Great Value Deluxe Mixed Nuts 170 15 grams

WHOLESALE - herbalcode · Choco Balls ( 1000 grams ) $999.99-Choco Balls ( 500 grams ) $499.99 - Choco Balls ( 10 grams ) $10.40 $24.99 Choco Balls ( 5 grams ) $5.20 $14.99 PRODUCT

78535 – 103.4 grams 78536 78537 78538 78539 78545 – 8.6 ...78535 – 103.4 grams 78536 – 8.67 grams 78537 – 11.76 grams 78538 - 5.82 grams 78539 – 3.73 grams 78545 – 8.6

Contract ECO/ 10/277370/SI2.599581...Caliber 12 for lead pellets P28 grams and P32 grams for steel pellets A28 grams A24 grams and A32 grams. Also for cartridges that actually use

Shortbread Recipe Ingredients: 250 grams of gluten free flour 250 grams of butter 125 grams of cornflour 125 grams of icing sugar Caster

bsbproduction.s3.amazonaws.com · Web viewIf you are having a Pop Tart (~16 grams of sugar, 2 grams protein, 5 grams fat,

grams peintte

CHICKEN CACCIATORAdalziel.co.uk/.../CHICKENCACCIATORA.pdfChicken Cacciatora Ingredients 1000 grams Chicken, Diced 500 grams Italian Herb & Tomato Sauce 20 grams DS Binder 50 grams

Mean: 23.55 grams of fat Median: 26 grams of fat Mode: 23, 26 and 28 grams of fat Range: 36 grams of fat Standard Deviation: 10.36 grams of

Chapter 9: Markov Chain Regular Markov Chains …momran/m118videos/notes/sec92.pdf · Chapter 9: Markov Chain Section 9.2: Regular Markov Chains • Irreducible Markov Chain: When

Seven Grams

SCIENTIFIC ABSTRACT MARKOV, K.K. - MARKOV, K.K. · Title: SCIENTIFIC ABSTRACT MARKOV, K.K. - MARKOV, K.K. Subject: SCIENTIFIC ABSTRACT MARKOV, K.K. - MARKOV, K.K. Keywords: k. a r

V-Wash II RGBW · 2019. 3. 28. · Dimension (mm) 380Dia24.0 750Dia24.0 1120Dia24.0 380Dia24.0 750Dia24.0 1120Dia24.0 Weight 176 grams 350 grams 525 grams 176 grams 350 grams

Markov Systems and Markov Decision Processes...States, Actions, Observations Passive Controlled Fully Observable Markov Model Markov Decision Process (MDP) Hidden State Hidden Markov

Markov Chains Regular Markov Chains Absorbing Markov Chains

Broadcast Setting MatrixLBS./1,000 SQ FT SPREADER SETTING 5 Grams 1.0 LBS. 11 10 Grams 2.0 LBS. 13 15 Grams 3.0 LBS. 14 20 Grams 4.0 LBS. 16 25 Grams 5.0 LBS. 17 30 Grams 6.0 LBS

Hidden Markov Models - AUusers-cs.au.dk/cstorm/courses/PRiB_f12/slides/hidden-markov-model… · Hidden Markov Models Markov Model Hidden Markov Model If the latent variables are

Culture grams

78505 78506 78507 78509 - NASA · 2015. 6. 22. · 78505-506.3 grams 78506 – 56 grams 78507 – 23.4 grams 78509 – 8.7 grams Ilmenite Basalt Figure 1: Photo of 78505. Scale and

FOPC Mission Study Report - Fair Oaks Presfairoakspres.org/wp-content/uploads/2017/12/FOPC...• Have become splintered into smaller groups • Loss of member engagement • Lost the