Language Models

Language Models

Hongning WangCS@UVa

CS 6501: Information Retrieval 2

Notion of RelevanceRelevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

…

Generative ModelRegressionModel(Fox 83)

Classicalprob. Model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LM approach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)

CS@UVa


What is a statistical LM?

• A model specifying probability distribution over word sequences– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001

• It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

CS@UVa


Why is a LM useful?

• Provides a principled way to quantify the uncertainties associated with natural language

• Allows us to answer questions like:– Given that we see “John” and “feels”, how likely will we see “happy”

as opposed to “habit” as the next word? (speech recognition)– Given that we observe “baseball” three times and “game” once in a

news article, how likely is it about “sports”? (text categorization, information retrieval)– Given that a user is interested in sports news, how likely would the

user use “baseball” in a query? (information retrieval)

CS@UVa


Source-Channel framework [Shannon 48]

SourceTransmitter(encoder)

DestinationReceiver(decoder)

NoisyChannel

P(X)P(Y|X)

X Y X’

P(X|Y)=?

)()|(maxarg)|(maxargˆ XpXYpYXpXXX

When X is text, p(X) is a language model

(Bayes Rule)

Many Examples: Speech recognition: X=Word sequence Y=Speech signal

Machine translation: X=English sentence Y=Chinese sentenceOCR Error Correction: X=Correct word Y= Erroneous wordInformation Retrieval: X=Document Y=QuerySummarization: X=Summary Y=Document

CS@UVa


Unigram language model

• Generate a piece of text by generating each word independently

– • Essentially a multinomial distribution over the

vocabulary

w1 w2 w3 w4 w5 w6 w7 w8 w9w10

w11w12

w13w14

w15w16

w17w18

w19w20

w21w22

w23w24

w250

0.02

0.04

0.06

0.08

0.1

0.12

A Unigram Language Model

CS@UVa

The simplest and most popular choice!


More sophisticated LMs

• N-gram language models– In general, – n-gram: conditioned only on the past n-1 words– E.g., bigram:

• Remote-dependence language models (e.g., Maximum Entropy model)

• Structured language models (e.g., probabilistic context-free grammar)

CS@UVa


Why just unigram models?

• Difficulty in moving toward more complex models– They involve more parameters, so need more data to

estimate– They increase the computational complexity

significantly, both in time and space• Capturing word order or structure may not add so

much value for “topical inference”• But, using more sophisticated models can still be

expected to improve performance ...

CS@UVa


Generative view of text documents(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Topic 1:Text mining

…text 0.01food 0.25nutrition 0.1healthy 0.05diet 0.02…

Topic 2:Health

Document

Text miningdocument

Food nutritiondocument

Sampling

CS@UVa


Estimation of language models

• Maximum likelihood estimationUnigram Language Model p(w| )=? Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?assocation ?database ?…query ?…

Estimation

A “text mining” paper(total #words=100)

10/1005/1003/1003/100

1/100

CS@UVa


Language models for IR[Ponte & Croft SIGIR’98]

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?…

…food ?nutrition ?healthy ?diet ?…

?Which model would most likely have generated this query?

“data mining algorithms”

Query

CS@UVa


Ranking docs by query likelihood

d1

d2

dN

qp(q| d1

)

p(q| d2)

p(q| dN)

Query likelihood

……

d1

d2

dN

Doc LM

……

)1,|(),|1( RDQPDQROJustification: PRP

CS@UVa


Justification from PRP

))0|()0,|(()0|(

)1|()1,|(

)0|()0,|(

)1|()1,|(

)0|,(

)1|,(),|1(

RQPRDQPAssumeRDP

RDPRDQP

RDPRDQP

RDPRDQP

RDQP

RDQPDQRO

Query likelihood p(q| d) Document prior Assuming uniform document prior, we have

)1,|(),|1( RDQPDQRO

CS@UVa

Query generation


Retrieval as language model estimation

• Document ranking based on query likelihood

– Retrieval problem Estimation of – Common approach

• Maximum likelihood estimation (MLE)

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

Document language model

CS@UVa


Problem with MLE

• What probability should we give a word that has not been observed in the document?– log0?

• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words

• This is so-called “smoothing”

CS@UVa


General idea of smoothing

• All smoothing methods try to1. Discount the probability of words seen in a

document2. Re-allocate the extra counts such that unseen

words will have a non-zero count

CS@UVa


Illustration of language model smoothing

P(w|d)

w

Max. Likelihood Estimate

wordsallofcountwofcount

ML wp )(

Smoothed LM

Assigning nonzero probabilities to the unseen words

CS@UVa

Discount from the seen words


Smoothing methods

• Method 1: Additive smoothing – Add a constant to the counts of each word

– Problems?• Hint: all words are equally important?

( , ) 1( | )

| | | |

c w dp w d

d V

“Add one”, Laplace smoothing

Vocabulary size

Counts of w in d

Length of d (total counts)

CS@UVa


Refine the idea of smoothing

• Should all unseen words get equal probabilities?

• We can use a reference model to discriminate unseen words

( | )( | )

( | )seen

d

p w d if w is seen in dp w d

p w REF otherwise

Discounted ML estimate

Reference language model1 ( | )

( | )

seenw is seen

d

w is unseen

p w d

p w REF

CS@UVa


Smoothing methods (cont)

• Method 2: Absolute discounting – Subtract a constant from the counts of each

word

– Problems? • Hint: varied document length?

max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d

# uniq words

CS@UVa



• Method 3: Linear interpolation, Jelinek-Mercer – “Shrink” uniformly toward p(w|REF)

– Problems?• Hint: what is missing?

( , )( | ) (1 ) ( | )

| |

c w dp w d p w REF

d

parameterMLE

CS@UVa



• Method 4: Dirichlet Prior/Bayesian – Assume pseudo counts p(w|REF)

– Problems?

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )

| |c w d p w REF d

d d d

c w dp w d p w REF

d

parameter

CS@UVa


Dirichlet prior smoothing

• Bayesian estimator – Posterior of LM:

• Conjugate prior • Posterior will be in the same form as prior• Prior can be interpreted as “extra”/“pseudo” data

• Dirichlet distribution is a conjugate prior for multinomial distribution

1

11

11 )()(

)(),,|(

i

i

N

iN

NNDir

“extra”/“pseudo” word counts, we set i= p(wi|REF)

prior over models

likelihood of doc given the model

CS@UVa


Some background knowledge

• Conjugate prior– Posterior dist in the same

family as prior• Dirichlet distribution

– Continuous– Samples from it will be the

parameters in a multinomial distribution

Gaussian -> GaussianBeta -> BinomialDirichlet -> Multinomial

CS@UVa


Dirichlet prior smoothing (cont)

))( , ,)( | () | ( 11 NNwcwcDirdp Posterior distribution of parameters:

}{)E(then ),|(~ If :Property i

iDir

The predictive distribution is the same as the mean:

|d|

)|()w(

|d|

)w(

)|()|p(w)ˆ|p(w

i

1i

i

ii

REFwpcc

dDir

iN

i

i

Dirichlet prior smoothingCS@UVa


Estimating using leave-one-out [Zhai & Lafferty 02]

P(w1|d- w1)

P(w2|d- w2)

N

i Vw i

ii d

CwpdwcdwcCL

11 )

1||

)|(1),(log(),()|(

log-likelihood

)C|(μLargmaxμ̂ 1μ

Maximum Likelihood Estimator

Leave-one-outw1

w2

P(wn|d- wn)

wn

...

CS@UVa


Why would “leave-one-out” work?

abc abc ab c d dabc cd d d abd ab ab ab abcd d e cd e

20 word by author1

20 word by author2

abc abc ab c d dabe cb e f acf fb ef aff abefcdc db ge f s

Now, suppose we leave “e” out…

1 20 1(" " | 1) (" " | 1) (" " | )

19 20 19 20

0 20 0(" " | 2) (" " | 2) (" " | )

19 20 19 20

ml smooth

ml smooth

p e author p e author p e REF

p e author p e author p e REF

must be big! more smoothing

doesn’t have to be big

The amount of smoothing is closely related to the underlying vocabulary size

CS@UVa


Understanding smoothingQuery = “the algorithms for data mining”

p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)p( “mining”|d1) < p(“mining”|d2)

So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps to achieve this goal…

Content words

Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)…

pML(w|d1): 0.04 0.001 0.02 0.002 0.003 pML(w|d2): 0.02 0.001 0.01 0.003 0.004

Query = “the algorithms for data mining”P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409

)!2|()1|(),|(9.0)|(1.0)|( dqpdqpREFwpdwpdwpwithsmoothingAfter DML

CS@UVa

4.8 ×10− 12

2.4 × 10−12

2.35 ×10−13

4.53 ×10− 13


Understanding smoothing

• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain

qw

idqdw id

iseen

ii

CwpnCwp

dwpdqp )|(loglog]

)|(

)|([log)|(log

Ignore for rankingIDF weighting

TF weightingDoc length normalization(longer doc is expected to have a smaller d)

• Smoothing with p(w|C) TF-IDF + doc-length normalization

CS@UVa

1 ( | )

( | )

seenw is seen

d

w is unseen

p w d

p w REF


, ( , ) 0

, ( , ) 0, , ( , ) 0, ( , ) 0( , ) 0

, ( , ) 0 , ( , )

log ( | ) ( , ) log ( | )

( , ) log ( | ) ( , ) log ( | )

( , ) log ( | ) ( , ) log ( | ) ( , ) log ( | )

w V c w q

Seen dw V c w d w V c w q c w dc w q

Seen d dw V c w d w V c w q

p q d c w q p w d

c w q p w d c w q p w C

c w q p w d c w q p w C c w q p w C

, ( , ) 0 0, ( , ) 0

, ( , ) 0 , ( , ) 0( , ) 0

( | )( , ) log | | log ( , ) ( | )

( | )

w V c w q c w d

Seend

w V c w d w V c w qdc w q

p w dc w q q c w q p w C

p w C

Smoothing & TF-IDF weighting

( | )( | )

( | )Seen

d

p w d if w is seen in dp w d

p w C otherwise

1 ( | )

( | )

Seenw is seen

d

w is unseen

p w d

p w C

Smoothed ML estimate

Reference language model

Retrieval formula using the general smoothing scheme

Key rewriting step (where did we see it before?)

CS@UVa

Similar rewritings are very common when using probabilistic models for IR…

𝑐 (𝑤 ,𝑞 )>0


What you should know

• How to estimate a language model• General idea and different ways of smoothing• Effect of smoothing

CS@UVa

Documents

Language Models