24
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

  • View
    223

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Language Models for TR

Rong Jin

Department of Computer Science and Engineering

Michigan State University

Page 2: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

What is a Statistical LM?

• A probability distribution over word sequences

– p(“Today is Wednesday”) 0.001

– p(“Today Wednesday is”) 0.0000000000001

– p(“The eigenvalue is positive”) 0.00001

• Context-dependent!

• Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

Page 3: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Why is a LM Useful?• Provides a principled way to quantify the uncertainties

associated with natural language

• Allows us to answer questions like:

– Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition)

– Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)

– Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

Page 4: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

The Simplest Language Model(Unigram Model)

• Generate a piece of text by generating each word INDEPENDENTLY

• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)

• Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size)

• Essentially a multinomial distribution over words

• A piece of text can be regarded as a sample drawn according to this word distribution

Page 5: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Text Generation with Unigram LM

(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001

Topic 1:Text mining

…food 0.25nutrition 0.1healthy 0.05diet 0.02

Topic 2:Health

Document

Text miningpaper

Food nutritionpaper

Sampling

Page 6: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Estimation of Unigram LM

(Unigram) Language Model p(w| )=?

Document

text 10mining 5

association 3database 3algorithm 2

…query 1

efficient 1

…text ?mining ?assocation ?database ?…query ?

Estimation

A “text mining paper”(total #words=100)

10/1005/1003/1003/100

1/100

Page 7: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Language Models for Retrieval(Ponte & Croft 98)

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?

…food ?nutrition ?healthy ?diet ?

Query = “data mining algorithms”

? Which model would most likely have generated this query?

Page 8: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Ranking Docs by Query Likelihood

d1

d2

dN

qd1

d2

dN

Doc LM

p(q| d1)

p(q| d2)

p(q| dN)

Query likelihood

Page 9: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

But, where is the relevance?

And, what’s good about this approach?

Page 10: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

The Notion of Relevance

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

Page 11: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Refining P(R=1|Q,D) Method 2:generative models

• Basic idea

– Define P(Q,D|R)

– Compute P(R|Q,D) using Bayes’ rule

• Special cases

– Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R)

– Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R)

)0(

)1(

)0|,(

)1|,(),|1(

RP

RP

RDQP

RDQPDQRO

Ignored for ranking D

Page 12: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Query Generation

))0|()0,|(()0|(

)1|()1,|(

)0|()0,|(

)1|()1,|(

)0|,(

)1|,(),|1(

RQPRDQPAssumeRDP

RDPRDQP

RDPRDQP

RDPRDQP

RDQP

RDQPDQRO

Assuming uniform prior, we have

Query likelihood p(q| d) Document prior

)1,|(),|1( RDQPDQRO

Now, the question is how to compute ?)1,|( RDQP

Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated model

Page 13: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Retrieval as Language Model Estimation

• Document ranking based on query likelihood

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

• Retrieval problem Estimation of p(wi|d)

• Smoothing is an important issue, and distinguishes different approaches

Document language model

Page 14: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

A General Smoothing Scheme

• All smoothing methods try to

– discount the probability of words seen in a doc

– re-allocate the extra probability so that unseen words will have a non-zero probability

• Most use a reference model (collection language model) to discriminate unseen words

otherwiseCwp

dinseeniswifdwpdwp

d

seen

)|(

)|()|(

Discounted ML estimate

Collection language model

Page 15: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Smoothing & TF-IDF Weighting

• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain

i

id

qwdw id

iseen CwpnCwp

dwpdqp

i

i

)|(loglog])|(

)|([log)|(log

Ignore for rankingIDF weighting

TF weightingDoc length normalization(long doc is expected to have a smaller d)

• Smoothing with p(w|C) TF-IDF + length norm.

Page 16: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Three Smoothing Methods(Zhai & Lafferty 01)

• Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C)

)|()|()()|( Cwpdwpdwp ml 1

)|()|()|( ||||||

||)|();( Cwpdwpdwp dmld

dd

Cwpdwc

• Dirichlet prior (Bayesian): Assume pseudo counts p(w|C)

• Absolute discounting: Subtract a constant

||)|(||)0,);(max()|( d

Cwpddwc udwp

Page 17: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Comparison of Three Methods

Query Type JM Dir ADTitle 0.228 0.256 0.237Long 0.278 0.276 0.260

Relative performance of JM, Dir. and AD

0

0.1

0.2

0.3

JM DIR AD

Method

precision

TitleQuery

LongQuery

Page 18: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

The Need of Query-Modeling(Dual-Role of Smoothing)

Verbosequeries

Keywordqueries

Page 19: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Another Reason for Smoothing

Query = “the algorithms for data mining”

d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004

p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)

p( “mining”|d1) < p(“mining”|d2)

But p(q|d1)>p(q|d2)!

We should make p(“the”) and p(“for”) less different for all docs.

Page 20: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Two-stage Smoothing

c(w,d)

|d|P(w|d) =

+p(w|C)

+

Stage-1

-Explain unseen words-Dirichlet prior(Bayesian)

(1-) + p(w|U)

Stage-2

-Explain noise in query-2-component mixture

Page 21: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Estimating using leave-one-out

P(w1|d- w1)

P(w2|d- w2)

N

i Vw i

ii d

CwpdwcdwcCl

11 )

1||

)|(1),(log(),()|(

log-likelihood

)ˆ C|(μlargmaxμ 1μ

Maximum Likelihood Estimator

Newton’s Method

Leave-one-outw1

w2

P(wn|d- wn)

wn

...

Page 22: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Estimating using Mixture Model

query

1

N

...

U)λ,|p(qargmaxλ

U))|λp(q)θ|λ)p(q((1πU)λ,|p(q

λ

N

1i

m

1jjdji i

ˆ

ˆ

Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm

P(w|d1)d1

P(w|dN)dN

… ...

Stage-1

(1-)p(w|d1)+ p(w|U)

(1-)p(w|dN)+ p(w|U)

Stage-2

Page 23: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Collection query Optimal-JM Optimal-Dir Auto-2stageSK 20.3% 23.0% 22.2%*LK 36.8% 37.6% 37.4%SV 18.8% 20.9% 20.4%LV 28.8% 29.8% 29.2%SK 19.4% 22.3% 21.8%*LK 34.8% 35.3% 35.8%SV 17.2% 19.6% 19.9%LV 27.7% 28.2% 28.8%*SK 17.9% 21.5% 20.0%LK 32.6% 32.6% 32.2%SV 15.6% 18.5% 18.1%LV 26.7% 27.9% 27.9%*

AP88-89

WSJ87-92

ZIFF1-2

Automatic 2-stage results Optimal 1-stage results

Average precision (3 DB’s + 4 query types, 150 topics)

Page 24: Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

Acknowledgement

• Many thanks to Chengxiang Zhai who generously shares his slides on language modeling approach for information retrieval