Risk Minimization and Language Modeling in Text Retrieval

1

Risk Minimization and

Language Modeling in Text

Retrieval ChengXiang Zhai

Thesis Committee:

John Lafferty (Chair), Jamie Callan

Jaime CarbonellDavid A. Evans

W. Bruce Croft (Univ. of Massachusetts, Amherst)

2

Information Overflow

Web Site Growth

3

Text Retrieval (TR)

RetrievalSystem

User

“Tips on thesis defense”

query

relevant docs

database/collection

text docs

4

Challenges in TR

(independent,topical)Relevance

Ad hocparameter tuning

Utility

5

Sophisticated Parameter Tuningin the Okapi System

(Robertson et al. 1999)

“k1, b and k3 are parameters which depend on the nature of the queries and possibly on the database; k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3 is often set to 7 or 1000 (effectively infinite).”

6

More Than “Relevance”

Relevance Ranking Desired Ranking

Redundancy

Readability

7

Meeting the Challenges

Bayesian Decision Theory

Statistical Language Models

Risk Minimization Framework

Utility-based Retrieval

ParameterEstimation

8

Map of Thesis

RiskMinimizationFramework

Two-stageLanguage Model

Automatic parameter setting

KL-divergenceRetrieval Model

Aspect RetrievalModel

Natural incorporation of feedback

Non-traditional ranking

New TR Framework New TR Models Features

9

Retrieval as Decision-Making

Unordered subset?

Clustering?

Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? ()Choose: (D,)

Query … Ranked list?1 2 3 4

10

Generative Model of Document & Query

observedPartiallyobserved

QU)|( Up Q

User

DS )|( Sp DSource

inferred

),|( Sdp Dd Document

),|( Uqp Qq Query

11

Bayesian Decision Theory

Choice: (D1,1)

Choice: (D2,2)

Choice: (Dn,n)

...

query quser U

doc set C

source S

q

1

N

dSCUqpDLDD

),,,|(),,(minarg*)*,(,

hidden observedloss

Bayes risk for choice (D, )RISK MINIMIZATION

Loss

L

L

L

12

Special Cases

• Set-based models (choose D)

• Ranking models (choose )

– Independent loss ( PRP)

• Relevance-based loss

• Distance-based loss

– Dependent loss

• MMR loss

• MDR loss

Boolean model

Probabilistic relevance model

Vector-space Model

Aspect retrieval model

Two-stage LM

KL-divergence model

13

Map of Existing TR Models

Relevance

(R(q), R(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel

(Salton et al., 75)

Prob. distr.model

(Wong & Yao, 89)

…

GenerativeModel

RegressionModel

(Fox 83)

Classicalprob. Model(Robertson &

Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach

(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model

(Wong & Yao, 95)

Differentinference system

Inference network model

(Turtle & Croft, 91)

14

Where Are We?




Aspect Retrieval Model

15

Two-stage Language Models

QU)|( Up Q

DS)|( Sp D ),|( Sdp D

d

),|( Uqp Qq

otherwisec

ifdl DQ

DQ

),(0),,(

Loss function

),ˆ|(),( UqpqdR DQ

Rank

Risk ranking formula

Stage 1: compute D̂Stage 1

),ˆ|( Uqp DStage 2: compute

Stage 2

(Dirichlet prior smoothing)

(Mixture model)

Two-stage smoothing

16

The Need of Query-Modeling(Dual-Role of Smoothing)

Verbosequeries

Keywordqueries

17

Interaction of the Two Roles of Smoothing

Query Type JM Dir ADTitle 0.228 0.256 0.237Long 0.278 0.276 0.260

Relative performance of JM, Dir. and AD

0

0.1

0.2

0.3

JM DIR AD

Method

precision

TitleQuery

LongQuery

18

Two-stage Smoothing

c(w,d)

|d|P(w|d) =

+p(w|C)

+

Stage-1

-Explain unseen words-Dirichlet prior(Bayesian)

(1-) + p(w|U)

Stage-2

-Explain noise in query-2-component mixture

19

Estimating using leave-one-out

P(w1|d- w1)

P(w2|d- w2)

N

i Vw i

ii d

CwpdwcdwcCl

11 )

1||

)|(1),(log(),()|(

log-likelihood

)ˆ C|(μlargmaxμ 1μ

Maximum Likelihood Estimator

Newton’s Method

Leave-one-outw1

w2

P(wn|d- wn)

wn

...

20

Estimating using Mixture Model

query

1

N

...

U)λ,|p(qargmaxλ

U))|λp(q)θ|λ)p(q((1πU)λ,|p(q

λ

N

1i

m

1jjdji i

ˆ

ˆ

Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm

P(w|d1)d1

P(w|dN)dN

… ...

Stage-1

(1-)p(w|d1)+ p(w|U)

(1-)p(w|dN)+ p(w|U)

Stage-2

21

Collection query Optimal-JM Optimal-Dir Auto-2stageSK 20.3% 23.0% 22.2%*LK 36.8% 37.6% 37.4%SV 18.8% 20.9% 20.4%LV 28.8% 29.8% 29.2%SK 19.4% 22.3% 21.8%*LK 34.8% 35.3% 35.8%SV 17.2% 19.6% 19.9%LV 27.7% 28.2% 28.8%*SK 17.9% 21.5% 20.0%LK 32.6% 32.6% 32.2%SV 15.6% 18.5% 18.1%LV 26.7% 27.9% 27.9%*

AP88-89

WSJ87-92

ZIFF1-2

Automatic 2-stage results Optimal 1-stage results

Average precision (3 DB’s + 4 query types, 150 topics)

22

Where Are We?





23

KL-divergence Retrieval Models

QU)|( Up Q

DS)|( Sp D ),|( Sdp D

d

),|( Uqp Qq

)||(

),(),,(

DQ

DQDQ

cD

cdl

Loss function

)ˆ||ˆ(),( DQ

Rank

DqdR

Risk ranking formula

)ˆ||ˆ( DQD

24

Expansion-based vs. Model-based

D)|( DQP

Document DResults

Feedback Docs

Doc model

Q

D

)||( DQD

Doc model

Scoring

Scoring

Query Q

Document D

Query Q

Feedback Docs

Results

Expansion-basedFeedback

modify

modify

Model-basedFeedback

Query model

Query likelihood

KL-divergence

25

Feedback as Model Interpolation

Query Q

D

)||( DQD

Document D

Results

Feedback Docs F={d1, d2 , …, dn}

FQQ )1('

Generative model

Divergence minimization

Q

F=0

No feedback

FQ '

=1

Full feedback

QQ '

26

F Estimation Method I: Generative Mixture Model

w

w

F={d1, …, dn}

))|()|()log(();()|(log CwpwpdwcFpi w

i 1

)|(logmaxarg

FpF Maximum Likelihood

P(w| )

P(w| C)

1-

P(source)

Background words

Topic words

27

F Estimation Method II: Empirical Divergence Minimization

d1

F={d1, …, dn}

1d

nd dn

close

))||()||(),,(1

||1

Cjd

n

iF DDCFD

),,(minarg CFDF

Empirical divergence

Divergence minimization

Cfar ()

C

Background model

28

Example of Feedback Query Model

W p(W| )security 0.0558airport 0.0546

beverage 0.0488alcohol 0.0474bomb 0.0236

terrorist 0.0217author 0.0206license 0.0188bond 0.0186

counter-terror 0.0173terror 0.0142

newsnet 0.0129attack 0.0124

operation 0.0121headline 0.0121

Trec topic 412: “airport security”

W p(W| )the 0.0405

security 0.0377airport 0.0342

beverage 0.0305alcohol 0.0304

to 0.0268of 0.0241

and 0.0214author 0.0156bomb 0.0150

terrorist 0.0137in 0.0135

license 0.0127state 0.0127

by 0.0125

=0.9 =0.7

FF

Mixture model approach

Web database

Top 10 docs

29

Model-based feedback vs. Simple LM

Simple LM Mixture Improv. Div.Min. Improv.AvgPr 0.21 0.296 pos +41% 0.295 pos +40%InitPr 0.617 0.591 pos -4% 0.617 pos +0%Recall 3067/4805 3888/4805 pos +27% 3665/4805 pos +19%AvgPr 0.256 0.282 pos +10% 0.269 pos +5%InitPr 0.729 0.707 pos -3% 0.705 pos -3%Recall 2853/4728 3160/4728 pos +11% 3129/4728 pos +10%AvgPr 0.281 0.306 pos +9% 0.312 pos +11%InitPr 0.742 0.732 pos -1% 0.728 pos -2%Recall 1755/2279 1758/2279 pos +0% 1798/2279 pos +2%

collection

AP88-89

TREC8

WEB

30

Where Are We?





31

Aspect Retrieval

Query: What are the applications of robotics in the world today?

Find as many DIFFERENT applications as possible.

Example Aspects: A1: spot-welding robotics

A2: controlling inventory A3: pipe-laying robotsA4: talking robotA5: robots for loading & unloading memory tapesA6: robot [telephone] operatorsA7: robot cranes… …

Aspect judgments A1 A2 A3 … ... Ak

d1 1 1 0 0 … 0 0d2 0 1 1 1 … 0 0d3 0 0 0 0 … 1 0….dk 1 0 1 0 ... 0 1

32

Evaluation Measures• Aspect Coverage (AC): measures per-doc

coverage

– #distinct-aspects/#docs

– Equivalent to the “set cover” problem, NP-hard

• Aspect Uniqueness(AU): measures redundancy

– #distinct-aspects/#aspects

– Equivalent to the “volume cover” problem, NP-hard

• Examples

0001001

0101100

1000101

… ...d1 d3d2

#doc 1 2 3 … …#asp 2 5 8 … …#uniq-asp 2 4 5AC: 2/1=2.0 4/2=2.0 5/3=1.67AU: 2/2=1.0 4/5=0.8 5/8=0.625

Accumulated counts

33

Loss Function L( k+1 | 1 … k )

d1

dk

? dk+1

… 1

k

k+1

known

Novelty/RedundancyNov ( k+1 | 1 … k )

RelevanceRel( k+1 )

Maximal Marginal Relevance (MMR)

The best dk+1 is novel & relevant

1

k

k+1

Maximal Diverse Relevance (MDR)

Aspect Coverage Distrib. p(a|i)

The best dk+1 is complementary

in coverage

34

Maximal Marginal Relevance (MMR) Models

• Maximizing aspect coverage indirectly through redundancy elimination

• Elements

– Redundancy/Novelty measure

– Combination of novelty and relevance

• Proposed & studied six novelty measures

• Proposed & studied four combination strategies

35

Comparison of Novelty Measures (Aspect Coverage)

0

0.5

1

1.5

2

2.5

3

3.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Aspect Recall

Avg

. As

pe

ct C

ove

rag

e

Relevance

AvgKL

AvgMix

KLMin

KLAvg

MixMin

MixAvg

36

Comparison of Novelty Measures (Aspect Uniqueness)

0

0.2

0.4

0.6

0.8

1

1.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Aspect Recall

Avg

. As

pe

ct U

niq

ue

ne

ss

Relevance

AvgKL

AvgMix

KLMin

KLAvg

MixMin

MixAvg

37

A Mixture Model for Redundancy

P(w|Background)Collection

P(w|Old)

Ref. document

1-

=?

Maximum Likelihood Expectation-Maximization

38

Cost-based Combination of Relevance and Novelty

1,

))|(1()|(

))|(1)(|(Re

))|(Re1())|(1)(|(Re)}{,,,...,|(

2

3

321111

c

cwhere

dNewpdqp

dNewpdlp

dlpcdNewpdlpcdddl

kk

Rank

kk

Rank

kkkkiiQkk

Relevance score Novelty score

39

Maximal Diverse Relevance (MDR) Models

• Maximizing aspect coverage directly through aspect modeling

• Elements

– Aspect loss function

– Generative Aspect Model

• Proposed & studied KL-divergence aspect loss function

• Explored two aspect models (PLSI, LDA)

40

Aspect Generative Model of Document & Query

QU),|( Up Q

User),|( Qqp

q Query

DS),|( Sp D

Source),|( Ddp

d Document

=( 1,…, k)

n

n

i

A

aDaiD dddwhereapdpdp ...,)|()|(),|( 1

1 1

dDirapdpdpn

i

A

aai )|()|()|(),|(

1 1

PLSI:

LDA:

41

Aspect Loss Function

)|()1()|(1

)|(

,

)||()}{,,,...,|(

1

11,...,1

1,...,11111

k

k

ii

kk

kkQ

kiiQkk

apapk

ap

where

Ddddl

QU),|( Up Q ),|( Qqp

q

DS),|( Sp D Ddp ,|(

d

)ˆ||ˆ( 1,...,1k

kQD

42

Aspect Loss Function: Illustration

Desired coveragep(a|Q)

“Already covered” p(a|1)... p(a|k -1)

New candidate p(a|k)

non-relevant

redundant

perfect

Combined coverage

)|()1()|(1

)|(

1

1

1,...,1

k

k

ii

kk

apapk

ap

43

Preliminary Evaluation: MMR vs. MDR

Relevant Data Mixed Data Ranking Method AC AU AC AU Prec. MMR +2.6% +13.8% +1.5% +2.2% +3.4% MDR +9.8% +4.5% +1.5% 0.0% -13.8%

• On the relevant data set, both MMR and MDR

are effective, but they complement each other

- MMR improves AU more than AC

- MDR improves AC more than AU

• On the mixed data set, however,

- MMR is only effective when relevance ranking is accurate

- MDR improves AC, even though relevance ranking is degraded.

44

Further Work is Needed

• Controlled experiments with synthetic data

– Level of redundancy

– Density of relevant documents

– Per-document aspect counts

• Alternative loss functions

• Aspect language models, especially along the line of LDA

– Aspect-based feedback

45

Summary of Contributions



Aspect RetrievalModel

New TR Models


New TR Framework

•Unifies existing models•Incorporates LMs•Serves as a map for exploring new models

Specific Contributions

•Empirical study of smoothing (dual role of smoothing)•New smoothing method (two-stage smoothing) •Automatic parameter setting (leave-one-out, mixture)

•Query/document distillation•Feedback with LMs (mixture model & div. min.)

•Evaluation criteria (AC, AU)•Redundancy/novelty measures (mixture weight)•MMR with LMs (cost-comb.)•Aspect-based loss function (“collective KL-div”)

46

Future Research Directions

• Better Approximation of the risk integral

• More effective LMs for “traditional” retrieval

– Can we beat TF-IDF without increasing computational complexity?

– Automatic parameter setting, especially for feedback models

– Flexible passage retrieval, especially with HMM

– Beyond unigrams (more linguistics)

47

More Future Research Directions

• Aspect Retrieval Models

– Document structure/sub-topic modeling

– Aspect-based feedback

• Interactive information retrieval models

– Risk minimization for information filtering

– Personalized & context-sensitive retrieval

48

Thank you!

Documents

Risk Minimization and Language Modeling in Text Retrieval