82
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1 Formal Retrieval Frameworks ChengXiang Zhai ( 翟翟翟 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Formal Retrieval Frameworks

  • Upload
    joella

  • View
    28

  • Download
    2

Embed Size (px)

DESCRIPTION

Formal Retrieval Frameworks. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, [email protected]. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1

Formal Retrieval Frameworks

ChengXiang Zhai (翟成祥 ) Department of Computer Science

Graduate School of Library & Information Science

Institute for Genomic Biology, Statistics

University of Illinois, Urbana-Champaign

http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

Page 2: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 2

Outline

• Risk Minimization Framework [Lafferty & Zhai 01, Zhai & Lafferty 06]

• Axiomatic Retrieval Framework [Fang et al. 04, Fang & Zhai 05, Fang & Zhai 06]

Page 3: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 3

Risk Minimization Framework

Page 4: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 4

Risk Minimization: Motivation• Long-standing IR Challenges

– Improve IR theory

• Develop theoretically sound and empirically effective models

• Go beyond the limited traditional notion of relevance (independent, topical relevance)

– Improve IR practice

• Optimize retrieval parameters automatically

• SLMs are very promising tools …

– How can we systematically exploit SLMs in IR?

– Can SLMs offer anything hard/impossible to achieve in traditional IR?

Page 5: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 5

Long-Standing IR Challenges

• Limitations of traditional IR models

– Strong assumptions on “relevance”

• Independent relevance

• Topical relevance

– Can we go beyond this traditional notion of relevance?

• Difficulty in IR practice

– Ad hoc parameter tuning

– Can’t go beyond “retrieval” to support info. access in general

Page 6: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 6

More Than “Relevance”

Relevance Ranking Desired Ranking

Redundancy

Readability

Page 7: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 7

Retrieval Parameters

• Retrieval parameters are needed to

– model different user preferences

– customize a retrieval model according to different queries and documents

• So far, parameters have been set through empirical experimentation

• Can we set parameters automatically?

Page 8: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 8

Systematic Applications of Language Models to IR

• Many different variants of language models have been developed, but are there many more models to be studied?

• Can we establish a road map for exploring language models in IR?

Page 9: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 9

Two Main Ideas of the Risk Minimization Framework

• Retrieval as a decision process

• Systematic language modeling

Page 10: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 10

Idea 1: Retrieval as Decision-Making(A more general notion of relevance)

Unordered subset?

Clustering?

Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? ()Choose: (D,)

Query … Ranked list?1 2 3 4

Page 11: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 11

Idea 2: Systematic Language Modeling

Document Language ModelsDocuments

DOC MODELING

QueryQuery

Language Model

QUERY MODELING

Loss Function User

USER MODELING

Retrieval Decision: ?

Page 12: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 12

Generative Model of Document & Query [Lafferty & Zhai 01b]

observedPartiallyobserved

QU)|( Up QUser

DS )|( Sp DSource

inferred

),|( Sdp Dd Document

),|( Uqp Q q Query

( | , )Q Dp R R

Page 13: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 13

Applying Bayesian Decision Theory [Lafferty & Zhai 01b, Zhai 02, Zhai & Lafferty 06]

Choice: (D1,1)

Choice: (D2,2)

Choice: (Dn,n)

...

query quser U

doc set Csource S

q

1

N

dSCUqpDLDD

),,,|(),,(minarg*)*,(,

hidden observedloss

Bayes risk for choice (D, )RISK MINIMIZATION

Loss

L

L

L

Page 14: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 14

Benefits of the Framework

• Systematic exploration of retrieval models (covering almost all the existing retrieval models as special cases)

• Derive general retrieval principles (risk ranking principle)

• Automatic parameter setting

• Go beyond independent-relevance (subtopic retrieval)

Page 15: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 15

Special Cases of Risk Minimization

• Set-based models (choose D)

• Ranking models (choose )

– Independent loss

• Relevance-based loss

• Distance-based loss

– Dependent loss

• MMR loss

• MDR loss

Boolean model

Probabilistic relevance model Generative Relevance Theory

Vector-space Model

Subtopic retrieval model

Two-stage LM

KL-divergence model

Page 16: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 16

Case 1: Two-stage Language Models

QU)|( Up Q

DS)|( Sp D ),|( Sdp D

d

),|( Uqp Q q

otherwisec

ifdl DQ

DQ

),(0),,(

Loss function

),ˆ|(

)|ˆ(),ˆ|(

),|ˆ(),(

Uqp

UpUqp

UqpqdR

DQ

DQDQ

DQ

Rank

Risk ranking formula

Stage 1: compute D̂Stage 1

),ˆ|( Uqp DStage 2: compute

Stage 2

(Dirichlet prior smoothing)

(Mixture model)

Two-stage smoothing

Page 17: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 17

Case 2: KL-divergence Retrieval Models

QU)|( Up Q

DS)|( Sp D ),|( Sdp D

d

),|( Uqp Q q

)||(

),(),,(

DQ

DQDQ

cD

cdl

Loss function

)ˆ||ˆ(),( DQ

Rank

DqdR

Risk ranking formula

)ˆ||ˆ( DQD

Page 18: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 18

Case 3: Aspect Generative Model of Document & Query

QU),|( Up Q

User),|( Qqp

q Query

DS),|( Sp D

Source),|( Ddp

d Document

=( 1,…, k)

n

n

i

A

aDaiD dddwhereapdpdp ...,)|()|(),|( 1

1 1

dDirapdpdpn

i

A

aai )|()|()|(),|(

1 1

PLSI:

LDA:

Page 19: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 19

Optimal Ranking for Independent Loss

1 11 1

1 1

1

1 1

1

1 1

1

1 1

* arg min ( , ) ( | , , , )

( , ) ( | ... )

( )

( ) ( )

* arg min ( ) ( ) ( | , , , )

arg min ( ) ( ) (

j j

j

j

j

j

N i

ii j

N i

ii j

N jN

ij i

N jN

ij i

N jN

ij i

L p q U C S d

L s l

s l

s l

s l p q U C S d

s l p

| , , , )

( | , , , ) ( ) ( | , , , )

* ( | , , , )

j j

k k k k

k

q U C S d

r d q U C S l p q U C S d

Ranking based on r d q U C S

Decision space = {rankings}

Sequential browsing

Independent loss

Independent risk= independent scoring

“Risk ranking principle”[Zhai 02]

Page 20: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 20

Automatic Parameter Tuning• Retrieval parameters are needed to

– model different user preferences

– customize a retrieval model to specific queries and documents

• Retrieval parameters in traditional models

– EXTERNAL to the model, hard to interpret

– Parameters are introduced heuristically to implement “intuition”

– No principles to quantify them, must set empirically through many experiments

– Still no guarantee for new queries/documents

• Language models make it possible to estimate parameters…

Page 21: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 21

The Way to Automatic Tuning ...

• Parameters must be PART of the model!

– Query modeling (explain difference in query)

– Document modeling (explain difference in doc)

• De-couple the influence of a query on parameter setting from that of documents

– To achieve stable setting of parameters

– To pre-compute query-independent parameters

Page 22: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 22

Parameter Setting in Risk Minimization

Query Query Language Model

Document Language Models

Loss Function

User

Documents

Query model parameters

Doc model parameters

User model parameters

Estimate

Estimate

Set

Page 23: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 23

Generative Relevance Hypothesis [Lavrenko 04]

• Generative Relevance Hypothesis: – For a given information need, queries expressing that need and

documents relevant to that need can be viewed as independent random samples from the same underlying generative model

• A special case of risk minimization when document models and query models are in the same space

• Implications for retrieval models: “the same underlying generative model” makes it possible to– Match queries and documents even if they are in different

languages or media

– Estimate/improve a relevant document model based on example queries or vice versa

Page 24: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 24

Risk minimization can easily go beyond independent relevance…

Page 25: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 25

Aspect Retrieval

Query: What are the applications of robotics in the world today?

Find as many DIFFERENT applications as possible.

Example Aspects: A1: spot-welding robotics

A2: controlling inventory A3: pipe-laying robotsA4: talking robotA5: robots for loading & unloading memory tapesA6: robot [telephone] operatorsA7: robot cranes… …

Aspect judgments A1 A2 A3 … ... Ak

d1 1 1 0 0 … 0 0d2 0 1 1 1 … 0 0d3 0 0 0 0 … 1 0….dk 1 0 1 0 ... 0 1

Must go beyond independent relevance!

Page 26: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 26

Evaluation Measures

• Aspect Coverage (AC): measures per-doc coverage

– #distinct-aspects/#docs

– Equivalent to the “set cover” problem, NP-hard

• Aspect Uniqueness(AU): measures redundancy

– #distinct-aspects/#aspects

– Equivalent to the “volume cover” problem, NP-hard

• Examples0001001

0101100

1000101

… ...d1 d3d2

#doc 1 2 3 … …#asp 2 5 8 … …#uniq-asp 2 4 5AC: 2/1=2.0 4/2=2.0 5/3=1.67AU: 2/2=1.0 4/5=0.8 5/8=0.625

Accumulated counts

Page 27: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 27

Dependent Relevance Ranking

• In general, the computation of the optimal ranking is NP-hard

• A general greedy algorithm

– Pick the first document according to INDEPENDENT relevance

– Given that we have picked k documents, evaluate the CONDITIONAL relevance of each candidate document

– Choose the document that has the highest conditional relevance value

Page 28: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 28

Loss Function L( k+1 | 1 … k )

d1

dk

? dk+1

… 1

k

k+1

known

Novelty/RedundancyNov ( k+1 | 1 … k )

RelevanceRel( k+1 )

Maximal Marginal Relevance (MMR)

The best dk+1 is novel & relevant

1

k

k+1

Maximal Diverse Relevance (MDR)

Aspect Coverage Distrib. p(a|i)

The best dk+1 is complementary

in coverage

Page 29: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 29

Maximal Marginal Relevance (MMR) Models

• Maximizing aspect coverage indirectly through redundancy elimination

• Conditional-Rel. = novel + relevant

• Elements

– Redundancy/Novelty measure

– Combination of novelty and relevance

Page 30: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 30

A Mixture Model for Redundancy

P(w|Background)Collection

P(w|Old)

Ref. document

1-

=?

Maximum Likelihood Expectation-Maximization

Page 31: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 31

Cost-based Combination of Relevance and Novelty

1,

))|(1()|(

))|(1)(|(Re

))|(Re1())|(1)(|(Re)}{,,,...,|(

2

3

321111

c

cwhere

dNewpdqp

dNewpdlp

dlpcdNewpdlpcdddl

kk

Rank

kk

Rank

kkkkiiQkk

Relevance score Novelty score

Page 32: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 32

Maximal Diverse Relevance (MDR) Models

• Maximizing aspect coverage directly through aspect modeling

• Conditional-rel. = complementary coverage

• Elements

– Aspect loss function

– Generative Aspect Model

Page 33: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 33

Aspect Generative Model of Document & Query

QU),|( Up Q

User),|( Qqp

q Query

DS),|( Sp D

Source),|( Ddp

d Document

=( 1,…, k)

n

n

i

A

aDaiD dddwhereapdpdp ...,)|()|(),|( 1

1 1

dDirapdpdpn

i

A

aai )|()|()|(),|(

1 1

PLSI:

LDA:

Page 34: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 34

Aspect Loss Function

)|()1()|(1

)|(

,

)||()}{,,,...,|(

1

11,...,1

1,...,11111

k

k

ii

kk

kkQ

kiiQkk

apapk

ap

where

Ddddl

QU),|( Up Q ),|( Qqp

q

DS),|( Sp D Ddp ,|(

d

)ˆ||ˆ( 1,...,1kkQD

Page 35: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 35

Aspect Loss Function: Illustration

Desired coveragep(a|Q)

“Already covered” p(a|1)... p(a|k -1)

New candidate p(a|k)

non-relevant

redundant

perfect

Combined coverage

)|()1()|(1

)|(

1

1

1,...,1

k

k

ii

kk

apapk

ap

Page 36: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 36

Risk Minimization: Summary

• Risk minimization is a general probabilistic retrieval framework

– Retrieval as a decision problem (=risk min.)

– Separate/flexible language models for queries and docs

• Advantages

– A unified framework for existing models

– Automatic parameter tuning due to LMs

– Allows for modeling complex retrieval tasks

• Lots of potential for exploring LMs…

• For more information, see [Zhai 02]

Page 37: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 37

Future Research Directions

• Modeling latent structures of documents

– Introduce source structures (naturally suggest structure-based smoothing methods)

• Modeling multiple queries and clickthroughs of the same user

– Let the observation include multiple queries and clickthroughs

• Collaborative search

– Introduce latent interest variables to tie similar users together

• Modeling interactive search

Page 38: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 38

Axiomatic Retrieval Framework

Most of the following slides are from Hui Fang’s presentation

Page 39: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 39

Traditional Way of Modeling the RelevanceTraditional Way of Modeling the Relevance

Query

Document

Relevance?

QRep

DRep

• No way to predict the performance and identify the weaknesses• Sophisticated parameter tuning

Rel≈Sim(DRep,QRep)

Vector Space Models[Salton et al.75, Salton et al. 83,

Salton et al. 89, Singhal96]

Rel≈P(R=1|DRep,QRep)

Probabilistic Models[Fuhr et al 92, Lafferty et al 03, Ponte et al 98, Robertson et al. 76, Turtle et al. 91, Rijbergen et al 77]

test collectiontest collection

Page 40: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 40

No Way to Predict the PerformanceNo Way to Predict the Performance

S (Q,D) c(t,Q)logN 1

df (t)

1 log(1 log(c(t,D)))

(1 s) s | D |avdl

tDQ

1 log(c(t,D))

Page 41: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 41

“k1, b and k3 are parameters which depend on the nature of the queries and possibly on the database; k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3 is often set to 7 or 1000.”

Sophisticated Parameter TuningSophisticated Parameter Tuning

[Robertson et al. 1999]

S(Q,D) logN df (t) 0.5

df (t)tQD

(k1 1)c(t,D)

c(t,D) k1((1 b) b | D |

avdl)(k3 1)c(t,Q)

k3 c(t,Q)

Page 42: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 42

High Parameter SensitivityHigh Parameter Sensitivity

Page 43: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 43

Hui Fang’s Thesis Work Hui Fang’s Thesis Work [Fang 07][Fang 07]

Propose a novel axiomatic framework, where relevance is directly modeled with term-based constraints

– Predict the performance of a function analytically[Fang et al., SIGIR04]

– Derive more robust and effective retrieval functions [Fang & Zhai, SIGIR05, Fang & Zhai, SIGIR06]

– Diagnose weaknesses and strengths of retrieval functions [Fang & Zhai, under review]

Page 44: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 44

Traditional Way of Modeling the RelevanceTraditional Way of Modeling the Relevance

Document

Relevance?Rel≈Sim(DRep,QRep)

Vector Space Models

Rel≈P(R=1|DRep,QRep)

Probabilistic Models

test collectiontest collection

Query QRep

DRep

Page 45: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 45

Axiomatic Approach to Relevance ModelingAxiomatic Approach to Relevance Modeling

Document

Relevance?Rel≈Sim(DRep,QRep)

Vector Space Models

Rel≈P(R=1|DRep,QRep)

Probabilistic Models

test collectiontest collection

Query QRep

DRep

Constraint 1

Constraint 2

Constraint m

(1) Predict (1) Predict performanceperformance

Rel(Q,D)

(2) Develop more (2) Develop more robust functionsrobust functions

Collection Collection

(constraint 1)(constraint 1)……

(3) Diagnose weaknesses(3) Diagnose weaknesses

Collection Collection

(constraint 2)(constraint 2)

Collection Collection

(constraint m)(constraint m)

We are hereWe are here

Page 46: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 46

Part 1: Define retrieval constraintsPart 1: Define retrieval constraints

[Fang et. al. SIGIR 2004][Fang et. al. SIGIR 2004]

Page 47: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 47

• Pivoted Normalization Method

• Dirichlet Prior Method

• Okapi Method

1 ln(1 ln( ( , ))) 1( , ) ln

| | ( )(1 )w q d

c w d Nc w q

d df ws savdl

( , )( , ) ln(1 ) | | ln

( | ) | |w q d

c w dc w q q

p w C d

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

Inversed Document FrequencyDocument Length NormalizationTerm Frequency

Empirical Observations in IR (Cont.)

1+ln(c(w,d))

Alternative TF transformationParameter sensitivity

Page 48: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 48

Research Questions

• How can we formally characterize these necessary retrieval heuristics?

• Can we predict the empirical behavior of a method without experimentation?

Page 49: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 49

d2:d1:

),( 1dwc

),( 2dwc

Term Frequency Constraints (TFC1)

• TFC1

TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term.

q :w

If |||| 21 dd ),(),( 21 dwcdwc and

Let q be a query with only one term w.

).,(),( 21 qdfqdf then

),(),( 21 qdfqdf

Page 50: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 50

1 2( , ) ( , )f d q f d q

Term Frequency Constraints (TFC2)

TF weighting heuristic II: Favor a document with more distinct query terms.

2 1( , )c w d

1 2( , )c w d

1 1( , )c w d

d1:

d2:

1 2( , ) ( , ).f d q f d qthen

1 2 1 1 2 1( , ) ( , ) ( , )c w d c w d c w d If2 2 1 1 2 1( , ) 0, ( , ) 0, ( , ) 0c w d c w d c w d

and

1 2| | | |d dand

Let q be a query and w1, w2 be two query terms.

Assume1 2( ) ( )idf w idf w

• TFC2

q:w1 w2

Page 51: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 51

Term Discrimination Constraint (TDC)

IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.

Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)

...…

SVMSVM TutorialTutorial…

Doc 1

……

SVMSVMTutorialTutorial…

Doc 2

( 1) ( 2)f Doc f Doc

SVM Tutorial

Page 52: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 52

Term Discrimination Constraint (Cont.)

• TDCLet q be a query and w1, w2 be two query terms.

1 2| | | |,d dAssume

)()( 21 widfwidf and),(),( 2111 dwcdwc If

).,(),( 21 qdfqdf then

),(),(),(),( 22211211 dwcdwcdwcdwc and

),(),( 21 dwcdwc for all other words w.and

1 2( ) ( )idf w idf wq:w1 w2

d2:d1:

),( 11 dwc

),( 21 dwc

),( 12 dwc

),( 22 dwc

1 2( , ) ( , )f d q f d q

Page 53: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 53

Length Normalization Constraints(LNCs)Document length normalization heuristic:Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2) .

• LNC2

d2:

q:Let q be a query.

d1:||||,1 21 dkdk ),(),( 21 dwckdwc If and

),(),( 21 qdfqdf then

),(),( 21 qdfqdf

d1:d2:

q:Let q be a query.

1),(),(, 12 dwcdwcqw),(),(, 12 dwcdwcw

qw

),( 1dwc

),( 2dwc

If for some word

but for other words

),(),( 21 qdfqdf ),(),( 21 qdfqdf then

• LNC1

Page 54: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 54

TF-LENGTH Constraint (TF-LNC)

• TF-LNC

TF-LN heuristic:Regularize the interaction of TF and document length.

q:w

),( 2dwc

d2:

),( 1dwc

d1:

Let q be a query with only one term w.

).,(),( 21 qdfqdf then

),(),( 21 dwcdwc and

If 1 2 1 2| | | | ( , ) ( , )d d c w d c w d

1 2( , ) ( , )f d q f d q

Page 55: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 55

Analytical Evaluation

Retrieval Formula TFCs TDC LNC1 LNC2 TF-LNC

Pivoted Norm. Yes Conditional Yes Conditional Conditional

Dirichlet Prior Yes Conditional Yes Conditional Yes

Okapi (original) Conditional Conditional Conditional Conditional Conditional

Okapi (modified) Yes Conditional Yes Yes Yes

Page 56: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 56

Term Discrimination Constraint (TDC)

IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.

...…SVMSVMSVMTutorialTutorial…

Doc 1

Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)

……TutorialSVMSVMTutorialTutorial…

Doc 2

( 1) ( 2)f Doc f Doc

Page 57: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 57

Benefits of Constraint Analysis

• Provide an approximate bound for the parameters

– A constraint may be satisfied only if the parameter is within a particular interval.

• Compare different formulas analytically without experimentations

– When a formula does not satisfy the constraint, it often indicates non-optimality of the formula.

• Suggest how to improve the current retrieval models

– Violation of constraints may pinpoint where a formula needs to be improved.

Page 58: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 58

Parameter sensitivity of s

s

Avg

. Pre

c.

Benefits 1 : Bounding Parameters

• Pivoted Normalization MethodLNC2 s<0.4

0.4

Optimal s (for average precision)

Page 59: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 59

Negative when df(w) is large Violate many constraints

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

Benefits 2 : Analytical Comparison• Okapi Method

Pivoted

Okapi

keyword query verbose query

s or b s or b

Avg

. Pre

c

Avg

. Pre

c

Page 60: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 60

Benefits 3: Improving Retrieval Formulas

Make Okapi satisfy more constraints; expected to help verbose queries

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

• Modified Okapi Methoddf

N 1ln

keyword query verbose query

s or b s or b

Avg

. Pre

c.

Avg

. Pre

c.

Pivoted

Okapi

Modified Okapi

Page 61: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 61

Axiomatic Approach to Relevance ModelingAxiomatic Approach to Relevance Modeling

Document

Relevance?Rel≈Sim(DRep,QRep)

Vector Space Models

Rel≈P(R=1|DRep,QRep)

Probabilistic Models

test collectiontest collection

Query QRep

DRep

Constraint 1

Constraint 2

Constraint m

(1) Predict (1) Predict performanceperformance

Rel(Q,D)

(2) Develop more (2) Develop more robust functionsrobust functions

Collection Collection

(constraint 1)(constraint 1)……

(3) Diagnose weaknesses(3) Diagnose weaknesses

Collection Collection

(constraint 2)(constraint 2)

Collection Collection

(constraint m)(constraint m)

We are hereWe are here

Page 62: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 62

Part 2: Derive new retrieval functionsPart 2: Derive new retrieval functions

[Fang & Zhai SIGIR05, Fang & Zhai SIGIR06][Fang & Zhai SIGIR05, Fang & Zhai SIGIR06]

Page 63: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 63

Basic Idea of Axiomatic ApproachBasic Idea of Axiomatic Approach

C2

C3

S1

S2

S3

Function space

C1

Retrieval constraints

Our target

Function space

SS11

SS22

SS33

Page 64: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 64

Function SpaceFunction Space

Dd1,d2,...,dn

Qq1,q2,...,qm;

S : QD

Define the function space inductively

Q:

D:

catcat

dogdog

Primitive weighting function (f)S(Q,D) = S( , ) = f ( , ) bigbig

Query growth function (h)S(Q,D) = S( , ) = S( , )+h( , , )

Document growth function (g)

S(Q,D) = S( , ) = S( , )+g( , , )

bigbig

Page 65: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 65

CC11

CC33

CC22

Derivation of New Retrieval FunctionsDerivation of New Retrieval Functions

S(Q,D)

f

g

h

decomposedecompose

S’S’

SS

generalizegeneralize

F

G

Hconstrainconstrain

f '

g'

h'

existing functionexisting function

assembleassemble

S'(Q,D) new functionnew function

Page 66: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 66

Representative Derived Function Representative Derived Function

S(Q,D) c(t,Q)tQD (

N

df (t))0.35

c(t,D)

c(t,D) s s| D |

avdl

IDFIDF TFTF

length normalizationlength normalization

QTFQTF

Page 67: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 67

The derived function is less sensitive to the The derived function is less sensitive to the parameter settingparameter setting

Axiomatic ModelAxiomatic Modelbetterbetter

Page 68: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 68

Adding Semantic Term MatchingAdding Semantic Term Matching

dogdog

Training puppies is not always easy: it requires work. Puppies should be touched and held from birth, although only briefly and occasionally until their eyes and ears open. Otherwise the puppy may become vicious.

A book is a collection of paper with text, pictures, usually bound together along one edge within covers. A book is also a literary work or a main division of such a work. A book produced in electronic format is known as an e-book.

Page 69: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 69

General Approach General Approach to Semantic Term Matchingto Semantic Term Matching

• Select semantic similar terms

• Expand original query with the selected terms

dog dog 11puppy 0.5

doggy 0.5

hound 0.5

bone 0.1

Key challenge: Key challenge:

How to weight selected terms?How to weight selected terms?

The proposed axiomatic approach provides guidance on The proposed axiomatic approach provides guidance on how to weight terms appropriately. how to weight terms appropriately.

Page 70: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 70

Effectiveness of Semantic Term MatchingEffectiveness of Semantic Term Matching

ROBUST04 ROBUST05

MAP P@20 MAP P@20

Syntactic term matching

(baseline)0.248 0.352 0.192 0.379

Semantic term matching

0.302

(21.8%)

0.399 0.292

(51.0%)

0.502

Page 71: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 71

Axiomatic Approach to Relevance ModelingAxiomatic Approach to Relevance Modeling

Document

Relevance?Rel≈Sim(DRep,QRep)

Vector Space Models

Rel≈P(R=1|DRep,QRep)

Probabilistic Models

test collectiontest collection

Query QRep

DRep

Constraint 1

Constraint 2

Constraint m

(1) Predict (1) Predict performanceperformance

Rel(Q,D)

(2) Develop more (2) Develop more robust functionsrobust functions

Collection Collection

(constraint 1)(constraint 1)……

(3) Diagnose weaknesses(3) Diagnose weaknesses

Collection Collection

(constraint 2)(constraint 2)

Collection Collection

(constraint m)(constraint m)

We are hereWe are here

Page 72: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 72

Part 3: Diagnostic evaluation for IR models

[Fang & Zhai, under review]

Page 73: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 73

Existing evaluation provides little explanation for Existing evaluation provides little explanation for the performance differencesthe performance differences

trec8 wt-2g fr88-89

Pivoted 0.244 0.288 0.218

Dirichlet 0.257 0.302 0.202

How to diagnose weaknesses and strengths of retrieval functions?

test collectiontest collection

dogdog

……

Query:Query:

DocDoc11::

DocDoc22::

DocDocnn::

Retrieval FunctionRetrieval Function MAP=0.25 MAP=0.25

Page 74: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 74

Relevance-Preserving Perturbations

cD(d,d,K)

concatenate every document with itself K times

document scaling perturbation:document scaling perturbation:

Perturb term statistics in documents and keep relevance statusPerturb term statistics in documents and keep relevance status

Page 75: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 75

Summary of PerturbationsSummary of Perturbations

• Relevance addition

• Noise addition

• Internal term growth

• Document scaling

• Relevant document concatenation

• Non-relevant document concatenation

• Noise deletion

• Document addition

• Document deletion

Page 76: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 76

Length Scaling TestLength Scaling Test

cD(d,d,K)

test whether a retrieval function over-penalizes long documents

1. Identify the aspect to be diagnosed

2. Choose appropriate relevance-preserving perturbations

3. Perform the test and interpret the results

Dirichlet over-penalizes Dirichlet over-penalizes long documents!long documents!

Page 77: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 77

Summary of Diagnostic TestsSummary of Diagnostic Tests

• Length variation sensitivity tests

– Length variance reduction test

– Length variance amplification test

– Length scaling test

• Term noise resistance tests

– Term noise addition test

• TF-LN balance Tests

– Single query term growth

– Majority query term growth

– All query term growth

Page 78: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 78

Identifying the weaknesses makes it possible to improve the performance

Dir. 0.257 2838 0.397 0.302 1875 0.372 0.207 741 0.185

M.D. 0.262 2874 0.415 0.321 1930 0.395 0.224 811 0.191

Piv. 0.244 2826 0.402 0.288 1924 0.369 0.223 822 0.206

M.P. 0.256 2848 0.411 0.316 1940 0.392 0.230 867 0.202

trec8 wt2g fr88-89

MAP #RRel P@20 MAP #RRel P@20 MAP #RRel P@20

Page 79: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 79

Axiomatic Framework: Summary

• A new way of examining and developing retrieval models

• Facilitate analytical study of retrieval models

• Applicable to the development of all kinds of ranking functions

• Limitation:

– Constraints can be subjective

– Not constructive (thus must rely on other techniques to reduce the search space)

• Combined with machine learning?

Page 80: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 80

Lecture 4: Key Points

• Retrieval problem can be generally formalized as a statistical decision problem

– Nicely incorporate generative models into a retrieval framework

– Serve as a road map for exploring new retrieval models

– Make it easier to model complex retrieval problems (interactive retrieval)

• Axiomatic framework makes it possible to analyze a retrieval function without experimentation

– Facilitate theoretical study of retrieval models (“impossibility theorem”?)

– Offer a general methodology for thinking about and improving retrieval models

Page 81: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 81

Readings

• The risk minimization paper:

– http://sifaka.cs.uiuc.edu/czhai/riskmin.pdf

• Hui Fang’s thesis:

– http://www.cs.uiuc.edu/techreports.php?report=UIUCDCS-R-2007-2847

Page 82: Formal Retrieval Frameworks

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 82

Discussion

• Risk minimization for multimedia retrieval

– Add generative models of images and video to the framework

– Unifying multimedia with text as a common language

• Axiomatic approaches

– Constraints for ranking multimedia information items

– Add constraints to a statistical learning framework (e.g., add constraints as prior or regularization)