Upload
sela
View
37
Download
0
Embed Size (px)
DESCRIPTION
Risk Minimization and Language Modeling in Text Retrieval. ChengXiang Zhai. Thesis Committee: John Lafferty (Chair), Jamie Callan Jaime Carbonell David A. Evans W. Bruce Croft (Univ. of Massachusetts, Amherst). Information Overflow. Web Site Growth. query. “Tips on thesis defense”. - PowerPoint PPT Presentation
Citation preview
1
Risk Minimization and
Language Modeling in Text
Retrieval ChengXiang Zhai
Thesis Committee:
John Lafferty (Chair), Jamie Callan
Jaime CarbonellDavid A. Evans
W. Bruce Croft (Univ. of Massachusetts, Amherst)
2
Information Overflow
Web Site Growth
3
Text Retrieval (TR)
RetrievalSystem
User
“Tips on thesis defense”
query
relevant docs
database/collection
text docs
4
Challenges in TR
(independent,topical)Relevance
Ad hocparameter tuning
Utility
5
Sophisticated Parameter Tuningin the Okapi System
(Robertson et al. 1999)
“k1, b and k3 are parameters which depend on the nature of the queries and possibly on the database; k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3 is often set to 7 or 1000 (effectively infinite).”
6
More Than “Relevance”
Relevance Ranking Desired Ranking
Redundancy
Readability
7
Meeting the Challenges
Bayesian Decision Theory
Statistical Language Models
Risk Minimization Framework
Utility-based Retrieval
ParameterEstimation
8
Map of Thesis
RiskMinimizationFramework
Two-stageLanguage Model
Automatic parameter setting
KL-divergenceRetrieval Model
Aspect RetrievalModel
Natural incorporation of feedback
Non-traditional ranking
New TR Framework New TR Models Features
9
Retrieval as Decision-Making
Unordered subset?
Clustering?
Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? ()Choose: (D,)
Query … Ranked list?1 2 3 4
10
Generative Model of Document & Query
observedPartiallyobserved
QU)|( Up Q
User
DS )|( Sp DSource
inferred
),|( Sdp Dd Document
),|( Uqp Qq Query
11
Bayesian Decision Theory
Choice: (D1,1)
Choice: (D2,2)
Choice: (Dn,n)
...
query quser U
doc set C
source S
q
1
N
dSCUqpDLDD
),,,|(),,(minarg*)*,(,
hidden observedloss
Bayes risk for choice (D, )RISK MINIMIZATION
Loss
L
L
L
12
Special Cases
• Set-based models (choose D)
• Ranking models (choose )
– Independent loss ( PRP)
• Relevance-based loss
• Distance-based loss
– Dependent loss
• MMR loss
• MDR loss
Boolean model
Probabilistic relevance model
Vector-space Model
Aspect retrieval model
Two-stage LM
KL-divergence model
13
Map of Existing TR Models
Relevance
(R(q), R(d)) Similarity
P(r=1|q,d) r {0,1} Probability of Relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel
(Salton et al., 75)
Prob. distr.model
(Wong & Yao, 89)
…
GenerativeModel
RegressionModel
(Fox 83)
Classicalprob. Model(Robertson &
Sparck Jones, 76)
Docgeneration
Querygeneration
LMapproach
(Ponte & Croft, 98)(Lafferty & Zhai, 01a)
Prob. conceptspace model
(Wong & Yao, 95)
Differentinference system
Inference network model
(Turtle & Croft, 91)
14
Where Are We?
RiskMinimizationFramework
Two-stageLanguage Model
KL-divergenceRetrieval Model
Aspect Retrieval Model
15
Two-stage Language Models
QU)|( Up Q
DS)|( Sp D ),|( Sdp D
d
),|( Uqp Qq
otherwisec
ifdl DQ
DQ
),(0),,(
Loss function
),ˆ|(),( UqpqdR DQ
Rank
Risk ranking formula
Stage 1: compute D̂Stage 1
),ˆ|( Uqp DStage 2: compute
Stage 2
(Dirichlet prior smoothing)
(Mixture model)
Two-stage smoothing
16
The Need of Query-Modeling(Dual-Role of Smoothing)
Verbosequeries
Keywordqueries
17
Interaction of the Two Roles of Smoothing
Query Type JM Dir ADTitle 0.228 0.256 0.237Long 0.278 0.276 0.260
Relative performance of JM, Dir. and AD
0
0.1
0.2
0.3
JM DIR AD
Method
precision
TitleQuery
LongQuery
18
Two-stage Smoothing
c(w,d)
|d|P(w|d) =
+p(w|C)
+
Stage-1
-Explain unseen words-Dirichlet prior(Bayesian)
(1-) + p(w|U)
Stage-2
-Explain noise in query-2-component mixture
19
Estimating using leave-one-out
P(w1|d- w1)
P(w2|d- w2)
N
i Vw i
ii d
CwpdwcdwcCl
11 )
1||
)|(1),(log(),()|(
log-likelihood
)ˆ C|(μlargmaxμ 1μ
Maximum Likelihood Estimator
Newton’s Method
Leave-one-outw1
w2
P(wn|d- wn)
wn
...
20
Estimating using Mixture Model
query
1
N
...
U)λ,|p(qargmaxλ
U))|λp(q)θ|λ)p(q((1πU)λ,|p(q
λ
N
1i
m
1jjdji i
ˆ
ˆ
Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm
P(w|d1)d1
P(w|dN)dN
… ...
Stage-1
(1-)p(w|d1)+ p(w|U)
(1-)p(w|dN)+ p(w|U)
Stage-2
21
Collection query Optimal-JM Optimal-Dir Auto-2stageSK 20.3% 23.0% 22.2%*LK 36.8% 37.6% 37.4%SV 18.8% 20.9% 20.4%LV 28.8% 29.8% 29.2%SK 19.4% 22.3% 21.8%*LK 34.8% 35.3% 35.8%SV 17.2% 19.6% 19.9%LV 27.7% 28.2% 28.8%*SK 17.9% 21.5% 20.0%LK 32.6% 32.6% 32.2%SV 15.6% 18.5% 18.1%LV 26.7% 27.9% 27.9%*
AP88-89
WSJ87-92
ZIFF1-2
Automatic 2-stage results Optimal 1-stage results
Average precision (3 DB’s + 4 query types, 150 topics)
22
Where Are We?
RiskMinimizationFramework
Two-stageLanguage Model
KL-divergenceRetrieval Model
Aspect Retrieval Model
23
KL-divergence Retrieval Models
QU)|( Up Q
DS)|( Sp D ),|( Sdp D
d
),|( Uqp Qq
)||(
),(),,(
DQ
DQDQ
cD
cdl
Loss function
)ˆ||ˆ(),( DQ
Rank
DqdR
Risk ranking formula
)ˆ||ˆ( DQD
24
Expansion-based vs. Model-based
D)|( DQP
Document DResults
Feedback Docs
Doc model
Q
D
)||( DQD
Doc model
Scoring
Scoring
Query Q
Document D
Query Q
Feedback Docs
Results
Expansion-basedFeedback
modify
modify
Model-basedFeedback
Query model
Query likelihood
KL-divergence
25
Feedback as Model Interpolation
Query Q
D
)||( DQD
Document D
Results
Feedback Docs F={d1, d2 , …, dn}
FQQ )1('
Generative model
Divergence minimization
Q
F=0
No feedback
FQ '
=1
Full feedback
QQ '
26
F Estimation Method I: Generative Mixture Model
w
w
F={d1, …, dn}
))|()|()log(();()|(log CwpwpdwcFpi w
i 1
)|(logmaxarg
FpF Maximum Likelihood
P(w| )
P(w| C)
1-
P(source)
Background words
Topic words
27
F Estimation Method II: Empirical Divergence Minimization
d1
F={d1, …, dn}
1d
nd dn
close
))||()||(),,(1
||1
Cjd
n
iF DDCFD
),,(minarg CFDF
Empirical divergence
Divergence minimization
Cfar ()
C
Background model
28
Example of Feedback Query Model
W p(W| )security 0.0558airport 0.0546
beverage 0.0488alcohol 0.0474bomb 0.0236
terrorist 0.0217author 0.0206license 0.0188bond 0.0186
counter-terror 0.0173terror 0.0142
newsnet 0.0129attack 0.0124
operation 0.0121headline 0.0121
Trec topic 412: “airport security”
W p(W| )the 0.0405
security 0.0377airport 0.0342
beverage 0.0305alcohol 0.0304
to 0.0268of 0.0241
and 0.0214author 0.0156bomb 0.0150
terrorist 0.0137in 0.0135
license 0.0127state 0.0127
by 0.0125
=0.9 =0.7
FF
Mixture model approach
Web database
Top 10 docs
29
Model-based feedback vs. Simple LM
Simple LM Mixture Improv. Div.Min. Improv.AvgPr 0.21 0.296 pos +41% 0.295 pos +40%InitPr 0.617 0.591 pos -4% 0.617 pos +0%Recall 3067/4805 3888/4805 pos +27% 3665/4805 pos +19%AvgPr 0.256 0.282 pos +10% 0.269 pos +5%InitPr 0.729 0.707 pos -3% 0.705 pos -3%Recall 2853/4728 3160/4728 pos +11% 3129/4728 pos +10%AvgPr 0.281 0.306 pos +9% 0.312 pos +11%InitPr 0.742 0.732 pos -1% 0.728 pos -2%Recall 1755/2279 1758/2279 pos +0% 1798/2279 pos +2%
collection
AP88-89
TREC8
WEB
30
Where Are We?
RiskMinimizationFramework
Two-stageLanguage Model
KL-divergenceRetrieval Model
Aspect Retrieval Model
31
Aspect Retrieval
Query: What are the applications of robotics in the world today?
Find as many DIFFERENT applications as possible.
Example Aspects: A1: spot-welding robotics
A2: controlling inventory A3: pipe-laying robotsA4: talking robotA5: robots for loading & unloading memory tapesA6: robot [telephone] operatorsA7: robot cranes… …
Aspect judgments A1 A2 A3 … ... Ak
d1 1 1 0 0 … 0 0d2 0 1 1 1 … 0 0d3 0 0 0 0 … 1 0….dk 1 0 1 0 ... 0 1
32
Evaluation Measures• Aspect Coverage (AC): measures per-doc
coverage
– #distinct-aspects/#docs
– Equivalent to the “set cover” problem, NP-hard
• Aspect Uniqueness(AU): measures redundancy
– #distinct-aspects/#aspects
– Equivalent to the “volume cover” problem, NP-hard
• Examples
0001001
0101100
1000101
… ...d1 d3d2
#doc 1 2 3 … …#asp 2 5 8 … …#uniq-asp 2 4 5AC: 2/1=2.0 4/2=2.0 5/3=1.67AU: 2/2=1.0 4/5=0.8 5/8=0.625
Accumulated counts
33
Loss Function L( k+1 | 1 … k )
d1
dk
? dk+1
… 1
k
k+1
known
Novelty/RedundancyNov ( k+1 | 1 … k )
RelevanceRel( k+1 )
Maximal Marginal Relevance (MMR)
The best dk+1 is novel & relevant
1
k
k+1
Maximal Diverse Relevance (MDR)
Aspect Coverage Distrib. p(a|i)
The best dk+1 is complementary
in coverage
34
Maximal Marginal Relevance (MMR) Models
• Maximizing aspect coverage indirectly through redundancy elimination
• Elements
– Redundancy/Novelty measure
– Combination of novelty and relevance
• Proposed & studied six novelty measures
• Proposed & studied four combination strategies
35
Comparison of Novelty Measures (Aspect Coverage)
0
0.5
1
1.5
2
2.5
3
3.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Aspect Recall
Avg
. As
pe
ct C
ove
rag
e
Relevance
AvgKL
AvgMix
KLMin
KLAvg
MixMin
MixAvg
36
Comparison of Novelty Measures (Aspect Uniqueness)
0
0.2
0.4
0.6
0.8
1
1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Aspect Recall
Avg
. As
pe
ct U
niq
ue
ne
ss
Relevance
AvgKL
AvgMix
KLMin
KLAvg
MixMin
MixAvg
37
A Mixture Model for Redundancy
P(w|Background)Collection
P(w|Old)
Ref. document
1-
=?
Maximum Likelihood Expectation-Maximization
38
Cost-based Combination of Relevance and Novelty
1,
))|(1()|(
))|(1)(|(Re
))|(Re1())|(1)(|(Re)}{,,,...,|(
2
3
321111
c
cwhere
dNewpdqp
dNewpdlp
dlpcdNewpdlpcdddl
kk
Rank
kk
Rank
kkkkiiQkk
Relevance score Novelty score
39
Maximal Diverse Relevance (MDR) Models
• Maximizing aspect coverage directly through aspect modeling
• Elements
– Aspect loss function
– Generative Aspect Model
• Proposed & studied KL-divergence aspect loss function
• Explored two aspect models (PLSI, LDA)
40
Aspect Generative Model of Document & Query
QU),|( Up Q
User),|( Qqp
q Query
DS),|( Sp D
Source),|( Ddp
d Document
=( 1,…, k)
n
n
i
A
aDaiD dddwhereapdpdp ...,)|()|(),|( 1
1 1
dDirapdpdpn
i
A
aai )|()|()|(),|(
1 1
PLSI:
LDA:
41
Aspect Loss Function
)|()1()|(1
)|(
,
)||()}{,,,...,|(
1
11,...,1
1,...,11111
k
k
ii
kk
kkQ
kiiQkk
apapk
ap
where
Ddddl
QU),|( Up Q ),|( Qqp
q
DS),|( Sp D Ddp ,|(
d
)ˆ||ˆ( 1,...,1k
kQD
42
Aspect Loss Function: Illustration
Desired coveragep(a|Q)
“Already covered” p(a|1)... p(a|k -1)
New candidate p(a|k)
non-relevant
redundant
perfect
Combined coverage
)|()1()|(1
)|(
1
1
1,...,1
k
k
ii
kk
apapk
ap
43
Preliminary Evaluation: MMR vs. MDR
Relevant Data Mixed Data Ranking Method AC AU AC AU Prec. MMR +2.6% +13.8% +1.5% +2.2% +3.4% MDR +9.8% +4.5% +1.5% 0.0% -13.8%
• On the relevant data set, both MMR and MDR
are effective, but they complement each other
- MMR improves AU more than AC
- MDR improves AC more than AU
• On the mixed data set, however,
- MMR is only effective when relevance ranking is accurate
- MDR improves AC, even though relevance ranking is degraded.
44
Further Work is Needed
• Controlled experiments with synthetic data
– Level of redundancy
– Density of relevant documents
– Per-document aspect counts
• Alternative loss functions
• Aspect language models, especially along the line of LDA
– Aspect-based feedback
45
Summary of Contributions
Two-stageLanguage Model
KL-divergenceRetrieval Model
Aspect RetrievalModel
New TR Models
RiskMinimizationFramework
New TR Framework
•Unifies existing models•Incorporates LMs•Serves as a map for exploring new models
Specific Contributions
•Empirical study of smoothing (dual role of smoothing)•New smoothing method (two-stage smoothing) •Automatic parameter setting (leave-one-out, mixture)
•Query/document distillation•Feedback with LMs (mixture model & div. min.)
•Evaluation criteria (AC, AU)•Redundancy/novelty measures (mixture weight)•MMR with LMs (cost-comb.)•Aspect-based loss function (“collective KL-div”)
46
Future Research Directions
• Better Approximation of the risk integral
• More effective LMs for “traditional” retrieval
– Can we beat TF-IDF without increasing computational complexity?
– Automatic parameter setting, especially for feedback models
– Flexible passage retrieval, especially with HMM
– Beyond unigrams (more linguistics)
47
More Future Research Directions
• Aspect Retrieval Models
– Document structure/sub-topic modeling
– Aspect-based feedback
• Interactive information retrieval models
– Risk minimization for information filtering
– Personalized & context-sensitive retrieval
48
Thank you!