Evaluation of Relevance Feedback Algorithms for XML Retrieval

Evaluation of Relevance Feedback

Algorithms for XML Retrieval

Silvana Solomon27 February 2007

Supervisor:

Dr. Ralf Schenkel

Silvana Solomon Evaluation of RF Algorithms for XML Retrieval

27 Feb 2007

Outline

Short introduction

Motivation & Goals

Evaluating retrieval effectiveness

INEX tool

Evaluation methodology

Results


27 Feb 2007

Introduction

Path to the result

sec„The IR process is composed…“

article

body

sec

subsec„For small collections…“

frontmatter

sec

subsec

p p p„Figure 1 outlines…“

author„Ian Ruthven“

Content of result

citation„D. Harman“

backmatter

(3) feedback

(4) expanded query

FeedbackXML SearchEngine

(1) q

ue

ry

(2) re

su

lts

(5) re

su

lts o

f e

xp

an

de

d q

ue

ry


27 Feb 2007

Motivation

Best way to compare feedback algorithms?

Cannot use standard evaluation tools on feedback results

Goals:

Analyze evaluation methods

Develop an evaluation tool


27 Feb 2007

Evaluating Retrieval Effectiveness

Document collection

Topics set

Assessments set

Human assessors

Metrics

INEX: INitiative for the Evaluation of XML Retrieval 2006 document collection: 600,000 Wikipedia documents


27 Feb 2007

INEX Tool: EvalJ

Tool for evaluation of information retrieval experiments

Implements a set of metrics used for evaluation

Limitations: cannot measure improvement of runs produced with feedback


27 Feb 2007

RF Evaluation – Ranking Effect

Baseline run

doc[1]/bdy[1]

doc[2]/bdy[1]

doc[4]/bdy[1]/ article[1]/ sec[6]

Feedback run

doc[1]

Mark in top results

relevantdoc[3]

doc[8]/bdy[1]/article[3]

doc[3]


doc[7]/article[3]

push the known relevant results to the top of the element ranking

artificially improves RP figures



27 Feb 2007

RF Evaluation – Feedback Effect

measure improvement on unseen relevant elements

not directly tested

Modify

FB run

Evaluate untrained results

Baseline run

doc[1]/bdy[1]

doc[3]

doc[2]/bdy[1]



Feedback run

doc[3]


Mark in top results

relevant


27 Feb 2007

Evaluation Methodology (1)1. Standard text IR: freezing known results at the

top independent results assumption

2. New approach: remove known results+X from the collection

resColl-result: remove results only (~doc retrieval) resColl-desc: remove results+descendants resColl-anc: remove results+ancestors resColl-path: remove results+desc+anc resColl-doc: remove whole doc with known results


27 Feb 2007

Evaluation Methodology (2) Freezing:

Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



Feedback run


doc[9]


doc[2]/bdy[1]

doc[4]/bdy[1]/ article[4]


27 Feb 2007

Evaluation Methodology (2)

Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



block top-3

Feedback run

doc[7]/bdy[1]


doc[9]


doc[2]/bdy[1]


Freezing:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



block top-3

Feedback run

doc[7]/bdy[1]

doc[3]


doc[9]


doc[2]/bdy[1]


Freezing:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



block top-3

Feedback run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]


doc[9]


doc[2]/bdy[1]


Freezing:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



block top-3

Feedback run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]


doc[9]


doc[2]/bdy[1]


Freezing:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



Feedback run


doc[9]


doc[2]/bdy[1]


resColl-path:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



Feedback run


doc[9]


doc[2]/bdy[1]


resColl-path:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



Feedback run


doc[9]


doc[2]/bdy[1]


resColl-path:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



Feedback run


doc[9]


doc[2]/bdy[1]


resColl-path:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



Feedback run


doc[9]



resColl-path:


27 Feb 2007


Baseline run

doc[7]/bdy[1]

doc[3]

doc[2]/bdy[1]



Feedback run


doc[9]



resColl-path:


27 Feb 2007

Best Evaluation Methodology?

sec„The IR process is composed…“

article

body

sec

subsec„For small collections…“

frontmatter backmatter

sec

subsec

p p P„Figure 1 outlines…“

author„Ian Ruthven“

citation„D. Harman“

resColl-path


27 Feb 2007

Testing Evaluated Results

Standard method: average – problems:

Topic-id 205 280 307 325 341 400 Avg.

Baseline 0.2 0.3 0.1 0.1 0.2 0.3 0.2

Modified feedback

0.2 0.2 0.1 0.9 0.2 0.2 0.3

t-test & Wilcoxon signed-rank test: gives probability p that the baseline run is better than the feedback run

experiment significant if p<0.05 or p<0.01


27 Feb 2007

Results (1)

Evaluation mode: resColl-path

Feedback file INEX metric

Abs. improv.

Rel. improv.

T-test WSR

TopX_CO_Content.xml 0.0185 0.0112 1.5467 0.0001 0.0001

xfirm_r1_cosc3s.xml 0.0028 0.0015 1.0975 0.0003 0.0023

xfirm_r1_cosc5.xml 0.0026 0.0012 0.9222 0.0028 0.0422

xfirm_r1_cosc3.xml 0.0025 0.0012 0.8854 0.0032 0.0441

xfirm_r1_coc3s3.xml 0.0031 -0.0017 -0.3564 0.9301 0.9995

xfirm2_r2_cop4.xml 0.0032 -0.0018 -0.3594 0.8532 0.9732

xfirm2_r2_cot40.xml 0.0025 -0.0024 -0.4863 0.9239 0.9987

xfirm2_r2_cot10.xml 0.0023 -0.0026 -0.5334 0.9429 0.9999

xfirm_r1_coc3.xml 0.0014 -0.0034 -0.7186 0.9993 0.9999

xfirm_r1_coc10.xml 0.0013 -0.0035 -0.7281 0.9989 0.9999


27 Feb 2007

Results (2)

Comparison of evaluation techniques based on relative improvement w.r.t. baseline run

freezing resColl-anc

resColl-desc

resColl-doc

resColl-path

resColl- res

c3s c3s TopX TopX TopX c3s

TopX c5 c3s c3s c3s c5

c5 TopX c5 c5 c5 TopX

c3 c3 c3 c3 c3 c3

TopX = TopX_CO_Content.xmlc3 = xfirm_r1_cosc3.xmlc3s = xfirm_r1_cosc3s.xmlc5 = xfirm_r1_cosc5.xml


27 Feb 2007

Conclusions & Future Work Evaluation based on different techniques &

metrics

Correct improvement measurement

Not solved: comparing several systems with different output

Maybe a hybrid evaluation mode

Documents

Evaluation of Relevance Feedback Algorithms for XML Retrieval