One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China [email protected]

One Click One Revisited: Enhancing Evaluation Based on Information

UnitsTetsuya Sakai1 and Makoto P. Kato2

1 Microsoft Research Asia, P.R. China

[email protected]

2 Kyoto University, Japan

[email protected]

AIRS 2012

Introduction

1. NTCIR-9 1CLICK-1 -measure, the official evaluation metric of One Click Access

Task, discounts the value of each information unit based on its position within the textual output.

2. NTCIR-10 1CLICK-2 We complement the recall-like -measure with a simple,

precision-like metric called -measure as well as a combination of -measure and -measure, called .

2

-strings

The output of 1CLICK systems

3

Task and Data

Participating systems were expected to return important iUnits first, and to minimize the amount of text the user has to read.

The length of the vital string is used for defining an “optimal” output and for computing .

Every -string was evaluated by two assessors: we evaluate runs based on the Intersection data (I) and the Union data (U) of the iUnit matches.

4

-measure and

The Pseudo Minimal Output (PMO) be the set of gold-standard iUnits constructed for a

particular query

be the vital string and be the weight for iUnit

Sorting all vital strings by (first key) and (second key)

denote the offset position of within the PMO

denote the set of matched iUnits obtained by manually comparing the -string with the gold-standard iUnits

denote the offset position of

5

-measure and (Cont’d)

is defined as:

Let be a parameter that represents how the user’s patience runs out: When is set to a very large value, reduces to weighted

recall (W-recall), which is position-insensitive.

There is no theoretical guarantee that lies below one: given by may be used instead.

6

Effect of the Patience Parameter

The official 1CLICK-1 evaluation used (one minute) with .

We vary this parameter as follows and examine the outcome: (two minutes), (30 seconds) and (6 seconds).

Note that if is set to an extremely small value, most of the contents of the -strings will be ignored.

7

Effect of on the System Ranking

The -axis shows runs sorted by Mean -measure ()

8

-strings of Runs from KUIDL and MSRA1click The LOCAL query “Menard Aoyama Resort” (name

of a facility)

9

Results on the Patience Parameter (two minutes) produces rankings that are very

similar to (one minute), but (30 seconds) results in substantially different system rankings.

Given a test collection with a set of runs, discriminative power is measured by conducting a statistical significance test for every pair of runs.

10

Effect of on Discriminative Power The Achieved Significance Level (ASL) curves of

with varying The -axis represents the -value and the -axis represents run

pairs sorted by the -value. Metrics that are closer to the origin are the ones that are highly discriminative.

11

-measure, and (1/3)

We introduce a precision-like “terseness” metric for evaluating an -string of size :

As might exceed one, we also define given by . Although in reality never exceeded one for our data and

therefore holds.

12

-measure, and (2/3)

Finally, following the approach of the well-known -measure, we can define as:

Where letting reduces to a harmonic mean of and . We also examined .

13

-measure, and (3/3)

differs from the traditional nugget-based -measure in the following two aspects:1. It utilizes the positions of iUnits for computing the recall-

like .

2. Instead of relying on a fixed allowance parameter, it utilizes the vital string length of each iUnit for computing the precision-like -measure.

14

System Ranking by Different Metrics The -axis shows runs sorted by Mean -measure with

15

-strings of Runs from KUIDL and MSRA1click The QA query “The three duties of a Japanese

citizen.”

16

Effect of on

The -axis represents and the -axis represents Kendall’s with the Mean -measure ranking Note that means “ is times as important as ” and that .

We recommend for evaluating 1CLICK systems, along with the original .

17

Results on -measure and

Discriminative power of -measure, -measure and

18

Conclusions

The One Click Access Task (1CLICK) is one of the tasks of NTCIR that requires systems to return a concise multi-document summary of web pages in response to a query which is assumed to have been submitted in a mobile context.

Furthermore, we introduce the next round of the 1CLICK task called MobileClick, in which participants are required to submit a two-layered summarization suitable for mobile information access.

19

ThanksY. Hou et al. (Eds.): AIRS 2012, LNCS 7675, pp. 39–51, 2012.

© Springer-Verlag Berlin Heidelberg 2012

20

Documents

One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China [email protected]