20
One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China [email protected] 2 Kyoto University, Japan [email protected] AIRS 2012

One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China [email protected]

Embed Size (px)

Citation preview

Page 1: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

One Click One Revisited: Enhancing Evaluation Based on Information

UnitsTetsuya Sakai1 and Makoto P. Kato2

1 Microsoft Research Asia, P.R. China

[email protected]

2 Kyoto University, Japan

[email protected]

AIRS 2012

Page 2: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Introduction

1. NTCIR-9 1CLICK-1 -measure, the official evaluation metric of One Click Access

Task, discounts the value of each information unit based on its position within the textual output.

2. NTCIR-10 1CLICK-2 We complement the recall-like -measure with a simple,

precision-like metric called -measure as well as a combination of -measure and -measure, called .

2

Page 3: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

-strings

The output of 1CLICK systems

3

Page 4: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Task and Data

Participating systems were expected to return important iUnits first, and to minimize the amount of text the user has to read.

The length of the vital string is used for defining an “optimal” output and for computing .

Every -string was evaluated by two assessors: we evaluate runs based on the Intersection data (I) and the Union data (U) of the iUnit matches.

4

Page 5: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

-measure and

The Pseudo Minimal Output (PMO) be the set of gold-standard iUnits constructed for a

particular query

be the vital string and be the weight for iUnit

Sorting all vital strings by (first key) and (second key)

denote the offset position of within the PMO

denote the set of matched iUnits obtained by manually comparing the -string with the gold-standard iUnits

denote the offset position of

5

Page 6: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

-measure and (Cont’d)

is defined as:

Let be a parameter that represents how the user’s patience runs out: When is set to a very large value, reduces to weighted

recall (W-recall), which is position-insensitive.

There is no theoretical guarantee that lies below one: given by may be used instead.

6

Page 7: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Effect of the Patience Parameter

The official 1CLICK-1 evaluation used (one minute) with .

We vary this parameter as follows and examine the outcome: (two minutes), (30 seconds) and (6 seconds).

Note that if is set to an extremely small value, most of the contents of the -strings will be ignored.

7

Page 8: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Effect of on the System Ranking

The -axis shows runs sorted by Mean -measure ()

8

Page 9: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

-strings of Runs from KUIDL and MSRA1click The LOCAL query “Menard Aoyama Resort” (name

of a facility)

9

Page 10: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Results on the Patience Parameter (two minutes) produces rankings that are very

similar to (one minute), but (30 seconds) results in substantially different system rankings.

Given a test collection with a set of runs, discriminative power is measured by conducting a statistical significance test for every pair of runs.

10

Page 11: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Effect of on Discriminative Power The Achieved Significance Level (ASL) curves of

with varying The -axis represents the -value and the -axis represents run

pairs sorted by the -value. Metrics that are closer to the origin are the ones that are highly discriminative.

11

Page 12: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

-measure, and (1/3)

We introduce a precision-like “terseness” metric for evaluating an -string of size :

As might exceed one, we also define given by . Although in reality never exceeded one for our data and

therefore holds.

12

Page 13: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

-measure, and (2/3)

Finally, following the approach of the well-known -measure, we can define as:

Where letting reduces to a harmonic mean of and . We also examined .

13

Page 14: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

-measure, and (3/3)

differs from the traditional nugget-based -measure in the following two aspects:1. It utilizes the positions of iUnits for computing the recall-

like .

2. Instead of relying on a fixed allowance parameter, it utilizes the vital string length of each iUnit for computing the precision-like -measure.

14

Page 15: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

System Ranking by Different Metrics The -axis shows runs sorted by Mean -measure with

15

Page 16: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

-strings of Runs from KUIDL and MSRA1click The QA query “The three duties of a Japanese

citizen.”

16

Page 17: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Effect of on

The -axis represents and the -axis represents Kendall’s with the Mean -measure ranking Note that means “ is times as important as ” and that .

We recommend for evaluating 1CLICK systems, along with the original .

17

Page 18: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Results on -measure and

Discriminative power of -measure, -measure and

18

Page 19: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

Conclusions

The One Click Access Task (1CLICK) is one of the tasks of NTCIR that requires systems to return a concise multi-document summary of web pages in response to a query which is assumed to have been submitted in a mobile context.

Furthermore, we introduce the next round of the 1CLICK task called MobileClick, in which participants are required to submit a two-layered summarization suitable for mobile information access.

19

Page 20: One Click One Revisited: Enhancing Evaluation Based on Information Units Tetsuya Sakai 1 and Makoto P. Kato 2 1 Microsoft Research Asia, P.R. China tetsuyasakai@acm.org

ThanksY. Hou et al. (Eds.): AIRS 2012, LNCS 7675, pp. 39–51, 2012.

© Springer-Verlag Berlin Heidelberg 2012

20