28
1 Generalized Inverse Document Frequency Donald Metzler Yahoo! Research October 27, 2008 CIKM 2008

Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

1

Generalized Inverse

Document Frequency

Donald Metzler

Yahoo! Research

October 27, 2008

CIKM 2008

Page 2: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

What is IDF?

• Global measure of the importance of an

identifier (word, phrase, etc.)

• Used in a variety of tasks

– Information retrieval

– Text classification

• Classical formulation:

2

Page 3: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF Timeline

3

1972 1976 1979 1997 2007

Page 4: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF Timeline

4

1972 1976 1979 1997 2007

Page 5: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF Timeline

5

1972 1976 1979 1997 2007

Page 6: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF Timeline

6

1972 1976 1979 1997 2007

Page 7: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF Timeline

7

1972 1976 1979 1997 2007

Page 8: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Why Study IDF?

• Term frequency and document length

normalization are focus of many studies

• Inverse document frequency is often

overlooked and not very well

understood

• Momentum building for improved

understanding and modeling of IDF

8

Page 9: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Common IDF Formulations

• Robertson-Sparck Jones IDF:

• Robertson and Walker IDF:

• Both can be derived as a RSJ weight

from the BIR model

9

Page 10: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Generalized IDF

10

Page 11: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Document Generation Model

• A model (θ) is sampled

given a query (q),

relevance class (r), and

judgments (J)

• A term (di) event is

sampled from the model

• Distributional assumptions:

11

q r J

θi

di

Page 12: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Generalized IDF

• Estimates

• Generalized IDF

12

Page 13: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Generalized IDF

• Different settings of hyperparametersgive different IDF formulations

• Various ways to estimate

– Collection statistics

– (Pseudo-)Relevance feedback

– Click data

• We propose and evaluate several simple estimates as “proof of concept”

13

Page 14: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Relevance Distribution:

Assumption Set 1

14

Hyperparameters

Estimate

IDF

P(d | q, r) is constant for all terms.

Page 15: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Empirical Relevance Distributions

15

Page 16: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Relevance Distribution:

Assumption Set 2

16

Hyperparameters

Estimate

IDF

P(d | q, r) is linearly smoothed between empirical

mean and collection model.

Page 17: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF

Non-relevance Distribution:

Assumption Set 1

17

Hyperparameters

Estimate

P(d | q, ¬r) is constant for all terms.

Page 18: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Non-relevance Distribution:

Assumption Set 2

18

Hyperparameters

Estimate

IDF

P(d | q, ¬r) is increasing with P(w | C)

Page 19: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Non-relevance Distribution:

Assumption Set 3

19

Hyperparameters

Estimate

IDF

P(d | q, ¬r) is increasing with P(w | C)

Page 20: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF

Non-relevance Distribution:

Assumption Set 4

20

Hyperparameters

Estimate

P(d | q, ¬r) is linearly smoothed between empirical

mean and collection model.

Page 21: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Experimental Methodology

• Data

– Three TREC news collections (AP, WSJ, ROBUST 2004)

– One TREC web collection (WT10G)

• Mean average precision (MAP) used for evaluation

• Parameter Estimation

– Tuned to maximize MAP on training set

– Evaluated on test set

21

Page 22: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Hyperparameters

22

Page 23: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Goodness of Beta Fit

23

Page 24: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF Experiments

24

• IDF-only ranking function

• By eliminating TF and document length

frequency we can directly quantify the

impact of new IDF formulations

Page 25: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

IDF Results

25

Bold indicates the best formulation for each data set. The superscripts α and

β indicate statistically significant improvements over RSJ and RSJ Positive,

respectively, at the p < 0.1 level. Underlined superscripts are significant at

the p < 0.05 level. Significance tests were only performed on the test sets.

Page 26: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

TF.IDF Experiments

• Okapi TF

• Ranking function

• Does TF dampen IDF improvements?

26

Page 27: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

TF.IDF Results

27

Bold indicates the best formulation for each data set. The superscripts α and

β indicate statistically significant improvements over RSJ and RSJ Positive,

respectively, at the p < 0.1 level. Underlined superscripts are significant at

the p < 0.05 level. Significance tests were only performed on the test sets.

Page 28: Generalized Inverse Document Frequency€¦ · CIKM 2008. What is IDF? •Global measure of the importance of an identifier (word, phrase, etc.) •Used in a variety of tasks –Information

Conclusions

• Derived a generalized IDF formulation– Can be used to derive new and improved IDF

formulations

– Provides better understanding of IDF

• Proposed several new IDF formulations that are more effective than current “best practice” IDFs

• Recommended IDF formulation:

28