Transcript
Page 1: Copulas for Information Retrieval (SIGIR'13)

Copulas for Information Retrieval

Carsten Eickhoff, Arjen P. de Vries, Kevyn Collins-Thompson

Page 2: Copulas for Information Retrieval (SIGIR'13)

Copulas – What is it all about?

• Assume two sufficiently different commodities

• Rare elemental metals

• Pork bellies

• No apparent correlations

0

1

2

3

4

5

6

Rare Earths Pork Bellies

Page 3: Copulas for Information Retrieval (SIGIR'13)

Copulas – What is it all about?

• Two seemingly independent variables

• Yet, for rare extreme cases, there are co-movements

• “Tail dependencies”

• Copulas decouple observations and dependencies • IR models are good at estimating marginals

• Copulas are good at combining them

Page 4: Copulas for Information Retrieval (SIGIR'13)

Overview

1. Non-linear Dependency Structures in IR

2. Copulas – Intuition & Background

3. Multivariate Relevance Estimation

4. When to use them?

5. Score Fusion

6. Conclusion & Future Directions

Page 5: Copulas for Information Retrieval (SIGIR'13)

1 Non-Linear Dependency Structures in IR

Page 6: Copulas for Information Retrieval (SIGIR'13)

Multivariate Relevance Modelling

• IR Systems index and retrieve a growing variety of document types • Many structured, or at least “complex”

• Single-criteria relevance frameworks do not perform well

• Multi-criteria models tend to be either: a) Naïve (e.g., independence assumption), or,

b) Hard to qualitatively interpret for humans (e.g., L2R)

Page 7: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Non-linear dependency structures are still a challenge

• TREC 2010 Faceted Blog Distillation Task, Topic 1171, “mysql”

• Relevance Criteria: • Topicality

• Subjectivity

Page 8: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?

Page 9: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?

Page 10: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?

• In the lower third of the scale,

we note ᵨ = 0.37

Page 11: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?

• In the lower third of the scale,

we note ᵨ = 0.37,

• And in the upper third, it turns

to ᵨ = -0.4

Page 12: Copulas for Information Retrieval (SIGIR'13)

2 Copulas – Intuition & Background

Page 13: Copulas for Information Retrieval (SIGIR'13)

Copulas (from copulare, to join)

• Copulas model complex non-linear dependencies between variables that simple correlations can't capture

• Decouple marginal distributions from dependency structure

• Approximate joint multivariate distributions

• Applied previously in portfolio and risk management, meteorology, river flooding predictions, …

Page 14: Copulas for Information Retrieval (SIGIR'13)

Formal Basics

• Given a k-dimensional rv

• Map to unit cube

• Describe joint cdf with copula

• Isolation of a component

• Copula’s zero

Page 15: Copulas for Information Retrieval (SIGIR'13)

Closing the circle

• Recall the example TREC topic 1171

• Linear combination: AP = 0.14, below collection average (0.25)

• Fit Clayton copula to model joint relevance distribution

• AP rises to 0.22

Page 16: Copulas for Information Retrieval (SIGIR'13)

3 Multivariate Relevance Estimation

Page 17: Copulas for Information Retrieval (SIGIR'13)

Joint Relevance Estimation

• Estimate marginal distributions from data

• Estimate copula fitting parameters to maximize posterior probability of observing data

• Use copula to represent joint probability of relevance

Page 18: Copulas for Information Retrieval (SIGIR'13)

Joint Relevance Estimation

• We study three different scenarios: • Opinionated blog posts • Personalized bookmarks • Child-friendly websites

• Use original training portion of the corpora where available

• A 90/10 split otherwise

Page 19: Copulas for Information Retrieval (SIGIR'13)

Results I – Opinionated Blog Posts

• TREC Blogs08 dataset

• 1.3 M documents

• Relevance dimensions: Topicality & Subjectivity

• Significantly higher performance than linear combination model

Page 20: Copulas for Information Retrieval (SIGIR'13)

Results II – Personalized Bookmarks

• Dataset by Vallet & Castells

• 339k documents

• Relevance Dimensions: Topicality & Personal relevance

• Significantly performance gains in some metrics

Page 21: Copulas for Information Retrieval (SIGIR'13)

Results III – Child-friendly Websites

• Dataset from the PuppyIR project (http://puppyir.eu)

• 22k documents

• Relevance Dimensions: Topicality & Child-suitability

• Worse-than-baseline performance

Page 22: Copulas for Information Retrieval (SIGIR'13)

4 Copulas – When to use them?

Page 23: Copulas for Information Retrieval (SIGIR'13)

When to use them?

• Previously: Strongly varying performance for different settings

• Is there a way of predicting the merit?

• Recall: copulas model tail dependencies between dimensions

Page 24: Copulas for Information Retrieval (SIGIR'13)

Types of Tail Dependencies

Page 25: Copulas for Information Retrieval (SIGIR'13)

Measuring Tail Dependencies

• According to Frees and Valdez 1998: IL and IU measure strength of lower and upper tail dependencies

• Anderson-Darling test of goodness-of-fit between copula and observed data

Domain Frees Tail index Anderson-Darling Actual Retrieval

Performance

Opinionated Blogs IL = 0.07 0.67 Copulas > linear

Personalized Bookmarks IU = 0.49 0.47 Copulas = linear

Child-friendly Websites IL = IU = 0 0.046 Copulas < linear

Page 26: Copulas for Information Retrieval (SIGIR'13)

5 Copulas for Score Fusion

Page 27: Copulas for Information Retrieval (SIGIR'13)

Score Fusion

• A different angle on relevance estimation

• Combine individual retrieval system scores instead of modelling relevance from content criteria

• In this setting, submissions to historic TRECs serve as criteria

• We randomly draw k individual runs and combine them using copulas

Page 28: Copulas for Information Retrieval (SIGIR'13)

Fusion Methods

• Established: • Copula-based:

Page 29: Copulas for Information Retrieval (SIGIR'13)

Results – TREC 4

• Results are averaged across 200 randomizations per setting of k

• Relative improvements over the best, worst and median fused run in terms of percentages of MAP

• Small but consistent improvements over non-copula fusion baselines

Page 30: Copulas for Information Retrieval (SIGIR'13)

Robustness - CombSUM

• Fusion approaches are often sensitive to weak contributions

• We control the number of weak submissions added to the fusion

• Copulas’ explicit modeling of dependency structure is more robust

Page 31: Copulas for Information Retrieval (SIGIR'13)

Robustness - CombMNZ

• Fusion approaches are often sensitive to weak contributions

• We control the number of weak submissions added to the fusion

• Copulas’ explicit modeling of dependency structure is more robust

Page 32: Copulas for Information Retrieval (SIGIR'13)

6 Conclusion and Future Directions

Page 33: Copulas for Information Retrieval (SIGIR'13)

Conclusion

• Copulas decouple observations and dependencies • IR models are good at estimating marginal

• Copulas are good at combining them

• We use them for multivariate relevance estimation • Strongly scenario-dependent performance

• Tail indices & goodness of fit tests as estimators of expected performance

• Copulas for score fusion • Robust to outliers

Page 34: Copulas for Information Retrieval (SIGIR'13)

The Road Ahead

• Currently, we use single copulas for relevance modelling • Copula mixtures and composite Archimedean copulas for higher accuracy

• Here, we use pre-existing copula families and fit them to data • Instead, can we formalize copulas from scratch to include domain knowledge?

• So far, we explored two-dimensional relevance spaces • What happens as we move into higher-order systems?

Page 35: Copulas for Information Retrieval (SIGIR'13)

Thank You!


Recommended