54
Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft SIGIR 2019 · July 23 rd · Paris Picture by dalbera

Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft

SIGIR 2019 · July 23rd · Paris Picture by dalbera

Page 2: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Current Statistical Testing Practice

• According to surveys by Sakai & Carterette

–60-75% of IR papers use significance testing

– In the paired case (2 systems, same topics):

• 65% use the paired t-test

• 25% use the Wilcoxon test

• 10% others, like Sign, Bootstrap & Permutation

2

Page 3: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

t-test and Wilcoxon are the de facto choice

Is this a good choice?

3

Page 4: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Our Journey

4

van

Rijs

ber

gen

Hu

ll @

SIG

IR

Savo

y @

IP&

M

Wilb

ur

@JI

S

Zob

el @

SIG

IR

Vo

orh

ees

& B

uck

ley

@SI

GIR

Vo

orh

ees

@SI

GIR

Saka

i @SI

GIR

Smu

cker

et

al. @

SIG

IR

Saka

i @SI

GIR

Para

par

et

al. @

JASI

ST

Co

rmac

k &

Lyn

am @

SIG

IR

Smu

cker

et

al. @

CIK

M

Urb

ano

& N

agle

r @

SIG

IR

1980 1990 2000 2010 2020

San

der

son

& Z

ob

el @

SIG

IR

Urb

ano

et

al. @

SIG

IR

Car

tere

tte

@TO

IS

Car

tere

tte

@IC

TIR

Sa

kai @

SIG

IR F

oru

m

Urb

ano

@JI

R

Page 5: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Our Journey

4

van

Rijs

ber

gen

Hu

ll @

SIG

IR

Savo

y @

IP&

M

Wilb

ur

@JI

S

Zob

el @

SIG

IR

Vo

orh

ees

& B

uck

ley

@SI

GIR

Vo

orh

ees

@SI

GIR

Saka

i @SI

GIR

Smu

cker

et

al. @

SIG

IR

Saka

i @SI

GIR

Para

par

et

al. @

JASI

ST

Co

rmac

k &

Lyn

am @

SIG

IR

Smu

cker

et

al. @

CIK

M

Urb

ano

& N

agle

r @

SIG

IR

1980 1990 2000 2010 2020

San

der

son

& Z

ob

el @

SIG

IR

Urb

ano

et

al. @

SIG

IR

Car

tere

tte

@TO

IS

Car

tere

tte

@IC

TIR

Sa

kai @

SIG

IR F

oru

m

1st Period

Statistical testing unpopular Theoretical arguments around test assumptions

Urb

ano

@JI

R

Page 6: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Our Journey

4

van

Rijs

ber

gen

Hu

ll @

SIG

IR

Savo

y @

IP&

M

Wilb

ur

@JI

S

Zob

el @

SIG

IR

Vo

orh

ees

& B

uck

ley

@SI

GIR

Vo

orh

ees

@SI

GIR

Saka

i @SI

GIR

Smu

cker

et

al. @

SIG

IR

Saka

i @SI

GIR

Para

par

et

al. @

JASI

ST

Co

rmac

k &

Lyn

am @

SIG

IR

Smu

cker

et

al. @

CIK

M

Urb

ano

& N

agle

r @

SIG

IR

1980 1990 2000 2010 2020

San

der

son

& Z

ob

el @

SIG

IR

Urb

ano

et

al. @

SIG

IR

Car

tere

tte

@TO

IS

Car

tere

tte

@IC

TIR

Sa

kai @

SIG

IR F

oru

m

2nd Period

Empirical studies appear Resampling-based tests and t-test

Urb

ano

@JI

R

Page 7: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Our Journey

4

van

Rijs

ber

gen

Hu

ll @

SIG

IR

Savo

y @

IP&

M

Wilb

ur

@JI

S

Zob

el @

SIG

IR

Vo

orh

ees

& B

uck

ley

@SI

GIR

Vo

orh

ees

@SI

GIR

Saka

i @SI

GIR

Smu

cker

et

al. @

SIG

IR

Saka

i @SI

GIR

Para

par

et

al. @

JASI

ST

Co

rmac

k &

Lyn

am @

SIG

IR

Smu

cker

et

al. @

CIK

M

Urb

ano

& N

agle

r @

SIG

IR

1980 1990 2000 2010 2020

San

der

son

& Z

ob

el @

SIG

IR

Urb

ano

et

al. @

SIG

IR

Car

tere

tte

@TO

IS

Car

tere

tte

@IC

TIR

Sa

kai @

SIG

IR F

oru

m

3rd Period

Wide adoption of statistical testing Long-pending discussion about statistical practice

Urb

ano

@JI

R

Page 8: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Our Journey

• Theoretical and empirical arguments

for and against specific tests

• 2-tailed tests at α=.05 with AP and P@10,

almost exclusively

• Limited data, resampling from the same topics

• No control over the null hypothesis

• Discordances or conflicts among tests,

but no actual error rates

5

Page 9: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Main reason?

No control of the data generating process

6

Page 10: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

PROPOSAL FROM SIGIR 2018

Page 11: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

• Build a generative model of the joint distribution of system scores

• So that we can simulate scores on new, random topics (no content, only scores)

• Unlimited data • Full control over H0

Stochastic Simulation

9

test

AP

p-values

Model

Urbano & Nagler, SIGIR 2018

Page 12: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

• Build a generative model of the joint distribution of system scores

• So that we can simulate scores on new, random topics (no content, only scores)

• Unlimited data • Full control over H0

• The model is flexible, and

can be fit to existing data to make it realistic

Stochastic Simulation

10

TREC

IR systems

test

AP

AP

p-values

Model

Urbano & Nagler, SIGIR 2018

Page 13: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Stochastic Simulation

11

Experimental

Bas

elin

e

• We use copula

models, which

separate:

1. Marginal

distributions, of

individual systems

• Give us full knowledge

and control over H0

2. Dependence

structure, among

systems

μE

μB

Page 14: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Stochastic Simulation

11

Experimental

Bas

elin

e

• We use copula

models, which

separate:

1. Marginal

distributions, of

individual systems

• Give us full knowledge

and control over H0

2. Dependence

structure, among

systems

μE

μB

Page 15: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Research Question

• Which is the test that…

1. Maintaining Type I errors at the α level,

2. Has the highest statistical power,

3. Across measures and sample sizes,

4. With IR-like data?

12

Page 16: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Factors Under Study

• Paired test: Student’s t, Wilcoxon, Sign,

Bootstrap-shift, Permutation

• Measure: AP, nDCG@20, ERR@20, P@10, RR

• Topic set size n: 25, 50, 100

• Effect size δ: 0.01, 0.02, …, 0.1

• Significance level α: 0.001, …, 0.1

• Tails: 1 and 2

• Data to fit stochastic models: TREC 5-8 Ad Hoc

and 2010-13 Web

13

Page 17: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

We report results on

>500 million p-values

1.5 years of CPU time

¯\_(ツ)_/¯ 14

Page 18: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

TYPE I ERRORS

Page 19: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Ind

ex

NA

TREC

Sys

tem

s

Topics

Simulation such that μE = μB

16

Experimental

Bas

elin

e

Page 20: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB

16

Experimental

Bas

elin

e

Page 21: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB

16

Experimental

Bas

elin

e

Page 22: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB

16

Experimental

Bas

elin

e

μE = μB

Page 23: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB

16

Experimental

Bas

elin

e

Tests

p-values

μE = μB

Page 24: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB

• Repeat for each measure and topic set size n

–1,667,000 times

–≈8.3 million 2-tailed p-values

–≈8.3 million 1-tailed p-values

• Grand total of >250 million p-values

• Any p<α corresponds to a Type I error

Page 25: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type I Errors by α | n 2-tailed

18

Not so interested in specific points but in trends

Page 26: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type I Errors by α | n 2-tailed

20

Page 27: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type I Errors by α | n 2-tailed

20

• Wilcoxon and Sign have higher error rates than expected • Wilcoxon better in P@10 and RR because of symmetricity • Even worse as sample size increases (with RR too)

Page 28: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type I Errors by α | n 2-tailed

20

• Bootstrap has high error rates too • Tends to correct with sample size because it estimates

the sampling distribution better

Page 29: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type I Errors by α | n 2-tailed

20

• Bootstrap has high error rates too • Tends to correct with sample size because it estimates

the sampling distribution better

• Permutation and t-test have nearly ideal behavior • Permutation very slightly sensitive to sample size • t-test remarkably robust to it

Page 30: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type I Errors - Summary

• Wilcoxon, Sign and Bootstrap test tend to make

more errors than expected

• Increasing sample size helps Bootstrap, but

hurts Wilcoxon and Sign even more

• Permutation and t-test have nearly ideal

behavior across measures, even with small

sample size

• t-test is remarkably robust

• Same conclusions with 1-tailed tests

21

Page 31: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

TYPE II ERRORS

Page 32: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB + δ

23

Ind

ex

NA

TREC

Sys

tem

s

Topics

Experimental

Bas

elin

e

Page 33: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB + δ

23

Experimental

Bas

elin

e

Page 34: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB + δ

23

Experimental

Bas

elin

e

Page 35: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB + δ

23

Experimental

Bas

elin

e

μE = μB+δ

Page 36: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB + δ

23

Experimental

Bas

elin

e

Tests

p-values

μE = μB+δ

Page 37: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Simulation such that μE = μB + δ

• Repeat for each measure, topic set size n

and effect size δ –167,000 times

–≈8.3 million 2-tailed p-values

–≈8.3 million 1-tailed p-values

• Grand total of >250 million p-values

• Any p>α corresponds to a Type II error

Page 38: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Power by δ | n α=.05, 2-tailed

25

ideally

Page 39: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Power by δ | n α=.05, 2-tailed

26

• Clear effect of effect size δ • Clear effect of sample size n • Clear effect of measure (via σ)

Page 40: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Power by δ | n α=.05, 2-tailed

26

• Clear effect of effect size δ • Clear effect of sample size n • Clear effect of measure (via σ)

• Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially for small n

Page 41: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Power by δ | n α=.05, 2-tailed

26

• Clear effect of effect size δ • Clear effect of sample size n • Clear effect of measure (via σ)

• Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially for small n • Permutation and t-test are almost identical again • Very close to Bootstrap as sample size increases

Page 42: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Power by δ | n α=.05, 2-tailed

26

• Clear effect of effect size δ • Clear effect of sample size n • Clear effect of measure (via σ)

• Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially for small n

• Wilcoxon is very similar to Permutation and t-test • Even slightly better with small n or δ, specially for AP, nDCG and ERR

(it’s indeed more efficient with some asymmetric distributions)

Page 43: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Power by α | δ n=50, 2-tailed

27

Page 44: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Power by α | δ n=50, 2-tailed

27

• With small δ, Wilcoxon and Bootstrap consistently the most powerful • With large δ, Permutation and t-test catch up with Wilcoxon

Page 45: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type II Errors - Summary

• All tests, except Sign, behave very similarly

• Bootstrap and Wilcoxon are consistently

a bit more powerful across significance levels – But more Type I errors!

• With larger effect sizes and sample sizes,

Permutation and t-test catch up with Wilcoxon,

but not with Bootstrap

• Same conclusions with 1-tailed tests 28

Page 46: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

TYPE III ERRORS

Page 47: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type III what?

• A wrong directional decision based on the correct rejection of a non-directional hypothesis

• Example: – We observe a positive result, 𝐸 > 𝐵 – We run a 2-tailed test, 𝐻0: 𝜇𝐸 = 𝜇𝐵 – Find 𝑝 < 𝛼, so we reject and conclude 𝜇𝐸 > 𝜇𝐵

– But 𝑯𝟎 is non-directional – What if we just got lucky, and really 𝝁𝑬 < 𝝁𝑩?

30

Page 48: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type III Errors by δ | n α=.05

31

• Clear effect of δ and n • P@10 and RR substantially more problematic because of higher σ

Page 49: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type III Errors by δ | n α=.05

31

• Bootstrap tends to correct with sample size • Wilcoxon stays the same, and Sign test gets even worse

Page 50: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type III Errors by δ | n α=.05

31

• Bootstrap tends to correct with sample size • Wilcoxon stays the same, and Sign test gets even worse

Page 51: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Type III Errors in Practice

• How much of a problem could this be?

• Example: AP and n=50 topics – Improvement of +0.01 over the baseline

–2-tailed t-test comes up significant

–7.3% probability that it is a Type III error and

your system is actually worse

– Is that too high?

32

Page 52: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

CONCLUSIONS

Page 53: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

What We Did

• First empirical study of actual error rates with IR-like data

• Comprehensive – Paired test: Student’s t, Wilcoxon, Sign, Bootstrap-shift,

Permutation

– Measure: AP, nDCG@20, ERR@20, P@10, RR

– Topic set size: 25, 50, 100

– Effect size: 0.01, 0.02, …, 0.1

– Significance level: 0.001, …, 0.1

– Tails: 1 and 2

• More than 500 million p-values

• All data and many more plots are available online https://github.com/julian-urbano/sigir2019-statistical

34

Page 54: Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft · • Sign test consistently the least powerful (disregards magnitudes) • Bootstrap test consistently the most powerful, specially

Recommendations

• Don’t use Wilcoxon or Sign tests anymore

• For statistics other than the mean, use

permutation, and bootstrap only if you have

many topics

• For typical tests about mean scores, the

t-test is simple, the most robust, behaves as

expected w.r.t. Type I errors, and is nearly as

powerful as the Bootstrap. Keep using it 35