200
Topic Set Size Design and Power Analysis in Practice Tetsuya Sakai @tetsuyasakai [email protected] Waseda University ICTIR 2016 Tutorial: September 13, 2016, Delaware.

ICTIR2016tutorial

Embed Size (px)

Citation preview

Page 1: ICTIR2016tutorial

Topic Set Size Design andPower Analysis in Practice

Tetsuya Sakai

@tetsuyasakai

[email protected]

Waseda University

ICTIR 2016 Tutorial: September 13, 2016, Delaware.

Page 2: ICTIR2016tutorial

This half-day tutorial will teach you

•How to determine the number of topics when building a new test collection (prerequisite: you already have some pilot data from which you can construct a topic-by-run score matrix). You will kind of know how it works.

•How to check whether a reported experiment is overpowered/underpowered and decide on a better sample size for a future experiment.

Page 3: ICTIR2016tutorial

Before attending the tutorial, please download on your laptop- Sample topic-by-run matrix:https://waseda.box.com/20topics3runs- Excel topic set size design tools:http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsxhttp://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsxhttp://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx

[OPTIONAL]- (Install R first and then) R scripts for power analysis:https://waseda.box.com/SIGIR2016PACK

Page 4: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 5: ICTIR2016tutorial

1.1 Preliminaries (1)

• In IR experiments, we often compare sample means to guess if the population means are different.

• We often employ parametric tests (assume specific population distributions with parameters)

- paired and unpaired t-tests (comparing m=2 means)

- ANOVA (comparing m (>2) means)

one-way, two-way, two-way without replication

Are the two population means equal?

Are the m population means equal?

scores

EXAMPLE

n topics

m systems

Sample mean for a system

Page 6: ICTIR2016tutorial

1.1 Preliminaries (2)

• H0: tentative assumption that all population means are equal

• test statistic: what you compute from observed data – under H0, this should obey a known distribution (e.g. t-distribution)

• p-value: probability of observing what you have observed (or something more extreme) assuming H0 is true

Null hypothesis

test statistic t0

Page 7: ICTIR2016tutorial

1.1 Preliminaries (3)

Reject H0

if p-value <= α

test statistic t0 t(φ; α)

Accept H0 Reject H0

H0 is truesystems are equivalent

Correct conclusion(1-α)

Type I errorα

H0 is falsesystems are different

Type II errorβ

Correct conclusion(1-β)

α/2 α/2

Statistical power: ability to detect real

differences

Page 8: ICTIR2016tutorial

1.1 Preliminaries (4)Accept H0 Reject H0

H0 is truesystems are equivalent

Correct conclusion(1-α)

Type I errorα

H0 is falsesystems are different

Type II errorβ

Correct conclusion(1-β)

Statistical power: ability to detect real

differencesCohen’s five-eighty convention:α=5%, 1-β=80% (β=20%)Type I errors 4 times as serious as Type II errors

The ratio may be set depending on specific situations

Page 9: ICTIR2016tutorial

For a continuous random variable x and its probability density function f(x), the expectation of a function g(x) (including g(x)=x) is given by:

How likely x will take a particular value

Population mean

Population variance

Population standarddeviation

The central position of x as it is observed an infinite

number of times

How x varies from the population

mean

1.1 Preliminaries (5)

Page 10: ICTIR2016tutorial

A normal distribution with population parameters

is denoted by .

Properties of a normal distribution:

Probability density function of a normal distribution

μ = 100σ = 20

1.1 Preliminaries (6)

Page 11: ICTIR2016tutorial

If x obeys , then

obeys .

Standardisation

Population mean: 0

Population standard

deviation: 1

Standard normal distribution

1.1 Preliminaries (7)

Page 12: ICTIR2016tutorial

1.1 Preliminaries (8)For random variables x, y, a function that satisfies the following is called a joint probability density function:

Whereas, marginal probability density functions are defined as:

If the following holds for any (x,y), x and y are said to be independent.

Page 13: ICTIR2016tutorial

If are independent and obey

then obeys

Reproductive property:Adding normally distributed variables still gives you a normal distribution

Population mean Population variance

1.1 Preliminaries (9)

Page 14: ICTIR2016tutorial

If are independent and obey

then obeys

obeys and therefore

obeys .

Corollary: If we let ai = 1/n, μi = μ, σi = σ ...

1.1 Preliminaries (10)

Sample mean

Page 15: ICTIR2016tutorial

Sample mean

Sum of squares

Sample variance

Sample standard deviation

If are independent and obey , then

holds.

Sample variance V is an unbiased estimator of the population variance

1.1 Preliminaries (11)

cf. 2.5 (3): s is NOT an unbiased estimator of the population standard deviation

Page 16: ICTIR2016tutorial

If are independent and

then:

• Law of large numbers

As n approaches infinity, approaches .

• Central Limit Theorem

Provided that n is large, the distribution of

can be approximated by .

It’s a good thing to observe lots of data to estimate the

population mean.

If you have lots of observations, then the sample mean can be regarded as normally distributed even if we don’t know much about individual random variables {xi}

1.1 Preliminaries (12)

Not necessarily normal

Page 17: ICTIR2016tutorial

If are independent and obey then

the probability distribution that the following random variable obeys is called a chi-square distribution with φ = k degrees of freedom:

The pdf of the above distribution is given by:

Gamma function

Denoted by

1.1 Preliminaries (13)

Page 18: ICTIR2016tutorial

If obeys then .

If are independent and obey then:

(a) obeys .

(b) and are independent.

(c) obeys .

1.1 Preliminaries (14)

Corollary from previous slide since(xi – μ)/σ obeys

[Nagata03] p.57

[Nagata03] p.58

Page 19: ICTIR2016tutorial

1.1 Preliminaries (15)

If and they are independent,

the probability distribution that the following random variable obeys is called a t distribution with φ degrees of freedom, denoted by t(φ).

IMPORTANT PROPERTY:

If and are independent, then:

obeys

Sample mean andsample variance as defined in 1.1 (11)

Page 20: ICTIR2016tutorial

1.1 Preliminaries (16)

If and they are independent,

the probability distribution that the following random variable obeys is called an F distribution with degrees of freedom,

denoted by .

IMPORTANT PROPERTY:

If

and they are all independent, then:

Page 21: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 22: ICTIR2016tutorial

1.2 How the t-test works (1) paired t-test

What does this sample tell us aboutthe populations?

Page 23: ICTIR2016tutorial

Comparing Systems X and Y with n topics with

(say) Mean nDCG over n topics

ASSUMPTIONS:

are independent and obey

are independent and obey .

Under these assumptions:

1.2 How the t-test works (2) paired t-test

In Slide 1.1 (9), let a1 = 1, a2 = -1.

Page 24: ICTIR2016tutorial

is an unbiased estimator of :

t distribution with n-1 degrees of freedom, which is basically like the

standard normal distribution(See also 1.1 (15))

1.2 How the t-test works (3) paired t-test

1.1 (10)

1.1 (7)

We don’t know the population variance so use a

sample variance instead.

1.1 (11)

Page 25: ICTIR2016tutorial

Since under our assumptions,

if we further assume , then

.

Hypotheses:Same population means: X and Y are equally effective

Two-sided test

1.2 How the t-test works (4) paired t-test

0test statistic t0

Page 26: ICTIR2016tutorial

Hypotheses:

1.2 How the t-test works (5) paired t-test

test statistic t0critical t value t(n-1; α)

α/2 α/2

Under , .

0

So if ,something highly unlikely has happened.We assumed but that must have beenwrong! Reject !

is probably true,with 100(1-α)% confidence.

α: significance criterion

Page 27: ICTIR2016tutorial

1.2 How the t-test works (6) paired t-test

test statistic t0critical t value t(n-1; α)

α/2 α/2

0

Using Excel to do a t-test:

- Reject if = TINV(α, n-1) = T.INV.2T(α, n-1).- P-value = TDIST(|t0|, n-1, 2) = T.DIST.2T(|t0|, n-1).

Blue areas under the curve: probability of observing the data at hand or something more extreme, if H0 is true

Page 28: ICTIR2016tutorial

1.2 How the t-test works (7) confidence intervals

From 1.2 (3),

critical t value t(n-1; α)

α/2 α/2

0

t obeys t(n-1)

Page 29: ICTIR2016tutorial

1.2 How the t-test works (8) confidence intervals

From 1.2 (3),

where .

So 95% CI for the difference in means is given by:

Margin of Eerror

Different samples yield different CIs. 95% of the CIs will capture the true difference in means.

Page 30: ICTIR2016tutorial

1.2 How the t-test works (9) unpaired t-test

Page 31: ICTIR2016tutorial

Comparing Systems X and Y, based on a sample of size n1 for X and another sample of size n2 for Y.

ASSUMPTIONS: the above observations are all independent and

and furthermore Homoscedasticity (equal variance)but the t-test is quite robust to the

assumption violation [Sakai16SIGIRshort]

1.2 How the t-test works (10) unpaired t-test

cf. 1.2 (15)

Page 32: ICTIR2016tutorial

Under the assumptions, it is known that

where

Pooled variance

1.2 How the t-test works (11) unpaired t-test

Page 33: ICTIR2016tutorial

Hypotheses:

Since under our assumptions,

if we further assume , then

1.2 How the t-test works (12) unpaired t-test

Same population means: X and Y are equally effective

Two-sided test

0test statistic t0

Page 34: ICTIR2016tutorial

Hypotheses:

1.2 How the t-test works (13) unpaired t-test

Under , .

test statistic t0critical t value t(n-1; α)

α/2 α/2

0

α: significance level

So if ,something highly unlikely has happened.We assumed but that must have beenwrong! Reject !

is probably true,with 100(1-α)% confidence.

Page 35: ICTIR2016tutorial

test statistic t0critical t value t(n-1; α)

α/2 α/2

0

Using Excel to do a t-test:

- Reject if = TINV(α, φ) = T.INV.2T(α, φ).- P-value = TDIST(|t0|, φ, 2) = T.DIST.2T(|t0|, φ).

Blue areas under the curve: probability of observing the data at hand or something more extreme, if H0 is true

1.2 How the t-test works (14) unpaired t-test

Page 36: ICTIR2016tutorial

1.2 How the t-test works (15) unpaired t-test

• Unpaired (i.e., two-sample) t-tests:

- Student’s t-test: equal variance assumption

- Welch’s t-test: no equal variance assumption, but involves approximations – use this if (1) two sample sizes are very different AND (2) two sample variances are very different [Sakai16SIGIRshort].

The Welch t-statistic and the degrees of freedom:

Page 37: ICTIR2016tutorial

Difference measured in standard deviation units

Paired data [Sakai14SIGIRForm] :

Unpaired data:

WARNING: Different books define “Cohen’s d” differently. [Okubo12]

1.2 How the t-test works (15) effect sizes

effect size

Pooled varianceeffect size

cf. Hedges’ g, Glass’s Δ

Page 38: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 39: ICTIR2016tutorial

1.3 T-test with Excel and R (hands-on) (1)

- Sample topic-by-run matrix:

https://waseda.box.com/20topics3runs

The easiest way to obtain the p-values:

Paired t-test:

= TTEST(A1:A20,B1:B20,2,1) = 0.2058

Unpaired, Student’s t-test:

= TTEST(A1:A20,B1:B20,2,2) = 0.5300

Unpaired, Welch’s t-test:

= TTEST(A1:A20,B1:B20,2,3) = 0.5302

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

Runs A, B, C

20 topics

two-sided

But this makes you treat the t-test as a black box.To obtain the test statistic, degrees of freedom etc., let’s do it “by hand”...

Page 40: ICTIR2016tutorial

1.3 T-test with Excel and R (hands-on) (2)A B C D

=A1-B1

= AVERAGE(D1:D20)= 0.022375

= DEVSQ(D1:D20)/(20-1)= 0.005834

Paired t-test

= 1.3101

P-value = T.DIST.2T(|t0|, 19) = 0.2058.

0.4695 0.3732 0.3575 0.0963

0.2813 0.3783 0.2435 -0.097

0.3914 0.3868 0.3167 0.0046

0.6884 0.5896 0.6024 0.0988

0.6121 0.4725 0.4766 0.1396

0.3266 0.233 0.2429 0.0936

0.5605 0.4328 0.4066 0.1277

0.5916 0.5073 0.4707 0.0843

0.4385 0.3889 0.3384 0.0496

0.5821 0.5551 0.4597 0.027

0.2871 0.3274 0.2769 -0.0403

0.5186 0.5066 0.4066 0.012

0.5188 0.5198 0.3859 -0.001

0.5019 0.4981 0.4568 0.0038

0.4702 0.3878 0.3437 0.0824

0.329 0.4387 0.2649 -0.1097

0.4758 0.4946 0.4045 -0.0188

0.3028 0.34 0.3253 -0.0372

0.3752 0.4895 0.3205 -0.1143

0.2796 0.2335 0.224 0.0461

Page 41: ICTIR2016tutorial

1.3 T-test with Excel and R (hands-on) (3)A B C

=A1-B1

= AVERAGE(A1:A20)-AVERAGE(B1:B20)= 0.022375

= DEVSQ(A1:A20) = 0.291139

Unpaired, Student’s t-test

= 0.012463

P-value = T.DIST.2T(|t0|, 38) = 0.5300.

0.4695 0.3732 0.35750.2813 0.3783 0.2435

0.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.4066

0.5916 0.5073 0.47070.4385 0.3889 0.3384

0.5821 0.5551 0.45970.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.4568

0.4702 0.3878 0.34370.329 0.4387 0.2649

0.4758 0.4946 0.40450.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

= DEVSQ(B1:B20) = 0.182445

= 0.6338

Page 42: ICTIR2016tutorial

1.3 T-test with Excel and R (hands-on) (4)A B C

=A1-B1 Unpaired, Welch’s t-test

= 0.015323

P-value = T.DIST.2T(|t0|, φ*) = 0.5302.

0.4695 0.3732 0.35750.2813 0.3783 0.2435

0.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.4066

0.5916 0.5073 0.47070.4385 0.3889 0.3384

0.5821 0.5551 0.45970.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.4568

0.4702 0.3878 0.34370.329 0.4387 0.2649

0.4758 0.4946 0.40450.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

= 0.6338

= 0.009602

= 36.0985

Page 43: ICTIR2016tutorial

1.3 T-test with Excel and R (hands-on) (5)

Page 44: ICTIR2016tutorial

1.3 T-test with Excel and R (hands-on) (6)

Compare with the Excel results.

Page 45: ICTIR2016tutorial

1.3 T-test with Excel and R (hands-on) (7)

Also try:

R uses Welch as the default!

Compare with the Excel results.

Page 46: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 47: ICTIR2016tutorial

1.5 How ANOVA works (1)

System Per-topicperformances

1 x11, x12, … , x1n

2 x21, x22, … , x1n

3 x31, x32, … , x3n

Topic→↓System

1 2 … n

1 x11 x12 … x1n

2 y21 y22 … y2n

3 z31 z32 … z3n

One-way ANOVA withequal number of replicates

Two-way ANOVA without replication(If xi corresponds to yi and zi, this should be preferred over one-way ANOVA)

ANOVA can ask: “Are ALL systems equally effective?” when there are m (>2) systems.In this tutorial, let’s first consider the following two simplest types of ANOVA.

Generalises the unpaired t-test

Generalises the paired t-test

Page 48: ICTIR2016tutorial

1.5 How ANOVA works (2) one-way ANOVA

System Per-topicperformances

1 x11, x12, … , x1n

2 x21, x22, … , x1n

3 x31, x32, … , x3n

i=1, … , mj=1, … , n

: score of i-th system for topic j

ASSUMPTIONS: are independent and

, or, equivalently,

and .

Let and .

Then it is easy to show that .

Homoscedasticity(equal variance)assumption

Population grand mean i-th system effect

Page 49: ICTIR2016tutorial

Hypotheses:

: At least one of the system effects is non-zero.

Let

.

Note that

1.5 How ANOVA works (3) one-way ANOVA

ALL population means are equal

Diff betweenscore andgrand mean

Diff betweensystem mean and grand mean

Diff betweenscore and system mean

Sample grand mean Sample system mean

Page 50: ICTIR2016tutorial

Similarly, ST = SA + SE holds, where System Per-topicperformances

1 x11, x12, … , x1n

2 x21, x22, … , x1n

3 x31, x32, … , x3n

1.5 How ANOVA works (4) one-way ANOVA

Total variation

Between-systemvariation

Within-systemvariation

Page 51: ICTIR2016tutorial

ST = SA + SE

Under the i.i.d. and normality assumptions on ,

(a)

(b) .

So, under H0 (ai = 0),

φE =m(n-1)

φA =m-1

1.1 (14)(c)

φT =mn-1= φA + φE

Degrees of freedom:how accurate is the sum of

squares?

1.1 (14)(c)

1.1 (10)

1.5 How ANOVA works (5) one-way ANOVA

Page 52: ICTIR2016tutorial

ST = SA + SE φT = φA + φE

[Under H0]

⇒ Under H0,

Is the between-system variation large compared to the within system variation?

1.5 How ANOVA works (6) one-way ANOVA

φE = m(n-1)

φA = m-1

1.1 (16)

Page 53: ICTIR2016tutorial

m=3,n=10 m=5, n=10 m=20, n=10

Hypotheses:

: At least one of the system effects is non-zero.

Test statistic:

Reject H0 if

F0 >= F(φA,φE;α).

φE = m(n-1)

φA = m-1

Critical F valueF(φA,φE;α)

F0

1.5 How ANOVA works (7) one-way ANOVA

α

0

SE from 1.5 (4)

Page 54: ICTIR2016tutorial

Sum of squares

Degrees of freedom

Mean squares

F0

BetweenSystem

SA φA = m-1 VA = SA/φA = SA/(m-1)

VA/VE =m(n-1)SA

(m-1)SE

Within System

SE φE = m(n-1) VE = SE/φE = SE/m(n-1)

Total ST φT = mn-1

- Reject H0 if F0 >= F(φA,φE;α) = F.INV.RT(φA,φE,α)- P-value = F.DIST.RT(F0,φA,φE)

1.5 How ANOVA works (8) one-way ANOVA

If n varies across the m systems,let φE = (total #observations) – m.

Page 55: ICTIR2016tutorial

Population effect size

Simplest estimator of the above from a sample

(more accurate)

How much of the total variance can be accounted for by the between-system variance?

Effect sizes for one-way ANOVA [Okubo12]

1.5 How ANOVA works (9) one-way ANOVA

More accurate estimator in[Okubo12, Sakai14SIGIRforum]

Page 56: ICTIR2016tutorial

1.5 How ANOVA works (10) two-way ANOVA w/o replication

Topic→↓System

1 2 … n

1 x11 x12 … x1n

2 y21 y22 … y2n

3 z31 z32 … z3n

ASSUMPTIONS: are independent and

homoscedasticitySystem and topic effects are additive

and linearly related to xij

Sample grand mean

Sample system mean Sample topic mean

Page 57: ICTIR2016tutorial

1.5 How ANOVA works (11) two-way ANOVA w/o replication

Hypothesis for the system effects

: at least one differs

Hypothesis for the topic effects

: at least one differs

Note that

Diff betweenscore andgrand mean

Diff betweensystem mean and grand mean

Diff betweentopic mean and grand mean

The rest

Green part for one-way ANOVA in 1.5 (3)

Page 58: ICTIR2016tutorial

1.5 How ANOVA works (12) two-way ANOVA w/o replication

Similarly, ST = SA + SB + SE holds, where

Total variation

Between-systemvariation

Residual

Between-topicvariation Within-system variance

for one-way ANOVAin 1.5 (4)

Page 59: ICTIR2016tutorial

ST = SA + SB + SE φT = φA + φB + φE

Hypotheses for the system effects

: at least one differs

Under H0,

Hypotheses for the topic effects

: at least one differs

Under H0,

i

1.5 How ANOVA works (13) two-way ANOVA w/o replication

φE = (m-1)(n-1)

φA = m-1

φB = n-1

Page 60: ICTIR2016tutorial

m=3,n=10 m=5, n=10 m=20, n=10

Hypotheses (for system effects):

: At least one of the system effects is non-zero.

Test statistic:

Reject H0 if

F0 >= F(φA,φE;α).

φE = (m-1)(n-1)

φA = m-1

Critical F valueF(φA,φE;α)

F0

α

0

1.5 How ANOVA works (14) two-way ANOVA w/o replication

For topic effects,use SB and φB

instead of SA and φA.SE from

1.5 (12)

Page 61: ICTIR2016tutorial

Sum of squares

Degrees of freedom

Mean squares F0

Betweensystem

SA φA =m-1 VA = SA/φA =SA/(m-1)

VA/VE =(n-1)SA/SE

Between topic

SB φB = n-1 VB = SB/φB = SB/(n-1)

VB/VE =(m-1)SB/SE

SE φE = (m-1)(n-1) VE = SE/φE =SE/(n-1)(m-1)

Total ST φT = mn-1

1.5 How ANOVA works (15) two-way ANOVA w/o replication

For system effects: - Reject H0 if F0 >= F(φA,φE;α) = F.INV.RT(φA,φE,α)- P-value = F.DIST.RT(F0,φA,φE)

Page 62: ICTIR2016tutorial

ST = SA + SB + SAxB + SE

1.5 How ANOVA works (16) two-way ANOVAφT = φA + φB + φAxB + φE

B→↓A

1 2 … n

1 x111,:x11r

x121

:x12r

… x1n1

:x1nr

2 x211,:x21r

: … :

: : : … :

m xm11

:xm1r

xm21

:xm2r

… xmn1

:xmnr

Not discussed in detail in this tutorial as this design is rare in system-based evaluation

• Two factors A and B• Each cell contains r observations(total #observations = N = mnr)• Interaction between A and B considered

A levels

score

B level 1

B level 2

score seems high if A level is high AND B level is high!

No interaction

Page 63: ICTIR2016tutorial

Sum of squares

Degrees of freedom

Mean squares F0

A SA φA =m-1 VA = SA/φA VA/VE

B SB φB = n-1 VB = SB/φB VB/VE

AxB SAB φAxB = (m-1)(n-1) VAxB = SAxB/φAxB VAxB/VE

SE φE = mn(r-1) VE = SE/φE

Total ST φT = mnr-1

P-value = F.DIST.RT( F0, φA, φE )

P-value = F.DIST.RT( F0, φB, φE )

P-value = F.DIST.RT( F0, φAxB, φE )

1.5 How ANOVA works (17) two-way ANOVA

ST = SA + SB + SAxB + SE φT = φA + φB + φAxB + φEDefinitions of SAxB and SE for two-way ANOVAcan be found in text books.

Page 64: ICTIR2016tutorial

Population effect sizes

Simplest estimators of the above from a sample

Variances we’re not interested in removed

from denominator

(more accurate)

Effect sizes for two-way ANOVA w and w/o replication [Okubo12]How much of the total variance does the between-system variance account for?

1.5 How ANOVA works (18)

without replication: ST = SA + SB + SE

with replication: ST = SA + SB + SAB + SE

More accurate estimators in[Okubo12, Sakai14SIGIRforum]

Page 65: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 66: ICTIR2016tutorial

1.6 ANOVA with Excel and R (1) one-way ANOVA

• = DEVSQ(A1:C20) = 0.726229

• = DEVSQ(A1:A20)

+ DEVSQ(B1:B20)

+ DEVSQ(C1:C20) = 0.650834

= ST – SE = 0.075395

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A

20 topics

B C

Page 67: ICTIR2016tutorial

1.6 ANOVA with Excel and R (2) one-way ANOVA

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A B C

Sum of squares

Degrees of freedom

Mean squares

F0

BetweenSystem

SA

= 0.075395

φA = m-1= 2

VA = SA/φA

= 0.037697VA/VE

= 3.3015

Within System

SE

= 0.650834

φE = m(n-1)= 57

VE = SE/φE

= 0.011418

Total ST

= 0.726229

P-value = F.DIST.RT( F0, φA, φE ) = 0.0440

Page 68: ICTIR2016tutorial

1.6 ANOVA with Excel and R (3) one-way ANOVA

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A B C

Data that we used for the t-test

Page 69: ICTIR2016tutorial

1.6 ANOVA with Excel and R (4) one-way ANOVA

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A B C

Compare with the Excel results.

Page 70: ICTIR2016tutorial

1.6 ANOVA with Excel and R (5) two-way ANOVA w/o replication

• = DEVSQ(A1:C20) = 0.726229

= 20*((0.4501-0.4146)^2

+(0.4277-0.4146)^2

+(0.3662-0.4146)^2 = 0.075395

= 0.579826

= ST – SA – SB = 0.726229

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A B C

cf. 1.6 (1)

Page 71: ICTIR2016tutorial

1.6 ANOVA with Excel and R (6) two-way ANOVA w/o replication

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A B C

Sum of squares

Degrees of freedom

Mean squares F0

Betweensystem

SA

= 0.075395

φA =m-1= 2

VA = SA/φA

= 0.037697VA/VE

= 20.1737

Between topic

SB

= 0.579826

φB = n-1= 19

VB = SB/φB

= 0.030517VB/VE

= 16.3312

SE

= 0.071008

φE = (m-1)(n-1)= 38

VE = SE/φE

= 0.001869

Total ST

= 0.726229

P-value (system) = F.DIST.RT( F0, φA, φE ) = 1.070E-06P-value (topic) = F.DIST.RT( F0, φB, φE ) = 8.173E-13

Page 72: ICTIR2016tutorial

1.6 ANOVA with Excel and R (7) two-way ANOVA w/o replication

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A B C

Page 73: ICTIR2016tutorial

1.6 ANOVA with Excel and R (8) two-way ANOVA w/o replication

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A B C

Page 74: ICTIR2016tutorial

1.6 ANOVA with Excel and R (9) two-way ANOVA w/o replication

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A B C

Compare with the Excel results.

Page 75: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 76: ICTIR2016tutorial

1.7 What's wrong with significance tests? (1)[Johnson99]

• Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think.

• Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else.

• Cohen (1994:997) noted that statistical testing of the null hypothesis "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!"

Page 77: ICTIR2016tutorial

1.7 What's wrong with significance tests? (2)

• We want to know P(H|D), but classical significance testing only gives us something like P(D|H). (Alternative: Bayesian statistics etc.)

• Reporting α (e.g. 0.05) instead of the actual p-values leads to dichotomous thinking (“Signifcant or not”?)

• Even if p-values are reported, p-values reflect not only the effect size (magnitude of the actual difference) but also the sample size:

p-value = f( sample_size, effect_size )

large effect size ⇒ small p-value

large sample size ⇒ small p-value

H: Hypothesis, D: Data

Anything can be made statistically significant by using lots of data

1.2 (15)

Page 78: ICTIR2016tutorial

1.7 What's wrong with significance tests? (3)[Sakai14SIGIRForum]

So what should we do?

Whenever using a classical significance test, report not only p-values, but also effect sizes and confidence intervals.

Difference between two systemsmeasured in standard deviation units

Page 79: ICTIR2016tutorial

1.7 What's wrong with significance tests? (4)[Sakai14SIGIRForum]

So what should we do?

Whenever using a classical significance test, report not only p-values, but also effect sizes and confidence intervals.

Difference between two systemsmeasured in standard deviation units

Actually, if you want p-values for every system pair, you can apply randomised Tukey HSD

[Carterette12,Sakai14PROMISE] WITHOUT doing ANOVA.

More accurate estimators of

and

cf. 1.5 (18)

Page 80: ICTIR2016tutorial

1.7 What's wrong with significance tests? (5)Randomised Tukey HSD test for m>=2 systemshttp://research.nii.ac.jp/ntcir/tools/discpower-en.html

• Input: a topic-by-run score matrix.• Can be used to compute p-values for 2 or more systems.• Unlike classical tests, it does not rely on assumptions such as normality.• It is a kind of multiple comparisonprocedure (free from the familywise error rate problem).

Page 81: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 82: ICTIR2016tutorial

1.8 Significance tests in the IR literature, or lack thereof (1) [Sakai16SIGIR]

Page 83: ICTIR2016tutorial

1.8 Significance tests in the IR literature, or lack thereof (2) [Sakai16SIGIR]

Page 84: ICTIR2016tutorial

1.8 Significance tests in the IR literature, or lack thereof (3) [Sakai16SIGIR]

Page 85: ICTIR2016tutorial

1.8 Significance tests in the IR literature, or lack thereof (4) [Sakai16SIGIR]

Page 86: ICTIR2016tutorial

1.8 Significance tests in the IR literature, or lack thereof (5) [Sakai16SIGIR]

Page 87: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 88: ICTIR2016tutorial

2.1 Topic set sizes in IR (1) [Sakai16IRJ]

According to Sparck Jones and Van Rijsbergen [SparckJones75],

fewer than 75 topics “are of no real value”;

250 topics “are minimally acceptable”;

more than 1000 topics “are needed for some purposes”

because “real collections are large”; “statistically significant results are desirable” and “scaling up must be studied.”

Page 89: ICTIR2016tutorial

2.1 Topic set sizes in IR (2) [Sakai16IRJ]

In 1979, in a report that considered the number of relevance assessments required from a statistical viewpoint, Gilbert and SparckJones remarked [Gilbert79]:

“Since there is some doubt about the feasibility of getting 1000 requests, or the convenience of such a large set for future experiments, we consider 500 requests.”

Page 90: ICTIR2016tutorial

2.1 Topic set sizes in IR (3)

The default topic set size at TREC: 50.

Exceptions include the million query track that created 1800+ topics [Carterette08] but creating a “reusable” test collection was not the objective of the track.

Round Documents Topics

TREC-1 disks 1 + 2 51-100

TREC-2 disks 1 + 2 101-150

TREC-3 disks 1 + 2 151-200

TREC-4 disks 2 + 3 201-250

TREC-5 disks 2 + 4 251-300

TREC-6 disks 4 + 5 301-350

TREC-7 disks 4 + 5 351-400

TREC-8 disks 4 + 5 401-450

Early TREC ad hoc tasks and topics[Voorhees05, p.24]

Page 91: ICTIR2016tutorial

2.1 Topic set sizes in IR (4) [Sakai16IRJ]

In 2009, Voorhees conducted an experiment where she randomly split 100 TREC topics in half to count discrepancies in statistically significant results, and concluded that

“Fifty-topic sets are clearly too small to have confidence in a conclusion when using a measure as unstable as P(10). Even for stable measures, researchers should remain skeptical of conclusions demonstrated on only a single test collection.” [Voorhees09]

TREC-7 + 8 topicswith TREC 2004

robust track systems

100 topicsrandom split 50

topics50

topicsPaired t-test saysSystem A > B!

Paired t-test saysSystem A < B!

conflict

But if randomised Tukey HSD (i.e. a multiple comparison procedure) is used for filtering system pairs,discrepancies across test collections almost never occur [Sakai16ICTIR].

Page 92: ICTIR2016tutorial

2.1 Topic set sizes in IR (5)

At CIKM 2008, [Webber08] pointed out that the topic set size should be determined based on the required statistical power.

Accept H0 Reject H0

H0 is truesystems are equivalent

Correct conclusion(1-α)

Type I errorα

H0 is falsesystems are different

Type II errorβ

Correct conclusion(1-β)

Statistical power: ability to detect real

differences

Page 93: ICTIR2016tutorial

2.1 Topic set sizes in IR (6)

The approach of [Webber08]:

• Incremental test collection building – adding topics with relevance assessments one by one until the desired power is achieved;

• Considered the t-test without addressing the familywise error rate problem;

• Estimated the variance of score deltas using non-standard methods;

We want a more straightforward answer to “How many topics should I create?”

In addition to the t-test, we can consider one-way ANOVA and confidence intervals as the basis.

Residual variances from ANOVA are unbiased estimators of the within-system variances.

Page 94: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 95: ICTIR2016tutorial

2.2 Topic set size design (1) [Sakai16IRJ]

• Provides answers to the following question:

“I’m building a new test collection. How many topics should I create?”

• A prerequisite: a small topic-by-run score matrix based on pilot data, for estimating within-system variances.

• Three approaches (with easy-to-use Excel tools), based on:

(1) paired t-test power

(2) one-way ANOVA power

(3) confidence interval width upperbound.

Page 96: ICTIR2016tutorial

2.2 Topic set size design (2) [Sakai16IRJ]

Test collection designs should evolve based on past data

topic-by-runscore matrix withpilot data

About 25 topicswith runs from a few teamsprobably sufficient[Sakai16EVIA]

n1 topics

m runs

Estimate n1 based on thewithin-system varianceestimate

TREC 201X TREC 201(X+1)

n2 topics

n0 topics

Estimate n2 based on thewithin-system varianceestimate

A more accurate estimate

Page 97: ICTIR2016tutorial

2.2 Topic set size design (3) [Sakai16IRJ]Method Input required

Paired t-test α (Type I error probability), β (Type II error probability),minDt (minimum detectable difference: whenever the diff between two systems is this much or larger, we want to guarantee (1-β)% power),

: variance estimate for the score delta.

one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems),minD (minimum detectable range: whenever the diff between the best and worst systems is this much or larger, we want to guarantee (1-β)% power),

: estimate of the within-system variance under the homoscedasticity assumption.

Confidence intervals α (Type I error probability), δ (CI width upperbound: you want the CI for the diff between any system pair to be this much or smaller),

: variance estimate for the score delta.

Page 98: ICTIR2016tutorial

2.2 Topic set size design (4) [Sakai16IRJ]

ANOVA-based results for m=10 can be used instead

of CI-based results

ANOVA-based results for m=2 can be used instead of

t-test-based results

In practice, you can deduce t-test-based and CI-based results from ANOVA-based results

Caveat: the ANOVA-based tool can only handle (α, β)=(0.05, 0.20), (0.01, 0.20),

(0.05, 0.10), (0.01, 0.10).

Page 99: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 100: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 101: ICTIR2016tutorial

2.3 Paired t-tests (1)

Example situation: You plan to compare a system pair with the paired t-test with α=5%. You plan to use nDCG as a primary evaluation measure, and want to guarantee 80% power whenever the diff between two systems

>= minDt.

You know from pilot data that the variance of the nDCG delta is around .

What is the required number of topics n?

Method Input required

Paired t-test α (Type I error probability), β (Type II error probability),minDt (minimum detectable difference: whenever the diff between two systems is this much or larger, we want to guarantee (1-β)% power),

: variance estimate for the score delta.

Page 102: ICTIR2016tutorial

2.3 Paired t-tests (2)

Notations (some slightly different from Part 1)

t: a random variable that obeys t(φ) where φ=n-1;

: two-sided critical t value for sig. criterion α

= T.INV.2T(α, φ)

α/2 α/2

0

Page 103: ICTIR2016tutorial

2.3 Paired t-tests (3)

Under our assumptions, holds.

In a t-test, we let

and consider . Due to the t-test procedure,

regardless of what t0 obeys, the probability of rejecting H0 is

.

Page 104: ICTIR2016tutorial

2.3 Paired t-tests (4)

Regardless of what t0 obeys, the probability of rejecting H0 is

... (a)

If H0 is true, then t0 obeys t(n-1) and (a) is exactly α

(that’s how was defined).

Alternatively, if H1 is true, the distribution that t0 obeys is known as

a noncentral t distribution with φ degrees of freedom,

and (a) is exactly the power, (1-β). Rejecting the incorrect

hypothesis H0

Rejecting the correct

hypothesis H0

Page 105: ICTIR2016tutorial

Accept H0 Reject H0

H0 is truesystems are equivalent

Correct conclusion(1-α)

Type I errorα

H0 is falsesystems are different

Type II errorβ

Correct conclusion(1-β)

2.3 Paired t-tests (5)

t0 obeys a (central) t distribution

t0 obeys a noncentral t distribution

=

(a)

Page 106: ICTIR2016tutorial

2.3 Paired t-tests (6)

If H1 is true, the distribution that t0 obeys is known as a noncentral t distribution with φ degrees of freedom, and (a) is exactly the power, (1-β).

The noncentral t distribution in fact has another parameter called

the noncentrality parameter λt :

populationeffect size

Population variance of the score differences: See 1.2 (2)

Page 107: ICTIR2016tutorial

2.3 Paired t-tests (7)

If H1 is true, the distribution that t0 obeys is known as a noncentral t distribution with φ degrees of freedom and a noncentrality parameter λt, and (a) is exactly the power, (1-β).

We want to compute (a) , but the computation involving the noncentralt distribution is too complex...

... (a)

Power =

Page 108: ICTIR2016tutorial

2.3 Paired t-tests (8)

Fortunately, a good approximation is available [Nagata03] .

t’: a random variable that obeys a noncentral t distribution with φ, λt ;

u: a random variable that obeys a standard normal distribution;

... (a)

Power =

AppendixTheorem A’

Page 109: ICTIR2016tutorial

... (a)

Power =

2.3 Paired t-tests (9)

where .

... (a’)

Theorem A’

Page 110: ICTIR2016tutorial

2.3 Paired t-tests (10)

Power = 1-β

Now we know how to compute power given (α, Δt, n).But we want to compute n given (α, β, Δt).

... (a’)

Page 111: ICTIR2016tutorial

2.3 Paired t-tests (11)

Now we know how to compute power given (α, Δt, n).But we want to compute n given (α, β, Δt). Starting again with:

Power =

AppendixTheorem A

Page 112: ICTIR2016tutorial

2.3 Paired t-tests (12)

Now we know how to compute power given (α, Δt, n).But we want to compute n given (α, β, Δt). Starting again with:

Power =

Theorem A

If λt > 0

λt < 0 will lead to the same final result

Ignore

Page 113: ICTIR2016tutorial

2.3 Paired t-tests (13)

Now we know how to compute power given (α, Δt, n).But we want to compute n given (α, β, Δt).

Power

one-sided z value for probability 1-β

Let

cf. This is rougher than Theorem A’

Page 114: ICTIR2016tutorial

2.3 Paired t-tests (14)

Now we know how to compute power given (α, Δt, n).But we want to compute n given (α, β, Δt).

When λt > 0 or λt < 0 (i.e. H1 is true)

Similarly, when λt = 0(i.e. H0 is true),

two-sided t value

one-sided z value

≠ 0

Page 115: ICTIR2016tutorial

2.3 Paired t-tests (15)

Now we know how to compute power given (α, Δt, n).But we want to compute n given (α, β, Δt).

Appendix Theorem A’’

Appendix Theorem B

Page 116: ICTIR2016tutorial

2.3 Paired t-tests (16)

Now we know how to compute power given (α, Δt, n).But we want to compute n given (α, β, Δt).

Let and recall that . Substituting these to the above gives

≠ 0when H1 is true

Page 117: ICTIR2016tutorial

Given (α, β, minΔt), the minimal sample size n can be approximated as

by letting Δt = minΔt .

But this involved a lot of approximations, so we need to go back to (a’)

and check that n actually achieves 100(1-β)% power:

2.3 Paired t-tests (17) minimum detectable effect size

... (a’)

Page 118: ICTIR2016tutorial

EXAMPLE: α=0.05, β=0.20, detectable effect size regardless of evaluation measure minΔt = 0.50 (i.e. half a std deviation of the diff)

= 33.3

(z α/2 = z 0.025 = NORM.S.INV(1-0.025)=1.960, z 1-β = z 0.80 = -0.842)

So if we let n=33, the achieved power according to (a’)

2.3 Paired t-tests (18)

= 0.795 ... doesn’t quite achieve 80%!

Page 119: ICTIR2016tutorial

EXAMPLE: α=0.05, β=0.20, detectable effect size regardless of evaluation measure minΔt = 0.50 (i.e. half a std deviation of the diff)

If we let n=34, the achieved power according to (a’)

2.3 Paired t-tests (19)

= 0.808 ... so n=34 is what we need!

Page 120: ICTIR2016tutorial

Don’t worry,

http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx

will do this for you! Use the “From effect size” sheet and fill out the orange cells.

2.3 Paired t-tests (20)

n=34 is what you want!

Page 121: ICTIR2016tutorial

2.3 Paired t-tests (21) [Sakai16IRJ]

Topic set sizes for typical requirements based on effect sizes

Page 122: ICTIR2016tutorial

2.3 Paired t-tests (22)

In practice, you might want to specify a minimum detectable diff (minDt) in (say) nDCG instead of minΔt for guaranteeing 100(1-β)% power.

Given minD and , so n can be obtained as before.

A conservative estimate for the delta variance would be

where is a within-system variance estimate obtained under a homoscedasticity assumption. See 2.6

Page 123: ICTIR2016tutorial

2.3 Paired t-tests (23)

EXAMPLE: For nDCG, α=0.05, β=0.20, minDt =0.1 (i.e., one-tenth of nDCG’s score range), = 0.50 (from some pilot data)

→ Use the “From the absolute diff” sheet:

n=395 is what you want!

Page 124: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 125: ICTIR2016tutorial

Method Input required

one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems),minD (minimum detectable range: whenever the diff between the best and worst systems is this much or larger, we want to guarantee (1-β)% power),

: estimate of the within-system variance under the homoscedasticity assumption.

Example situation: You plan to compare m systems with one-way ANOVA with α=5%. You plan to use nDCG as a primary evaluation measure, and want to guarantee 80% power whenever the diff between the best and the worst systems >= minD.

You know from pilot data that the within-system variance for nDCG is around .

What is the required number of topics n?

2.4 One-way ANOVA (1) m systems

best

worst

minD <= D

Page 126: ICTIR2016tutorial

2.4 One-way ANOVA (2)

Notations (some slightly different from Part 1)

F: random variable that obeys an F distribution with (φA, φE) degrees of freedom;

: critical F value for sig. criterion α

= F.INT.RT(α, φA, φE)

α

0

φA = m-1

φE = m(n-1)

Page 127: ICTIR2016tutorial

2.4 One-way ANOVA (3)

Due to the one-way ANOVA procedure, regardless of what F0 obeys, the probability of rejecting H0 is:

If H0 is true, then F0 obeys F(φA, φE) and (c) is exactly α

(that’s how is defined).

Alternatively, if H1 is true, the distribution that F0 obeys is known as a noncentral F distribution with (φA, φE) degrees of freedom,

and (c) is exactly the power, (1-β).

... (c)

Page 128: ICTIR2016tutorial

Accept H0 Reject H0

H0 is truesystems are equivalent

Correct conclusion(1-α)

Type I errorα

H0 is falsesystems are different

Type II errorβ

Correct conclusion(1-β)

F0 obeys a (central) F distribution

F0 obeys a noncentral Fdistribution

(c)

2.4 One-way ANOVA (4)

Page 129: ICTIR2016tutorial

2.4 One-way ANOVA (5)

If H1 is true, the distribution that F0 obeys is known as a noncentral F distribution with (φA, φE) degrees of freedom, and (c) is exactly the power, (1-β).

The noncentral F distribution in fact has another parameter called

the noncentrality parameter λ :

Measures the total system effects in variance units

Within-system varianceunder homoscedasticity

Page 130: ICTIR2016tutorial

2.4 One-way ANOVA (6)

If H1 is true, the distribution that F0 obeys is known as a noncentral F distribution with (φA, φE) degrees of freedom and a noncentralityparameter λ, and (c) is exactly the power, (1-β).

... (c)

AppendixTheorem C ... (c’)

Denoted F’(φA, φE, λ)

Page 131: ICTIR2016tutorial

2.4 One-way ANOVA (7)

Let us ensure that when Δ≠0 (i.e., H1 is true), we guarantee 100(1-β)% power whenever the difference between best and worst systems is minD or larger (minimum detectable range).

m systems

best

worst

minD <= D

H1: At least one system isdifferent

≠ 0

Page 132: ICTIR2016tutorial

2.4 One-way ANOVA (8)

Let us ensure that when Δ≠0 (i.e., H1 is true), we guarantee 100(1-β)% power whenever the difference D between best and worst systems is minD or larger (minimum detectable range).

Define .

Then

holds.

Appendix Theorem D

minD does not uniquely determine Δ,but minΔ can be used as the worst-case Δ.

Page 133: ICTIR2016tutorial

2.4 One-way ANOVA (9)

The worst-case sample size:

The λ is the noncentrality parameter for

F’(φA, φE, λ), which can be approximated by

, for which these linear

approximations are available α β

0.01 0.10

0.01 0.20

0.05 0.10

0.05 0.20

Appdendix Theorem E

λ for noncentral chi-square distributions [Nagata03]

Page 134: ICTIR2016tutorial

2.4 One-way ANOVA (10)

Given (α, β, minD, m, ), the minimal sample size n can be approximated as

.

But this involved a lot of approximations, so we need to go back to (c’)

and check that n actually achieves 100(1-β)% power:

... (c’)Power

Page 135: ICTIR2016tutorial

2.4 One-way ANOVA (11)

EXAMPLE: α=0.05, β=0.20, minD=0.5, m=3, =0.5^2.

So let n=19 ⇒

Hence from (c’) we get power =

= 0.791 ... doesn’t quite achieve 80%!

Page 136: ICTIR2016tutorial

2.4 One-way ANOVA (12)

EXAMPLE: α=0.05, β=0.20, minD=0.5, m=3, =0.5^2.

Try n=20 ⇒

From (c’) we get power =

= 0.813 ... so n=20 is what we need!

Page 137: ICTIR2016tutorial

2.4 One-way ANOVA (13)

Don’t worry,

http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx

will do this for you! Use the appropriate sheet for a given (α, β) and fill out the orange cells.

:

n=20 is what you want!

Page 138: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 139: ICTIR2016tutorial

2.5 Confidence Intervals (1)Method Input required

Confidence intervals α (Type I error probability), δ (CI width upperbound: you want the CI for the diff between any system pair to be this much or smaller),

: variance estimate for the score delta.

Example situation: You plan to compare a system pair by means of 95% CI for the difference in nDCG. You want to guarantee that the CI width for any system pair is δ or smaller. You know from pilot data that the variance of the nDCG delta is around .

What is the required number of topics n?

Page 140: ICTIR2016tutorial

2.5 Confidence Intervals (2) cf. 1.2 (8)

The 100(1-α)% CI for a difference in means (paired data) is given by

where

.

Let’s consider a sample size n which guarantees that the CI width (=2*MOE) for any difference will be no larger than δ.

But since MOE contains a random variable V, let’s consider the above requirement using an expectation:

.

Page 141: ICTIR2016tutorial

Now, it is known that

so we want to find the smallest n that

satisfies:

.

2.5 Confidence Intervals (3)

sample standard deviation

population standard deviation

gamma function:(see Theorem A)

cf. 1.1 (11)

Page 142: ICTIR2016tutorial

We want to find the smallest n that satisfies:

To obtain an initial n, instead of ,

consider where the variance is known.

Thus, let and start with .

Increment n’ until (d) is satisfied.

2.5 Confidence Intervals (4)

... (d)

Page 143: ICTIR2016tutorial

EXAMPLE: α=0.05, δ=0.5, = 0.5 (from some pilot data)

= 30.7

Try n=31 → LHS=0.257 > 0.25

n=32 → LHS=0.253 > 0.25

n=33 → LHS=0.249 < 0.25

2.5 Confidence Intervals (5)

... (d)=0.25

LHS

n=33 is what you want!

Page 144: ICTIR2016tutorial

2.5 Confidence Intervals (6)

Don’t worry,

http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx

will do this for you! Just fill out the orange cells.

n=33 is what you want!

Page 145: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 146: ICTIR2016tutorial

2.6 Estimating the variance (1)

We need for topic set size design based on one-way ANOVA

and for that based on the paired t-test or CI.

From a pilot topic-by-run score matrix, obtain:

Then, if possible, pool multiple estimates to enhance accuracy:

Pooled estimate

By-product of one-way ANOVA

(use two-way w/o replilcation for tighter

estimates)

Page 147: ICTIR2016tutorial

• = DEVSQ(A1:A20)

+ DEVSQ(B1:B20)

+ DEVSQ(C1:C20) = 0.650834

φE = m(n-1) = 3(20-1)= 57

= VE = SE / φE = 0.011

0.4695 0.3732 0.35750.2813 0.3783 0.24350.3914 0.3868 0.3167

0.6884 0.5896 0.60240.6121 0.4725 0.4766

0.3266 0.233 0.24290.5605 0.4328 0.40660.5916 0.5073 0.47070.4385 0.3889 0.33840.5821 0.5551 0.4597

0.2871 0.3274 0.27690.5186 0.5066 0.4066

0.5188 0.5198 0.38590.5019 0.4981 0.45680.4702 0.3878 0.34370.329 0.4387 0.26490.4758 0.4946 0.4045

0.3028 0.34 0.32530.3752 0.4895 0.3205

0.2796 0.2335 0.224

A

20 topics

B C2.6 Estimating the variance (2)

cf. 1.6 (1)

cf. 1.6 (2)

If there is no other topic-by-run matrix available, use this as .

Page 148: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 149: ICTIR2016tutorial

2.7 How much pilot data do we need? (1) [Sakai16EVIA]

100topics

44 runs from 16 teams

Pilot data

Variance estimates(best estimatesavailable)

OfficialNTCIR-12 STCqrels based on16 teams(union of contributionsfrom 16 teams)

Can we obtain a reliable even from a few teams and a small number of topics?

Page 150: ICTIR2016tutorial

2.7 How much pilot data do we need? (2) [Sakai16EVIA]

Can we obtain a reliable even from a few teams and a small number of topics?

100topics

Runs from 15 teams

Pilot data

New variance estimates

Tryleave-1-out10 times

Leaving out k teamsk=1(k=1,...,15)

Page 151: ICTIR2016tutorial

2.7 How much pilot data do we need? (3) [Sakai16EVIA]

Can we obtain a reliable even from a few teams and a small number of topics?

100topics

Runs from 1 team

Pilot data

New variance estimates

Leaving out k teamsk=15(k=1,...,15)

Tryleave-15-out10 times

Page 152: ICTIR2016tutorial

2.7 How much pilot data do we need? (4) [Sakai16EVIA]

Can we obtain a reliable even from a few teams and a small number of topics?

100topics

44 runs from 16 teams

Variance estimates(best estimatesavailable)

5025

Variance estimates

Variance estimates

Removing topics100 → 90 → 75 → 50 → 25 → 10

Official NTCIR-12STC qrels

Page 153: ICTIR2016tutorial

2.7 How much pilot data do we need? (5) [Sakai16EVIA]

Can we obtain a reliable even from a few teams and a small number of topics?

100topics

Runs from 15 teams

Variance estimates(best estimatesavailable)

5025

Variance estimates

Variance estimates

Removing topics100 → 90 → 75 → 50 → 25 → 10

Leave-k-out qrelsk=1(k=1,...,15)

Page 154: ICTIR2016tutorial

Starting with n’=100 topics Starting with n’=10 topics

2.7 How much pilot data do we need? (6) [Sakai16EVIA] About 25 topics with a few teams seems sufficient,

provided that a reasonably stable measure is used.

Page 155: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 156: ICTIR2016tutorial

3.1 Power analysis (1) [Ellis10, pp.56-57]

1. Effect size describes the degree to which the phenomenon is present in the population;

2. Sample size determines the amount of sampling error inherent in a result;

3. Significance criterion α defines the risk of committing a Type I error;

4. power (1-β) refers to the chosen or implied Type II error rate.

“The four power parameters are related, meaning that the value of any parameter can be determined from the other three.”

We had a quick look at how the computations can be done in Part 2.

Page 157: ICTIR2016tutorial

3.1 Power analysis (2) [Toyoda09]

If a paper reports

- The parametric significance test type (paired/unpaired t-test, one-way ANOVA, two-way ANOVA w and w/o replication)

- either p-value or test statistic (t-value or F-value)

- actual sample size

we can easily compute the sample effect size.

Then, using the library pwr of R, we can compute

- the achieved power of the experiment

- future sample size for achieving given (α, β).

cf. 1.7 (2)

https://cran.r-project.org/web/packages/pwr/pwr.pdf

power=(1-β)

Page 158: ICTIR2016tutorial

3.1 Power analysis (3) [Sakai16SIGIR]

My R power analysis scripts, adapted from [Toyoda09] with Professor Toyoda’s kind permission, are available at

https://waseda.box.com/SIGIR2016PACK

- Works with paired/unpaired t-test, one-way ANOVA, two-way ANOVA w and w/o replication.

- SIGIR2016PACK also contains an Excel file from [Sakai16SIGIR] (manual analysis of 1055 papers from SIGIR+TOIS 2006-2015).

Page 159: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 160: ICTIR2016tutorial

3.2 With paired t-tests (1)

future.sample.pairedt arguments:- t statistic (t)- sample size (n)- two-sided/one-sided (default: two-sided)- α (default: 0.05)- desired power (1-β) (default:0.80)OUTPUT:- effect size- achieved power- future sample size n’

1.2 (15)

Calls power.t.test

Page 161: ICTIR2016tutorial

3.2 With paired t-tests (2)

A paper from SIGIR 2012 reports

“t(27)=0.953 with (two-sided) paired t-test”

⇒ t = 0.953, n = 28 (φ = n-1 = 27)

Line 270 in the raw Excel file from [Sakai16SIGIR]

very low power (15.1%)

For this kind of effect, we need a much larger sample if we want 80% power

Page 162: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 163: ICTIR2016tutorial

3.3 With unpaired t-tests (1)

future.sample.unpairedt arguments:

- t statistic (t)

- sample sizes (n1, n2)

- two-sided/one-sided (default: two-sided)

- α (default: 0.05)

- desired power (1-β) (default: 0.80)

OUTPUT:

- effect size

- achieved power

- future sample size n’ per group

1.2 (15)

Calls pwr.t2n.test

Page 164: ICTIR2016tutorial

3.3 With unpaired t-tests (2)

A paper from SIGIR 2007 reports:

“t(188403) = 2.81, n1 = 150610, n2 = 37795 with (two-sided) two-sample t-test”

φ = n1 + n2 -2 = 188403

Line 714 in the raw Excel file from [Sakai16SIGIR]

Appropriate level of power

n1 = n2 = 60066 would be the typical setting for 80% power

Page 165: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 166: ICTIR2016tutorial

3.4 With one-way ANOVA (1)

future.sample.1wayanova arguments:- F statistic (F, i.e. FA)- #groups (systems) compared (m)- #observations (topics) per group (n)- α (default: 0.05)- desired power (1-β) (default: 0.80)OUTPUT:- effect size- achieved power- future sample size per group n’

φA = m-1, φE = m(n-1)

Calls pwr.anova.test

1.5 (9)

Compares between-system variation against

within-system

Page 167: ICTIR2016tutorial

3.4 With one-way ANOVA (2) φA = m-1, φE = m(n-1)

A paper from SIGIR 2008 reports:

“m=3 groups, n=12 subjects per group,

F(2, 33)=1.284 with (one-way) ANOVA”

(φA = m-1 = 2, φE = m(n-1) = 3*(12-1) = 33)

Line 616 in the raw Excel file from [Sakai16SIGIR]

Very low power (27.9%)

For this kind of effect, we need more subjects if we want 80% power

Page 168: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 169: ICTIR2016tutorial

future.sample.2waynorep arguments:

same as future.sample.1wayanova.

OUTPUT:

- effect size

- achieved power

- future sample size per group n’

3.5 With two-way ANOVA without replication (1)

φA = m-1, φE = (m-1)(n-1)

A little different from 1.5 (18)

Calls pwr.f2.test, which requires the above squared effect size

p stands for partial:effect of B has been removed

Page 170: ICTIR2016tutorial

3.5 With two-way ANOVA without replication (2)

A paper from SIGIR 2015 reports:

“m=4 groups,

F(3, 48)=0.63 with a repeated-measures ANOVA”

⇒m = φA +1 = 4, φE = (m-1)(n-1) = 48, n = 17 per group

Line 22 in the raw Excel file from [Sakai16SIGIR]

Same procedure as two-way ANOVA w/o replication

(second factor e.g. topicsregarded as repeated

observations)

Very low power (18.3%)

For this kind of effect, we need more subjects if we want 80% power

Page 171: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 172: ICTIR2016tutorial

3.6 With two-way ANOVA (1)

future.sample.2wayanova2 arguments:

- F statistics (FA, FB, FAB)

- #groups compared (m)

- #cells per group (n)

- #total observations (N=mnr)

- α (default: 0.05)

- desired power (1-β) (default: 0.80)

OUTPUT:

- effect size

- achieved power

- Total sample size N’

φA = m-1, φB = n-1, φAB = (m-1)(n-1)φE = mn(r-1)

And similarly for B and ABCalls pwr.anova.test

p stands for partial:effects of B and AB have

been removed

Version 2

Page 173: ICTIR2016tutorial

3.6 With two-way ANOVA (2)

A paper from SIGIR 2014 reports:

“m=2, n=2, two-way ANOVA,

A: F(1, 960)=24.00,

B: F(1, 960)=24.89,

AxB: F(1, 960)=10.03”

φA = m-1 = 1, φB = n-1 = 1,

φAxB = (m-1)(n-1)=1,

φE = mn(r-1) = 960

⇒ r= 960/4+1 = 241,

N = mnr = 964

Line 121 in the raw Excel file from [Sakai16SIGIR]

Very high power

Smaller sample sizes suffice

φE/(φA+1) + 1 = 960/(1+1) + 1

= 481[Cohen88, p.365]

Page 174: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 175: ICTIR2016tutorial

3.7 Overpowered and underpowered experiments in IR (1)[Sakai16SIGIR]

SSR = sample size ratio = actual size/recommended size for future

SSR is extremely large ⇔ extremely overpowered

SSR is extremely small ⇔ extremely underpowered

133 SIGIR+TOIS papers from the past decade (2006-2015) were examined using the R power analysis tools.

(106 with t-tests; 27 with ANOVAs)

Page 176: ICTIR2016tutorial

3.7 Overpowered and underpowered experiments in IR (2)[Sakai16SIGIR]

Page 177: ICTIR2016tutorial

A paper on personalisation from a search engine company (paired t-test)

t=16.00, n=5,352,460, effect size=0.007, achieved power=1recommended future sample size=164,107

Effect size very small (though this may translate into substantial profit for a company)

3.7 Overpowered and underpowered experiments in IR (3)[Sakai16SIGIR]

Page 178: ICTIR2016tutorial

User experiments, paired t-test

t=0.95, n=28,effect size=0.180,

achieved power=0.152future sample size=244

(similar results for other t-test results in the same paper)

3.7 Overpowered and underpowered experiments in IR (4)[Sakai16SIGIR]

Page 179: ICTIR2016tutorial

3.7 Overpowered and underpowered experiments in IR (5)[Sakai16SIGIR]

Page 180: ICTIR2016tutorial

Experiments with a commercial social media application data (one-way ANOVA)

F=243.42, m=3, sample size per group=2551,

effect size fhat=2.252, achieved power=1,recommended future sample size per group=52

3.7 Overpowered and underpowered experiments in IR (6)[Sakai16SIGIR]

Page 181: ICTIR2016tutorial

User experiments, two-way ANOVA w/o replication

F=0.63, m=4, sample size per group=17,effect size fhat^2 = 0.039,

achieved power=0.183,recommended future sample

size per group=75

(similar results for other ANOVA results in the same

paper)

3.7 Overpowered and underpowered experiments in IR (7)[Sakai16SIGIR]

Page 182: ICTIR2016tutorial

TUTORIAL OUTLINE1. Significance testing basics and limitations

1.1 Preliminaries

1.2 How the t-test works

1.3 T-test with Excel and R (hands-on)

1.4 How ANOVA works

1.5 ANOVA with Excel and R (hands-on)

1.6 What's wrong with significance tests?

1.7 Significance tests in the IR literature, or lack thereof

2. Using the Excel topic set size design tools

2.1 Topic set sizes in IR

2.2 Topic set size design

<30min coffee break>

2.3 With paired t-tests (hands-on)

2.4 With one-way ANOVA (hands-on)

2.5 With confidence intervals (hands-on)

2.6 Estimating the variance (hands-on)

2.7 How much pilot data do we need?

3. Using the R power analysis scripts

3.1 Power analysis

3.2 With paired t-tests (hands-on)

3.3 With unpaired t-tests (hands-on)

3.4 With one-way ANOVA (hands-on)

3.5 With two-way ANOVA without replication (hands-on)

3.6 With two-way ANOVA (hands-on)

3.7 Overpowered and underpowered experiments in IR

4. Summary, a few additional remarks, and Q&A

30min

70min

20min

50min

10min

Appendix

Page 183: ICTIR2016tutorial

Now you know

•How to determine the number of topics when building a new test collection using a topic-by-run matrix from pilot data and a simple Excel tool. And you kind of know how it works!

•How to check whether a reported experiment is overpowered/underpowered and decide on a better sample size for a future experiment using simple R scripts.

Page 184: ICTIR2016tutorial

What now?

• Be aware of the limitations of classical significance testing. But while we are still using classical tests, report effect sizes, p-values etc. for collective wisdom [Sakai14SIGIRforum,Sakai16SIGIR]. And use topic set size design and power analysis! Some guidance is better than none!

• My personal wish is that the classical significance tests will soon be replaced by Bayesian tests, so we can discuss P(H|D) instead of P(D|H) for various H’s, not just “equality of means” etc.

Using score standardisation can give you smaller topic set sizes in topic set size design. Please have a look at [Sakai16ICTIR].

Page 185: ICTIR2016tutorial

Thank you for staying with me until the end!Questions?

Page 186: ICTIR2016tutorial

Acknowledgements

This tutorial is rather heavily based on what I learnt from Professor Yasushi Nagata’s and Professor Hideki Toyoda’s books (written in Japanese).

I thank Professor Nagata (WasedaUniversity) for his valuable advice and Professor Toyoda (WasedaUniversity) for letting me modify his R code and distribute it.

If there are any errors in this tutorial, I am solely responsible.

Page 187: ICTIR2016tutorial

References

[Carterette08] Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., and Allan, J.: Evaluation over Thousands of Queries, ACM SIGIR 2008.

[Carterette12] Carterette, B.: Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments, ACM TOIS 30(1), 2012.

[Cohen88] Cohen. J.: Statistical Power Analysis for the Behavioral Sciences (Second Edition), Psychology Press, 1988.

[Ellis10] Ellis, P. D.: The Essential Guide to Effect Sizes, Cambridge, 2010.

[Gilbert79] Gilbert, H. and Sparck Jones, K. S.:, Statistical Bases of Relevance assessment for the `IDEAL’ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1979.

[Johnson99] Johnson, D. H.: The Insignificance of Statistical Significance Testing, Journal of Wildlife Management, 63(3), 1999.

[Nagata03] Nagata, Y.: How to Design the Sample Size (In Japanese), Asakura Shoten, 2003.

[Okubo12] Okubo, G. and Okada, K.: Psychological Statistics to Tell Your Story: Effect Size, Confidence Interval, and Power (in Japanese), Keisho Shobo, 2012.

Page 188: ICTIR2016tutorial

References

[Sakai14SIGIRforum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), 2016. http://sigir.org/files/forum/2014J/2014J_sigirforum_Article_TetsuyaSakai.pdf

[Sakai16EVIA] Sakai, T. and Shang, L.: On Estimating Variances for Topic Set Size Design, EVIA 2016.

[Sakai16ICTIR] Sakai, T.: A Simple and Effective Approach to Score Standardisation, ACM ICTIR 2016.

[Sakai16IRJ] Sakai, T.: Topic Set Size Design, Information Retrieval Journal, 19(3), 2016. [OPEN ACCESS] http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf

[Sakai16SIGIR] Sakai, T.: Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015, ACM SIGIR 2016.

[Sakai16SIGIRshort] Sakai, T.: Two Sample T-tests for IR Evaluation: Student or Welch?, ACM SIGIR 2016.

Page 189: ICTIR2016tutorial

References

[SparckJones75] Sparck Jones, K.S. and Van Rijsbergen, C.J.: Report on the Need for and Provision on an `Ideal’ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1975.

[Toyoda09] Tokoda, H.: Introduction to Statistical Power Analysis: A Tutorial

with R (in Japanese). Tokyo Tosyo, 2009.

[Voorhees05] Voorhees, E. M. and Harman, D. K.: TREC: Experiment and Evaluation in Information Retrieval, The MIT Press, 2005.

[Voorhees09] Voorhees, E. M.: Topic Set Size Redux, ACM SIGIR 2009.

[Webber08] Webber, W., Moffat, A., and Zobel, J.: Statistical Power in Retrieval Experimentation, ACM CIKM 2008.

Page 190: ICTIR2016tutorial

Appendix (everything adapted from [Nagata03])

• Definition: noncentral t distribution

• Definition: noncentral chi-square distribution

• Definition: noncentral F distribution

• Theorem A: normal approximation of a noncentral t distribution

• Theorem A’: corollary of A

• Theorem A’’: corollary of A (approximating a z value using a t value)

• Theorem B: approximating a t value using a z value

• Theorem C: normal approximation of a noncentral F distribution

• Theorem D: inequality for system effects

• Theorem E: approximating a noncentral F distribution with a noncentral chi-square distribution

Page 191: ICTIR2016tutorial

Definition: noncentral t distribution

Let ,

where the two random variables are independent.

The probability distribution of the following random variable is called a noncentral t distribution with φ degrees of freedom and a noncentrality parameter λ:

When λ=0, it is reduced to the central t distribution with φ degrees of freedom, t(φ).

Denoted by t’(φ, λ)

Page 192: ICTIR2016tutorial

Let where the random variables are independent.

The probability distribution of the following random variable is called a noncentral chi-square distribution with φ=k degrees of freedom and a noncentrality parameter λ:

where .

Definition: noncentral chi-square distribution

When λ=0, it is reduced to the central chi-square distribution with φ degrees of freedom, .

Denoted by

Page 193: ICTIR2016tutorial

Let ,

where the two random variables are independent.

The probability distribution of the following random variable is called a noncentral F distribution with (φ1, φ2) degrees of freedom and a noncentrality parameter λ.

Definition: noncentral F distribution

noncentral chi-square distribution central chi-square distribution

When λ=0, it is reduced to the central F distribution with (φ1, φ2) degrees of freedom, F(φ1, φ2).

Denoted by

Page 194: ICTIR2016tutorial

Theorem A: normal approximation of a noncentral t distributionLet , .

Then:

where:

.

Gamma function

noncentral t distribution

Brief derivation given in[Sakai16IRJ Appendix 1]

Page 195: ICTIR2016tutorial

Theorem A’: corollary of A

Let , .

Then:

PROOF: Let

,

in Theorem A.

Brief derivation given in[Sakai16IRJ Appendix 1]

Page 196: ICTIR2016tutorial

Theorem A’’: corollary of A (approximating a z value using a t value)

one-sided z value

two-sided t value

PROOF: In Theorem A, when λ=0,then t=t’ obeys a (central) t distribution.Also let

.

1

1.5

2

2.5

3

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

z f(t)

φ

2P=α=0.05

Verified with Excel

Page 197: ICTIR2016tutorial

Theorem B: approximating a t value using a z value

This is a special case of Johnson and Welch’s theorem on the noncentral t statistic. [Nagata03]

Two-sided t value one-sided z value

P = α = 0.05

1.5

2

2.5

3

3.5

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

t f(z)

Verified with Excel

Page 198: ICTIR2016tutorial

Theorem C: normal approximation of a noncentral F distributionLet ,

Then:

where

.

noncentral F distribution

Brief derivation given in[Sakai16IRJ Appendix 2]

Page 199: ICTIR2016tutorial

Theorem D: inequality for system effects

For ,

Let

.

Then

.

The equality holds when = D/2, = -D/2

and ai = 0 for all others.

Proof in [Sakai16IRJ footnote 19]

Page 200: ICTIR2016tutorial

Theorem E: approximating a noncentral F distribution with a noncentral chi-square distribution

Let ,

Then: Letting φE≒∞

F value for probability P

chi-square value for probability P