How to improve the statistical power of the 10-fold crossvalidation scheme in Recommender Systems

University of Ljubljana ..: Faculty of Electrical Engineering

[LDOS] ..: Digital Signal, Image and Video Processing Laboratory

How to improve the statistical power of the 10-fold cross

validation scheme in Recommender Systems

Andrej Košir

Ante Odić

Marko Tkalčič



Statistical power, replicability and reproducibility

What is:

Replicability: to get the same experimental result (on the same data)

Reproducibility : to get similar experimental results leading to the same

conclusion

Mackay, R., & Oldford, R. (2000). Scientific method, statistical method, and the speed of light, Working pa-

per 2000-02). Department of Statistics and Actuarial Science, University of Waterloo.

In terms of statistical testing

Higher power => better reproducibility

More likely to get to the same conclusions



On stat hypothese testing

When we need to use stat tests?

The results should not change if we repeat the experiment

When we need it: at later stages of development where results are similar

Elements of statistical testing

Working hypotheses

Null and alternative hypotheses: 𝐻0 and 𝐻1 p-value: 𝑝

Risk level: 𝛼

Decision on 𝐻0

RS 1

RS 2

F1

F2

0.72

0.89

0.74

Test

data



On errors and statistical power

Errors in test decision:

Errors of type I. and type II.

Effect size

Power:

For each test a new analysis is required

more is better

The best one can do

Task 1 - How to select sample size: apriory power

Task 2 - How to estimate achieved power: posterior power

History:

1908 by William Sealy Gosset (Student): he did not need it

Mainly ignored until then

Software: GPower

http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/

OK type I.

type II. OK

0H

1H

1H0H

Power = 𝑃𝑟[ 𝐻1|𝐻1]



The application we were working on: contextual variables

Which contextual variables are relevant:

What is context

Candidates: time, weather, mood, ...

Can we simply use it all?

• Irrelevant context can worse the performance of RS

Test if a given context is relevant

How: compare RS with and without it

ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Predicting and detecting the

relevant contextual information in a movie-recommender system. Interact. comput.. [Print ed.], 2013,

vol. 25, no. 1, pp. 74-90, ilustr., doi:10.1093/iwc/iws003. [COBISS.SI-ID 9650260]

ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Impact of the context relevancy on

ratings prediction in a movie-recommender system. Automatika (Zagreb), 2013, vol. 54, no. 2, pp. 252-

262, ilustr., doi:10.7305/automatika.54-2.258. [COBISS.SI-ID 9782356]

http://dx.doi.org/10.1093/iwc/iws003

http://cobiss.izum.si/scripts/cobiss?command=DISPLAY&base=COBIB&RID=9650260

http://dx.doi.org/10.7305/automatika.54-2.258

http://cobiss.izum.si/scripts/cobiss?command=DISPLAY&base=COBIB&RID=9782356



The problem we observed: cross validation scheme

There were differences among folds, but not in conclusion

What is wrong?

Paired / unpaired?

What is usually done:

Confusion matrix computation is actually unpaired

ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Predicting and detecting the relevant contextual

information in a movie-recommender system. Interact. comput., vol. 25, no. 1, pp. 74-90, 2013.



Proposed solution

The procedure outline:

1. Select the scalar comparison measure (such as precision or F-measure).

2. Store the evaluation results of each fold and each method separately;

3. According to the specfic features of the evaluation results (distributions

etc.) select the most powerful test that meets these specific features

4. Perform the paired version of the selected test.



Materials and methods (1)

Dataset:

Context Movie Dataset (LDOS-CoMoDa)

1611 ratings from 89 users to 946 items with associated contextual factors.

Contextual variables

• time (morning, afternoon, evening, night),

• daytype (working day, weekend, time (morning, afternoon, evening, night),

• season (spring, summer, autumn, winter),

• Location (home, public place, friend's house),

• weather (sunny/clear, rainy, stormy, snowy, cloudy),

• social (alone, partner, friends, colleagues, parents, public, family),

• endEmo (sad, happy, scared, surprised, angry, disgusted, neutral),

• dominantEmo (sad, happy, scared, surprised, angry, disgusted, neutral),

• mood (positive, neutral, negative),

• physical (healthy, ill), decision (user's choice, given by other), interaction (1rst, n-th)

Publically available: LDOS-CoMoDa contextual dataset: available at www.ldos.si/comoda.html.

Used by 29 researchers at this moment.



Materials and methods (2), results

Experimental design

10-fold cross validation

Two procedures: ProcPaired, ProcIndep

Results – which contextual variable improves MF?

Tests: Wilcoxon signed rank test (ProcIndep) and

Mann Whitney U test, (ProcPaired)

The achieved (post-hoc) statistical power for the paired test (pw pa.) and for the

independent test (pw in.) along with the computed p-values

Id Var 1 Var 2 pw paired p paired pw indep. p indep.

1 Physical Weather 0.42 0.001 0.14 0.24

2 Decision Social 0.99 0.004 0.25 0.19

3 interaction Social 0.06 <0.001 0.05 0.43



Discussion

Power improvements:

The first combination (physical vs. weather): 0.14 0.42, low but useful;

The second combination (decision vs. social): 0.19 0.99, the difference in

power is again substantial;

The third combination (interaction vs. social): 0.05 0.06, irrelevant;

It does not require substantial additional work

Worth of effort



Further work

We limited to 10-fold cross validation and simple tests only. There is more

out there.

We will concentrate on a comparison of RS regarding the selected final tasks

(such as best five) and not limited to scalar performance measures (such as

precision at five).

More sophisticated statistical approaches:

are available such as a multi-level repeated binomial regression

my opinion: will not be used frequently

THANK YOU

Invitation: International Conference on Automatic Face and Gesture

Recognition FG2015, http://www.fg2015.org/

http://www.fg2015.org/



Presentation structure

The goal

What it has to do with replicability and reproducibility?

Selected items from statistics

Our case & problem statement

Proposed solution & comments

Experimental results

Future work

Take away notes

Technology

How to improve the statistical power of the 10-fold crossvalidation scheme in Recommender Systems