Size matters – also in scientometric studies and research evaluation Jesper W. Schneider Danish Centre for Studies in Research & Research Policy, Department

Size matters – also in scientometric studies and research evaluation

Jesper W. SchneiderDanish Centre for Studies in Research & Research Policy,

Department of Political Science & Government, Aarhus University, Denmark

[email protected]

Last year …

• ... I complained about the rote use of statistical significance tests• ... one of my claims was that such tests are used more or less

mindlessly to mechanically decide whether a result is important or not

• Similar to last year I will sharewith you some further concerns of mine and it basically continues where I left ...

... what bothers me this year?

• We rarely reflect on the numbers we produce ... and we produce a lot!

• We mainly pose dichotomous or ordinal hypotheses (“weak hypotheses”)

• ... it is the easy way, but it is not particularly informative or for that matter scientific

• Compare this to researchers in the natural sciences • They do reflect upon their numbers and they formulate “strong”

hypotheses or theories and test their predictive ability • ... researchers in the natural sciences are concerned with questions

such as “how much” or “to what degree”, questions concerning the size of effects and the impact of such effcets ... to them size matters

Consider this

up to, on average, 1 citation per paper in difference

Have you ever considered what the nummerical differences between JIFs imply in practice?

Consider this

Difference from #101 to #125 is 0.5% points

Management at my university like (really like) rankings, their focus though is soley on the rank position – in this case #117

Consider this

• Possible gender inequalities in publication output from Spanish psychology researchers

• One research question:– “Is there a difference between the proportion of female

authors depending on the gender of the first author”• Statistical hypothesis tested

– “there is no difference”we keep it anonymous

Some suggested solutions

• Example, three research groups

• Some colleagues have recently argued that statistical significance tests should be used in order to detect ”significant differences”

• Kruskall-Wallis (K-W) tests are suggested• H0: Group 1 = Group 2 = Group 3 no difference in medians between

groups• K-W is ”significant” (p = .014, α=5%)• Pair wise comparisons suggest that ”significant differences” exist between

group 1 and 2 (p = .008), as well as group 1 and 3 (p = .013)!• Finito!

Group 1 Group 2 Group 3MNCS 1.42 1.64 1.56

... not so, statistical significance cannot tells us to what degree the results are important!

• … but effect sizes can help us interpret the data

• … where Z is the statistic from the Mann-Whitney tests used for the pair wise comparisons, and N is the total number of cases compared

Group 1 Group 2 Group 3Group 1 - 0.13 0.13Group 2 - 0.02Group 3 -

9

• Judgments about the relative importance of results must be determined by theory, former findings, practical implications, cost-benefits … whatever that informs

• Effect sizes should be the starting point for such judgments – as size matters!

• Effect sizes are measures of the strength of a phenomenon and they come in standardized (scale-free) and non-standardized forms

• Effects sizes are comparable across studies

Effect sizes

Interpretation

• Theory, former findings, practical implications etc, that can help us interpret results are not plenty in our field

• Example– r = 0.47, is that a big or small effects, in the scheme of things?

• If you haven't a clue, you're not alone. Most people don't know how to interpret the magnitude of a correlation, or the magnitude of any other effect statistic. But people can understand trivial, small, moderate, and large, so qualitative terms like these can be used to discuss results.

A scale of magnitudes for effect statistics – a beginning

• Cohen (1988) reluctantly suggested a benchmark-scale for effect sizes based upon important findings in the behavioral and social sciences

• The main justification for this scale of correlations comes from the interpretation of the correlation coefficient as the slope of the line between two variables when their standard deviations are the same. For example, if the correlation between height (X variable) and weight (Y variable) is 0.7, then individuals who differ in height by one standard deviation will on average differ in weight by only 0.7 of a standard deviation. So, for a correlation of 0.1, the change in Y is only one-tenth of the change in X.

Standardized Effect Sizes Small Medium LargeCorrelation 0.1 0.3 0.5Cohen’s h (proportions) 0.2 0.5 0.8

A scale of magnitudes for effect statistics – a beginning

Trivial Small Moderate Large Very large Nearly perfect PerfectCorrelation 0 0.1 0.3 0.5 0.7 0.9 1Difference in means 0 0.2 0.6 1.2 2 4 infinite

In principle, we are skeptical of this categorization as it also encourages "mechanical thinking" similar to statistical significance tests. A "one size fit all" yardstick is problematic as Cohen himself argued. However, for the sake of argument we use these benchmarks to illustrate what we need to focus upon: importance of findings!

Returning to the group comparison

• … where Z is the statistic from the Mann-Whitney tests used for the pair wise comparisons, and N is the total number of cases compared

Group 1 Group 2 Group 3Group 1 - 0.13 0.13Group 2 - 0.02Group 3 -

Very small, close to trivial difference

Extremely trivial difference

Group 1 Group 2 Group 3MNCS 1.42 1.64 1.56

14

University rankings: Stanford University vs. UCLA

z-test = ”significant difference” (p < .0001)

Difference = 3.5 percentage points

Ho = p1 = p2

HA = p1 ≠ p2

Leiden RankingInstitution pp_top 10% No. PapersStanford University 22.90% 26917UCLA 19.40% 30865

Does this mean that we have a "difference that makes a difference“ between the rankings of these two universities?

15

Differences in effect sizes between Stanford University and one of the other 499 institutions in

the Leiden ranking

.00-.10

.11-.20

.21-.30

.31-.40

.41-.50

.51-.60

.61-.70

.71-.80

.81-.90

.91-1.1

0%

5%

10%

15%

20%

25%

30%

35%

40%

0%

20%

40%

60%

80%

100%

Effect sizes for Stanford University (Leiden)

Distribution of Cohen's hCumulative distribution of Cohen's h

≈ 20% = ”trivial effects”

≈ 76% = ”small effects”

≈ 4% = ”medium effects”

Stanford vs. UCLA:

Cohen’s h = 0.09

16

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.700

5

10

15

20

25

30

Leiden Ranking

Cohen's h

pp

_to

p 1

0 %

Medium

Relation between “pp_top 10%” values and Cohen's h for Stanford University

Small

Institution Cohen's h pp_top 10% PubsHacettepe Univ Ankara 0.54 5.2 5773Nihon Univ 0.54 5.2 4423Gifu Univ 0.54 5.0 3357Istanbul Univ 0.56 4.7 5573Gazi Univ Ankara 0.57 4.5 4420Univ Belgrade 0.58 4.3 6936Ankara Univ 0.59 4.1 4830Adam Mickz Univ Poznan 0.59 4.0 3363Moscow Mv Lom State Univ 0.61 3.8 14205St Petersburg State Univ 0.64 3.1 4826

Institution Cohen's h pp_top 10% PubsStanford Univ 0.00 22.9 26917Univ Calif Santa Barbara 0.01 22.6 9250Univ Calif Berkeley 0.01 22.3 22213Harvard Univ 0.02 23.6 61462Caltech 0.02 22.0 13379Univ Calif San Francisco 0.03 21.8 20527Princeton Univ 0.03 24.0

10832Rice Univ 0.03 21.7 4646Univ Chicago 0.04 21.1 14103Univ Calif Santa Cruz 0.05 20.9 4667Univ Washington - Seattle 0.06 20.5 29273Columbia Univ 0.06 20.2 25339Yale Univ 0.07 20.2 20366MIT 0.07 26.0 19294Duke Univ 0.07 19.8 21078Univ Penn 0.08 19.8 25837Carnegie Mellon Univ 0.08 19.7 6371Ecole Polytecn Fed Lausanne 0.08 19.6 8831Univ Calif San Diego 0.08 19.4 23453Washington Univ - St Louis 0.09 19.4 16512Univ Calif Los Angeles 0.09 19.4 30865

As we said, we use these benchmarks only for the sake of argument and illustration, a ”trivial effect size” can in fact be important!But the distribution of effect sizes can certainly inform us

Stanford University:

22.9

Explanatory research

“The data also show a statistically significant difference in the proportion of female authors depending on the gender of the first author (t = 2.707, df = 473, p = 0.007). Thus, when the first author was female, the average proportion of female co-authors per paper was 0.49 (SD 0.38, CI 0.53–0.45), whereas when a man was the first author, the average proportion of women dropped to 0.39 (SD 0.39, CI 0.43–0.35).”

0.3

0.4

0.5

95%

CIs

for p

ropo

rtion

s

Some uncertainty, the interval is points wide

Cohen’s d = 0.25 r = 0.12: ”small effect”, perhaps not that ”significant”?

18

Possible empirical calibration of Effect Sizes

• Cohen always stated that the guidelines he was providing were simply first attempts to provide some useful general guidelines for interpreting social science research

• His suggestion was that specific areas of research would need to refine these guidelines

• Strictly empirical, based on the actual distribution of all effect sizes• between the institutions in the Leiden Ranking• Based on quartiles

• Small effects = difference between 3rd and 2nd quartiles• Medium effects = difference between upper and lower half• Large effects = top quartile versus bottom quartile

• Very instrumental and cannot initially be applied outside this dataset

Summary

• Size matters– Discuss the theoretical and/or practical importance of results

(numbers)– Always report effect sizes– For want of something better use an ordinal benchmarking like

Cohen’s– Calibrate benchmarks according to the findings and their degree of

theoretical and practical importance– Effect sizes are comparable across studies

• We need a focus on this otherwise our studies continue to be mainly instrumental and a-theoretical – there is little understanding or advance in that!

Summary

©Lortie, Aarsen, Budden & Leimu (2012)

21

Thank you for your attention

Documents

Size matters – also in scientometric studies and research evaluation Jesper W. Schneider Danish Centre for Studies in Research & Research Policy, Department