Upload
harriet-randall
View
216
Download
1
Embed Size (px)
Citation preview
Size matters – also in scientometric studies and research evaluation
Jesper W. SchneiderDanish Centre for Studies in Research & Research Policy,
Department of Political Science & Government, Aarhus University, Denmark
Last year …
• ... I complained about the rote use of statistical significance tests• ... one of my claims was that such tests are used more or less
mindlessly to mechanically decide whether a result is important or not
• Similar to last year I will sharewith you some further concerns of mine and it basically continues where I left ...
... what bothers me this year?
• We rarely reflect on the numbers we produce ... and we produce a lot!
• We mainly pose dichotomous or ordinal hypotheses (“weak hypotheses”)
• ... it is the easy way, but it is not particularly informative or for that matter scientific
• Compare this to researchers in the natural sciences • They do reflect upon their numbers and they formulate “strong”
hypotheses or theories and test their predictive ability • ... researchers in the natural sciences are concerned with questions
such as “how much” or “to what degree”, questions concerning the size of effects and the impact of such effcets ... to them size matters
Consider this
up to, on average, 1 citation per paper in difference
Have you ever considered what the nummerical differences between JIFs imply in practice?
Consider this
Difference from #101 to #125 is 0.5% points
Management at my university like (really like) rankings, their focus though is soley on the rank position – in this case #117
Consider this
• Possible gender inequalities in publication output from Spanish psychology researchers
• One research question:– “Is there a difference between the proportion of female
authors depending on the gender of the first author”• Statistical hypothesis tested
– “there is no difference”we keep it anonymous
Some suggested solutions
• Example, three research groups
• Some colleagues have recently argued that statistical significance tests should be used in order to detect ”significant differences”
• Kruskall-Wallis (K-W) tests are suggested• H0: Group 1 = Group 2 = Group 3 no difference in medians between
groups• K-W is ”significant” (p = .014, α=5%)• Pair wise comparisons suggest that ”significant differences” exist between
group 1 and 2 (p = .008), as well as group 1 and 3 (p = .013)!• Finito!
Group 1 Group 2 Group 3MNCS 1.42 1.64 1.56
... not so, statistical significance cannot tells us to what degree the results are important!
• … but effect sizes can help us interpret the data
• … where Z is the statistic from the Mann-Whitney tests used for the pair wise comparisons, and N is the total number of cases compared
Group 1 Group 2 Group 3Group 1 - 0.13 0.13Group 2 - 0.02Group 3 -
9
• Judgments about the relative importance of results must be determined by theory, former findings, practical implications, cost-benefits … whatever that informs
• Effect sizes should be the starting point for such judgments – as size matters!
• Effect sizes are measures of the strength of a phenomenon and they come in standardized (scale-free) and non-standardized forms
• Effects sizes are comparable across studies
Effect sizes
Interpretation
• Theory, former findings, practical implications etc, that can help us interpret results are not plenty in our field
• Example– r = 0.47, is that a big or small effects, in the scheme of things?
• If you haven't a clue, you're not alone. Most people don't know how to interpret the magnitude of a correlation, or the magnitude of any other effect statistic. But people can understand trivial, small, moderate, and large, so qualitative terms like these can be used to discuss results.
A scale of magnitudes for effect statistics – a beginning
• Cohen (1988) reluctantly suggested a benchmark-scale for effect sizes based upon important findings in the behavioral and social sciences
• The main justification for this scale of correlations comes from the interpretation of the correlation coefficient as the slope of the line between two variables when their standard deviations are the same. For example, if the correlation between height (X variable) and weight (Y variable) is 0.7, then individuals who differ in height by one standard deviation will on average differ in weight by only 0.7 of a standard deviation. So, for a correlation of 0.1, the change in Y is only one-tenth of the change in X.
Standardized Effect Sizes Small Medium LargeCorrelation 0.1 0.3 0.5Cohen’s h (proportions) 0.2 0.5 0.8
A scale of magnitudes for effect statistics – a beginning
Trivial Small Moderate Large Very large Nearly perfect PerfectCorrelation 0 0.1 0.3 0.5 0.7 0.9 1Difference in means 0 0.2 0.6 1.2 2 4 infinite
In principle, we are skeptical of this categorization as it also encourages "mechanical thinking" similar to statistical significance tests. A "one size fit all" yardstick is problematic as Cohen himself argued. However, for the sake of argument we use these benchmarks to illustrate what we need to focus upon: importance of findings!
Returning to the group comparison
• … where Z is the statistic from the Mann-Whitney tests used for the pair wise comparisons, and N is the total number of cases compared
Group 1 Group 2 Group 3Group 1 - 0.13 0.13Group 2 - 0.02Group 3 -
Very small, close to trivial difference
Extremely trivial difference
Group 1 Group 2 Group 3MNCS 1.42 1.64 1.56
14
University rankings: Stanford University vs. UCLA
z-test = ”significant difference” (p < .0001)
Difference = 3.5 percentage points
Ho = p1 = p2
HA = p1 ≠ p2
Leiden RankingInstitution pp_top 10% No. PapersStanford University 22.90% 26917UCLA 19.40% 30865
Does this mean that we have a "difference that makes a difference“ between the rankings of these two universities?
15
Differences in effect sizes between Stanford University and one of the other 499 institutions in
the Leiden ranking
.00-.10
.11-.20
.21-.30
.31-.40
.41-.50
.51-.60
.61-.70
.71-.80
.81-.90
.91-1.1
0%
5%
10%
15%
20%
25%
30%
35%
40%
0%
20%
40%
60%
80%
100%
Effect sizes for Stanford University (Leiden)
Distribution of Cohen's hCumulative distribution of Cohen's h
≈ 20% = ”trivial effects”
≈ 76% = ”small effects”
≈ 4% = ”medium effects”
Stanford vs. UCLA:
Cohen’s h = 0.09
16
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.700
5
10
15
20
25
30
Leiden Ranking
Cohen's h
pp
_to
p 1
0 %
Medium
Relation between “pp_top 10%” values and Cohen's h for Stanford University
Small
Institution Cohen's h pp_top 10% PubsHacettepe Univ Ankara 0.54 5.2 5773Nihon Univ 0.54 5.2 4423Gifu Univ 0.54 5.0 3357Istanbul Univ 0.56 4.7 5573Gazi Univ Ankara 0.57 4.5 4420Univ Belgrade 0.58 4.3 6936Ankara Univ 0.59 4.1 4830Adam Mickz Univ Poznan 0.59 4.0 3363Moscow Mv Lom State Univ 0.61 3.8 14205St Petersburg State Univ 0.64 3.1 4826
Institution Cohen's h pp_top 10% PubsStanford Univ 0.00 22.9 26917Univ Calif Santa Barbara 0.01 22.6 9250Univ Calif Berkeley 0.01 22.3 22213Harvard Univ 0.02 23.6 61462Caltech 0.02 22.0 13379Univ Calif San Francisco 0.03 21.8 20527Princeton Univ 0.03 24.0
10832Rice Univ 0.03 21.7 4646Univ Chicago 0.04 21.1 14103Univ Calif Santa Cruz 0.05 20.9 4667Univ Washington - Seattle 0.06 20.5 29273Columbia Univ 0.06 20.2 25339Yale Univ 0.07 20.2 20366MIT 0.07 26.0 19294Duke Univ 0.07 19.8 21078Univ Penn 0.08 19.8 25837Carnegie Mellon Univ 0.08 19.7 6371Ecole Polytecn Fed Lausanne 0.08 19.6 8831Univ Calif San Diego 0.08 19.4 23453Washington Univ - St Louis 0.09 19.4 16512Univ Calif Los Angeles 0.09 19.4 30865
As we said, we use these benchmarks only for the sake of argument and illustration, a ”trivial effect size” can in fact be important!But the distribution of effect sizes can certainly inform us
Stanford University:
22.9
Explanatory research
“The data also show a statistically significant difference in the proportion of female authors depending on the gender of the first author (t = 2.707, df = 473, p = 0.007). Thus, when the first author was female, the average proportion of female co-authors per paper was 0.49 (SD 0.38, CI 0.53–0.45), whereas when a man was the first author, the average proportion of women dropped to 0.39 (SD 0.39, CI 0.43–0.35).”
0.3
0.4
0.5
95%
CIs
for p
ropo
rtion
s
Some uncertainty, the interval is points wide
Cohen’s d = 0.25 r = 0.12: ”small effect”, perhaps not that ”significant”?
18
Possible empirical calibration of Effect Sizes
• Cohen always stated that the guidelines he was providing were simply first attempts to provide some useful general guidelines for interpreting social science research
• His suggestion was that specific areas of research would need to refine these guidelines
• Strictly empirical, based on the actual distribution of all effect sizes• between the institutions in the Leiden Ranking• Based on quartiles
• Small effects = difference between 3rd and 2nd quartiles• Medium effects = difference between upper and lower half• Large effects = top quartile versus bottom quartile
• Very instrumental and cannot initially be applied outside this dataset
Summary
• Size matters– Discuss the theoretical and/or practical importance of results
(numbers)– Always report effect sizes– For want of something better use an ordinal benchmarking like
Cohen’s– Calibrate benchmarks according to the findings and their degree of
theoretical and practical importance– Effect sizes are comparable across studies
• We need a focus on this otherwise our studies continue to be mainly instrumental and a-theoretical – there is little understanding or advance in that!
Summary
©Lortie, Aarsen, Budden & Leimu (2012)
21
Thank you for your attention