23
Chapter 4 Changing the Way We Do: Hypothesis Testing, Power, Effect Size, and Other Misunderstood Issues Annisa Fitri Irwan Della Oferischa Musfera Nara Vadia Rezky Jafri K4-13

STATISTICS : Changing the way we do: Hypothesis testing, effect size, power, and other misunderstood issues

Embed Size (px)

Citation preview

Chapter 4Changing the Way We Do:

Hypothesis Testing, Power, Effect Size, and Other Misunderstood Issues

Annisa Fitri IrwanDella OferischaMusfera Nara VadiaRezky Jafri

K4-13

There are a few important steps that researchers can take to make the statistical results from their studies meaningful and useful. They are: Perform a power analysis before undertaking a study in

order to determine the number of participants that should be included in a study to ensure an adequate level of power (power should be higher than .50 and ideally .80). Briefly put, power is the probability that you will find differences between groups or relationships among variables if they actually do exist.

Never set an alpha level lower than .05 and try to set it higher, to .10 if at all acceptable to the research community one is working in.

Report effect sizes and their interpretation. Report confidence intervals.

1. Null Hypothesis Significant Tests

Null hypothesis (Ho) is there is no difference between groups or that there is no relationship between variables.

Once we reject null hypothesis we will be able to accept alternative hypothesis (Ha).

Example:

HR: Will 15 minutes of practice of meaningful drills result in more accurate grammar scores than 15 minutes of telling a story where the grammar in question must be used?

H0 : There is no [statistical] difference between a group which practices grammar using explicit meaningful drills for 15 minutes each day and a group which uses grammar implicitly by telling stories where the grammar is needed for 15 minutes each day.

Ha: There is a [statistical] difference between the explicit and implicit group.

Since only two groups are being compared, a t-test can be used. The t-test statistic is calculated based on three pieces of information.: the mean scores of the groups, their variances, and the size of each group (the sample size).

In NHST process, we should have already decided on a cut-off level that we will use to consider the results of the statistic test extreme. This is called alpha level or significant level.

Baayen (2008) “if the p-value is lower than alpha level we set, we reject the null hypothesis and accept the alternative hypothesis that there is a difference between the two groups (it does not necessarily mean the alternative hypothesis is correct)”

P-value: the probability of finding a [insert statistic name here] this large or larger if the null hypothesis were true is [insert p-value].

One-Tailed versus Two-Tailed Tests of Hypothesis

if the only thing we care about is just one of the possibilities, then we can use a one-tailed test. A one-tailed or directional test of a hypothesis looks only at one end of the distribution. A one-tailed test will have more power to find differences, because it can allocate the entire alpha level to one side of the distribution and not have to split it up between both ends.

Two-tailed hypothesis is examining two possibilities of groups. The hypothesis could go in either direction. For example: in our null hypothesis, we would be examining both the possibility that the explicit group was better than the implicit group and the possibility that the explicit group was worse than the implicit group.

Outcomes of Statistical TestingTrue situation in the population

Outcome observed in

studyNo effect exist

No effect exist Effect exist

Correct situation (probability= I-α)

Type II error (probability=ß)

Effect exist Type I error (probability=α)

Correct situation (probability= I-ß)

Power

Type I error (being overeager): concluding there is a relationship when there is none.

Set type I error level by setting alpha (α) level

o Commonly set at α = .05

o Possibility of a Type I error is thus 5%

Type error II (being overly cautious): concluding there is no relationship when there is one.

Set type II error level (ß) and then calculate power (power = I – ß)

o Commonly set at ß = .20, resulting in power =.80

o Possibility of Type II error is thus 20% Avoid low power by:o Having adequate sample sizeso Using reliable dependent variableo controlling for individual differenceso Including pre-testo Using longer post-testo Making sure not to violate statistical

assumptions.

Problems with NHSTThere are problems with using NHST method of making conclusions about experiments. One is that some authors interpret a low p-value

as indicate of a very strong result. On the other hand, probably a lower p-value does not make a study more significant, in the generally accepted sense of being important.

On the other hand, the p-value of a study is an index of group size and the power of study, and this is why a p-value of .049 and a p-value of .001 are not equivalent, although both are lower than α=.05

Change the Way I Do Statistics

Reporting exact p-values (unless they are so small it would take too much room to report them)

Talking about “statistical” results instead of “significant” or “statistically significant” results

Providing confidence intervals and effect size whenever possible.

2. Power Analysis

Power is the probability of detecting a statistical result when there are in fact differences between groups or relationships between variables.

Power often translates into the probability that the test will lead to a correct conclusion about the null hypothesis.

What are the theoretical implications if power is not high?

If the power of a test is .50, this means that there is only a 50% chance that a true effect will be detected. In other words, even though there is in fact an effect for some treatment, the researcher has only a 50/50 chance of finding that effect.

What is the optimal level of power?

Power should be above .50 and would be judged adequate at .80. a power level of .80 would mean that four out of five times a real effect in the population will be found.

Power levels ought to be calculated before a study is done, and not after.

Help with calculating power using R

How to use the arguments for the “pwr” library, how to calculate effect sizes, and Cohen’s guidelines as to the magnitude of effect sizes. Cohen meant for these guidelines to be a help to those who may not know how to start doing power analyses, but once you have a better idea of effect sizes you may be able to make your own guidelines about what constitutes a small, medium, or large effect size for the particular question you are studying.

Remember that obtaining a small effect size means that you think the difference between groups is going to be quite small.

3. Effect Size

Effect size is the magnitude of the impact of the independent variable on the dependent variable.

An effect size gives the researcher insight into the size of the difference between groups is important or negligible.

If the effect size is quite small, then it may make sense to simply discount the findings as unimportant, even if they are statistical.

If the effect size is large, then the researcher has found something that it is important to understand.

Effect sizes do not change no matter how many participants there are, that makes effect sizes a valuable piece of information, much more valuable than the question of whether a statistical test is “significant” or not.

Understanding Effect Size Measures

Huberty (2002) divided effect sizes into two broad families: group difference indexes and relationship indexes. Both the group difference and relationship effect sizes are ways to provide a standardized measure of the strength of the effect that is found.

A group difference index, or mean difference measure, has been called the d family of effect sizes by Rosenthal (1994). The prototypical effect size measure in this family is Cohen’s d. Cohen’s d measures the difference between two independent sample means, and expresses how large the difference is in standard deviations.

Relationship indexes, also called the r family of effect sizes, measure how much an independent and dependent variable vary together or, in other words, the amount of covariation in the two variables. The more closely the two variables are related, the higher the effect size.

Calculating Effect Size for Power Analysis

How to determine the effect size to expect? The best way to do this is to look at effect sizes from previous

in this area to see what the magnitude of effect sizes has been. If there is none, the researcher must make an educated guess of the size of the effect that they will find acceptable, or use Cohen’s effect size.

Cohen notes that effect size are likely to be small when one is undertaking research in an area that has been little studied, as researchers may not know yet what kinds of variables they need to control for.

A small effect size is one that is not visible to the naked eye but exists nevertheless.

Cohen notes that effect size magnitudes depend on the area.

Calculating Effect Size Summary In general, statistical tests which include a categorical

variable that divides the sample into groups, such as the t-test or ANOVA, use the d family of the effect sizes to measure effect size. The basic idea of effect sizes in the d family is to look at the difference between the means of two groups, as in µA-µB.

Table 4.6 Options for Computing Standardizers (the Denominator for d Family Effect Sizes)

A The standard deviation of one of the groups, perhaps most typically the control group

B The pooled standard deviation of [only the groups] being compared

C The pooled standard deviation [of all the groups] in the design

4. Confidence Intervals The confidence interval represents “a range of

plausible values for the corresponding parameter”, whether that parameter be the true mean, the difference in scores or whatever.

The width of the confidence interval indicates the precision with which the difference can be calculated or, more precisely, the amount of sampling error.

If there is a lot of sampling error in a study, then confidence intervals will be wide, and the statistical results may not be very good estimates.

Power through Replication and Belief in the “Law of Small Numbers”

Tversky and Kahneman (1971) point out that replication studies will ideally have a larger number of participants than the original study. With a sample size that is larger than the original, the experimenter will have a better chance of finding a significant result. Sample sizes play a direct role in the amount of power that a study has, and also directly affect the p-value of the test statistic.

Tversky and Kahneman’s proposed law of small numbers states that “the law of large numbers applies to small numbers as well”

In other words, researchers believe that even a small sampling should represent the whole population well, which leads to an unfounded confidence in results found with small sample sizes.

Larson-Hall, Jenifer. 2010. A Guide to doing Statistic in Second Language Research Using SPSS. New York: Routledge