46
POWER AND EFFECT SIZE

Power and Effect Size

  • Upload
    gaura

  • View
    56

  • Download
    1

Embed Size (px)

DESCRIPTION

Power and Effect Size. Previous Weeks. A few weeks ago I made a small chart outlining all the different statistical tests we’ve covered (week 9) I want to complete that chart using information from the past week Most of this is a repeat – but a few new tests have been added - PowerPoint PPT Presentation

Citation preview

Page 1: Power  and  Effect Size

POWER AND

EFFECT SIZE

Page 2: Power  and  Effect Size

Previous Weeks

A few weeks ago I made a small chart outlining all the different statistical tests we’ve covered (week 9) I want to complete that chart using information

from the past week

Most of this is a repeat – but a few new tests have been added Important that you are familiar with these tests,

know when they are appropriate to use, and how to run (most of) them in SPSS Excused from running ANCOVA, RM ANOVA

Page 3: Power  and  Effect Size

When to use specific statistical tests…

# of IV (format)

# of DV (format)

Examining…

Test/Notes

1 (continuou

s)

1 (continuous)

AssociationPearson

Correlation (r)

1(continuou

s)

1(continuous)

PredictionSimple Linear Regression (m

+ b)

Multiple1

(continuous)Prediction

Multiple Linear Regression (m

+ b)

Page 4: Power  and  Effect Size

# of IV (format)

# of DV (format)

Examining…

Test/Notes

1 (grouping, 2 levels)

1(continuous)

Group differences

When one group is a ‘known’ population =

One-Sample t-test

1 (grouping, 2 levels)

1(continuous)

Group differences

When both groups are

independent = Independent

Samples t-test

1 (grouping, 2 levels)

1(continuous)

Group differences

When both groups are

dependent = Paired Samples t-

test

1 (grouping, ∞ levels)

1(continuous)

Group differences

One-Way ANOVA, with Post-Hoc (F

ratio)

Page 5: Power  and  Effect Size

# of IV (format)

# of DV (format)

Examining…

Test/Notes

∞ (grouping, ∞ levels)

1(continuou

s)

Group Differences

and interactions

Factorial ANOVA with Post-Hoc

and/or Estimated Marginal Means

(F ratio)

∞ (grouping, ∞ levels)

1(continuou

s)

Group Differences, interactions,

controlling for confounders

ANCOVA with Estimated

Marginal Means (F ratio)

Analysis of Co-Variance

∞ (grouping, ∞ levels)

1(continuou

s)

Group Differences, interactions,

controlling for confounders in a related

sample

Repeated Measures ANOVA

with Estimated Marginal Means

(F ratio)(e.g.,

longitudinal)

Page 6: Power  and  Effect Size

Tonight… A break from learning a new statistical ‘test’

Focus will be on two critical statistical ‘concepts’ Statistical Power

Related to Alpha/Statistical Significance

Brief overview of Effect Size Statistically significant results vs Meaningful results

First, a quick review of error in testing…

Page 7: Power  and  Effect Size

Example Hypothesis Pretend my masters thesis topic is the influence of

exercise on body composition I believe people that exercise more, will have lower %BF To study this:

I draw a sample and group subjects by how much they exercise –High and Low Exercise Groups (this is my IV)

I also assess %BF in each subject as a continuous variable (DV)

I plan to see if the two groups have different mean %BF

My hypotheses (HO and HA): HA: There is a difference in %BF between the groups HO: There is not a difference in %BF between the groups

Page 8: Power  and  Effect Size

Example Continued

Now I’m going to run my statistical test, get my test statistic, and calculate a p-value I’ve set alpha at the standard 0.05 level By the way, what statistical test should I

use…?

My final decision on my hypotheses is going to be based on that p-value: I could reject the null hypothesis (accept HA) I could accept the null hypothesis (reject HA)

Page 9: Power  and  Effect Size

Statistical Errors…

Since there are two potential decisions (and only one of them can be correct), there are two possible errors I can make:

Type I Error We could reject the null hypothesis although

it was really true (should have accepted null) Type II Error

We could fail to reject the null hypothesis when it was really untrue (should have rejected null)

Page 10: Power  and  Effect Size

There are really 4 potential outcomes, based on what is “true” and what we “decide”

Our Decision

Reject HO Accept HO

What is True

HO Type I Error Correct

HA Correct Type II Error

HA: There is a difference in %BF between the groupsHO: There is not a difference in %BF between the groups

Page 11: Power  and  Effect Size

Statistical Errors…

Remember – My final decision is based on the p-

value

Page 12: Power  and  Effect Size

Our Decision

Reject HO Accept HO

What is True

HO Type I Error Correct

HA Correct Type II Error

If p </= 0.05, our decision is reject HO

If p > 0.05, our decision is accept HO

Page 13: Power  and  Effect Size

Statistical Errors…

In my analysis, I find: High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08

What is my decision? Accept HO

There is NOT a difference in %BF between the groups

Why is that my decision? The means ARE different? I can’t be confident that the 4% difference between

the two groups is not due to random sampling error

Is it possible I’ve made an error in my decision?

Page 14: Power  and  Effect Size

Possible Error…?

If I did make an error, what type would it be? Type II Error

When you find a p-value greater than alpha The only possible error is Type II error

When you find a p-value less than alpha The only possible error is Type I error

Page 15: Power  and  Effect Size

Our p = 0.08, we accepted HO

The only possible error is Type II

Our Decision

Reject HO Accept HO

What is True

HO Type I Error Correct

HA Correct Type II Error

If p </= 0.05, our decision is reject HO

If p > 0.05, our decision is accept HO

Page 16: Power  and  Effect Size

Possible Error…?

Compare Type I and Type II error like this: The only concern when you find statistical significance

(p < 0.05) is Type I Error Is the difference between groups REAL or due to Random

Sampling Error Thankfully, the p-value tells you exactly what the probability

of that random sampling error is In other words, the p-value tells you how likely Type I error is

But, does the p-value tell you how likely Type II error is? The probability of Type II error is better provided by

Power

Page 17: Power  and  Effect Size

Possible Error…? Probability of Type II error is provided by Power

Statistical Power, also known as β (actually 1 – β) We will not discuss the specific calculation of power in this

class SPSS can calculate this for you

Power (Beta) is related to Alpha, but: Alpha is the probability of having Type I error

Lower number is better (i.e., 0.05 vs 0.01 vs 0.001) Power is the probability of NOT having Type II error

The probability of being right (correctly rejecting the null hypothesis)

Higher number is better (typical goal is 0.80)

Let’s continue this in the context of my ‘thesis’ example

Page 18: Power  and  Effect Size

Statistical Errors… In my analysis, I found:

High Exercise Group mean %BF = 22% Low Exercise Group mean %BF = 26% p = 0.08 Decided to accept the null

What do I do when I don’t find statistical significance?

What happens when the result does not reflect expectations?

First, consider the situation

Page 19: Power  and  Effect Size

Should it be statistically significant?

The most obvious thing you need to consider is if you REALLY should have found a statistically significant result? Just because you wanted your test to be significant doesn’t

mean it should be This wouldn’t be Type II error – it would just be the correct

decision!

In my example, researchers have shown in several studies that exercise does influence %BF This result ‘should’ be statistically significant, right? If the answer is yes, then you need to consider power

Page 20: Power  and  Effect Size

In my ‘thesis’ This result ‘should’ be statistically significant, right? Probably an issue with Statistical Power

This scenario plays out at least once a year between myself and a grad student working on a thesis or research project How can I increase the chance that I will find statistically

significant results? Why was this analysis not statistically significant? What can I do to decrease the chance of Type II error?

Several different factors influence power Your ability to detect a true difference

Page 21: Power  and  Effect Size

How can I increase Power?

1) Increase Alpha level Changing alpha from 0.05 to 0.10 will increase

your power (better chance of finding significant results)

Downsides to increasing your alpha level? This will increase the chance of Type I error!

This is rarely acceptable in practice Only really an option when working in a new area:

Researchers are unsure of how to measure a new variable

Researchers are unaware of confounders to control for

Page 22: Power  and  Effect Size

How can I increase Power?

2) Increase N Sample size is directly used when calculating p-

values

Including more subjects will increase your chance of finding statistically significant results

Downsides to increasing sample size? More subjects means more time/money

More subjects is ALWAYS a better option if possible

Page 23: Power  and  Effect Size

How can I increase Power? 3) Use fewer groups/variables (simpler designs)

Related to sample size but different ‘Use fewer groups’ NOT ‘Use less subjects’

↑ groups negatively effects your degrees of freedom Remember, df is calculated with # groups and # subjects

Lots of variables, groups and interactions make it more difficult to find statistically significant differences The purpose of the Family-wise error rate is to make it

harder to find significant results! Downsides to fewer groups/variables?

Sometimes you NEED to make several comparisons and test for interactions - unavoidable

Page 24: Power  and  Effect Size

How can I increase Power? 4) Measure variables more accurately

If variables are poorly measured (sloppy work, broken equipment, outdated equipment, etc…) this increases measurement error

More measurement error decreases confidence in the result

For example, perhaps I underestimated %BF in my ‘low exercise’ group? This could lead to Type II Error.

More of an internal validity problem than statistical problem

Downsides to measuring more accurately? None – if you can afford the best tools

Page 25: Power  and  Effect Size

How can I increase Power? 5) Decrease subject variability

Subjects will have various characteristics that may also be correlated with your variables SES, sex, race/ethnicity, age, etc… These variables can confound your results, making it

harder to find statistically significant results When planning your sample (to enhance power), select

subjects that are very similar to each other This is a reason why repeated measures tests and paired

samples are more likely to have statistically significant results

Downside to decreasing subject variability? Will decrease your external validity – generalizability If you only test women, your results do not apply to men

Page 26: Power  and  Effect Size

How can I increase Power? 6) Increase magnitude of the mean difference

If your groups are not different enough, make them more different!

For example, instead of measuring just high and low exercisers, perhaps I compare marathon runners vs completely sedentary people? Compare a ‘very’ high exercise to a ‘very’ low exercise

group Sampling at the extremes, getting rid of the middle

group Downsides to using the extremes?

Similar to decreasing subject variability, this will decrease your external validity

Questions on Power/Increasing Power?

Page 27: Power  and  Effect Size

The Catch-22 of Power and P-values

I’ve mentioned this previously – but once you are able to draw a large sample, this will ruin the utility of p/statistical significance The larger your sample, the more likely you’ll find

statistically significant results Sometimes miniscule differences between groups or tiny

correlations are ‘significant’ This becomes relevant once sample size grows to 100~150

subjects per group Once you approach 1000 subjects, it’s hard not to find p <

0.05 Example from most highly cited paper in Psych, 2004…

Page 28: Power  and  Effect Size

This paper was the first to find a link between playing video games/TV and aggression in children:

Every correlation in this table except 1 has p < 0.05 Do you remember what a correlation of 0.10 looks like?

Page 29: Power  and  Effect Size

r = 0.10

Do you see a relationship between these two

variables?

Page 30: Power  and  Effect Size

What now?

This realization has led scientists to begin to avoid p-values (or at least avoid just reporting p-values) Moving towards reporting with 95% confidence intervals Especially in areas of research where large samples are

common (epidemiology, psychology, sociology, etc..)

Some people interpret ‘statistically significant’ as being ‘important’ We’ve mentioned several times this is NOT true Statistically significant just means it’s likely not Type I error Can have ‘important’ results that aren’t statistically

significant

Page 31: Power  and  Effect Size

Effect Size To get an idea of how ‘important’ a difference or

association is, we can use Effect Size There are over 40 different types of effect size

Depends on statistical test used SPSS will NOT always calculate effect size

Effect size is like a ‘descriptive’ statistic that tells you about the magnitude of the association or group difference Not impacted by statistical significance Effect size can stay the same even if p-value changes Present the two together when possible

The goal is not to teach you how to calculate effect size, but to understand how to interpret it when you see it

Page 32: Power  and  Effect Size

Effect Size

Understanding effect size from correlations and regressions is easy (and you already know it): r2, coefficient of determination

% Variance accounted for Pearson correlations between %BF and 3

variables: r = 0.54, r = -0.92, r = 0.70

Which of the three correlations has the most important association with %BF? r2 = 0.29, r2 = 0.85, r2 = 0.49

Page 33: Power  and  Effect Size

Interpreting Effect Size

Usually, guidelines are given for interpreting the effect size Help you to know how important the effect is Only a guide, you can use your own brain to

compare In general, r2 is interpreted as:

0.01 or smaller, a Trivial Effect 0.01 to 0.09, a Small Effect 0.09 to 0.25, a Moderate Effect > 0.25, a Large Effect

Page 34: Power  and  Effect Size

Effect Size in Regression Two regression equations contain 4 predictors

of %BF. Each ‘model’ is statistically significant. Here are their r2 values: 0.29 and 0.15

Which has the largest effect size? Do either or the regression models have a large effect size? 0.29 model is the most important, and has a

‘large effect size’. 0.15 model is of ‘moderate’ importance.

Page 35: Power  and  Effect Size

Effect Size for Group Differences

Effect size in t-tests and ANOVA’s is a bit more complicated

In general, effect size is a ratio of the mean difference between two groups and the standard deviation Does this remind you of anything we’ve previously seen? Z-score = (Score – Mean)/SD

Effect size, when calculated this way, is basically determining how many standard deviations the two groups are different by E.g., effect size of 1 means the two groups are different

by 1 standard deviation (this would be a big difference)!

Page 36: Power  and  Effect Size

Example

When working with t-tests, calculating effect size by the mean difference/SD is called Cohen’s d < 0.1 Trivial effect 0.1-0.3 Small effect 0.3-0.5 Medium effect > 0.5 Large effect

The next slide is the result of a repeated measures t-test from a past lecture, we’ll calculate Cohen’s d

Page 37: Power  and  Effect Size

Paired-Samples t-test Output

Mean difference = 2.9, Std. Deviation = 5.2

Cohen’s d = 0.55, a large effect size Essentially, the weight loss program

reduced body weight by just about half a standard deviation

Page 38: Power  and  Effect Size

Other example

I sample a group of 100 ISU students and find their average IQ is 103. Recall, the population mean for IQ is 100, SD

= 15. I run a one-sample t-test and find it to be

statistically significant (p < 0.05) However, effect size is…

0.2, or Small Effect Interpretation: While this difference is likely

not due to random sampling error – it’s not very important either

Page 39: Power  and  Effect Size

Other types of effect sizes

SPSS will not calculate Cohen’s d for t-tests However, it will calculate effect size for

ANOVA’s (if you request it) Not Cohen’s d, but Partial Eta Squared (η2) Similar to r2, interpreted the same way (same

scale)

Here is last week’s cancer example Does Tumor Size and Lymph Node Involvement

effect Survival Time I’ll re-run and request effect size…

Page 40: Power  and  Effect Size
Page 41: Power  and  Effect Size
Page 42: Power  and  Effect Size

Notice, η2 can be used for the entire ‘model’, or each main effect and interaction individually How would you describe the effect of Tumor Size, or our interaction? Trivial to Small Effect – How did we get a significant p-value? Other factors not in our model are also very important

Page 43: Power  and  Effect Size

Notice that the r2 is equal to the η2 of the full model The advantage of η2 is that you can evaluate

individual effects

Page 44: Power  and  Effect Size

Effect Size Summary

Many other types of effect sizes are out there – I just wanted to show you the effect sizes most commonly used with the tests we know: Correlation and Regression: r2

T-tests: Cohen’s d ANOVA: Partial eta squared (η2) and/or r2

You are responsible for knowing: The general theory behind effect size/why to use

them What tests they are associated with How to interpret them

Page 45: Power  and  Effect Size

QUESTIONS ON POWER?EFFECT SIZE?

Page 46: Power  and  Effect Size

Upcoming…

In-class activity Homework:

Cronk – Read Appendix A (pg. 115-19) on Effect Size

Holcomb Exercises 21 and 22 No out-of-class SPSS work this week

Things are slowing down - next week we’ll discuss non-parametric tests Chi-Square and Odds Ratio