Hypothesis Testing: Statistical Approach (6 basic steps) 1.Formulate Null (H o ) and Alternative (H A ) Hypotheses. 2.Determine test criteria – what test

Hypothesis Testing: Statistical Approach(6 basic steps)

1. Formulate Null (Ho) and Alternative (HA) Hypotheses.

2. Determine test criteria – what test procedure will be used and what “evidence” is required to convince.

3. Collect data, summarize data and measure discrepancy from initial assumption.

4. Compute likelihood chance variation would produce observed result, i.e. find p-value.

5. Make decision and interpret your findings.6. Quantify significant findings: power analysis,

effect size (ES), and confidence intervals.

STEP 1: Formulate Null and Alternative Hypotheses (Ho and HA)

Null Hypothesis (Ho)

• Opposite of alternative hypothesis

• Statement of “nothing going on”

• Statement that reflects “equality” or the “status quo”

• IS ASSUMED TO BE THE TRUTH WHEN CONDUCTING THE TEST!

STEP 1: Formulate Null and Alternative Hypotheses (Ho and HA)

Alternative Hypothesis (HA)

• Also called the “Research Hypothesis”

• Reflects what is believed to be the case or what the researcher “hopes” to show.

• Statement of “inequality” or change from “status quo”

• States that a relationship or difference exists.

Example: Grade inflation? (Has mean GPA increased since 1990?)

Population of 5 million college

students

Is the average GPA 2.7 ?(Imagine that 2.7 was mean GPA for U.S. college students in 1990)

Sample of 100 college students

How likely is it that 100 students would have an average GPA as large as 2.9 if the population average was 2.7?

Example: Grade inflation?Has mean GPA increased since 1990?

1990)in was

itan greater th is now GPA(mean 7.2:

1990)in it was than less

or same is now GPA(mean 7.2:

students college U.S.of GPAmean current

A

o

H

H

Alternative hypothesis reflects research hypothesis that the mean GPA for college students is greater than it was in 1990.

STEP 1: More comments about the null hypothesis (Ho)• In comparative experiments, the null says

the populations do not differ in terms of an given outcome, e.g. in a treatment vs. control experiment it would state there is not “treatment effect”.

• In studies where we are interested in determining if a “relationship exists” the null says that there is no association or trend between the factors being studied, e.g. no difference in risk of cancer associated with a potential risk factor such as smoking.

STEP 1: More comments about thealternative hypothesis (HA)• In comparative experiments, the alternative

says the populations DO differ in terms of an given outcome, e.g. in a treatment vs. control experiment it would state there is a “treatment effect”.

• In studies where we are interested in determining if a “relationship exists” the alternative states that there is an association or trend between the factors being studied, e.g. there is increased risk of cancer associated with smoking NOT CAUSATION!

STEP 2: Determine test criteria

a) Choose significance level ()

b) Determine test procedure that will be used

STEP 2 (a): Choose significance levelBriefly this says “sets the bar” for what it

means for our results to be viewed as “unlikely to be produced by chance variation alone”.

The smaller we choose the more discrepant our observed results have to be from what we would expect to see if the null hypothesis were true, i.e. the weirder our results have to be to convince us to support the alternative.

STEP 2 (a): Choose significance level

Typically is used which means that if less than 1 study in 20 would produce the observed result when the null (Ho) is true, we would then be convinced that the assumed null is probably not the case and we would decide in favor of the alternative (HA).

STEP 2(b): Determine test procedureWhat test procedure we use is dictated by:• the nature of the question to be

answered• the data types of the variable(s)

involved, i.e. nominal, ordinal, discrete or continuous.

• distributional properties of the variables involved, e.g. normality for measured numeric quantities.

STEP 2(b): Determine test procedureCommon test procedures you might be familiar

with:

• t-Test (both one- and two-sample)

• Chi-square test of independence

• Paired t-Test

• ANOVA (F-tests)

• Wilcoxon or Mann-Whitney rank-sum test

• And many, many, more!

STEP 2(b): Determine test procedure

This table is probably pretty scary at this point, but I think we will cover most of the tests in this table in the course.


Common test procedures you might be familiar with:

• t-Test (both one- and two-sample)

• Chi-square test of independence

• Paired t-Test

• ANOVA (F-tests)

• Wilcoxon or Mann-Whitney rank-sum test

• And many, many, more!

What is common to all these and other test procedures is that they all in some way quantify or measure numerically the discrepancy between what is observed and what we would expect to see if the null hypothesis were true


Many of the tests you learn about in an introductory statistics course use a test statistic of the form:

e)SE(estimat

true)is H assuming value(parameter - data) from (estimate

StatisticTest

o

They measure how many SE’s away from what you expected to see your observed estimate, i.e. “think z-score”. Test statistics outside -2 to 2 are “strange”.



students


Sample of 100 college studentsMean=2.9 SD = .6


Example: Grade Inflation?

Example: Grade Inflation (cont’d)Test Statistic for a Single Population Mean ()

e)SE(estimat


StatisticTest

o

)1(on distributi-or t normal standard~

t

ndf

ns

X

)XSE(

μ-X oo

We can then use a standard normal or t-distribution to find the probability we would get a discrepancy as large or larger than our observed test statistic by chance variation alone.

STEP 3: Collect data and compute test statistic

• Summarize all variables involved both numerically and graphically

• Look at distributional shape, as many test procedures require approx. normality for continuous variables.

• Check for outliers.• Compute the test statistic from our observed

data.



students


Sample of 100 college studentsMean=2.9 SD = .6



100

61.

91.2

n

s

X

GPAHow likely are we to obtain a sample mean this large sampling from a population whose mean ?

GPA distribution is only slightly left-skewed and looks for reasonably normal.

Example: Grade Inflation (cont’d)Test Statistic for a Single Population Mean ()

e)SE(estimat


StatisticTest

o

44.3

100.61

2.7-2.91


t

ndf

ns

X

)XSE(

μ-X oo

Our observed mean is 3.44 SE’s higher than what we would expect if the mean GPA of U.S. college students was still 2.70 as it had been in 1990. This is a very extreme result!!! 99.73% of the time we expect such a value to lie in the range from -3.00 to 3.00 and we are beyond that!

STEP 4: Find p-value

• The p-value is the probability that chance variation alone would produce a test statistic value as extreme or more extreme than the one observed assuming the null hypothesis is true.

• The underlined part of the definition means that the probability is calculated assuming the null is true!

Aside: p-value and the O.J. Simpson trial• Back to O.J. Simpson criminal trial for a moment.

The p-value in that trial is the “probability” that they would have evidence consisting of: size 12 Bruno Magli bloody footprint, bloody glove, blood spots on white Ford Bronco, the knock on the wall, DNA evidence from above, motive (?), etc…

if O.J. Simpson was innocent as we are assuming initially. Clearly it is unlikely there would be this much evidence against an innocent person.

• How unlikely would you say it is? ____________ • This is your probability is your p-value.• What is your verdict in light of your p-value?

Example: Grade Inflation (cont’d)p-value calculation and interpretation

e)SE(estimat


StatisticTest

o

44.3

100.61

2.7-2.91


t

ndf

ns

X

)XSE(

μ-X oo

P(t > 3.44) = .0003. Therefore the probability that chance variation alone would produce a sample mean of 2.91 when sampling from a population whose mean is actually 2.7 is .0003 or 3 out of 10,000! It is highly unlikely that chance variation would produce this result!


• Working under the assumption that the mean GPA of U.S. college students is still , we have determined that only 3 samples in 10,000 would be expected to produce a sample mean of 2.91 or higher.

• Thus our observed data (evidence) is very inconsistent with the assumption that the mean GPA today is as it was in 1990.


We two ways to think about this outcome:a) The mean GPA is still and by dumb

luck we have obtained a very rare sample, one which will be produced by chance variation in 3 out of 10,000 random samples.

b) Our assumption that the mean GPA is still 2.70 is wrong, and it is actually the case that the mean GPA of U.S. college students has increased from where it was in 1990. Which makes the observed result more plausible.


We two ways to think about this outcome:a) The mean GPA is still and by dumb

luck we have obtained a very rare sample, one which will be produced by chance variation in 3 out of 10,000 random samples.

b) Our assumption that the mean GPA is still 2.70 is wrong, and it is actually the case that the mean GPA of U.S. college students has increased from where it was in 1990. Which makes the observed result more plausible.

Given these two options (b) is clearly the way to go. Why on earth would we think that we obtained this incredibly rare sample ? The observed result is much more plausible when sampling from a population whose mean is greater than 2.70.

STEP 5: Make decision and interpret

Decision rule based on p-values:

• If p-value < we Reject Ho in favor of the alternative.

• If p-value > we Fail to Reject Ho and conclude there is insufficient evidence to support the alternative.

Note: This is true for ALL statistical tests!

STEP 5: Make decision and interpretDecision:

• Because our p-value = .0003 < .05 we reject the null hypothesis in favor of the alternative.

Interpretation:

• We conclude that the mean GPA of U.S. college students today is greater than 2.70, which is what is was back in 1990.

STEP 5: Make decision and interpret• What we have decided is that we have statistically

significant evidence (p = .0003) to conclude the mean GPA has increased from what is was in 1990.

Follow-up Questions:1. How much larger is it? 2. Is it a large enough increase that we actually care?3. Does this result tell us that grades in U.S. college

and universities are being inflated by professors?4. Was the sampling method biased in some way?5. Others?

STEP 6: Quantifying significant findings

• Calculate Effect Size (ES)

• Construct Confidence Interval

• Calculate Power (see Type I & II Error ppt)

Effect Size (ES) - (denoted d )• For a single population mean situation the effect size

is simply the magnitude of the difference between the observed mean and the hypothesized mean divided by the standard deviation (not the SE).

• We will examine other ES estimates for other situations later in the course

344.61.

|70.291.2|

test)- tsample-onefor size(effect ||

s

Xd o

Effect Size (ES)• Guidelines for ES Interpretation

d = .20 small effect size

d = .40 medium effect size

d = .80 large effect size

• Our estimated effect size d = .344 here, so it is certainly not a large effect, but is bordering on medium size.

Effect Size (ES)• Guidelines for ES Interpretation – another way to

interpret the effect size is in terms of the percent overlap or non-overlap of the distribution under the null (GPA’s in 1990) and the distribution under the alternative (GPA’s now with an estimated mean of 2.91).

Overlap is highlighted in yellow, non-overlap is white.

Effect Size (ES) and % Non-overlap

A d = .34 gives percent non-overlap around 24% for the two GPA distributions.

Confidence Interval for

• The basic form for most confidence intervals is

e)SE(estimat value)(table estimate

Estimate of parameter from our sample, e.g. the sample mean when estimating

Table value comes either from a standard normal or t-distribution table. For 95% confidence this value is around 2.00

Estimated standard error of our estimate, for the mean is given by

nsXSE )(

Confidence Interval (CI) for

• The basic form for most confidence intervals is

• For a 95% CI for we have:

e)SE(estimat value)(table estimate

3.03) (2.79,

.12 2.91

.0611.96 2.91

100

61.96.191.2)(96.1

XSEX

Confidence Interval (CI) for • 95% CI for is given by (2.79, 3.03)• This range of GPA values has a 95% chance of

covering the true mean GPA of U.S. college students at the present time.***

• The lower confidence limit (LCL) is 2.79 which is only a .09 grade point increase from the mean GPA in 1990, or a 3.33% increase.

• The upper confidence limit (UCL) is 3.03 which is a fairly sizeable increase. In percentage terms it is 12.2% increase and .33 grade point increase in absolute terms.

*** This is technically not correct but it sounds good. The correct interpretation is that the process of taking a sample and constructing a 95% CI for based on the sample will produce intervals that covers the true mean 95% of the time. This particular interval either covers the true mean or it doesn’t, we have no idea which is the case. We are 95% confident that the process of sampling and constructing confidence intervals will produce intervals that cover the true value of the parameter a large percentage of the time (i.e. 95%).

Confidence Interval for and Effect Size

• 95% CI for is (2.79, 3.03) which in terms of an increase from 2.70 is (.09, .33).

• If we divide the LCL and UCL of the latter CI above we have the effect sizes ( d ).

• Thus effect size is somewhere between small to medium in scale… doh! That sure tells me a lot !

541. , 148.61.

33.,

61.

09.

Documents

Hypothesis Testing: Statistical Approach (6 basic steps) 1.Formulate Null (H o ) and Alternative (H A ) Hypotheses. 2.Determine test criteria – what test