43
Econ 413 Parks Hypothesis Testing

Hypothesis Testing - Washington University in St. Louisecon413.wustl.edu/Hypothesis_testing.docx  · Web viewHypothesis Testing. ... and distributed as a binomial with success rate

  • Upload
    lycong

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Hypothesis Testing

Page 32 of 32Econ 413Hypothesis Testing

Econ 413ParksHypothesis TestingHypothesis Testing

A statistical hypothesis is a set of assumptions about a model of observed data.

Example 1 (coin toss): The number of heads of 11 coin flips are random and distributed as a binomial with success rate 0.5 and n=11 Recall the binomial distribution has two parameters, the probability of success and the number of trials. We call the first parameter the success rate to distinguish it from probabilities we calculate using the binomial.

Example 2 (income): Income is distributed as a normal random variable with mean and variance 2

Example 3 (univariate regression): Y = a + b*X + and the seven classical assumptions are true.

Example 1 specifies the exact distribution of the data (number of heads). Example 1 has no unknowns. A statistical hypothesis (about data) which has no unknowns is called a simple hypothesis. Examples 2 and 3 have unknown parameters and do not specify the exact distribution of the data. They are called complex hypotheses.

A statistical hypothesis test is a decision about a statistical hypothesis. The decision is to accept or reject the hypothesis. The statistical hypothesis we test is called the maintained. The alternate hypothesis is a different specification of the distribution of the data. Either or both can be simple or complex.

Most books use the term null hypothesis. I have three reasons to use the word maintained rather than null:

1. Null is defined as amounting to nothing, having no value, and being 0 (among other definitions). Often the labeling of the null hypothesis is H0 and I suppose null hypothesis was preferred to zero hypothesis or naught hypothesis. .

2. You have learned things about the null hypothesis which may or may not be true. Using maintained hypothesis starts us off on a neutral path.

3. Maintained hypothesis may, I hope, remind you that the maintained hypothesis usually has many assumptions. For our regression tests, the maintained hypothesis assumes A1 to A7 and possibly other assumptions.

A statistical hypothesis test specifies a critical region a set of numbers. If the observed data (or a function of the data) is in the critical region, then reject the maintained hypothesis. If the observed data is NOT in the critical region, then accept the maintained hypothesis. I think REJECTION region would be a better name than critical region. Alas, the literature has critical region. I will use critical/rejection to help solidify the concept. I use ACCEPTANCE region rather than the cumbersome 'not in the rejection region'.

Example 1 test: Let the critical (rejection) region be {0, 1, 2, 9, 10, 11} heads. If you flip the coin 11 times, reject the maintained hypothesis: the number of heads is a binomial distribution with success rate .5 of heads and n=11 if you observe {0, 1, 2, 9, 10, 11} heads.

Accepting the maintained hypothesis does not prove it to be true and rejecting the maintained hypothesis does not prove it to be false. Similarly, accepting the alternate hypothesis does not prove it to be true and rejecting the alternative does not prove it to be false. A statistical test can prove nothing.

I believe many authors use 'fail to reject' so students will not think the hypothesis was proved with a statistical hypothesis. But the only meaning that 'fail to reject' can have in statistical hypothesis testing is accept. The outcome of a statistical hypothesis test is BINARY only two outcomes. The data is either in the critical (rejection) region or the data is not in the critical (rejection) region. The wording 'fail to reject' connotatively conveys something different than 'accept' because in English we often use a double negative to convey something other than a binary outcome.

A statistical test has exactly two outcomes. The data is either in the critical (rejection) region or it is not in the critical (rejection) region. If the data is NOT in the critical (rejection) region, you accept the maintained. You reject the alternative. Reject the alternative must mean accept the maintained. Fail to reject the maintained must mean accept the maintained.

If 'fail to reject' had any real meaning other than accept, then 'fail to accept' would also have a different meaning. Now you would have four outcomes: accept the maintained, reject the maintained,

fail to reject the maintained,

fail to accept the maintained. A statistical test has exactly two outcomes: the data is either in the critical (rejection) region or it is not in the critical (rejection) region. The only outcomes are to accept the maintained (reject the alternative) or accept the alternative (reject the maintained). Fail to reject must mean accept and fail to accept must mean reject.

'failed to reject' may have a connotation that you are trying to reject (and failed). Whether you want to accept or reject a statistical hypothesis is outside of the discussion of statistical hypotheses. Want is a normative concept. I never use 'failed to accept' (except in moments of brain failure). I never want to accept or reject a hypothesis unless someone is paying me money, reputation, or other reward (which then makes me want). You will not want to accept or reject a hypothesis in this course. Your grade does not depend on whether the hypothesis is accepted or rejected, but rather on what you do with the acceptance or rejection.

A third reason authors use 'fail to reject' is Karl Popper's influence on scientific method. Popper touted falsification of theories. Specifically, "Logically, no number of experimental testing can confirm (read prove) a scientific theory, but a single counterexample is logically decisive: it shows the theory to be false." For Popper, experimental evidence would either fail to reject the theory or would reject the theory.

For Popper, reject requires one data point which is inconsistent with the theory. For example the Cobb-Douglas production function (be an economist for a moment) requires 0 output if either labor or capital = 0. We can reject Cobb-Douglas if we observe positive output with 0 labor or 0 capital. Most econometric models do not have the property of rejection by one observation.

Many statistical tests exist for some statistical hypotheses. In example 1 (coin test) we may reject the maintained hypothesis if we observed 5 heads and then 6 tails. (which is not in the rejection region {0,1,2,9,10,11}). With regressions, we have homoscedasticity tests, serial correlation tests, endogeneity tests, model specification tests, and normality tests. Each test has the same hypothesis all seven assumptions. fail to reject one of many statistical tests of the same hypothesis means that the current test accepts the maintained but some other test remaining to be done might reject the maintained. Then fail to reject is not about a hypothesis test, but about many hypothesis tests. In such a case the many statistical tests have many critical (rejection) regions (as many as there are tests). ACCEPT or REJECT is about one single critical (rejection) region. We will discuss distinguishing among hypothesis tests, but we will not use fail to reject. I never use fail to reject and never use fail to accept.

If accepting a hypothesis does not prove the hypothesis, then what does accepting a hypothesis do? Acceptance allows one to proceed as if the hypothesis were true.

We may either accept a true hypothesis or accept a false hypothesis. Accepting a true hypothesis would be a correct decision and rejecting a true hypothesis would be an incorrect decision that is an error.

Type I and Type II errors

1. Type I error: Reject a true maintained hypothesis = accept a false alternative hypothesis.

2. Type II error: Reject a true alternative hypothesis = accept a false maintained hypothesis.

In classical statistical hypothesis testing a hypothesis is true or false. Hypotheses do not have a probability of being true or false.

The probability of making a Type I error is the probability the data is in the critical (rejection) region conditional upon assuming the data is distributed by the maintained hypothesis.

The probability of making a Type II error is the probability the data is NOT in the critical (rejection) region conditional upon assuming the data is distributed by the alternative hypothesis. Or the probability of making a Type II error is the probability the data is in the ACCEPTANCE region conditional upon assuming the data is distributed by the alternative hypothesis

Example 1: Return to the coin flip. A critical (rejection) region is {0, 1, 2, 9, 10, 11}. The probability of {0, 1, 2, 9, 10, or 11} heads occurring given the number of heads is a Binomial (0.5, 11) is 0.0005 + 0.0054 + 0.0269 + 0.0269 + 0.0054 + 0.0005 = 0.0654. I used the Excel function BINOMDIST to calculate the probabilities e.g., for two heads I used =BINOMDIST(2,11,0.5,FALSE) . The probability of a Type I error for the critical (rejection) region {0, 1, 2, 9, 10, 11} is 0.0654. It is the probability we observe {0, 1, 2, 9, 10, or 11} heads in 11 flips assuming the flips are a binomial distribution with n=11 and p=0.5 - the maintained distribution. If we observe 0, 1, 2, 9, 10, or 11 heads we reject the maintained hypothesis and accept the alternative hypothesis. If we observe 3, 4, 5, 6, 7, or 8 heads we accept the maintained hypothesis and reject the alternative hypothesis.

See lecture8.pptx near slides 28-39 for the calculation of the probability of Type I errors for the following critical regions:

CR1. {0,1,2,9,10,11}P=0.06543CR2. {0,1,10,11}P=0.01172CR3. {0,1,2} P=0.03271CR4. {0,1,2,3}P=0.11328CR5. {8,9,10,11}P=0.11328 CR6. {9,10,11}P=0.03271 CR7. {1,3,7,9}P=0.27393CR8. {2,10,11}P=0.03271

What is the alternative hypothesis? Unspecified. One alternative is the data was generated by a different distribution. For example, the data is generated by flipping the coin until 2 heads were observed and it took 11 trials. The distribution (flipping until a certain number of successes is observed) is called the negative binomial.

Another alternative hypothesis in example 1 is the distribution is binomial, n=11 and the success rate is any number zero to one.

Unless the alternative hypothesis is specified we can not know the probability of a Type II error (reject a true alternative). In most real life cases, the alternative is complex and the probability of Type II error is unknown unless we specify a particular alternative.

Sometimes we can calculate the probability of a Type II error. In example 1, if we specify the alternative is a binomial distribution, then we can calculate the probability of Type II error for each success rate from 0 to 1. The following table exhibits probabilities of Type II errors eight different critical (rejection) regions.

See lecture 8.

The last eight columns of the table are different critical regions. The probability of a Type I error is the same for critical regions 3, 6 and 8 and the same for critical regions 4 and 5. Comparing critical regions 1 and 2 critical region 1 has a larger Type I error and a smaller Type II error than critical region 2 for each value of the alternative success rate of the binomial (the graph makes the comparison easy).

Probability

Of

Type II

Error

Alternative success rates

Probability of Type II error (vertical axis)

Success rates of the binomial (horizontal access).

Suppose that two hypothesis tests had identical Prob(Type I error), say .05. Suppose also that one test had a greater Prob(Type II error) for every specification of the alternative than the other. The hypothesis test with the larger probability of Type II error is dominated by the one with the smaller probability of Type II error.

Among UNDOMINATED hypothesis tests, decreasing probability of Type I error increases probability of Type II error. A trade off exists between probability of Type I error and probability of Type II error decrease one and the other increases.

A theoretical result is: for testing a simple hypothesis against a simple hypothesis, there exists a critical region with no lower probability of a Type II error given a fixed probability of Type I error. This is a beautiful result. One test is dominant for a simple versus simple situation. Unfortunately, in econometrics, both the maintained and the alternative are usually complex hypotheses and we have no such result.

The probability of a Type I error is called the size (or significance level) of a statistical test.

The POWER of a test is 1 minus the probability of a Type II error. For a given size, we want a statistical test with greatest power. For most tests we encounter, we specify a size, we obtain a critical region and theoretical results indicate what alternatives have relatively large power and what alternatives may not. For most tests we encounter, we never know the probability of a Type II error. We rely on prior research to tell us what tests are powerful against what alternatives.

Both the size (sometimes called the significance level) and the power are probabilities of the critical region. The difference is the assumption made to compute the probability. For the size, the probability is computed assuming the maintained hypothesis. For the power, the probability is computed assuming some alternative hypothesis.

Size = Prob(CR| maintained)=Prob(Type I error)

Power=Prob(CR| alternative) = 1 Prob(Acceptance|alternative)=1-Prob(TypeII error)

The following table shows the powers for the eight critical region.

We can graph the power of the test used in Example 1 just as we graphed the probability of Type I error.

The graph shows two power curves with the same size namely CR3 ={0,1,2} and CR6={9.10,11}. CR3 is more powerful for alternative success rates of heads less than .5 and CR6 is more powerful for alternative probabilities of heads greater than .5.

Test CR2={0,1,10,11} has a smaller size (0.012) than CR3 or CR6 (.033). CR2 is less powerful for some alternatives and more powerful for other alternatives than either CR3 or CR6. For alternatives .45 to .55, CR2 is less powerful than either CR3 or Cr6. CR7 has greater power for some alternatives but also has greater size. Smaller size => larger power = less Prob(Type II error).

To illustrate that power decreases as size decreases, compare the graph of CR1={0,1,2,9,10,11} and CR2={0,1,10,11}. The size of CR1 is .065 and the size of CR2 is .012 . For every alternative success rate of heads, CR1 has greater power but its size is also greater. The graph shows the trade off between size (we want smaller size) and power (we want greater power). Smaller size comes with smaller power for a given test.

The important concepts are:

1. Hypothesis tests are critical (rejection of maintained) regions.

2. A type I error is rejecting a true maintained. A type II error is accepting a false maintained (rejecting a true alternative).

3. The Probability of a Type I error is the probability of the critical region using the maintained distribution. The Probability of a Type II error is the probability of the acceptance region using an alternative distribution.

4. Size is the probability of Type I error (rejecting a true maintained). Power is 1 minus probability of Type II error.

5. Every test has some power for some alternative hypotheses and less power for other alternative hypotheses.

6. A smaller size results in a smaller power (or larger Prob(Type II error). Illustrated by CR1 and CR2, or CR3 and CR4, or CR5 and CR6.

7. For some statistical hypothesis tests, two tests exist. One has higher power for some alternatives and the other has higher power for the remaining alternatives. CR3 and CR6 have the same size. CR3 is more powerful for alternative success rate of heads less than .5 and CR6 is more powerful for alternative success rate of heads greater than .5. CR4 and CR5 have smaller size than CR3 and CR6 but have a similar comparison for alternatives less or greater than .5.

The critical regions CR3, CR4, CR5 and CR6 are often called one sided. The critical regions contain only small or only large number of heads. Such one sided critical regions are powerful for only large or only small alternative success rate of heads. For example, CR3={0,1,2} is more powerful for the alternative hypotheses of small probabilities of heads while CR6={9,10,11} is more powerful for the alternative hypotheses of large probabilities of heads.

Summary of POWER. Understanding POWER explains why we would use more than one test. For example, the Ramsey test may have 1,2,3,4, terms. Why use more than just 2 terms? To increase the power of the specification test albeit at changing the size (since doing 1,2,3,4 terms means you are doing sequential statistical testS not just one test). A Ramsey test with 2 terms will be more powerful for some alternative hypotheses than a Ramsey test with 4 terms will be more powerful for some other alternatives. Explaining which statistical test(s) to use is our only use of POWER.

REGRESSION TESTS THE T TEST

For a regression, we might wish to test whether some independent variable has a statistically significant effect on the dependent variable: Income on Consumption, rebounds on percent win, number of competitors on sales, gender on wages, high school rank on financial aid, etc. We usually test statistical significance by testing whether the coefficient of the variable is equal to 0.

In OLS regression, with all 7 classical assumptions true, and the additional assumption that the corresponding coefficient is ZERO, the reported T-statistic for a coefficient is an observation of a random variable that has a T-distribution. The T-distribution was authored by W.S. Gosset, who worked for Guinness brewery and wrote under the name Student. Often the T is called Student's T distribution.

The maintained hypothesis does not specify the remaining coefficients nor the variance of the error of the equation 2 they can be any value. The maintained hypothesis is complex and the alternative hypothesis is complex.

With the 7 classical assumptions in a simple, one variable regression, the OLS estimator. 2 is unknown!

We derive a random variable based on which has a T-distribution and does not depend on any unknown parameters.

Note the in the numerator cancels with the square root of the 2 in the denominator and the only unknown in the formula is 1 .

The T-distribution has one parameter called Degrees of Freedom (DOF). For most tests, the value of the DOF parameter is number of observations minus number of estimated coefficients. In a simple regression there are two coefficient estimates the intercept and the coefficient of the single variable. The DOF is n-2. In a K variable regression, there are K+1 coefficients to estimate: 0 1 2 3 K and the DOF is n-(K+1) = n K - 1.

The display above has a Tdistribution if the maintained hypothesis (all 7 classical assumptions) is true. It does not have a Tdistribution if any of 7 assumptions is not true.

cannot be reported by a statistics program because 1 is unknown. The reported T-statistic is

It will have a T-distribution if all 7 classical assumptions are true and 1=0.

The T-test is a critical/rejection region for the T-statistic values of the T-statistic for which you REJECT the maintained hypothesis that all 7 classical assumptions are true and 1=0. T-tests can have one sided or two sided critical regions. To determine the critical region, you must choose a size for the test the probability of a Type I error the probability you reject a true maintained hypothesis. What size you choose is your own choice. It is common to have sizes of 0.01, 0.05 or 0.10. In fact in reporting regression results, generally one reports whether the reported t-statistic is in a 10%, or 5% or 1% critical region. If the reported t-statistic is in the 1% region, it is in the 5% and 10%.

In most cases, we report significance rather than stating we reject the maintained hypothesis at the 5% level. We state the coefficient is statistically significant at 5%. The meaning is the same namely the reported T-statistic is in the 5% critical region. You would report significant at 1% understanding it is also significant at 5% and 10% (and 15% and ).

Below is a plot of the density of a Tdistribution for 10 degrees of freedom. For a 5% size, two sided test, we split the 5% into each tail - 2.5% of the distribution is below 2.228 and 2.5% of the distribution is above 2.228. I found 2.228 on page 585 Studenmund 6th edition (Critical values of the t-distribution) in row 10 observations and column 2.5% one sided.

The blue area illustrates a two sided critical/rejection region 5% test..

For 31 degrees of freedom, the probability you would observe a random variable (with a T-distribution) below -2.0395134464 is 0.025 and similarly above +2.0395134464 is 0.025. so there is a 5% chance that you would observe a T-random variable below -2.0395134464 or above +2.0395134464. If you sampled 1,000,000 T-random variables with 31 degrees of freedom, then approximately 25,000 would be below 2.0395134464 and approximately 25,000 would be above +2.0395134464.

These calculations usedhttp://surfstat.anu.edu.au/surfstat-home/tables/t.php very easy with good graphicshttp://www.tutor-pages.com/Statistics-Calculator/statistics_tables.html similar graphics

http://socr.ucla.edu/htmls/SOCR_Distributions.html I had to use with IE http://www.distributome.org/js/calc/index.html http://www.distributome.org/js/calc/StudentCalculator.htmld for the T-distribution

http://bcs.whfreeman.com/ips4e/cat_010/applets/statsig_ips.html Java Security error used to work

All of these pages use JAVA. JAVA has security issues. Some browsers will not run the JAVA required. They all used to work.

http://www.danielsoper.com/statcalc3/calc.aspx?id=10 does not use JAVA and calculates to 8 decimals! See http://www.danielsoper.com/statcalc3/default.aspx for other distributions.

Example of T-test: Gender discrimination

To be more explicit, consider a gender discrimination case. The plaintiff contends males are discriminated against while the defense contends males are not discriminated against. Below is a (partial) estimation output in the case:

Variable

Coefficient

Std. Error

t-Statistic

Prob.

GENDER

-3.848931

1.863662

-2.065251

0.0473

The Degrees of Freedom equals 31.

The variable GENDER is 1 for males, and 0 for females. The negative coefficient indicates if the individual is male (GENDER=1) then the dependent variable is estimated to be -3.848931 less than if the individual is female.

The reported Prob. of 0.0473 is the size of a critical/rejection region [-,-2.065251] [+2.065251,+] which uses the reported T-statistic to determine the critical/rejection region. The reported Prob. value is called a p-value. With DOF=31, the probability that you would observe a T random variable in [-,-2.065251] is 0.02365 (=.0473/2). The probability that you would observe a T random variable in [+2.065251,+] is 0.02365 (=.0473/2).

http://surfstat.anu.edu.au/surfstat-home/tables/t.php shows the two sided critical/rejection regions.

A 5% critical/rejection region is [-,-2.04] [+2.04,+]

For accuracy but no picture, http://www.danielsoper.com/statcalc3/calc.aspx?id=10

The observed T = -2.065251 is in the critical/rejection region [-,-2.03951345]. REJECT the maintained hypothesis at 5% size REJECT the conjunction of all 7 classical assumptions plus =0.

An easier but identical critical/rejection region is the p-value space. If the reported P-value (Prob. in the output) is LESS THAN the chosen (by you) size of the test, REJECT. For example, 0.0473 is less than .05 and we reject at a size=5% test.

Pvalue or Size

Reported T or Tabled T

Reported

.0473

-2.065251

Is less than

Is greater in absolute value

Tabled

.0500

-2.03951345

The critical/rejection region for a 5% test in p-value space is 0.05 . Reported p-values less than 0.05 REJECT the maintained exactly as reported T-values smaller or larger than the p-value corresponding to the reported T statistic.

In our example, 0.0473 is greater than .01 => ACCEPT. The .01 critical/rejection region is

For a 1% test, we accept the maintained hypothesis. For a 1% test, the critical/rejection region is [-,-2.744] [+2.744,+] . Our reported T-statistic is not in the critical/rejection region.

Always use the reported p-value to test unless you love extra work!. If your chosen size of the test is greater than the p-value, REJECT, If your chosen size of the test is less than the p-value, ACCEPT.

No need to look up in a table of numbers, no need to use an internet calculator. If the p-value is small, reject. If the p-value is large, accept. You can use the p-value for all the tests we do. For an test, if the reported p-value is small, say less than .01, REJECT and if the reported p-value is large, say .20, ACCEPT. How easy can your life get?

The t-test for a coefficient = 0 is theoretically proved to be powerful against alternative hypotheses in which the classical 7 assumptions are true but the particular coefficient is not 0. The reported T-statistic has a non-central T-distribution if all 7 classical assumptions are true and the coefficient 0. But we do not know the distribution unless we specify a particular value for the coefficient which then determines the non-centrality parameter of the non-central T-distribution.

If the alternative value for the coefficient is a large absolute value of the coefficient (say 1,000) then the power is greater than if the alternative value for the coefficient is a smaller absolute value of a coefficient (say 10). We also know increases in size increase power and smaller sizes have less power. A 1% test has less power for any specific alternative than does a 5% test.

The formula for the T-statistic provides intuition for the power. The reported T-statistic is

. If is large, the reported T-statistic is large and the test will reject.

One sided tests:

The critical/rejection region [-,-2.065251] [+2.065251,+] is two sided. Two one sided 5% critical/rejection regions are:

MINUS=[-,-1.696] and PLUS=[1.696,] .

A one sided test must specify which side. For our example, the reported T-statistic = -2.065251is in the critical/rejection region MINUS and is not in the critical/rejection region PLUS.

The critical/rejection region PLUS is more powerful for alternatives with >0 and the critical/rejection region MINUS is more powerful for alternatives with