Upload
darlingjunior
View
39
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
CHI-SQUARE AND ANALYSIS OF VARIANCE
PROBABILITY DISTRIBUTIONS
Chi-Square as a Test of independence
Contingency Tables
Chi-Square as a Test of Goodness of Fit
Analysis of Variance
In this session ….
- the B-school
When do we use Chi-Square?
Chi-Square test is used when1.We have to compare the proportions of more
than two populations.2.We have to determine if certain attributes of
a population are independent of each other. E.g If we classify a population with respect to two attributes such as age and job performance, we can use a Chi-Square test to establish if the two attributes are independent of each other.
- the B-school
Chi-Square as a test of independence
If a population is classified into categories, the dependence or independence of the categorical variables may be established through a contingency table using the Chi-Square test.
An m×n contingency table shows the observed frequencies for two categorical variables (say A and B) arranged in m rows and n columns. The sum of all observed frequencies is N, the sample size.
If a sampled individual has the ith value for A and the jth value for B then this individual is assigned to the (i,j)th cell of the contingency table.
- the B-school
Chi-Square as a test of independence
Two attributes A and B are independent if the value of one variable has no influence on the value of another variable.
Test H0: ‘A and B are independent’ against HA: ‘A and B are not independent’.
Assume H0 to be true and calculate the expected frequencies
Chi-square statistic,
( ) ( )ij
sum of ith row sum of jth columnE
n
2( )ij ijc
ij
O E
E
- the B-school
The Chi-Square test statistic
The Chi-Square test statistic, is a random variable with its own probability distribution, known as the Chi-square distribution.
When the sample size n is large, the probability histogram of can be approximated by a chi-square curve with k = (m-1)×(n-1) degrees of freedom.
c
___ ___ ___ * Row 1
___ ___ ___ * Row 2
* * * * Row 3
Col 1 Col 2 Col 3 Col 4
c
- the B-school
Contingency Table - Example
A brand manager is concerned that her brand’s share may be unevenly distributed throughout the country. In a survey in which the country was divided into four geographical regions, a random sampling of 100 consumers in each region was surveyed, with the following results.
(a) Construct the contingency table and calculate the chi-square statistic.(b) State the null and alternative hypothesis.(c) At α = 0.05, test whether brand share depends on the region OR the brand share is the same across the 4 regions.
NE NW SE SW TOTAL
Purchase the brand
40 55 45 50 190
Do not purchase
60 45 55 50 210
Total 100 100 100 100 400
- the B-school
Contingency Table
NE NW SE SW TOTAL
Purchase the brand
O11= 40
E11=47.5
O12= 55
E12=47.5
O13 = 45
E13=47.5
O14 = 50
E14=47.5
190
Do not purchase
O21 = 60
E21=52.5
O22 = 45
E22=52.5
O23 = 55
E23=52.5
O24 = 50
E24=52.5
210
Total 100 100 100 100 400
H0: Purchasing is independent of region.
HA: Purchasing depends on the region.
OR
H0: PNE=PNW=PSE=PSW
HA: All proportions are not equal (at least two are unequal)
The Chi-Square test statistic
If reject H0 and accept HA. If accept H0 and reject HA.
2( ),R
c k 2( ),R
c k
5.012c 2( ), 7.815 1 3 3 0.05Rk for k df and
2( ), 05.012 7.815
sin .
Rc kSince we accept H
that purcha g is independent of region
(Use CHIINV)
- the B-school
Exercise
A researcher, studying the relationship between having a particular disease and addiction of the individuals, interviewed 32 male subjects. For each individual, the researcher recorded his disease status (Y = yes, N = No), and addiction type (Type – I, Type –2, Type – 3) as shown.
From the above dataset can the researcher conclude that the males with addiction types 2 or 3 are more likely to have the disease than those with addiction type 1?
Disease status
N N N Y N N N N Y N N Y N N N N
Addiction type
1 1 2 1 1 1 3 1 1 1 1 2 1 1 2 1
Disease status
Y N Y Y N Y N N N N N Y N Y N N
Addiction type
1 3 1 2 3 1 1 1 1 1 3 1 3 2 1 2
- the B-school
Exercise
H0: Disease is independent of addiction type
HA: Disease is prevalent among those who have addiction types 2 and 3
TYPE 1 TYPE 2 TYPE 3 Row totals
N O11 = 15
E11 = 15.09
O12 = 3
E12 = 4.31
O13 = 5
E13 = 3.59
23
Y O21 = 6
E21 = 5.91
O22 = 3
E22 = 1.69
O23 = 0
E23 = 0
9
Column
totals
21 6 5 32
2( )1.969ij ij
cij
O E
E
2( ), 5.99
1 2 2 0.05
Rk for
k df and
- the B-school
Chi-Square as a Test of Goodness of Fit: Testing the Appropriateness of a Distribution
The Chi-Square test can also be used to decide whether a particular probability distribution such as Binomial, Poisson or Normal is the appropriate distribution for representing a given data.
The Chi-Square test enables us to test whether there is a significant difference between the observed frequency distribution and the theoretical distribution.
- the B-school
Chi-Square as a Test of Goodness of Fit: Testing the Appropriateness of a DistributionThe salesman of a Paper Company has five accounts to visit per day. It is suggested that the variable, sales by him be described by a binomial distribution, with the probability of selling each account being 0.4. Given the following frequency distribution of the number of sales per day, can we conclude that the data does not follow the suggested distribution? Use 0.05 significance level.No. of sales per day 0 1 2 3 4 5 Frequency 10 41 60 20 6 3
- the B-school
Chi-Square as a Test of Goodness of Fit: Testing the Appropriateness of a DistributionSolution:H0: A binomial dist. With p = 0.4 is a good description of the sales.HA: A binomial dist. With p = 0.4 is NOT a good
description of the sales. k (degrees of freedom) = 5, α = 0.05
Chi-Square Statistic,
We reject H0 (that the distribution is well described by a binomial distribution with p = 0.4 and n = 5) since
2( )11.9413o e
ce
f f
f
2( ) 2( ), 5,0.05 11.070R R
K
2( )4,0.05R
c
- the B-school
What is Analysis of Variance (ANOVA) ?
ANOVA1. Enables us to test for the significance of the
differences among more than two sample means.2. Using ANOVA we can make inferences about
whether our samples are drawn from populations having the same mean.e.g used in research in evaluation of new drugs, effects of diseases, frequency of medication etc.
3. ASSUMPTION: Each of the samples is drawn from a normal population and that each of these populations have the same variance 2.
4. ANOVA helps us to compare two different estimates of the variance of our overall population (variance among sample means and variance within sample means).
- the B-school
ANOVA – An Example
Three training methods were compared to see if they led to greater productivity after training. The following are the productivity measures for individuals trained by each method.
Method 1 45 40 50 39 53 44Method 2 59 43 47 51 39 49Method 3 41 37 43 40 52 37
Use ANOVA at 0.05 level of significance, to determine whether these training methods lead to different levels of productivity?
- the B-school
ANOVA –AN EXAMPLE
Solution Method:Statement of Hypothesis:
H0: µ1= µ2 = µ3 HA: µ1, µ2 and µ3 are not all equal Step 1: Calculate the grand mean,Step 2: Calculate the three sample means ,Step 3: Calculate the three sample variances Step 4: Estimate the between-column-variance, i.e variance among sample means
Step 5: Estimate the within-column-variance, i.e variance within the sample means ni = size of ith sample
nT = total sample size
Step 6: Calculate F-ratio/F-statistic,
X1 2 3, ,x x x2 2 21 2 3, ,s s s
22 ( )
1i i
b
n x X
k
2 21iw i
T
ns
n k
2
2b
w
F
- the B-school
ANOVA –AN EXAMPLE
Solution Method:Statement of Hypothesis:
H0: µ1= µ2 = µ3 HA: µ1, µ2 and µ3 are not all equal Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6: F-ratio/F-statistic,
44.94X
1 2 345.17, 48, 41.67x x x
2 2 21 2 325.13, 39.67, 25.89s s s
22 ( ) 120.78
60.391 2
i ib
n x X
k
2 21 5(25.14 39.68 25.89) 90.69 / 3 30.23
15i
w iT
ns
n k
2
2
60.391.997
30.23b
w
F
- the B-school
The F - Distribution
The F statistic has a particular sampling distribution called F-distribution.
Similar to t and chi-square distribution, the F-distribution represents a family of distributions.
Each distribution has a pair of degrees of freedom (a,b)where a = no. of df of the numerator (2
b) = k – 1 b = no. of df of the denomerator (2
w) = nT – k
For our problema = k – 1 = 3 – 1 = 2b = nT – k = 18 – 3 = 15
- the B-school
The F - Distribution
The F –table at 0.05 level of significance with (2,15) as the df indicates that 3.68 (F-statistic) is the upper level of the acceptance region.
Since our F- ratio = 1.99 is within this region we accept the null hypothesis and conclude that there are no significant differences in the effects of the three training methods on employee productivity.
The above example highlights the one-way ANOVA since we have considered only one factor i.e the effect of training method on employee productivity.
- the B-school
ANOVA – EXERCISE
The following data show the number of claims processed per day for a group of four insurance company employees observed for a number of days. Using ANOVA test the hypothesis that the employee’s mean claims per day are all the same. Use the 0.05 level of significance.
Employee 1 15 17 14 12Employee 2 12 10 13 17Employee 3 11 14 13 15 12Employee 4 13 12 12 14 10 9
- the B-school
ANOVA – EXERCISE
Solution Method:Statement of Hypothesis:
H0: µ1= µ2 = µ3= µ4 HA: µ1, µ2 , µ3 , µ4 are not all equal Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6: F-ratio/F-statistic,
Step 7: Critical F value (with df (3,15), α = 0.05) = 3.29
12.89X
1 2 3 414.5, 13, 13, 11.67x x x x
2 2 2 21 2 3 44.33, 8.67, 2.5, 3.46s s s s
22 ( ) 19.456
6.481 3
i ib
n x X
k
2 21 3 3 4 54.33 8.67 2.5 3.47) 4.42
15 15 15 15i
w iT
ns
n k
2
2
6.481.47
4.42b
w
F