Upload
rohanshettymanipal
View
1.059
Download
0
Embed Size (px)
Citation preview
SAMPLING DISTRIBUTIONS
Population Distribution: When we talk of Population distribution, we assume that we have
investigated the population and have full knowledge of its mean and standard deviation.
Population mean is denoted by µ and standard deviation of population is denoted by σ . The
measures µ and σ of populations are called parameters.
Sample Distribution: When we talk of a sample distribution, we take a sample from the
population. The mean and standard deviation of the sample are denoted by X and s . These
measures related to sample are called statistic. It may be noted several sample distributions
are possible from a given population.
Distribution of Sample Means: Considering sample mean X as variable, we observe that
the expected value of X is population mean. i.e., µµ == XXE )( and the standard deviation
of X is given by nXσ
σ = where n is the sample size. Standard deviation of mean is also
known as standard error of mean.
In order to use the standard deviation of the sample ‘s’ as an estimate the for σ , we have the
following formula
1
)( 2
−
−= ∑
nxx
s i
And the standard error of mean is given by
)1()( 2
−
−== ∑
nnxx
ns i
Xσ
When the sample size ‘n’ is not very small in comparison with the finite population size ‘N’,
then we consider the following formula:
1−
−=
NnN
nXσ
σ
Exercise: The time between two arrivals in a queuing model is normally distributed with
mean 2 minutes and standard deviation 0.25 minute. If a random sample size of 36 is drawn,
what is the probability that a sample mean will be greater than 2.1 minutes?
Solution: n = 36; 2=µ ; 25.0=σ . The standard error for mean is calculated as under:
042.03625.0
===nXσ
σ .
Now to find the probability that sample mean greater than 2.1 is given by
0087.09913.01)38.2(042.0
21.2042.0
2)1.2( =−=≥=
−≥
−=≥ ZPXPXP
Exercise: The weight of certain type of car tire is normally distributed with mean of 25
pounds and variance of 3 pounds. A random sample of 50 tires is selected. What is the
probability that the mean of this sample lies between 24.5 and 25.5 pounds?
Exercise: An auditor takes a sample of size 36 from a population of 1000 accounts receivable.
The standard deviation of the population is unknown, but the standard deviation of the
sample is Rs 43. If the true mean value of the accounts receivable is Rs 260, what is the
probability that the sample mean will be less than or equal to Rs 250?
Estimation of Population Mean:
In most of the research studies, population parameters are unknown and have to be estimated
from a sample. As such the methods of estimating parameters assume an important role in
statistical analysis.
The estimate of a population parameter may be one single value or it could be a range of
values. If the estimate is one single value, it is referred as point estimate, whereas in the range
of values case it is termed as interval estimate.
A good estimator possesses the following properties:
(i) An estimator should on the average be equal to the value of the parameter being
estimated. (Property of Unbiased ness)
(ii) An estimator should have relatively less variance. (Property of efficiency)
(iii) An estimator should use as much as possible the information available from
sample (Property of Sufficiency)
(iv) An estimator should approach the value of parameter as the sample size becomes
larger and larger. (Property of Consistency)
The point estimator of population mean ( µ ) is X , the sample mean.
The interval estimator for the mean µ is given by the interval around X for certain degree
of confidence with the help of Standard error.
For example, for 95% degree of confidence interval for the population mean is given by the
lower limit SEX 96.1− and upper limit SEX 96.1+ . In other words, the probability of µ
being in the interval [ SEX 96.1− , SEX 96.1+ ] is 0.95.
Or, 95.0]96.196.1[ =+≤≤− SEXSEXP µ
In the above, 1.96 is the z-variate of standard normal distribution for the confidence level of
95% (or the significance level of 5%)
If the sample size is smaller, i.e., lesser than 30, we use t-variate with n-1 degree of freedom,
for the estimation.
Exercise: From a random sample of 36 civil service personnel, the mean age and sample
standard deviation were found to be 40 years and 4.5 years respectively. Construct a 95%
confidence interval for the mean age of civil servants. Also construct a 96% confidence
interval for the mean age of civil servants.
Solution: In the above n = 36, 40=X and 5.4=s . Population size is not finite. Sample size
may be considered as large. The standard error of mean is given by
75.0365.4====
nsSE Xσ
Standard normal variate for 95% confidence is 1.96.
Thus 95% confidence interval for the mean of population is given by the limits SEX 96.1± .
47.140)75.0)(96.1(4096.1 ±=±=± SEX
Therefore the 95% confidence interval for population mean is [38.53, 41.47]
In other words, 95.0)47.4153.38( =≤≤ µP .
Standard normal variate for 96% confidence is 2.065.
Thus 95% confidence interval for the mean of population is given by the limits SEX 065.2± .
55.140)75.0)(065.2(40065.2 ±=±=± SEX
Therefore the 95% confidence interval for population mean is [38.45, 41.55]
In other words, 96.0)55.4145.38( =≤≤ µP .
Exercise: In a random selection of 64 of 2400 intersection in a small city, the mean number of
scooter accidents per year was 3.2 and sample standard deviation was 0.8.
(i) Make an estimate of standard deviation of the population from the standard
deviation
(ii) Workout standard error of mean for this finite population
(iii) If the desired confidence level is 0.90, what will be the upper limit and lower
limits of confidence interval for the mean number accidents per year?
Exercise: A random sample of 16 values from normal population showed a mean of 41.5
inches and the sum of squares of deviation from this mean is 135 square inches. Obtain 95%
and 99% confidence limit for the same.
Exercise: The foreman of ABC mining company has estimated the average quantity of iron
ore extracted to be 36.8 tons per shift and the sample standard deviation to be 2.8 tons per
shift, based upon a random selection of four shifts. Construct a 90% confidence interval
around the estimate.
Estimation of Sample Size:
Size of the sample should be determined by a researcher keeping the following points:
(i) Nature of Universe: If the items of the universe are homogeneous, a small sample
can serve the purpose. But if the items are heterogeneous, a large sample would be
required. Technically, this can be termed as dispersion factor.
(ii) Number of classes proposed: If many class groups are to be formed, a large
sample would be required because a small sample may not be able to give
reasonable number of items in each class-group.
(iii) Nature of Study: If items are to be intensively and continuously studied, the
sample should be small. For a general survey the size of the sample should be
large, but small sample is considered appropriate in technical survey.
(iv) Type of sampling: Sampling technique plays an important part in determining the
size of the sample. A small random sample is apt to be much superior to a larger
but badly selected sample.
(v) Standard of accuracy and acceptable confidence level: If the standard of accuracy
or the level of precision is to be kept high, we shall require relatively larger
sample. For doubling the accuracy for fixed significance level, the sample size has
to be increased fourfold.
(vi) Availability of finance: In practice, the size of the sample depends upon the
amount of money available for the study purposes. This factor should be kept in
view while determining the size of the sample. Larger sample result in increasing
the cost of sampling estimates.
(vii) Other considerations: Nature of units, size of population, size of questionnaire,
availability of trained investigators, the conditions under which the sample is
being conducted, the time available for completion of the study are few other
considerations to which a researcher must pay attention while selecting the size of
the sample.
Sample Size when estimating a mean:
Note that the limits of confidence interval for the Mean of Population is by
nzXSEzX σ
±=± . ,
where X is the sample mean
z is the value of standard variate at given confidence level
n is the sample size, and
σ is the standard deviation of population.
If the researcher like to estimate the mean of population within desired precision e± , then
get n
ze σ= and therefore 2
22
ezn σ
= .
In case of finite population, we get
1.
−−
==N
nNn
zSEze σ and therefore 222
22
)1( σσ
zeNNzn+−
=
Many a times, the standard deviation of population is not known and sample is not yet taken,
rough estimate of the population is given by
6
onDistributi Population of Rangeˆ =σ
Range in the above may have to be obtained from past records or through a pilot survey of
large number of items.
Exercise: If the acceptable error in estimating the population is within 3 units of the sample
mean with 95% confidence estimate the sample size, when the standard deviation of the
population is known and equals to 4.8.
Solution: Here e = 3 z = 1.96 (for 95% confidence level) and σ =4.8. The estimation of
sample size for 95% confidence limit and within 3 units from the sample mean is given by
10834.9)3(
)8.4()96.1(2
22
2
22
≅===e
zn σ
Therefore the size of sample for estimating population mean within range of 3 units and with
95% confidence is 10.
Exercise: A cigarette manufacturer wishes to use a random sample to estimate the average
nicotine content. The error should not be more than 1 milligram above or below the true
mean, with 99% confidence coefficient. The population standard deviation is 4 milligrams.
What sample size should one the company use in order to satisfy the requirement?
Exercise: Determine the size of the sample for estimating the true weight of the 5000 cereal
container on the basis of following information:
The variance of weight is 4 ounces on the basis of past records.
Estimate should be within 0.8 ounces of the true average weight with 99% probability.
Will there be change in the size of sample if we assume infinite population in the given case?
If so, explain by how much?
Sample size when estimating the population proportion:
If we are to find the sample size for estimating a proportion of population, our reasoning
remains similar to what we have said in the context of population mean. It is required to
specify the precision and the confidence level and then estimate the sample size as under:
Note that the standard error of proportion is given by
npqSE p ==σ (in case of infinite population)
1−−
==N
nNnpqSE pσ (in case of finite population of size N)
Where, p is the sample proportion, q = 1-p, z is the standard variate for appropriate
confidence level and n is the sample size.
Further, confidence interval for the population proportion is given by
SEzp .±
If e is the precision rate, the acceptable error then the sample size can be expressed as
2
2
epqzn = (in case of infinite population)
pqzNepqNzn 22
2
)1( +−= (in case of finite population)
Exercise: What should be the sample size if a simple random sample from a population of
4000 items to be drawn to estimate the percent of defective within 2% of true value with
95.5% probability? What should be the size of the sample if the population is assumed to be
infinite in the given case? (from the pilot study, it has been observed that the proportion of
defective items is about 2%)
Solution:
In this case N = 4000, z = 2.005, p = 0.02 and e = 0.02
18888.187)98.0)(2.0()005.2(()14000()02.0(
)4000)(98.0)(02.0()005.2()1( 22
2
22
2
≅=+−
=+−
=pqzNe
pqNzn
Therefore the sample size is estimated to be equals to 188 for sample proportion to be with in
2% limit and 95.5% confidence.
If we assume that the population size is infinite, then
19798.196)02.0(
)98.0)(02.0()005.2(2
2
2
2
≅===epqzn
Exercise: Suppose a certain hotel management is interested in determining the percentage of
the hotel’s guests who stay for more than 3 days. The reservation manager wants to be 95%
confident that the percentage has been estimated within 3% of the true value. What is the
most conservative sample size needed for the problem.
Exercise: Suppose the following ten values represent random observation from a normal
parent population:
2, 6, 7, 9, 5, 1, 0, 3, 5, 4.
Construct a 99 percent confidence interval for the mean of the parent population.
Exercise: A team of medic research experts feels confident that a new drug they have
developed will cure about 80% of the patients. How large should the sample size be for the
team to be 98% certain that the sample proportion of cure is within plus and minus 2% of the
proportion of all cases that drug will cure?
Exercise: Annual income of 900 salesmen employed by a company is known to be
approximately normally distributed. If the company wants 95% confident that the true mean
of this year’s salesmen’s income does not differ by more than 2% of the last year’s mean
income of Rs 12,000, what sample size would be required assuming the population standard
deviation to be Rs 1500?
Exercise: In a random sample of 64 items taken from a large consignment, some were found
to be defective. Deduce that percentage of defective items in the consignment almost
certainly lies between 31.25 and 68.75 given that the standard error of the proportion of
defective items in the sample is 1/16.
Exercise: A cigarette manufacturer claims that his cigarettes have an average content of 18.3
mg of nicotine. If random samples of this type have for content of 20, 17, 21, 19, 22, 21, 20
and 16 mg, would you agree with the manufacturer’s claim. Assume suitable value for level
of significance. (Level of significance = 1 – Level of Confidence)
Distribution of Sample Standard deviation:
If a population is large and normally distributed with standard deviation σ , the standard
deviation of random samples of size ‘n’ (n is large) are approximated by normal distribution
with standard deviation n2/σ (Standard error of standard deviation).
The standard deviation of the distribution of standard deviation of samples drawn from a
normal population is called standard error of standard deviation and is denoted by
nSE 2/σ= .
TESTING OF HYPOTHESES
Hypothesis:
It is an assumption or some supposition to be proved or rejected.
Definition: Hypothesis is a proposition or a set of propositions set forth as an explanation for
the occurrence of some specified group of phenomena either asserted merely as a provisional
conjecture to guide some investigation or accepted as highly probable in the light of
established facts.
Characteristic of Hypothesis:
(i) Hypothesis should be clear and precise.
(ii) Hypothesis should be capable of being tested
(iii) Hypothesis should state relationship between variables, if it happens to be
relational hypothesis
(iv) Hypothesis should be limited to the scope and must be specific. A researcher must
remember that narrower hypothesis is more generally testable and should develop
such hypothesis.
(v) Hypothesis should be stated as far as possible in most simple terms so that the
same is easily understandable by all concerned.
(vi) Hypothesis should be consistent with most known facts, i.e., it must be consistent
with a substantial body of established facts. In other words, it should be one which
judges accept as being the most likely.
(vii) Hypothesis should be amenable to testing within reasonable time.
(viii) Hypothesis must explain the facts that gave rise to the need for explanation.
Null Hypothesis and Alternative Hypothesis: Null Hypothesis is an initial statement
concerning a population parameter. It is generally denoted by 0H . Any hypothesis which
differs from a null hypothesis is called ‘alternative hypothesis. Alternative Hypothesis is
denoted by 1H .
Type I error: The error of rejecting the hypothesis when it should have been accepted is
known as type I error.
Type II error: The error of accepting the hypothesis when it should have been rejected is
known as type II error
The probability of Type I error is usually determined in advance and understood as level of
significance of testing the hypothesis. If the type I error is fixed at 5%, it means that there are
about 5 chances in 100 that we reject 0H when 0H is true.
But with a fixed sample size, n, when we try to reduce the type I error, the probability of
committing type II error increases. Both type of error can not be reduced simultaneously.
Two-tailed and One –tailed test: A two-tailed test rejects the null hypothesis if, say, the
sample mean is significantly higher or lower than the hypothesized value of mean of the
population. Thus in a two-tailed test, there are two rejection regions, one on each tail of
normal curve.
0
0
:
:
1
0
H
H
H
H
µµ
µµ
≠
=
A one-tailed test would be used when there are to test, say, whether the population mean is
either lower than or higher than some hypothesized value.
0
0
:
:
1
0
H
H
H
H
µµ
µµ
>
= or
0
0
:
:
1
0
H
H
H
H
µµ
µµ
<
=
Examples: A random sample of 25 tiers from a large consignment gave an average life of
38,000 kms and standard deviation of 5000 kms. Could the sample come from a population
with mean life of tiers 40,000 kms?
Solution:
We make null hypothesis and the alternative hypothesis as under:
40000:40000:
1
0
≠=
µµ
HH
We make two-tailed test for population mean. Consider the level of significance 005.0=α .
The test criterion is
ns
Xt µ−= . Here n = 25, sample mean is 38,000 and sample standard
deviation is s = 5000. Therefore
2||
2
255000
4000038000
=
−=−
=−
=
tn
sXt µ
From the table, t-variate value for 5% significant level (95% confidence level) and with 24
degree of freedom is 2.064.
Since the calculated t-value is lesser than the table value, we accept the null hypothesis that
the mean life of tier is 40,000 kms with 5% significance level (95% confidence level).
Flow Chart for Hypothesis Testing:
State 0H as well 1H
Specify the level of significance (α )
Decide the correct sampling distribution
Obtain sample and workout an appropriate value from sample data
Calculate the probability that sample result would diverge as widely as it has from expectations, if the null hypothesis were true (find z-value or t-value for the purpose)
Compare this probability with significance level( 2/α in case of two tailed test; α in case of one tail test). (Find whether calculated z or t value is in the rejection region)
Reject 0H Accept 0H
Yes No
Exercise: A certain stimulus administered to each of 12 patients resulted in the following
change in of blood pressure:
5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6
Can it be concluded that the stimuli will, in general, accompanied by an change in blood
pressure?
Solution:
From the given data, we obtain sample mean and sample variance as
08.3
538.91
)(
6.21231
22
=
=−
−=
===
∑
∑
sn
XXs
nX
X
We shall make a null hypothesis that stimulus in general not be accompanied by the change
in blood pressure. Therefore the null hypothesis and the alternative hypothesis can be
formulated as under:
0:0:
1
0
≠=
µµ
HH
Assume 5% level of significance. i.e., 95% level of confidence. Corresponding t-value with
11 degree of freedom is 2.201( 24,025.0t ). Further rejection region is given by 201.2|:| >tR
94.2
1208.3
06.2=
−=
−=
ns
Xt µ
Since calculated t-value is bigger than the table value, we reject the null hypothesis.
Exercise: A certain stimulus administered to each of 12 patients resulted in the following
change in of blood pressure:
5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6
Can it be concluded that the stimuli will, in general, accompanied by an increase in blood
pressure?
Solution:
From the given data, we obtain sample mean and sample variance as
08.3
538.91
)(
6.21231
22
=
=−
−=
===
∑
∑
sn
XXs
nX
X
We shall make a null hypothesis that stimulus in general not be accompanied by an increase
in blood pressure. Therefore the null hypothesis and the alternative hypothesis can be
formulated as under:
0:0:
1
0
>=
µµ
HH
Assume 5% level of significance. i.e., 95% level of confidence. Corresponding t-value (one-
tail test) with 11 degree of freedom is 1.796 ( 11,05.0t ) and the rejection region is 796.1: >tR
94.2
1208.3
06.2=
−=
−=
ns
Xt µ
Since calculated t-value is bigger than the table value, we reject the null hypothesis.
Exercise: A cigarette manufacturer claims that his cigarettes have an average content of 18.3
mg of nicotine. If random samples of this type have for content of 20, 17, 21, 19, 22, 21, 20
and 16 mg, would you agree with the manufacturer’s claim. Assume suitable value for level
of significance.
Exercise: Raju Restaurant near the railway station has been having a average sales of 500 tea
cups per day. Because of some development of bus stand nearby, it expects to increase its
sales. During the first 12 days after the start of the bus stand, the daily sales were as under:
550, 570, 490, 615, 505, 580, 570, 460, 600, 580, 530, 526
On the basis of simple information, can one conclude that Raju Restaurant’s sales have
increased? Use 5% level of significance.
Solution: Consider null hypothesis that sales average is 500 cups and sale has not increased
unless proved. We can write:
500:500:
1
0
>=
µµ
HH
The sample size is small from infinite population. So, we shall use one-tailed t-test and
compute t-statistic given by ns
Xt µ−= . Further note that population standard deviation is not
given. We shall compute s and X .
iX )( XX i − 2)( XX i −
550
570
490
615
505
580
570
460
600
580
530
526
2
22
-58
67
-43
32
22
-88
52
32
-18
-22
4
484
3364
4489
1849
1024
484
7744
2704
1024
324
484
6576 23978
558.31268.46
500548
68.4611
239781
)(
54812
6576
2
=−
=−
=
==−
−=
==
∑
nsXt
nXX
s
X
i
µ
Degree of freedom = n – 1 = 12 – 1 = 11. Therefore, corresponding t-value (one-tail test) at
5% significance level and with 11 degree of freedom is 1.796 ( 11,05.0t ) and 796.1: >tR
Since calculated t-value is greater than table value and in the rejection region, we reject the
null hypothesis that there is no change in the sales and conclude that there is increase in sales.
Exercise: A sample of 400 male students is found to have a mean height 67.47 inches. Can it
be regarded as a sample from large population with mean height 67.39 inches and standard
deviation 1.30 inches? Test 5% level of significance.
Solution: Consider the null hypothesis that the average height is 67.39 inches and we can
write 39.67:39.67:
1
0
≠=
µµ
HH
. The sample size is large (400), population is infinite and standard
deviation of the population is known, we shall use two-tailed z –test and find z-statistic
nXzσ
µ−= . Note that at 5% significance level for 2-tailed test, z-variate is 1.96 and the
rejection region is 96.1|:| >zR
231.140030.1
39.6747.67=
−=
−=
nXzσ
µ and therefore the calculated z-variate value is within the
acceptance region. We accept the null hypothesis that the mean height of students 67.39 at
5% significance level.
Exercise: Suppose that we are interested in a population of 20 industrial units of same size,
all of which are experiencing excessive of labor turnover problems. The past records show
that the mean of the distribution of turnover is 320 employees, with a standard deviation of
75 employees. A sample of 5 of these industrial units is taken at random which gives a mean
of annual turnover as 300 employees. Is the sample mean consistent with the population
mean? Test at 5% significant level.
Exercise: The mean of a certain production process is known to be 50 with a standard
deviation of 2.5. The production manager may welcome any change is mean value towards
higher side but would like to safeguard against decreasing values of mean. He takes a sample
of 36 items that gives a mean value of 48.5. What inference should the manager take for the
production process on the basis of sample results? Use 5% level of significance for the
purpose.
Exercise: The mean lifetime of a random sample of 50 similar torch bulbs drawn from a batch
of 500 bulbs is 72 hours. The standard deviation of the lifetime of sample is 10.4 hours. The
batch is classed as inferior if the mean lifetime is less than the 75 hours. Determine whether,
as a result of sample data, the batch is considered to be inferior at level of significance of a)
0.05 and b) 0.01
Solution: Population is finite and N = 500. Sample size is n = 50. The sample mean 72=X
hrs and sample standard deviation is 10.4(s). Claimed lifetime of the bulbs (population
mean) is minimum of 75 Hrs. Objective is to test the given batch is of inferior quality (life
time less than 75 Hrs). Therefore we make null hypothesis that the life time of the bulbs is
not less than 75 Hrs. i.e.,
75:75:
1
0
<≥
µµ
HH
We shall have one-tail test for larger sample from finite population.
148.2)95.0)(471.1(
3
150050500
504.10
7572
1
−=−
=
−−
−=
−−
−=
NnN
ns
Xz µ
(a) Test at 5% significance level: Table value for z is -1.645. Therefore rejection region is
645.1: −<zR . Calculated value for z is in the rejection region and therefore we
reject null hypothesis at 5% level of significance.
(b) Test at 1% significance level: Table value for z is -2.33. Therefore rejection region is
33.2: −<zR . Calculated value for z is not in the rejection region and therefore we
accept null hypothesis at 1% level of significance.
Hypothesis Testing for Difference of Means:
In some decision making situations, we may have to find whether the parameters of two
populations are alike or different. For example, one may like to know whether female worker
earn same as male worker or different. In this situation, we like to test whether the mean
income of males and females are same or not.
In this case the parameter of our interest is 21 µµ − , where 1µ may the mean income of
female population and 2µ may be the mean income of male population. Suppose 1n and 2n
are the sizes of two samples and, 1σ and 2σ are the standard deviations of populations
respectively. We consider the standard deviations of samples in the absence of population
standard deviation for the estimation.
Standard Error for the difference of means is given by
2
22
1
21
21 nnSE XX
σσσ +== − and test statistic is given by
sample) large of case(in
2
22
1
21
21
nn
XXzσσ
+
−=
sample) small of casein freedom, of degree 2-(with 21
2
22
1
21
21 nn
nn
XXt +
+
−=
σσ
In case of large samples are presumed to be drawn from same population whose variance
( 2σ ) is known, we use z test for the difference in means and compute z-statistics and t-
statistics are as under
sample) large of case(in 11
21
2
21
+
−=
nn
XXz
σ
sample) small of casein freedom; of degree 2-(with 11
21
21
2
21 nn
nn
XXt +
+
−=
σ
In case population variance is not known, we estimate the standard deviation of population as
under:
21
22112,1
2,1222,11121
22
222
21
211 )();( where;)()(ˆ
nnXnXnX
XXDXXDnn
DsnDsn
++
=
−=−=+
+++=σ
In case of small samples are presumed to be taken from same population and population
variance is not known, then we use t-test for the difference of means and z- statistics and t-
statistics are computed as under:
2121
222
211
21
112
)()(nnnn
XXXXXXz
ii +−+
−+−
−=∑ ∑
freedom of degree 2-nn with ;11
2)()(
21
2121
222
211
21 +
+−+
−+−
−=∑ ∑
nnnnXXXX
XXtii
Alternatively,
;11
2)1()1(
2121
222
211
21
nnnnsnsn
XXz+
−+−+−
−=
freedom of degree 2-nn with ;11
2)1()1(
21
2121
222
211
21 +
+−+−+−
−=
nnnnsnsn
XXt
Exercise: The mean produce of wheat of a sample of 100 fields 200 quintal per acre with
standard deviation 100 quintal. Another sample of 150 fields gives the same mean of 220
quintal per acre with standard deviation of 12 quintal. Can the two samples be considered to
have been taken from the two populations with same mean yield? Use 5% level of
significance.
Solution: Taking the null hypothesis that the mean of two populations do not differ, consider
211
210
::
µµµµ
≠=
HH
It is given that
12 ;10;220 ;200
;150 ;100
21
21
21
====
==
ssXX
nn
Sample sizes are large; we can have two-tailed test to compare the mean with 5% level of
significance. Z-statistics for 5% level of significance in two tailed test is 1.96. Therefore the
rejection is 96.1|:| >zR .
Note that standard deviations of the populations are not given. From the given data we have
28.144.120
15012
10010
220-20022
2
22
1
21
21 −=−
=
+
=
+
−=
ns
ns
XXz
Since calculated z-variate is is not in the acceptance region and in fact, in the rejection region,
we reject the null hypothesis at 5% level of significance.
Exercise: The mean produce of wheat of a sample of 100 fields 200 quintal per acre with
standard deviation 100 quintal. Another sample of 150 fields gives the same mean of 220
quintal per acre with standard deviation of 12 quintal. Can the two samples be considered to
have been taken from the same population whose standard deviation is 11 quintal? Use 5%
level of significance.
Solution: Assuming that both the samples are from same population, consider the null
hypothesis
211
210
::
µµµµ
≠=
HH
Where
220 ;200;150 ;100
21
21
==
==
XXnn
Standard deviation of the population is given as 11 quintal, i.e., 11=σ . Since the null
hypothesis is that both the samples are from same population, we can take that 1121 ==σσ .
Sample sizes are large; we can have two-tailed test to compare the mean with 5% level of
significance. Z-statistics for 5% level of significance in two tailed test is 1.96. Therefore the
rejection is 96.1|:| >zR .
Further, z-statistics is calculated as:
08.1442.120
15011
10011
220-20022
2
22
1
21
21 −=−
=
+
=
+
−=
nn
XXzσσ
.
Calculated z-value falls in the rejection region and therefore we reject the null hypothesis at
5% significance level.
Exercise: A simple random sampling survey in respect of monthly earning of semi-skilled
workers in two cities gives the following information:
City Average Monthly earning St deviation of monthly earning Size of sample
A
B
695
710
40
60
200
175
Test the hypothesis that there is no difference between monthly earning of workers of two
cities.
Exercise: Sample of sales in similar shops in two groups are taken for a new product with
following results:
Group Mean Sales Variance Size of sample
A
B
57
61
5.3
4.8
5
7
Is there any evidence that both the groups are in the same town without any difference in
sales pattern? Use 5% level of significance.
Solution: Presuming that both the groups are from the same town and having same sales
pattern. In other words we make null hypothesis that both the groups are from single
population. Consider hypotheses
211
210
::
µµµµ
≠=
HH
It is given that
8.4 ;3.5;61 ;57
;7 ;5
21
21
21
====
==
ssXX
nn
Since the samples are small and population variances are not known, we consider the
following test t statistics as under:
053.3
71
51
2-751)(4.8)-(71)(5.3)-(5
61-57
112
)1()1(
2121
222
211
21 −=+
++
=
+−+−+−
−=
nnnnsnsn
XXt
At 5% level of significance and with 5+7-2=10 degree of freedom t-statistics from table is
2.228 and therefore the rejection region is given by 228.2|:| >tR . Note that calculated t-
value is in the rejection region. So we reject the null hypothesis at 5% level of significance.
So we may conclude that the sample groups A and B are from different population with
different sales pattern.
Exercise: Two independent samples of size 9 and 7 respectively had the following values:
Sample 1: 18 20 36 50 49 36 34 49 41
Sample 2: 29 28 26 35 30 44 46
Is the difference between the means of sample significant at 5% level of significance?
Exercise: A group of seven-week old chickens reared on a high protein diet weigh 12, 15, 11,
16, 14, 14 and 16 ounces; a second group of five chickens, similarly treated except that they
receive a low protein diet, weigh 8, 10, 14, 10 and 13 ounces. Test at 5% level whether there
is significant evidence that additional protein has increased the weight of chickens.
Hypothesis testing of Proportions & Difference between Proportions:
Recall that the standard error of proportion is given by
npqSE p ==σ (in case of infinite population)
1−−
==N
nNnpqSE pσ (in case of finite population of size N)
Where, p is the proportion of the items in the population, q = 1-q, z is the standard variate for
appropriate confidence level and n is the sample size.
If p̂ is the observed proportion, then to test the null hypothesis that 0
:0 HppH = , we
compute following z-statistic as under:
SEppz H−
=ˆ
.
For a large population, we have
npqppz H−
=ˆ
.
Standard error in case of difference between proportions is,
2
22
1
11 ˆˆˆˆ21 n
qpnqpSE pp +== −σ , where 1p̂ and 2p̂ are sample proportions of samples of sizes
1n and 2n respectively. The above formula is more conveniently used whenever the samples
are drawn from two heterogeneous populations. But when we assume that the populations are
similar as regards the given attribute, we make use of the following formula to compute SE.
0021
22110
2100
1
where11ˆˆ21
pqnn
pnpnp
nnqpSE pp
−=++
=
+== −σ
Exercise: A sample survey indicates that out of 3232 births, 1705 were boys and the rest were
girls. Do these figures confirm the hypothesis that the sex ratio is 50:50? Test at 5% level of
significance.
Solution: Define p as the ratio of boy babies. We shall make null hypothesis and alternative
hypothesis as under:
5.0:5.0:
1
0
≠=
pHpH
Observed value for p is given by 5275.032321705ˆ ==p .
Standard error for the proportion is given by 0088.03232
)5.0)(5.0(====
npqSE pσ and z-
test statistic is given by 125.30088.0
5.05275.0ˆ=
−=
−=
npq
ppz .
With reference to null hypothesis and alternative hypothesis, we apply two-tailed test and
rejection region at 5% significance level is 96.1|:| >zR . Calculated z-value lies in the
rejection region and therefore we reject null hypothesis at five percent significance level and
conclude that the sex ratio among the births are not 50:50.
Exercise: A certain process produces 10% defective items. A supplier of new raw material
claims that the use of his material would reduce the proportion of defectives. The random
sample of 400 units using this new material was taken out of which 34 were defective. Can
the supplier claim be accepted? Test at 1% level of significance.
Solution: Since the supplier claim that there is a decrease in defective items, we shall
consider the following null hypothesis and alternative hypothesis:
10.0:10.0:
1
0
<=
pHpH
From the above null hypothesis and alternative hypothesis, we shall have one-tail test (left) at
1% level of significance. Rejection region at 1% level of significance is 32.2: −<zR .
Observed sample proportion is given by 085.040034ˆ ==p further z-statistics from the given
data is 00.1
400)9.0)(1.0(1.0085.0ˆ
−=−
=−
=
npq
ppz
Since computed z-value does not fall in the rejection region, we accept the null hypothesis at
1% level of significance. So at 1% level of significance, we can accept the supplier’s claim
that there is significant reduction in the defective items.
Exercise: The null hypothesis is that 20% of the passengers go in first class, but management
recognizes the possibility that this percentage could be more or less. A random sample of 400
passengers includes 70 passengers holding first class ticket. Can the null hypothesis be
rejected at 10% level of significance?
Exercise: A drug research experimental unit is testing two drugs newly developed to reduce
BP level. The drugs are administered to two different sets of animals. In group one, 350 of
600 animals tested respond to drug one and in group two, 260 of 500 animals tested respond
to drug two. The research unit wants to test whether there is difference between the efficiency
of the said two drugs at 5% level of significance. How will you deal with this problem?
Solution: Let 1p be the proportion of animals respond to the drug one and 2p be the
proportion of animals respond to drug two. Here we may consider that the samples are from
different population.
Consider the null hypothesis:
210 : ppH = i.e., the proportions of response for both the drugs are same.
And the alternative hypothesis:
211 : ppH ≠
We shall have two-tailed test for the samples from different population at 5% significance
level. The rejection region is 96.1|:| >zR
From given data, we have
520.0500260ˆ
583.0600350ˆ
2
1
==
==
p
p
500600
2
1
==
nn
Further, z-value for the observed data is given by
093.2
500)480.0)(520.0(
600)417.0)(583.0(
520.0583.0ˆˆˆˆ
ˆˆ
2
22
1
11
21 =+
−=
+
−=
nqp
nqp
ppz
As calculated value is in the rejection region, we reject the null hypothesis at 5% level of
significance.
Exercise: A drug research experimental unit is testing two drugs newly developed to reduce
BP level. The drugs are administered to two different sets of animals. In group one, 350 of
600 animals tested respond to drug one and in group two, 260 of 500 animals tested respond
to drug two. The research unit wants to test whether the efficiency of the first drug is more
than the second drug at 5% level of significance. How will you deal with this problem?
Exercise: At a certain date in a large city 400 out of a random sample 500 men were found to
be smokers. After the tax on tobacco had been heavily increased, another random sample of
600 men in the same city included 400 smokers. Was the observed decrease in the proportion
of smokers significant? Test at 5% level of significance.
Solution: We start with null hypothesis that the proportion of smokers even after the heavy
tax on the tobacco remains unchanged i.e., 210 : ppH = and alternative hypothesis that
proportion of smokers after tax has decreased i.e., 210 : ppH > . So, we shall have one-tail
test (right). Rejection region at 5% level of significance is 645.1: ≥zR .
From the given data, we have
667.0600400
8.0500400
2
1
==
==
p
p
On the presumption that the populations are similar, the best estimator for the proportion is
given by
2727.07273.01
7273.0600500
)667.0(600)8.0(500
0
21
22110
=−=
=++
=++
=
qnn
pnpnp
Further,
926.4
6001
5001)2727.0)(7273.0(
)667.0()8.0(
11
2100
21 =
+
−=
+
−=
nnqp
ppz
So the calculated value is in the rejection region an therefore we reject the null hypothesis at
5% level of significance. There is a significance decrease in smokers after the increase in tax
on tobacco.
Exercise: There are 100 students in a university college and in the whole university, inclusive
of this college; the number of students is 2000. In a random sample study of 20 were found
smokers in the college and the proportion of smokers in the university is 0.05. Is there a
significant difference between the proportion between the smokers in the college and
university? Test at 5% level.
CHI-SQUARE TEST
Chi-square Distribution: Chi-square distribution is used when we deal with collection of
values that involve sum of squares. Chi-square distribution is defined for positive value of
random variable and the distribution curve is not symmetric. This distribution depends on yet
another parameter, the degree of freedom (n-1), where n is the sample size.
Chi-square, by notation 2χ , is a statistical measure used in the context of sampling analysis
for comparing a sample variance to a theoretical variance. As a non-parametric test, it can be
used to determine if categorical data shows dependency or two classifications are
independent. It can also be used to make comparisons between theoretical populations and
actual data when categories are used. Thus, the chi-square test is applicable in large number
of problems in the areas such as:
(i) test the goodness of fit
(ii) test the significance of association between two attributes, and
(iii) test the homogeneity or the significance of population variance.
Chi-square Test for testing significance of Population Variance:
We can use the test to judge if a random sample has been drawn from a normal population
with mean µ and with a specified variance 2σ . Given a sample of size ‘n’ and the sample
variance 2s , we observe that the quantity )1(2
22 −= ns
σχ has the chi-square distribution with
n-1 degree of freedom. To test the null hypothesis 220 : sH =σ , we compare the calculated
2χ value against the table value at n-1 degree of freedom and given level of significance. If
the calculated value is higher than the table value, then we reject the null hypothesis,
otherwise we accept the null hypothesis.
Exercise: The weights of ten students are as follows:
S.No: 1 2 3 4 5 6 7 8 9 10
Weight(kg): 38 40 45 53 47 43 55 48 52 49
Can we say that the variance of the distribution of weight of all students from which the
above sample of 10 students was drawn is equal to 20 kgs? Test at 5% level of significance.
Solution:
First we shall find the variance of sample data given.
S.No iX (weight) )( XX i − 2)( XX i −
1
2
3
4
5
6
7
8
9
10
38
40
45
53
47
43
55
48
52
49
-9
-7
-2
6
0
-4
8
1
5
2
81
49
04
36
00
16
64
01
25
04
470 280
11.319
2801
)(
4710470
22 ==
−
−=
==
∑n
XXs
X
i
Let the null hypothesis 220 : sH =σ . To test the hypothesis, we shall compute
99.13)110(20
11.31)1(2
22 =−=−= ns
σχ
Table value of 2χ at 10 – 1 = 9 degree of freedom and 5% level of significance is 16.92.
Since calculated value is less than the table value we accept the null hypothesis at 5% level of
significance. In other words, we can say that the sample is taken from the population with
variance 20 kgs.
Exercise: A sample of 10 is drawn randomly from a certain population. The sum of squared
deviation from the mean of given sample is 50. Test the hypothesis that the variance of the
population is 5 at 5% level of significance.
Chi-Square Test as Non-Parametric Test: This test can be used for (i) Testing goodness of
fit (ii) Testing independence of data
Testing goodness of fit: Chi-square test enables us to see how well does the assumed
theoretical distribution fit to the observed data. When some theoretical distribution is fitted to
the given data, we are always interested in knowing as to how well this distribution fits with
observed data.
We consider the fit is considered to be good, in other words, the divergence between the
observed and expected frequencies is attributable to fluctuation of sample, if the calculated
value of 2χ is lesser than the table value for certain level of significance. Otherwise, fit is
not considered to be good one.
Test of independence: 2χ test enables us to explain whether or not two attributes are
associated. For instance, we may be interested in knowing a new medicine is effective in
controlling fever or not, in such a case 2χ test helps us in deciding the issue.
In such situation, we proceed with null hypothesis that the two attributes are independent. i.e,
the new medicine is not effective in controlling fever. On this basis we calculate the expected
frequencies and then workout the value of 2χ . If the calculated 2χ value is lesser than the
table for given degree of freedom, we accept the null hypothesis, otherwise, we reject.
We calculate ∑ −=
EEO 2
2 )(χ where O is the observed frequency and E is the expected
frequency.
Degree of Freedom:
If there are ‘n’ number of frequency classes and there is one independent constraint, then the
degree of freedom is given by ‘n-1’.
When we have two independent constraints (bivariate case) with ‘c’ number of rows and ‘r’
number of columns then the degree of freedom is given by (c-1)(r-1).
For instance, in the following data obtained during the outbreak of smallpox:
Attacked Not attacked Total
Vaccinated 31 469 500
Not vaccinated 185 1315 1500
Total 216 1784 2000
The degree of freedom is (2-1)(2-1) = 1
Exercise: Genetic theory states that children having one parent of blood type A and the other
of blood type B will always one of the three types, A, AB, B and the proportion of three types
will be on an average be as 1 : 2 : 1. A report states that out of 300 children having one A
parent and one B parent, 30 percent were found to be of type A, 45 percent type AB and
remainder type B. Test the hypothesis by 2χ test.
Solution: Observed frequencies of type A, AB and B are given by 90, 135 and 75 respectively
(in the proportion of 30 : 45: 25). Theoretically, it should have in the proportion of 1 : 2 : 1.
Therefore the expected frequencies of type A , AB and B are 75, 150 and 75 respectively. We
shall have chi-square test to verify the goodness of fit of theoretical distribution given.
Let the null hypothesis that the given data fits into given distribution. We shall calculate the 2χ as under:
Type Observed
Frequency(O)
Expected
Frequency(E)
)( EO − 2)( EO − E
EO 2)( −
A
AB
B
90
135
75
75
150
75
15
-15
0
225
225
0
3
1.5
0
5.405.132 =++=χ
Degree of freedom = 3 – 1= 2
Table value of 2χ for 2 degree of freedom at 5% level of significance is 5.991
Calculated 2χ value is lesser than the table value. Therefore we accept the null hypothesis
that on an average type A , AB and B stand in the proportion of 1 : 2 : 1.
Exercise: A dice is rolled 240 times and observed frequencies are given below.
Face observed 1 2 3 4 5 6
Frequency observed 49 35 32 46 49 29
Using 2χ test verify whether the dice is unbiased. Test at 5% level of significance.
Exercise: A sample of 10 is drawn randomly from a certain population. The sum of squared
deviation from the mean of given sample is 50. Test the hypothesis that the variance of the
population is 5 at 5% level of significance.
Exercise: In a city a survey was carried out of 200 families, each with 5 children. The
distribution shown below was produced.
(Boys, Girls) (5, 0) (4, 1) (3, 2) (2, 3) (1, 4) (0, 5)
No of families 11 35 69 55 25 5
Test the null hypothesis that the observed frequencies are consistent with male and female
births being equal probable, assuming binomial distribution, a level of significance of 0.05.
Solution: Assume that male and female births are equal probable. That is 5.0== qp . Note
that the probability having k boys among 5 children in a family is given by kk
k−
5)5.0()5.0(5
(B, G) Observed
Frequency(O)
Prob Expected
Frequency (E)
(Prob x 200)
)( EO − 2)( EO − E
EO 2)( −
(5, 0)
(4, 1)
(3, 2)
(2, 3)
(1, 4)
(0, 5)
11
35
69
55
25
5
0.03125
0.15625
0.3125
0.3125
0.15625
0.03125
6
31
63
63
31
6
5
4
6
-8
-6
-1
25
16
36
64
36
1
4.167
0.516
0.571
1.016
1.161
1.167
598.7)( 22 =
−=∑ E
EOχ
Degree of freedom = 6 – 1 = 5
Table value of 2χ for 5 degree of freedom at 5% level of significance is 11.1.
Since calculated 2χ is lesser than the table value, we accept the null hypothesis that observed
frequencies are consistent with male and female births are equal probable.
Exercise: Two research groups classified some people in income groups on the basis of
sampling studies. The results are as follows:
Investigator Income groups Total
Poor Middle Rich
A
B
160
140
30
120
10
40
200
300
Total 300 150 50 500
Show that the sampling technique of at least one research group is defective.
Solution: Let us make the hypothesis that the techniques adopted both the groups are similar
and the data are similar.
Expected frequencies are
Investigator Income groups Total
Poor Middle Rich
A
B
120
180
60
90
20
30
200
300
Total 300 150 50 500
54.5530
)3040(90
)90120(180
)180140(20
)2010(60
)6030(120
)120160(
)(
222222
22
=
−+
−+
−+
−+
−+
−=
−=∑ E
EOχ
Degree of freedom = (3-1)(2-1)=2
Table value of 2χ for 2 degree of freedom at 5% level of significance is 5.991. Since the
calculated value is bigger than the table value, we conclude the rejection of null hypothesis at
5% level of significance. Technique adopted by one of two groups in data collection is
defective.
Exercise: The following data is obtained during the outbreak of smallpox:
Attacked Not attacked Total
Vaccinated 31 469 500
Not vaccinated 185 1315 1500
Total 216 1784 2000
Test the effectiveness of vaccination in preventing the attack from the smallpox.
Exercise: Consider the following information regarding home condition and children’s
condition:
Condition of child Condition of home Total
Clean Dirty
Clean
Fairly Clean
Dirty
70
80
35
50
20
45
120
100
80
Total 185 115 300
State whether the two attributes viz., condition of home and condition of child are
independent. Use chi-square test for the purpose.
Conditions for Application of Chi-square test:
The following conditions should be satisfied before 2χ test being applied:
(i) Observation recorded and used are collected on random basis.
(ii) All the items in the sample must be independent.
(iii) No group should contain very few items.
(iv) The overall number of items also must also be reasonably large.
(v) The constraints must be linear. Constraints which involve linear equations in the
cell frequencies of a contingency table.
ANOVA
Consider a case of three varieties of wheat, each grown on four plots and production of wheat
for each kind of wheat per acre land in each kind of plot is given below:
Plot of land Variety of wheat
A B C
1
2
3
4
6
7
3
8
5
5
3
7
5
4
3
4
Researcher may be interested if there is significant difference between varieties of wheat
and/or varieties of plots.
ANOVA technique is very useful in making analysis in the above context.
ANOVA is an important technique in those entire situations where we want to compare more
than two populations such as in comparing the yield of crop from several varieties of seeds,
mileage of several automobiles and so on. In the circumstances of these kinds, one generally
does not want to consider all those combinations of two populations at a time, where the
number of tests required before arriving to a decision is larger.
The basic principle of ANOVA is to test for differences among the means of the populations
by examining the amount of variation within each of these samples, relative to the amount of
variation between the samples. In terms of variation within the given population, it is
assumed that the values of X differ from the mean of this population only because of random
effects, i.e., there are influences on X which are unexplainable, where as in examining
differences between populations we assume that the difference between the mean of jth
populations and the grand mean is attributable to what is called a ‘specific factor’ or what is
technically described as treatment effect. Thus while using ANOVA, we assume that each of
the samples is drawn from normal population and each of these populations has the same
variance. We also assume that all the factors other than the one or more being tested are
effectively controlled. In other words, means that we assume the absence of many factors that
might affect our conclusions concerning the factor(s) to be studied.
In this case we make two estimates of populations, namely, one based on between samples
variance and the other based on within the samples variance. Then the said two estimates of
population variance are compared with F-test, wherein we workout.
variancesamples on whithin variancepopulation of Estimate variancesamplesbetween on based variancepopulation of Esimate
=F
This value of F is to be compared to the F-limit for given degree of freedom. If the calculated
F value is more than the F-limit value, we may say that there are significance differences
between the sample means.
ANOVA Technique: One-way ANOVA
Under one-way ANOVA, we consider just one factor and then observe that the reason for
said factor to be important is that several possible types of samples can occur within that
factor. We then determine if there are differences within that factor.
(i) Obtain kXXX ,...,, 21 where ‘k’ is the number of samples.
(ii) Workout mean of sample mean by the formula k
XXXX k+++=
...21
(iii) Calculate sum of square between the samples by the formula 22
222
11 )(....)()(between XXnXXnXXnSS kk −++−+−=
(iv) Compute Mean Square between the samples by the formula
1between between −
=k
SSMS
Where k-1 represents degree of freedom between the samples
(v) Calculate Sum of squares within samples by the formula :
∑ ∑ ∑ −++−+−= 2222
211 )(....)()( within kkiii XXXXXXSS
(vi) Compute Mean Square within by
knSSMS
−=
within within
Where n is the total number of samples, and k is the number of sample.
(vii) Compute sum of squares of deviations for total variance by
withinSS between SS )( variancefor total 2 +=−=∑ XXSS ij
Degree of freedom for total variance is n-1 = (k-1)+(n-k)
(viii) Finally, F ratio is computed by formula
withinMSbetween MSratio =−F
If the calculated F-ration is greater than the F-value for the given degrees of
freedom and the significance level, then we reject the null hypothesis and in fact,
we conclude that the differences between the means are significant.
ANOVA TABLE (ONE-WAY ANALYSIS)
Source of
variation
Sum of Squares
(SS)
Degree
of
Freedom
Mean
Square
(MS)
F-Ratio
Between
the samples
Within
Samples
2
222
211
)(....
)()(
XXn
XXnXXn
kk −+
+−+−
∑∑∑
−+
+−+−2
222
211
)(....
)()(
kki
ii
XX
XXXX
(k-1)
(n-k)
1between −k
SS
k-n withinSS
withinMSbetween MS
Total ∑ − 2)( XX ij (n-1)
Short Cut Method:
Compute ∑= ijXT and further
Source of variation Sum of Squares
(SS)
Degree of
Freedom
Mean Square
(MS)
F-Ratio
Between the samples
Within Samples
nT
nT
j
j22 )()(
−∑
nT
nTn
TX
j
j
ij
22
22
)()(
)(
−
−−
∑
∑
(k-1)
(n-k)
1between −k
SS
k-n withinSS
withinMSbetween MS
Total n
TX ij
22 )(−∑
(n-1)
Exercise: Setup an ANOVA table for the following per acre production for three varieties of
wheat, each grown on four plots and state if the variety difference is significant:
Plot of land Variety of wheat
A B C
1
2
3
4
6
7
3
8
5
5
3
7
5
4
3
4
Solution:
Plot of land Variety of wheat
A B C
1
2
3
4
6
7
3
8
5
5
3
7
5
4
3
4
Total 24 20 16
53
456
44
16 ;54
20 ;64
2412
4
321
321
=++
=
======
====
X
XXX
nnnn
8)54(4)55(4)56(4
)()()(between 222
233
222
211
=−+−+−=
−+−+−= XXnXXnXXnSS
24)()()( within 233
222
211 =−+−+−=∑ ∑ ∑ XXXXXXSS iii
Source of variation (SS) df (MS) F-Ratio
Between the samples
Within Samples
8
24
3-1=2
12-3=9
428
1between
==−k
SS
67.2924
k-n withinSS
==
5.167.200.4
withinMSbetween MS
=
=
Total 32 (n-1) F-limit 5%
level of significance
F(2,9) = 4.26
Since the calculated value lesser than the table value, we accept the null hypothesis that
difference between outputs due to variety of seed is not significant.
Exercise: A manager of a firm wishes to test whether the salesmen of his firm (A, B, and C)
tend to make sales of same size. During a week there have been 14 sale calls. A made 5, B
made 4 and C made 5 calls respectively. Following are the sales data for the week of 3
salesmen.
A: 500 400 700 800 600
B: 300 700 400 600
C: 500 300 500 400 300
Perform ANOVA and draw your conclusion at 5% level of significance (F(2, 11)=3.98)
Hint : Use scaling factor (X-500)/100
TWO-WAY ANOVA
Two-way ANOVA technique is used when the data are classified on the basis of two factors.
For example:
(i) The agricultural output may be classified on the basis of different varieties of
seeds and also on the basis of different fertilizers used.
(ii) A business firm may have its sales data classified on the basis of different
salesmen and also on the basis of sales in different regions.
(iii) In a factory, the various units of a product produced during a certain period may
be classified on the basis of different varieties of machines and also on the basis of
different grades of labor.
Two way designs may have repeated measurement of each factor or may not have repeated
values. The ANOVA technique is little different in case of repeated measurements where we
also compute interaction variation.
Two-way ANOVA when values are not repeated.
Source of
variation
Sum of Squares
(SS)
Degree of
Freedom
Mean Square
(MS)
F-Ratio
Between
Column
Treatment
nT
nT
j
j22 )()(
−∑ (c – 1)
1columnsbetween
−cSS
Residual
columnsbetween MS
MS
Between
rows
Treatment
nT
nT
i
i22 )()(
−∑ (r – 1)
1rowsbetween
−rSS
Residual rowsbetween
MSMS
Residual
error
Total SS – (SS
between columns
+SS between rows
(c-1)(r-1) )1)(1(
Residual −− rc
SS
Total n
TX ij
22 )(−∑
c.r - 1
Exercise: Set up an ANOVA table for the following two-way design results:
Per acre production of wheat
Variety of
Fertilizer
Variety of wheat
A B C
1
2
3
4
6
7
3
8
5
5
3
7
5
4
3
4
Total 24 20 16
Also state whether variety differences are significant at 5% level significance.
Solution:
Variety of
Fertilizer
Variety of wheat Row total
1
2
3
4
6
7
3
8
5
5
3
7
5
4
3
4
16
16
09
19
Column Total 24 20 16 60
n = 12; T = 60
300126022
==n
T
32300)478333457556(
)( total
222223222222
22
=−+++++++++++=
−=∑ nTXSS ij
83004
164
204
24
)()(atment column trebetween
222
22
=−++=
−=∑ nT
nT
SSj
j
183003
193
93
163
16
)()( treatmentrowbetween
2222
22
=−+++=
−=∑ nT
nTSS
i
i
SS Residual = Total SS – (SS between columns +SS between rows)
= 32 –(8 +18) = 6
ANOVA Table:
Source of
variation
Sum of Squares
(SS)
Degree of
Freedom
Mean Square
(MS)
F-Ratio 5% level
F-limit
Between
Column
Treatment
8 2 4 4 F(2, 6) =
5.14
Between
rows
Treatment
18 3 6 6 F(3, 6) =
4.76
Residual
error
6 6 1
Total 32 11
F-ratio due to column treatment (4) is lesser than the table value (5.14) at 5% level of
significance. Therefore difference among mean yield due to column treatment (varieties of
seeds) is not significant.
But F-ratio in case of row treatment (6) is larger than the table value (4.76) at 5% level of
significance. Therefore, difference among mean yield due to column treatment (varieties of
fertilizer) is significant.
Exercise: The following data gives the number of units produced per day by 5 workers using
4 different machines.
M1 M2 M3 M4
A:
B:
C:
D:
E:
45
40
43
36
41
42
32
36
38
37
48
50
44
46
47
38
34
40
36
37
Test if the production is equal with respect machines and with respect to workers.
Answer:
Source of
variation
Sum of Squares
(SS)
Degree of
Freedom
Mean Square
(MS)
F-Ratio 5% level
F-limit
Between
Column
Treatment
335 3 111.67 14.97 F(3,12)
= 3.49
Between
rows
Treatment
48.5 4 12.125 1.626 F(4,12)
=3.26
Residual
error
89.5 12 7.458
Total 473 11
Limitations of Test of Hypothesis:
(i) The test should not be used in a mechanical fashion. It should be kept in view that
testing is not decision making itself; the tests are only useful aids for decision-
making. Hence proper interpretation of statistical evidence is important to
intelligent decision.
(ii) Tests do not explain the reasons as to why do the difference exist. They simply
indicate whether the difference is due to fluctuations of sampling or because of
other reasons but tests do not tell us as to which is/are the other reason(s) causing
the difference.
(iii) Results of significant tests are based on probabilities and such cannot be expressed
with full certainty. When a test shows that a difference is statistically significant,
then it simply suggest that the difference is probably not due to chance.
(iv) Statistical inferences based on the significance tests cannot be said to be entirely
correct evidences concerning the truth of hypothesis. This is specially so in case of
small samples where the probability of drawing erring inferences happens to be
generally higher. For greater reliability, the size of samples is sufficiently
enlarged.