39
SAMPLING DISTRIBUTIONS Population Distribution: When we talk of Population distribution, we assume that we have investigated the population and have full knowledge of its mean and standard deviation. Population mean is denoted by μ and standard deviation of population is denoted by σ . The measures μ and σ of populations are called parameters. Sample Distribution: When we talk of a sample distribution, we take a sample from the population. The mean and standard deviation of the sample are denoted by X and s . These measures related to sample are called statistic. It may be noted several sample distributions are possible from a given population. Distribution of Sample Means: Considering sample mean X as variable, we observe that the expected value of X is population mean. i.e., μ μ = = X X E ) ( and the standard deviation of X is given by n X σ σ = where n is the sample size. Standard deviation of mean is also known as standard error of mean. In order to use the standard deviation of the sample ‘s’ as an estimate the for σ , we have the following formula 1 ) ( 2 - - = n x x s i And the standard error of mean is given by ) 1 ( ) ( 2 - - = = n n x x n s i X σ When the sample size ‘n’ is not very small in comparison with the finite population size ‘N’, then we consider the following formula: 1 - - = N n N n X σ σ Exercise: The time between two arrivals in a queuing model is normally distributed with mean 2 minutes and standard deviation 0.25 minute. If a random sample size of 36 is drawn, what is the probability that a sample mean will be greater than 2.1 minutes?

STAT_T_3

Embed Size (px)

Citation preview

Page 1: STAT_T_3

SAMPLING DISTRIBUTIONS

Population Distribution: When we talk of Population distribution, we assume that we have

investigated the population and have full knowledge of its mean and standard deviation.

Population mean is denoted by µ and standard deviation of population is denoted by σ . The

measures µ and σ of populations are called parameters.

Sample Distribution: When we talk of a sample distribution, we take a sample from the

population. The mean and standard deviation of the sample are denoted by X and s . These

measures related to sample are called statistic. It may be noted several sample distributions

are possible from a given population.

Distribution of Sample Means: Considering sample mean X as variable, we observe that

the expected value of X is population mean. i.e., µµ == XXE )( and the standard deviation

of X is given by nXσ

σ = where n is the sample size. Standard deviation of mean is also

known as standard error of mean.

In order to use the standard deviation of the sample ‘s’ as an estimate the for σ , we have the

following formula

1

)( 2

−= ∑

nxx

s i

And the standard error of mean is given by

)1()( 2

−== ∑

nnxx

ns i

When the sample size ‘n’ is not very small in comparison with the finite population size ‘N’,

then we consider the following formula:

1−

−=

NnN

nXσ

σ

Exercise: The time between two arrivals in a queuing model is normally distributed with

mean 2 minutes and standard deviation 0.25 minute. If a random sample size of 36 is drawn,

what is the probability that a sample mean will be greater than 2.1 minutes?

Page 2: STAT_T_3

Solution: n = 36; 2=µ ; 25.0=σ . The standard error for mean is calculated as under:

042.03625.0

===nXσ

σ .

Now to find the probability that sample mean greater than 2.1 is given by

0087.09913.01)38.2(042.0

21.2042.0

2)1.2( =−=≥=

−≥

−=≥ ZPXPXP

Exercise: The weight of certain type of car tire is normally distributed with mean of 25

pounds and variance of 3 pounds. A random sample of 50 tires is selected. What is the

probability that the mean of this sample lies between 24.5 and 25.5 pounds?

Exercise: An auditor takes a sample of size 36 from a population of 1000 accounts receivable.

The standard deviation of the population is unknown, but the standard deviation of the

sample is Rs 43. If the true mean value of the accounts receivable is Rs 260, what is the

probability that the sample mean will be less than or equal to Rs 250?

Estimation of Population Mean:

In most of the research studies, population parameters are unknown and have to be estimated

from a sample. As such the methods of estimating parameters assume an important role in

statistical analysis.

The estimate of a population parameter may be one single value or it could be a range of

values. If the estimate is one single value, it is referred as point estimate, whereas in the range

of values case it is termed as interval estimate.

A good estimator possesses the following properties:

(i) An estimator should on the average be equal to the value of the parameter being

estimated. (Property of Unbiased ness)

(ii) An estimator should have relatively less variance. (Property of efficiency)

(iii) An estimator should use as much as possible the information available from

sample (Property of Sufficiency)

(iv) An estimator should approach the value of parameter as the sample size becomes

larger and larger. (Property of Consistency)

The point estimator of population mean ( µ ) is X , the sample mean.

Page 3: STAT_T_3

The interval estimator for the mean µ is given by the interval around X for certain degree

of confidence with the help of Standard error.

For example, for 95% degree of confidence interval for the population mean is given by the

lower limit SEX 96.1− and upper limit SEX 96.1+ . In other words, the probability of µ

being in the interval [ SEX 96.1− , SEX 96.1+ ] is 0.95.

Or, 95.0]96.196.1[ =+≤≤− SEXSEXP µ

In the above, 1.96 is the z-variate of standard normal distribution for the confidence level of

95% (or the significance level of 5%)

If the sample size is smaller, i.e., lesser than 30, we use t-variate with n-1 degree of freedom,

for the estimation.

Exercise: From a random sample of 36 civil service personnel, the mean age and sample

standard deviation were found to be 40 years and 4.5 years respectively. Construct a 95%

confidence interval for the mean age of civil servants. Also construct a 96% confidence

interval for the mean age of civil servants.

Solution: In the above n = 36, 40=X and 5.4=s . Population size is not finite. Sample size

may be considered as large. The standard error of mean is given by

75.0365.4====

nsSE Xσ

Standard normal variate for 95% confidence is 1.96.

Thus 95% confidence interval for the mean of population is given by the limits SEX 96.1± .

47.140)75.0)(96.1(4096.1 ±=±=± SEX

Therefore the 95% confidence interval for population mean is [38.53, 41.47]

In other words, 95.0)47.4153.38( =≤≤ µP .

Standard normal variate for 96% confidence is 2.065.

Thus 95% confidence interval for the mean of population is given by the limits SEX 065.2± .

55.140)75.0)(065.2(40065.2 ±=±=± SEX

Therefore the 95% confidence interval for population mean is [38.45, 41.55]

In other words, 96.0)55.4145.38( =≤≤ µP .

Exercise: In a random selection of 64 of 2400 intersection in a small city, the mean number of

scooter accidents per year was 3.2 and sample standard deviation was 0.8.

Page 4: STAT_T_3

(i) Make an estimate of standard deviation of the population from the standard

deviation

(ii) Workout standard error of mean for this finite population

(iii) If the desired confidence level is 0.90, what will be the upper limit and lower

limits of confidence interval for the mean number accidents per year?

Exercise: A random sample of 16 values from normal population showed a mean of 41.5

inches and the sum of squares of deviation from this mean is 135 square inches. Obtain 95%

and 99% confidence limit for the same.

Exercise: The foreman of ABC mining company has estimated the average quantity of iron

ore extracted to be 36.8 tons per shift and the sample standard deviation to be 2.8 tons per

shift, based upon a random selection of four shifts. Construct a 90% confidence interval

around the estimate.

Estimation of Sample Size:

Size of the sample should be determined by a researcher keeping the following points:

(i) Nature of Universe: If the items of the universe are homogeneous, a small sample

can serve the purpose. But if the items are heterogeneous, a large sample would be

required. Technically, this can be termed as dispersion factor.

(ii) Number of classes proposed: If many class groups are to be formed, a large

sample would be required because a small sample may not be able to give

reasonable number of items in each class-group.

(iii) Nature of Study: If items are to be intensively and continuously studied, the

sample should be small. For a general survey the size of the sample should be

large, but small sample is considered appropriate in technical survey.

(iv) Type of sampling: Sampling technique plays an important part in determining the

size of the sample. A small random sample is apt to be much superior to a larger

but badly selected sample.

(v) Standard of accuracy and acceptable confidence level: If the standard of accuracy

or the level of precision is to be kept high, we shall require relatively larger

sample. For doubling the accuracy for fixed significance level, the sample size has

to be increased fourfold.

(vi) Availability of finance: In practice, the size of the sample depends upon the

amount of money available for the study purposes. This factor should be kept in

Page 5: STAT_T_3

view while determining the size of the sample. Larger sample result in increasing

the cost of sampling estimates.

(vii) Other considerations: Nature of units, size of population, size of questionnaire,

availability of trained investigators, the conditions under which the sample is

being conducted, the time available for completion of the study are few other

considerations to which a researcher must pay attention while selecting the size of

the sample.

Sample Size when estimating a mean:

Note that the limits of confidence interval for the Mean of Population is by

nzXSEzX σ

±=± . ,

where X is the sample mean

z is the value of standard variate at given confidence level

n is the sample size, and

σ is the standard deviation of population.

If the researcher like to estimate the mean of population within desired precision e± , then

get n

ze σ= and therefore 2

22

ezn σ

= .

In case of finite population, we get

1.

−−

==N

nNn

zSEze σ and therefore 222

22

)1( σσ

zeNNzn+−

=

Many a times, the standard deviation of population is not known and sample is not yet taken,

rough estimate of the population is given by

6

onDistributi Population of Rangeˆ =σ

Range in the above may have to be obtained from past records or through a pilot survey of

large number of items.

Exercise: If the acceptable error in estimating the population is within 3 units of the sample

mean with 95% confidence estimate the sample size, when the standard deviation of the

population is known and equals to 4.8.

Solution: Here e = 3 z = 1.96 (for 95% confidence level) and σ =4.8. The estimation of

sample size for 95% confidence limit and within 3 units from the sample mean is given by

Page 6: STAT_T_3

10834.9)3(

)8.4()96.1(2

22

2

22

≅===e

zn σ

Therefore the size of sample for estimating population mean within range of 3 units and with

95% confidence is 10.

Exercise: A cigarette manufacturer wishes to use a random sample to estimate the average

nicotine content. The error should not be more than 1 milligram above or below the true

mean, with 99% confidence coefficient. The population standard deviation is 4 milligrams.

What sample size should one the company use in order to satisfy the requirement?

Exercise: Determine the size of the sample for estimating the true weight of the 5000 cereal

container on the basis of following information:

The variance of weight is 4 ounces on the basis of past records.

Estimate should be within 0.8 ounces of the true average weight with 99% probability.

Will there be change in the size of sample if we assume infinite population in the given case?

If so, explain by how much?

Sample size when estimating the population proportion:

If we are to find the sample size for estimating a proportion of population, our reasoning

remains similar to what we have said in the context of population mean. It is required to

specify the precision and the confidence level and then estimate the sample size as under:

Note that the standard error of proportion is given by

npqSE p ==σ (in case of infinite population)

1−−

==N

nNnpqSE pσ (in case of finite population of size N)

Where, p is the sample proportion, q = 1-p, z is the standard variate for appropriate

confidence level and n is the sample size.

Further, confidence interval for the population proportion is given by

SEzp .±

If e is the precision rate, the acceptable error then the sample size can be expressed as

2

2

epqzn = (in case of infinite population)

Page 7: STAT_T_3

pqzNepqNzn 22

2

)1( +−= (in case of finite population)

Exercise: What should be the sample size if a simple random sample from a population of

4000 items to be drawn to estimate the percent of defective within 2% of true value with

95.5% probability? What should be the size of the sample if the population is assumed to be

infinite in the given case? (from the pilot study, it has been observed that the proportion of

defective items is about 2%)

Solution:

In this case N = 4000, z = 2.005, p = 0.02 and e = 0.02

18888.187)98.0)(2.0()005.2(()14000()02.0(

)4000)(98.0)(02.0()005.2()1( 22

2

22

2

≅=+−

=+−

=pqzNe

pqNzn

Therefore the sample size is estimated to be equals to 188 for sample proportion to be with in

2% limit and 95.5% confidence.

If we assume that the population size is infinite, then

19798.196)02.0(

)98.0)(02.0()005.2(2

2

2

2

≅===epqzn

Exercise: Suppose a certain hotel management is interested in determining the percentage of

the hotel’s guests who stay for more than 3 days. The reservation manager wants to be 95%

confident that the percentage has been estimated within 3% of the true value. What is the

most conservative sample size needed for the problem.

Exercise: Suppose the following ten values represent random observation from a normal

parent population:

2, 6, 7, 9, 5, 1, 0, 3, 5, 4.

Construct a 99 percent confidence interval for the mean of the parent population.

Exercise: A team of medic research experts feels confident that a new drug they have

developed will cure about 80% of the patients. How large should the sample size be for the

team to be 98% certain that the sample proportion of cure is within plus and minus 2% of the

proportion of all cases that drug will cure?

Page 8: STAT_T_3

Exercise: Annual income of 900 salesmen employed by a company is known to be

approximately normally distributed. If the company wants 95% confident that the true mean

of this year’s salesmen’s income does not differ by more than 2% of the last year’s mean

income of Rs 12,000, what sample size would be required assuming the population standard

deviation to be Rs 1500?

Exercise: In a random sample of 64 items taken from a large consignment, some were found

to be defective. Deduce that percentage of defective items in the consignment almost

certainly lies between 31.25 and 68.75 given that the standard error of the proportion of

defective items in the sample is 1/16.

Exercise: A cigarette manufacturer claims that his cigarettes have an average content of 18.3

mg of nicotine. If random samples of this type have for content of 20, 17, 21, 19, 22, 21, 20

and 16 mg, would you agree with the manufacturer’s claim. Assume suitable value for level

of significance. (Level of significance = 1 – Level of Confidence)

Distribution of Sample Standard deviation:

If a population is large and normally distributed with standard deviation σ , the standard

deviation of random samples of size ‘n’ (n is large) are approximated by normal distribution

with standard deviation n2/σ (Standard error of standard deviation).

The standard deviation of the distribution of standard deviation of samples drawn from a

normal population is called standard error of standard deviation and is denoted by

nSE 2/σ= .

Page 9: STAT_T_3

TESTING OF HYPOTHESES

Hypothesis:

It is an assumption or some supposition to be proved or rejected.

Definition: Hypothesis is a proposition or a set of propositions set forth as an explanation for

the occurrence of some specified group of phenomena either asserted merely as a provisional

conjecture to guide some investigation or accepted as highly probable in the light of

established facts.

Characteristic of Hypothesis:

(i) Hypothesis should be clear and precise.

(ii) Hypothesis should be capable of being tested

(iii) Hypothesis should state relationship between variables, if it happens to be

relational hypothesis

(iv) Hypothesis should be limited to the scope and must be specific. A researcher must

remember that narrower hypothesis is more generally testable and should develop

such hypothesis.

(v) Hypothesis should be stated as far as possible in most simple terms so that the

same is easily understandable by all concerned.

(vi) Hypothesis should be consistent with most known facts, i.e., it must be consistent

with a substantial body of established facts. In other words, it should be one which

judges accept as being the most likely.

(vii) Hypothesis should be amenable to testing within reasonable time.

(viii) Hypothesis must explain the facts that gave rise to the need for explanation.

Null Hypothesis and Alternative Hypothesis: Null Hypothesis is an initial statement

concerning a population parameter. It is generally denoted by 0H . Any hypothesis which

differs from a null hypothesis is called ‘alternative hypothesis. Alternative Hypothesis is

denoted by 1H .

Type I error: The error of rejecting the hypothesis when it should have been accepted is

known as type I error.

Type II error: The error of accepting the hypothesis when it should have been rejected is

known as type II error

Page 10: STAT_T_3

The probability of Type I error is usually determined in advance and understood as level of

significance of testing the hypothesis. If the type I error is fixed at 5%, it means that there are

about 5 chances in 100 that we reject 0H when 0H is true.

But with a fixed sample size, n, when we try to reduce the type I error, the probability of

committing type II error increases. Both type of error can not be reduced simultaneously.

Two-tailed and One –tailed test: A two-tailed test rejects the null hypothesis if, say, the

sample mean is significantly higher or lower than the hypothesized value of mean of the

population. Thus in a two-tailed test, there are two rejection regions, one on each tail of

normal curve.

0

0

:

:

1

0

H

H

H

H

µµ

µµ

=

A one-tailed test would be used when there are to test, say, whether the population mean is

either lower than or higher than some hypothesized value.

0

0

:

:

1

0

H

H

H

H

µµ

µµ

>

= or

0

0

:

:

1

0

H

H

H

H

µµ

µµ

<

=

Examples: A random sample of 25 tiers from a large consignment gave an average life of

38,000 kms and standard deviation of 5000 kms. Could the sample come from a population

with mean life of tiers 40,000 kms?

Solution:

We make null hypothesis and the alternative hypothesis as under:

40000:40000:

1

0

≠=

µµ

HH

We make two-tailed test for population mean. Consider the level of significance 005.0=α .

The test criterion is

ns

Xt µ−= . Here n = 25, sample mean is 38,000 and sample standard

deviation is s = 5000. Therefore

2||

2

255000

4000038000

=

−=−

=−

=

tn

sXt µ

Page 11: STAT_T_3

From the table, t-variate value for 5% significant level (95% confidence level) and with 24

degree of freedom is 2.064.

Since the calculated t-value is lesser than the table value, we accept the null hypothesis that

the mean life of tier is 40,000 kms with 5% significance level (95% confidence level).

Flow Chart for Hypothesis Testing:

State 0H as well 1H

Specify the level of significance (α )

Decide the correct sampling distribution

Obtain sample and workout an appropriate value from sample data

Calculate the probability that sample result would diverge as widely as it has from expectations, if the null hypothesis were true (find z-value or t-value for the purpose)

Compare this probability with significance level( 2/α in case of two tailed test; α in case of one tail test). (Find whether calculated z or t value is in the rejection region)

Reject 0H Accept 0H

Yes No

Page 12: STAT_T_3

Exercise: A certain stimulus administered to each of 12 patients resulted in the following

change in of blood pressure:

5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6

Can it be concluded that the stimuli will, in general, accompanied by an change in blood

pressure?

Solution:

From the given data, we obtain sample mean and sample variance as

08.3

538.91

)(

6.21231

22

=

=−

−=

===

sn

XXs

nX

X

We shall make a null hypothesis that stimulus in general not be accompanied by the change

in blood pressure. Therefore the null hypothesis and the alternative hypothesis can be

formulated as under:

0:0:

1

0

≠=

µµ

HH

Assume 5% level of significance. i.e., 95% level of confidence. Corresponding t-value with

11 degree of freedom is 2.201( 24,025.0t ). Further rejection region is given by 201.2|:| >tR

94.2

1208.3

06.2=

−=

−=

ns

Xt µ

Since calculated t-value is bigger than the table value, we reject the null hypothesis.

Exercise: A certain stimulus administered to each of 12 patients resulted in the following

change in of blood pressure:

5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6

Can it be concluded that the stimuli will, in general, accompanied by an increase in blood

pressure?

Solution:

From the given data, we obtain sample mean and sample variance as

Page 13: STAT_T_3

08.3

538.91

)(

6.21231

22

=

=−

−=

===

sn

XXs

nX

X

We shall make a null hypothesis that stimulus in general not be accompanied by an increase

in blood pressure. Therefore the null hypothesis and the alternative hypothesis can be

formulated as under:

0:0:

1

0

>=

µµ

HH

Assume 5% level of significance. i.e., 95% level of confidence. Corresponding t-value (one-

tail test) with 11 degree of freedom is 1.796 ( 11,05.0t ) and the rejection region is 796.1: >tR

94.2

1208.3

06.2=

−=

−=

ns

Xt µ

Since calculated t-value is bigger than the table value, we reject the null hypothesis.

Exercise: A cigarette manufacturer claims that his cigarettes have an average content of 18.3

mg of nicotine. If random samples of this type have for content of 20, 17, 21, 19, 22, 21, 20

and 16 mg, would you agree with the manufacturer’s claim. Assume suitable value for level

of significance.

Exercise: Raju Restaurant near the railway station has been having a average sales of 500 tea

cups per day. Because of some development of bus stand nearby, it expects to increase its

sales. During the first 12 days after the start of the bus stand, the daily sales were as under:

550, 570, 490, 615, 505, 580, 570, 460, 600, 580, 530, 526

On the basis of simple information, can one conclude that Raju Restaurant’s sales have

increased? Use 5% level of significance.

Solution: Consider null hypothesis that sales average is 500 cups and sale has not increased

unless proved. We can write:

500:500:

1

0

>=

µµ

HH

Page 14: STAT_T_3

The sample size is small from infinite population. So, we shall use one-tailed t-test and

compute t-statistic given by ns

Xt µ−= . Further note that population standard deviation is not

given. We shall compute s and X .

iX )( XX i − 2)( XX i −

550

570

490

615

505

580

570

460

600

580

530

526

2

22

-58

67

-43

32

22

-88

52

32

-18

-22

4

484

3364

4489

1849

1024

484

7744

2704

1024

324

484

6576 23978

558.31268.46

500548

68.4611

239781

)(

54812

6576

2

=−

=−

=

==−

−=

==

nsXt

nXX

s

X

i

µ

Degree of freedom = n – 1 = 12 – 1 = 11. Therefore, corresponding t-value (one-tail test) at

5% significance level and with 11 degree of freedom is 1.796 ( 11,05.0t ) and 796.1: >tR

Since calculated t-value is greater than table value and in the rejection region, we reject the

null hypothesis that there is no change in the sales and conclude that there is increase in sales.

Exercise: A sample of 400 male students is found to have a mean height 67.47 inches. Can it

be regarded as a sample from large population with mean height 67.39 inches and standard

deviation 1.30 inches? Test 5% level of significance.

Page 15: STAT_T_3

Solution: Consider the null hypothesis that the average height is 67.39 inches and we can

write 39.67:39.67:

1

0

≠=

µµ

HH

. The sample size is large (400), population is infinite and standard

deviation of the population is known, we shall use two-tailed z –test and find z-statistic

nXzσ

µ−= . Note that at 5% significance level for 2-tailed test, z-variate is 1.96 and the

rejection region is 96.1|:| >zR

231.140030.1

39.6747.67=

−=

−=

nXzσ

µ and therefore the calculated z-variate value is within the

acceptance region. We accept the null hypothesis that the mean height of students 67.39 at

5% significance level.

Exercise: Suppose that we are interested in a population of 20 industrial units of same size,

all of which are experiencing excessive of labor turnover problems. The past records show

that the mean of the distribution of turnover is 320 employees, with a standard deviation of

75 employees. A sample of 5 of these industrial units is taken at random which gives a mean

of annual turnover as 300 employees. Is the sample mean consistent with the population

mean? Test at 5% significant level.

Exercise: The mean of a certain production process is known to be 50 with a standard

deviation of 2.5. The production manager may welcome any change is mean value towards

higher side but would like to safeguard against decreasing values of mean. He takes a sample

of 36 items that gives a mean value of 48.5. What inference should the manager take for the

production process on the basis of sample results? Use 5% level of significance for the

purpose.

Exercise: The mean lifetime of a random sample of 50 similar torch bulbs drawn from a batch

of 500 bulbs is 72 hours. The standard deviation of the lifetime of sample is 10.4 hours. The

batch is classed as inferior if the mean lifetime is less than the 75 hours. Determine whether,

as a result of sample data, the batch is considered to be inferior at level of significance of a)

0.05 and b) 0.01

Solution: Population is finite and N = 500. Sample size is n = 50. The sample mean 72=X

hrs and sample standard deviation is 10.4(s). Claimed lifetime of the bulbs (population

Page 16: STAT_T_3

mean) is minimum of 75 Hrs. Objective is to test the given batch is of inferior quality (life

time less than 75 Hrs). Therefore we make null hypothesis that the life time of the bulbs is

not less than 75 Hrs. i.e.,

75:75:

1

0

<≥

µµ

HH

We shall have one-tail test for larger sample from finite population.

148.2)95.0)(471.1(

3

150050500

504.10

7572

1

−=−

=

−−

−=

−−

−=

NnN

ns

Xz µ

(a) Test at 5% significance level: Table value for z is -1.645. Therefore rejection region is

645.1: −<zR . Calculated value for z is in the rejection region and therefore we

reject null hypothesis at 5% level of significance.

(b) Test at 1% significance level: Table value for z is -2.33. Therefore rejection region is

33.2: −<zR . Calculated value for z is not in the rejection region and therefore we

accept null hypothesis at 1% level of significance.

Hypothesis Testing for Difference of Means:

In some decision making situations, we may have to find whether the parameters of two

populations are alike or different. For example, one may like to know whether female worker

earn same as male worker or different. In this situation, we like to test whether the mean

income of males and females are same or not.

In this case the parameter of our interest is 21 µµ − , where 1µ may the mean income of

female population and 2µ may be the mean income of male population. Suppose 1n and 2n

are the sizes of two samples and, 1σ and 2σ are the standard deviations of populations

respectively. We consider the standard deviations of samples in the absence of population

standard deviation for the estimation.

Standard Error for the difference of means is given by

2

22

1

21

21 nnSE XX

σσσ +== − and test statistic is given by

sample) large of case(in

2

22

1

21

21

nn

XXzσσ

+

−=

Page 17: STAT_T_3

sample) small of casein freedom, of degree 2-(with 21

2

22

1

21

21 nn

nn

XXt +

+

−=

σσ

In case of large samples are presumed to be drawn from same population whose variance

( 2σ ) is known, we use z test for the difference in means and compute z-statistics and t-

statistics are as under

sample) large of case(in 11

21

2

21

+

−=

nn

XXz

σ

sample) small of casein freedom; of degree 2-(with 11

21

21

2

21 nn

nn

XXt +

+

−=

σ

In case population variance is not known, we estimate the standard deviation of population as

under:

21

22112,1

2,1222,11121

22

222

21

211 )();( where;)()(ˆ

nnXnXnX

XXDXXDnn

DsnDsn

++

=

−=−=+

+++=σ

In case of small samples are presumed to be taken from same population and population

variance is not known, then we use t-test for the difference of means and z- statistics and t-

statistics are computed as under:

2121

222

211

21

112

)()(nnnn

XXXXXXz

ii +−+

−+−

−=∑ ∑

freedom of degree 2-nn with ;11

2)()(

21

2121

222

211

21 +

+−+

−+−

−=∑ ∑

nnnnXXXX

XXtii

Alternatively,

;11

2)1()1(

2121

222

211

21

nnnnsnsn

XXz+

−+−+−

−=

Page 18: STAT_T_3

freedom of degree 2-nn with ;11

2)1()1(

21

2121

222

211

21 +

+−+−+−

−=

nnnnsnsn

XXt

Exercise: The mean produce of wheat of a sample of 100 fields 200 quintal per acre with

standard deviation 100 quintal. Another sample of 150 fields gives the same mean of 220

quintal per acre with standard deviation of 12 quintal. Can the two samples be considered to

have been taken from the two populations with same mean yield? Use 5% level of

significance.

Solution: Taking the null hypothesis that the mean of two populations do not differ, consider

211

210

::

µµµµ

≠=

HH

It is given that

12 ;10;220 ;200

;150 ;100

21

21

21

====

==

ssXX

nn

Sample sizes are large; we can have two-tailed test to compare the mean with 5% level of

significance. Z-statistics for 5% level of significance in two tailed test is 1.96. Therefore the

rejection is 96.1|:| >zR .

Note that standard deviations of the populations are not given. From the given data we have

28.144.120

15012

10010

220-20022

2

22

1

21

21 −=−

=

+

=

+

−=

ns

ns

XXz

Since calculated z-variate is is not in the acceptance region and in fact, in the rejection region,

we reject the null hypothesis at 5% level of significance.

Exercise: The mean produce of wheat of a sample of 100 fields 200 quintal per acre with

standard deviation 100 quintal. Another sample of 150 fields gives the same mean of 220

quintal per acre with standard deviation of 12 quintal. Can the two samples be considered to

have been taken from the same population whose standard deviation is 11 quintal? Use 5%

level of significance.

Solution: Assuming that both the samples are from same population, consider the null

hypothesis

211

210

::

µµµµ

≠=

HH

Page 19: STAT_T_3

Where

220 ;200;150 ;100

21

21

==

==

XXnn

Standard deviation of the population is given as 11 quintal, i.e., 11=σ . Since the null

hypothesis is that both the samples are from same population, we can take that 1121 ==σσ .

Sample sizes are large; we can have two-tailed test to compare the mean with 5% level of

significance. Z-statistics for 5% level of significance in two tailed test is 1.96. Therefore the

rejection is 96.1|:| >zR .

Further, z-statistics is calculated as:

08.1442.120

15011

10011

220-20022

2

22

1

21

21 −=−

=

+

=

+

−=

nn

XXzσσ

.

Calculated z-value falls in the rejection region and therefore we reject the null hypothesis at

5% significance level.

Exercise: A simple random sampling survey in respect of monthly earning of semi-skilled

workers in two cities gives the following information:

City Average Monthly earning St deviation of monthly earning Size of sample

A

B

695

710

40

60

200

175

Test the hypothesis that there is no difference between monthly earning of workers of two

cities.

Exercise: Sample of sales in similar shops in two groups are taken for a new product with

following results:

Group Mean Sales Variance Size of sample

A

B

57

61

5.3

4.8

5

7

Is there any evidence that both the groups are in the same town without any difference in

sales pattern? Use 5% level of significance.

Solution: Presuming that both the groups are from the same town and having same sales

pattern. In other words we make null hypothesis that both the groups are from single

population. Consider hypotheses

Page 20: STAT_T_3

211

210

::

µµµµ

≠=

HH

It is given that

8.4 ;3.5;61 ;57

;7 ;5

21

21

21

====

==

ssXX

nn

Since the samples are small and population variances are not known, we consider the

following test t statistics as under:

053.3

71

51

2-751)(4.8)-(71)(5.3)-(5

61-57

112

)1()1(

2121

222

211

21 −=+

++

=

+−+−+−

−=

nnnnsnsn

XXt

At 5% level of significance and with 5+7-2=10 degree of freedom t-statistics from table is

2.228 and therefore the rejection region is given by 228.2|:| >tR . Note that calculated t-

value is in the rejection region. So we reject the null hypothesis at 5% level of significance.

So we may conclude that the sample groups A and B are from different population with

different sales pattern.

Exercise: Two independent samples of size 9 and 7 respectively had the following values:

Sample 1: 18 20 36 50 49 36 34 49 41

Sample 2: 29 28 26 35 30 44 46

Is the difference between the means of sample significant at 5% level of significance?

Exercise: A group of seven-week old chickens reared on a high protein diet weigh 12, 15, 11,

16, 14, 14 and 16 ounces; a second group of five chickens, similarly treated except that they

receive a low protein diet, weigh 8, 10, 14, 10 and 13 ounces. Test at 5% level whether there

is significant evidence that additional protein has increased the weight of chickens.

Hypothesis testing of Proportions & Difference between Proportions:

Recall that the standard error of proportion is given by

npqSE p ==σ (in case of infinite population)

1−−

==N

nNnpqSE pσ (in case of finite population of size N)

Where, p is the proportion of the items in the population, q = 1-q, z is the standard variate for

appropriate confidence level and n is the sample size.

Page 21: STAT_T_3

If p̂ is the observed proportion, then to test the null hypothesis that 0

:0 HppH = , we

compute following z-statistic as under:

SEppz H−

.

For a large population, we have

npqppz H−

.

Standard error in case of difference between proportions is,

2

22

1

11 ˆˆˆˆ21 n

qpnqpSE pp +== −σ , where 1p̂ and 2p̂ are sample proportions of samples of sizes

1n and 2n respectively. The above formula is more conveniently used whenever the samples

are drawn from two heterogeneous populations. But when we assume that the populations are

similar as regards the given attribute, we make use of the following formula to compute SE.

0021

22110

2100

1

where11ˆˆ21

pqnn

pnpnp

nnqpSE pp

−=++

=

+== −σ

Exercise: A sample survey indicates that out of 3232 births, 1705 were boys and the rest were

girls. Do these figures confirm the hypothesis that the sex ratio is 50:50? Test at 5% level of

significance.

Solution: Define p as the ratio of boy babies. We shall make null hypothesis and alternative

hypothesis as under:

5.0:5.0:

1

0

≠=

pHpH

Observed value for p is given by 5275.032321705ˆ ==p .

Standard error for the proportion is given by 0088.03232

)5.0)(5.0(====

npqSE pσ and z-

test statistic is given by 125.30088.0

5.05275.0ˆ=

−=

−=

npq

ppz .

Page 22: STAT_T_3

With reference to null hypothesis and alternative hypothesis, we apply two-tailed test and

rejection region at 5% significance level is 96.1|:| >zR . Calculated z-value lies in the

rejection region and therefore we reject null hypothesis at five percent significance level and

conclude that the sex ratio among the births are not 50:50.

Exercise: A certain process produces 10% defective items. A supplier of new raw material

claims that the use of his material would reduce the proportion of defectives. The random

sample of 400 units using this new material was taken out of which 34 were defective. Can

the supplier claim be accepted? Test at 1% level of significance.

Solution: Since the supplier claim that there is a decrease in defective items, we shall

consider the following null hypothesis and alternative hypothesis:

10.0:10.0:

1

0

<=

pHpH

From the above null hypothesis and alternative hypothesis, we shall have one-tail test (left) at

1% level of significance. Rejection region at 1% level of significance is 32.2: −<zR .

Observed sample proportion is given by 085.040034ˆ ==p further z-statistics from the given

data is 00.1

400)9.0)(1.0(1.0085.0ˆ

−=−

=−

=

npq

ppz

Since computed z-value does not fall in the rejection region, we accept the null hypothesis at

1% level of significance. So at 1% level of significance, we can accept the supplier’s claim

that there is significant reduction in the defective items.

Exercise: The null hypothesis is that 20% of the passengers go in first class, but management

recognizes the possibility that this percentage could be more or less. A random sample of 400

passengers includes 70 passengers holding first class ticket. Can the null hypothesis be

rejected at 10% level of significance?

Exercise: A drug research experimental unit is testing two drugs newly developed to reduce

BP level. The drugs are administered to two different sets of animals. In group one, 350 of

600 animals tested respond to drug one and in group two, 260 of 500 animals tested respond

to drug two. The research unit wants to test whether there is difference between the efficiency

of the said two drugs at 5% level of significance. How will you deal with this problem?

Page 23: STAT_T_3

Solution: Let 1p be the proportion of animals respond to the drug one and 2p be the

proportion of animals respond to drug two. Here we may consider that the samples are from

different population.

Consider the null hypothesis:

210 : ppH = i.e., the proportions of response for both the drugs are same.

And the alternative hypothesis:

211 : ppH ≠

We shall have two-tailed test for the samples from different population at 5% significance

level. The rejection region is 96.1|:| >zR

From given data, we have

520.0500260ˆ

583.0600350ˆ

2

1

==

==

p

p

500600

2

1

==

nn

Further, z-value for the observed data is given by

093.2

500)480.0)(520.0(

600)417.0)(583.0(

520.0583.0ˆˆˆˆ

ˆˆ

2

22

1

11

21 =+

−=

+

−=

nqp

nqp

ppz

As calculated value is in the rejection region, we reject the null hypothesis at 5% level of

significance.

Exercise: A drug research experimental unit is testing two drugs newly developed to reduce

BP level. The drugs are administered to two different sets of animals. In group one, 350 of

600 animals tested respond to drug one and in group two, 260 of 500 animals tested respond

to drug two. The research unit wants to test whether the efficiency of the first drug is more

than the second drug at 5% level of significance. How will you deal with this problem?

Exercise: At a certain date in a large city 400 out of a random sample 500 men were found to

be smokers. After the tax on tobacco had been heavily increased, another random sample of

600 men in the same city included 400 smokers. Was the observed decrease in the proportion

of smokers significant? Test at 5% level of significance.

Solution: We start with null hypothesis that the proportion of smokers even after the heavy

tax on the tobacco remains unchanged i.e., 210 : ppH = and alternative hypothesis that

Page 24: STAT_T_3

proportion of smokers after tax has decreased i.e., 210 : ppH > . So, we shall have one-tail

test (right). Rejection region at 5% level of significance is 645.1: ≥zR .

From the given data, we have

667.0600400

8.0500400

2

1

==

==

p

p

On the presumption that the populations are similar, the best estimator for the proportion is

given by

2727.07273.01

7273.0600500

)667.0(600)8.0(500

0

21

22110

=−=

=++

=++

=

qnn

pnpnp

Further,

926.4

6001

5001)2727.0)(7273.0(

)667.0()8.0(

11

2100

21 =

+

−=

+

−=

nnqp

ppz

So the calculated value is in the rejection region an therefore we reject the null hypothesis at

5% level of significance. There is a significance decrease in smokers after the increase in tax

on tobacco.

Exercise: There are 100 students in a university college and in the whole university, inclusive

of this college; the number of students is 2000. In a random sample study of 20 were found

smokers in the college and the proportion of smokers in the university is 0.05. Is there a

significant difference between the proportion between the smokers in the college and

university? Test at 5% level.

Page 25: STAT_T_3

CHI-SQUARE TEST

Chi-square Distribution: Chi-square distribution is used when we deal with collection of

values that involve sum of squares. Chi-square distribution is defined for positive value of

random variable and the distribution curve is not symmetric. This distribution depends on yet

another parameter, the degree of freedom (n-1), where n is the sample size.

Chi-square, by notation 2χ , is a statistical measure used in the context of sampling analysis

for comparing a sample variance to a theoretical variance. As a non-parametric test, it can be

used to determine if categorical data shows dependency or two classifications are

independent. It can also be used to make comparisons between theoretical populations and

actual data when categories are used. Thus, the chi-square test is applicable in large number

of problems in the areas such as:

(i) test the goodness of fit

(ii) test the significance of association between two attributes, and

(iii) test the homogeneity or the significance of population variance.

Chi-square Test for testing significance of Population Variance:

We can use the test to judge if a random sample has been drawn from a normal population

with mean µ and with a specified variance 2σ . Given a sample of size ‘n’ and the sample

variance 2s , we observe that the quantity )1(2

22 −= ns

σχ has the chi-square distribution with

n-1 degree of freedom. To test the null hypothesis 220 : sH =σ , we compare the calculated

2χ value against the table value at n-1 degree of freedom and given level of significance. If

the calculated value is higher than the table value, then we reject the null hypothesis,

otherwise we accept the null hypothesis.

Exercise: The weights of ten students are as follows:

S.No: 1 2 3 4 5 6 7 8 9 10

Weight(kg): 38 40 45 53 47 43 55 48 52 49

Can we say that the variance of the distribution of weight of all students from which the

above sample of 10 students was drawn is equal to 20 kgs? Test at 5% level of significance.

Solution:

First we shall find the variance of sample data given.

Page 26: STAT_T_3

S.No iX (weight) )( XX i − 2)( XX i −

1

2

3

4

5

6

7

8

9

10

38

40

45

53

47

43

55

48

52

49

-9

-7

-2

6

0

-4

8

1

5

2

81

49

04

36

00

16

64

01

25

04

470 280

11.319

2801

)(

4710470

22 ==

−=

==

∑n

XXs

X

i

Let the null hypothesis 220 : sH =σ . To test the hypothesis, we shall compute

99.13)110(20

11.31)1(2

22 =−=−= ns

σχ

Table value of 2χ at 10 – 1 = 9 degree of freedom and 5% level of significance is 16.92.

Since calculated value is less than the table value we accept the null hypothesis at 5% level of

significance. In other words, we can say that the sample is taken from the population with

variance 20 kgs.

Exercise: A sample of 10 is drawn randomly from a certain population. The sum of squared

deviation from the mean of given sample is 50. Test the hypothesis that the variance of the

population is 5 at 5% level of significance.

Chi-Square Test as Non-Parametric Test: This test can be used for (i) Testing goodness of

fit (ii) Testing independence of data

Testing goodness of fit: Chi-square test enables us to see how well does the assumed

theoretical distribution fit to the observed data. When some theoretical distribution is fitted to

the given data, we are always interested in knowing as to how well this distribution fits with

observed data.

Page 27: STAT_T_3

We consider the fit is considered to be good, in other words, the divergence between the

observed and expected frequencies is attributable to fluctuation of sample, if the calculated

value of 2χ is lesser than the table value for certain level of significance. Otherwise, fit is

not considered to be good one.

Test of independence: 2χ test enables us to explain whether or not two attributes are

associated. For instance, we may be interested in knowing a new medicine is effective in

controlling fever or not, in such a case 2χ test helps us in deciding the issue.

In such situation, we proceed with null hypothesis that the two attributes are independent. i.e,

the new medicine is not effective in controlling fever. On this basis we calculate the expected

frequencies and then workout the value of 2χ . If the calculated 2χ value is lesser than the

table for given degree of freedom, we accept the null hypothesis, otherwise, we reject.

We calculate ∑ −=

EEO 2

2 )(χ where O is the observed frequency and E is the expected

frequency.

Degree of Freedom:

If there are ‘n’ number of frequency classes and there is one independent constraint, then the

degree of freedom is given by ‘n-1’.

When we have two independent constraints (bivariate case) with ‘c’ number of rows and ‘r’

number of columns then the degree of freedom is given by (c-1)(r-1).

For instance, in the following data obtained during the outbreak of smallpox:

Attacked Not attacked Total

Vaccinated 31 469 500

Not vaccinated 185 1315 1500

Total 216 1784 2000

The degree of freedom is (2-1)(2-1) = 1

Page 28: STAT_T_3

Exercise: Genetic theory states that children having one parent of blood type A and the other

of blood type B will always one of the three types, A, AB, B and the proportion of three types

will be on an average be as 1 : 2 : 1. A report states that out of 300 children having one A

parent and one B parent, 30 percent were found to be of type A, 45 percent type AB and

remainder type B. Test the hypothesis by 2χ test.

Solution: Observed frequencies of type A, AB and B are given by 90, 135 and 75 respectively

(in the proportion of 30 : 45: 25). Theoretically, it should have in the proportion of 1 : 2 : 1.

Therefore the expected frequencies of type A , AB and B are 75, 150 and 75 respectively. We

shall have chi-square test to verify the goodness of fit of theoretical distribution given.

Let the null hypothesis that the given data fits into given distribution. We shall calculate the 2χ as under:

Type Observed

Frequency(O)

Expected

Frequency(E)

)( EO − 2)( EO − E

EO 2)( −

A

AB

B

90

135

75

75

150

75

15

-15

0

225

225

0

3

1.5

0

5.405.132 =++=χ

Degree of freedom = 3 – 1= 2

Table value of 2χ for 2 degree of freedom at 5% level of significance is 5.991

Calculated 2χ value is lesser than the table value. Therefore we accept the null hypothesis

that on an average type A , AB and B stand in the proportion of 1 : 2 : 1.

Exercise: A dice is rolled 240 times and observed frequencies are given below.

Face observed 1 2 3 4 5 6

Frequency observed 49 35 32 46 49 29

Using 2χ test verify whether the dice is unbiased. Test at 5% level of significance.

Exercise: A sample of 10 is drawn randomly from a certain population. The sum of squared

deviation from the mean of given sample is 50. Test the hypothesis that the variance of the

population is 5 at 5% level of significance.

Page 29: STAT_T_3

Exercise: In a city a survey was carried out of 200 families, each with 5 children. The

distribution shown below was produced.

(Boys, Girls) (5, 0) (4, 1) (3, 2) (2, 3) (1, 4) (0, 5)

No of families 11 35 69 55 25 5

Test the null hypothesis that the observed frequencies are consistent with male and female

births being equal probable, assuming binomial distribution, a level of significance of 0.05.

Solution: Assume that male and female births are equal probable. That is 5.0== qp . Note

that the probability having k boys among 5 children in a family is given by kk

k−

5)5.0()5.0(5

(B, G) Observed

Frequency(O)

Prob Expected

Frequency (E)

(Prob x 200)

)( EO − 2)( EO − E

EO 2)( −

(5, 0)

(4, 1)

(3, 2)

(2, 3)

(1, 4)

(0, 5)

11

35

69

55

25

5

0.03125

0.15625

0.3125

0.3125

0.15625

0.03125

6

31

63

63

31

6

5

4

6

-8

-6

-1

25

16

36

64

36

1

4.167

0.516

0.571

1.016

1.161

1.167

598.7)( 22 =

−=∑ E

EOχ

Degree of freedom = 6 – 1 = 5

Table value of 2χ for 5 degree of freedom at 5% level of significance is 11.1.

Since calculated 2χ is lesser than the table value, we accept the null hypothesis that observed

frequencies are consistent with male and female births are equal probable.

Exercise: Two research groups classified some people in income groups on the basis of

sampling studies. The results are as follows:

Investigator Income groups Total

Poor Middle Rich

A

B

160

140

30

120

10

40

200

300

Total 300 150 50 500

Show that the sampling technique of at least one research group is defective.

Page 30: STAT_T_3

Solution: Let us make the hypothesis that the techniques adopted both the groups are similar

and the data are similar.

Expected frequencies are

Investigator Income groups Total

Poor Middle Rich

A

B

120

180

60

90

20

30

200

300

Total 300 150 50 500

54.5530

)3040(90

)90120(180

)180140(20

)2010(60

)6030(120

)120160(

)(

222222

22

=

−+

−+

−+

−+

−+

−=

−=∑ E

EOχ

Degree of freedom = (3-1)(2-1)=2

Table value of 2χ for 2 degree of freedom at 5% level of significance is 5.991. Since the

calculated value is bigger than the table value, we conclude the rejection of null hypothesis at

5% level of significance. Technique adopted by one of two groups in data collection is

defective.

Exercise: The following data is obtained during the outbreak of smallpox:

Attacked Not attacked Total

Vaccinated 31 469 500

Not vaccinated 185 1315 1500

Total 216 1784 2000

Test the effectiveness of vaccination in preventing the attack from the smallpox.

Exercise: Consider the following information regarding home condition and children’s

condition:

Condition of child Condition of home Total

Clean Dirty

Clean

Fairly Clean

Dirty

70

80

35

50

20

45

120

100

80

Total 185 115 300

Page 31: STAT_T_3

State whether the two attributes viz., condition of home and condition of child are

independent. Use chi-square test for the purpose.

Conditions for Application of Chi-square test:

The following conditions should be satisfied before 2χ test being applied:

(i) Observation recorded and used are collected on random basis.

(ii) All the items in the sample must be independent.

(iii) No group should contain very few items.

(iv) The overall number of items also must also be reasonably large.

(v) The constraints must be linear. Constraints which involve linear equations in the

cell frequencies of a contingency table.

Page 32: STAT_T_3

ANOVA

Consider a case of three varieties of wheat, each grown on four plots and production of wheat

for each kind of wheat per acre land in each kind of plot is given below:

Plot of land Variety of wheat

A B C

1

2

3

4

6

7

3

8

5

5

3

7

5

4

3

4

Researcher may be interested if there is significant difference between varieties of wheat

and/or varieties of plots.

ANOVA technique is very useful in making analysis in the above context.

ANOVA is an important technique in those entire situations where we want to compare more

than two populations such as in comparing the yield of crop from several varieties of seeds,

mileage of several automobiles and so on. In the circumstances of these kinds, one generally

does not want to consider all those combinations of two populations at a time, where the

number of tests required before arriving to a decision is larger.

The basic principle of ANOVA is to test for differences among the means of the populations

by examining the amount of variation within each of these samples, relative to the amount of

variation between the samples. In terms of variation within the given population, it is

assumed that the values of X differ from the mean of this population only because of random

effects, i.e., there are influences on X which are unexplainable, where as in examining

differences between populations we assume that the difference between the mean of jth

populations and the grand mean is attributable to what is called a ‘specific factor’ or what is

technically described as treatment effect. Thus while using ANOVA, we assume that each of

the samples is drawn from normal population and each of these populations has the same

variance. We also assume that all the factors other than the one or more being tested are

effectively controlled. In other words, means that we assume the absence of many factors that

might affect our conclusions concerning the factor(s) to be studied.

In this case we make two estimates of populations, namely, one based on between samples

variance and the other based on within the samples variance. Then the said two estimates of

population variance are compared with F-test, wherein we workout.

Page 33: STAT_T_3

variancesamples on whithin variancepopulation of Estimate variancesamplesbetween on based variancepopulation of Esimate

=F

This value of F is to be compared to the F-limit for given degree of freedom. If the calculated

F value is more than the F-limit value, we may say that there are significance differences

between the sample means.

ANOVA Technique: One-way ANOVA

Under one-way ANOVA, we consider just one factor and then observe that the reason for

said factor to be important is that several possible types of samples can occur within that

factor. We then determine if there are differences within that factor.

(i) Obtain kXXX ,...,, 21 where ‘k’ is the number of samples.

(ii) Workout mean of sample mean by the formula k

XXXX k+++=

...21

(iii) Calculate sum of square between the samples by the formula 22

222

11 )(....)()(between XXnXXnXXnSS kk −++−+−=

(iv) Compute Mean Square between the samples by the formula

1between between −

=k

SSMS

Where k-1 represents degree of freedom between the samples

(v) Calculate Sum of squares within samples by the formula :

∑ ∑ ∑ −++−+−= 2222

211 )(....)()( within kkiii XXXXXXSS

(vi) Compute Mean Square within by

knSSMS

−=

within within

Where n is the total number of samples, and k is the number of sample.

(vii) Compute sum of squares of deviations for total variance by

withinSS between SS )( variancefor total 2 +=−=∑ XXSS ij

Degree of freedom for total variance is n-1 = (k-1)+(n-k)

(viii) Finally, F ratio is computed by formula

withinMSbetween MSratio =−F

If the calculated F-ration is greater than the F-value for the given degrees of

freedom and the significance level, then we reject the null hypothesis and in fact,

we conclude that the differences between the means are significant.

Page 34: STAT_T_3

ANOVA TABLE (ONE-WAY ANALYSIS)

Source of

variation

Sum of Squares

(SS)

Degree

of

Freedom

Mean

Square

(MS)

F-Ratio

Between

the samples

Within

Samples

2

222

211

)(....

)()(

XXn

XXnXXn

kk −+

+−+−

∑∑∑

−+

+−+−2

222

211

)(....

)()(

kki

ii

XX

XXXX

(k-1)

(n-k)

1between −k

SS

k-n withinSS

withinMSbetween MS

Total ∑ − 2)( XX ij (n-1)

Short Cut Method:

Compute ∑= ijXT and further

Source of variation Sum of Squares

(SS)

Degree of

Freedom

Mean Square

(MS)

F-Ratio

Between the samples

Within Samples

nT

nT

j

j22 )()(

−∑

nT

nTn

TX

j

j

ij

22

22

)()(

)(

−−

(k-1)

(n-k)

1between −k

SS

k-n withinSS

withinMSbetween MS

Total n

TX ij

22 )(−∑

(n-1)

Exercise: Setup an ANOVA table for the following per acre production for three varieties of

wheat, each grown on four plots and state if the variety difference is significant:

Plot of land Variety of wheat

A B C

1

2

3

4

6

7

3

8

5

5

3

7

5

4

3

4

Page 35: STAT_T_3

Solution:

Plot of land Variety of wheat

A B C

1

2

3

4

6

7

3

8

5

5

3

7

5

4

3

4

Total 24 20 16

53

456

44

16 ;54

20 ;64

2412

4

321

321

=++

=

======

====

X

XXX

nnnn

8)54(4)55(4)56(4

)()()(between 222

233

222

211

=−+−+−=

−+−+−= XXnXXnXXnSS

24)()()( within 233

222

211 =−+−+−=∑ ∑ ∑ XXXXXXSS iii

Source of variation (SS) df (MS) F-Ratio

Between the samples

Within Samples

8

24

3-1=2

12-3=9

428

1between

==−k

SS

67.2924

k-n withinSS

==

5.167.200.4

withinMSbetween MS

=

=

Total 32 (n-1) F-limit 5%

level of significance

F(2,9) = 4.26

Since the calculated value lesser than the table value, we accept the null hypothesis that

difference between outputs due to variety of seed is not significant.

Exercise: A manager of a firm wishes to test whether the salesmen of his firm (A, B, and C)

tend to make sales of same size. During a week there have been 14 sale calls. A made 5, B

made 4 and C made 5 calls respectively. Following are the sales data for the week of 3

salesmen.

Page 36: STAT_T_3

A: 500 400 700 800 600

B: 300 700 400 600

C: 500 300 500 400 300

Perform ANOVA and draw your conclusion at 5% level of significance (F(2, 11)=3.98)

Hint : Use scaling factor (X-500)/100

TWO-WAY ANOVA

Two-way ANOVA technique is used when the data are classified on the basis of two factors.

For example:

(i) The agricultural output may be classified on the basis of different varieties of

seeds and also on the basis of different fertilizers used.

(ii) A business firm may have its sales data classified on the basis of different

salesmen and also on the basis of sales in different regions.

(iii) In a factory, the various units of a product produced during a certain period may

be classified on the basis of different varieties of machines and also on the basis of

different grades of labor.

Two way designs may have repeated measurement of each factor or may not have repeated

values. The ANOVA technique is little different in case of repeated measurements where we

also compute interaction variation.

Two-way ANOVA when values are not repeated.

Source of

variation

Sum of Squares

(SS)

Degree of

Freedom

Mean Square

(MS)

F-Ratio

Between

Column

Treatment

nT

nT

j

j22 )()(

−∑ (c – 1)

1columnsbetween

−cSS

Residual

columnsbetween MS

MS

Between

rows

Treatment

nT

nT

i

i22 )()(

−∑ (r – 1)

1rowsbetween

−rSS

Residual rowsbetween

MSMS

Residual

error

Total SS – (SS

between columns

+SS between rows

(c-1)(r-1) )1)(1(

Residual −− rc

SS

Total n

TX ij

22 )(−∑

c.r - 1

Page 37: STAT_T_3

Exercise: Set up an ANOVA table for the following two-way design results:

Per acre production of wheat

Variety of

Fertilizer

Variety of wheat

A B C

1

2

3

4

6

7

3

8

5

5

3

7

5

4

3

4

Total 24 20 16

Also state whether variety differences are significant at 5% level significance.

Solution:

Variety of

Fertilizer

Variety of wheat Row total

1

2

3

4

6

7

3

8

5

5

3

7

5

4

3

4

16

16

09

19

Column Total 24 20 16 60

n = 12; T = 60

300126022

==n

T

32300)478333457556(

)( total

222223222222

22

=−+++++++++++=

−=∑ nTXSS ij

83004

164

204

24

)()(atment column trebetween

222

22

=−++=

−=∑ nT

nT

SSj

j

183003

193

93

163

16

)()( treatmentrowbetween

2222

22

=−+++=

−=∑ nT

nTSS

i

i

SS Residual = Total SS – (SS between columns +SS between rows)

= 32 –(8 +18) = 6

Page 38: STAT_T_3

ANOVA Table:

Source of

variation

Sum of Squares

(SS)

Degree of

Freedom

Mean Square

(MS)

F-Ratio 5% level

F-limit

Between

Column

Treatment

8 2 4 4 F(2, 6) =

5.14

Between

rows

Treatment

18 3 6 6 F(3, 6) =

4.76

Residual

error

6 6 1

Total 32 11

F-ratio due to column treatment (4) is lesser than the table value (5.14) at 5% level of

significance. Therefore difference among mean yield due to column treatment (varieties of

seeds) is not significant.

But F-ratio in case of row treatment (6) is larger than the table value (4.76) at 5% level of

significance. Therefore, difference among mean yield due to column treatment (varieties of

fertilizer) is significant.

Exercise: The following data gives the number of units produced per day by 5 workers using

4 different machines.

M1 M2 M3 M4

A:

B:

C:

D:

E:

45

40

43

36

41

42

32

36

38

37

48

50

44

46

47

38

34

40

36

37

Test if the production is equal with respect machines and with respect to workers.

Answer:

Source of

variation

Sum of Squares

(SS)

Degree of

Freedom

Mean Square

(MS)

F-Ratio 5% level

F-limit

Between

Column

Treatment

335 3 111.67 14.97 F(3,12)

= 3.49

Page 39: STAT_T_3

Between

rows

Treatment

48.5 4 12.125 1.626 F(4,12)

=3.26

Residual

error

89.5 12 7.458

Total 473 11

Limitations of Test of Hypothesis:

(i) The test should not be used in a mechanical fashion. It should be kept in view that

testing is not decision making itself; the tests are only useful aids for decision-

making. Hence proper interpretation of statistical evidence is important to

intelligent decision.

(ii) Tests do not explain the reasons as to why do the difference exist. They simply

indicate whether the difference is due to fluctuations of sampling or because of

other reasons but tests do not tell us as to which is/are the other reason(s) causing

the difference.

(iii) Results of significant tests are based on probabilities and such cannot be expressed

with full certainty. When a test shows that a difference is statistically significant,

then it simply suggest that the difference is probably not due to chance.

(iv) Statistical inferences based on the significance tests cannot be said to be entirely

correct evidences concerning the truth of hypothesis. This is specially so in case of

small samples where the probability of drawing erring inferences happens to be

generally higher. For greater reliability, the size of samples is sufficiently

enlarged.