Optimum stratification with ratio and regression methods of estimation

OPTIMUM STRATIFICATION WITH RATIO AND REGRESSION

METHODS OF ESTIMATION

RAVlNDRA SINOH AND B. V. SUKHATME

(Received July 12, 1971; revised Nov. 24, 1971)

Summary

The paper considers the problem of opt imum stratification on an auxiliary variable z when the information on the auxiliary variable z

is also used to estimate the population mean :Y using ratio or regression methods of estimation. Assuming the form of the regression of the estimation variable y on the auxiliary variable x as also the form of the conditional variance function V (y[z), the problem of de termining opt imum strata boundaries (OSB) is shown to be a particular case of op- t imum stratification on the auxiliary variable for stratified simple random sampling estimate. A numerical investigation has also been made to study the amount of gain in efficiency tha t can be brought about by s t ra t i fying the population.

1. Introduction

In case of simple random sampling, it is well known tha t if we have information on an auxiliary variable x highly correlated with the estimation variable y, the population mean can be est imated wi th much greater precision by using ratio or regression est imate in place of simple mean estimate. The question naturally arises whether the information on the auxiliary variable �9 can also be used for fu r ther increasing the precision of the estimate by adopting techniques like stratification. In this paper we propose to consider this question.

For theoretical development, let us assume tha t the population under s tudy is infinite and is to be divided into L strata. A stratified simple random sample of size n is drawn from it, the sample drawn

r .

from the h th (h-- l , 2 , . . . , L) s t ra tum being of size n~ so tha t ~] n~--n.

Sukhatme and Sukhatme [1] have shown (Section 4.11) tha t if the s trata

ratios R~--YJX~ where Y~ and X~ are t h e population means for y and respectively in the h th s t ra tum, do not differ much from s t ra tum to

627

628 RAVINDRA SINGH AND B. V. SUKHATME

stratum the combined and separate ratio estimates of the population

mean Y" are nearly equally efficient. The combined estimate has an additional advantage of smaller bias over the separate estimate espe- cially when the sample sizes within the strata are small.

Similar results are also t rue in case of regression estimates (see Cochran [2], p. 204).

Because of these considerations, we shall consider in this paper the combined ratio and regression estimates.

Combined ratio estimate The combined ratio estimate of the population mean ~" is given by

(I.i) ~=(~=~ W~,)~Z/(~IWh~)

where for the hth stratum

Wh=proportion of units in the hth stratum,

~ = s a m p l e mean for the variable y,

~ - s a m p l e mean for the variable x, and

X = population mean for the variable x.

We shall assume the knowledge of .~. The large sample variance of the estimate yR, is to the first order of approximation given by

L

(1.2) V (YR,) = ~}2 W~ (~, - 2Ro~ + R'oL)In, &fflit

where in the hth stratum

~y=var iance of y,

o~.y=covariance between x and y,

o~=variance of

and

R= Y/X.

Combi,mt regresd~ estimate The combined regression

given by estimate of the population mean Y is

(1.3) r. L

7L=I lb=l

where for the hth stratum

OPTIMUM STRATIFICATION WITH RATIO AND REGRESSION METHODS 629

and

s~.~=sample covariance between x and y,

sL=sample mean square for the variable x,

L L ] b--Fz �9

La=* J IL t s=*

The large sample variance of the estimate ~,~ is to the first order of approximation given by

r .

h = l (1.4)

where

L

If the cost of observing any unit in the population is assumed to be the same, the variances in (1.2) and (1.4) are minimised by adopting Neyman method of allocating the sample to different strata. Under this allocation the variances in (1.2) and (1.4) reduce to

(1.5)

and

r ]? Lh=t

(i.6) v @~o)~= [~.~ ]'/ .

We observe that the two variances are the same except that they differ in the constants R and ft. For further discussion, we shall therefore consider only the combined regression estimate with variance given by (1.6). The variance expression in (1.6) is clearly a function of the strata boundaries on the variable z. This variance can, therefore, be further reduced by using the optimum strata boundaries which corre- spond to the minimum of V (~)~ in (1.6) with respect to these boundaries. The problem of determining optimum strata boundaries was first considered by Dalenius [3] and Hayashi, Maruyama and Isida [4]. Be- cause of formidable difficulties involved in finding exactly the optimum strata boundaries, several attempts were made to obtain approximate solutions to this problem. For references see Cochran [2]. Subsequently the authors [5], [6] have also considered the problem of finding optimum strata boundaries on the auxiliary variable for stratified simple random sampling and stratified varying probability sampling respectively. We now consider the problem of determining these boundaries for combined ratio and regression estimates in stratified sampling.


2. Optimum strata boundaries

Let us assume tha t the joint density function of the variable (~, y) in the population is continuous and the relationship between y and x is of the form

(2.1) y=c(z)..Fe

where c(~) is some function of �9 and e is the error te rm such tha t E(e[~)=O and V(e lz )=p(x)>O for all z in the range (a,b) of x wi th ( b - a ) < c o . Under this model

(2.2) aS --as _L . and

where, in the h th s t ra tum aL, a~, a n d / ~ , are respectively the variance of c(z), covariance between z and c(z) and the expected value of the function ~(z). From (1.6) and (2.2) we have the variance of the combined regression est imate as

(2.3)

where ~(x)=c(~)-~. This variance is the same as the one obtained by the authors in

relation (2.3) of [5] with c(z) replaced by ~(z). The minimal equations giving optimum strata boundaries and the methods of finding their approximate solutions in this case are, therefore, the same as those given in [5].

The regression estimate is usually employed when the regression of y on z is linear and is given by

(2.4) ~ = a + / 3 z .

Therefore, in such cases c ( x ) = a + ~ z and ~(z)=~ so tha t ~L=O and the variance V (Yz~)N reduces to

(2.5) v

and the minimal equations giving OSB become

(2.6) ~ ( z~ )+~_~(x~ )+ /J~ i = h + l , h=l , 2 , . . . L - 1 . -

Now let us define

(2.7) g3(z) = d 2(z)/(~(~)) sin.


Then if the function ps(x)=gs(~)f(x) where f(x) is the density function of x, is bounded and possesses first two derivatives continuous for all

in (a, b), the various methods of finding approximate solutions to (2.6) can be obtained from those given in [5]. One of these methods is the cure ~/Ps(~) rule. According to this rule approximately optimum strata boundaries (AOSB) are the solutions of the equations

(2.8) f ~ ~ /~ -~d t=I : $/-~-(t)dt]L , h = l , 2, . . ., L . z~-I

We shall make use of this rule in finding AOSB in the numerical investigation given in the next section.

It may also be remarked here that all the results obtained in Sec- tion 4 of [6] also hold in this case after the function g~(x) is replaced by g3(x).

All the above discussion also holds for the combined ratio estimate after taking a=O and /3>0 is replaced by the population ratio R.

For the purpose of illustrating the usefulness of stratification for ratio and regression estimates we give below a numerical investigation.

.

For tions of x.

i) Rectangular :

ii) Right triangular :

iii) Exponential :

iv) Normal :

Numerical illustration

this illustration we consider the following four density func-

f(x)----1, l ~ x K 2

f (~)--2(2--x) , 1 ~ _ < 2

f ( x ) = e -=+1, 1 ~ x ~ oo

f ( x ) = 2r -~-D2, l ~ x ~ o o .

These densities to a considerable extent represent those usually en- countered in practice. For the conditional variance function ~(x) we have taken the form ~(x)=azg where a > 0 and g are constants. In practice, g is usually found to take values from 0 to 2. The regression of y on x in the population is assumed to be linear which is exactly the situation where the use of linear regression estimate is recommended. Thus c(x)=a-l-~x. Under this model it will be seen from the relation (2.5) that for g=O the variance V (~,)~ does not change with the number of strata L. The efficiency of stratification is, therefore, zero in this case for any value of L. Because Of this reason only two values of g, i.e., g = l and 2, have been considered. The constant a has been determined in such a way that 75% of the total variation is explained by regression. T h e correlation coefficient in this case will, therefore,


be about 0.87. To have a finite range the exponential and normal distributions were truncated at z = 6 and z = 5 respectively. Thus the probabilities for �9 to take values beyond the truncation points were ex- tremely small. For finding the AOSB the ranges of all the four distributions were divided into 20 classes of equal width. The AOSB were

Table 1 Percentage relative efficiency of stratification

~. L r

r

VU

. N

o~ eL

N

t ~

Vl

1

2 1.457

AOSB

3 1.296, 1.628

4 1.218, 1.457, 1.718 1.173, 1.359, 1.559,

5 1.772 1.143, 1.296, 1.457,

6 1.628, 1.809

g= l g=2

n V (~]te).,v % Rel. eft. AOSB

1

2 1.370

3 1.234, 1.526

4 1.171, 1.370, 1.614 1.135, 1.286, 1.461,

5 1.671 1.111, 1.234, 1.370,

6 1.526, 1.712

1

2 2.167

3 1.675, 2.650

4 1.474, 2.167, 3.308 1.369, 1.855, 2.545,

5 3.645 1.299, 1.675, 2.167,

6 2.850, 3.901

1

2 1.931

3 1.571, 2.386

4 1.414, 1.931, 2.675 1.325, 1.705, 2.187,

5 2.885 1.26._5, 1.571, 1.931,

6 2.386, 3.044

0.027774

0.027586

0.027549

0.027534

0.027527

0.027523

100.00

100.58

100.82

100.87

100.90

100.91

0.013520 100.00

0.018416 100.57

0.018398 100.66

0.018390 100.71

0.018387 100.73

0.018385 100.73

1.472

1.305, 1.642

1.228, 1.472, 1.729

1.181, 1.372, 1.573, 1.782 1.150, 1.308, 1.472, 1.642, 1.818

1.382

1.243, 1.538

1.178, 1.382, 1.625 1.141, 1.297, 1.473, 1.682 1.117, 1.243, 1.382, 1.538, 1.722

2.289

1.753, 3.006

1.534, 2.289, 3.477 1.416, 1.952, 2.692, 3.815 1.340, 1.753, 2.289, 3.006, 4.059

! n V (~to)a" ~ Rel. eft.

0.027767

0.027028

0.026888

0.026838

0.025815

0.026804

0.018517

0.018109

0.018026

0.017996

0.017982

0.017974

0.272713

0.257825

0.230649

O. 228018

0.226789

0.226106

0.275950 100.00

0.262274 105.22

0.250397 105.98

0.259671 106.27

0.259369 105.40

0.259146 106.49

0.120813 100.00

0.118663 101.81

0.118152 102.25

0.117958 102.42

0.117863 102.50

0.117812 102.55

1.996

1.622, 2.465

1.452, 1.996, 2.757 1.357, 1.762, 2.262, 2.970 1.294, 1.622, 1.996, 2.465, 3.135

O. 120818

O. 112371

O. 110425

0.109671

0.109308

O. 109103

100.00

102.73

103.27

103.46

103.55

103.59

I00.00

102.25

102.72

102.90

102.98

103.02

100.00

114.67

118.24

119.60

120.25

120.61

I00.00

107.52

109.41

II0.16

110.53

110.74


obtained by using the rule (2.8). The function g~(x) was evaluated at the mid-points of the class intervals and then multiplied by W~. Cube root of this product was then found out for each of the 20 classes. The cube roots were then cumulated and AOSB were obtained by taking equal intervals on the cumulative totals.

In the following table we give the AOSB, n V (~o)~ and the relative efficiency of stratification with respect to no stratification. The variance corresponding to L - 1 is the variance of the usual regression estimate with no stratification.

From the table it is seen tha t the relative gain in efficiency is only trivial for rectangular and the r ight triangular distributions. But in case of normal and exponential distributions, the gain is about 7 and 15% respectively for L=2 and g=2. In case of exponential distribution, the gain increases to about 20% for L=4. It can also be seen that the rate of increase in efficiency with the increase in the number of strata is rather slow in comparison to stratified simple random sampling. We further observe that unlike stratified PPSWR estimate (Ref. [6]) where the gain in efficiency decreased with the increase in the value of g, here the efficiency of stratification increases with g. The gain in efficiency of stratification is zero for g=O and is maximum for g=2. It is also observed from (2.5) t ha t the AOSB and the efficiency only depend on the value of g and not on the value of the correlation coefficient, i.e., for a given value of g if the value of the constant a is changed to obtain different correlation coefficient values it is only the absolute variance V (~o)~ that is effected and not the AOSB or the efficiency of stratification.

PANJAB AGRICULTURAL UNIVERSITY IOWA STATE UNIVEP~ITY

REFERENCES

[ 1 ] Sukhatme, P. V. and Sukhatme, B. V. (1970). Sampling Theory of Surveys with Ap- plications, Iowa State Press, Ames, Iowa.

[ 2 ] Cochran, W. G. (1963). Sampling Techniques, John Wiley and Sons, New York. [ 3 ] Dalenius, T. (1950). The problems of optimum stratification, Skand. Akt., 33, 203-213. [ 4 ] Hayashi, C., Maruyama, F. and Isida, M. D. (1951). On some criteria for stratifica-

tion, Ann. Inst. Statist. Math., 2, 77-86. [ 5 ] Singh, Ravindra and Sukhatme, B. V. (1969). Optimum stratification, Ann. Inst.

Statist. Math., 21, 515-528. [6 ] Singh, Ravindra and Sukhatme, B. V. (1971). Optimum stratification in sampling

with varying probabilities, Ann. Inst. Statist. Math., 24, 485-494.

Documents

Optimum stratification with ratio and regression methods of estimation