14
OPTIMUM STRATIFICATION RAVINDRA SINGH AND B. V. SUKHATME (Received March 20, 1969; revised June 8, 1969) Summary This paper considers the problem of optimum stratification on a concomitant variable x when the form of the regression of the estima- tion variable y on the concomitant variable x as also the form of the variance function V(ylx) are known. Minimal equations giving optimum strata boundaries have been obtained for Neyman and proportional allo- cations. Since the minimal equations cannot be solved easily, various methods of finding approximate solutions have been given. A compari- son of approximate solutions with the exact solutions is made for certain density functions. 1. Introduction Let the population under consideration be divided into L strata and a stratified simple random sample of size n be drawn from it, the sam- L ple size in the hth stratum being n~ so that Z n~=n. If y is the vari- h=l ate under study, an unbiased estimate of the population mean is given by L where W~ is the proportion of units in the hth stratum and ~ is the sample mean based on n~ units drawn from the hth stratum. On ignoring finite correction factors, the variance of the estimate ~ under Neyman allocation (minimizing the variance for fixed n)is given by 1 ~ 2 while that under proportional allocation is given by 515

Optimum stratification

Embed Size (px)

Citation preview

Page 1: Optimum stratification

OPTIMUM STRATIFICATION

RAVINDRA SINGH AND B. V. SUKHATME

(Received March 20, 1969; revised June 8, 1969)

Summary

This paper considers the problem of optimum stratification on a concomitant variable x when the form of the regression of the estima- tion variable y on the concomitant variable x as also the form of the variance function V(ylx) are known. Minimal equations giving optimum strata boundaries have been obtained for Neyman and proportional allo- cations. Since the minimal equations cannot be solved easily, various methods of finding approximate solutions have been given. A compari- son of approximate solutions with the exact solutions is made for certain density functions.

1. Introduction

Let the population under consideration be divided into L strata and a stratified simple random sample of size n be drawn from it, the sam-

L

ple size in the hth stratum being n~ so that Z n~=n. If y is the vari- h = l

ate under study, an unbiased estimate of the population mean is given by

L

where W~ is the proportion of units in the hth stratum and ~ is the sample mean based on n~ units drawn from the hth stratum.

On ignoring finite correction factors, the variance of the estimate ~ under Neyman allocation (minimizing the variance for fixed n ) i s given by

1 ~ 2

while that under proportional allocation is given by

515

Page 2: Optimum stratification

516 RAVINDRA SINGH AND B. V. SUKHATME

h = l

For a given method of allocation, the variance is clearly a function of the strata boundaries. The problem of determining optimum strata boundaries, i.e., optimum stratification was first considered by Dalenius [1] and Hayashi, Maruyama and Isida [2]. By minimizing the variance of the estimate ~ , sets of equations were obtained, solutions to which gave optimum strata boundaries for different allocations. These equations involved population parameters which were functions of the optimum strata boundaries. Due to the implicit nature of these minimal equations, their exact solutions could not be obtained. Subsequently, various authors gave methods of obtaining approximations to the exact solutions of the minimal equations. For an excellent account of these investigations, re- ference may be made to Cochran [3], [4]. The more general problem of optimum stratification when the strata need not necessarily be intervals has been considered recently by Taga [5] for the case of proportional allocation. He has given a set of sufficient conditions under which the optimum strata boundaries for the general case reduce to the optimum strata boundaries for the case when the strata are necessarily intervals.

In most of these investigations of the problem of optimum stratifi- cation, both the estimation and the stratification variables are taken to be the same. Since the distribution of the estimation variable y is rarely known in practice, it is desirable to stratify on the basis of some suit- ably chosen concomitant variable x. An investigation in this direction has been made by Taga [5] who has considered the general problem of optimum stratification based on concomitant variables (when the strata are not necessarily intervals) for the case of proportional allocation. As remarked earlier, under certain conditions, the optimum strata bound- aries obtained in the general case reduce to the optimum strata bound- aries for the case when the stratas are necessarily intervals. However, as remarked by Isii [6], this is not the case with Neyman allocation. Taking the population to be infinite, we consider the problem of optimum stratification on the concomitant variable x. Assuming knowledge about the form of the regression of y on x as also the form of the variance function V(ylx), minimal equations giving optimum strata boundaries have been obtained for Neyman and proportional allocations. Since these equations cannot be solved easily, various methods of finding approxi- mations to the exact solutions have been given. The results for pro- portional allocation come out as a special case of the results for Neyman allocation. The paper concludes by comparing approximate solutions with the exact solutions for certain populations.

Page 3: Optimum stratification

OPTIMUM STRATIFICATION $17

2. Optimum strata boundaries

Let

(2.1) y=c(x)§

where c(x) is a function of x and e is such tha t E(e[x)=O and V(e[x)= ~(x)>O for all x in the range (a, b) with ( b - a ) < c ~ . Let if(x, y) be the joint density of (x, y) and f(x) the marginal density of x. Then, we have

(2.2)

W~ = ~1 .% f ( x )dx 2 x k - I

1 f~

and

c(x)f(x)dx

2 2

where (x~_~, x0 are the boundaries of the hth s t ra tum, / ~ is the ex- pected value of ~(x) and a~c is the variance of c(x) in the hth s t ra tum.

Using these relations, the variance expressions for Neyman and proportional allocations reduce to

and

L

(2.4) V ( ~ J p = 1 ~ W~(a~c+[~).

Let [xtJ denote the set of optimum points of stratification on the range (a, b) of x, for which the variance of the estimate 9~ is minimum. These points [x~] are the solutions of the minimal equations which are obtained by equating to zero the partial derivatives of V(~.~J with re- spect to [x~]. We shall now obtain these equations for Neyman and proportional allocations.

Neyman allocation: The minimization of the variance expression as given in (2.3) is equivalent to the minimization of the expression

L

.

Equat ing to zero, the partial derivative of this expression with respect to x~, we get

Page 4: Optimum stratification

518 RAVINDRA SINGH AND B. V. SUKHATME

(2.5) r r OXt, ~X~

where i = h + l , ( h ) = a ~ + t ~ and (i)=a~c+,z~. Now, as can be easily verified, we have

a(h) _ fCx~) [{c(x,+)_/~o}~_o;.+~+~(x~)_t,,+] , ax~ W~

8(i) _ fCx~) [{c(x,~)_l~}~_a~+~(x~)_[~] , ax~ W~

and

aw~ - f ( x ~ ) = - aw~ ax~ ax~

Therefore, on using these relations in (2.5), we get the minimal

equations on simplification as

(2.6) {c(x~)--[~h~}2+a~+~+~(x~)§ {c(x~+)-t~+~}2+~+~(xh)+ff+~ a~ a s

i = h + l , h- - l , 2, . . . , L - 1 .

Proportional allocation: To obtain the minimal equations for this allocation method, we minimize the variance expression given in (2.4). The minimization of this variance is equivalent to the minimization of

L L

the expression Z W~a~c, since ~ W~+t~=ff~ is a population paramete r ~= i h = l

and is, therefore, a constant. On equating to zero the partial deriv- ative of this expression with respect to x~, we get

W a a ~ r ~ aW~ . . . . az~c , ~ aw+ a~c- - -P- VV , ; - - t a ~ --~ 0 ax~ ax~ aXt~ aX~

which on simplification gives the system of minimal equations as

(2.7) c(x~) = te~+f~c i = h + l , h = l , 2, . . . , L - 1 . 2 '

The system of equations (2.6) and (2.7) give optimum st ra ta bound- aries in the sense of minimizing V(98~) under Neyman and proportional allocation respectively is the functions f(x)[4~(x)cn(x)+~'2(x)]/[~(x)] m and c'2(x)f(x) are bounded away from zero and possess first two derivatives which are continuous for all x in (a, b). We observe tha t both the sys- tern of equations (2.6) and (2.7) are functions of the population parame- ters which are themselves functions of the solutions of these equations. Due to this difficulty it is not possible to find exact solutions. We shall

Page 5: Optimum stratification

OPTIMUM STRATIFICATION 519

therefore find approximate solutions. However, we shall first obtain certain approximate expressions for the conditional mean and variance which will be necessary to obtain approximate solutions.

3. Approximate expressions for conditional mean and variance over small intervals

Assume that the functions f(x), 9(x) and c(x) are bounded away from zero and possess first two derivatives continuous for all x in (a, b). Then, we have the following identities due to Ekman [7], [8].

f; (t_y)~f(t)dt=~o k~+J+l (3.1) It(y, x)= ~ j ! ( i+j+l ) fJ(Y)+O(k~+s)

where fJ(y) is the 3"th derivative of f(t) at t=y and k = x - y .

(3.2) L(Y, x)=f; (t-x)~f(t)dt= ~ (-k)~+J+l ~=0 j ! ( i + j + l ) fJ(x)+O(k~+~)"

Let g~(y, x) denote the conditional expectation of ~(t) in the interval (y, x) so that

.~(y, x)= f: q~(t)f(t)dt/f: f(t)dt.

From Taylor's theorem

~(t) = ~ (t--Y)J ~cj)(y) § J=o j !

where ~(J~(y) is the j th derivative of ~(t) at t=y. Then

x) f: f(t)dt=l~(y, X)Io(y, x) tL~(Y,

= ~o(t)f(t)dt y

Therefore, we have

s ~cj)(y) is(y ' x) ~_O(k4). ~(y, x)=j=oE j! Io(y, x)

Using the series expansions for Ij(y, x) and Io(y, x) from (3.1) and sim- plifying, we obtain

Page 6: Optimum stratification

520 RAVINDRA SINGH AND B. V. S U K H A T M E

(3.3) V~(y, x)=911+~-~-~ k-~ 9[f'-~-2fg"k212f9

P! ! ! lP 2 ?t! ! ?2 "} ( f f 9-~-ff 9 -~-f 9 -:-9'f ) k3+O(k,) -~ 24ff9

where the functions 9, f and their derivatives are evaluated at t=y. Proceeding in a similar fashion but using the Taylor expansions

about the point x, we obtain

(3.4) [~(y, x ) = 9 1 1 - 9' k-~ (9~f'+2fg") k 2 29 12f9

( f f" 9'-t- f f ' 9"-}- f29"'-- f'29') ,.3-- ~, ,.4~)

where the functions 9, f and their derivatives are evaluated at t=x . Let a~(y, x) denote the conditional variance of 9(t) in the interval

(y, x). Then using the approximations for ~(y, x) and [z~(y, x) about the point t=y , we obtain

]~ r2 r H (3.5) ~(y, x) = ~ ]1+ ~ k + o(k'-) J .I.~ t. 9

where the function 9 and its derivatives are evaluated at t=y. Using the above results, several other approximations can be ob-

tained. Multiplying the series expansions for /~(y, x) about the points t = y and t = x and taking the square root, we obtain

(3.6)

Again, we have

~(y, x) = ~/v-~79(x) [1 + o(k~-)].

Io(y, x)f~(y, x)= i: 9(t)f(t)dt .

Taking 9(t)=t 2 and using (3.6), we get

(3.7) /0(Y, x)=-~y fxtV(t)dt[l~ +O(k~')].

Similarly, expanding ~/f(t) about the point t=y , we have

x ~ / - - f ,(y) ~_O(t_y)2tdtl ,t f(y)~-l/~

=Ik~/ - f i~ § k 2 f '(y) +O(k~)l ~ 22 f(y)~-l/~

---k~f(Y)El + ~. f'(Y) + o(k~) 1 f(Y)

Page 7: Optimum stratification

OPTIMUM STRATIFICATION

=lc~-~ f~ f(t)dt[1 +O(/d')] .

521

4. Approx imate solutions of the minimal equations

To find approximate solutions to the minimal equation (2.6), we shall obtain the series expansions of this system of equations about the point x~, the common boundary of hth and (h+ l ) th strata. The expan- sions for the two sides of the equation (2.6) are obtained by using vari- ous results proved in the preceding section. For the expansion of the right-hand side about the point x~, (y, x) is replaced by (x~, x~+~) while for the left-hand side we replace (y, x) by (x~_~, x~).

We first consider the development of the right-hand side of (2.6). Let k~=x~+~-x~. Then, using (3.3), we have

k~r ,~ +( c'~f'+2fc'c" )k ~ = Lc 3 f

and

I + ( 6ff"c"+ lOff'c'c" +6fu ''2 k~+O(k~) l 36f 2

(/)! ~(x~)+ [z~=e[2 ~'f' +12f ~2f ~" k~

ff,, ~, + ff, ~,, + f~ , , ,_ f,2~, "~ +0( J1 k~ k ~ . 24f2~ l

Also, we have from (3.5)

G~ = !c~ [c,2 + c'c"k~ + O(k~)] o

s

Whence, we obtain

(4.1) [c(xJ - t~,c] 2 + G + ~(x2 + t~.

I V ' 2f~ +4fc k'~ L 2~ 24f9

J ~ C~-{-O.] C C -P-JJ ~ -1-JJ ~ -q -J , - - J W

48f2~

Also, we have

I -~-~ ~" '--2~ "--~c'2 o J g t J~ ~-J k: (4.2) tt~+r i+ k~ q ~

q-ff ~ -t-f9 - -~ f +2f cc k~+O(kl)~ f f , , ~ , , , , 2 , . , ,2 2 . , , ,

+ 24ff~ J

k~+O(k~)].

Page 8: Optimum stratification

(4.3)

where

522 RAVINDRA SINGH AND B. V. SUKHATME

Using the relations (4.1) and (4.2), we obtain on simplification

_ _ - - 2 3 4 [c(x,~)-- ~,~]2 + ~ + ~(x~) q- g~ _ 2~/~ [lfl- B~k, + Bak~ + O(k0]

B 2 - 4~c'~+~'2 32~ 2

and (4.4) ( 4 ~ c ,2- ~ '~ )

Ba-- 96f, /~- dx~ f8/2

_ 8f'~2c n 4" 16f~u q- 2~f'~ n q- 4 f ~ ' ~ " - 4 f ~ ' c n - 3f~ n 192f~ 3

Similarly, the expansion for the left-hand side of (2.6) is obtained and is given by

(4.5) [c(x~)-~c]2+a~q-~(x~)+[~ -2v'-~[lq-B2k~-Balc~+O(k~)]

where k,~=x,~-,,~_,. The system of minimal equations (2.6) can there- fore be wr i t t en in the form

(4.6) ki[& -- B~k~ + O(kD] = k~[B= + B~k, + O(k~)]

which can again be rewri t ten in the form

k} t ~ g,(t)f(t)dt [1+ O(k~)] = k~ f ~+~ g,(t)f(t)dt [1+ O(k~)] J X h -- 1 o "rl~

(4.7)

where

(4.8) g~(t)- ?'z(t)q-4~(t)cn(t) D(t)] 3/~

Therefore, if we have a large number of s t ra ta so tha t the s t ra ta widths k~ are small and their higher powers in the expansion can be neglected, then the system of minimal equations (2.6) or equivalently the system of equations (4.7) can be approximated by

(4.9) k s f ~ g~(t)f(t)dt = Const. = c~ xh,-- 1

say h = l , 2, . . . , L

where terms of order O(m4), m = sup k~ have been neglected on both sides (a,b)

of the equation (4.7) since gl(t)f(t)dt=O(m) in view of the fact tha t

Page 9: Optimum stratification

O P T I M U M S T R A T I F I C A T I O N 523

g~(x)f(x) is bounded away from zero for all x in (a, b). We fur ther remark that if we have a function Qt(x~_,, x~) such that

(4.10) k ~ f ~ g~(t)f(t)dt=Q(x~_~, xh)[l+O(k~)] xtt--1

then the system of minimal equations (2.6) can, to the same degree of approximation as involved in (4.9), be approximated by

(4.ii) Q(x~_,, x~)=constant , h = l , 2, . . . , L .

We have thus established the following theorem.

THEOREM 4.1. I f the regression of the estimation variable y on the stratification variable x, is given by

y=c(x)+e

such that E(elx)=O, V(e]x)=~(x)>O for all x in (a,b) with ( b - a ) < o o , and further i f the function g~(x)f(x) is bounded away from zero and possesses first two derivatives continuous for all x in (a, b), then the sys- tem of minimal equations (2.6) giving optimum points of stratification under Neyman allocation can be approximated by the system of equations

or equivalently by

k~ f ~ g~(t)f(t)dt=constant xI~- 1

Ql(x~_t, x~)=constant

for L sud~ciently large so that terms of order O(m ~) can be neglected where

k~=x~--x~_l, r e=sup k~, g~(x)= ~'2(x)+4~(x)cn(x) and ~ , . W(x)]3/~

Ql(x~_~, x~)=k~ f ~ g~(t)f(t)dt[l +O(k~)] . J XI~-- I

If in (4.6), we neglect terms of order O(m3), then the two sides are equalised if

(4.12) k ~ = C o n s t . = - b - a for all h L

and the optimum points of stratification are given by

(4.13) x~=a+h . b - a with Xo=a and x~=b. L

This set of solutions may not be expected to yield good results but could

Page 10: Optimum stratification

524 RAVINDRA SINGH AND B. V. SUKHATME

be used in cases where not much is known about g~(x) and f(x). The set of approximate solutions given by (4.9) should be satisfac-

tory and quite close to the exact solutions of (2.6). However, this may not always be convenient from the point of view of actual calculations. Using (4.11) and the various identities proved in Section 3, several other approximations can be obtained. Thus, using (3.6) and (4.9) we obtain the set of equations

(4.14) k~Jg~(x~_~)g~(x~) W~-=c2. Using (3.7) and (4.9), we obtain

(4.15) k~ f~ t2g~(t)f(t)dt=c3 Xh_iXl~ xt~- 1

as another approximation to the system of minimal equations. Using identity (3.8), we get the approximation

(4.16) [k~-' f~_l [g~(t)f(t)]~dtlln=Const.

which for 2=1/2 and 1/3 gives

(4.17) k,~ ~ d t =c. L3X~-- 1

and

(4.18) f x~_l~/g~(t)f(t)dt=cs.

In all these approximations, the constants c~ have to be determined. The approximation (4.18) is useful since the exact value of the constant

1 f~ ~/g~(t)f(t)dt. It is difficult to find the c5 can be easily seen to be

exact values of the other constants. However, approximate values of the constants based on (4.12) can be easily found.

So far, we have considered the case of Neyman allocation. How- ever, it is easy to see that the system of equations (2.6) giving optimum points of stratification under Neyman allocation reduces to the system of equations (2.7) giving optimum points of stratification under propor- tional allocation if ~(x) is constant and zL=z~ for i = h + l and h = l , 2, �9 .-, L - 1 . Under these assumptions gl(t)c~[c'(t)] 2 and hence the various approximations of the system of equations giving optimum points of stratification under proportional allocation can be obtained from the cor- responding results for Neyman allocation on replacing g~(t) by [c'(t)] 2.

Page 11: Optimum stratification

OPTIMUM STRATIFICATION 525

5. Numerical illustration

For the purpose of illustrating the usefullness of the approximate solutions to the minimal equations giving optimum points of stratification, we shall consider three density functions for x

i) Rectangular f ( x ) = l , l_<x_<2

ii) Right-triangular f ( x ) = 2 ( 2 - x ) , l_<x_<2

iii) Exponential f ( x )=e -~+I, l <_x<_ c~.

These densities to some extent represent those usually encountered. For the conditional variance function ~(x), we shall take two forms, namely ~(x)=a and ~(x)=~x, a and ~ being constants. The regression function c(x)will be taken to be linear with the slope at 45 ~ The constants a and 2 will be determined in such a way that 90% of the total variation is accounted for by the regression. In the ease of the ex- ponential distribution, we shall truncate the distribution such that the area under the curve to the right of the truncation point is .05. The optimum points of stratification are found by successive iterations. The comparison of the optimum points of stratification obtained through ap- proximate methods to the exact points obtained by trial and error is given in Tables 1 and 2.

Page 12: Optimum stratification

526 RAVINDRA SINGH AND B. V. SUKHATME

o

~ " .

0

"~ ~ ~ ~ o ~ ~ o

,5 ~ ,5 ,5 ,5

t-Z t.... ~ r

.CO -,~ CO r r ~

�9 ~

~" ,5 o ~ r r

II

z o r r

~ ' ~ c~ . . . . . .

I~ ~ ,7::, o o o K" o ,:5 o o o

O

,..4

co t .o t.~ c o

r ,5 ,:5 r ,::;

a ~ I n ~ u ~ ! I s ~ q Z ! H l e [ ~ u 0 u 0 d x z

C O

0J

O .r

z

R

"7-, r

O

Page 13: Optimum stratification

O P T I M U M STRATIFICATION 527

N N ~: N N

0 C~ C~

L~ "~

~0 �9

C~ ~ ~

o ~ o ~ c~

r r r.-.I

ea

�9 r

r ~'q

~. o2. ~ . ~ ~.m. ,~.~.~. . . . . ,~.~. ,~.m.~.

. ~ e [ n ~ u ~ : ~ N .xe InN ra~ !.kL ~tq~!N

o

d ~ ~ d ~ d

Ie~.~.u~uodx~d

~

o

o

o

Page 14: Optimum stratification

528 RAVINDRA SINGH AND B. V. SUKHATME

It will be seen that the solutions obtained through approximate sys- tems of equations give better approximation to the optimum points of stratification than those based on equal intervals which is, of course, to be expected. It is also interesting to observe that the approximations seem to be satisfactory even for small values of L. The approximation based on equal intervals seems to be satisfactory in the case of uniform and right-triangular distributions. However, further numerical investigations are necessary before any firm conclusions can be drawn.

PAN JAB AGRICULTURAL UNIVERSITY IOWA STATE UNIVERSITY

R E F E R E N C E S

[ 1 ] T. Dalenius, " T h e problems of optimum stratification," Skand. Akt., 33 (1950), 203- 213. 2

[ 2 ] C. Hayashi, F. Maruyama and M. D. Isida, "On some criteria for stratification," Ann. Inst. Statist. Math., 2 (1951), 77-86.

[ 3 ] W. G. Cochran, "Comparison of methods of determining strata boundaries," Bull. Int. Star. Inst., 38 (1961), 345-358.

[ 4 ] W. G. Cochran, "Sampling Techniques," John Wiley and Sons, New York, 1963. [ 5 ] Y. Taga, " O n optimum stratification for the objective variable based on concomitant

variables using prior information," 19 (1967), 101-130. [ 6 ] K. Isii, "Mathemat ical programming and statistical inference," Sggaku (Mathematics),

18 (1966), 85-95. [ 7 ] G. Ekman, "An approximation useful in univariate stratification," Ann. Math. Statist.,

30 (1959), 219-229. [ 8 ] G. Ekman, "Approximate expressions for the conditional mean and variance over small

intervals of a continuous distribution," Ann. Math. Statist., 30 (1959), 1131-1134.