Stat 3014 Notes 11 Sampling

STAT 3014/3914: Sampling Theory

Section outline

1. Simple random samples and stratification.

Finite population correction factor. Sample size determination. Inference over sub-

populations. Stratified sampling, optimal allocation.

2. Ratio and regression estimators.

Ratio estimators. Hartley-Ross estimator. Ratio estimator for stratified samples. Re-

gression estimator.

3. Systematic sampling and cluster sampling.

4. Sampling with unequal probabilities.

Probability proportional to size(PPS) sampling. The Horvitz-Thompson estimator.

References

1. Barnett, V. (1991) Sample Survey Principles & Methods, Edward Arnold, london.

2. Cochran, W.G. (1963) Sampling Techniques, Wiley, New York.

3. McLennan, W. (1999) An Introduction to Sample Surveys, A.B.S. Publications, Can-

berra.

1

1 Simple Random Samples and Stratification.

1.1 The Population.

We have a finite number of elements, N where N is assumed known.

The population is Y1 . . . YN , where Yi is a numerical value associated with i-th element.

We adopt the notation in the classical text by W.G. Cochran where capital letters refer to

characteristics of the population; small letters are used for the corresponding characteristics

of a sample.

Population Total: Y =∑N

i=1 Yi,

Population Mean: µ = Y = YN

=

∑Ni=1 YiN

,

Population Variance : σ2 = N−1∑N

i=1(Yi − Y )2. S2 = NN−1

σ2.

These are fixed (population) quantities, to be estimated.

If we have to consider two numerical values: (Yi, Xi), i = 1, · · · , N an additional

population quantity of interest is

R =∑Ni=1 Yi∑Ni=1Xi

= YX

= YX

the ratio of totals.

1.2 Simple Random Sampling.

Focus on the numerical values Yi, i = 1, · · · , N . A random sample of size n is taken with-

out replacement : the observed values y1, · · · , yn are random variables and are stochastically

dependent. The sampling frame is a list of the values Yi, i = 1, · · · , N.

The natural estimator for µ is µ =

∑ni=1 yin

= y

and hence for Y = Nµ is Y = Ny.

Distributional properties of y are complicated by the dependence of the yi’s.

Sample variance: s2 =∑n

i=1(yi − y)2/(n− 1)

2

Fundamental Results

Ey = µ.

Var y =S2

n(1− n

N) =

S2

n(1− f) =

σ2

n

(N − nN − 1

)

Es2 = S2

where f is the sampling fraction and the finite population correction (f.p.c.) is 1− f. Recall

s.e. (estimator) =√

variance of estimator

Central Limit Property

For large sample size n (n > 30 say), and small to moderate f, we have the approximation

(y − µ)/√

Var y ∼ N (0, 1).

Confidence Interval for µ = Y .

Replacing S2 by s2, an approximate 95% C.I. for µ and Y are respectively

y ± 1.96s√n

√1− f

N(y ± 1.96s√n

√1− f)

Sample Size Calculations

To calculate the sample size needed for sampling yet to be carried out, for specified precision,

use past information, or information from a pilot survey.

Suppose we require

Pr{|y − µ| ≤ δ µ} ≥ 0.95

For example, we may want to be at least 95% sure the estimate y of µ is within 1%

of the actual value of µ. Thus if we take δ = 0.01 .

Then

Pr

(|y − µ|√

Var y≤ δ µ√

Var y

)≥ 0.95

3

i.e.

δµ/√

Var y ≥ 1.96.

Ignoring f.p.c. (i.e. taking f = 0)

n ≥ (1.96)2S2

δ2 µ2≈ (1.96)2s2

δ2 y2

where s2 and y are estimates from a pilot survey.

1.3 Simple Random Sampling for Attributes.

The above can be applied to estimate the total number, or proportion (or %) of units which

possess some qualitative attribute. Let this subset of the population be C.

Put

Yi = 1 if i ∈ C

= 0 if i /∈ C

and similarly for yi’s .

Customary notation:

µ = Y = P

A =N∑i=1

Yi = Y = NP is the population proportion

y = p is the sample proportion.

Let a =∑n

i=1 yi = np,Q = 1− P and q = 1− p.. Thus A = Np = Na/n.

4

Since

N∑i=1

(Yi − Y )2 =N∑i=1

Y 2i −NY 2;

= NY −NY 2 = NP (1− P )

we have

S2 = NP (1− P )/(N − 1)

= NPQ/(N − 1), and similarly

s2 = npq/(n− 1).

Examples

1. Suppose that a finite population of 1500 students is to graduate. A sample will be

selected and surveyed to estimate the average starting salary for graduates. Determine

the sample size, n, needed to ensure that the sample average starting salary is within

$40 of the population average with probability at least 0.90, if from previous years we

know that the standard deviation of starting salaries is about $400.

2. (i) What size sample must be drawn from a population of size N = 800 in order to

estimate the proportion with a given blood group to within 0.04 (i.e. an absolute

error of 4%) with probability 0.95?

(ii) What sample size is needed if we know that the blood group is present in no more

than 30% of the population?

5

1.4 Simple Random Sampling: Inference over Subpopulations.

Population of size N individuals, simple random sample of size n of individuals, on which

measurements Yi, i = 1, · · · , N are considered.

Separate estimates might be wanted for one of a number of subclasses {C1, C2, · · · } which

are subsets (“domains of study”) of the population (sampling frame). For example,

POPULATION (SAMPLING FRAME) SUBPOPULATION

Australian population unemployed Queenslanders

retailers supermarkets

the employed the employed working overtime

Denote the total number of items in class Cj by Nj. Nj is generally unknown: but the

techniques of §7.2 - 7.3 tell us that we may estimate Nj by

Nnj/n = Nj

where nj/n is the proportion of the total sample size falling into Cj.

The same technique as in §7.3 can be attempted to estimate:

YCj =∑i∈Cj

Yi/Nj, and YCj =∑i∈Cj

Yi.

Put

Y ′i =

Yi if i ∈ Cj0 if i /∈ Cj

and correspondingly for the sample values:

y′i =

yi if i ∈ Cj0 if i /∈ Cj.

Then y′ estimates Y ′ =∑N

i=1 Y′i /N , where y′ =

∑ni=1 y

′i/n .

Thus we have

NY ′ = YCj so Ny′ estimates YCj and ˆY Cj = Ny′/Nj estimates YCj .

6

The estimators are unbiased.

Var ˆY Cj =

(N

Nj

)2

Var y′ =

(N

Nj

)2 (S ′Cj)2

n

(1− n

N

)where (S ′Cj)

2 can be estimated by (s′Cj)2:

(S ′Cj)2 =

N∑i=1

(Y ′i − Y ′)2

N − 1, (s′Cj)

2 =

∑ni=1(y′i − y′)2

n− 1.

Example.

There are 200 children in a village. One dentist takes a simple random sample of 20 and

finds 12 children with at least one decayed tooth and a total of 42 decayed teeth. Another

dentist quickly checks all 200 children and finds 60 with no decayed teeth.

Estimate the total number of decayed teeth.

C1 : children with at least one decayed tooth C2 : children with no decayed teeth

N1 = 140 N2 = 60

n = 20

n1 = 12 (n2 = 8)∑j∈C1

yj = 42 =n∑j=1

y′j.

Thus y′ = 42/20 = 2.1.

The estimate for the total number of decayed teeth, YC1 , is Ny′ = 200 × 2.1 = 420,

ignoring the information from the second dentist.

7

If Nj is not known, we can estimate the total YCj by Ny′ but we cannot estimate

YCj by (N/Nj)y′ . It is natural to consider trying to estimate it by (n/nj)y

′, that is, taking

ˆY Cj = yCj =∑i∈Cj

yi/nj = (n/nj)y′.

Consider the properties of this estimator:

Theorem 1

EyCj = YCj

Var yCj ≈(N

Nj

)2 S2Cj

n

(1− n

N

)

where S2Cj

=

∑i∈Cj(Yi − YCj)

2

N − 1, provided we define E yCj and Var yCj appropriately when

nj = 0 .

Note S2Cj

can be estimated by

s2Cj

=

∑i∈Cj(yi − yCj)

2

n− 1.

Proof: Use conditional expectations.

E(yCj) = Enj{E(yCj |nj)}

= E(yCj |nj = 0) Pr(nj = 0) +∑i≥1

E(yCj |nj = i) Pr(nj = i)

= YCj Pr(nj = 0) +∑i≥1

YCj Pr(nj = i)

= YCj ,

where for nj fixed, we are looking at a simple random sample of size nj from Cj.

Further

Var yCj = Var(E(yCj |nj))︸︷︷︸YCj

+E(Var (yCj |nj))

= 0 + E(Var(yCj |nj)).

8

Now, if nj > 0

Var (yCj |nj) =

1

nj

∑i∈Cj

(Yi − YCj)2

Nj − 1

(1− nj

Nj

)- a simple random sample of size nj from Cj . Now use

njn≈ Nj

N(an unbiased estimator).

Var (yCj |nj) ≈

{N

nNj


2

Nj − 1

(1− n

N

)}

≈

{(N

Nj

)21

nS2Cj

(1− n

N

)}.

Corollary: The estimator YCj = Nj yCj is unbiased for YCj and has Var YCj = (Nj)2 Var yCj .

When Nj is known, the estimator Nj yCj for YCj uses more information

(by using Nj and nj) than N y′ and so should be a better unbiased estimator.

Summary of the Unbiased Estimators of YCj

Estimator Variance

Nj known or not Ny′ N2(S′Cj

)2

n(1− n

N)

Nj known Nj yCj N2S2Cj

n(1− n

N)

(S ′Cj)2 =

∑Ni=1(Y ′i − Y ′)2

N − 1, S2

Cj=


2

N − 1

S2Cj≤ (S ′Cj)

2.

9

1.5 Stratified Random Sampling

Description

The population of size N is first divided into subpopulations of N1, N2, · · · , NL units.

The subpopulations, or strata, are mutually exclusive and exhaustive. The Nh’s are assumed

known.

A simple random sample of size nh is drawn from the hth stratum, h = 1, · · · , L.

Total sample size :L∑h=1

nh = n.

Reasons for Stratification:

1) If estimates of known precision are wanted for certain subdivisions of a population,

and these subpopulations can be sampled from, it is better to treat each subdivision

as a population in its own right, rather than take a simple random sample overall and

from it decide who falls into Cj and who not, where Cj denotes stratum j.

2) Administrative convenience, for example, if separate administrative units exist, such as

states of Australia, easier to sample in each individually, and then put the information

together.

3) A natural stratification may exist.

The theory consists of

(i) Properties of estimates for the population mean from a stratified sample, for a given

n1, n2, · · · , nL .

(ii) Choice of nh’s to obtain maximum precision, for a given total sample size n, or,

more generally, a given total cost.

(iii) For a specified precision choice of the minimum n and corresponding nh’s .

10

Properties of the Estimates

Further notation

Population Sample

Nh nh

Yhi i = 1, · · · , Nh yhi, i = 1, · · · , nh ith unit for random sample from stratum h

Yh =∑Nh

i=1 Yhi/Nh yh =∑nh

i=1 yhi/nh

S2h s2

h

Also Wh = NhN

is the stratum weight and fh = nhNh

is the sampling fraction for stratum h .

Basic quantities of interest: the population mean and/or total.

Y can be written

∑Lh=1NhYhN

; and as usual Y = NY .

Estimator for Y :

ˆY = yst =L∑h=1

NhyhN

, Y = Nyst.

Clearly E yst = Y , since Eyh = Yh and the yh’s are independent, so

Var yst =L∑h=1

(Nh

N

)2

Var yh

=L∑h=1

(Nh

N

)2S2h

nh

(1− nh

Nh

)

=L∑h=1

W 2h S

2h

nh− 1

N

L∑h=1

WhS2h.

Proportional Allocation for given total sample size n .

Nh

N=nhn, that is, nh = n

Nh

N

Varprop yst = 1n

(1− n

N

) ∑Lh=1WhS

2h

11

Optimal Allocation

Optimization

Let K denote a cost function where

K = K0 +L∑h=1

Khnh, K0 ≥ 0, Kh > 0, h = 1, · · · , L.

Let

V = Var yst =L∑h=1

W 2h

S2h

nh− 1

N

L∑h=1

WhS2h.

Note that if we take K0 = 0, Kh = 1, h ≥ 1 then K reduces to the total sample size n.

Theorem 2

For a fixed value of K , V is minimized by the choices of nh where

nh = constNhSh√Kh

, h = 1, · · · , L.

The constant is chosen in accordance with the fixed value of K .

For a fixed value of V , K is minimized by choice of nh’s of the above form, where the

constant is chosen in accordance with the fixed value of V .

Applications of the Theorem:

(1) Best Precision For Specified Total Sample Size

For fixed total sample size n, precision is greatest if

nh =NhSh∑hNhSh

n =

(WhSh∑hWhSh

)n, h = 1, · · · , L.

- Optimal Allocation of Sample Size: First determined by Alexander Chuprov (1923),

rediscovered by Neyman (≈ 1934).

- If total sample size n is fixed and cost of sampling a unit in each stratum is the

same, (1) says: In a given stratum take a larger sample if 1) the stratum is larger;

and/or 2) the variability is greater.

- Need to have estimates of S2h from pilot survey or past census to apply the formula.

- Variance under optimal allocation: Varoptyst = 1n

(∑Lh=1 WhSh

)2

− 1N

∑Lh=1WhS

2h

12

(2) Minimal Total Sample Size for Specified Precision

If we require

Pr{|yst − Y | ≤ δ} ≥ 0.95

that is

Pr

{|yst − Y |√V ar yst

≤ δ√V ar yst

}≥ 0.95

then chooseδ√

V ar yst≥ 1.96 so

δ2

(1.96)2≥ Var yst .

Therefore, take V = Var yst = δ2

(1.96)2fixed.

The cost function to be minimized is∑L

h=1 nh = n (total sample size). Set K0 =

0, Kh = 1 , so according to the theorem nh = cWhSh where the constant, c, is to

be determined in accordance with specified size of V .

General formula for

V = Var yst =L∑h=1

W 2h S

2h

nh(ignoring fpc)

=L∑h=1

W 2h S

2h

cWhSh=

1

c

L∑h=1

WhSh .

Therefore

c =1

V

L∑h=1

WhSh =1

V

L∑j=1

WjSj,

and

nh =1

VWhSh

(L∑j=1

WjSj

), h = 1, 2, · · · , L .

(3) Confidence Intervals for specified total sample size

Provided each nh is moderately largeyst − µ√

Varyst∼ N (0, 1), at least approximately. If

using yst with optimal allocation, for a given total sample size will give the shortest

C.I.

Length of a (1− α) 100% CI is

2 z(1−α2

)

√Var yst = 2 z(1−α

2)

√Varopt yst

13

Example (Cochran (1963))

A sample of U.S. colleges was drawn to estimate 1947 enrolments in 196 Teachers’ Col-

leges. These colleges were separated into 6 strata. The standard deviation for college enrol-

ments, sj, in a given stratum was estimated by the 1943 value.

Stratum Nj sj

1 13 325

2 18 190

3 26 189

4 42 82

5 73 86

6 24 190

196

The 1943 total enrolment was 56,472. For the 1947 study the aim is to obtain a Standard

Error/Mean ratio of approximately 5%. Determine the sample size required and the optimal

allocation of these values to strata.

14

2 Ratio and Regression Estimators.

2.1 Case Study G.P.’s

A simple random sample of GP’s in NSW is taken in which each fills out, for each of 100

successive patient encounters, an encounter recording form. Some of the things which are

needed to be estimated: for any specific health problem (for example, hypertension):

1) the average number of occurrences of this health problem for 100 encounters;

2) the average number of occurrences of this problem per 100 health problems.

(Each encounter may involve up to 4 health problems.)

Imagine for the population of N GP’s in NSW, two numbers are attached to the ith GP :

Yi : the number of occurrences of the health problem hypertension for 100 encounters;

Xi : the number of problems treated in 100 encounters for that GP;

1) amounts to estimating Y ,

2) needs estimation of 100 ×(∑Yi/∑Xi) i.e. estimation of

R = Y/X = Y /X.

2.2 Two Characteristics/Unit. Simple Random Sampling.

Theorem 3

If Xi and Yi are a pair of numerical characteristics defined on every unit of the

population, and y and x are the corresponding means from a simple random sample

without replacement of size n , then

Cov (x, y) =1

n

N∑i=1

(Yi − Y )(Xi − X)

N − 1

(

1− n

N

). (2.1)

and

E

n∑i=1

(yi − y)(xi − x)

n− 1

=

N∑i=1

(Yi − Y )(Xi − X)

N − 1. (2.2)

15

Proof. Consider the unit characteristic defined as Ui = Xi + Yi . The corresponding

sample values are ui = xi + yi and clearly

Var u =S2U

n

(1− n

N

)=

1

n

{∑Ni=1(Ui − U)2

N − 1

}(1− n

N

).

Now

N∑i=1

(Ui − U)2 =N∑i=1

(Xi − X + Yi − Y )2

=N∑i=1

(Xi − X)2 +N∑i=1

(Yi − Y )2

+ 2N∑i=1

(Xi − X)(Yi − Y ).

Thus

Var (u) = Var (x+ y) =

= Var (x) + Var (y) +2

n

{∑Ni=1(Xi − X)(Yi − Y )

N − 1

}(1− n

N

).

But

Var u = Var (x+ y) = Var (x) + Var (y) + 2Cov(x, y).

(8.2) is proved in a similar way, beginning with

E

n∑i=1

(ui − u)2

n− 1

=

N∑i=1

(Ui − U)2

N − 1.

Frequently, in this two variable per unit situation, we are interested in measuring the ratio

of the two variables.

Examples of this kind occur frequently when the sampling unit comprises a group or

cluster of individuals, and our interest is in the population mean per individual. For ex-

ample, survey might be a household survey and we are interested in average expenditure

on cosmetics/adult female in the population; or the average number of hours/week spent

watching television for child aged 10-15; or the average number of suits/adult male.

16

In order to estimate the last of these items, we would record for the ith household

(1 = 1, · · · , n), i.e. ith sample unit, the number of adult males who live there, xi, and the

total number of suits they possess, yi. The population parameter to be estimated is clearly

R =total no. of suits

total no. of adult males=

∑Ni=1 Yi∑Ni=1 Xi

.

The corresponding sample estimate is

R = r =

∑ni=1 yi∑ni=1 xi

=y

x.

Theorem 4

(a) E(r)−R ≈ 0 for large sample size.

(b) For large samples,

Var r ≈ 1

nX2

(∑Ni=1(Yi −RXi)

2

N − 1

)(1− n

N

).

(Thus√

Var r is the standard error of the estimator r of R .)

Proof.

a) Recall Ey = Y , Ex = X and Var(x) = O(n−1). Thus

Er = E( yx

)≈ Ey

X= R.

b) Note

r −R =y

x−R ≈ y −Rx

X.

Thus, for large sample sizes

Var r ≈ E{(r −R)2} ≈ 1

X2E{(y −Rx)2}.

Now y − Rx is the sample mean of the sample r.v.’s di = yi − Rxi, i = 1, · · · , n ,

from population Di = Yi −RXi, i = 1, · · · , N .

Therefore Var(r) =E(d2)

X2= V ar(d)/X2, since Ed = D =

∑Ni=1(Yi −RXi)

N= 0

17

From Section 7.2 on simple random sampling

Var d =S2D

n

(1− n

N

)where

S2D =

∑Ni=1(Di − D)2

N − 1=

∑Ni=1(Yi −RXi)

2

N − 1

which completes the assertion.

The estimate of Var(r) generally used is the ‘natural’ one:

Var(r) ≈ 1

nx2

{∑ni=1(yi − rxi)2

n− 1

} (1− n

N

)which can be shown to be almost unbiased if n is large.

For hand computation it is quickest to use∑ni=1(yi − rxi)2

n− 1=

∑ni=1(yi − y)2

n− 1− 2r

n∑i=1

(xi − x)(yi − y)

n− 1+ r2

n∑i=1

(xi − x)2

n− 1

= s2y − 2r sxy + r2 s2

x.

2.3 Ratio Estimate of the Population Total

The above has focused on the estimation of R =Y

X=Y

Xand indeed this is, as in the Case

Study (§8.1) of G.P.’s the quantity of primary interest.

On the other hand we see from the above that an estimator of the population total Y

which takes into account the x values is

YR =y

xX =

y

xX = rX

This is known as the ratio estimate of the population total.

Correspondingly ˆY R = yxX = rX.

In this estimate we are using not only information on the observed Yi values, yi, i =

1, · · · , n but also on the corresponding covariates xi, i = 1, · · · , n (as well as the, hopefully

known, X or X). In general we expect this extra information to increase the precision of YR

over the usual Y = Ny.

From Theorem 4

(a) E ˆY R = XEr ≈ XR = Y . (Similarly for EY .)

18

(b) Var ˆY R ≈1

n

(∑Ni=1(Yi −RXi)

2

N − 1

)(1− n

N

).

The variance is estimated by

1

n

{∑ni=1(yi − rxi)2

n− 1

}{1− n

N

}.

Bias.

The estimator r for R is generally biased , so YR and ˆY R are biased also for Y

and Y . The question of bias is quite important since it influences to what extent variance

determines accuracy. Recall if α is an estimator of α, then bias is defined by Eα−α . The

mean-squared error is

E((α− α)2) = E((α− Eα + Eα− α)2)

= E((α− Eα)2 + (E(α)− α)2 + 2(α− Eα)(Eα− α))

= Var α + bias2.

Bias measurement for R = r.

Note

Cov(r, x) = E(rx)− ErEx = E( yxx)− ErEx

so

Er =Ey

Ex− Cov(r, x)

Ex= R− ρr,x

√Var rVarx

Ex.

Therefore R− Er = ρr,x

√Var rVar x

Ex. Therefore

|bias in r|σR

=|R− Er|√

Var(r)≤√

Var x

X= coeff of variation of x

since |ρr,x| ≤ 1. The ratio is small if n is large since Var(x) =S2X

n

(1− n

N

).

Thus if the coefficient of variation of x is small the bias of R is negligible relative to

s.e. σR of R . But if n is small, the bias can be large.

19

2.4 The Hartley - Ross Estimator of R, Y and Y when X is known

We saw that r as an estimate of R was biased and we could estimate Y by rX if we

knew X (§8.3), though this estimator would be biased. The following leads to an unbiased

estimator of R .

Theorem 5

Let V = h(X, Y ) for some fixed function of two variables h(·, ·). Define Vi = h(Xi, Yi)

and vi = h(xi, yi) . Then

E

{v +

N − 1

NX(n− 1)

n∑i=1

vi(xi − x)

}=

N∑i=1

ViXi

NX. (2.3)

Proof. The LHS is

Ev +N − 1

NXE

n∑i=1

vi(xi − x)

n− 1

= V +N − 1

NX

N∑i=1

Vi(Xi − X)

N − 1.

The last step is from (8.2), using Vi in place of Yi , and vi in place of yi.

Now, returning to the problem of estimation of R from sample readings (xi, yi), i =

1, · · · , n. Assume Xi > 0, i = 1, · · · , N , take the function h(s, t) = t/s in Theorem 5.

Then vi = yi/xi, i = 1, · · · , n , and Vi = Yi/Xi, i = 1, 2, · · · , N , so from (2.3)

E

{r +

(N − 1)

NX(n− 1)n(y − rx)

}=Y

X= R

where r = 1n

n∑i=1

yixi

. Thus an unbiased estimator of R is

R = r + (N−1)

NX(n−1)n(y − rx)

for which we need to know X (or X = NX). The Hartley-Ross estimator for the population

total is

20

Y = Xr + N−1n−1

n(y − rx).

Why, finally, if we need to know X for this estimator, why don’t we just use R1 =

y/X? Recall E( yX

)=Ey

X=Y

X= R so R1 is unbiased for R. However it does not use as

much information and for small samples we might expect the Hartley-Ross estimator to be

better. There is no general result on the comparison of the variances of y/x, y/X, and r +(N − 1)n

NX(n− 1)(y − rx) for all sample sizes. (See Cochran (2nd Ed) Theorem 6.3 §6.15.)

Example

In a survey of family size (x1), weekly income (x2) and weekly expenditure on food (y), we

want to estimate the average weekly expenditure on food per family in the most efficient

way. A simple random sample of 27 families yields the following data:∑i

x1i = 109,∑i

x2i = 16277,∑i

yi = 2831, ρx1,y = 0.925, ρx2,y = 0.573

The sample covariance matrix for y, x1 and x2 is

547.8234 26.5057 1796.5541

26.5057 1.4986 80.1595

1796.5541 80.1595 17967.0541

.

From the census data X1 = 3.91 and X2 = 542.

(a) Estimate the standard errors of the ratio estimators for Y using x1 and using x2. Com-

pare the standard errors with the s.e. for the simple estimate ignoring the covariates.

Which estimator has the smallest estimated s.e.?

(b) Calculate the best available estimate of the average weekly expenditure on food per

family and give an approximate 95% confidence interval for this average.

21

2.5 Ratio Estimation for a Subdomain Cj by Simple Random Sam-

pling over a Population.

For some fixed j we want to estimate:

Rj =∑i∈Cj

Yi

/∑i∈Cj

Xi =N∑i=1

Y ′i

/N∑i=1

X ′i,= R′

if we put

(Y ′i , X′i) = (Yi, Xi) if i ∈ Cj

= (0, 0) if i /∈ Cj .

Hence the natural estimator is (from §8.2)

r′ =n∑i=1

y′i/n∑i=1

x′i

=∑i∈Cj

yi/∑i∈Cj

xi = rj

where

(x′i, y′i) = (xi, yi) if i ∈ Cj

= (0, 0) if i /∈ Cj

and

Var rj = Var r′ ≈ 1

(X ′)2

S2D′

n

(1− n

N

)where X ′ =

X ′

Ncan be estimated by x′ =

∑ni=1 x′in

=

∑i∈Cj xi

n

S2D′ =

∑Ni−1(Y ′i −R′X ′i)2

N − 1

can be estimated by ∑ni=1(y′i − r′x′i)2

n− 1=

∑i∈Cj(yi − rjxi)

2

n− 1.

Note: nj does not come into any of these calculations.

22

2.6 Ratio Estimation for a Population by Stratified Sampling.

Recall that with stratified sampling we have a total of L subdomains (strata) where we

assume all Nh’s known, and a simple random sample is taken in each subdomain (stratum).

Suppose we want ratio estimates such as the one for NSW doctors, §8.1, for all of Australia

(6 strata: each state is a stratum). We know from §8.2 what to do within each state, for

specified sizes nh, h = 1, · · · , L of the random sample in each state. We can estimate Rh by

rh = yh/xh

and we have its variance from Theorem 4 of §8.2 . But how do we estimate from these rh’s

the ratio

R =Y

X=

L∑i=1

Yi/L∑i=1

Xi

for the whole population?

2.6.1 The ‘Separate’ Ratio Estimate

Suppose the stratum totals Xh, h = 1, · · · , L are known so X =∑L

h=1Xh is known also.

Then use a weighted average of rh’s (assuming all Xh’s positive):

Rs =∑h

rhXh

X.

Since Erh ≈ Rh = YhXh

ERs ≈∑

Yh/X = R

and Var(Rs) =∑

h

(XhX

)2Var(rh).

For large stratum sample sizes, rh will be approximately unbiased for Rh and Var(rh)

will approximate Var(rh) reasonably well.

For moderate and small samples bias is important, and we should consider it here. We

see from §8.4 that in a single stratum

|bias rh|σrh

≤ σxhXh

= c.v. of xh

(c.v. = coefficient of variation).

23

Consider the bias of Rs :

E(Rs −R) = E

({L∑h=1

Xh

Xrh

}−R

)

= EL∑h=1

Xh

X(rh −Rh)

=L∑h=1

Xh

XE(rh −Rh)

=L∑h=1

Xh

X(bias rh).

Therefore assuming all Xh’s positive, as usual, |bias Rs| = |E(Rs−R)| ≤∑L

h=1XhX|bias rh| .

This argument eventually leads to

|E(Rs −R)|s.e.Rs

≤√L

(maxhσrhminhσrh

)maxh

σxhXh

.

Therefore the ratio on the LHS can be√L times as large as the σxh/Xh bound on

individual relative biases, so even if these are individually small, the overall bias can be

large.

2.6.2 The ‘Combined’ Ratio Estimate

Defined by:

Rc =

∑hNhyh∑hNhxh

=ystxst

=ˆY

ˆX=Y

X

This estimate in contrast to Rs does not require a knowledge of the individual Xh’s .

Theorem 6

Var Rc ≈1

X2

L∑h=1

(Nh

N

)21

nh

{∑Nhi=1(Yih − Yh −R(Xih − Xh))

2

Nh − 1

}(1− nh

Nh

).

Proof. First

Rc −R =ystxst−R =

1

xst(yst −Rxst)

=1

xst

∑Lh=1Nh(yh −Rxh)

N

=1

xst

(∑Lh=1NhdhN

)

24

where dih = yih −Rxih i = 1, · · · , nh and dh =

∑nhi=1 dihnh

.

Rc −R =1

xstdst

≈ 1

Xdst

where on each individual in the hth stratum we consider a measurement of the form

Dih = Yih −RXih . Therefore recalling the ideas used in Theorem 4:

Var Rc ≈1

X2Var dst

≈ 1

X2

L∑h=1

(Nh

N

)2S2h(D)

nh

(1− nh

Nh

)

where S2h(D) =

∑Nhi=1(Dih − Dh)

2

Nh − 1which in general gives the answer. Note here typically

Dh 6= 0 .

If we write

S2h(D) =

∑{Yih − Yh −R(Xih − Xh)}2

Nh − 1= S2

Yh+R2S2

Xh− 2RSXhYh ,

this can be estimated by∑nhi=1(dih − dh)2

nh − 1=

∑nhi=1(yih − yh − Rc(xih − xh))2

nh − 1

= s2yh

+ R2cs

2xh− 2Rcsxhyh .

There is less risk of bias in Rc than in Rs. We can show that |ERc−R|√VarRc

≤ maxh

(σxhXh

)in

contrast to the final expression in §8.6.1 .

2.7 The Regression Estimator

In this section, we shall only be concerned with the situation where we have a simple random

sample (xi, yi), i = 1, · · · , n from the population and we wish to estimate the population total

Y or the population mean, Y .

If these sample values are plotted in the (x, y) plane it may become apparent (or, indeed,

may be suspected from other considerations) that the relation between the X and Y

measurements is approximately linear.

25

If the line appears to pass through the origin, and X is known, it is sensible to use the

the ratio estimate (§8.3) for Y , or Y , since the estimates

YR =y

xX and ˆY R =

y

xX

reflect this linear relation through the origin, as well as making use of the knowledge of X.

On the other hand, if the relation is roughly a line, but the line does not pass through

the origin, perhaps we can do better. We know that presupposing a relation

yi = α + βxi + εi i = 1, · · · , n

for the obtained sample (xi, yi), i = 1, · · · , n leads by least squares to

α = y − βx

β =n∑i=1

(xi − x)(yi − y)/n∑i=1

(xi − x)2 = sxy/s2x

so the fitted values yi are given by

y + β(xi − x) i = 1, · · · , n

and for any specified Xi value, i = 1, · · · , N , the estimated Yi value is

Yi = y + β(Xi − x).

It follows that the population total Y is naturally estimated by

YLR =N∑i=1

Yi = Ny + β(X −Nx)

= Ny + βN(X − x)

and correspondingly

ˆY LR = y + β(X − x).

This, of course also makes use of the known total X (or X).

26

Theorem 7

(a) Bias = EYLR − Y = −NCov(β, x)

(b) Var(YLR) ≈ N2

nS2Y

(1− n

N

)(1− ρ2

X,Y ) = Var(Y )(1− ρ2X,Y )

where ρX,Y =∑N

i=1(Xi − X)(Yi − Y )/√∑N

i=1(Xi − X)2∑N

i=1(Yi − Y )2.

This shows that in general LR estimation is less variable than the estimate Y = Ny

which does not use the covariate observation.

Proof.

(a) EYLR = NEy +NEβ(X − x)

i.e. EYLR − Y = −NE{β(x− Ex)} = −NCov{β, x}.

(b) For large n and N

β =

∑ni=1(xi − x)(yi − y)/(n− 1)∑n

i=1(xi − x)2/(n− 1)

=sxys2x

≈ SXYS2X

=SYSX

ρX,Y = β, say.

Now Var YLR ≈ N2(Var y + β2Var x− 2β Cov(y, x)), approximating β by β.

From Theorem 3, equation (8.1) :

VarYLR = N2

{Var y + β2Var x− 2β

1

n

∑Ni=1(Yi − Y )(Xi − X)

N − 1

(1− n

N

)}

= N2

{S2Y

n

(1− n

N

)+S2Y

S2X

ρ2X,Y

S2X

n

(1− n

N

)− 2

SYSX

ρX,YSXYn

(1− n

N

)}= N2S

2Y

n

(1− n

N

) (1− ρ2

X,Y

).

27

3 Systematic and Cluster Sampling.

3.1 Systematic Sampling

This is a quick and easy method for selecting a sample when the sampling frame is sequenced.

Randomly select one of the first k units, then select every kth unit thereafter.

The sample size is determined by N and the choice of k: n = [Nk

] or [Nk

] + 1. The k

possible samples will only be of equal size if N is a multiple of k. If N = 23 and k = 5 then

the possible samples are units numbered:

S1 S2 S3 S4 S5

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22 23

For simplicity we will restrict attention to the case N = nk. (If N is large then the

following results will be approximately true.) Under systematic sampling only k of the

possible(Nn

)samples under SRS are considered. We expect systematic sampling to be

beneficial if the k samples are representative of the population. This might be achieved by

selecting the sequencing variable carefully. For example, suppose we have a list of businesses

giving location, industry, and employment size. We want to estimate the amount of overtime

wages paid. We expect the amount of overtime wages to be correlated with the employment

size so sort the list according to employment size and then draw a systematic sample.

3.1.1 The Sample Mean and its Variance.

Denote the sample by y1, ..., yn. Then

ysys =1

n

n∑i=1

yi

28

can take only one of k possible values:

Y1 = (Y1 + Yk+1 + ..+ YN−k+1)/n

:

Yk = (Yk + Y2k + ..+ YN)/n.

Thus

Eysys = (Y1 + ...+ Yk)/k = Y ,

so the systematic mean is unbiased for the population mean.

Let Yis = Yi+k(s−1), for i = 1, 2, ..., k and s = 1, 2, ..., n.

Var(ysys) = E(ysys − Y )2 =1

k

k∑i=1

(Yi − Y )2.

Recall

(N − 1)S2 =N∑i=1

(Yi − Y )2

=k∑i=1

n∑s=1

(Yis − Y )2

=k∑i=1

n∑s=1

(Yis − Yi)2 + nk∑i=1

(Yi − Y )2

= k(n− 1)S2w + knVar(ysys).

Thus Var(ysys) is the Between cluster variance and S2w is the within cluster variance.

Var(ysys) =N − 1

NS2 − N − k

NS2w.

Systematic sampling is most useful when the variability within systematic samples is larger

than the population variance.

Theorem 8

Systematic sampling leads to more precise estimators of Y than SRS if and only if S2w >

S2.

Proof. The variance under SRS is S2

n(1− n

N) and we want

S2

n(1− n

N) >

N − 1

NS2 − N − k

NS2w.

29

Thus

k(n− 1)S2w > [(N − 1)− N − n

n]S2

= k(n− 1)S2.

3.2 Cluster Sampling.

If a population has a natural grouping of units into clusters one way of proceeding is to

perform a SRS of the clusters and then include all units in the selected clusters in the

sample. Systematic sampling has this structure but the distinct ‘clusters’ may not have any

physical significance.

For cluster sampling we do not need a complete population list, just a list of clusters and

then a list of elements for the selected clusters.

Notation Let N denote the population size, M the number of clusters and Mi the number

of elements in the ith cluster. Let Yis denote element s is cluster i,

Y ∗i =

Mi∑s=1

Yis and Y ∗i· = Y ∗i /Mi.

Unbiased estimators for Y .

The population total is Y =∑M

i=1 Y∗i and N =

∑Mi=1 Mi. Draw a sample of size m from

the clusters. The observed cluster totals are y∗1, ..., y∗m and

Ey∗i = (Y ∗1 + ..+ Y ∗M)/M = Y/M

as each cluster has probability 1M

of being drawn under SRS. Thus YCL = Mm

∑mi=1 y

∗i is an

unbiased estimator of Y and

ˆYCL =M

Ny∗

is unbiased for Y .

VarYCL =M2

m

∑Mi=1(Y ∗i − YM)2

M − 1(1− m

M),

where YM = Y/M is the average total per cluster. This variance is estimated by

M2

m

∑mi=1(y∗i − y∗)2

m− 1(1− m

M),

i.e. we use the (adjusted) sample variance of the observed cluster totals.

30

3.2.1 Equal Cluster Sizes

.

The analysis is simpler when the clusters are of equal size, that is, Mi = N/M = M, i =

1, 2, ..,M. In this case YM = MY and the between cluster variance is

S2b =

1

M − 1

M∑i=1

(Y ∗i − Y )2M =

∑Mi=1(Y ∗i − YM)2

M(M − 1).

The within cluster variance is

S2w =

1

N −M

M∑i=1

M∑s=1

(Yis − Y ∗i )2.

Recall from ANOVA

(N − 1)S2 = (N −M)S2w + (M − 1)S2

b .

From SRS we have an unbiased estimator for S2b ,

s2b =

1

M(m− 1)

m∑i=1

(y∗i − y∗)2.

An unbiased estimator for S2w is

s2w =

1

(M − 1)m

m∑i=1

M∑s=1

(yis − y∗i )2.

Both components of S2 can be estimated under cluster sampling unlike systematic sampling

where we only observe one ‘cluster’ and so cannot estimate the between cluster component.

Comparing SRS and Cluster Sampling.

A sample of m clusters involves mM = n observations. An estimate of Y based on a

SRS of the same size has variance

Var( ˆY ) =S2

mM(1− mM

MM) =

S2

mM(1− m

M).

Var( ˆYCL) =1

mM2(S2

b M)(1− m

M)

1

N2=

S2b

Mm(1− m

M).

ThusVar( ˆYCL)

Var( ˆY )=S2b

S2,

and so the cluster estimator is preferable when S2b < S2. Note S2

b is minimised when all

cluster means are equal, i.e. Y ∗i = Y . i = 1, ..,M. The worst case occurs when S2w = 0 in

which case each cluster consists of identical responses.

31

3.3 Ratio to Size Estimation.

YCL and ˆYCL are unbiased estimators for Y and Y . We can develop other estimators based

on cluster totals and cluster sizes using the ratio estimation approach. Consider the pairs

(Y ∗1 ,M1), ..., (Y ∗M ,MM). Observe (y∗1,m1), ..., (y∗m,mm). Consider the ratio estimator for

Y =

∑Mi=1 Y

∗i∑M

i=1Mi

= R.

The ‘natural’ estimator is

r = ˆYR =

∑mi=1 y

∗i∑m

i=1 mi

.

This estimator does not require knowledge of M or N .

Var( ˆYR) ' 1

mM2(

∑Mi=1(Y ∗i −RMi)

2

M − 1)(1− m

M)

and we can estimate∑M

i=1(Y ∗i −RMi)2/(M − 1) by

∑mi=1(y∗i − rmi)

2/(m− 1).

To estimate Y we need to know N as Y = NY . Recalling that R = Y and M = N/M ,

YM = MY we have

Var( ˆYR) ' 1

mM2(

∑Mi=1(Y ∗i − Y Mi)

2

M − 1)(1− m

M)

and

Var( ˆYCL) ' 1

mM2(

∑Mi=1(Y ∗i − Y M)2

M − 1)(1− m

M).

The two expressions are very close with Y Mi replaced by Y M in the second expression. The

two variances are equal if Mi = M for all i.

32

4 Sampling with Unequal Probabilities.

Simple random sampling and systematic sampling are schemes where every unit in the pop-

ulation has the same chance of being selected. We will now consider unequal probability

sampling. We have encountered an example of this already for under stratified sampling the

units in stratum h have chance nhNh

of being selected. Being able to vary the probabilities

across strata under optimal allocation leads to increased accuracy and this will apply in

other situations.

4.1 Sampling with Replacement

Using with replacement sampling simplifies the calculations and if the sampling fraction

is small this model should give a reasonable approximation to the exact behaviour of the

estimators in w/o replacement sampling. Let pj denote the probability of selecting unit Yj

on the ith draw, so

P (yi = Yj) = pj, j = 1, 2, .., N.

The first two moments are Eyi =∑N

j=1 pjYj and Var(yi) =∑N

j=1 pjY2j − [

∑Nj=1 pjYj]

2.

The Hansen and Hurwitz Estimator for the population total Y is

YHH =1

n

n∑i=1

p−1i yi.

This estimator is unbiased for Y . Under sampling with replacement the (yi/pi) random

variables are independent so

Var(YHH) =1

n2

n∑i=1

Var(yi/pi) =1

n(N∑j=1

Y 2j

pj− Y 2).

The estimator for the variance is (∑n

i=1(yi/pi)2 − nY 2

HH)/[n(n− 1)].

How do we choose the pj to minimise the variance?

We can show that the minimum is achieved if we set pi = Yi/Y , i = 1, .., N. Of course we

cannot use these values in practice as the Yi ( and Y ) are unknown but we look for another

variable, Xi, that is known and highly correlated with Yi to construct probability estimates.

33

Set pj = Xj/X where X =∑N

j=1 Xj. Then

YHH =1

n

n∑i=1

(yixi

)X.

This is known as probability proportional to size (pps) sampling. Note

Var(YHH) =X2

nVar(

yixi

).

Example. We wish to estimate the turnover for a population of farms. For each farm

we know Xi, the estimated value of agricultural operations (EVAO) which we use as a size

variable. From the ABS data set the true value for Y is $20.079m with s.d. $2.314m. From

the sample of size 10 YHH is $17.818m with and estimated s.e. $1.821m.

A natural alternative to the pps sampling here would be to use the ratio estimator

YR = X(

∑ni=1 yi∑ni=1 xi

)

whereas

YHH = X(1

n

n∑i=1

(yixi

).

One is based on the ratio of the averages whilst the other is the average of the ratios.

34

4.2 P.P.S. Sampling without Replacement

If we are sampling without replacement the nature of the population changes after each

selection and so if at each step we select using probabilities proportional to size the overall

scheme will not necessarily be pps. One way to achieve a pps scheme is to use systematic

sampling where only one random selection is necessary.

Let πi denote the probability the Yi is in the selected sample. The Horvitz - Thompson

estimator for Y is

YHT =n∑i=1

(yiπi

).

Let

δi = 1 if Yi in sample

= 0 otherwise

Write

YHT =N∑i=1

δi(Yiπi

).

Using this form EYHT =∑N

i=1 Yi/πiE(δi) = Y so YHT is unbiased for Y .

Let δij = 1 if both Yi and Yj are in the sample and set δij to 0 otherwise. Note δii = δi.

Let πij = E(δij). We can use these indicator functions to derive an expression for Var(YHT ).

Var(YHT ) = E(N∑i=1

Yiπiδi − Y )2

=N∑i=1

N∑j=1

Yiπi

YjπjE(δiδj)− Y 2

=N∑i=1

N∑j=1

YiYj(Eδijπiπj

− 1)

=N∑i=1

N∑j=1

Yiπi

Yjπj

(πij − πiπj)

=N∑i=1

Y 2i

1− πiπi

+N∑i=1

∑j 6=i

YiYjπiπj

(πij − πiπj).

We estimate this via

Var(YHT ) =n∑i=1

n∑j=1

(1− πiπjπij

)yiyjπiπj

.

35

Note that

EVar(YHT ) =N∑i=1

N∑j=1

(1− πiπjπij

)YiYjπiπj

EδijI(πij > 0)

=N∑i=1

N∑j=1

(πij − πiπj)YiYjπiπj

I(πij > 0)

= Var(YHT ) +∑i

∑j

I(πij = 0)YiYj.

If all pairs have a positive chance of selection then we have an unbiased estimate for the

variance. Otherwise the bias can be large.

Note:

• The variance estimate can be negative (if πiπj > πij). If we restrict the variance to 0

in these cases to counter this we induce bias.

• The estimate is not necessarily 0 when the true value is 0.

We know πi ∝ Xi for some indicator variable Xi but calculating πij depends on the way

the sample is selected. Note n2 values of πij need to be calculated.

Hurtley and Rao gave the following approximation for Var (YHT )

Var(YHT ) 'N∑j=1

N∑i=1

(1− n− 1

nπi

)(Yiπi− Yjπj

)2

.

This expression can be estimated by

n∑i=1

(1− n− 1

nπi

)(yiπi− YHT

n

)2

.

36

Documents

Stat 3014 Notes 11 Sampling