STK 4600: Statistical methods for social sciences

1

STK 4600: Statistical methods for social sciences.

Survey sampling and statistical demography

Surveys for households and individuals

2

Survey sampling: 4 major topics

1. Traditional design-based statistical inference • 6 weeks

2. Likelihood considerations• 1 weeks

3. Model-based statistical inference• 3 weeks

4. Missing data - nonresponse• 2 weeks

3

Statistical demography

• Mortality

• Life expectancy

• Population projections• 2-3 weeks

4

Course goals

• Give students knowledge about:– planning surveys in social sciences

– major sampling designs

– basic concepts and the most important estimation methods in traditional applied survey sampling

– Likelihood principle and its consequences for survey sampling

– Use of modeling in sampling

– Treatment of nonresponse

– A basic knowledge of demography

5

But first: Basic concepts in sampling

Population (Target population): The universe of all units of interest for a certain study

• Denoted, with N being the size of the population: U = {1, 2, ...., N}

All units can be identified and labeled

• Ex: Political poll – All adults eligible to vote

• Ex: Employment/Unemployment in Norway– All persons in Norway, age 15 or more

• Ex: Consumer expenditure : Unit = household

Sample: A subset of the population, to be observed. The sample should be ”representative” of the population

6

Sampling design:

• The sample is a probability sample if all units in the sample have been chosen with certain probabilities, and such that each unit in the population has a positive probability of being chosen to the sample

• We shall only be concerned with probability sampling

• Example: simple random sample (SRS). Let n denote the sample size. Every possible subset of n units has the same chance of being the sample. Then all units in the population have the same probability n/N of being chosen to the sample.

• The probability distribution for SRS on all subsets of U is an example of a sampling design: The probability plan for selecting a sample s from the population:

nssp

nsn

Nsp

|| if 0)(

|| if /1)(

7

Basic statistical problem: Estimation

• A typical survey has many variables of interest

• Aim of a sample is to obtain information regarding totals or averages of these variables for the whole population

• Examples : Unemployment in Norway– Want to estimate the total number t of individuals unemployed.

For each person i (at least 15 years old) in Norway:

otherwise 0 ,unemployed is person if 1 iyi

Ni iyt 1

:Then

8

• In general, variable of interest: y with yi equal to the value of y for unit i in the population, and the total is denoted

Ni iyt 1

• The typical problem is to estimate t or t/N

•Sometimes, of interest also to estimate ratios of totals:

Example- estimating the rate of unemployment:

otherwise 0 force,labor in the is person if 1

otherwise 0 ,unemployed is person if 1

ix

iy

i

i

xy tt , swith total

Unemployment rate: xy tt /

9

Sources of error in sample surveys

1. Target population U vs Frame population UF

Access to the population is thru a list of units – a register UF . U and UF may not be the same: Three possible errors in UF:

– Undercoverage: Some units in U are not in UF

– Overcoverage: Some units in UF are not in U

– Duplicate listings: A unit in U is listed more than once in UF

• UF is sometimes called the sampling frame

10

2. Nonresponse - missing data• Some persons cannot be contacted• Some refuse to participate in the survey• Some may be ill and incapable of responding• In postal surveys: Can be as much as 70%

nonresponse• In telephone surveys: 50% nonresponse is not

uncommon

• Possible consequences:– Bias in the sample, not representative of the

population– Estimation becomes more inaccurate

• Remedies: – imputation, weighting

11

3. Measurement error – the correct value of yi is not measured

– In interviewer surveys:• Incorrect marking

• interviewer effect: people may say what they think the interviewer wants to hear – underreporting of alcohol ute, tobacco use

• misunderstanding of the question, do not remember correctly.

12

4. Sampling error– The error caused by observing a sample

instead of the whole population– To assess this error- margin of error:

measure sample to sample variation

– Design approach deals with calculating sampling errors for different sampling designs

– One such measure: 95% confidence interval:

If we draw repeated samples, then 95% of the calculated confidence intervals for a total t will actually include t

13

• The first 3 errors: nonsampling errors– Can be much larger than the sampling error

• In this course:– Sampling error– nonresponse bias– Shall assume that the frame population is

identical to the target population– No measurement error

14

Summary of basic concepts

• Population, target population• unit• sample• sampling design• estimation

– estimator– measure of bias – measure of variance– confidence interval

15

• survey errors:– register /frame population– mesurement error– nonresponse– sampling error

16

Example – Psychiatric Morbidity Survey 1993 from Great Britain

• Aim: Provide information about prevalence of psychiatric problems among adults in GB as well as their associated social disabilities and use of services

• Target population: Adults aged 16-64 living in private households

• Sample: Thru several stages: 18,000 adresses were chosen and 1 adult in each household was chosen

• 200 interviewers, each visiting 90 households

17

Result of the sampling process• Sample of addresses 18,000

Vacant premises 927Institutions/business premises 573Demolished 499Second home/holiday flat 236

• Private household addresses 15,765Extra households found 669

• Total private households 16,434Households with no one 16-64 3,704

• Eligible households 12,730• Nonresponse 2,622• Sample 10,108

households with responding adults aged 16-64

18

Why sampling ?• reduces costs for acceptable level of accuracy

(money, manpower, processing time...)• may free up resources to reduce nonsampling error

and collect more information from each person in the sample– ex:

400 interviewers at $5 per interview: lower sampling error

200 interviewers at 10$ per interview: lower nonsampling error

• much quicker results

19

When is sample representative ?• Balance on gender and age:

– proportion of women in sample proportion in population

– proportions of age groups in sample proportions in population

• An ideal representative sample: – A miniature version of the population: – implying that every unit in the sample represents the

characteristics of a known number of units in the population

• Appropriate probability sampling ensures a representative sample ”on the average”

20

Alternative approaches for statistical inference based on survey sampling

• Design-based: – No modeling, only stochastic element is the

sample s with known distribution• Model-based: The values yi are assumed to be

values of random variables Yi: – Two stochastic elements: Y = (Y1, …,YN) and s– Assumes a parametric distribution for Y– Example : suppose we have an auxiliary

variable x. Could be: age, gender, education. A typical model is a regression of Yi on xi.

21

• Statistical principles of inference imply that the model-based approach is the most sound and valid approach

• Start with learning the design-based approach since it is the most applied approach to survey sampling used by national statistical institutes and most research institutes for social sciences. – Is the easy way out: Do not need to model. All

statisticians working with survey sampling in practice need to know this approach

22

Design-based statistical inference• Can also be viewed as a distribution-free

nonparametric approach• The only stochastic element: Sample s, distribution

p(s) for all subsets s of the population U={1, ..., N}• No explicit statistical modeling is done for the

variable y. All yi’s are considered fixed but unknown • Focus on sampling error• Sets the sample survey theory apart from usual

statistical analysis• The traditional approach, started by Neyman in 1934

23

Estimation theory-simple random sample

Estimation of the population mean of a variable y: NyN

i i /1

A natural estimator - the sample mean: nyy si is / Desirable properties:

)ˆ(

ifunbiased is êstimator An :ess UnbiasednI)(

Edesign SRSfor unbiased is sy

SRS of size n: Each sample s of size n has

n

Nsp /1)(

Can be performed in principle by drawing one unit at time at random without replacement

24

The uncertainty of an unbiased estimator is measured by its estimated sampling variance or standard error (SE):

)ˆ(ˆ)ˆ(

)ˆ( of estimate (unbiased)an is )ˆ(ˆ

)ˆ( if ,)ˆ()ˆ( 2

VSE

VarV

EEVar

Some results for SRS:

)( )2(

fraction sampling the, /Then

sample, in the is unit y that probabilit thebe Let )1(

s

i

i

yE

fNn

i

25

correction population finite called the is )-(1factor theHere,

)1()(

)(1

1 :variance population thebe Let )3(

2

1222

f

fn

yVar

yN

s

Ni i

• usually unimportant in social surveys:

n =10,000 and N = 5,000,000: 1- f = 0.998

n =1000 and N = 400,000: 1- f = 0.9975

n =1000 and N = 5,000,000: 1-f = 0.9998

• effect of changing n much more important than effect of changing n/N

26

si si yyn

s 22

2

)(1

1

variance sample

by thegiven is ofestimator unbiased An

The estimated variance )1()(ˆ2

fn

syV s

Usually we report the standard error of the estimate:

)(ˆ)( ss yVySE

Confidence intervals for is based on the Central Limit Theorem:

)1,0(~/)1(/)(:, large For NnfyZnNn s

)(96.1)(96.1 ),(96.1

:for CI95% eApproximat

ssssss ySEyySEyySEy

27

Example

N = 341 residential blocks in Ames, Iowa

yi = number of dwellings in block i

1000 independent SRS for different values of n

n Proportion of samples with |Z| <1.64

Proportion of samples with |Z| <1.96

30 0.88 0.93

50 0.88 0.93

70 0.88 0.94

90 0.90 0.95

28

For one SRS with n = 90:

14.53) 11.47,( 1.5313 0.781.9613 :CI95% eApproximat

78.090/75)341/901()(

75

132

s

s

ySE

s

y

29

The coefficient of variation for the estimate:

sss yySEyCV /)()(

•A measure of the relative variability of an estimate.

•It does not depend on the unit of measurement.

• More stable over repeated surveys, can be used for planning, for example determining sample size

• More meaningful when estimating proportions

Absolute value of sampling error is not informative when not related to value of the estimate

For example, SE =2 is small if estimate is 1000, but very large if estimate is 3

%606.013/78.0)( :exampleIn syCV

30

Estimation of a population proportion pwith a certain characteristic A

p = (number of units in the population with A)/N

Let yi = 1 if unit i has characteristic A, 0 otherwise

Then p is the population mean of the yi’s.

Let X be the number of units in the sample with characteristic A. Then the sample mean can be expressed as

nXyp s /ˆ

31

1

)1( equals variance population thesince

)1

11(

)1()ˆ(

and

)ˆ(

:SRSunder Then

2

N

pNpN

n

n

pppVar

ppE

)ˆ1(ˆ1

2 ppn

ns

So the unbiased estimate of the variance of the estimator:

)1(1

)ˆ1(ˆ)ˆ(ˆ

N

n

n

pppV

32

Examples

A political poll: Suppose we have a random sample of 1000 eligible voters in Norway with 280 saying they will vote for the Labor party. Then the estimated proportion of Labor votes in Norway is given by:

2801000280 ./p

01440999

7202801

1

1.

..)

N

n(

n

)p(p)p(SE

Confidence interval requires normal approximation. Can use the guideline from binomial distribution, when N-n is large: 5)1(and 5 pnnp

33

In this example : n = 1000 and N = 4,000,000

0.308) (0.252, 0.028 0.280

961 :CI 95% eApproximat

)p(SE.p

Ex: Psychiatric Morbidity Survey 1993 from Great Britain

p = proportion with psychiatric problems

n = 9792 (partial nonresponse on this question: 316)

N

47)(0.133,0.1 0.0070.14 0.00351.96 0.14 :CI %95

0035.09791/86.014.0)00024.01()ˆ(

14.0ˆ

pSE

p

34

General probability sampling• Sampling design: p(s) - known probability of selection for each subset s of the population U

• Actually: The sampling design is the probability distribution p(.) over all subsets of U

• Typically, for most s: p(s) = 0 . In SRS of size n, all s with size different from n has p(s) = 0.

• The inclusion probability:

}:{)()(

sample) in the is unit (

sis

i

spsiP

iP

35

Illustration

U = {1,2,3,4}Sample of size 2; 6 possible samplesSampling design: p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8

The inclusion probabilities:

}4:{4

}3:{3

}2:{2

}1:{1

8/2})4,1({})4,3({)(

8/3})4,3({})3,2({)(

8/64/3})3,2({})2,1({)(

8/5})4,1({})2,1({)(

ss

ss

ss

ss

ppsp

ppsp

ppsp

ppsp

36

Some results

n

nII

nnEI

N

N

...

:advancein be todetermined is size sample If )(

size sample theis ; )(... )(

21

21

N

i i

N

i i

N

i i

iii

i

ZEnEZn

ZEZP

iZLet

111)()(

)()1(

otherwise 0 sample, in the included is unit if 1

:Proof

37

Estimation theory probability sampling in general

Problem: Estimate a population quantity for the variable y

For the sake of illustration: The population total

N

iiyt

1

tt ˆ :sample on thebased ofestimator An

)ˆ( ifunbiased is ˆ

)ˆ( :Bias

)(]ˆ)(ˆ[]ˆˆ[)ˆ( :Variance

)()(ˆ)ˆ( :valueExpected 22

ttEt

ttE

sptEsttEtEtVar

spsttE

s

s

38

ttSEtCVt

tVtSEt

tVartV

ˆ/)ˆ()ˆ( :ˆ ofvariation oft Coefficien

)ˆ(ˆ)ˆ( :ˆ oferror standard The

)ˆ( of estimate possible) if(unbiased an be )ˆ(ˆLet

CV is a useful measure of uncertainty, especially when standard error increases as the estimate increases

Because, typically we have that

nNntSEtttSEtP , largefor 95.0))ˆ(2ˆ)ˆ(2ˆ(

)ˆ(2 :error ofMargin tSE

nNnt , largefor d distributenormally ely approximat is ˆ Since

CI 95% aely approximat is 2 )t(SEt

39

Some peculiarities in the estimation theoryExample: N=3, n=2, simple random sample

3313232

3213122

112112

2

1

321

2

1)(ˆ)

3

1

2

1(3)(ˆ

2

1)(ˆ)

3

2

2

1(3)(ˆ

)(ˆ)(2

13)(ˆ

:bygiven be ˆLet

unbiased ,3ˆLet

1,2,3 for 3/1)(

}3,2{},3,1{},2,1{

ystyyst

ystyyst

styyst

t

yt

ksp

sss

s

k

40

ttstspsttE

t

k ks 33

1)(ˆ

3

1)()(ˆ)ˆ(

:unbiased is ˆ Also

31 222

2

)33(6

1)ˆ()ˆ( 312321 yyyytVartVar

1,0when happens thisvariables,1/0 If

33and 0 if )ˆ()ˆ(

321

312321

yyyy

yyyytVartVar

i

For this set of values of the yi’s:

5.2)(ˆ ,2)(ˆ ,5.1)(ˆ

correctnever : 3)(ˆ ,5.1)(ˆ ,5.1)(ˆ

322212

312111

ststst

ststst

values- for these ˆy than variabilit lessclearly has ˆ12 ytt

41

Let y be the population vector of the y-values.

This example shows that

syNis not uniformly best ( minimum variance for all y) among linear design-unbiased estimators

Example shows that the ”usual” basic estimators do not have the same properties in design-based survey sampling as they do in ordinary statistical models

In fact, we have the following much stronger result:

Theorem: Let p(.) be any sampling design. Assume each yi can take at least two values. Then there exists no uniformly best design-unbiased estimator of the total t

42

Proof:

0

0

yy

yy

when 0)ˆ(with ûnbiased exists Then there

. of value possible one be let and unbiased, be ˆLet

00 tVart

t

00 yyyy for total theis ,),(ˆ),(ˆ),(ˆ000 ttststst

0)ˆ( samples allfor ˆ:When )2

)(),(ˆ)ˆ( :unbiased is ˆ )1

000

000

tVarstt

ttspstttEt s

0

0

yy

y

This implies that a uniformly best unbiased estimator must have variance equal to 0 for all values of y, which is impossible

43

Determining sample size

• The sample size has a decisive effect on the cost of the survey

• How large n should be depends on the purpose for doing the survey

• In a poll for detemining voting preference, n = 1000 is typically enough

• In the quarterly labor force survey in Norway, n = 24000

Mainly three factors to consider:

1. Desired accuracy of the estimates for many variables. Focus on one or two variables of primary interest

2. Homogeneity of the population. Needs smaller samples if little variation in the population

3. Estimation for subgroups, domains, of the population

44

It is often factor 3 that puts the highest demand on the survey

• If we want to estimate totals for domains of the population we should take a stratified sample

• A sample from each domain

• A stratified random sample: From each domain a simple random sample

H

H

n...nnn

n,...,n,n

H

21

21

: size sample Total

:sizes Sample

population whole theconstitute that strata

hneach determineMust

45

Assume the problem is to estimate a population proportion p for a certain stratum, and we use the sample proportion from the stratum to estimate p

Let n be the sample size of this stratum, and assume that n/N is negligible

Desired accuracy for this stratum: 95% CI for p should be %5

n

pppp

)ˆ1(ˆ96.1ˆ:for CI95%

The accuracy requirement:

384)ˆ1(ˆ2096.1

20

105.0

)ˆ1(ˆ96.1

22

ppn

n

pp

46

The estimate is unkown in the planning fase

Use the conservative size 384 or a planning value p0 with n = 1536 p0(1- p0 )

F.ex.: With p0 = 0.2: n = 246

In general with accuracy requirement d, 95% CI dp ˆ

200 /)1(84.3 dppn

edpCVpdn

pp

pp

p

96.1/)ˆ(ˆ )ˆ1(ˆ

96.1

) -1 estimate otherwise ,5.0ˆ(when

ˆ toalproportion is CI95% ofLength

:trequiremenaccuracy eAlternativ

47

With e = 0.1, then we require approximately that

900and 02.0ˆ CI %95:1.0when

100and 10.0ˆ CI %95:5.0when

0

0

npp

npp

0

020

2

11: value Planning

ˆ

ˆ11ˆ/)ˆ(

p

p

enp

p

p

eneppSE

48

Example: Monthly unemployment rate

Important to detect changes in unemployment rates from month to month

planning value p0 = 0.05

7300005.0

600,45002.0

182,400 0.1%) error of(margin 001.0

/1824.0/)1(84.3)ˆ(1.96

:accuracyDesired 22

00

nd

nd

nd

ddppndpSE

%5051.05.0/00255.0)ˆ(005.0 :Note pCVd

49

Two basic estimators:Ratio estimator

Horvitz-Thompson estimator

• Ratio estimator in simple random samples

• H-T estimator for unequal probability sampling: The inclusion probabilities are unequal

• The goal is to estimate a population total t for a variable y

50

Ratio estimator

),...,( 21 Nxxxx

N

i ixX1

Let

Suppose we have known auxiliary information for the whole population:

Ex: age, gender, education, employment status

The ratio estimator for the y-total t:

s

s

si i

si iR x

yX

x

yXt

ˆ

51

We can express the ratio estimator on the following form:

)(ˆs

sR yN

xN

Xt

It adjusts the usual “sample mean estimator” in the cases where the x-values in the sample are too small or too large.

Reasonable if there is a positive correlation between x and y

Example: University of 4000 students, SRS of 400

Estimate the total number t of women that is planning a career in teaching, t=Np, p is the proportion

yi = 1 if student i is a woman planning to be a teacher, t is the y-total

52

Results : 84 out of 240 women in the sample plans to be a teacher

840ˆˆ

21.0400/84ˆ

pNt

p

HOWEVER: It was noticed that the university has 2700 women (67,5%) while in the sample we had 60% women. A better estimate that corrects for the underrepresentation of women is obtained by the ratio estimate using the auxiliary

x = 1 if student is a woman

945)840(6.04000

2700ˆ

Rt

53

In business surveys it is very common to use a ratio estimator.

Ex: yi = amount spent on health insurance by business i

xi = number of employees in business i

We shall now do a comparison between the ratio estimator and the sample mean based estimator. We need to derive expectation and variance for the ratio estimator

54

First: Must define the population covariance

variables and theof means population are ,

))((1

11

xy

yxN

yx

N

i yixixy

N

i xix

N

i yiy

xN

yN

1

22

1

22

)(1

1

)(1

1

The population correlation coefficient: yx

xyxy

55

),ˆ()ˆ( :Bias )(

//ˆ and

//Let 11

sR

ssss

N

i i

N

i i

xNRCovttEI

xyxNyNR

XtxyR

),ˆ()()ˆ(

)1(ˆ

Proof

sss

sR

s

ss

s

sR

xNRCovXxNxN

yNEttE

txN

XxNyNtX

xN

yNtt

56

It follows that

)(|),ˆ(|)(

)()()ˆ(

|),ˆ(|

)ˆ(

|)ˆBias(|

sss

s

s

s

R

R

xCVxNRCorrxNCV

xNVarxNVarRVarX

xNRCov

tVar

t

Hence, in SRS, the absolute bias of the ratio estimator is small relative to the true SE of the estimator if the coefficient of variation of the x-sample mean is small

Certainly true for large n

57

nt)t(E)II( R largefor ,

N

i ii

xxyyR

RxyNn

fN

RRn

fNtVarIII

1

22

2222

)(1

11

)2(1

)ˆ( )(

58

Note: The ratio estimator is very precise when the population points (yi , xi) lie close around a straight line thru the origin with slope R.

The regression model generates the ratio estimator

59

N

i iiR RxyNn

fNtVar

1

22 )(1

11 )ˆ(

N

i yi

N

i iisR yRxyyNVartVar1

2

1

2 )()()()ˆ(

The ratio estimator is more accurate if Rxi predicts yi better than y does

N

i yis yNn

fNyNVar

1

22 )(1

11)(

that recalling and

60

Estimated variance for the ratio estimator

)1/()ˆ(by

)1/()( Estimate

2

1

2

nxRy

NRxy

si ii

N

i ii

atreflect th larger to becomes estimate variancethe

anduncertain more is ˆ then small, very is If :

)ˆ(1

11)ˆ(ˆ 22

2

RxNote

xRynn

fN

xtV

s

si iis

xR

61

For large n, N-n: Approximate normality holds and an approximate 95% confidence interval is given by

si iis

R xRynn

f

x

Xt 2)ˆ(

1

1196.1ˆ

62

Unequal probability sampling

tobelongs individual

that household in the 64-16 adults ofnumber

/1

i

M

M

i

ii

Example:

Psychiatric Morbidity Survey: Selected individuals from households

Inclusion probabilities:NisiPi ,...,1 allfor 0)(

63

Horvitz-Thompson estimator- unequal probability sampling

NisiPi ,...,1 allfor 0)(

syNLet’s try and use

unbiasednot

)/()(1

)(

)( otherwise. 0 , if 1Let

11

N

i ii

N

i iis

iii

tynNZyEn

NyNE

ZEsiZ

Bias is large if inclusion probabilities tend to increase or decrease systematically with yi

64

Use weighting to correct for bias:

ii

i

N

i iii

N

i iii

isi ii

w

yt

ywZywEtE

swywt

/1 ifonly and if

valuespossible allfor unbiased is ˆ and

)ˆ(

on dependnot does ; ˆ

11

sii

iHT

yt

ˆ

sHTi yNtNn ˆ and / SRS,In

65

2

1

1 1

1

1 1

2

1

)()ˆ( )

then,|| If

21

)ˆ( )

N

i

N

ijj

j

i

iijjiHT

ji

N

i

N

ijji

jiiji

N

ii

iHT

yytVarb

ns

yyytVara

)1(),( jiij ZZPsjiP

Horvitz-Thompson estimator is widely used f.ex., in official statistics

66

Note that the variance is small if we determine the inclusion probabilities such that

ii

ii

y

y

increasing with increases i.e.

equal,ely approximat are /

Of course, we do not know the value of yi when planning the survey, use known auxiliary xi and choose

Xnxx iiii /

nN

i i 1 since

67

unequal are ' h theeven thoug

estimator,-HT usenot should one and enormous becan )ˆ(

"correlated" negativelyor relatednot are and If

s

tVar

y

i

HT

ii

Example: Population of 3 elephants, to be shipped. Needs an estimate for the total weight

•Weighing an elephant is no simple matter. Owner wants to estimate the total weight by weighing just one elephant.

• Knows from earlier: Elephant 2 has a weight y2 close to the average weight. Wants to use this elephant and use 3y2 as an estimate

• However: To get an unbiased estimator, all inclusion probabilities must be positive.

68

• Sampling design:

05.0 ,90.0 and 1|| 312 s

• The weights: 1,2, 4 tons, total = 7 tons

}3{ if 80

{2} if 22.2

}1{ if 20

}{ if /ˆ

s

s

s

isyt iiHT • H-T estimator:

Hopeless! Always far from true total of 7

ttE HT 7)ˆ(Can not be used, even though

69

Problem:

46.295

05.0.)780(90.0)722.2(05.0)720()ˆ( 222

HTtVar

!!! 2.17)ˆ()ˆ( True HTHT tVartSE

The planned estimator, even though not a SRS:

}{ if 33ˆ isyyt iseleph

Possible values: 3, 6, 12

70

49122752

atlook but unbiased,not

156

..)t(SE

.)t(E

eleph

72.1)ˆ(

95.2)ˆ(Bias)ˆ()ˆ( 22

eleph

elephelepheleph

tMSE

tVarttEtMSE

topreferableclearly is HTeleph tt

71

Variance estimate for H-T estimator

2

)ˆ(ˆ

:0 iesprobabilitinclusion joint

all provided),ˆ( ofestimator unbiasedAn

j

j

i

i

siijsj ij

ijjiHT

ij

HT

yytV

tVar

Assume the size of the sample is determined in advance to be n.

)ˆ(ˆ96.1ˆ

:, largefor CI, 95% eApproximat

HTHT tVt

nNn

72

• Can always compute the variance estimate!!Since, necessarily ij > 0 for all i,j in the sample s

• But: If not all ij > 0 , should not use this estimate! It can give very incorrect estimates

• The variance estimate can be negative, but for most sampling designs it is always positive

73

A modified H-T estimator

Consider first estimating the population mean

Nty HTHT /ˆˆ

Nty /

An obvious choice:

Alternative: Estimate N as well, whether N is known or not

),1( 1ˆ iyN isi

i

NZENEN

i ii

N

i ii

11

11)ˆ(

Nn

NNNn

si

ˆ/ SRS,For i

74

si i

si iiHTw

yNty

/1

/ˆ/ˆˆww yNt ˆˆ

estimator ratio a isit that note Wenot.or known is

whetheruse, toestimator theordinarily is ˆ So

riance.smaller va hasusually It unbiased.ely approximatonly

isit h even thoug ,ˆn better thaoften is ˆ gly,Interestin

N

t

tt

w

HTw

0 if estimatebetter a while

1Then

1 allfor

)N(Var,tNct

Nc/ct

.N,...,icy

:

w

si iHT

i

onIllustrati

75

If sample size varies then the “ratio” estimator performs better than the H-T estimator, the ratio is more stable than the numerator

Example:

tlyindependen ,y probabilit

with selected is population in theunit Each

:sampling Bernoulli design Sampling

,...,1for ,

Nicyi

NnE

Nn

ZPsZ iii

)(

ondistributi ),( binomial a has variable,stochastic a is

)1( with i.i.d. are '

76

tNcn

ncNt

tNccN

tE

cn

t

w

HT

HT

/

/ˆ

))ˆ((

ˆ

H-T estimator varies because n varies, while the modified H-T is perfectly stable

77

Review of Advantages of Probability Sampling

• Objective basis for inference• Permits unbiased or approximately unbiased

estimation• Permits estimation of sampling errors of

estimators– Use central limit theorem for confidence interval

– Can choose n to reduce SE or CV for estimator

78

Outstanding issues in design-based inference

• Estimation for subpopulations, domains• Choice of sampling design –

– discuss several different sampling designs

– appropriate estimators

• More on use of auxiliary information to improve estimates

• More on variance estimation

79

Estimation for domains• Domain (subpopulation): a subset of the

population of interest• Ex: Population = all adults aged 16-64

Examples of domains:

– Women

– Adults aged 35-39

– Men aged 25-29

– Women of a certain ethnic group

– Adults living in a certain city

• Partition population U into D disjoint domains U1,…,Ud,..., UD of sizes N1,…,Nd,…,ND

80

Estimating domain means Simple random sample from the population

dUi did Ny / :meandomain True

• e.g., proportion of divorced women with psychiatric problems.

||

/

in sample theofpart the

: frommean sample by the Estimate

dd

dsi is

dd

dd

sn

nyy

Uss

U

dd

Note: nd is a random variable

81

The estimator is a ratio estimator:

otherwise 0

if 1

otherwise 0

if

Define

di

dii

Uix

Uiyu

Rxuxuy

Rxu

sssi si iis

N

i

N

i iid

d

ˆ//

/1 1

82

si isid

d

ds

s

xyunn

fN

nn

NN

NyV

ny

dd

d

22

2

2)(

1

11

/

/1)(ˆ

largefor unbiasedely approximat is

d dsi sid

d

d

yyn

s

s

22

2

)(1

1

domain, for the variancesample thebe Let

d

ddd

ds n

sfsn

nn

f

n

nyV

d

22

2

2

)1()1()1(

1)(ˆ

NnfnsfySE ddsd/ , /)1()( 2

83

fNnf ddd / samples largeFor

• Can then treat sd as a SRS from Ud

• Whatever size of n is, conditional on nd, sd is a SRS from Ud – conditional inference

Example: Psychiatric Morbidity Survey 1993

Proportions with psychiatric problems

Domain d nd SE

women 4933 0.18

Divorced women

314 0.29

dsy )(dsy

005.04932/82.018.

026.0313/71.029.0

84

Estimating domain totals

dsd yN

ssi isdd

dsd

d

uNun

NyNt

nnNxNN

xN

d

1ˆˆ

/ˆ

:total- theis Since

• Nd is known: Use

• Nd unknown, must be estimated

nsfNtSE ud /)1()ˆ( 2

85

Stratified sampling• Basic idea: Partition the population U into H

subpopulations, called strata. • Nh = size of stratum h, known• Draw a separate sample from each stratum, sh of size nh

from stratum h, independently between the strata• In social surveys: Stratify by geographic regions, age

groups, gender• Ex –business survey. Canadian survey of employment.

Establishments stratified by o Standard Industrial Classification – 16 industry

divisionso Size – number of employees, 4 groups, 0-19, 20-49, 50-

199, 200+o Province – 12 provinces

Total number of strata: 16x4x12=768

86

Reasons for stratification

1. Strata form domains of interest for which separate estimates of given precision is required, e.g. strata = geographical regions

2. To “spread” the sample over the whole population. Easier to get a representative sample

3. To get more accurate estimates of population totals, reduce sampling variance

4. Can use different modes of data collection in different strata, e.g. telephone versus home interviews

87

Stratified simple random sampling

• The most common stratified sampling design

• SRS from each stratum

hsi hhih

hhi

hhi

H

h h

hh

nyy

siy

Niyh

nn

nsh

/ :mean Sample

):( :Sample

,...,1,: stratum from Values

:size sample Total

size of sample: stratum From

1

• Notation:

88

th = y-total for stratum h: hN

i hih yt1

H

h htt1

: totalpopulation The

Consider estimation of th: hhh yNt ˆ

Assuming no auxiliary information in addition to the “stratifying variables”

The stratified estimator of t:

h

H

h h

H

h hst yNtt

11ˆˆ

89

h

H

hh

stst yN

NNty

Nt

1/ˆ :mean Stratified

:/mean population theestimate To

A weighted average of the sample stratum means.

•Properties of the stratified estimator follows from properties of SRS estimators.

•Notation:

h

h

N

i hhih

h

h

N

i hih

yN

h

Nyh

1

22

1

)(1

1: stratumin Variance

/: stratumin Mean

90

Hh

Hh h

h

hhhst

stst

)f(n

N)t(Var)t(Var

tt)t(E

1 1

22 1

unbiased is ,

Estimated variance is obtained by estimating the stratum variance with the stratum sample variance

hsi hhih

h yyn

s 22 )(1

1

)1()ˆ(ˆ1

22

h

H

hh

hhst f

n

sNtV

Approximate 95% confidence interval if n and N-n are large:

)(ˆ96.1ˆstst tVt

91

Estimating population proportion in stratified simple random sampling

A sticcharacteri has stratumin unit if 1 where

ˆ

hiy

yp

hi

hh

NpNH

h hh /1

h

H

h hstst pNNyp ˆ)/(ˆ1

ph : proportion in stratum h with a certain characteristic A

p is the population mean: p = t/N

Stratum mean estimator:

Stratified estimator of the total t = number of units in thewith characteristic A:

H

h hhstst pNpNt1

ˆˆˆ

92

Estimated variance:

31) (slide 11

1)

N

n(

n

)p(p)p(V

h

h

h

hhh

H

hh

hh

h

hh

H

h hhst

hh

H

h

H

hh

hh

h

hhhhst

n

pp

N

nWNpWNVtV

NNW

n

pp

N

nWpWVpV

1

22

1

1 1

2

1

)ˆ1(ˆ)1()ˆ(ˆ)ˆ(ˆ

and

/ where

1

)ˆ1(ˆ)1()ˆ(ˆ)ˆ(ˆ

93

Allocation of the sample units• Important to determine the sizes of the stratum samples,

given the total sample size n and given the strata partitioning – how to allocate the sample units to the different strata

• Proportional allocation– A representative sample should mirror the population– Strata proportions: Wh=Nh/N– Strata sample proportions should be the same:

nh/n = Wh

– Proportional allocation:

hN

n

N

n

N

Nnn

h

hhh allfor

94

The stratified estimator under proportional allocation

SRS anot isit but ,population in the units allfor same the

// : iesprobabilitInclusion NnNn hhhi

s

H

h si hi

si hih

H

h hh

H

h hst

yNyn

N

yn

NyNt

h

h

1

11

1ˆ

/ˆ :mean stratified The sstst yNty

The equally weighted sample mean ( sample is self-weighting: Every unit in the sample represents the same number of units in the population , N/n)

95

Variance and estimated variance under proportional allocation

NNWNnfWn

fN

fn

NtVar

hh

H

h hh

H

h hh

hhst

/ ,/ , 1

)1()ˆ(

1

22

1

22

H

h hhst sWn

fNtV

1

22 1)ˆ(ˆ

96

• The estimator in simple random sample:

sSRS yNt ˆ

• Under proportional allocation:

SRSst tt ˆˆ

• but the variances are different:

H

h hhst

SRSSRS

Wn

fNtVar

n

fNtVar

1

22

22

1)ˆ( :allocation alproportionUnder

1)ˆ( :SRSUnder

97

H

h hh

H

h hh

h

hhh

WW

N

N

N

N

N

N

1

2

1

22 )(

:11

and 1

1 ionsapproximat theUsing

Total variance = variance within strata + variance between strata

Implications:1. No matter what the stratification scheme is: Proportional allocation gives more accurate estimates of population total than SRS2. Choose strata with little variability, smaller strata variances. Then the strata means will vary more and between variance becomes larger and precision of estimates increases compared to SRS

2

1

22 1)ˆ(

fromseen as general,in y trueessentiall also is This .3

hh

hH

h hst n

fWNtV

98

Optimal allocationIf the only concern is to estimate the population total t:

• Choose nh such that the variance of the stratified estimator is minimum

• Solution depends on the unkown stratum variances• If the stratum variances are approximately equal,

proportional allocation minimizes the variance of the stratified estimator

99

H

k kk

hhh

N

Nnn

1

:allocation Optimal

)()11

(

Minimize :method multiplier Lagrange Use

fixed is subject to

sizes sample therespect to with )ˆ( Minimize

:Proof

1

2

1

2

1

nnNn

NQ

nnn

tVar

H

h hhh

h

H

h h

H

h hh

st

01

0 222

hh

hh

Nnn

Q /hhh Nn

Result follows since the sample sizes must add up to n

100

• Called Neyman allocation (Neyman, 1934)• Should sample heavily in strata if

– The stratum accounts for a large part of the population

– The stratum variance is large

• If the stratum variances are equal, this is proportional allocation

• Problem, of course: Stratum variances are unknown– Take a small preliminary sample (pilot)

– The variance of the stratified estimator is not very sensitive to deviations from the optimal allocation. Need just crude approximations of the stratum variances

101

Optimal allocation when considering the cost of a survey

• C represents the total cost of the survey, fixed – our budget

• c0 : overhead cost, like maintaining an office• ch : cost of taking an observation in stratum h

– Home interviews: traveling cost +interview– Telephone or postal surveys: ch is the same for all

strata– In some strata: telephone, in others home interviews

h

H

h hcncC

10

• Minimize the variance of the stratified estimator for a given total cost C

102

H

h hh

hhh

H

h hst

Ccnc

NnWNtVar

10

2

1

22

:subject to

)11

()ˆ( Minimize

Solution: hhhh cWn /

H

k kkkh

hhh

cW

cC

c

Wn

1

0 )(

H

h hhh

H

h hhh

cN

cNcCn

C

1

10 /)(

:cost totalfixed afor Hence,

103

allocation alproportion

:equal are ' theand equal are ' theIf 3.

allocationNeyman :equal are ' theIf .2

strata einexpensivin samples Large 1.

ssc

sc

hh

h

We can express the optimal sample sizes in relation to n

H

k kkk

hhhh

cW

cWnn

1/

/

In particular, if ch = c for all h: ccCn /)( 0

104

Other issues with optimal allocation• Many survey variables• Each variable leads to a different optimal solution

– Choose one or two key variables– Use proportional allocation as a compromise

• If nh > Nh, let nh =Nh and use optimal allocation for the remaining strata

• If nh=1, can not estimate variance. Force nh =2 or collapse strata for variance estimation

• Number of strata: For a given n often best to increase number of strata as much as possible. Depends on available information

105

• Sometimes the main interest is in precision of the estimates for stratum totals and less interest in the precision of the estimate for the population total

• Need to decide nh to achieve desired accuracy for estimate of th, discussed earlier

– If we decide to do proportional allocation, it can mean in small strata (small Nh) the sample size nh must be increased

106

Poststratification

• Stratification reduces the uncertainty of the estimator compared to SRS

• In many cases one wants to stratify according to variables that are not known or used in sampling

• Can then stratify after the data have been collected• Hence, the term poststratification• The estimator is then the usual stratified estimator

according to the poststratification• If we take a SRS and N-n and n are large, the

estimator behaves like the stratified estimator with proportional allocation

107

Poststratification to reduce nonresponse bias

• Poststratification is mostly used to correct for nonresponse

• Choose strata with different response rates• Poststratification amounts to assuming that the

response sample in poststratum h is representative for the nonresponse group in the sample from poststratum h

108

Systematic sampling• Idea:Order the population and select every kth unit• Procedure: U = {1,…,N} and N=nk + c, c < n

1. Select a random integer r between 1 and k, with equal probability

2. Select the sample sr by the systematic rule

sr = {i: i = r + (j-1)k: j= 1, …, nr}

where the actual sample size nr takes values [N/k] or [N/k] +1 k : sampling interval = [N/n]

• Very easy to implement: Visit every 10th house or interview every 50th name in the telephone book

109

• k distinct samples each selected with probability 1/k

otherwise 0

,...,1 , if /1)(

krssk

sp r

• Unlike in SRS, many subsets of U have zero probability

Examples:

1) N =20, n=4. Then k=5 and c=0. Suppose we select r =1. Then the sample is {1,6,11,16}

5 possible distinct samples. In SRS: 4845 distinct samples

2) N= 149, n = 12. Then k = 12, c=5. Suppose r = 3. s3 = {3,15,27,39,51,63,75,87,99,111,123,135,147} and sample size is 13 3) N=20, n=8. Then k=2 and c = 4. Sample size is nr =10

4) N= 100 000, n = 1500. Then k = 66 , c=1000 and c/k =15.15 with [c/k]=15. nr = 1515 or 1516

110

Estimation of the population total

)(

)()(ˆ 2)

)(]/[)()(ˆ 1)

:) when (equal estimators Two

size sample )( ,)(

sn

stNyNst

stnNsktst

nkN

snyst

s

si i

1 ]/[or ]/[)( kNkNsn

These estimators are approximately the same:

)/(

1/

kNN

N

kNnN

111

kr r

kr r

kr rr

t)s(tk

)s(kt

)s(p)s(t)t(E

t

11

1

1

:unbiased is

t

t

ˆ than riancesmaller vaslightly usually -

estimator) ratio a s(it' unbiasedely approximatonly is ˆ

• Advantage of systematic sampling: Can be implemented even where no population frame exists

•E.g. sample every 10th person admitted to a hospital, every 100th tourist arriving at LA airport.

112

totalssample theof average theis

/)( where

))(())((1

)())(ˆ()ˆ()ˆ(

1

1

2

1

2

1

22

kstt

tstktstkk

sptstttEtVar

k

r r

k

r r

k

r r

k

r rr

• The variance is small if

shomogeneou very are ,,..}21{}1{

strata"" theif i.e., little, varies)(

etc.k, ...,k, ,..,k

st r

• Or, equivalently, if the values within the possible samples sr are very different; the samples are heterogeneous

• Problem: The variance cannot be estimated properly because we have only one observation of t(sr)

113

Systematic sampling as Implicit StratificationIn practice: Very often when using systematic sampling (common design in national statistical institutes):

The population is ordered such that the first k units constitute a homogeneous “stratum”, the second k units another “stratum”, etc.

Implicit strata Units

1 1,2….,k

2 k+1,…,2k

: :

n = N/k assumed (n-1)k+1,.., nk

Systematic sampling selects 1 unit from each stratum at random

114

Systematic sampling vs SRS

• Systematic sampling is more efficient if the study variable is homogeneous within the implicit strata– Ex: households ordered according to house numbers

within neighbourhooods and study variable related to income

• Households in the same neighbourhood are usually homogeneous with respect socio-economic variables

• If population is in random order (all N! permutations are equally likely): systematic sampling is similar to SRS

• Systematic sampling can be very bad if y has periodic variation relative to k: – Approximately: y1 = yk+1, y2 = yk+2 , etc

115

Variance estimation

•No direct estimate, impossible to obtain unbiased estimate

• If population is in random order: can use the variance estimate form SRS as an approximation

• Develop a conservative variance estimator by collapsing the “implicit strata”, overestimate the variance

• The most promising approach may be:

Under a statistical model, estimate the expected value of the design variance

• Typically, systematic sampling is used in the second stage of two-stage sampling (to be discussed later), may not be necessary to estimate this variance then.

116

Cluster sampling and multistage sampling

• Sampling designs so far: Direct sampling of the units in a single stage of sampling

• Of economial and practical reasons: may be necessary to modify these sampling designs

– There exists no population frame (register: list of all units in the population), and it is impossible or very costly to produce such a register

– The population units are scattered over a wide area, and a direct sample will also be widely scattered. In case of personal interviews, the traveling costs would be very high and it would not be possible to visit the whole sample

117

• Modified sampling can be done by 1. Selecting the sample indirectly in groups , called

clusters; cluster sampling– Population is grouped into clusters– Sample is obtained by selecting a sample of

clusters and observing all units within the clusters

– Ex: In Labor Force Surveys: Clusters = Households, units = persons

2. Selecting the sample in several stages; multistage sampling

118

3. In two-stage sampling: • Population is grouped into primary sampling

units (PSU)• Stage 1: A sample of PSUs• Stage 2: For each PSU in the sample at stage

1, we take a sample of population units, now also called secondary sampling units (SSU)

• Ex: PSUs are often geographical regions

119

Examples1. Cluster sampling. Want a sample of high school

students in a certain area, to investigate smoking and alcohol use. If a list of high school classes is available,we can then select a sample of high school classes and give the questionaire to every student in the selected classes; cluster sampling with high school class being the clusters

2. Two-stage cluster sampling. If a list of classes is not available, we can first select high schools, then classes and finally all students in the selected classes. Then we have 2-stage cluster sample.

1. PSU = high school2. SSU = classes3. Units = students

120

Psychiatric Morbidity Survey is a 4-stage sample

– Population: adults aged 16-64 living in private households in Great Britain

– PSUs = postal sectors

– SSUs = addresses

– 3SUs = households

– Units = individuals

Sampling process:

1) 200 PSUs selected

2) 90 SSUs selected within each sampled PSU (interviewer workload)

3) All households selected per SSU

4) 1 adult selected per household

121

Cluster sampling

N

i iMM1

:size Population

advancein fixednot

: sample final of Size

in units all :units of sample Final

|| clusters, of sample

Isi i

I

II

Mm

s

ss

sns

Number of clusters in the population : N

Number of units in cluster i: Mi

M/ty

tt,ilustertotal in cyt

y

N

i ii

:variable for themean Population

1

122

Simple random cluster samplingRatio-to-size estimator

I

I

si i

si i

R M

tMt

MtMty

yMNn

fNtVar

iii

N

i iiR

/ and mean,cluster the, / where

)(1

11)ˆ(

1

222

Use auxiliary information: Size of the sampled clusters

Approximately unbiased with approximate variance

123

mean sample usual theis

and where

1

11

by estimated

2222

II

I

si isi is

si siiR

M/tyN/nf

)yy(Mnn

fN

n/m

N/M)t(V

Note that this ratio estimator is in fact the usual sample mean based estimator with respect to the y- variable

sR yMt ˆ

And corresponding estimator of the population mean of y is

sy

Can be used also if M is unknown

124

• Estimator’s variance is highly influenced by how the clusters are constructed.

similar values themaking clusters, in the lies

values- in the variation theofmost such that

ous,heterogene clusters themake

small )( make toclusters Choose 22

i

ii

y

y

yM

• Note: The opposite in stratified sampling• Typically, clusters are formed by “nearby units’ like households, schools, hospitals because of economical and practical reasons, with little variation within the clusters:

Simple random cluster sampling will lead to much less precise estimates compared to SRS, but this is offset by big cost reductions

Sometimes SRS is not possible; information only known for clusters

125

Design Effects

A design effect (deff) compares efficiency of two design-estimation strategies (sampling design and estimator) for same sample size

Now: Compare

Strategy 1:simple random cluster sampling with ratio estimator

Strategy 2: SRS, of same sample size m, with usual sample mean estimator

In terms of estimating population mean:

s

sR

y

yMt

:estimator 2Strategy

/ˆ :estimator 1Strategy

126

)(/)(),( sSRSsSCSs yVaryVarySCSdeff

The design effect of simple random cluster sampling, SCS, is then

Estimated deff: )(ˆ/)(ˆsSRSsSCS yVyV

In probation example:

200387.0)1/()ˆ1(ˆ)1)](1/()ˆ1(ˆ[)ˆ(ˆ mppfmpppVSRS

9.6000387.0/0.0302 deff Estimated 22

Conclusion: Cluster sampling is much less efficient

! 99615.0/9.60

615.026/16/1ˆ/ -1 and

)/(ˆ lettingby /-1factor p.c. theestimatecan We:Note

estdeff

NnMm

nmNMMm

127

Two-stage sampling• Basic justification: With homogeneous clusters

and a given budget, it is inefficient to survey all units in the cluster- can instead select more clusters

• Populations partioned into N primary sampling units (PSU)

• Stage 1: Select a sample sI of PSUs• Stage 2: For each selected PSU i in sI: Select a

sample si of units (secondary sampling units, SSU) • The cluster totals ti must be estimated from the

sample

128

|| :size sample Total

||

PSUs of sample 1 stage of size||

smm

sm

sn

Isi i

ii

I

General two-stage sampling plan:

)| (

) (

| Iiij

IIi

sisjSSUP

siPSUP

ijIiij

ij

|

:cluster in (SSU)unit for y probabilit Inclusion

129

ijyy

yt

sit

ij

sjij

ijHTi

Ii

i

cluster in unitfor of value where

ˆ

:, totalofestimator Thompson -Horvitz

|,

Suggested estimator for population total t :

IsiIi

HTitt,

ˆˆ

HTsi sj si sjij

ij

ij

ij

Ii

tyy

tI i I i

ˆ1ˆ|

Unbiased estimator

130

N

iIi

HTi

siIi

itVart

VartVarI 1

, )ˆ()ˆ(

1. The first component expresses the sampling uncertainty on stage 1, since we are selecting a sample of PSU’s. It is the variance of the HT-estimator with ti as observations

2. The second component is stage 2 variance and tells us how well we are able to estimate each ti in the whole population

3. The second component is often negligible because of little variability within the clusters

131

A special case: Clusters of equal size and SRS

on stage 1 and stage 2

M

m

M

m

N

nMmNn

NiMM

mm

ijj|iIi

i

i

0

000

0

0

/ ,/

,...,1 ,

2 stageat sizes sample equal -

ssi sj ij yMym

Mt

I i

ˆ

Self-weighting sample: equal inclusion probabilities for all units in the population

132

Unequal cluster sizes. PPS – SRS sampling• In social surveys: good reasons to have equal inclusion

probabilities (self-weighting sample) for all units in the population (similar representation to all domains)

• Stage 1: Select PSUs with probability proportional to size Mi

• Stage 2: SRS (or systematic sample) of SSUs• Such that sample is self-weighting

Mm

MmM

Mn

ijIiij

iiiji

Ii

/ such that

/ and

|

|

mi = m/n = equal sample sizes in all selected PSUs

syMt ˆ

133

Remarks

• Usually one interviewer for each selected PSU• First stage sampling is often stratified PPS• With self-weighting PPS-SRS:

– equal workload for each interviewer– Total sample size m is fixed

134

II. Likelihood in statistical inference and survey sampling

• Problems with design-based inference

• Likelihood principle, conditionality principle and sufficiency principle

• Fundamental equivalence

• Likelihood and likelihood principle in survey sampling

135

Traditional approach

Design-based inference

• Population (Target population): The universe of all units of interest for a certain study: U = {1,2, …, N}

– All units can be identified and labeled

– Variable of interest y with population values

– Typical problem: Estimate total t or population mean t/N

• Sample: A subset s of the population, to be observed

• Sampling design p(s) is known for all possible subsets;

– The probability distribution of the stochastic sample

),...,,( 21 Nyyyy

136

Problems with design-based inference

• Generally: Design-based inference is with respect to hypothetical replications of sampling for a fixed population vector y

• Variance estimates may fail to reflect information in a given sample

• If we want to measure how a certain estimation method does in quarterly or monthly surveys, then y will vary from quarter to quarter or month to month – need to assume that y is a realization of a random vector

• Use: Likelihood and likelihood principle as guideline on how to deal with these issues

137

Problem with design-based variance measure Illustration 1

N

i is ss

s

yN

y)s(p)y(E

y

1 2

1

2

1 :Unbiased

mean populationfor estimator as Usec)

2

1

22

2

1

2

1

:variance-Design

~N

)y()y(E)y(VarN

i iss

a) N +1 possible samples: {1}, {2},…,{N}, {1,2,…N}

b) Sampling design: p({i}) =1/2N , for i = 1,..,N ; p({1,2,…N})= 1/2

d) Assume we select the “sample” {1,2,…,N}. Then we claim that the “precision” of the resulting sample (known to be without error) is22 /~

138

Problem with design-based variance measureIllustration 2

N/nf,)y(N

n)f

y

N

i i

s

1

1

-(1by measured isPrecision

estimate and SRS :1Expert a)

1

22

2

n/~ys

2by precision measures

estimate andt replacemen with SRS :2Expert b)

Both experts select the same sample, compute the same estimate, but give different measures of precision…

139

The likelihood principle, LPgeneral model

• LP: The likelihood function contains all information about the unknown parameters

• More precisely: Two proportional likelihood functions for , from the same or different experiments, should give identically the same statistical inference

model in the parametersunknown theare ; ),(~ :Model xfX

• The likelihood function, with data x: )()( xflx

l is quite a different animal than f !!

Measures the likelihood of different values in light of the data x

140

• Maximum likelihood estimation satisfies LP, using the curvature of the likelihood as a measure of precision (Fisher)

• LP is controversial, but hard to argue against because of the fundamental result by Birnbaum, 1962:

• LP follows from sufficiency (SP) and conditionality principles (CP) that ”no one” disagrees with.

• SP: Statistical inference should be based on sufficient statistics

• CP: If you have 2 possible experiments and choose one at random, the inference should depend only on the chosen experiment

141

Illustration of CP• A choice is to be made between a census og taking a sample of size 1. Each with probability ½.

• Census is chosen

• Unconditional approach:

.N

iPP

iPcensusPi

2

11

2

11/2

1) size of sample|selected is (1) size of (sample 1/2

selected) is and 1 size of (sample)(

142

The Horvitz-Thompson estimator:

! 22 tytiUHT

Conditional approach: iand HT estimate is t

143

LP, SP and CP

model in the parametersunknown theare ; ),(~ :Model xfX

xE E,xI

,f,,XE

n observatiowith experiment in the about Inference : )(

}}{{ triplea is Experiment

))()(( oft independen ),()(

Assume . }}{{ and }}{{Let

22

11

21

222

111

21xcfxf.ccll

f,,XEf,,XE

x,x,

Likelihood principle:

)()( :Then 2211 x,EIx,EI

This includes the case where E1 = E2 and x1 and x2 are two different observations from the same experiment

144

Sufficiency principle: Let T be a sufficient statistics for in the experiment E. Assume T(x1) = T(x2). Then I(E, x1) = I(E, x2).

Conditionality principle:

1,2. ),( of valuethe

then is in n observatio The observed. is and 1/2y probabilit

h chosen wit is and observed is and 1/2y probabilit

h chosen wit is where experiment mixture heConsider t

. }}{{ and }}{{Let

2

21

1

222

111

JX,JX*

E*x

Ex

EE*

f,,XEf,,XE

J

)())(( :CP

)()( where}}{{ 21

jjj

jj

j

x,EIx,jEI

xfx,jff,,XE

145

LP SP and CP :Theorem

)( )( that show shall We

)()(such that and nsobservatio and and :Given

n)implicatioimportant the, variablesdiscrete(for

Exercise :Proof

022

011

22

110

20121

x,EIx,EI

xcfxfxxEE

))2(()( and ))1(()(

: CP From experiment mixture heConsider t02

022

01

011 x,,EIx,EIx,,EIx,EI

.E

))2(( ))1((

SP that show toremainsIt 02

01 x,,EIx,,EI

)2()1(

with in sufficent find Enough to02

01 x,Tx,T

ET

146

otherwise )()(

121

:possible as little as Reduce01

02

01

jj x,jx,jT

)x,()x,(T)x,(T

X

.1)()(

)(

)()(

)(

)(

)),1(( )|),1((

and

1 )|),2(( )|),1(( :),1(With

oft independen otherwise; 0 and ,),( if ,1)|),((

: ),1(first Let :sufficient is

02

202

2

02

2

02

2210

11

21

01

121

0

0100

1

002

001

01

0

001

c

c

xfxcf

xcf

xfxf

xf

tTP

xXPtTxXP

tTxXPtTxXPxt

txjtTxjXP

txtT

jj

147

Consequences for statistical analysis

• Statistical analysis, given the observed data: The sample space is irrelevant

• The usual criteria like confidence levels and P-values do not necessarily measure reliability for the actual inference given the observed data

• Frequentistic measures evaluate methods

– not necessarily relevant criteria for the observed data

148

Illustration- Bernoulli trials

successes ofnumber observe

and s)(0' failures 3get we until trialsContinue :

observeand nsobservatio 12:

:about n informatiogain tosexperiment Two

y probabilitwith (success) 1

,..,...,

2

2

12111

1

Y

E

XYnE

X

XX

i i

i

i

9 are results theSuppose 21 yy

149

The likelihood functions:

binomial negative )1()()(

binomial )1()()(3911

9)2(

9

39129

)1(9

l

l

Proportional likelihoods: )()4/1()( )1(9

)2(9 ll

LP: Inference about should be identical in the two cases

Frequentistic analyses give different results:

0.0327 value-P :)9,( 0.0730 value-P :)9,(

2/1:against 2/1: test F.ex.

21

10

EE

HH

because different sample spaces: (0,1,..,12) and (0,1,...)

150

Frequentistic vs. likelihood

• Frequentistic approach: Statistical methods are evaluated pre-experimental, over the sample space

• LP evaluate statistical methods post-experimental, given the data

• History and dicussion after Birnbaum, 1962: An overview in ”Breakthroughs in Statistics,1890-1989, Springer 1991”

151

Likelihood function in design-based inference

)...,,( 21 Nyyyy

}:),{( , siyix iobs

}for :{ , siyy iobsix y

• Unknown parameter:

• Data:

• Likelihood function = Probability of the data, considered as a function of the parameters

• Sampling design: p(s)

• Likelihood function:

otherwise 0

if )()( x

x

spl

yy

• All possible y are equally likely !!

152

• Likehood principle, LP : The likelihood function contains all information about the unknown parameters

• According to LP: – The design-model is such that the data contains no

information about the unobserved part of y, yunobs

– One has to assume in advance that there is a relation between the data and yunobs :

• As a consequence of LP: Necessary to assume a model

– The sampling design is irrelevant for statistical inference, because two sampling designs leading to the same s will have proportional likelihoods

153

Let p0 and p1 be two sampling designs. Assume we get the same sample s in either case. Then the data x are the same and x is the same for both experiments.

The likelihood function for sampling design pi , i = 0,1:

otherwise 0

if )()(,

xixi

spl

yy

)()(

)()(

: for then and

if )(/)()(/)(

,00

1,1

01,0,1

yy

y

yyy

xx

xxx

lsp

spl

all

spspll

154

• Same inference under the two different designs. This is in direct opposition to usual design-based inference, where the only stochastic evaluation is thru the sampling design, for example the Horvitz-Thompson estimator

• Concepts like design unbiasedness and design variance are irrelevant according to LP when it comes to do the actual statistical analysis.

• Note: LP is not concerned about method performance, but the statistical analysis after the data have been observed

• This does not mean the sampling design is not important. It is important to assure we get a good representative sample. But once the sample is collected the sampling design should not play

a role in the inference phase, according to LP

155

Model-based inference

• Assumes a model for the y vector• Conditioning on the actual sample• Use modeling to combine information• Problem: dependence on model

– Introduces a subjective element– almost impossible to model all variables in a

survey• Design approach is “objective” in a perfect world

of no nonsampling errors

156

III. Model-based inference in survey sampling

• Model-based approach. Also called the prediction approach– Assumes a model for the y vector– Use modeling to construct estimator

– Ex: ratio estimator • Model-based inference

– Inference is based on the assumed model– Treating the sample s as fixed, conditioning on the actual sample

• Best linear unbiased predictors• Variance estimation for different variance measures

157

Model-based approach

si i

N

i si ii yyyt1

N

N

YYY

yyy

,..., variablesrandom

of valuesrealized are ,...,,

21

21

We can decompose the total t as follows:

Treat the sample s as fixed

Two stochastic elements:

fYYYps N ~),...,( 2) )(~ sample )1 21

[Model-assisted approach: use the distribution assumption of Y to construct estimator, and evaluate according to distribution of s, given the realized vector y]

158

si isi i

si i

YZyz

y

of valuerealized the,

estimate tois problem theknown, is Since

• The unobserved z is a realized value of the random variable Z, so the problem is actually to predict the value z of Z.

Can be done by predicting each unobserved yi: siyi ,ˆ

zz

zyyytsi isi isi ipred

for predictor a is ˆ

ˆˆˆ :Estimator

• The prediction approach, the prediction based estimator

modelingby Determine iy

159

Remarks:

1. Any estimator can be expressed on the “prediction form:

si it

tsi i

ytz

zyt

ˆˆ letting

ˆˆ

ˆ

ˆ

2. Can then use this form to see if the estimator makes any sense

160

Ex 1.

si sisisisis yyynNyyNt )(ˆ

siyyyz sisi s allfor ,ˆ and ˆ Hence,

Ex.2

HTsi i

siisi

sx

ix

i

isi i

sii

xisi isi

i

ixHT

zyxxnt

nxt

x

y

ny

nx

tyy

nx

ytt

HT

ˆ1

1ˆ

ˆ

Reasonable sampling design when y and x are positively correlated

N

i ixxiisi iiHT xt/tnx π /πyt1

,and

tcoefficien regression unusualrather a is ˆ

ˆˆˆ

HT

si isi iHTHT yxz

161

Three common models

0),( and )( ,0)( with 2 jiiiiiii CovxVarExY

I. A model for business surveys, the ratio model:

• assume the existence of an auxiliary variable x for all units in the population.

0),( and )( , )( 2 jiiiii YYCovxYVarxYE

162

II. A model for social surveys, simple linear regression:

0),( and )( ,0)( , 221 jiiiiii CovVarExY

III. Common mean model:

eduncorrelat are ' theand )( , )( 2 sYYVarYE iii

• Ex: xi is a measure of the “size” of unit i, and yi tends to increase with increasing xi. In business surveys, the regression goes thru the origin in many cases

163

Model-based estimators (predictors)

)|)ˆ(()|ˆ( 2 sTTEsTTVar

1. Predictor: ZYTsi i

ˆˆ

N

i iYTsTTET1

, 0)|ˆ( if unbiased-model is ˆ .3

2. Model parameters:

4. Model variance of model-unbiased predictor is the variance of the prediction error, also called the prediction variance

5. From now on, skip s in the notation: all expectations and variances are given the selected sample s, for example

)|ˆ()ˆ(

)|ˆ()ˆ(

sTTVarTTVar

sTTETTE

164

Prediction variance as a variance measure for the actual observed sample

TYNT s totalpopulation for theestimator theas ˆ Use

0)0()ˆ( VarTTVar

N +1 possible samples: {1}, {2},…,{N}, {1,2,…N}

Assume we select the “sample” {1,2,…,N}.

Prediction variance:

Illustration 1, slide 137

TYNT ˆThen

Illustration 2, slide 138: Exactly the same prediction variance for the two sampling designs

165

:predictorslinear unbiased-model

all among varianceprediction minimumuniformly has ˆ )2

unbiased-model is ˆ 1)

if for predictor (BLU) unbiasedlinear best theis ˆ

0

0

0

T

T

TT

6. Definition:

Linear predictor:

si ii YsaT )(ˆ

allfor )ˆ()ˆ( 0 TTVarTTVar

Tpredictor linear unbiased-modelany For

166

0),( , eduncorrelat are ,...,

)()( and 0)( ,

:

1

2

jiN

iiiiii

CovYY

xvVarExY

Model

Suggested Predictor:

of (BLUE)estimator unbiasedlinear best theis ˆ where

ˆˆ

opt

si ioptsi ipred xYT

si ii

si iiiopt xvx

xvYx

)(/

)(/ˆ2

2g0 ,)( Usually, gxxv

167

1)(,)()ˆ(

)(ˆ

si iisi ii

si ii

xscxscE

Ysc

si ii xvcVar )()ˆ( 22

method Lagrange using

1 osubject t )( Minimize 2

si iiisi i xcxvc

)()2/(

0)(2/

)1()(2

i

ii

iiii

si iisi ii

xv

xc

xxvccQ

xcxvcQ

168

1)(/2/

:1such that )2/( Determine2

si ii

si ii

xvx

xc

si ii xvx )(//1)2/( 2

sj jj

si iii

si ioptiopt

sj jj

iiopti

xvx

xvYxYc

xvx

xvxc

)(/

)(/ˆ and

)(/

)(/ and

2,

2,

This is the least squares estimate based on )(/ ii xvY

169

predictorlinear and unbiased-model a be ˆLet T

si isi i

si isi i

xYT

xZYTZ

ˆˆ

./ˆˆ and ˆˆLet

• We shall show that

TTpred for predictor (BLU) unbiasedlinear best theis ˆ

)ˆ( unbiased-model ˆ and

),(in linear is ˆ predictor linear ˆ

ET

siYT i

170

si isi isi i

si isi i

xExxE

YxETTE

])ˆ([]ˆ[

)ˆ()ˆ( since

)ˆ(0)ˆ(such that ETTE

The prediction variance of model-unbiased predictor:

si isi i

si isi i

si isi i

xvVarx

YVarxVar

YxVarTTVar

)()ˆ()(

)()ˆ(

)ˆ()ˆ(

22

To minimize the prediction variance is equivalent to minimizing )ˆ(VarGiving us predictor BLU theas ˆ

predT

171

The prediction variance of the BLU predictor:

si i

si ii

si i

si i

si iisi i

si ioptsi ipred

xvxvx

x

xvxvx

x

xvVarxTTVar

)()(/

)(

)()(/

)(

)()ˆ()()ˆ(

2

2

2

22

22

22

A variance estimate is obtained by using the model-unbiased estimator for

si ioptii

xYxvn

22 )ˆ()(

1

1

1ˆ

172

The central limit theorem applies such that for large n, N-n we have that

)1,0(ely approximat is )ˆ(ˆ/)ˆ( NTTVTTpred

Approximate 95% confidence interval for the value t of T:

)ˆ(ˆ96.1ˆ TTVt pred

si i

si ii

si i

pred xvxvx

xTTV )(

)(/

)(ˆ)ˆ(ˆ

2

2

2

Also called a 95% prediction interval for the random variable T

173

Three special cases: 1) v(x) = x, the ratio model, 2) v(x)= x2 and 3) xi =1 for all i, the common mean model

1. v(x) = x

ratio sample usual the, ˆ)(/

)(/ˆ2

Rx

Y

xvx

xvYx

si i

si i

si ii

si iiiopt

xsi isi i

si isi ipred

tRxRxR

xRYT

ˆˆˆ

ˆˆ

the usual ratio estimator

si isi isi ipred xxxTTVar )/()()ˆ( 22

N

i isi ir

s

r

xxnNxxNnf

x

xx

n

fN

1

22

and )/(,/

,1

174

2. v(x) =x2

ratios theofmean sample the,/

)(/

)(/ˆ2 n

xY

xvx

xvYxsi ii

si ii

si iiiopt

si isii

isi i

si ioptsi ipred

xx

Y

nY

xYT

)1

(

ˆˆ

si isi i

si i

si ii

si i

pred

xn

x

xvxvx

xTTVar

2

2

2

2

2

2

)(

)()(/

)()ˆ(

175

:/when estimator T-H theResembles xii tnx

sxsii

ixHT Rt

nx

YtT ˆ

) (îssi isxsi issi ipred xRYRtxRYT

si isiii nRRxYR / and /Let

When the sampling fraction f is small or when the xi values vary little, these two estimators are approximately the same. In the latter case:

si iisi ssi is

s YxRYxn

R and 1

Also model-unbiased

176

3. xi =1


)( and 0)( ,

:

1

2

jiN

iiii

CovYY

VarEY

Model

mean sample the,1

)(/

)(/ˆ2

si si

si ii

si iiiopt YY

nxvx

xvYx

ssi ssi ipred YNYYT ˆ

nfNnN

n

nN

xvxvx

xTYNVar

si i

si ii

si i

s

22

22

2

2

2

)1()()(

)()(/

)()(

This is also the usual, design-based variance formula under SRS

177

We see that the variance estimate is given by

variancesample the

)(1

1ˆ 22

si si yy

n

nfN

22 ˆ

)1(

Exactly the same as in the design-approach, but the interpretation is different

178

Simple Linear regression model

eduncorrelat are ,...,

)( ,0)( ,

1

221

N

iiiii

YY

VarExY

BLU predictor:

ss

si si

si isi

si si

si sisi

si isi ipred

xY

xx

Yxx

xx

YYxx

xYT

21

222

21

21

ˆˆ

)(

)(

)(

))((ˆ

,estimators LS theare ˆ and ˆ

where

)ˆˆ(ˆ

179

)(ˆ

))((ˆ)(

)ˆˆ(ˆ

2

2

21

sxs

ssi iss

si isi ipred

xNtYN

xnNxYnNYn

xYT

)](ˆ[ˆ2 sspred xxYNT

)()}()(1

{)ˆ(

and

)()()(

:unbiased-model is ˆ Clearly,

21221

211 21

xNxxxn

NTE

xNxTE

T

sisipred

N

i i

pred

180

We shall now show that this predictor is BLU

)./()/ˆ(let and predictor, unbiased-model

linear, a be ˆLet . first that Assume

ss

s

xxYNTb

Txx

)]([ˆ)(ˆ1ssss xxbYNTxxbYT

N

Hence, any predictor can be expressed on this form and the predictor is linear if and only if b is linear in the Yi’s

).()()(

)()]()([

)()()ˆ(

:)( unbiased-model is ˆ Also,

222

2121

21

2

sss

ss

xxxxbExx

xNbExxxN

xNTETE

bET

181

Prediction variance:

: ofestimator unbiased ,)(

)()()()ˆ(

2

2

si ii

ss

Yscb

nNxxNbYnNVarTTVar

si iisi i

isi i

xcc

xcbE

221

2212 )()(

si iisi i xccbE 1 )2( and 0 )1()( 2

So we need to minimize the prediction variance with respect to the ci’s under (1) and (2)

182

])(

)(2)([

)(

)()()(

minimize i.e.

22222

22

si si isis

si is

si isiss

n

nNcxxN

n

nNcxxN

cxxNn

nN

cxxNn

nNYVarxxNbYnNVar

(2) and (1) conditionsunder minimize enough to isit

,0 Since2

si i

si i

c

c

183

iiiii

si iisi isi i

xcxccQ

xcccQ

2121

212

0222/

)1(2)(2

11 )2(

00 )1(2

21

21

si issi ii

ssi i

xxnxc

xc

si si

si si

s

xx

xnx

x

22

22

22

21

)(/1

1 :)2( from

)1(

184

222

2221

ˆ)(

)(

)( and

)()(

sj sj

si isi

sj sj

sisi i

sj sj

sisiii

xx

Yxx

xx

xxYb

xx

xxxxxc

The prediction variance is given by

2

22

2

2

22

2

))(ˆ(2

1ˆ

with estimatingby obtained is estimate varianceand

)(

)()1()ˆ(

si sisi

si si

spred

xxYYn

xx

xxn

N

n

n

NTTVar

185

predictor. BLU theis and ˆThen

? if What . far, So

spred

ss

YNT

xxxx

)]()1([

)(])1([)ˆ(

ˆ predictor,linear any For

22

2

nNa

nNYaVarTTVar

YaT

si i

si ii

si ii

si isi ia anaaYaT / ,ˆLet

)]()1([

)]()1([)ˆ(22

22

nNan

nNaTTVarsia

186

22 )1()1( ana

si i

.ˆˆ

:/ and unbiased-model ˆ and

)ˆ()ˆ(

predsa

si i

a

TYNT

nNaNaT

TTVarTTVar

187

Anticipated variance (method variance)

sthisTTVar

Ts

sample particular for y uncertaint themeasures )ˆ(

:ˆ unbiased-model with , sample on the lConditiona .1

We want a variance measure that tells us about the expected uncertainty in repeated surveys

)(on distributi sampling over the )},ˆ({

:surveys repeatedfor y uncertaint expected The .2

pTTVarEp

3. This is called the anticipated variance.

4. It can be regarded as a variance measure that describes how the estimation method is doing in repeated surveys

188

error squaremean danticipate they,uncertaintfor criterion a as

})ˆ({

use weunbiased,-modelnot is ˆ If2TTEE

T

p

)ˆ()ˆ()|)ˆ(

and

)}|)ˆ({})ˆ({

thenunbiased-design is ˆ If :Note

22

22

tVarttETTE

TTEETTEE

T

ppp

pp

yY

Y

And the anticipated MSE becomes the expected design-variance, also called the anticipated design variance

)}ˆ({})ˆ({ 2 TVarETTEE pp

189

Example: Simple linear regression and simple random sample

N

i i

N

i ispsp

s

Nn

fN

YYN

En

fNYNVarETYNEE

YN

1

222

1

222

})(1

1{

1

})(1

1{

1)}({})({

:unbiased-design isbut

unbiased,-modelnot isIt :used is mean sample If

xxYE iii 2121 ,)(

N

i ix

xsp

xxN

S

Sn

fNYNVarE

1

22

222

22

})(1

1

}{1

)}({

190

Let us now study the BLU predictor.( It can be shown that it is approximately design-unbiased )

si sip

sp

si si

sppredp

si si

spred

xxE

xxnE

N

n

n

N

xx

xxnE

N

n

n

NTTVarE

xx

xxn

N

n

n

NTTVar

2

22

2

2

22

2

2

22

2

)(

})({)1(

)(

)()1()}ˆ({

)(

)()1()ˆ(

22 )1()()( xspsp SfxnVarxxnE 22 )1()( xsi sip SnxxE

191

22

22

22

)1()1(1

1

1)1()}ˆ({

fn

Nf

n

N

n

ff

n

NTTVarE predp

}{1

)}({

tocompared

222

22xsp S

n

fNYNVarE

s

xpred

YN

ST

than efficient moremuch is and

term theeliminates ˆ 222

192

Remarks

• From a design-based approach, the sample mean based estimator is unbiased, while the linear regression estimator is not

• Considering only the design-bias, we might choose the sample mean based estimator

• The linear regression estimator would only be selected over the sample mean based estimator because it has smaller anticipated variance

• Hence, difficult to see design-unbiasedness as a means to choose among estimators

193

Robust variance estimation

• The model assumed is really a “working model”• Especially, the variance assumption may be

misspecified and it is not always easy to detect this kind of model failure– like constant variance

– variance proportional to size measure xi

• Standard least squares variance estimates is sensitive to misspecification of variance assumption

• Concerned with robust variance estimators

194

Variance estimation for the ratio estimator


)( and 0)( ,

1

2

jiN

iiiiii

CovYY

xVarExY

22 ˆ1

)ˆ(ˆ s

rxR x

xx

n

fNTtRV

Working model:

Under this working model, the unbiased estimator of the prediction variance of the ratio estimator is

ss

si iii

xYR

xRYxn

/ˆ

)ˆ(1

1

1ˆ 22

195

This variance estimator is non-robust to misspecification of the variance model.

Suppose the true model has

)()( and )( 2iiii xvYVarxYE

Ratio estimator is still model-unbiased but prediction variance is now

si isi is

r

si i

si i

si i

si i

si isi ix

xvxvxn

xnN

xvx

xvx

xvRVarxTtRVar

)()()(

)()(

)()(

)()ˆ()()ˆ(

22

222

22

22

22

196

rsrs

rss

rx

vfxxvfn

fN

vnNvxn

xnNTtRVar

222

2

222

)/()1(1

)()(

)ˆ(

si irsi is nNxvvnxvv )/()( and /)(

Moreover,

si iisssss

si iii

xxvn

xvxvxvn

xv

xRYExn

E

/)(1

)/( , }/)/{(1

1)/(

)ˆ(1

1

1)ˆ(

2

22

:)ˆ( 22 E

197

Robust variance estimator for the ratio estimator

varianceprediction in the termleading the

, )/(1

})/({)/(1

)/()1(1

)ˆ(

222

2222

222

srs

srsrsrs

rsrsx

xxn

fNv

xxvvfxxvn

fN

vfxxvfn

fNTtRVar

si isi is YVarn

xvn

v )(1

)(1

:and 22

})(1

{)(1

222

si iisi iis xY

nExYE

nv

198

Suggests we may use:

si iisrob xRYn

v 22 )ˆ(1

1

Leading to the robust variance estimator:

si iisrxrob xRYnn

fNxxTtRV 222 )ˆ(

1

11)/()ˆ(ˆ

Almost the same as the design variance estimator in SRS:

si iisxSRS xRYnn

fNxxtRV 222 )ˆ(

1

11)/()ˆ(ˆ

199

)ˆ(1

)/()ˆ(ˆ 222 TtRVvn

fNxxTtRVE xssrxrob

Can we do better?

Require estimator to be exactly unbiased under ratio model, v(x) = x:

si sixs

xs

sis

iisi ii

si ii

xxn

sx

s

nx

xn

xx

nxRYE

n

xRYn

Exxv

222

22

22

2

)(1

1 ,

11

)1( 1

1 )ˆ(

1

1

})ˆ(1

1{:)(When

200

So a robust variance estimator that is exactly unbiased under the working model , v(x) = x:

The prediction variance when v(x) = x:

22 1)ˆ(

s

rx x

xx

n

fNTtRV

2

2222 1

1)/(1

)}ˆ(ˆ{s

xsrxrob x

s

nxx

n

fNTtRVE

)ˆ(ˆ11)}ˆ(ˆ

1

2

2

, xrobs

x

rxrobR tRV

x

s

nx

xTtRV

si iisrsx xRYnn

fNxxxxsn 2221221 )ˆ(

1

11)/()}/(1{

)ˆ(ˆ)/()}/(1{ 1221xSRSrsx tRVxxxsn

201

General approach to robust variance estimation

)()()1()ˆ(

ˆ .2

2

si isi iis

si iis

YVarYVarwTTVar

YwT

1. Find robust estimators of Var(Yi), that does not depend on model assumptions about the variance

model under true )( estimate ˆ

)ˆ()(ˆ :For 3. 2

ii

iii

YE

YYVsi

4. Estimate only leading term in the prediction variance, typically dominating, or estimate the second term from the more general model

202

• Reference to robust variance estimation:

• Valliant, Dorfman and Royall (2000):

Finite Population Sampling and Inference. A Prediction Approach, ch. 5

203

Model-assisted approach

population wholefor theknown is Here, .population in the

eachfor estimate"" based-regression a is ˆˆ Suppose

ii

ii

xy

xy

• Design-based approach

• Use modeling to improve the basic HT-estimator. Assume the population values y are realized values of random Y

• Assume the existence of auxiliary variables, known for all units in the population• Basic idea:

estimator-HTby estimated becan and estimate, easier tomuch is

),ˆ( where , and )ˆ(ˆ111 iii

N

i i

N

i ii

N

i i yyeeeyyyt

204

sii

iHT

ee

ˆ

Final estimator, the regression estimator:

N

i HTireg ext1

ˆˆˆ

Alternative expression:

N

i ixsii

ixsi

i

ireg xt

xt

yt

1 , )(ˆˆ

)ˆ(ˆˆˆ,, HTxxHTyreg tttt

205

Simple random sample

estimator ratio the,ˆ

/ˆ :estimator unbiasedlinear Best

)( and 0)( ,

andt independen are ' The :

)(ˆˆ

2

s

sxsx

s

ssreg

ss

iiiiii

i

sxsreg

x

ytyNt

x

yyNt

xy

xVarExY

sY

xNtyNt

Model

206

In general with this “ratio model”, in order to get approximately design-unbiased estimators:

si ii

si iiHTxHTy

si iiHTx

si iiHTy

N

i i

N

i i

x

ytt

xt

yt

xy

/

/ˆ/ˆˆ use

/ˆby estimated isr Denominato

/ˆby estimated isNumerator

/ of estimatean as estimate- regardCan

,,

,

,

11

ii

N

i ixreg

xy

ytt

ˆˆ where

ˆˆˆ1

207

Reference: Sarndal, Swensson and Wretman : Model Assisted Survey Sampling (1992, ch. 6), Wiley

• Regression estimator is approximately unbiased

Variance and variance estimation

• Variance estimation:

where

, :residuals ample The

xy

siyyes

ii

iii

si ijsj

j

j

i

i

ij

ijjireg

eetV

ns

,

2)(

)ˆ(ˆ

:advancein fixed , || If

208

Approximate 95% CI, for large n, N-n:

)ˆ(ˆ96.1ˆregreg tVt

• Remark: In SSW (1992,ch.6), an alternative variance estimator is mentioned that may be preferable in many cases

209

Common mean model

eduncorrelat are ' theand )( , )( 2 sYYVarYE iii

N

ty

ytt HTy

s

si i

si iiHTxHTy ˆ

ˆ~/1

/ˆ/ˆˆ ,

,,

sxreg yNNtt ~ˆˆˆ

The ratio model with xi =1.

This is the modified H-T estimator (slide 73,74)Typically much better than the H-T estimator when different

210

si ijsj

j

j

i

i

ij

ijjis

sii

eeyNV

yye

,

2

)~(ˆ

~

Alternatively,

2

,2 )(

)ˆ/()~(ˆ

j

j

i

isi ij

sj

ij

ijjis

eeNNyNV

211

1. The model-assisted regression estimator has often the form

N

i ireg yt1

ˆˆ

2. The prediction approach makes it clear: no need to estimate the observed yi

Remarks:

3. Any estimator can be expressed on the “prediction form:

si it

tsi i

ytz

zyt

ˆˆ letting

ˆˆ

ˆ

ˆ

4. Can then use this form to see if the estimator makes any sense

Documents

STK 4600: Statistical methods for social sciences