47
Survey Data Analysis in Stata Jeff Pitblado Associate Director, Statistical Software StataCorp LP Stata Conference DC 2009 J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44

Survey Data Analysis in Stata · Survey Data Analysis in Stata Jeff Pitblado Associate Director, Statistical Software StataCorp LP Stata Conference DC 2009 J. Pitblado (StataCorp)

  • Upload
    others

  • View
    51

  • Download
    0

Embed Size (px)

Citation preview

Survey Data Analysis in Stata

Jeff PitbladoAssociate Director, Statistical Software

StataCorp LP

Stata Conference DC 2009

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44

Outline

1 Types of data

2 Survey data characteristics

3 Variance estimation

4 Estimation for subpopulations

5 Summary

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 2 / 44

Why survey data?

Collecting data can be expensive and time consuming.Consider how you would collect the following data:

Smoking habits of teenagersBirth weights for expectant mothers with high blood pressure

Using stages of clustered sampling can help cut down on theexpense and time.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 3 / 44

Types of data

Simple random sample (SRS) dataObservations are "independently" sampled from a data generatingprocess.

Typical assumption: independent and identically distributed (iid)Make inferences about the data generating processSample variability is explained by the statistical model attributed tothe data generating process

Standard dataWe’ll use this term to distinguish this data from survey data.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 4 / 44

Types of data

Correlated dataIndividuals are assumed not independent.

Cause:

Observations are taken over timeRandom effects assumptionsCluster sampling

Treatment:

Time-series modelsLongitudinal/panel data modelscluster() option

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 5 / 44

Types of data

Survey dataIndividuals are sampled from a fixed population according to a surveydesign.

Distinguishing characteristics:

Complex nature under which individuals are sampledMake inferences about the fixed populationSample variability is attributed to the survey design

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 6 / 44

Types of data

Standard dataEstimation commands for standard data:

proportionregress

We’ll refer to these as standard estimation commands.

Survey dataSurvey estimation commands are governed by the svy prefix.

svy: proportionsvy: regress

svy requires that the data is svyset.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 7 / 44

Survey data characteristics

Single-stage syntaxsvyset

[psu

] [weight

] [,strata(varname)fpc(varname)

]Primary sampling units (PSU)Sampling weights – pweight

StrataFinite population correction (FPC)

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 8 / 44

Survey data characteristics

Sampling unitAn individual or collection of individuals from the population that canbe selected for observation.

Sampling groups of individuals is synonymous with clustersampling.Cluster sampling usually results in inflated variance estimatescompared to SRS.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 9 / 44

Survey data characteristics

Sampling weightThe reciprocal of the probability for an individual to be sampled.

Probabilities are derived from the survey design.

Sampling unitsStrata

Typically considered to be the number of individuals in thepopulation that a sampled individual represents.Reduces bias induced by the sampling design.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 10 / 44

Survey data characteristics

StrataIn stratified designs, the population is partitioned into well-definedgroups, called strata.

Sampling units are independently sampled from within eachstratum.Stratification usually results in smaller variance estimatescompared to SRS.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 11 / 44

Survey data characteristics

Finite population correction (FPC)An adjustment applied to the variance due to sampling withoutreplacement.

Sampling without replacement from a finite population reducessampling variability.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 12 / 44

Survey data characteristics

Example: svyset for single-stage designs

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 13 / 44

Survey data characteristics

Population 1000

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 14 / 44

Survey data characteristics

SRS sample 200

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 15 / 44

Survey data characteristics

Cluster sample 20 (208 obs)

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 16 / 44

Survey data characteristics

Stratified sample 198

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 17 / 44

Survey data characteristics

Stratified-cluster sample 20 (215 obs)

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 18 / 44

Survey data characteristics

Multistage syntaxsvyset psu

[weight

] [, strata(varname) fpc(varname)

][|| ssu

[, strata(varname) fpc(varname)

] ][|| ssu

[, strata(varname) fpc(varname)

] ]...

Stages are delimited by “||”SSU – secondary/subsequent sampling unitsFPC is required at stage s for stage s + 1 to play a role in thelinearized variance estimator

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 19 / 44

Survey data characteristics

PoststratificationA method for adjusting sampling weights, usually to account forunderrepresented groups in the population.

Adjusts weights to sum to the poststratum sizes in the populationReduces bias due to nonresponse and underrepresented groupsCan result in smaller variance estimates

Syntaxsvyset ... poststrata(varname) postweight(varname)

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 20 / 44

Survey data characteristics

Example: svyset for poststratification

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 21 / 44

Strata with a single sampling unit

Big problem for variance estimationConsider a sample with only 1 observationsvy reports missing standard error estimates by default

Finding these lonely sampling unitsUse svydes:

Describes the strata and sampling unitsHelps find strata with a single sampling unit

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 22 / 44

Strata with a single sampling unit

Example: svydes

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 23 / 44

Strata with a single sampling unit

Handling lonely sampling units1 Drop them from the estimation sample.2 svyset one of the ad-hoc adjustments in the singleunit()

option.3 Somehow combine them with other strata.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 24 / 44

Certainty units

Sampling units that are guaranteed to be chosen by the design.Certainty units are handled by treating each one as its ownstratum with an FPC of 1.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 25 / 44

Variance estimation

Stata has three variance estimation methods for survey data:

LinearizationBalanced repeated replicationThe jackknife

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 26 / 44

Variance estimation

LinearizationA method for deriving a variance estimator using a first order Taylorapproximation of the point estimator of interest.

Foundation: Variance of the total estimator

Syntaxsvyset ...

[vce(linearized)

]Delta methodHuber/White/robust/sandwich estimator

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 27 / 44

Variance estimation

Total estimator – Stratified two-stage designyhijk – observed value from a sampled individualStrata: h = 1, . . . ,LPSU: i = 1, . . . ,nh

SSU: j = 1, . . . ,mhi

Individual: k = 1, . . . ,mhij

Y =∑

whijkyhijk

V (Y ) =∑

h

(1− fh)nh

nh − 1

∑i

(yhi − yh)2 +

∑h

fh∑

i

(1− fhi)mhi

mhi − 1

∑j

(yhij − yhi)2

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44

Variance estimation

Total estimator – Stratified two-stage designyhijk – observed value from a sampled individualStrata: h = 1, . . . ,LPSU: i = 1, . . . ,nh

SSU: j = 1, . . . ,mhi

Individual: k = 1, . . . ,mhij

Y =∑

whijkyhijk

V (Y ) =∑

h

(1− fh)nh

nh − 1

∑i

(yhi − yh)2 +

∑h

fh∑

i

(1− fhi)mhi

mhi − 1

∑j

(yhij − yhi)2

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44

Variance estimation

Total estimator – Stratified two-stage designyhijk – observed value from a sampled individualStrata: h = 1, . . . ,LPSU: i = 1, . . . ,nh

SSU: j = 1, . . . ,mhi

Individual: k = 1, . . . ,mhij

Y =∑

whijkyhijk

V (Y ) =∑

h

(1− fh)nh

nh − 1

∑i

(yhi − yh)2 +

∑h

fh∑

i

(1− fhi)mhi

mhi − 1

∑j

(yhij − yhi)2

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44

Variance estimation

Example: svy: total

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 29 / 44

Variance estimation

Linearized variance for regression modelsModel is fit using estimating equations.

G() is a total estimator, use Taylor expansion to get V (β).

G(β) =∑

j

wjsjxj = 0

V (β) = DV{G(β)}|β=bβD′

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 30 / 44

Variance estimation

Linearized variance for regression modelsModel is fit using estimating equations.

G() is a total estimator, use Taylor expansion to get V (β).

G(β) =∑

j

wjsjxj = 0

V (β) = DV{G(β)}|β=bβD′

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 30 / 44

Variance estimation

Example: svy: logit

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 31 / 44

Variance estimation

Balanced repeated replicationFor designs with two PSUs in each of L strata.

Compute replicates by dropping a PSU from each stratum.Find a balanced subset of the 2L replicates. L ≤ r < L + 4The replicates are used to estimate the variance.

Syntaxsvyset ... vce(brr)

[mse

]

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 32 / 44

Variance estimation

BRR variance formulas

θ – point estimatesθ(i) – i th replicate of the point estimates

θ(.) – average of the replicates

Default variance formula:

V (θ) =1r

r∑i=1

{θ(i) − θ(.)}{θ(i) − θ(.)}′

Mean squared error (MSE) formula:

V (θ) =1r

r∑i=1

{θ(i) − θ}{θ(i) − θ}′

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 33 / 44

Variance estimation

Example: svy brr: logit

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 34 / 44

Variance estimation

The jackknifeA replication method for variance estimation. Not restricted to aspecific survey design.

Delete-1 jackknife: drop 1 PSUDelete-k jackknife: drop k PSUs within a stratum

Syntaxsvyset ... vce(jackknife)

[mse

]

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 35 / 44

Variance estimation

Jackknife variance formulas

θ(h,i) – replicate of the point estimates from stratum h, PSU i

θh – average of the replicates from stratum hmh = (nh − 1)/nh – delete-1 multiplier for stratum h

Default variance formula:

V (θ) =L∑

h=1

(1− fh) mh

nh∑i=1

{θ(h,i) − θh}{θ(h,i) − θh}′

Mean squared error (MSE) formula:

V (θ) =L∑

h=1

(1− fh) mh

nh∑i=1

{θ(h,i) − θ}{θ(h,i) − θ}′

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 36 / 44

Variance estimation

Example: svy jackknife: logit

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 37 / 44

Variance estimation

Replicate weight variableA variable in the dataset that contains sampling weight values thatwere adjusted for resampling the data using BRR or the jackknife.

Typically used to protect the privacy of the survey participants.Eliminate the need to svyset the strata and PSU variables.

Syntaxsvyset ... brrweight(varlist)

svyset ... jkrweight(varlist[, ... multiplier(#)

])

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 38 / 44

Estimation for subpopulations

Focus on a subset of the populationSubpopulation variance estimation:

Assumes the same survey design for subsequent data collection.The subpop() option.

Restricted-sample variance estimation:

Assumes the identified subset for subsequent data collection.Ignores the fact that the sample size is a random quantity.The if and in restrictions.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 39 / 44

Estimation for subpopulations

Total from SRS dataData is y1, . . . , yn and S is the subset of observations.

δj(S) =

{1, if j ∈ S0,otherwise

Subpopulation (or restricted-sample) total:

YS =n∑

j=1

δj(S)wjyj

Sampling weight and subpopulation size:

wj =Nn, NS =

n∑j=1

δj(S)wj =Nn

nS

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 40 / 44

Estimation for subpopulations

Variance of a subpopulation totalSample n without replacement from a population comprised of the NSsubpopulation values with N − NS additional zeroes.

V (YS) =(

1− nN

) nn − 1

n∑j=1

{δj(S)yj −

1n

YS

}2

Variance of a restricted-sample totalSample nS without replacement from the subpopulation of NS values.

V (YS) =

(1− nS

NS

)nS

nS − 1

n∑j=1

δj(S)

{yj −

1nS

YS

}2

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 41 / 44

Estimation for subpopulations

Example: svy, subpop()

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 42 / 44

Summary

1 Use svyset to specify the survey design for your data.2 Use svydes to find strata with a single PSU.3 Choose your variance estimation method; you can svyset it.4 Use the svy prefix with estimation commands.5 Use subpop() instead of if and in.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 43 / 44

References

Levy, P. and S. Lemeshow. 1999.Sampling of Populations. 3rd ed.New York: Wiley.

StataCorp. 2009.Survey Data Reference Manual: Release 11.College Station, TX: StataCorp LP.

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 44 / 44