1 Ch 5: Cluster sampling with equal probabilities DEFN: A cluster is a group of observation units...

Ch 5: Cluster sampling with equal probabilities DEFN: A cluster is a group of

observation units (or “elements”)

Population Obs Unit Cluster

U.S. residents person household

Lincoln households household city block, or postal route

UNL employees employee department

Maple trees in Vermont

tree 1 km 1 km plot

Cluster sample DEFN: A cluster sample is a

probability sample in which a sampling unit is a cluster

Frame SU OU List of phone numbers phone number person List of blocks block household List of UNL departments department faculty member List of plots plot tree

Cluster sample – 2 1-stage cluster sampling

Divide the population (of K elements) into N clusters (of size Mi for cluster i)

Cluster = group of elements An element belongs to 1 and only 1 cluster

Sampling unit Cluster = group of elements = PSU = primary

sampling unit We’ll start by assuming a SRS of clusters (equal prob) Can use any design to select clusters (STS, PPS) –

we’ll work with other designs in Ch 6 Data collection

Collect information on ALL elements in the cluster

1-stage CS STS

Take an SRS f rom ever stratum:Take an SRS of clusters; observe all elements within the clusters in thesample:

A block of cells is a stratum

A block of cells is a clusterSU is a cluster

Don’t sample from every cluster

SU is an element (or OU)

Sample from every stratum

Sample of 40 elements

Cluster vs. stratified sampling Cluster sample

Divide K elements into N clusters Cluster or PSU i has Mi elements

Take a sample of n clusters Stratified sampling

N elements divided into H strata An element belongs to 1 and only 1 stratum

Take a sample of n elements, consisting of nh elements from stratum h for each of the H strata

Cluster sample – 3 2-stage cluster sampling (later)

Process Select PSUs (stage 1) Select elements within each sampled PSU (stage

2) First stage sampling unit is a …

PSU = primary sampling unit = cluster Second stage sampling unit is a …

SSU = secondary sampling unit = element = OU Only collect data on the SSUs that were

sampled from the cluster

1-stage vs. 2-stage cluster sampling

Take an SRS of mi SSUs in sampled PSU i :Sample all SSUs in sampled PSUs:

1-stage cluster sample (stop here)

Stage 1 of 2-stage cluster sample(select PSUs)

Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)

Why use cluster sampling? May not have a list of OUs for a frame, but a list

of clusters may be available List of Lincoln phone numbers (= group of residents) is

available, but a list of Lincoln residents is not available List of all NE primary and secondary schools (= group

of students) is available, but a list of all students in NE schools is not available

May be cheaper to conduct the study if OUs are clustered

Occurs when cost of data collection increases with distance between elements

Household surveys using in-person interviews (household = cluster of people)

Field data collection (plot = cluster of plants, or animals)

Defining clusters due to frame limitations A cluster (or PSU) is a group of

elements corresponding to a record (row) in the frame

Example Population = employees in

McDonald’s franchises Element = employee Frame = list of McDonald’s stores PSU = store = cluster of employees

Defining clusters to reduce travel costs A cluster (or PSU) is a group of

nearby elements Example

Population = all farms Element = farm Frame = list of sections (1 mi x 1 mi

areas) in rural area PSU = section = cluster of farms

Cluster samples usually lead to less precise estimates Elements within clusters tend to be correlated

due to exposure to similar conditions Members of a household Employees in a business Plants or soil within a field plot

We are getting less information than if selected same number of unrelated elements

Select sample of city blocks (clusters of households) Ask each household:

Should city upgrade storm sewer system? PSU (city block) 1

No storm sewer households will tend to say yes PSU (city block) 2

New development households will tend to say no

Defining clusters for improved precision Define clusters for which within-cluster

variation is high (rarely possible) Make each cluster as heterogeneous as possible

Like making each cluster a mini-population that reflects variation in population

Minimizes the amount of correlation among elements in the cluster

Opposite of the approach to stratification Large variation among strata, homogeneous within

strata Define clusters that are relatively small

Extreme case is cluster = element Decreasing the number of correlated

observations in the sample

Example for single-stage cluster sampling w/ equal prob (CSE1) Dorm has N = 100 suites (clusters) Each suite has Mi = 4 students (4 elements

in cluster i , i = 1, 2, … , N) Note that there are

Take SRS n = 5 suites (clusters) Ask each student living in each of the 5

suites How many nights per week do you eat dinner in

the dining hall? Will get observations from a sample of 20

students = 5 suites x 4 students/suite

population in students 400)4(1001

Dorm example – 2

Stu-dent

Suite 6

Suite 21

Suite 28

Suite 54

Suite 89

1 5 3 6 5 1

2 5 2 4 4 4

3 4 4 4 6 3

4 6 5 5 6 2

Total 20 14 19 21 10

Dorm example – 3 SRS of n = 5 dorm rooms Data on each cluster (all students in dorm

room) ti = total number of dining hall dinners for dorm

room i t2 = 14 dining hall dinners for 4 students in dorm

room 2 Estimated total number of dining hall nights

for the dorm students HT estimator of total = pop size x sample mean (of

cluster totals)dinners hall dining 1680)8.16(100

)1021191420(51

1001ˆ

iiunb t

Notation Indices

i = index for PSU i i j = index for SSU j in PSU i

Number of PSUs (clusters) in the population N clusters

Number of SSUs (elements) in a PSU (cluster) Mi elements

Number of SSUs (elements) in the polulation

In Chapters 1-4, this was designated as N elements

Notation – 2

N = 12 PSUs

K = 20 + 12 + … + 9 + 16

= 150 SSUs

M1 = 20 SSUs

M2 = 12 SSUs

M12 = 16 SSUs

M11 = 9 SSUs

i =4i =3i =2

i =11 i =12

SSU i = 9j = 1 SSU

i = 9j = 7

Notation – 3 Response variable for SSU j in PSU

i yij e.g., age of j-th resident in household

i e.g., whether or not dorm resident j

in room i owns a computer

Cluster size =

Cluster population total

Note that we observe cluster population total (or mean or variance) for each sample cluster in 1-stage cluster sampling

We will estimate cluster parameters in 2-stage cluster sampling

jiji yt

Cluster-level population parameters (for cluster i )

Mi elements

Cluster population mean

Within-cluster variance

Cluster-level population parameters (for cluster i ) – 2

75.733.4

Popuation

88.6Sboxes12

00.9S9

1-stage cluster sample

Cluster-level population parameters (for cluster i ) – 3 For 1-stage cluster samples

Have a complete enumeration of the cluster elements

Cluster population parameters are known For 2-stage cluster samples

Observe data on a sample of elements in a cluster

Estimate cluster population parameters

Population parameters Same parameters as in previous

chapters, rewritten in notation for cluster sampling

Population size

(** K was referred to as N in previous chapters)

Population total (sum of all cluster totals)

jij tyt

elements 1

Population Parameters-2 Population mean (of K elements)

Population variance (among K elements)

Variance among N cluster totals

Data from cluster samples Work with element and cluster-level data Element data set will have columns for

Cluster id Element id within cluster Variable (y)

Will also summarize this data set to generate cluster parameters (1-stage) or estimates of cluster parameters (2-stage)

Cluster id Cluster total (or estimate) Cluster mean (or estimate) Cluster variance (or estimate)

1-stage cluster sampleElement data Cluster

summary

i j yij

1 1 y11

1 2 y12

1 3 Y13

1 4 y14

2 1 y21

2 2 y22

2 3 y23

3 1 y31

21S22S23S

Estimation for CSE1 Chapter reading

Section 5.2.1 covers equal sized clusters (Mi constant, read)

We’ll start with 5.2.3 (unequal sized clusters, Mi varies)

Section 5.2.2 covers theory Two types estimators

Unbiased – HT estimator Ratio estimation

Equal probability sample of clusters – assume SRS of clusters

CSE1 unbiased estimation under SRS – total t Estimator for population total using data

collected from a 1-stage cluster sample SRS of clusters

Estimator of variance of

iiunb t

where1ˆˆ

NtV unbi

Dorm example – 4 Estimated population total

Estimated variance

dinners hall dining 1680)8.16(100

)1021191420(51

1001ˆ

iiunb t

06.203ˆ

230,415

511001ˆˆ

7.21])8.1610(...)8.1620[(15

Two events : A and B Pr{ A and B both occur }

= P { A occurs } x P { B occurs given A occurs } In our setting

A = sample cluster i B = sample element j (in cluster i)

Inclusion probability for for element j in cluster i ij = Pr {including element j and cluster i in sample}

= Pr {including cluster i in sample} x Pr {incl. element j given cluster i has been

included in sample}

CSE1 inclusion probability for an element

Need to two pieces Pr {including cluster i in sample} = n / N Pr {including element j given cluster i has been

included in sample} = 1 Inclusion probability ij

= Pr {including element j and cluster i in sample}= Pr {including cluster i in sample} x

Pr {including element j given cluster i has been included in sample} = (n / N ) x 1 = n / N

CSE1 inclusion probability for an element – 2

CSE1 weight for an element Weight for element j in cluster i

Inverse element inclusion probability wij = 1/ ij = N /n

Estimator using weights

jijijunb t

11 11 1

Dorm example – 5 Inclusion probability for student j in

dorm room i N = 100 dorm rooms n = 5 sample dorm rooms Take all 4 students in dorm room ij = n / N = 1/20 = 0.05

Weight for student j in dorm room i wij = N / n = 20 students

CSE1 unbiased estimation under SRS – mean Unbiased estimator for population

mean For SRS, estimator for total divided by

number of population elements (OUs) Units are y-units per element

unbunb

ˆˆ1ˆˆ

Dorm example – 6

51.0ˆ

257688.0400

230,41ˆˆˆˆ

per weekstudent per dinners hall dining 20.4

)4(100

1680ˆˆ

unbunb

Unbiased estimation – proportion p What is y ?

Ratio estimation Usually ti (cluster total) is correlated with Mi

(cluster size) As Mi (# SSUs/elements in cluster i ) increases,

value for ti (total of yij for cluster i ) increases Positive correlation between Mi and ti No intercept

Perfect conditions for SRS ratio estimator

Notation of Ch 3 Notation of Ch 5

yi (variable of interest) ti (cluster total)

xi (auxiliary info) Mi (cluster size)

Ratio estimation for CSE1 Estimator for population mean

Units are y-units per element

Ratio estimation for CSE1 – 2 Estimator for variance of ratio

estimator of population mean

is average cluster size for populationUM

Ratio estimation for CSE1 – 3 Average cluster size

If unknown, can estimate with sample mean of cluster sizes

Dorm example – 7 Estimated population mean

Average cluster size

Dorm example – 8 Estimated variance

Ratio estimation for CSE1 – 4 Estimator for population total

rr yKt ˆˆ

rr yVKtV ˆˆˆˆ 2

Dorm example – 9 Estimated population total

Estimated variance

rr yKt ˆˆ

rr yVKtV ˆˆˆˆ 2

CSE1: impact of cluster size If cluster sizes Mi are variable across

clusters, generally estimate population parameter with less precision If ti is related to Mi , then get large

variation among cluster totals if Mi is variable

Variance of population parameter estimator (unbiased or ratio) is a function of variation among cluster totals

2-stage equal probability cluster sampling (CSE2) CSE2 has 2 stages of sampling

Stage 1. Select SRS of n PSUs from population of N PSUs

Stage 2. Select SRS of mi SSUs from Mi elements in PSU i sampled in stage 1

2-stage cluster sampling

Take an SRS of mi SSUs in sampled PSU i :Sample all SSUs in sampled PSUs:

Stage 1 of 2-stage cluster sample(select PSUs)

Stage 2 of 2-stage cluster sample (select SSUs w/in PSUs)

Motivation for 2-stage cluster samples

Recall motivations for cluster sampling in general Only have access to a frame that lists

clusters Reduce data collection costs by going

to groups of nearby elements (cluster defined by proximity)

Motivation for 2-stage cluster samples – 2 Likely that elements in cluster will be

correlated May be inefficient to observe all elements in

a sample PSU Extra effort required to fully enumerate a

PSU does not generate that much extra information

May be better to spend resources to sample many PSUs and a small number of SSUs per PSU Possible opposing force: study costs

associated to going to many clusters

Have a sample of elements from a cluster We no longer know the value of

cluster parameter, ti

Estimate ti using data observed for mi SSUs

CSE2 unbiased estimation for population total t

iiii y

CSE2 unbiased estimation for population total – 2 Approach is to plug estimated

cluster totals into CSE1 formula CSE1

jiunb yM

The variance of has 2 components associated with the 2 sampling stages1. Variation among PSUs2. Variation among SSUs within PSUs

CSE2 unbiased estimation for population total – 3

itunb m

22 11ˆˆ

among PSU

within PSU

In CSE1, we observe all elements in a cluster We know ti

Have variance component 1, but no component 2

In CSE2, we sample a subset of elements in a cluster We estimate ti with Component 2 is a function of estimates

variance for

CSE2 unbiased estimation for population total – 5 Estimated variance among cluster

totals

Estimated variance among elements in a cluster

unbit N

itunb m

22 11ˆˆ

unbit N

Dorm example – 10 Stage 2: select 2 students in each

1 5 3 6 5 1

2 5 2 4 4 4

3 4 4 4 6 3

4 6 5 5 6 2

? ? ? ? ?

Dorm example – 11 Stage 1

Cluster = N = n = SRS

Stage 2 Element = Mi = mi = SRS

Dorm example – 12

Stu-dent

1 5 3 4 5 4

2 6 2 5 4 2

Dorm example – 13

jiunb t

unbit N

Dorm example – 14

itunb m

22 11ˆˆ

CSE2 unbiased estimation for population mean

ˆˆˆˆ

unbunb

Dorm example – 15

ˆˆˆˆ

unbunb

Two events : A and B Pr{ A and B both occur }

= P { A occurs } x P { B occurs | A occurs } “|” denotes “given” (a condition)

In our setting A = sample cluster i B = sample element j

Inclusion probability symbols ij = Pr {including element j and cluster i in sample} i = Pr {including cluster i in sample} j|i = Pr {incl. element j | cluster i has been included

in sample}

CSE2 inclusion probability for an element

Need to two pieces i = Pr {including cluster i in sample} = n / N

j|i = Pr {including element j | cluster i has been included in sample} = mi /Mi

Inclusion probability for element j in cluster i ij = i j|i =

CSE2 inclusion probability for an element – 2

CSE2 weight for an element Sampling Weight for element j in

cluster i

Estimator for population total

jijijunb

ywtiii

1 11 11 1

ijij m

What does equal probability mean in Ch 5? Clusters (PSUs) sampled using SRS Equal inclusion probability for stage 1

PSUs (clusters)

i is same for all i

What does equal probability mean in Ch 5? – 2 Elements (SSUs) in a given PSU are

sampled using SRS All elements (j ) in a sample PSU (i ) are

selected with equal probability This is a conditional probability (given PSU i )

For a given PSU i , j|i is the same for all elements j

What does equal probability mean in Ch 5? – 3 Note that

Equal probability at stage 1 (i )

plus Equal probability at stage 2 given PSU i (j|i )

does NOT imply equal inclusion probability for an element

In fact, element-level (unconditional) inclusion probability is not necessarily constant

Depends on cluster size Mi and sample size mi for the cluster to which the element belongs

CSE2 ratio estimation for population mean

CSE2 ratio estimation for population mean – 2

iriiir

1or of mean sampleby estimated be can

11ˆˆ

Dorm example – 16

Stu-dent

1 5 3 4 5 4

2 6 2 5 4 2

5.5 2.5 4.5 4.5 3.0

22 10 18 18 12

0.5 0.5 0.5 0.5 2.0

Dorm example – 16

iriir yyM

222 ˆ

Dorm example – 17

CSE2 ratio estimation for population total t

rr yKt ˆˆ

rr yVKtV ˆˆˆˆ 2

Dorm example – 18

rr yKt ˆˆ

rr yVKtV ˆˆˆˆ 2

Coots egg example Target pop = American coot eggs in Minnedosa,

Manitoba PSU / cluster = clutch (nest) SSU / element = egg w/in clutch Stage 1

SRS of n = 184 clutches N = ??? Clutches, but probably pretty large

Stage 2 SRS of mi = 2 from Mi eggs in a clutch Do not know K = ??? eggs in population, also large Can count Mi = # eggs in sampled clutch i

Measurement yij = volume of egg j from clutch i

Coots egg example – 2 Scatter plot of volumes

vs. i (clutch id) Double dot pattern - high

correlation among eggs WITHIN a clutch

Quite a bit of clutch to clutch variation

Implies May not have very high

precision unless sample a large number of clutches

Certainly lower precision than if obtained a SRS of

eggs3681

Could use a side-by-side plot for data with larger cluster sizes – PROC UNIVARIATE w/ BY CLUSTER and PLOTS option

Coots egg example – 3 Plot

Rank the mean egg volume for clutch i ,

Plot yij vs. rank for clutch i Draw a line between yi 1 and

yi2 to show how close the 2 egg volumes in a clutch are

Observations Same results as Fig 5.3, but

more clear Small within-cluster

variation Large between-cluster

variation Also see 1 clutch with large

WITHIN clutch variation check data (i = 88)

i sorted by iy

Coots egg example – 4 Plot si vs. for clutch i Since volumes are

always positive, might expect si to increase as gets larger

If is very small, yi 1 and yi 2 are likely to be very small and close small si

See this to moderate degree

Clutch 88 has large si , as noted in previous plot

Coots egg example – 5 Estimation goal

Estimate , population mean volume per coot egg in Minnedosa, Manitoba

What estimator? Unbiased estimation

Don’t know N = total number of clutches or K = total number of eggs in Minnedosa, Manitoba

Ratio estimation Only requires knowledge of Mi , number of eggs in

selected clutch i , in addition to data collected May want to plot versus Mi it

Coots egg example – 6

Clutch

iy 2is

rii yMt

1 13 3.86 0.0094 50.23594 0.671901 318.9232 2 13 4.19 0.0009 54.52438 0.065615 490.4832 3 6 0.92 0.0005 5.49750 0.005777 89.22633 4 11 3.00 0.0008 32.98168 0.039354 31.19576 5 10 2.50 0.0002 24.95708 0.006298 0.002631 6 13 3.98 0.0003 51.79537 0.023622 377.053 7 9 1.93 0.0051 17.34362 0.159441 25.72099 8 11 2.96 0.0051 32.57679 0.253589 26.83682 9 12 3.46 0.0001 41.52695 0.006396 135.4898 10 11 2.96 0.0224 32.57679 1.108664 26.83682 … … … … … … …

180 9 1.95 0.0001 17.51918 0.002391 23.97106 181 12 3.45 0.0017 41.43934 0.102339 133.4579 182 13 4.22 0.00003 54.85854 0.002625 505.3962 183 13 4.41 0.0088 57.39262 0.630563 625.7549 184 12 3.48 0.000006 41.81168 0.000400 142.1994 sum 1757 4375.947 42.17445 11,439.58 var 149.565814

ry 2.490579

Don’t

Coots egg example – 7

061.0184

511.62549.91ˆ

18417.421

184511.62184

1549.91ˆˆ

549.9184/1757

511.62183

58.439,111

49.21757

947.4375ˆ

Don’t know N , but assumed large

2nd term is very small, so approximate SE ignores 2nd

Coots egg example – 8 What is first-stage PSU inclusion

probability?

What is conditional SSU inclusion probability at second stage?

What is unconditional SSU inclusion probability?

CSE2: Unbiased vs. ratio estimation Unbiased estimator can poor precision if

Cluster sizes (Mi ) are unequal ti (cluster total) is roughly proportional to Mi

(cluster size)

Biased (ratio estimator) can be precise if ti roughly proportional to Mi

This happens frequently in pops w/cluster sizes (Mi) vary

CSE2: Self-weighting design Stage 1: Select n PSUs from N PSUs in pop

using SRS Inclusion probability for PSU i :

Stage 2: Choose mi proportional to Mi so that mi /Mi is constant, use SRS to select sample

Inclusion probability for SSU j given PSU i :

Unconditional inclusion probability for SSU j in cluster i is constant for all elements

ij Inclusion probability may vary in practice because may not be possible for mi /Mi to be equal to c for all clusters

Self-weighting designs in general Why are self-weighting samples

appealing?

Are dorm student or coot egg samples self-weighting 2-stage cluster samples?

What other (non-cluster) self-weighting designs have we discussed?

Self-weighting designs in general – 2 What is the caveat for variance

estimation in self-weighting samples? No break on variance of estimator – must

use proper formula for design

Why are self-weighting samples appealing? Simple mean estimator Homogeneous weights tends to make

estimates more precise

Return to systematic sampling (SYS) Have a frame, or list of N elements Determine sampling interval, k

k is the next integer after N/n Select first element in the list

Choose a random number, R , between 1 & k R-th element is the first element to be

included in the sample Select every k-th element after the R-th

element Sample includes element R, element R + k,

element R + 2k, … , element R + (n-1)k

SYS example Telephone survey of members in an

organization abut organization’s website use N = 500 members Have resources to do n = 75 calls N / n = 500/75 = 6.67 k = 7 Random number table entry: 52994

Rule: if pick 1, 2, …, 7, assign as R; otherwise discard #

Select R = 5 Take element 5, then element 5+7 =12, then

element 12+7 =19, 26, 33, 40, 47, …

SYS – 2 Arrange population in rows of

length k = 7R 1 2 3 4 5 6 7 i

1 2 3 4 5 6 7 1

8 9 10 11 12 13 14 2

15 16 17 18 19 20 21 3

22 23 24 25 26 27 28 4

… …

Relationship between SYS and cluster sampling Design relationships

Element = ? Cluster = ? Sampling unit(s) = ? Cluster sampling design = ?

Relationship between frame ordering and expected precision of a an estimate from a cluster sample?

Periodic, where cycle of pattern is coincident with sampling interval k

Ordered by X , which is correlated with response variable Y

Random

SYS – 3 Suppose X [age of member] is correlated with

Y [use of org website] Sort list by X before selecting sample

k 1 2 3 4 5 6 7 X i

1 2 3 4 5 6 7 young 1

8 9 10 11 12 13 14 2

15 16 17 18 19 20 21 3

22 23 24 25 26 27 28 4

… mid …

old 72

1 Ch 5: Cluster sampling with equal probabilities DEFN: A cluster is a group of observation units...

Documents

What is Probability? Hit probabilities Damage probabilities Personality (e.g. chance of attack, run, etc.) ??? Probabilities are used to add

1.Defn , Role & Importance

Yates Probabilities

Hit Probabilities

Pollination Defn.?. Pollination Def’n: transfer pollen from stamen to stigma Carpel = pistil

Binomial Probabilities - Lecture19

Kevin Lynagh - Keming Labs · ;;src/clj/my_stuff.clj (ns my-stuff) (defn thing [x] ) (defn another [x y] ) ClojureNamespaces

Conditional probabilities

Colitis Ulcerativa Expo Defn

Statistics and probabilities

Review Probabilities –Definitions of experiment, event, simple event, sample space, probabilities, intersection, union compliment –Finding Probabilities

Probabilities of Events

Ocw Newsletter Januari2010 Defn

Complemental Probabilities

10.Wounds Defn ,Classification,Healing,Management of Wound c

Limiting probabilities

WORLD OF PROBABILITIES

Probabilities and Proportions

Prior Probabilities

13.4 Determining Probabilities