56
Estimating the Size of Hidden Populations using Respondent-Driven Sampling Data Mark S. Handcock Krista J. Gile Department of Statistics Department of Mathematics University of California University of Massachusetts - Los Angeles - Amherst Corinne M. Mar Center for Studies in Demography and Ecology University of Washington Supported by NIH grants 1R21HD063000 and 5R21HD075714-02, NSF awards MMS-0851555, SES-1230081 and MMS-1357619 and the DoD ONR MURI award N00014-08-1-1015. Working Papers available at http://www.stat.ucla.edu/handcock http://arXiv.org UNAIDS Reference Group Consultation, UMass, June 9-10, 2014

Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Estimating the Size of Hidden Populationsusing Respondent-Driven Sampling Data

Mark S. Handcock Krista J. GileDepartment of Statistics Department of MathematicsUniversity of California University of Massachusetts

- Los Angeles - AmherstCorinne M. Mar

Center for Studies in Demography and EcologyUniversity of WashingtonUNIVERSITY OF CALIFORNIA

UNOFFICIAL SEAL

Attachment B - “Unofficial” SealFor Use on Letterhead

Supported by NIH grants 1R21HD063000 and 5R21HD075714-02, NSFawards MMS-0851555, SES-1230081 and MMS-1357619 and

the DoD ONR MURI award N00014-08-1-1015.

Working Papers available athttp://www.stat.ucla.edu/∼handcock

http://arXiv.org

UNAIDS Reference Group Consultation, UMass, June 9-10, 2014

Page 2: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Hard-to-Reach Population Methods Research Group

I Isabelle Beaudry, UMassI Ian E. Fellows, Fellows StatisticsI Krista J. Gile, UMassI Mark S. Handcock, UCLAI Lisa G. Johnston, Tulane University, UCSFI Corinne M. Mar, University of WashingtonI http://hpmrg.org

Page 3: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Inferential approaches

The key is the modeling of the sampling processI Salganik and Heckathorn (2004): Markov chain model over

classesI Volz and Heckathorn (2008): Markov chain model over peopleI Gile (2008, 2011): Adjusts for with-replacement effects -

Successive Sampling (SS)I Gile and Handcock (2008, 2011): Network model-assisted

estimator, more realistic representation of RDS

Page 4: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Successive Sampling (SS)

Consider the following successive sampling (SS) or probabilityproportional to size without replacement (PPSWOR) samplingprocedure:

I Begin with a population of N units, denoted by indices 1 . . .Nwith varying sizes represented by d1,d2, . . .dN .

I Let G1, . . . ,GN be the indices of the successively sampledpeople.

I Sample the first unit from the full population {1 . . .N} withprobability proportional to size di . Assign the index of this unit tothe random variable Gi .

I Select each subsequent unit with probability proportional to sizefrom among the remaining units, such that

P(Gi = k |G1 . . .Gi−1) =

{dk∑

j /∈{G1...Gi−1}dj

k /∈ {G1 . . .Gi−1}0 else

.

Page 5: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

SS for RDS

Gile (2011) argues that RDS can be approximated by SuccessiveSampling under a configuration model for the network:

I Node i has given degree, di , consider di edge-ends.I Pairs of edge-ends matched up at randomI This is a configuration model

I Suppose G1,G2, . . .Gk by Successive Sampling according to d .I Then if the network is unknown, but known to be a configuration

model, tracing a link from Gk will select nodes according toSuccessive Sampling.

Page 6: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Is there information in RDS dataabout population size?

Idea:I Under Successive Sampling, “larger” units typically sampled

earlierI Early sample: lots of “big” units, few “small”I Later sample: fewer “big”, more “small”I No change implies the population is not much depletedI Big change implies population very depleted

This can be quantified to estimate N!Note: the information about N is in the ordered sample pattern!

Page 7: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Modeling the sampling process for non-ignorablesampling

I RDS is not ignorable: P(G|Dobs,Dunobs) 6= P(G|Dobs)

I Information about N is in the sequence of observations.I Make inference from joint model for sample sequence and unit

sizes.

Page 8: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Inferential Approach

Observed data:

Dobs = Unit sizes (degrees) of observed units in order of observation

Goal:P(N|Dobs)

(posterior distribution of N given the data)

Parameters:

N = Population Sizeη = Parameter of distribution of unit sizes.

Page 9: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Inferential Approach

P(N|Dobs) ∝∫

P(Dobs|N, η)P(η,N)dη

(independent priors)

=

∫P(Dobs|N, η)P(η)P(N)dη

=

∫(likelihood) (prior for η) (prior for N)dη

=

∫P(Dobs|G,U,D, η)P(G|U, η)P(U|N, η)P(η)P(N)dη

=

∫P(samp given degrees) P(degrees) (prior η) (prior N)dη

Page 10: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Models for Degrees

Parametric model for the degrees:

d iidi ∼ f (·|η)

with support d = 0,1, . . . , and parameter η.

Page 11: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Models for DegreesExtensive papers by Handcock and Jones.To specify f (·|η). We can consider:

1. Poisson2. Negative binomial. This allows Gamma over-dispersion over

Poisson.3. Yule, Waring. This allows power-law over-dispersion over

Poisson.4. Poisson-log-normal. This allows log-normal over-dispersion over

Poisson. It is more than the Negative Binomial but less than thepower-law models.

5. Conway-Maxwell-Poisson distribution. This allows bothunder-dispersion and over dispersion with a single additionalparameter over a Poisson.

6. Non-parametric lower tails: To allow for poor fit in the lowerdegrees.

These are all coded up in the CRAN degreenet package and/or thesize package.

Page 12: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Prior for the degree distribution model

Each degree distribution model parametrized with mean andstandard deviation.

η = g(µ, σ)

µ|σ ∼ N(µ0, σ0/dfmean) σ ∼ Invχ(σ0;dfsigma)

Use diffuse default prior on degree model parameters(equivalent sample size dfµ = 1 and dfσ = 5).

Page 13: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Prior for Population Size N

Many possibilitiesI The data truncates the prior below the sample size.I Uniform prior is improperI Natural parametric models (e.g., Negative Binomial,

Poisson-log-normal, Conway-Maxwell-Poisson).I Natural parametric models too thin in the tails

I Instead: specify prior knowledge about the sample proportion(i.e. n/N).

Page 14: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Prior for Population Size N

I Simple prior: uniform on n/N.I Gives closed form prior on N with infinite mean (median = 2n)I Generalize to n/N ∼ Beta(α, β)

The density on N (considered continuous) is:

π(N) = βn(N − n)β−1/Nα+β for N > n.

I The distribution has tail behavior O(1/Nα+1).I Elicit median or mode from field researchers and translate to β

and/or α.

Page 15: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

500 1000 1500 2000 2500 3000

0.00

000.

0005

0.00

100.

0015

Population size (N)

Den

sity

mean=1000median=1000mode=1000

Figure: Three example prior distributions for the population size (N). Theycorrespond to α = 1 and β = 1.55, 1.16 and 3.

Page 16: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Likelihood: Notation

Let:Dobs = (D1, . . . ,Dn) be the random ordered observed degrees(ordered for notation)Dunobs = (Dn+1, . . . ,DN) be the unordered random unobserveddegreesLet dobs = (d1, . . . ,dn) and dunobs = (dn+1, . . . ,dN) be their realizedvalues.Let G = (G1, . . . ,Gn) be the random indices of the ordered sampleand gobs be the observed sequence.

Page 17: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Likelihood

P(Dobs|N, η)

=∑

d

∑g

p(Dobs = dobs|G = g,D = d , η)p(G = g|D = d , η)p(D = d |η)

=N!

(N − n)!

∑d∈DU(dobs)

p(G = (1, . . . ,n)|D = d)N∏

j=1

f (dj |η)

where DU(dobs) is the set of possible dunobs given dobs.

P(G|D = d ,N, η) =n∏

k=1

dk

λkwhere

λk =N∑

i=k

di =n∑

i=k

di +N∑

i=n+1

di k = 1, . . . ,n

depends on both dobs and dunobs.

L[N, η|Dobs,G] ∝ N!

(N − n)!

∑dunobs∈DU(dobs)

n∏k=1

dk

λj·

N∏j=1

f (dj |η)

Page 18: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Inference

I Likelihood can be maximizedI Can combine with priors to compute posteriorI Note computational complexities based on sum over N − n

embedded sums over infinite spaces.

Page 19: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Example: N=1000, homophily=2, diff. activity=3

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015

posterior for population size

population size

Dens

ity

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015

posterior for population size

population size

Dens

ity

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015

posterior for population size

population size

Dens

ity

Page 20: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Application: Estimating the numbers of those most atrisk for HIV in Cities in El Salvador

I Surveillance surveys in El SalvadorI Focus on high-risk groups: female sex workers (FSW)I RDS study of size n = 184 in 2010.

Page 21: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Figure: Graphical representation of the recruitment tree for the sampling offemale sex workers in Sonsonate, El Salvador in 2010. The nodes are therespondents and the wave number increases as you go down the page. Thenode gray scale is proportional to the network size reported by the worker,with white being degree one and black the maximum degree.

Page 22: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

population size

post

erio

r de

nsity

(x

10−4

)

0

5

10

15

20

184 1000 2000 3000 4000population size

post

erio

r de

nsity

(x

10−4

)

0

5

15

20

184 1000 2000 3000 4000population size

post

erio

r de

nsity

(x

10−4

)

0

5

10

15

20

184 1000 2000 3000 4000

Figure: Posterior distribution for the number of female sex workers inSonsonate based on three prior distributions for the population size: flat,matching the midpoint UNAIDS estimate, and interval-matching the UNAIDSestimate. The prior is dashed. The red mark is at the posterior median. Thegreen mark is at the posterior mean. The blue lines are at the lower andupper bounds of the 95% highest-probability-density interval. The purplelines demark the lower and upper UNAIDS guidelines.

Page 23: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Simulation StudySimulate Population

I 1000, 835, 715, 625, 555, or 525 nodesI 20% “Infected”

Simulate Social Network (from ERGM, using statnet)I Mean degree 7I Homophily on Infection: α = E(# infected to infected tie)

ER=0(# infected to infected tie) = 5 (orother)

I Differential Activity: ω =mean degree infected

mean degree uninfected = 1 (or other)

Simulate Respondent-Driven SampleI 500 total samplesI 10 seeds, chosen proportional to degreeI 2 coupons eachI Coupons at random to relationsI Sample without replacement

Blue parameters varied in study.

Page 24: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Evaluating Performance:Frequentist properties of Bayesian method

I Point estimates: are they about right on average?I Using the Bayesian framework, use probability intervals for the

population size (Highest Posterior Density Credible Intervals -CI’s)

I Compare Frequentist properties: CI width and coverage rates

Page 25: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Prior Mean

Siz

e E

stim

ate/

Trut

h

527.5 555 1110 625 750 1500 750 1000 2000 1000 1500 3000 2750 5000 10000

●549 ●568●618

●728●814

●1053

●995

●1244

●1897

●1423

●1918

●2978

●3712

●5587

●7712

Coverage

94 96 99

96 96 99

96 96100

9499

100

83

93

100

HPDU ratio

1.11.1

1.3

1.3

1.6

2.3

1.6

2.4

4.8

1.9

3.2

5.7

1.5

2.3

3.5

Figure: Spread of central 95% of simulated population size estimates(posterior means) for 5 population sizes for low, accurate, and high priors.Dots represent means. Estimates are represented as multiples of the truepopulation size (red line at 1 indicates true population size). Numbers belowthe bars are coverage rates of 95% HPD intervals.

Page 26: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Miss-specification of the Network Structure

0.5

1.0

1.5

2.0

2.5

3.0

3.5S

ize

Est

imat

e/Tr

uth

750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000

α=1.0 α=1.8 α=1.0 α=1.8 α=1.0 α=1.8

ω=0.5 ω=1.0 ω=2.0

●977

●1167

●1571

●974

●1167

●1783

●995

●1244

●1897

●1002

●1258

●1901

●1004

●1144

●1324

●984

●1161

●1541

Coverage 94 97 10094 96

9696 96

100 98 98100

90 95 97 82 86 92

HPDU ratio

1.6

2.1

3.1

1.5

2.0

2.7

1.6

2.4

4.8

1.6

2.4

4.4

1.5

1.9

2.2

1.4

1.6

1.9

Figure: Spread of central 95% of simulated population size estimates(posterior means) for population size 1000 for low, accurate, and high priors,with varying levels of homophily (α) and differential activity (ω).

Page 27: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Discussion

I It is important to estimate the networked population sizeI There is information on the population size implicit in the

decreasing degrees of the sample nodes over timeI Using successive sampling model we can model the decreaseI We can incorporate prior information about the population size

using the Bayesian frameworkI We can incorporate other features of the populationI We can estimate population means (e.g., prevalence and counts)I In the Bayesian framework we can estimate uncertainty of the

estimates in a natural way

Page 28: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Cautions

I The difference between the model with disease and withouthighlights the importance of the specification of the model for thedegree distribution.

I The estimates depend on the prior distribution for populationsize.

I The estimates are biased because the successive samplingmodel is not perfect - and will be be increasingly misspecified aswe get further from the configuration network.

I There is another important (and influential) tuning parameter: K -the truncation value for degrees.

I This approach is promising. It is designed to be combined withdata from other methods (e.g., scale up) to provide the mostaccurate overall estimate.

I Fundamentally, RDS data typically does not contain muchinformation about the population size. The Bayesian approachenables us to quantify this.

Page 29: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

Loess Fit to Sequence of Degrees (ALL) red is minimum, green is maximum, blue is best

Time Sequence

p( <=

obse

rved d

egree

)

Page 30: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Prior for the population size

I translates to a closed form for the prior on N which has infinitemean

I Generalize to n/N ∼ Beta(1, β)

The density function on N (considered as a continuous variable) is:

f (N|n) = βn(N − n)β−1/Nβ+1 for N > n

The distribution has tail behavior ≈ 1/N2. The mode of the prior is at0.5n(β + 1) and the median is given by n/(1− (1/2)1/β) The medianor mode can be elicited from field researchers and translated to β. Auniform distribution on the sample proportion corresponds to amedian of twice the sample size.

Page 31: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0 5000 10000 15000

3e−0

54e−0

55e−0

56e−0

57e−0

58e−0

59e−0

5

Prior for population size Prior mode = 1000

truth=1000population size

prior

densi

ty

0 10000 20000 30000 40000 50000

0.0e+

005.0

e−06

1.0e−

051.5

e−05

2.0e−

05

Prior for population size Prior mode = 10000

truth=1000population size

prior

densi

ty

Page 32: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Example: N=1000, homophily=2, diff. activity=3

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015

posterior for population size

population size

Dens

ity

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015

posterior for population size

population size

Dens

ity

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015

posterior for population size

population size

Dens

ity

Page 33: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

500

1000

1500

2000

2500

posteriorsize() Population Size REVISION 178200 RDS samples, prior.size.mode = truth

circle is mode, triangle is mean

Po

pu

latio

n S

ize

●●●●●

●●●●●●●●● ●●

●●

●●●●●

● ●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

Page 34: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Miss-specification of the degree distribution shape

I Because of differential activity the Conway-Maxwell-Poissonmodel class does not cover the bimodality of the degreedistribution

Page 35: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

600 800 1000 1200 1400

0.000

0.001

0.002

0.003

Posterior for Population Size

population size

Dens

ity

truth = 1000 mode = 763 median = 801 mean = 835

Page 36: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Miss-specification of the degree distribution shape

I Because of differential activity the Conway-Maxwell-Poissonmodel class does not cover the bimodality of the degreedistribution

I Solution: Model the degree distributions of the diseased from thenon-diseased with separate Conway-Maxwell-Poisson models.

Page 37: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

500 1000 1500 2000

0.000

00.0

005

0.001

00.0

015

0.002

0

Posterior for Population Size

population size

Dens

itytruth = 1000 mode = 988 median = 993 mean = 1049

4 6 8 10 12 14 16

0.00.5

1.01.5

2.0

Posterior for Mean Degree: true overall mean degree is 7

degree

Dens

ity

No Disease With Disease

2.0 2.5 3.0 3.5 4.0 4.5

01

23

4

Posterior for s.d. degree

degree

Dens

ity

No Disease With Disease

500

1000

1500

2000

2500

3000

Population Size REVISION 170200 RDS samples, prior.size.mode = truth

circle is mode, triangle is mean

Po

pu

latio

n S

ize

●●●●●●

●●●●●●●

●●● ●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

Page 38: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Modeling of other characteristics of the population

I As we model many population characteristics, including thedisease status, we can compute estimates of them directly

I Example: disease prevalence

Page 39: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0.14

0.16

0.18

0.20

0.22

0.24

0.26

Disease Prevalence REVISION 170200 RDS samples, prior.size.mode = truth

Population size: red is 525, green is 715, and blue is 1000

Dis

ea

se

Pre

va

len

ce

●●

●●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

Page 40: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Confidence intervals, design effectsand standard errors

I Using the Bayesian framework, we can naturally compute theprobability intervals for the population size and othercharacteristics

I Examples: CI coverage for the population size and prevalence

Page 41: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0.88

0.90

0.92

0.94

0.96

REVISION 170200 RDS samples, prior.size.mode = truth

Coverage: proportion of samples whose 95% CI covered the true population size True population size: red is 525, green is 715, blue is 1000

Popu

latio

n Si

ze

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

0.85

0.90

0.95

1.00

REVISION 170200 RDS samples, prior sampling fraction = truth

Coverage: proportion of samples whose 95% CI covered the true prevalenceTrue prevalence = 0.2

Popu

latio

n Si

ze

● ●

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

Page 42: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Application: The study of HIV/AIDS in San Francisco

I Surveillance surveys by San Francisco Department of PublicHealth

I Focus on African-American (AA) men-who-have-sex-with-men(MSM)

I RDS study of size n = 256 in 2009.I Intensive study provides a population size estimate of 4439.I Census data indicated 21518 AA men in San Francisco.

Page 43: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling
Page 44: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0 5000 10000 15000 20000

0e+0

02e−0

54e−0

56e−0

58e−0

5posterior for population size

population size

Densi

ty

SF mean

Page 45: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0 2000 4000 6000 8000

0.000

000.0

0005

0.000

100.0

0015

0.000

200.0

0025

0.000

30

posterior for the number of AA MSM with HIV

HIV+ count

Densi

ty

SF mean

Page 46: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0 5000 10000 15000 20000

1e−05

2e−05

3e−05

4e−05

5e−05

6e−05

posterior for population size

population size

Density

SF mean

Page 47: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0 2000 4000 6000 8000 10000

0.00000

0.00005

0.00010

0.00015

posterior for the number of AA MSM with HIV

HIV+ count

Density

SF mean

Page 48: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Discussion

I It is important to estimate the networked population sizeI There is information on the population size implicit in the

decreasing degrees of the sample nodes over timeI Using successive sampling model we can model the decreaseI We can incorporate prior information about the population size

using the Bayesian frameworkI In the Bayesian framework we can estimate uncertainty of the

estimates in a natural way

Page 49: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Cautions

I The difference between the model with disease and withouthighlights the importance of the specification of the model for thedegree distribution.

I The estimates depend on the prior distribution for populationsize.

I The estimates are biased because the successive samplingmodel is not perfect - and will be be increasingly misspecified aswe get further from the configuration network.

I There is another important (and influential) tuning parameter: K -the truncation value for degrees.

I This approach is promising. It is designed to be combined withdata from other methods (e.g., scale up) to provide the mostaccurate overall estimate.

I Fundamentally, RDS data typically does not contain muchinformation about the population size. The Bayesian approachenables us to quantify this.

Page 50: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Miss-specification of the degree distribution shape

I Because of differential activity the Conway-Maxwell-Poissonmodel class does not cover the bimodality of the degreedistribution

I Solution: Model the degree distributions of the diseased from thenon-diseased with separate Conway-Maxwell-Poisson models.

Page 51: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

4 6 8 10 12 14 16

0.00.5

1.01.5

2.0

Posterior for Mean Degree: true overall mean degree is 7

degree

Dens

ityNo Disease With Disease

2.0 2.5 3.0 3.5 4.0 4.5

01

23

4

Posterior for s.d. degree

degree

Dens

ity

No Disease With Disease

500

1000

1500

2000

2500

3000

Population Size REVISION 170200 RDS samples, prior.size.mode = truth

circle is mode, triangle is mean

Po

pu

latio

n S

ize

●●●●●●

●●●●●●●

●●● ●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

Page 52: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Modeling of other characteristics of the population

I As we model many population characteristics, including thedisease status, we can compute estimates of them directly

I Example: disease prevalence

Page 53: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0.14

0.16

0.18

0.20

0.22

0.24

0.26

Disease Prevalence REVISION 170200 RDS samples, prior.size.mode = truth

Population size: red is 525, green is 715, and blue is 1000

Dis

ea

se

Pre

va

len

ce

●●

●●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

Page 54: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

0.85

0.90

0.95

1.00

REVISION 170200 RDS samples, prior sampling fraction = truth

Coverage: proportion of samples whose 95% CI covered the true prevalenceTrue prevalence = 0.2

Popu

latio

n Si

ze

● ●

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

Page 55: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Comparison to the Gile SS estimator

I The SS estimator in Gile JASA (2011) requires N knownI We use the posterior mode as a plug in estimate of N

Page 56: Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Pre

vale

nce

750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000

α=1.0 α=1.8 α=1.0 α=1.8 α=1.0 α=1.8

ω=0.5 ω=1.0 ω=2.0

.14

.16

.18

.20

.22

.24

.26

MSE Ratio

0.80

Coverage

72 88

MSE Ratio

1.21

Coverage

90 90

MSE Ratio

0.84

Coverage

92 92

MSE Ratio

0.83

Coverage 76 89

●●

MSE Ratio

1.12

Coverage

94 90

MSE Ratio

1.08

Coverage

99 92

● ●

MSE Ratio

1.06

Coverage

92 94

● ●

MSE Ratio

1.03

Coverage

95 96

● ●

MSE Ratio

0.98

Coverage

100 98

● ●

MSE Ratio

1.04

Coverage

88 94

● ●

MSE Ratio

1.02

Coverage

96 98

● ●

MSE Ratio

0.99

Coverage

100 99

MSE Ratio

0.54

Coverage

40 80

●●

MSE Ratio

2.78

Coverage

96 79

MSE Ratio

0.61

Coverage

64 78

MSE Ratio

0.66

Coverage

52 80

●●

MSE Ratio

2.18

Coverage

97 83

MSE Ratio

0.85

Coverage

89 84

Figure: Spread of central 95% of simulated prevalence estimates forpopulation size 1000, with varying levels of homophily (α) and differentialactivity (ω). Solid lines represent prevalence estimates based on theposterior mean, dashed lines represent comparable estimates using the priormean. Relative efficiency (MSE posterior/MSE prior) is given above each bar,and the coverage of nominal 95% confidence intervals is below each bar.