24
Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

Embed Size (px)

Citation preview

Page 1: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

Sampling Research Questions

Bruce D. SpencerStatistics Department and Institute for Policy Research

Northwestern University

SAMSI Workshop 10/21/10

Page 2: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

2

Introduction

• At the end of the opening workshop the group in Sampling, Modeling, and Inference raised a number of open questions related to sampling.

• Today I will discuss those questions, most of which are still unsolved.

Page 3: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

3

Goal of Sample-Based Inference

• What is the target of the inference?

– a stochastic model that generated a network or set of networks

– population of networks, e.g., dynamic networks

– multiple networks on a single population of edges

– single network

Page 4: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

4

Various Network Sampling Designs

• Conventional sample design to learn about the network

– probabilities do not depend on observed data

– E.g., Current Population Survey

• Adaptive sample design using the network

– probabilities may depend on observed data

– E.g. RDS; ego-centric samples; link-tracing designs

• Two-phase sampling to target further investigation of missing data or measurement error

• Subsampling (?) to reduce computational burden at possible loss of efficiency

Page 5: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

5

Conventional Sampling Design to Learn about the Network(s)

Samples of nodes or of edges - used for

• description of network(s)

• prediction of future state of network

• prediction of links/gaps/nodes

• fitting a model to the graph

Page 6: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

6

Limitations from Sampling

• Sampling introduces random error into the estimates (and possibly bias, since E f(X) ≠ f (EX) for nonlinear f )

• Sampling variance needs to be estimated, maybe bias does too; may be problematic for small samples

• Some population characteristics may not be “estimable” from a sample

– E.g., maximum path length between any two nodes?

– Number of components in a general graph?

– What does “estimable” mean?

Page 7: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

7

Limitations from Sampling

• If elements of interest (edges/non-edges, stars, motifs, etc.) have unequal probabilities of being observed, then

– need to know the probabilities and adjust for them

– or, need to have a model that explains the population

– or, sometimes, both.

Page 8: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

8

E.g.: Induced Graph Sampling

• Undirected parent graph (V, G)

• Sample nodes S V

• Observe G(S) G – observe edge/non-edge between u, v iff u,v S

• Conventional sampling with possibly unequal probabilities (including multiple- frame stratified multi-stage): probability of including u1,u2 ,...,uj and

excluding u1,u2 ,...,vk knowable for any j, k

• Denote inclusion probabilities by

(.)

Page 9: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

9

Horvitz-Thompson Estimators of Totals

1,2

,

,

ˆUnbiased estimator of | | is 1/ ( ).

ˆUnbiased estimator of | | is / ( , )

with 1 if are adjacent and 0 otherwise.

Unbiased estimators of variances of H-T estimat

u S

u vu v S

u v

N E N u

R G R T u v

T u v

ors

ˆ ˆ ( ), ( ), etc. are available.V N V R

Page 10: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

10

H-T Estimators of Triad Distribution

Define

Tk,u,v,w = 1 if u,v,w are distinct vertices sharing

k edges and

= 0 otherwise

Tk number of triads in E with 0 < k < 3 edges

Other totals estimated similarly, e.g., number of stars or other motifs.

1, ,6

, ,

ˆUnbiased estimator of is / ( , , ).k k u v wu v w S

T T T u v w

Page 11: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

11

Degree Distribution

• du degree of node u (its number of edges)

• M maximum degree in (E, G)

• Nr number of nodes of degree 0 < r < M

• (F0,F1,…,FM) is degree distribution, with Fr =Nr /N

• Degree distribution of the sample can differ from degree distribution of the population.

“Subnets of Scale-Free Networks are Not Scale-Free: Sampling Properties of Networks” Stumpf, Wiuf, May (PNAS, 2005)

Page 12: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

12

Estimation of Degree Distribution

• Induced subgraph from SRS of size n from (E,G)

• Nr number of nodes of degree r in parent graph

• Nr(S) number of nodes of degree r in subgraph

0 0

1

Set ( , , ) and ( ) ( ( ), , ( )) .

ˆO. Frank (1980 ): If then where

ˆ ( ) and is a triangular matrix whose

entries are probabilities referring to the

hypergeometric distrib

T TM MN N S N S N S

JSPI n M E

S

N N

N N

N B N B

ution.

Page 13: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

13

Estimation of Degree Distribution

1

ˆ ˆ ˆTo get degree distribution, set / .

This will have small bias for large enough .

I have extended the proof to accommodate

complex sampling with unequal probabilities.

ˆInstead of ( ) we

r rF N N

n

S

N B N 1ˆ use ( )

ˆwith .

S

E

B N

B B

Page 14: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

14

Estimation of Mean and Variance of Degree Distribution

21 2 3

2

The mean of the degree distribution is equal to

2 / and the variance can be shown to equal

3( 1)2 ) 4

2(Frank 1981, . .). We have unbiased

H-T estimates of , , . Plug in andk

R N

T NT N T R

N NSoc Meth

N R T

(optionally)

ˆ ˆ ˆsubtract ( / ) for the variance estimate.V R N

Page 15: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

15

Partial Recap

• Using induced graph subsamples from conventional samples where joint inclusion probabilities are known, we can estimate

– population values of descriptive statistics based on totals

– degree distribution.

• (Only undirected graphs at one point in time discussed.)

• What about

– other descriptive statistics

– model fitting

– large variances when sample size small

– adaptive samples?

Page 16: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

16

Approaches to Model Fitting

1. You trust* your model. • Under certain conditions** on the sample design

and the model, you can ignore the way the sample was selected and treat the sample as having been generated from the model.

• The sampling mechanism needs to be carefully examined to make sure it meets the requirements, which depend on the model being used.

* Reagan and others, “trust but verify”

** Handcock and Gile (2010 AoAS) call the condition “amenability” and relate it to “ignorability” (Rubin 1976).

Page 17: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

17

Approaches to Model Fitting

2. “Model as descriptive statistic”. You do not necessarily believe the model, but you want to fit the model the way you would if you completely observed the population.

• Anathema to many social scientists. . .• E.g., in ERGMs, model fitting for population depends on

sufficient statistics that are population totals. One can estimate them with H-T estimates (or alternatives) and then fit model. (Pavel Krivitsky poster)

• I have not investigated how to implement for other models.• If both approaches are tried, “large” differences in fits can

indicate model misspecification.

Page 18: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

18

Adaptive Sampling• Probabilities of observations depend on data from sampled units.• Provides more information about network than conventional

samples (Frank). Note: variances may be too large when sample is conventional but sparse.

• Probabilities of observing triads and larger typically unavailable, and even probabilities for dyads known for ego-centric designs but not link-tracing designs. (H-G 2010)

• In order to use full data, either need to estimate unknown probabilities (hard!!) or rely on model if amenability condition can be verified and model validated.

• E.g., when using conventional unequal probability samples to estimate a population total, the amenability condition typically does not hold.

Page 19: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

19

Model Validation

• Model validation is important, but challenging when sampling probabilities are unknown.

• At the heart of every adaptive sample is a conventional sample.

• Use conventional sample to fit model as descriptive statistic. Compare result to model fitted under assumption of ignorability/amenability for (i) conventional sample and (ii) larger and more informative adaptive sample.

Page 20: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

20

Recap

• What is the population (network, or set of networks) from which sample is selected?

• Sample design (and inference) to learn about the network

– Static– Over time– Description of network– Prediction of future state of network and prediction

of links/gaps/nodes

Page 21: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

21

Recap

• Sample design (and inference) using the network to learn about a population– Respondent Driven Sampling– Adaptive Sampling– Others– Static and over time

Page 22: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

22

Recap

• Subsampling design (and inference) to – Ease computational burden– Target further investigation to learn about

measurement error

• When can inferences be made based on sample design information to provide approx. unbiasedness whether or not model is valid?

Page 23: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

23

Recap

• How can model inferences be made?– What models?

• Exponential random graph models• Mixed membership stochastic block models• Latent space models• Agent based models

– What network characteristics (what summary statistics)

Page 24: Sampling Research Questions Bruce D. Spencer Statistics Department and Institute for Policy Research Northwestern University SAMSI Workshop 10/21/10

24

Recap

• What is effect of measurement error (and missing data, non-response) on inferences about network?– RDS samples– Others

• How to design and analyze randomized experiments when subjects are part of a static network? Dynamic?– Google experiments– Experiments on adolescents in schools (e.g., drug

counseling, obesity “treatment”) – effects on peers