1 1 A statistical approach to surrogate data Li-Chun Zhang Statistics Norway E-mail: [email protected]

1

1

A statistical approach to A statistical approach to surrogate data surrogate data

Li-Chun Zhang

Statistics Norway

E-mail: [email protected]

2

A setting of surrogate data

• Target data– Directly collected, such as in sample surveys– For a subset of population, if available

• Surrogate data– To replace target data for statistical purposes, hence “surrogate”

For reasons such as cost, burden, scope, etc.– Secondary data of nature: Re-use of data collected for other purposes– Typically from multiple sources– Often for the entire population or a major part of it– Two examples:

Register-based Employment statistics (surrogate) & LFS (target) Self-administered census (surrogate) & post-census survey (target)

• Issues of concern:– Conditions for valid substitution– Associated statistical accuracy

3

Unit-specific approach: Equality vs. equivalence

• Unit-specific approach– Scheme

Link surrogate and target data at the micro level Estimation of relevant unit-specific misclassification rates Propagation of uncertainty to statistics of interest

– Two shortcomings Require micro-level linkage Unit-specific consistency may be irrelevant or misleading for uses

• Equality vs. equivalence – An example– Two binary data sets of the same size– Equal mean without the values being equal for all the units– Identical empirical CDF => identical statistical inference– Inequality may fail to reveal statistical equivalence

4

Other relevant situations

• Some settings– Indirect (proxy) interview– Unstable reporting– Mode effects– Public micro data

• Some observations– Unit-specific approach may not be applicable– Unit-specific equality may not even be desirable– To use surrogate data (Z) in place of target data (Y), together with

additional data (X) Joint distribution of (Z,Y|X) is not of primary interest for users Distribution of (Z,X) instead of distribution of (Y,X) is the issue

5

Validity and equivalence

• Valid surrogate data– Denote by f(x,y) and f(x,z) the distribution functions:

f(X=x, Z=y) = f(X=x) f(Z=y | X=x) = f(X=x) f(Y=y | X=x) = f(X=x, Y=y)– Example: X = age-sex grouping, Z = register-employment status,

Y = LFS-employment status according to ILO-definition– Example: Z = proxy-interview in LFS, Y = direct-interview in LFS– Equality of distribution can be assessed without linked / linkable data

• Empirically equivalent surrogate data– Denote by p(x,z; s1) and p(x,y; s2) the empirical distribution functions:

p(X=x; s1) p(Z=y | X=x; s1) = p(X=x; s2) p(Y=y | X=x, s2)– Equality on micro level not necessary & s1 may differ from s2– Parametric analogy: Statistical equivalence by Sufficiency Principle

6

Similar ideas in disclosure control literature

• Fienberg et al. (1998)– Random generation of “pseudo” micro data Z conditional on {x; s}– Parametric f(y | x) or empirical p(y | x)– Conditional validity in expectation, provided unbiased estimation

• Rubin (1993)– Synthetic data & Bayesian multiple-imputation framework– Random generation of population data + sampling– No particular emphasis on conditioning & validity instead of equivalence

• SARs– Sample of Anonymised Records from census data– Real data albeit anomymized– Valid surrogate data

• Micro simulation– Based on sample instead of census data– Random generation of “imaginary” micro data– Validity in expectation provided unbiased estimation of distribution

7

Some applications / implications?

• Statistics and inference based on surrogate data– Validity (or equivalence) vs. efficiency (or accuracy)– Example: Employment register (ER) vs. LFS

Deterministic ER-status by editing rules vs. valid ER-status for specific purposes Bias of invalid ER-status vs. variance of valid ER-status Balance in trade-off may change direction on more detailed levels

• Micro data for public use– Targeting full empirical equivalence followed by disclosure control (DC)– Equivalent data targeting at coarsened information (embedded DC)

• Micro calibration of surrogate data– Secondary population (U) data (X, Z1, Z2, …, Zk; U) & target sample data

(X, Y1; s1), (X, Y2; s2), …, (X, Yk; sk) --- different units in general– Surrogate data (X*, Z1*, Z2*, …, Zk*) with marginal validity btw. (X; U) and

(X*; U), (X*, Z1*; U) and (X, Y1; s1), …, (X*, Zk*; U) and (X, Yk; sk)– Conditional surrogate data (X, Z1*, Z2*, …, Zk*) with marginal validity btw.

(X, Z1*; U) and (X, Y1; s1), …, (X, Zk*; U) and (X, Yk; sk)?– Alternative to statistical matching by Conditional Independence Assumption– Uncertainty?

Documents

1 1 A statistical approach to surrogate data Li-Chun Zhang Statistics Norway E-mail: [email protected]