Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London [email protected]

Chris JacksonWith Nicky Best and Sylvia RichardsonDepartment of Epidemiology and Public Health

Imperial College, [email protected]

NCRM BIAS nodehttp://www.bias-project.org.uk

Combining administrative and survey data in a study of low birth weight and air pollution

mailto:[email protected]

BIAS: Biases in observational studies

Promote principled methods for accounting for potential biases in observational data: “non-response” bias:

selection bias (participation in a study) missing data (on some variables for one individual)

confounding (important variables not available) ecological bias (from aggregate / area-level data)

measurement error Naïve methods not normally appropriate.

Alleviating biases

Suitable statistical models for the processes underlying the data Express uncertainty about biases as

probability distributions. Uncertainty carries through to the results Bayesian graphical models Software, e.g. WinBUGS

Using multiple data sources to inform about the potential biases

Application areas

Small area estimation (with Virgilio Gómez Rubio) Using combination of aggregate (e.g. census) and

individual survey data Selection bias in case-control and survey studies (with

Sara Geneletti) Using directed acyclic graphs

Inference from combining datasets of different designs from different sources (with Chris Jackson, Jassy Molitor) Using Bayesian hierarchical / graphical models

See (http://www.bias-project.org.uk)

Example: low birth weight and air pollution

Does exposure to air pollution during pregnancy increase the risk of low birth weight?

Example illustrates various biases. Combine datasets with different strengths:

Survey data (Millennium Cohort Study) Small, great individual detail.

Administrative data (national births register) Large, but little individual detail.

Single underlying model assumed to govern both datasets: elaborate as appropriate to handle biases

Low birth weight

Important determinant of future health population health indicator.

Established risk factors: Tobacco smoking during pregnancy. Ethnicity (South Asian, issue for UK data) Maternal age, weight, height, number of previous

births. Role of environmental risk factors, such as air pollution,

less clear. Various studies around the world suggest a link. Exposure to urban air pollution correlated with

socioeconomic factors ethnicity, tobacco smoking confounding

Data sources (1): Millennium Cohort Study

About 15,000 births in the UK between Sep 2000 and August 2001 (we study only England and Wales, singleton births)

Postcode made available to us under strict security Match individuals with annual mean

concentration of certain air pollutants (PM10, NO2, CO, SO2) (NETCEN)

Birth weight, and reasonably complete set of confounder data available

Allows a reasonable analysis, but issues remain: Low power to detect small effect could be

improved by incorporating other data. Selection bias.

Selection of Millennium Cohort

ALL UK WARDS

ENGLAND

SCOTLAND

WALES

NORTHERN IRELAND

High child poverty

Low child poverty

High child poverty

Low child poverty

High child poverty

Low child poverty

High child poverty

Low child poverty

High ethnic minority

SELECTION PROBABILITYSELECTION PROBABILITY

0.040.04

0.020.02

0.110.11

0.070.07

0.040.04

0.180.18

0.060.06

0.160.16

0.080.08

Selection bias in the Millennium Cohort

Survey disproportionately represents population. If selection probability related to exposure and

outcome, then estimate of association biased. Ethnicity / child poverty probably related to both

pollution exposure and low birth weight. Accounting for selection bias:

Adjust model for all variables affecting selection, or Weight cases by inverse probability of selection

Cluster sampling within-ward correlations for correct standard errors, use a hierarchical

(multilevel) model with groups defined by wards.

Data sources (2): National birth register

Every birth in the population recorded. Individual data with postcode ( pollution exposure)

and birth weight available to us under strict security. Social class and employment status of parents also

available for a 10% sample. We study only this 10% sample: 50,000 births

between Sep 2000 and Aug 2001. Larger dataset, no selection bias, …but no confounder information, especially ethnicity

and smoking.

Data sources (3): Aggregate data

Ethnic composition of the population 2001 census for census output areas (~500 individuals)

Tobacco expenditure consumer surveys (CACI, who produce ACORN

consumer classification data) for census output areas.

…linked by postcode to Millennium Cohort and national register data.

Birth weight and pollution (source: MCS)

Birth weight and ethnicity (source: MCS)

Birth weight and smoking (source: MCS)

Pollution and confounders (source: MCS)

Models for formally analysing combined data

Want estimate of the association between low birth weight and pollution, using all data, accounting for:

Selection bias in MCS Adjust models for all predictors of selection Or weight by inverse probability of selection

Missing confounders in register Bayesian graphical model…

Graphical model representation

LBWLBWii

POLLPOLLii POLLPOLLjj

MODELMODEL

baby i in register baby j in MCS

ETHETHii ETHETHjj

LBWLBWjj

LBWi: low birth weight

POLLi: pollution exposure (plus other confounders observed in both datasets)

ETHi: ethnicity and smoking. Only observed in the MCS.

Same MODEL assumed to govern both datasets.

knownknown

unknownunknown

Adding in the imputation model

LBWLBWii

POLLPOLLii POLLPOLLjjMODELMODEL

(LBW)(LBW)

baby i in register baby j in MCS

ETHETHii ETHETHjj

LBWLBWjj

AGGAGGii AGGAGGjjMODELMODEL

(imputation)(imputation)

AGGi: aggregate ethnicity/smoking data for area of residence of baby i

MODEL MODEL for for imputationimputation of of ETHi in terms of aggregate data and other variables. in terms of aggregate data and other variables. Estimate it from observed Estimate it from observed ETHj in the MCS.

Bayesian model

Estimate both: Imputation model for missing ethnicity and smoking Outcome model for the association between low birth

weight and pollution. All beliefs about unknown quantities expressed as

probability distributions. Prior distributions (often ignorance) modified in light of

data posterior distributions Joint posterior distribution of all unknowns estimated by

Markov Chain Monte Carlo (MCMC) simulation (WinBUGS software)

Graphical representation of the model guides the MCMC simulation.

Variables in the final models: (1) regression model for low birth

weight Probability baby i has birth weight under 2.5 kg

modelled in terms of Pollution (NO2 and SO2) Ethnicity (White / South Asian / Black / other) Smoking during pregnancy (yes/no) Social class of mother Survey selection strata (for MCS data)

Other variables not significant in multiple regression, or not confounded with pollution (mother’s weight, height, maternal age, number of previous births, hypertension during pregnancy,…)

Variables in the final models: (2) imputation model for missing data Probability baby i is in one of eight categories:

ethnicity 1. White / 2. South Asian / 3. Black / 4. other smoking during pregnancy 1. No / 2. Yes

Modelled in terms of small-area variables for baby i: Proportion of population of in each of three ethnic

minority categories (South Asian / Black / other) Tobacco expenditure MCS survey selection strata

…and some individual-level variables for baby i. Pollution exposure Low birth weight Social class, employment status of mother.

Odds ratios (posterior mean, 95% CI)

Data NO2 * SO2

*

Smoking South Asian

Register, ignore

confounding

1.20 (1.13,1.27)

1.03 (1.00,1.07)

- -

MCS 1.04 (0.89,1.21)

1.04 (0.96,1.12)

2.00 (1.71,2.34)

2.76 (2.14,3.56)

MCS, ignore selection

1.08 (0.94,1.23)

1.04 (0.96,1.12)

2.00 (1.71,2.34)

3.01 (2.42,3.74)

Register + MCS

0.97 (0.91,1.03)

1.01 (0.97,1.05)

1.94 (1.80,2.10)

2.92 (2.61,3.26)

Register, adjust for confounding

0.97 (0.91,1.04)

1.01 (0.97,1.07)

1.94 (1.76,2.12)

2.93 (2.57,3.33)

*One unit of pollution concentration = interquartile range of pollution *One unit of pollution concentration = interquartile range of pollution concentration across England and Walesconcentration across England and Wales

Conclusions so far

Combining the datasets can increase power alleviate bias due to confounding

No evidence for association of pollution exposure with low birth weight.

Work in progress

Sensitivity to different choices for the imputation model External data (e.g. small-area data) on confounders

not always available More investigation of selection bias, and different ways

of accounting for it Quantify relative influence of each dataset Other biases, expected to be smaller problem

Missing data in MCS Exposure measurement error

Distinguish between preterm birth and low full-term birth weight.

Other kinds of data synthesis

Aggregate (ecological) data Administrative data usually aggregated to preserve confidentiality Make inferences on individual-level risk factors and outcomes using

aggregate data: “Ecological bias” caused by within-area variability of risk factors confounding caused by limited number of variables.

Needs appropriate models, and often individual data survey/cohort data, case-control data.

Combining aggregate and individual data: can reduce ecological bias and increase power distinguish contextual effects from individual.

Publications

Our papers, presentations and software available from http://www.bias-project.org.uk

C. Jackson, N. Best, S. Richardson. Hierarchical related regression for combining aggregate and survey data in studies of socio-economic disease risk factors. under revision, Journal of the Royal Statistical Society, Series A.

C. Jackson, N. Best, S. Richardson. Improving ecological inference using individual-level data. Statistics in Medicine (2006) 25(12):2136-2159.

C. Jackson, S. Richardson, N. Best. Studying place effects on health by synthesising area-level and individual data. Submitted.

http://www.bias-project.org.uk/

Documents

Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London [email protected]