Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Cancer Care Outcomes Research and SurveillanceConsortium(CanCORS)

Using Multiple Imputation to both Uncover and Hide

Dave Harrington ([email protected])

Dana-Farber Cancer Institute, Harvard Statistics and Biostatistics

10 June 2014

Harrington (DFCI/Harvard) CanCORS 10 June 2014 1 / 42

Collaborators

Alan Zaslavsky, HMS

Yulei He, CDC

Bronwyn Loong, Australian National University

Paul Catalano, DFCI and HSPH

MaryBeth Landrum, HMS

Many members of the CanCORS Consortium


Outline . . .

Brief Overview of CanCORS

Multiple Imputation (MI) Primer

Unit Non-response: Using MI to ‘uncover’

Confidentiality: Using MI to Hide

Using CanCORS data


The Thread





Using CanCORS data


!

Biloxi, MS

Patients from population-based cohorts in geographic areas

Patients from integrated health-care delivery systems

Patients at Veterans Health Administration hospitals

" Kaiser Permanente Hawaii

(4 major islands)

"

" Kaiser Permanente

Northwest (Portland metropolitan area)

Group Health Cooperative (Seattle metropolitan area)

#

Portland, OR "

#

#

Harvard Pilgrim Health Care (Boston metropolitan area)

Durham, NC

Baltimore, MD

New York, NY

Biloxi, MS #

#

#

Nashville, TN

#

! North Carolina 22 central/eastern counties

#

Atlanta, GA

!

!

Los Angeles County

Northern California (8 counties in San Jose, San Francisco/Oakland, and Sacramento areas)

"

Henry Ford Health System (Detroit metropolitan area)

#

#

Minneapolis, MN

Chicago, IL (Lakeside and Hines)

!

State of Iowa

!

State of Alabama Houston, TX

#

Seattle, WA

#

Indianapolis, IN

CanCORS Sites

Diagnosis May 03 – Dec 05

Enrollment Baseline Patient Interview

Follow-up Interview

12 months post diagnosis

3-6 months

Physician Surveys

Medical Record Abstraction

by role

Caregiver Surveys

Enrollment Phase DataCollection

15 months post diagnosis


Interview Instrument Types

• Full baseline

• Brief baseline

• Surrogate living and decedent


Alive 12 months from Diagnosis

Screening Interview Patient or Surrogate

Physician Surveys

Adv. Disease Survey Med Rec Abstraction

Survivor survey

Second wave Phase Data Collection

Deceased or alive with advanced

disease Disease Free


Enrollment

10,061 patients with lung or colorectal cancer enrolled, with data from

• Patient interviews• Physician surveys• Medical record abstraction

2,995 patients recontacted in second wave

• Revisions of earlier instruments• Assessing recurrence surveillance using electronic records in integrated

health plans and Medicare Claims


CanCORS vs SEER Colorectal Cancer

010203040506070

Female

Hispan

ic

Black N

HAsia

n

Age < 50

Age 50-7

5

Age > 75

Stage I

Stage I

I

Stage I

II

Stage I

V

CanCORS SEER

% Diagnosed Cases

(Catalano, Med Care, 2012)Harrington (DFCI/Harvard) CanCORS 10 June 2014 10 / 42

Challenges in the Design and Analysis

• Uniform data collection/standards of analysis

• Unit non-response

• Item non-response

• Confidentiality


The Thread





Using CanCORS data


Multiple Imputation

Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible.

• Schafer 1999 SMMR paper is a primer• Multiple draws (usually 5 - 10) used to create versions of completed

dataset.

• Draws taken from posterior predictive distribution of missing data,given observed data.

• Each dataset analyzed using traditional complete case analysis.• Variance estimates account for within and between imputation

variability.


The Thread





Using CanCORS data


Missingness in the surveys

Survey No. Vars Sample Size Skips(%) DO/DK/REF(%)

Mean Range Mean Range

Full Baseline 535 5763 50 0-99 1.23 0-37Brief Baseline 156 1397 48 0-99 2.49 0-22Surrogate Live 413 1083 50 0-99 1.61 0-28Surrogate Deceased 473 1827 62 0-99 1.46 0-29Follow-up 384 6087 65 0-99 0.55 0-33

DO/DK/REF = Drop Out/Don’t Know/Refused

Some complete case analyses drop approximately 30% of respondents.


Challenges to MI

• Several different data types• 4 instruments used in two cancer types• Block missingness because of missing domains in shorter surveys.• Extensive skip patterns in interview structure• Data set periodically refreshed• Diagnostics difficult


Imputation Strategy

• Use Sequential Regression Multiple Imputation (SRMI), implemented inSAS module IVEWare. (Raghunathan, 2001, Surv. Meth )

• Imputation used to create 5 datasets for each of the two cancer types.• Within a cancer, imputation based on dataset with instrument types

combined.• Skipped questions and missing blocks in shorter surveys imputed• Skips and block missingness restored after imputation• To incease congeniality, imputation did not use data available to

Coordinating Center but not available to data analysts.


Diagnostics

• Examined changes in marginal means, pairwise correlation coefficients.• Comparisons with complete case and missingness indicator analyses.• Comparisons with weighted analyses: treat brief/surrogate surveys as

item non-response, non-response weights calculated for participation infull survey.


Comparative Analyses of Hospice

Full analysis in Huskamp, et al., Arch Int Med (2009)

Did you participate in a discussion of hospice with your physician?

Multiple imputation in a large-scale complex survey 665

Table 5 Hospice care analysis results

Predictor Ref. CC Missingness indicator SRMI

EST SE p-value EST SE p-value EST SE p-value

Intercept −1.97 0.51 0.00 −2.08 0.43 0.00 −2.17 0.42 0.00Race White

Black 0.03 0.21 0.90 0.03 0.17 0.86 −0.02 0.17 0.89Hispanic −0.51 0.31 0.10 −0.68 0.24 0.00 −0.66 0.24 0.01Asian 0.24 0.29 0.42 0.20 0.25 0.41 0.21 0.25 0.40Other 0.46 0.27 0.08 0.46 0.23 0.04 0.42 0.22 0.06

Marital status Married/PartnerWidowed −0.24 0.18 0.19 −0.10 0.15 0.50 −0.07 0.15 0.66

Divorced/ 0.24 0.17 0.17 0 .29 0.14 0 .05 0.29 0.15 0.05Separated

Never married 0.66 0.32 0.04 0.57 0.25 0.03 0.58 0.27 0.03Missing 0.09 0.42 0.83

Received chemo NoYes −1.25 0.12 0.00 −1.01 0.10 0.00 −1.01 0.10 0.00Missing 0.07 0.58 0.91

Income <20 K20 – 40 K 0.12 0.15 0.45 0.13 0.14 0.35 0.12 0.14 0.4140 – 60 K −0.07 0.19 0.71 −0.04 0.18 0.83 −0.02 0.18 0.9160 K+ −0.38 0.21 0.07 −0.32 0.19 0.09 −0.30 0.21 0.16Missing −0.15 0.20 0.45

Insurance MedicareMedicaid 0.09 0.31 0.77 0.26 0.27 0.33 0.28 0.25 0.26Private 0.04 0.20 0.86 0.01 0.17 0.97 0.01 0.18 0.98Other 0.81 0.34 0.02 0.91 0.28 0.00 0.85 0.28 0.00Missing 0.19 0.23 0.41

Myocardial Noinfarction

Yes −0.41 0.17 0.02 −0.24 0.15 0.11 −0.23 0.16 0.13Missing −0.62 0.56 0.27

Depression NoYes 0.47 0.14 0.00 0.36 0.12 0.00 0.35 0.13 0.01Missing −0.31 0.58 0.59

Deceased with No1 year of dx

Yes 1.85 0.16 0.00 1.59 0.13 0.00 1.69 0.13 0.00Cancer stage Stage IIIB

Stage IV 0.72 0.13 0.00 0.76 0.11 0.00 0.77 0.11 0.00

Note: CC = Complete-case analysis; SRMI = Sequential regression multiple imputation.

involved in specific analyses. In the hospice subsample and assuming MAR, we per-formed posterior predictive checking26−31 to examine the deviation of selected analysisresults Q computed from the completed data with imputations compared to the valuesof Q calculated from simulated copies of the completed data under the model.

at Harvard Library on May 23, 2014smm.sagepub.comDownloaded from

Table is small piece of full logistic regression.

Complete account in He, et al. 2010, SMMR


Weighted analyses: QoL scores

668 Y He et al.

Table 7 Estimates of quality-of-life scales

Sample Scale CC Weighting SRMI

Mean SE Mean SE Mean SE

All EORTC–LC 79 0.31 77 0.58 75 0.26EORTC–QLQ 77 0.39 74 0.60 71 0.28DYSPNEA 63 0.59 59 0.96 59 0.50

Stage I/II EORTC–LC 82 0.48 81 0.69 78 0.42EORTC–QLQ 80 0.60 79 0.83 75 0.51DYSPNEA 64 0.95 62 1.34 61 0.73

Stage IV EORTC–LC 78 0.62 76 1.07 73 0.40EORTC–QLQ 75 0.81 73 1.18 68 0.43DYSPNEA 63 1.23 59 2.00 58 0.82

Notes: CC = Complete-case analysis; SRMI=Sequential regression multiple imputation.EORTC (European Organization for Research and Treatment of Cancer)_LC (Lung cancer specific): coughing,sore mouth, trouble swallowing and pain.EORTC_QLQ (quality of life questionnaire): insomnia, nauseated, vomited, etc.DYSPNEA: short of breadth when resting, walking or climbing stairs.

to the limited predictive power of the non-response model predictors we selected andthe large variation of the weights from the heterogenous sample. We would expect theadvantage of SRMI over weighting to be even more pronounced in an analysis that isprimarily based on variables with no block missingness, since it makes better use of allthe available information on those variables.

6 Discussion and future directions

This paper demonstrates the use of multiple imputation to construct a centraliseddatabase for multiple users with different analytic objectives. As health or social policydecisions are often based on analyses of large incomplete databases, our work may serveas a template for similar applications. By showing the details of implementation, weaim to encourage practitioners to apply multiple imputation for complex large-scaledata sets.

SRMI is used to tackle the challenges arising from the complex survey data in thiswork, and we believe it is a suitable solution for other missing data problems withsimilar scale and complexity. However, we have still experienced a variety of chal-lenging issues in implementation. These problems include how to build models whichcan effectively exclude unused block non-responses or skipped items, how to do modelselection in the presence of a large number of predictors, how to decide when to stopthe chain and collect imputations, and how to perform model validation and diagnos-tics. We adopted some ad-hoc solutions, but we caution against the mechanical useof SRMI despite its increasing popularity. Echoing other literature in this area,7−9 wecall for more methodological research on the properties of SRMI which would helpanswer the above questions. Given current limitations on the practical and theoreticaldevelopment of SRMI, we also encourage practitioners to do sensitivity analysis using

at Harvard Library on May 23, 2014smm.sagepub.comDownloaded from


The Thread





Using CanCORS data


Challenges in preserving confidentiality with publiclyavailable healthcare microdata

• Minimize disclosure risk.

• Reduce risk of participant re-identification by malicious user shouldbe acceptably low.

• Maintain data utility.

• Modifications to the observed data set should maintain variablerelationships that are both clinically plausible and reproduce(approximately) relationships in the data.

• Low maintenance for Coordinating Center


Background on partially synthetic data

Idea: Replace the observed values for sensitive variables or key identifierswith synthetic data. Do not release original values of these variables.

• First proposed by Rubin (1993) based on the concepts of multipleimputation.

• Synthetic values are created using samples drawn from the posteriorpredictive distribution of target population responses given the observeddata set.

• Analysts can draw approximately valid inferences about the targetpopulation of interest using standard methods for multiply imputed data.


Synthetic data in practice

• Survey of Consumer Finances• American Communities Survey• Survey of Income and Program Participation• US Longitudinal Business Database• Longitudinal Employer - Household Dynamics• German IAB Establishment Panel


Identification of variables to synthesize

• Direct identifiers: name, postal address, SSN: not available to internaland external investigators, so not synthesized.

• Quasi-identifiers: Age, education, sex, marital status, race.

• Clinical: Not synthesized; complex structure - difficult to synthesize,accessible by insiders only (not wider population); subject tojudgemental variation and incomplete medical record acquisition

• Sensitive: All information gathered can be considered sensitive; fullsynthesis not practical and not useful to data analysts

Details reported in Loong, et al. (2013) Stat. Med.


Imputation models

f (Y(j)age, Y(j)

educ, Y(j)marstat, Y(j)

race, Y(j)sex

∣∣Y0, S0) , (1)

Not practical to draw directly from conditional joint distribution of identifiers,given observed data.

We used sequential regression multiple imputation here as well

Parametric approach to imputation - logisitic regression model to impute sex,multinomial logit models for other variables.

m = 5 partially synthetic data sets created

Imputations done in R package MI for this project.


Imputation models

• Select predictor variables using stepwise regression within each of 12sections of survey.

• On average ≈ 50 predictors for each variable to be imputed.• No interactions, main effects only - reduce risk of overfitting (which

increases disclosure risk), reduce computational burden and avoidmulticollinearity and separation issues.

• Variables included in all imputation models: survey disposition code,stage, histology, vital status, study site.


Disclosure risk

• Identification disclosure risk: potential identification of sampled unitsin the released data

• Inferential disclosure risk: inference of new information about a knownparticipant in the survey, e.g., all participants with the same identifiershave the same income.

Inferential disclosure risk a difficult problem, that is still not well-understood.We focus on identification disclosure risk assessment.


Identification disclosure risk assessment

Duncan and Lambert and risk framework (1989)

• Mimic the behavior of an ill-intentioned public user (an intruder) whohas the true values of unique or quasi-identifiers for select target unitsand wants to identify the records in the synthetic data that have matchingidentifier values.

• We investigated 3 sets of quasi-identifying values representing varyingassumed levels of intruder information(i) Set 1: Age, sex, marital status, and race(ii) Set 2: Set 1 + education + income level(iii) Set 3: Set 2 + disease stage + study site


Identification disclosure risk assessmentA potential identification risk for target record i occurs when itsquasi-identifying values match the corresponding values for a record k insynthetic data set j.

Some assumptions/definitions about risk of identification:

• The intruder knows the target is in the survey and the quasi-identifiers ofall units in the population.

• If target record has no matching quasi-identifiers among the m imputeddata sets, risk of identification is zero.

• Expected match risk: probability of correct match if intruder randomlyguesses the match from the candidates with same quasi-identifiers

• Maximum match risk: intruder correctly always correctly identifiesrecord from among potential candidates

• Total match risk: probability that a record is correctly and uniquelyidentified in synthetic data.


Disclosure risk . . .

Compute probabilities of identification under three sets of quasi-identifiers.

(i) Set 1: Age, sex, marital status, race

(ii) Set 2: Set 1 + income level + education

(iii) Set 3: Set 2 + disease stage + PDCRID


Disclosure risk estimates, lung cancer cohort

Using Set 2 of identifiers:

Synthetic data:

• Expected match risk: 217/5000 = 4.3%• Maximum risk: 925/5000 = 18.5%• Total match risk: 88/5000 = 1.7%

Observed data:

• Expected match risk: 1770/5000 = 35.4%• Maximum risk: 5000/5000 = 100%• Total match risk: 989/5000 = 19.8%


Calculating disclosure risk

Expected match rate, identifier set 2, lung cancer:

• Total Match Risk : 88 cases where original record correctly and uniquelymatched to synthetic data record.

• Maximum risk : 1158 instances where a true record was among identicalsets of identifiers in 5 synthetic data sets. Corresponds to 958 uniquerecords

• Expected matches: Expected number of matches if intruder guessedrandomly among potential set of matches. This is a weighted averagefirst within then across synthetic data sets.


Disclosure risk by set of indentifiers

Table: Disclosure risk - CanCORS lung cancer partially synthetic data

Quasi-identifier set EMR TMRObs. Syn. Obs. Syn.

1 70 38 0 02 1770 217 989 883 4495 890 4000 717


Data utility assessment

Characterize the quality of what can be learned from the synthetic data,relative to what can be learned from the observed data set.

• Confidence interval overlap (Karr et al. (2006))• Coverage error due to bias (Cochran (1977))


Data utility

Observed versus synthetic data inferential results

Analytical comparisons based on published studies which analyzed theobserved data set.

Reference papers:

• Huskamp et al. (2009). Discussions with Physicians about Hospiceamong Patients with Metastatic Lung Cancer. Arch. Intern. Med., 169(10), 954-962

• Keating et al. (2010). Cancer patients’ roles in treatment decisions: docharacteristics of the decision influence roles?. Journal of ClinicalOncology, 28:4634 -4370.


Data utility - analytic results comparison

Table: Descriptive characteristics and estimated probabilities of hospice discussionby race, unadjusted for other covariates. (Standard errors in parentheses)

Characteristic Patients % Discussed Hospice % p-valueObs. Syn. Obs. Syn. Obs. Syn.

Race/ethnicity < 0.001 0.62White 73.7 72.0 55.2 (1.3) 54.4 (1.5)Black 10.7 11.4 42.6 (1.3) 50.0 (4.1)Hispanic 5.9 5.7 40.4 (1.3) 45.8 (5.3)Asian 5.1 5.3 49.4 (1.3) 51.6 (6.0)Other 4.7 4.8 64.5 (1.2) 54.0 (6.9)


Data utility - Confidence interval overlap andcoverage error

Table: Data utility for coefficient estimates in a logistic model forhospice discussion, unadjusted for other covariates.

Characteristic∣∣∣Bias(qsyn)

SE(qsyn)

∣∣∣ CI. overlap CI Cov. ErrorRace/ethnicityBlack 1.101 0.767 0.196Hispanic 0.101 0.831 0.051Asian 0.024 0.803 0.050Other 1.479 0.707 0.316


Data utility - uncongenialityWhat explanation can be given for the change in conclusion of significance?→ Uncongeniality Meng (1994)

"Analysts and imputer have access to different types of information and data,and assess and use the information and data in different ways"

That is, there are systematic differences between the imputation model andanalysis procedure inputs.

For our study, the imputation model conditioned on the entire data set (5000records) but the analysis procedure analyzed the subset of Stage IV lungcancer patients (1517 records) .

→ If the imputation model does not capture all the important subgrouprelationships, results from the synthetic data may be biased.


Data utility - uncongeniality

Re-ran imputation models conditional on Stage IV lung cancer patients only -large bias for ‘black’ race factor level is removed.

Table: Descriptive characteristics and estimated probabilities of hospice discussionby race, unadjusted for other covariates. (Standard errors in parentheses)

Characteristic Patients % Discussed Hospice % p-valueObs. Syn. Obs. Syn. Obs. Syn.

Race/ethnicity < 0.001 0.003White 73.7 74.6 55.2 (1.3) 55.6 (1.5)Black 10.7 9.9 42.6 (1.3) 42.0 (4.2)Hispanic 5.9 6.1 40.4 (1.3) 40.4 (5.3)Asian 5.1 5.1 49.4 (1.3) 50.2 (5.8)Other 4.7 4.3 64.5 (1.2) 58.5 (6.5)


The Thread





Using CanCORS data


Currently, no public use data sets released.

Grant officially closes July 31, 2014.

Public web site (www.cancors.org/public) will be available with

• Full study and data documentation• Bibliography of published papers• All study instruments and protocols

Outside investigators will be urged to collaborate with members of CanCORSteam


Documents

Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is