42
Cancer Care Outcomes Research and Surveillance Consortium (CanCORS) Using Multiple Imputation to both Uncover and Hide Dave Harrington ([email protected]) Dana-Farber Cancer Institute, Harvard Statistics and Biostatistics 10 June 2014 Harrington (DFCI/Harvard) CanCORS 10 June 2014 1 / 42

Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Cancer Care Outcomes Research and SurveillanceConsortium(CanCORS)

Using Multiple Imputation to both Uncover and Hide

Dave Harrington ([email protected])

Dana-Farber Cancer Institute, Harvard Statistics and Biostatistics

10 June 2014

Harrington (DFCI/Harvard) CanCORS 10 June 2014 1 / 42

Page 2: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Collaborators

Alan Zaslavsky, HMS

Yulei He, CDC

Bronwyn Loong, Australian National University

Paul Catalano, DFCI and HSPH

MaryBeth Landrum, HMS

Many members of the CanCORS Consortium

Harrington (DFCI/Harvard) CanCORS 10 June 2014 2 / 42

Page 3: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Outline . . .

Brief Overview of CanCORS

Multiple Imputation (MI) Primer

Unit Non-response: Using MI to ‘uncover’

Confidentiality: Using MI to Hide

Using CanCORS data

Harrington (DFCI/Harvard) CanCORS 10 June 2014 3 / 42

Page 4: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

The Thread

Brief Overview of CanCORS

Multiple Imputation (MI) Primer

Unit Non-response: Using MI to ‘uncover’

Confidentiality: Using MI to Hide

Using CanCORS data

Harrington (DFCI/Harvard) CanCORS 10 June 2014 4 / 42

Page 5: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

!

Biloxi, MS

Patients from population-based cohorts in geographic areas

Patients from integrated health-care delivery systems

Patients at Veterans Health Administration hospitals

" Kaiser Permanente Hawaii

(4 major islands)

"

" Kaiser Permanente

Northwest (Portland metropolitan area)

Group Health Cooperative (Seattle metropolitan area)

#

Portland, OR "

#

#

Harvard Pilgrim Health Care (Boston metropolitan area)

Durham, NC

Baltimore, MD

New York, NY

Biloxi, MS #

#

#

Nashville, TN

#

! North Carolina 22 central/eastern counties

#

Atlanta, GA

!

!

Los Angeles County

Northern California (8 counties in San Jose, San Francisco/Oakland, and Sacramento areas)

"

Henry Ford Health System (Detroit metropolitan area)

#

#

Minneapolis, MN

Chicago, IL (Lakeside and Hines)

!

State of Iowa

!

State of Alabama Houston, TX

#

Seattle, WA

#

Indianapolis, IN

CanCORS Sites

Page 6: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Diagnosis May 03 – Dec 05

Enrollment Baseline Patient Interview

Follow-up Interview

12 months post diagnosis

3-6 months

Physician Surveys

Medical Record Abstraction

by role

Caregiver Surveys

Enrollment Phase DataCollection

15 months post diagnosis

Harrington (DFCI/Harvard) CanCORS 10 June 2014 6 / 42

Page 7: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Interview Instrument Types

• Full baseline

• Brief baseline

• Surrogate living and decedent

Harrington (DFCI/Harvard) CanCORS 10 June 2014 7 / 42

Page 8: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Alive 12 months from Diagnosis

Screening Interview Patient or Surrogate

Physician Surveys

Adv. Disease Survey Med Rec Abstraction

Survivor survey

Second wave Phase Data Collection

Deceased or alive with advanced

disease Disease Free

Harrington (DFCI/Harvard) CanCORS 10 June 2014 8 / 42

Page 9: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Enrollment

10,061 patients with lung or colorectal cancer enrolled, with data from

• Patient interviews• Physician surveys• Medical record abstraction

2,995 patients recontacted in second wave

• Revisions of earlier instruments• Assessing recurrence surveillance using electronic records in integrated

health plans and Medicare Claims

Harrington (DFCI/Harvard) CanCORS 10 June 2014 9 / 42

Page 10: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

CanCORS vs SEER Colorectal Cancer

010203040506070

Female

Hispan

ic

Black N

HAsia

n

Age < 50

Age 50-7

5

Age > 75

Stage I

Stage I

I

Stage I

II

Stage I

V

CanCORS SEER

% Diagnosed Cases

(Catalano, Med Care, 2012)Harrington (DFCI/Harvard) CanCORS 10 June 2014 10 / 42

Page 11: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Challenges in the Design and Analysis

• Uniform data collection/standards of analysis

• Unit non-response

• Item non-response

• Confidentiality

Harrington (DFCI/Harvard) CanCORS 10 June 2014 11 / 42

Page 12: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

The Thread

Brief Overview of CanCORS

Multiple Imputation (MI) Primer

Unit Non-response: Using MI to ‘uncover’

Confidentiality: Using MI to Hide

Using CanCORS data

Harrington (DFCI/Harvard) CanCORS 10 June 2014 12 / 42

Page 13: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Multiple Imputation

Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible.

• Schafer 1999 SMMR paper is a primer• Multiple draws (usually 5 - 10) used to create versions of completed

dataset.

• Draws taken from posterior predictive distribution of missing data,given observed data.

• Each dataset analyzed using traditional complete case analysis.• Variance estimates account for within and between imputation

variability.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 13 / 42

Page 14: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

The Thread

Brief Overview of CanCORS

Multiple Imputation (MI) Primer

Unit Non-response: Using MI to ‘uncover’

Confidentiality: Using MI to Hide

Using CanCORS data

Harrington (DFCI/Harvard) CanCORS 10 June 2014 14 / 42

Page 15: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Missingness in the surveys

Survey No. Vars Sample Size Skips(%) DO/DK/REF(%)

Mean Range Mean Range

Full Baseline 535 5763 50 0-99 1.23 0-37Brief Baseline 156 1397 48 0-99 2.49 0-22Surrogate Live 413 1083 50 0-99 1.61 0-28Surrogate Deceased 473 1827 62 0-99 1.46 0-29Follow-up 384 6087 65 0-99 0.55 0-33

DO/DK/REF = Drop Out/Don’t Know/Refused

Some complete case analyses drop approximately 30% of respondents.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 15 / 42

Page 16: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Challenges to MI

• Several different data types• 4 instruments used in two cancer types• Block missingness because of missing domains in shorter surveys.• Extensive skip patterns in interview structure• Data set periodically refreshed• Diagnostics difficult

Harrington (DFCI/Harvard) CanCORS 10 June 2014 16 / 42

Page 17: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Imputation Strategy

• Use Sequential Regression Multiple Imputation (SRMI), implemented inSAS module IVEWare. (Raghunathan, 2001, Surv. Meth )

• Imputation used to create 5 datasets for each of the two cancer types.• Within a cancer, imputation based on dataset with instrument types

combined.• Skipped questions and missing blocks in shorter surveys imputed• Skips and block missingness restored after imputation• To incease congeniality, imputation did not use data available to

Coordinating Center but not available to data analysts.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 17 / 42

Page 18: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Diagnostics

• Examined changes in marginal means, pairwise correlation coefficients.• Comparisons with complete case and missingness indicator analyses.• Comparisons with weighted analyses: treat brief/surrogate surveys as

item non-response, non-response weights calculated for participation infull survey.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 18 / 42

Page 19: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Comparative Analyses of Hospice

Full analysis in Huskamp, et al., Arch Int Med (2009)

Did you participate in a discussion of hospice with your physician?

Multiple imputation in a large-scale complex survey 665

Table 5 Hospice care analysis results

Predictor Ref. CC Missingness indicator SRMI

EST SE p-value EST SE p-value EST SE p-value

Intercept −1.97 0.51 0.00 −2.08 0.43 0.00 −2.17 0.42 0.00Race White

Black 0.03 0.21 0.90 0.03 0.17 0.86 −0.02 0.17 0.89Hispanic −0.51 0.31 0.10 −0.68 0.24 0.00 −0.66 0.24 0.01Asian 0.24 0.29 0.42 0.20 0.25 0.41 0.21 0.25 0.40Other 0.46 0.27 0.08 0.46 0.23 0.04 0.42 0.22 0.06

Marital status Married/PartnerWidowed −0.24 0.18 0.19 −0.10 0.15 0.50 −0.07 0.15 0.66

Divorced/ 0.24 0.17 0.17 0 .29 0.14 0 .05 0.29 0.15 0.05Separated

Never married 0.66 0.32 0.04 0.57 0.25 0.03 0.58 0.27 0.03Missing 0.09 0.42 0.83

Received chemo NoYes −1.25 0.12 0.00 −1.01 0.10 0.00 −1.01 0.10 0.00Missing 0.07 0.58 0.91

Income <20 K20 – 40 K 0.12 0.15 0.45 0.13 0.14 0.35 0.12 0.14 0.4140 – 60 K −0.07 0.19 0.71 −0.04 0.18 0.83 −0.02 0.18 0.9160 K+ −0.38 0.21 0.07 −0.32 0.19 0.09 −0.30 0.21 0.16Missing −0.15 0.20 0.45

Insurance MedicareMedicaid 0.09 0.31 0.77 0.26 0.27 0.33 0.28 0.25 0.26Private 0.04 0.20 0.86 0.01 0.17 0.97 0.01 0.18 0.98Other 0.81 0.34 0.02 0.91 0.28 0.00 0.85 0.28 0.00Missing 0.19 0.23 0.41

Myocardial Noinfarction

Yes −0.41 0.17 0.02 −0.24 0.15 0.11 −0.23 0.16 0.13Missing −0.62 0.56 0.27

Depression NoYes 0.47 0.14 0.00 0.36 0.12 0.00 0.35 0.13 0.01Missing −0.31 0.58 0.59

Deceased with No1 year of dx

Yes 1.85 0.16 0.00 1.59 0.13 0.00 1.69 0.13 0.00Cancer stage Stage IIIB

Stage IV 0.72 0.13 0.00 0.76 0.11 0.00 0.77 0.11 0.00

Note: CC = Complete-case analysis; SRMI = Sequential regression multiple imputation.

involved in specific analyses. In the hospice subsample and assuming MAR, we per-formed posterior predictive checking26−31 to examine the deviation of selected analysisresults Q computed from the completed data with imputations compared to the valuesof Q calculated from simulated copies of the completed data under the model.

at Harvard Library on May 23, 2014smm.sagepub.comDownloaded from

Table is small piece of full logistic regression.

Complete account in He, et al. 2010, SMMR

Harrington (DFCI/Harvard) CanCORS 10 June 2014 19 / 42

Page 20: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Weighted analyses: QoL scores

668 Y He et al.

Table 7 Estimates of quality-of-life scales

Sample Scale CC Weighting SRMI

Mean SE Mean SE Mean SE

All EORTC–LC 79 0.31 77 0.58 75 0.26EORTC–QLQ 77 0.39 74 0.60 71 0.28DYSPNEA 63 0.59 59 0.96 59 0.50

Stage I/II EORTC–LC 82 0.48 81 0.69 78 0.42EORTC–QLQ 80 0.60 79 0.83 75 0.51DYSPNEA 64 0.95 62 1.34 61 0.73

Stage IV EORTC–LC 78 0.62 76 1.07 73 0.40EORTC–QLQ 75 0.81 73 1.18 68 0.43DYSPNEA 63 1.23 59 2.00 58 0.82

Notes: CC = Complete-case analysis; SRMI=Sequential regression multiple imputation.EORTC (European Organization for Research and Treatment of Cancer)_LC (Lung cancer specific): coughing,sore mouth, trouble swallowing and pain.EORTC_QLQ (quality of life questionnaire): insomnia, nauseated, vomited, etc.DYSPNEA: short of breadth when resting, walking or climbing stairs.

to the limited predictive power of the non-response model predictors we selected andthe large variation of the weights from the heterogenous sample. We would expect theadvantage of SRMI over weighting to be even more pronounced in an analysis that isprimarily based on variables with no block missingness, since it makes better use of allthe available information on those variables.

6 Discussion and future directions

This paper demonstrates the use of multiple imputation to construct a centraliseddatabase for multiple users with different analytic objectives. As health or social policydecisions are often based on analyses of large incomplete databases, our work may serveas a template for similar applications. By showing the details of implementation, weaim to encourage practitioners to apply multiple imputation for complex large-scaledata sets.

SRMI is used to tackle the challenges arising from the complex survey data in thiswork, and we believe it is a suitable solution for other missing data problems withsimilar scale and complexity. However, we have still experienced a variety of chal-lenging issues in implementation. These problems include how to build models whichcan effectively exclude unused block non-responses or skipped items, how to do modelselection in the presence of a large number of predictors, how to decide when to stopthe chain and collect imputations, and how to perform model validation and diagnos-tics. We adopted some ad-hoc solutions, but we caution against the mechanical useof SRMI despite its increasing popularity. Echoing other literature in this area,7−9 wecall for more methodological research on the properties of SRMI which would helpanswer the above questions. Given current limitations on the practical and theoreticaldevelopment of SRMI, we also encourage practitioners to do sensitivity analysis using

at Harvard Library on May 23, 2014smm.sagepub.comDownloaded from

Harrington (DFCI/Harvard) CanCORS 10 June 2014 20 / 42

Page 21: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

The Thread

Brief Overview of CanCORS

Multiple Imputation (MI) Primer

Unit Non-response: Using MI to ‘uncover’

Confidentiality: Using MI to Hide

Using CanCORS data

Harrington (DFCI/Harvard) CanCORS 10 June 2014 21 / 42

Page 22: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Challenges in preserving confidentiality with publiclyavailable healthcare microdata

• Minimize disclosure risk.

• Reduce risk of participant re-identification by malicious user shouldbe acceptably low.

• Maintain data utility.

• Modifications to the observed data set should maintain variablerelationships that are both clinically plausible and reproduce(approximately) relationships in the data.

• Low maintenance for Coordinating Center

Harrington (DFCI/Harvard) CanCORS 10 June 2014 22 / 42

Page 23: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Background on partially synthetic data

Idea: Replace the observed values for sensitive variables or key identifierswith synthetic data. Do not release original values of these variables.

• First proposed by Rubin (1993) based on the concepts of multipleimputation.

• Synthetic values are created using samples drawn from the posteriorpredictive distribution of target population responses given the observeddata set.

• Analysts can draw approximately valid inferences about the targetpopulation of interest using standard methods for multiply imputed data.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 23 / 42

Page 24: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Synthetic data in practice

• Survey of Consumer Finances• American Communities Survey• Survey of Income and Program Participation• US Longitudinal Business Database• Longitudinal Employer - Household Dynamics• German IAB Establishment Panel

Harrington (DFCI/Harvard) CanCORS 10 June 2014 24 / 42

Page 25: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Identification of variables to synthesize

• Direct identifiers: name, postal address, SSN: not available to internaland external investigators, so not synthesized.

• Quasi-identifiers: Age, education, sex, marital status, race.

• Clinical: Not synthesized; complex structure - difficult to synthesize,accessible by insiders only (not wider population); subject tojudgemental variation and incomplete medical record acquisition

• Sensitive: All information gathered can be considered sensitive; fullsynthesis not practical and not useful to data analysts

Details reported in Loong, et al. (2013) Stat. Med.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 25 / 42

Page 26: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Imputation models

f (Y(j)age, Y(j)

educ, Y(j)marstat, Y(j)

race, Y(j)sex

∣∣Y0, S0) , (1)

Not practical to draw directly from conditional joint distribution of identifiers,given observed data.

We used sequential regression multiple imputation here as well

Parametric approach to imputation - logisitic regression model to impute sex,multinomial logit models for other variables.

m = 5 partially synthetic data sets created

Imputations done in R package MI for this project.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 26 / 42

Page 27: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Imputation models

• Select predictor variables using stepwise regression within each of 12sections of survey.

• On average ≈ 50 predictors for each variable to be imputed.• No interactions, main effects only - reduce risk of overfitting (which

increases disclosure risk), reduce computational burden and avoidmulticollinearity and separation issues.

• Variables included in all imputation models: survey disposition code,stage, histology, vital status, study site.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 27 / 42

Page 28: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Disclosure risk

• Identification disclosure risk: potential identification of sampled unitsin the released data

• Inferential disclosure risk: inference of new information about a knownparticipant in the survey, e.g., all participants with the same identifiershave the same income.

Inferential disclosure risk a difficult problem, that is still not well-understood.We focus on identification disclosure risk assessment.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 28 / 42

Page 29: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Identification disclosure risk assessment

Duncan and Lambert and risk framework (1989)

• Mimic the behavior of an ill-intentioned public user (an intruder) whohas the true values of unique or quasi-identifiers for select target unitsand wants to identify the records in the synthetic data that have matchingidentifier values.

• We investigated 3 sets of quasi-identifying values representing varyingassumed levels of intruder information(i) Set 1: Age, sex, marital status, and race(ii) Set 2: Set 1 + education + income level(iii) Set 3: Set 2 + disease stage + study site

Harrington (DFCI/Harvard) CanCORS 10 June 2014 29 / 42

Page 30: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Identification disclosure risk assessmentA potential identification risk for target record i occurs when itsquasi-identifying values match the corresponding values for a record k insynthetic data set j.

Some assumptions/definitions about risk of identification:

• The intruder knows the target is in the survey and the quasi-identifiers ofall units in the population.

• If target record has no matching quasi-identifiers among the m imputeddata sets, risk of identification is zero.

• Expected match risk: probability of correct match if intruder randomlyguesses the match from the candidates with same quasi-identifiers

• Maximum match risk: intruder correctly always correctly identifiesrecord from among potential candidates

• Total match risk: probability that a record is correctly and uniquelyidentified in synthetic data.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 30 / 42

Page 31: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Disclosure risk . . .

Compute probabilities of identification under three sets of quasi-identifiers.

(i) Set 1: Age, sex, marital status, race

(ii) Set 2: Set 1 + income level + education

(iii) Set 3: Set 2 + disease stage + PDCRID

Harrington (DFCI/Harvard) CanCORS 10 June 2014 31 / 42

Page 32: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Disclosure risk estimates, lung cancer cohort

Using Set 2 of identifiers:

Synthetic data:

• Expected match risk: 217/5000 = 4.3%• Maximum risk: 925/5000 = 18.5%• Total match risk: 88/5000 = 1.7%

Observed data:

• Expected match risk: 1770/5000 = 35.4%• Maximum risk: 5000/5000 = 100%• Total match risk: 989/5000 = 19.8%

Harrington (DFCI/Harvard) CanCORS 10 June 2014 32 / 42

Page 33: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Calculating disclosure risk

Expected match rate, identifier set 2, lung cancer:

• Total Match Risk : 88 cases where original record correctly and uniquelymatched to synthetic data record.

• Maximum risk : 1158 instances where a true record was among identicalsets of identifiers in 5 synthetic data sets. Corresponds to 958 uniquerecords

• Expected matches: Expected number of matches if intruder guessedrandomly among potential set of matches. This is a weighted averagefirst within then across synthetic data sets.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 33 / 42

Page 34: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Disclosure risk by set of indentifiers

Table: Disclosure risk - CanCORS lung cancer partially synthetic data

Quasi-identifier set EMR TMRObs. Syn. Obs. Syn.

1 70 38 0 02 1770 217 989 883 4495 890 4000 717

Harrington (DFCI/Harvard) CanCORS 10 June 2014 34 / 42

Page 35: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Data utility assessment

Characterize the quality of what can be learned from the synthetic data,relative to what can be learned from the observed data set.

• Confidence interval overlap (Karr et al. (2006))• Coverage error due to bias (Cochran (1977))

Harrington (DFCI/Harvard) CanCORS 10 June 2014 35 / 42

Page 36: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Data utility

Observed versus synthetic data inferential results

Analytical comparisons based on published studies which analyzed theobserved data set.

Reference papers:

• Huskamp et al. (2009). Discussions with Physicians about Hospiceamong Patients with Metastatic Lung Cancer. Arch. Intern. Med., 169(10), 954-962

• Keating et al. (2010). Cancer patients’ roles in treatment decisions: docharacteristics of the decision influence roles?. Journal of ClinicalOncology, 28:4634 -4370.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 36 / 42

Page 37: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Data utility - analytic results comparison

Table: Descriptive characteristics and estimated probabilities of hospice discussionby race, unadjusted for other covariates. (Standard errors in parentheses)

Characteristic Patients % Discussed Hospice % p-valueObs. Syn. Obs. Syn. Obs. Syn.

Race/ethnicity < 0.001 0.62White 73.7 72.0 55.2 (1.3) 54.4 (1.5)Black 10.7 11.4 42.6 (1.3) 50.0 (4.1)Hispanic 5.9 5.7 40.4 (1.3) 45.8 (5.3)Asian 5.1 5.3 49.4 (1.3) 51.6 (6.0)Other 4.7 4.8 64.5 (1.2) 54.0 (6.9)

Harrington (DFCI/Harvard) CanCORS 10 June 2014 37 / 42

Page 38: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Data utility - Confidence interval overlap andcoverage error

Table: Data utility for coefficient estimates in a logistic model forhospice discussion, unadjusted for other covariates.

Characteristic∣∣∣Bias(qsyn)

SE(qsyn)

∣∣∣ CI. overlap CI Cov. ErrorRace/ethnicityBlack 1.101 0.767 0.196Hispanic 0.101 0.831 0.051Asian 0.024 0.803 0.050Other 1.479 0.707 0.316

Harrington (DFCI/Harvard) CanCORS 10 June 2014 38 / 42

Page 39: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Data utility - uncongenialityWhat explanation can be given for the change in conclusion of significance?→ Uncongeniality Meng (1994)

"Analysts and imputer have access to different types of information and data,and assess and use the information and data in different ways"

That is, there are systematic differences between the imputation model andanalysis procedure inputs.

For our study, the imputation model conditioned on the entire data set (5000records) but the analysis procedure analyzed the subset of Stage IV lungcancer patients (1517 records) .

→ If the imputation model does not capture all the important subgrouprelationships, results from the synthetic data may be biased.

Harrington (DFCI/Harvard) CanCORS 10 June 2014 39 / 42

Page 40: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Data utility - uncongeniality

Re-ran imputation models conditional on Stage IV lung cancer patients only -large bias for ‘black’ race factor level is removed.

Table: Descriptive characteristics and estimated probabilities of hospice discussionby race, unadjusted for other covariates. (Standard errors in parentheses)

Characteristic Patients % Discussed Hospice % p-valueObs. Syn. Obs. Syn. Obs. Syn.

Race/ethnicity < 0.001 0.003White 73.7 74.6 55.2 (1.3) 55.6 (1.5)Black 10.7 9.9 42.6 (1.3) 42.0 (4.2)Hispanic 5.9 6.1 40.4 (1.3) 40.4 (5.3)Asian 5.1 5.1 49.4 (1.3) 50.2 (5.8)Other 4.7 4.3 64.5 (1.2) 58.5 (6.5)

Harrington (DFCI/Harvard) CanCORS 10 June 2014 40 / 42

Page 41: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

The Thread

Brief Overview of CanCORS

Multiple Imputation (MI) Primer

Unit Non-response: Using MI to ‘uncover’

Confidentiality: Using MI to Hide

Using CanCORS data

Harrington (DFCI/Harvard) CanCORS 10 June 2014 41 / 42

Page 42: Cancer Care Outcomes Research and Surveillance Consortium ... · Multiple Imputation Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible. Schafer 1999 SMMR paper is

Currently, no public use data sets released.

Grant officially closes July 31, 2014.

Public web site (www.cancors.org/public) will be available with

• Full study and data documentation• Bibliography of published papers• All study instruments and protocols

Outside investigators will be urged to collaborate with members of CanCORSteam

Harrington (DFCI/Harvard) CanCORS 10 June 2014 42 / 42