Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Cancer Care Outcomes Research and SurveillanceConsortium(CanCORS)
Using Multiple Imputation to both Uncover and Hide
Dave Harrington ([email protected])
Dana-Farber Cancer Institute, Harvard Statistics and Biostatistics
10 June 2014
Harrington (DFCI/Harvard) CanCORS 10 June 2014 1 / 42
Collaborators
Alan Zaslavsky, HMS
Yulei He, CDC
Bronwyn Loong, Australian National University
Paul Catalano, DFCI and HSPH
MaryBeth Landrum, HMS
Many members of the CanCORS Consortium
Harrington (DFCI/Harvard) CanCORS 10 June 2014 2 / 42
Outline . . .
Brief Overview of CanCORS
Multiple Imputation (MI) Primer
Unit Non-response: Using MI to ‘uncover’
Confidentiality: Using MI to Hide
Using CanCORS data
Harrington (DFCI/Harvard) CanCORS 10 June 2014 3 / 42
The Thread
Brief Overview of CanCORS
Multiple Imputation (MI) Primer
Unit Non-response: Using MI to ‘uncover’
Confidentiality: Using MI to Hide
Using CanCORS data
Harrington (DFCI/Harvard) CanCORS 10 June 2014 4 / 42
!
Biloxi, MS
Patients from population-based cohorts in geographic areas
Patients from integrated health-care delivery systems
Patients at Veterans Health Administration hospitals
" Kaiser Permanente Hawaii
(4 major islands)
"
" Kaiser Permanente
Northwest (Portland metropolitan area)
Group Health Cooperative (Seattle metropolitan area)
#
Portland, OR "
#
#
Harvard Pilgrim Health Care (Boston metropolitan area)
Durham, NC
Baltimore, MD
New York, NY
Biloxi, MS #
#
#
Nashville, TN
#
! North Carolina 22 central/eastern counties
#
Atlanta, GA
!
!
Los Angeles County
Northern California (8 counties in San Jose, San Francisco/Oakland, and Sacramento areas)
"
Henry Ford Health System (Detroit metropolitan area)
#
#
Minneapolis, MN
Chicago, IL (Lakeside and Hines)
!
State of Iowa
!
State of Alabama Houston, TX
#
Seattle, WA
#
Indianapolis, IN
CanCORS Sites
Diagnosis May 03 – Dec 05
Enrollment Baseline Patient Interview
Follow-up Interview
12 months post diagnosis
3-6 months
Physician Surveys
Medical Record Abstraction
by role
Caregiver Surveys
Enrollment Phase DataCollection
15 months post diagnosis
Harrington (DFCI/Harvard) CanCORS 10 June 2014 6 / 42
Interview Instrument Types
• Full baseline
• Brief baseline
• Surrogate living and decedent
Harrington (DFCI/Harvard) CanCORS 10 June 2014 7 / 42
Alive 12 months from Diagnosis
Screening Interview Patient or Surrogate
Physician Surveys
Adv. Disease Survey Med Rec Abstraction
Survivor survey
Second wave Phase Data Collection
Deceased or alive with advanced
disease Disease Free
Harrington (DFCI/Harvard) CanCORS 10 June 2014 8 / 42
Enrollment
10,061 patients with lung or colorectal cancer enrolled, with data from
• Patient interviews• Physician surveys• Medical record abstraction
2,995 patients recontacted in second wave
• Revisions of earlier instruments• Assessing recurrence surveillance using electronic records in integrated
health plans and Medicare Claims
Harrington (DFCI/Harvard) CanCORS 10 June 2014 9 / 42
CanCORS vs SEER Colorectal Cancer
010203040506070
Female
Hispan
ic
Black N
HAsia
n
Age < 50
Age 50-7
5
Age > 75
Stage I
Stage I
I
Stage I
II
Stage I
V
CanCORS SEER
% Diagnosed Cases
(Catalano, Med Care, 2012)Harrington (DFCI/Harvard) CanCORS 10 June 2014 10 / 42
Challenges in the Design and Analysis
• Uniform data collection/standards of analysis
• Unit non-response
• Item non-response
• Confidentiality
Harrington (DFCI/Harvard) CanCORS 10 June 2014 11 / 42
The Thread
Brief Overview of CanCORS
Multiple Imputation (MI) Primer
Unit Non-response: Using MI to ‘uncover’
Confidentiality: Using MI to Hide
Using CanCORS data
Harrington (DFCI/Harvard) CanCORS 10 June 2014 12 / 42
Multiple Imputation
Proposed by Rubin in 1977; Rubin’s 1996 JASA review is accessible.
• Schafer 1999 SMMR paper is a primer• Multiple draws (usually 5 - 10) used to create versions of completed
dataset.
• Draws taken from posterior predictive distribution of missing data,given observed data.
• Each dataset analyzed using traditional complete case analysis.• Variance estimates account for within and between imputation
variability.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 13 / 42
The Thread
Brief Overview of CanCORS
Multiple Imputation (MI) Primer
Unit Non-response: Using MI to ‘uncover’
Confidentiality: Using MI to Hide
Using CanCORS data
Harrington (DFCI/Harvard) CanCORS 10 June 2014 14 / 42
Missingness in the surveys
Survey No. Vars Sample Size Skips(%) DO/DK/REF(%)
Mean Range Mean Range
Full Baseline 535 5763 50 0-99 1.23 0-37Brief Baseline 156 1397 48 0-99 2.49 0-22Surrogate Live 413 1083 50 0-99 1.61 0-28Surrogate Deceased 473 1827 62 0-99 1.46 0-29Follow-up 384 6087 65 0-99 0.55 0-33
DO/DK/REF = Drop Out/Don’t Know/Refused
Some complete case analyses drop approximately 30% of respondents.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 15 / 42
Challenges to MI
• Several different data types• 4 instruments used in two cancer types• Block missingness because of missing domains in shorter surveys.• Extensive skip patterns in interview structure• Data set periodically refreshed• Diagnostics difficult
Harrington (DFCI/Harvard) CanCORS 10 June 2014 16 / 42
Imputation Strategy
• Use Sequential Regression Multiple Imputation (SRMI), implemented inSAS module IVEWare. (Raghunathan, 2001, Surv. Meth )
• Imputation used to create 5 datasets for each of the two cancer types.• Within a cancer, imputation based on dataset with instrument types
combined.• Skipped questions and missing blocks in shorter surveys imputed• Skips and block missingness restored after imputation• To incease congeniality, imputation did not use data available to
Coordinating Center but not available to data analysts.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 17 / 42
Diagnostics
• Examined changes in marginal means, pairwise correlation coefficients.• Comparisons with complete case and missingness indicator analyses.• Comparisons with weighted analyses: treat brief/surrogate surveys as
item non-response, non-response weights calculated for participation infull survey.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 18 / 42
Comparative Analyses of Hospice
Full analysis in Huskamp, et al., Arch Int Med (2009)
Did you participate in a discussion of hospice with your physician?
Multiple imputation in a large-scale complex survey 665
Table 5 Hospice care analysis results
Predictor Ref. CC Missingness indicator SRMI
EST SE p-value EST SE p-value EST SE p-value
Intercept −1.97 0.51 0.00 −2.08 0.43 0.00 −2.17 0.42 0.00Race White
Black 0.03 0.21 0.90 0.03 0.17 0.86 −0.02 0.17 0.89Hispanic −0.51 0.31 0.10 −0.68 0.24 0.00 −0.66 0.24 0.01Asian 0.24 0.29 0.42 0.20 0.25 0.41 0.21 0.25 0.40Other 0.46 0.27 0.08 0.46 0.23 0.04 0.42 0.22 0.06
Marital status Married/PartnerWidowed −0.24 0.18 0.19 −0.10 0.15 0.50 −0.07 0.15 0.66
Divorced/ 0.24 0.17 0.17 0 .29 0.14 0 .05 0.29 0.15 0.05Separated
Never married 0.66 0.32 0.04 0.57 0.25 0.03 0.58 0.27 0.03Missing 0.09 0.42 0.83
Received chemo NoYes −1.25 0.12 0.00 −1.01 0.10 0.00 −1.01 0.10 0.00Missing 0.07 0.58 0.91
Income <20 K20 – 40 K 0.12 0.15 0.45 0.13 0.14 0.35 0.12 0.14 0.4140 – 60 K −0.07 0.19 0.71 −0.04 0.18 0.83 −0.02 0.18 0.9160 K+ −0.38 0.21 0.07 −0.32 0.19 0.09 −0.30 0.21 0.16Missing −0.15 0.20 0.45
Insurance MedicareMedicaid 0.09 0.31 0.77 0.26 0.27 0.33 0.28 0.25 0.26Private 0.04 0.20 0.86 0.01 0.17 0.97 0.01 0.18 0.98Other 0.81 0.34 0.02 0.91 0.28 0.00 0.85 0.28 0.00Missing 0.19 0.23 0.41
Myocardial Noinfarction
Yes −0.41 0.17 0.02 −0.24 0.15 0.11 −0.23 0.16 0.13Missing −0.62 0.56 0.27
Depression NoYes 0.47 0.14 0.00 0.36 0.12 0.00 0.35 0.13 0.01Missing −0.31 0.58 0.59
Deceased with No1 year of dx
Yes 1.85 0.16 0.00 1.59 0.13 0.00 1.69 0.13 0.00Cancer stage Stage IIIB
Stage IV 0.72 0.13 0.00 0.76 0.11 0.00 0.77 0.11 0.00
Note: CC = Complete-case analysis; SRMI = Sequential regression multiple imputation.
involved in specific analyses. In the hospice subsample and assuming MAR, we per-formed posterior predictive checking26−31 to examine the deviation of selected analysisresults Q computed from the completed data with imputations compared to the valuesof Q calculated from simulated copies of the completed data under the model.
at Harvard Library on May 23, 2014smm.sagepub.comDownloaded from
Table is small piece of full logistic regression.
Complete account in He, et al. 2010, SMMR
Harrington (DFCI/Harvard) CanCORS 10 June 2014 19 / 42
Weighted analyses: QoL scores
668 Y He et al.
Table 7 Estimates of quality-of-life scales
Sample Scale CC Weighting SRMI
Mean SE Mean SE Mean SE
All EORTC–LC 79 0.31 77 0.58 75 0.26EORTC–QLQ 77 0.39 74 0.60 71 0.28DYSPNEA 63 0.59 59 0.96 59 0.50
Stage I/II EORTC–LC 82 0.48 81 0.69 78 0.42EORTC–QLQ 80 0.60 79 0.83 75 0.51DYSPNEA 64 0.95 62 1.34 61 0.73
Stage IV EORTC–LC 78 0.62 76 1.07 73 0.40EORTC–QLQ 75 0.81 73 1.18 68 0.43DYSPNEA 63 1.23 59 2.00 58 0.82
Notes: CC = Complete-case analysis; SRMI=Sequential regression multiple imputation.EORTC (European Organization for Research and Treatment of Cancer)_LC (Lung cancer specific): coughing,sore mouth, trouble swallowing and pain.EORTC_QLQ (quality of life questionnaire): insomnia, nauseated, vomited, etc.DYSPNEA: short of breadth when resting, walking or climbing stairs.
to the limited predictive power of the non-response model predictors we selected andthe large variation of the weights from the heterogenous sample. We would expect theadvantage of SRMI over weighting to be even more pronounced in an analysis that isprimarily based on variables with no block missingness, since it makes better use of allthe available information on those variables.
6 Discussion and future directions
This paper demonstrates the use of multiple imputation to construct a centraliseddatabase for multiple users with different analytic objectives. As health or social policydecisions are often based on analyses of large incomplete databases, our work may serveas a template for similar applications. By showing the details of implementation, weaim to encourage practitioners to apply multiple imputation for complex large-scaledata sets.
SRMI is used to tackle the challenges arising from the complex survey data in thiswork, and we believe it is a suitable solution for other missing data problems withsimilar scale and complexity. However, we have still experienced a variety of chal-lenging issues in implementation. These problems include how to build models whichcan effectively exclude unused block non-responses or skipped items, how to do modelselection in the presence of a large number of predictors, how to decide when to stopthe chain and collect imputations, and how to perform model validation and diagnos-tics. We adopted some ad-hoc solutions, but we caution against the mechanical useof SRMI despite its increasing popularity. Echoing other literature in this area,7−9 wecall for more methodological research on the properties of SRMI which would helpanswer the above questions. Given current limitations on the practical and theoreticaldevelopment of SRMI, we also encourage practitioners to do sensitivity analysis using
at Harvard Library on May 23, 2014smm.sagepub.comDownloaded from
Harrington (DFCI/Harvard) CanCORS 10 June 2014 20 / 42
The Thread
Brief Overview of CanCORS
Multiple Imputation (MI) Primer
Unit Non-response: Using MI to ‘uncover’
Confidentiality: Using MI to Hide
Using CanCORS data
Harrington (DFCI/Harvard) CanCORS 10 June 2014 21 / 42
Challenges in preserving confidentiality with publiclyavailable healthcare microdata
• Minimize disclosure risk.
• Reduce risk of participant re-identification by malicious user shouldbe acceptably low.
• Maintain data utility.
• Modifications to the observed data set should maintain variablerelationships that are both clinically plausible and reproduce(approximately) relationships in the data.
• Low maintenance for Coordinating Center
Harrington (DFCI/Harvard) CanCORS 10 June 2014 22 / 42
Background on partially synthetic data
Idea: Replace the observed values for sensitive variables or key identifierswith synthetic data. Do not release original values of these variables.
• First proposed by Rubin (1993) based on the concepts of multipleimputation.
• Synthetic values are created using samples drawn from the posteriorpredictive distribution of target population responses given the observeddata set.
• Analysts can draw approximately valid inferences about the targetpopulation of interest using standard methods for multiply imputed data.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 23 / 42
Synthetic data in practice
• Survey of Consumer Finances• American Communities Survey• Survey of Income and Program Participation• US Longitudinal Business Database• Longitudinal Employer - Household Dynamics• German IAB Establishment Panel
Harrington (DFCI/Harvard) CanCORS 10 June 2014 24 / 42
Identification of variables to synthesize
• Direct identifiers: name, postal address, SSN: not available to internaland external investigators, so not synthesized.
• Quasi-identifiers: Age, education, sex, marital status, race.
• Clinical: Not synthesized; complex structure - difficult to synthesize,accessible by insiders only (not wider population); subject tojudgemental variation and incomplete medical record acquisition
• Sensitive: All information gathered can be considered sensitive; fullsynthesis not practical and not useful to data analysts
Details reported in Loong, et al. (2013) Stat. Med.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 25 / 42
Imputation models
f (Y(j)age, Y(j)
educ, Y(j)marstat, Y(j)
race, Y(j)sex
∣∣Y0, S0) , (1)
Not practical to draw directly from conditional joint distribution of identifiers,given observed data.
We used sequential regression multiple imputation here as well
Parametric approach to imputation - logisitic regression model to impute sex,multinomial logit models for other variables.
m = 5 partially synthetic data sets created
Imputations done in R package MI for this project.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 26 / 42
Imputation models
• Select predictor variables using stepwise regression within each of 12sections of survey.
• On average ≈ 50 predictors for each variable to be imputed.• No interactions, main effects only - reduce risk of overfitting (which
increases disclosure risk), reduce computational burden and avoidmulticollinearity and separation issues.
• Variables included in all imputation models: survey disposition code,stage, histology, vital status, study site.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 27 / 42
Disclosure risk
• Identification disclosure risk: potential identification of sampled unitsin the released data
• Inferential disclosure risk: inference of new information about a knownparticipant in the survey, e.g., all participants with the same identifiershave the same income.
Inferential disclosure risk a difficult problem, that is still not well-understood.We focus on identification disclosure risk assessment.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 28 / 42
Identification disclosure risk assessment
Duncan and Lambert and risk framework (1989)
• Mimic the behavior of an ill-intentioned public user (an intruder) whohas the true values of unique or quasi-identifiers for select target unitsand wants to identify the records in the synthetic data that have matchingidentifier values.
• We investigated 3 sets of quasi-identifying values representing varyingassumed levels of intruder information(i) Set 1: Age, sex, marital status, and race(ii) Set 2: Set 1 + education + income level(iii) Set 3: Set 2 + disease stage + study site
Harrington (DFCI/Harvard) CanCORS 10 June 2014 29 / 42
Identification disclosure risk assessmentA potential identification risk for target record i occurs when itsquasi-identifying values match the corresponding values for a record k insynthetic data set j.
Some assumptions/definitions about risk of identification:
• The intruder knows the target is in the survey and the quasi-identifiers ofall units in the population.
• If target record has no matching quasi-identifiers among the m imputeddata sets, risk of identification is zero.
• Expected match risk: probability of correct match if intruder randomlyguesses the match from the candidates with same quasi-identifiers
• Maximum match risk: intruder correctly always correctly identifiesrecord from among potential candidates
• Total match risk: probability that a record is correctly and uniquelyidentified in synthetic data.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 30 / 42
Disclosure risk . . .
Compute probabilities of identification under three sets of quasi-identifiers.
(i) Set 1: Age, sex, marital status, race
(ii) Set 2: Set 1 + income level + education
(iii) Set 3: Set 2 + disease stage + PDCRID
Harrington (DFCI/Harvard) CanCORS 10 June 2014 31 / 42
Disclosure risk estimates, lung cancer cohort
Using Set 2 of identifiers:
Synthetic data:
• Expected match risk: 217/5000 = 4.3%• Maximum risk: 925/5000 = 18.5%• Total match risk: 88/5000 = 1.7%
Observed data:
• Expected match risk: 1770/5000 = 35.4%• Maximum risk: 5000/5000 = 100%• Total match risk: 989/5000 = 19.8%
Harrington (DFCI/Harvard) CanCORS 10 June 2014 32 / 42
Calculating disclosure risk
Expected match rate, identifier set 2, lung cancer:
• Total Match Risk : 88 cases where original record correctly and uniquelymatched to synthetic data record.
• Maximum risk : 1158 instances where a true record was among identicalsets of identifiers in 5 synthetic data sets. Corresponds to 958 uniquerecords
• Expected matches: Expected number of matches if intruder guessedrandomly among potential set of matches. This is a weighted averagefirst within then across synthetic data sets.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 33 / 42
Disclosure risk by set of indentifiers
Table: Disclosure risk - CanCORS lung cancer partially synthetic data
Quasi-identifier set EMR TMRObs. Syn. Obs. Syn.
1 70 38 0 02 1770 217 989 883 4495 890 4000 717
Harrington (DFCI/Harvard) CanCORS 10 June 2014 34 / 42
Data utility assessment
Characterize the quality of what can be learned from the synthetic data,relative to what can be learned from the observed data set.
• Confidence interval overlap (Karr et al. (2006))• Coverage error due to bias (Cochran (1977))
Harrington (DFCI/Harvard) CanCORS 10 June 2014 35 / 42
Data utility
Observed versus synthetic data inferential results
Analytical comparisons based on published studies which analyzed theobserved data set.
Reference papers:
• Huskamp et al. (2009). Discussions with Physicians about Hospiceamong Patients with Metastatic Lung Cancer. Arch. Intern. Med., 169(10), 954-962
• Keating et al. (2010). Cancer patients’ roles in treatment decisions: docharacteristics of the decision influence roles?. Journal of ClinicalOncology, 28:4634 -4370.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 36 / 42
Data utility - analytic results comparison
Table: Descriptive characteristics and estimated probabilities of hospice discussionby race, unadjusted for other covariates. (Standard errors in parentheses)
Characteristic Patients % Discussed Hospice % p-valueObs. Syn. Obs. Syn. Obs. Syn.
Race/ethnicity < 0.001 0.62White 73.7 72.0 55.2 (1.3) 54.4 (1.5)Black 10.7 11.4 42.6 (1.3) 50.0 (4.1)Hispanic 5.9 5.7 40.4 (1.3) 45.8 (5.3)Asian 5.1 5.3 49.4 (1.3) 51.6 (6.0)Other 4.7 4.8 64.5 (1.2) 54.0 (6.9)
Harrington (DFCI/Harvard) CanCORS 10 June 2014 37 / 42
Data utility - Confidence interval overlap andcoverage error
Table: Data utility for coefficient estimates in a logistic model forhospice discussion, unadjusted for other covariates.
Characteristic∣∣∣Bias(qsyn)
SE(qsyn)
∣∣∣ CI. overlap CI Cov. ErrorRace/ethnicityBlack 1.101 0.767 0.196Hispanic 0.101 0.831 0.051Asian 0.024 0.803 0.050Other 1.479 0.707 0.316
Harrington (DFCI/Harvard) CanCORS 10 June 2014 38 / 42
Data utility - uncongenialityWhat explanation can be given for the change in conclusion of significance?→ Uncongeniality Meng (1994)
"Analysts and imputer have access to different types of information and data,and assess and use the information and data in different ways"
That is, there are systematic differences between the imputation model andanalysis procedure inputs.
For our study, the imputation model conditioned on the entire data set (5000records) but the analysis procedure analyzed the subset of Stage IV lungcancer patients (1517 records) .
→ If the imputation model does not capture all the important subgrouprelationships, results from the synthetic data may be biased.
Harrington (DFCI/Harvard) CanCORS 10 June 2014 39 / 42
Data utility - uncongeniality
Re-ran imputation models conditional on Stage IV lung cancer patients only -large bias for ‘black’ race factor level is removed.
Table: Descriptive characteristics and estimated probabilities of hospice discussionby race, unadjusted for other covariates. (Standard errors in parentheses)
Characteristic Patients % Discussed Hospice % p-valueObs. Syn. Obs. Syn. Obs. Syn.
Race/ethnicity < 0.001 0.003White 73.7 74.6 55.2 (1.3) 55.6 (1.5)Black 10.7 9.9 42.6 (1.3) 42.0 (4.2)Hispanic 5.9 6.1 40.4 (1.3) 40.4 (5.3)Asian 5.1 5.1 49.4 (1.3) 50.2 (5.8)Other 4.7 4.3 64.5 (1.2) 58.5 (6.5)
Harrington (DFCI/Harvard) CanCORS 10 June 2014 40 / 42
The Thread
Brief Overview of CanCORS
Multiple Imputation (MI) Primer
Unit Non-response: Using MI to ‘uncover’
Confidentiality: Using MI to Hide
Using CanCORS data
Harrington (DFCI/Harvard) CanCORS 10 June 2014 41 / 42
Currently, no public use data sets released.
Grant officially closes July 31, 2014.
Public web site (www.cancors.org/public) will be available with
• Full study and data documentation• Bibliography of published papers• All study instruments and protocols
Outside investigators will be urged to collaborate with members of CanCORSteam
Harrington (DFCI/Harvard) CanCORS 10 June 2014 42 / 42