27
Predictive Markers and Predictive Models in Translational Research Kerby Shedden Department of Statistics University of Michigan [email protected] April 23, 2010

Predictive Markers and Predictive Models in Translational Researchdept.stat.lsa.umich.edu/~kshedden/Courses/OJOC/ojoc-2010.pdf · 2010. 4. 23. · Predictive Markers and Predictive

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Predictive Markers and Predictive Models inTranslational Research

    Kerby SheddenDepartment of StatisticsUniversity of [email protected]

    April 23, 2010

  • High level goals

    I Advance personalized medicine – the use of diagnostic/prognostic tests and individually targeted therapies.

    I Discovery

    I Identification of markers and targets.I Identification of distinct disease subtypes or spectra.I Identification of mechanistic interactions.

    I Prediction

    I Improve the performance of diagnostic and prognostic tests.I Characterize the performance of diagnostic/prognostic tests;

    define the settings in which they can be used effectively.

  • Outcomes

    General types of outcomes:

    I Presence of disease

    I Disease type (or subtype)

    I Rate of disease progresssion

    I Potential for treatment response

    I Time to an adverse event

    Some measurement issues:

    I Measurement scale (e.g. quantitative/qualitative)

    I Fully observed and partially observed outcomes (censoring)

    I Error-free versus error-prone measurements

  • Factors potentially influencing outcomes

    Some broad categories:

    I Molecular (more later)

    I Environmental

    I Exposures to potentially harmful factors (e.g. environmentalchemicals)

    I Social environment

    I Behavioral

    I Diet, physical activityI Risky behaviors

    These are dependent, overlapping categories.

    Interest may be in direct effects, synergistic effects, or interactions.

    Differing measurement properties.

  • Molecular assay data

    Measurements of a collection of molecular characteristics onindividual subjects/patients.

    How many distinct things can be measured per subject?

    I 101 − 103: ELISA, PCR, Western blots, tissue microarray, 2Dgel electrophoresis, chromatography.

    I 102 − 104: Gene expression microarrays, mass spectrometry.I 105 − 106: SNP genotyping, copy number variations,

    epigenetic characteristics, regional sequencing.

    I 106 − 109: Whole genome sequencing.

  • What do the assay data typically look like?

    Subjects

    Mark

    ers

    Data matrix

    -0.88 -0.28 -0.85 1.65 0.07 0.16 1.13 -0.68 0.71 -0.11 0.53 -0.92 -0.38 0.80 1.34-1.22 -1.36 0.78 -0.89 -1.11 0.31 1.62 2.68 -0.58 2.87 0.70 0.86 1.42 1.12 0.41-0.14 -1.16 -0.55 -0.29 -0.56-0.50 -1.31 1.13 -0.39 0.79-0.68 0.67 -0.02 0.10 -0.22-0.56 0.26 -0.22 -0.96 -0.03 0.37 -0.75 0.21 1.16 0.48-0.45 -0.42 1.29 0.42 1.12-0.16 -0.42 -1.39 -0.07 -1.00 1.11 -0.40 -1.21 0.02 0.24 0.55 0.27 -1.44 0.62 -1.85 0.65 1.04 -0.64 -1.74 0.99 0.74 -1.04 0.97 0.74 -0.73 0.71 -0.87 -0.61 -1.34 0.92 1.31 -0.01 -0.94 0.11 0.14 1.50 -0.46 0.23 -0.69 0.19

  • Study design

    I Sample size, number of factors being considered

    I Size of “p” versus “n”

    I Study goals, nature of effects of interest.

    I Relationship of study subjects to population of interest.

    I Representativeness, generalizability

    I Control over sources of variation of secondary interest.

    I Measurement and other data collection issues.

    I Systematic and random measurement error

    I Statistical power

  • Screening analyses

    Typically focused on marginal effects:

    I For a particular marker M, how much does the outcome differon average if we compare subjects with marker valuesM = m1 to subjects with marker values M = m2?

    I Linear effect: The outcome difference when comparingM = m to M = m + δ depends on δ but not on m.

    I No control for changes in other markers.

  • Marginal and joint effects

    Suppose markers 1 and 2 are related as follows:

    3 2 1 0 1 2 3Marker 1

    3

    2

    1

    0

    1

    2

    3

    Mark

    er

    2

    Also suppose that:

    I When marker 1 is held fixed and marker 2 is increased by 1unit, the outcome increases by 2 units on average.

    I When marker 2 is held fixed and marker 1 is increased by 1unit, the outcome increases by 1 unit on average.

    What is the marginal effect of a 1 unit difference in marker 1?

    Answer: 1 + 0.5× 2 = 2

  • Marginal and joint effects

    Suppose markers 1 and 2 are related as follows:

    3 2 1 0 1 2 3Marker 1

    3

    2

    1

    0

    1

    2

    3

    Mark

    er

    2

    Also suppose that:

    I When marker 1 is held fixed and marker 2 is increased by 1unit, the outcome increases by 2 units on average.

    I When marker 2 is held fixed and marker 1 is increased by 1unit, the outcome increases by 1 unit on average.

    What is the marginal effect of a 1 unit difference in marker 1?Answer: 1 + 0.5× 2 = 2

  • A lung cancer prognosis study

    I Lung adenocarcinoma

    I ∼ 500 subjects in four treatment centers.I Interested in gene expression markers and predictive models

    for overall survival.

    Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. NatureMedicine 2008, 8:822-7.

  • Hazard ratios

    The hazard at time t

    H(t) ≈ P(S < t + δ|S >= t)δ

    where S is the time to the event (may be observed or censored).If H1(t) and H2(t) are the hazards in two groups of subjects attime t, then

    H1(t)/H2(t)

    is the hazard ratio.

    In a proportional hazards model this ratio does not depend on time.

  • Marginal effect estimates

    Gene1.51.00.50.00.51.01.5

    Log h

    aza

    rd r

    ati

    oInstitution 1

    Gene0.00.51.01.52.02.53.0

    Haza

    rd r

    ati

    o

    Institution 1

  • Significance levels

    Z-score = effect estimate/standard error

    Gene

    42024

    Z-s

    core

    Institution 1

    Bonferroni threshold for 13,830 genes at level 0.05 is 4.6.

  • Significance levels

    Gene

    42024

    Z-s

    core

    Institution 1

  • Significance levels

    Top half of measured markers in terms of variability:

    Bonferroni threshold for 6,915 genes at level 0.05 is 4.5.

    Gene

    42024

    Z-s

    core

    Institution 1

  • Hazard ratio standard error

    0.0 0.1 0.2 0.3 0.4Institution 1 SE

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    Insi

    tuti

    on 1

    SD

    0.0 0.1 0.2 0.3 0.4Institution 1 SE

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    Insi

    tuti

    on 1

    1/S

    D

  • Global null hypothesis and attributing effects

    I The null hypothesis for marker i is that there is no marginalassociation between marker i and the outcome.

    I The “global null” is that all markers have zero marginal effect.

    I We can reject the global null even when every marker’s nullhypothesis is not rejected – something is going on, but wecan’t attribute it to a specific marker.

  • Overall assessment of significance levels

    We can strongly reject the global null:

    4 2 0 2 4Institution 1 Z-score

    4

    2

    0

    2

    4

    Norm

    al quanti

    le

    In the four insitutions:

    var(Z ) = 1.57, 1.24, 1.71, 1.59

  • Stratify

    4 2 0 2 4Institution 1 Z-score

    4

    2

    0

    2

    4

    Norm

    al quanti

    leQuartile 1

    4 2 0 2 4Institution 1 Z-score

    4

    2

    0

    2

    4

    Norm

    al quanti

    le

    Quartile 2

    4 2 0 2 4Institution 1 Z-score

    4

    2

    0

    2

    4

    Norm

    al quanti

    le

    Quartile 3

    4 2 0 2 4Institution 1 Z-score

    4

    2

    0

    2

    4

    Norm

    al quanti

    le

    Quartile 4

  • Overall assessment of significance levels

    6 4 2 0 2 4 6Institution 1 Z-score

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    0.40

    Densi

    ty

  • Generalizability of significance levels

    Gene

    42024

    Z-s

    core

    Institution 1

    Gene

    42024

    Z-s

    core

    Institution 2

  • Generalizability of marginal effect estimates and Z-scores

    2 1 0 1 2Institution 1

    2

    1

    0

    1

    2

    Inst

    ituti

    on 2

    Log hazard ratios

    4 2 0 2 4Institution 1

    4

    2

    0

    2

    4

    Inst

    ituti

    on 2

    Z-scores

  • Results of 2002 study

    818 NATURE MEDICINE • VOLUME 8 • NUMBER 8 • AUGUST 2002

    ARTICLES

    Immunoreactivity for both IGFBP-3 and HSP-70 (Fig. 2c) was de-tected in the cytoplasm of the adenocarcinomas, with little de-tectable reactivity in the stromal or inflammatory cells. CystatinC was detected in alveolar pneumocytes and intra-alveolarmacrophages in non-neoplastic lung parenchyma and also con-sistently in the cytoplasm of neoplastic cells.

    Gene-expression profiles predict survivalAs expected, Kaplan–Meier survival curves (Fig. 3a) and log-ranktests indicated poorer survival among stage III compared withstage I adenocarcinomas (P =

  • Results of 2006 studyT h e n e w e ng l a nd j o u r na l o f m e dic i n e

    n engl j med 355;6 www.nejm.org august 10, 2006576

    We analyzed 25 samples from the ACOSOG Z0030 trial to validate the performance of the predictive model of recurrence based on the Duke training cohort. As was the case with the Duke cohort, for the ACOSOG Z0030 cohort, univariate and multivariate analyses showed that the meta-gene model was a significantly more accurate predictor (P1

    A

    B

    Lung Metagene Model

    Clinical Model

    Figure 3. Kaplan–Meier Survival Estimates for the Duke Training Cohort.

    Estimates based on predictions from the lung meta-gene model demonstrate the value of that approach (Panel A). Panel B shows the estimates based on the clinical model of prognosis, as well as those based on individual clinical characteristics — here, tumor diam-eter and stage of disease. A high risk of recurrence was defined as a probability of recurrence of more than 0.5, and a low risk of recurrence was defined as a risk of 0.5 or less. P values were obtained with the use of a log-rank test. Tick marks indicate patients whose data were censored by the time of last follow-up or owing to death.

    Copyright © 2006 Massachusetts Medical Society. All rights reserved. Downloaded from www.nejm.org at UNIVERSITY OF MICHIGAN on April 8, 2009 .

    Genomic Str ategy to Refine Prognosis in Early Non–Small-Cell Lung Cancer

    n engl j med 355;6 www.nejm.org august 10, 2006 577

    currence were submitted to a CALGB statistician for comparison with the true outcomes. Once again, univariate and multivariate analyses showed that the lung metagene model predicted outcome significantly better (P

  • Hazard ratios for all methods

    expression data on subsets of lung adenocarcinomas were generatedby each of four different laboratories using a common platform andfollowing a protocol previously demonstrated to be robust andreproducible12. We considered four separate hypotheses: (i) geneexpression alone can predict outcomes for all samples; (ii) geneexpression and basic clinical covariates (stage, age, sex) can predictoutcomes for all samples; (iii) gene expression alone can predictoutcomes for stage 1 samples; and (iv) gene expression and basicclinical covariates can predict outcomes for stage 1 samples. Note thatprediction on stage 1 samples is more difficult than on the full studyset as these samples are relatively homogeneous. The consideration ofclinical covariates is highly relevant as the basic variables consideredhere will always be available in practice, and gene expression–basedprediction is relevant in practice only if it provides more informationthan these measures. We followed a strict protocol for the datacollection, data analysis and performance evaluation phases of ourstudy. Data generated at two sites were used as a training set and theresults were validated using the independent data sets from the othertwo participating sites following a blinded protocol. The results fromthis study provide not only valid assessment of outcome prediction inthe multi-institutional setting but also a rich data set for futureanalysis and provide an example of how large data sets can begenerated and tested by cooperation and pooling of resourcesamong many investigators.

    RESULTSConsortium and classifier developmentWe collected a total of 442 lung adenocarcinomas with high-qualitygene-expression data, pathological data and clinical informationdescribing the severity of the disease at surgery and the clinical courseof the disease after sampling. These samples, collected from sixcontributing treatment institutions, were grouped into four sets ofdata on the basis of the laboratory where samples were processed formicroarray analysis. The distribution of several clinical variables forthese four data sets is shown in Table 1. The first two data sets, UMand HLM, were released to members of the consortium for thedevelopment of classifiers appropriate for our four hypotheses. Detailsof our protocol for developing and evaluating classifiers are providedin the Supplementary Methods online.

    Eight classifiers producing either categorical or continuous riskscores were developed by investigators using the training data andwere tested for effectiveness on the two remaining data sets (MSK andCAN/DF). Most of these classifiers incorporated techniques that haverepeatedly been applied in gene expression–based prognosis andfound to work well in at least some instances. As an overview, datareduction was carried out using gene clustering (method A), uni-variate testing (methods B, C, D, E, F, G) or on a mechanistic basis(method H). Final scoring and classification was based on penalized

    Cox regression modeling with gene cluster expression summaries asthe features (method A), on expression of individual genes (methodB), on principal components of gene expression (methods F, G), onmembership in clusters defined by gene expression (methods C, D) oron a vote of single-gene classifiers (method H). A number of otherfactors, such as subselection of the training samples, gene filtering anddata transformation, were handled in various ways as described indetail in Supplementary Methods. We note that all classifiersstarted with the same set of expression summaries processed usingthe DChip algorithm14, so handling of the raw data was uniformacross the methods.

    Classifier performance without and with clinical covariatesThe estimated hazard ratios for the risk scores produced by the eightprognosis methods, with 95% confidence intervals, are shown for thetwo validation sets in Figure 1. Hazard ratios substantially greaterthan 1.0 indicate that subjects in the validation set with high predictedrisk had poor outcomes. Confidence intervals in Figure 1 and thecorresponding P-values (Supplementary Results online) indicatewhich of the methods performed significantly better than expectedby chance. As another performance measure, we calculated theconcordance probability estimate (CPE), which measures how wellthe subject outcomes agree with the predicted risk scores. CPE valuesclose to 0.5 indicate no concordance (poor predictivity); CPE valuesapproaching 1.0 indicate strong concordance (good predictivity). Onthe basis of these measures, most of the classifiers performed well in atleast some situations. Finally, for 3-year survival, we constructedreceiver operating characteristics (ROC) curves for continuous pre-dictors and tables of sensitivity and specificity estimates for categoricalpredictors (Supplementary Results).

    There are some notable observations about the classifiers as a group.Most methods performed much better on sample sets containing allstages compared to sample sets containing just stage 1. This reflects anability to stratify by stage even when stage is not explicitly included inthe model. Including clinical covariates improved the performance ofmost of the models. In fact, without clinical covariates, no modelachieved a hazard ratio significantly greater than 1 in both validationsets for the stage 1 samples. An important criterion was that a modelshould perform well in both validation sets as an indication of robustperformance in routine clinical testing. For prediction on all stages

    Table 1 Summary statistics of data

    UM HLM CAN/DF MSK

    Sample size 177 79 82 104

    Age (mean, s.d.) 64 (10) 67 (10) 61 (10) 65 (10)

    Sex (% male) 56% 51% 56% 36%

    Stage I 66% 54% 68% 61%

    Stage II 16% 26% 32% 19%

    Stage III 18% 19% 0% 20%

    Median follow-up (months) 54 39 40 43

    Number of deaths 75 50 28 34

    MSK test set

    All stages

    No

    cova

    riate

    sW

    ith c

    ovar

    iate

    s

    CAN/DF test setStage 1 only

    A A

    A

    B

    0 1 2 3Hazard ratio

    4 5 6 7 8 0 1 2 3Hazard ratio

    4 5 6 7 8

    I

    A

    B

    I

    Ccat DcatEcatFcatGcat

    H

    EcatFcatGcat

    H

    Figure 1 Classifier performance. Hazard ratios of methods A–I (see

    Introduction and Supplementary Methods) on validation data sets for the

    four hypotheses, along with 95% confidence intervals. Methods that placed

    patients into ordered categories rather than providing a continuous risk score

    are denoted ‘‘cat’’.

    ART ICL ES

    NATURE MEDICINE VOLUME 14 [ NUMBER 8 [ AUGUST 2008 823

    ©20

    08 N

    atur

    e P

    ublis

    hing

    Gro

    up

    http

    ://w

    ww

    .nat

    ure.

    com

    /nat

    urem

    edic

    ine

  • Results of 2008 study

    using gene expression data, only methods A and H performed withconsistent statistical significance. For prediction on all stages usingboth gene expression and clinical covariates, methods A and Bproduced hazard ratios exceeding 2 for both validation sets. Forprediction on subjects with stage 1 disease using gene expressiondata only, three of the methods (A, D and H) gave hazard ratiosexceeding 1 for both validation sets. Of these, only method A had ahazard ratio significantly greater than 1 for one of the data sets. Forprediction on subjects with stage 1 disease using gene expression dataand clinical covariates, method A gave hazard ratios that exceeded 2and were statistically significant for both data sets. For many of theclassifiers, good performance in one setting was offset by poorperformance in a different setting. Thus, method A seemed to havethe best overall performance across the four hypotheses.

    Using method A to stratify subjects into three groups, we generatedKaplan-Meier plots to illustrate the survival differences among thegroups determined by this classification scheme for both the validation(Fig. 2) and the training (Fig. 3) data sets. This illustrates that lungadenocarcinomas can be divided into groups with different survivalrates. Kaplan-Meier plots showing the performance of the otherclassifiers on the validation data sets are available in the Supplemen-tary Results. The plots developed from method A again illustrate thatrisk predictors evaluated on all subjects performed better than thoseevaluated on subjects with stage 1 disease. Furthermore, using clinicalcovariates together with the gene expression data improved outcomeprediction compared to using gene expression data alone. Method Aincluded the null value 1 in its 95% hazard ratio confidence interval inonly one of eight situations considered (Fig. 1). The one hypothesiswherein method A did not give significant prediction was stage 1;subjects scored using only gene-expression measures. As noted above,no method gave significant results for both validation sets in thissetting. This suggests that stage 1 tumors may be classified moreefficiently using clinical parameters along with gene expression data.

    Analyses of additional classifiersThe additional classifiers shown in the Supplementary Results (J, K,L, M and N) were derived from the probe sets listed in refs. 9 and 10.

    Although we were unable to reconstruct the classifiers reported inthe original papers, we used the reported probe sets to constructclassifiers, and we tested them on our validation data. The perfor-mances of these classifiers were generally comparable to, althoughslightly poorer than, those for methods A–H developed for this paper.The hazard ratios were in most cases larger than 1, but they did notgive statistically significant hazard ratios consistently for both valida-tion data sets (Supplementary Results). For these classifiers, theaddition of clinical covariates improved the predictive ability.

    We considered two other ways to compare the classifiers developedfor this study. The Supplementary Results show how each tumorsample was classified by each of the methods. Many subjects could becorrectly classified by many different methods. These may representextreme cases that can be easily recognized. There were a number of

    0 10 20 30 40 50 60

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Time (months)

    Pro

    port

    ion

    aliv

    e

    0 10 20 30 40 50 60

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Time (months)

    Pro

    port

    ion

    aliv

    e

    6010 20 30 40 500

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Time (months)

    Pro

    port

    ion

    aliv

    e

    0 10 20 30 40 50 60

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Time (months)

    Pro

    port

    ion

    aliv

    e

    0 10 20 30 40 50 60

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Time (months)

    Pro

    port

    ion

    aliv

    e

    0 10 20 30 40 50 60

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Time (months)

    Pro

    port

    ion

    aliv

    e

    0 10 20 30 40 50 600 10 20 30 40 50 60

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Time (months)Time (months)P

    ropo

    rtio

    n al

    ive

    cba d

    e f g h

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Pro

    port

    ion

    aliv

    e

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Figure 2 Kaplan-Meier estimates of the survivor function for method A on each validation data set for the four hypotheses. (a) MSK test set, all stages.

    (b) MSK test set with covariates, all stages. (c) MSK test set, stage 1 only. (d) MSK test set with covariates, stage 1 only. (e) CAN/DF test set, all stages.

    (f) CAN/DF test set with covariates, all stages. (g) CAN/DF test set, stage 1 only. (h) CAN/DF test set with covariates, stage 1 only. Low scores correspond

    to the lowest predicted risk and high scores correspond to the greatest predicted risk.

    100 20 30 40 50 60

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Time (months)

    100 20 30 40 50 60

    Time (months)

    100 20 30 40 50 60

    Time (months)

    100 20 30 40 50 60

    Time (months)

    Pro

    port

    ion

    aliv

    e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Pro

    port

    ion

    aliv

    e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Pro

    port

    ion

    aliv

    e

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Pro

    port

    ion

    aliv

    e

    a b

    dc

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Low scoreModerate scoreHigh score

    Figure 3 Kaplan-Meier estimates of the survivor function for method A

    (cross-validated) on training sets UM and MSK. (a) All stages. (b) All stages

    with covariates. (c) Stage 1 only. (d) Stage 1 only, with covariates.

    ART ICL ES

    824 VOLUME 14 [ NUMBER 8 [ AUGUST 2008 NATURE MEDICINE

    ©20

    08 N

    atur

    e P

    ublis

    hing

    Gro

    up

    http

    ://w

    ww

    .nat

    ure.

    com

    /nat

    urem

    edic

    ine