32
Q-learning Residual Analysis with Application to A Schizophrenia Clinical Trial Bibhas Chakraborty Centre for Quantitative Medicine, Duke-National University of Singapore Graduate Medical School Based on joint work with Ashkan Ertefaie & Susan Shortreed ISCB, Utrecht August 27, 2015 1 / 32

Q-learning Residual Analysis with Application to A ......Q-learning(Watkins, 1989; Sutton and Barto, 1998; Ernest et al., 2005) –A popular method fromReinforcement (Machine) Learning

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Q-learning Residual Analysis with Application to ASchizophrenia Clinical Trial

    Bibhas ChakrabortyCentre for Quantitative Medicine, Duke-National University of Singapore Graduate

    Medical School

    Based on joint work with Ashkan Ertefaie & Susan Shortreed

    ISCB, UtrechtAugust 27, 2015

    1 / 32

  • Dynamic Treatment Regimes: A Quick Overview

    Outline

    1 Dynamic Treatment Regimes: A Quick Overview

    2 Estimation of Optimal DTRs via Q-learning

    3 Model Checking for Q-learning

    4 Numerical Study

    5 Discussion

    2 / 32

  • Dynamic Treatment Regimes: A Quick Overview

    Dynamic Treatment Regimes

    Consider personalized management of chronic health conditions

    A dynamic treatment regime (DTR) is a sequence of decision rules, one per stageof clinical intervention

    – Each decision rule takes a patient’s treatment and covariate history as inputs, andoutputs a recommended treatment

    A DTR is called optimal if it optimizes the long-term mean outcome (or someother suitable criterion)

    3 / 32

  • “SMART” Data Sources

    Sequential Multiple Assignment Randomized Trials (SMARTs) (Lavori andDawson, 2004; Murphy, 2005)

    – Each patient is followed through multiple stages of treatment

    – At each stage the patient is randomized to one of the possible treatment options

    – Treatment options for a patient can be restricted based on prior treatment andcovariate history

    Examples of classic SMARTs:

    – Schizophrenia: CATIE (Schneider et al., 2001)

    – Depression: STAR*D (Rush et al., 2003)

    – Prostate Cancer: Thall et al. (2000)

    – Leukemia: CALGB Protocol 8923 (Stone et al., 1995; Wahed and Tsiatis, 2004)

    – Smoking: Project Quit (Strecher et al., 2008)

    Many recently finished or ongoing trials:

    http://methodology.psu.edu/ra/adap-inter/projects

  • Dynamic Treatment Regimes: A Quick Overview

    CATIE: A Study of Schizophrenia

    Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) (Schneider etal., 2001; Stroup et al., 2003; Swartz et al., 2003)

    One of the earlier SMART studies relevant for DTR research, funded by NIMH

    Quite complex study design, but we will be looking at a simplified version forillustrative purposes

    – Only non-responders to initial treatment are re-randomized at the second stage

    5 / 32

  • Dynamic Treatment Regimes: A Quick Overview

    CATIE Study Design (Simplified)

    6 / 32

  • Estimation of Optimal DTRs via Q-learning

    Outline

    1 Dynamic Treatment Regimes: A Quick Overview

    2 Estimation of Optimal DTRs via Q-learning

    3 Model Checking for Q-learning

    4 Numerical Study

    5 Discussion

    7 / 32

  • Estimation of Optimal DTRs via Q-learning

    Q-learning: A Regression-based Method

    How to estimate the optimal DTR for every individual patient from SMART data?

    Q-learning (Watkins, 1989; Sutton and Barto, 1998; Ernest et al., 2005)

    – A popular method from Reinforcement (Machine) Learning

    – A generalization of least squares regression to multistage decision problems(Murphy, 2005)

    – Implemented in the DTR context in recent times with different variations (Zhao etal., 2009; Chakraborty et al., 2010; Shortreed et al., 2011; Schulte et al., 2012;Laber et al., 2012; Song et al., 2012; Nahum-Shani et al., 2012; Moodie et al., 2012)

    The intuition comes from dynamic programming (Bellman, 1957) in case themultivariate distribution of the data is known

    – Q-learning is an approximate dynamic programming approach

    8 / 32

  • Estimation of Optimal DTRs via Q-learning

    Data Structure

    Two stages on a single patient:

    O1,A1,O2, S2,A2,Y

    Oj : Observation (pre-treatment) at the j-th stageAj : Treatment (action) at the j-th stage, Aj ∈ Aj, randomized

    (for simplicity, restrict attention to Aj = {−1, 1})S2 : Indicator of whether a patient is re-randomized at stage 2

    (in some SMART designs, S2 = 1 for every patient)Hj : History at the j-th stage,H1 = O1, H2 = (O1,A1,O2)Y : Primary Outcome (larger is better)

    A DTR is a sequence of decision rules:

    d ≡ (d1, d2) with dj(hj) ∈ Aj

    9 / 32

  • Estimation of Optimal DTRs via Q-learning

    Dynamic Programming: The Background for Q-learning

    Move backward in time to take care of the delayed effects

    Define the “Quality of treatment”, Q-functions:

    Q2(h2, a2) = E[Y∣∣∣H2 = h2,A2 = a2]

    Q1(h1, a1) = E[

    maxa2

    Q2(H2, a2)︸ ︷︷ ︸delayed effect

    ∣∣∣H1 = h1,A1 = a1]

    Optimal DTR:

    dj(hj) = arg maxaj

    Qj(hj, aj), j = 1, 2

    When the true Q-functions are not known, one needs to estimate them from data,using regression models ...

    10 / 32

  • Estimation of Optimal DTRs via Q-learning

    Q-learning with Linear Regression

    Regression models for Q-functions:

    Qj(Hj,Aj;βj) = βTj1Hj1 + (βTj2Hj2)Aj, j = 1, 2,

    where Hj1 and Hj2 are two features of Hj

    At stage 2, regress Y on (H21, H22A2) only among patients with S2 = 1, to obtainβ̂2 = (β̂21, β̂22)

    11 / 32

  • Estimation of Optimal DTRs via Q-learning

    Q-learning with Linear Regression (Cont’d)

    Construct stage-1 “pseudo-outcome” for patients with S2 = 1:

    Ỹmax = maxa2

    Q2(H2, a2; β̂2)

    and hence the stage-1 dependent variable for every patient in the trial:

    Ỹ = S2 · Ỹmax + (1− S2) · Y

    At stage 1, regress Ỹ on (H11, H12A1) to obtain β̂1 = (β̂11, β̂12)

    Estimated Optimal DTR:

    d̂j(hj) = arg maxaj

    Qj(hj, aj; β̂j) = sign(β̂Tj2 hj2)

    12 / 32

  • Model Checking for Q-learning

    Outline

    1 Dynamic Treatment Regimes: A Quick Overview

    2 Estimation of Optimal DTRs via Q-learning

    3 Model Checking for Q-learning

    4 Numerical Study

    5 Discussion

    13 / 32

  • Model Checking for Q-learning

    Model Checking

    Quality of the DTRs estimated using Q-learning are critically model-dependent

    Model checking for stage 2 can be done using standard residual diagnostic toolsfrom linear regression

    Model checking for stage 1 is tricky

    – The dependent variable is a non-smooth function of the data

    – The data consists of two types of individuals, viz., those who are randomized atstage 2 and those who are not – hence inherent scope of variance heterogeneity

    14 / 32

  • Proposed Modification: Q-learning with Mixture Residuals(QL-MR)

    Stage-2 regression model (nested among those with S2 = 0 and S2 = 1):

    Q2(H2,A2;β21, β22, β23) = S2 · (βT21H21 + βT22H22A2) + (1− S2) · (βT23H23)

    Obtain β̂2 by fitting the above model to Y , and hence define

    d̂2(h2) = arg maxa2

    Q2(h2, a2; β̂2)

    Construct stage-1 “pseudo-outcome” for each patient in the trial:

    ỸQL-MR = maxa2

    [S2 · (β̂T21H21 + β̂T22H22A2)] + (1− S2) · (β̂T23H23)

    = S2 · (β̂T21H21 + |β̂T22H22|) + (1− S2) · (β̂T23H23)

  • Q-learning with Mixture Residuals (QL-MR) (Cont’d)

    Define π = E[S2|H2] = P[S2 = 1|H2]

    Postulate a parametric model for π, say π(α), and compute the maximumlikelihood estimate α̂; then define π̂ = E[S2|H2; α̂] (e.g., logistic regression)

    Stage-1 Q-function:

    Q1(H1,A1) = E[ỸQL-MR

    ∣∣H1,A1]= E

    [E{ỸQL-MR|H2}

    ∣∣H1,A1]= E

    [E{

    S2(β̂T21H21 + |β̂T22H22|) + (1− S2)(β̂T23H23)∣∣H2}∣∣∣H1,A1]

    = E[π(β̂T21H21 + |β̂T22H22|)

    ∣∣∣H1,A1]+ E[(1− π)(β̂T23H23)∣∣∣H1,A1]

    Replace π by π̂ in the expression of Q1

  • Q-learning with Mixture Residuals (QL-MR) (Cont’d)Q1 is a mixture model with two components (e.g., for responders and fornon-responders)

    Fit two linear models for the two conditional expectations in the expression ofQ1, say ηT11H11 + η

    T12H12A1 and θ

    T11H

    ′11 + θ

    T12H

    ′12A1

    Construct mixture residuals �̂QL-MR as

    π̂(β̂T21H21+|β̂T22H22|)+(1−π̂)(β̂T23H23)−[η̂T11H11+η̂T12H12A1]−[θ̂T11H′11+θ̂T12H′12A1]

    Assess �̂QL-MR using standard residual diagnostic plots– If lack of fit is detected, adjust the set of predictors and re-assess model

    – Else find the optimal DTR based on fitted models

    The optimal stage-1 decision rule is given by

    d̂1(h1) = arg maxa1

    Q1(h1, a1; η̂1, θ̂1)

  • Model Checking for Q-learning

    Asymptotic Properties of QL-MR

    Standard Q-learning and QL-MR are asymptotically equivalent under thefollowing conditions:

    1 The postulated model for Y among individuals with S2 = 0 is correctly specified2 The postulated model for π is correctly specified

    Precisely, ỸQL-MR + (1− S2)τ = Ỹ + op(1), where τ = Y − β̂T23H23 forindividuals with S2 = 0

    In case of correctly specified model, E(τ |H2) = 0, and thus the twopseudo-outcomes have the same mean

    18 / 32

  • Model Checking for Q-learning

    Inference

    The problem of non-regularity remains the same as in Q-learning

    Either the adaptive confidence interval (ACI) (Laber et al., 2014) or m-out-of-nbootstrap (Chakraborty et al., 2013) should be employed for constructingconfidence intervals

    – We extended and implemented ACI in the current work

    19 / 32

  • Numerical Study

    Outline

    1 Dynamic Treatment Regimes: A Quick Overview

    2 Estimation of Optimal DTRs via Q-learning

    3 Model Checking for Q-learning

    4 Numerical Study

    5 Discussion

    20 / 32

  • Numerical Study

    Simulation Design

    Assess the diagnostic performance of QL-MR as compared to conventionalQ-learning

    Simulate a SMART study, analogous to CATIE, with sample size n = 300

    21 / 32

  • Generative Model for Simulation Study

    O1ji.i.d.∼ N(0, 1), j = 1, 2

    A1 ∈ {−1, 1} with probability 0.5

    O2ji.i.d.∼ N

    (5− 0.3 A1 − 0.5 O1j, 1

    ), j = 1, 2

    S2 = I{O22 > 5}

    A2 ∈ {−1, 1} with probability 0.5

    g(H2) = 1 + 2 O11 − 1.5 O211 − 2 O12 + O21 − A1 − 0.5 A1O11

    � ∼ N(0, 1)

    Y = g(H2) + S2 ·(

    0.8 O21 − 0.5 A2 − 0.4 A2O21 − 0.7A2 O11)+ �

  • Analysis Model

    Assume that the model for the stage-2 Q-function is correctly specified, andcheck the model fit at stage 1

    At stage 1, fit models for E[π̂(β̂T21H21 + |β̂T22H22|)

    ∣∣∣H1,A1] andE[(1− π̂)(β̂T23H23)

    ∣∣∣H1,A1]Three types of models are considered for these quantities:

    Model Variables Included1 (O11,O12,A1)2 (O11,O12,A1,O211)3 (O11,O12,A1,O211,A1O11)

  • -3 -2 -1 0 1 2

    -20

    -10

    010

    o11

    Res

    QL

    -3 -2 -1 0 1 2 3

    -20

    -10

    010

    o12

    Res

    QL

    Res QL

    Frequency

    -30 -20 -10 0 10 20

    020

    4060

    80

    -3 -2 -1 0 1 2

    -50

    510

    20

    o11

    Res

    QL-

    MR

    -3 -2 -1 0 1 2 3

    -50

    510

    20

    o12

    Res

    QL-

    MR

    Res QL-MR

    Frequency

    -10 0 10 20 30

    050

    100

    150

    Figure : Model 1 residual plots against O11 and O12, and the histogram. The orange and greenlines are the loess smoother lines for individuals with A1 = +1 and A1 = −1, respectively.

  • -3 -2 -1 0 1 2

    -10

    -50

    510

    15

    o11

    Res

    QL

    -3 -2 -1 0 1 2 3

    -10

    -50

    510

    15

    o12

    Res

    QL

    Res QL

    Frequency

    -15 -10 -5 0 5 10 15 20

    020

    4060

    80

    -3 -2 -1 0 1 2

    -10

    -50

    5

    o11

    Res

    QL-

    MR

    -3 -2 -1 0 1 2 3

    -10

    -50

    5

    o12

    Res

    QL-

    MR

    Res QL-MR

    Frequency

    -10 -5 0 5 10

    010

    2030

    4050

    Figure : Model 2 residual plots against O11 and O12, and the histogram. The orange and greenlines are the loess smoother lines for individuals with A1 = +1 and A1 = −1, respectively.

  • -3 -2 -1 0 1 2

    -50

    510

    15

    o11

    Res

    QL

    -3 -2 -1 0 1 2 3

    -50

    510

    15

    o12

    Res

    QL

    Res QL

    Frequency

    -10 -5 0 5 10 15

    010

    2030

    4050

    -3 -2 -1 0 1 2

    -6-4

    -20

    24

    6

    o11

    Res

    QL-

    MR

    -3 -2 -1 0 1 2 3

    -6-4

    -20

    24

    6

    o12

    Res

    QL-

    MR

    Res QL-MR

    Frequency

    -6 -4 -2 0 2 4 6

    020

    4060

    Figure : Model 3 residual plots against O11 and O12, and the histogram. The orange and greenlines are the loess smoother lines for individuals with A1 = +1 and A1 = −1, respectively.

  • What do the plots say?

    Even after adjusting for quadratic and interaction terms, the residuals fromstandard Q-learning suggest at least lack of variance homogeneity and lack ofsymmetry / normality

    This finding may influence the analyst to believe a lack of fit and considervariance-stabilizing and/or normality-inducing transformations

    – This, in turn, may jeopardize the simplicity and interpretability of Q-learning

    QL-MR, on the other hand, does not mislead an analyst

    – And, this is achieved by using standard diagnostic tools – not requiring to inventnew residual diagnostic techniques

    In the end, the parameter estimates are similar to standard Q-learning – so theextra diagnostic ability is not at the cost of the estimation performance of keyparameters

  • Numerical Study

    Parameter Estimates

    Table : Simulated data: Estimates of the Stage-2 and Stage-1 decision rule parameters

    Standard Q-learning QL-MRParameter Estimate 90% CI Estimate 90% CIStage-2 ModelA2 -2.17 -2.97 -1.37 -2.18 -3.01 -1.35A2O11 -1.67 -1.84 -1.51 -1.68 -1.85 -1.51A2O21 1.64 1.47 1.80 1.64 1.47 1.81Stage-1 ModelA1 -0.84 -1.44 -0.24 -0.86 -1.48 -0.26A1O11 -3.69 -4.43 -2.96 -3.75 -4.49 -3.07

    28 / 32

  • Numerical Study

    CATIE Data Analysis (QoL Outcome)

    Table : CATIE: Stage-2 and Stage-1 regression models

    Standard Q-learning QL-MRParameter Estimate 90% C.I Estimate 90% CIStage-2 ModelO11: Baseline PANSS 0.01 -0.12 0.14 0.02 -0.11 0.15O211: Baseline PANSS 0.05 -0.02 0.13 0.02 -0.05 0.09O12: Baseline Quality of Life 0.48 0.36 0.61 0.49 0.37 0.60A1: Stage-1 treatment 0.004 -0.11 0.12 0.008 -0.09 0.11O21: PANSS during stage-1 -0.19 -0.30 -0.08 -0.20 -0.30 -0.10A2: Stage-2 treatment -0.06 -0.17 0.04 -0.07 -0.17 0.03A2A1 -0.09 -0.19 0.02 -0.09 -0.19 0.01Stage-1 ModelO11: Baseline PANSS -0.13 -0.23 -0.04 -0.12 -0.22 -0.03O211: Baseline PANSS 0.06 0.00 0.12 0.05 -0.01 0.12O12: Baseline Quality of Life 0.51 0.42 0.61 0.50 0.42 0.59A1: Stage-1 treatment -0.01 -0.10 0.11 -0.01 -0.13 0.09

    29 / 32

  • Discussion

    Outline

    1 Dynamic Treatment Regimes: A Quick Overview

    2 Estimation of Optimal DTRs via Q-learning

    3 Model Checking for Q-learning

    4 Numerical Study

    5 Discussion

    30 / 32

  • Summary

    SMART designs are becoming increasingly popular in various domains of healthresearch

    – A particular type of SMART studies, where only the non-responders to the initialtreatment are being re-randomized, are more common

    Secondary analysis of SMART studies to find individualized interventions isusually conducted using Q-learning

    In case of SMARTs where only the non-responders are re-randomized, modelchecking for standard Q-learning is problematic

    – This problem has received little, if any, attention in the literature so far

    – We have proposed a simple modification of Q-learning so that standard residualdiagnostic tools from the classical regression literature can be used

  • Shoot your questions, comments, criticisms, request for slides to:[email protected]

    Dynamic Treatment Regimes: A Quick OverviewEstimation of Optimal DTRs via Q-learningModel Checking for Q-learningNumerical StudyDiscussion