Q-learning Residual Analysis with Application to A ......Q-learning(Watkins, 1989; Sutton and Barto, 1998; Ernest et al., 2005) –A popular method fromReinforcement (Machine) Learning

Q-learning Residual Analysis with Application to ASchizophrenia Clinical Trial

Bibhas ChakrabortyCentre for Quantitative Medicine, Duke-National University of Singapore Graduate

Medical School

Based on joint work with Ashkan Ertefaie & Susan Shortreed

ISCB, UtrechtAugust 27, 2015

1 / 32

Dynamic Treatment Regimes: A Quick Overview

Outline

1 Dynamic Treatment Regimes: A Quick Overview

2 Estimation of Optimal DTRs via Q-learning

3 Model Checking for Q-learning

4 Numerical Study

5 Discussion

2 / 32


Dynamic Treatment Regimes

Consider personalized management of chronic health conditions

A dynamic treatment regime (DTR) is a sequence of decision rules, one per stageof clinical intervention

– Each decision rule takes a patient’s treatment and covariate history as inputs, andoutputs a recommended treatment

A DTR is called optimal if it optimizes the long-term mean outcome (or someother suitable criterion)

3 / 32

“SMART” Data Sources

Sequential Multiple Assignment Randomized Trials (SMARTs) (Lavori andDawson, 2004; Murphy, 2005)

– Each patient is followed through multiple stages of treatment

– At each stage the patient is randomized to one of the possible treatment options

– Treatment options for a patient can be restricted based on prior treatment andcovariate history

Examples of classic SMARTs:

– Schizophrenia: CATIE (Schneider et al., 2001)

– Depression: STAR*D (Rush et al., 2003)

– Prostate Cancer: Thall et al. (2000)

– Leukemia: CALGB Protocol 8923 (Stone et al., 1995; Wahed and Tsiatis, 2004)

– Smoking: Project Quit (Strecher et al., 2008)

Many recently finished or ongoing trials:

http://methodology.psu.edu/ra/adap-inter/projects


CATIE: A Study of Schizophrenia

Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) (Schneider etal., 2001; Stroup et al., 2003; Swartz et al., 2003)

One of the earlier SMART studies relevant for DTR research, funded by NIMH

Quite complex study design, but we will be looking at a simplified version forillustrative purposes

– Only non-responders to initial treatment are re-randomized at the second stage

5 / 32


CATIE Study Design (Simplified)

6 / 32

Estimation of Optimal DTRs via Q-learning

Outline




4 Numerical Study

5 Discussion

7 / 32


Q-learning: A Regression-based Method

How to estimate the optimal DTR for every individual patient from SMART data?

Q-learning (Watkins, 1989; Sutton and Barto, 1998; Ernest et al., 2005)

– A popular method from Reinforcement (Machine) Learning

– A generalization of least squares regression to multistage decision problems(Murphy, 2005)

– Implemented in the DTR context in recent times with different variations (Zhao etal., 2009; Chakraborty et al., 2010; Shortreed et al., 2011; Schulte et al., 2012;Laber et al., 2012; Song et al., 2012; Nahum-Shani et al., 2012; Moodie et al., 2012)

The intuition comes from dynamic programming (Bellman, 1957) in case themultivariate distribution of the data is known

– Q-learning is an approximate dynamic programming approach

8 / 32


Data Structure

Two stages on a single patient:

O1,A1,O2, S2,A2,Y

Oj : Observation (pre-treatment) at the j-th stageAj : Treatment (action) at the j-th stage, Aj ∈ Aj, randomized

(for simplicity, restrict attention to Aj = {−1, 1})S2 : Indicator of whether a patient is re-randomized at stage 2

(in some SMART designs, S2 = 1 for every patient)Hj : History at the j-th stage,H1 = O1, H2 = (O1,A1,O2)Y : Primary Outcome (larger is better)

A DTR is a sequence of decision rules:

d ≡ (d1, d2) with dj(hj) ∈ Aj

9 / 32


Dynamic Programming: The Background for Q-learning

Move backward in time to take care of the delayed effects

Define the “Quality of treatment”, Q-functions:

Q2(h2, a2) = E[Y∣∣∣H2 = h2,A2 = a2]

Q1(h1, a1) = E[

maxa2

Q2(H2, a2)︸︷︷︸delayed effect

∣∣∣H1 = h1,A1 = a1]

Optimal DTR:

dj(hj) = arg maxaj

Qj(hj, aj), j = 1, 2

When the true Q-functions are not known, one needs to estimate them from data,using regression models ...

10 / 32


Q-learning with Linear Regression

Regression models for Q-functions:

Qj(Hj,Aj;βj) = βTj1Hj1 + (βTj2Hj2)Aj, j = 1, 2,

where Hj1 and Hj2 are two features of Hj

At stage 2, regress Y on (H21, H22A2) only among patients with S2 = 1, to obtainβ̂2 = (β̂21, β̂22)

11 / 32


Q-learning with Linear Regression (Cont’d)

Construct stage-1 “pseudo-outcome” for patients with S2 = 1:

Ỹmax = maxa2

Q2(H2, a2; β̂2)

and hence the stage-1 dependent variable for every patient in the trial:

Ỹ = S2 · Ỹmax + (1− S2) · Y

At stage 1, regress Ỹ on (H11, H12A1) to obtain β̂1 = (β̂11, β̂12)

Estimated Optimal DTR:

d̂j(hj) = arg maxaj

Qj(hj, aj; β̂j) = sign(β̂Tj2 hj2)

12 / 32

Model Checking for Q-learning

Outline




4 Numerical Study

5 Discussion

13 / 32


Model Checking

Quality of the DTRs estimated using Q-learning are critically model-dependent

Model checking for stage 2 can be done using standard residual diagnostic toolsfrom linear regression

Model checking for stage 1 is tricky

– The dependent variable is a non-smooth function of the data

– The data consists of two types of individuals, viz., those who are randomized atstage 2 and those who are not – hence inherent scope of variance heterogeneity

14 / 32

Proposed Modification: Q-learning with Mixture Residuals(QL-MR)

Stage-2 regression model (nested among those with S2 = 0 and S2 = 1):

Q2(H2,A2;β21, β22, β23) = S2 · (βT21H21 + βT22H22A2) + (1− S2) · (βT23H23)

Obtain β̂2 by fitting the above model to Y , and hence define

d̂2(h2) = arg maxa2

Q2(h2, a2; β̂2)

Construct stage-1 “pseudo-outcome” for each patient in the trial:

ỸQL-MR = maxa2

[S2 · (β̂T21H21 + β̂T22H22A2)] + (1− S2) · (β̂T23H23)

= S2 · (β̂T21H21 + |β̂T22H22|) + (1− S2) · (β̂T23H23)

Q-learning with Mixture Residuals (QL-MR) (Cont’d)

Define π = E[S2|H2] = P[S2 = 1|H2]

Postulate a parametric model for π, say π(α), and compute the maximumlikelihood estimate α̂; then define π̂ = E[S2|H2; α̂] (e.g., logistic regression)

Stage-1 Q-function:

Q1(H1,A1) = E[ỸQL-MR

∣∣H1,A1]= E

[E{ỸQL-MR|H2}

∣∣H1,A1]= E

[E{

S2(β̂T21H21 + |β̂T22H22|) + (1− S2)(β̂T23H23)∣∣H2}∣∣∣H1,A1]

= E[π(β̂T21H21 + |β̂T22H22|)

∣∣∣H1,A1]+ E[(1− π)(β̂T23H23)∣∣∣H1,A1]

Replace π by π̂ in the expression of Q1

Q-learning with Mixture Residuals (QL-MR) (Cont’d)Q1 is a mixture model with two components (e.g., for responders and fornon-responders)

Fit two linear models for the two conditional expectations in the expression ofQ1, say ηT11H11 + η

T12H12A1 and θ

T11H

′11 + θ

T12H

′12A1

Construct mixture residuals �̂QL-MR as

π̂(β̂T21H21+|β̂T22H22|)+(1−π̂)(β̂T23H23)−[η̂T11H11+η̂T12H12A1]−[θ̂T11H′11+θ̂T12H′12A1]

Assess �̂QL-MR using standard residual diagnostic plots– If lack of fit is detected, adjust the set of predictors and re-assess model

– Else find the optimal DTR based on fitted models

The optimal stage-1 decision rule is given by

d̂1(h1) = arg maxa1

Q1(h1, a1; η̂1, θ̂1)


Asymptotic Properties of QL-MR

Standard Q-learning and QL-MR are asymptotically equivalent under thefollowing conditions:

1 The postulated model for Y among individuals with S2 = 0 is correctly specified2 The postulated model for π is correctly specified

Precisely, ỸQL-MR + (1− S2)τ = Ỹ + op(1), where τ = Y − β̂T23H23 forindividuals with S2 = 0

In case of correctly specified model, E(τ |H2) = 0, and thus the twopseudo-outcomes have the same mean

18 / 32


Inference

The problem of non-regularity remains the same as in Q-learning

Either the adaptive confidence interval (ACI) (Laber et al., 2014) or m-out-of-nbootstrap (Chakraborty et al., 2013) should be employed for constructingconfidence intervals

– We extended and implemented ACI in the current work

19 / 32

Numerical Study

Outline




4 Numerical Study

5 Discussion

20 / 32

Numerical Study

Simulation Design

Assess the diagnostic performance of QL-MR as compared to conventionalQ-learning

Simulate a SMART study, analogous to CATIE, with sample size n = 300

21 / 32

Generative Model for Simulation Study

O1ji.i.d.∼ N(0, 1), j = 1, 2

A1 ∈ {−1, 1} with probability 0.5

O2ji.i.d.∼ N

(5− 0.3 A1 − 0.5 O1j, 1

), j = 1, 2

S2 = I{O22 > 5}

A2 ∈ {−1, 1} with probability 0.5

g(H2) = 1 + 2 O11 − 1.5 O211 − 2 O12 + O21 − A1 − 0.5 A1O11

� ∼ N(0, 1)

Y = g(H2) + S2 ·(

0.8 O21 − 0.5 A2 − 0.4 A2O21 − 0.7A2 O11)+ �

Analysis Model

Assume that the model for the stage-2 Q-function is correctly specified, andcheck the model fit at stage 1

At stage 1, fit models for E[π̂(β̂T21H21 + |β̂T22H22|)

∣∣∣H1,A1] andE[(1− π̂)(β̂T23H23)

∣∣∣H1,A1]Three types of models are considered for these quantities:

Model Variables Included1 (O11,O12,A1)2 (O11,O12,A1,O211)3 (O11,O12,A1,O211,A1O11)

-3 -2 -1 0 1 2

-20

-10

010

o11

Res

QL

-3 -2 -1 0 1 2 3

-20

-10

010

o12

Res

QL

Res QL

Frequency

-30 -20 -10 0 10 20

020

4060

80

-3 -2 -1 0 1 2

-50

510

20

o11

Res

QL-

MR

-3 -2 -1 0 1 2 3

-50

510

20

o12

Res

QL-

MR

Res QL-MR

Frequency

-10 0 10 20 30

050

100

150

Figure : Model 1 residual plots against O11 and O12, and the histogram. The orange and greenlines are the loess smoother lines for individuals with A1 = +1 and A1 = −1, respectively.

-3 -2 -1 0 1 2

-10

-50

510

15

o11

Res

QL

-3 -2 -1 0 1 2 3

-10

-50

510

15

o12

Res

QL

Res QL

Frequency

-15 -10 -5 0 5 10 15 20

020

4060

80

-3 -2 -1 0 1 2

-10

-50

5

o11

Res

QL-

MR

-3 -2 -1 0 1 2 3

-10

-50

5

o12

Res

QL-

MR

Res QL-MR

Frequency

-10 -5 0 5 10

010

2030

4050


-3 -2 -1 0 1 2

-50

510

15

o11

Res

QL

-3 -2 -1 0 1 2 3

-50

510

15

o12

Res

QL

Res QL

Frequency

-10 -5 0 5 10 15

010

2030

4050

-3 -2 -1 0 1 2

-6-4

-20

24

6

o11

Res

QL-

MR

-3 -2 -1 0 1 2 3

-6-4

-20

24

6

o12

Res

QL-

MR

Res QL-MR

Frequency

-6 -4 -2 0 2 4 6

020

4060


What do the plots say?

Even after adjusting for quadratic and interaction terms, the residuals fromstandard Q-learning suggest at least lack of variance homogeneity and lack ofsymmetry / normality

This finding may influence the analyst to believe a lack of fit and considervariance-stabilizing and/or normality-inducing transformations

– This, in turn, may jeopardize the simplicity and interpretability of Q-learning

QL-MR, on the other hand, does not mislead an analyst

– And, this is achieved by using standard diagnostic tools – not requiring to inventnew residual diagnostic techniques

In the end, the parameter estimates are similar to standard Q-learning – so theextra diagnostic ability is not at the cost of the estimation performance of keyparameters

Numerical Study

Parameter Estimates

Table : Simulated data: Estimates of the Stage-2 and Stage-1 decision rule parameters

Standard Q-learning QL-MRParameter Estimate 90% CI Estimate 90% CIStage-2 ModelA2 -2.17 -2.97 -1.37 -2.18 -3.01 -1.35A2O11 -1.67 -1.84 -1.51 -1.68 -1.85 -1.51A2O21 1.64 1.47 1.80 1.64 1.47 1.81Stage-1 ModelA1 -0.84 -1.44 -0.24 -0.86 -1.48 -0.26A1O11 -3.69 -4.43 -2.96 -3.75 -4.49 -3.07

28 / 32

Numerical Study

CATIE Data Analysis (QoL Outcome)

Table : CATIE: Stage-2 and Stage-1 regression models

Standard Q-learning QL-MRParameter Estimate 90% C.I Estimate 90% CIStage-2 ModelO11: Baseline PANSS 0.01 -0.12 0.14 0.02 -0.11 0.15O211: Baseline PANSS 0.05 -0.02 0.13 0.02 -0.05 0.09O12: Baseline Quality of Life 0.48 0.36 0.61 0.49 0.37 0.60A1: Stage-1 treatment 0.004 -0.11 0.12 0.008 -0.09 0.11O21: PANSS during stage-1 -0.19 -0.30 -0.08 -0.20 -0.30 -0.10A2: Stage-2 treatment -0.06 -0.17 0.04 -0.07 -0.17 0.03A2A1 -0.09 -0.19 0.02 -0.09 -0.19 0.01Stage-1 ModelO11: Baseline PANSS -0.13 -0.23 -0.04 -0.12 -0.22 -0.03O211: Baseline PANSS 0.06 0.00 0.12 0.05 -0.01 0.12O12: Baseline Quality of Life 0.51 0.42 0.61 0.50 0.42 0.59A1: Stage-1 treatment -0.01 -0.10 0.11 -0.01 -0.13 0.09

29 / 32

Discussion

Outline




4 Numerical Study

5 Discussion

30 / 32

Summary

SMART designs are becoming increasingly popular in various domains of healthresearch

– A particular type of SMART studies, where only the non-responders to the initialtreatment are being re-randomized, are more common

Secondary analysis of SMART studies to find individualized interventions isusually conducted using Q-learning

In case of SMARTs where only the non-responders are re-randomized, modelchecking for standard Q-learning is problematic

– This problem has received little, if any, attention in the literature so far

– We have proposed a simple modification of Q-learning so that standard residualdiagnostic tools from the classical regression literature can be used

Shoot your questions, comments, criticisms, request for slides to:[email protected]

Dynamic Treatment Regimes: A Quick OverviewEstimation of Optimal DTRs via Q-learningModel Checking for Q-learningNumerical StudyDiscussion

Documents

Q-learning Residual Analysis with Application to A ......Q-learning(Watkins, 1989; Sutton and Barto, 1998; Ernest et al., 2005) –A popular method fromReinforcement (Machine) Learning