Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Treatment Allocations Based on Multi-Armed BanditStrategies

Wei Qian and Yuhong Yang

Applied Economics and Statistics, University of DelawareSchool of Statistics, University of Minnesota

Innovative Statistics and Machine Learning for Precision MedicineSeptember 15, 2017


1 Bandit Problems

2 Methodology and Theory

3 Model Combining

4 Numerical Studies

5 Conclusion


Standard Multi-Armed Bandit Problem

There is a wall of slot machines.

!! = !!%! !! = !!%! !! = !!%!!!Each machine has certain winning probability to receive $1.

Chances of winning are unknown to the game player.

At each time, one and only one machine can be played, and theimmediate result is observed.

Goal: maximize the total number of wins over N times of plays.


Standard Multi-Armed Bandit Problem

There is a wall of slot machines.

!! = !!%! !! = !!%! !! = !!%!!!Each machine has certain winning probability to receive $1.

Chances of winning are unknown to the game player.

At each time, one and only one machine can be played, and theimmediate result is observed.

Goal: maximize the total number of wins over N times of plays.


Exploration-Exploitation Tradeoff

Exploration: pull each arm as many times as possible to explore onthe true reward probabilities.

Exploitation: use the existing information and play the “best” arm.


Motivation: Ethical Clinical Studies

Slot machines: different treatments to a certain disease

Survival probability: unknown to the doctor

Goal: sequentially assign treatments to patients to maximize thesurvival rate


A Real Example: ECMO Trial

ECMO for treating newborns with persistent pulmonary hypertension?

Ethical dilemma of using a conventional randomized controlled trialcurrent patients versus future patientstwo hats on a participating doctor

A solution is response adaptive design. L.J. Wei’s randomized versionof the play the winner rule was used in a study.

The ECMO trial has generated a lot of discussions. See, e.g., twoStatistical Science papers in 1989 and 1991.


Motivation: Online Services

Web applications are generating massive data streams.

Online recommendation systems– recommend articles to online newspaper readers.– recommend products to customers of online retailers.


Motivation: Online Services

Web applications are generating massive data streams.

Online recommendation systems– recommend articles to online newspaper readers.– recommend products to customers of online retailers.


Motivation: Bandit Problem For Online Services

Slot machines: multiple articles

Each internet visit: one and only one article delivered

Clicking probability: unknown to the internet company

Goal: sequentially choose an article for internet users to maximize thetotal number of clicks or click-through-rate (CTR)


Bandit Problem With Covariates

Standard bandit problem assumes constant winning probabilities.

In practice, winning probability can be dependent on covariates.

Personalized medical serviceTreatment effects (e.g., survival probability) can be associated withpatients’ prognostic factors.


Bandit Problem With Covariates

Standard bandit problem assumes constant winning probabilities.

In practice, winning probability can be dependent on covariates.

Personalized medical serviceTreatment effects (e.g., survival probability) can be associated withpatients’ prognostic factors.


Personalized Web Service

Personalized online advertising, article recommendationInternet user’s interest in an ad or an article story can be associatedwith some user information.


Multi-Armed Bandit with Covariate (MABC) for Precision Medicine

An example scenario:

A few FDA approved drugs are available on the market for treating acertain disease

Currently the doctors perhaps choose among the available drugs basedon limited information and reading of scattered publications if any

Why not use the MABC framework for better medical practice?


Two-Armed Bandit Problem with Covariates

Two treatments (news articles): A and B

Patient (user) covariate x ∈ [0, 1]

Recovering (clicking) probability: fA(x), fB(x)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

clic

king

pro

babl

ityfA(x)fB(x)


Problem Setup: Two-Armed Bandit with Covariates

Problem Setup:

Given a bandit problem with two arms: treatments A and B

Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)

Covariates Xn, i.i.d. from continuous distribution PX

At each time n,

1 observe patient covariate Xn ∼ PX ;

2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};

3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.

Question: how to design the sequential allocation algorithm?



Problem Setup:




At each time n,







Problem Setup:




At each time n,






A Measure of Performance: Regret

Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax

i∈{A,B}fi(x)

“optimal” recovering probability: f∗(x) := maxi∈{A,B}

fi(x)

Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.

regretn = f∗(Xn)− fIn(Xn).

To measure the overall performance, consider cumulative regret

RN :=

N∑n=1

(f∗(Xn)− fIn(Xn)

)An algorithm is strongly consistent if RN = o(N) almost surely.




i∈{A,B}fi(x)


fi(x)




RN :=

N∑n=1






i∈{A,B}fi(x)


fi(x)




RN :=N∑n=1




Model Assumptions of fA and fB

Parametric framework– Woodroofe, 1979; Auer, 2002; Li et al., 2010; Goldenshluger and Zeevi,

2009, 2013; Bastani and Bayati, 2016– Linear models

Nonparametric framework– Yang and Zhu, 2002; Rigollet and Zeevi, 2010; Perchet and Rigollet,

2013


Algorithms

Two articles A and B with clicking probabilities fA(x) and fB(x)

1 Deliver each article an equal number of times (e.g., each is deliveredn0 = 20 times):I1 = A, I2 = B, · · · , I2n0−1 = A, I2n0 = B.

2 For the next internet visit (n = 2n0 + 1), observe the internet usercovariate Xn.

3 Estimate fA and fB using previous data to obtain fA,n and fB,n.

4 Find the more promising option: in = argmaxi∈{A,B} fi,n(Xn);Deliver article with randomization scheme:

In =

{in, with probability 1− πn,i, with probability πn, i 6= in.

Observe the result YIn,n.


Kernel Estimation

Given article A, at each time point n, define

JA,n = {j : Ij = A, 1 ≤ j ≤ n− 1}

Nadaraya-Watson estimator of fA(x):

fA,n(x) =

∑j∈JA,n

YA,jK(x−Xj

hn

)∑

j∈JA,n

K(x−Xj

hn

)kernel function K(u) : Rd → R; bandwidth hn

Epanechnikov quadratic kernel:

K(u) =3

4(1− u2)I(‖u‖ ≤ 1)


An UCB-Type Kernel Estimator

Upper Confidence Bound (UCB) kernel estimator

fA,n(x) =

∑j∈JA,n

YA,jK(x−Xj

hn

)∑

j∈JA,n

K(x−Xj

hn

) + UA,n(x)

A “standard error” quantity

UA,n(x) =

c

√(logN)

∑j∈JA,n

K2(x−Xjhn

)∑j∈JA,n

K(x−Xjhn

)Under uniform kernel K(u) = I(‖u‖∞ ≤ 1) withNA,n(x) =

∑j∈JA,n

I(‖Xj − x‖∞ ≤ h),

UA,n(x) = c

√logN

NA,n(x)


Algorithm Illustration

Deliver each article 20 times. X1 = 0.93, article A

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 1, nA = 1, nB = 0

x

clic

king

pro

babl

ity



Deliver each article 20 times. X1 = 0.93, article A

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 1, nA = 1, nB = 0

x

clic

king

pro

babl

ity



Deliver each article 20 times. X2 = 0.88, article B

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 2, nA = 1, nB = 1

x

clic

king

pro

babl

ity



Deliver each article 20 times. X2 = 0.88, article B

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 2, nA = 1, nB = 1

x

clic

king

pro

babl

ity



Deliver each article 20 times.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity



X41 = 0.52. Estimate fA(X41) and fB(X41) by kernel estimation.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity



Estimate fA(X41)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity



Estimate fA(X41): consider a window [X41 − h,X41 + h].Similar information may give similar clicking probability.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity



Estimate fA(X41): consider a window [X41 − h,X41 + h].fA(X41) = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity



Estimate fB(X41): consider a window [X41 − h,X41 + h].fB(X41) = 0.7996

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity



Article B looks more promising: fA(X41) < fB(X41).πn = 20%: P(I41 = B|H41) = 80%, P(I41 = A|H41) = 20%

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity



Continue the process with decreasing hn and πn to the end.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 800, nA = 349, nB = 451

x

clic

king

pro

babl

ity


Challenges and Contributions

Partial information in bandit problem

Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply

Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods

Stong consistency and finite-time analysis

Dimension reduction and model combination
















Asymptotic Performance

Theorem (Qian and Yang, JMLR, 2016a)

If fi’s (i ∈ {A,B}) are uniformly continuous, and hn and πn are chosen tosatisfy hn → 0, πn → 0 and nh2d

n π4n/(logn)3 →∞,

then Nadaraya-Watson estimators are uniformly strong consistent, that is,for each i ∈ {A,B},

supx∈[0,1]d

(fi,n(x)− fi(x)

)→ 0 a.s. as n→∞.

Estimation uniform strong consistency implies thatRN = o(N) almost surely.

Equivalently, ∑Nn=1 YIn,n∑Nn=1 Y

∗n

→ 1 a.s. as N →∞


Finite-Time Regret Analysis

Modulus of continuity: ω(h; f) = sup‖x1−x2‖≤h

|f(x1)− f(x2)|

Holder continuity: ω(h; fi) ≤ ρhκ (0 < κ ≤ 1)


There exists nδ � N such that with probability larger than 1− 2δ,

RN < C1nδ +N∑

n=nδ

(2 maxi∈{A,B}

ω(hn; fi) +

√C2 log(N)

nhdnπn+ πn

)+ C3

√N log

(1

δ

).

Upper bound of f∗(Xn)− fIn(Xn)– Estimation bias: ω(hn; fi)– Estimation variance: C2 log(N)/(nhdnπn)– Exploration price: πn


Finite-Time Regret Analysis

Modulus of continuity: ω(h; f) = sup‖x1−x2‖≤h

|f(x1)− f(x2)|

Holder continuity: ω(h; fi) ≤ ρhκ (0 < κ ≤ 1)


There exists nδ � N such that with probability larger than 1− 2δ,

RN < C1nδ +N∑

n=nδ

(2 maxi∈{A,B}

ω(hn; fi) +

√C2 log(N)

nhdnπn+ πn

)+ C3

√N log

(1

δ

).

Upper bound of f∗(Xn)− fIn(Xn)– Nonparametric estimation: Bias-Variance tradeoff– Bandit problem: Exploration-Exploitation tradeoff


Finite-Time Regret Upper Bounds

Under Holder continuity, when using the kernel UCB-type estimator,

ERN < CN1− 1

2+d/κ (logN)c.

– Larger d and smaller κ gives larger power index.– Matches minimax rate of Perchet and Rigollet (2013)

up to a logarithmic factor.

Adaptive performance (Qian and Yang, EJS, 2016b): near minimaxrate can be achieved without having κ a priori (0 < c∗ ≤ κ ≤ 1).


Finite-Time Regret Upper Bounds

Under Holder continuity, when using the kernel UCB-type estimator,

ERN < CN1− 1

2+d/κ (logN)c.

– Larger d and smaller κ gives larger power index.– Matches minimax rate of Perchet and Rigollet (2013)

up to a logarithmic factor.

Adaptive performance (Qian and Yang, EJS, 2016b): near minimaxrate can be achieved without having κ a priori (0 < c∗ ≤ κ ≤ 1).


Model Combining

Different regression methods– kernel estimation, histogram, K-nearest neighbors– linear regression

Model combining: weighted average of different statistical models

AFTER (Yang, 2004):combines different forecasting procedures

Data-driven algorithm with robust performance


Model Combining – Illustration

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

clic

king

pro

babl

ity

fA(x)fB(x)

fA(x) = 0.7e−30(x−0.2)2 + 0.7e−30(x−0.8)2

fB(x) = 0.65− 0.3x

Time horizon N = 800, πn = 1log2 n

Model Combining1 Nadaraya-Watson estimation (h1 and h2)2 Linear regression


Model Combining – Adaptive Performance

Per-round regret rn = Rn/n

0 200 400 600 800

0.04

0.05

0.06

0.07

0.08

0.09

n

r ncombinedNadaraya-Watson-h1Nadaraya-Watson-h2linear regression


Yahoo! Front Page Today Module Dataset

46 million internet visit events with user response and five usercovariates in ten days.

Contains a pool of about 10 editor-picked news articles.

Raw data file is 8GB each day.

Algorithms are implemented efficiently in C++.

Potentially adapted for online applications.


Evaluation Results

Algorithms evaluated by click-through-rate (CTR).– Complete random– Naive simple average (no covariates)– LinUCB (Chapelle and Li, 2011):

Bayesian logistic regression based algorithm– Model combining:

Kernel estimation (h1 = n−1/6, h2 = n−1/8, h3 = n−1/10)Naive simple average

random Naive LinUCB Combining

avg. normalized CTR 1.00 1.189 1.225 1.237

std. dev. – 0.005 0.041 0.018


Conclusion

Precision medicine demands “online” learning for optimal treatmentresults

MABC provides a framework for designing effective treatmentallocation rules in a way that integrates the learning fromexperimentation with maximizing the benefits to the patients along theprocess

Many theoretical and practical issues need to be addressed


Some References

Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). “Finite-time analysis of themultiarmed bandit problem,” Machine Learning, 47, 235-256.

Lai, T. L. and Robbins, H. (1985), “Asymptotically efficient adaptive allocationrules,” Advances in Applied Mathematics, 6, 4-22.

Perchet, V. and Rigollet, P. (2013), “The multi-armed bandit problem withcovariates,” The Annals of Statistics, 41, 693-721.

Qian, W. and Yang, Y. (2016a), “Kernel estimation and model combination in abandit problem with covariates,” Journal of Machine Learning Research, 17,1-37.

Qian, W. and Yang, Y. (2016b), “Randomized allocation with arm elimination ina bandit problem with covariates,” Electronic Journal of Statistics, 10, 242-270.

Robbins, H. (1954), “Some aspects of the sequential design of experiments,”.Bulletin of the American Mathematical Society, 58, 527-535.

Woodroofe, M. (1979), “A one-armed bandit problem with a concomitantvariable,” Journal of the American Statistical Association, 74, 799-806.

Yang, Y. (2004), “Combining forecasting procedures: some theoretical results,”Econometric Theory, 20, 176-222.

Yang, Y. and Zhu, D. (2002), “Randomized allocation with nonparametricestimation for a multi-armed bandit problem with covariates,” The Annals ofStatistics, 30, 100-121.

Yahoo! Academic Relations. (2011) Yahoo! front page today module user click logdataset, version 1.0.(Available from http://webscope.sandbox.yahoo.com.)

Documents

Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies