52
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion Treatment Allocations Based on Multi-Armed Bandit Strategies Wei Qian and Yuhong Yang Applied Economics and Statistics, University of Delaware School of Statistics, University of Minnesota Innovative Statistics and Machine Learning for Precision Medicine September 15, 2017

Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

  • Upload
    buikiet

  • View
    225

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Treatment Allocations Based on Multi-Armed BanditStrategies

Wei Qian and Yuhong Yang

Applied Economics and Statistics, University of DelawareSchool of Statistics, University of Minnesota

Innovative Statistics and Machine Learning for Precision MedicineSeptember 15, 2017

Page 2: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

1 Bandit Problems

2 Methodology and Theory

3 Model Combining

4 Numerical Studies

5 Conclusion

Page 3: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Standard Multi-Armed Bandit Problem

There is a wall of slot machines.

!! = !!%! !! = !!%! !! = !!%!!!Each machine has certain winning probability to receive $1.

Chances of winning are unknown to the game player.

At each time, one and only one machine can be played, and theimmediate result is observed.

Goal: maximize the total number of wins over N times of plays.

Page 4: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Standard Multi-Armed Bandit Problem

There is a wall of slot machines.

!! = !!%! !! = !!%! !! = !!%!!!Each machine has certain winning probability to receive $1.

Chances of winning are unknown to the game player.

At each time, one and only one machine can be played, and theimmediate result is observed.

Goal: maximize the total number of wins over N times of plays.

Page 5: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Exploration-Exploitation Tradeoff

Exploration: pull each arm as many times as possible to explore onthe true reward probabilities.

Exploitation: use the existing information and play the “best” arm.

Page 6: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Motivation: Ethical Clinical Studies

Slot machines: different treatments to a certain disease

Survival probability: unknown to the doctor

Goal: sequentially assign treatments to patients to maximize thesurvival rate

Page 7: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

A Real Example: ECMO Trial

ECMO for treating newborns with persistent pulmonary hypertension?

Ethical dilemma of using a conventional randomized controlled trialcurrent patients versus future patientstwo hats on a participating doctor

A solution is response adaptive design. L.J. Wei’s randomized versionof the play the winner rule was used in a study.

The ECMO trial has generated a lot of discussions. See, e.g., twoStatistical Science papers in 1989 and 1991.

Page 8: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Motivation: Online Services

Web applications are generating massive data streams.

Online recommendation systems– recommend articles to online newspaper readers.– recommend products to customers of online retailers.

Page 9: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Motivation: Online Services

Web applications are generating massive data streams.

Online recommendation systems– recommend articles to online newspaper readers.– recommend products to customers of online retailers.

Page 10: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Motivation: Bandit Problem For Online Services

Slot machines: multiple articles

Each internet visit: one and only one article delivered

Clicking probability: unknown to the internet company

Goal: sequentially choose an article for internet users to maximize thetotal number of clicks or click-through-rate (CTR)

Page 11: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Bandit Problem With Covariates

Standard bandit problem assumes constant winning probabilities.

In practice, winning probability can be dependent on covariates.

Personalized medical serviceTreatment effects (e.g., survival probability) can be associated withpatients’ prognostic factors.

Page 12: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Bandit Problem With Covariates

Standard bandit problem assumes constant winning probabilities.

In practice, winning probability can be dependent on covariates.

Personalized medical serviceTreatment effects (e.g., survival probability) can be associated withpatients’ prognostic factors.

Page 13: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Personalized Web Service

Personalized online advertising, article recommendationInternet user’s interest in an ad or an article story can be associatedwith some user information.

Page 14: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Multi-Armed Bandit with Covariate (MABC) for Precision Medicine

An example scenario:

A few FDA approved drugs are available on the market for treating acertain disease

Currently the doctors perhaps choose among the available drugs basedon limited information and reading of scattered publications if any

Why not use the MABC framework for better medical practice?

Page 15: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Two-Armed Bandit Problem with Covariates

Two treatments (news articles): A and B

Patient (user) covariate x ∈ [0, 1]

Recovering (clicking) probability: fA(x), fB(x)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

clic

king

pro

babl

ityfA(x)fB(x)

Page 16: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Problem Setup: Two-Armed Bandit with Covariates

Problem Setup:

Given a bandit problem with two arms: treatments A and B

Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)

Covariates Xn, i.i.d. from continuous distribution PX

At each time n,

1 observe patient covariate Xn ∼ PX ;

2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};

3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.

Question: how to design the sequential allocation algorithm?

Page 17: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Problem Setup: Two-Armed Bandit with Covariates

Problem Setup:

Given a bandit problem with two arms: treatments A and B

Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)

Covariates Xn, i.i.d. from continuous distribution PX

At each time n,

1 observe patient covariate Xn ∼ PX ;

2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};

3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.

Question: how to design the sequential allocation algorithm?

Page 18: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Problem Setup: Two-Armed Bandit with Covariates

Problem Setup:

Given a bandit problem with two arms: treatments A and B

Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)

Covariates Xn, i.i.d. from continuous distribution PX

At each time n,

1 observe patient covariate Xn ∼ PX ;

2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};

3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.

Question: how to design the sequential allocation algorithm?

Page 19: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

A Measure of Performance: Regret

Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax

i∈{A,B}fi(x)

“optimal” recovering probability: f∗(x) := maxi∈{A,B}

fi(x)

Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.

regretn = f∗(Xn)− fIn(Xn).

To measure the overall performance, consider cumulative regret

RN :=

N∑n=1

(f∗(Xn)− fIn(Xn)

)An algorithm is strongly consistent if RN = o(N) almost surely.

Page 20: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

A Measure of Performance: Regret

Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax

i∈{A,B}fi(x)

“optimal” recovering probability: f∗(x) := maxi∈{A,B}

fi(x)

Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.

regretn = f∗(Xn)− fIn(Xn).

To measure the overall performance, consider cumulative regret

RN :=

N∑n=1

(f∗(Xn)− fIn(Xn)

)An algorithm is strongly consistent if RN = o(N) almost surely.

Page 21: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

A Measure of Performance: Regret

Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax

i∈{A,B}fi(x)

“optimal” recovering probability: f∗(x) := maxi∈{A,B}

fi(x)

Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.

regretn = f∗(Xn)− fIn(Xn).

To measure the overall performance, consider cumulative regret

RN :=N∑n=1

(f∗(Xn)− fIn(Xn)

)An algorithm is strongly consistent if RN = o(N) almost surely.

Page 22: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Model Assumptions of fA and fB

Parametric framework– Woodroofe, 1979; Auer, 2002; Li et al., 2010; Goldenshluger and Zeevi,

2009, 2013; Bastani and Bayati, 2016– Linear models

Nonparametric framework– Yang and Zhu, 2002; Rigollet and Zeevi, 2010; Perchet and Rigollet,

2013

Page 23: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithms

Two articles A and B with clicking probabilities fA(x) and fB(x)

1 Deliver each article an equal number of times (e.g., each is deliveredn0 = 20 times):I1 = A, I2 = B, · · · , I2n0−1 = A, I2n0 = B.

2 For the next internet visit (n = 2n0 + 1), observe the internet usercovariate Xn.

3 Estimate fA and fB using previous data to obtain fA,n and fB,n.

4 Find the more promising option: in = argmaxi∈{A,B} fi,n(Xn);Deliver article with randomization scheme:

In =

{in, with probability 1− πn,i, with probability πn, i 6= in.

Observe the result YIn,n.

Page 24: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Kernel Estimation

Given article A, at each time point n, define

JA,n = {j : Ij = A, 1 ≤ j ≤ n− 1}

Nadaraya-Watson estimator of fA(x):

fA,n(x) =

∑j∈JA,n

YA,jK(x−Xj

hn

)∑

j∈JA,n

K(x−Xj

hn

)kernel function K(u) : Rd → R; bandwidth hn

Epanechnikov quadratic kernel:

K(u) =3

4(1− u2)I(‖u‖ ≤ 1)

Page 25: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

An UCB-Type Kernel Estimator

Upper Confidence Bound (UCB) kernel estimator

fA,n(x) =

∑j∈JA,n

YA,jK(x−Xj

hn

)∑

j∈JA,n

K(x−Xj

hn

) + UA,n(x)

A “standard error” quantity

UA,n(x) =

c

√(logN)

∑j∈JA,n

K2(x−Xjhn

)∑j∈JA,n

K(x−Xjhn

)Under uniform kernel K(u) = I(‖u‖∞ ≤ 1) withNA,n(x) =

∑j∈JA,n

I(‖Xj − x‖∞ ≤ h),

UA,n(x) = c

√logN

NA,n(x)

Page 26: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Deliver each article 20 times. X1 = 0.93, article A

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 1, nA = 1, nB = 0

x

clic

king

pro

babl

ity

Page 27: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Deliver each article 20 times. X1 = 0.93, article A

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 1, nA = 1, nB = 0

x

clic

king

pro

babl

ity

Page 28: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Deliver each article 20 times. X2 = 0.88, article B

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 2, nA = 1, nB = 1

x

clic

king

pro

babl

ity

Page 29: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Deliver each article 20 times. X2 = 0.88, article B

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 2, nA = 1, nB = 1

x

clic

king

pro

babl

ity

Page 30: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Deliver each article 20 times.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity

Page 31: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

X41 = 0.52. Estimate fA(X41) and fB(X41) by kernel estimation.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity

Page 32: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Estimate fA(X41)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity

Page 33: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Estimate fA(X41): consider a window [X41 − h,X41 + h].Similar information may give similar clicking probability.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity

Page 34: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Estimate fA(X41): consider a window [X41 − h,X41 + h].fA(X41) = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity

Page 35: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Estimate fB(X41): consider a window [X41 − h,X41 + h].fB(X41) = 0.7996

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity

Page 36: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Article B looks more promising: fA(X41) < fB(X41).πn = 20%: P(I41 = B|H41) = 80%, P(I41 = A|H41) = 20%

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 40, nA = 20, nB = 20

x

clic

king

pro

babl

ity

Page 37: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Algorithm Illustration

Continue the process with decreasing hn and πn to the end.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time n = 800, nA = 349, nB = 451

x

clic

king

pro

babl

ity

Page 38: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Challenges and Contributions

Partial information in bandit problem

Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply

Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods

Stong consistency and finite-time analysis

Dimension reduction and model combination

Page 39: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Challenges and Contributions

Partial information in bandit problem

Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply

Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods

Stong consistency and finite-time analysis

Dimension reduction and model combination

Page 40: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Challenges and Contributions

Partial information in bandit problem

Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply

Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods

Stong consistency and finite-time analysis

Dimension reduction and model combination

Page 41: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Asymptotic Performance

Theorem (Qian and Yang, JMLR, 2016a)

If fi’s (i ∈ {A,B}) are uniformly continuous, and hn and πn are chosen tosatisfy hn → 0, πn → 0 and nh2d

n π4n/(logn)3 →∞,

then Nadaraya-Watson estimators are uniformly strong consistent, that is,for each i ∈ {A,B},

supx∈[0,1]d

(fi,n(x)− fi(x)

)→ 0 a.s. as n→∞.

Estimation uniform strong consistency implies thatRN = o(N) almost surely.

Equivalently, ∑Nn=1 YIn,n∑Nn=1 Y

∗n

→ 1 a.s. as N →∞

Page 42: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Finite-Time Regret Analysis

Modulus of continuity: ω(h; f) = sup‖x1−x2‖≤h

|f(x1)− f(x2)|

Holder continuity: ω(h; fi) ≤ ρhκ (0 < κ ≤ 1)

Theorem (Qian and Yang, JMLR, 2016a)

There exists nδ � N such that with probability larger than 1− 2δ,

RN < C1nδ +N∑

n=nδ

(2 maxi∈{A,B}

ω(hn; fi) +

√C2 log(N)

nhdnπn+ πn

)+ C3

√N log

(1

δ

).

Upper bound of f∗(Xn)− fIn(Xn)– Estimation bias: ω(hn; fi)– Estimation variance: C2 log(N)/(nhdnπn)– Exploration price: πn

Page 43: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Finite-Time Regret Analysis

Modulus of continuity: ω(h; f) = sup‖x1−x2‖≤h

|f(x1)− f(x2)|

Holder continuity: ω(h; fi) ≤ ρhκ (0 < κ ≤ 1)

Theorem (Qian and Yang, JMLR, 2016a)

There exists nδ � N such that with probability larger than 1− 2δ,

RN < C1nδ +N∑

n=nδ

(2 maxi∈{A,B}

ω(hn; fi) +

√C2 log(N)

nhdnπn+ πn

)+ C3

√N log

(1

δ

).

Upper bound of f∗(Xn)− fIn(Xn)– Nonparametric estimation: Bias-Variance tradeoff– Bandit problem: Exploration-Exploitation tradeoff

Page 44: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Finite-Time Regret Upper Bounds

Under Holder continuity, when using the kernel UCB-type estimator,

ERN < CN1− 1

2+d/κ (logN)c.

– Larger d and smaller κ gives larger power index.– Matches minimax rate of Perchet and Rigollet (2013)

up to a logarithmic factor.

Adaptive performance (Qian and Yang, EJS, 2016b): near minimaxrate can be achieved without having κ a priori (0 < c∗ ≤ κ ≤ 1).

Page 45: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Finite-Time Regret Upper Bounds

Under Holder continuity, when using the kernel UCB-type estimator,

ERN < CN1− 1

2+d/κ (logN)c.

– Larger d and smaller κ gives larger power index.– Matches minimax rate of Perchet and Rigollet (2013)

up to a logarithmic factor.

Adaptive performance (Qian and Yang, EJS, 2016b): near minimaxrate can be achieved without having κ a priori (0 < c∗ ≤ κ ≤ 1).

Page 46: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Model Combining

Different regression methods– kernel estimation, histogram, K-nearest neighbors– linear regression

Model combining: weighted average of different statistical models

AFTER (Yang, 2004):combines different forecasting procedures

Data-driven algorithm with robust performance

Page 47: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Model Combining – Illustration

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

clic

king

pro

babl

ity

fA(x)fB(x)

fA(x) = 0.7e−30(x−0.2)2 + 0.7e−30(x−0.8)2

fB(x) = 0.65− 0.3x

Time horizon N = 800, πn = 1log2 n

Model Combining1 Nadaraya-Watson estimation (h1 and h2)2 Linear regression

Page 48: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Model Combining – Adaptive Performance

Per-round regret rn = Rn/n

0 200 400 600 800

0.04

0.05

0.06

0.07

0.08

0.09

n

r ncombinedNadaraya-Watson-h1Nadaraya-Watson-h2linear regression

Page 49: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Yahoo! Front Page Today Module Dataset

46 million internet visit events with user response and five usercovariates in ten days.

Contains a pool of about 10 editor-picked news articles.

Raw data file is 8GB each day.

Algorithms are implemented efficiently in C++.

Potentially adapted for online applications.

Page 50: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Evaluation Results

Algorithms evaluated by click-through-rate (CTR).– Complete random– Naive simple average (no covariates)– LinUCB (Chapelle and Li, 2011):

Bayesian logistic regression based algorithm– Model combining:

Kernel estimation (h1 = n−1/6, h2 = n−1/8, h3 = n−1/10)Naive simple average

random Naive LinUCB Combining

avg. normalized CTR 1.00 1.189 1.225 1.237

std. dev. – 0.005 0.041 0.018

Page 51: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Conclusion

Precision medicine demands “online” learning for optimal treatmentresults

MABC provides a framework for designing effective treatmentallocation rules in a way that integrates the learning fromexperimentation with maximizing the benefits to the patients along theprocess

Many theoretical and practical issues need to be addressed

Page 52: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies

Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion

Some References

Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). “Finite-time analysis of themultiarmed bandit problem,” Machine Learning, 47, 235-256.

Lai, T. L. and Robbins, H. (1985), “Asymptotically efficient adaptive allocationrules,” Advances in Applied Mathematics, 6, 4-22.

Perchet, V. and Rigollet, P. (2013), “The multi-armed bandit problem withcovariates,” The Annals of Statistics, 41, 693-721.

Qian, W. and Yang, Y. (2016a), “Kernel estimation and model combination in abandit problem with covariates,” Journal of Machine Learning Research, 17,1-37.

Qian, W. and Yang, Y. (2016b), “Randomized allocation with arm elimination ina bandit problem with covariates,” Electronic Journal of Statistics, 10, 242-270.

Robbins, H. (1954), “Some aspects of the sequential design of experiments,”.Bulletin of the American Mathematical Society, 58, 527-535.

Woodroofe, M. (1979), “A one-armed bandit problem with a concomitantvariable,” Journal of the American Statistical Association, 74, 799-806.

Yang, Y. (2004), “Combining forecasting procedures: some theoretical results,”Econometric Theory, 20, 176-222.

Yang, Y. and Zhu, D. (2002), “Randomized allocation with nonparametricestimation for a multi-armed bandit problem with covariates,” The Annals ofStatistics, 30, 100-121.

Yahoo! Academic Relations. (2011) Yahoo! front page today module user click logdataset, version 1.0.(Available from http://webscope.sandbox.yahoo.com.)