65
Corralling a Band of Bandit Algorithms Alekh Agarwal 1 , Haipeng Luo 1 , Behnam Neyshabur 2 , Rob Schapire 1 1 Microsoft Research, New York 2 Toyota Technological Institute at Chicago

Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Corralling a Band of Bandit Algorithms

Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1

1Microsoft Research, New York2Toyota Technological Institute at Chicago

Page 2: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Contextual Bandits in Practice

Personalized news recommendation in MSN:

EXP4

Epoch-Greedy

LinUCB

ILOVETOCONBANDITS

BISTRO, BISTRO+

...

2 / 19

Page 3: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Contextual Bandits in Practice

Personalized news recommendation in MSN:

EXP4

Epoch-Greedy

LinUCB

ILOVETOCONBANDITS

BISTRO, BISTRO+

...

2 / 19

Page 4: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Motivation

So many existing algorithms, which one should I use ??

no one single algorithm is guaranteed to be the best

Naive approach: try all and pick the best

inefficient, wasteful, nonadaptive

Hope: create a master algorithm that

selects base algorithms automatically and adaptively on the fly

performs closely to the best in the long run

3 / 19

Page 5: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Motivation

So many existing algorithms, which one should I use ??

no one single algorithm is guaranteed to be the best

Naive approach: try all and pick the best

inefficient, wasteful, nonadaptive

Hope: create a master algorithm that

selects base algorithms automatically and adaptively on the fly

performs closely to the best in the long run

3 / 19

Page 6: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Motivation

So many existing algorithms, which one should I use ??

no one single algorithm is guaranteed to be the best

Naive approach: try all and pick the best

inefficient, wasteful, nonadaptive

Hope: create a master algorithm that

selects base algorithms automatically and adaptively on the fly

performs closely to the best in the long run

3 / 19

Page 7: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Motivation

So many existing algorithms, which one should I use ??

no one single algorithm is guaranteed to be the best

Naive approach: try all and pick the best

inefficient, wasteful, nonadaptive

Hope: create a master algorithm that

selects base algorithms automatically and adaptively on the fly

performs closely to the best in the long run

3 / 19

Page 8: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Motivation

So many existing algorithms, which one should I use ??

no one single algorithm is guaranteed to be the best

Naive approach: try all and pick the best

inefficient, wasteful, nonadaptive

Hope: create a master algorithm that

selects base algorithms automatically and adaptively on the fly

performs closely to the best in the long run

3 / 19

Page 9: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Motivation

So many existing algorithms, which one should I use ??

no one single algorithm is guaranteed to be the best

Naive approach: try all and pick the best

inefficient, wasteful, nonadaptive

Hope: create a master algorithm that

selects base algorithms automatically and adaptively on the fly

performs closely to the best in the long run

3 / 19

Page 10: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Closer Look

Full information setting:

“expert” algorithm (e.g. Hedge (Freund and Schapire, 1997)) solves it

Bandit setting:

use a multi-armed bandit algorithm (e.g. EXP3 (Auer et al., 2002))?

Serious flaw in this approach:

regret guarantee is only about the actual performance

but the performance of base algorithms are significantly influenceddue to lack of feedback

4 / 19

Page 11: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Closer Look

Full information setting:

“expert” algorithm (e.g. Hedge (Freund and Schapire, 1997)) solves it

Bandit setting:

use a multi-armed bandit algorithm (e.g. EXP3 (Auer et al., 2002))?

Serious flaw in this approach:

regret guarantee is only about the actual performance

but the performance of base algorithms are significantly influenceddue to lack of feedback

4 / 19

Page 12: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Closer Look

Full information setting:

“expert” algorithm (e.g. Hedge (Freund and Schapire, 1997)) solves it

Bandit setting:

use a multi-armed bandit algorithm (e.g. EXP3 (Auer et al., 2002))?

Serious flaw in this approach:

regret guarantee is only about the actual performance

but the performance of base algorithms are significantly influenceddue to lack of feedback

4 / 19

Page 13: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Closer Look

Full information setting:

“expert” algorithm (e.g. Hedge (Freund and Schapire, 1997)) solves it

Bandit setting:

use a multi-armed bandit algorithm (e.g. EXP3 (Auer et al., 2002))?

Serious flaw in this approach:

regret guarantee is only about the actual performance

but the performance of base algorithms are significantly influenceddue to lack of feedback

4 / 19

Page 14: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Difficulties

An example:

when run separately

when run with a master

Right objective:perform almost as well as the best base algorithm if it was run on its own

Difficulties:

worse performance ↔ less feedback

requires better tradeoff between exploration and exploitation

5 / 19

Page 15: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Difficulties

An example:

when run separately

when run with a master

Right objective:perform almost as well as the best base algorithm if it was run on its own

Difficulties:

worse performance ↔ less feedback

requires better tradeoff between exploration and exploitation

5 / 19

Page 16: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Difficulties

An example:

when run separately

when run with a master

Right objective:perform almost as well as the best base algorithm if it was run on its own

Difficulties:

worse performance ↔ less feedback

requires better tradeoff between exploration and exploitation

5 / 19

Page 17: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Difficulties

An example:

when run separately

when run with a master

Right objective:perform almost as well as the best base algorithm if it was run on its own

Difficulties:

worse performance ↔ less feedback

requires better tradeoff between exploration and exploitation

5 / 19

Page 18: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Difficulties

An example:

when run separately

when run with a master

Right objective:perform almost as well as the best base algorithm if it was run on its own

Difficulties:

worse performance ↔ less feedback

requires better tradeoff between exploration and exploitation

5 / 19

Page 19: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Related Work and Our Results

Maillard and Munos (2011) studied similar problems

EXP3 with higher amount of uniform exploration

if base algorithms have√T regret, master has T 2/3 regret

Our results:

a novel algorithm: more active and adaptive exploration

I almost same regret as base algorithms

two major applications:

I exploiting easy environments while keeping worst case robustness

I selecting correct model automatically

6 / 19

Page 20: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Related Work and Our Results

Maillard and Munos (2011) studied similar problems

EXP3 with higher amount of uniform exploration

if base algorithms have√T regret, master has T 2/3 regret

Our results:

a novel algorithm: more active and adaptive exploration

I almost same regret as base algorithms

two major applications:

I exploiting easy environments while keeping worst case robustness

I selecting correct model automatically

6 / 19

Page 21: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Related Work and Our Results

Maillard and Munos (2011) studied similar problems

EXP3 with higher amount of uniform exploration

if base algorithms have√T regret, master has T 2/3 regret

Our results:

a novel algorithm: more active and adaptive exploration

I almost same regret as base algorithms

two major applications:

I exploiting easy environments while keeping worst case robustness

I selecting correct model automatically

6 / 19

Page 22: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Related Work and Our Results

Maillard and Munos (2011) studied similar problems

EXP3 with higher amount of uniform exploration

if base algorithms have√T regret, master has T 2/3 regret

Our results:

a novel algorithm: more active and adaptive exploration

I almost same regret as base algorithms

two major applications:

I exploiting easy environments while keeping worst case robustness

I selecting correct model automatically

6 / 19

Page 23: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Related Work and Our Results

Maillard and Munos (2011) studied similar problems

EXP3 with higher amount of uniform exploration

if base algorithms have√T regret, master has T 2/3 regret

Our results:

a novel algorithm: more active and adaptive exploration

I almost same regret as base algorithms

two major applications:

I exploiting easy environments while keeping worst case robustness

I selecting correct model automatically

6 / 19

Page 24: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Outline

1 Introduction

2 Formal Setup

3 Main Results

4 Conclusion and Open Problems

7 / 19

Page 25: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A General Bandit Problem

for t = 1 to T do

Environment reveals some side information xt ∈ X

Player picks an action θt ∈ Θ

Environment decides a loss function ft : Θ×X 7→ [0, 1]

Player suffers and observes (only) the loss ft(θt , xt)

Example: contextual bandits

x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)

environment: i.i.d., adversarial or hybrid

(Pseudo) Regret: REG = supθ∈Θ E[∑T

t=1 ft(θt , xt)− ft(θ, xt)]

8 / 19

Page 26: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A General Bandit Problem

for t = 1 to T do

Environment reveals some side information xt ∈ X

Player picks an action θt ∈ Θ

Environment decides a loss function ft : Θ×X 7→ [0, 1]

Player suffers and observes (only) the loss ft(θt , xt)

Example: contextual bandits

x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)

environment: i.i.d., adversarial or hybrid

(Pseudo) Regret: REG = supθ∈Θ E[∑T

t=1 ft(θt , xt)− ft(θ, xt)]

8 / 19

Page 27: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A General Bandit Problem

for t = 1 to T do

Environment reveals some side information xt ∈ X

Player picks an action θt ∈ Θ

Environment decides a loss function ft : Θ×X 7→ [0, 1]

Player suffers and observes (only) the loss ft(θt , xt)

Example: contextual bandits

x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)

environment: i.i.d., adversarial or hybrid

(Pseudo) Regret: REG = supθ∈Θ E[∑T

t=1 ft(θt , xt)− ft(θ, xt)]

8 / 19

Page 28: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A General Bandit Problem

for t = 1 to T do

Environment reveals some side information xt ∈ X

Player picks an action θt ∈ Θ

Environment decides a loss function ft : Θ×X 7→ [0, 1]

Player suffers and observes (only) the loss ft(θt , xt)

Example: contextual bandits

x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)

environment: i.i.d., adversarial or hybrid

(Pseudo) Regret: REG = supθ∈Θ E[∑T

t=1 ft(θt , xt)− ft(θ, xt)]

8 / 19

Page 29: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A General Bandit Problem

for t = 1 to T do

Environment reveals some side information xt ∈ X

Player picks an action θt ∈ Θ

Environment decides a loss function ft : Θ×X 7→ [0, 1]

Player suffers and observes (only) the loss ft(θt , xt)

Example: contextual bandits

x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)

environment: i.i.d., adversarial or hybrid

(Pseudo) Regret: REG = supθ∈Θ E[∑T

t=1 ft(θt , xt)− ft(θ, xt)]

8 / 19

Page 30: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A General Bandit Problem

for t = 1 to T do

Environment reveals some side information xt ∈ X

Player picks an action θt ∈ Θ

Environment decides a loss function ft : Θ×X 7→ [0, 1]

Player suffers and observes (only) the loss ft(θt , xt)

Example: contextual bandits

x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)

environment: i.i.d., adversarial or hybrid

(Pseudo) Regret: REG = supθ∈Θ E[∑T

t=1 ft(θt , xt)− ft(θ, xt)]

8 / 19

Page 31: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A General Bandit Problem

for t = 1 to T do

Environment reveals some side information xt ∈ X

Player picks an action θt ∈ Θ

Environment decides a loss function ft : Θ×X 7→ [0, 1]

Player suffers and observes (only) the loss ft(θt , xt)

Example: contextual bandits

x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)

environment: i.i.d., adversarial or hybrid

(Pseudo) Regret: REG = supθ∈Θ E[∑T

t=1 ft(θt , xt)− ft(θ, xt)]

8 / 19

Page 32: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Base Algorithms

Suppose given M base algorithms B1, . . . ,BM .

At each round t, receive suggestions θ1t , . . . , θ

Mt .

Hope: create a master s.t.

loss of master ≈ loss of best base algorithm if run on its own

How to formally measure?

9 / 19

Page 33: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Base Algorithms

Suppose given M base algorithms B1, . . . ,BM .

At each round t, receive suggestions θ1t , . . . , θ

Mt .

Hope: create a master s.t.

loss of master ≈ loss of best base algorithm if run on its own

How to formally measure?

9 / 19

Page 34: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Base Algorithms

Suppose given M base algorithms B1, . . . ,BM .

At each round t, receive suggestions θ1t , . . . , θ

Mt .

Hope: create a master s.t.

loss of master ≈ loss of best base algorithm if run on its own

How to formally measure?

9 / 19

Page 35: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Base Algorithms

Suppose given M base algorithms B1, . . . ,BM .

At each round t, receive suggestions θ1t , . . . , θ

Mt .

Hope: create a master s.t.

loss of master ≈ loss of best base algorithm if run on its own

How to formally measure?

9 / 19

Page 36: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Goal

Suppose running Bi alone gives

REGBi ≤ Ri (T )

When running master with all base algorithms, want

REGM ≤ O(poly(M)Ri (T ))

impossible in general!

Example: for contextual bandits

Algorithm REG environment√T i.i.d.

T 2/3 hybrid

10 / 19

Page 37: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Goal

Suppose running Bi alone gives

REGBi ≤ Ri (T )

When running master with all base algorithms, want

REGM ≤ O(poly(M)Ri (T ))

impossible in general!

Example: for contextual bandits

Algorithm REG environment√T i.i.d.

T 2/3 hybrid

10 / 19

Page 38: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Goal

Suppose running Bi alone gives

REGBi ≤ Ri (T )

When running master with all base algorithms, want

REGM ≤ O(poly(M)Ri (T ))

impossible in general!

Example: for contextual bandits

Algorithm REG environment

ILOVETOCONBANDITS (Agarwal et al., 2014)√T i.i.d.

BISTRO+ (Syrgkanis et al., 2016) T 2/3 hybrid

10 / 19

Page 39: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Goal

Suppose running Bi alone gives

REGBi ≤ Ri (T )

When running master with all base algorithms, want

REGM ≤ O(poly(M)Ri (T ))

impossible in general!

Example: for contextual bandits

Algorithm REG environment

Master√T i.i.d.

Master T 2/3 hybrid

10 / 19

Page 40: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Goal

Suppose running Bi alone gives

REGBi ≤ Ri (T )

When running master with all base algorithms, want

REGM ≤ O(poly(M)Ri (T )) impossible in general!

Example: for contextual bandits

Algorithm REG environment

Master√T i.i.d.

Master T 2/3 hybrid

10 / 19

Page 41: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Natural Assumption

Typical strategy:

sample a base algorithm it ∼ pt

feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i

1{i = it}

Assume Bi ensuresREGBi

Want REGM ≤ O(poly(M)Ri (T ))

Note: EXP3 still doesn’t work

11 / 19

Page 42: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Natural Assumption

Typical strategy:

sample a base algorithm it ∼ pt

feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i

1{i = it}

Assume Bi ensures

REGBi ≤ Ri (T )?

Want REGM ≤ O(poly(M)Ri (T ))

Note: EXP3 still doesn’t work

11 / 19

Page 43: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Natural Assumption

Typical strategy:

sample a base algorithm it ∼ pt

feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i

1{i = it}

Assume Bi ensures

REGBi ≤ E[maxt

1pt,i

]Ri (T )

Want REGM ≤ O(poly(M)Ri (T ))

Note: EXP3 still doesn’t work

11 / 19

Page 44: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Natural Assumption

Typical strategy:

sample a base algorithm it ∼ pt

feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i

1{i = it}

Assume Bi ensures

REGBi ≤ E[(

maxt

1pt,i

)αi]Ri (T )

Want REGM ≤ O(poly(M)Ri (T ))

Note: EXP3 still doesn’t work

11 / 19

Page 45: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Natural Assumption

Typical strategy:

sample a base algorithm it ∼ pt

feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i

1{i = it}

Assume Bi ensures

REGBi ≤ E[(

maxt

1pt,i

)αi]Ri (T )

Want REGM ≤ O(poly(M)Ri (T ))

Note: EXP3 still doesn’t work

11 / 19

Page 46: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Natural Assumption

Typical strategy:

sample a base algorithm it ∼ pt

feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i

1{i = it}

Assume Bi ensures

REGBi ≤ E[(

maxt

1pt,i

)αi]Ri (T )

Want REGM ≤ O(poly(M)Ri (T ))

Note: EXP3 still doesn’t work

11 / 19

Page 47: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Outline

1 Introduction

2 Formal Setup

3 Main Results

4 Conclusion and Open Problems

12 / 19

Page 48: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Special OMD

Intuition 1: want 1pt,i

to be small

Mirror Map 1/pt,i ≈

Shannon Entropy (EXP3) exp(η · loss)

Log Barrier: − 1η

∑i ln pi (Foster et al., 2016) η · loss

In some sense, this provides the least extreme weighting

13 / 19

Page 49: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Special OMD

Intuition 1: want 1pt,i

to be small

Mirror Map 1/pt,i ≈

Shannon Entropy (EXP3) exp(η · loss)

Log Barrier: − 1η

∑i ln pi (Foster et al., 2016) η · loss

In some sense, this provides the least extreme weighting

13 / 19

Page 50: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Special OMD

Intuition 1: want 1pt,i

to be small

Mirror Map 1/pt,i ≈

Shannon Entropy (EXP3) exp(η · loss)

Log Barrier: − 1η

∑i ln pi (Foster et al., 2016) η · loss

In some sense, this provides the least extreme weighting

13 / 19

Page 51: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

A Special OMD

Intuition 1: want 1pt,i

to be small

Mirror Map 1/pt,i ≈

Shannon Entropy (EXP3) exp(η · loss)

Log Barrier: − 1η

∑i ln pi (Foster et al., 2016) η · loss

In some sense, this provides the least extreme weighting

13 / 19

Page 52: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

An Increasing Learning Rates Schedule

Intuition 2: Need to learn faster if a base algorithm has a low sampledprobability

Solution:

individual learning rates:∑

i− ln piηi

as mirror map

increase learning rate ηi when 1pt,i

is too large

14 / 19

Page 53: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

An Increasing Learning Rates Schedule

Intuition 2: Need to learn faster if a base algorithm has a low sampledprobability

Solution:

individual learning rates:∑

i− ln piηi

as mirror map

increase learning rate ηi when 1pt,i

is too large

14 / 19

Page 54: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

An Increasing Learning Rates Schedule

Intuition 2: Need to learn faster if a base algorithm has a low sampledprobability

Solution:

individual learning rates:∑

i− ln piηi

as mirror map

increase learning rate ηi when 1pt,i

is too large

14 / 19

Page 55: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Our Algorithm: Corral

initial learning rates ηi = η, initial thresholds ρi = 2M

for t = 1 to T do

Observe xt and send xt to all base algorithms

Receive suggested actions θ1t , . . . , θ

Mt

Sample it ∼ pt , predict θt = θitt , observe loss ft(θt , xt)

Construct unbiased estimated loss f it (θ, x) and send it to Bi

Update pt+1 ← Log-Barrier-OMD(pt ,ft(θt ,xt)pt,it

eit ,η)

for i = 1 to M do

if 1 > ρi then update ρi ← 2ρi , ηi ← βηi

15 / 19

Page 56: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Our Algorithm: Corral

initial learning rates ηi = η, initial thresholds ρi = 2M

for t = 1 to T do

Observe xt and send xt to all base algorithms

Receive suggested actions θ1t , . . . , θ

Mt

Sample it ∼ pt , predict θt = θitt , observe loss ft(θt , xt)

Construct unbiased estimated loss f it (θ, x) and send it to Bi

Update pt+1 ← Log-Barrier-OMD(pt ,ft(θt ,xt)pt,it

eit ,η)

for i = 1 to M do

if 1 > ρi then update ρi ← 2ρi , ηi ← βηi

15 / 19

Page 57: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Our Algorithm: Corral

initial learning rates ηi = η, initial thresholds ρi = 2M

for t = 1 to T do

Observe xt and send xt to all base algorithms

Receive suggested actions θ1t , . . . , θ

Mt

Sample it ∼ pt , predict θt = θitt , observe loss ft(θt , xt)

Construct unbiased estimated loss f it (θ, x) and send it to BiUpdate pt+1 ← Log-Barrier-OMD(pt ,

ft(θt ,xt)pt,it

eit ,η)

for i = 1 to M do

if 1 > ρi then update ρi ← 2ρi , ηi ← βηi

15 / 19

Page 58: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Our Algorithm: Corral

initial learning rates ηi = η, initial thresholds ρi = 2M

for t = 1 to T do

Observe xt and send xt to all base algorithms

Receive suggested actions θ1t , . . . , θ

Mt

Sample it ∼ pt , predict θt = θitt , observe loss ft(θt , xt)

Construct unbiased estimated loss f it (θ, x) and send it to BiUpdate pt+1 ← Log-Barrier-OMD(pt ,

ft(θt ,xt)pt,it

eit ,η)

for i = 1 to M do

if 1pt+1,i

> ρi then update ρi ← 2ρi , ηi ← βηi

15 / 19

Page 59: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Regret Guarantee

Theorem

If for some environment there exists a base algorithm Bi such that:

REGBi ≤ E[ραiT ,i

]Ri (T )

then under the same environment Corral ensures:

REGM = O(M

η+ Tη −

E[ρT ,i ]

η+ E[ραi

T ,i ]Ri (T )

)

16 / 19

Page 60: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Application

Contextual Bandits:

Algorithm REG environment

ILOVETOCONBANDITS (Agarwal et al., 2014)√T i.i.d.

BISTRO+ (Syrgkanis et al., 2016) T 2/3 hybrid

17 / 19

Page 61: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Application

Contextual Bandits:

Algorithm REG environment

Corral√T i.i.d.

Corral T 3/4 hybrid

17 / 19

Page 62: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Outline

1 Introduction

2 Formal Setup

3 Main Results

4 Conclusion and Open Problems

18 / 19

Page 63: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Conclusion

We resolve the problem of creating a master that is almost as well as thebest base algorithm if it was run on its own.

least extreme weighting: Log-Barrier-OMD

increasing learning rate to learn faster

applications for many settings

Open problems:

inherit exactly the regret of base algorithms ?

REGM ≤ O(poly(M)Ri (T ))

dependence on M: from polynomial to logarithmic?

19 / 19

Page 64: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Conclusion

We resolve the problem of creating a master that is almost as well as thebest base algorithm if it was run on its own.

least extreme weighting: Log-Barrier-OMD

increasing learning rate to learn faster

applications for many settings

Open problems:

inherit exactly the regret of base algorithms ?

REGM ≤ O(poly(M)Ri (T ))

dependence on M: from polynomial to logarithmic?

19 / 19

Page 65: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,

Conclusion

We resolve the problem of creating a master that is almost as well as thebest base algorithm if it was run on its own.

least extreme weighting: Log-Barrier-OMD

increasing learning rate to learn faster

applications for many settings

Open problems:

inherit exactly the regret of base algorithms ?

REGM ≤ O(poly(M)Ri (T ))

dependence on M: from polynomial to logarithmic?

19 / 19