53
Recharging Bandits Bobby Kleinberg Cornell University Joint work with Nicole Immorlica. NYU Machine Learning Seminar New York, NY 24 Oct 2017

Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

  • Upload
    lethu

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

Bobby Kleinberg Cornell University

Joint work with Nicole Immorlica.

NYU Machine Learning Seminar New York, NY 24 Oct 2017

Page 2: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

Can you construct a dinner schedule that:

never goes 2 days without macaroni and cheese

never goes 3 days without pizza

never goes 5 days without fish?

never goes 7 days without tacos?

Answer: Impossible. For N ≥ 60,

bN/2c+ bN/3c+ bN/5c > N.

Page 3: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

Can you construct a dinner schedule that:

never goes 2 days without macaroni and cheese

never goes 4 days without pizza

never goes 5 days without fish?

never goes 7 days without tacos?

Answer: Possible.

Page 4: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

Can you construct a dinner schedule that:

never goes 2 days without macaroni and cheese

never goes 3 days without pizza

never goes 100 days without fish?

never goes 7 days without tacos?

Answer: Impossible.

Page 5: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

Can you construct a dinner schedule that:

never goes 2 days without macaroni and cheese

never goes 5 days without pizza

never goes 100 days without fish

never goes 7 days without tacos?

Answer: Impossible.

Page 6: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

Can you construct a dinner schedule that:

never goes 2 days without macaroni and cheese

never goes 5 days without pizza

never goes 100 days without fish

never goes 7 days without tacos?

Answer: Impossible.

Page 7: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

Can you construct a dinner schedule that:

never goes 2 days without macaroni and cheese

never goes 5 days without pizza

never goes 100 days without fish

never goes 7 days without tacos?

Answer: Impossible.

Page 8: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

The Pinwheel Problem

Given g1, . . . , gn, can Z be partitioned into S1, . . . ,Sn such that Siintersects every interval of length gi?

E.g., (g1, . . . , g5) = (3, 4, 6, 10, 16)

Page 9: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

The Pinwheel Problem

Given g1, . . . , gn, can Z be partitioned into S1, . . . ,Sn such that Siintersects every interval of length gi?

What is the complexity of thisdecision problem?

It belongs to PSPACE.

No non-trivial lower bounds known.

Later in this talk: PTAS for anoptimization version.

Page 10: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Prologue

The Pinwheel Problem

Given g1, . . . , gn, can Z be partitioned into S1, . . . ,Sn such that Siintersects every interval of length gi?

What is the complexity of thisdecision problem?

It belongs to PSPACE.

No non-trivial lower bounds known.

Later in this talk: PTAS for anoptimization version.

Page 11: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

The Multi-Armed Bandit Problem

Stochastic Multi-Armed Bandit Problem: A decision-maker(“gambler”) chooses one of n actions (“arms”) in each time step.

Chosen arm yields random payoff from unknown distrib. on [0, 1].

Goal: Maximize expected total payoff.

Page 12: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Page 13: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Page 14: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Page 15: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Page 16: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Recharging Bandits

Pulling arm i at time t, when it was last pulled at time s,yields random payoff with expectation Hi (t − s).

Hi is an increasing, concave function; Hi (t) ≤ t.

Page 17: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Recharging Bandits

Pulling arm i at time t, when it was last pulled at time s,yields random payoff with expectation Hi (t − s).

Hi is an increasing, concave function; Hi (t) ≤ t.

Concavity assumption implies free disposal: in step t, pulling i isbetter than doing nothing because

Hi (u − t) + Hi (t − s) ≥ Hi (u − s).

Page 18: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Recharging Bandits

Pulling arm i at time t, when it was last pulled at time s,yields random payoff with expectation Hi (t − s).

Hi is an increasing, concave function; Hi (t) ≤ t.

With known {Hi}: a special case of deterministic restless bandits.

General case is PSPACE-hard [Papadimitriou & Tsitsiklis 1987].

Which reinforcement learning problems have a PTAS?

Page 19: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Recharging Bandits

Pulling arm i at time t, when it was last pulled at time s,yields random payoff with expectation Hi (t − s).

Hi is an increasing, concave function; Hi (t) ≤ t.

Plan of attack:

1 Analyze optimal play when {Hi} are known.

2 Use upper confidence bounds + “ironing” to reduce the casewhen {Hi} must be learned to the case when they are known.

Page 20: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits

In many applications, an arm’s expected payoff is an increasingfunction of its “idle time.”

Recharging Bandits

Pulling arm i at time t, when it was last pulled at time s,yields random payoff with expectation Hi (t − s).

Hi is an increasing, concave function; Hi (t) ≤ t.

Plan of attack:

1 Analyze optimal play when {Hi} are known.

2 Use upper confidence bounds + “ironing” to reduce the casewhen {Hi} must be learned to the case when they are known.

Page 21: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Greedy 12-Approximation

Greedy algorithm: always maximize payoff in current time step.

Greedy/OPT ratio can be arbitrarily close to 1/2

H1(t) = 1− ε, H2(t) = t.

Greedy always pulls arm 2.

“Almost-OPT” pulls arm 1 for T � 1 time steps, then arm 2.

Net payoff (2− ε)T + 1 over T + 1 time steps.

Page 22: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Greedy 12-Approximation

Greedy algorithm: always maximize payoff in current time step.

Greedy/OPT is never less than 1/2

Imagine allowing the algorithm (but not OPT) to pull twoarms per time step.

At each time, supplement the greedy selection with the armselected by OPT, if they differ.

This at most doubles the payoff in each time step.

Net payoff of supplemented schedule ≥ OPT.(free disposal property)

Page 23: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Rate of Return Function

For 0 ≤ x ≤ 1, let Ri (x) denote maximum long-run average payoffachievable by playing i in at most x fraction of time steps.

Ri (x) = sup

1

T

∑̀j=1

Hi (tj − tj−1)

∣∣∣∣∣∣ T <∞, ` ≤ x · T ,0 = t0 < t1 < · · · < t` ≤ T

.

Fact: Ri is piecewise-linear with breakpoints Ri (1k ) = 1

kHi (k).

Page 24: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Rate of Return Function

For 0 ≤ x ≤ 1, let Ri (x) denote maximum long-run average payoffachievable by playing i in at most x fraction of time steps.

Ri (x) = sup

1

T

∑̀j=1

Hi (tj − tj−1)

∣∣∣∣∣∣ T <∞, ` ≤ x · T ,0 = t0 < t1 < · · · < t` ≤ T

.

Fact: Ri is piecewise-linear with breakpoints Ri (1k ) = 1

kHi (k).

Page 25: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Rate of Return Function

For 0 ≤ x ≤ 1, let Ri (x) denote maximum long-run average payoffachievable by playing i in at most x fraction of time steps.

Ri (x) = sup

1

T

∑̀j=1

Hi (tj − tj−1)

∣∣∣∣∣∣ T <∞, ` ≤ x · T ,0 = t0 < t1 < · · · < t` ≤ T

.

Fact: Ri is piecewise-linear with breakpoints Ri (1k ) = 1

kHi (k).

H (k)

k

i

1 2 3 4

0.9

1.4

1.82.0

R (x)

x

i

1

4

1

3

1

21

0.5

0.6

0.7

0.9

Page 26: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Rate of Return Function

For 0 ≤ x ≤ 1, let Ri (x) denote maximum long-run average payoffachievable by playing i in at most x fraction of time steps.

Ri (x) = sup

1

T

∑̀j=1

Hi (tj − tj−1)

∣∣∣∣∣∣ T <∞, ` ≤ x · T ,0 = t0 < t1 < · · · < t` ≤ T

.

Fact: Ri is piecewise-linear with breakpoints Ri (1k ) = 1

kHi (k).

Proof sketch: The optimal sequence 0 = t0 < · · · < t` ≤ T hasat most two distinct gap sizes, b 1x c and d 1x e.

Page 27: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Concave Relaxation

The problem

max

{n∑

i=1

Ri (xi )

∣∣∣∣∣ ∑i

xi ≤ 1, ∀i xi ≥ 0

}

specifies an upper bound on the value of the optimal schedule.

R (x)

x

1

1

2

R (x)

x

2

1

3

R (x)

x

3

1

6

Mapping (x1, . . . , xn) to a schedule: pinwheel problem!

Page 28: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Concave Relaxation

The problem

max

{n∑

i=1

Ri (xi )

∣∣∣∣∣ ∑i

xi ≤ 1, ∀i xi ≥ 0

}

specifies an upper bound on the value of the optimal schedule.

R (x)

x

1

1

2

R (x)

x

2

1

3

R (x)

x

3

1

6

Mapping (x1, . . . , xn) to a schedule: pinwheel problem!

Page 29: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Independent Rounding

First idea: every time step, sample arm i with probability xi .

Then τi = delay of arm i = tj(i)− tj−1(i) is geometricallydistributed with expectation 1/xi .

Rounding scheme gets xi · EHi (τi ) whereas relaxation getsRi (xi ) = xiHi (1/xi ) = xi · Hi (Eτi ).

Fact: if H is concave and non-decreasing and Y is geometricallydistributed then EH(Y ) ≥

(1− 1

e

)H(EY ).

To do better, need rounding scheme that reduces variance of τi .

Page 30: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Independent Rounding

First idea: every time step, sample arm i with probability xi .

Then τi = delay of arm i = tj(i)− tj−1(i) is geometricallydistributed with expectation 1/xi .

Rounding scheme gets xi · EHi (τi ) whereas relaxation getsRi (xi ) = xiHi (1/xi ) = xi · Hi (Eτi ).

Fact: if H is concave and non-decreasing and Y is geometricallydistributed then EH(Y ) ≥

(1− 1

e

)H(EY ).

To do better, need rounding scheme that reduces variance of τi .

Page 31: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Interleaved Arithmetic Progressions

Second idea: round continuous-time schedule to discrete time.

In continuous time, pull i at { ri+kxi| k ∈ N} where ri ∼Unif [0, 1).

Map this schedule to discrete time in an order-preserving manner.

Between two pulls of i , we pull j either bxj/xic or dxj/xie times.

τi = 1 +∑

j 6=i Zj

{Zj} are independent, each supported on 2 consecutive integers.

Page 32: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Interleaved Arithmetic Progressions

Second idea: round continuous-time schedule to discrete time.

In continuous time, pull i at { ri+kxi| k ∈ N} where ri ∼Unif [0, 1).

Map this schedule to discrete time in an order-preserving manner.

Between two pulls of i , we pull j either bxj/xic or dxj/xie times.

τi = 1 +∑

j 6=i Zj

{Zj} are independent, each supported on 2 consecutive integers.

Page 33: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Convex Stochastic Ordering

Definition

If X ,Y are random variables, the convex stochastic orderingdefines X≤cxY if and only if Eφ(X ) ≤ Eφ(Y ) for every convexfunction φ.

Lemma

If X is a sum of independent Bernoulli random variables and Y isPoisson with EY = EX then X≤cxY .

τi = 1 +∑

j 6=i Zj ≤cx 1 + Pois( 1xi− 1)

xi · EHi (τi ) ≥ xi · EHi (1 + Pois( 1xi− 1))

Page 34: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Convex Stochastic Ordering

Definition

If X ,Y are random variables, the convex stochastic orderingdefines X≤cxY if and only if Eφ(X ) ≤ Eφ(Y ) for every convexfunction φ.

Lemma

If X is a sum of independent Bernoulli random variables and Y isPoisson with EY = EX then X≤cxY .

τi = 1 +∑

j 6=i Zj ≤cx 1 + Pois( 1xi− 1)

xi · EHi (τi ) ≥ xi · EHi (1 + Pois( 1xi− 1))

Page 35: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Convex Stochastic Ordering

Definition

If X ,Y are random variables, the convex stochastic orderingdefines X≤cxY if and only if Eφ(X ) ≤ Eφ(Y ) for every convexfunction φ.

Lemma

If X is a sum of independent Bernoulli random variables and Y isPoisson with EY = EX then X≤cxY .

τi = 1 +∑

j 6=i Zj ≤cx 1 + Pois( 1xi− 1)

xi · EHi (τi ) ≥ xi · EHi (1 + Pois( 1xi− 1))

Page 36: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Approximation Ratio for Interleaved AP Rounding

Fact 1: If H is concave and non-decreasing and Y is Poisson, thenEH(1 + Y ) ≥ (1− 1

2e )H(1 + EY )

Fact 2: If H is concave and non-decreasing and Y is Poisson withEY ≥ m, then

EH(1 + Y ) ≥(

1− 1√2πm

)H(1 + EY )

Conclusion: Interleaved AP rounding is

a 1− 12e ≈ 0.816 approximation in general

a 1− δ approximation for “small arms” to whom the concaverelaxation assigns xi < δ2

Page 37: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

PTAS for Recharging Bandits

Let ε > 0 be a small constant. Two easy cases . . .

1 All arms are big. Every arm that gets pulled in the optimalschedule is pulled with frequency ε2 or greater.

Then the optimal schedule uses only 1/ε2 arms. Brute-forcesearch takes polynomial time.

2 All arms are small. If the optimal concave program solutionhas xi < ε2 for all i , then randomly interleaved arithmeticprogressions get 1− ε approximation.

Combine the cases using “partial enumeration”. For p = Oε(1) . . .

Outer loop: iterate over p-periodic schedules of arms and gaps.

Inner loop: fit small arms into gaps using interleaved AP rounding.

Page 38: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

PTAS Difficulties

Gaps in the p-periodic schedule may not be equally spaced.

For each small arm choose just one congruence class (mod p)of “eligible gaps.”

Bin-pack small arms into congruence classes.

Works if xi < ε2/p for small arms while xi ≥ 1/p for big arms.

Eliminate intermediate arms by finding k ≤ 1/ε such that armswith xi ∈ (ε4(k+1), ε4k ] contribute less than ε · OPT.

Conclusion: # of big arms ≤ (1/ε)O(1/ε).

Page 39: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

PTAS Difficulties

Gaps in the p-periodic schedule may not be equally spaced.

For each small arm choose just one congruence class (mod p)of “eligible gaps.”

Bin-pack small arms into congruence classes.

Works if xi < ε2/p for small arms while xi ≥ 1/p for big arms.

Eliminate intermediate arms by finding k ≤ 1/ε such that armswith xi ∈ (ε4(k+1), ε4k ] contribute less than ε · OPT.

Conclusion: # of big arms ≤ (1/ε)O(1/ε).

Page 40: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

PTAS Difficulties

Gaps in the p-periodic schedule may not be equally spaced.

For each small arm choose just one congruence class (mod p)of “eligible gaps.”

Bin-pack small arms into congruence classes.

Works if xi < ε2/p for small arms while xi ≥ 1/p for big arms.

Eliminate intermediate arms by finding k ≤ 1/ε such that armswith xi ∈ (ε4(k+1), ε4k ] contribute less than ε · OPT.

Conclusion: # of big arms ≤ (1/ε)O(1/ε).

Page 41: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

PTAS Difficulties

Gaps in the p-periodic schedule may not be equally spaced.

Why can we assume big arms are scheduled with periodp = Oε(1)?

We need existence of a p-periodic schedule that matches twoproperties of OPT

1 rate of return from big arms2 amount of time left over for small arms

Existence proof is surprisingly technical; omitted.

Conclusion p = (#big)/ε2 suffices.

Page 42: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

PTAS Difficulties

Gaps in the p-periodic schedule may not be equally spaced.

Why can we assume big arms are scheduled with periodp = Oε(1)?

We need existence of a p-periodic schedule that matches twoproperties of OPT

1 rate of return from big arms2 amount of time left over for small arms

Existence proof is surprisingly technical; omitted.

Conclusion p = (#big)/ε2 suffices.

Grand conclusion: PTAS with running time n(1/ε)(24/ε)

.

Page 43: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

PTAS Difficulties

Gaps in the p-periodic schedule may not be equally spaced.

Why can we assume big arms are scheduled with periodp = Oε(1)?

We need existence of a p-periodic schedule that matches twoproperties of OPT

1 rate of return from big arms2 amount of time left over for small arms

Existence proof is surprisingly technical; omitted.

Conclusion p = (#big)/ε2 suffices.

Grand conclusion: PTAS with running time n(1/ε)(24/ε)

.

Remark: the 0.816-approximation runs in time O(n2 log n).

Page 44: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits: Regret Minimization

Now suppose {Hi} are not known, must be learned by sampling.

Idea: divide time into “planning epochs” of length φ = O(n/ε).

In each epoch . . .

1 Compute H̄i (x), an upper confidence bound on Hi (x), ∀i .2 Run approx alg. on {H̄i} to schedule arms within epoch.

3 Update empirical estimates and confidence radii.

R (x)

x

i

Page 45: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits: Regret Minimization

Now suppose {Hi} are not known, must be learned by sampling.

Idea: divide time into “planning epochs” of length φ = O(n/ε).

In each epoch . . .

1 Compute H̄i (x), an upper confidence bound on Hi (x), ∀i .2 Run approx alg. on {H̄i} to schedule arms within epoch.

3 Update empirical estimates and confidence radii.

Main challenge: Although Hi isconcave, H̄i may not be.

Solution: Work with R̄i and “iron”the non-concavity, without disruptingthe approximation guarantee.

R (x)

x

i

Page 46: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits: Regret Minimization

Now suppose {Hi} are not known, must be learned by sampling.

Idea: divide time into “planning epochs” of length φ = O(n/ε).

In each epoch . . .

1 Compute H̄i (x), an upper confidence bound on Hi (x), ∀i .2 Run approx alg. on {H̄i} to schedule arms within epoch.

3 Update empirical estimates and confidence radii.

Main challenge: Although Hi isconcave, H̄i may not be.

Solution: Work with R̄i and “iron”the non-concavity, without disruptingthe approximation guarantee.

R (x)

x

i

Page 47: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits: Regret Minimization

Now suppose {Hi} are not known, must be learned by sampling.

Idea: divide time into “planning epochs” of length φ = O(n/ε).

In each epoch . . .

1 Compute H̄i (x), an upper confidence bound on Hi (x), ∀i .2 Run approx alg. on {H̄i} to schedule arms within epoch.

3 Update empirical estimates and confidence radii.

Approx alg. is almost black box.

Can plug in greedy, interleaved AProunding, or PTAS.

Approx. factor reduced by 1− ε, plusO(n log(n)

√T log(nT )

)regret.

R (x)

x

i

Page 48: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Recharging Bandits: Regret Minimization

Now suppose {Hi} are not known, must be learned by sampling.

Idea: divide time into “planning epochs” of length φ = O(n/ε).

In each epoch . . .

1 Compute H̄i (x), an upper confidence bound on Hi (x), ∀i .2 Run approx alg. on {H̄i} to schedule arms within epoch.

3 Update empirical estimates and confidence radii.

Approx alg. is almost black box.

Can plug in greedy, interleaved AProunding, or PTAS.

Approx. factor reduced by 1− ε, plusO(n log(n)

√T log(nT )

)regret.

R (x)

x

i

Page 49: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Summary

Recharging bandits: A model for learning to schedule recurringtasks (interventions) whose benefit increases with latency.

Approximation algorithms:

simple greedy (12);

rounding concave relaxation using interleaved arithmeticprogressions (1− 1

2e );

partial enumeration and concave rounding (1− ε).

Nice connections to pinwheel problem in additive combinatorics.

Page 50: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Open Questions

1 Pinwheel problem1 Complexity? (Could be in P. Could be PSPACE-complete.)

2 Is (g1, . . . , gn) always feasible if∑

i g−1i ≤ 5/6?

3 Is (g1 + 1, . . . , gn + 1) always feasible if∑

i g−1i ≤ 1?

Best result in this direction: increase gi + 1 to gi + g1/2+o(1)i .

[Immorlica-K. 2017]

2 Reinforcement learning: What other special cases admitPTAS?

Page 51: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Open Questions

1 Pinwheel problem1 Complexity? (Could be in P. Could be PSPACE-complete.)

2 Is (g1, . . . , gn) always feasible if∑

i g−1i ≤ 5/6?

3 Is (g1 + 1, . . . , gn + 1) always feasible if∑

i g−1i ≤ 1?

Best result in this direction: increase gi + 1 to gi + g1/2+o(1)i .

[Immorlica-K. 2017]

2 Reinforcement learning: What other special cases admitPTAS?

Page 52: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Open Questions

1 Pinwheel problem1 Complexity? (Could be in P. Could be PSPACE-complete.)

2 Is (g1, . . . , gn) always feasible if∑

i g−1i ≤ 5/6?

3 Is (g1 + 1, . . . , gn + 1) always feasible if∑

i g−1i ≤ 1?

Best result in this direction: increase gi + 1 to gi + g1/2+o(1)i .

[Immorlica-K. 2017]

2 Reinforcement learning: What other special cases admitPTAS?

Page 53: Bobby KleinbergCornell Universitypeople.stern.nyu.edu/nzhang/mlseminar17/recharging_nyu.pdf · Recharging Bandits Bobby KleinbergCornell University Joint work with Nicole Immorlica

Open Questions

3 Applications: extend recharging bandits model to incorporatedomain-specific features such as . . .

1 (fighting poachers) Strategic arms with endogenous payoffs.[Kempe-Schulman-Tamuz ’17]

2 (invasive species removal) Externalities between arms.Movement costs.

3 (education) Payoffs with more complex history-dependency.[Novikoff-Kleinberg-Strogatz ’11]