Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Multi-armed Bandits (MAB)

Xi Chen

Slides are based on AAAI 2018 Tutorial by Tor Lattimore & CsabaSzepesvari

1 / 86

Overview

• What are bandits, and why you should care• Finite-armed stochastic bandits

• Explore-Then-Commit (ETC) Algorithm• Upper Confidence Bound (UCB) Algorithm• Lower Bound

• Finite-armed adversarial bandits

2 / 86

What’s in a name? A tiny bit of history

First bandit algorithm proposed by Thompson (1933)

Bush and Mosteller (1953) were in-terested in how mice behaved in a T-maze

3 / 86

Applications

• Clinical trials/dose discovery

• Recommendation systems (movies/news/etc)

• Advertisement placement

• A/B testing

• Dynamic pricing (eg., for Amazon products)

• Ranking (eg., for search)

• Resource allocation

• They isolate an important component of reinforcement learning:exploration-vs-exploitation

4 / 86

Finite-armed bandits

• K actions

• n rounds

• In each round t the learner chooses an action

At ∈ {1, 2, . . . ,K} .

• Observes reward Xt ∼ PAt where P1,P2, . . . ,PK are unknowndistributions (Gaussian or subgaussian)

5 / 86

Example: A/B testing

• Business wants to optimize their webpage

• Actions correspond to ‘A’ and ‘B’ (two arms)

• Users arrive at webpage sequentially

• Algorithm chooses either ‘A’ or ‘B’ (pulling an arm)

• Receives activity feedback (click as the reward)

6 / 86

Measuring performance – the regret

• Let µi be the mean reward of distribution Pi

• µ∗ = maxi µi is the maximum mean

• The (expected) regret is

Rn = nµ∗ − E

[n∑

t=1

Xt

]

• A reasonable policy for which the regret should be (Rn = o(n))

• Of course we would like to make it as ‘small as possible’

7 / 86


• Let µi be the mean reward of distribution Pi

• µ∗ = maxi µi is the maximum mean

• The (expected) regret is

Rn = nµ∗ − E

[n∑

t=1

Xt

]

• A reasonable policy for which the regret should be (Rn = o(n))

• Of course we would like to make it as ‘small as possible’

7 / 86


Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm

Let Ti (n) be the number of times arm i is played over all n rounds

Key decomposition lemma: Rn =K∑i=1

∆iE[Ti (n)]

Proof Let Et [·] = E[·|A1,X1, . . . ,Xt−1,At ]

Rn = nµ∗ − E

[n∑

t=1

Xt

]= nµ∗ −

n∑t=1

E[Et [Xt ]] = nµ∗ −n∑

t=1

E[µAt ]

=n∑

t=1

E[∆At ] = E

[n∑

t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi (n)

]=

K∑i=1

∆iE[Ti (n)]

8 / 86


Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm

Let Ti (n) be the number of times arm i is played over all n rounds

Key decomposition lemma: Rn =K∑i=1

∆iE[Ti (n)]

Proof Let Et [·] = E[·|A1,X1, . . . ,Xt−1,At ]

Rn = nµ∗ − E

[n∑

t=1

Xt

]= nµ∗ −

n∑t=1

E[Et [Xt ]] = nµ∗ −n∑

t=1

E[µAt ]

=n∑

t=1

E[∆At ] = E

[n∑

t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi (n)

]=

K∑i=1

∆iE[Ti (n)]

8 / 86

A simple policy: Explore-Then-Commit

1 Choose each action m times

2 Find the empirically best action I ∈ {1, 2, . . . ,K} (i.e., the action Igives the largest average reward over m items)

3 Choose At = I for all remaining (n −mK ) rounds

In order to analyse this policy we need to bound the probability ofcommitting to a suboptimal actionNeed probability tools: concentration inequalities.

9 / 86

A simple policy: Explore-Then-Commit

1 Choose each action m times

2 Find the empirically best action I ∈ {1, 2, . . . ,K} (i.e., the action Igives the largest average reward over m items)

3 Choose At = I for all remaining (n −mK ) rounds

In order to analyse this policy we need to bound the probability ofcommitting to a suboptimal actionNeed probability tools: concentration inequalities.

9 / 86

A crash course in concentration

Let Z1,Z2, . . . ,Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?

10 / 86



n

n∑t=1

Zt


Classical statistics says:

1. (law of large numbers) limn→∞ µn = µ almost surely

2. (central limit theorem)√

n(µn − µ)d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2 (V(µn) = σ2

n )

Basic probability inequality (R.V. X with finite mean andvariance):

1. (Markov’s inequality) P (|X | ≥ ε) ≤ E(|X |)ε .

2. (Chebyshev’s inequality) P (|X − E(X )| ≥ ε) ≤ V(X )ε2

We need something nonasymptotic and stronger than Chebyshev’s (Notpossible without assumptions)

11 / 86



n

n∑t=1

Zt





n(µn − µ)d→ N (0, σ2)


nε2 (V(µn) = σ2

n )





11 / 86



n

n∑t=1

Zt





n(µn − µ)d→ N (0, σ2)


nε2 (V(µn) = σ2

n )





11 / 86


Random variable Z is σ-subgaussian if for all λ ∈ R,

MZ (λ).

= E[exp(λZ )] ≤ exp(λ2σ2/2

),

where MZ (λ) is known as the moment generating function.

• Which distributions are σ-subgaussian? Gaussian, Bernoulli,bounded support.

• And not: exponential, power law

Lemma If Z ,Z1, . . . ,Zn are independent and σ-subgaussian, then

• aZ is |a|σ-subgaussian for any a ∈ R•∑n

t=1 Zt is√

nσ-subgaussian

• µn is n−1/2σ-subgaussian

12 / 86



MZ (λ).


),






t=1 Zt is√

nσ-subgaussian


12 / 86



MZ (λ).


),






t=1 Zt is√

nσ-subgaussian


12 / 86


Lemma(Tail bound of subgaussian random variable)

• If X is a σ-subgaussian, for any ε > 0, P (X ≥ ε) ≤ exp(− ε2

2σ2 ).

Proof We use Chernoff’s method. Let ε > 0 and λ = ε/σ2.

P (X ≥ ε) = P (exp (λX ) ≥ exp (λε))

≤ E [exp (λX )]

exp(λε)(Markov’s)

≤ exp(σ2λ2/2− λε

)(X is subgaussian)

= exp(−ε2/(2σ2)

)

13 / 86


Lemma(Tail bound of subgaussian random variable)

• If X is a σ-subgaussian, for any ε > 0, P (X ≥ ε) ≤ exp(− ε2

2σ2 ).

Proof We use Chernoff’s method. Let ε > 0 and λ = ε/σ2.

P (X ≥ ε) = P (exp (λX ) ≥ exp (λε))

≤ E [exp (λX )]

exp(λε)(Markov’s)

≤ exp(σ2λ2/2− λε

)(X is subgaussian)

= exp(−ε2/(2σ2)

)

13 / 86

A crash course in concentrationTheorem If Z1, . . . ,Zn are independent and σ-subgaussian, then

P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof µn − µ is a σ/√

n-subgaussian random variable and thus

P (µn − µ ≥ ε) ≤ exp(− nε2

2σ2)

Setting exp(− nε2

2σ2 ) = δ and solving for ε.

Corollary If Z1, . . . ,Zn are independent and σ-subgaussian, then

P

(µn − µ ≤ −

√2σ2 log(1/δ)

n

)≤ δ

14 / 86

A crash course in concentrationTheorem If Z1, . . . ,Zn are independent and σ-subgaussian, then

P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof µn − µ is a σ/√

n-subgaussian random variable and thus

P (µn − µ ≥ ε) ≤ exp(− nε2

2σ2)

Setting exp(− nε2

2σ2 ) = δ and solving for ε.

Corollary If Z1, . . . ,Zn are independent and σ-subgaussian, then

P

(µn − µ ≤ −

√2σ2 log(1/δ)

n

)≤ δ

14 / 86


• Comparing Chebyshev’s w. subgaussian bound:

Chebyshev’s: P

(µn − µ ≥

√σ2

nδ

)≤ δ

Subgaussian: P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Typically δ � 1/n in our use-cases. Then Chebyshev’s inequality is

too loose since√

σ2

nδ is too large.

From now on, we will assume that reward distribution associatedwith each arm is 1-subgaussian (but with different means)

15 / 86


• Comparing Chebyshev’s w. subgaussian bound:

Chebyshev’s: P

(µn − µ ≥

√σ2

nδ

)≤ δ

Subgaussian: P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Typically δ � 1/n in our use-cases. Then Chebyshev’s inequality is

too loose since√

σ2

nδ is too large.

From now on, we will assume that reward distribution associatedwith each arm is 1-subgaussian (but with different means)

15 / 86

Analysing Explore-Then-Commit

• Exploration phase: Chooses each arm m times

• Exploitation phase: Then commits to the arm with the largestempirical reward

• Standard convention: Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Means that first arm is optimal

• Algorithms are symmetric and do not know this fact

• We consider only K = 2

16 / 86







16 / 86







16 / 86


Step 1 Let µi be the average reward of i-th arm (for i ∈ {1, 2}) afterthe exploration phase

The algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆ = µ1 − µ2

Observation µ2 − µ2︸︷︷︸√1/m−subgaussian

+ µ1 − µ1︸︷︷︸√1/m−subgaussian

is√

2/m-subgaussian

with zero-mean

17 / 86

Analysing Explore-Then-CommitStep 1 The algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆ = µ1 − µ2

Observation µ2 − µ2 + µ1 − µ1 is√

2/m-subgaussian with zero-mean

Step 2 The regret is

Rn = E

[n∑

t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n − 2m)∆P (commit to the wrong arm)

= m∆ + (n − 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)The last inequality is because if X is a σ-subgaussian, for any ε > 0,P (X ≥ ε) ≤ exp(− ε2

2σ2 ) (ε = ∆ and σ =√

2/m).

18 / 86

Analysing Explore-Then-CommitStep 1 The algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆ = µ1 − µ2

Observation µ2 − µ2 + µ1 − µ1 is√

2/m-subgaussian with zero-mean

Step 2 The regret is

Rn = E

[n∑

t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n − 2m)∆P (commit to the wrong arm)

= m∆ + (n − 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)The last inequality is because if X is a σ-subgaussian, for any ε > 0,P (X ≥ ε) ≤ exp(− ε2

2σ2 ) (ε = ∆ and σ =√

2/m).18 / 86


Rn ≤ m∆︸︷︷︸(A)

+ n∆ exp(−m∆2/4)︸︷︷︸(B)

(A) is monotone increasing in m while (B) is monotone decreasing in m

Exploration/Exploitation Trade-off Exploring too much (m large)then (A) is big, while exploring too little makes (B) large

Bound minimised by m =⌈

4∆2 log

(n∆2

4

)⌉leading to

Rn ≤ ∆ +4

∆log

(n∆2

4

)+

4

∆,

noting that due to ceiling function in m: (A) ≤ (1 + 4∆2 log

(n∆2

4

))∆.

19 / 86


Last slide: Rn ≤ ∆ +4

∆log

(n∆2

4

)+

4

∆

What happens when ∆ is very small? (Rn can be unbounded)

A natural correction:

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

∆

}Illustration of Rn with n = 1000.

0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆

Reg

ret

—

20 / 86


Last slide: Rn ≤ ∆ +4

∆log

(n∆2

4

)+

4

∆

What happens when ∆ is very small? (Rn can be unbounded)A natural correction:

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

∆

}Illustration of Rn with n = 1000.

0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆

Reg

ret

—

20 / 86


Does this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

∆

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆

Small ∆ makes identification of the best arm hard, but cost of failure (ofidentification) is low

Large ∆ makes the cost of failure high, but identification becomes easy

Worst case is when ∆ ≈√

1/n with Rn ≈√

n

21 / 86


Does this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

∆

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆Small ∆ makes identification of the best arm hard, but cost of failure (ofidentification) is low

Large ∆ makes the cost of failure high, but identification becomes easy

Worst case is when ∆ ≈√

1/n with Rn ≈√

n

21 / 86

Limitations of Explore-Then-Commit

• Recall that m =⌈

4∆2 log

(n∆2

4

)⌉• Need advance knowledge of the unknown horizon length n

• Optimal tuning depends on unknown ∆ = µ1 − µ2

• Better approaches now exist, but Explore-Then-Commit is often agood place to start when analyzing a bandit problem since itcaptures exploration-exploitation trade-off

22 / 86

Limitations of Explore-Then-Commit

• Recall that m =⌈

4∆2 log

(n∆2

4

)⌉• Need advance knowledge of the unknown horizon length n

• Optimal tuning depends on unknown ∆ = µ1 − µ2

• Better approaches now exist, but Explore-Then-Commit is often agood place to start when analyzing a bandit problem since itcaptures exploration-exploitation trade-off

22 / 86

Optimism principle

23 / 86

Informal illustration

Visiting a new region

Shall I try local cuisine?

Optimist: Yes!

Pessimist: No!

Optimism leads to exploration, pessimism prevents it

Exploration is necessary, but how much?

24 / 86

Optimism principle

• Let µi (t) = 1Ti (t)

∑ts=1 1(As = i)Xs be the empirical mean reward

of i-th arm at time t

• Optimistic estimate of the mean of arm = ‘largest value it couldplausibly be’

• Formalise the intuition using confidence intervals (σ = 1)

P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Suggests

optimistic estimate = µi (t − 1) +

√2 log(1/δ)

Ti (t − 1)

• δ ∈ (0, 1) determines the level of optimism

25 / 86

Optimism principle






P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Suggests


√2 log(1/δ)

Ti (t − 1)


25 / 86

Optimism principle






P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Suggests


√2 log(1/δ)

Ti (t − 1)


25 / 86

Upper confidence bound algorithm

1 Choose each action once

2 Choose the action maximising

At = argmaxi µi (t − 1) +

√2 log(t3)

Ti (t − 1)

3 Goto 2

Corresponds to δ = 1/t3. This is quite a conservative choice (more onthis later)

Algorithm does not depend on horizon n (it is anytime)

26 / 86





√2 log(t3)

Ti (t − 1)

3 Goto 2



26 / 86





√2 log(t3)

Ti (t − 1)

3 Goto 2



26 / 86


• Red circle: true mean, Blue rectangle: empirical mean reward.

• (10), (73), (3), (23): number of pulls (a larger number of pullsmakes the true and empirical mean closer).

27 / 86

Why UCB?

• A suboptimal arm can only be played if its upper confidence boundis larger than the upper confidence bound of the optimal arm, whichin turn is larger than the mean of the optimal arm.

• However,this cannot happen too often because by playing a fewmore times of a suboptimal arm, its upper confidence bound will beclose to its true mean. Thus, it will eventually fall below the upperconfidence bound of the optimal arm.

• UCB:


√2 log(t3)

Ti (t − 1)

• An algorithm should explore arms more often if they are1. either promising because µi (t − 1) is large2. or not well explored because Ti (t − 1) is small

28 / 86

Why UCB?



• UCB:


√2 log(t3)

Ti (t − 1)


28 / 86

Why UCB?



• UCB:


√2 log(t3)

Ti (t − 1)


28 / 86

Regret of UCB

Theorem The regret of UCB is at most

Rn = O

∑i :∆i>0

(∆i +

log(n)

∆i

)Furthermore,

Rn = O(√

Kn log(n)),

where K is the number of arms and n is the time horizon length.

Bounds of the first kind are called problem dependent or instancedependent, which depends on ∆i = µ1 − µi

Bounds like the second are called distribution free or worst case

29 / 86

Regret analysis

Rewrite the regret Rn =K∑i=1

∆iE[Ti (n)]

Only need to show that E[Ti (n)] is not too large for suboptimal arms

30 / 86

Regret analysis

Key insight Arm i is only played if its index is larger than the index ofthe optimal arm

γi (t − 1) = µi (t − 1) +

√2 log(t3)

Ti (t − 1)︸︷︷︸index of arm i in round t

A suboptimal arm i 6= 1 is played implies that

1. either γi (t − 1) ≥ µ1 (index of arm i is larger than the mean ofoptimal arm)

2. or γ1(t − 1) ≤ µ1 (index of arm 1 is smaller than its true mean)

Otherwise, we have γi (t − 1) ≤ µ1 ≤ γ1(t − 1): arm 1 should be playedsince

Both events are unlikely after a sufficiently number of plays.

31 / 86

Regret analysis

Key insight Arm i is only played if its index is larger than the index ofthe optimal arm

γi (t − 1) = µi (t − 1) +

√2 log(t3)

Ti (t − 1)︸︷︷︸index of arm i in round t

A suboptimal arm i 6= 1 is played implies that

1. either γi (t − 1) ≥ µ1 (index of arm i is larger than the mean ofoptimal arm)

2. or γ1(t − 1) ≤ µ1 (index of arm 1 is smaller than its true mean)

Otherwise, we have γi (t − 1) ≤ µ1 ≤ γ1(t − 1): arm 1 should be playedsince

Both events are unlikely after a sufficiently number of plays.

31 / 86

Regret analysis

To make this intuition a reality we decompose the “pull-count” for thei-th arm (i 6= 1)

E[Ti (n)] = E

[n∑

t=1

1(At = i)

]=

n∑t=1

P (At = i)

=n∑

t=1

P (At = i and (γ1(t − 1) ≤ µ1 or γi (t − 1) ≥ µ1))

≤n∑

t=1

P (γ1(t − 1) ≤ µ1)︸︷︷︸index of opt. arm too small?

+n∑

t=1

P (At = i and γi (t − 1) ≥ µ1)︸︷︷︸index of subopt. arm large?

32 / 86

Regret analysis

We want to show that P (γ1(t − 1) ≤ µ1) is small

Tempting to use the concentration theorem...

P (γ1(t − 1) ≤ µ1) = P

(µ1(t − 1) +

√2 log(t3)

Ti (t − 1)≤ µ1

)?≤ 1

t3

What’s wrong with this?

Ti (t − 1) is a random variable but not anumber! Use union bound Pr(∪t−1

s=1As) ≤∑t−1

s=1 Pr(As)

P

(µ1(t − 1) +

√2 log(t3)

Ti (t − 1)≤ µ1

)≤ P

(∃s ≤ t − 1 : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2. (δ = 1/t3)

33 / 86

Regret analysis

We want to show that P (γ1(t − 1) ≤ µ1) is small

Tempting to use the concentration theorem...

P (γ1(t − 1) ≤ µ1) = P

(µ1(t − 1) +

√2 log(t3)

Ti (t − 1)≤ µ1

)?≤ 1

t3

What’s wrong with this? Ti (t − 1) is a random variable but not anumber! Use union bound Pr(∪t−1

s=1As) ≤∑t−1

s=1 Pr(As)

P

(µ1(t − 1) +

√2 log(t3)

Ti (t − 1)≤ µ1

)≤ P

(∃s ≤ t − 1 : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2. (δ = 1/t3)

33 / 86

Regret analysis

n∑t=1

P (At = i and γi (t − 1) ≥ µ1) = E

[n∑

t=1

1(At = i and γi (t − 1) ≥ µ1)

]

= E

[n∑

t=1

1(At = i and µi (t − 1) +

√6 log(t)

Ti (t − 1)≥ µ1)

]

≤ E

[n∑

t=1

1(At = i and µi (t − 1) +

√6 log(n)

Ti (t − 1)≥ µ1)

](t ≤ n)

34 / 86

Regret analysis

n∑t=1

P (At = i and γi (t − 1) ≥ µ1)

≤E

[n∑

t=1

1(At = i and µi (t − 1) +

√6 log(n)

Ti (t − 1)≥ µ1)

]

≤E

[n∑

s=1

1(µi ,s +

√6 log(n)

s≥ µ1)

](For each possible Ti (t − 1) = s and s = 1, . . . , n)

=n∑

s=1

P

(µi ,s +

√6 log(n)

s≥ µ1

)

35 / 86

Regret analysis

Let u =24 log(n)

∆2i

. Then we decompose time periods into [1, u] and

[u + 1, n]:

n∑s=1

P

(µi ,s +

√6 log(n)

s≥ µ1

)≤ u +

n∑s=u+1

P

(µi ,s +

√6 log(n)

s≥ µ1

)

Choose u large enough so that for any s > u,√

6 log(n)s ≤ ∆i

2

(u =24 log(n)

∆2i

). Then we have

µi ,s ≥ µ1 −√

6 log(n)

s⇒ µi ,s − µi ≥ µ1 − µi −

√6 log(n)

s

⇒ µi ,s − µi ≥ ∆i −∆i

2

⇒ µi ,s − µi ≥∆i

2

36 / 86

Regret analysis

Let u =24 log(n)

∆2i

. Then we decompose time periods into [1, u] and

[u + 1, n]:

n∑s=1

P

(µi ,s +

√6 log(n)

s≥ µ1

)≤ u +

n∑s=u+1

P

(µi ,s +

√6 log(n)

s≥ µ1

)

Choose u large enough so that for any s > u,√

6 log(n)s ≤ ∆i

2

(u =24 log(n)

∆2i

). Then we have

µi ,s ≥ µ1 −√

6 log(n)

s⇒ µi ,s − µi ≥ µ1 − µi −

√6 log(n)

s

⇒ µi ,s − µi ≥ ∆i −∆i

2

⇒ µi ,s − µi ≥∆i

2

36 / 86

Regret analysis

Let u =24 log(n)

∆2i

. Then

n∑s=1

P

(µi ,s +

√6 log(n)

s≥ µ1

)≤ u +

n∑s=u+1

P

(µi ,s +

√6 log(n)

s≥ µ1

)

≤ u +n∑

s=u+1

P(µi ,s ≥ µi +

∆i

2

)

≤ u +∞∑

s=u+1

exp

(−

s∆2i

8

)(∑∞

s=u+1 exp(− s∆2

i8

)≤ 1 +

∫∞s=u exp

(− s∆2

i8

)ds)

≤ u + 1 +8

∆2i

.

37 / 86

Regret analysisCombining the two parts we have

E[Ti (n)] ≤n∑

t=1


+n∑

t=1


≤n∑

t=1

1

t2+ 1 + u +

8

∆2i

(u = 24 log(n)∆2

i,∑n

t=11t2 ≤ 1 +

∫∞t=1

1t2dt)

≤ 3 +8

∆2i

+24 log(n)

∆2i

So the regret is bounded by (instance dependent bound)

Rn =∑

i :∆i>0

∆iE[Ti (n)] ≤∑

i :∆i>0

(3∆i +

8

∆i+

24 log(n)

∆i

)

=O

( ∑i :∆i>0

(∆i +

log(n)

∆i

))

38 / 86

Regret analysisCombining the two parts we have

E[Ti (n)] ≤n∑

t=1


+n∑

t=1


≤n∑

t=1

1

t2+ 1 + u +

8

∆2i

(u = 24 log(n)∆2

i,∑n

t=11t2 ≤ 1 +

∫∞t=1

1t2dt)

≤ 3 +8

∆2i

+24 log(n)

∆2i

So the regret is bounded by (instance dependent bound)

Rn =∑

i :∆i>0

∆iE[Ti (n)] ≤∑

i :∆i>0

(3∆i +

8

∆i+

24 log(n)

∆i

)

=O

( ∑i :∆i>0

(∆i +

log(n)

∆i

))38 / 86

Distribution free boundsLet ∆ > 0 be some constant to be chosen later

Rn =∑

i :∆i≤∆

∆iE[Ti (n)] +∑

i :∆i>∆

∆iE[Ti (n)]

≤ n∆ +∑

i :∆i>∆

∆iE[Ti (n)] (∑

i :∆i≤∆ Ti (n) ≤∑

i Ti (n) = n)

. n∆ +∑

i :∆i>∆

(∆i +log(n)

∆i)

. n∆ +K log(n)

∆︸︷︷︸∆=√

K log(n)/n

+K∑i=1

∆i

.√

nK log(n) +K∑i=1

∆i

Note that∑K

i=1 ∆i is unavoidable since each arm needs to be played atleast once and this term is negligible when n is large.

39 / 86

Distribution free boundsLet ∆ > 0 be some constant to be chosen later

Rn =∑

i :∆i≤∆

∆iE[Ti (n)] +∑

i :∆i>∆

∆iE[Ti (n)]

≤ n∆ +∑

i :∆i>∆

∆iE[Ti (n)] (∑

i :∆i≤∆ Ti (n) ≤∑

i Ti (n) = n)

. n∆ +∑

i :∆i>∆

(∆i +log(n)

∆i)

. n∆ +K log(n)

∆︸︷︷︸∆=√

K log(n)/n

+K∑i=1

∆i

.√

nK log(n) +K∑i=1

∆i

Note that∑K

i=1 ∆i is unavoidable since each arm needs to be played atleast once and this term is negligible when n is large.

39 / 86

Improvements

• The constants in the algorithm/analysis can be improved quitesignificantly.


√2 log(t)

Ti (t − 1)

• With this choice:

limn→∞

Rn

log(n)=∑

i :∆i>0

2

∆i

• The distribution-free regret is also improvable


√4

Ti (t − 1)log

(1 +

t

KTi (t − 1)

)• With this index we save a log factor in the distribution free bound

Rn = O(√

nK )

40 / 86

Improvements

• The constants in the algorithm/analysis can be improved quitesignificantly.


√2 log(t)

Ti (t − 1)

• With this choice:

limn→∞

Rn

log(n)=∑

i :∆i>0

2

∆i

• The distribution-free regret is also improvable


√4

Ti (t − 1)log

(1 +

t

KTi (t − 1)

)• With this index we save a log factor in the distribution free bound

Rn = O(√

nK )

40 / 86

Exercise

• Consider different settings of arms (number of arms K , mean gap∆i ) and different distributions: uniform, Bernoulli, normal

• Compare Explore-Then-Commit with UCB Algorithm in

1. Regret as a function of horizon n2. Frequency of pulling each arm3. Tuning the constant ρ:


√ρ log(t)

Ti (t − 1)

41 / 86

Lower bounds

Is the bound Rn = O(√

nK ) optimal in n and K ?

1. For worst-case regret for a given policy π: Rn(π) = supν∈E Rn(π, ν),where E denotes the set of K -armed Gaussian bandits with unitvariance and means µ ∈ [0, 1]K .

2. The minimax regret R∗n(E) = infπ supν∈E Rn(π, ν)

Theorem R∗n(E) ≥√

(K − 1)n/27: for every policy π and n andK ≤ n + 1, there exists a K -armed Gaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

UCB with Rn = O(√

nK ) is a rate-optimal policy

42 / 86

Lower bounds







Rn(π, ν) ≥√

(K − 1)n/27

UCB with Rn = O(√


42 / 86

Lower bounds







Rn(π, ν) ≥√

(K − 1)n/27

UCB with Rn = O(√


42 / 86

How to prove a minimax lower bound?

Key idea: reduce the bandit problem into a statistical hypothesis testingproblem.

Select two bandit problem instances (two sets of K distributions) in sucha way that the following two conditions hold simultaneously:

• Competition: A sequence of actions that is good for one bandit isnot good for the other (choose two instances far away from eachother).

• Similarity: The instances are ’close’ enough that a policy interactingwith either of the two instances cannot statistically identify the truebandit.

Lower bound: optimize the trade-off between these two opposite goals.

43 / 86


Key idea: reduce the bandit problem into a statistical hypothesis testingproblem.Select two bandit problem instances (two sets of K distributions) in sucha way that the following two conditions hold simultaneously:




43 / 86


Key idea: reduce the bandit problem into a statistical hypothesis testingproblem.Select two bandit problem instances (two sets of K distributions) in sucha way that the following two conditions hold simultaneously:




43 / 86

Minimax lower boundTheorem For every policy π and n and K ≤ n, there exists a K -armedGaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch• Two bandits: ν = (Pi )

Ki=1 and ν ′ = (P ′i )

Ki=1, where Pi = N(µi , 1)

and P ′i = N(µ′i , 1).

• It suffices to show that for any policy π, there exists µ and µ′ suchthat the π incurs regret larger than

√Kn on at least one instance:

max(Rn(π, ν),Rn(π, ν ′)) ≥ c√

Kn,

or (since max(a, b) ≥ (a + b)/2),

Rn(π, ν) + Rn(π, ν′) ≥ c√

Kn.

• Choose µ = (∆, 0, . . . , 0) and

Rn(π, ν) = (n − Eν(T1(n)))∆

(∆ optimized later)

44 / 86


Rn(π, ν) ≥√

(K − 1)n/27


Ki=1 and ν ′ = (P ′i )


and P ′i = N(µ′i , 1).• It suffices to show that for any policy π, there exists µ and µ′ such

that the π incurs regret larger than√

Kn on at least one instance:


Kn,


Rn(π, ν) + Rn(π, ν′) ≥ c√

Kn.

• Choose µ = (∆, 0, . . . , 0) and

Rn(π, ν) = (n − Eν(T1(n)))∆

(∆ optimized later)

44 / 86


Rn(π, ν) ≥√

(K − 1)n/27


Ki=1 and ν ′ = (P ′i )


and P ′i = N(µ′i , 1).• It suffices to show that for any policy π, there exists µ and µ′ such

that the π incurs regret larger than√

Kn on at least one instance:


Kn,


Rn(π, ν) + Rn(π, ν′) ≥ c√

Kn.

• Choose µ = (∆, 0, . . . , 0) and

Rn(π, ν) = (n − Eν(T1(n)))∆

(∆ optimized later) 44 / 86


Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch

• Choose µ = (∆, 0, . . . , 0) and Rn(π, ν) = (n − Eν(T1(n)))∆• Let i = argminj>1 EνE (Tj(n)) (arm explored the least) andEν [Ti (n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (2∆ at the i-th arm, optimal arm):

Rn(π, ν ′) = ∆Eν′(T1(n)) +∑j 6=1,i

2∆Eν′(Tj(n)) ≥ ∆Eν′(T1(n)).

• Depend on T1(n) ≷ n/2,• If T1(n) ≤ n/2, Rn(π, ν) = (n − Eν(T1(n)))∆ ≥ n∆

2 . Therefore,

Rn(π, ν) ≥ Pν(T1(n) ≤ n/2) n∆2

• If T1(n) ≤ n/2, Rn(π, ν′) ≥ ∆Eν′(T1(n)) ≥ n∆2 . Therefore,

Rn(π, ν′) ≥ Pν′(T1(n) ≥ n/2) n∆2

45 / 86


Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch



Rn(π, ν ′) = ∆Eν′(T1(n)) +∑j 6=1,i

2∆Eν′(Tj(n)) ≥ ∆Eν′(T1(n)).


2 . Therefore,

Rn(π, ν) ≥ Pν(T1(n) ≤ n/2) n∆2


Rn(π, ν′) ≥ Pν′(T1(n) ≥ n/2) n∆2

45 / 86


Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch



Rn(π, ν ′) = ∆Eν′(T1(n)) +∑j 6=1,i

2∆Eν′(Tj(n)) ≥ ∆Eν′(T1(n)).


2 . Therefore,

Rn(π, ν) ≥ Pν(T1(n) ≤ n/2) n∆2


Rn(π, ν′) ≥ Pν′(T1(n) ≥ n/2) n∆2

45 / 86

Minimax lower bound

Rn(π, ν) + Rn(π, ν′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

Need to show Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2) is larger!

Theorem (Pinsker’s inequality) For any two distributions and anyevent A:

P(A) + Q(Ac) ≥ 1

2exp(−D(P,Q)),

where D(P,Q) =∫

p log(pq ) is the Kullback-Leibler (KL) divergence.

Intuition, when P is close to Q, P(A) + Q(Ac) should be large(P(A) + P(Ac) = 1)

Exercise:

D(N(µ1, σ2),N(µ2, σ

2)) =(µ1 − µ2)2

2σ2

46 / 86

Minimax lower bound

Rn(π, ν) + Rn(π, ν′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))



P(A) + Q(Ac) ≥ 1

2exp(−D(P,Q)),

where D(P,Q) =∫



Exercise:

D(N(µ1, σ2),N(µ2, σ

2)) =(µ1 − µ2)2

2σ2

46 / 86

Minimax lower bound

Rn(π, ν) + Rn(π, ν′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))



P(A) + Q(Ac) ≥ 1

2exp(−D(P,Q)),

where D(P,Q) =∫



Exercise:

D(N(µ1, σ2),N(µ2, σ

2)) =(µ1 − µ2)2

2σ2

46 / 86

Minimax lower bound

Rn(π, ν) + Rn(π, ν′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))



P(A) + Q(Ac) ≥ 1

2exp(−D(P,Q)),

where D(P,Q) =∫



Exercise:

D(N(µ1, σ2),N(µ2, σ

2)) =(µ1 − µ2)2

2σ2

46 / 86

Minimax lower bound

Rn(π, ν) + Rn(π, ν ′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

≥ n∆

4exp(−D(Pν ,Pν′))

Theorem (Divergence decomposition)

D(Pν ,Pν′) =K∑j=1

Eν(Tj(n))D(Pj ,P′j )

Recall µ = (∆, 0, . . . , 0) and µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (only oneentry different):

D(Pν ,Pν′) = Eν(Ti (n))D(N(0, 1),D(2∆, 1)) = Eν(Ti (n))(2∆)2

2≤ 2n∆2

K − 1.

Rn(π, ν) + Rn(π, ν′) ≥ n∆

4exp(− 2n∆2

K − 1)

and choose ∆ =√

(K − 1)/4n

47 / 86

Minimax lower bound

Rn(π, ν) + Rn(π, ν ′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

≥ n∆







2≤ 2n∆2

K − 1.

Rn(π, ν) + Rn(π, ν′) ≥ n∆

4exp(− 2n∆2

K − 1)

and choose ∆ =√

(K − 1)/4n

47 / 86

Minimax lower bound

Rn(π, ν) + Rn(π, ν ′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

≥ n∆







2≤ 2n∆2

K − 1.

Rn(π, ν) + Rn(π, ν′) ≥ n∆

4exp(− 2n∆2

K − 1)

and choose ∆ =√

(K − 1)/4n

47 / 86

Minimax lower bound

Rn(π, ν) + Rn(π, ν ′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

≥ n∆







2≤ 2n∆2

K − 1.

Rn(π, ν) + Rn(π, ν′) ≥ n∆

4exp(− 2n∆2

K − 1)

and choose ∆ =√

(K − 1)/4n 47 / 86

Worst case lower bound

Theorem For every policy π and n and K ≤ n + 1, there exists aK -armed Gaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

48 / 86

What else is there?

• All kinds of variants of UCB for different noise models: Bernoulli,exponential families, heavy tails, Gaussian with unknown mean andvariance,...

• Thompson sampling: each round sample mean from posterior foreach arm, choose arm with largest

• All manner of twists on the setup: non-stationarity, delayed rewards,playing multiple arms each round, moving beyond expected regret(high probability bounds)

49 / 86

What else is there?




49 / 86

What else is there?




49 / 86

The adversarial viewpoint

• Replace random rewards with an adversary

• At the start of the game the adversary secretly chooses losses`1, `2, . . . , `n where `t ∈ [0, 1]K

• Learner chooses actions At :• observe and suffers the loss `tAt

• Regret is

Rn = E

[n∑

t=1

`tAt

]︸︷︷︸

learner’s loss

− mini

n∑t=1

`ti︸︷︷︸loss of best arm

• Mission Make the regret small, regardless of the adversary

• There exists an algorithm such that

Rn ≤ 2√

Kn

50 / 86

The adversarial viewpoint

• Replace random rewards with an adversary

• At the start of the game the adversary secretly chooses losses`1, `2, . . . , `n where `t ∈ [0, 1]K

• Learner chooses actions At :• observe and suffers the loss `tAt

• Regret is

Rn = E

[n∑

t=1

`tAt

]︸︷︷︸

learner’s loss

− mini

n∑t=1


• Mission Make the regret small, regardless of the adversary

• There exists an algorithm such that

Rn ≤ 2√

Kn

50 / 86

Why this regret definition?

• The regret is with respect to the loss of the best arm

Rn = E

[n∑

t=1

`tAt

]︸︷︷︸

learner’s loss

− mini

n∑t=1


• The following alternative objective is hopeless

R ′n = E

[n∑

t=1

`tAt

]︸︷︷︸

learner’s loss

−n∑

t=1

mini`ti︸︷︷︸

loss of best sequence

• Regret is at least cn for some c > 1.

` =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)

51 / 86

Why this regret definition?

• The regret is with respect to the loss of the best arm

Rn = E

[n∑

t=1

`tAt

]︸︷︷︸

learner’s loss

− mini

n∑t=1


• The following alternative objective is hopeless

R ′n = E

[n∑

t=1

`tAt

]︸︷︷︸

learner’s loss

−n∑

t=1

mini`ti︸︷︷︸

loss of best sequence

• Regret is at least cn for some c > 1.

` =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)51 / 86

Tackling the adversarial bandit

• Randomisation is crucial in adversarial bandits

• Learner chooses distribution Pt over the K actions

• Samples At ∼ Pt

• Observes the loss `tAt

• Expected regret is

Rn = maxi

E

[n∑

t=1

(`tAt − `ti )

]= max

p∈∆KE

[n∑

t=1

〈Pt − p, `t〉

]

• How to choose Pt?

• Consider a simpler setting: choose the action At and the entirevector `t is observed (instead of `tAt )

• Online convex optimization with a linear loss

52 / 86







Rn = maxi

E

[n∑

t=1

(`tAt − `ti )

]= max

p∈∆KE

[n∑

t=1

〈Pt − p, `t〉

]




52 / 86







Rn = maxi

E

[n∑

t=1

(`tAt − `ti )

]= max

p∈∆KE

[n∑

t=1

〈Pt − p, `t〉

]




52 / 86







Rn = maxi

E

[n∑

t=1

(`tAt − `ti )

]= max

p∈∆KE

[n∑

t=1

〈Pt − p, `t〉

]




52 / 86

Online convex optimisation (linear losses)• Domain of x K ⊂ Rd is a convex set• Adversary secretly chooses`1, . . . , `n ∈ K◦ = {u : supx∈K |〈x , u〉| ≤ 1} (polar set)

• At each time t, the learner chooses xt ∈ K• Suffers loss 〈xt , `t〉

• The regret with respect to the best x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .

More general online convex optimization• Learner chooses xt ∈ K• Adversary chooses convex ft : K → R• Suffer loss in round t is ft(xt) and regret is

Rn(x) =n∑

t=1

(ft(xt)− ft(x))

• linear is a special case with ft(x) = 〈x , `t〉

53 / 86


• At each time t, the learner chooses xt ∈ K• Suffers loss 〈xt , `t〉• The regret with respect to the best x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .


Rn(x) =n∑

t=1

(ft(xt)− ft(x))

• linear is a special case with ft(x) = 〈x , `t〉

53 / 86


• At each time t, the learner chooses xt ∈ K• Suffers loss 〈xt , `t〉• The regret with respect to the best x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .


Rn(x) =n∑

t=1

(ft(xt)− ft(x))

• linear is a special case with ft(x) = 〈x , `t〉53 / 86

Why linear is enough?

• convex function

• The sum of convex functions is convex

• Strictly convex function has a unique minimizer

54 / 86


• convex function

• The sum of convex functions is convex

• Strictly convex function has a unique minimizer

54 / 86

Linearisation of a convex function

ft(x) ≥ ft(xt) + 〈x − xt ,∇ft(xt)〉

xt

x

Rearranging, ft(xt)− ft(x) ≤ 〈xt − x ,∇ft(xt)〉55 / 86


• Regret is bounded by

Rn(x) =n∑

t=1

(ft(xt)− ft(x))

≤n∑

t=1

〈xt − x ,∇ft(xt)〉

• Reduction from nonlinear to linear

• Only uses first order information (the gradient)

• Linear losses from now on ft(x) = 〈x , `t〉

• Think of `t = ∇ft(xt) for a general convex loss function

56 / 86


• Regret is bounded by

Rn(x) =n∑

t=1

(ft(xt)− ft(x))

≤n∑

t=1

〈xt − x ,∇ft(xt)〉

• Reduction from nonlinear to linear

• Only uses first order information (the gradient)

• Linear losses from now on ft(x) = 〈x , `t〉

• Think of `t = ∇ft(xt) for a general convex loss function

56 / 86

Online convex optimisation (linear losses)

• Adversary secretly chooses`1, . . . , `n ∈ K◦ = {u : supx∈K |〈x , u〉| ≤ 1} (polar)

• Learner chooses xt ∈ K• Suffers loss 〈xt , `t〉 and the regret with respect to x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .

• How to choose xt? Most simple idea ‘follow-the-leader’

xt+1 = argminx∈K

t∑s=1

〈x , `s〉 .

• Fails miserably: K = [−1, 1], `1 = 1/2, `2 = −1, `3 = 1, . . .

• x1 = ?, x2 = −1 (argminx∈K〈x , `1〉) , x3 = 1(argminx∈K〈x , `1 + `2〉), . . .

• Rn(0) =∑n

t=1〈xt , `t〉 ≈ n.

57 / 86

Online convex optimisation (linear losses)

• Adversary secretly chooses`1, . . . , `n ∈ K◦ = {u : supx∈K |〈x , u〉| ≤ 1} (polar)

• Learner chooses xt ∈ K• Suffers loss 〈xt , `t〉 and the regret with respect to x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .

• How to choose xt? Most simple idea ‘follow-the-leader’

xt+1 = argminx∈K

t∑s=1

〈x , `s〉 .

• Fails miserably: K = [−1, 1], `1 = 1/2, `2 = −1, `3 = 1, . . .

• x1 = ?, x2 = −1 (argminx∈K〈x , `1〉) , x3 = 1(argminx∈K〈x , `1 + `2〉), . . .

• Rn(0) =∑n

t=1〈xt , `t〉 ≈ n.

57 / 86

Follow The regularized Leader (FTRL)

• New idea Add regularization to stabilize follow-the-leader

• Let F be a strictly convex function and η > 0 be the learning rateand

xt+1 = argminx∈K

(F (x) + η

t∑s=1

〈x , `s〉

)

• Different choices of F lead to different algorithms.

• One clean analysis.

58 / 86

Follow The regularized Leader (FTRL)

• New idea Add regularization to stabilize follow-the-leader

• Let F be a strictly convex function and η > 0 be the learning rateand

xt+1 = argminx∈K

(F (x) + η

t∑s=1

〈x , `s〉

)

• Different choices of F lead to different algorithms.

• One clean analysis.

58 / 86

Example – Gradient descent

• K = Rd and F (x) = 12‖x‖

22

xt+1 = argminx∈K η

t∑s=1

〈x , `s〉+1

2‖x‖2

2

• Differentiating,

0 = η

t∑s=1

`s + x

• xt+1 = −η∑t

s=1 `s = xt − η`t

59 / 86

Example – Gradient descent

• K = Rd and F (x) = 12‖x‖

22


t∑s=1

〈x , `s〉+1

2‖x‖2

2

• Differentiating,

0 = η

t∑s=1

`s + x

• xt+1 = −η∑t

s=1 `s = xt − η`t

59 / 86

A few tools

• Online convex optimization uses many tools from convex analysis

• Bregman divergence

• First-order optimality conditions

• Dual norms

60 / 86

Bregman divergence

For convex F , DF (x , y) = F (x)− F (y)− 〈∇F (y), x − y〉

• Bregman divergence is not a distance (may not be symmetricDF (x , y) 6= DF (y , x), e.g., KL divergence), but still, DF (x , y) ≥ 0

• By Taylor expansion, there exists a z = αx + (1− α)y for α ∈ [0, 1]

DF (x , y) = (x − y)>∇2F (z)(x − y) = ‖x − y‖2∇2F (z)

• Key property: does not change under linear perturbation: ForF (x) = F + 〈a, x〉, D

F(x , y) = DF (x , y)

61 / 86

Bregman divergence




DF (x , y) = (x − y)>∇2F (z)(x − y) = ‖x − y‖2∇2F (z)


F(x , y) = DF (x , y)

61 / 86

Bregman divergence




DF (x , y) = (x − y)>∇2F (z)(x − y) = ‖x − y‖2∇2F (z)


F(x , y) = DF (x , y)

61 / 86

Examples

• Quadratic F (x) = 12‖x‖

2

DF (x , y) =1

2‖x − y‖2

2

• Neg-entropy F (x) =∑d

i=1 xi log(xi )− xi

DF (x , y) =d∑

i=1

xi log(xiyi

) +d∑

i=1

(yi − xi )

When x , y ∈ ∆d , where ∆d = {x ∈ Rd : x ≥ 0, ‖x‖1 = 1}(d-dimensional simplex, usually for modeling a discrete probabilitydistribution):

DF (x , y) =d∑

i=1

xi log(xiyi

)

62 / 86

Examples

• Quadratic F (x) = 12‖x‖

2

DF (x , y) =1

2‖x − y‖2

2

• Neg-entropy F (x) =∑d

i=1 xi log(xi )− xi

DF (x , y) =d∑

i=1

xi log(xiyi

) +d∑

i=1

(yi − xi )

When x , y ∈ ∆d , where ∆d = {x ∈ Rd : x ≥ 0, ‖x‖1 = 1}(d-dimensional simplex, usually for modeling a discrete probabilitydistribution):

DF (x , y) =d∑

i=1

xi log(xiyi

)

62 / 86

First order optimality condition• Let K be convex, f : K → R convex, differentiable

x∗ = argminx∈K f (x)⇔ 〈∇f (x∗), x − x∗〉 ≥ 0 ∀x ∈ K

• Interpretation f is increasing in direction x − x∗ for all x ∈ K

63 / 86

Dual norm

Let ‖ · ‖t be a norm on Rd , then its dual norm

‖z‖t∗ = sup{〈z , x〉, ‖x‖t ≤ 1}.

The dual norm of ‖z‖t∗ will be ‖ · ‖t .

• The dual norm of ‖ · ‖2 is ‖ · ‖2

• The dual norm of ‖ · ‖1 is ‖ · ‖∞• The dual norm of ‖ · ‖p is ‖ · ‖q (with 1

p + 1q = 1).

• Holder’s inequality: 〈z , x〉 ≤ ‖x‖t‖z‖t∗ .

64 / 86

Dual norm


‖z‖t∗ = sup{〈z , x〉, ‖x‖t ≤ 1}.




p + 1q = 1).


64 / 86

Dual norm


‖z‖t∗ = sup{〈z , x〉, ‖x‖t ≤ 1}.




p + 1q = 1).


64 / 86

Dual norm


‖z‖t∗ = sup{〈z , x〉, ‖x‖t ≤ 1}.




p + 1q = 1).


64 / 86

Follow the regularized leader

• xt+1 = argminx∈K(F (x) + η

∑ts=1〈x , `s〉

)• Equivalent to

xt+1 = argminx∈K (η〈x , `t〉+ DF (x , xt))

= argminx∈K (η〈x , `t〉+ F (x)− F (xt)− 〈∇F (xt), x − xt〉)

• Assuming the minimizer is achieved in the interior of K .

• The first optimization implies that ∇F (xt+1) = −η∑t

s=1 `s

• The second optimization implies that η`t +∇F (xt+1)−∇F (xt) = 0and thus∇F (xt+1) = −η`t +∇F (xt) = −η

∑ts=1 `s +∇F (x1)︸︷︷︸

0

= −η∑t

s=1 `s .

65 / 86

Follow the regularized leader

• xt+1 = argminx∈K(F (x) + η

∑ts=1〈x , `s〉

)• Equivalent to

xt+1 = argminx∈K (η〈x , `t〉+ DF (x , xt))

= argminx∈K (η〈x , `t〉+ F (x)− F (xt)− 〈∇F (xt), x − xt〉)

• Assuming the minimizer is achieved in the interior of K .

• The first optimization implies that ∇F (xt+1) = −η∑t

s=1 `s

• The second optimization implies that η`t +∇F (xt+1)−∇F (xt) = 0and thus∇F (xt+1) = −η`t +∇F (xt) = −η

∑ts=1 `s +∇F (x1)︸︷︷︸

0

= −η∑t

s=1 `s .

65 / 86

Regret Analysis: Follow the regularized leader

Theorem For any fixed action x , the regret of follow the regularizedleader satisfies

Rn(x) :=n∑

t=1

〈xt − x , `t〉

≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

Let zt ∈ [xt , xt+1] be such that DF (xt+1, xt) = 12‖xt − xt+1‖2

∇2F (zt)

and ‖ · ‖t = ‖ · ‖∇2F (zt) and ‖ · ‖t∗ is the dual norm of ‖ · ‖t .

Choosing ‖ · ‖t such that DF (xt+1, xt) ≥ 12‖xt − xt+1‖2

t is also valid.

66 / 86

Regret Analysis: Follow the regularized leader

Theorem For any fixed action x , the regret of follow the regularizedleader satisfies

Rn(x) :=n∑

t=1

〈xt − x , `t〉

≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

Let zt ∈ [xt , xt+1] be such that DF (xt+1, xt) = 12‖xt − xt+1‖2

∇2F (zt)

and ‖ · ‖t = ‖ · ‖∇2F (zt) and ‖ · ‖t∗ is the dual norm of ‖ · ‖t .

Choosing ‖ · ‖t such that DF (xt+1, xt) ≥ 12‖xt − xt+1‖2

t is also valid.

66 / 86

Regret Analysis: Follow the regularized leaderTheorem For any fixed action x , the regret of follow the regularizedleader satisfies

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

DF (xt+1, xt) = 12‖xt − xt+1‖2

∇2F (zt)and ‖ · ‖t = ‖ · ‖∇2F (zt)

Proof of the second inequality:

〈xt − xt+1, `t〉 −DF (xt+1, xt)

η≤ ‖`t‖t∗‖xt − xt+1‖t −

DF (xt+1, xt)

η

= ‖`t‖t∗√

2DF (xt+1, xt)−DF (xt+1, xt)

η≤ η

2‖`t‖2

t∗,

The last inequality is due to ax − bx2/2 ≤ a2/(2b) for any b ≥ 0 witha = ‖`t‖t∗, x =

√2DF (xt+1, xt) and b = 1

η

67 / 86

FTRL analysis

• Proof of the first inequality

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)• Rewriting the regret

Rn(x) =n∑

t=1

〈xt − x , `t〉

=n∑

t=1

〈xt − xt+1, `t〉+n∑

t=1

〈xt+1 − x , `t〉

• Goal: show that

n∑t=1

〈xt+1 − x , `t〉 ≤F (x)− F (x1)

η−

n∑t=1

1

ηDF (xt+1, xt)

68 / 86

FTRL analysis

• Proof of the first inequality

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)• Rewriting the regret

Rn(x) =n∑

t=1

〈xt − x , `t〉

=n∑

t=1

〈xt − xt+1, `t〉+n∑

t=1

〈xt+1 − x , `t〉

• Goal: show that

n∑t=1

〈xt+1 − x , `t〉 ≤F (x)− F (x1)

η−

n∑t=1

1

ηDF (xt+1, xt)

68 / 86

FTRL analysis

• Potential function: Φt(x) =F (x)

η+

t∑s=1

〈x , `s〉

• By FRTL: xt+1 minimizes Φt in K•

n∑t=1

〈xt+1 − x , `t〉

=n∑

t=1

〈xt+1, `t〉 − (n∑

t=1

〈x , `t〉+F (x)

η)︸︷︷︸

Φn(x)

+F (x)

η

=n∑

t=1

(Φt(xt+1)− Φt−1(xt+1))︸︷︷︸(F (xt+1)

η+∑t

s=1〈xt+1,`s〉)−(

F (xt+1)

η+∑t−1

s=1 〈xt+1,`s〉)−Φn(x) +

F (x)

η,

69 / 86

FTRL analysisPotential function: Φt(x) =

F (x)

η+

t∑s=1

〈x , `s〉 (Φ0(x) = F (x)η )

Then using: (1) xt+1 = argminx Φt(x) and (2) DΦt (·, ·) = 1ηDF (·, ·) (Bregman divergence

keeps the same by adding linear functions)

n∑t=1

〈xt+1 − x , `t〉 =F (x)

η+

n∑t=1

(Φt(xt+1)− Φt−1(xt+1)) − Φn(x)

=F (x)

η− Φ0(x1) + Φn(xn+1)− Φn(x)︸︷︷︸

≤0: xn+1=argminx Φn(x)

+n−1∑t=0

(Φt(xt+1)− Φt(xt+2))

≤ F (x)− F (x1)

η+

n−1∑t=0

(Φt(xt+1)− Φt(xt+2))

=F (x)− F (x1)

η−

n−1∑t=0

DΦt (xt+2, xt+1) + 〈∇Φt(xt+1), xt+2 − xt+1〉︸︷︷︸≥0

≤ F (x)− F (x1)

η− 1

η

n∑t=1

DF (xt+1, xt),

whereDΦt (xt+2, xt+1) = Φt(xt+2)− Φt(xt+1)− 〈∇Φt(xt+1), xt+2 − xt+1〉.

70 / 86

Final form of the regret

Rn(x) ≤F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

≤diamF (K)

η+η

2

n∑t=1

‖`t‖2t∗ ,

where diamF (K) := maxa,b∈K F (a)− F (b).

• Regret depends on distance from start to optimal

• Learning rate needs careful tuning

71 / 86

Final form of the regret

Rn(x) ≤F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

≤diamF (K)

η+η

2

n∑t=1

‖`t‖2t∗ ,

where diamF (K) := maxa,b∈K F (a)− F (b).

• Regret depends on distance from start to optimal

• Learning rate needs careful tuning

71 / 86

Application 1: Online gradient descentAssume K = {x : ‖x‖2 ≤ 1} and `t ∈ K (|〈x , `t〉| ≤ 1)

Choose F (x) = 12‖x‖

22, diamF (K) := maxa,b∈K F (a)− F (b) = 1

2 :

DF (x , y) =1

2‖x − y‖2

2

FTRL:


t∑s=1

〈x , `s〉+1

2‖x‖2

2

Then by choosing η =√

1/n:

Rn(x) ≤ diamF (K)

2η+η

2

n∑t=1

‖`t‖22 ≤

1

2η+ηn

2≤√

n

72 / 86




2 :

DF (x , y) =1

2‖x − y‖2

2

FTRL:


t∑s=1

〈x , `s〉+1

2‖x‖2

2


1/n:

Rn(x) ≤ diamF (K)

2η+η

2

n∑t=1

‖`t‖22 ≤

1

2η+ηn

2≤√

n

72 / 86




2 :

DF (x , y) =1

2‖x − y‖2

2

FTRL:


t∑s=1

〈x , `s〉+1

2‖x‖2

2


1/n:

Rn(x) ≤ diamF (K)

2η+η

2

n∑t=1

‖`t‖22 ≤

1

2η+ηn

2≤√

n

72 / 86

Application 2: Exponential weightsAssume K = ∆d := {x ≥ 0 : ‖x‖1 = 1} and `t ∈ [0, 1]d for all t(|〈x , `t〉| ≤ 1)

F (x) =d∑

i=1

(xi log(xi )− xi )

diamF (K) := maxa,b∈K F (a)− F (b) = log(d). This is becausemaxx∈K F (x) = log(d)− 1 (achieving at (1/d , . . . , 1/d)) andminx∈K F (x) = −1 (achieving (1, 0, . . . , 0)).

Bregman divergence

DF (x , y) =d∑

i=1

xi log

(xiyi

)(KL-divergence)

≥ 1

2‖x − y‖2

1 (Pinsker’s inequality (exercise))

73 / 86

Application 2: Exponential weightsAssume K = ∆d := {x ≥ 0 : ‖x‖1 = 1} and `t ∈ [0, 1]d for all t(|〈x , `t〉| ≤ 1)

F (x) =d∑

i=1

(xi log(xi )− xi )

diamF (K) := maxa,b∈K F (a)− F (b) = log(d). This is becausemaxx∈K F (x) = log(d)− 1 (achieving at (1/d , . . . , 1/d)) andminx∈K F (x) = −1 (achieving (1, 0, . . . , 0)).

Bregman divergence

DF (x , y) =d∑

i=1

xi log

(xiyi

)(KL-divergence)

≥ 1

2‖x − y‖2

1 (Pinsker’s inequality (exercise))

73 / 86

Application 2: Exponential weights

Assume K = ∆d := {x ≥ 0 : ‖x‖1 = 1} and `t ∈ [0, 1]d for all tFTRL:


t∑s=1

〈x , `s〉+ F (x)

Optimal action is a standard basis vector ei , where i is the position thati = argminj=1,...,n

(η∑t

s=1 `s,j)

(corresponding F (x) is minimized sinceF (ei ) = −1).

Rn(x) ≤ diamF (K)

η+η

2

n∑t=1

‖`t‖2t∗

≤ log(d)

η+η

2

n∑t=1

‖`t‖2∞

≤ log(d)

η+ηn

2≤√

2n log(d)

74 / 86

Application 2: Exponential weights

Assume K = ∆d := {x ≥ 0 : ‖x‖1 = 1} and `t ∈ [0, 1]d for all tFTRL:


t∑s=1

〈x , `s〉+ F (x)

Optimal action is a standard basis vector ei , where i is the position thati = argminj=1,...,n

(η∑t

s=1 `s,j)

(corresponding F (x) is minimized sinceF (ei ) = −1).

Rn(x) ≤ diamF (K)

η+η

2

n∑t=1

‖`t‖2t∗

≤ log(d)

η+η

2

n∑t=1

‖`t‖2∞

≤ log(d)

η+ηn

2≤√

2n log(d)

74 / 86

Our Goal: Adversarial bandits

• At the start of the game the adversary secretly chooses losses`1, . . . , `n with `t ∈ [0, 1]K

• In each round the learner chooses the arm At ∈ {1, . . . ,K} ∼ Pt

(from some distribution Pt)

• Suffers and loss `t,At (only observe `t,At )

• Regret is Rn = maxa E [∑n

t=1 `tAt − `ta]

• Surprising result there exists an algorithm such thatRn ≤

√2nK log(K ) for any adversary. How?

• Key idea• Construct an estimator of the entire loss vector `t = (`t,1, . . . , `t,K )• Apply the follow the regularized leader (FTRL) to the estimated loss

ˆt = (ˆ

t,1, . . . , ˆt,K )

75 / 86







t=1 `tAt − `ta]




ˆt = (ˆ

t,1, . . . , ˆt,K )

75 / 86







t=1 `tAt − `ta]




ˆt = (ˆ

t,1, . . . , ˆt,K )

75 / 86

Importance-weighted estimators

At time t, our algorithm chooses the arm At = i with probability Pti

(specify Pti later). Define the estimator of `t,i

ˆt,i =

`t,i1(At = i)

Pti

and ˆt = (ˆ

t,1, . . . ˆt,K ).

Unbiased estimator,

E[

ˆt,i | Pt

]=`t,iPti

E[1(At = i) | Pt ] =`t,iPti

Pti

= `t,i

Second moment: E[

ˆ2t,i | Pt

]=

`2t,i

Pti

76 / 86

Follow the regularized leader for bandits (EXP3)

• Estimate `t with unbiased importance-weighted estimator ˆt

ˆt,i =

1(At = i)`t,iPti

• Then the expected regret satisfies

E[Rn] = maxi

E

[n∑

t=1

`t,At − `t,i

]= max

iE

[n∑

t=1

〈Pt − ei , ˆt〉

]

This is because• E(〈Pt , `t〉) = E(

∑Ki=1 Pti`t,i ) = E(

∑Ki=1 1(At = i)`t,i ) = E(`t,At )

• E(〈ei , ˆt〉) = E(ˆ

t,i ) = `t,i .

77 / 86

Follow the regularized leader for bandits (EXP3)• Then the expected regret satisfies

E[Rn] = maxi

E

[n∑

t=1

`t,At − `t,i

]= max

iE

[n∑

t=1


]• FTRL:

Pt = argminp∈∆K

F (p)

η+

t−1∑s=1

〈p, ˆs〉

• Since the domain is ∆K , choose the negentropy F

F (p) =K∑i=1

pi log(pi )− pi

• Using Lagrange dual, the probability of choosing each arm i :

Pti =exp

(−η∑t−1

s=1ˆs,i

)∑K

j=1 exp(−η∑t−1

s=1ˆs,j

)

78 / 86

Follow the regularized leader for bandits (EXP3)• Then the expected regret satisfies

E[Rn] = maxi

E

[n∑

t=1

`t,At − `t,i

]= max

iE

[n∑

t=1


]• FTRL:

Pt = argminp∈∆K

F (p)

η+

t−1∑s=1

〈p, ˆs〉

• Since the domain is ∆K , choose the negentropy F

F (p) =K∑i=1

pi log(pi )− pi

• Using Lagrange dual, the probability of choosing each arm i :

Pti =exp

(−η∑t−1

s=1ˆs,i

)∑K


s=1ˆs,j

)78 / 86

Follow the regularized leader for bandits (EXP3 Algo)

• Using the FTRL regret bound:

E[Rn] ≤ E

[F (ei )− F (P1)

η+η

2

n∑t=1

‖ˆt‖2

t∗

]

≤ log(K )

η+η

2E

[n∑

t=1

‖ˆt‖2

t∗

](F (ei )− F (P1) ≤ log(K ))

where i is the best arm.

• How to bound ‖ˆt‖2

t∗?

79 / 86

Follow the regularized leader for bandits (EXP3 Algo)

• Using the FTRL regret bound:

E[Rn] ≤ E

[F (ei )− F (P1)

η+η

2

n∑t=1

‖ˆt‖2

t∗

]

≤ log(K )

η+η

2E

[n∑

t=1

‖ˆt‖2

t∗

](F (ei )− F (P1) ≤ log(K ))

where i is the best arm.


t∗?

79 / 86

Follow the regularized leader for bandits


t∗?

• Recall that Let zt ∈ [xt , xt+1] be such thatDF (xt+1, xt) = 1

2‖xt − xt+1‖2∇2F (zt)

. And ‖ · ‖t = ‖ · ‖∇2F (zt) and

‖ · ‖t∗ is the dual norm of ‖ · ‖t .

• For F (p) =∑K

i=1 pi log(pi )− pi ,

∇2F (p) = diag(1/p) =⇒ ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

K∑i=1

piˆ2t,i ,

for some p ∈ [Pt ,Pt+1]

80 / 86

Follow the regularized leader for bandits


t∗?

• Recall that Let zt ∈ [xt , xt+1] be such thatDF (xt+1, xt) = 1

2‖xt − xt+1‖2∇2F (zt)

. And ‖ · ‖t = ‖ · ‖∇2F (zt) and

‖ · ‖t∗ is the dual norm of ‖ · ‖t .

• For F (p) =∑K

i=1 pi log(pi )− pi ,

∇2F (p) = diag(1/p) =⇒ ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

K∑i=1

piˆ2t,i ,

for some p ∈ [Pt ,Pt+1]

80 / 86

Follow the regularized leader for bandits• How to bound ‖ˆ

t‖2t∗?

• ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

∑Ki=1 pi

ˆ2t,i for some p ∈ [Pt ,Pt+1]

• ˆt,i =

1(At=i)`t,iPti

is non-negative and ˆt,j = 0 for At 6= j :

‖ˆt‖2

t∗ = pAtˆ2t,At

• Further note that, Pt,At =exp(−η

∑t−1s=1

ˆs,At )∑K


s=1ˆs,j)

:=αAt∑Kj=1 αj

and

Pt+1,At =exp

(−η∑t−1

s=1ˆs,At

)exp(−η ˆ

t,At )∑Kj=1 exp

(−η∑t−1

s=1ˆs,j

)exp(−η ˆ

t,j)=

αAt

αAt +∑

j 6=Atαj exp(η ˆ

t,At )

Since exp(η ˆt,At ) > 1, Pt+1,At ≤ Pt,At

‖ˆt‖2

t∗ =K∑j=1

pjˆ2t,j ≤ PtAt

ˆ2t,At

81 / 86


t‖2t∗?

• ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

∑Ki=1 pi


• ˆt,i =

1(At=i)`t,iPti


‖ˆt‖2

t∗ = pAtˆ2t,At


∑t−1s=1

ˆs,At )∑K


s=1ˆs,j)

:=αAt∑Kj=1 αj

and

Pt+1,At =exp

(−η∑t−1

s=1ˆs,At

)exp(−η ˆ

t,At )∑Kj=1 exp

(−η∑t−1

s=1ˆs,j

)exp(−η ˆ

t,j)=

αAt

αAt +∑

j 6=Atαj exp(η ˆ

t,At )


‖ˆt‖2

t∗ =K∑j=1

pjˆ2t,j ≤ PtAt

ˆ2t,At

81 / 86


t‖2t∗?

• ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

∑Ki=1 pi


• ˆt,i =

1(At=i)`t,iPti


‖ˆt‖2

t∗ = pAtˆ2t,At


∑t−1s=1

ˆs,At )∑K


s=1ˆs,j)

:=αAt∑Kj=1 αj

and

Pt+1,At =exp

(−η∑t−1

s=1ˆs,At

)exp(−η ˆ

t,At )∑Kj=1 exp

(−η∑t−1

s=1ˆs,j

)exp(−η ˆ

t,j)=

αAt

αAt +∑

j 6=Atαj exp(η ˆ

t,At )


‖ˆt‖2

t∗ =K∑j=1

pjˆ2t,j ≤ PtAt

ˆ2t,At

81 / 86

Putting everything together: Regret bound for the followthe regularized leader for bandits

E[Rn] ≤ log(K )

η+η

2E

[n∑

t=1

‖ˆt‖2

t∗

]

≤ log(K )

η+η

2E

[n∑

t=1

PtAtˆ2t,At

]

=log(K )

η+η

2E

[n∑

t=1

`2t,At

PtAt

]

≤ log(K )

η+η

2E

[n∑

t=1

1

PtAt

](`t ∈ [0, 1]K )

=log(K )

η+η

2E

[n∑

t=1

K∑i=1

Pti ·1

Pti

]

=log(K )

η+ηnK

2≤√

2nK log(K ) (η =√

2 log(K )/(nK ))82 / 86

Historical notes

• First paper on bandits is by Thompson (1933). He proposed analgorithm for two-armed Bernoulli bandits and hand-runs somesimulations (Thompson sampling)

• Popularized enormously by Robbins (1952)

• Confidence bounds first used by Lai and Robbins (1985) to deriveasymptotically optimal algorithm

• UCB by Katehakis and Robbins (1995) and Agrawal (1995).Finite-time analysis by Auer et al. (2002)

• Adversarial bandits: Auer et al. (1995)

• Minimax optimal algorithm by Audibert and Bubeck (2009)

83 / 86

Resources

• Online notes: http://banditalgs.com

• The book “Bandit Algorithms” by Tor Lattimore and CsabaSzepesvarihttps://tor-lattimore.com/downloads/book/book.pdf

• Book by Bubeck and Cesa-Bianchi (2012)

• Book by Cesa-Bianchi and Lugosi (2006)

• The Bayesian books by Gittins et al. (2011) and Berry and Fristedt(1985). Both worth reading.

• Notes by Aleksandrs Slivkins:http://slivkins.com/work/MAB-book.pdf

84 / 86

http://banditalgs.com

https://tor-lattimore.com/downloads/book/book.pdf

http://slivkins.com/work/MAB-book.pdf

References I

Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for themulti-armed bandit problem. Advances in Applied Probability, pages 1054–1078.

Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochasticbandits. In Proceedings of Conference on Learning Theory (COLT), pages 217–226.

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmedbandit problem. Machine Learning, 47:235–256.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a riggedcasino: The adversarial multi-armed bandit problem. In Foundations of Computer Science,1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE.

Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments.Chapman and Hall, London ; New York :.

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and NonstochasticMulti-armed Bandit Problems. Foundations and Trends in Machine Learning. NowPublishers Incorporated.

Bush, R. R. and Mosteller, F. (1953). A stochastic model with applications to learning. TheAnnals of Mathematical Statistics, pages 559–585.

Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridgeuniversity press.

Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. JohnWiley & Sons.

85 / 86

References II

Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations.Proceedings of the National Academy of Sciences of the United States of America,92(19):8584.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advancesin applied mathematics, 6(1):4–22.

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of theAmerican Mathematical Society, 58(5):527–535.

Thompson, W. (1933). On the likelihood that one unknown probability exceeds another inview of the evidence of two samples. Biometrika, 25(3/4):285–294.

86 / 86

Documents

Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial