163
Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial by Tor Lattimore & Csaba Szepesv´ ari 1 / 86

Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Multi-armed Bandits (MAB)

Xi Chen

Slides are based on AAAI 2018 Tutorial by Tor Lattimore & CsabaSzepesvari

1 / 86

Page 2: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Overview

• What are bandits, and why you should care• Finite-armed stochastic bandits

• Explore-Then-Commit (ETC) Algorithm• Upper Confidence Bound (UCB) Algorithm• Lower Bound

• Finite-armed adversarial bandits

2 / 86

Page 3: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

What’s in a name? A tiny bit of history

First bandit algorithm proposed by Thompson (1933)

Bush and Mosteller (1953) were in-terested in how mice behaved in a T-maze

3 / 86

Page 4: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Applications

• Clinical trials/dose discovery

• Recommendation systems (movies/news/etc)

• Advertisement placement

• A/B testing

• Dynamic pricing (eg., for Amazon products)

• Ranking (eg., for search)

• Resource allocation

• They isolate an important component of reinforcement learning:exploration-vs-exploitation

4 / 86

Page 5: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Finite-armed bandits

• K actions

• n rounds

• In each round t the learner chooses an action

At ∈ {1, 2, . . . ,K} .

• Observes reward Xt ∼ PAt where P1,P2, . . . ,PK are unknowndistributions (Gaussian or subgaussian)

5 / 86

Page 6: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Example: A/B testing

• Business wants to optimize their webpage

• Actions correspond to ‘A’ and ‘B’ (two arms)

• Users arrive at webpage sequentially

• Algorithm chooses either ‘A’ or ‘B’ (pulling an arm)

• Receives activity feedback (click as the reward)

6 / 86

Page 7: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Measuring performance – the regret

• Let µi be the mean reward of distribution Pi

• µ∗ = maxi µi is the maximum mean

• The (expected) regret is

Rn = nµ∗ − E

[n∑

t=1

Xt

]

• A reasonable policy for which the regret should be (Rn = o(n))

• Of course we would like to make it as ‘small as possible’

7 / 86

Page 8: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Measuring performance – the regret

• Let µi be the mean reward of distribution Pi

• µ∗ = maxi µi is the maximum mean

• The (expected) regret is

Rn = nµ∗ − E

[n∑

t=1

Xt

]

• A reasonable policy for which the regret should be (Rn = o(n))

• Of course we would like to make it as ‘small as possible’

7 / 86

Page 9: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Measuring performance – the regret

Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm

Let Ti (n) be the number of times arm i is played over all n rounds

Key decomposition lemma: Rn =K∑i=1

∆iE[Ti (n)]

Proof Let Et [·] = E[·|A1,X1, . . . ,Xt−1,At ]

Rn = nµ∗ − E

[n∑

t=1

Xt

]= nµ∗ −

n∑t=1

E[Et [Xt ]] = nµ∗ −n∑

t=1

E[µAt ]

=n∑

t=1

E[∆At ] = E

[n∑

t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi (n)

]=

K∑i=1

∆iE[Ti (n)]

8 / 86

Page 10: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Measuring performance – the regret

Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm

Let Ti (n) be the number of times arm i is played over all n rounds

Key decomposition lemma: Rn =K∑i=1

∆iE[Ti (n)]

Proof Let Et [·] = E[·|A1,X1, . . . ,Xt−1,At ]

Rn = nµ∗ − E

[n∑

t=1

Xt

]= nµ∗ −

n∑t=1

E[Et [Xt ]] = nµ∗ −n∑

t=1

E[µAt ]

=n∑

t=1

E[∆At ] = E

[n∑

t=1

K∑i=1

1(At = i)∆i

]

= E

[K∑i=1

∆i

n∑t=1

1(At = i)

]= E

[K∑i=1

∆iTi (n)

]=

K∑i=1

∆iE[Ti (n)]

8 / 86

Page 11: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A simple policy: Explore-Then-Commit

1 Choose each action m times

2 Find the empirically best action I ∈ {1, 2, . . . ,K} (i.e., the action Igives the largest average reward over m items)

3 Choose At = I for all remaining (n −mK ) rounds

In order to analyse this policy we need to bound the probability ofcommitting to a suboptimal actionNeed probability tools: concentration inequalities.

9 / 86

Page 12: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A simple policy: Explore-Then-Commit

1 Choose each action m times

2 Find the empirically best action I ∈ {1, 2, . . . ,K} (i.e., the action Igives the largest average reward over m items)

3 Choose At = I for all remaining (n −mK ) rounds

In order to analyse this policy we need to bound the probability ofcommitting to a suboptimal actionNeed probability tools: concentration inequalities.

9 / 86

Page 13: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

Let Z1,Z2, . . . ,Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?

10 / 86

Page 14: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?

Classical statistics says:

1. (law of large numbers) limn→∞ µn = µ almost surely

2. (central limit theorem)√

n(µn − µ)d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2 (V(µn) = σ2

n )

Basic probability inequality (R.V. X with finite mean andvariance):

1. (Markov’s inequality) P (|X | ≥ ε) ≤ E(|X |)ε .

2. (Chebyshev’s inequality) P (|X − E(X )| ≥ ε) ≤ V(X )ε2

We need something nonasymptotic and stronger than Chebyshev’s (Notpossible without assumptions)

11 / 86

Page 15: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?

Classical statistics says:

1. (law of large numbers) limn→∞ µn = µ almost surely

2. (central limit theorem)√

n(µn − µ)d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2 (V(µn) = σ2

n )

Basic probability inequality (R.V. X with finite mean andvariance):

1. (Markov’s inequality) P (|X | ≥ ε) ≤ E(|X |)ε .

2. (Chebyshev’s inequality) P (|X − E(X )| ≥ ε) ≤ V(X )ε2

We need something nonasymptotic and stronger than Chebyshev’s (Notpossible without assumptions)

11 / 86

Page 16: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

empirical mean = µn =1

n

n∑t=1

Zt

How close is µn to µ?

Classical statistics says:

1. (law of large numbers) limn→∞ µn = µ almost surely

2. (central limit theorem)√

n(µn − µ)d→ N (0, σ2)

3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2

nε2 (V(µn) = σ2

n )

Basic probability inequality (R.V. X with finite mean andvariance):

1. (Markov’s inequality) P (|X | ≥ ε) ≤ E(|X |)ε .

2. (Chebyshev’s inequality) P (|X − E(X )| ≥ ε) ≤ V(X )ε2

We need something nonasymptotic and stronger than Chebyshev’s (Notpossible without assumptions)

11 / 86

Page 17: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

Random variable Z is σ-subgaussian if for all λ ∈ R,

MZ (λ).

= E[exp(λZ )] ≤ exp(λ2σ2/2

),

where MZ (λ) is known as the moment generating function.

• Which distributions are σ-subgaussian? Gaussian, Bernoulli,bounded support.

• And not: exponential, power law

Lemma If Z ,Z1, . . . ,Zn are independent and σ-subgaussian, then

• aZ is |a|σ-subgaussian for any a ∈ R•∑n

t=1 Zt is√

nσ-subgaussian

• µn is n−1/2σ-subgaussian

12 / 86

Page 18: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

Random variable Z is σ-subgaussian if for all λ ∈ R,

MZ (λ).

= E[exp(λZ )] ≤ exp(λ2σ2/2

),

where MZ (λ) is known as the moment generating function.

• Which distributions are σ-subgaussian? Gaussian, Bernoulli,bounded support.

• And not: exponential, power law

Lemma If Z ,Z1, . . . ,Zn are independent and σ-subgaussian, then

• aZ is |a|σ-subgaussian for any a ∈ R•∑n

t=1 Zt is√

nσ-subgaussian

• µn is n−1/2σ-subgaussian

12 / 86

Page 19: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

Random variable Z is σ-subgaussian if for all λ ∈ R,

MZ (λ).

= E[exp(λZ )] ≤ exp(λ2σ2/2

),

where MZ (λ) is known as the moment generating function.

• Which distributions are σ-subgaussian? Gaussian, Bernoulli,bounded support.

• And not: exponential, power law

Lemma If Z ,Z1, . . . ,Zn are independent and σ-subgaussian, then

• aZ is |a|σ-subgaussian for any a ∈ R•∑n

t=1 Zt is√

nσ-subgaussian

• µn is n−1/2σ-subgaussian

12 / 86

Page 20: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

Lemma(Tail bound of subgaussian random variable)

• If X is a σ-subgaussian, for any ε > 0, P (X ≥ ε) ≤ exp(− ε2

2σ2 ).

Proof We use Chernoff’s method. Let ε > 0 and λ = ε/σ2.

P (X ≥ ε) = P (exp (λX ) ≥ exp (λε))

≤ E [exp (λX )]

exp(λε)(Markov’s)

≤ exp(σ2λ2/2− λε

)(X is subgaussian)

= exp(−ε2/(2σ2)

)

13 / 86

Page 21: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

Lemma(Tail bound of subgaussian random variable)

• If X is a σ-subgaussian, for any ε > 0, P (X ≥ ε) ≤ exp(− ε2

2σ2 ).

Proof We use Chernoff’s method. Let ε > 0 and λ = ε/σ2.

P (X ≥ ε) = P (exp (λX ) ≥ exp (λε))

≤ E [exp (λX )]

exp(λε)(Markov’s)

≤ exp(σ2λ2/2− λε

)(X is subgaussian)

= exp(−ε2/(2σ2)

)

13 / 86

Page 22: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentrationTheorem If Z1, . . . ,Zn are independent and σ-subgaussian, then

P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof µn − µ is a σ/√

n-subgaussian random variable and thus

P (µn − µ ≥ ε) ≤ exp(− nε2

2σ2)

Setting exp(− nε2

2σ2 ) = δ and solving for ε.

Corollary If Z1, . . . ,Zn are independent and σ-subgaussian, then

P

(µn − µ ≤ −

√2σ2 log(1/δ)

n

)≤ δ

14 / 86

Page 23: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentrationTheorem If Z1, . . . ,Zn are independent and σ-subgaussian, then

P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

Proof µn − µ is a σ/√

n-subgaussian random variable and thus

P (µn − µ ≥ ε) ≤ exp(− nε2

2σ2)

Setting exp(− nε2

2σ2 ) = δ and solving for ε.

Corollary If Z1, . . . ,Zn are independent and σ-subgaussian, then

P

(µn − µ ≤ −

√2σ2 log(1/δ)

n

)≤ δ

14 / 86

Page 24: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

• Comparing Chebyshev’s w. subgaussian bound:

Chebyshev’s: P

(µn − µ ≥

√σ2

)≤ δ

Subgaussian: P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Typically δ � 1/n in our use-cases. Then Chebyshev’s inequality is

too loose since√

σ2

nδ is too large.

From now on, we will assume that reward distribution associatedwith each arm is 1-subgaussian (but with different means)

15 / 86

Page 25: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A crash course in concentration

• Comparing Chebyshev’s w. subgaussian bound:

Chebyshev’s: P

(µn − µ ≥

√σ2

)≤ δ

Subgaussian: P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Typically δ � 1/n in our use-cases. Then Chebyshev’s inequality is

too loose since√

σ2

nδ is too large.

From now on, we will assume that reward distribution associatedwith each arm is 1-subgaussian (but with different means)

15 / 86

Page 26: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

• Exploration phase: Chooses each arm m times

• Exploitation phase: Then commits to the arm with the largestempirical reward

• Standard convention: Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Means that first arm is optimal

• Algorithms are symmetric and do not know this fact

• We consider only K = 2

16 / 86

Page 27: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

• Exploration phase: Chooses each arm m times

• Exploitation phase: Then commits to the arm with the largestempirical reward

• Standard convention: Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Means that first arm is optimal

• Algorithms are symmetric and do not know this fact

• We consider only K = 2

16 / 86

Page 28: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

• Exploration phase: Chooses each arm m times

• Exploitation phase: Then commits to the arm with the largestempirical reward

• Standard convention: Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Means that first arm is optimal

• Algorithms are symmetric and do not know this fact

• We consider only K = 2

16 / 86

Page 29: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

Step 1 Let µi be the average reward of i-th arm (for i ∈ {1, 2}) afterthe exploration phase

The algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆ = µ1 − µ2

Observation µ2 − µ2︸ ︷︷ ︸√1/m−subgaussian

+ µ1 − µ1︸ ︷︷ ︸√1/m−subgaussian

is√

2/m-subgaussian

with zero-mean

17 / 86

Page 30: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-CommitStep 1 The algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆ = µ1 − µ2

Observation µ2 − µ2 + µ1 − µ1 is√

2/m-subgaussian with zero-mean

Step 2 The regret is

Rn = E

[n∑

t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n − 2m)∆P (commit to the wrong arm)

= m∆ + (n − 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)The last inequality is because if X is a σ-subgaussian, for any ε > 0,P (X ≥ ε) ≤ exp(− ε2

2σ2 ) (ε = ∆ and σ =√

2/m).

18 / 86

Page 31: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-CommitStep 1 The algorithm commits to the wrong arm if

µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆ = µ1 − µ2

Observation µ2 − µ2 + µ1 − µ1 is√

2/m-subgaussian with zero-mean

Step 2 The regret is

Rn = E

[n∑

t=1

∆At

]= E

[2m∑t=1

∆At

]+ E

[n∑

t=2m+1

∆At

]= m∆ + (n − 2m)∆P (commit to the wrong arm)

= m∆ + (n − 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)

≤ m∆ + n∆ exp

(−m∆2

4

)The last inequality is because if X is a σ-subgaussian, for any ε > 0,P (X ≥ ε) ≤ exp(− ε2

2σ2 ) (ε = ∆ and σ =√

2/m).18 / 86

Page 32: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

Rn ≤ m∆︸︷︷︸(A)

+ n∆ exp(−m∆2/4)︸ ︷︷ ︸(B)

(A) is monotone increasing in m while (B) is monotone decreasing in m

Exploration/Exploitation Trade-off Exploring too much (m large)then (A) is big, while exploring too little makes (B) large

Bound minimised by m =⌈

4∆2 log

(n∆2

4

)⌉leading to

Rn ≤ ∆ +4

∆log

(n∆2

4

)+

4

∆,

noting that due to ceiling function in m: (A) ≤ (1 + 4∆2 log

(n∆2

4

))∆.

19 / 86

Page 33: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

Last slide: Rn ≤ ∆ +4

∆log

(n∆2

4

)+

4

What happens when ∆ is very small? (Rn can be unbounded)

A natural correction:

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}Illustration of Rn with n = 1000.

0 0.2 0.4 0.6 0.8 1

0

10

20

30

Reg

ret

20 / 86

Page 34: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

Last slide: Rn ≤ ∆ +4

∆log

(n∆2

4

)+

4

What happens when ∆ is very small? (Rn can be unbounded)A natural correction:

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}Illustration of Rn with n = 1000.

0 0.2 0.4 0.6 0.8 1

0

10

20

30

Reg

ret

20 / 86

Page 35: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

Does this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

Small ∆ makes identification of the best arm hard, but cost of failure (ofidentification) is low

Large ∆ makes the cost of failure high, but identification becomes easy

Worst case is when ∆ ≈√

1/n with Rn ≈√

n

21 / 86

Page 36: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Analysing Explore-Then-Commit

Does this figure make sense? Why is the regret largest when ∆ is small,but not too small?

Rn ≤ min

{n∆, ∆ +

4

∆log

(n∆2

4

)+

4

}0 0.2 0.4 0.6 0.8 1

0

10

20

30

∆Small ∆ makes identification of the best arm hard, but cost of failure (ofidentification) is low

Large ∆ makes the cost of failure high, but identification becomes easy

Worst case is when ∆ ≈√

1/n with Rn ≈√

n

21 / 86

Page 37: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Limitations of Explore-Then-Commit

• Recall that m =⌈

4∆2 log

(n∆2

4

)⌉• Need advance knowledge of the unknown horizon length n

• Optimal tuning depends on unknown ∆ = µ1 − µ2

• Better approaches now exist, but Explore-Then-Commit is often agood place to start when analyzing a bandit problem since itcaptures exploration-exploitation trade-off

22 / 86

Page 38: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Limitations of Explore-Then-Commit

• Recall that m =⌈

4∆2 log

(n∆2

4

)⌉• Need advance knowledge of the unknown horizon length n

• Optimal tuning depends on unknown ∆ = µ1 − µ2

• Better approaches now exist, but Explore-Then-Commit is often agood place to start when analyzing a bandit problem since itcaptures exploration-exploitation trade-off

22 / 86

Page 39: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Optimism principle

23 / 86

Page 40: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Informal illustration

Visiting a new region

Shall I try local cuisine?

Optimist: Yes!

Pessimist: No!

Optimism leads to exploration, pessimism prevents it

Exploration is necessary, but how much?

24 / 86

Page 41: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Optimism principle

• Let µi (t) = 1Ti (t)

∑ts=1 1(As = i)Xs be the empirical mean reward

of i-th arm at time t

• Optimistic estimate of the mean of arm = ‘largest value it couldplausibly be’

• Formalise the intuition using confidence intervals (σ = 1)

P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Suggests

optimistic estimate = µi (t − 1) +

√2 log(1/δ)

Ti (t − 1)

• δ ∈ (0, 1) determines the level of optimism

25 / 86

Page 42: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Optimism principle

• Let µi (t) = 1Ti (t)

∑ts=1 1(As = i)Xs be the empirical mean reward

of i-th arm at time t

• Optimistic estimate of the mean of arm = ‘largest value it couldplausibly be’

• Formalise the intuition using confidence intervals (σ = 1)

P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Suggests

optimistic estimate = µi (t − 1) +

√2 log(1/δ)

Ti (t − 1)

• δ ∈ (0, 1) determines the level of optimism

25 / 86

Page 43: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Optimism principle

• Let µi (t) = 1Ti (t)

∑ts=1 1(As = i)Xs be the empirical mean reward

of i-th arm at time t

• Optimistic estimate of the mean of arm = ‘largest value it couldplausibly be’

• Formalise the intuition using confidence intervals (σ = 1)

P

(µn − µ ≥

√2σ2 log(1/δ)

n

)≤ δ

• Suggests

optimistic estimate = µi (t − 1) +

√2 log(1/δ)

Ti (t − 1)

• δ ∈ (0, 1) determines the level of optimism

25 / 86

Page 44: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Upper confidence bound algorithm

1 Choose each action once

2 Choose the action maximising

At = argmaxi µi (t − 1) +

√2 log(t3)

Ti (t − 1)

3 Goto 2

Corresponds to δ = 1/t3. This is quite a conservative choice (more onthis later)

Algorithm does not depend on horizon n (it is anytime)

26 / 86

Page 45: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Upper confidence bound algorithm

1 Choose each action once

2 Choose the action maximising

At = argmaxi µi (t − 1) +

√2 log(t3)

Ti (t − 1)

3 Goto 2

Corresponds to δ = 1/t3. This is quite a conservative choice (more onthis later)

Algorithm does not depend on horizon n (it is anytime)

26 / 86

Page 46: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Upper confidence bound algorithm

1 Choose each action once

2 Choose the action maximising

At = argmaxi µi (t − 1) +

√2 log(t3)

Ti (t − 1)

3 Goto 2

Corresponds to δ = 1/t3. This is quite a conservative choice (more onthis later)

Algorithm does not depend on horizon n (it is anytime)

26 / 86

Page 47: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Upper confidence bound algorithm

• Red circle: true mean, Blue rectangle: empirical mean reward.

• (10), (73), (3), (23): number of pulls (a larger number of pullsmakes the true and empirical mean closer).

27 / 86

Page 48: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why UCB?

• A suboptimal arm can only be played if its upper confidence boundis larger than the upper confidence bound of the optimal arm, whichin turn is larger than the mean of the optimal arm.

• However,this cannot happen too often because by playing a fewmore times of a suboptimal arm, its upper confidence bound will beclose to its true mean. Thus, it will eventually fall below the upperconfidence bound of the optimal arm.

• UCB:

At = argmaxi µi (t − 1) +

√2 log(t3)

Ti (t − 1)

• An algorithm should explore arms more often if they are1. either promising because µi (t − 1) is large2. or not well explored because Ti (t − 1) is small

28 / 86

Page 49: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why UCB?

• A suboptimal arm can only be played if its upper confidence boundis larger than the upper confidence bound of the optimal arm, whichin turn is larger than the mean of the optimal arm.

• However,this cannot happen too often because by playing a fewmore times of a suboptimal arm, its upper confidence bound will beclose to its true mean. Thus, it will eventually fall below the upperconfidence bound of the optimal arm.

• UCB:

At = argmaxi µi (t − 1) +

√2 log(t3)

Ti (t − 1)

• An algorithm should explore arms more often if they are1. either promising because µi (t − 1) is large2. or not well explored because Ti (t − 1) is small

28 / 86

Page 50: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why UCB?

• A suboptimal arm can only be played if its upper confidence boundis larger than the upper confidence bound of the optimal arm, whichin turn is larger than the mean of the optimal arm.

• However,this cannot happen too often because by playing a fewmore times of a suboptimal arm, its upper confidence bound will beclose to its true mean. Thus, it will eventually fall below the upperconfidence bound of the optimal arm.

• UCB:

At = argmaxi µi (t − 1) +

√2 log(t3)

Ti (t − 1)

• An algorithm should explore arms more often if they are1. either promising because µi (t − 1) is large2. or not well explored because Ti (t − 1) is small

28 / 86

Page 51: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret of UCB

Theorem The regret of UCB is at most

Rn = O

∑i :∆i>0

(∆i +

log(n)

∆i

)Furthermore,

Rn = O(√

Kn log(n)),

where K is the number of arms and n is the time horizon length.

Bounds of the first kind are called problem dependent or instancedependent, which depends on ∆i = µ1 − µi

Bounds like the second are called distribution free or worst case

29 / 86

Page 52: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

Rewrite the regret Rn =K∑i=1

∆iE[Ti (n)]

Only need to show that E[Ti (n)] is not too large for suboptimal arms

30 / 86

Page 53: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

Key insight Arm i is only played if its index is larger than the index ofthe optimal arm

γi (t − 1) = µi (t − 1) +

√2 log(t3)

Ti (t − 1)︸ ︷︷ ︸index of arm i in round t

A suboptimal arm i 6= 1 is played implies that

1. either γi (t − 1) ≥ µ1 (index of arm i is larger than the mean ofoptimal arm)

2. or γ1(t − 1) ≤ µ1 (index of arm 1 is smaller than its true mean)

Otherwise, we have γi (t − 1) ≤ µ1 ≤ γ1(t − 1): arm 1 should be playedsince

Both events are unlikely after a sufficiently number of plays.

31 / 86

Page 54: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

Key insight Arm i is only played if its index is larger than the index ofthe optimal arm

γi (t − 1) = µi (t − 1) +

√2 log(t3)

Ti (t − 1)︸ ︷︷ ︸index of arm i in round t

A suboptimal arm i 6= 1 is played implies that

1. either γi (t − 1) ≥ µ1 (index of arm i is larger than the mean ofoptimal arm)

2. or γ1(t − 1) ≤ µ1 (index of arm 1 is smaller than its true mean)

Otherwise, we have γi (t − 1) ≤ µ1 ≤ γ1(t − 1): arm 1 should be playedsince

Both events are unlikely after a sufficiently number of plays.

31 / 86

Page 55: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

To make this intuition a reality we decompose the “pull-count” for thei-th arm (i 6= 1)

E[Ti (n)] = E

[n∑

t=1

1(At = i)

]=

n∑t=1

P (At = i)

=n∑

t=1

P (At = i and (γ1(t − 1) ≤ µ1 or γi (t − 1) ≥ µ1))

≤n∑

t=1

P (γ1(t − 1) ≤ µ1)︸ ︷︷ ︸index of opt. arm too small?

+n∑

t=1

P (At = i and γi (t − 1) ≥ µ1)︸ ︷︷ ︸index of subopt. arm large?

32 / 86

Page 56: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

We want to show that P (γ1(t − 1) ≤ µ1) is small

Tempting to use the concentration theorem...

P (γ1(t − 1) ≤ µ1) = P

(µ1(t − 1) +

√2 log(t3)

Ti (t − 1)≤ µ1

)?≤ 1

t3

What’s wrong with this?

Ti (t − 1) is a random variable but not anumber! Use union bound Pr(∪t−1

s=1As) ≤∑t−1

s=1 Pr(As)

P

(µ1(t − 1) +

√2 log(t3)

Ti (t − 1)≤ µ1

)≤ P

(∃s ≤ t − 1 : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2. (δ = 1/t3)

33 / 86

Page 57: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

We want to show that P (γ1(t − 1) ≤ µ1) is small

Tempting to use the concentration theorem...

P (γ1(t − 1) ≤ µ1) = P

(µ1(t − 1) +

√2 log(t3)

Ti (t − 1)≤ µ1

)?≤ 1

t3

What’s wrong with this? Ti (t − 1) is a random variable but not anumber! Use union bound Pr(∪t−1

s=1As) ≤∑t−1

s=1 Pr(As)

P

(µ1(t − 1) +

√2 log(t3)

Ti (t − 1)≤ µ1

)≤ P

(∃s ≤ t − 1 : µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

P

(µ1,s +

√2 log(t3)

s≤ µ1

)

≤t−1∑s=1

1

t3≤ 1

t2. (δ = 1/t3)

33 / 86

Page 58: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

n∑t=1

P (At = i and γi (t − 1) ≥ µ1) = E

[n∑

t=1

1(At = i and γi (t − 1) ≥ µ1)

]

= E

[n∑

t=1

1(At = i and µi (t − 1) +

√6 log(t)

Ti (t − 1)≥ µ1)

]

≤ E

[n∑

t=1

1(At = i and µi (t − 1) +

√6 log(n)

Ti (t − 1)≥ µ1)

](t ≤ n)

34 / 86

Page 59: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

n∑t=1

P (At = i and γi (t − 1) ≥ µ1)

≤E

[n∑

t=1

1(At = i and µi (t − 1) +

√6 log(n)

Ti (t − 1)≥ µ1)

]

≤E

[n∑

s=1

1(µi ,s +

√6 log(n)

s≥ µ1)

](For each possible Ti (t − 1) = s and s = 1, . . . , n)

=n∑

s=1

P

(µi ,s +

√6 log(n)

s≥ µ1

)

35 / 86

Page 60: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

Let u =24 log(n)

∆2i

. Then we decompose time periods into [1, u] and

[u + 1, n]:

n∑s=1

P

(µi ,s +

√6 log(n)

s≥ µ1

)≤ u +

n∑s=u+1

P

(µi ,s +

√6 log(n)

s≥ µ1

)

Choose u large enough so that for any s > u,√

6 log(n)s ≤ ∆i

2

(u =24 log(n)

∆2i

). Then we have

µi ,s ≥ µ1 −√

6 log(n)

s⇒ µi ,s − µi ≥ µ1 − µi −

√6 log(n)

s

⇒ µi ,s − µi ≥ ∆i −∆i

2

⇒ µi ,s − µi ≥∆i

2

36 / 86

Page 61: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

Let u =24 log(n)

∆2i

. Then we decompose time periods into [1, u] and

[u + 1, n]:

n∑s=1

P

(µi ,s +

√6 log(n)

s≥ µ1

)≤ u +

n∑s=u+1

P

(µi ,s +

√6 log(n)

s≥ µ1

)

Choose u large enough so that for any s > u,√

6 log(n)s ≤ ∆i

2

(u =24 log(n)

∆2i

). Then we have

µi ,s ≥ µ1 −√

6 log(n)

s⇒ µi ,s − µi ≥ µ1 − µi −

√6 log(n)

s

⇒ µi ,s − µi ≥ ∆i −∆i

2

⇒ µi ,s − µi ≥∆i

2

36 / 86

Page 62: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysis

Let u =24 log(n)

∆2i

. Then

n∑s=1

P

(µi ,s +

√6 log(n)

s≥ µ1

)≤ u +

n∑s=u+1

P

(µi ,s +

√6 log(n)

s≥ µ1

)

≤ u +n∑

s=u+1

P(µi ,s ≥ µi +

∆i

2

)

≤ u +∞∑

s=u+1

exp

(−

s∆2i

8

)(∑∞

s=u+1 exp(− s∆2

i8

)≤ 1 +

∫∞s=u exp

(− s∆2

i8

)ds)

≤ u + 1 +8

∆2i

.

37 / 86

Page 63: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysisCombining the two parts we have

E[Ti (n)] ≤n∑

t=1

P (γ1(t − 1) ≤ µ1)︸ ︷︷ ︸index of opt. arm too small?

+n∑

t=1

P (At = i and γi (t − 1) ≥ µ1)︸ ︷︷ ︸index of subopt. arm large?

≤n∑

t=1

1

t2+ 1 + u +

8

∆2i

(u = 24 log(n)∆2

i,∑n

t=11t2 ≤ 1 +

∫∞t=1

1t2dt)

≤ 3 +8

∆2i

+24 log(n)

∆2i

So the regret is bounded by (instance dependent bound)

Rn =∑

i :∆i>0

∆iE[Ti (n)] ≤∑

i :∆i>0

(3∆i +

8

∆i+

24 log(n)

∆i

)

=O

( ∑i :∆i>0

(∆i +

log(n)

∆i

))

38 / 86

Page 64: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret analysisCombining the two parts we have

E[Ti (n)] ≤n∑

t=1

P (γ1(t − 1) ≤ µ1)︸ ︷︷ ︸index of opt. arm too small?

+n∑

t=1

P (At = i and γi (t − 1) ≥ µ1)︸ ︷︷ ︸index of subopt. arm large?

≤n∑

t=1

1

t2+ 1 + u +

8

∆2i

(u = 24 log(n)∆2

i,∑n

t=11t2 ≤ 1 +

∫∞t=1

1t2dt)

≤ 3 +8

∆2i

+24 log(n)

∆2i

So the regret is bounded by (instance dependent bound)

Rn =∑

i :∆i>0

∆iE[Ti (n)] ≤∑

i :∆i>0

(3∆i +

8

∆i+

24 log(n)

∆i

)

=O

( ∑i :∆i>0

(∆i +

log(n)

∆i

))38 / 86

Page 65: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Distribution free boundsLet ∆ > 0 be some constant to be chosen later

Rn =∑

i :∆i≤∆

∆iE[Ti (n)] +∑

i :∆i>∆

∆iE[Ti (n)]

≤ n∆ +∑

i :∆i>∆

∆iE[Ti (n)] (∑

i :∆i≤∆ Ti (n) ≤∑

i Ti (n) = n)

. n∆ +∑

i :∆i>∆

(∆i +log(n)

∆i)

. n∆ +K log(n)

∆︸ ︷︷ ︸∆=√

K log(n)/n

+K∑i=1

∆i

.√

nK log(n) +K∑i=1

∆i

Note that∑K

i=1 ∆i is unavoidable since each arm needs to be played atleast once and this term is negligible when n is large.

39 / 86

Page 66: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Distribution free boundsLet ∆ > 0 be some constant to be chosen later

Rn =∑

i :∆i≤∆

∆iE[Ti (n)] +∑

i :∆i>∆

∆iE[Ti (n)]

≤ n∆ +∑

i :∆i>∆

∆iE[Ti (n)] (∑

i :∆i≤∆ Ti (n) ≤∑

i Ti (n) = n)

. n∆ +∑

i :∆i>∆

(∆i +log(n)

∆i)

. n∆ +K log(n)

∆︸ ︷︷ ︸∆=√

K log(n)/n

+K∑i=1

∆i

.√

nK log(n) +K∑i=1

∆i

Note that∑K

i=1 ∆i is unavoidable since each arm needs to be played atleast once and this term is negligible when n is large.

39 / 86

Page 67: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Improvements

• The constants in the algorithm/analysis can be improved quitesignificantly.

At = argmaxi µi (t − 1) +

√2 log(t)

Ti (t − 1)

• With this choice:

limn→∞

Rn

log(n)=∑

i :∆i>0

2

∆i

• The distribution-free regret is also improvable

At = argmaxi µi (t − 1) +

√4

Ti (t − 1)log

(1 +

t

KTi (t − 1)

)• With this index we save a log factor in the distribution free bound

Rn = O(√

nK )

40 / 86

Page 68: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Improvements

• The constants in the algorithm/analysis can be improved quitesignificantly.

At = argmaxi µi (t − 1) +

√2 log(t)

Ti (t − 1)

• With this choice:

limn→∞

Rn

log(n)=∑

i :∆i>0

2

∆i

• The distribution-free regret is also improvable

At = argmaxi µi (t − 1) +

√4

Ti (t − 1)log

(1 +

t

KTi (t − 1)

)• With this index we save a log factor in the distribution free bound

Rn = O(√

nK )

40 / 86

Page 69: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Exercise

• Consider different settings of arms (number of arms K , mean gap∆i ) and different distributions: uniform, Bernoulli, normal

• Compare Explore-Then-Commit with UCB Algorithm in

1. Regret as a function of horizon n2. Frequency of pulling each arm3. Tuning the constant ρ:

At = argmaxi µi (t − 1) +

√ρ log(t)

Ti (t − 1)

41 / 86

Page 70: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Lower bounds

Is the bound Rn = O(√

nK ) optimal in n and K ?

1. For worst-case regret for a given policy π: Rn(π) = supν∈E Rn(π, ν),where E denotes the set of K -armed Gaussian bandits with unitvariance and means µ ∈ [0, 1]K .

2. The minimax regret R∗n(E) = infπ supν∈E Rn(π, ν)

Theorem R∗n(E) ≥√

(K − 1)n/27: for every policy π and n andK ≤ n + 1, there exists a K -armed Gaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

UCB with Rn = O(√

nK ) is a rate-optimal policy

42 / 86

Page 71: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Lower bounds

Is the bound Rn = O(√

nK ) optimal in n and K ?

1. For worst-case regret for a given policy π: Rn(π) = supν∈E Rn(π, ν),where E denotes the set of K -armed Gaussian bandits with unitvariance and means µ ∈ [0, 1]K .

2. The minimax regret R∗n(E) = infπ supν∈E Rn(π, ν)

Theorem R∗n(E) ≥√

(K − 1)n/27: for every policy π and n andK ≤ n + 1, there exists a K -armed Gaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

UCB with Rn = O(√

nK ) is a rate-optimal policy

42 / 86

Page 72: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Lower bounds

Is the bound Rn = O(√

nK ) optimal in n and K ?

1. For worst-case regret for a given policy π: Rn(π) = supν∈E Rn(π, ν),where E denotes the set of K -armed Gaussian bandits with unitvariance and means µ ∈ [0, 1]K .

2. The minimax regret R∗n(E) = infπ supν∈E Rn(π, ν)

Theorem R∗n(E) ≥√

(K − 1)n/27: for every policy π and n andK ≤ n + 1, there exists a K -armed Gaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

UCB with Rn = O(√

nK ) is a rate-optimal policy

42 / 86

Page 73: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

How to prove a minimax lower bound?

Key idea: reduce the bandit problem into a statistical hypothesis testingproblem.

Select two bandit problem instances (two sets of K distributions) in sucha way that the following two conditions hold simultaneously:

• Competition: A sequence of actions that is good for one bandit isnot good for the other (choose two instances far away from eachother).

• Similarity: The instances are ’close’ enough that a policy interactingwith either of the two instances cannot statistically identify the truebandit.

Lower bound: optimize the trade-off between these two opposite goals.

43 / 86

Page 74: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

How to prove a minimax lower bound?

Key idea: reduce the bandit problem into a statistical hypothesis testingproblem.Select two bandit problem instances (two sets of K distributions) in sucha way that the following two conditions hold simultaneously:

• Competition: A sequence of actions that is good for one bandit isnot good for the other (choose two instances far away from eachother).

• Similarity: The instances are ’close’ enough that a policy interactingwith either of the two instances cannot statistically identify the truebandit.

Lower bound: optimize the trade-off between these two opposite goals.

43 / 86

Page 75: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

How to prove a minimax lower bound?

Key idea: reduce the bandit problem into a statistical hypothesis testingproblem.Select two bandit problem instances (two sets of K distributions) in sucha way that the following two conditions hold simultaneously:

• Competition: A sequence of actions that is good for one bandit isnot good for the other (choose two instances far away from eachother).

• Similarity: The instances are ’close’ enough that a policy interactingwith either of the two instances cannot statistically identify the truebandit.

Lower bound: optimize the trade-off between these two opposite goals.

43 / 86

Page 76: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower boundTheorem For every policy π and n and K ≤ n, there exists a K -armedGaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch• Two bandits: ν = (Pi )

Ki=1 and ν ′ = (P ′i )

Ki=1, where Pi = N(µi , 1)

and P ′i = N(µ′i , 1).

• It suffices to show that for any policy π, there exists µ and µ′ suchthat the π incurs regret larger than

√Kn on at least one instance:

max(Rn(π, ν),Rn(π, ν ′)) ≥ c√

Kn,

or (since max(a, b) ≥ (a + b)/2),

Rn(π, ν) + Rn(π, ν′) ≥ c√

Kn.

• Choose µ = (∆, 0, . . . , 0) and

Rn(π, ν) = (n − Eν(T1(n)))∆

(∆ optimized later)

44 / 86

Page 77: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower boundTheorem For every policy π and n and K ≤ n, there exists a K -armedGaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch• Two bandits: ν = (Pi )

Ki=1 and ν ′ = (P ′i )

Ki=1, where Pi = N(µi , 1)

and P ′i = N(µ′i , 1).• It suffices to show that for any policy π, there exists µ and µ′ such

that the π incurs regret larger than√

Kn on at least one instance:

max(Rn(π, ν),Rn(π, ν ′)) ≥ c√

Kn,

or (since max(a, b) ≥ (a + b)/2),

Rn(π, ν) + Rn(π, ν′) ≥ c√

Kn.

• Choose µ = (∆, 0, . . . , 0) and

Rn(π, ν) = (n − Eν(T1(n)))∆

(∆ optimized later)

44 / 86

Page 78: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower boundTheorem For every policy π and n and K ≤ n, there exists a K -armedGaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch• Two bandits: ν = (Pi )

Ki=1 and ν ′ = (P ′i )

Ki=1, where Pi = N(µi , 1)

and P ′i = N(µ′i , 1).• It suffices to show that for any policy π, there exists µ and µ′ such

that the π incurs regret larger than√

Kn on at least one instance:

max(Rn(π, ν),Rn(π, ν ′)) ≥ c√

Kn,

or (since max(a, b) ≥ (a + b)/2),

Rn(π, ν) + Rn(π, ν′) ≥ c√

Kn.

• Choose µ = (∆, 0, . . . , 0) and

Rn(π, ν) = (n − Eν(T1(n)))∆

(∆ optimized later) 44 / 86

Page 79: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower boundTheorem For every policy π and n and K ≤ n, there exists a K -armedGaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch

• Choose µ = (∆, 0, . . . , 0) and Rn(π, ν) = (n − Eν(T1(n)))∆• Let i = argminj>1 EνE (Tj(n)) (arm explored the least) andEν [Ti (n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (2∆ at the i-th arm, optimal arm):

Rn(π, ν ′) = ∆Eν′(T1(n)) +∑j 6=1,i

2∆Eν′(Tj(n)) ≥ ∆Eν′(T1(n)).

• Depend on T1(n) ≷ n/2,• If T1(n) ≤ n/2, Rn(π, ν) = (n − Eν(T1(n)))∆ ≥ n∆

2 . Therefore,

Rn(π, ν) ≥ Pν(T1(n) ≤ n/2) n∆2

• If T1(n) ≤ n/2, Rn(π, ν′) ≥ ∆Eν′(T1(n)) ≥ n∆2 . Therefore,

Rn(π, ν′) ≥ Pν′(T1(n) ≥ n/2) n∆2

45 / 86

Page 80: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower boundTheorem For every policy π and n and K ≤ n, there exists a K -armedGaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch

• Choose µ = (∆, 0, . . . , 0) and Rn(π, ν) = (n − Eν(T1(n)))∆• Let i = argminj>1 EνE (Tj(n)) (arm explored the least) andEν [Ti (n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (2∆ at the i-th arm, optimal arm):

Rn(π, ν ′) = ∆Eν′(T1(n)) +∑j 6=1,i

2∆Eν′(Tj(n)) ≥ ∆Eν′(T1(n)).

• Depend on T1(n) ≷ n/2,• If T1(n) ≤ n/2, Rn(π, ν) = (n − Eν(T1(n)))∆ ≥ n∆

2 . Therefore,

Rn(π, ν) ≥ Pν(T1(n) ≤ n/2) n∆2

• If T1(n) ≤ n/2, Rn(π, ν′) ≥ ∆Eν′(T1(n)) ≥ n∆2 . Therefore,

Rn(π, ν′) ≥ Pν′(T1(n) ≥ n/2) n∆2

45 / 86

Page 81: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower boundTheorem For every policy π and n and K ≤ n, there exists a K -armedGaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

Proof sketch

• Choose µ = (∆, 0, . . . , 0) and Rn(π, ν) = (n − Eν(T1(n)))∆• Let i = argminj>1 EνE (Tj(n)) (arm explored the least) andEν [Ti (n)] ≤ n/(K − 1)

• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (2∆ at the i-th arm, optimal arm):

Rn(π, ν ′) = ∆Eν′(T1(n)) +∑j 6=1,i

2∆Eν′(Tj(n)) ≥ ∆Eν′(T1(n)).

• Depend on T1(n) ≷ n/2,• If T1(n) ≤ n/2, Rn(π, ν) = (n − Eν(T1(n)))∆ ≥ n∆

2 . Therefore,

Rn(π, ν) ≥ Pν(T1(n) ≤ n/2) n∆2

• If T1(n) ≤ n/2, Rn(π, ν′) ≥ ∆Eν′(T1(n)) ≥ n∆2 . Therefore,

Rn(π, ν′) ≥ Pν′(T1(n) ≥ n/2) n∆2

45 / 86

Page 82: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower bound

Rn(π, ν) + Rn(π, ν′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

Need to show Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2) is larger!

Theorem (Pinsker’s inequality) For any two distributions and anyevent A:

P(A) + Q(Ac) ≥ 1

2exp(−D(P,Q)),

where D(P,Q) =∫

p log(pq ) is the Kullback-Leibler (KL) divergence.

Intuition, when P is close to Q, P(A) + Q(Ac) should be large(P(A) + P(Ac) = 1)

Exercise:

D(N(µ1, σ2),N(µ2, σ

2)) =(µ1 − µ2)2

2σ2

46 / 86

Page 83: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower bound

Rn(π, ν) + Rn(π, ν′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

Need to show Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2) is larger!

Theorem (Pinsker’s inequality) For any two distributions and anyevent A:

P(A) + Q(Ac) ≥ 1

2exp(−D(P,Q)),

where D(P,Q) =∫

p log(pq ) is the Kullback-Leibler (KL) divergence.

Intuition, when P is close to Q, P(A) + Q(Ac) should be large(P(A) + P(Ac) = 1)

Exercise:

D(N(µ1, σ2),N(µ2, σ

2)) =(µ1 − µ2)2

2σ2

46 / 86

Page 84: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower bound

Rn(π, ν) + Rn(π, ν′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

Need to show Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2) is larger!

Theorem (Pinsker’s inequality) For any two distributions and anyevent A:

P(A) + Q(Ac) ≥ 1

2exp(−D(P,Q)),

where D(P,Q) =∫

p log(pq ) is the Kullback-Leibler (KL) divergence.

Intuition, when P is close to Q, P(A) + Q(Ac) should be large(P(A) + P(Ac) = 1)

Exercise:

D(N(µ1, σ2),N(µ2, σ

2)) =(µ1 − µ2)2

2σ2

46 / 86

Page 85: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower bound

Rn(π, ν) + Rn(π, ν′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

Need to show Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2) is larger!

Theorem (Pinsker’s inequality) For any two distributions and anyevent A:

P(A) + Q(Ac) ≥ 1

2exp(−D(P,Q)),

where D(P,Q) =∫

p log(pq ) is the Kullback-Leibler (KL) divergence.

Intuition, when P is close to Q, P(A) + Q(Ac) should be large(P(A) + P(Ac) = 1)

Exercise:

D(N(µ1, σ2),N(µ2, σ

2)) =(µ1 − µ2)2

2σ2

46 / 86

Page 86: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower bound

Rn(π, ν) + Rn(π, ν ′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

≥ n∆

4exp(−D(Pν ,Pν′))

Theorem (Divergence decomposition)

D(Pν ,Pν′) =K∑j=1

Eν(Tj(n))D(Pj ,P′j )

Recall µ = (∆, 0, . . . , 0) and µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (only oneentry different):

D(Pν ,Pν′) = Eν(Ti (n))D(N(0, 1),D(2∆, 1)) = Eν(Ti (n))(2∆)2

2≤ 2n∆2

K − 1.

Rn(π, ν) + Rn(π, ν′) ≥ n∆

4exp(− 2n∆2

K − 1)

and choose ∆ =√

(K − 1)/4n

47 / 86

Page 87: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower bound

Rn(π, ν) + Rn(π, ν ′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

≥ n∆

4exp(−D(Pν ,Pν′))

Theorem (Divergence decomposition)

D(Pν ,Pν′) =K∑j=1

Eν(Tj(n))D(Pj ,P′j )

Recall µ = (∆, 0, . . . , 0) and µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (only oneentry different):

D(Pν ,Pν′) = Eν(Ti (n))D(N(0, 1),D(2∆, 1)) = Eν(Ti (n))(2∆)2

2≤ 2n∆2

K − 1.

Rn(π, ν) + Rn(π, ν′) ≥ n∆

4exp(− 2n∆2

K − 1)

and choose ∆ =√

(K − 1)/4n

47 / 86

Page 88: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower bound

Rn(π, ν) + Rn(π, ν ′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

≥ n∆

4exp(−D(Pν ,Pν′))

Theorem (Divergence decomposition)

D(Pν ,Pν′) =K∑j=1

Eν(Tj(n))D(Pj ,P′j )

Recall µ = (∆, 0, . . . , 0) and µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (only oneentry different):

D(Pν ,Pν′) = Eν(Ti (n))D(N(0, 1),D(2∆, 1)) = Eν(Ti (n))(2∆)2

2≤ 2n∆2

K − 1.

Rn(π, ν) + Rn(π, ν′) ≥ n∆

4exp(− 2n∆2

K − 1)

and choose ∆ =√

(K − 1)/4n

47 / 86

Page 89: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Minimax lower bound

Rn(π, ν) + Rn(π, ν ′) ≥ n∆

2(Pν(T1(n) ≤ n/2) + Pν′(T1(n) ≥ n/2))

≥ n∆

4exp(−D(Pν ,Pν′))

Theorem (Divergence decomposition)

D(Pν ,Pν′) =K∑j=1

Eν(Tj(n))D(Pj ,P′j )

Recall µ = (∆, 0, . . . , 0) and µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0) (only oneentry different):

D(Pν ,Pν′) = Eν(Ti (n))D(N(0, 1),D(2∆, 1)) = Eν(Ti (n))(2∆)2

2≤ 2n∆2

K − 1.

Rn(π, ν) + Rn(π, ν′) ≥ n∆

4exp(− 2n∆2

K − 1)

and choose ∆ =√

(K − 1)/4n 47 / 86

Page 90: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Worst case lower bound

Theorem For every policy π and n and K ≤ n + 1, there exists aK -armed Gaussian bandit ν such that

Rn(π, ν) ≥√

(K − 1)n/27

48 / 86

Page 91: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

What else is there?

• All kinds of variants of UCB for different noise models: Bernoulli,exponential families, heavy tails, Gaussian with unknown mean andvariance,...

• Thompson sampling: each round sample mean from posterior foreach arm, choose arm with largest

• All manner of twists on the setup: non-stationarity, delayed rewards,playing multiple arms each round, moving beyond expected regret(high probability bounds)

49 / 86

Page 92: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

What else is there?

• All kinds of variants of UCB for different noise models: Bernoulli,exponential families, heavy tails, Gaussian with unknown mean andvariance,...

• Thompson sampling: each round sample mean from posterior foreach arm, choose arm with largest

• All manner of twists on the setup: non-stationarity, delayed rewards,playing multiple arms each round, moving beyond expected regret(high probability bounds)

49 / 86

Page 93: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

What else is there?

• All kinds of variants of UCB for different noise models: Bernoulli,exponential families, heavy tails, Gaussian with unknown mean andvariance,...

• Thompson sampling: each round sample mean from posterior foreach arm, choose arm with largest

• All manner of twists on the setup: non-stationarity, delayed rewards,playing multiple arms each round, moving beyond expected regret(high probability bounds)

49 / 86

Page 94: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

The adversarial viewpoint

• Replace random rewards with an adversary

• At the start of the game the adversary secretly chooses losses`1, `2, . . . , `n where `t ∈ [0, 1]K

• Learner chooses actions At :• observe and suffers the loss `tAt

• Regret is

Rn = E

[n∑

t=1

`tAt

]︸ ︷︷ ︸

learner’s loss

− mini

n∑t=1

`ti︸ ︷︷ ︸loss of best arm

• Mission Make the regret small, regardless of the adversary

• There exists an algorithm such that

Rn ≤ 2√

Kn

50 / 86

Page 95: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

The adversarial viewpoint

• Replace random rewards with an adversary

• At the start of the game the adversary secretly chooses losses`1, `2, . . . , `n where `t ∈ [0, 1]K

• Learner chooses actions At :• observe and suffers the loss `tAt

• Regret is

Rn = E

[n∑

t=1

`tAt

]︸ ︷︷ ︸

learner’s loss

− mini

n∑t=1

`ti︸ ︷︷ ︸loss of best arm

• Mission Make the regret small, regardless of the adversary

• There exists an algorithm such that

Rn ≤ 2√

Kn

50 / 86

Page 96: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why this regret definition?

• The regret is with respect to the loss of the best arm

Rn = E

[n∑

t=1

`tAt

]︸ ︷︷ ︸

learner’s loss

− mini

n∑t=1

`ti︸ ︷︷ ︸loss of best arm

• The following alternative objective is hopeless

R ′n = E

[n∑

t=1

`tAt

]︸ ︷︷ ︸

learner’s loss

−n∑

t=1

mini`ti︸ ︷︷ ︸

loss of best sequence

• Regret is at least cn for some c > 1.

` =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)

51 / 86

Page 97: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why this regret definition?

• The regret is with respect to the loss of the best arm

Rn = E

[n∑

t=1

`tAt

]︸ ︷︷ ︸

learner’s loss

− mini

n∑t=1

`ti︸ ︷︷ ︸loss of best arm

• The following alternative objective is hopeless

R ′n = E

[n∑

t=1

`tAt

]︸ ︷︷ ︸

learner’s loss

−n∑

t=1

mini`ti︸ ︷︷ ︸

loss of best sequence

• Regret is at least cn for some c > 1.

` =

(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1

)51 / 86

Page 98: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Tackling the adversarial bandit

• Randomisation is crucial in adversarial bandits

• Learner chooses distribution Pt over the K actions

• Samples At ∼ Pt

• Observes the loss `tAt

• Expected regret is

Rn = maxi

E

[n∑

t=1

(`tAt − `ti )

]= max

p∈∆KE

[n∑

t=1

〈Pt − p, `t〉

]

• How to choose Pt?

• Consider a simpler setting: choose the action At and the entirevector `t is observed (instead of `tAt )

• Online convex optimization with a linear loss

52 / 86

Page 99: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Tackling the adversarial bandit

• Randomisation is crucial in adversarial bandits

• Learner chooses distribution Pt over the K actions

• Samples At ∼ Pt

• Observes the loss `tAt

• Expected regret is

Rn = maxi

E

[n∑

t=1

(`tAt − `ti )

]= max

p∈∆KE

[n∑

t=1

〈Pt − p, `t〉

]

• How to choose Pt?

• Consider a simpler setting: choose the action At and the entirevector `t is observed (instead of `tAt )

• Online convex optimization with a linear loss

52 / 86

Page 100: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Tackling the adversarial bandit

• Randomisation is crucial in adversarial bandits

• Learner chooses distribution Pt over the K actions

• Samples At ∼ Pt

• Observes the loss `tAt

• Expected regret is

Rn = maxi

E

[n∑

t=1

(`tAt − `ti )

]= max

p∈∆KE

[n∑

t=1

〈Pt − p, `t〉

]

• How to choose Pt?

• Consider a simpler setting: choose the action At and the entirevector `t is observed (instead of `tAt )

• Online convex optimization with a linear loss

52 / 86

Page 101: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Tackling the adversarial bandit

• Randomisation is crucial in adversarial bandits

• Learner chooses distribution Pt over the K actions

• Samples At ∼ Pt

• Observes the loss `tAt

• Expected regret is

Rn = maxi

E

[n∑

t=1

(`tAt − `ti )

]= max

p∈∆KE

[n∑

t=1

〈Pt − p, `t〉

]

• How to choose Pt?

• Consider a simpler setting: choose the action At and the entirevector `t is observed (instead of `tAt )

• Online convex optimization with a linear loss

52 / 86

Page 102: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Online convex optimisation (linear losses)• Domain of x K ⊂ Rd is a convex set• Adversary secretly chooses`1, . . . , `n ∈ K◦ = {u : supx∈K |〈x , u〉| ≤ 1} (polar set)

• At each time t, the learner chooses xt ∈ K• Suffers loss 〈xt , `t〉

• The regret with respect to the best x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .

More general online convex optimization• Learner chooses xt ∈ K• Adversary chooses convex ft : K → R• Suffer loss in round t is ft(xt) and regret is

Rn(x) =n∑

t=1

(ft(xt)− ft(x))

• linear is a special case with ft(x) = 〈x , `t〉

53 / 86

Page 103: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Online convex optimisation (linear losses)• Domain of x K ⊂ Rd is a convex set• Adversary secretly chooses`1, . . . , `n ∈ K◦ = {u : supx∈K |〈x , u〉| ≤ 1} (polar set)

• At each time t, the learner chooses xt ∈ K• Suffers loss 〈xt , `t〉• The regret with respect to the best x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .

More general online convex optimization• Learner chooses xt ∈ K• Adversary chooses convex ft : K → R• Suffer loss in round t is ft(xt) and regret is

Rn(x) =n∑

t=1

(ft(xt)− ft(x))

• linear is a special case with ft(x) = 〈x , `t〉

53 / 86

Page 104: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Online convex optimisation (linear losses)• Domain of x K ⊂ Rd is a convex set• Adversary secretly chooses`1, . . . , `n ∈ K◦ = {u : supx∈K |〈x , u〉| ≤ 1} (polar set)

• At each time t, the learner chooses xt ∈ K• Suffers loss 〈xt , `t〉• The regret with respect to the best x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .

More general online convex optimization• Learner chooses xt ∈ K• Adversary chooses convex ft : K → R• Suffer loss in round t is ft(xt) and regret is

Rn(x) =n∑

t=1

(ft(xt)− ft(x))

• linear is a special case with ft(x) = 〈x , `t〉53 / 86

Page 105: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why linear is enough?

• convex function

• The sum of convex functions is convex

• Strictly convex function has a unique minimizer

54 / 86

Page 106: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why linear is enough?

• convex function

• The sum of convex functions is convex

• Strictly convex function has a unique minimizer

54 / 86

Page 107: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Linearisation of a convex function

ft(x) ≥ ft(xt) + 〈x − xt ,∇ft(xt)〉

xt

x

Rearranging, ft(xt)− ft(x) ≤ 〈xt − x ,∇ft(xt)〉55 / 86

Page 108: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why linear is enough?

• Regret is bounded by

Rn(x) =n∑

t=1

(ft(xt)− ft(x))

≤n∑

t=1

〈xt − x ,∇ft(xt)〉

• Reduction from nonlinear to linear

• Only uses first order information (the gradient)

• Linear losses from now on ft(x) = 〈x , `t〉

• Think of `t = ∇ft(xt) for a general convex loss function

56 / 86

Page 109: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Why linear is enough?

• Regret is bounded by

Rn(x) =n∑

t=1

(ft(xt)− ft(x))

≤n∑

t=1

〈xt − x ,∇ft(xt)〉

• Reduction from nonlinear to linear

• Only uses first order information (the gradient)

• Linear losses from now on ft(x) = 〈x , `t〉

• Think of `t = ∇ft(xt) for a general convex loss function

56 / 86

Page 110: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Online convex optimisation (linear losses)

• Adversary secretly chooses`1, . . . , `n ∈ K◦ = {u : supx∈K |〈x , u〉| ≤ 1} (polar)

• Learner chooses xt ∈ K• Suffers loss 〈xt , `t〉 and the regret with respect to x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .

• How to choose xt? Most simple idea ‘follow-the-leader’

xt+1 = argminx∈K

t∑s=1

〈x , `s〉 .

• Fails miserably: K = [−1, 1], `1 = 1/2, `2 = −1, `3 = 1, . . .

• x1 = ?, x2 = −1 (argminx∈K〈x , `1〉) , x3 = 1(argminx∈K〈x , `1 + `2〉), . . .

• Rn(0) =∑n

t=1〈xt , `t〉 ≈ n.

57 / 86

Page 111: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Online convex optimisation (linear losses)

• Adversary secretly chooses`1, . . . , `n ∈ K◦ = {u : supx∈K |〈x , u〉| ≤ 1} (polar)

• Learner chooses xt ∈ K• Suffers loss 〈xt , `t〉 and the regret with respect to x ∈ K is

Rn(x) =n∑

t=1

〈xt − x , `t〉 .

• How to choose xt? Most simple idea ‘follow-the-leader’

xt+1 = argminx∈K

t∑s=1

〈x , `s〉 .

• Fails miserably: K = [−1, 1], `1 = 1/2, `2 = −1, `3 = 1, . . .

• x1 = ?, x2 = −1 (argminx∈K〈x , `1〉) , x3 = 1(argminx∈K〈x , `1 + `2〉), . . .

• Rn(0) =∑n

t=1〈xt , `t〉 ≈ n.

57 / 86

Page 112: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow The regularized Leader (FTRL)

• New idea Add regularization to stabilize follow-the-leader

• Let F be a strictly convex function and η > 0 be the learning rateand

xt+1 = argminx∈K

(F (x) + η

t∑s=1

〈x , `s〉

)

• Different choices of F lead to different algorithms.

• One clean analysis.

58 / 86

Page 113: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow The regularized Leader (FTRL)

• New idea Add regularization to stabilize follow-the-leader

• Let F be a strictly convex function and η > 0 be the learning rateand

xt+1 = argminx∈K

(F (x) + η

t∑s=1

〈x , `s〉

)

• Different choices of F lead to different algorithms.

• One clean analysis.

58 / 86

Page 114: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Example – Gradient descent

• K = Rd and F (x) = 12‖x‖

22

xt+1 = argminx∈K η

t∑s=1

〈x , `s〉+1

2‖x‖2

2

• Differentiating,

0 = η

t∑s=1

`s + x

• xt+1 = −η∑t

s=1 `s = xt − η`t

59 / 86

Page 115: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Example – Gradient descent

• K = Rd and F (x) = 12‖x‖

22

xt+1 = argminx∈K η

t∑s=1

〈x , `s〉+1

2‖x‖2

2

• Differentiating,

0 = η

t∑s=1

`s + x

• xt+1 = −η∑t

s=1 `s = xt − η`t

59 / 86

Page 116: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

A few tools

• Online convex optimization uses many tools from convex analysis

• Bregman divergence

• First-order optimality conditions

• Dual norms

60 / 86

Page 117: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Bregman divergence

For convex F , DF (x , y) = F (x)− F (y)− 〈∇F (y), x − y〉

• Bregman divergence is not a distance (may not be symmetricDF (x , y) 6= DF (y , x), e.g., KL divergence), but still, DF (x , y) ≥ 0

• By Taylor expansion, there exists a z = αx + (1− α)y for α ∈ [0, 1]

DF (x , y) = (x − y)>∇2F (z)(x − y) = ‖x − y‖2∇2F (z)

• Key property: does not change under linear perturbation: ForF (x) = F + 〈a, x〉, D

F(x , y) = DF (x , y)

61 / 86

Page 118: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Bregman divergence

For convex F , DF (x , y) = F (x)− F (y)− 〈∇F (y), x − y〉

• Bregman divergence is not a distance (may not be symmetricDF (x , y) 6= DF (y , x), e.g., KL divergence), but still, DF (x , y) ≥ 0

• By Taylor expansion, there exists a z = αx + (1− α)y for α ∈ [0, 1]

DF (x , y) = (x − y)>∇2F (z)(x − y) = ‖x − y‖2∇2F (z)

• Key property: does not change under linear perturbation: ForF (x) = F + 〈a, x〉, D

F(x , y) = DF (x , y)

61 / 86

Page 119: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Bregman divergence

For convex F , DF (x , y) = F (x)− F (y)− 〈∇F (y), x − y〉

• Bregman divergence is not a distance (may not be symmetricDF (x , y) 6= DF (y , x), e.g., KL divergence), but still, DF (x , y) ≥ 0

• By Taylor expansion, there exists a z = αx + (1− α)y for α ∈ [0, 1]

DF (x , y) = (x − y)>∇2F (z)(x − y) = ‖x − y‖2∇2F (z)

• Key property: does not change under linear perturbation: ForF (x) = F + 〈a, x〉, D

F(x , y) = DF (x , y)

61 / 86

Page 120: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Examples

• Quadratic F (x) = 12‖x‖

2

DF (x , y) =1

2‖x − y‖2

2

• Neg-entropy F (x) =∑d

i=1 xi log(xi )− xi

DF (x , y) =d∑

i=1

xi log(xiyi

) +d∑

i=1

(yi − xi )

When x , y ∈ ∆d , where ∆d = {x ∈ Rd : x ≥ 0, ‖x‖1 = 1}(d-dimensional simplex, usually for modeling a discrete probabilitydistribution):

DF (x , y) =d∑

i=1

xi log(xiyi

)

62 / 86

Page 121: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Examples

• Quadratic F (x) = 12‖x‖

2

DF (x , y) =1

2‖x − y‖2

2

• Neg-entropy F (x) =∑d

i=1 xi log(xi )− xi

DF (x , y) =d∑

i=1

xi log(xiyi

) +d∑

i=1

(yi − xi )

When x , y ∈ ∆d , where ∆d = {x ∈ Rd : x ≥ 0, ‖x‖1 = 1}(d-dimensional simplex, usually for modeling a discrete probabilitydistribution):

DF (x , y) =d∑

i=1

xi log(xiyi

)

62 / 86

Page 122: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

First order optimality condition• Let K be convex, f : K → R convex, differentiable

x∗ = argminx∈K f (x)⇔ 〈∇f (x∗), x − x∗〉 ≥ 0 ∀x ∈ K

• Interpretation f is increasing in direction x − x∗ for all x ∈ K

63 / 86

Page 123: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Dual norm

Let ‖ · ‖t be a norm on Rd , then its dual norm

‖z‖t∗ = sup{〈z , x〉, ‖x‖t ≤ 1}.

The dual norm of ‖z‖t∗ will be ‖ · ‖t .

• The dual norm of ‖ · ‖2 is ‖ · ‖2

• The dual norm of ‖ · ‖1 is ‖ · ‖∞• The dual norm of ‖ · ‖p is ‖ · ‖q (with 1

p + 1q = 1).

• Holder’s inequality: 〈z , x〉 ≤ ‖x‖t‖z‖t∗ .

64 / 86

Page 124: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Dual norm

Let ‖ · ‖t be a norm on Rd , then its dual norm

‖z‖t∗ = sup{〈z , x〉, ‖x‖t ≤ 1}.

The dual norm of ‖z‖t∗ will be ‖ · ‖t .

• The dual norm of ‖ · ‖2 is ‖ · ‖2

• The dual norm of ‖ · ‖1 is ‖ · ‖∞• The dual norm of ‖ · ‖p is ‖ · ‖q (with 1

p + 1q = 1).

• Holder’s inequality: 〈z , x〉 ≤ ‖x‖t‖z‖t∗ .

64 / 86

Page 125: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Dual norm

Let ‖ · ‖t be a norm on Rd , then its dual norm

‖z‖t∗ = sup{〈z , x〉, ‖x‖t ≤ 1}.

The dual norm of ‖z‖t∗ will be ‖ · ‖t .

• The dual norm of ‖ · ‖2 is ‖ · ‖2

• The dual norm of ‖ · ‖1 is ‖ · ‖∞• The dual norm of ‖ · ‖p is ‖ · ‖q (with 1

p + 1q = 1).

• Holder’s inequality: 〈z , x〉 ≤ ‖x‖t‖z‖t∗ .

64 / 86

Page 126: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Dual norm

Let ‖ · ‖t be a norm on Rd , then its dual norm

‖z‖t∗ = sup{〈z , x〉, ‖x‖t ≤ 1}.

The dual norm of ‖z‖t∗ will be ‖ · ‖t .

• The dual norm of ‖ · ‖2 is ‖ · ‖2

• The dual norm of ‖ · ‖1 is ‖ · ‖∞• The dual norm of ‖ · ‖p is ‖ · ‖q (with 1

p + 1q = 1).

• Holder’s inequality: 〈z , x〉 ≤ ‖x‖t‖z‖t∗ .

64 / 86

Page 127: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader

• xt+1 = argminx∈K(F (x) + η

∑ts=1〈x , `s〉

)• Equivalent to

xt+1 = argminx∈K (η〈x , `t〉+ DF (x , xt))

= argminx∈K (η〈x , `t〉+ F (x)− F (xt)− 〈∇F (xt), x − xt〉)

• Assuming the minimizer is achieved in the interior of K .

• The first optimization implies that ∇F (xt+1) = −η∑t

s=1 `s

• The second optimization implies that η`t +∇F (xt+1)−∇F (xt) = 0and thus∇F (xt+1) = −η`t +∇F (xt) = −η

∑ts=1 `s +∇F (x1)︸ ︷︷ ︸

0

= −η∑t

s=1 `s .

65 / 86

Page 128: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader

• xt+1 = argminx∈K(F (x) + η

∑ts=1〈x , `s〉

)• Equivalent to

xt+1 = argminx∈K (η〈x , `t〉+ DF (x , xt))

= argminx∈K (η〈x , `t〉+ F (x)− F (xt)− 〈∇F (xt), x − xt〉)

• Assuming the minimizer is achieved in the interior of K .

• The first optimization implies that ∇F (xt+1) = −η∑t

s=1 `s

• The second optimization implies that η`t +∇F (xt+1)−∇F (xt) = 0and thus∇F (xt+1) = −η`t +∇F (xt) = −η

∑ts=1 `s +∇F (x1)︸ ︷︷ ︸

0

= −η∑t

s=1 `s .

65 / 86

Page 129: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret Analysis: Follow the regularized leader

Theorem For any fixed action x , the regret of follow the regularizedleader satisfies

Rn(x) :=n∑

t=1

〈xt − x , `t〉

≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

Let zt ∈ [xt , xt+1] be such that DF (xt+1, xt) = 12‖xt − xt+1‖2

∇2F (zt)

and ‖ · ‖t = ‖ · ‖∇2F (zt) and ‖ · ‖t∗ is the dual norm of ‖ · ‖t .

Choosing ‖ · ‖t such that DF (xt+1, xt) ≥ 12‖xt − xt+1‖2

t is also valid.

66 / 86

Page 130: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret Analysis: Follow the regularized leader

Theorem For any fixed action x , the regret of follow the regularizedleader satisfies

Rn(x) :=n∑

t=1

〈xt − x , `t〉

≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

Let zt ∈ [xt , xt+1] be such that DF (xt+1, xt) = 12‖xt − xt+1‖2

∇2F (zt)

and ‖ · ‖t = ‖ · ‖∇2F (zt) and ‖ · ‖t∗ is the dual norm of ‖ · ‖t .

Choosing ‖ · ‖t such that DF (xt+1, xt) ≥ 12‖xt − xt+1‖2

t is also valid.

66 / 86

Page 131: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Regret Analysis: Follow the regularized leaderTheorem For any fixed action x , the regret of follow the regularizedleader satisfies

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)

≤ F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

DF (xt+1, xt) = 12‖xt − xt+1‖2

∇2F (zt)and ‖ · ‖t = ‖ · ‖∇2F (zt)

Proof of the second inequality:

〈xt − xt+1, `t〉 −DF (xt+1, xt)

η≤ ‖`t‖t∗‖xt − xt+1‖t −

DF (xt+1, xt)

η

= ‖`t‖t∗√

2DF (xt+1, xt)−DF (xt+1, xt)

η≤ η

2‖`t‖2

t∗,

The last inequality is due to ax − bx2/2 ≤ a2/(2b) for any b ≥ 0 witha = ‖`t‖t∗, x =

√2DF (xt+1, xt) and b = 1

η

67 / 86

Page 132: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

FTRL analysis

• Proof of the first inequality

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)• Rewriting the regret

Rn(x) =n∑

t=1

〈xt − x , `t〉

=n∑

t=1

〈xt − xt+1, `t〉+n∑

t=1

〈xt+1 − x , `t〉

• Goal: show that

n∑t=1

〈xt+1 − x , `t〉 ≤F (x)− F (x1)

η−

n∑t=1

1

ηDF (xt+1, xt)

68 / 86

Page 133: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

FTRL analysis

• Proof of the first inequality

Rn(x) ≤ F (x)− F (x1)

η+

n∑t=1

(〈xt − xt+1, `t〉 −

1

ηDF (xt+1, xt)

)• Rewriting the regret

Rn(x) =n∑

t=1

〈xt − x , `t〉

=n∑

t=1

〈xt − xt+1, `t〉+n∑

t=1

〈xt+1 − x , `t〉

• Goal: show that

n∑t=1

〈xt+1 − x , `t〉 ≤F (x)− F (x1)

η−

n∑t=1

1

ηDF (xt+1, xt)

68 / 86

Page 134: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

FTRL analysis

• Potential function: Φt(x) =F (x)

η+

t∑s=1

〈x , `s〉

• By FRTL: xt+1 minimizes Φt in K•

n∑t=1

〈xt+1 − x , `t〉

=n∑

t=1

〈xt+1, `t〉 − (n∑

t=1

〈x , `t〉+F (x)

η)︸ ︷︷ ︸

Φn(x)

+F (x)

η

=n∑

t=1

(Φt(xt+1)− Φt−1(xt+1))︸ ︷︷ ︸(F (xt+1)

η+∑t

s=1〈xt+1,`s〉)−(

F (xt+1)

η+∑t−1

s=1 〈xt+1,`s〉)−Φn(x) +

F (x)

η,

69 / 86

Page 135: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

FTRL analysisPotential function: Φt(x) =

F (x)

η+

t∑s=1

〈x , `s〉 (Φ0(x) = F (x)η )

Then using: (1) xt+1 = argminx Φt(x) and (2) DΦt (·, ·) = 1ηDF (·, ·) (Bregman divergence

keeps the same by adding linear functions)

n∑t=1

〈xt+1 − x , `t〉 =F (x)

η+

n∑t=1

(Φt(xt+1)− Φt−1(xt+1)) − Φn(x)

=F (x)

η− Φ0(x1) + Φn(xn+1)− Φn(x)︸ ︷︷ ︸

≤0: xn+1=argminx Φn(x)

+n−1∑t=0

(Φt(xt+1)− Φt(xt+2))

≤ F (x)− F (x1)

η+

n−1∑t=0

(Φt(xt+1)− Φt(xt+2))

=F (x)− F (x1)

η−

n−1∑t=0

DΦt (xt+2, xt+1) + 〈∇Φt(xt+1), xt+2 − xt+1〉︸ ︷︷ ︸≥0

≤ F (x)− F (x1)

η− 1

η

n∑t=1

DF (xt+1, xt),

whereDΦt (xt+2, xt+1) = Φt(xt+2)− Φt(xt+1)− 〈∇Φt(xt+1), xt+2 − xt+1〉.

70 / 86

Page 136: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Final form of the regret

Rn(x) ≤F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

≤diamF (K)

η+η

2

n∑t=1

‖`t‖2t∗ ,

where diamF (K) := maxa,b∈K F (a)− F (b).

• Regret depends on distance from start to optimal

• Learning rate needs careful tuning

71 / 86

Page 137: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Final form of the regret

Rn(x) ≤F (x)− F (x1)

η+η

2

n∑t=1

‖`t‖2t∗

≤diamF (K)

η+η

2

n∑t=1

‖`t‖2t∗ ,

where diamF (K) := maxa,b∈K F (a)− F (b).

• Regret depends on distance from start to optimal

• Learning rate needs careful tuning

71 / 86

Page 138: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Application 1: Online gradient descentAssume K = {x : ‖x‖2 ≤ 1} and `t ∈ K (|〈x , `t〉| ≤ 1)

Choose F (x) = 12‖x‖

22, diamF (K) := maxa,b∈K F (a)− F (b) = 1

2 :

DF (x , y) =1

2‖x − y‖2

2

FTRL:

xt+1 = argminx∈K η

t∑s=1

〈x , `s〉+1

2‖x‖2

2

Then by choosing η =√

1/n:

Rn(x) ≤ diamF (K)

2η+η

2

n∑t=1

‖`t‖22 ≤

1

2η+ηn

2≤√

n

72 / 86

Page 139: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Application 1: Online gradient descentAssume K = {x : ‖x‖2 ≤ 1} and `t ∈ K (|〈x , `t〉| ≤ 1)

Choose F (x) = 12‖x‖

22, diamF (K) := maxa,b∈K F (a)− F (b) = 1

2 :

DF (x , y) =1

2‖x − y‖2

2

FTRL:

xt+1 = argminx∈K η

t∑s=1

〈x , `s〉+1

2‖x‖2

2

Then by choosing η =√

1/n:

Rn(x) ≤ diamF (K)

2η+η

2

n∑t=1

‖`t‖22 ≤

1

2η+ηn

2≤√

n

72 / 86

Page 140: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Application 1: Online gradient descentAssume K = {x : ‖x‖2 ≤ 1} and `t ∈ K (|〈x , `t〉| ≤ 1)

Choose F (x) = 12‖x‖

22, diamF (K) := maxa,b∈K F (a)− F (b) = 1

2 :

DF (x , y) =1

2‖x − y‖2

2

FTRL:

xt+1 = argminx∈K η

t∑s=1

〈x , `s〉+1

2‖x‖2

2

Then by choosing η =√

1/n:

Rn(x) ≤ diamF (K)

2η+η

2

n∑t=1

‖`t‖22 ≤

1

2η+ηn

2≤√

n

72 / 86

Page 141: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Application 2: Exponential weightsAssume K = ∆d := {x ≥ 0 : ‖x‖1 = 1} and `t ∈ [0, 1]d for all t(|〈x , `t〉| ≤ 1)

F (x) =d∑

i=1

(xi log(xi )− xi )

diamF (K) := maxa,b∈K F (a)− F (b) = log(d). This is becausemaxx∈K F (x) = log(d)− 1 (achieving at (1/d , . . . , 1/d)) andminx∈K F (x) = −1 (achieving (1, 0, . . . , 0)).

Bregman divergence

DF (x , y) =d∑

i=1

xi log

(xiyi

)(KL-divergence)

≥ 1

2‖x − y‖2

1 (Pinsker’s inequality (exercise))

73 / 86

Page 142: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Application 2: Exponential weightsAssume K = ∆d := {x ≥ 0 : ‖x‖1 = 1} and `t ∈ [0, 1]d for all t(|〈x , `t〉| ≤ 1)

F (x) =d∑

i=1

(xi log(xi )− xi )

diamF (K) := maxa,b∈K F (a)− F (b) = log(d). This is becausemaxx∈K F (x) = log(d)− 1 (achieving at (1/d , . . . , 1/d)) andminx∈K F (x) = −1 (achieving (1, 0, . . . , 0)).

Bregman divergence

DF (x , y) =d∑

i=1

xi log

(xiyi

)(KL-divergence)

≥ 1

2‖x − y‖2

1 (Pinsker’s inequality (exercise))

73 / 86

Page 143: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Application 2: Exponential weights

Assume K = ∆d := {x ≥ 0 : ‖x‖1 = 1} and `t ∈ [0, 1]d for all tFTRL:

xt+1 = argminx∈K η

t∑s=1

〈x , `s〉+ F (x)

Optimal action is a standard basis vector ei , where i is the position thati = argminj=1,...,n

(η∑t

s=1 `s,j)

(corresponding F (x) is minimized sinceF (ei ) = −1).

Rn(x) ≤ diamF (K)

η+η

2

n∑t=1

‖`t‖2t∗

≤ log(d)

η+η

2

n∑t=1

‖`t‖2∞

≤ log(d)

η+ηn

2≤√

2n log(d)

74 / 86

Page 144: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Application 2: Exponential weights

Assume K = ∆d := {x ≥ 0 : ‖x‖1 = 1} and `t ∈ [0, 1]d for all tFTRL:

xt+1 = argminx∈K η

t∑s=1

〈x , `s〉+ F (x)

Optimal action is a standard basis vector ei , where i is the position thati = argminj=1,...,n

(η∑t

s=1 `s,j)

(corresponding F (x) is minimized sinceF (ei ) = −1).

Rn(x) ≤ diamF (K)

η+η

2

n∑t=1

‖`t‖2t∗

≤ log(d)

η+η

2

n∑t=1

‖`t‖2∞

≤ log(d)

η+ηn

2≤√

2n log(d)

74 / 86

Page 145: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Our Goal: Adversarial bandits

• At the start of the game the adversary secretly chooses losses`1, . . . , `n with `t ∈ [0, 1]K

• In each round the learner chooses the arm At ∈ {1, . . . ,K} ∼ Pt

(from some distribution Pt)

• Suffers and loss `t,At (only observe `t,At )

• Regret is Rn = maxa E [∑n

t=1 `tAt − `ta]

• Surprising result there exists an algorithm such thatRn ≤

√2nK log(K ) for any adversary. How?

• Key idea• Construct an estimator of the entire loss vector `t = (`t,1, . . . , `t,K )• Apply the follow the regularized leader (FTRL) to the estimated loss

ˆt = (ˆ

t,1, . . . , ˆt,K )

75 / 86

Page 146: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Our Goal: Adversarial bandits

• At the start of the game the adversary secretly chooses losses`1, . . . , `n with `t ∈ [0, 1]K

• In each round the learner chooses the arm At ∈ {1, . . . ,K} ∼ Pt

(from some distribution Pt)

• Suffers and loss `t,At (only observe `t,At )

• Regret is Rn = maxa E [∑n

t=1 `tAt − `ta]

• Surprising result there exists an algorithm such thatRn ≤

√2nK log(K ) for any adversary. How?

• Key idea• Construct an estimator of the entire loss vector `t = (`t,1, . . . , `t,K )• Apply the follow the regularized leader (FTRL) to the estimated loss

ˆt = (ˆ

t,1, . . . , ˆt,K )

75 / 86

Page 147: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Our Goal: Adversarial bandits

• At the start of the game the adversary secretly chooses losses`1, . . . , `n with `t ∈ [0, 1]K

• In each round the learner chooses the arm At ∈ {1, . . . ,K} ∼ Pt

(from some distribution Pt)

• Suffers and loss `t,At (only observe `t,At )

• Regret is Rn = maxa E [∑n

t=1 `tAt − `ta]

• Surprising result there exists an algorithm such thatRn ≤

√2nK log(K ) for any adversary. How?

• Key idea• Construct an estimator of the entire loss vector `t = (`t,1, . . . , `t,K )• Apply the follow the regularized leader (FTRL) to the estimated loss

ˆt = (ˆ

t,1, . . . , ˆt,K )

75 / 86

Page 148: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Importance-weighted estimators

At time t, our algorithm chooses the arm At = i with probability Pti

(specify Pti later). Define the estimator of `t,i

ˆt,i =

`t,i1(At = i)

Pti

and ˆt = (ˆ

t,1, . . . ˆt,K ).

Unbiased estimator,

E[

ˆt,i | Pt

]=`t,iPti

E[1(At = i) | Pt ] =`t,iPti

Pti

= `t,i

Second moment: E[

ˆ2t,i | Pt

]=

`2t,i

Pti

76 / 86

Page 149: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits (EXP3)

• Estimate `t with unbiased importance-weighted estimator ˆt

ˆt,i =

1(At = i)`t,iPti

• Then the expected regret satisfies

E[Rn] = maxi

E

[n∑

t=1

`t,At − `t,i

]= max

iE

[n∑

t=1

〈Pt − ei , ˆt〉

]

This is because• E(〈Pt , `t〉) = E(

∑Ki=1 Pti`t,i ) = E(

∑Ki=1 1(At = i)`t,i ) = E(`t,At )

• E(〈ei , ˆt〉) = E(ˆ

t,i ) = `t,i .

77 / 86

Page 150: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits (EXP3)• Then the expected regret satisfies

E[Rn] = maxi

E

[n∑

t=1

`t,At − `t,i

]= max

iE

[n∑

t=1

〈Pt − ei , ˆt〉

]• FTRL:

Pt = argminp∈∆K

F (p)

η+

t−1∑s=1

〈p, ˆs〉

• Since the domain is ∆K , choose the negentropy F

F (p) =K∑i=1

pi log(pi )− pi

• Using Lagrange dual, the probability of choosing each arm i :

Pti =exp

(−η∑t−1

s=1ˆs,i

)∑K

j=1 exp(−η∑t−1

s=1ˆs,j

)

78 / 86

Page 151: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits (EXP3)• Then the expected regret satisfies

E[Rn] = maxi

E

[n∑

t=1

`t,At − `t,i

]= max

iE

[n∑

t=1

〈Pt − ei , ˆt〉

]• FTRL:

Pt = argminp∈∆K

F (p)

η+

t−1∑s=1

〈p, ˆs〉

• Since the domain is ∆K , choose the negentropy F

F (p) =K∑i=1

pi log(pi )− pi

• Using Lagrange dual, the probability of choosing each arm i :

Pti =exp

(−η∑t−1

s=1ˆs,i

)∑K

j=1 exp(−η∑t−1

s=1ˆs,j

)78 / 86

Page 152: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits (EXP3 Algo)

• Using the FTRL regret bound:

E[Rn] ≤ E

[F (ei )− F (P1)

η+η

2

n∑t=1

‖ˆt‖2

t∗

]

≤ log(K )

η+η

2E

[n∑

t=1

‖ˆt‖2

t∗

](F (ei )− F (P1) ≤ log(K ))

where i is the best arm.

• How to bound ‖ˆt‖2

t∗?

79 / 86

Page 153: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits (EXP3 Algo)

• Using the FTRL regret bound:

E[Rn] ≤ E

[F (ei )− F (P1)

η+η

2

n∑t=1

‖ˆt‖2

t∗

]

≤ log(K )

η+η

2E

[n∑

t=1

‖ˆt‖2

t∗

](F (ei )− F (P1) ≤ log(K ))

where i is the best arm.

• How to bound ‖ˆt‖2

t∗?

79 / 86

Page 154: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits

• How to bound ‖ˆt‖2

t∗?

• Recall that Let zt ∈ [xt , xt+1] be such thatDF (xt+1, xt) = 1

2‖xt − xt+1‖2∇2F (zt)

. And ‖ · ‖t = ‖ · ‖∇2F (zt) and

‖ · ‖t∗ is the dual norm of ‖ · ‖t .

• For F (p) =∑K

i=1 pi log(pi )− pi ,

∇2F (p) = diag(1/p) =⇒ ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

K∑i=1

piˆ2t,i ,

for some p ∈ [Pt ,Pt+1]

80 / 86

Page 155: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits

• How to bound ‖ˆt‖2

t∗?

• Recall that Let zt ∈ [xt , xt+1] be such thatDF (xt+1, xt) = 1

2‖xt − xt+1‖2∇2F (zt)

. And ‖ · ‖t = ‖ · ‖∇2F (zt) and

‖ · ‖t∗ is the dual norm of ‖ · ‖t .

• For F (p) =∑K

i=1 pi log(pi )− pi ,

∇2F (p) = diag(1/p) =⇒ ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

K∑i=1

piˆ2t,i ,

for some p ∈ [Pt ,Pt+1]

80 / 86

Page 156: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits• How to bound ‖ˆ

t‖2t∗?

• ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

∑Ki=1 pi

ˆ2t,i for some p ∈ [Pt ,Pt+1]

• ˆt,i =

1(At=i)`t,iPti

is non-negative and ˆt,j = 0 for At 6= j :

‖ˆt‖2

t∗ = pAtˆ2t,At

• Further note that, Pt,At =exp(−η

∑t−1s=1

ˆs,At )∑K

j=1 exp(−η∑t−1

s=1ˆs,j)

:=αAt∑Kj=1 αj

and

Pt+1,At =exp

(−η∑t−1

s=1ˆs,At

)exp(−η ˆ

t,At )∑Kj=1 exp

(−η∑t−1

s=1ˆs,j

)exp(−η ˆ

t,j)=

αAt

αAt +∑

j 6=Atαj exp(η ˆ

t,At )

Since exp(η ˆt,At ) > 1, Pt+1,At ≤ Pt,At

‖ˆt‖2

t∗ =K∑j=1

pjˆ2t,j ≤ PtAt

ˆ2t,At

81 / 86

Page 157: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits• How to bound ‖ˆ

t‖2t∗?

• ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

∑Ki=1 pi

ˆ2t,i for some p ∈ [Pt ,Pt+1]

• ˆt,i =

1(At=i)`t,iPti

is non-negative and ˆt,j = 0 for At 6= j :

‖ˆt‖2

t∗ = pAtˆ2t,At

• Further note that, Pt,At =exp(−η

∑t−1s=1

ˆs,At )∑K

j=1 exp(−η∑t−1

s=1ˆs,j)

:=αAt∑Kj=1 αj

and

Pt+1,At =exp

(−η∑t−1

s=1ˆs,At

)exp(−η ˆ

t,At )∑Kj=1 exp

(−η∑t−1

s=1ˆs,j

)exp(−η ˆ

t,j)=

αAt

αAt +∑

j 6=Atαj exp(η ˆ

t,At )

Since exp(η ˆt,At ) > 1, Pt+1,At ≤ Pt,At

‖ˆt‖2

t∗ =K∑j=1

pjˆ2t,j ≤ PtAt

ˆ2t,At

81 / 86

Page 158: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Follow the regularized leader for bandits• How to bound ‖ˆ

t‖2t∗?

• ‖ˆt‖2

t∗ = ‖ˆt‖2∇2F (p)−1 =

∑Ki=1 pi

ˆ2t,i for some p ∈ [Pt ,Pt+1]

• ˆt,i =

1(At=i)`t,iPti

is non-negative and ˆt,j = 0 for At 6= j :

‖ˆt‖2

t∗ = pAtˆ2t,At

• Further note that, Pt,At =exp(−η

∑t−1s=1

ˆs,At )∑K

j=1 exp(−η∑t−1

s=1ˆs,j)

:=αAt∑Kj=1 αj

and

Pt+1,At =exp

(−η∑t−1

s=1ˆs,At

)exp(−η ˆ

t,At )∑Kj=1 exp

(−η∑t−1

s=1ˆs,j

)exp(−η ˆ

t,j)=

αAt

αAt +∑

j 6=Atαj exp(η ˆ

t,At )

Since exp(η ˆt,At ) > 1, Pt+1,At ≤ Pt,At

‖ˆt‖2

t∗ =K∑j=1

pjˆ2t,j ≤ PtAt

ˆ2t,At

81 / 86

Page 159: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Putting everything together: Regret bound for the followthe regularized leader for bandits

E[Rn] ≤ log(K )

η+η

2E

[n∑

t=1

‖ˆt‖2

t∗

]

≤ log(K )

η+η

2E

[n∑

t=1

PtAtˆ2t,At

]

=log(K )

η+η

2E

[n∑

t=1

`2t,At

PtAt

]

≤ log(K )

η+η

2E

[n∑

t=1

1

PtAt

](`t ∈ [0, 1]K )

=log(K )

η+η

2E

[n∑

t=1

K∑i=1

Pti ·1

Pti

]

=log(K )

η+ηnK

2≤√

2nK log(K ) (η =√

2 log(K )/(nK ))82 / 86

Page 160: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Historical notes

• First paper on bandits is by Thompson (1933). He proposed analgorithm for two-armed Bernoulli bandits and hand-runs somesimulations (Thompson sampling)

• Popularized enormously by Robbins (1952)

• Confidence bounds first used by Lai and Robbins (1985) to deriveasymptotically optimal algorithm

• UCB by Katehakis and Robbins (1995) and Agrawal (1995).Finite-time analysis by Auer et al. (2002)

• Adversarial bandits: Auer et al. (1995)

• Minimax optimal algorithm by Audibert and Bubeck (2009)

83 / 86

Page 161: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

Resources

• Online notes: http://banditalgs.com

• The book “Bandit Algorithms” by Tor Lattimore and CsabaSzepesvarihttps://tor-lattimore.com/downloads/book/book.pdf

• Book by Bubeck and Cesa-Bianchi (2012)

• Book by Cesa-Bianchi and Lugosi (2006)

• The Bayesian books by Gittins et al. (2011) and Berry and Fristedt(1985). Both worth reading.

• Notes by Aleksandrs Slivkins:http://slivkins.com/work/MAB-book.pdf

84 / 86

Page 162: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

References I

Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for themulti-armed bandit problem. Advances in Applied Probability, pages 1054–1078.

Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochasticbandits. In Proceedings of Conference on Learning Theory (COLT), pages 217–226.

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmedbandit problem. Machine Learning, 47:235–256.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a riggedcasino: The adversarial multi-armed bandit problem. In Foundations of Computer Science,1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE.

Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments.Chapman and Hall, London ; New York :.

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and NonstochasticMulti-armed Bandit Problems. Foundations and Trends in Machine Learning. NowPublishers Incorporated.

Bush, R. R. and Mosteller, F. (1953). A stochastic model with applications to learning. TheAnnals of Mathematical Statistics, pages 559–585.

Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridgeuniversity press.

Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. JohnWiley & Sons.

85 / 86

Page 163: Multi-armed Bandits (MAB) - New York Universitypeople.stern.nyu.edu › xchen3 › TeachingMaterial › ... · Multi-armed Bandits (MAB) Xi Chen Slides are based on AAAI 2018 Tutorial

References II

Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations.Proceedings of the National Academy of Sciences of the United States of America,92(19):8584.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advancesin applied mathematics, 6(1):4–22.

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of theAmerican Mathematical Society, 58(5):527–535.

Thompson, W. (1933). On the likelihood that one unknown probability exceeds another inview of the evidence of two samples. Biometrika, 25(3/4):285–294.

86 / 86