19
Bayesian Incentive-Compatible Bandit Exploration Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with: Yishay Mansour (Microsoft Research and Tel- Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

Embed Size (px)

Citation preview

Page 1: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

1

Bayesian Incentive-Compatible Bandit Exploration

Vasilis Syrgkanis

Microsoft Research, NYC

Joint with: Yishay Mansour (Microsoft Research and Tel-Aviv Univ.)

Aleksandrs Slivkins (Microsoft Research)

Page 2: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

2

Exploration-Exploitation in Recommendation Systems

Page 3: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

3

Exploration problem

Prior bias of users leads to lack of exploration

Can miss good options that a priori seem inferior (“hidden gems”)

System needs to “incentivize” exploration

This talk: incentivizing exploration through information asymmetry

Page 4: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

4

Bayesian Incentive-Compatible Bandit Model

T users arrive sequentially Each can take one of K actions Each action has a mean reward of Common prior belief on each Realized reward, noisy version of At each time-step system recommends an action

Users report realized reward

Recommendation algorithm

1 t T…Users

1 KiActions ……

𝜇𝑖

𝑟 𝑖𝑡

𝜇𝑖

Prior on Mean Rewards

Realized rewards noisy

𝜇𝑖0

𝐼 𝑡 𝑟 𝐼 𝑡𝑡

𝜇10𝜇𝐾

0

Page 5: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

5

Systems Objective: Asymptotically Small Regret

Recommendation algorithm

1 t T…Users

1 KiActions ……

𝜇𝑖

𝑟 𝑖𝑡

𝜇𝑖

Prior on Mean Rewards

Realized rewards noisy

𝜇𝑖0𝜇10𝜇𝐾

0

𝐼 𝑡 𝑟 𝐼 𝑡𝑡 Ex-post regret

Bayesian Regret

Page 6: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

6

So far equivalent to Stochastic i.i.d. Multi-armed Bandit Model:

regret achievable

Page 7: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

7

Incentive Compatibility (IC)

Playing recommended action has expected utility as high as any other action

e.g. first user can only take action 1

If users observe everything will only take the posterior better action given previous rewards – cannot guarantee exploration

Page 8: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

8

Information Asymmetry

Users do not observe rewards or recommendations of previous users

Unaware whether rewards of previous steps have made a priori better arms worse than a priori worse arms

Page 9: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

9

Main question

Is regret achievable under the incentive compatibility constraint?

Page 10: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

10

Main Results: Bayesian Regret Black-box reduction: any bandit algorithm to an incentive compatible one (prior-dependent constant blow up in Bayesian regret)

Implies Bayesian regret IC algorithms T steps of any algorithm can be simulated in an incentive compatible manner in time steps Average expected reward as high as that of the algorithm

Enables modular design of IC recommendation systems

Page 11: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

11

Main Results: Ex-post Regret ex-post regret for instances with large “gap” in the means

Difference of best arm and suboptimal arms lower bounded by a constant

Detail-free algorithm (doesn’t need to know full prior, but only an upper bound on a single parameter of the prior)

Page 12: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

12

Some related work Kremer et al. [2014]: Same model, two arms, primarily Bayesian optimal

for non-stochastic rewards, for stochastic

Che and Horner [2013]: continuous time stochastic model, two arms, binary reward, Bayesian optimal

Frazier et al. [2014]: Monetary transfers allowed, users observe past actions, payments vs. information asymmetry

Connections to Bayesian Persuasion: Kamenica, Gentzkow [2011]

Connections to Strategic Experimentation: Bolton, Harris [1999]

Page 13: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

13

Key Ideas

Page 14: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

14

Key idea: two arms, first sample Hide exploration in a pool of exploitation

Pull arm 1 Pull posterior better arm

users users

Recommend arm 2

• “Exploration” user doesn’t know if he is recommended action 2 because of it is the better posterior action or because of exploration

• If L is large enough exploitation is most probable and hence in his own interest

Page 15: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

15

Key idea: Black-box reduction

Pull arm 1

Pull posterior better arm

users users

Recommend arm 2

Initial Sampling Phase

Pull posterior better arm

Ask Bandit algorithmReport back

Pull posterior better arm

Ask Bandit algorithmReport back

users users

• Expected reward of exploit users at least as good as algorithm’s reward

• Expected reward at least: • Regret at most: • If A is algorithm IC algorithm

Page 16: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

16

Key idea: Ex-post regret

• Use sample means instead of posterior means• Exploit arm is arm 2 only if sample mean higher with some confidence• After initial sampling phase do active arms elimination

Pull arm 1 Pull an exploit arm

users users

Recommend arm 2

Page 17: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

17

Key idea: many arms Need to first sample arms to convince to play

Do a contest:

Many technical difficulties to perform contest with sample means for detail-free

Use of sample averages with a confidence bound not as straight-forward

Not trivial to define exploit arm as a function of sample means

Pull arm 1Pull posterior better

arm from 1,,2

users users

Recommend arm 2

Pull posterior better arm from 1,2,3

users

Recommend arm 3

Pull posterior better arm from 1,2,3,4

users

Recommend arm 4

Page 18: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

18

User-Heterogeneity: Contextual Bandit Extension

Each user has a publicly observable context Context affects user’s mean reward Need to find optimal policy: context arm Captures user heterogeneity Black-box reduction very useful to leverage contextual bandit algorithms

Page 19: Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

19

Summary

Enables modular design of IC recommendation systems O(√ ) and O(log ( )) instance-based ex-post regret 𝑇 𝑇 Detail-free algorithm (doesn’t need to know full prior, but only an upper bound on a single parameter of the prior)

Extensions: contextual bandits and other (see paper)

Thank you