Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012

Reinforcement Learning

Evaluative Feedback and Bandit Problems

Subramanian RamamoorthySchool of Informatics

20 January 2012

Recap: What is Reinforcement Learning?

• An approach to Artificial Intelligence• Learning from interaction• Goal-oriented learning• Learning about, from, and while interacting with an external

environment

• Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal

• Can be thought of as a stochastic optimization over time

220/01/2012

Recap: The Setup for RL

Agent is:

• Temporally situated

• Continual learning and planning

• Objective is to affect the environment – actions and states

• Environment is uncertain, stochastic

Environment

actionstate

rewardAgent

320/01/2012

Recap: Key Features of RL

• Learner is not told which actions to take• Trial-and-Error search• Possibility of delayed reward– Sacrifice short-term gains for greater long-term gains

• The need to explore and exploit

• Consider the whole problem of a goal-directed agent interacting with an uncertain environment

420/01/2012

Multi-arm Bandits (MAB)

• N possible actions• You can play for some period

of time and you want to maximize reward (expected utility)

Which is the best arm/ machine?

DEMO

520/01/2012

Real-Life Version

• Choose the best content to display to the next visitor of your commercial website

• Content options = slot machines• Reward = user's response (e.g., click on an ad)

• Also, clinical trials: arm = treatment, reward = patient cured• Simplifying assumption: no context (no visitor proles). In

practice, we want to solve contextual bandit problems but that is for later discussion.

620/01/2012

What is the Choice?

720/01/2012

n-Armed Bandit Problem

• Choose repeatedly from one of n actions; each choice is called a play

• After each play at , you get a reward rt , where

These are unknown action valuesDistribution of depends only on rt at

Objective is to maximize the reward in the long term, e.g., over 1000 plays

To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them

820/01/2012

Exploration/Exploitation Dilemma

• Suppose you form estimates

• The greedy action at time t is at*

• You can’t exploit all the time; you can’t explore all the time• You can never stop exploring; but you could reduce

exploring.

Qt(a) Q*(a) action value estimates

at* argmax

aQt(a)

at at* exploitation

at at* exploration

920/01/2012

Why?

Action-Value Methods

• Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action a had been chosen ka times, producing rewards r1 , r2 , …, rka

, then

“sample average”

limk a

Qt(a) Q*(a)

1020/01/2012

Remark

• The simple greedy action selection strategy:

• Why might this above be insufficient?• You are estimating, online, from a few samples. How will this

behave?

DEMO

1120/01/2012

-Greedy Action Selection

• Greedy action selection:

• -Greedy:

at at* arg max

aQt(a)

at* with probability 1

random action with probability {at

. . . the simplest way to balance exploration and exploitation

1220/01/2012

Worked Example: 10-Armed Testbed

• n = 10 possible actions

• Each is chosen randomly from a normal distrib.:

• Each is also normal:

• 1000 plays, repeat the whole thing 2000 times and average the results

rt

Q*(a) )1,0(N

)1),(( *taQN

1320/01/2012

-Greedy Methods on the 10-Armed Testbed

1420/01/2012

Softmax Action Selection

• Softmax action selection methods grade action probabilities by estimated values.

• The most common softmax uses a Gibbs, or Boltzmann, distribution:

re' temperatunalcomputatio' a is where

,

yprobabilit with play on action Choose

1

)(

)(

n

b

bQ

aQ

t

t

e

e

ta

1520/01/2012

Incremental Implementation

Qk

r1 r2 rk

k

Sample average estimation method:

How to do this incrementally (without storing all the rewards)?

We could keep a running sum and count, or, equivalently:

Qk1 Qk 1

k 1rk1 Qk

The average of the first k rewards is(dropping the dependence on a ):

NewEstimate = OldEstimate + StepSize [Target – OldEstimate]

1620/01/2012

Tracking a Nonstationary Problem

Choosing to be a sample average is appropriate in astationary problem, i.e., when none of the change over time,

But not in a nonstationary problem.

Qk

Q*(a)

Better in the nonstationary case is:

Qk1 Qk rk1 Qk for constant , 0 1

(1 ) kQ0 (1 i1

k

)k iri

exponential, recency-weighted average

1720/01/2012

Optimistic Initial Values

• All methods so far depend on , i.e., they are biased• Encourage exploration: initialize the action values optimistically,

Q0 (a)

i.e., on the 10-armed testbed, use Q0 (a) 5 for all a

1820/01/2012

Beyond Counting…

20/01/2012 19

An Interpretation of MAB Type Problems

20/01/2012 20

Related to‘rewards’

MAB is a Special Case of Online Learning

20/01/2012 21

How to Evaluate Online Alg.: Regret

• After you have played for T rounds, you experience a regret:= [Reward sum of optimal strategy] – [Sum of actual collected rewards]

• If the average regret per round goes to zero with probability 1, asymptotically, we say the strategy has no-regret property

~ guaranteed to converge to an optimal strategy • -greedy is sub-optimal (so has some regret). Why?

20/01/2012 22

kk

T

ti

T

tt trETrT

t

max

)(ˆ

*

1

*

1

*

Randomness in

draw of rewards &player’s strategy

Interval Estimation

• Attribute to each arm an “optimistic initial estimate” within a certain confidence interval

• Greedily choose arm with highest optimistic mean (upper bound on confidence interval)

• Infrequently observed arm will have over-valued reward mean, leading to exploration

• Frequent usage pushes optimistic estimate to true values

20/01/2012 23

Interval Estimation Procedure

• Associate to each arm 100(1-)% reward mean upper band

• Assume, e.g., rewards are normally distributed• Arm is observed n times to yield empirical mean & std dev• -upper bound:

• If is carefully controlled, could be made zero-regret strategy– In general, we don’t know

20/01/2012 24

dxx

tc

cn

u

t

2exp

2

1)(

)1(ˆ

ˆ

2

1

Cum. Distribution Function

Variant: UCB Strategy

• Again, based on notion of an upper confidence bound but more generally applicable

• Algorithm:– Play each arm once– At time t > K, play arm it maximizing

20/01/2012 25

far so playedbeen has j timesofnumber :

ln2)(

,

,

tj

tjj

T

T

ttr

UCB Strategy

20/01/2012 26

Reminder: Chernoff-Hoeffding Bound

20/01/2012 27

UCB Strategy – Behaviour

20/01/2012 28

We will not try to prove the following result but I quote the final result to tell you why UCB may be a desirable strategy – regret is bounded.

K = number of arms

Variation on SoftMax:

• It is possible to drive regret down by annealing • Exp3 : Exponential weight alg. for exploration and exploitation• Probability of choosing arm k at time t is

20/01/2012 29

n

b

bQ

aQ

t

t

e

e

1

)(

)(

)log(Regret

at pulled is arm if

)(

)(

)(exp)(

)1(

)(

)()1()(

1

KKTO

otherwise

tj

tw

KtP

trtw

tw

Ktw

twtP

j

j

jj

j

k

jj

kk

is a user definedopen parameter

The Gittins Index

• Each arm delivers reward with a probability• This probability may change through time but only when arm

is pulled• Goal is to maximize discounted rewards – future is discounted

by an exponential discount factor • The structure of the problem is such that, all you need to do is

compute an “index” for each arm and play the one with the highest index

• Index is of the form:

20/01/2012 30

T

t

t

T

t

it

Ti

tR

0

0

0

)(

sup

Gittins Index – Intuition

• Proving optimality isn’t within our scope; but based on,• Stopping time: the point where you should ‘terminate’ bandit

• Nice Property: Gittins index for any given bandit is independent of expected outcome of all other bandits– Once you have a good arm, keep playing until there is a better one– If you add/remove machines, computation doesn’t really change

• BUT: – hard to compute, even when you know distributions– Exploration issues; arms isn’t updated unless used (restless bandits?)

20/01/2012 31

Numerous Applications!

20/01/2012 32

Extending the MAB Model

• In this lecture, we are in a single casino and the only decision is to pull from a set of n arms– except perhaps in the very last slides, exactly one state!

Next,• What if there is more than one state?• So, in this state space, what is the effect of the distribution of

payout changing based on how you pull arms? • What happens if you only obtain a net reward corresponding

to a long sequence of arm pulls (at the end)?

3320/01/2012

Acknowledgements

• Many slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book

3420/01/2012

Documents

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012