Lecture Notes on Reinforcement Learning - RUB · (1,2) (3,2) x environment-3 states state values agent actions and transitions-4 absorbing state Figure 1: A simple example of reinforcement

Lecture Notes on

Reinforcement Learning

Laurenz WiskottInstitut fur Neuroinformatik

Ruhr-Universitat Bochum, Germany, EU

27 March 2017

Contents

1 Basic reinforcement algorithm 2

1.1 General idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Concepts and notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Learning the true value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Learning the optimal policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Learning value function and policy simultaneously . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Problems and variants 5

2.1 Improved learning by updating previous states . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Diverging value function for tasks without an absorbing state . . . . . . . . . . . . . . . . . . 5

2.3 Nondeterministic transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Exploration-exploitation dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Two opponent players . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 Large state space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Application 8

3.1 TD-Gammon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

© 2007, 2012 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figuresfrom other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. Toview a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures from other sources have their owncopyright, which is generally indicated. Do not distribute parts of these lecture notes showing figures with non-free copyrights(here usually figures I have the rights to publish but you don’t, like my own published figures). Figures I do not have the rightsto publish are grayed out, but the word ’Figure’, ’Image’, or the like in the reference is often linked to a pdf.More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.

1

https://www.ini.rub.de/PEOPLE/wiskott/

http://creativecommons.org/licenses/by-sa/4.0/

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/

4 Exploration-exploitation dilemma 11

4.1 The k-armed bandit problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Estimating action values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.3 Balancing exploration and exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3.1 Greedy method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3.2 ε-greedy method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3.3 Optimistic initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3.4 Upper-confidence-bound action selection . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3.5 Gradient bandit algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Associative search (contextual bandits) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

These lecture notes are largely based on (Harmon and Harmon, 1996; Kaelbling et al., 1996; ten Hagen andKrose, 1997; Sutton and Barto, 2016).

1 Basic reinforcement algorithm

1.1 General idea

The general idea of reinforcement learning (D: Verstarkungslernen) was nicely phrased by Harmon andHarmon (1996):

To provide the intuition behind reinforcement learning consider the problem of learning to ride a bicycle.The goal given to the RL system is simply to ride the bicycle without falling over. In the first trial,the RL system begins riding the bicycle and performs a series of actions that result in the bicycle beingtilted 45 degrees to the right. At this point there are two actions possible: turn the handle bars leftor turn them right. The RL system turns the handle bars to the left and immediately crashes to theground, thus receiving a negative reinforcement. The RL system has just learned not to turn the handlebars left when tilted 45 degrees to the right. In the next trial the RL system performs a series of actionsthat again result in the bicycle being tilted 45 degrees to the right. The RL system knows not to turnthe handle bars to the left, so it performs the only other possible action: turn right. It immediatelycrashes to the ground, again receiving a strong negative reinforcement. At this point the RL systemhas not only learned that turning the handle bars right or left when tilted 45 degrees to the right isbad, but that the ”state” of being titled 45 degrees to the right is bad. Again, the RL system beginsanother trial and performs a series of actions that result in the bicycle being tilted 40 degrees to theright. Two actions are possible: turn right or turn left. The RL system turns the handle bars left whichresults in the bicycle being tilted 45 degrees to the right, and ultimately results in a strong negativereinforcement. The RL system has just learned not to turn the handle bars to the left when titled 40degrees to the right. By performing enough of these trial-and-error interactions with the environment,the RL system will ultimately learn how to prevent the bicycle from ever falling over.

2

(1,2)

(3,2)

x

environment

−3 statesstate valuesagentactions and transitions

−4

absorbing state

Figure 1: A simple example of reinforcement learning to introduce basic notions.© CC BY-SA 4.0

1.2 Concepts and notions

To formalize reinforcement learning, we need a number of concepts and notions.1 Let us introducethem by means of a simple example. In reinforcement learning we consider an agent (D: Agent), which isin one of a number states (D: Zustande) s in an environment (D: Umgebung). In the simple exampleshown here the environment is simply a 4×4 field and each field is a state s ∈ (1, 1), (1, 2), ..., (4, 4). Theagent can perform certain actions (D: Aktionen) a that will usually cause transitions (D: Ubergange) intoneighboring states s′. Here the actions are movements right, down, left, or up, i.e. a ∈ →, ↓,←, ↑. Thetransitions may either be deterministic or, more generally, probabilistic with certain unknown transitionprobabilities (D: Ubergangswahrscheinlichkeiten). In the former case one might introduce a transitionfunction S and write s′ = S(s, a); in the latter case one has to use probabilities and write P (s′|s, a). In oursimple example we would have, for instance, S((2, 3), ↑) = (1, 3). Upon each action taken in a certain state,the agent receives an immediate reward (D: unmittelbare Belohnung) r = R(s, a), which may actually benegative and then is effectively a punishment (D: Bestrafung). In our example the immediate reward is −1,thus each transition gets punished and the goal of the agent should obviously be to reach one of the twoabsorbing states as quickly as possible to minimize the overall punishment (maximize the overall reward).An absorbing state (D: absorbierender Zustand?) is a state at which the agent stops and the task is over,which may be good or bad. The expected reward (D: erwartete Belohnung) is the expectation value ofthe reward summed over the future given the agent is in state s0 at time t0, i.e.

E

∞∑t=t0

r(t) | s(t0) = s0

. (1)

The agent has a policy (D: Strategie) π, which assigns an action a = A(s) to a state. If the policy isdeterministic, it makes little sense to distinguish π(s) and A(s), but if it is probabilistic it does. Finally,the agent also has a value function (D: Bewertungsfunktion) V (s) that represents the estimated expectedreward for each state. Ideally V (s) = E

∑∞t=t0

r(t) | s(t0) = s. The value function obviously depends onthe policy, see Figure 2.

1.3 Learning the true value function

Given a fixed deterministic policy π providing A(s) and if the transitions are deterministicallygiven by S(s,A(s)), it is clear that each state has a definite expected reward which should bereflected by the value function. Let Vt be the value function at time step t. In general Vt+1 := Vt, butat the current state s = s(t) we can improve the value function because we see the true immediatereward under the current policy. Thus, instead of setting Vt+1(s) := Vt(s) we set

Vt+1(s) := R(s,A(s)) + Vt(S(s,A(s))) . (2)

1Important text (but not inline formulas) is set bold face; • marks important formulas worth remembering; marks lessimportant formulas, which I also discuss in the lecture; + marks sections that I typically skip during my lectures.

3

https://creativecommons.org/licenses/by-sa/4.0/

−5

−4

−4 −3

−3

−3

−3 −2

−2 −1

−1 −0

−0

−2−4−5

Figure 2: A possible policy and the corresponding value function for the simple example of Figure 1.© CC BY-SA 4.0

Because R(s,A(s)) is authentic, Vt+1(s) will generally be more accurate than Vt(s). Iterating (2) infinitelyoften for randomized initial states is guaranteed to converge V to the true value function, under certainassumptions.

1.4 Learning the optimal policy

If a value function V (s) for a suboptimal policy π is given, the policy can be improved easilyby choosing that action that leads to the neighboring state with the highest expected reward.

A(s) := arg maxa′

(R(s, a′) + V (S(s, a′))) . (3)

There is no need to iterate this equation. A(s) can be inferred from V (s) directly.

Figure 3 shows an iteration between learning the true value function and the optimal policy.

−5

−4

−4 −3

−3

−3

−3 −2

−2 −1

−1 −0

−0

−2−4−5

−4

−4 −3

−3

−3

−3 −2

−2 −1

−1 −0

−0

−2−4

−1

−1

−3

−3

−3

−3 −2

−2 −1

−1 −0

−0

−2

−1

−1

−2

−2

−2

Figure 3: Iterating learning the true value function and the optimal policy for the initial policy shown inFigure 2.© CC BY-SA 4.0

4



1.5 Learning value function and policy simultaneously

It makes little sense to let (2) converge for a fixed policy before improving the policy. It is obviously moreefficient to always pick the currently optimal policy. This leads to the combined learning rule forstate s

• a := At+1(s) := arg maxa′

(R(s, a′) + Vt(S(s, a′))) , (4)

• Vt+1(s) := R(s, a) + Vt(S(s, a)) . (5)

For all other states the old values remain unchanged. Iterating these equations leads to the optimal policyand the true value function.

This algorithm is known as temporal difference learning. Note however that there are more refinedversions of it.

2 Problems and variants

2.1 Improved learning by updating previous states

Imaging an agent goes through the states s, s′, and s′′, in this order. From s to s′ it will update V (s) andfrom s′ to s′′ it will update V (s′). It is clear that if V (s′) has to be updated, V (s) should be updated by thesame amount, or at least by fraction of the same amount, because the update of V (s) happened with theold incorrect value of V (s′). It would thus be nice if it were possible to propagate updates back topreviously visited states. This can be achieved by complementing (5) as follows.

Vt+1(s(t)) := R(s(t), a(t)) + Vt(S(s(t), a(t))︸︷︷︸s(t+1)

) , (6)

Vt+1(s(t− t′)) := Vt(s(t− t′)) + λt′(Vt+1(s(t))− Vt(s(t))) ∀t′ = 1, 2, ... . (7)

Factor λ ∈ [0, 1] can be used to dampen the propagation of the current update into the past.

This algorithm is referred to as TD(λ)-Learning.

2.2 Diverging value function for tasks without an absorbing state

The value function can diverge if the system does not run into an absorbing state. For instance,if the task is to balance a pole and the agent receives a reward of +1 for each time step of successful balancing,then the expected reward will go to infinity if the agent learns to master the task.

This problem can be solved by discounting the reward by a constant factor γ with 0 < γ < 1for each time step. The expected reward (1) then becomes

E

∞∑t=t0

γt−t0r(t) | s(t0) = s0

(8)

and is finite even for constant reward. The update (5) of the value function in the reinforcement algorithmchanges to

Vt+1(s) := R(s, a) + γVt(S(s, a)) . (9)

5

2.3 Nondeterministic transitions

If the transitions are not deterministic but probabilistic, we have to adapt the algorithm. If thetransition probabilities P (s′|s, a) were known, we could change (2) to

Vt+1(s) := R(s, a) +∑s′

P (s′|s, a)Vt(s′) . (10)

However, the P (s′|s, a) are not known.

One solution would be to estimate the P (s′|s, a) explicitly. But it is more efficient to introduce anextended value function Q(s, a) representing the expected reward given a state s and an actiona. Obviously, one would define

V (s) := maxa′

Q(s, a′) , (11)

A(s) := arg maxa′

Q(s, a′) , (12)

because with an optimal policy one would always pick the action with maximal reward.

In the deterministic case and given a fixed policy we would update Q according to

Qt+1(s, a) := R(s, a) + V (s′) with s′ = S(s, a) . (13)

In the probabilistic case we have to average over the different new states according to their probabilitiesor frequencies. This can naturally be achieved by leaky integration

Qt+1(s, a) := (1− α)Qt(s, a) + α(R(s, a) + V (s′)) (14)

• (12)= (1− α)Qt(s, a) + α(R(s, a) + max

a′Qt(s

′, a′)) (15)

with s′ being the new state after the probabilistic transition. I am not aware of any principled method tochoose the right value for α.

I guess, the main advantage over the approach with explicit probabilities, cf. (10), is the reducedcomputational and memory costs. For instance, if there were 1.000 states and 10 actions per state,P (s′|s, a) would have a size of 10.000.000 while Q(s, a) only has a size of 10.000.

Reinforcement learning with the Q function is referred to as Q-learning.

2.4 Exploration-exploitation dilemma

If the transitions are probabilistic one faces the dilemma between exploration and exploitation.On the one hand one would like to always choose the action with the highest expected reward. On the otherhand one has to choose a particular action multiple times to make a reliable estimate of its associatedexpected reward. When should one stop choosing an action with low expected reward at least occasionally?

To illustrate the point consider the simple system in Figure 4 with five states. In this example the agentalways starts in state s1 and ends in state s5. The only choice it has is in state s1, where it can choose a = 0to go to states s2 or s3 or it choose a = 1 to go to state s4. In the former case it will end up in state s2 withprobability 0.1 and in state s3 with probability 0.9.

Assume that the agent initially chooses a = 0 and a = 1 with equal probability and the firstfour paths go through the system were via s4, s2, s4, s3, then a = 0 seems to yield an expectedreward of 1 and a = 1 an expected reward of 2. Should the agent then stick to a = 1?

If we calculate the true expected rewards, we get Q(s1, a = 0) = 0.1(1 − 3) + 0.9(1 + 3) = 3.4 andQ(s1, a = 1) = 1 + 1 = 2 and see that it is best to actually choose a = 0 in state s1.

In such cases it may be advisable to choose a non-deterministic policy, in which case it starts to makes senseto distinguish π and A(s).

6

s1

s2

s3

s5

s4

r=−3

r=3P=0.9

P=0.1

r=1

a=0r=1

r=1a=1

Figure 4: A simple example illustrating the exploration-exploitation dilemma.© CC BY-SA 4.0

player Bplayer A

+

+

−

−

x

x

Figure 5: A simple example illustrating reinforcement learning for two opponent players.© CC BY-SA 4.0

2.5 Two opponent players

Reinforcement can also be used to learn to play a game between two opponents. Each playeris then an agent acting in an environment that includes the other player. This is illustrated inFigure 5. The states of the game are divided in two subsets. Each player gets to see only the states wherehe/she has to move. The other player may contribute to the probabilistic nature of the transitions betweenthe states of one player.

2.6 Large state space

If the state space is very large, learning becomes very slow and memory requirements verylarge. If the value function is relatively smooth, the problem of large state space can be solved byapproximating the value function with a continuous function with few parameters. For example,one could use a backpropagation network.

7



3 Application

3.1 TD-Gammon

This section is based on (Tesauro, 1995).

TD-Gammon

(Andre-Pierre du Plessis, 2009, Flickr,© CC BY 2.0)

Section title: TD-GammonFrom the dictionary: Gam-mon = The lower part of a sideof bacon.Image: (Andre-Pierre duPlessis, 2009, Flickr)1 © CCBY 2.0.

Back Gammon

(Ptkfgs, 2007, Wikimedia,© public domain)

Image: (Ptkfgs, 2007, Wiki-media)2 © public domain.

8

https://www.flickr.com/photos/andrepierre/4240154606/

https://creativecommons.org/licenses/by/2.0/


https://commons.wikimedia.org/wiki/File:Backgammon_lg.jpg

https://creativecommons.org/share-your-work/public-domain/pdm/

TD-Gammon - Network Structure

(Tesauro, 1995, Commun. ACM 38:58-68, Fig. 1,© non-free)

Since the state space ofbackgammon it far too largeto be learned explicitely, aneural network is used to ap-proximate the value function.The network receives a com-plete description of the boardpositions as an input and pro-vides the expected outcome ofthe game as an output. It istrained with the backpropaga-tion algorithm.Figure: (Tesauro, 1995,Fig. 1)3 © non-free.

TD-Gammon - Performance

(Tesauro, 1995, Commun. ACM 38:58-68, Tab. 1,© non-free)

After training for 1,500,000games the system lost againsta world class backgammonplayer with -1 point in 40games, which basically meansthat is performs on world classlevel.Table: (Tesauro, 1995,Tab. 1)4 © non-free.

9

http://www.bkgm.com/articles/tesauro/tdl.html


TD-Gammon - First Move

(Tesauro, 1995, Commun. ACM 38:58-68, Fig. 2, Tab. 2,© non-free)

TD-Gammon even changedthe way human experts play.Figure/table: (Tesauro, 1995,Fig. 2, Tab. 2)5 © non-free.

10


4 Exploration-exploitation dilemma

This section is based on (Sutton and Barto, 2016, Chap. 2), and I largely adopt their notation here forcompatibility reasons.

We have seen in Section 2.4 that a reinforcement learning agent has to balance the tradeoff between exploringits environment by trying out actions that are apparently suboptimal and exploiting its knowledge about theenvironment by always picking the action that it thinks is optimal. This exploration-exploitation dilemmacan be discussed nicely by example of the k-armed bandit problem.

4.1 The k-armed bandit problem

In the k-armed bandit problem, an agent is faced with a slot machine with k arms to pull rather than justone (the one-armed bandit). Pulling one of the arms is an action a ∈ 1, ...,K the agent can take. Theagent tries to maximize its reward, i.e. the money it gets back for the coins it puts in, and each of the kactions has a constant average payoff, called the action value

q∗(a) := 〈Rt〉Ω:At=a (16)

where 〈Rt〉Ω:At=a indicates the ensemble average of the reward Rt under the condition that action a hasbeen chosen.

If the probabilities by which the agent takes actions a at time t is given by the policy πt(a), then the expectedreward is

〈Rt〉Ω =⟨〈Rt〉Ω:At=a

⟩a

(16)= 〈q∗(a)〉a =

∑a

πt(a)q∗(a) (17)

where we use the notation 〈f(a)〉a to indicate the average of f(a) over the distribution πt(a).

Ideally the agent should always pick the action with maximal payoff, but it first has to figure out the payoffsof the different actions, before it can do that. Thus, it has an exploration-exploitation dilemma.

This problem is a very reduced reinforcement learning problem, since there is only one single state withalways the same k possible actions. But it is a nice example to study the exploration-exploitation dilemma.

4.2 Estimating action values

Since the k payoffs of the k-armed bandit do not depend on what actions have been chosen earlier, they canbe estimated independently of each other by only considering the cases where a particular action has beenchosen. Thus, the true action values (16) can be estimated by

Qt(a) :=

∑t−1t′=1Rt δAt,a∑t−1t′=1 δAt,a

. (18)

where δA,a is the delta function, which assumes the value 1 if A = a and zero otherwise. It serves here as anindicator function to pick out those cases where action a has been chosen. This method is sometimes calledsample-average method.

Calculating the action values this way can become quite expensive, because in the long run many valuesneed to be stored and added. However, one can calculate the action value estimates also incrementally. Ifwe focus on just one action for simplicity and drop (a), and use the index n rather than t, then the directmethod (18) would read

Qn :=

∑n−1n′=1Rn′

n− 1, (19)

and it is easy to show that this can incrementally be calculated with the recursive equation

Qn+1 := Qn +1

n(Rn −Qn) . (20)

11

Notice, that with this rule Q2 = R1 regardless of the initialization of Q1. Thus, estimates of the actionvalues can be efficiently updated even in the long run.

Update rule (18) weighs all past rewards equally, which is ok for a stationary action value. If the actionvalue changes over time, then more recent rewards should be weighed more than past ones. This can beachieved by replacing the decaying step size 1/n by a constant one, i.e.

Qn+1 := Qn + α(Rn −Qn) (21)

with α ∈ (0, 1]. It is easy to show that this leads to an estimate

Qn+1 := (1− α)nQ1 +

n∑n′=1

α(1− α)n−n′Rn′ , (22)

where the initial value Q1 and rewards are forgotten with an exponential decay (1−α)n. This is sometimescalled exponential, recency-weighted average.

Other schedules for the step size are also possible, so that in general one has an αn(a) for each action aas a function of n, the number of times an action has been chosen. With αn(a) = 1/n we get (20); withαn(a) = α = const we get (21).

Convergence of Qn(a) to the true action value q∗(a) is guaranteed if

∞∑n=1

αn(a) = ∞ and (23)

∞∑n=1

α2n(a) < ∞ . (24)

If the former condition is met, then the step size remains large enough to level out a wrong initializationor random fluctuations; if the latter condition is met, then the step size reduces fast enought to actuallyconverge to something. For non-stationary action values the latter condition should actually not be met,so that the estimate is able to track any changes all the time. Although step size schedules fulfilling theseconditions guarantee convergence, in practice one often uses schedules violating these conditions, becausethey converge faster and/or are more robust with respect to the parameter choice.

4.3 Balancing exploration and exploitation

Now that we have methods to estimate the action value of the k different actions we can turn to the questionof how to balance exploration and exploitation.

4.3.1 Greedy method

The simplest method is to not balance at all and simply always choose the action with highest action value,referred to as the greedy action. It is quite obvious that this greedy method, as it is called, is optimal forthe very next choice but not in the long run, because other actions with lower estimated action values mightactually give more reward, the agent just does not know it.

4.3.2 ε-greedy method

It is clear that at least occasionally one should try actions that do not have an optimal estimated actionvalue, so that one has a chance to discover actions that yield a higher reward than the currently favored one.In the ε-greedy method, the agent usually chooses the greedy action, but with a probability of ε it choosesany action at random, including the greedy action. Thus, if there are k = 4 possible actions and ε = 0.1,then the greedy action is chosen with a probability of (1− ε) + ε/k = 0.925 and the others are chosen witha probability of 0.025 each.

To test the relative performance of the ε-greedy versus the greedy method Sutton and Barto (2016) haveempirically tested the two methods on a k-armed bandit problem with k = 10 and each of the 10 actions

12

Figure 6: Reward distributions of the 10 actions possible for the 10-armed bandit.Figure: (Sutton and Barto, 2016, Fig. 2.1)6 © non-free.

having its own stationary Gassian reward distribution with a variance of one and a mean sampled from anormalized Gaussian with zero mean and unit variance, see Figure 6.

The performances averaged of 2000 runs for 1000 time steps are shown in Figure 7. The greedy methodis better at the very first beginning but then levels off quickly and does not reach the performance of theε-greedy method.

Figure 7: Performance comparison of the ε-greedy versus the greedy method with two different values(0.01 and 0.1) for ε (ε = 0 indicates the greedy method) for the 10-armed bandit example.Figure: (Sutton and Barto, 2016, Fig. 2.2)7 © non-free.

The relative performances depend on the conditions. If there are no fluctuations in the rewards, then thegreedy method might actually be better, because one execution of each action is sufficient to know the true

13

http://incompleteideas.net/sutton/book/bookdraft2016sep.pdf


action values. It is often useful to adapt the value of α over time, for instance, start with a high value andlet it decrease down to zero to arrive at a pure exploitation strategy. Or one could increase α again whenone suspects nonstationarities in the environment.

4.3.3 Optimistic initialization

A simple method to make an agent explore all possible actions is an optimistic initialization, i.e. the ini-tialization of the action-value estimates is set higher than one expects the actual values to be. Then evenwith the greedy method, all actions are chosen in turns until they converge from their optimistic high valuesdown to more realistic lower values. Eventually the optimal action remains at a higher level than the othersand the greedy method starts exploiting the action-value estimates.

Figure 8 shows the performance of a greedy method on the 10-arm bandit example from above with optimisticand with realistic initialization. Even though suboptimal in the beginning, in the longer run optimisticinitialization is clearly better.

Figure 8: Performance comparison of a greedy method with optimistic and with realistic initialization forthe 10-armed bandit example.Figure: (Sutton and Barto, 2016, Fig. 2.3)8 © non-free.

Optimistic initialization does not work with the sample-average method, of course, because the initial valueis gone after one iteration, unless the reward is deterministic, in which case a single iteration is sufficientto estimate the action value exactly. It does also not work for non-stationary action values, because theoptimistic initialization is only of effect in the beginning and cannot adapt to drifts in the action values.

4.3.4 Upper-confidence-bound action selection

The ε-greedy method has the advantage over the greedy method that it occasionally also chooses actionswith suboptimal estimated action values to explore their potential, i.e. it mixes ε expoloration into thedominating exploitation strategy. However, all actions are explored equally, even those that are clearlysuboptimal. The upper-confidence-bound (UCB) action selection method tries to do better and, while stillexploring all actions, it does so more often for the promising alternatives than for the clearly suboptimalones. It chooses actions according to

At := arg maxa

(Qt(a) + c

√ln(t)

Nt(a)

), (25)

where Nt(a) indicates how often action a has been chosen so far, and c scales the amount of exploration.

The second term is a bonus that every action gets added to its estimated action value, so that actionswith lower estimated action values also get a chance of being chosen and explored. It grows for all actionslike

√ln(t), which is rather slow but sufficient to give every action a chance. Once an action is chosen,

the denominator√Nt(a) is increased by one, which reduces the bonus again. Due to the logarithm in the

numerator, it takes longer and longer to compensate an increment in Nt(a) by increasing time t. Importantly,actions with low estimated action values need a larger bonus before they are explored, which takes more

14


time to build up. Thus, promising non-greedy actions are explored more often than those with low estimatedaction values, and exploration reduces over time.

Figure 9 shows the performance of the UCB action selection method as compared to the ε-greedy method.It performs better but is more difficult to generalize to more complicated reinforcement learning settings.

Figure 9: Performance comparison of the upper-confidence-bound (UCB) action selection method versusthe ε-greedy method with ε = 0.1 for the 10-armed bandit example.Figure: (Sutton and Barto, 2016, Fig. 2.4)9 © non-free.

4.3.5 Gradient bandit algorithms

So far, all methods were rather ad hoc. A more principled approach would be gradient ascent on theexpectation value of the reward (17). When dealing with probabilities that must be positive and normalizedto one, it is convenient to introduce unconstraint auxiliary variables Ht(a) and compute the probabilitiesfrom these with a soft-max operation, i.e.

πt(a) :=exp(Ht(a))∑a′ exp(Ht(a′))

. (26)

This avoids constraints in the formalism. Further down we also need the derivative

∂πt(a′)

∂Ht(a)

(26)=

∂

∂Ht(a)

exp(Ht(a′))∑

a′′ exp(Ht(a′′))(27)

=

∂ exp(Ht(a′))

∂Ht(a)

∑a′′ exp(Ht(a

′′))− exp(Ht(a′))

∂∑

a′′ exp(Ht(a′′))

∂Ht(a)

(∑

a′′ exp(Ht(a′′)))2(28)

=δaa′ exp(Ht(a

′))∑

a′′ exp(Ht(a′′))− exp(Ht(a

′)) exp(Ht(a))

(∑

a′′ exp(Ht(a′′)))2(29)

= δaa′exp(Ht(a

′))∑a′′ exp(Ht(a′′)

− exp(Ht(a′))∑

a′′ exp(Ht(a′′)

exp(Ht(a))∑a′′ exp(Ht(a′′)

(30)

= δaa′πt(a′)− πt(a′)πt(a) (31)

= (δaa′ − πt(a))πt(a′) . (32)

It is also interesting to note that because∑

a′ πt(a′) = 1 we have∑

a′

∂πt(a′)

∂Ht(a)= 0 . (33)

Gradient ascent on the expected reward (17) with respect to the auxiliary variables corresponds to theupdate rule

Ht+1(a) = Ht(a) + α∂ 〈Rt〉Ω∂Ht(a)

. (34)

15


However, the partial derivatives cannot be computed, because we do not know the true action values in (17),but they can be approximated by sampling the rewards as can be seen as follows.

∂ 〈Rt〉Ω∂Ht(a)

(17)=

∂

∂Ht(a)

∑a′

πt(a′)q∗(a

′) (35)

=∑a′

q∗(a′)∂πt(a

′)

∂Ht(a)(36)

(33)=

∑a′

(q∗(a′)− Rt)

∂πt(a′)

∂Ht(a)(see below for a discussion of Rt) (37)

(32)=

∑a′

(q∗(a′)− Rt)(δaa′ − πt(a))πt(a

′) (38)

(cf. 17)=

⟨(q∗(a

′)− Rt)(δaa′ − πt(a))⟩a′ (39)

(16)=

⟨(〈Rt〉Ω:At=a′ − Rt)(δaa′ − πt(a))

⟩a′ (40)

=⟨⟨

(Rt − Rt)(δaAt − πt(a))⟩

Ω:At=a′

⟩a′

(41)

(since Rt, δaa′ and πt(a) are constant and At = a′ across Ω : At = a′)(cf. 17)

=⟨(Rt − Rt)(δaAt

− πt(a))⟩

Ω. (42)

Here we see that the exact expression (42) for the gradient requires an ensemble average over Ω, whichis impossible to compute. We therefore replace the ensemble average by the concrete sample that we getat each time point and hope that by averaging over time, i.e. chosing a sufficiently small value for α inthe learning rule (34), we approximate the ensemble average reasonably well. Inserting (42) into (34) anddropping the ensemble average yields

Ht+1(a)(34,42)

= Ht(a) + α(Rt − Rt)(δaAt− πt(a)) (43)

as a stochastic gradient approximation to the true gradient ascent rule.

This derivation holds for any value of Rt, but the convergence of the learning rule depends on it. A commonchoice is to set Rt to the average reward, which works well in practice and is the reason why we denote thisvariable by Rt.

Figure 10 shows the performance of the gradient bandit method with and without a baseline value Rt.

Figure 10: Performance comparison of the gradient bandit method for the 10-armed bandit example.Figure: (Sutton and Barto, 2016, Fig. 2.5)10 © non-free.

4.4 Associative search (contextual bandits)

The k-armed bandit problem considered above corresponds to a single state with k actions. If the agent israndomly faced with one of several k-armed bandits but each time it knows which bandit it is confronted

16


with, then this is a version of an associative search task or a contextual bandits. It has mutliple states withmultiple actions resulting in some immidate reward, but the actions do not influence the next state. Thesolution to this task is simple. The agent can treat each bandit separately and apply one of the methodsabove multiple times in parallel. If we have a setting where the action taken not only influences the immidatereward but also the next state, then we have a real reinforcement learning problem.

4.5 Summary

We have learned about different methods to learn the action values of a k-armed bandit. The greedy methodseems to be practical only in very specific situations. ε-greedy and greedy with optimistic initialization aread hoc but simple, robust and generally applicable. Upper-confidence-bound action selection and gradientbandit algorithms are more refined but harder to generalize to the full reinforcement learning problem.Figure 11 shows the performance, measured as the average reward over the first 1000 steps, of the latter fourmethods in comparison as a function of their respective parameter.

Figure 11: Performance comparison of four methods for the 10-armed bandit example. Each curve showsthe performance of one method as a function its paramter.Figure: (Sutton and Barto, 2016, Fig. 2.6)11 © non-free.

These methods are clearly suboptimal but work in practice. There exist even more sophisticated methods,such as Bayesian methods, but they are difficult to apply in practice for larger RL problems.

References

Harmon, M. E. and Harmon, S. S. (1996). Reinforcement learning: A tutorial. http://citeseer.ist.psu.edu/harmon96reinforcement.html. 2

Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal ofArtificial Intelligence Research, 4:237–285. http://www.jair.org/papers/paper301.html. 2

Sutton, R. S. and Barto, A. G. (2016). Reinforcement Learning: An Introduction (Draft). ABradford Book. The MIT Press, second (in progress) edition. Downloaded 2017-01-30 fromhttp://incompleteideas.net/sutton/book/bookdraft2016sep.pdf. 2, 11, 12, 13, 14, 15, 16, 17

ten Hagen, S. and Krose, B. (1997). A short introduction to reinforcement learning. In Daelemans, W.,Flach, P., and van den Bosch, A., editors, Proc. of the 7th Belgian-Dutch Conf. on Machine Learning,pages 7–12, Tilburg. http://citeseer.ist.psu.edu/tenhagen97short.html. 2

Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Commun. ACM, 38:58–68. 8, 9, 10

17


http://citeseer.ist.psu.edu/harmon96reinforcement.html

http://citeseer.ist.psu.edu/harmon96reinforcement.html

http://www.jair.org/papers/paper301.html

http://citeseer.ist.psu.edu/tenhagen97short.html

Notes

1Andre-Pierre du Plessis, 2009, Flickr, © CC BY 2.0, https://www.flickr.com/photos/andrepierre/4240154606/

2Ptkfgs, 2007, Wikimedia, © public domain, https://commons.wikimedia.org/wiki/File:Backgammon_lg.jpg

3Tesauro, 1995, Commun. ACM 38:58-68, Fig. 1, © non-free, http://www.bkgm.com/articles/tesauro/tdl.html

4Tesauro, 1995, Commun. ACM 38:58-68, Tab. 1, © non-free, http://www.bkgm.com/articles/tesauro/tdl.html

5Tesauro, 1995, Commun. ACM 38:58-68, Fig. 2, Tab. 2,© non-free, http://www.bkgm.com/articles/tesauro/tdl.html

6Sutton & Barto, 2016, Reinforcement Learning: An Introduction (Draft), The MIT Press, second (in progress) edition.,Fig. 2.1, © non-free, http://incompleteideas.net/sutton/book/bookdraft2016sep.pdf






18


https://www.flickr.com/photos/andrepierre/4240154606/

https://creativecommons.org/share-your-work/public-domain/pdm/

https://commons.wikimedia.org/wiki/File:Backgammon_lg.jpg










Documents

Lecture Notes on Reinforcement Learning - RUB · (1,2) (3,2) x environment-3 states state values agent actions and transitions-4 absorbing state Figure 1: A simple example of reinforcement