solution9_12.pdf

7/29/2019 solution9_12.pdf

1/6

Dr. Cdric Pradalier

Institut fr Robotik und Intelligente Systeme

Autonomous Systems Lab

CLA E 16.1

Tannenstrae 3

8092 Zrich

[email protected]

www.asl.ethz.ch/people/cedricp

Exercise Sheet 12 SolutionsTopics: Introduction to Reinforcement Learning

1 N-armed Bandit Machine

In order to simulate a n-armed bandit machine with 10 arms, the matlab function givenin algorithm 1 can be used.

function reward = narmedbandit(arm);% T h e f i r s t t i m e t h e f u n c t i o n i s r u n , i n i t i a l i s e e x p e c t e d r e w a r d s .

global expected_rewards;ifisempty(expected_rewards) then

disp(Generating expected rewards);expected_rewards = rand(10,1);

end% T h e r e w a r d i s 1 i f r a n d i s s m a l l e r t h a n t h e e x p e c t e d _ r e w a r d , 0 o t h e r w i s e

reward = (rand < expected_rewards(arm));

Algorithm 1: The n-armed bandit simulator

The basic greedy learning algorithm is given in algorithm 2.

(a) When running this algorithm, it does not seem to improve with time. Explainwhat is missing.

Because this algorithm is greedy it always choose the best action. As the action valueis estimated to zero, the first action tried is always better than any action not tried. Asa result the greedy algorithm always select the first action. It always exploit and neverexplore.

(b) By modifying algorithm 2, compare the following variant to the learning algo-rithm:

-Greedy, with = 0.1.

-Greedy, with = 0.01.

Greedy, with optimistic initial values.

1


2/6

% N u m b e r o f g a m e s

ngames = 30;% N u m b e r o f l e a r n i n g i t e r a t i o n s p e r g a m e

niter = 500;disp(Greedy learning);% R e w a r d i n i t i a l i s a t i o n

rwd = zeros(ngames,niter);for k=1:ngames do

% I n i t i a l a c t i o n v a l u e

Q=zeros(10,1);for i=1:niterdo

% S e l e c t b e s t a c t i o n

[v,j] = max(Q);% E v a l u a t e a c t i o n r e w a r d

rwd(k,i) = narmedbandit(j);% U p d a t e a c t i o n v a l u e

Q(j) = Q(j) + (rwd(k,i) - Q(j))/(i+1);end

endplot(mean(rwd));

Algorithm 2: The basic greedy algorithm

-Greedy, with optimistic initial values and = 0.1.

Which one is the most efficient? Is it useful to have an -strategy in addition to

the optimistic initial values? In which cases? Plot the average reward per iterationof each algorithm on a same graph (as on slide 12).

Depending on the expected gains of the n-armed bandit, two situations can be ob-served:

if the first action is the one with the best reward, then the greedy strategy will per-form best, because most of the time they will select the best action.

2


3/6

10 20 30 40 50 60

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Greedy

OptimisticEps Greedy

Eps/10 GreedyEps Optimistic

in other cases, the optimistic approach are better, because they encourage explo-ration and help the system find earlier the best approach.

0 10 20 30 40 50 60 70 80 90 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Greedy

OptimisticEps GreedyEps/10 Greedy

Eps Optimistic

The -strategy and the optimistic initialisation are not incompatible in general, butnot necessary on a static system. The goal of the -strategy is to have a system thatkeep exploring over time. This allows to adapt to slow changes in the observed sys-tem. The optimistic initialisation helps the system to explore at the beginning. If thesystem is static, it can gather enough information at the beginning so as to not needthe -strategy.

(c) The algorithm above estimate the expected reward for each action. Why is thisimplementation very sensitive to the initial actions tried? How to slightly changethe implementation above to have a system capable of adapting to slowly chang-ing gains? Regenerate the plot of average reward per iteration for all the variantsin question (b).

This implementation multiplies the observed difference by1/(k + 1). After a large num-ber of games, this coefficient becomes too small to compensate for errors learnt at the

3


4/6

beginning. Using a constant learning coefficient (0.1 for instance) would solve the prob-lem. This also reduces the difference between the cases where the first choice is the bestor not.

If the first action is the best:

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Greedy

Optimistic

Eps Greedy

Eps/10 Greedy

Eps Optimistic

Otherwise:

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Greedy

Optimistic

Eps greedy

Eps/10 Greedy

Eps Optimistic

2 Bellmans equations

In class, we have seen the recursive form of the Bellmans equations:

V(s) = maxa s

Pass

[Rass

+ V(s)] (1)

and

Q(s, a) =s

Pass

Ra

ss+ max

aQ(s, a)

(2)

4


5/6

Using the definition of the Q and V functions and of the expected reward function,give the proof of these equations.The value function is defined as follows:

V(s) = E[Rt | st = s] (3)

= E[rt + Rt+1 | st = s] (4)

=a

(s, a)s

Pass

[Rass

+ V(s)] (5)

Because V is the state value for the optimal policy, the greedy policy with respect to V isoptimal. This policy, can be written as:

(st, a) = arg maxat

E[Rass

| s = st, a = at]

From this definition, we have immediately:

V(s) = maxa

s

Pass

[Rass

+ V(s)] (6)

In the same way,

Q(s, a) = E[Rt | st = s, at = a] (7)

=s

Pass

Ra

ss+

a

Q(s, a)

(8)

IfQ is optimal, then the greedy policy with respect to Q is optimal and the sum overa can

be replaced by a maximum:

Q(s, a) =s

Pass

Ra

ss+ max

aQ(s, a)

(9)

3 Applicability of reinforcement learning

For each of the following problems, state if it can be modelled as a reinforcement learn-ing problem, and if so, what are the environment, the state, the goal, the actions and

the reward.

(a) A mobile robot must find a source of light in its environment. This can be mod-elled as a reinforcement learning problem. The environment is the experimental floor,the state is the position of the robot on the floor, the actions are the robot movements.

5


6/6

The goal is to maximise the received light, and the reward is the output of the light sen-sor.

Because this can be seen as a continuous problem, the reward will certainly need to be

discounted.

(b) A robotic arms is used to play "ball-in-a-cup". See picture:

This can be implemented as a reinforcement learning problem (and indeed it has been).The environment is the 3D space of the robot. Its state if its pose (all joints). The goal isto put the ball in the cup. The reward would be -1 per step during the trial, until the ballarrives in the cup. At this point the episode ends.

(c) A robotic delivery truck must deliver parcels to all the institutes of ETH. This couldbe implemented using reinforcement learning, but there is more efficient specialisedsolution.

(d) A robotic helicopter needs to take-off from, hover and land on its launch pad. Thiscan be implemented as a RL problem. The state is the 6D state of the helicopter, and theenvironment the 3D space in which the rover moves. For take-off and landing, the heli-copter would get a -1 reward for any time step where the task has not been completed.It could also get a -1000 reward when it crashes. For the hovering task, a reward wouldbe -1 for each timestep where the state is not the hovering state, 0 otherwise.

6

Documents

solution9_12.pdf