Upload
sarine-babikian
View
216
Download
0
Embed Size (px)
Citation preview
7/29/2019 solution9_12.pdf
1/6
Dr. Cdric Pradalier
Institut fr Robotik und Intelligente Systeme
Autonomous Systems Lab
CLA E 16.1
Tannenstrae 3
8092 Zrich
www.asl.ethz.ch/people/cedricp
Exercise Sheet 12 SolutionsTopics: Introduction to Reinforcement Learning
1 N-armed Bandit Machine
In order to simulate a n-armed bandit machine with 10 arms, the matlab function givenin algorithm 1 can be used.
function reward = narmedbandit(arm);% T h e f i r s t t i m e t h e f u n c t i o n i s r u n , i n i t i a l i s e e x p e c t e d r e w a r d s .
global expected_rewards;ifisempty(expected_rewards) then
disp(Generating expected rewards);expected_rewards = rand(10,1);
end% T h e r e w a r d i s 1 i f r a n d i s s m a l l e r t h a n t h e e x p e c t e d _ r e w a r d , 0 o t h e r w i s e
reward = (rand < expected_rewards(arm));
Algorithm 1: The n-armed bandit simulator
The basic greedy learning algorithm is given in algorithm 2.
(a) When running this algorithm, it does not seem to improve with time. Explainwhat is missing.
Because this algorithm is greedy it always choose the best action. As the action valueis estimated to zero, the first action tried is always better than any action not tried. Asa result the greedy algorithm always select the first action. It always exploit and neverexplore.
(b) By modifying algorithm 2, compare the following variant to the learning algo-rithm:
-Greedy, with = 0.1.
-Greedy, with = 0.01.
Greedy, with optimistic initial values.
1
7/29/2019 solution9_12.pdf
2/6
% N u m b e r o f g a m e s
ngames = 30;% N u m b e r o f l e a r n i n g i t e r a t i o n s p e r g a m e
niter = 500;disp(Greedy learning);% R e w a r d i n i t i a l i s a t i o n
rwd = zeros(ngames,niter);for k=1:ngames do
% I n i t i a l a c t i o n v a l u e
Q=zeros(10,1);for i=1:niterdo
% S e l e c t b e s t a c t i o n
[v,j] = max(Q);% E v a l u a t e a c t i o n r e w a r d
rwd(k,i) = narmedbandit(j);% U p d a t e a c t i o n v a l u e
Q(j) = Q(j) + (rwd(k,i) - Q(j))/(i+1);end
endplot(mean(rwd));
Algorithm 2: The basic greedy algorithm
-Greedy, with optimistic initial values and = 0.1.
Which one is the most efficient? Is it useful to have an -strategy in addition to
the optimistic initial values? In which cases? Plot the average reward per iterationof each algorithm on a same graph (as on slide 12).
Depending on the expected gains of the n-armed bandit, two situations can be ob-served:
if the first action is the one with the best reward, then the greedy strategy will per-form best, because most of the time they will select the best action.
2
7/29/2019 solution9_12.pdf
3/6
10 20 30 40 50 60
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Greedy
OptimisticEps Greedy
Eps/10 GreedyEps Optimistic
in other cases, the optimistic approach are better, because they encourage explo-ration and help the system find earlier the best approach.
0 10 20 30 40 50 60 70 80 90 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Greedy
OptimisticEps GreedyEps/10 Greedy
Eps Optimistic
The -strategy and the optimistic initialisation are not incompatible in general, butnot necessary on a static system. The goal of the -strategy is to have a system thatkeep exploring over time. This allows to adapt to slow changes in the observed sys-tem. The optimistic initialisation helps the system to explore at the beginning. If thesystem is static, it can gather enough information at the beginning so as to not needthe -strategy.
(c) The algorithm above estimate the expected reward for each action. Why is thisimplementation very sensitive to the initial actions tried? How to slightly changethe implementation above to have a system capable of adapting to slowly chang-ing gains? Regenerate the plot of average reward per iteration for all the variantsin question (b).
This implementation multiplies the observed difference by1/(k + 1). After a large num-ber of games, this coefficient becomes too small to compensate for errors learnt at the
3
7/29/2019 solution9_12.pdf
4/6
beginning. Using a constant learning coefficient (0.1 for instance) would solve the prob-lem. This also reduces the difference between the cases where the first choice is the bestor not.
If the first action is the best:
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Greedy
Optimistic
Eps Greedy
Eps/10 Greedy
Eps Optimistic
Otherwise:
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Greedy
Optimistic
Eps greedy
Eps/10 Greedy
Eps Optimistic
2 Bellmans equations
In class, we have seen the recursive form of the Bellmans equations:
V(s) = maxa s
Pass
[Rass
+ V(s)] (1)
and
Q(s, a) =s
Pass
Ra
ss+ max
aQ(s, a)
(2)
4
7/29/2019 solution9_12.pdf
5/6
Using the definition of the Q and V functions and of the expected reward function,give the proof of these equations.The value function is defined as follows:
V(s) = E[Rt | st = s] (3)
= E[rt + Rt+1 | st = s] (4)
=a
(s, a)s
Pass
[Rass
+ V(s)] (5)
Because V is the state value for the optimal policy, the greedy policy with respect to V isoptimal. This policy, can be written as:
(st, a) = arg maxat
E[Rass
| s = st, a = at]
From this definition, we have immediately:
V(s) = maxa
s
Pass
[Rass
+ V(s)] (6)
In the same way,
Q(s, a) = E[Rt | st = s, at = a] (7)
=s
Pass
Ra
ss+
a
Q(s, a)
(8)
IfQ is optimal, then the greedy policy with respect to Q is optimal and the sum overa can
be replaced by a maximum:
Q(s, a) =s
Pass
Ra
ss+ max
aQ(s, a)
(9)
3 Applicability of reinforcement learning
For each of the following problems, state if it can be modelled as a reinforcement learn-ing problem, and if so, what are the environment, the state, the goal, the actions and
the reward.
(a) A mobile robot must find a source of light in its environment. This can be mod-elled as a reinforcement learning problem. The environment is the experimental floor,the state is the position of the robot on the floor, the actions are the robot movements.
5
7/29/2019 solution9_12.pdf
6/6
The goal is to maximise the received light, and the reward is the output of the light sen-sor.
Because this can be seen as a continuous problem, the reward will certainly need to be
discounted.
(b) A robotic arms is used to play "ball-in-a-cup". See picture:
This can be implemented as a reinforcement learning problem (and indeed it has been).The environment is the 3D space of the robot. Its state if its pose (all joints). The goal isto put the ball in the cup. The reward would be -1 per step during the trial, until the ballarrives in the cup. At this point the episode ends.
(c) A robotic delivery truck must deliver parcels to all the institutes of ETH. This couldbe implemented using reinforcement learning, but there is more efficient specialisedsolution.
(d) A robotic helicopter needs to take-off from, hover and land on its launch pad. Thiscan be implemented as a RL problem. The state is the 6D state of the helicopter, and theenvironment the 3D space in which the rover moves. For take-off and landing, the heli-copter would get a -1 reward for any time step where the task has not been completed.It could also get a -1000 reward when it crashes. For the hovering task, a reward wouldbe -1 for each timestep where the state is not the hovering state, 0 otherwise.
6