solution9_12.pdf

Embed Size (px)

Citation preview

  • 7/29/2019 solution9_12.pdf

    1/6

    Dr. Cdric Pradalier

    Institut fr Robotik und Intelligente Systeme

    Autonomous Systems Lab

    CLA E 16.1

    Tannenstrae 3

    8092 Zrich

    [email protected]

    www.asl.ethz.ch/people/cedricp

    Exercise Sheet 12 SolutionsTopics: Introduction to Reinforcement Learning

    1 N-armed Bandit Machine

    In order to simulate a n-armed bandit machine with 10 arms, the matlab function givenin algorithm 1 can be used.

    function reward = narmedbandit(arm);% T h e f i r s t t i m e t h e f u n c t i o n i s r u n , i n i t i a l i s e e x p e c t e d r e w a r d s .

    global expected_rewards;ifisempty(expected_rewards) then

    disp(Generating expected rewards);expected_rewards = rand(10,1);

    end% T h e r e w a r d i s 1 i f r a n d i s s m a l l e r t h a n t h e e x p e c t e d _ r e w a r d , 0 o t h e r w i s e

    reward = (rand < expected_rewards(arm));

    Algorithm 1: The n-armed bandit simulator

    The basic greedy learning algorithm is given in algorithm 2.

    (a) When running this algorithm, it does not seem to improve with time. Explainwhat is missing.

    Because this algorithm is greedy it always choose the best action. As the action valueis estimated to zero, the first action tried is always better than any action not tried. Asa result the greedy algorithm always select the first action. It always exploit and neverexplore.

    (b) By modifying algorithm 2, compare the following variant to the learning algo-rithm:

    -Greedy, with = 0.1.

    -Greedy, with = 0.01.

    Greedy, with optimistic initial values.

    1

  • 7/29/2019 solution9_12.pdf

    2/6

    % N u m b e r o f g a m e s

    ngames = 30;% N u m b e r o f l e a r n i n g i t e r a t i o n s p e r g a m e

    niter = 500;disp(Greedy learning);% R e w a r d i n i t i a l i s a t i o n

    rwd = zeros(ngames,niter);for k=1:ngames do

    % I n i t i a l a c t i o n v a l u e

    Q=zeros(10,1);for i=1:niterdo

    % S e l e c t b e s t a c t i o n

    [v,j] = max(Q);% E v a l u a t e a c t i o n r e w a r d

    rwd(k,i) = narmedbandit(j);% U p d a t e a c t i o n v a l u e

    Q(j) = Q(j) + (rwd(k,i) - Q(j))/(i+1);end

    endplot(mean(rwd));

    Algorithm 2: The basic greedy algorithm

    -Greedy, with optimistic initial values and = 0.1.

    Which one is the most efficient? Is it useful to have an -strategy in addition to

    the optimistic initial values? In which cases? Plot the average reward per iterationof each algorithm on a same graph (as on slide 12).

    Depending on the expected gains of the n-armed bandit, two situations can be ob-served:

    if the first action is the one with the best reward, then the greedy strategy will per-form best, because most of the time they will select the best action.

    2

  • 7/29/2019 solution9_12.pdf

    3/6

    10 20 30 40 50 60

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Greedy

    OptimisticEps Greedy

    Eps/10 GreedyEps Optimistic

    in other cases, the optimistic approach are better, because they encourage explo-ration and help the system find earlier the best approach.

    0 10 20 30 40 50 60 70 80 90 100

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Greedy

    OptimisticEps GreedyEps/10 Greedy

    Eps Optimistic

    The -strategy and the optimistic initialisation are not incompatible in general, butnot necessary on a static system. The goal of the -strategy is to have a system thatkeep exploring over time. This allows to adapt to slow changes in the observed sys-tem. The optimistic initialisation helps the system to explore at the beginning. If thesystem is static, it can gather enough information at the beginning so as to not needthe -strategy.

    (c) The algorithm above estimate the expected reward for each action. Why is thisimplementation very sensitive to the initial actions tried? How to slightly changethe implementation above to have a system capable of adapting to slowly chang-ing gains? Regenerate the plot of average reward per iteration for all the variantsin question (b).

    This implementation multiplies the observed difference by1/(k + 1). After a large num-ber of games, this coefficient becomes too small to compensate for errors learnt at the

    3

  • 7/29/2019 solution9_12.pdf

    4/6

    beginning. Using a constant learning coefficient (0.1 for instance) would solve the prob-lem. This also reduces the difference between the cases where the first choice is the bestor not.

    If the first action is the best:

    0 50 100 150 200 250 300 350 400 450 5000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Greedy

    Optimistic

    Eps Greedy

    Eps/10 Greedy

    Eps Optimistic

    Otherwise:

    0 50 100 150 200 250 300 350 400 450 5000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Greedy

    Optimistic

    Eps greedy

    Eps/10 Greedy

    Eps Optimistic

    2 Bellmans equations

    In class, we have seen the recursive form of the Bellmans equations:

    V(s) = maxa s

    Pass

    [Rass

    + V(s)] (1)

    and

    Q(s, a) =s

    Pass

    Ra

    ss+ max

    aQ(s, a)

    (2)

    4

  • 7/29/2019 solution9_12.pdf

    5/6

    Using the definition of the Q and V functions and of the expected reward function,give the proof of these equations.The value function is defined as follows:

    V(s) = E[Rt | st = s] (3)

    = E[rt + Rt+1 | st = s] (4)

    =a

    (s, a)s

    Pass

    [Rass

    + V(s)] (5)

    Because V is the state value for the optimal policy, the greedy policy with respect to V isoptimal. This policy, can be written as:

    (st, a) = arg maxat

    E[Rass

    | s = st, a = at]

    From this definition, we have immediately:

    V(s) = maxa

    s

    Pass

    [Rass

    + V(s)] (6)

    In the same way,

    Q(s, a) = E[Rt | st = s, at = a] (7)

    =s

    Pass

    Ra

    ss+

    a

    Q(s, a)

    (8)

    IfQ is optimal, then the greedy policy with respect to Q is optimal and the sum overa can

    be replaced by a maximum:

    Q(s, a) =s

    Pass

    Ra

    ss+ max

    aQ(s, a)

    (9)

    3 Applicability of reinforcement learning

    For each of the following problems, state if it can be modelled as a reinforcement learn-ing problem, and if so, what are the environment, the state, the goal, the actions and

    the reward.

    (a) A mobile robot must find a source of light in its environment. This can be mod-elled as a reinforcement learning problem. The environment is the experimental floor,the state is the position of the robot on the floor, the actions are the robot movements.

    5

  • 7/29/2019 solution9_12.pdf

    6/6

    The goal is to maximise the received light, and the reward is the output of the light sen-sor.

    Because this can be seen as a continuous problem, the reward will certainly need to be

    discounted.

    (b) A robotic arms is used to play "ball-in-a-cup". See picture:

    This can be implemented as a reinforcement learning problem (and indeed it has been).The environment is the 3D space of the robot. Its state if its pose (all joints). The goal isto put the ball in the cup. The reward would be -1 per step during the trial, until the ballarrives in the cup. At this point the episode ends.

    (c) A robotic delivery truck must deliver parcels to all the institutes of ETH. This couldbe implemented using reinforcement learning, but there is more efficient specialisedsolution.

    (d) A robotic helicopter needs to take-off from, hover and land on its launch pad. Thiscan be implemented as a RL problem. The state is the 6D state of the helicopter, and theenvironment the 3D space in which the rover moves. For take-off and landing, the heli-copter would get a -1 reward for any time step where the task has not been completed.It could also get a -1000 reward when it crashes. For the hovering task, a reward wouldbe -1 for each timestep where the state is not the hovering state, 0 otherwise.

    6