Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,

Kshitij Judah, Saikat RoyAlan Fern, Tom Dietterich

Reinforcement Learning via Practice and Critique Advice

AAAI-2010 Atlanta, GA

PROBLEM: Usually RL takes a long time to learn a good policy.

Reinforcement Learning (RL)

Teacher

behavior

advice

RESEARCH QUESTION: Can we make RL perform better with some outside help, such as critique/advice from teacher and how?

GOALS: Non-technical

users as teachers

Natural interaction methods

state

action

rewardEnvironment

RL via Practice + Critique Advice

?Policy Parameters

Trajectory Data

Practice Session

Advice Interface

In a state si action ai is bad, whereas action aj is good. Teacher

Critique Session

Critique Data

http://maru.bonyari.jp/texclip/texclip.php?s=$T_i%20=%20/left/%7B(s%5E1_i,%20a%5E1_i,%20r%5E1_i),%20...,%20(s%5E%7BL_i%7D_i,%20a%5E%7BL_i%7D_i,%20r%5E%7BL_i%7D_i)/right/%7D$



http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0AT%20=%20/%7BT_1,%20/dots,%20T_N/%7D%0D%0A/end%7Balign*%7D

Solution Approach

Trajectory Data

Practice Session

Advice Interface

In a state si action ai is bad, whereas action aj is good. Teacher

Critique Session

Policy Parameters

Critique Data

Estimate Expected Utility using Importance Sampling.(Peshkin & Shelton, ICML 2002)




http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0Aargmax_/theta/hspace%7B1%20mm%7D%20J(/theta,C,T)%0D%0A/end%7Balign*%7D

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0AJ(/theta,C,T)=/lambda%20U(/theta,T)-(1-/lambda)L(/theta,C)%0D%0A/end%7Balign*%7D

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0AT%20=%20/%7BT_1,%20/dots,%20T_N/%7D%0D%0A/end%7Balign*%7D

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0AC%20=%20/%7B(s_1,c_1%5E+,c_1%5E-),%20/dots,%20(s_M,c_M%5E+,c_M%5E-)/%7D%0D%0A/end%7Balign*%7D

Critique Data Loss L(θ,C)

Imagine: Our teacher is an Ideal Teacher (Provides All Good Actions)

Set of all Good

actions

Any action not in O(si) is suboptimal according to Ideal

Teacher

All actions

are equallygood

Advice Interface

Ideal Teacher

Advice Interface

Some good

actions

Some bad actions

Some actions

unlabeled

‘Any Label Learning’ (ALL)

Imagine: Our teacher is an Ideal Teacher (Provides All Good Actions)

Set of all Good

actions

Any action not in O(si) is suboptimal according to Ideal

Teacher

All actions

are equallygood

Advice Interface

Ideal Teacher

Learning Goal: Find a probabilistic policy, or classifier, that has a high probability of returning an action in O(s) when applied to s.

ALL Likelihood (LALL(,C)) :

Probability of selecting an action in O(Si) given state si

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0A/sum_%7Bi=1%7D%5E%7BM%7D%20%7Blog(/pi_%7B/theta%7D(O(s_i)%20/mid%20s_i))%7D;%20/hspace%7B0.1in%7D%20/pi_%7B/theta%7D(O(s_i)%20/mid%20s_i)%20=%20/sum_%7Ba%20/in%20O(s_i)%7D%20%7B/pi_%7B/theta%7D(a%20/mid%20s_i)%7D%0D%0A/end%7Balign*%7D

Critique Data Loss L(θ,C) Coming back to reality: Not All Teachers are Ideal !

and provide partial evidence about O(si)

Advice Interface

What about the naïve approach of treating as the true set O(si) ? Difficulties: When there are actions outside of that are equally good

compared to those in , the learning problem becomes even harder.

We want a principled way of handling the situation where either or can be empty.

http://maru.bonyari.jp/texclip/texclip.php?s=$c_%7Bi%7D%5E%7B+%7D$

http://maru.bonyari.jp/texclip/texclip.php?s=$c_%7Bi%7D%5E%7B-%7D$


Expected Any-Label Learning

The Gradient of Expected Loss has a compact closed form.

…

and provide partial evidence about O(si)

User Model Assume independence among different states.

From corresponding for all states , we can get:

Expected ALL Loss:

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0A(s_i,c_i%5E+,c_i%5E-)%0D%0A/end%7Balign*%7D

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0A(s_i,c_i%5E+,c_i%5E-)%0D%0A/end%7Balign*%7D

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0AO_i%5E1,/hspace%7B1%20mm%7DP(O_i%5E1/vert%20c_i%5E+,c_i%5E-)%0D%0A/end%7Balign*%7D

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0AO_i%5E2,/hspace%7B1%20mm%7DP(O_i%5E2/vert%20c_i%5E+,c_i%5E-)%0D%0A/end%7Balign*%7D

http://maru.bonyari.jp/texclip/texclip.php?s=/begin%7Balign*%7D%0D%0AO_i%5E%7B2%5E%7B/vert%20A/vert%7D%7D,/hspace%7B1%20mm%7DP(O_i%5E%7B2%5E%7B/vert%20A/vert%7D%7D/vert%20c_i%5E+,c_i%5E-)%0D%0A/end%7Balign*%7D


Map 1 Map 2

Experimental Setup

Our Domain: Micro-management in tactical battles in the Real Time Strategy (RTS) game of Wargus.

5 friendly footmen against a group of 5 enemy footmen (Wargus AI).

Two battle maps, which differed only in the initial placement of the units.

Both maps had winning strategies for the friendly team and are of roughly the same difficulty.

Advice Interface

Difficulty:Fast pace and

multiple units acting in parallel

Our setup: Provide end-users

with an Advice Interface that allows to watch a battle and pause at any moment.

Goal is to evaluate two systems1. Supervised System = no practice session2. Combined System = includes practice and

critique

The user study involved 10 end-users 6 with CS background 4 no CS background

Each user trained both the supervised and combined systems

30 minutes total for supervised 60 minutes for combined due to additional time

for practice

Since repeated runs are not practical results are qualitative

To provide statistical results we first present simulated experiments

User Study

Simulated Experiments

After user study, selected the worst and best performing users on each map when training the combined system

Total Critique data: User#1: 36, User#2: 91, User#3: 115, User#4: 33.

For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data.

We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. All results are averaged over 5 runs.

Simulated Experiments Results:Benefit of Critiques (User #1)

RL is unable to learn a winning policy (i.e. achieve a positive value).


With more critiques performance increases a little bit.

As the amount of critique data increases, the performance improves for a fixed number of practice episodes.

RL did not go past 12 health difference on any map even after 500 trajectories.


Simulated Experiments Results:Benefit of Practice (User #1)

Even with no practice, the critique data was sufficient to outperform RL.

RL did not go past 12 health difference.


With more practice performance increases too.


Our approach is able to leverage practice episodes in order to improve the effectiveness on a given amount of critique data.

Goal is to evaluate two systems1. Supervised System = no practice session2. Combined System = includes practice and

critique

The user study involved 10 end-users 6 with CS background 4 no CS background

Each user trained both the supervised and combined systems

30 minutes total for supervised 60 minutes for combined due to additional time

for practice

Results for Actual User Study

Results of User Study

-50 0 50 80 1000123456789

10

Health Difference Achieved by Users

SupervisedCombined

Health Difference

Num

ber o

f Use

rs


Comparing to RL: 9 out of 10 users achieved 50 or more

performance using Supervised System 6 out of 10 users achieved 50 or more

performance using Combined System

Users effectively performed better than RL using either the Supervised

or Combined Systems.

RL did not go past 12 health difference on any map even after 500 trajectories.

-50 0 50 80 1000123456789

10


SupervisedCombined

Health Difference

Num

ber o

f Use

rs


Frustrating Problems for Users Large delay experience. (not an issue in many

realistic settings)

Policy returned after practice was sometimes poor, seemed to be ignoring advice. (perhaps practice sessions were too short)

Comparing Combined and Supervised: The end-users had slightly greater success

with the supervised system v/s the combined system.

More users were able to achieve performance levels of 50 and 80 using the supervised system.

-50 0 50 80 1000123456789

10


SupervisedCombined

Health Difference

Num

ber o

f Use

rs

Future Work

An important part of our future work will be to conduct further user studies in order to pursue the most relevant directions including:

Studying more sophisticated user models that further approximate real users.

Enriching the forms of advice.

How learner can actively request advice from teacher.

Designing realistic user studies.

Increasing the stability of autonomous practice.

Questions ?

Documents

Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,