Upload
derick-edwards
View
215
Download
0
Embed Size (px)
Citation preview
Kshitij Judah, Saikat RoyAlan Fern, Tom Dietterich
Reinforcement Learning via Practice and Critique Advice
AAAI-2010 Atlanta, GA
PROBLEM: Usually RL takes a long time to learn a good policy.
Reinforcement Learning (RL)
Teacher
behavior
advice
RESEARCH QUESTION: Can we make RL perform better with some outside help, such as critique/advice from teacher and how?
GOALS: Non-technical
users as teachers
Natural interaction methods
state
action
rewardEnvironment
RL via Practice + Critique Advice
?Policy Parameters
Trajectory Data
Practice Session
Advice Interface
In a state si action ai is bad, whereas action aj is good. Teacher
Critique Session
Critique Data
Solution Approach
Trajectory Data
Practice Session
Advice Interface
In a state si action ai is bad, whereas action aj is good. Teacher
Critique Session
Policy Parameters
Critique Data
Estimate Expected Utility using Importance Sampling.(Peshkin & Shelton, ICML 2002)
Critique Data Loss L(θ,C)
Imagine: Our teacher is an Ideal Teacher (Provides All Good Actions)
Set of all Good
actions
Any action not in O(si) is suboptimal according to Ideal
Teacher
All actions
are equallygood
Advice Interface
Ideal Teacher
Advice Interface
Some good
actions
Some bad actions
Some actions
unlabeled
‘Any Label Learning’ (ALL)
Imagine: Our teacher is an Ideal Teacher (Provides All Good Actions)
Set of all Good
actions
Any action not in O(si) is suboptimal according to Ideal
Teacher
All actions
are equallygood
Advice Interface
Ideal Teacher
Learning Goal: Find a probabilistic policy, or classifier, that has a high probability of returning an action in O(s) when applied to s.
ALL Likelihood (LALL(,C)) :
Probability of selecting an action in O(Si) given state si
Critique Data Loss L(θ,C) Coming back to reality: Not All Teachers are Ideal !
and provide partial evidence about O(si)
Advice Interface
What about the naïve approach of treating as the true set O(si) ? Difficulties: When there are actions outside of that are equally good
compared to those in , the learning problem becomes even harder.
We want a principled way of handling the situation where either or can be empty.
Expected Any-Label Learning
The Gradient of Expected Loss has a compact closed form.
…
and provide partial evidence about O(si)
User Model Assume independence among different states.
From corresponding for all states , we can get:
Expected ALL Loss:
Map 1 Map 2
Experimental Setup
Our Domain: Micro-management in tactical battles in the Real Time Strategy (RTS) game of Wargus.
5 friendly footmen against a group of 5 enemy footmen (Wargus AI).
Two battle maps, which differed only in the initial placement of the units.
Both maps had winning strategies for the friendly team and are of roughly the same difficulty.
Advice Interface
Difficulty:Fast pace and
multiple units acting in parallel
Our setup: Provide end-users
with an Advice Interface that allows to watch a battle and pause at any moment.
Goal is to evaluate two systems1. Supervised System = no practice session2. Combined System = includes practice and
critique
The user study involved 10 end-users 6 with CS background 4 no CS background
Each user trained both the supervised and combined systems
30 minutes total for supervised 60 minutes for combined due to additional time
for practice
Since repeated runs are not practical results are qualitative
To provide statistical results we first present simulated experiments
User Study
Simulated Experiments
After user study, selected the worst and best performing users on each map when training the combined system
Total Critique data: User#1: 36, User#2: 91, User#3: 115, User#4: 33.
For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data.
We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. All results are averaged over 5 runs.
Simulated Experiments Results:Benefit of Critiques (User #1)
RL is unable to learn a winning policy (i.e. achieve a positive value).
Simulated Experiments Results:Benefit of Critiques (User #1)
With more critiques performance increases a little bit.
As the amount of critique data increases, the performance improves for a fixed number of practice episodes.
RL did not go past 12 health difference on any map even after 500 trajectories.
Simulated Experiments Results:Benefit of Critiques (User #1)
Simulated Experiments Results:Benefit of Practice (User #1)
Even with no practice, the critique data was sufficient to outperform RL.
RL did not go past 12 health difference.
Simulated Experiments Results:Benefit of Practice (User #1)
With more practice performance increases too.
Simulated Experiments Results:Benefit of Practice (User #1)
Our approach is able to leverage practice episodes in order to improve the effectiveness on a given amount of critique data.
Goal is to evaluate two systems1. Supervised System = no practice session2. Combined System = includes practice and
critique
The user study involved 10 end-users 6 with CS background 4 no CS background
Each user trained both the supervised and combined systems
30 minutes total for supervised 60 minutes for combined due to additional time
for practice
Results for Actual User Study
Results of User Study
-50 0 50 80 1000123456789
10
Health Difference Achieved by Users
SupervisedCombined
Health Difference
Num
ber o
f Use
rs
Results of User Study
Comparing to RL: 9 out of 10 users achieved 50 or more
performance using Supervised System 6 out of 10 users achieved 50 or more
performance using Combined System
Users effectively performed better than RL using either the Supervised
or Combined Systems.
RL did not go past 12 health difference on any map even after 500 trajectories.
-50 0 50 80 1000123456789
10
Health Difference Achieved by Users
SupervisedCombined
Health Difference
Num
ber o
f Use
rs
Results of User Study
Frustrating Problems for Users Large delay experience. (not an issue in many
realistic settings)
Policy returned after practice was sometimes poor, seemed to be ignoring advice. (perhaps practice sessions were too short)
Comparing Combined and Supervised: The end-users had slightly greater success
with the supervised system v/s the combined system.
More users were able to achieve performance levels of 50 and 80 using the supervised system.
-50 0 50 80 1000123456789
10
Health Difference Achieved by Users
SupervisedCombined
Health Difference
Num
ber o
f Use
rs
Future Work
An important part of our future work will be to conduct further user studies in order to pursue the most relevant directions including:
Studying more sophisticated user models that further approximate real users.
Enriching the forms of advice.
How learner can actively request advice from teacher.
Designing realistic user studies.
Increasing the stability of autonomous practice.
Questions ?