Upload
sara-dorsey
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Skill Acquisition via Transfer Learning
and Advice TakingLisa Torrey, Jude Shavlik, Trevor Walker
University of Wisconsin-Madison, USA
Richard MaclinUniversity of Minnesota-Duluth, USA
Transfer LearningTransfer Learning
Agent learns Task A
Agent encounters related Task B
Agent discovers how tasks are related
So far the user
provides this info to the agent
Agent uses knowledge from Task A to learn Task B faster
Task A is the
source. Task B is
the target.
Transfer LearningTransfer Learning
The goal for the target task:
perf
orm
ance
training
with transferwithout transfer
Reinforcement Learning Reinforcement Learning OverviewOverview
Take an actionObserve world state
Receive a reward
Policy: choose the action with the highest Q-value in the current state
Use the rewards to
estimate the Q-values of actions in
states
Described by a set of features
Transfer in Reinforcement Transfer in Reinforcement LearningLearning
What knowledge will we transfer from the source?What knowledge will we transfer from the source? Q-functions (Taylor & Stone 2005)Q-functions (Taylor & Stone 2005) Policies (Torrey et al. 2005)Policies (Torrey et al. 2005) Skills Skills (this work)(this work)
How will we extract that knowledge from the source?How will we extract that knowledge from the source? From Q-functions (Torrey et al. 2005)From Q-functions (Torrey et al. 2005) From observed behavior From observed behavior (this work)(this work)
How will we apply that knowledge in the target?How will we apply that knowledge in the target? Model reuse (Taylor & Stone 2005)Model reuse (Taylor & Stone 2005) Advice takingAdvice taking (Torrey et al. 2005, this work) (Torrey et al. 2005, this work)
Advice TakingAdvice Taking
AdviceAdvice: instructions for the learner: instructions for the learner
IF: IF: conditioncondition
THEN: THEN: prefer actionprefer action
In these states
Qaction1 >
Qaction2
Complexity of Q-function
Apply advice as Apply advice as soft constraintssoft constraints (KBKR, 2005)(KBKR, 2005)
For each action, find the Q-function that minimizes:
Error on Training
Data
Disagreement with Advice+ +
Experimental Domain: Experimental Domain: RoboCupRoboCup
Keep the ballStone & Sutton 2001
KeepAway (KA/MKA)
Score a goalMaclin et al. 2005
BreakAway (BA)
MoveDownfield (MD)
Cross the line Torrey et al. 2006
Different objectives, but a transferable skill: passing to teammates
A Challenge for Skill A Challenge for Skill TransferTransfer
Shared skills are not exactly the sameShared skills are not exactly the same Skills have general and specific aspectsSkills have general and specific aspects
Aspects of the Aspects of the passpass skill in RoboCup skill in RoboCup General: teammate must be openGeneral: teammate must be open Game-specific: where teammate should be locatedGame-specific: where teammate should be located Player-specific: whether teammate is nearest or Player-specific: whether teammate is nearest or
furthestfurthest
I’m open and near the goal.
Pass to me!
I’m open and far
from you. Pass to me!
Addressing the ChallengeAddressing the Challenge
We focus on learning We focus on learning generalgeneral skillskill aspectsaspects These should transfer betterThese should transfer better
We learn skills that We learn skills that apply toapply to multiple multiple playersplayers This generalizes over player-specific aspectsThis generalizes over player-specific aspects
We allow humans to We allow humans to provide informationprovide information They can point out game-specific aspectsThey can point out game-specific aspects
Human-Provided Human-Provided InformationInformation
User provides a User provides a mapping mapping to show task similaritiesto show task similarities
May also provide May also provide user adviceuser advice about task about task differencesdifferences
PassØØ
Pass towards goalMove towards goal
Shoot at goal
Our Transfer AlgorithmOur Transfer Algorithm
Observe source task games to learn skills
Create advice for thetarget task
Learn target taskwith KBKR
Translate learned skills into transfer
advice
If there is user advice,
add it in
Learning Skills By Learning Skills By ObservationObservation
Source-task games are sequences: (state, action)Source-task games are sequences: (state, action) Learning skills is like learning to classify states by Learning skills is like learning to classify states by
their correct actionstheir correct actions We use Inductive Logic Programming to learn classifiersWe use Inductive Logic Programming to learn classifiers
State 1:State 1:distBetween(me,teammate2) = 15distBetween(me,teammate2) = 15distBetween(me,teammate1) = 10distBetween(me,teammate1) = 10distBetween(me,opponent1) = 5distBetween(me,opponent1) = 5......action = pass(teammate2)action = pass(teammate2)outcome = caught(teammate2)outcome = caught(teammate2)
Advantages of ILPAdvantages of ILP
Can produce first-order rules for skillsCan produce first-order rules for skills Capture only the essential aspects of the Capture only the essential aspects of the
skillskill We expect these aspects to transfer betterWe expect these aspects to transfer better
Can incorporate background knowledgeCan incorporate background knowledge
pass(pass(TeammateTeammate))
pass(pass(teammate1teammate1))
pass(pass(teammateNteammateN))
vs. ...
Preparing Datasets for ILPPreparing Datasets for ILP
action = pass(Teammate) ?
outcome = caught(Teammate) ?
Q(pass) is high?
Q(pass) is highest?
Positive example for pass(Teammate)
yes
yes
yes
yes
Q(other) is high?
Q(pass) is lower?
Negative example for pass(Teammate)
no
yes
yes
Reject example
no
no
no
no
no
Example of a Skill LearnedExample of a Skill Learned
pass(pass(TeammateTeammate) :-) :-
distBetween(me, distBetween(me, TeammateTeammate) > 14,) > 14,
passAngle(passAngle(TeammateTeammate) > 30,) > 30,
passAngle(passAngle(TeammateTeammate) < 150,) < 150,
distBetween(me, distBetween(me, OpponentOpponent) < 7.) < 7.
KBKR requires propositional adviceKBKR requires propositional advice We instantiate each rule headWe instantiate each rule head
Variables in rule bodies create Variables in rule bodies create disjunctionsdisjunctions We use We use tile featurestile features to translate them to translate them
Variables can appear multiple timesVariables can appear multiple times We create new features to translate themWe create new features to translate them
Technical ChallengesTechnical Challenges
Two Experimental ScenariosTwo Experimental Scenarios
PassØØ
Pass towards goalMove towards goal
Shoot at goal
4-on-3 MKA 3-on-2 BA
3-on-2 BA3-on-2 MD
PassMoveAhead
Ø
PassMoveAhead
Shoot at goal
Skill Transfer ResultsSkill Transfer Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0 1000 2000 3000 4000 5000
Games Played
Pro
bab
ility
of
Go
al Without transfer
From MKA
From MD
Breakdown of MKA ResultsBreakdown of MKA Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0 1000 2000 3000 4000 5000
Games Played
Pro
bab
ilit
y o
f G
oal
all advice
transfer advice onlyuser advice only
no advice
What if User Advice is Bad?What if User Advice is Bad?
0
0.1
0.2
0.3
0.4
0.5
0.6
0 1000 2000 3000 4000 5000
Games Played
Pro
ba
bili
ty o
f G
oa
l
Transfer with good advice
Transfer with bad adviceBad advice only
No advice
Related WorkRelated Work
Q-function transfer in RoboCupQ-function transfer in RoboCup Taylor & Stone (AAMAS 2005, AAAI 2005)Taylor & Stone (AAMAS 2005, AAAI 2005)
Transfer via policy reuseTransfer via policy reuse Fernandez & Veloso (AAMAS 2006, ICML workshop Fernandez & Veloso (AAMAS 2006, ICML workshop
2006)2006) Madden & Howley (AI Review 2004)Madden & Howley (AI Review 2004)
Transfer via relational RLTransfer via relational RL Driessens et al. (ICML workshop 2006)Driessens et al. (ICML workshop 2006)
Summary of Summary of ContributionsContributions
Transfer of shared skills in high-level logicTransfer of shared skills in high-level logic Despite differences in shared skillsDespite differences in shared skills
Demonstration of the value of user guidanceDemonstration of the value of user guidance Easy to give and beneficialEasy to give and beneficial
Effective transfer in the RoboCup domainEffective transfer in the RoboCup domain Challenging and dissimilar tasksChallenging and dissimilar tasks
Future WorkFuture Work
Learn more general skills by Learn more general skills by combining multiple source taskscombining multiple source tasks
Compare several transfer methods on Compare several transfer methods on RoboCup scenarios of varying RoboCup scenarios of varying difficultydifficulty
Reach similar levels of transfer with Reach similar levels of transfer with less user inputless user input
AcknowledgementsAcknowledgements
DARPA Grant HR0011-04-1-0007DARPA Grant HR0011-04-1-0007
US Naval Research Laboratory US Naval Research Laboratory Grant N00173-06-1-G002Grant N00173-06-1-G002
Thank You
User AdviceUser AdviceIF: distBetween(me,goal) < 10 ANDIF: distBetween(me,goal) < 10 AND
angle(goal, me, goalie) > 40angle(goal, me, goalie) > 40
THEN: prefer shootTHEN: prefer shoot
IF: distBetween(me,goal) > 10IF: distBetween(me,goal) > 10
THEN: prefer move_aheadTHEN: prefer move_ahead
IF: IF: [transferred conditions] [transferred conditions] ANDAND distBetween(Teammate,goal) < distBetween(me,goal) distBetween(Teammate,goal) < distBetween(me,goal)
THEN: prefer pass(Teammate)THEN: prefer pass(Teammate)
This is the part
that came from
transfer
Feature TilingFeature Tiling
Original feature
Tiling #1Tiling #2
Tiling #8…
Tiling #9Tiling #10
Tiling #11
…
min value max value
(16 tiles)
(8 tiles)
(8 tiles)
Propositionalizing RulesPropositionalizing Rules
pass(pass(TeammateTeammate) :-) :-
distBetween(me, distBetween(me, TeammateTeammate) > 14,) > 14,
… …
Step 1: rule headStep 1: rule head
pass(pass(teammate1teammate1) :-) :-
distBetween(me, distBetween(me, teammate1teammate1) > ) > 14,14,
… …
pass(pass(teammateNteammateN) :-) :-
distBetween(me, distBetween(me, teammateNteammateN) > ) > 14,14,
… …
…
Propositionalizing RulesPropositionalizing Rules
distBetween(me, distBetween(me, OpponentOpponent) < 7) < 7
distBetween(me,distBetween(me,opponent1opponent1))[0,7][0,7] + … + distBetween(me, + … + distBetween(me,opponentNopponentN ) )[0,7][0,7] ≥≥ 1 1
Step 2: single-variable disjunctionsStep 2: single-variable disjunctions
distBetween(me,distBetween(me,opponent1opponent1) < 7 OR … OR distBetween(me,) < 7 OR … OR distBetween(me,opponentNopponentN) < 7) < 7
distBetween(me, distBetween(me, PlayerPlayer) > 14,) > 14,
distBetween(distBetween(PlayerPlayer, goal) < 10, goal) < 10
newFeature(newFeature(player1player1) + … + newFeature() + … + newFeature(playerNplayerN) ) ≥≥ 1 1
newFeature(Player) :-newFeature(Player) :-
Dist1 is distBetween(me, Dist1 is distBetween(me, PlayerPlayer),),
Dist2 is distBetween(Dist2 is distBetween(PlayerPlayer, goal),, goal),
Dist1 > 14, Dist2 < 10.Dist1 > 14, Dist2 < 10.
Add to target task feature space:
Step 3: linked-variable disjunctionsStep 3: linked-variable disjunctions
Propositionalizing RulesPropositionalizing Rules