Upload
jasmine-mccoy
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Transfer in Reinforcement Learning via Markov Logic
NetworksLisa Torrey, Jude Shavlik,
Sriraam Natarajan, Pavan Kuppili, Trevor WalkerUniversity of Wisconsin-Madison, USA
Possible Benefits of Transfer in Possible Benefits of Transfer in RLRL
Learning curves in the target task:
perf
orm
ance
training
with transferwithout transfer
Reinforcement LearningReinforcement Learning
Environment
Agent
action rewardstate
distance(me,teammate1) = 15distance(me,opponent1) = 5angle(opponent1, me, teammate1) = 30…
States are described by features:
MovePassShoot
Actions are:
+1 for scoring 0 otherwise
Rewards are:
Our Previous MethodsOur Previous Methods
Skill transferSkill transfer Learn a rule for when to take each Learn a rule for when to take each
actionaction Use rules as adviceUse rules as advice
Macro transferMacro transfer Learn a relational multi-step action planLearn a relational multi-step action plan Use macro to demonstrateUse macro to demonstrate
Transfer via Markov Logic Transfer via Markov Logic NetworksNetworks
MLNQ-function
Analyze
Target-tasklearner
MLNQ-function
Demonstrate
Source-tasklearner
Learn Source-taskQ-functionand data
Markov Logic NetworksMarkov Logic Networks A Markov network models a joint distributionA Markov network models a joint distribution
A Markov Logic Network combines probability with A Markov Logic Network combines probability with logic logic Template: a set of first-order formulas with weightsTemplate: a set of first-order formulas with weights Each grounded predicate in a formula becomes a nodeEach grounded predicate in a formula becomes a node Predicates in grounded formula are connected by arcsPredicates in grounded formula are connected by arcs
Probability of a world: (1/Z) exp( Probability of a world: (1/Z) exp( ΣΣ W WiiNNi i ))
Richardson and Domingos, ML 2006
X Y Z
A B
MLN Q-functionMLN Q-function
IF distance(me, Teammate) < 15
AND angle(me, goalie, Teammate) > 45
THEN Q є (0.8, 1.0)
IF distance(me, GoalPart) < 10
AND angle(me, goalie, GoalPart) > 45
THEN Q є (0.8, 1.0)
Formula 1
W1 = 0.75
N1 = 1 teammate
Formula 2
W1 = 1.33
N1 = 3 goal parts
Probability that Q є (0.8, 1.0): __exp(W1N1 + W1N1)__
1 + exp(W1N1 + W1N1)
Grounded Markov NetworkGrounded Markov Network
Q є (0.8, 1.0)
distance(me, teammate1) < 15
angle(me, goalie, teammate1) > 45
distance(me, goalRight) < 10
angle(me, goalie, goalRight) > 45
distance(me, goalLeft) < 10
angle(me, goalie, goalLeft) > 45
Learning an MLNLearning an MLN Find good Q-value bins using Find good Q-value bins using
hierarchical clusteringhierarchical clustering
Learn rules that classify examples into Learn rules that classify examples into bins using inductive logic programmingbins using inductive logic programming
Learn weights for these formulas to Learn weights for these formulas to produce the final MLNproduce the final MLN
Binning via Hierarchical Binning via Hierarchical ClusteringClustering
Fre
quen
cy
Q-value
Fre
quen
cy
Q-value
Fre
quen
cy
Q-value
Classifying Into Bins via ILPClassifying Into Bins via ILP
Given examplesGiven examples Positive: inside this Q-value binPositive: inside this Q-value bin Negative: outside this Q-value binNegative: outside this Q-value bin
The Aleph* ILP learning system finds The Aleph* ILP learning system finds rules that separate positive from rules that separate positive from negativenegative Builds rules one predicate at a timeBuilds rules one predicate at a time Top-down search through the feature spaceTop-down search through the feature space
* Srinivasan, 2001
Learning Formula WeightsLearning Formula Weights
Given formulas and examplesGiven formulas and examples Same examples as for ILPSame examples as for ILP ILP rules as network structureILP rules as network structure
Alchemy* finds weights that make the Alchemy* finds weights that make the probability estimates accurateprobability estimates accurate Scaled conjugate-gradient algorithmScaled conjugate-gradient algorithm
* Kok, Singla, Richardson, Domingos, Sumner, Poon and Lowd, 2004-2007
Using an MLN Q-functionUsing an MLN Q-function
Q є (0.8, 1.0) P1 = 0.75
Q є (0.5, 0.8) P2 = 0.15
Q є (0, 0.5) P2 = 0.10
Q = P1 ● E [Q | bin1]
+ P2 ● E [Q | bin2]
+ P3 ● E [Q | bin3]
Q-value of most similar
training example in bin
Example SimilarityExample Similarity
1
-1
1
1
1
-1
E [Q | bin] = Q-value of most similar training example in bin
Similarity = dot product of example vectors
Example vector shows which bin rules the example satisfies Rule 1
Rule 2
Rule 3
…
ExperimentsExperiments
Source task: 2-on-1 BreakAwaySource task: 2-on-1 BreakAway 3000 existing games from the learning 3000 existing games from the learning
curvecurve Learn MLNs from 5 separate runsLearn MLNs from 5 separate runs
Target task: 3-on-2 BreakAwayTarget task: 3-on-2 BreakAway Demonstration period of 100 gamesDemonstration period of 100 games Continue training up to 3000 gamesContinue training up to 3000 games Perform 5 target runs for each source runPerform 5 target runs for each source run
DiscoveriesDiscoveries
Results can vary widely with the source-Results can vary widely with the source-task chunk from which we transfertask chunk from which we transfer
Most methods use the “final” Q-function Most methods use the “final” Q-function from the last chunkfrom the last chunk
MLN transfer performs better from MLN transfer performs better from chunks halfway through the learning chunks halfway through the learning curvecurve
Results in 3-on-2 BreakAwayResults in 3-on-2 BreakAway
0
0.1
0.2
0.3
0.4
0.5
0.6
0 500 1000 1500 2000 2500 3000Training Games
Pro
bab
ilit
y o
f G
oal
MLN Transfer
Macro Transfer
Value-function Transfer
Standard RL
ConclusionsConclusions
MLN transfer can significantly improve MLN transfer can significantly improve initial target-task performanceinitial target-task performance
Like macro transfer, it is an aggressive Like macro transfer, it is an aggressive approach for tasks with similar strategiesapproach for tasks with similar strategies
It “lifts” transferred information to first-It “lifts” transferred information to first-order logic, making it more general for order logic, making it more general for transfertransfer
Theory refinement in the target task may Theory refinement in the target task may be viable through MLN revisionbe viable through MLN revision
Potential Future WorkPotential Future Work
Model screening for transfer learningModel screening for transfer learning
Theory refinement in the target taskTheory refinement in the target task
Fully relational RL in RoboCup using Fully relational RL in RoboCup using MLNs as Q-function approximatorsMLNs as Q-function approximators