Upload
hengshuai-yao
View
163
Download
0
Embed Size (px)
Citation preview
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
My RL Approach toPrediction and Control
Hengshuai Yao
University of Alberta
April 4, 2013
Hengshuai Yao Thesis Overview 1/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Outline
• One-page Summary of my work (30 seconds)
• Background on Reinforcement Learning (RL) (8 slides; 6minutes)
• Walkthrough my work (6 slides, 4 minutes)
• LAM-API: Large-scale Off-policy Learning and Control (5slides; 5 minutes)
• Citation count prediction using RL (10 slides; 8 minutes)
Hengshuai Yao Thesis Overview 2/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Summary of my workPrediction
• A framework for existing prediction algorithms [ICML 08]
• Data efficiency for on-policy prediction (Multi-step linear Dyna [NIPS 09]) andoff-policy prediction (LAM-off-policy [ICML-workshop 10])
Control
• Memory and sample efficiency for control (LAM-API [AAAI 12])
• Online abstract planning with Universal Option Models [in preparation for JAIRwith Csaba, Rich and Shalabh]
• RL with general function approximation. Deterministic MDPs [in preparation forMachine Learning Journal with Csaba]
RL applications
• Citation count prediction[submitted to IEEE Trans. on SMC-part B]
• Ranking [current work with Dale]
Hengshuai Yao Thesis Overview 3/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
MDPsRL Prediction
Background on RL
I will go over
• MDPs: Definition, Policies, Value Functions, and more
• Prediction Problems (TD, Dyna, On-policy, Off-policy)
• The Control Problem (Value Iteration, Q-learning, LSPI)
Hengshuai Yao Thesis Overview 4/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
MDPsRL Prediction
MDPs
An MDP is defined by a 5-tuple, 〈γ,S,A, (Pa)a∈A, (Ra)a∈A〉.
Pa(s′|s) = P0(s′|s, a)
Ra : S × S → R
(Pa)a∈A and (Ra)a∈A are called the model ortransition-dynamics.A policy, π : S × A → [0, 1], selects actions at states. Think about a
policy as a way of how you act everyday.
Hengshuai Yao Thesis Overview 5/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
MDPsRL Prediction
My MDP example
S: UofA, EDM, HK, Paomadi,NoahA: the set of the links(P a)a∈A: deterministicRa(s, s′) = r(s′),π(UA,Edmonton) = 1,π(HK,Noah) = 0.9,π(HK,Paomadi) = 0.1; etc.
University of Alberta
EdmontonAirport
HKAirport
t=0
t=1,r=$-100
t=3, r=$10,000
t=2,r=$-1,000
t=3, r_{horse}
0.1
0.91.0
1.0
1.0
policy π
Hengshuai Yao Thesis Overview 6/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
MDPsRL Prediction
Value function
V π(s) = E
[∞∑
t=0
γtrt+1 | s0 = s, at ∼ π(st, ·)
]
Optimal policy
V ∗(s) = V π∗
(s) = maxπ
V π(s), for all s ∈ S.
V π(UofA) = −100− 1000γ + γ2(0.1× (−1000) + 0.9× 10, 000)
If rhorse = −1, 000:
V ∗(UofA) = −100− 1000γ + γ2( 1.0︸︷︷︸
=π∗(HK,Noah)
×10, 000)
rhorse = 1, 000, 000:
V ∗(UofA) = −100− 1000γ + γ2( 1.0︸︷︷︸
=π∗(HK,Paomadi)
×1, 000, 000)
Hengshuai Yao Thesis Overview 7/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
MDPsRL Prediction
MDPs cont’–Dynamic programmingBellman equation. For all s ∈ S, for any policy π, one-steplook-ahead:
V π(s) = r(s) + γES′∼π(s,·)[Vπ(S′)],
where r(s) =∑
s′Pπ(s, s′)r(s, s′); r(s, s′) =
∑a∈APa(s, s′)Ra(s, s′).
Solving V π for an ordinary policy π is called policy evaluation. Simple, power iteration.Solving V π
∗
or π∗ is called control, usually using value iteration:
Vk+1(s) = maxa
E[rt+1 + γVk(st+1) | st = s, at = a]
= maxa
∑
s′
Pa(s′|s)(Ra(s, s′) + γVk(s′))
Policy iteration is similar.
Hengshuai Yao Thesis Overview 8/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
MDPsRL Prediction
RLFeatures of RL:
• Sample-based learning. No model.
• Only intermediate rewards are observed.
• Partially observable, e.g., citation count prediction.
Prediction/Control: solving V π (for some π) or V ∗ using samples.Sample efficiency and memory is important. Algorithms:
• TD, Q-learning [Barto et. al. 83; Sutton 88; Dayan 92; Bertsekas 96]
• Dyna [Sutton et. al. 91] and linear Dyna [Sutton el.al.08].
• LSTD [Boyan 02], LSPI [Lagoudakis and Parr, 03]
• GTD [Sutton et. al. 09; Maie 10 el. al. 10]
Hengshuai Yao Thesis Overview 9/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
MDPsRL Prediction
PredictionFeature mapping: φ : S → R
n, n being the number of features.Linear Function Approximation (LFA). We approximate V π by
V π(s) = φ(s)⊤θ,
for s ∈ S, where θ is the parameter vector (to learn).Samples (this is our Big Data)
D = (〈φ(st), at, rt+1, φ(st+1)〉)t=1,2,...,T
Prediction: Given an input policy π, output an estimate of thevalue function V π. We learn a predictor on D using φ.On-policy: D is created by following π.Off-policy: D is not created by π.
Hengshuai Yao Thesis Overview 10/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
MDPsRL Prediction
Prediction-cont’-TD (Sutton, 88)Given the tuple 〈φ(st), at, rt+1, φ(st+1)〉, Temporal Difference(TD) learning (without eligibility trace):
δt = rt+1 + γV (st+1)− V (st),
δt is called the TD error, which is a sample of the Bellmanresidual:
E[δt|st = s] = r(s) + γ∑
s′∈S
P π(s, s′)V π(s′)− V π(s).
∆θt ∝ αtδtφ(st)
Hengshuai Yao Thesis Overview 11/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Preconditioning Framework [ICML 08]
Previously: Issues of step-size, sample efficiency, and sparsity:LSTD [Boyan 02], LSPE [Bertsekas et. al. 96, 03, 04], FPKF[Van Roy 06], iLSTD [Geramifard et. al. 06, 07].Contribution: I proposed a general class of algorithms byapplying the preconditioning technique in iterative analysis,which includes the above mentioned algorithms. I solved allthese issues in this framework. Empirical results: the step-sizeadaptation learns much quicker; sparsity based storage andcomputation is more efficient.
Hengshuai Yao Thesis Overview 12/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Multi-step Linear Dyna [NIPS 09]
Previously: Online planning is believed to speed up learningand makes better decisions (mostly tabular), but “Model-basedis poorer than model-free”. Dyna [Sutton et. al. 92]/linear Dyna [Suttonet.al.08] is an integrated architecture for real time acting, learning, modeling, andplanning without waiting for each other to complete. However, linear Dyna was found toperform inferior to (model-free) TD learning [Sutton et.al.08].
Contribution: I improved linear Dyna [Sutton et.al. 08] toperform much better than TD. I also extended linear Dyna fromsingle-step to multi-step planning, and demonstrated onMountain-car (an under-powered car climbing a mountain) thatmulti-step planning predicts more accurately than single-step.
Hengshuai Yao Thesis Overview 13/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
LAM based off-policy learning[ICML-workshop 10]
Previously: Off-policy learning is ubiquitous. TD diverges but isreasonably fast if it converges. GTD algorithms [Sutton et. al.09,10,11] converge but are slow.Contribution: I used linear action model based framework foroff-policy learning. It can learn various policies in parallel froma single stream of data, for quick real time decision making.Evaluated on two continuous-state, hard control problems. Irecommend using LAM based off-policy learning in place ofon-policy learning.
Hengshuai Yao Thesis Overview 14/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Deterministic MDPs[with Csaba]
Previously. Theory: state aggregation, LFA ; Practice: LFA, andneural networkContribution: A very general framework for RL with functionapproximation. We propose to view all RL methods as buildingsome correspondence MDP, which has a smaller state spacethan the original. We solve the correspondence MDP and liftthe policies and value functions found there back to the original.A few important results are proved (20+ lemmas andtheorems). This framework is helpful in understanding existingalgorithms as well as developing new ones.
Hengshuai Yao Thesis Overview 15/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Reinforcement Ranking [with Dale]
Bellman equation looks familiar to you? PageRank? Stationarydistribution?Contribution: We proposed a framework of discoveringauthorities using rewards defined for pages/links. No stationarydistribution at all, but still guaranteed to converge. Evaluation isperformed on Wikipedia, DBLP and WebSpamUK. Comparedthe precision and recall with PageRank and TrustRank.Promising results on Wikipedia and DBLP.
Hengshuai Yao Thesis Overview 16/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Universal Option Models [with R.C.S]Previously: Options are used to describe high-level decisions. The execution of an
option is a sequence of actions (abstraction). Traditional option modelsconsist in a reward part and and a state-prediction part. Veryinefficient for multiple reward functions (or reward functionchanges dynamically).Contribution: We proposed a new way of modelingoptions—removing the reward part but adding a stateoccupancy part. We proved that, (a) given any reward function,we can construct the return of the option from the new model;(b) with the new model we can recover the TD solution withoutany planning computation. On a simulated Star-craft 2 game, itis much more efficient for planning than the traditional model.Very suitable for large real-time games.
Hengshuai Yao Thesis Overview 17/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
LAM-API [AAAI 12]Previous API solutions (experience replay [Lin 92], LSPI [Lagoudakis and Parr 03])have to remember the sample set D and sweep all the samples in each iteration. Dcan be very large in practice.
Key idea: Summarize your Big Data with a model. Work withthe model.First, we learn a linear model, 〈F a, fa〉 for each action a, from the samples. For agiven action a and any given state s ∈ S, with s′ ∼ Pa(s, ·), we expect that
F aφ(s) ≈ E[φ(s′)] and (fa)⊤φ(s) ≈ E[Ra(s, s′)].
Complexity of modeling: O(Tn2).Second, we use all the LAMs to perform API. Complexity: O(Ln2Niter)LAM-API: O(Tn2) + O(|L|n2Niter)LSPI: O(Tn2Niter)
Big Data: T ≫ |L|, which means,LAM-API—O(Tn2)≪ O(Tn2Niter)—LSPI
Hengshuai Yao Thesis Overview 18/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
LAM-API-cont’Algorithm 1 LAM-API with LSTD
Input: a list of features L = {φi}, a LAM (〈F a, fa〉)a∈A. Output: a weight vector θ.Initialize θ
repeat for Niter iterationsfor φi from L do
Select greedy action:a∗ = argmaxa{(fa)⊤φi + γθ⊤F aφi}
Select model:F ∗ = F a
∗
, f∗ = fa∗
Produce prediction for the next feature vector and reward:φi+1 = F ∗φi
ri = (f∗)⊤φi
Accumulate LSTD structures:A = A+ φi(γφi+1 − φi)⊤
b = b+ φi riθ = −A−1b
Hengshuai Yao Thesis Overview 19/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
LAM-API-cont’Compared learning quality with LSPI. L = {φi |φi ← Di}.Chain-walk problems. Left: 4-state chain. Right: LAM-LSTD on the 50-state chain.LAM-LSTD converges in 4 iterations. At iteration 2, the policy is already optimal.
5 10 15 20 25 30 35 40 45 50−0.05
0
0.05
0.1
Iter. 0#: all actions are ’R’
5 10 15 20 25 30 35 40 45 500
0.2
0.4
0.6
Iter. 1#: RRRRRRRRRRRLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLL
Val
ue F
unct
ions
5 10 15 20 25 30 35 40 45 500
0.2
0.4
0.6
Iter. 2#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL
State No#
Hengshuai Yao Thesis Overview 20/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
LAM-API-cont’LSPI converges in 14 iterations; found the optimal policy at thelast iteration
5 10 15 20 25 30 35 40 45 50−1
0
1Iter. 0#: all actions are ’R’
5 10 15 20 25 30 35 40 45 50
0
1
2Iter. 1#: LLLLLRRRRRRLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLR
Val
ue F
unct
ions
5 10 15 20 25 30 35 40 45 500
1
2
Iter. 7#: LRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLRR
State No#
(c) LSPI: iteration 0,1,7
5 10 15 20 25 30 35 40 45 50−15
−10
−5
0
5Iter. 8#: RRRRLLRRRRRRLLLLLLLRRRLLLLLRRRRRRLLLLRRRRRRRLLLLLL
5 10 15 20 25 30 35 40 45 50−2
0
2
Iter. 9#: LRRRRRRRRRRLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLLLR
Val
ue F
unct
ions
5 10 15 20 25 30 35 40 45 500
1
2
Iter. 14#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL
State No#
(d) LSPI: iteration 8, 9, 14
Hengshuai Yao Thesis Overview 21/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
LAM-API-cont’-Cart-PoleGoal: Keep the pendulum above horizontal. (for a maximum of 3000 steps).
Reward: binary; State: angle and angular velocity (both continuous)
(e) Cart-pole balancing
0 200 400 600 800 10000
500
1000
1500
2000
2500
3000
#Training Episodes
#Bal
ance
d st
eps
LAM−LSTD, worst
LSPI, averageLAM−LSTD/LSPI, best
LAM−LSTD, average
LSPI, worst
(f) Balancing steps
Why important? LSPI [Lagoudakis and Parr 03] widely used; “LSPI is arguably the
most competitive RL algorithm available in large environments.” [Li et. al. 09].Hengshuai Yao Thesis Overview 22/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count PredictionCitation count: the most used measure in academics.Predicting it is interesting. We studied the prediction of thecitation count for papers.Previously, [Yan et. al. 11] [Fu 08] studied a citation countprediction problem using SL.Training (spacial):
Input → Outputx: feature vectors in 1990 → y: citation counts until 2000
Given a paper’s features in 1990—predict.Now given a paper’s features in 2000—? (a temporal aspect)
Hengshuai Yao Thesis Overview 23/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–cont’Citation count prediction is temporal.Problem formulation. Define the “value” of a paper p at year t bythe sum of discounted numbers of citations in all thesubsequent years:
V (p, t) =∞∑
q=t
γq−tcq, γ ∈ [0, 1)
where cq is the number of citations the paper receives in year q.When t is the publication year of the paper and γ approaches one, V (p, t) will be
virtually close to the total number of citation counts for the paper.
Question: What is my state here? st = (p, t)
Hengshuai Yao Thesis Overview 24/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–cont’
We represent byV (p, t) = φ(p, t)⊤θ.
Samples: a data set,
D = ∪p∈PDp; Dp = (〈φ(p, t), ct+1, φ(p, t+ 1)〉)t=1990,1991,...,2000
Features. φ(p, t): a vector, having entries for, e.g., the numberof citations for each author till year t, the number of citations forthe venue till year t, etc.
Hengshuai Yao Thesis Overview 25/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–cont’Long term: predict more than 10 years. TD and LSTD
Short term: predict in k (k < 10) years
• not a standard RL problem
• We extended LAM to this context and proposed a model-basedprediction method.
Key idea: learn a model 〈F, f〉 from year-to-year status change ofpapers. Given 〈φ(p, t), ct+1, φ(p, t+ 1)〉, update
∆F = α [φ(p, t+ 1)− Fφ(p, t)]φ(p, t)⊤,
and∆f = α
[ct+1 − f⊤φ(p, t)
]φ(p, t).
Hengshuai Yao Thesis Overview 26/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–cont’What we learned?f : a one-year predictorF : multiple one-year predictors
my #CC of in the
last few years
#CC of last author
2000
#CC of the citing
papers
#CC of first author
... ...
2001
my #CC of in the
last few years
#CC of last author
#CC of the citing
papers
#CC of first author
... ...
Linear transient model
Hengshuai Yao Thesis Overview 27/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–cont’
How do we use the model?Given: the feature vector of a paper s at a year t = 2012, φ(s2012)citation count in 2013: c1 = f⊤φ(s2012).citation count in 2014: We need φ(s2013) (unavailable). We can
predict the features: φ2013def= Fφ(s2012). Then we combine f again to
predict byc2 = f⊤φ2013
︸ ︷︷ ︸
Using a prediction to predict
This generalizes the key idea of TD, linear Dyna [Sutton et.al. 08],LAM-API to more general multi-step prediction problems.Similarly we can extrapolate into more years into the future.
Hengshuai Yao Thesis Overview 28/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–Empirical“Now” is 2002.Training data: the citation counts of 7K papers from 1990 to “Now”.Test data: their citation counts from “Now” to 2012.Algorithms: LS/SVR, LSTD
0 100 200 300 400 500 6000
100
200
300
400
500
600
True Value
Pre
dict
ion
LS (training)
(g) LS-train
0 100 200 300 400 500 600 7000
1000
2000
3000
4000
5000
True Value
Pre
dict
ion
LS (predicting future)
(h) LS-testHengshuai Yao Thesis Overview 29/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–EmpiricalLSTD successfully generalizes over time for the training papers.
0 200 400 6000
200
400
600
800
True Value
Pre
dict
ion
LSTD
(i) LS-test
Hengshuai Yao Thesis Overview 30/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–EmpiricalPredicting for test papers (newer than the training papers)Papers are marked according to year of publication. black, green, blue, red, magenta:
• in Left plot, correspond to papers published in 1990, 1991, 1992, 1993, 1994
• in Middle plot, correspond to papers published in 1995, 1996, 1997, 1998, 1999
In Right plot: black, green—papers published in 2000, 2001.True citation count: cross (+) marked; Prediction: star (*) markedLSTD successfully generalizes over time for new papers
0 200 400 600 800 1000 12000
200
400
600
800
1000
1200
True Value
Pre
dict
ion
LSTD(0)
(j) papers 90-940 200 400 600 800
0
100
200
300
400
500
600
700
800
900
True Value
Pre
dict
ion
LSTD(0)
(k) papers 95-990 200 400 600 800
0
100
200
300
400
500
600
700
800LSTD(0)
True ValueP
redi
ctio
n(l) papers 00-01
Hengshuai Yao Thesis Overview 31/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Citation Count Prediction–EmpiricalShort-term prediction: the performance of the proposed method
0 100 200 300 4000
50
100
150
200
250
300
350
400Dyna: 4−year prediction
True Value
Pre
dict
ion
(m) 4 year; Papers 00-01
1 2 3 4 5 6 7 80
5
10
15
20
25
30
35
Years into the future
RM
SE
paper−2000−2001
paper−1995−2000
paper−1990−1995
paper−before1990
(n) Summary
Hengshuai Yao Thesis Overview 32/33
OutlineOne-page Summary of my work
BackgroundA Walkthrough of My work
LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL
Thank You!
Hengshuai Yao Thesis Overview 33/33