Hengshuai noah

OutlineOne-page Summary of my work

BackgroundA Walkthrough of My work

LAM-API: Large-scale Off-policy Learning and ControlCitation Count Prediction using RL

My RL Approach toPrediction and Control

Hengshuai Yao

University of Alberta

April 4, 2013

Hengshuai Yao Thesis Overview 1/33




Outline

• One-page Summary of my work (30 seconds)

• Background on Reinforcement Learning (RL) (8 slides; 6minutes)

• Walkthrough my work (6 slides, 4 minutes)

• LAM-API: Large-scale Off-policy Learning and Control (5slides; 5 minutes)

• Citation count prediction using RL (10 slides; 8 minutes)





Summary of my workPrediction

• A framework for existing prediction algorithms [ICML 08]

• Data efficiency for on-policy prediction (Multi-step linear Dyna [NIPS 09]) andoff-policy prediction (LAM-off-policy [ICML-workshop 10])

Control

• Memory and sample efficiency for control (LAM-API [AAAI 12])

• Online abstract planning with Universal Option Models [in preparation for JAIRwith Csaba, Rich and Shalabh]

• RL with general function approximation. Deterministic MDPs [in preparation forMachine Learning Journal with Csaba]

RL applications

• Citation count prediction[submitted to IEEE Trans. on SMC-part B]

• Ranking [current work with Dale]





MDPsRL Prediction

Background on RL

I will go over

• MDPs: Definition, Policies, Value Functions, and more

• Prediction Problems (TD, Dyna, On-policy, Off-policy)

• The Control Problem (Value Iteration, Q-learning, LSPI)





MDPsRL Prediction

MDPs

An MDP is defined by a 5-tuple, 〈γ,S,A, (Pa)a∈A, (Ra)a∈A〉.

Pa(s′|s) = P0(s′|s, a)

Ra : S × S → R

(Pa)a∈A and (Ra)a∈A are called the model ortransition-dynamics.A policy, π : S × A → [0, 1], selects actions at states. Think about a

policy as a way of how you act everyday.





MDPsRL Prediction

My MDP example

S: UofA, EDM, HK, Paomadi,NoahA: the set of the links(P a)a∈A: deterministicRa(s, s′) = r(s′),π(UA,Edmonton) = 1,π(HK,Noah) = 0.9,π(HK,Paomadi) = 0.1; etc.

University of Alberta

EdmontonAirport

HKAirport

t=0

t=1,r=$-100

t=3, r=$10,000

t=2,r=$-1,000

t=3, r_{horse}

0.1

0.91.0

1.0

1.0

policy π





MDPsRL Prediction

Value function

V π(s) = E

[∞∑

t=0

γtrt+1 | s0 = s, at ∼ π(st, ·)

]

Optimal policy

V ∗(s) = V π∗

(s) = maxπ

V π(s), for all s ∈ S.

V π(UofA) = −100− 1000γ + γ2(0.1× (−1000) + 0.9× 10, 000)

If rhorse = −1, 000:

V ∗(UofA) = −100− 1000γ + γ2( 1.0︸︷︷︸

=π∗(HK,Noah)

×10, 000)

rhorse = 1, 000, 000:

V ∗(UofA) = −100− 1000γ + γ2( 1.0︸︷︷︸

=π∗(HK,Paomadi)

×1, 000, 000)





MDPsRL Prediction

MDPs cont’–Dynamic programmingBellman equation. For all s ∈ S, for any policy π, one-steplook-ahead:

V π(s) = r(s) + γES′∼π(s,·)[Vπ(S′)],

where r(s) =∑

s′Pπ(s, s′)r(s, s′); r(s, s′) =

∑a∈APa(s, s′)Ra(s, s′).

Solving V π for an ordinary policy π is called policy evaluation. Simple, power iteration.Solving V π

∗

or π∗ is called control, usually using value iteration:

Vk+1(s) = maxa

E[rt+1 + γVk(st+1) | st = s, at = a]

= maxa

∑

s′

Pa(s′|s)(Ra(s, s′) + γVk(s′))

Policy iteration is similar.





MDPsRL Prediction

RLFeatures of RL:

• Sample-based learning. No model.

• Only intermediate rewards are observed.

• Partially observable, e.g., citation count prediction.

Prediction/Control: solving V π (for some π) or V ∗ using samples.Sample efficiency and memory is important. Algorithms:

• TD, Q-learning [Barto et. al. 83; Sutton 88; Dayan 92; Bertsekas 96]

• Dyna [Sutton et. al. 91] and linear Dyna [Sutton el.al.08].

• LSTD [Boyan 02], LSPI [Lagoudakis and Parr, 03]

• GTD [Sutton et. al. 09; Maie 10 el. al. 10]





MDPsRL Prediction

PredictionFeature mapping: φ : S → R

n, n being the number of features.Linear Function Approximation (LFA). We approximate V π by

V π(s) = φ(s)⊤θ,

for s ∈ S, where θ is the parameter vector (to learn).Samples (this is our Big Data)

D = (〈φ(st), at, rt+1, φ(st+1)〉)t=1,2,...,T

Prediction: Given an input policy π, output an estimate of thevalue function V π. We learn a predictor on D using φ.On-policy: D is created by following π.Off-policy: D is not created by π.





MDPsRL Prediction

Prediction-cont’-TD (Sutton, 88)Given the tuple 〈φ(st), at, rt+1, φ(st+1)〉, Temporal Difference(TD) learning (without eligibility trace):

δt = rt+1 + γV (st+1)− V (st),

δt is called the TD error, which is a sample of the Bellmanresidual:

E[δt|st = s] = r(s) + γ∑

s′∈S

P π(s, s′)V π(s′)− V π(s).

∆θt ∝ αtδtφ(st)





Preconditioning Framework [ICML 08]

Previously: Issues of step-size, sample efficiency, and sparsity:LSTD [Boyan 02], LSPE [Bertsekas et. al. 96, 03, 04], FPKF[Van Roy 06], iLSTD [Geramifard et. al. 06, 07].Contribution: I proposed a general class of algorithms byapplying the preconditioning technique in iterative analysis,which includes the above mentioned algorithms. I solved allthese issues in this framework. Empirical results: the step-sizeadaptation learns much quicker; sparsity based storage andcomputation is more efficient.





Multi-step Linear Dyna [NIPS 09]

Previously: Online planning is believed to speed up learningand makes better decisions (mostly tabular), but “Model-basedis poorer than model-free”. Dyna [Sutton et. al. 92]/linear Dyna [Suttonet.al.08] is an integrated architecture for real time acting, learning, modeling, andplanning without waiting for each other to complete. However, linear Dyna was found toperform inferior to (model-free) TD learning [Sutton et.al.08].

Contribution: I improved linear Dyna [Sutton et.al. 08] toperform much better than TD. I also extended linear Dyna fromsingle-step to multi-step planning, and demonstrated onMountain-car (an under-powered car climbing a mountain) thatmulti-step planning predicts more accurately than single-step.





LAM based off-policy learning[ICML-workshop 10]

Previously: Off-policy learning is ubiquitous. TD diverges but isreasonably fast if it converges. GTD algorithms [Sutton et. al.09,10,11] converge but are slow.Contribution: I used linear action model based framework foroff-policy learning. It can learn various policies in parallel froma single stream of data, for quick real time decision making.Evaluated on two continuous-state, hard control problems. Irecommend using LAM based off-policy learning in place ofon-policy learning.





Deterministic MDPs[with Csaba]

Previously. Theory: state aggregation, LFA ; Practice: LFA, andneural networkContribution: A very general framework for RL with functionapproximation. We propose to view all RL methods as buildingsome correspondence MDP, which has a smaller state spacethan the original. We solve the correspondence MDP and liftthe policies and value functions found there back to the original.A few important results are proved (20+ lemmas andtheorems). This framework is helpful in understanding existingalgorithms as well as developing new ones.





Reinforcement Ranking [with Dale]

Bellman equation looks familiar to you? PageRank? Stationarydistribution?Contribution: We proposed a framework of discoveringauthorities using rewards defined for pages/links. No stationarydistribution at all, but still guaranteed to converge. Evaluation isperformed on Wikipedia, DBLP and WebSpamUK. Comparedthe precision and recall with PageRank and TrustRank.Promising results on Wikipedia and DBLP.





Universal Option Models [with R.C.S]Previously: Options are used to describe high-level decisions. The execution of an

option is a sequence of actions (abstraction). Traditional option modelsconsist in a reward part and and a state-prediction part. Veryinefficient for multiple reward functions (or reward functionchanges dynamically).Contribution: We proposed a new way of modelingoptions—removing the reward part but adding a stateoccupancy part. We proved that, (a) given any reward function,we can construct the return of the option from the new model;(b) with the new model we can recover the TD solution withoutany planning computation. On a simulated Star-craft 2 game, itis much more efficient for planning than the traditional model.Very suitable for large real-time games.





LAM-API [AAAI 12]Previous API solutions (experience replay [Lin 92], LSPI [Lagoudakis and Parr 03])have to remember the sample set D and sweep all the samples in each iteration. Dcan be very large in practice.

Key idea: Summarize your Big Data with a model. Work withthe model.First, we learn a linear model, 〈F a, fa〉 for each action a, from the samples. For agiven action a and any given state s ∈ S, with s′ ∼ Pa(s, ·), we expect that

F aφ(s) ≈ E[φ(s′)] and (fa)⊤φ(s) ≈ E[Ra(s, s′)].

Complexity of modeling: O(Tn2).Second, we use all the LAMs to perform API. Complexity: O(Ln2Niter)LAM-API: O(Tn2) + O(|L|n2Niter)LSPI: O(Tn2Niter)

Big Data: T ≫ |L|, which means,LAM-API—O(Tn2)≪ O(Tn2Niter)—LSPI





LAM-API-cont’Algorithm 1 LAM-API with LSTD

Input: a list of features L = {φi}, a LAM (〈F a, fa〉)a∈A. Output: a weight vector θ.Initialize θ

repeat for Niter iterationsfor φi from L do

Select greedy action:a∗ = argmaxa{(fa)⊤φi + γθ⊤F aφi}

Select model:F ∗ = F a

∗

, f∗ = fa∗

Produce prediction for the next feature vector and reward:φi+1 = F ∗φi

ri = (f∗)⊤φi

Accumulate LSTD structures:A = A+ φi(γφi+1 − φi)⊤

b = b+ φi riθ = −A−1b





LAM-API-cont’Compared learning quality with LSPI. L = {φi |φi ← Di}.Chain-walk problems. Left: 4-state chain. Right: LAM-LSTD on the 50-state chain.LAM-LSTD converges in 4 iterations. At iteration 2, the policy is already optimal.

5 10 15 20 25 30 35 40 45 50−0.05

0

0.05

0.1

Iter. 0#: all actions are ’R’

5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

Iter. 1#: RRRRRRRRRRRLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLL

Val

ue F

unct

ions

5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

Iter. 2#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL

State No#





LAM-API-cont’LSPI converges in 14 iterations; found the optimal policy at thelast iteration

5 10 15 20 25 30 35 40 45 50−1

0

1Iter. 0#: all actions are ’R’

5 10 15 20 25 30 35 40 45 50

0

1

2Iter. 1#: LLLLLRRRRRRLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLR

Val

ue F

unct

ions

5 10 15 20 25 30 35 40 45 500

1

2

Iter. 7#: LRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLRR

State No#

(c) LSPI: iteration 0,1,7

5 10 15 20 25 30 35 40 45 50−15

−10

−5

0

5Iter. 8#: RRRRLLRRRRRRLLLLLLLRRRLLLLLRRRRRRLLLLRRRRRRRLLLLLL

5 10 15 20 25 30 35 40 45 50−2

0

2

Iter. 9#: LRRRRRRRRRRLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRLLLLLLLLR

Val

ue F

unct

ions

5 10 15 20 25 30 35 40 45 500

1

2

Iter. 14#: RRRRRRRRRLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRLLLLLLLLL

State No#

(d) LSPI: iteration 8, 9, 14





LAM-API-cont’-Cart-PoleGoal: Keep the pendulum above horizontal. (for a maximum of 3000 steps).

Reward: binary; State: angle and angular velocity (both continuous)

(e) Cart-pole balancing

0 200 400 600 800 10000

500

1000

1500

2000

2500

3000

#Training Episodes

#Bal

ance

d st

eps

LAM−LSTD, worst

LSPI, averageLAM−LSTD/LSPI, best

LAM−LSTD, average

LSPI, worst

(f) Balancing steps

Why important? LSPI [Lagoudakis and Parr 03] widely used; “LSPI is arguably the

most competitive RL algorithm available in large environments.” [Li et. al. 09].Hengshuai Yao Thesis Overview 22/33




Citation Count PredictionCitation count: the most used measure in academics.Predicting it is interesting. We studied the prediction of thecitation count for papers.Previously, [Yan et. al. 11] [Fu 08] studied a citation countprediction problem using SL.Training (spacial):

Input → Outputx: feature vectors in 1990 → y: citation counts until 2000

Given a paper’s features in 1990—predict.Now given a paper’s features in 2000—? (a temporal aspect)





Citation Count Prediction–cont’Citation count prediction is temporal.Problem formulation. Define the “value” of a paper p at year t bythe sum of discounted numbers of citations in all thesubsequent years:

V (p, t) =∞∑

q=t

γq−tcq, γ ∈ [0, 1)

where cq is the number of citations the paper receives in year q.When t is the publication year of the paper and γ approaches one, V (p, t) will be

virtually close to the total number of citation counts for the paper.

Question: What is my state here? st = (p, t)





Citation Count Prediction–cont’

We represent byV (p, t) = φ(p, t)⊤θ.

Samples: a data set,

D = ∪p∈PDp; Dp = (〈φ(p, t), ct+1, φ(p, t+ 1)〉)t=1990,1991,...,2000

Features. φ(p, t): a vector, having entries for, e.g., the numberof citations for each author till year t, the number of citations forthe venue till year t, etc.





Citation Count Prediction–cont’Long term: predict more than 10 years. TD and LSTD

Short term: predict in k (k < 10) years

• not a standard RL problem

• We extended LAM to this context and proposed a model-basedprediction method.

Key idea: learn a model 〈F, f〉 from year-to-year status change ofpapers. Given 〈φ(p, t), ct+1, φ(p, t+ 1)〉, update

∆F = α [φ(p, t+ 1)− Fφ(p, t)]φ(p, t)⊤,

and∆f = α

[ct+1 − f⊤φ(p, t)

]φ(p, t).





Citation Count Prediction–cont’What we learned?f : a one-year predictorF : multiple one-year predictors

my #CC of in the

last few years

#CC of last author

2000

#CC of the citing

papers

#CC of first author

... ...

2001

my #CC of in the

last few years

#CC of last author

#CC of the citing

papers

#CC of first author

... ...

Linear transient model





Citation Count Prediction–cont’

How do we use the model?Given: the feature vector of a paper s at a year t = 2012, φ(s2012)citation count in 2013: c1 = f⊤φ(s2012).citation count in 2014: We need φ(s2013) (unavailable). We can

predict the features: φ2013def= Fφ(s2012). Then we combine f again to

predict byc2 = f⊤φ2013

︸︷︷︸

Using a prediction to predict

This generalizes the key idea of TD, linear Dyna [Sutton et.al. 08],LAM-API to more general multi-step prediction problems.Similarly we can extrapolate into more years into the future.





Citation Count Prediction–Empirical“Now” is 2002.Training data: the citation counts of 7K papers from 1990 to “Now”.Test data: their citation counts from “Now” to 2012.Algorithms: LS/SVR, LSTD

0 100 200 300 400 500 6000

100

200

300

400

500

600

True Value

Pre

dict

ion

LS (training)

(g) LS-train

0 100 200 300 400 500 600 7000

1000

2000

3000

4000

5000

True Value

Pre

dict

ion

LS (predicting future)

(h) LS-testHengshuai Yao Thesis Overview 29/33




Citation Count Prediction–EmpiricalLSTD successfully generalizes over time for the training papers.

0 200 400 6000

200

400

600

800

True Value

Pre

dict

ion

LSTD

(i) LS-test





Citation Count Prediction–EmpiricalPredicting for test papers (newer than the training papers)Papers are marked according to year of publication. black, green, blue, red, magenta:

• in Left plot, correspond to papers published in 1990, 1991, 1992, 1993, 1994

• in Middle plot, correspond to papers published in 1995, 1996, 1997, 1998, 1999

In Right plot: black, green—papers published in 2000, 2001.True citation count: cross (+) marked; Prediction: star (*) markedLSTD successfully generalizes over time for new papers

0 200 400 600 800 1000 12000

200

400

600

800

1000

1200

True Value

Pre

dict

ion

LSTD(0)

(j) papers 90-940 200 400 600 800

0

100

200

300

400

500

600

700

800

900

True Value

Pre

dict

ion

LSTD(0)

(k) papers 95-990 200 400 600 800

0

100

200

300

400

500

600

700

800LSTD(0)

True ValueP

redi

ctio

n(l) papers 00-01





Citation Count Prediction–EmpiricalShort-term prediction: the performance of the proposed method

0 100 200 300 4000

50

100

150

200

250

300

350

400Dyna: 4−year prediction

True Value

Pre

dict

ion

(m) 4 year; Papers 00-01

1 2 3 4 5 6 7 80

5

10

15

20

25

30

35

Years into the future

RM

SE

paper−2000−2001

paper−1995−2000

paper−1990−1995

paper−before1990

(n) Summary





Thank You!


Documents

Hengshuai noah