Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015

Reinforcement Learning: Generalization and Function

Brendan and YifangFeb 10, 2015

Problem

• Our estimates of value functions are represented as a table with one entry for each state or each state-action pair.

• The approaches we discussed so far are limited to tasks with small numbers of states and actions.

• If the state or action space include continuous variables or complex sensations, the problem would be more severe.

Solution• We approximate the value function as a parameterized

functional form with parameter, instead of a table.• Our goal is to minimize the mean-squared errors.

• The distribution P is usually the distribution from which the states in the training examples are drawn.

• We assume the P as a uniform distribution.

Approach: Gradient-Descent• We assume the function is a function of and

, and is a parameter of

• Gradient descent method tells us if is differentiable in a neighborhood of S, then decreases fastest if one goes from S in the direction of negative gradient at the point S.

• In this problem, our goal is to achieve

• We use gradient descent to solve it. The gradient is

• Because our goal is local minimum, we update , with its negative gradient, as follows:

• Then, we use a constant 0.5α to tune the update in each step, so we have

• If is a linear function, i.e., , then we have

Monte-Carlo estimate• But is actually unavailable, so we use some

approximation of it. • So instead, we use the following update equation. If is an

unbiased estimate, for each t, then is guaranteed to converge to a local optimum.

• Because the true value of a state is the expected value of the return following it, the Monte Carlo target is an unbiased estimate.

• If we use TD estimate , will converge to the local optimum?

N-step TD• TD(0) is one-step TD.

• So how to leverage between ?

Figure 7.1 Diagrams from one step TD to Monte Carlo estimation

We chose• An intuitive idea

• We define

Figure 7.3 Diagram of TD(λ)

in Gradient Descent• Monte Carlo is a special case of TD(λ)• TD(λ) based θ update

• Monte Carlo θ update

• However, TD(λ) is not an unbiased estimate, so this method does not converge to a local optimum.

• So what?• TD(λ) has been proved to converge in the linear case. But it

does not converge to the optimum solution, but to a nearly parameter vector

Backward view of TD( )λ• Eligibility trace evaluates the recency of a state been visited.

Backward view TD, cont’d• In the forward view, the TD error for state-value prediction is

• But in the backward view, the TD error is proportional to all the recent visits.

Figure 7.7 Online TD(λ) tabular policy evaluation

TD(0) or TD( )?λ

Figure 6.1 Using TD(0) in policy evaluation

Figure 7.7 Using TD(λ) in policy evaluation

Backward view of gradient descent• Back view of gradient descent is

• Where

is a column vector of eligibility traces, one for each component of . It is an extension to the original definition.

Linear Methods

• Corresponding to each state, there is a column vector of features , and the state-value function is given by

Why do we prefer linear methods?In linear case, any method guaranteed to converge to or near a local optimum is automatically guaranteed to converge to or near the global optimum.

Linear Method Coding

• Goal: use linear method to solve un-linear problem.• Solution: code the un-linear problem into a linear problem.

Transforming the non-linear solution in state space to a linear solution in the feature space.• Coarse coding• Tile Coding• Using Radial Basic function in coding

Coarse Coding• A feature correspond to a circle in the state space. If the state

is inside a circle, then the corresponding feature has the value 1, and is said to be present, otherwise the feature is 0, and is said to be absent. This kind of feature is called binary feature.

• An example: : the state space is 1-dimensional, and the dimensionality of the feature spaces is the number of the intervals.

Figure 8.2 One-dimensional coarse coding

Coarse Coding• An example: the state space is 2-dimensional, and the

dimensionality of the feature spaces is the number of the circles.

Figure 8.4 Two dimensional coarse coding

Tile Coding• Different from coarse coding, where circles could be

overlapping, tile coding partitions the state space into exhaustive tiles. And tiles are not overlapping.

Figure 8.6 Different tiling schemes

Radial Basis Functions• Rather than each feature being either 0 or 1, the feature can

be anything in the interval of [0, 1], reflecting various degree to which the feature is present.

Figure 8.7 One-dimensional radial function

Control with Linear method• Evaluate the action-value function to generate the policy

Figure 8.8 Policy control: Linear gradient descent Sarsa(λ)

Pros and Cons of Bootstrap• Theoretically, non-bootstrapping methods achieve a lower

asymptotic error than bootstrapping methods.

• In practice, bootstrapping methods usually outperforms than non-bootstrapping methods.

Figure 8.10 The mountain car task Figure 8.15 Performance of different solutions

Pros and Cons of Bootstrap• Another example: a small Markov process for generating

random walks.

Figure 6.5 A random walk

Figure 8.15 Performance of different solutions

Question 1: I am not sure what the theta-vector is. Is there a more basic concrete example of a task need to be solved, a description of the states and where the theta-vector comes into play? What is the theta-vector in mountain-car task. (From Henry)

Figure 8.10 The mountain car task

• Question 2: one of the assumptions required is that value function is a smooth differentiable function. The authors do not address if this is realistic or even if the assumption holds in the majority of cases? (From Henry)

• Answer: (I think) smooth differential requirement is necessary for any gradient descent method to make sure the descent at a point could reflect the trend of the function. If there is jump at the point, the linear descent method requires a denser grid.

http://en.wikipedia.org/wiki/Monte_Carlo_method

• Question 3: How does the hashing form of tiling make sense? (From Henry)

• Answer: Hashing allows to produce noncontiguous tiles. High resolution is needed in only a small fraction of the state space. It frees us from the curse of dimensionality.

Figure Hashing scheme

• Question 4: Can you give an example to explain how the gradient-descent of Monte Carlo state-value prediction will converge? (From Tavish)

• Answer: Use this equation to update . Eligibility traces are not leveraged any more. Other parts are the same with the following algorithm

Question 5: In section 8.1, the book say “the novelty in this chapter is that the approximate value function at time t, is represented not a table but as a parameterized function from with parameter vector theta”. Could you give some example of theta vector? (From Yuankai)

Question 6: the book says “there are generally far more states than there are components to theta-vector”. Why is this the case? (From Yuankai)Answer: The state-value function might be continuous function. But the number of features is limited. With this limited number of features, we can approximate the original function. Denser grid can approximate much fluctuating function, at the expense of more resources.

• Question 7: It looks like the theta vector can fit in the other models, Can you illustrate the roles of this theta vector in these different approaches? (From Sicong)

• Answer: The universal approximation theorem of artificial neural network is a good example. It is as follows:

is a non-constant, bounded, and monotonically-increasing continuous function. is m-dimensional unit hypercube. Any function f in the continuous function space define on could be approximated

• Question 8: In theory, non-bootstrapping is better. But in practice. Bootstrapping methods are better. And we do not know why. Is it correct? (From Brad)

• Answer: According to textbook, “the available result indicate that non-bootstrapping methods are better than bootstrapping method at reducing MSE from the true value function. But reducing MSE is not necessarily the most import goal”

• Question 9: If Vt is a biased estimate, why the method does not converge to a local optimum? (From Brad)

• Answer: Method is to approximate Vt. If Vt is a biased estimate, then the method converges to biased value.

• Question 10: The book mentions backup many times. What is it? (From Jiyun)

• Answer: Update the value of Value of St using Value St+1, as standing the state St+1, at time t+1, and trying to update the value of St. The process is updating the value of a preceding state using the value of a successive state, i.e. a backup strategy.

Documents

Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015