Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

Flexible and fast convergent learning agent

Miguel A. Soto SantibanezMichael M. Marefat

Department of Electrical and Computer EngineeringUniversity of Arizona, Tucson, AZ

[email protected] [email protected]

Background and Motivation

“A computer program is said to LEARN from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

A robot driving learning problem:

Task T: driving on public four-lane highways using vision sensorsPerformance measure P: average distance traveled before an error (as

judged by human overseer)Training experiences E: a sequence of images and steering commands

recorded while observing a human driver

1) Artificial Neural Networks

Robust to errors in the training data

Dependency on the availability of good and extensive training examples

2) Instance-Based Learning Able to model complex policies by making use of less complex local approximations

Dependency on the availability of good and extensive training examples

3) Reinforcement Learning

Independent of the availability of good and extensive training examples

Convergence to the optimal policy can be extremely slow

Background and Motivation II

Background and Motivation III

Motivation:

Is it possible to get the best of both worlds?

Is it possible for a Learning Agent to be flexible and fast convergent at the same time?

The ProblemFormalization:

Given: a) a set of actions A = {a1, a2, a3, . . .}, b) a set of situations S = {s1, s2, s3, . . .}, c) and a function TR(a, s) tr,

where tr is the total reward associated with applying action a while at state s,

The LA needs to construct a set of rules P = {rule(s1, a1), rule(s2, a2), . . .} such that rule(s, a) P, a = amax

where TR(amax, s)=max(TR(a1,s), TR(a2,s), . . .) Also:

1) Increase flexibility 2) Increase speed of convergence

The Solution

The Q learning Algorithm:

1: rule(s, a) P, TR(a, s) 02: find out what is the current situation si3: do forever:4: select an action ai A and execute it5: find out what is the immediate reward r 6: find out what is the current situation si’7: TR(ai, si) r + aFactormax(TR(a, si’)) a8: si si’

The Solution IIAdvantages: 1) The LA does not depend on the availability of good and extensive training examples

Reason: a) This method learns from experimentation instead of given training examplesShortcomings: 1) Convergence to the optimal policy can be very slow

Reasons: a) The Q learning Algorithm propagates “good findings” very slowly. b) Speed of convergence tied to number of situations that need to be handled.

2) May not be able to use this method on high dimensionality problems

Reason: a) The memory requirements grow exponentially as we add more dimensions to the problem.

The Solution IIISpeed of convergence tied to the number of situations:

situations ==> P rules that need to be found P rules that need to be found ==> experiments are needed experiments are needed ==> convergence speed

120000 situations world 12 situations world

The Solution IVSlow propagation of “good findings”:

A

B C D

J

E F H

I K L

Factor = 0.9

0

0 0

0

0

0

90 100

0

0 0 0

After visiting A, B, . . .G

G

1

2 3 4

5 6

7

0

0 0

0

81

0

90 100

0

0 0 0

After visiting A, . . .G 2 times

59

66 73

0

81

0

90 100

0

0 0 0

After visiting A, . . .G 5 times

0

0 0

0

0

0

0 100

0

0 0 0

Table of intrinsic rewards Possible situations

The Solution VFirst Sub-problem:

Slow propagation of “good findings”

Solution:Develop a method that propagates “good findings” beyond the previous state

A

B C D

J

E F H

I K L

Intrinsic value of F = 100

Intrinsic value of others = 0

Factor = 0.9

0

0 0

0

0

0

90

100

0

0 0 0

Without Propagation With Propagation

G

1

2 3 4

5 6

7

59

66 73

0

81

0

90 100

0

0 0 0

The Solution VISolution to First Sub-problem:

a) Use a buffer, which we call “short term memory”, to keep track of the last n situations b) After each learning experience apply the following algorithm:

t = currentTime -1

is entry visited

at time = t stored in the "short term

memory"?

End

YES

NO

is total reward

(coming from entry at time = t + 1) bigger

than the official Value?

NO

YES

t = t -1

Begin

update P

The Solution VII

The Second and Third Sub-problems: a) Memory requirements grow exponentially as we add more dimensions to the problem

b) Speed of convergence tied to number of situations that need to be handled.

Solution: 1) We just keep a few examples of the policy (also called prototypes)

2) We generate the policy on situation not described explicitly by these prototypes by “generalizing” from “nearby” prototypes

The Solution VIII

Kanerva Coding

And

Tile Coding

Moving Prototypes

The Solution IX

The Solution X

The Solution XI

The Solution XII A sound tree:

a) all the “areas” are mutually

exclusive

b) their merging is exhaustive

c) the merging of any two sibling “areas” is equal to their parent’s “area”.

children

parent

children

parent

The Solution XIII

Impossible Merge

The Solution XIV

“Smallest predecessor”

The Solution XV

The Solution XVIPossible ways of breaking the existing nodes:

Node being inserted

The Solution XVII

List 1

List 1.1

List 1.2

and

The Solution XVIII

The Solution XIX

The Solution XX

ResultsThe performance of the algorithm “Propagation of Good Findings” is especially good when the world is large:

Memory Size = 100 Seed = 9642Factor = 0.99

0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35World s ize

Exp

erie

nces

nee

ded

Look around

Q Learning

Propagation

The algorithm “Propagation of Good Findings” is more efficient when the size of its “Short Term Memory” is large:

Seed = 2129 World Size = 7X7Factor = 0.9

0

2000

4000

6000

8000

10000

0 1 2 3 4 5 6 7Memory size

Exp

erie

nces

nee

ded

Look around

Q Learning

Propagation

Results IIThe algorithm “Propagation of Good Findings” is more efficient when the value of the parameter “discount factor” is large:

Memory Size = 100 World Size = 7X7Seed = 2129

0100002000030000

400005000060000

0 0.2 0.4 0.6 0.8 1 1.2

Factor

Exp

erie

nces

nee

ded

Look around

Q Learning

Propagation

Results do not depend on sequence of random numbers:

Conclusions

Moving Prototypes

Q Learning Algorithm

Propagation of good findings The proposed

Learning Agent

Q Learning Algorithm LA becomes more flexible

Propagating concept Convergence is accelerated

Moving Prototypes concept LA becomes more flexible

Moving Prototypes concept Convergence is accelerated

Conclusions II

What is left to do:

Obtain results on the advantages of using regression trees and linear approximation over other similar methods (just as we have already done with the method “Propagation of Good Findings”).

Apply the proposed model to solving example applications such as a self-optimizing middle-men between a high level planner and the actuators in a robot.

Develop more precisely the limits on the use of this model.

Documents

Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…