29
Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of Arizona, Tucson, AZ [email protected] [email protected]

Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

Embed Size (px)

DESCRIPTION

1) Artificial Neural Networks Robust to errors in the training data Dependency on the availability of good and extensive training examples 2) Instance-Based Learning Able to model complex policies by making use of less complex local approximations Dependency on the availability of good and extensive training examples 3) Reinforcement Learning Independent of the availability of good and extensive training examples Convergence to the optimal policy can be extremely slow Background and Motivation II

Citation preview

Page 1: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

Flexible and fast convergent learning agent

Miguel A. Soto SantibanezMichael M. Marefat 

Department of Electrical and Computer EngineeringUniversity of Arizona, Tucson, AZ

[email protected] [email protected]

Page 2: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

Background and Motivation

“A computer program is said to LEARN from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

A robot driving learning problem:

Task T: driving on public four-lane highways using vision sensorsPerformance measure P: average distance traveled before an error (as

judged by human overseer)Training experiences E: a sequence of images and steering commands

recorded while observing a human driver

Page 3: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

1) Artificial Neural Networks

Robust to errors in the training data

Dependency on the availability of good and extensive training examples

2) Instance-Based Learning Able to model complex policies by making use of less complex local approximations

Dependency on the availability of good and extensive training examples

3) Reinforcement Learning

Independent of the availability of good and extensive training examples

Convergence to the optimal policy can be extremely slow

Background and Motivation II

Page 4: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

Background and Motivation III

Motivation: 

Is it possible to get the best of both worlds?

Is it possible for a Learning Agent to be flexible and fast convergent at the same time?

Page 5: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The ProblemFormalization: 

Given: a) a set of actions A = {a1, a2, a3, . . .}, b) a set of situations S = {s1, s2, s3, . . .}, c) and a function TR(a, s) tr,

where tr is the total reward associated with applying action a while at state s,

The LA needs to construct a set of rules P = {rule(s1, a1), rule(s2, a2), . . .} such that rule(s, a) P, a = amax

where TR(amax, s)=max(TR(a1,s), TR(a2,s), . . .) Also:

1) Increase flexibility 2) Increase speed of convergence

Page 6: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution 

The Q learning Algorithm: 

1: rule(s, a) P, TR(a, s) 02: find out what is the current situation si3: do forever:4: select an action ai A and execute it5: find out what is the immediate reward r 6: find out what is the current situation si’7: TR(ai, si) r + aFactormax(TR(a, si’)) a8: si si’

Page 7: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution IIAdvantages: 1) The LA does not depend on the availability of good and extensive training examples

Reason: a) This method learns from experimentation instead of given training examplesShortcomings: 1) Convergence to the optimal policy can be very slow

Reasons: a) The Q learning Algorithm propagates “good findings” very slowly. b) Speed of convergence tied to number of situations that need to be handled.

2) May not be able to use this method on high dimensionality problems

Reason: a) The memory requirements grow exponentially as we add more dimensions to the problem.

Page 8: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution IIISpeed of convergence tied to the number of situations:

situations ==> P rules that need to be found P rules that need to be found ==> experiments are needed experiments are needed ==> convergence speed

120000 situations world 12 situations world

Page 9: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution IVSlow propagation of “good findings”:

A

B C D

J

E F H

I K L

Factor = 0.9

0

0 0

0

0

0

90 100

0

0 0 0

After visiting A, B, . . .G

G

1

2 3 4

5 6

7

0

0 0

0

81

0

90 100

0

0 0 0

After visiting A, . . .G 2 times

59

66 73

0

81

0

90 100

0

0 0 0

After visiting A, . . .G 5 times

0

0 0

0

0

0

0 100

0

0 0 0

Table of intrinsic rewards Possible situations

Page 10: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution VFirst Sub-problem:

Slow propagation of “good findings”

 Solution:Develop a method that propagates “good findings” beyond the previous state

A

B C D

J

E F H

I K L

Intrinsic value of F = 100

Intrinsic value of others = 0

Factor = 0.9

0

0 0

0

0

0

90

100

0

0 0 0

Without Propagation With Propagation

G

1

2 3 4

5 6

7

59

66 73

0

81

0

90 100

0

0 0 0

Page 11: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution VISolution to First Sub-problem: 

a) Use a buffer, which we call “short term memory”, to keep track of the last n situations b) After each learning experience apply the following algorithm:

t = currentTime -1

is entry visited

at time = t stored in the "short term

memory"?

End

YES

NO

is total reward

(coming from entry at time = t + 1) bigger

than the official Value?

NO

YES

t = t -1

Begin

update P

Page 12: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution VII

The Second and Third Sub-problems: a) Memory requirements grow exponentially as we add more dimensions to the problem

b) Speed of convergence tied to number of situations that need to be handled.

Solution: 1) We just keep a few examples of the policy (also called prototypes)

2) We generate the policy on situation not described explicitly by these prototypes by “generalizing” from “nearby” prototypes

Page 13: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution VIII

Kanerva Coding

And

Tile Coding

Moving Prototypes

Page 14: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution IX

Page 15: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution X

Page 16: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XI

Page 17: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XII A sound tree:

a) all the “areas” are mutually

exclusive

b) their merging is exhaustive

c) the merging of any two sibling “areas” is equal to their parent’s “area”.

children

parent

children

parent

Page 18: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XIII

Impossible Merge

Page 19: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XIV

“Smallest predecessor”

Page 20: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XV

Page 21: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XVIPossible ways of breaking the existing nodes:

Node being inserted

Page 22: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XVII

List 1

List 1.1

List 1.2

and

Page 23: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XVIII

Page 24: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XIX

Page 25: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

The Solution XX

Page 26: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

ResultsThe performance of the algorithm “Propagation of Good Findings” is especially good when the world is large:

Memory Size = 100 Seed = 9642Factor = 0.99

0

200000

400000

600000

800000

1000000

0 5 10 15 20 25 30 35World s ize

Exp

erie

nces

nee

ded

Look around

Q Learning

Propagation

The algorithm “Propagation of Good Findings” is more efficient when the size of its “Short Term Memory” is large:

Seed = 2129 World Size = 7X7Factor = 0.9

0

2000

4000

6000

8000

10000

0 1 2 3 4 5 6 7Memory size

Exp

erie

nces

nee

ded

Look around

Q Learning

Propagation

Page 27: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

Results IIThe algorithm “Propagation of Good Findings” is more efficient when the value of the parameter “discount factor” is large:

Memory Size = 100 World Size = 7X7Seed = 2129

0100002000030000

400005000060000

0 0.2 0.4 0.6 0.8 1 1.2

Factor

Exp

erie

nces

nee

ded

Look around

Q Learning

Propagation

Results do not depend on sequence of random numbers:

Page 28: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

Conclusions

Moving Prototypes

Q Learning Algorithm

Propagation of good findings The proposed

Learning Agent

Q Learning Algorithm LA becomes more flexible

Propagating concept Convergence is accelerated

Moving Prototypes concept LA becomes more flexible

Moving Prototypes concept Convergence is accelerated

Page 29: Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department…

Conclusions II

What is left to do:

Obtain results on the advantages of using regression trees and linear approximation over other similar methods (just as we have already done with the method “Propagation of Good Findings”).

Apply the proposed model to solving example applications such as a self-optimizing middle-men between a high level planner and the actuators in a robot.

Develop more precisely the limits on the use of this model.