ICO Learning

ICO Learning

Gerhard Neumann

Seminar A, SS06

Overview

Short Overview of different control methods Correlation Based Learning ISO Learning Comparison to other Methods ([Wörgötter05])

TD Learning STDP

ICO Learning ([Porr06]) Learning Receptive Fields ([Kulvicius06])

Comparison of ISO learning to other Methods

Comparison for Classical Conditioning learning Problems (open loop control)

Relating RL to Classical Conditioning Classical Conditioning: Pairing of two subsequent

stimuli is learned such that the presentation of the first stimulus is taken as a predictor of the second one.

RL: Maximization of Rewards:

v … Predictor of future reward

RL for Classical Conditioning

TD-Error: Derivation Term :

Weight Change: => Nothing new so far…

Goal: Output v should react after learning to the onset of the CS xn, and remains active until the reward terminates

Present CS internally by a chain of n + 1 delayed pulses xi

Replace the states from traditional RL with time steps

RL for Classical Conditioning

Special kind of E-Trace Serial Compound

Representation Learning Steps:

Rectangular response of v Special Treatment of the

reward not necessary x0 can replace the reward

when setting w0 to 1 at the beginning

Comparison for Classical Conditioning

Correlation Based Learning

„Reward“ x0 is not an independent term as in TD learning

TD-Learning

Comparison for Classical Conditioning TD-Learning

ISO-Learning

Uses another form of E-Traces (Band-pass filters) Used for all input pathways

-> also for calculating the output

Comparison for the closed loop Closed loop

Actions of the agent affect future sensory input Comparison not so easy any more, because behavior of the algorithms is now

quite different Reward Based Architectures

Actor-Critic Architecture Use Evaluative Feed-Back Reward Maximation A good reward signal is very

often hard to find In nature: Found by evolution Can theoretically be applied to any learning problem Resolution in the State Space:

Only applicable for low dimensional state spaces -> Curse of dimensionality!

Comparison for the closed loop Correlation Based Architectures

Non-evaluative feedback, all signals are value free Minimize Disturbance

Valid Regions are usually much bigger than in for reward maximation Better Convergence !! Restricted Solutions

Evaluations are implicitely build into the sign of the reaction behavior Actor and Critic are the same architectureal building block Only for a restricted set of learning problems

Hard to apply for complex tasks Resolution in Time:

Only looks at temporal correlation of the input variables Can be applied for high dimensional state spaces

Comparison of ISO learning and STDP ISO learning generically produces a bimodal weight change

curve Similiar to the STDP (Spike timing dependent plasticity) learning

weight change curve

ISO learning STDP rule: Potential from the synapse: Filtered version of a spike Gradient Dependent Model

Much faster time scale used in STDP Can model different kind of synapses with different filters easily

Overview


TD Learning STDP

ICO Learning ([Porr06]) Learning Receptive Fields([Kulvicius06])

ICO (Input Correlation Only) Learning

Drawback of Hebbian Learning Auto-Correlation can result in divergence even if x0 = 0 ISO learning:

Relies on orthogonal filters of different inputs Orthogonal to its derivative Only works for if steady state is assumed

Auto correlation does not vanish any more if the weights are changed during the impulse response of the filters

-> can not be applied for large learning rates => Can be used only for small learning rates, otherwise

Auto-Correlation causes divergence of the weights

ICO & ISO Learning

ISO Learning

ICO Learning

ICO Learning

Simple adaption of the ISO Learning rule Correlate only inputs with each other No correlation with the output

-> No Auto Correlation Define one Input as the reflex input x0 Drawback:

Loss of Generality: Not Isotropic any more Not all inputs are treated equally any more

Advantage: Can use much higher learning rates (up to 100x faster) Can use almost arbitrary types of filter No Divergence in weights any more

ICO Learning

Weight change curve (open loop, just one Filter bank) Same as for ISO

learning

Weight changing curve ISO learning contains

exponential instability Even after setting x0 to

0 after 100000 timesteps

ICO Learning: Closing the Loop

Output of learner v feeds back to its inputs xj after being modified by the environment Reactive Pathway: Fixed Reactive Feedback control

Learning Goal: Learn earlier reaction to keep x0 (Disturbance or error signal) at 0

One can proof that under simplified

conditions that one shoot learning

is possible With one filter bank, impulse signals Using Z-Transform

ICO Learning: Applications

Simulated Robot Experiment: Robot has to find food (disks in the environment) Sensors for Uncondition Stimulus:

2 Touchsensors (Left + Right) Reflex: Robot elicits a sharp turn as it touches a disk

Pulls the robot into the centre of the disk Sensors for predictive Stimulus

2 Sound (Distance) Sensors (Left + Right), Disks Can measure distance to the disk Stimulus: Difference between Left + Right sound signals Use 5 filters (resonators) in the filter bank

Output v: Steering angle of the Robot

ICO Learning: Simulated Robot

Only One experience has been sufficient to show an adapted behavior Only Possible with ICO learning

Simulated Robot Comparison for different Learning rates

ICO Learning ISO Learning

Learning was successful if for a sequence of four contacts Equivalent for small learning rates

Small Auto correlation term

Simulated Robot

Two Different Learning Rates

Divergent Behavior of ISO learning for high learning rates Robot shows avoidance behavior from food disks

Applications continued

More Complex Task: Three food disks simultanously No simple relationship between the reflex input and the

predictive input any more Superimposed Sound Fields

Is only learned by ICO learning, not by ISO learning

ICO: Real Robot Application

Real Robot: Target White disk from a distance Reflex: Pulls the robot into the white disk just at the

moment the robot drives over the disk Achieved by analysing the bottom-scanline of a camera

Predictive input: Analysing Scanline from the top of the image

Filter Bank 5 FIR Filters with different filter length

All coefficients set to 1 -> smear out signal Narrow viewing angle of the camera

Put robot more or less in front of the disk

ICO: Real Robot Experiment Processing the input

Calculate the deviation of the positions of all white points in a scanline to the center of the scanline

1D signal Results:

A before learning B & C After learning

14 contacts Weights oscillate around

their best values, but do

not diverge

ICO Learning: Other Applications

Mechanical Arm Arm is always controlled with a PI controller to a

specified set point Input of the PI controller: Motor position PI controller is used as reactive filter

Disturbance: Pushing force of a second small arm mounted to the

main arm Fast reacting touch sensors measures D.

Use 10 resonator filters in the filter bank

ICO Learning: Other Applications

Result: Control is shifted

backwards in time Error signal

(derivation to the set point) almost vanishes

Other example: Temperature Control Predict temperature

changes caused by another heater

Overview


TD Learning STDP

ICO Learning ([Porr06]) Learning Receptive Fields([Kulvicius06])

Development of Receptive fields through temporal Sequence learning [Kulvicius06]

Develop receptive fields by ICO learning Learn behavior and receptive fields simultanously Usually these 2 learning processes are considered seperately

First approach where the receptive field and the behavior is trained simultanously!!

Shows the application of ICO learning for high dimensional input spaces

Line Following System:

Robot should learn to better follow a line painted on the ground

Reactive Input: x0… Pixels at the bottom ot the image

Predictive Input x1… Pixels in the middle of the image Use 10 different filters in the filter bank

(resonators) Reflexive Output:

Brings robot back to the line Not a Smooth behavior

Motor Output S… Constant Speed v modifies speed and steering of the robot

Use Left-Right symmetry

Line Following

Simple System Fixed sensor banks, all pixels are summed up

Input x1 predicts x0

Line Following

Three different Tracks Steep, Shallow, Sharp For one learning experiment

always the same track is used Robot steers much smoother Usually 1 trial is enough for

learning Videos

Without Learning Steep Sharp

Line Following: Receptive Fields

Receptive fields Use 225 pixels for the far sensors Use individual filter banks for each pixel

10 filters per pixel Left-Right Symmetry:

Left Receptive field is a mirror of the right

Line Following: Receptive Fields

Results Lower learning rates have to be used More trials are needed (3 to 6 trials) Different RFs are learned for different tracks

Steep and Sharp Track, Plots show the sum of all filter weights for one pixel

Conclusion Correlation Based Learning

Tries to minimize the influence of disturbances Easier to learn than Reinforcement Learning The framework is less general

Questions: When to apply Correlation Based Learning and when

Reinforcement Learning How is it done by Animals/Humans?

How can these two methods be combined Correlation learning in early learning stage RL for fine tuning

ICO Learning Improvement of ISO learning More Stable, higher learning rates can be used

One Shoot Learning is possible

Literature: [Porr05]: F. Wörgötter and B. Porr, Temporal Sequence

Learning, Prediction and Control, A Review of different control methods and their relation to biological mechanisms

[Porr03]: B. Porr, F. Wörgötter, Isotropic Sequence Order Learning

[Porr06]: B. Porr, F. Wörgötter, Strongly improved stability and faster convergence of temporal sequence learning by utilising input correlations only

[Kulvicius06]: T. Kulvicius, B. Porr and F. Wörgötter, Behaviourally Guided Development of Primary and Secondary Receptive Fields through temporal sequence learning

Documents

ICO Learning