54

NONITERA TIVE PR OCEDURES Man - New York University

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Chapter �

Associative Models

Association is the task of mapping input patterns to tar�

get patterns ��attractors��� For instance� an associative

memory may have to complete �or correct� an incom�

plete �or corrupted� pattern� Unlike computer memo�

ries� no �address� is known for each pattern�

Learning consists of encoding the desired patterns as

a weight matrix �network�� retrieval �or �recall�� refers

to the generation of an output pattern when an input

pattern is presented to the network�

�� Hetero�association mapping input vectors to output

vectors that range over a dierent vector space� e�g��

translating English words to Spanish words�

� CHAPTER �� ASSOCIATIVE MODELS

�� Auto�association input vectors and output vectors

range over the same vector space� e�g�� character

recognition� eliminating noise from pixel arrays� as

in Figure ����

�or�

Noisy Input Stored Image

Stored Image Stored Image

Figure ���� Noisy input pattern resembles � � and ���� but

not ����

This example illustrates the presence of noise in the

input� and also that more than one �but not all� target

patterns may be reasonable for an input pattern� The

dierence between input and target patterns allows us

���� NON�ITERATIVE PROCEDURES �

to evaluate network outputs�

Many associative learning models are based on varia�

tions of Hebb�s observation �When one cell repeatedly

assists in �ring another� the axon of the �rst cell de�

velops synaptic knobs �or enlarges them if they already

exist� in contact with the soma of the second cell��

We �rst discuss non�iterative� �one�shot� procedures

for association� then proceed to iterative models with

better error�correction capabilities� in which node states

may be updated several times� until they �stabilize��

��� Non�iterative Procedures

In non�iterative association� the output pattern is gener�

ated from the input pattern in a single iteration� Hebb�s

law may be used to develop associative �matrix mem�

ories�� or gradient descent can be applied to minimize

recall error�

Consider hetero�association� using a two�layer network

CHAPTER �� ASSOCIATIVE MODELS

developed using the training set

T � f�ip� dp� p � �� � � � � Pg

where ip � f��� �gn� dp � f��� �gm� Presenting ip atthe �rst layer leads to the instant generation of dp at

the second layer�

Each node of the network corresponds one component

of input �i� or desired output �d� patterns�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Matrix Associative Memories A weight matrix W is

�rst used to premultiply the input vector y � WX����

If output node values must range over f��� �g� then thesignum function is applied to each coordinate

X���j � sgn�yj�� for j � �� � � � �m�

The Hebbian weight update rule is

�wj�k � ip�kdp�j�

If all input and output patterns are available at once�

���� NON�ITERATIVE PROCEDURES

we can perform a direct calculation of weights

wj�k � cPXp��

ip�kdp�j� for c � ��

Each wj�k measures the correlation between the kth

component of the input vectors and the jth component

of the associated output vectors� The multiplier c can

be omitted if we are only interested in the sign of each

component of y� so that

W �

PXp��

dp �ip�T � DIT �

where rows of matrix I are input patterns ip� and rowsof D are desired output patterns dp�Non�iterative procedures have low error�correction ca�

pabilities multiplyingW with even a slightly corrupted

input vector often results in �spurious� output that dif�

fers from the patterns intended to be stored�

Example ��� Associate input vector ��� �� with output

vector ���� ��� and ������ with ��������

� CHAPTER �� ASSOCIATIVE MODELS

W �

�� �� ��

� ��

�A�� � �

� ��

�A �

�� �� �

� �

�A �

When the original input pattern ����� is presented�

y �WX��� �

�� �� �

� �

�A�� ��

�A �

�� ��

�A �

and sgn�y� � X���� the correct output pattern associ�

ated with ��� ��� If the stimulus is a new input pattern

�������� for which no stored association exists� the re�sulting output pattern is

y �WX��� �

�� �� �

� �

�A�� ����

�A �

�� �

��

�A �

and sgn�y� � ������� dierent from the stored patterns�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Least squares procedure �Widrow�Ho rule� When ip

is presented to the network� the resulting output �ip

must be as close as possible to the desired output pat�

���� NON�ITERATIVE PROCEDURES �

tern dp� Hence � must be chosen to minimize

E �PXp��

jjdp � �ipjj�

�PXp��

��dp�� �Xj

���jip�j�� � � � � � �dp�m �

Xj

�m�jip�j���

Since E is a quadratic function whose second deriva�

tive is positive� we obtain the weights that minimize E

by solving

�E

��j��

� ��PXp��

�dp�j �

nXk��

�j�k ip�k

�ip�� � ��

for each �� j� obtaining

PXp��

dp�j ip�� �PXp��

nXk��

�j�k ip�k ip��

nXk��

�j�k

�� PX

p��

ip�k ip��

�A

� �jth row of �� � ��th column of IIT��Combining all such equations�

DIT � ��IIT ��

When �IIT � is invertible� mean square error is mini�

� CHAPTER �� ASSOCIATIVE MODELS

mized by

� � DIT �IIT ��� �����

The least squares method �normalizes� the Hebbian

weight matrix DIT using the inverse of �IIT ���� Forauto�association� � � IIT�IIT ��� is the identity matrixthat maps each vector to itself�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

What if IIT has no inverse� EverymatrixA has a unique�pseudo�inverse�� de�ned to be a matrix A� that satis�es

the following conditions

AA�A � A

A�AA� � A�

A�A � �A�A�T

AA� � �AA��T �

The Optimal Linear Associative Memory �OLAM� gen�

eralizes Equation ��� � � D �I�� �For autoassociation� I � D� so that � � I �I�� �

���� NON�ITERATIVE PROCEDURES

When I is a set of orthonormal unit vectors�

ip � ip� �

��� � if p � p�

� otherwise

hence I �I�T � �I�T I � I� the identity matrix� so that

all conditions de�ning the pseudo�inverse are satis�ed

with �I�� � �I�T � Then

� � D �I�� � D �I�T �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example ��� Hetero�association� with three �i� d� pairs

��� �� ����� ��� ������ �������� �� ���� �� �� ������ Inputand output matrices are

I �

�BBBB�� � ��� �� �

� � �

�CCCCA and D �

�� �� �� �

� �� ��

�A �

The inverse of I�I�T exists� and � � D�I�T �I�I�T ���

�� �� �� �

� �� ��

�A�BBBB�

� � �

� �� ��� � �

�CCCCA

�BBBB����� ���� ����

���� ���� ����

���� ���� ����

�CCCCA

�� CHAPTER �� ASSOCIATIVE MODELS

�� �� � �

� � ��

�A �

An input pattern ���������� yields ���� �� on premul�tiplying by ��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example �� Consider hetero�association where four in�

put�output patterns are ��� ������������ ��������� ����� ��������� �� �� ��� and ���� �� ����� ���

I �

�BBBB�

� � �� ��� �� �� �

�� �� � �

�CCCCA and D �

�� �� � � ���� �� � �

�A �

�I�I�T � ��BBBB����� ���� ����

���� ���� ����

���� ���� ����

�CCCCA �

���� NON�ITERATIVE PROCEDURES ��

� �

�� �� � � ���� �� � �

�A

�BBBBBBB�

� � ��� �� ��

�� �� �

�� � �

�CCCCCCCA

�BBBB����� ���� ����

���� ���� ����

���� ���� ����

�CCCCA

�� � �� ��� � �

�A �

For input vector ��� ������ the output is

�� � �� ��� � �

�A�BBBB�

��

�CCCCA �

�� ����

�A �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Noise Extraction Given I� a vector x can be written asthe sum of two components

x � Ix �Wx � �I �W �x � �x � �a �mXi��

ciii � �a

where �x is the projection of x onto the vector space

spanned by I� and �a is the noise component orthonor�mal to this space�

�� CHAPTER �� ASSOCIATIVE MODELS

Matrix multiplication by W thus projects any vector

onto the space of the stored vectors� whereas �I �W �

extracts the noise component to be minimized�

Noise is suppressed if the number of patterns being

stored is less than the number of neurons� otherwise�

the noise may be ampli�ed�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Nonlinear Transformations

X��� � f�W X������

�BBBB�

f�w��� X���� � � � ��w��n X���

n �

���

f�wm�� X���� � � � ��wm�n X���

n �

�CCCCA �

where f is dierentiable� Error E is minimized w�r�t�

each weight wj�� by solving

�E

�wj��

PXp��

�dp�j �X���p�j � f

��nX

k��

wj�k ip���ip�� � �

where X���p�j � f�

nXk��

wj�k ip��� and f ��Z� � df�Z��dZ�

Iterative procedures are needed to solve these nonlin�

ear equations�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

���� HOPFIELD NETWORKS ��

��� Hop�eld Networks

Hop�eld networks are auto�associators in which node

values are iteratively updated based on a local compu�

tation principle the new state of each node depends

only on its net weighted input at a given time�

The network is fully connected� as shown in Figure

���� and weights are obtained by Hebb�s rule�

The system may undergo many state transitions be�

fore converging to a stable state�

� �

w��� � w���

w��� � w���

w��� � w���

w��� � w���

w��� � w���

Figure ���� Hop�eld network with �ve nodes�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Discrete Hop�eld networks

One node corresponds to each vector dimension�

� CHAPTER �� ASSOCIATIVE MODELS

taking values � f��� �g� Each node applies a step func�tion to the sum of external inputs and the weighted

outputs of other nodes� Node output values may change

when an input vector is presented� Computation pro�

ceeds in �jumps� the values for the time variable range

over natural numbers� not real numbers�

NOTATION

T � fi�� � � � � iPg training set�xp�j�t� value generated by the jth node at time t� for

pth input pattern�

wk�j weight of connection from jth to kth node�

Ip�k external input to kth node� for pth input vector

�includes threshold term��

The node output function is described by

xp�k�t� �� � sgn

�� nX

j��

wk�jxp�j�t� � Ip�k

�A

where sgn�x� � � if x � � and sgn�x� � �� if x � ��Asynchronicity at every time instant� precisely one

node�s output value is updated� Node selection may be

���� HOPFIELD NETWORKS �

cyclic or random� each node in the system must have

the opportunity to change state�

Network Dynamics Select a node k � f�� � � � � ng tobe updated�

xp���t� �� �

���xp���t� if � �� k

sgn

�� nX

j��

w��jxp�j�t� � Ip��

�A if � � k�

�����

By contrast� in Little�s synchronous model� all nodes

are updated simultaneously� at every time instant� Cyclic

behavior may result when two nodes update their val�

ues simultaneously� each attempting to move towards a

dierent attractor� The network may then repeatedly

shuttle between two network states� Any such cycle

consists of only two states and hence can be detected

easily�

The Hop�eld network can be used to retrieve a stored

pattern when a corrupted version of the stored pat�

tern is presented� The Hop�eld network can also be

used to �complete� a pattern when parts of the pat�

�� CHAPTER �� ASSOCIATIVE MODELS

X

Y

�c��a� �b�X X

YY

Y

X

O

O

O

�d� �e�

YO

XO

Figure ���� Initial network state �O� is equidistant from at�

tractors X and Y � in �a�� Asynchronous computation

instantly leads to one of the attractors� �b� or �c�� In syn�

chronous computation� the object moves instead from

the initial position to a non�attractor position �d�� Cy�

cling behavior results for the synchronous case� with

network state oscillating between �d� and �e��

���� HOPFIELD NETWORKS ��

tern are missing� e�g�� using � for an unknown node in�

put�activation value�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example ��� Consider a ��node Hop�eld network whose

weights store patterns ��� �� �� �� and ��������������Each weight w��j � �� for � �� j� and wj�j � � for all j�

I� Corrupted Input Pattern ��� �� ����� I� � I� � I� �

� and I� � �� are the initial values for node outputs�

�� Assume that the second node is randomly selected

for possible update� Its net input is w���x��w���x��

w���x�� I� � ����� � � � �� Since sgn��� � �� this

node�s output remains at ��

�� If the fourth node is selected for possible update� its

net input is ������� � �� and sgn��� � � implies

that this node changes state to ��

� No further changes of state occur from this network

con�guration ��� �� �� ��� Thus� the network has suc�

cessfully recovered one of the stored patterns from

�� CHAPTER �� ASSOCIATIVE MODELS

the corrupted input vector�

II� Equidistant Case ��� �������� Both stored patternsare equally distant� one is chosen because the node func�

tion yields � when the net input is ��

�� If the second node is selected for updating� its net

input is �� hence state is not changed from ��

�� If the third node is selected for updating� its net

input is �� hence state is changed from �� to ��

� Subsequently� the fourth node also changes state� re�

sulting in the network con�guration ��� �� �� ���

Missing Input Element Case ��� �������� If the sec�ond node is selected for updating� its net input isw��x��

w��x� � w��x� � I� � ����� � � � �� implying the up�dated node output is �� for this node� Subsequently�the �rst node also switches state to ��� resulting in thenetwork con�guration ���� ��� ��� ����Multiple Missing Input Element Case ��� �� ����� Thoughmost of the initial inputs are unknown� the network suc�

���� HOPFIELD NETWORKS �

ceeds in switching states of three nodes� resulting in the

stored pattern ���� ��� ��� ����Thus� a signi�cant amount of corruption� noise or

missing data can be successfully handled in problems

with a small number of stored patterns�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Energy Function The Hop�eld network dynamics at�

tempt to minimize an �energy� �cost� function derived

as follows� assuming each desired output vector is the

stored attractor pattern nearest the input vector�

� If w��j is positive and large� then we expect that the

�th and jth nodes are frequently ON or OFF to�

gether in the attractor patterns� i�e��PP

p���ip�� ip�j�

is positive and large�

� Similarly� w��j is negative and large when the �th and

jth nodes frequently have opposite activation values

for dierent attractor patterns� i�e��PP

p���ip�� ip�j� is

negative and large�

�� CHAPTER �� ASSOCIATIVE MODELS

� So w��j should be proportional toPP

p���ip�� ip�j�� For

strong positive or negative correlation� w��j andP

p�ip�� ip�j�

have the same sign� hencePP

p�� w��j�ip�� ip�j� � ��

� Self�excitation coe cient wj�j � �� often ���

� Summing over all pairs of nodes� P�

Pjw��jip��ip�j is

positive and large for input vectors almost identical

to some attractor�

� We expect that node correlations present in the at�tractors are absent in an input vector i distant from

all attractor patterns� so thatP

Pjw��ji�ij is then

low or negative�

� Therefore� the energy function contains a term

���X

Xj

w��jx�xj

�A �

When this is minimized� the �nal values of various

node outputs are expected to correspond to an at�

tractor pattern�

� Network output should be close to the input vector�

���� HOPFIELD NETWORKS ��

when presented with a corrupted image of � �� we

do not want the system to generate an output cor�

responding to ���� For external inputs I�� another

term �P�I�x� is included in the energy expression�

I�x� � � i input�output for the �th node�

Combining the two terms� the following �energy� or

Lyapunov function must be minimized by modifying x�

values�

E � �aX�

Xj

w��jx�xj � bX�

I�x�

where a� b � �� The values a � ��� and b � � corre�

spond to reduction in energy whenever a node update

occurs� as described below�

Even if each I� � �� we can select initial node inputs

x���� so that the network settles into a state close to the

input pattern components�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Energy Minimization Steadily reducing E will result in

convergence to a stable state �which may or may not be

�� CHAPTER �� ASSOCIATIVE MODELS

one of the desired attractors��

Let the kth node be selected for updating at time

t� For the node update rule in Eqn� ���� the resulting

change of energy is

�E�t� � E�t � ��� E�t�

� �aX�

Xj ���

w��j �x��t � ��xj�t� ��� x��t�xj�t��

�bX�

I��x��t� ��� x��t��

� �aXj ��k

��wk�j �wj�k� �xk�t � ��� xk�t��xj�t��

�bIk �xk�t� �� � xk�t�� �

because xj�t � �� � xj�t� for every node �j �� k� not

selected for updating at this step� Hence�

�E�t� � ���aX

j ��k

�wk�j � wj�k�xj�t� � bIk

� �xk�t � ��� xk�t�� �

For �E�t� to be negative� �xk�t � ��� xk�t�� and

�aP

j ��k �wk�j � wj�k�xj�t� � bIk� must have the same sign�

The weights are chosen to be proportional to corre�

lation terms� i�e�� w��j �PP

p�� ip��ip�j�P � Hence wj�k �

wk�j� i�e�� the weights are symmetric� and for the choice

���� HOPFIELD NETWORKS ��

of the constants a � ��� b � �� the energy change expres�

sion simpli�es to

�E�t� � ���X

j ��k

wj�kxj�t� � Ik

�A �xk�t� ��� xk�t��

� �netk�t��xk�t��

where netk�t� �

��X

j ��k

wj�kxj�t� � Ik

�A is the net input

to the kth node at time t�

To reduce energy� the chosen �kth� node changes state

i current state diers from the sign of the net input�

i�e��

netk�t� ��xk�t� � ��

Repeated applications of this update rule results in a

�stable state� in which all nodes stop changing their cur�

rent values� Stable states may not be the desired at�

tractor states� but may instead be �spurious��

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example ��� The patterns ��� ��������� ��� �� �� �� and������� �� �� are to be stored in a ��node network� The

� CHAPTER �� ASSOCIATIVE MODELS

�rst and second nodes have exactly the same values in

every stored pattern� hence w��� � �� Similarly� w��� �

�� The �rst and third nodes agree in one stored pattern

but disagree in the other two stored patterns� hence

w��� ��no� of agreements� � �no� of disagreements�

no� of patterns��� �

Similarly� w��� � w��� � w��� � ��� �

� If the input vector is �������������� and the fourthnode is selected for possible node update� its net in�

put is w���x� �w���x� �w���x� � I� � ���� ����� ����� ����� � ���� � ���� � ��� � �� hence thenode does not change state� The same holds for ev�

ery node in the network� so that the network con�g�

uration remains at �������������� dierent fromthe patterns that were to be stored�

� If the input vector is ���������� ��� representingthe case when the fourth input value is missing� and

the fourth node is selected for possible node up�

date� its net input is w���x� �w���x� �w���x� � I� �

���� HOPFIELD NETWORKS �

���� ���������� �������������� � ��� � ��and the node changes state to ��� resulting in thespurious pattern ��������������

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example ��� Images of four dierent objects are shown

in Figure ���� We treat each image as a �!� �! binarypixel array� as in Figure ���� stored using a Hop�eld

network with �!� �! neurons�

XII

VI

IIIIX

Figure ��� Four images stored in a Hop�eld network�

Figure ��� Binary representation of objects in Fig� ����

�� CHAPTER �� ASSOCIATIVE MODELS

Figure ���� Corrupted versions of objects in Fig� ����

The network is stimulated by a distorted version of

a stored image� shown in Figure ���� In each case� as

long as the total amount of distortion is less than �" of

the number of neurons� the network recovers the correct

image� even when big parts of an image are �lled with

ones or zeroes�

Poorer performance would be expected if the network

is much smaller� or if the number of patterns to be

stored is much larger�

One drawback of a Hop�eld network is the assumption

of full connectivity a million weights are needed for a

thousand�node network� such large cases arise in image�

processing applications with one node per pixel�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

���� HOPFIELD NETWORKS ��

Storage capacity refers to the quantity of information

that can be stored and retrieved without error� and may

be measured as

C �no� of stored patterns

no� of neurons�

Capacity depends on the connection weights� the stored

patterns� and the dierence between the stimulus pat�

terns and stored patterns�

Let the training set contain P randomly chosen vec�

tors i�� � � � � iP where each ip � f����gn� These vectorsare stored using the connection weights

w��j ��

n

PXp��

ip�� ip�j�

� How large can P be� so that the network responds

to each i� by correctly retrieving i��

Theorem ��� The maximum capacity of a Hop�eld neu�

ral network �with n nodes� is bounded above by �n�� lnn��

In other words� if

� � Prob�� ��th bit of p�th stored vector is correctly

retrieved for each �� p��

�� CHAPTER �� ASSOCIATIVE MODELS

then limn�� � � � whenever P � n��� ln�n���

Proof

� For stimulus i�� the output of the �rst node is

o� �nXj��

w��ji��j ��

n

nXj��

PXp��

ip��ip�ji��j�

This output will be correctly decoded if o�i��� � ��

� Algebraic manipulation yields

o�i��� � � � Z � �n

where Z � ���n�Pn

j��

PPp�� ip��ip�ji��ji����

� Probability of correct retrieval for the �rst bit is

� � Prob�o� i��� � �� � Prob�� � Z � �n� ��

� Prob�Z � ���

� By assumption� E�i��j� � � and

E�i��ji��j�� �

��� � if � � ��� j � j�

� otherwise

� By the central limit theorem� Z would be distributedas a Gaussian random variable with mean � and vari�

ance �P����n����n� � P�n for large n and P � with

���� HOPFIELD NETWORKS �

density function ��P�n����� exp��nx���P �� and

Prob�Z � a� �

Z �

a

��P�n����� exp��nx���P �dx

� �

pnp�P

Z �

��

e��nx

�P�dx

pnp�P

Z �

��

e��nx

�P�dx

since the density function of Z is symmetric�

� If �n�P � is large� � � ��q

P�� n

exp�� n

�P

� The same probability expression � applies for eachbit of each pattern� Therefore� if p n� the prob�

ability of correct retrieval of all bits of all stored

patterns is given by

� � ���nP ����

rP

� ne��n��P �

�nP

� �� nP

rP

� ne��n��P ��

� If P�n � ���� lnn�� for some � �� then

exp�� n

�P� � exp��� lnn

�� � n���

�� CHAPTER �� ASSOCIATIVE MODELS

which converges to zero as n�� so that

nP

rP

� nexp�� n

�P� � n���

�� lnn����p�

which converges to zero if � ��

A better bound can be obtained by considering the

correction of errors during the iterative evaluations�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Hop�eld network dynamics are not deterministic the

node to be updated at any given time is chosen ran�

domly� Dierent sequences of choices of nodes to be

updated may lead to dierent stable states�

Stochastic version �of Hop�eld network� Output of node

� is �� with probability ���� � exp����net���� for netnode input net�� Retrieval of stored patterns is then

eective if P � ��� �n�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Continuous Hop�eld networks

�� Node outputs � a continuous interval�

���� HOPFIELD NETWORKS ��

�� Time is continuous each node constantly examines

its net input and updates its output�

� Changes in node outputs must be gradual over time�

For ease of implementation� assurance of convergence�

and biological plausibility� we assume each node�s out�

put � ���� ��� with the modi�ed node update rule

�x��t�

�t�

����� if x� � � and f�

Pj wj��xj�t� � I�� � ��

�� if x� � �� and f�P

j wj��xj�t� � I�� � ��

� f�P

j wj��xj�t� � I��� otherwise�

Proof of convergence is similar to that for the dis�

crete Hop�eld model� An energy function with a lower

bound is constructed� and it is shown that every change

made in a node�s output decreases energy� assuming

asynchronous dynamics� The energy function

E � ������X�

Xj ���

w��jx��t�xj�t��X�

I�x��t�

is minimized as x��t�� � � � � xn�t� vary with time t� Given

the weights and external inputs� E has a lower bound

since x��t�� � � � � xn�t� have upper and lower bounds�

�� CHAPTER �� ASSOCIATIVE MODELS

Since �����w��jx�xj������wj��xjx� � w��jx�xj for sym�

metric weights� we have

�E

�x��t�� ��

Xj

w��jxj � I��� �����

The Hop�eld net update rule requires

�x�

�t� � i f�

Xj

wj��xj � I�� � �� �����

Equivalently� the update rule may be expressed in terms

of changes occurring in the net input to a node� instead

of node output �x���

Whenever f is a monotonically increasing function

�such as tanh�� with f��� � ��

f�Xj

wj��xj � I�� � � i �Xj

wj��xj � I�� � �� �����

From Equations �� � ���� and ����

�x�

�t� � i

��X

j

wj�x� � I� � �

�A � i�e��

�E

�x�� �

����x�

�t

���E

�x�

�� �� for each i

�� �E

�t�X�

��x�

�t

���E

�x�

�� ��

���� HOPFIELD NETWORKS ��

Computation terminates� since

�a� each node update decreases �lower�bounded� E�

�b� the number of possible states is �nite�

�c� the number of possible node updates is limited�

�d� and node output values are bounded�

The continuous model generalizes the discrete model�

but the size of the state space is larger and the energy

function may have many more local minima�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Cohen�Grossberg Theorem gives su cient conditions for

a network to converge asysmptotically to a stable state�

Let u� be the net input to the �th node� f� its node

function� aj the rate of change� and bj a �loss� term�

Theorem ��� Let aj�uj� � �� �dfj�uj��duj� � �� whereuj is the net input to the jth node in a neural network

with symmetric weights wj�� � w��j� whose behavior is

governed by the following dierential equation

duj

dt� aj�uj�

�bj�uj��

NXi��

wj�� f��u��

�� for j � �� � � � � n�

� CHAPTER �� ASSOCIATIVE MODELS

Then� there exists an energy function E for which

�dE�dt� � for uj �� �� i�e�� the network dynamics leadto a stable state in which energy ceases to change�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Brain�State�in�a�Box �BSB� Network

BSB is similar to the Hop�eld model� but all nodes are

updated simultaneously� The node function used is a

ramp function

f�net� � min��� max���� net��

which is bounded� continuous� and piecewise linear� as

shown in Figure ��#�

��

��

f�x�

x

Figure ���� Ramp function� with output values � ���� ���

���� BRAIN�STATE�IN�A�BOX �BSB� NETWORK �

Node update rule initial activation is steadily ampli�ed

by positive feedback until saturation� jx�j ��

x��t � �� � f

�� nX

j��

w��jxj�t�

�A

w��� may be �xed to equal ��

The state of the network always remains inside an n�

dimensional �box�� giving rise to the name of the net�

work� �Brain�State�in�a�Box�� Network state steadily

moves from an arbitrary point inside the box �A in Fig�

ure ����a�� towards one side of the box �B in �gure��

and then crawls along the side of the box to reach a

corner of the box �C in �gure�� a stored pattern�

Connections between nodes are Hebbian� represent�

ing correlations between node activations� and can be

obtained by the non�iterative computation

w��j ��

P

PXp��

�ip�� ip�j��

where each ip�j � f��� �g�If the training procedure is iterative� training patterns

are repeatedly presented to the network� and weights

�� CHAPTER �� ASSOCIATIVE MODELS

�a� �b�

���

x�

���

x� x�

B

��� �� ��C

���������

A����������

Figure ���� State change trajectory �A B C� in a BSB

�a� the network� with nodes� �b� three stored patterns�

indicated by darkened circles�

are successively modi�ed using the weight update rule

�w��j � �ip�j�ip�� �Xk

w��kip�k�� for p � �� � � � � P�

where � � �� This rule steadily reduces

E �

PXp��

nX���

�ip�� �Xk

w��kip�����

and is applied repeatedly for each pattern until E � ��When training is completed� we expect

Pp�w��j � ��

implying that

Xp

ip�j�ip�� �Xk

w��kip�k� � ��

���� BRAIN�STATE�IN�A�BOX �BSB� NETWORK ��

i�e��Xp

�ip�j ip��� �Xp

ip�jXk

�w��k ip�k��

an equality satis�ed when

ip�� �Xk

�w��k ip�k��

The trained network is hence �stable� for the trained

patterns� i�e�� presentation of a stored pattern does not

result in any change in the network�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example ��# Let the training set contain patterns

f��� �� ��� ����������� ���������g�

as in Figure ���� with network connection weights

w��� � w��� � �� � �� ��� � �� �

w��� � w��� � �� � �� ��� � �� �

w��� � w��� � �� � � � ��� � ��

� If an input pattern ����� ���� ���� is presented� thenext network state is�f

���� �

���

����

�� f

���� �

���

� ���

�� f �� � ��

�� CHAPTER �� ASSOCIATIVE MODELS

� ���# � ��!#� ��!#�

where f is the ramp function described earlier� Note

that the second and third nodes exhibit identical

states� since w��� � �� The very next network state

is ��� �� ��� a stored pattern�

� If ����� �������#� is the input pattern� network statechanges to ����#� ���#� ���#� then to ������ �� � �� ��

eventually converging to the stable memory ��� �� ���

� If the input pattern presented is ��� ���������� net�work state changes instantly to ��� �� ��� and does

not change thereafter�

Network state may converge to a pattern which was

not intended to be stored� E�g�� if a �dimensional BSB

network was intended to store only the two patterns

f��� �� ��� ���������g� with weights

w��� � w��� � ��� ���� � ��w��� � w��� � ��� ���� � ��

and w��� � w��� � �� � ���� � ��

���� BRAIN�STATE�IN�A�BOX �BSB� NETWORK �

then ���������� is such a spurious attractor�BSB computations steadily reduce �P�

Pj w��jx�xj�

when the weight matrix is symmetric and positive de��

nite �i�e�� all eigenvalues of W are positive��

Stability and the number of spurious attractors tends

to increase with the self�excitatory weights wj�j� If the

weight matrix is �diagonal�dominant�� with

wj�j �X���j

w��j for each j � f�� � � � � ng�

then every vertex of the �box� is a stable memory�

The hetero�associative version of the BSB network

contains two layers of nodes� the connection from the

jth node of the input layer to the �th node of the out�

put layer carries the weight

w��j ��

P

PXp��

ip�jdp���

Example application clustering radar pulses� distinguish�

ing meaningful signals from noise in a radar surveillance

environment where a detailed description of the signal

sources is not known�

� CHAPTER �� ASSOCIATIVE MODELS

��� Hetero�associators

A hetero�associator maps input patterns to a dierent

set of output patterns�

Consider the task of translating English word inputs

into Spanish word outputs� A simple feedforward net�

work trained by backpropagation may be able to assert

that a particular word supplied as input to the system

is the ���th word in the dictionary� The desired output

pattern may be obtained by concatenating the feedfor�

ward network function f �n � fC�� � � � � Ckg with alookup table mapping g fC�� � � � � Ckg � fP�� � � � � Pkgthat associates each �address� Ci with a pattern Pi� as

shown in Figure ��!�

Two�layer networks with weights determined by Hebb�s

rule are much less computationally expensive� In a non�

iterative model� output layer node activations are given

by

x���� �t� �� � f�

Xj

w��jx���j �t� � ��

���� HETERO�ASSOCIATORS �

Ck

C�

C�

C�

���

Feedforwardnetwork

Inputvector

C� �

Ck �

Output patternPi

���

���

P�

P�

Pk

P�

���

Ci � �

Figure �� � Association between input and output patterns

using feedforward network and simple memory �at most

one Ci is �ON� for any input vector presentation��

� CHAPTER �� ASSOCIATIVE MODELS

For error correction� iterative models are more useful

�� Compute output node activations using the above

�non�iterative� update rule� and then perform itera�

tive auto�association within the output layer� leading

to a stored output pattern�

�� Perform iterative auto�association within the input

layer� resulting in a stored input pattern� which is

then fed into the second layer of the hetero�associator

network�

� Bidirectional Associative Memory �BAM� with no

intra�layer connections� as in Figure ����

REPEAT

�a� x���� �t � �� � f�

Pj w��jx

���j �t� �

���� ��

�b� x���� �t � �� � f�

Pj w��jx

���j �t� �� �

���� �

UNTIL x���� �t��� � x

���� �t�� and x

���� �t��� � x

���� �t��

Weights can be chosen to be Hebbian �correlation�

terms w��j � cPP

p�� ip�j dp�� Sigmoid node functions

can be used in continuous BAM models�

���� HETERO�ASSOCIATORS �

First layer Second layer

Figure ����� Bidirectional Associative Memory �BAM�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Example ��� The goal is to establish the following three

associations between ��dim� and ��dim� patterns

���� ��� ��� ��� ���� ����

���� ��� ��� ��� ���� �������� ��� ��� ��� ���� ����

By the Hebbian rule �with c � ��P � �� ��

w��� ��P�

p�� ip��dp���

� ��

CHAPTER �� ASSOCIATIVE MODELS

Likewise� w��� � � and every other weight � ���� e�g��

w��� ��P�

p�� ip��dp���

� ��

These weights constitute the weight matrix

W �

�� � � ��� ��� ��� ��� ��� ���

�A �

When an input vector such as i � ������������� ispresented at the �rst layer� x

���� � sgn�x

���� w����x

���� w����

x���� w��� � x

���� w���� � sgn��� � � � �� � �� � � ��

and x���� � sgn�x

���� w��� � x

���� w��� � x

���� w��� � x

���� w���� �

sgn��� � �� � �� � �� � � �� The resulting vector is

���� ��� one of the stored ��dim� patterns� The corre�sponding �rst layer pattern generated is ������� �� ���following computations such as

x���� � sgn�x

���� w��� � x

���� w���� � sgn���� �� � � ���

No further changes occur�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

The �additive� variant of the BAM separates out the

eect of the previous activation value and the external

���� HETERO�ASSOCIATORS

input I� for the node under consideration� using the fol�

lowing state change rule

x���� �t � �� � a�x

���� �t� � b�I� � f�

Xj ���

w��jx���j �t��

where ai� bi are frequently chosen from f�� �g� If theBAM is discrete� bivalent� as well as additive� then

x���i �t��� �

��� � if �aix

���i �t� � biIi �

Pj ��iwi�jx

���j �t�� � i

� otherwise�

where i is the threshold for the ith node� A similar

expression is used for the �rst layer updates�

BAM models have been shown to converge using a

Lyapunov function such as

L� � �X�

Xj

x���� �t�x

���j �t�w��j�

This energy function can be modi�ed� taking into ac�

count external node inputs Ik� as well as thresholds �k��

for nodes

L� � L� ��X

k��

N�k�X���

x�k�� �I

k� �

�k�� ��

� CHAPTER �� ASSOCIATIVE MODELS

where N�k� denotes the number of nodes in the kth

layer� As in Hop�eld nets� there is no guarantee that

the system stabilizes to a desired output pattern�

Stability is assured if nodes of one layer are unchanged

while nodes of the other layer are being updated� All

nodes in a layer may change state simultaneously� allow�

ing greater parallelism than Hop�eld networks�

When a new input pattern is presented� the rate at

which the system stabilizes depends on the proximity of

the new input pattern to a stored pattern� and not on

the number of patterns stored� The number of patterns

that can be stored in a BAM is limited by network size�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Importance Factor The relative importance of dier�

ent pattern�pairs may be adjusted by attaching a �sig�

ni�cance factor� ��p� to each pattern ip being used to

modify the BAM weights

wj�i � wi�j �PXp��

��p ip�i dp�j��

���� HETERO�ASSOCIATORS �

Decay Memory may change with time� allowing pre�

viously stored patterns to decay� using a monotonically

decreasing function for each importance factor

�p�t� � ��� ���p�t� ���

or �p�t� � max��� �p�t� ��� ��

where � � � � � represents the �forgetting� rate�

If �ip� dp� is added to the training set at time t� then

wi�j�t� � ��� ��wi�j�t� �� � �p�t� ��ip�j dp�i

where �� � �� is the attenuation factor� Alternatively�

the rate at which memory fades may be �xed to a global

clock

wi�j�t� � ��� ��wi�j�t� �� ���t�Xp��

�p�t� ��ip�j dp�i

where ��t� � � is the number of new patterns being

stored at time t� There may be many instants at which

��t� � �� i�e�� no new patterns are being stored� but

existing memory continues to decay�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� CHAPTER �� ASSOCIATIVE MODELS

��� Boltzmann Machines

Memory capacity of Hop�eld models can be increased

by introducing hidden nodes� A stochastic learning pro�

cess is needed to allow the weights between hidden nodes

and other ��visible�� nodes to change to optimal values�

beginning from randomly chosen values�

Principles of simulated annealing are invoked to min�

imize the energy function E de�ned earlier for Hop�

�eld networks� A node is randomly chosen� and changes

state with a probability that depends on ��E�� �� where

temperature � � � is steadily lowered� The state change

is accepted with probability ���� � exp��E�� ��� An�

nealing terminates when � � ��

Many state changes occur at each temperature� If

such a system is allowed to reach equilibrium at any tem�

perature � � the ratio of the probabilities of two states

a� b with energiesEa and Eb will be given by P �a��P �b� �

exp�Eb�Ea��� � This is the Boltzmann distribution� and

��� BOLTZMANN MACHINES

does not depend on the initial state or the path followed

in reaching equilibrium�

Figure ���� describes the learning algorithm for het�

eroassociation problems� For autoassociation� no nodes

are �clamped� in the second phase of the algorithm�

The Boltzmann machine weight change rule conducts

gradient descent on relative entropy �cross�entropy��

H�P� P �� �Xs

Ps ln�Ps�P�s�

where s ranges over all possible network states� Ps is the

probability of network state s when the visible nodes

are clamped� and P �s is the probability of network state

s when no nodes are clamped� H�P� P �� compares the

probability distributions P and P �� note thatH�P� P �� �

� when P � P �� the desired goal�

In using the BM� we cannot directly compute output

node states from input node states since initial states

of the hidden nodes are undetermined� Annealing the

network states would result in a global optimum of the

energy function� which may have nothing in common

� CHAPTER �� ASSOCIATIVE MODELS

Algorithm Boltzmann�

while weights continue to change

and computational bounds are not exceeded� do

Phase �

for each training pattern� do

Clamp all input and output nodes�

ANNEAL� changing hidden node states�

Update f� � � pi�j � � �g� the equilibrium probs�with which nodes i� j have same state�

end�for�

Phase �

for each training pattern� do

Clamp all input nodes�

ANNEAL� changing hidden $ output nodes�

Update f� � � p�i�j � � �g� the equilibrium probs�with which nodes i� j have same state�

end�for�

Increment each wi�j by ��pi�j � p�i�j��

end�while�

Figure ����� BM Learning Algorithm

��� BOLTZMANN MACHINES �

with the input pattern� Since the input pattern may

be corrupted� the best approach is to initially clamp in�

put nodes while annealing from a high temperature to

an intermediate temperature �I � This leads the network

towards a local minimum of the energy function near the

input pattern� The visible nodes are then unclamped�

and annealing continues from �I to � � �� allowing visi�ble node states also to be modi�ed� correcting errors in

the input pattern�

The cooling rate with which temperature decreases

must be extremely slow to assure convergence to global

minima of E� Faster cooling rates are often used due to

computational limitations�

The BM learning algorithm is extremely slow many

observations have to be made at many temperatures

before computation concludes�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Mean �eld annealing improves on the speed of execu�

tion of the BM� using a �mean �eld� approximation in

� CHAPTER �� ASSOCIATIVE MODELS

the weight change rule� e�g�� approximating the weight

update rule

�w��j � ��p��j � p���j��

�where p��j � E�x�xj� when the visible nodes are clamped�while p���j � E�x�xj� when no nodes are clamped�� by

�w��j � ��q�qj � q��q�j��

where average output of �th node is q� when visible

nodes are clamped� and q�� without clamping�

For the Boltzmann distribution� the average output is

q� � tanh�Xj

w��jxj�� ��

The mean �eld approximation suggests replacing the

random variable xj by its expected value E�xj�� so that

q� � tanh�Xj

w��jE�xj��� ��

These approximations improve the speed of execu�

tion of the Boltzmann machine� but convergence of the

weight values to global optima is not assured�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

���� CONCLUSION �

��� Conclusion

The biologically inspired Hebbian learning principle shows

how to make connection weights represent the simi�

larities and dierences inherent in various attributes

or input dimensions of available data� No extensive

slow �training� phase is required� the number of state�

changes executed before the network stabilizes is roughly

proportional to the number of nodes�

Associative learning reinforces the magnitudes of con�

nections between correlated nodes� Such networks can

be used to respond to a corrupted input pattern with

the correct output pattern� In autoassociation� the in�

put pattern space and output space are identical� these

spaces are distinct in heteroassociation tasks� Heteroas�

sociative systems may be bidirectional� with a vector

from either vector space being generated when a vector

from the other vector space is presented� These tasks

can be accomplished using one�shot� non�iterative pro�

CHAPTER �� ASSOCIATIVE MODELS

cedures� as well as iterative mechanisms that repeatedly

modify the weights as new samples are presented�

In dierential Hebbian learning� changes in a weight

wi�j are caused by the change in the stimulation of the

ith node by the jth node� at the time the output of the

ith node changes� also� the weight change is governed

by the sum of such changes over a period of time�

Not too many pattern associations can be stably stored

in these networks� If few patterns are to be stored� per�

fect retrieval is possible even when the input stimulus

is signi�cantly noisy or corrupted� However� such net�

works often store spurious memories�