Chapter 5 Unsupervised learningypeng/NN/F11NN/lecture-notes/... · 2011-10-18 · • Training: – Train the network such that the weight vector w j associated with jth output node

Chapter 5Unsupervised learning

Introduction

• Unsupervised learning – Training samples contain only input patterns

• No desired output is given (teacher-less)–Learn to form classes/clusters of sample

patterns according to similarities among themp g g• Patterns in a cluster would have similar features• No prior knowledge as what features areNo prior knowledge as what features are

important for classification, and how many classes are there.

Introduction• NN models to be covered

– Competitive networks and competitive learningCompetitive networks and competitive learning• Winner-takes-all (WTA)• Maxnet

H i• Hemming net– Counterpropagation nets– Adaptive Resonance Theory (ART models)p y ( )– Self-organizing map (SOM)– Principle component analysis (PCA) network

A li ti• Applications– Clustering– Vector quantizationq– Feature extraction– Dimensionality reduction

O ti i ti– Optimization

NN Based on Competitionp• Competition is important for NN

C i i b h b b d i– Competition between neurons has been observed in biological nerve systems

– Competition is important in solving many problemsp p g y p

• To classify an input pattern i t f th l

C_1x_1

into one of the m classes– ideal case: one class node

has output 1, all other 0 ; C mx np , ;– often more than one class

nodes have non-zero output

C_mx_n

INPUT CLASSIFICATION

– If these class nodes compete with each other, maybe only one will win eventually and all others lose (winner-takes-

ll) Th i h d l ifi i fall). The winner represents the computed classification of the input

• Winner takes all (WTA):• Winner-takes-all (WTA):– Among all competing nodes, only one will win and all

others will loseothers will lose– We mainly deal with single winner WTA, but multiple

winners WTA are possible (and useful in somewinners WTA are possible (and useful in some applications)

– Easiest way to realize WTA: have an external, central y ,arbitrator (a program) to decide the winner by comparing the current outputs of the competitors (break h i bi il )the tie arbitrarily)

– This is biologically unsound (no such external arbitrator i t i bi l i l t )exists in biological nerve system).

• Ways to realize competition in NNWays to realize competition in NN– Lateral inhibition (Maxnet, Mexican hat)

output of each node feeds 0, jiij wwpto other competing nodes throughinhibitory connections (with negative weights)

xjxi0, jiij ww

– Resource competition• output of node k is distributed to

0iiw 0jjwp

node i and j proportional to wikand wjk , as well as xi and xj

xi xj

jkwwxk

jkwikw

ji

ikikik xx

xxwnet

• self decay• biologically sound

Fixed-weight Competitive Nets • Maxnet• Maxnet

– Lateral inhibition between competitors

if

i htji

:function nodeotherwise

:weightsj

w ji

otherwise 0

0 if )(

xxxf

– Notes: C titi it ti til th t t bili ( t t• Competition: iterative process until the net stabilizes (at most one node with positive activation)

• where m is the # of competitors,/10 m • too small: takes too long to converge• too big: may suppress the entire network (no winner)

Fixed-weight Competitive Nets

• Example θ = 1 ε = 1/5 = 0 2θ = 1, ε = 1/5 = 0.2x(0) = (0.5 0.9 1 0.9 0.9 ) initial inputx(1) = (0 0.24 0.36 0.24 0.24 )x(2) = (0 0.072 0.216 0.072 0.072)x(2) (0 0.072 0.216 0.072 0.072)x(3) = (0 0 0.1728 0 0 )x(4) = (0 0 0.1728 0 0 ) = x(3)

t bili dstabilized

Mexican Hat• Architecture: For a given node,

– close neighbors: cooperative (mutually excitatory , w > 0)– farther away neighbors: competitive (mutually

inhibitory,w < 0)t f i hb i l t ( 0)– too far away neighbors: irrelevant (w = 0)

• Need a definition of distance (neighborhood):– one dimensional: ordering by index (1 2 n)one dimensional: ordering by index (1,2,…n)– two dimensional: lattice

weights1 1

2 1 2

weights0 if distance( , )

0 if distance( , )ij

c i j Rw c R i j R

2 1 2

1 2

( , )0 otherwise

where and given the radium of positive and negative regions.

ij j

R R

00function activation xif

maxmaxmax0

00)(

xifxifx

xifxf

:functionramp

1 2 1 2example: 1; 2; 0.6; 0.4, max 2R R C C

x(0) = (0.0, 0.50, 0.80, 1.0, 0.80, 0.50, 0.0) (1) = (0.0, 0.38, 1.06, 1.16, 1.06, 0.38, 0.0)x( ) ( , , , , , , ) (2) = (0.0, 0.39, 1.14, 1.66, 1.14, 0.39, 0.0)x

• Equilibrium: – negative input = positive input for all nodesg p p p– winner has the highest activation;– its cooperative neighbors also have positive activation;its cooperative neighbors also have positive activation;– its competitive neighbors have negative (or zero)

activations.

Hamming Networkg

• Hamming distance of two vectors, of yx anddimension n,– Definition: number of bits in disagreement between yx and– In bipolar:

dayxyx i iiT

andyxd

yxa

distancehammingandindifferringbitsofnumberis

andinagreementinbitsofnumberis:where

nyxanayx

and

T

T

)(502

distancehamming

yxnyxnnyxnad

nyxaTT

becan andbetween distance(negative)5.0)(5.0)(5.0

)(5.0

nyxy

T andby determied( g )

Hamming Networkg• Hamming network: computes – d between an input

vector i and each of the P vectors i1 iP of dimension nvector i and each of the P vectors i1,…, iP of dimension n– n input nodes, P output nodes, one for each of P stored

vector ip whose output = – d(i, ip)p p ( , p)– Weights and bias:

11 1Ti n

11 1,

2 2TP

Wni

– Output of the net:

11Ti i n

W i ,

2

where 0 5( ) is the negative distance between and

TP

T

o W ii i n

o i i n i i

where 0.5( ) is the negative distance between and k k ko i i n i i

E l )(• Example:– Three stored vectors:

)11111()1,1,11,1()1,1,1,1,1(

2

1

iii

– Input vector: Distance: (4 3 2)

)1,1,1,1,1(3 i)1,1,1,1,1( i

– Distance: (4, 3, 2)– Output vector

4]5)11111[(5.0]5)1,1,1,1,1)(1,1,1,1,1[(5.01

o

]5)11111)(11111[(503]5)11111[(5.0

]5)1,1,1,1,1)(1,1,11,1[(5.0])[(

2

o

2]5)11111[(5.0]5)1,1,1,1,1)(1,1,1,1,1[(5.03

o

If t th t ith ll t di t t i t i– If we want the vector with smallest distance to i to win, put a Maxnet on top of the Hamming net (for WTA)

• We have a associate memory: input pattern recalls theWe have a associate memory: input pattern recalls the stored vector that is closest to it (more on AM later)

Simple Competitive Learning• Unsupervised learning• Goal:

L t f l / l t f l / l tt– Learn to form classes/clusters of exemplars/sample patterns according to similarities of these exemplars.

– Patterns in a cluster would have similar features– No prior knowledge as what features are important for

classification, and how many classes are there.• Architecture:Architecture:

– Output nodes: Y1,……. Ym,

ti th lrepresenting the m classes– They are competitors

(WTA realized either by(WTA realized either by an external procedure or by lateral inhibition as in Maxnet)

• Training: – Train the network such that the weight vector wj associated

with jth output node becomes the representative vector of a class of similar input patterns.class of similar input patterns.

– Initially all weights are randomly assigned– Two phase unsupervised learning

• competing phase:– apply an input vector randomly chosen from sample set.

f ll d ili

– compute output for all output nodes: – determine the winner among all output nodes (winner is

not given in training samples so this is unsupervised)*j

jlj wio

not given in training samples so this is unsupervised) • rewarding phase:

– the winner is reworded by updating its weights to be *jwcloser to (weights associated with all other output nodes are not changed: kind of WTA)

• repeat the two phases many times (and gradually reduce

lij

repeat the two phases many times (and gradually reduce the learning rate) until all weights are stabilized.

• Weight update: jjj www – Method 1: Method 2

)( jlj wiw lj iw

jjj

)( jlj lj

il – wj

η (il - wj)il + wj

il

η (il wj)

il ηil

I h h d i d l i

wj wj + η(il - wj) wj wj + ηil

In each method, is moved closer to il

– Normalize the weight vector to unit length after it is updated www /

jw

updated– Sample input vectors are also normalized

jjj www /

lll iii /

ii 2)(Di– i ijiljl wiwi 2,, )(Distance

• is moving to the center of a cluster of sample vectors after repeated weight updates

N d j i f th t i i

jw

wj(0)– Node j wins for three trainingsamples: i1 , i2 and i3

– Initial weight vector w (0)

j( )wj(3)

– Initial weight vector wj(0)– After successively trained

by i1 , i2 and i3 , i

i3wj(1)

y 1 2 3 the weight vectorchanges to wj(1), i2

i1wj(2)

wj(2), and wj(3),

Examples• A simple example of competitive learning (pp. 168-170)

– 6 vectors of dimension 3 in 3 classes (3 input nodes, 3 output nodes)

η = 0.5

– Weight matrices:

Node A: for class {i2, i4, i5} { 2, 4, 5}

Node B: for class {i3}

Node C: for class {i i }Node C: for class {i1, i6}

Comments1. Ideally, when learning stops, each is close to the

centroid of a group/cluster of sample input vectors.jw

2. To stabilize , the learning rate may be reduced slowly toward zero during learning, e.g.,

3 # f d

jw )()1( tt

3. # of output nodes:– too few: several clusters may be combined into one class

too many: over classification– too many: over classification– ART model (later) allows dynamic add/remove output

nodes4. Initial :

– learning results depend on initial weights (node positions)U i i i l k b i di i l

jw

– Using training samples known to be in distinct classes, provided such info is available

– Generate randomly (bad choices may cause anomaly)y ( y y)5. Results also depend on sequence of sample presentation

Examplea p e

will always win no matter1ww1

w2

will always win no matterthe sample is from which classis stuck and will not participate2w

1w

is stuck and will not participatein learning

2w

unstuck: let output nodes have some consciencelet output nodes have some consciencetemporarily shot off nodes which have had very highwinning rate (hard to determine what rate should bewinning rate (hard to determine what rate should beconsidered as “very high”)

Example w

Results depend on the sequencef l t ti

w1

w2

of sample presentation

w2

Solution:

w1

Solution:Initialize wj to randomly selected input vectors that

f f h hare far away from each other

Self-Organizing Maps (SOM) (§ 5.5)g g p ( ) (§ )

• Competitive learning (Kohonen 1982) is a special case of SOM (K h 1989)SOM (Kohonen 1989)

• In competitive learning, the network is trained to organize input vector space into– the network is trained to organize input vector space into subspaces/classes/clusters

– each output node corresponds to one class– the output nodes are not ordered: random map

cluster_1

h l i l d f h

w_2

cluster_2 • The topological order of the three clusters is 1, 2, 3

• The order of their maps at

w_3 cluster_3

poutput nodes are 2, 3, 1

• The map does not preserve the topological order of the

w_1the topological order of the training vectors

T hi• Topographic map– a mapping that preserves neighborhood relations

between input vectors (topology preserving or featurebetween input vectors (topology preserving or feature preserving).

– if are two neighboring input vectors ( by some 21 and ii g g p ( ydistance metrics),

• their corresponding winning output nodes (classes), i and j t l b l t h th i f hi

21

j must also be close to each other in some fashion – one dimensional neighborhood: line or ring, node i has

neighbors or1i ni mod1neighbors or – two dimensional: grid.

rectangular: node(i, j) has neighbors:

1i

g ( , j) g

hexagonal: 6 neighbors))1,1(additionalor (),,1(),1,( jijiji

Self-Organizing Maps (SOM) (§ 5.5)g g p ( ) (§ )

• Topology preserving maps: cluster_1

cluster_2

w_1

w_2 cluster_3

w_3

cluster_1

l t 2

w_3

cluster_2

w_2 cluster_3

1

OR

w_1

Bi l i l ti ti• Biological motivation– Mapping two dimensional continuous inputs from

( ki t ) t t di i lsensory organ (eyes, ears, skin, etc) to two dimensional discrete outputs in the nerve system.

• Retinotopic map: from eye (retina) to the visual cortex• Retinotopic map: from eye (retina) to the visual cortex.• Tonotopic map: from the ear to the auditory cortex

– These maps preserve topographic orders of inputThese maps preserve topographic orders of input.– Biological evidence shows that the connections in these

maps are not entirely “pre-programmed” or “pre-wired” p y p p g pat birth. Learning must occur after the birth to create the necessary connections for appropriate topographic mapping.

SOM Architecture• Two layer network:

– Output layer:Output layer: • Each node represents a class (of inputs) • Neighborhood relation is defined over these nodesNeighborhood relation is defined over these nodes

– Nj(t): the set of nodes within distance D(t) to node j.• Each node cooperates with all its neighbors andEach node cooperates with all its neighbors and

competes with all other output nodes.• Cooperation and competition of these nodes can be p p

realized by Mexican Hat modelD = 0: all nodes are competitors (no cooperative)

random mapD > 0: topology preserving map

Notes1 I i i l i h ll d l f ( )1. Initial weights: small random value from (-e, e)2. Reduction of :

Li

)()1( ttLinear: Geometric:

3 Reduction of D:

)()1( tt10 where)()1( tt

0)(while1)()( tDtDttD3. Reduction of D:should be much slower than reduction.D can be a constant through out the learning

0)(while1)()( tDtDttD

D can be a constant through out the learning.4. Effect of learning (see Figure 5.20 on p. 196)

For each input i, not only the weight vector of winner *jp , y gis pulled closer to i, but also the weights of ’s close neighbors (within the radius of D).

j*j

5. Eventually, becomes close (similar) to . The classes they represent are also similar.

6 May need large initial D in order to establish topological

jw 1jw

6. May need large initial D in order to establish topological order of all nodes

Notes7 Fi d j* f i i i7. Find j* for a given input il:

– With minimum distance between wj and il.2 n

– Distance:

– If wj and il are normalized to unit vectors, minimizing

2,

1,

2

2)(),(dist kj

kklljlj wiiwiw

jdist(wj, il) can be realized by maximizing

k klkjjlj iwwio ,,jjj ,,

)2(),(dist ,,2,

1

2,

kjklkj

n

kkllj wiwiiw

21

,,1

2,

1

2,

n

n

kkjkl

n

kkj

n

kkl wiwi

22

221

,,

jl

kkjkl

wi

wi

j

Examples• A simple example of competitive learning (pp 191 194)• A simple example of competitive learning (pp. 191-194)

– 6 vectors of dimension 3 in 3 classes, node ordering: B – A – C

3.07.0 2.0:Aw

– Initialization: , weight matrix:– D(t) = 1 for the first epoch, = 0 afterwards

T i i ith

5.0

1 1 1:9.01.0 1 .0:)0(

C

B

A

wwW

)817111(i– Training withdetermine winner: squared Euclidean distance between

14)3081()7071()2011( 2222 d

)8.1,7.1,1.1(1 ijwi and 1

1.4)3.08.1()7.07.1()2.01.1(1, Ad1.1,4.4 2

1,2

1, CB dd 05121650:w• C wins, since D(t) = 1, weights

of node C and its neighbor A are updated, but not wB

1.4 1.35 05.1:9.01.0 1 .0:05.12.1 65.0:

)1(C

B

A

www

Wp , B

Examples 307020:w 81.077.083.0:Aw

1 1 1:9.01.0 1 .0:3.07.0 2.0:

)0(C

B

A

www

W

34.195.061.0:30.023.047 .0:81.077.0 83.0:

)15(C

B

A

www

W

– Observations:• Relative distance between

(0) (15)0.85 0.83A B

W Ww w • Relative distance between

weights of non-neighboring nodes (B, C) increaseI i h

1.28 1.271.10 0.60

B C

C A

w ww w

• Input vectors switch allegiance between nodes, especially in the early stage )31()542()6(127

)6,1()4,2()5,3(61

CBAt

of training• Inputs in cluster B are closer

to cluster A than to cluster C

)3,1()5,4,2()6(1613)3,1()5,4,2()6(127

to c uste t a to c uste C(1.35 vs 1.78)

• How to illustrate Kohonen map (for 2 dimensional patterns)Input vector: 2 dimensional– Input vector: 2 dimensionalOutput vector: 1 dimensional line/ring or 2 dimensional grid.Weight vector is also 2 dimensional

– Represent the topology of output nodes by points on a 2 dimensional plane. Plotting each output node on the plane with its weight vector as its coordinates.

– Connecting neighboring output nodes by a lineoutput nodes: (1, 1) (2, 1) (1, 2)D = 1D = 1

weight vectors:(0.5, 0.5) (0.7, 0.2) (0.9, 0.9) (0.7, 0.2) (0.5, 0.5) (0.9, 0.9)

C(1, 2) C(1, 2)

C(1, 1) C(2, 1)

C(2, 1) C(1, 1)

Illustration examples

• Input vectors are uniformly distributed in the regionInput vectors are uniformly distributed in the region, and randomly drawn from the region

• Weight vectors are initially drawn from the same• Weight vectors are initially drawn from the same region randomly (not necessarily uniformly)

W i ht t b d d di t th• Weight vectors become ordered according to the given topology (neighborhood), at the end of training

Traveling Salesman Problem (TSP)Traveling Salesman Problem (TSP)

• Given a road map of n cities, find the shortest tour p ,which visits every city on the map exactly once and then return to the original city (Hamiltonian circuit)g y ( )

• (Geometric version): – A complete graph of n vertices on a unit square– A complete graph of n vertices on a unit square. – Each city is represented by its coordinates (x_i, y_i)

n!/2n legal tours– n!/2n legal tours– Find one legal tour that is shortest

Approximating TSP by SOMpp g y• Each city is represented as a 2 dimensional input vector (its

coordinates (x, y)), ( , y)),• Output nodes C_j, form a SOM of one dimensional ring, (C_1,

C_2, …, C_n, C_1).I iti ll C 1 C h d i ht t d ’t• Initially, C_1, ... , C_n have random weight vectors, so we don’t know how these nodes correspond to individual cities.

• During learning, a winner C j on an input (x i, y i) of city i, not g g, _j p ( _ , y_ ) y ,only moves its w_j toward (x_i, y_i), but also that of of its neighbors (w_(j+1), w_(j-1)).

• As the result C (j 1) and C (j+1) will later be more likely to win• As the result, C_(j-1) and C_(j+1) will later be more likely to win with input vectors similar to (x_i, y_i), i.e, those cities closer to i

• At the end, if a node j represents city i, it would end up to have its neighbors j+1 or j-1 to represent cities similar to city i (i,e., cities close to city i).

• This can be viewed as a concurrent greedy algorithmThis can be viewed as a concurrent greedy algorithm

Initial position

Two candidate solutions:

ADFGHIJBC

ADFGHIJCB

Convergence of SOM Learning• Objective of SOM: converge to an ordered map

– Nodes are ordered if for all nodes r, s, q

• One-dimensional SOM– If neighborhood relation satisfies certain properties, then there

exists a sequence of input patterns that will lead the learn to converge to an ordered mapconverge to an ordered map

– When other sequence is used, it may converge, but not necessarily to an ordered mapy p

• SOM learning can be viewed as of two phases– Volatile phase: search for niches to move intop– Sober phase: nodes converge to centroids of its class of inputs– Whether a “right” order can be established depends on

“volatile phase,

Convergence of SOM Learning• For multi-dimensional SOM

– More complicated– No theoretical results

• Example p– 4 nodes located at 4 corners– Inputs are drawn from the region that is near the

center of the square but slightly closer to w1

– Node 1 will always win, w1, w0, and w2 will be pulled toward inputs but w will remain at thepulled toward inputs, but w3 will remain at the far corner

– Nodes 0 and 2 are adjacent to node 3, but not to j ,each other. However, this is not reflected in the distances of the weight vectors:

|w0 – w2| < |w3 – w2|

Counter propagation network (CPN) (§ 5.3)• Basic idea of CPN

– Purpose: fast and coarse approximation of vector mapping )(xy • not to map any given x to its with given precision,• input vectors x are divided into clusters/classes.

h l f h hi h i (h f ll ) h

)(x

• each cluster of x has one output y, which is (hopefully) the average of for all x in that class.

– Architecture: Simple case: FORWARD ONLY CPN, )(x

p ,

yzx 111

jj,kkk,ii yvzwx

mpn yzx

from input to hidden (class)

from hidden (class) to output

L i i t h– Learning in two phases: – training sample (x, d ) where is the desired precise mapping– Phase1: weights coming into hidden nodes are trained by

)(xd

kzkwPhase1: weights coming into hidden nodes are trained by competitive learning to become the representative vector of a cluster of input vectors x: (use only x, the input part of (x, d ))1 F h f df d d i h i i

kzkw

1. For a chosen x, feedforward to determine the winning2.3 R d h 1 d 2 il di i i

))(()()( *,*,*, oldwxoldwneww ikiikik *kz

3. Reduce , then repeat steps 1 and 2 until stop condition is met– Phase 2: weights going out of hidden nodes are trained by delta

rule to be an average output of where x is an input vector that kv

)(x

ule o be ve ge ou pu o w e e s pu vec ocauses to win (use both x and d). 1. For a chosen x, feedforward to determined the winning

)(kz

*kz2. (optional) 3.

))(()()( *,*,*, oldwxoldwneww ikiikik ))(()()( *,*,*, oldvdoldvnewv kjjkjkj

4. Repeat steps 1 – 3 until stop condition is met

Notes• A combination of both unsupervised learning (for in phase 1)

and supervised learning (for in phase 2). kw

kv• After phase 1, clusters are formed among sample inputs x , each

hidden node k, with weights , represents a cluster (centroid).• After phase 2 each cluster k maps to an output vector y which is

kw• After phase 2, each cluster k maps to an output vector y, which is

the average of• View phase 2 learning as following delta rule

_:)( kclusterxx p g g

because , where)(*,

*,*,*,*,kj

kjjkjjkjkj

Ev

Evdvdvv

•

)(2)( **,2

**,*,*,

kkjjkkjjkjkj

zvdzvdvv

E

whenshown thatbecanIt t •

win*makethat samplestrainingallofmean theis where )()( and )(

,when ,shown that becan It

kxxtvxtw

tkk

pg

ldi hi il )if( flh )1()()1()1( as rewriteen becan

rule update Weight similar.)is of (proofon only Show*,*,

** txtwtw

vwiikik

kk

( 1) (1 ) ( ) ( 1)*, *,

*,2

( 1) (1 ) ( ) ( 1) (1 )((1 ) ( 1) ( )) ( 1)

(1 ) ( 1) (1 ) ( ) ( 1)

k i k i i

k i i i

w t w t x tw t x t x t

w t x t x t

*,

2

(1 ) ( 1) (1 ) ( ) ( 1)

( 1) ( )(1 ) ( 1)(1 )

k i i i

i i i

w t x t x t

x t x t x t

... (1)(1 )tix ( ) ( )( ) ( )( )i i i ( )( )i

thenset, training thefromrandomly drawn are If x

xEtxEtxExtxtxEtwE

ti

t

tiiiik

))]1(()1...())(()1())1(([ ])1(...)1)(()1(([)]1([ 1*,

xx

x t

)1(11

])1....()1(1[

)1(1

Aft t i i th t k k lik l k f th• After training, the network works like a look-up of math table.

For any input x find a region where x falls (represented by– For any input x, find a region where x falls (represented by the wining z node);

– use the region as the index to look-up the table for theuse the region as the index to look-up the table for the function value.

– CPN works in multi-dimensional input spaceCPN works in multi dimensional input space– More cluster nodes (z), more accurate mapping.– Training is much faster than BPTraining is much faster than BP– May have linear separability problem

Full CPN• If both

we can establish bi directional approximationexist)(function inverse its and)( 1 yxxy

we can establish bi-directional approximation• Two pairs of weights matrices:

W(x to z) and V(z to y) for approx map x to )(xy W(x to z) and V(z to y) for approx. map x toU(y to z) and T(z to x) for approx. map y toWhen training sample ( ) is applied ( )

)(xy

)(1 yx YyXx onandon• When training sample (x, y) is applied ( ),

they can jointly determine the winner zk* or separately forYyXx onandon

and zz

)*()*( and ykxk zz

Adaptive Resonance Theory (ART) (§ 5 4)Adaptive Resonance Theory (ART) (§ 5.4)

• ART1: for binary patterns; ART2: for continuous patternsy p ; p• Motivations: Previous methods have the following problems:

1.Number of class nodes is pre-determined and fixed. p– Under- and over- classification may result from training – Some nodes may have empty classes.– no control of the degree of similarity of inputs grouped in one

class. 2.Training is non-incremental:

– with a fixed set of samples, ddi l f i i h k i h– adding new samples often requires re-train the network with

the all training samples, old and new, until a new stable state is reached.reached.

Id f ART d l• Ideas of ART model:– suppose the input samples have been appropriately classified

into k clusters (say by some fashion of competitive learning)into k clusters (say by some fashion of competitive learning).– each weight vector is a representative (average) of all

samples in that cluster.jw

– when a new input vector x arrives1.Find the winner j* among all k cluster nodes2.Compare with x

if they are sufficiently similar (x resonates with class j*),*jw

then update based on else, find/create a free class node and make x as its

fi b

|| *jwx *jw

first member.

• To achieve these we need:• To achieve these, we need:– a mechanism for testing and determining (dis)similarity

between x and wbetween x and .– a control for finding/creating new class nodes.

need to have all operations implemented by units of

*jw

– need to have all operations implemented by units oflocal computation.

• Only the basic ideas are presented• Only the basic ideas are presented– Simplified from the original ART model

S f th t l h i li d b i– Some of the control mechanisms realized by various specialized neurons are done by logic statements of the algorithmalgorithm

ART1 Architecture

tors)(input vecinput:x

values)(realtofromweightsupbottom:(classes)output :

tors)(input vecinput :

, yxbyx

jiij

)10(comparisonsimilarityfor parameter vigilance(binary)tofromtsdown weightop:,

ρρ:xyt ijji

Working of ART1• 3 phases after each input vector x is applied

R iti h d t i th i l t• Recognition phase: determine the winner cluster for x– Using bottom-up weights b– Winner j* with max yj* = bj*· xj j

– x is tentatively classified to cluster j*– the winner may be far away from x (e g |t - x|– the winner may be far away from x (e.g., |tj* - x|

is unacceptably large)

Working of ART1 (3 phases)• Comparison phase:

C t i il it i t d i ht t– Compute similarity using top-down weights t: vector: * * * *

1 , *( ,..., ) where n l l j ls s s s t x

* , *1 if both and are 1

0 otherwisel j l

l

t xs

– Resonance: if (# of 1’s in s*)|/(# of 1’s in x) > ρ, accept the classification update b and t

accept the classification, update bj* and tj*

– else: remove j* from further consideration, look f th t ti l i t dfor other potential winner or create a new node with x as its first patter.

• Weight update/adaptive phase• Weight update/adaptive phase– Initial weight (for a new output node: (no bias)

bottom up: top down:)1/(1)0( nb 1)0(tbottom up: top down:– When a resonance occurs with

)1/(1)0(, nb lj 1)0(, jlt

** andupdate*, node jj tbj xts* )old(

n

lljl

ljln

ii

llj

xt

xt

s

sb

1*,

*,

1

**,

)old(5.0

)old(

5.0)new(

If k l tt l t d t d j th

li 11 )old(new)( *,

**, lljllj xtst

– If k sample patterns are clustered to node j then= pattern whose 1’s are common to all these k samples jt

)()2()1()()2()1()0()( kk )().....2()1()().....2()1()0()new( kxxxkxxxtt jj

llljb

kiixsbli di

,,1,0)(ifonly0iff0)new( *,

jj tb normalizedais

• Example• Example

patternsInput 7,7.0 n

)0111011()5()0,1,1,1,0,0,0()4()0,1,1,1,1,0,1()3(

)0,1,1,1,1,0,0()2()1,0,0,0,0,1,1()1(

xxxxx

8/1)0(,1)0(:initially)0,1,1,1,0,1,1()5(

,11,

ll btx

f i t (1)for input x(1)

Node 1 wins

Notes1. Classification as a search process2. No two classes have the same b and t3. Outliers that do not belong to any cluster will be assigned

separate nodes4. Different ordering of sample input presentations may result

in different classification.5. Increase of increases # of classes learned, and decreases

the average class size.6 Cl ifi i hif d i h ill h bili6. Classification may shift during search, will reach stability

eventually.7 There are different versions of ART1 with minor variations7. There are different versions of ART1 with minor variations8. ART2 is the same in spirit but different in details.

ART1 Architecture

mj yyy 1

2F++

ijbjitRG2-

)(bF-++

+ G1

)(1 bF

ni xxx 1

ni sss 1)(1 aF

++

unitsinterface:)(unitsinput :)(1

bFaF value)(real to from

weightsup bottom :ji

ij

yxb

connectionwise-pair :)( to)(unitscluster :

unitsinterface:)(

11

2

1

bFaFF

bF

olar)binary/bipj class ing(represent to from

tsdown weigh top:ij

ji

xyt

connection full :)( and between p)()(

12

11

bFFolar)binary/bip

units control : , G1, G2R

• cluster units: competitive, receive input vector xthrough weights b: to determine winner j.

2F

• input units: placeholder or external inputs• interface units:

)(1 aF)(1 bF

– pass s to x as input vector for classification by – compare x and ) winner fromn (projectio jj yt

2F

– controlled by gain control unit G1• of twoif 1(output rule 2/3obey and )(both in Nodes 21 FbF

1G1)areinputs threetheRGtFtGsbF jijii ,2, : Input to ,1, :)( Input to 21

• Needs to sequence the three phases (by control units G1, G2, and R)

ysG 0and0if1

bFGsbFG

G

f)(00 receive open to )( :1

otherwise011

1

JtbFG for open )( :0 11

h i0if1

2

sG

inputnewafortionclassifica new a ofstart thesignals 1

otherwise02

2

G

G

if0

x

input newafor

parametervigilance1otherwise1

if0

o

sR

parametervigilance1 o

R = 0: resonance occurs, update andR 1 f il i il i i hibi J f f h i

JtJb

R = 1: fails similarity test, inhibits J from further computation

Principle Component Analysis (PCA) Networks (§ 5.8)p p y ( ) (§ )

• PCA: a statistical procedureR d di i li f i– Reduce dimensionality of input vectors • Too many features, some of them are dependent of others

E t t i t t ( ) f t f d t hi h• Extract important (new) features of data which are functions of original features

• Minimize information loss in the processMinimize information loss in the process– This is done by forming new interesting features

• As linear combinations of original features (first order ofAs linear combinations of original features (first order of approximation)

• New features are required to be linearly independent (avoid redundancy)

• New feature vectors are desired to be different from each other as much as possible (maximum variability)other as much as possible (maximum variability)

Linear Algebra

• Two vectors ),...,( and ),...,( 11 nn yyyxxx

Linear Algebra

are said to be orthogonal to each other if n

i ii yxyx 1 .0• A set of vectors of dimension n are said to be

linearly independent of each other if there does not exist a

i 1)()1( ,..., kxx

set of real numbers which are not all zero such thatkaa ,...,1(1) ( )

1 0kka x a x

otherwise, these vectors are linearly dependent and some can be expressed as a linear combination of the others

( ) (1) ( ) ( )1 ji k jkj i

i i i

aaax x x xa a a

• Vector x is an eigenvector of matrix A if there exists a gconstant != 0 such that Ax = x– is called a eigenvalue of A (wrt x)

A matrix A may have more than one eigenvectors each with its– A matrix A may have more than one eigenvectors, each with its own eigenvalue

– Eigenvectors of a matrix corresponding to distinct eigenvaluesare linearly independent of each otherare linearly independent of each other

• Matrix B is called the inverse matrix of a square matrix A if AB = I– I is the identity matrix– Denote B as A-1

– Not every square matrix has inverse (e g when one of the– Not every square matrix has inverse (e.g., when one of the row/column can be expressed as a linear combination of other rows/columns)

• Every matrix A has a unique pseudo inverse A* which• Every matrix A has a unique pseudo-inverse A , which satisfies the following propertiesAA*A = A; A*AA* = A*; A*A = (A*A)T; AA* = (AA*)T

• Example of PCA: 3-D x is transformed to 2-D y

2 D2-D feature vector

Transformation matrix W

3-D feature vector

If rows of W have unit length and are orthogonal (e.g., w1 • w2 = ap + bq + cr = 0), then

vector

is an identity matrix, and WT is a pseudo-inverse of W

• Generalization – Transform n-D x to m-D y (m < n) , then transformation matrix W

is a m x n matrix– Transformation: y = Wx– Opposite transformation: x’ = WTy = WTWx

f i i i i f i l i h f i h– If W minimizes “information loss” in the transformation, then||x – x’|| = ||x – WTWx|| should also be minimizedIf WT i th d i f W th ’ f t– If WT is the pseudo-inverse of W, then x’ = x: perfect transformation (no information loss)

• How to approximate W for a given set of input vectors• How to approximate W for a given set of input vectors– Let T = {x1, …, xk} be a set of input vectors

Make them zero mean vectors by subtracting the mean vector– Make them zero-mean vectors by subtracting the mean vector (∑ xi) / k from each xi.

– Compute the covariance matrix S(T) of these zero-mean vectors, p ( ) ,which is a n x n matrix

Fi d th i t f S(T) di t– Find the m eigenvectors of S(T): w1, …, wm corresponding to mlargest eigenvalues 1, …, m

w w are the first m principal components of T– w1, …, wm are the first m principal components of T – W = (w1, …, wm) is the transformation matrix we are looking for– m new features extract from transformation with W would bem new features extract from transformation with W would be

linearly independent and have maximum variability– This is based on the following mathematical result:g

• Examplep

ldimensiona-1intodtransofmevectorsldimensiona3Original

0677.0101.0)7.0,2.0,0()169.0,541.0,823.0(

ldimensiona-1intodtransofmevectorsldimensiona3 Original

212

111

xWyxWy T

22950

0677.0146201099.0

ldimensiona-2 into d transofme vectorsldimensiona 3 Original

222121 xWyxWy 2295.01462.0 222121

926.0139.1/)09.0965.0()/()(Now, 32121

• PCA network architectureOutput: vector y of m-dim

W: transformation matrixy = Wx

x’ = WTy

Input: vector x of n-dim

– Train W so that it can transform sample input vector xl from n-dim p p lto m-dim output vector yl.

– Transformation should minimize information loss:Fi d W hi h i i iFind W which minimizes ∑l||xl – xl’|| = ∑l||xl – WTWxl|| = ∑l||xl – WTyl||

Twhere xl’ is the “opposite” transformation of yl = Wxl via WT

• Training W for PCA net• Training W for PCA net

– Unsupervised learning: only depends on input samples xl

– Error driven: ΔW depends on ||xl – xl’|| = ||xl – WTWxl||Start with randomly selected weight change W according to– Start with randomly selected weight, change W according to

This is only one of a number of suggestions for K (Williams)( ) where T T T

l l l l l l l lW y x K y W K y x

– This is only one of a number of suggestions for Kl, (Williams)– Weight update rule becomes

))(()()( Tl

TTlll

Tl

Tlll

Tll

Tlll yWxyWyxyWyyxyW ))(()()( lllllllllllll yWxyWyxyWyyxyW

column row transformation. vector

owvector error

– Each row in W approximates a principle component of TEach row in W approximates a principle component of T

• Example (sample sample inputs as in previous example)p ( p p p p p )

-

After x33

After x4

After x55

After second epochAfter third epoch

eventually converging to 1st PC (-0.823 -0.542 -0.169)

• Notes – PCA net approximates principal components (error may exist)– It obtains PC by learning, without using statistical methods– Forced stabilization by gradually reducing η– Some suggestions to improve learning results.

• instead of using identity function for output y = Wx, using non-linear function S, then try to minimize

• If S is differentiable, use gradient descent approach• For example: S be monotonically increasing odd function• For example: S be monotonically increasing odd function

S(-x) = -S(x) (e.g., S(x) = x3

Documents

Chapter 5 Unsupervised learningypeng/NN/F11NN/lecture-notes/... · 2011-10-18 · • Training: – Train the network such that the weight vector w j associated with jth output node