34
METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014 EE 583 PATTERN RECOGNITION Bayes Decision Theory Supervised Learning Linear Discriminant Functions Unsupervised Learning Parametric methods Nonparametric methods Iterative approaches Hierarchical approaches Graph-theoretical approaches

EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

  • Upload
    others

  • View
    26

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

EE 583 PATTERN RECOGNITION

Bayes Decision Theory

Supervised Learning

Linear Discriminant Functions

Unsupervised Learning

Parametric methods

Nonparametric methods

Iterative approaches

Hierarchical approaches

Graph-theoretical approaches

Page 2: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Introduction� Classification procedures that use unlabeled samples

are called unsupervised.� Reasons for working such an “unpromised” approach :

� computational cost of labeling a huge sample (training) set � for systems whose class-specific pattern generation change

in time, it is necessary to adopt continuously � training with huge unlabeled data first, labeling the

clusters afterwards, is more preferable� some useful features can be found during unsupervised

classification

� Two main approaches :� Parametric strategy : assume forms are known, combined

classification and parameter estimation� Nonparametric strategy : Partitioning

Page 3: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Parametric Unsupervised Learning (1/2)

� There are initial assumptions :� Samples are from c number of classes� A priori class probabilities, p(wk), are known� Form of conditional prob, p(x|wk,Θk), is known� Deterministic parameter vector is unknown� Class labels are unknown

� Samples are assumed to be obtained by� selecting a state of nature wk with p(wk), then� selecting an x according to p(x|wk, Θk)

Page 4: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Parametric Unsupervised Learning (2/2)

� Mixture density can be written as :densitymixturewpwxpxp j

c

jdensitycomponent

jj )(),|()|(1∑

=

Θ=Θ4434421

� Goal : First estimate unknown parameter, then decompose mixture into its components to make a classification

� Note that different values of unknown parameter may lead to the same mixture density � “identifiable” densities are assumed (usually, a problem in discrete densities)

Page 5: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Parametric Unsupervised LearningMaximum Likelihood (ML) Estimate (1/2)

� Let X={ x1,…, xn} be n unlabelled samples drawn independently from )(),|()|(

1j

c

jjj wpwxpxp ∑

=

Θ=Θ

� ML estimate : ∏=ΘΘ

Θ=Θ=Θn

kkML xpXp

1

)|(maxarg)|(maxargˆ

� Let )|(log)|(log1

Θ=Θ≡ ∑=

n

kkxpXpl

� In order to maximize l

( )

x

xu

xux

xuthatNote

wpwxpxp

xpln

k

c

jjjjk

k

n

kk iii

∂∂=

∂∂

=

Θ∇

Θ=Θ∇=∇ ∑ ∑∑

= =Θ

=ΘΘ

)(

)(

1))(log(

0)(),|()|(

1)|(log

1 11

Page 6: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Parametric Unsupervised LearningMaximum Likelihood (ML) Estimate (2/2)

� Since parameters for different classes are functionally

independent

� Considering the relation between derivative and log, we obtain the equation below to be solved for all i.

( )

( )∑

ΘΘ

=ΘΘ

Θ∇Θ

=

Θ∇Θ

=∇

c

jiik

wxp

xwp

k

i

n

kiiik

k

wxpxp

wp

wpwxpxp

l

i

iik

ki

ii

1

),|(

),|(

1

),|()|(

)(

)(),|()|(

1

43421

( )

=

=ΘΘ

Θ

Θ=Θ

==Θ∇Θ=∇

c

jjjjk

iiikki

n

kiikki

wpwxp

wpwxpxwpwhere

ciwxpxwplii

1

1

)(),|(

)(),|(),|(

,,10),|(log),|( K

� If P(wj)’s are also unknown, ML estimate can be obtained by using a similar formulation

� Since it is nonlinear, solution can be obtained iteratively.

Page 7: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Parametric Unsupervised LearningML Estimate for Multivariate Normal (1/2)

� Assuming only mean is unknown, ML estimate :

( ) )()(2

1||)2(log),|(log 112/

iit

iid

ii xxwxp µµπµ −∑−−∑−= −−

� No explicit solution; mean can be obtained iteratively with no guarantee for global minimum

∑∑==

=+n

kki

n

kkkii jxwpxjxwpj

11

))(ˆ,|(/))(ˆ,|()1(ˆ µµµ

∑∑

==

= ==⇒c

jjjjk

iiikkin

kki

xofaverageweighted

n

kkki

i

wpwxp

wpwxpxwpwhere

xwp

xxwp

k

11

1

)(),|(

)(),|()ˆ,|(

)ˆ,|(

)ˆ,|(ˆ

µ

µµµ

µµ

444 8444 76

estimateMLiswherexxwp iki

n

kki µµµ ˆ0)ˆ()ˆ,|( 1

1

=−∑⇒−

=∑

)(),|(log 1iiii xwxp µµµ −∑=∇⇒

Page 8: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Parametric Unsupervised LearningML Estimate for Multivariate Normal (2/2)

� Consider the following example : µa =(µ1,µ2 )= (−2, 2) :

−−+

−−= 22

2121 )(

2

1exp

23

2)(

2

1exp

23

1),|( µ

πµ

πµµ xxxp

� Note that starting from two different starting points, the iterative algorithm reaches two different solutions = µa& µb

� unidentifiable

sample data drawn from the density

Page 9: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Parametric Unsupervised LearningK-means algorithm (1/2)

� K-means or ISODATA or c-means algorithm :

1) Choose initial mean values for k (or c) classes

2) Classify n samples by assigning them to “closest” mean

3) Recompute the means as the average of samples in their (new) classes

4) Continue till there is no change in mean values

� If we find the nearest mean µm to xk � approximate as

=

≈Θotherw.0

mi if1),|( ki xwp

� Then, the iterative ML estimation becomes K-means

i

jik

XXx

ki

n

kki

n

kkkii nxjjxwpxjxwpj /)1(ˆ))(ˆ,|(/))(ˆ,|()1(ˆ

)(11∑∑∑

∈==

=+⇒=+µ

µµµµ

Page 10: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Parametric Unsupervised LearningK-means algorithm (2/2)

� The trajectory of the mean values during iterations for different initial values

sample data drawn from the density

� For 2-D unsupervised data, Voronoi regions change during iterations

Page 11: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised Learning

� For non-trivial distributions parametric approaches usually fail

� An obvious alternative approach is using a priori information to design a classifier, then classify unlabelled data using this initial classifier (decision directed strategy)

� Some problems related such a strategy :� Initial classification is critical� An unfortunate sequence of samples will create more

errors compared to estimating likelihood� Even if the initial classification is optimal, samples from

the tails of other overlapping distributions will create biased estimates (i.e. less probable samples from one class will be included in the other class)

Page 12: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised Learning

� Clustering : A procedure yields a data description in terms of groups of data points that posses strong internal “similarities” (criterion fnct.)

� Partitioned samples in one cluster should be more similar to samples in the same cluster (wrt to the samples in other clusters), but

� how to measure similarity between samples?

� how one should evaluate a partitioning?

Page 13: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningSimilarity Measures (1/2)

� A good similarity measure is some ‘distance’ (e.g. Euclidian) between two samples

� It is expected to have shorter distance between the samples in one cluster, compared to the samples in different clusters

� Using a threshold, clusters can be classified, as “similar” or “dissimilar”

� In order to be scale independent, either � use normalization (can be problematic)� use non metric similarity functions, such as

normalized inner product :

21

2121 ),(

xx

xxxxs

t

rr

rr

rr =

Page 14: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningSimilarity Measures (2/2)

� Threshold selection is critical

� Normalization can be problematic

Page 15: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningCriterion Function for Clustering

� Criterion function measures the clustering quality of any partition of data

1) Sum-of-squared-error criterion (minimum variance) :

2) Related minimum variance criteria :

∑∑∑∈= ∈

=−=ii Xxi

i

c

i Xxie x

nmwheremxJ

1

1

2

),(),(min

average :),(1

generally, More

1

2

1

,,

2

2

21

xxsmediansorxxssor

xxsn

s

orxxn

swheresnJ

ii

i i

i i

Xxxi

Xxxi

Xx Xxii

Xx Xxii

c

iiie

′=′=

′=

′−==

∈′∈′

∈′ ∈

∈′ ∈=

∑ ∑

∑ ∑∑

Page 16: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningIterative Optimization (1/4)

� Clustering becomes a well-defined problem as as soon as criterion function is selected

� One option for a solution is exhaustive searche.g. 5 clusters, 100 samples --> 1067 partitioning !

� Iterative optimization is another optionBegin from a reasonable initial partition, then “move” samples from one group to another if that move is feasible

� Iterative optimization is suboptimal since it depends on initial partitioning

Page 17: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningIterative Optimization (2/4)

� Assume sum-of-squared-error criterion

∑∑∑∈∈=

=−==ii Xxi

iXx

ii

c

iie x

nmmxJJJ

12

1

� Let x move from cluster Xi to Xj

2**

2**

ˆ11

ˆ

ˆ1

,1

ˆ

ii

iii

i

iii

jj

jjj

j

jjj

mxn

nJJ

n

mxmm

mxn

nJJ

n

mxmm

−−

−=−

−−=

−+

+=+

−+=

� Since total criterion, Je, is tried to be minimized, transfer from Xi to Xj is accepted if decrease in Ji is greater than increase in Jj

22 ˆ1

ˆ1 i

i

ij

j

j mxn

nmx

n

n −−

<−+ (usually happens if x is closer to mj than mi)

Page 18: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningIterative Optimization (3/4)

� Minimum squared algorithm :1) Select an initial partition for n samples; compute Je

and mean2) Select next candidate, x, (let x belong to cluster i)3) Compute

4) Transfer x to Xk if ak <aj for all j5) Update Je, mi and mk

6) If Je does not change after n attempts, STOP

=−−

≠−+=

ijmxn

n

ijmxn

n

a

ii

i

jj

j

j2

2

ˆ1

ˆ1

Note that MSE is the sequential version of K-means

Page 19: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningIterative Optimization (4/4)

� Experimentally, it is shown that� MSE is more susceptible to being trapped by a local minima

compared to ISODATA� MSE depends on the order of choosing samples� MSE is step-wise optimal (if the problem is sequential, it is

advantageous)

� A common problem to iterative optimization methods is selection of starting points

� A solution is to use (c-1)-cluster problem as an initial estimate c-cluster problem. Beginning from 1-cluster, use the sample furthest from all samples to obtain the next cluster mean �basis for hierarchical clustering

Page 20: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningHierarchical Clustering (1/4)

� Two main strategies for hierarchical clustering� Agglomerative (bottom-up) :

Start with n singleton clusters and reach to c clusters by merging “similar” clusters by decreasing cluster number

� Divisive (top-down) :Start with 1 cluster which contains n samples and reach

to c clusters successively split clusters

� For n (large number) samples to be classified into c (large number) classes, agglomerative approaches are better from computational point of view (or vice versa) : n c 1

Page 21: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningHierarchical Clustering (2/4)

� Agglomerative Hierarchical Clustering :

1) Let c’=n, Xi={xi} i=1,…,n2) If c’<c STOP (c=# of regions)3) Find “nearest” pair of distinct clusters, Xi & Xj

4) Merge Xi & Xj ; delete Xj; decrement c’; GOTO 2

Similarity scale can be used to determine whether groupings are natural or forced

Page 22: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningHierarchical Clustering (3/4)

� In this hierarchy, at any level, distance between clusters gives a similarity value for finding the nearest clusters

� Following distance measures can be utilized

jijimean

Xx Xxjiavg

XxXxji

XxXxji

mmXXd

xxXXd

xxXXd

xxXXd

i j

ji

ji

−=

′−=

′−=

′−=

∑∑∈ ∈′

∈′∈

∈′∈

),(

),(

max),(

min),(

;max

;min Nearest neighbour measure

Furthest neighbour measure

Compromise between two

Page 23: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningHierarchical Clustering (4/4)

� Stepwise Optimal Hierarchical Clustering :� Change one step of “Agglomerative Clustering” algorithm so

that at every step a stepwise optimal procedure (wrt a criterion function) is extremized (Ward’s Algorithm):

Find pair of distinct clusters Xi & Xj whose merger would increase (decrease) the criterion function, minimum

jiij

ijejie mm

nn

nnJXXd −

+=∇=),(

In practice, de usually favors merging small regions to large regions rather than merging two mid-size region

222

1

2

∑∑∑∑∑∈∈∈= ∈

−−−−−=∇→−=jiijr Cx

jCx

iCx

ije

c

r Cxre mxmxmxJmxJ

� If we choose distance function as below, then the increase in sum-of squared error criterion is minimized at every step

Page 24: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningGraph Theoretic Approaches (1/3)

� Graph Theory can be applied to nonparametric unsupervised learning

� The samples to be clustered can be assumed to be nodes of a graph

� If two nodes are found out to be similar (belonging to the same cluster) � an edge is placed between these nodes

� All nodes, which are connected to each via chain of edges belong to the same cluster

� At any state, how to merge two clusters in such a graph?

Page 25: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningGraph Theoretic Approaches (2/3)

� Assume merging two clusters corresponds to adding an edge between these two nodes

� Since edges linking for clusters never make loops, in graph theory, such edge groups are called trees

� If continue edge linking till all clusters are merged, then a spanning tree is obtained which reaches all nodes

� If dmin(Xi,Xj) measure is used during merger

� resulting graph is a minimal spanning tree, � i.e nearest neighbor clustering algorithm can be

viewed as an algorithm to obtain minimal spanning tree

Page 26: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Nonparametric Unsupervised LearningGraph Theoretic Approaches (3/3)

� Conversely, given a minimal spanning tree, clusters obtained by the nearest neighbor algorithm can be found by first dividing tree into two by removing the longest edge, then ...

� Instead of removing the longest edge, one option is to remove the most ‘inconsistent’ edge incident to the same node

� Another option is to determine the edge length distribution and segment different edge length groups accordingly

Page 27: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Graph theoretic Clustering� A top-down approach

� Tokens (nodes) are represented by using a weighted graph.

� affinity matrix, A (similar nodes have higher entries)

� Cut up this graph to get subgraphs with strong links

Slides modified from D.A. Forsyth, Computer Vision - A Modern Approach

Affinity matrix

Page 28: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Affinity Matrix with scale (σ)

aff x, y( )= exp − 12σ d

2

x− y 2( )

Slides modified from D.A. Forsyth, Computer Vision - A Modern Approach

Page 29: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

29

Solution via Eigenvectors� Idea: Find vector, w, giving the association between each node

and a cluster

� Elements within a cluster should have strong affinity with each other

� Maximize the following relation: wTA w (A : affinity matrix)

=

=

4444 34444 214444 34444 214444 34444 21

jjii waw

jijjii

T wawwAw

cluster tonode of assoc.node & node similaritycluster tonode of assoc. jjii

,,

,

• Above relation maximizes, in case all 3 terms are non-zero (or not very small)

• There should be an extra constraint, as wT w=1

• Optimize by method of Lagrange multiplier : max { wTAw + λ (wTw-1) }

• Solution is an eigenvalue problem

• Choose the eigenvector of A with the largest eigenvalue

Slides modified from D.A. Forsyth, Computer Vision - A Modern Approach

Page 30: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Solution via Eigenvectors

points

matrix

eigenvector

Slides modified from D.A. Forsyth, Computer Vision - A Modern Approach

Page 31: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Normalized cuts

Slides modified from D.A. Forsyth, Computer Vision - A Modern Approach

� Previous criterion onlyevaluates within clustersimilarity, but not across cluster difference

� Instead, one would like to maximize within cluster similarity compared to the across cluster difference

� Write graph V, one cluster as A and the other as B

� Minimize Normalized Cut

cut(A,B) : sum of edges between A&B

assoc(A,V) : sum of edges only in A

� Construct A, Bsuch that their within cluster similarity is high,

� compared to their association with the rest of the graph

+

),(

),(

),(

),(

VBassoc

BAcut

VAassoc

BAcut

Page 32: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Normalized cuts

Slides modified from D.A. Forsyth, Computer Vision - A Modern Approach

cut(A,B) : sum of edges between A&B

assoc(A,V) : sum of edges only in A

)1/( where),1()1(

),(

),(

;),(),(matrix;affinity :

1)( ;1)( 1, enteries of vector :

11

)1)(()1(

11

)1)(()1(

),(

),(

),(

),(),(

0

kkbxbxyLet

iiD

iiD

kjiWiiDW

BiixAiixx

Dk

xWDx

Dk

xWDx

VBassoc

BAcut

VAassoc

BAcutBANCut

i

x

j

T

T

T

T

i

−=−−+≡

==

∈⇔−=∈⇔=±

−−−++−+=

+

=

∑∑ >

Page 33: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Normalized cuts

Slides modified from D.A. Forsyth, Computer Vision - A Modern Approach

� Defined vector y, has elements as

� 1, if item is in A,

� -b, if item is in B

� After derivations

is shown to be equivalent to

� with the constraint

(Read proof in the distributed notes)

� This is so called Rayleigh Quotient

miny

yT D −W( )yyTDy

yTD1= 0

+

=

),(

),(

),(

),(min),(min

VBassoc

BAcut

VAassoc

BAcutBANCut

Page 34: EE 583 PATTERN RECOGNITION - users.metu.edu.trusers.metu.edu.tr/alatan/Courses/EE583LECTURES/pr05.pdf · computational cost of labeling a huge sample (training) set for systems whose

METU EE 583 Lecture Notes by A.Aydin ALATAN © 2014

Normalized cuts

Slides modified from D.A. Forsyth, Computer Vision - A Modern Approach

� Its solutions is the generalized eigenvalue problem

which gives

� Optimal solution is the eigenvector due to second smallest eigenvalue

� Now, look for a quantization threshold that maximizes the criterion --- i.e. all components of y above that threshold go to one, all below go to -b

maxy yT D − W( )y( ) subject to yTDy = 1( )

D − W( )y = λDy

min