Spectral Clustering - UCSBveronika/SpectralClustering.pdf · Spectral Clustering Veronika Strnadová-Neeley Seminar on Top Algorithms in Computational Science Main Reference: Ulrike

Spectral Clustering

Veronika Strnadová-Neeley

Seminar on Top Algorithms in Computational Science

Main Reference: Ulrike Von Luxburg’s A Tutorial on Spectral Clustering

The Spectral Clustering Algorithm

Uses the eigenvalues and vectors of the graph Laplacian matrix in order to find clusters (or “partitions”) of the graph

1

2

34

5



1

2

34

5

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2



1

2

34

5

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2

𝜆1, 𝜆2, … , 𝜆𝑛

𝑒1, 𝑒2, … , 𝑒𝑛



1

2

34

5

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2

1

2

34

5𝜆1, 𝜆2, … , 𝜆𝑛

𝑒1, 𝑒2, … , 𝑒𝑛

A little history…

• 1973: Algebraic Connectivity of Graph M. Fiedler

• 1986: Eigenvalues and expanders. N. Alon.

• 1989: Conductance and convergence of Markov chains-a combinatorial treatment of expanders. M. Mihail

• 1990 – Partitioning Sparse Matrices with Eigenvectors of Graphs A. Pothenet al.

• 2000: Normalized Cuts and Image Segmentation. J. Shi and J. Malik

• 2001: A Random Walks View of Spectral Segmentation. M. Meila

• 2010: Power Iteration Clustering. F. Lin and W. Cohen

1

2

34

5

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2

1

2

34

5𝜆1, 𝜆2, … , 𝜆𝑛

𝑒1, 𝑒2, … , 𝑒𝑛

A little history…








A little history…








A little history…








A little history…








𝑐𝑢𝑡 𝐴1, 𝐴2 = 2𝐴1

𝐴2𝐴2

𝑐𝑢𝑡 𝐴1, 𝐴2 = 7

𝐴1

A little history…








𝑅𝑎𝑡𝑖𝑜𝑐𝑢𝑡 𝐴1, 𝐴2, … , 𝐴𝑘 =

𝑖=1,…,𝑘

𝑐𝑢𝑡(𝐴𝑖 , 𝐴𝑖)

|𝐴𝑖|

𝐴1 𝐴2

𝑅𝑎𝑡𝑖𝑜𝑐𝑢𝑡 𝐴1, 𝐴2 = 0.226

A little history…








𝑅𝑎𝑡𝑖𝑜𝑐𝑢𝑡 𝐴1, 𝐴2, … , 𝐴𝑘 =

𝑖=1,…,𝑘

𝑐𝑢𝑡(𝐴𝑖 , 𝐴𝑖)

|𝐴𝑖|

𝐴1 𝐴2

𝑅𝑎𝑡𝑖𝑜𝑐𝑢𝑡 𝐴1, 𝐴2 = 0.226

A little history…








Definition: Unnormalized Graph Laplacian Matrix

𝐿 = 𝐷 −𝑊

Where 𝐷 is the diagonal degree matrix and 𝑊𝑖𝑗 ≥ 0 weight of edge (𝑖, 𝑗), 𝑖 ≠ 𝑗


𝐿 = 𝐷 −𝑊

𝐷 =

1 0 00 2 00 0 2

0 00 00 0

0 0 00 0 0

2 00 1

W=

0 1 01 0 10 1 0

0 00 01 0

0 0 10 0 0

0 11 0

1 2 3 4 51 1 1 1


𝐿 = 𝐷 −𝑊

𝐿 =

1 0 00 2 00 0 2

0 00 00 0

0 0 00 0 0

2 00 1

−

0 1 01 0 10 1 0

0 00 01 0

0 0 10 0 0

0 11 0

1 2 3 4 51 1 1 1


𝐿 = 𝐷 −𝑊

𝐿 =

1 −1 0−1 2 −10 −1 2

0 00 0−1 0

0 0 −10 0 0

2 −1−1 1

1 2 3 4 51 1 1 1


𝐿 = 𝐷 −𝑊

𝐿 =

3 −3 0−3 4 −10 −1 5

0 00 0−4 0

0 0 −40 0 0

9 −5−5 5

Note: In a weighted graph, 𝐷𝑖𝑖 = 𝑗=1𝑛 𝑤𝑖𝑗

1 2 3 4 53 1 4 5


𝐿 = 𝐷 −𝑊

1

2

34

5 𝐿 =

4 −1 −1−1 4 −1−1 −1 4

−1 −1−1 −1−1 −1

−1 −1 −1−1 −1 −1

4 −1−1 4


𝐿 = 𝐷 −𝑊

1 2 3 4

5 6 7 8𝐿 =

2 00 2

0 00 0

0 00 0

2 00 4

−1 −1−1 −1

0 00 0

0 −1−1 −1

−1 0−1 −1

−1 −1−1 −1

0 −1−1 −1

0 00 0

−1 −10 −1

3 00 4

0 00 0

0 00 0

2 00 1

Properties of the Graph Laplacian

1. 𝒇𝑻𝑳𝒇 =𝟏

𝟐𝜮𝒊,𝒋=𝟏…𝒏𝒘𝒊𝒋(𝒇𝒊 − 𝒇𝒋)

𝟐 for all 𝒇𝝐ℝ𝒏

𝑓1…𝑓𝑛 (𝐷 −𝑊)𝑓1𝑓𝑛

𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1𝑓𝑛

− 𝑓1…𝑓𝑛

𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮

𝑤𝑛1 ⋯ 𝑤𝑛𝑛

𝑓1𝑓𝑛

=

𝑖=1

𝑛

𝑑𝑖 𝑓𝑖2 −

𝑖=1

𝑛

𝑗=1

𝑛

𝑤𝑖𝑗𝑓𝑖𝑓𝑗

=1

2

𝑖=1

𝑛

𝑑𝑖 𝑓𝑖2 − 2

𝑖=1

𝑛

𝑗=1

𝑛

𝑤𝑖𝑗𝑓𝑖𝑓𝑗 +

𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛

𝑤𝑖𝑗(𝑓𝑖−𝑓𝑗)2

Note that in this property relates the difference in weights assigned to vertices to the quadratic form of 𝐿

1

2

34

5





𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛


1

2

34

5

1

1

1

1

1

1

1

1

1

1

𝒇 =

𝟏𝟏𝟏𝟏𝟏

⇒ 𝒇𝑻𝑳𝒇 = 𝟎𝐿 =

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2





𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛


1

2

34

5

4

4

4

1

1

1

1

1

1

1

𝒇 =

𝟒𝟏𝟏𝟒𝟒

⇒ 𝒇𝑻𝑳𝒇 = 𝟗𝐿 =

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2





𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛


1

2

34

5

1

4

4

1

1

1

1

1

1

1

𝒇 =

𝟏𝟏𝟏𝟒𝟒

⇒ 𝒇𝑻𝑳𝒇 = 𝟐𝟕𝐿 =

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2





𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛





𝑓1…𝑓𝑛 (𝐷 −𝑊)𝑓1⋮𝑓𝑛

𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛






= 𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1⋮𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1⋮𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛






𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1⋮𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1⋮𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛






𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1⋮𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1⋮𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛






𝑓1…𝑓𝑛

𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

𝑓1⋮𝑓𝑛


𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


𝑓1⋮𝑓𝑛

=

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


=1

2

𝑖=1

𝑛


𝑖=1

𝑛

𝑗=1

𝑛


𝑗=1

𝑛

𝑑𝑗 𝑓𝑗2

=1

2

𝑗=1

𝑛

𝑖=1

𝑛


2. 𝑳 is symmetric and positive semi-definite

Recall the definition: 𝐿 = 𝐷 −𝑊

𝐷 diagonal, 𝑊 symmetric → 𝐿 symmetric

Recall property 1:

𝑓′𝐿𝑓 =1

2

𝑗=1

𝑛

𝑖=1

𝑛

𝑤𝑖𝑗(𝑓𝑖−𝑓𝑗)2 ≥ 0 ∀𝑓 ∈ ℝ𝑛




Recall property 1:

𝑓′𝐿𝑓 =1

2

𝑗=1

𝑛

𝑖=1

𝑛


𝐿 =

3 −3 0−3 4 −10 −1 5

0 00 0−4 0

0 0 −40 0 0

9 −5−5 5

1 2 3 4 53 1 4 5




Recall property 1:

𝑓𝑇𝐿𝑓 =1

2

𝑗=1

𝑛

𝑖=1

𝑛


3. The smallest eigenvalue of 𝑳 is 0 with corresponding eigenvector being the constant vector 𝕝


𝐿 = 𝐷 −𝑊 =𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

−

𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


=

𝑗=1

𝑛

𝑤1𝑗 − 𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮

−𝑤𝑛1 ⋯

𝑗=1

𝑛

𝑤𝑛𝑗 − 𝑤𝑛𝑛

⇒

𝑗=1𝑛 𝑤1𝑗 − 𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮−𝑤𝑛1 ⋯ 𝑗=1

𝑛 𝑤𝑛𝑗 − 𝑤𝑛𝑛

1⋮1

=

𝑗=1𝑛 𝑤1𝑗 − 𝑤11 + 𝑗=2

𝑛 𝑤1𝑗⋮

𝑗=1𝑛−1(−𝑤𝑛𝑗) + 𝑗=1


=0⋮0

0 is an eigenvalue and its corresponding eigenvector is the constant vector 𝕝

Since each eigenvalue is ≥ 0 (property 2), this is the smallest eigenvalue


𝐿 = 𝐷 −𝑊 =𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

−

𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


=

𝑗=1

𝑛

𝑤1𝑗 − 𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮

−𝑤𝑛1 ⋯

𝑗=1

𝑛


⇒

𝑗=1𝑛 𝑤1𝑗 − 𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮−𝑤𝑛1 ⋯ 𝑗=1


1⋮1

=

𝑗=1𝑛 𝑤1𝑗 − 𝑤11 + 𝑗=2

𝑛 𝑤1𝑗⋮

𝑗=1𝑛−1(−𝑤𝑛𝑗) + 𝑗=1


=0⋮0




𝐿 = 𝐷 −𝑊 =𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

−

𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


=

𝑗=1

𝑛

𝑤1𝑗 −𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮

−𝑤𝑛1 ⋯

𝑗=1

𝑛


⇒ 𝐿𝟏 =

𝑗=1𝑛 𝑤1𝑗 − 𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮−𝑤𝑛1 ⋯ 𝑗=1

𝑛 𝑤𝑛𝑗 −𝑤𝑛𝑛

1⋮1

𝑗=1𝑛 𝑤1𝑗 −𝑤11 + 𝑗=2

𝑛 𝑤1𝑗⋮

𝑗=1𝑛−1(−𝑤𝑛𝑗) + 𝑗=1

𝑛 𝑤𝑛𝑗 −𝑤𝑛𝑛

=0⋮0




𝐿 = 𝐷 −𝑊 =𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

−

𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


=

𝑗=1

𝑛

𝑤1𝑗 −𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮

−𝑤𝑛1 ⋯

𝑗=1

𝑛


⇒ 𝐿𝟏 =

𝑗=1𝑛 𝑤1𝑗 −𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮−𝑤𝑛1 ⋯ 𝑗=1


1⋮1

=

𝑗=1𝑛 𝑤1𝑗 − 𝑤11 − 𝑗=2

𝑛 𝑤1𝑗⋮

𝑗=1𝑛−1(−𝑤𝑛𝑗) − 𝑗=1


=0⋮0




𝐿 = 𝐷 −𝑊 =𝑑1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛

−

𝑤11 ⋯ 𝑤1𝑛⋮ ⋱ ⋮


=

𝑗=1

𝑛

𝑤1𝑗 − 𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮

−𝑤𝑛1 ⋯

𝑗=1

𝑛


⇒

𝑗=1𝑛 𝑤1𝑗 − 𝑤11 ⋯ −𝑤1𝑛

⋮ ⋱ ⋮−𝑤𝑛1 ⋯ 𝑗=1


1⋮1

=

𝑗=1𝑛 𝑤1𝑗 − 𝑤11 − 𝑗=2

𝑛 𝑤1𝑗⋮

𝑗=1𝑛−1(−𝑤𝑛𝑗) − 𝑗=1


=0⋮0



4. 𝑳 has non-negative, real-valued eigenvalues 𝟎 = 𝝀𝟏 ≤ 𝝀𝟐 ≤ ⋯ ≤ 𝝀𝒏

By the property 2, 𝐿 is symmetric → 𝐿 has 𝑛 real eigenvalues

Again by property 2, 𝐿 is positive-semidefinite→ 𝟎 ≤ 𝝀𝟏 ≤ 𝝀𝟐 ≤ ⋯ ≤ 𝝀𝒏

By property 3, 𝝀𝟏 = 𝟎

Therefore, 𝟎 = 𝝀𝟏 ≤ 𝝀𝟐 ≤ ⋯ ≤ 𝝀𝒏
















Properties of Graph Laplacians: Summary

𝐿 = 𝐷 −𝑊

1. 𝑓′𝐿𝑓 =1

2Σ𝑖,𝑗=1…𝑛𝑤𝑖𝑗(𝑓𝑖 − 𝑓𝑗)

2 for all 𝑓𝜖ℝ𝑛

2. 𝐿 is symmetric and positive semi-definite

3. The smallest eigenvalue of 𝐿 is 0, and the corresponding eigenvector is the constant vector 𝕝

4. 𝐿 has 𝑛 nonnegative, real eigenvalues 0 = 𝜆1 ≤ 𝜆2 ≤ … ≤ 𝜆𝑛

Proposition: The number of connected components of a graph is equal to the multiplicity 𝑘 of the first eigenvalue of the graph Laplacian matrix

Suppose 𝑘 = 1.

Suppose further that 𝒇 is an eigenvector with eigenvalue 0, i.e., 𝐿𝒇 = 𝟎

→ 0 = 𝒇𝑇𝐿𝒇 =12Σ𝑖,𝑗=1…𝑛𝑊𝑖𝑗(𝒇𝑖 − 𝒇𝑗)

2

Since all weights are positive,→ 𝒇𝑖 = 𝒇𝑗 for all connected vertices 𝑖, 𝑗

Thus 𝒇𝒗𝟏 = 𝒇𝒗𝟐 = ⋯ = 𝒇𝒗𝒎 for any path over vertices 𝑣1, 𝑣2, … , 𝑣𝑚

In a connected graph, all vertices are connected by some path

Thus 𝒇𝟏 = 𝒇𝟏 = ⋯ = 𝒇𝒏, and this is the only eigenvector with eigenvalue 0


Suppose 𝑘 = 1.



2






Suppose 𝑘 = 1.



2






Suppose 𝑘 = 1.



2






Suppose 𝑘 = 1.



2






Now suppose that 𝑘 > 1.

WLOG, order the vertices according to the components they belong to

1

2

34

5

𝐿 =

2 0 00 1 −10 −1 1

−1 −10 00 0

−1 0 0−1 0 0

2 −1−1 2




1

4

53

2

𝐿 =

2 −1 −1−1 2 −1−1 −1 2

0 00 00 0

0 0 00 0 0

1 −1−1 1




→ 𝐿 has a block diagonal form

𝐿 =𝐿1 0 00 ⋱ 00 0 𝐿𝑘





The spectrum of 𝐿 is the union of the spectra of 𝐿1, … , 𝐿𝑘

𝐿 =𝐿1 0 00 ⋱ 00 0 𝐿𝑘






Since each 𝐿𝑖 represents a connected component, the eigenvectors corresponding to 𝜆1 = 0 of L are constant for the block 𝐿𝑖 and 0 elsewhere

𝐿1 0 00 ⋱ 00 0 𝐿𝑘

𝟏⋮0

=0⋮0







𝐿1 0 00 ⋱ 00 0 𝐿𝑘

0⋮𝟏

=0⋮0







Thus, 𝐿 has as many 0 eigenvalues as there are connected components

1

2

34

5

𝐿 =

2 0 00 1 −10 −1 1

−1 −10 00 0

−1 0 0−1 0 0

2 −1−1 2

1

2

34

5

𝐿 =

2 0 00 1 −10 −1 1

−1 −10 00 0

−1 0 0−1 0 0

2 −1−1 2

Spectral Clustering Algorithm

Given: A graph with 𝑛 vertices and edge weights 𝑊𝑖𝑗 , number of desired clusters 𝑘

1. Construct (normalized) graph Laplacian 𝐿 𝐺 𝑉, 𝐸 = 𝐷 −𝑊

2. Find the 𝑘 eigenvectors corresponding to the 𝑘 smallest eigenvalues of 𝐿

3. Let U be the n × 𝑘 matrix of eigenvectors

4. Use 𝑘-means to find 𝑘 clusters 𝐶′ letting 𝑥𝑖′ be the rows of U

5. Assign data point 𝑥𝑖 to the 𝑗th cluster if 𝑥𝑖′ was assigned to cluster 𝑗

Graph Cut Point of View

The goal of spectral clustering is to minimize the sum of the weights of edges between clusters

Define: cut 𝐴1, … , 𝐴𝑘 =1

2Σ𝑖=1…𝑘𝑊(𝐴𝑖 , 𝐴𝑖) where 𝑊 𝐴𝑖 , 𝐴𝑖 = 𝑗∈𝐴𝑖,𝑙∈𝐴𝑖

𝑤𝑗𝑙

Ex: 𝒌 = 𝟐

cut=2

𝐴1𝐴1





𝑤𝑗𝑙

Ex: 𝒌 = 𝟐

cut=2

𝐴1𝐴1





𝑤𝑗𝑙

Ex: 𝒌 = 𝟐

cut=2

𝐴1𝐴1





𝑤𝑗𝑙

Ex: 𝒌 = 𝟐

cut=2

𝐴1

𝐴1

To create balanced clusters, minimize the RatioCut instead:

RatioCut 𝐴1, … , 𝐴𝑘 =1

2Σ𝑖=1…𝑘

𝑊(𝐴𝑖, 𝐴𝑖)

𝐴𝑖= Σ𝑖=1…𝑘

𝑐𝑢𝑡(𝐴𝑖, 𝐴𝑖)

𝐴𝑖


RatioCut=0.18803

𝐴1𝐴1

To create balanced clusters, minimize the RatioCut instead:

RatioCut 𝐴1, … , 𝐴𝑘 =1

2Σ𝑖=1…𝑘

𝑊(𝐴𝑖, 𝐴𝑖)

𝐴𝑖= Σ𝑖=1…𝑘

𝑐𝑢𝑡(𝐴𝑖, 𝐴𝑖)

𝐴𝑖


RatioCut=0.5500

𝐴1

𝐴1

…but minimizing RatioCut is NP-hard!!

min𝐴1,𝐴2,…,𝐴𝑘

1

2Σ𝑖=1…𝑘

𝑊(𝐴𝑖 , 𝐴𝑖)

𝐴𝑖

When the going gets tough…

We relax constraints on the problem

Minimizing Ratio Cut for 𝑘=2

Only two clusters, 𝐴 and 𝐴 (complement of 𝐴)

Want to minimize: min𝐴, 𝐴

1

2

𝑊(𝐴, 𝐴)

𝐴+

𝑊( 𝐴,𝐴)

𝐴

Trick: find a vector 𝑓 so that minimizing RatioCut 𝐴, 𝐴 is the same as minimizing 𝑓𝑇𝐿𝑓 subject to certain constraints.




1

2

𝑊(𝐴, 𝐴)

𝐴+

𝑊( 𝐴,𝐴)

𝐴





1

2

𝑊(𝐴, 𝐴)

𝐴+

𝑊( 𝐴,𝐴)

𝐴



Goal: minimize: min𝐴, 𝐴

1

2

𝑊(𝐴, 𝐴)

𝐴+

𝑊( 𝐴,𝐴)

𝐴

Let: 𝑓𝑖 = 𝐴 𝐴 𝑖𝑓 𝑣𝑖 ∈ 𝐴

− 𝐴 𝐴 𝑖𝑓 𝑣𝑖 ∈ 𝐴

Minimizing Ratio Cut for 𝑘=2Goal: minimize: min

𝐴, 𝐴

1

2

𝑊(𝐴, 𝐴)

𝐴+

𝑊( 𝐴,𝐴)

𝐴



Then, we can show that 𝑓𝑇𝐿𝑓 = 𝑉 RatioCut(𝐴, 𝐴)

Thus minimizing RatioCut(𝐴, 𝐴) is the same as minimizing 𝑓𝑇𝐿𝑓 with constraints:

𝑖=1𝑛 𝑓𝑖 = 0⇒ 𝑓 ⊥ 𝕝

and 𝑓 22= 𝑛, the number of vertices

and 𝑓𝑖 taking on discrete values as defined above


𝐴, 𝐴

1

2

𝑊(𝐴, 𝐴)

𝐴+

𝑊( 𝐴,𝐴)

𝐴





𝑖=1𝑛 𝑓𝑖 = 0⇒ 𝑓 ⊥ 𝕝




𝐴, 𝐴

1

2

𝑊(𝐴, 𝐴)

𝐴+

𝑊( 𝐴,𝐴)

𝐴





𝑖=1𝑛 𝑓𝑖 = 0⇒ 𝑓 ⊥ 𝕝



Now relax the constraints:

Minimize 𝑓𝑇𝐿𝑓 subject to: 𝑓 ⊥ 𝕝, 𝑓 22= 𝑛

The Rayleigh-Ritz theorem tells us that the solution to this optimization problem is given by 𝑒2, the eigenvector corresponding to 𝜆2

So we can approximate RatioCut by using the second eigenvector (corresponding to the second smallest eigenvalue) of 𝑳!!!! (this translates to solving for eigenvectors in the Spectral Clustering algorithm)









How to transform approximate solution back to discrete setting?Recall that we eliminated the constraint which assigned vertices to partitions:

𝑓𝑖 = 𝐴 𝐴 𝑖𝑓 𝑣𝑖 ∈ 𝐴


We can choose a threshold 𝑡, assigning all 𝑖 for which 𝑓𝑖 > 𝑡 to one partition 𝐴 and all 𝑖 for which 𝑓𝑖 ≤ 𝑡 to the other partition 𝐴

Approximating RatioCut for 𝑘 > 2

We can use similar arguments to show that solving minRatioCut for 𝑘 >2 corresponds to finding the first 𝑘 eigenvectors of 𝐿

Spectral Clustering Algorithm

Given: A graph with 𝑛 vertices and edge weights 𝑊𝑖𝑗 , number of desired clusters 𝑘

1. Construct (normalized) graph Laplacian 𝐿 𝐺 𝑉, 𝐸 = 𝐷 −𝑊

2. Find the 𝑘 eigenvectors corresponding to the 𝑘 smallest eigenvalues of 𝐿

3. Let U be the n × 𝑘 matrix of eigenvectors

4. Use 𝑘-means to find 𝑘 clusters 𝐶′ letting 𝑥𝑖′ be the rows of U

5. Assign data point 𝑥𝑖 to the 𝑗th cluster if 𝑥𝑖′ was assigned to cluster 𝑗

Examples:

1 2 3 4 5

1

2

34

5

1

2

34

5

1

2

34

5

1 2 3 4

5 6 7 8

Path Graph𝐿 =

1 −1 0−1 2 −10 −1 2

0 00 0−1 0

0 0 −10 0 0

2 −1−1 1

1 2 3 4 5

Complete Graph

𝐿 =

4 −1 −1−1 4 −1−1 −1 4

−1 −1−1 −1−1 −1

−1 −1 −1−1 −1 −1

4 −1−1 4

1

2

34

5

𝐿 =

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2

1

2

34

5

𝐿 =

2 0 00 1 −10 −1 2

−1 −10 0−1 0

−1 0 −1−1 0 0

3 −1−1 2

1

34

52

Bipartite Graph

𝐿 =

2 00 2

0 00 0

0 00 0

2 00 4

−1 −1−1 −1

0 00 0

0 −1−1 −1

−1 0−1 −1

−1 −1−1 −1

0 −1−1 −1

0 00 0

−1 −10 −1

3 00 4

0 00 0

0 00 0

2 00 1

1 2 3 4

5 6 7 8

1

2

34

5𝐿 =

2 −1 0−1 2 −10 −1 2

0 −10 0−1 0

0 0 −1−1 0 0

2 −1−1 2

Real-World Examples

Issues in Spectral Clustering

• similarity function

• a reasonable choice is 𝑠𝑖𝑗 = 𝑒−∥𝑥𝑖−𝑥𝑗∥/2𝜎2when the data

points live in a Euclidean space, but it always depends on the domain

• type of similarity graph• 𝜀-neighborhood: difficulties arise when data is on “different

scales” – which 𝜀 should we choose?

• 𝒌-nearest neighbors: points in low densities can be grouped with points in high densities

• mutual 𝑘-nearest neighbors: tends to connect points with constant densities, but not points in densities that are different from each other

Issues in Spectral Clustering

• similarity function

• a reasonable choice is 𝑠𝑖𝑗 = 𝑒−∥𝑥𝑖−𝑥𝑗∥/2𝜎2when the data

points live in a Euclidean space, but it always depends on the domain

• type of similarity graph• 𝜀-neighborhood: difficulties arise when data is on “different

scales” – which 𝜀 should we choose?

• 𝒌-nearest neighbors: points in low densities can be grouped with points in high densities

• mutual 𝑘-nearest neighbors: tends to connect points with constant densities, but not points in densities that are different from each other

More issues

• fully connected graph• usually used with gaussian similarity function

• need to pick 𝜎 wisely

• finding the eigenvectors• If we use the 𝑘-nearest neighbor graph or the 𝜀-

neighborhood graph, the Laplace matrix will be sparse, and we have efficient methods to compute the first k eigenvectors of sparse matrices

• However, the speed of convergence of popular methods depends on the size of the eigengap

• choosing the number of clusters• “A variety of more or less successful methods have been

devised for this problem”

More issues








More issues








Large-Scale Spectral Clustering

Main Challenge: Finding the first 𝑘 eigenvectors can be prohibitively expensive

2010: Power Iteration Clustering: Key idea: use power iteration to find only ONE vector that will give partitions similar to those found by looking at first 𝑘eigenvectors

Large-Scale Spectral Clustering

Main Challenge: Finding the first 𝑘 eigenvectors can be prohibitively expensive

2010: Power Iteration Clustering: Key idea: use power iteration to find only ONE vector that will give partitions similar to those found by looking at first 𝑘eigenvectors

Power Iteration Clustering

Large-scale extension to Spectral Clustering

Uses power iteration on I − 𝐷−1L = 𝐷−1𝑊 until convergence to a linear combination of the 𝑘 smallest eigenvectors

Power Iteration Clustering

Large-scale extension to Spectral Clustering

Uses power iteration on I − 𝐷−1L = 𝐷−1𝑊 until convergence to a linear combination of the 𝑘 smallest eigenvectors

Applicable to large-scale text-document classification, works well for small values of 𝑘

If 𝐹𝐹𝑇 = 𝑊 then 𝐷−1𝑊 𝑣 = 𝐷−1𝐹𝐹𝑇𝑣

Conclusion

Spectral Clustering has been a successful heuristic algorithm for partitioning the vertices of a graph

Relates linear algebraic theory to a discrete optimization problem

Has been extended to many application domains and to a large-scale setting

More Details

For an arbitrary 𝑘

We look for a solution of:

min𝐴1…𝐴𝑘

𝑇𝑟(𝐻′𝐿𝐻) = min𝐴1…𝐴𝑘

𝑅𝑎𝑡𝑖𝑜𝐶𝑢𝑡(𝐴1, … , 𝐴𝑘)

With the constraints:

𝐻′𝐻 = 𝐼, ℎ𝑖𝑗 =

1

𝐴𝑗

𝑖𝑓𝑣𝑖 ∈ 𝐴𝑗

0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

We relax the problem, then come back to unrelaxed version

The relaxed problem becomes:

min𝐻∈ℝ𝑛×𝑘

𝑇𝑟 𝐻′𝐿𝐻 where 𝐻′𝐻 = 𝐼

Rayleigh-Ritz Theorem tells us that the solution to this problem is the 𝐻 which contains the first 𝑘 eigenvectors of 𝐿 as columns

And we get back to discrete partitions of the graph by using 𝑘-means clustering on the rows of 𝐻 = 𝑈

Note: In general it is known that efficient algorithms to approximate balanced graph cuts up to a constant factor do not exist

Normalized Graph Laplacians

• 𝐿𝑠𝑦𝑚 = 𝐷−1

2𝐿𝐷1

2

• 𝐿𝑟𝑤 = 𝐷−1𝐿

• Their eigenvalues and vectors are closely related to eachother and to the unnormalized graph Laplacian 𝐿

Normalized Laplacians - Properties

• The multiplicity of the eigenvalue 0 of both 𝐿𝑠𝑦𝑚 and 𝐿𝑟𝑤 is the number of connected components in the graph

• Graph is undirected with nonnegative weights

• The eigenspace of 0 for 𝐿𝑟𝑤 is spanned by the indicator vectors 𝕝𝐴𝑖for those components

• The eigenspace of 0 for 𝐿𝑠𝑦𝑚 is spanned by 𝐷 −12𝕝𝐴𝑖

Documents

Spectral Clustering - UCSBveronika/SpectralClustering.pdf · Spectral Clustering Veronika Strnadová-Neeley Seminar on Top Algorithms in Computational Science Main Reference: Ulrike