On the Worst-Case Complexity of the k-Means …theory.stanford.edu/~sergei/slides/kMeans-hour.pdfOn...

Preview:

Citation preview

On the Worst-Case Complexity of the k-Means Method

Sergei Vassilvitskii

David Arthur

(Stanford University)

Given points in split them into similar groups.

Clustering

2

Rd

n k

Clustering Objectives

k-Center:

3

Let be the closest cluster center to . C(x)

min maxx!X

!x " C(x)!

x

Clustering Objectives

k-Median:

4

Let be the closest cluster center to . C(x) x

min!

x!X

!x " C(x)!

Clustering Objectives

k-Median Squared:

5

Let be the closest cluster center to . C(x) x

min!

x!X

!x " C(x)!2

Much more sensitive to outliers

Lloyd’s Method: K-means

Initialize with random clusters

6

Lloyd’s Method: K-means

Assign each point to nearest center

7

Lloyd’s Method: K-means

Recompute optimum centers (means)

8

Lloyd’s Method: K-means

Repeat: Assign points to nearest center

9

Lloyd’s Method: K-means

Repeat: Recompute centers

10

Lloyd’s Method: K-means

Repeat...

11

Lloyd’s Method: K-means

Repeat...Until clustering does not change

12

Analysis

How good is this algorithm?

Finds a local optimum

13

That’s arbitrarily worse than optimal solution

Analysis

How fast is this algorithm?

In practice: VERY fast:

e.g. Digit Recognition dataset

with

Converges after 60 iterations

In theory: Stay Tuned.

14

n = 60, 000, d = 700

Previous Work

Lower Bounds

on the line

in the plane

Upper Bounds:

on the line, spread

Exponential bounds:

15

!(n2)

!(n)

O(n!2) ! =maxx,y !x " y!

minx,y !x " y!

O(kn),O(nkd)

Our Results

Lower Bound: 2!(

!

n)

16

Smoothed Upper Bounds:

is the smoothness factor

Diameter of the pointset

!

D

O

!

n2+2/d

"

D

!

#2

22n/d

$

O

!

nk+2/d

"

D

!

#2$

Rest of the Talk

Lower Bound Sketch

Upper Bound Sketch

Open Problems

17

Lower Bound

General Idea:

Make a “Reset Widget”:

If k-Means takes time on , create a new point set , s.t. k-Means takes time to terminate.

18

t X

X!

2t

Lower Bound: Sketch

Initial Clustering:

t steps

19

Lower Bound: Sketch

With Widget

t steps reset t steps

Lower Bound Details

Three Main Ideas:

Signaling - Recognizing when to start flipping the switch

Resetting - Setting the cluster centers back to original position

Odds & Ends - Clean-up to make the process recursive

21

Signaling

Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence.

22

p

Signaling

Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence.

23

p

Signaling

Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence.

24

p

d ! ε

dBy setting we can control exactly when will switch.

ε

p

Signaling

Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence.

25

p

Resetting

k properly placed points can reset the positions of the k current centers

Easy to compute locations of the reset points, so that new cluster centers are placed correctly:

26

current center

intended center

Resetting

k properly placed points can reset the positions of the k current centers

Easy to compute locations of the reset points, so that new cluster centers are placed correctly:

27

current center

intended center

add to clusterto reset mean

Resetting

k properly placed points can reset the positions of the k current centers

Easy to compute locations of the reset points, so that new cluster centers are placed correctly:

28

new center

Resetting

Easy to compute locations of the reset points, so that new cluster centers are placed correctly:

But must avoid accidentally grabbing other points

29

Resetting

Solution: Add two new dimensions

30

x

y

w

z

intended new position

center to reset

Resetting

Solution: Add two new dimensions

31

x

y

w

z

new point added

Resetting

Solution: Add two new dimensions

32

x

y

w

z

new point added

Multi-Signaling

So far have shown how to signal and reset a single cluster. Can use one signal to induce a signal from all clusters.

33

p

Multi-Signaling

All centers are stable before the main signaling has taken place.

34

p

d

d + !

Multi-Signaling

All centers are stable before the main signaling has taken place.

35

p

d

d + !

Multi-Signaling

Due to signaling the center moves away, now all centers absorb points above.

36

p

d + !

>d +

!

Multi-Signaling

Due to signaling the center q moves away, now all centers absorb points above. All clusters have previously unseen centers.

37

p

Put All Pieces Together

Start with a signaling configuration

Transform it, so that all clusters signal

Use the new signal to reset cluster centers (and therefore double the runtime of k-Means)

Ensure the new configuration is signaling

Repeat...

38

Construction in Pictures

Construction:

39Reflected Points

Construction in Pictures

After steps - signal by all clusters

40

t

Construction in Pictures

Main Clusters absorb “catalyst” points. Yellow centers move away

41

Construction in Pictures

The new points added are “reset” points - resetting the original cluster centers.

42

Construction in Pictures

Can ensure “catalyst” points leave the main clusters

43

Construction in Pictures

k-Means runs for another steps. The original centers will be signaling.

44

t

Construction Results

If we repeat the reseting widget construction times:

points in dimensions

clusters

Total running time:

45

r

O(r)

O(r2) O(r)

2!(r)

Construction Remarks

Currently construction has very large spread

Can use more trickery to decrease the spread to be constant, albeit with a blow up in the dimension.

As presented requires specific placement of initial cluster centers, in practice centers chosen randomly from points.

Can make construction work even in this case

Open question:

Can we decrease the dimensionality to constant d?

46

Outline

k-Means Intuition

Lower Bound Sketch

Upper Bound Sketch

Open Problems

47

Smoothed Analysis

Assume each point came from a Gaussian distribution with variance .

Data collection is inherently noisy

Or add some Gaussian noise (effect on final clustering is minimal)

48

!2

Key Fact: Probability mass inside any ball of radius is at most .

!

(!/")d

Potential Function

Use a potential function:

Original Potential at most

Potential decreases every step.

Reassignment reduces

Center recomputation finds optimal for the given partition

49

!(C) =!

x!X

!x " C(x)!2

nD2

x ! C(X)

!

Potential Decrease

Lemma Let be a pointset with optimal center and be any other point then:

50

cS c!

!(c) ! !(c!) = |S|"c ! c!"2

!(c) =!

x!S

(x ! c) · (x ! c)

=∑

x!S

(x ! c") · (x ! c

") + (c ! c") · (c ! c

") + 2(c ! c") · (x ! c

")

= !(c!) + |S|‖c − c!‖ + 2(c − c

!) ·∑

x"S

(x − c!)

=!

x∈S

(x ! c + c∗! c

∗) · (x ! c + c∗! c

∗)

Main Lemma

In a smoothed pointset, fix an . Then with probability at

least for any two clusters and with

optimal centers and we have that:

51

S T

!c(S) " c(T )! #!

2 min(|S|, |T |)

1 ! 22n

!

!

"

"d

! > 0

c(S) c(T )

Proof Sketch

Suppose and . Fix all points except for .

52

|S| < |T | x, x ! S, x "! T

x

To ensure , must lie in a ball of diameter .

x!c(S) " c(T )! # !

|S|!

Since came from a Gaussian of variance this probability is at most .

x

(|S|!"!1)d

!2

Finally, union bound the total error probability over all possible pairs of sets - .22n(!/")d

Potential Drop

At each iteration, examine a cluster whose center changed from to :

Therefore, the potential drops by

After iterations, the algorithm must terminate.

53

S

c c!

!c " c!! #

!

|S|

|S|!2

|S|2!

!2

4n

m =4n2D2

!2

To Finish Up:

Chose . Then the total probability of

failure is: .

The total running time is

Remark: polynomial for

54

! = "n!

1

d 2!

2n

d

22n(!/")d = 1/n

d = !(n

log n)

m =4n2D2

!2= O

!

n2+2/d

"

D

"

#2

22n/d

$

Upper bound (2)

We used the union bound over all possible sets. However,

due to the geometry, not all sets arise. The total number

of distinct clusters that can appear is .

Carrying the same calculations through we can bound the total number of iterations as:

55

O(nkd)

O

!

nk+2/d

"

D

!

#2$

Remarks

The noise need not be Gaussian, need to avoid large probabilistic point masses.

e.g. Lipschitz conditions are enough.

56

Outline

k-Means Intuition

Lower Bound Sketch

Upper Bound Sketch

Open Problems

57

Conclusion - Lower Bounds

Showed super-polynomial lower bound on the execution time of k-Means:

However - construction requires many dimensions, does not preclude an upper bound

58

O(nd)

Conclusion - Upper Bounds

Can use smoothed analysis to reduce the best known upper bounds for k-Means.

But, is not polynomial for small values of d, or large values of k.

Even with smoothness there is an lower bound, which is never observed in practice.

59

!(n)

Thank you

Any Questions?

Recommended