Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
On the Worst-Case Complexity of the k-Means Method
Sergei Vassilvitskii
David Arthur
(Stanford University)
Given points in split them into similar groups.
Clustering
2
Rd
n k
Clustering Objectives
k-Center:
3
Let be the closest cluster center to . C(x)
min maxx!X
!x " C(x)!
x
Clustering Objectives
k-Median:
4
Let be the closest cluster center to . C(x) x
min!
x!X
!x " C(x)!
Clustering Objectives
k-Median Squared:
5
Let be the closest cluster center to . C(x) x
min!
x!X
!x " C(x)!2
Much more sensitive to outliers
Lloyd’s Method: K-means
Initialize with random clusters
6
Lloyd’s Method: K-means
Assign each point to nearest center
7
Lloyd’s Method: K-means
Recompute optimum centers (means)
8
Lloyd’s Method: K-means
Repeat: Assign points to nearest center
9
Lloyd’s Method: K-means
Repeat: Recompute centers
10
Lloyd’s Method: K-means
Repeat...
11
Lloyd’s Method: K-means
Repeat...Until clustering does not change
12
Analysis
How good is this algorithm?
Finds a local optimum
13
That’s arbitrarily worse than optimal solution
Analysis
How fast is this algorithm?
In practice: VERY fast:
e.g. Digit Recognition dataset
with
Converges after 60 iterations
In theory: Stay Tuned.
14
n = 60, 000, d = 700
Previous Work
Lower Bounds
on the line
in the plane
Upper Bounds:
on the line, spread
Exponential bounds:
15
!(n2)
!(n)
O(n!2) ! =maxx,y !x " y!
minx,y !x " y!
O(kn),O(nkd)
Our Results
Lower Bound: 2!(
!
n)
16
Smoothed Upper Bounds:
is the smoothness factor
Diameter of the pointset
!
D
O
!
n2+2/d
"
D
!
#2
22n/d
$
O
!
nk+2/d
"
D
!
#2$
Rest of the Talk
Lower Bound Sketch
Upper Bound Sketch
Open Problems
17
Lower Bound
General Idea:
Make a “Reset Widget”:
If k-Means takes time on , create a new point set , s.t. k-Means takes time to terminate.
18
t X
X!
2t
Lower Bound: Sketch
Initial Clustering:
t steps
19
Lower Bound: Sketch
With Widget
t steps reset t steps
Lower Bound Details
Three Main Ideas:
Signaling - Recognizing when to start flipping the switch
Resetting - Setting the cluster centers back to original position
Odds & Ends - Clean-up to make the process recursive
21
Signaling
Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence.
22
p
Signaling
Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence.
23
p
Signaling
Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence.
24
p
d ! ε
dBy setting we can control exactly when will switch.
ε
p
Signaling
Suppose that when k-Means terminates, there is one cluster center that has never appeared before. We use this as a signal to start the reset sequence.
25
p
Resetting
k properly placed points can reset the positions of the k current centers
Easy to compute locations of the reset points, so that new cluster centers are placed correctly:
26
current center
intended center
Resetting
k properly placed points can reset the positions of the k current centers
Easy to compute locations of the reset points, so that new cluster centers are placed correctly:
27
current center
intended center
add to clusterto reset mean
Resetting
k properly placed points can reset the positions of the k current centers
Easy to compute locations of the reset points, so that new cluster centers are placed correctly:
28
new center
Resetting
Easy to compute locations of the reset points, so that new cluster centers are placed correctly:
But must avoid accidentally grabbing other points
29
Resetting
Solution: Add two new dimensions
30
x
y
w
z
intended new position
center to reset
Resetting
Solution: Add two new dimensions
31
x
y
w
z
new point added
Resetting
Solution: Add two new dimensions
32
x
y
w
z
new point added
Multi-Signaling
So far have shown how to signal and reset a single cluster. Can use one signal to induce a signal from all clusters.
33
p
Multi-Signaling
All centers are stable before the main signaling has taken place.
34
p
d
d + !
Multi-Signaling
All centers are stable before the main signaling has taken place.
35
p
d
d + !
Multi-Signaling
Due to signaling the center moves away, now all centers absorb points above.
36
p
d + !
>d +
!
Multi-Signaling
Due to signaling the center q moves away, now all centers absorb points above. All clusters have previously unseen centers.
37
p
Put All Pieces Together
Start with a signaling configuration
Transform it, so that all clusters signal
Use the new signal to reset cluster centers (and therefore double the runtime of k-Means)
Ensure the new configuration is signaling
Repeat...
38
Construction in Pictures
Construction:
39Reflected Points
Construction in Pictures
After steps - signal by all clusters
40
t
Construction in Pictures
Main Clusters absorb “catalyst” points. Yellow centers move away
41
Construction in Pictures
The new points added are “reset” points - resetting the original cluster centers.
42
Construction in Pictures
Can ensure “catalyst” points leave the main clusters
43
Construction in Pictures
k-Means runs for another steps. The original centers will be signaling.
44
t
Construction Results
If we repeat the reseting widget construction times:
points in dimensions
clusters
Total running time:
45
r
O(r)
O(r2) O(r)
2!(r)
Construction Remarks
Currently construction has very large spread
Can use more trickery to decrease the spread to be constant, albeit with a blow up in the dimension.
As presented requires specific placement of initial cluster centers, in practice centers chosen randomly from points.
Can make construction work even in this case
Open question:
Can we decrease the dimensionality to constant d?
46
Outline
k-Means Intuition
Lower Bound Sketch
Upper Bound Sketch
Open Problems
47
Smoothed Analysis
Assume each point came from a Gaussian distribution with variance .
Data collection is inherently noisy
Or add some Gaussian noise (effect on final clustering is minimal)
48
!2
Key Fact: Probability mass inside any ball of radius is at most .
!
(!/")d
Potential Function
Use a potential function:
Original Potential at most
Potential decreases every step.
Reassignment reduces
Center recomputation finds optimal for the given partition
49
!(C) =!
x!X
!x " C(x)!2
nD2
x ! C(X)
!
Potential Decrease
Lemma Let be a pointset with optimal center and be any other point then:
50
cS c!
!(c) ! !(c!) = |S|"c ! c!"2
!(c) =!
x!S
(x ! c) · (x ! c)
=∑
x!S
(x ! c") · (x ! c
") + (c ! c") · (c ! c
") + 2(c ! c") · (x ! c
")
= !(c!) + |S|‖c − c!‖ + 2(c − c
!) ·∑
x"S
(x − c!)
=!
x∈S
(x ! c + c∗! c
∗) · (x ! c + c∗! c
∗)
Main Lemma
In a smoothed pointset, fix an . Then with probability at
least for any two clusters and with
optimal centers and we have that:
51
S T
!c(S) " c(T )! #!
2 min(|S|, |T |)
1 ! 22n
!
!
"
"d
! > 0
c(S) c(T )
Proof Sketch
Suppose and . Fix all points except for .
52
|S| < |T | x, x ! S, x "! T
x
To ensure , must lie in a ball of diameter .
x!c(S) " c(T )! # !
|S|!
Since came from a Gaussian of variance this probability is at most .
x
(|S|!"!1)d
!2
Finally, union bound the total error probability over all possible pairs of sets - .22n(!/")d
Potential Drop
At each iteration, examine a cluster whose center changed from to :
Therefore, the potential drops by
After iterations, the algorithm must terminate.
53
S
c c!
!c " c!! #
!
|S|
|S|!2
|S|2!
!2
4n
m =4n2D2
!2
To Finish Up:
Chose . Then the total probability of
failure is: .
The total running time is
Remark: polynomial for
54
! = "n!
1
d 2!
2n
d
22n(!/")d = 1/n
d = !(n
log n)
m =4n2D2
!2= O
!
n2+2/d
"
D
"
#2
22n/d
$
Upper bound (2)
We used the union bound over all possible sets. However,
due to the geometry, not all sets arise. The total number
of distinct clusters that can appear is .
Carrying the same calculations through we can bound the total number of iterations as:
55
O(nkd)
O
!
nk+2/d
"
D
!
#2$
Remarks
The noise need not be Gaussian, need to avoid large probabilistic point masses.
e.g. Lipschitz conditions are enough.
56
Outline
k-Means Intuition
Lower Bound Sketch
Upper Bound Sketch
Open Problems
57
Conclusion - Lower Bounds
Showed super-polynomial lower bound on the execution time of k-Means:
However - construction requires many dimensions, does not preclude an upper bound
58
O(nd)
Conclusion - Upper Bounds
Can use smoothed analysis to reduce the best known upper bounds for k-Means.
But, is not polynomial for small values of d, or large values of k.
Even with smoothness there is an lower bound, which is never observed in practice.
59
!(n)
Thank you
Any Questions?