Christian Sohler 1
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
A Fast PTAS for k-Means Clustering
Dan Feldman, Tel Aviv University, Morteza Monemizadeh,Christian Sohler ,Universität Paderborn
Christian Sohler 2
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Simple coreset for clustering problemsOverview
Introduction
Weak Coresets• Definition• Intuition• The construction• A sketch of analysis
The k-means PTAS
Conclusions
Christian Sohler 3
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
IntroductionClustering
Clustering• Partition input in sets (cluster), such that
- Objects in same cluster are similar - Objects in different clusters are dissimilar
Goal• Simplification
• Discovery of patterns
Procedure• Map objects to Euclidean space => point set P
• Points in same cluster are close
• Points in different clusters are far away from eachother
Christian Sohler 4
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Introductionk-means clustering
Clustering with Prototypes• One prototyp (center) for each cluster
k-Means Clustering• k clusters C ,…,C
• One center c for each cluster C
• Minimize d(p,c )
1 k
i i
pCiii
2
Christian Sohler 5
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Introductionk-means clustering
Clustering with Prototypes• One prototyp (center) for each cluster
k-Means Clustering• k clusters C ,…,C
• One center c for each cluster C
• Minimize d(p,c )
1 k
i i
pCiii
2
Christian Sohler 6
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Introductionk-means clustering
Clustering with Prototypes• One prototyp (center) for each cluster
k-Means Clustering• k clusters C ,…,C
• One center c for each cluster C
• Minimize d(p,c )
1 k
i i
pCiii
2
Christian Sohler 7
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
(128,59,88)(218,181,163)
IntroductionSimplification / Lossy Compression
Christian Sohler 8
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
IntroductionSimplification / Lossy Compression
Christian Sohler 9
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
IntroductionSimplification / Lossy Compression
Christian Sohler 10
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
IntroductionProperties of k-means
Properties of k-meansOptimal solution, if
• Centers are given assign each point to the nearest center
• Cluster are given centroid (mean) of clusters
Christian Sohler 11
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
IntroductionProperties of k-means
Properties of k-meansOptimal solution, if
• Centers are given assign each point to the nearest center
• Cluster are given centroid (mean) of clusters
Christian Sohler 12
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
IntroductionProperties of k-means
Properties of k-meansOptimal solution, if
• Centers are given assign each point to the nearest center
• Cluster are given centroid (mean) of clusters
Christian Sohler 13
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
IntroductionProperties of k-means
Properties of k-meansOptimal solution, if
• Centers are given assign each point to the nearest center
• Cluster are given centroid (mean) of clusters
Christian Sohler 14
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
IntroductionProperties of k-means
Properties of k-meansOptimal solution, if
• Centers are given assign each point to the nearest center
• Cluster are given centroid (mean) of clusters
Notation:cost(P,C) denotes the cost of the solution defined this way
Christian Sohler 15
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsCentroid Sets
Definition (-approx. centroid set)A set S is called -approximate centroid set, if
it contains a subset C S s.t. cost(P,C) (1+) cost(P,Opt)
Lemma [KSS04]The centroid of a random set of 2/ points is with constant
probability a (1+)-approx. of the optimal center of P.
CorollaryThe set of all centroids of subsets of 2/ points is an -approx.
Centroid set.
Christian Sohler 16
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsDefinition
Definition (weak -Coreset for k-means)A pair (K,S) is called a weak -coreset for P, if for every set C of k
centers from the -approx. centroid set S we have (1-) cost(P,C) cost(K,C) (1+) cost(P,C)
Point set P (light blue)
Christian Sohler 17
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsDefinition
Definition (weak -Coreset for k-means)A pair (K,S) is called a weak -coreset for P, if for every set C of k
centers from the -approx. centroid set S we have (1-) cost(P,C) cost(K,C) (1+) cost(P,C)
Set of solution S (yellow)
Christian Sohler 18
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsDefinition
Definition (weak -Coreset for k-means)A pair (K,S) is called a weak -coreset for P, if for every set C of k
centers from the -approx. centroid set S we have (1-) cost(P,C) cost(K,C) (1+) cost(P,C)
Possible coreset with weights (red)
4
34
5
5
Christian Sohler 19
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsDefinition
Definition (weak -Coreset for k-means)A pair (K,S) is called a weak -coreset for P, if for every set C of k
centers from the -approx. centroid set S we have
(1-) cost(P,C) cost(K,C) (1+) cost(P,C)
Approximates cost of k centers (voilett) from S
4
34
5
5
Christian Sohler 20
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsIdeal Sampling
Problem• Given n numbers a1,…,an >0
• Task: approximate A:=ai by random sampling
Ideal Sampling• Assign weights w1,…, wn to numbers• wj = avg / aj
• Pr[x=j] = aj / avg• Estimator: wxax
Christian Sohler 21
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsIdeal Sampling
Problem• Given n numbers a1,…,an >0
• Task: approximate A:=ai by random sampling
Ideal Sampling• Assign weights w1,…, wn to numbers• wj = avg / aj
• Pr[x=j] = aj / avg• Estimator: wxax
Properties of estimator:(1) wxax = A (0 variance)(2) Expected weight of number j is 1
Christian Sohler 22
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsIdeal Sampling
Problem• Given n numbers a1,…,an >0
• Task: approximate A:=ai by random sampling
Ideal Sampling• Assign weights w1,…, wn to numbers• wj = A / aj
• Pr[x=j] = aj / A• Estimator: wxax
Properties of estimator:(1) wxax = A (0 variance)(2) Expected weight of number j is 1
Only problem:Weights can be very large
Christian Sohler 23
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsConstruction
Step 1• Compute constant factor approximation
Christian Sohler 24
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsConstruction
Step 2• Consider each cluster separately
Christian Sohler 25
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsConstruction
Step 2• Consider each cluster separately
Christian Sohler 26
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsConstruction
Step 2• Consider each cluster separately
Main idea: Apply ideal sampling to each Cluster CPr[pi is taken] = dist(pi, c) / cost(C,c)w(pi) = cost(C,c) / dist(pi,c)
Christian Sohler 27
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsConstruction
Step 2• Consider each cluster separately
Main idea: Apply ideal sampling to each Cluster CPr[pi is taken] = dist(pi, c) / cost(C,c)w(pi) = cost(C,c) / dist(pi,c)
But what about high weights?
Christian Sohler 28
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsConstruction
Step 2• A little twist
Main idea: Apply ideal sampling to each Cluster CPr[pi is taken] = dist(pi, c) / cost(C,c)w(pi) = cost(C,c) / dist(pi,c)
Christian Sohler 29
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsConstruction
Step 3• A little twist
Uniform sampling from small ballRadius = average distance /
Ideal sampling from ‚outliers‘
Christian Sohler 30
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsAnalysis
Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘
Christian Sohler 31
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsAnalysis
Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘
At least (1-)-fraction of points is here by choice
of radius
Christian Sohler 32
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsAnalysis
Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘
At least (1-)-fraction of points is here by choice
of radius
Weight of samples from outliers at most |C|
Christian Sohler 33
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsAnalysis
Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘
At least (1-)-fraction of points is here by choice
of radius
Forget about outliers!
Christian Sohler 34
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsAnalysis
Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘
Christian Sohler 35
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsAnalysis
Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘
Doesn‘t matter where points lie inside the ball
DD
Christian Sohler 36
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsAnalysis
Fix arbitrary set of centers K• Case (b): nearest center is ‚near‘
Christian Sohler 37
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsAnalysis
Fix arbitrary set of centers K• Case (b): nearest center is ‚near‘
Almost ideal sampling- Expectation is cost(C,K)- low variance
Christian Sohler 38
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsResult
The centroid set• S is set of all centroids of 2/ points (with repetition) from our
sample set K
• Can show that K approximates all solutions from S
• Can show that S is an -approx. centroid set w.h.p.
TheoremOne can compute in O(nkd) time a weak -coreset (K,S). The size
of K is poly(k, 1/). S is the set of all centroids of subsets of K of size 2/.
Christian Sohler 39
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Weak CoresetsApplications
Fast-k-Means-PTAS(P,k)1. Compute weak coreset K
2. Project K on poly(1/,k) dimensional space
3. Exhaustively search for best solution of (projection of) centroid set
4. Return centroids of the points that create C
Running time:O(nkd + (k/) )O(k/)
~
Christian Sohler 40
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und KomplexitätSummary
Weak Coresets• independent of n and d
• fast PTAS for k-means
• First PTAS for kernel k-means (if the kernel maps into finite dimensional space)
Christian Sohler 41
HEINZ NIXDORF INSTITUTUniversität Paderborn
Algorithmen und Komplexität
Christian SohlerHeinz Nixdorf Institut& Institut für InformatikUniversität PaderbornFürstenallee 1133102 Paderborn, Germany
Tel.: +49 (0) 52 51/60 64 27Fax: +49 (0) 52 51/62 64 82E-Mail: [email protected]://www.upb.de/cs/ag-madh
Thank you!Thank you!