Upload
kylynn-duran
View
31
Download
2
Embed Size (px)
DESCRIPTION
k-Means and DBSCAN. Erik Zeitler Uppsala Database Laboratory. k-Means. Input M (set of points) k (number of clusters) Output µ 1 , …, µ k (cluster centroids) k-Means clusters the M point into K clusters by minimizing the squared error function - PowerPoint PPT Presentation
Citation preview
k-Means and DBSCAN
Erik ZeitlerUppsala Database Laboratory
23-04-19 Erik Zeitler 2
k-Means
Input• M (set of points)• k (number of clusters)
Output• µ1, …, µk (cluster centroids)
k-Means clusters the M point into K clustersby minimizing the squared error function
clusters Si; i=1, …, k. µi is the centroid of all xjSi.
2
1
k
i Sxij
ij
µx
23-04-19 Erik Zeitler 3
k-Means algorithm
select (m1 … mK) randomly from M % initial centroidsdo
(µ1 … µK) = (m1 … mK)
all clusters Ci = {}for each point p in M
% compute cluster membership of p
[µi, i] = min(dist(µ, p))% assign p to the corresponding cluster:
Ci = Ci {p}end
for each cluster Ci % recompute the centroids
mi = avg(x in Ci)
while exists mi µi % convergence criterion
23-04-19 Erik Zeitler 4
K-Means on three clusters
23-04-19 Erik Zeitler 5
I’m feeling Unlucky
Bad initial points
23-04-19 Erik Zeitler 6
Eat this!
Non-spherical clusters
23-04-19 Erik Zeitler 7
k-Means in practice
How to choose initial centroids• select randomly among the data points• generate completely randomly
How to choose k• study the data• run k-Means for different k
measure squared error for each k
Run k-Means many times!• Get many choices of initial points
23-04-19 Erik Zeitler 8
k-Means pros and cons
+ -
Easy Fast
Works only for ”well-shaped”
clusters
Scalable? Sensitive to outliers
Sensitive to noise
Must know k a priori
23-04-19 Erik Zeitler 9
Questions
Euclidean distance results in spherical clusters• What cluster shape does the Manhattan distance give?• Think of other distance measures too. What cluster
shapes will those yield?
Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point • give an approximation of the complexity of the
algorithm expressed in K, I, N, and X.
Can the K-means algorithm be parallellized?• How?
23-04-19 Erik Zeitler 10
DBSCAN
Density Based Spatial Clustering of Applications with Noise
Basic idea:• If an object p is density connected to q,
then p and q belong to the same cluster
• If an object is not density connected to any other object it is considered noise
23-04-19 Erik Zeitler 11
-neigborhood• The neighborhood within a radius of an
object core object
An object is a core object iffthere are more than MinPts objects in its -neighbourhood
directly density reachable (ddr)An object p is ddr from q iff
q is a core object and p is inside the neighbourhood of q
Definitions
p
q
23-04-19 Erik Zeitler 12
density reachable (dr)An object q is dr from p iff
there exists a chain of objects p1 … pn s.t.- p1 is ddr from p, - p2 is ddr from p1, - p3 is ddr from … and q is ddr from pn
density connected (dc)p is dc to q iff
- exist an object o such that p is dr from o - and q is dr from o
Reachability and Connectivity
q pp1p2
o
q p
23-04-19 Erik Zeitler 13
Recall…
Basic idea:• If an object p is density connected to q,
then p and q belong to the same cluster
• If an object is not density connected to any other object it is considered noise
23-04-19 Erik Zeitler 14
DBSCAN, take 1
i = 0dotake a point p from Mfind the set of points P which are density connected to pif P = {}
M = M \ {p}else
Ci=Pi=i+1M = M \ P
endwhile M {}
HOW?
p
23-04-19 Erik Zeitler 15
DBSCAN, take 2
i = 0find the set of core points CP in Mdotake a point p from CPfind the set of points P which are density reachable to pif P = {}
M = M \ {p}else
Ci=Pi=i+1M = M \ P
endwhile M {}
HOW?Let’s call this
function dr(p) p
23-04-19 Erik Zeitler 16
Implement dr
create function dr(p) -> P
C = {p}P = {p}
doremove a point p' from Cfind all points X that are ddr(p')C = C (X \ (P X)) % add newly discovered
% points to CP = P X
while C ≠ {}result P
p’p’
23-04-19 Erik Zeitler 17
DBSCAN pros and cons
+ -
Clusters of arbitrary
shapeRobust to
noise
Requires connected regions of sufficiently
high density
Does not need an a
priori k
Data sets with varying densities are problematic
Scalable?
23-04-19 Erik Zeitler 18
Questions
Why is the dc criterion useful to define a cluster, instead of dr or ddr?
For which points are density reachable symmetric?i.e. for which p, q: dr(p, q) and dr(q, p)?
Express using only core objects and ddr, which objects will belong to a cluster