18
k-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory

k-Means and DBSCAN

Embed Size (px)

DESCRIPTION

k-Means and DBSCAN. Erik Zeitler Uppsala Database Laboratory. k-Means. Input M (set of points) k (number of clusters) Output µ 1 , …, µ k (cluster centroids) k-Means clusters the M point into K clusters by minimizing the squared error function - PowerPoint PPT Presentation

Citation preview

Page 1: k-Means and DBSCAN

k-Means and DBSCAN

Erik ZeitlerUppsala Database Laboratory

Page 2: k-Means and DBSCAN

23-04-19 Erik Zeitler 2

k-Means

Input• M (set of points)• k (number of clusters)

Output• µ1, …, µk (cluster centroids)

k-Means clusters the M point into K clustersby minimizing the squared error function

clusters Si; i=1, …, k. µi is the centroid of all xjSi.

2

1

k

i Sxij

ij

µx

Page 3: k-Means and DBSCAN

23-04-19 Erik Zeitler 3

k-Means algorithm

select (m1 … mK) randomly from M % initial centroidsdo

(µ1 … µK) = (m1 … mK)

all clusters Ci = {}for each point p in M

% compute cluster membership of p

[µi, i] = min(dist(µ, p))% assign p to the corresponding cluster:

Ci = Ci {p}end

for each cluster Ci % recompute the centroids

mi = avg(x in Ci)

while exists mi µi % convergence criterion

Page 4: k-Means and DBSCAN

23-04-19 Erik Zeitler 4

K-Means on three clusters

Page 5: k-Means and DBSCAN

23-04-19 Erik Zeitler 5

I’m feeling Unlucky

Bad initial points

Page 6: k-Means and DBSCAN

23-04-19 Erik Zeitler 6

Eat this!

Non-spherical clusters

Page 7: k-Means and DBSCAN

23-04-19 Erik Zeitler 7

k-Means in practice

How to choose initial centroids• select randomly among the data points• generate completely randomly

How to choose k• study the data• run k-Means for different k

measure squared error for each k

Run k-Means many times!• Get many choices of initial points

Page 8: k-Means and DBSCAN

23-04-19 Erik Zeitler 8

k-Means pros and cons

+ -

Easy Fast

Works only for ”well-shaped”

clusters

Scalable? Sensitive to outliers

Sensitive to noise

Must know k a priori

Page 9: k-Means and DBSCAN

23-04-19 Erik Zeitler 9

Questions

Euclidean distance results in spherical clusters• What cluster shape does the Manhattan distance give?• Think of other distance measures too. What cluster

shapes will those yield?

Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point • give an approximation of the complexity of the

algorithm expressed in K, I, N, and X.

Can the K-means algorithm be parallellized?• How?

Page 10: k-Means and DBSCAN

23-04-19 Erik Zeitler 10

DBSCAN

Density Based Spatial Clustering of Applications with Noise

Basic idea:• If an object p is density connected to q,

then p and q belong to the same cluster

• If an object is not density connected to any other object it is considered noise

Page 11: k-Means and DBSCAN

23-04-19 Erik Zeitler 11

-neigborhood• The neighborhood within a radius of an

object core object

An object is a core object iffthere are more than MinPts objects in its -neighbourhood

directly density reachable (ddr)An object p is ddr from q iff

q is a core object and p is inside the neighbourhood of q

Definitions

p

q

Page 12: k-Means and DBSCAN

23-04-19 Erik Zeitler 12

density reachable (dr)An object q is dr from p iff

there exists a chain of objects p1 … pn s.t.- p1 is ddr from p, - p2 is ddr from p1, - p3 is ddr from … and q is ddr from pn

density connected (dc)p is dc to q iff

- exist an object o such that p is dr from o - and q is dr from o

Reachability and Connectivity

q pp1p2

o

q p

Page 13: k-Means and DBSCAN

23-04-19 Erik Zeitler 13

Recall…

Basic idea:• If an object p is density connected to q,

then p and q belong to the same cluster

• If an object is not density connected to any other object it is considered noise

Page 14: k-Means and DBSCAN

23-04-19 Erik Zeitler 14

DBSCAN, take 1

i = 0dotake a point p from Mfind the set of points P which are density connected to pif P = {}

M = M \ {p}else

Ci=Pi=i+1M = M \ P

endwhile M {}

HOW?

p

Page 15: k-Means and DBSCAN

23-04-19 Erik Zeitler 15

DBSCAN, take 2

i = 0find the set of core points CP in Mdotake a point p from CPfind the set of points P which are density reachable to pif P = {}

M = M \ {p}else

Ci=Pi=i+1M = M \ P

endwhile M {}

HOW?Let’s call this

function dr(p) p

Page 16: k-Means and DBSCAN

23-04-19 Erik Zeitler 16

Implement dr

create function dr(p) -> P

C = {p}P = {p}

doremove a point p' from Cfind all points X that are ddr(p')C = C (X \ (P X)) % add newly discovered

% points to CP = P X

while C ≠ {}result P

p’p’

Page 17: k-Means and DBSCAN

23-04-19 Erik Zeitler 17

DBSCAN pros and cons

+ -

Clusters of arbitrary

shapeRobust to

noise

Requires connected regions of sufficiently

high density

Does not need an a

priori k

Data sets with varying densities are problematic

Scalable?

Page 18: k-Means and DBSCAN

23-04-19 Erik Zeitler 18

Questions

Why is the dc criterion useful to define a cluster, instead of dr or ddr?

For which points are density reachable symmetric?i.e. for which p, q: dr(p, q) and dr(q, p)?

Express using only core objects and ddr, which objects will belong to a cluster