K means ++ and K means Parallel Jun Wang. Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers

Embed Size (px)

DESCRIPTION

K means ++ Actually you are using it Spend some time on choosing k centers(seeding) Save time on clustering

Citation preview

K means ++ and K means Parallel Jun Wang Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers K means ++ Actually you are using it Spend some time on choosing k centers(seeding) Save time on clustering K means ++ algorithm Seeding Choose a center from X randomly For k-1 times Sample one center each time from X with probability p Update center matrix Clustering d i 2 =min(euclidean distance b/t Xi to each Ci ) How to choose K centers Choose a point from X randomly Calculate all d i 2 Calculate Pi D=d 1 2 +d 2 2 +d 3 2 ++d n 2 P i =d i 2 / D P i =1 Points further away from red point have better chance to be chosen Pick up point with probability p Keep doing the following: Update center matrix Calculate d i 2 Calculate pi Until k centers are found K means || algorithm Seeding Choose a small subset C from X Assign weight to points in C Cluster C and get k centers Clustering Choose subset C from X Let D=Sum of square distance=d 1 2 +d 2 2 +d 3 2 ++d n 2 Let L be f(k) like 0.2k or 1.5k for ln(D) times Pick up each point in X using Bernoulli distribution P(chosen)=L*d i 2 /D Update the C How many data in C? Ln(D) iterations Each iteration there suppose to be 1*P1+1*P2++1*Pn =L points Total Ln(D)*L points Cluster the subset C Red points are in subset C Cluster the sample C Calculate distances between point A to other points in C, and find the smallest distance In this case,d_c 1 Cluster the sample C Calculate distances between point A and all points in X, and get d_x i Cluster the sample C Compare d_x i to d_c 1, and let W A =number of d_x i