CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

CSC 4510 – Machine Learning Dr. Mary‐Angela Papalaskari Department of CompuBng Sciences Villanova University

Course website: www.csc.villanova.edu/~map/4510/

11: Unsupervised Learning ‐ Clustering

1 Some of the slides in this presentaBon are adapted from: •  Prof. Frank Klassner’s ML class at Villanova •  the University of Manchester ML course hNp://www.cs.manchester.ac.uk/ugt/COMP24111/ •  The Stanford online ML course hNp://www.ml‐class.org/

Supervised learning

Training set: • The Stanford online ML course hNp://www.ml‐class.org/

Unsupervised learning

Training set: • The Stanford online ML course hNp://www.ml‐class.org/

Unsupervised Learning •  Learning “what normally happens” •  No output •  Clustering: Grouping similar instances •  Example applicaBons –  Customer segmentaBon –  Image compression: Color quanBzaBon –  BioinformaBcs: Learning moBfs

4 CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University

Clustering Algorithms •  K means •  Hierarchical – BoNom up or top down

•  ProbabilisBc – ExpectaBon MaximizaBon (E‐M)

CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 5

Clustering algorithms •  Par77oning method: Construct a parBBon of n examples into a set of K clusters

•  Given: a set of examples and the number K •  Find: a parBBon of K clusters that opBmizes the chosen parBBoning criterion –  Globally opBmal: exhausBvely enumerate all parBBons –  EffecBve heurisBc method: K‐means algorithm.

hNp://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17‐clustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 6

19

K‐Means •  Assumes instances are real‐valued vectors. •  Clusters based on centroids, center of gravity, or mean of points in a cluster, c

•  Reassignment of instances to clusters is based on distance to the current cluster centroids.

Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 7

K‐means intuiBon •  Randomly choose k points as seeds, one per cluster. •  Form iniBal clusters based on these seeds. •  Iterate, repeatedly reallocaBng seeds and by re‐compuBng clusters to improve the overall clustering.

•  Stop when clustering converges or ager a fixed number of iteraBons.

Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 8

• The Stanford online ML course hNp://www.ml‐class.org/









21

K‐Means Algorithm

hNp://www.csc.villanova.edu/~matuszek/spring2012/index2012.html, based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt

•  Let d be the distance measure between instances. •  Select k random points {s1, s2,… sk} as seeds. •  UnBl clustering converges or other stopping criterion: –  For each instance xi:

•  Assign xi to the cluster cj such that d(xi, sj) is minimal.

–  (Update the seeds to the centroid of each cluster) •  For each cluster cj, sj = μ(cj)


Distance measures •  Euclidean distance •  ManhaNan •  Hamming



Orange schema

Orange schema


hNp://store02.prostores.com/selectsocksinc/images/store_version1/Sigvaris%20120%20Pantyhose%20SIZE%20chart.gif

Clusters aren’t always separated…

K‐means for non‐separated clusters

T‐shirt sizing

Height

Weight


Weaknesses of k‐means •  The algorithm is only applicable to numeric data •  The user needs to specify k. •  The algorithm is sensiBve to outliers – Outliers are data points that are very far away from other data points.

– Outliers could be errors in the data recording or some special data points with very different values.

www.cs.uic.edu/~liub/teach/cs583‐fall‐05/CS583‐unsupervised‐learning.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 24

Strengths of k‐means •  Strengths:

–  Simple: easy to understand and to implement –  Efficient: Time complexity: O(tkn), –  where n is the number of data points, –  k is the number of clusters, and –  t is the number of iteraBons. –  Since both k and t are small. k‐means is considered a linear algorithm.

•  K‐means is the most popular clustering algorithm.

www.cs.uic.edu/~liub/teach/cs583‐fall‐05/CS583‐unsupervised‐learning.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 25

Documents

CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21