Upload
others
View
25
Download
0
Embed Size (px)
Citation preview
CSC 4510 – Machine Learning Dr. Mary‐Angela Papalaskari Department of CompuBng Sciences Villanova University
Course website: www.csc.villanova.edu/~map/4510/
11: Unsupervised Learning ‐ Clustering
1 Some of the slides in this presentaBon are adapted from: • Prof. Frank Klassner’s ML class at Villanova • the University of Manchester ML course hNp://www.cs.manchester.ac.uk/ugt/COMP24111/ • The Stanford online ML course hNp://www.ml‐class.org/
Supervised learning
Training set: • The Stanford online ML course hNp://www.ml‐class.org/
Unsupervised learning
Training set: • The Stanford online ML course hNp://www.ml‐class.org/
Unsupervised Learning • Learning “what normally happens” • No output • Clustering: Grouping similar instances • Example applicaBons – Customer segmentaBon – Image compression: Color quanBzaBon – BioinformaBcs: Learning moBfs
4 CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University
Clustering Algorithms • K means • Hierarchical – BoNom up or top down
• ProbabilisBc – ExpectaBon MaximizaBon (E‐M)
CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 5
Clustering algorithms • Par77oning method: Construct a parBBon of n examples into a set of K clusters
• Given: a set of examples and the number K • Find: a parBBon of K clusters that opBmizes the chosen parBBoning criterion – Globally opBmal: exhausBvely enumerate all parBBons – EffecBve heurisBc method: K‐means algorithm.
hNp://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17‐clustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 6
19
K‐Means • Assumes instances are real‐valued vectors. • Clusters based on centroids, center of gravity, or mean of points in a cluster, c
• Reassignment of instances to clusters is based on distance to the current cluster centroids.
Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 7
K‐means intuiBon • Randomly choose k points as seeds, one per cluster. • Form iniBal clusters based on these seeds. • Iterate, repeatedly reallocaBng seeds and by re‐compuBng clusters to improve the overall clustering.
• Stop when clustering converges or ager a fixed number of iteraBons.
Based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 8
• The Stanford online ML course hNp://www.ml‐class.org/
• The Stanford online ML course hNp://www.ml‐class.org/
• The Stanford online ML course hNp://www.ml‐class.org/
• The Stanford online ML course hNp://www.ml‐class.org/
• The Stanford online ML course hNp://www.ml‐class.org/
• The Stanford online ML course hNp://www.ml‐class.org/
• The Stanford online ML course hNp://www.ml‐class.org/
• The Stanford online ML course hNp://www.ml‐class.org/
• The Stanford online ML course hNp://www.ml‐class.org/
21
K‐Means Algorithm
hNp://www.csc.villanova.edu/~matuszek/spring2012/index2012.html, based on: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt
• Let d be the distance measure between instances. • Select k random points {s1, s2,… sk} as seeds. • UnBl clustering converges or other stopping criterion: – For each instance xi:
• Assign xi to the cluster cj such that d(xi, sj) is minimal.
– (Update the seeds to the centroid of each cluster) • For each cluster cj, sj = μ(cj)
CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 18
Distance measures • Euclidean distance • ManhaNan • Hamming
CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 19
CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 20
Orange schema
Orange schema
CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21
hNp://store02.prostores.com/selectsocksinc/images/store_version1/Sigvaris%20120%20Pantyhose%20SIZE%20chart.gif
Clusters aren’t always separated…
K‐means for non‐separated clusters
T‐shirt sizing
Height
Weight
• The Stanford online ML course hNp://www.ml‐class.org/
Weaknesses of k‐means • The algorithm is only applicable to numeric data • The user needs to specify k. • The algorithm is sensiBve to outliers – Outliers are data points that are very far away from other data points.
– Outliers could be errors in the data recording or some special data points with very different values.
www.cs.uic.edu/~liub/teach/cs583‐fall‐05/CS583‐unsupervised‐learning.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 24
Strengths of k‐means • Strengths:
– Simple: easy to understand and to implement – Efficient: Time complexity: O(tkn), – where n is the number of data points, – k is the number of clusters, and – t is the number of iteraBons. – Since both k and t are small. k‐means is considered a linear algorithm.
• K‐means is the most popular clustering algorithm.
www.cs.uic.edu/~liub/teach/cs583‐fall‐05/CS583‐unsupervised‐learning.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 25