25
CSC 4510 – Machine Learning Dr. Mary‐Angela Papalaskari Department of CompuBng Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/ 11: Unsupervised Learning ‐ Clustering 1 Some of the slides in this presentaBon are adapted from: Prof. Frank Klassners ML class at Villanova the University of Manchester ML course hNp://www.cs.manchester.ac.uk/ugt/COMP24111/ The Stanford online ML course hNp://www.ml‐class.org/

CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

  • Upload
    others

  • View
    25

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

CSC 4510 – Machine Learning Dr. Mary‐Angela Papalaskari Department of CompuBng Sciences Villanova University  

Course website: www.csc.villanova.edu/~map/4510/ 

11: Unsupervised Learning ‐ Clustering 

1 Some of the slides in this presentaBon are adapted from: •  Prof. Frank Klassner’s ML class at Villanova •   the University of Manchester ML course hNp://www.cs.manchester.ac.uk/ugt/COMP24111/ •  The Stanford online ML course hNp://www.ml‐class.org/ 

Page 2: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Supervised learning 

Training set:                  • The Stanford online ML course hNp://www.ml‐class.org/ 

Page 3: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Unsupervised learning 

Training set:      • The Stanford online ML course hNp://www.ml‐class.org/ 

Page 4: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Unsupervised Learning •  Learning “what normally happens” •  No output •  Clustering: Grouping similar instances •  Example applicaBons –  Customer segmentaBon  –  Image compression: Color quanBzaBon –  BioinformaBcs: Learning moBfs 

4 CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 

Page 5: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Clustering Algorithms •  K means  •  Hierarchical – BoNom up or top down 

•  ProbabilisBc – ExpectaBon MaximizaBon (E‐M) 

CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  5 

Page 6: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Clustering algorithms •  Par77oning method: Construct a parBBon of n examples into a set of K clusters 

•  Given: a set of examples and the number K  •  Find: a parBBon of K clusters that opBmizes the chosen parBBoning criterion –  Globally opBmal: exhausBvely enumerate all parBBons –  EffecBve heurisBc method: K‐means algorithm.  

hNp://www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17‐clustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  6 

Page 7: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

19 

K‐Means •  Assumes instances are real‐valued vectors. •  Clusters based on centroids, center of gravity, or mean of points in a cluster, c 

•  Reassignment of instances to clusters is based on distance to the current cluster centroids. 

Based on:  www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  7 

Page 8: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

K‐means intuiBon •  Randomly choose k points as seeds, one per cluster.   •  Form iniBal clusters based on these seeds. •  Iterate, repeatedly reallocaBng seeds and by re‐compuBng clusters to improve the overall clustering. 

•  Stop when clustering converges or ager a fixed number of iteraBons.  

Based on:  www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  8 

Page 9: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 10: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 11: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 12: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 13: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 14: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 15: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 16: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 17: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 18: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

21 

K‐Means Algorithm 

hNp://www.csc.villanova.edu/~matuszek/spring2012/index2012.html, based on:  www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt 

•  Let d be the distance measure between instances. •  Select k random points {s1, s2,… sk} as seeds. •  UnBl clustering converges or other stopping criterion: –  For each instance xi: 

•  Assign xi to the cluster cj such that d(xi, sj) is minimal. 

–  (Update the seeds to the centroid of each cluster) •  For each cluster cj,  sj = μ(cj)  

CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  18 

Page 19: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Distance measures •  Euclidean distance •  ManhaNan •  Hamming 

CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  19 

Page 20: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  20 

Orange schema 

Page 21: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Orange schema 

CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  21 

Page 22: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

hNp://store02.prostores.com/selectsocksinc/images/store_version1/Sigvaris%20120%20Pantyhose%20SIZE%20chart.gif 

Clusters aren’t always separated… 

Page 23: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

K‐means for non‐separated clusters 

T‐shirt sizing 

Height 

Weight 

• The Stanford online ML course hNp://www.ml‐class.org/ 

Page 24: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Weaknesses of k‐means •  The algorithm is only applicable to numeric data •  The user needs to specify k. •  The algorithm is sensiBve to outliers – Outliers are data points that are very far away from other data points.  

– Outliers could be errors in the data recording or some special data points with very different values.  

www.cs.uic.edu/~liub/teach/cs583‐fall‐05/CS583‐unsupervised‐learning.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  24 

Page 25: CSC 4510 – Machine Learningmap/4510/11clustering.pdf · CSC 4510 – Machine Learning ... Orange schema Orange schema CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University 21

Strengths of k‐means  •  Strengths:  

–  Simple: easy to understand and to implement –  Efficient: Time complexity: O(tkn),  –    where n is the number of data points,  –    k is the number of clusters, and  –    t is the number of iteraBons.  –  Since both k and t are small. k‐means is considered a linear algorithm.  

•  K‐means is the most popular clustering algorithm. 

www.cs.uic.edu/~liub/teach/cs583‐fall‐05/CS583‐unsupervised‐learning.ppt CSC 4510 ‐ M.A. Papalaskari ‐ Villanova University  25