17
1 1 Unsupervised Learning Basic Concepts Unsupervised Learning -- Ana Fred Outline Part 1: Basic Concepts of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties and open problems Part 2: Clustering Algorithms Hierarchical methods : Single-link : Complete-link : Clustering Based on Dissimilarity Increments Criteria From Single Clustering to Ensemble Methods - April 2009 2 Unsupervised Learning Basic Concepts Unsupervised Learning -- Ana Fred Pattern Recognition Decision Making Supervised Learning : training samples, labeled by their category membership, are used to design a classifier . Labeled training patterns . Labels represent true categories of patterns Unsupervised Learning : Based on a collection of samples without being told their categories . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns . Datamining From Single Clustering to Ensemble Methods - April 2009

Apresentação - Universidade de Aveiroafred/tutorials/A_Clustering_Basic Concepts.pdf · to design a classifier. Labeled training patterns. Labels represent true categories of patterns

Embed Size (px)

Citation preview

1

1

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Outline

Part 1: Basic Concepts of data clustering

Non-Supervised Learning and Clustering

: Problem formulation – cluster analysis

: Taxonomies of Clustering Techniques

: Data types and Proximity Measures

: Difficulties and open problems

Part 2: Clustering Algorithms

Hierarchical methods

: Single-link

: Complete-link

: Clustering Based on Dissimilarity Increments Criteria

From Single Clustering to Ensemble Methods - April 2009

2

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Pattern Recognition – Decision Making

Supervised Learning

: training samples, labeled by their category membership, are used

to design a classifier

. Labeled training patterns

. Labels represent true categories of patterns

Unsupervised Learning

: Based on a collection of samples without being told their

categories

. Learn the number of classes and the structure of each class using

similarity between unlabeled training patterns

. Datamining

From Single Clustering to Ensemble Methods - April 2009

2

3

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Unsupervised Learning / Clustering

Unsupervised Learning

: Learn the structure of multidimensional patterns

. Mixture Densities

– Gaussian Mixture Decomposition

» The probability structure is known with the exception of the values of

the parameters

Clustering Procedures

: Find subclasses

. Data description in terms of clusters or groups of data points that

possess strong internal similarities

Typical applications:

. As a stand-alone tool to get insight into data distribution

. As a preprocessing step for other algorithms

From Single Clustering to Ensemble Methods - April 2009

4

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Organize data into sensible groupings (either as a grouping of patterns or a

hierarchy of groups)

Clustering

: The process of grouping a set of objects into classes of similar objects

(extracting hidden structure from data)

Cluster

: A collection of objects that are similar to one another within the same

cluster and are dissimilar to the objects in other clusters

Cluster Analysis

From Single Clustering to Ensemble Methods - April 2009

3

5

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Shape Clustering

Right Ventricle from MR brain images

Duta, Jain and Jolly, “Automatic Construction of 2-D Shape Models”, IEEE PAMI, May 2001

Cistern from MR brain images

The main cluster is drawn using multicolor dots, secondary clusters are drawn in red,

green and magenta.

From Single Clustering to Ensemble Methods - April 2009

6

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Shape Clustering

Right Ventricle from MR brain images

Duta, Jain and Jolly, “Automatic Construction of 2-D Shape Models”, IEEE PAMI, May 2001

Cistern from MR brain images

The main cluster is drawn using multicolor dots, secondary clusters are drawn in red,

green and magenta.

From Single Clustering to Ensemble Methods - April 2009

4

7

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Identification of Writing Styles

122,000 “online” characters written by 100 writers

Lexemes are identified by clustering data within each character class into subclasses:

a string matching measure used to calculate distance between 2 characters

Connell and Jain, “Writer Adaptation for Online Handwriting Recognition”, IEEE PAMI, Mar 2002

From Single Clustering to Ensemble Methods - April 2009

8

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Segmentation of Natural Scenes

Hermes, Zoller, Bumannn, “Parametric Distributional Clustering for Image Segmentation”, ECCV 2002

From Single Clustering to Ensemble Methods - April 2009

5

9

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

What is a Cluster?

A set of entities which are alike; entities from different clusters are

not alike

An aggregation of points such that the distance between any two

points in a cluster is less than the distance between any point in the

cluster and any point not in it.

A relatively high density of points, surrounded by a relatively low

density of points

Ideal cluster: Compact and Isolated

From Single Clustering to Ensemble Methods - April 2009

10

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Taxonomy of Clustering Approaches

Two main strategies:

Hierarchical Methods

:Propose a sequence of nested data

partitions in a hierarchical structure

. Single-Link

. Complete Link

Partitional Methods

:Organize patterns into a small

number of clusters

. K-means

. Spectral clustering

From Single Clustering to Ensemble Methods - April 2009

6

11

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Taxonomy of Clustering Approaches

Clustering Principles:

Compactness

: K-means

: Complete-link

: Histogram clustering

: Pairwise data clustering

Connectedness

: Single-linkage

: Dissimilarity Increments

: Mean Shift clustering

Separation

: Normalized Cut

: Spectral clustering

From Single Clustering to Ensemble Methods - April 2009

12

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Taxonomy of Clustering Approaches

Clustering Principles:

Compactness

: K-means

: Complete-link

: Histogram clustering

: Pairwise data clustering

Connectedness

: Single-linkage

: Dissimilarity Increments

: Mean Shift clustering

Separation

: Normalized Cut

: Spectral clustering

From Single Clustering to Ensemble Methods - April 2009

7

13

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

Taxonomy of Clustering Approaches

Clustering Principles:

Compactness

: K-means

: Complete-link

: Histogram clustering

: Pairwise data clustering

Connectedness

: Single-linkage

: Dissimilarity Increments

: Mean Shift clustering

Separation

: Normalized Cut

: Spectral clustering

From Single Clustering to Ensemble Methods - April 2009

14

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Taxonomy of Clustering Approaches

Approaches:

Model-based

: Patterns can be given a simple and compact description in terms of

. Parametrical distribution -- Parametric density approaches (Mixture

models)

. A representative element, such as a centroid, median (central

clustering, square-error clustering, k-means, k-medoids) – or

multiple prototypes per cluster (CURE) -- Prototype-based methods

. Some geometrical primitives (lines, planes, circles, curves,

surfaces) – Shape fitting approaches

: These approaches assume particular cluster shapes, partitions

being in general obtained as a result of an optimization process

using a global criterion

From Single Clustering to Ensemble Methods - April 2009

8

15

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Taxonomy of Clustering Approaches

Graph-theoretical

: Mostly explored in hierarchical methods that can be represented

graphically as a tree or dendrogram

. Agglomerative methods (Single-link, complete-link)

. Divisive approaches (ex. Based on Minimum Spanning Tree)

: View clustering as a graph partitioning problem

Non parametric density-based

: Attempt to identify high density clusters separated by low density

regions (local cluster criterion, such as density-connected points)

(valley seeking clustering algorithms)

. DBSCAN, OPTICS, DENCLUE, CLIQUE

. Discover clusters of arbitrary shape

From Single Clustering to Ensemble Methods - April 2009

16

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Data Types in Clustering Problems

Data representations:

Vector data: n vectors in Rd

Proximity data: n x n pairwise

proximity matrix

:All types of data may be

addressed by choosing adequate

proximity measures

From Single Clustering to Ensemble Methods - April 2009

9

17

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or

dissimilarity between two data objects

Metrics:

: Positivity: d(a, b) >0 and d(a, b)=0 , a=b

: Symmetry property: d(a,b)=d(b,a).

: Triangle inequality: d(a,c)· d(a,b) + d(b,c).

From Single Clustering to Ensemble Methods - April 2009

18

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Metric Models in Feature Spaces

Minskowski distance:

(Euclidean distance corresponds to r = 2)

Maximum Value Metric:

. Considers only most distant features

From Single Clustering to Ensemble Methods - April 2009

10

19

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Metric Models in Feature Spaces

Absolute Value Metric, Manhattan Distance or City-block (r = 1)

. Reduced computational time; does not penalize much the features with higher dissimilarity

. In R2: dist1((x1,y1),(x2,y2))=|x2-x1|+|y2-y1| , city-block:It is not possible to make short-cuts through corners: it counts the number of

blocks that is necessary to pass in order to move from one corner to another

Constant Manhattan distance curves:

1

1

( , ) ( , )d

M i i

i

d a b d a b b a

From Single Clustering to Ensemble Methods - April 2009

20

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Metric Models in Feature Spaces

Euclidean Distance:

. R2: dist2((x1,y1),(x2,y2))=((x2-x1)2+(y2-y1)2)1/2.

. Emphasizes more features with higher dissimilarity.

Mahalanobis Distance

2

2

1

( , ) ( , )d

e i i

i

d a b d a b b a

1( , )T

Mahalanobisd x y x y x y

From Single Clustering to Ensemble Methods - April 2009

11

21

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

Dissimilarity based on String Editing operations

.

.

.

.

The Levensthein distance between two strings s1, s2 2 *, DL(s1, s2), is

defined as the minimum number of editing operations needed in order to

transform s1 into s2.

From Single Clustering to Ensemble Methods - April 2009

22

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

The Weighted Levensthein distance between two strings s1, s2 2 *, is

defined by

where

Normalized Weighted Levensthein distance

From Single Clustering to Ensemble Methods - April 2009

12

23

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

String Editing operations and String Matching

String matching. In (b), diagonal path elements represent substitutions,

vertical segments correspond to insertions, and horizontal

segments correspond to deletions.

(a) String matching using editing operations. (b) Editing path.

From Single Clustering to Ensemble Methods - April 2009

24

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

Normalized Edit Distance

Marzal and Vidal, “Computation of normalized edit distance and applications”, IEEE PAMI, 1993

From Single Clustering to Ensemble Methods - April 2009

13

25

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

Dissimilarity based on Error-Correcting Parsing [Fu]

: distance between strings based on the modelling of string structure

by means of grammars and on the concept of error-correcting

parsing

: the distance between a string and a reference string is given by the

error-correcting parser as the weighted Levensthein distance

between the string and the nearest (in terms of edit operations)

string generated by the grammar inferred from the reference string

(thus exhibiting a similar structure):

From Single Clustering to Ensemble Methods - April 2009

26

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

ECP distance

1

2

From Single Clustering to Ensemble Methods - April 2009

14

27

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

Dissimilarity based on Error-Correcting Parsing [Fu]

: distance between strings based on the modelling of string structure

by means of grammars and on the concept of error-correcting

parsing

: the distance between a string and a reference string is given by the

error-correcting parser as the weighted Levensthein distance

between the string and the nearest (in terms of edit operations)

string generated by the grammar inferred from the reference string

(thus exhibiting a similar structure):

: In order to preserve symmetry

From Single Clustering to Ensemble Methods - April 2009

28

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

Grammar Complexity-based Similarity

The basic idea is that, if two sentences are structurally similar, then

their joint description will be more compact than their isolated

description due to sharing of rules of symbol composition; the

compactness of the representation is quantified by the grammar

complexity, and the similarity is measured by the ratio of decrease

in grammar complexity

where C(Gsi) denotes grammar complexity.

Fred. “Similarity measures and clustering of string patterns”. In Dechang Chen and Xiuzhen Cheng, editors,

Pattern Recognition and String Matching, Kluwer Academic, 2002,

Fred, “Clustering of Sequences using a Minimum Grammar Complexity Criterion”, ICGI 1996

From Single Clustering to Ensemble Methods - April 2009

15

29

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

RDGC Similarity

From Single Clustering to Ensemble Methods - April 2009

30

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

Grammar Complexity-based Similarity– RDGC

Let G=(VN , , R, ) be a context-free grammar, where VN, are the sets

of nonterminal and terminal symbols, respectively, is the grammar’s

start symbol and R is the set of productions written in the form:

Let 2 (VN )*, be a grammatical sentence of length n, in which the

symbols a1, a2, … , am appear k1, k2, … , km times, respectively. The

complexity of the sentence, C(), is given by [Fu]

The complexity of the grammar G is defined as

From Single Clustering to Ensemble Methods - April 2009

16

31

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

Minimum Code Length-based Similarity

: Based on Solomonoff’s code: a string is represented by a triplet

where a coded string is obtained in an iterative procedure where, in

each step, intermediate codes are produced by defining sequences of

two symbols, which are represented by special symbols, and rewriting

the sequences using them. Compact codes are produced when

sequences exhibit local or distant inter-symbol interactions.

. Code length: sum of the lengths of the descriptions of the three part

code above

: Extension to sets of strings

Fred. “Similarity measures and clustering of string patterns”. In Dechang Chen and Xiuzhen Cheng, editors,

Pattern Recognition and String Matching, Kluwer Academic, 2002,

Fred and Leitão, “A Minimum Code Length Technique for Clustering of Syntactic Patterns”, ICPR 1996

From Single Clustering to Ensemble Methods - April 2009

32

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Dissimilarity for Sequences of Symbols

Minimum Code Length-based Similarity

: The basic idea is that global compact codes are produced by

considering the inter-symbol dependencies on the ensemble of the

strings. The quantification of this reduction in code length forms the

basis of the similarity measure designated by Normalized Ratio of

decrease in code length - NRDCL

with

From Single Clustering to Ensemble Methods - April 2009

17

33

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Requirements of Clustering in Data Mining

Discovery of clusters with arbitrary shape

Ability to deal with different types of attributes

Scalability

Minimal requirements for domain knowledge to determine input

parameters

Insensitivity to the order of input records

Ability to deal with noisy data

High dimensionality

From Single Clustering to Ensemble Methods - April 2009

34

Unsupervised Learning Basic Concepts

Unsupervised Learning -- Ana Fred

Issues in Clustering

Which similarity measure and features to use?

How many clusters?

Which is the “best” clustering method?

Are the individual clusters and the partition valid?

How to choose algorithmic parameters?

K-means clustering of uniform data

(k=4)

K-means using Euclidean (blue) and

Mahalanobis distance (k=2) (red)

From Single Clustering to Ensemble Methods - April 2009