Measurement of Similarity and Clustering Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521

Measurement Measurement of Similarity of Similarity

and Clusteringand Clustering

Dr Eamonn KeoghDr Eamonn KeoghComputer Science & Engineering Department

University of California - RiversideRiverside,CA [email protected]

• What is Similarity? • Some nomenclature • A useful tool (dendrogram)

• Why Measure Similarity?• Classification

• Clustering• Indexing

• Desirable Properties of Similarity Measures• Mathematical properties • Intuitiveness• Time and space complexity

• Two Approaches • Feature Projection • Transformation (Edit Distance)

• Hierarchal Clustering…

Outline of TalkOutline of Talk

What is Similarity?What is Similarity?The quality or state of being similar; likeness; resemblance; as, a similarity of features.

Similarity is hard to define, but… “We know it when we see it”

The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.

Webster's Dictionary

Some Nomenclature ISome Nomenclature I

We shall talk of measuring “similarity”, however we are usually measuring “dissimilarity”.

• SimilaritySimilarity: The larger the number, the more alike two objects are.

• DissimilarityDissimilarity: The larger the number, the less alike two objects are.

Distance is a common synonym for Dissimilarity, so we may speak of “Distance measure” and “Dissimilarity measure” interchangeably.

However a “Distance measure” is not the same thing as a “Distance Metric”. We will see why later…

Some Nomenclature IISome Nomenclature II

a

bc

a

bc

a

bc

Similarity Queries are often expressed as Nearest Neighbor Queries or Range Queries.

What is the nearest item to the green item?

What items are within R of the blue item?

RCan be generalized to the K nearest Neighbors

R is given by the user

A Useful Tool for A Useful Tool for Summarizing Similarity MeasurementsSummarizing Similarity Measurements In order to better appreciate and evaluate the examples given in the early part of this talk, we will now introduce the dendrogram.(We will have much more to say about dendrograms later)

Root

Internal Branch

Terminal Branch

Leaf

Internal Node

Root

Internal Branch

Terminal Branch

Leaf

Internal Node

The similarity between two objects in a dendrogram is represented as the height of the lowest internal node they share.

(Bovine:0.69395,(Gibbon:0.36079,(Orangutan:0.33636,(Gorilla:0.17147,(Chimp:0.19268,Human:0.11927):0.08386):0.06124):0.15057):0.54939);

SwoopogramCurvogramEurogram

PhenogramCladogramTree Diagram

Why Measure Similarity?Why Measure Similarity?• Classification:Classification: Given an unlabeled item Q, assign it to one of two or more predefined classes. (We can do classification without measuring similarity, but similarity based methods (I.e. nearest neighbor), are very competitive).

• Clustering:Clustering: Find natural groupings of items under some similarity measure.

• Indexing (Query by Content):Indexing (Query by Content): Given a query object Q, and some similarity measure, find the nearest matching item in the database, without having to examine every item.

Defining Distance MeasuresDefining Distance MeasuresDefinition: Let O1 and O2 be two objects from the universe

of possible objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1,O2)

0.23 3 342.7

Peter Piotr

What properties should a distance measure have?What properties should a distance measure have?

• D(A,B) = D(B,A) Symmetry

• D(A,A) = 0 Constancy of Self-Similarity

• D(A,B) = 0 IIf A= B Positivity (Separation)

• D(A,B) D(A,C) + D(B,C) Triangular Inequality

Peter Piotr

3

d('', '') = 0 d(s, '') = d('', s) = |s| -- i.e. length of s d(s1+ch1, s2+ch2) = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi, d(s1+ch1, s2) + 1, d(s1, s2+ch2) + 1 )

When we peek inside one of these black boxes, we see some function on two variables. These functions might very simple or very complex. In either case it is natural to ask, what properties should these functions have?

Intuitions behind desirable Intuitions behind desirable distance measure propertiesdistance measure properties

D(A,B) = D(B,A) Symmetry Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.”

D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim “Alex looks more like Bob, than Bob does.”

D(A,B) = 0 IIf A=B Positivity (Separation)Otherwise there are objects in your world that are different, but you cannot tell apart.

D(A,B) D(A,C) + D(B,C) Triangular Inequality Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.”

Why is the Triangular Inequality so Important?Why is the Triangular Inequality so Important?Virtually all techniques to index data require the triangular inequality to hold.

a

bc

Suppose I have a database of 3 objects.Further suppose that the triangular inequality holds, and that we have precomplied a table of distance between all the items in the database.

a b ca 6.70 7.07b 2.30c

Why is the Triangular Inequality so Important?Why is the Triangular Inequality so Important?Virtually all techniques to index data require the triangular inequality to hold.

a

bc

Q

Suppose I am looking for the closest point to Q, in a database of 3 objects.Further suppose that the triangular inequality holds, and that we have precomplied a table of distance between all the items in the database.

I find a and calculate that it is 2 units from Q, it becomes my best-so-far. I find b and calculate that it is 7.81 units away from Q.

I don’t have to calculate the distance from Q to c!

I know D(Q,b) D(Q,c) + D(b,c)D(Q,b) - D(b,c) D(Q,c) 7.81 - 2.30 D(Q,c) 5.51 D(Q,c)

So I know that c is at least 5.51 units away, but my best-so-far is only 2 units away.

a b ca 6.70 7.07b 2.30c

Thoughts on the Triangular Inequality IThoughts on the Triangular Inequality I

Sometimes the triangular inequality requirement maps nicely onto human intuitions.

Consider the similarity between the horse, the zebra and the lion.

The horse and the zebra are very similar, and both are very

unlike the lion.

Sometimes the triangular inequality requirement fails to map onto human intuition.

Consider the similarity between the horse, a man and the centaur.

The centaur example is due to Remco Velkamp

Thoughts on the Triangular Inequality IIThoughts on the Triangular Inequality II

The horse and the man are very different, but both share many features with the centaur. This relationship does not obey the triangular inequality.

What other properties should we What other properties should we require of a distance measurerequire of a distance measure

• It should really measure similarity!!

• It should be fast to compute• Euclidean distance and Hamming distance are O(n), Dynamic Time Warping and String Edit distance are O(n 2)

• It should be space efficient• This is usually not as important as time efficiency

• It should allow indexing• If the measure is a metric, this is automatically true, otherwise it depends…

• A fast lower bound measure is desirable • A, B lower_bound_distance(A, B) true_distance(A, B)

(We will see why on the next slide)

Whatever that means

……If not fast to compute, a fast lower bound measure is desirableIf not fast to compute, a fast lower bound measure is desirable

Algorithm Lower_Bounding_Sequential_Scan(Q)

1 best_so_far = infinity;2 for all items in database3 LB_dist = lower_bound_distance(Ci, Q);4 if LB_dist < best_so_far5 true_dist = true_distance(Ci, Q);6 if true_dist < best_so_far7 best_so_far = true_dist;8 index_of_best_match = i;9 endif10 endif11 endfor

Assume that true_distance(A, B) is the correct distance function, but is very expensive to compute, and that lower_bound_distance(A, B), is a cheap lower bounding estimate of true_distance(A, B), the above algorithm will allow faster sequential searching.

If we want to measure the similarity between If we want to measure the similarity between items, we will have to measure some featuresitems, we will have to measure some features

Scalar •Binary: Only two possible states.

True/False, Jew/Gentile, Married/Unmarried

• Nominal: Generalization of Binary to 3 or more states Jew/Catholic/Protestant, Married/Divorced/Widower

In basketball, jersey numbers are nominal

You cannot order, or do any mathematical operations on nominal data

Scalar (continued)

• Ordinal: Same as nominal, but order matters. However the distance between two values is not meaningful…

For example, we might have a coded survey, 0 = No high school, 1 = some high school, 2 = high school diploma, 4 = some college…

While we can clearly rank these attribute, the distacne between a “1” and a “2” is not the same as the distance between a “2” and a “3”.

• Interval: Distance between attributes is meaningful. In this case the we can measure intervals and take averages, but we cannot form ratios (I.E we cannot say 10 is twice as large as 5)

For example, consider temperature in Fahrenheit or Celsius

• Ratio: You can meaningfully form ratios.For example, weight, height, number of children

Scalar (continued)

Note that both Interval and Ratio data can be either discrete or continuous

For example consider the following two examples of ratio data:

• Number of Children (for a given person)

• Average Number of Children (For women in different countries)

Some algorithms work better (or only work) for one of either discrete or continuous.

We can convert from continuous to discrete…

In addition to scalar values, much of the data we are interested in is nonscalar...

Vectors or Matrices of Binary/Nominal/Ordinal/Interval/RatioBitmaps, Time Series, Strings, Trees, Graphs

Consider “color”, what kind of feature is this?Consider “color”, what kind of feature is this?

Nominal Scalar: Blue, Red, Yellow etc

Ordinal Discrete: Red, Orange, Yellow, Green, Blue, Indigo, Violet.

Ordinal Continuous: 780 – 622nm, 622 – 597nm, 597 – 577nm

Vector Continuous: [0.95, 0.01, 0.21] (Red/Green/Blue, or Hue/Saturation/Luminosity)

We sometimes have a choice of representation. Often making the right choice can be very important.

The similarity between two items depends on the features we measure (and the distance measure itself)

0

Last NameSimilarity

115

Skin ColorSimilarity

Sometimes we are given the perfect Sometimes we are given the perfect features to measure similarity…features to measure similarity………sometimes we need tosometimes we need to

• Generate Features: Suppose we hope to find similar people with regard to their medical conditions, knowing both their height and weight is not helpful, knowing their BMI is. (BMI =Weight in kilos /[Height in meters]2 )

• Clean Features: Our features may contain noise or outliers.

• Normalize Features: We may need to transform features.

• Reduce Features: We may have too many features to do efficient similarity measurement, so dimensionality reduction may be necessary.

There is no single “magic” black box for There is no single “magic” black box for measuring similaritymeasuring similarity

However there are two useful and general tricks

1. Project the data into feature space, the distance in feature space (appropriately measured) becomes the similarity.

2. Transform one object into the other, the “cost” of this transformation becomes the similarity.

““Feature Projection”Feature Projection”

““Edit Distance”Edit Distance”

1.0

1 2 3 4 5 6 7 8 9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rat

io o

f be

ak le

ngth

ove

r bo

dy le

ngth

Body Mass

Use the features to project the items into feature space.The distance between two objects in this space (appropriately measured) is the measure of similarity

From left to rightBee Hummingbird Costas HummingbirdRuby Topaz HummingbirdKestrelGyrfalconBald Eagle

Feature Projection Example IFeature Projection Example I

R. A. Fisher’s Iris Dataset.

3 variations of the Iris flower

50 of each

Iris Setosa Iris Versicolor Iris Virginica

Feature Projection Example IIFeature Projection Example II

A generic technique for measuring similarityA generic technique for measuring similarity

To measure the similarity between two objects, transform one of the objects into the other, and measure how much effort it took. The measure of effort becomes the distance measure.

The distance between Patty and Selma. Change dress color, 1 point Change earring shape, 1 point Change hair part, 1 point

D(Patty,Selma) = 3

The distance between Marge and Selma. Change dress color, 1 point Add earrings, 1 point Decrease height, 1 point Take up smoking, 1 point Lose weight, 1 point

D(Marge,Selma) = 5

Peter

Piter

Pioter

Piotr

Substitution (i for e)

Insertion (o)

Deletion (e)

Edit Distance Example IEdit Distance Example I

It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion.Assume that each of these operators has a cost associated with it.

The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C.

Note that for now we have ignored the issue of how we can find this cheapest

transformation

How similar are the names “Peter” and “Piotr”?Assume the following cost function

Substitution 1 UnitInsertion 1 UnitDeletion 1 Unit

D(Peter,Piotr) is 3

Pio

tr P

yotr

Pet

ros

Pie

tro

Pedr

o P

ierr

e P

iero

Pet

er

Edit Distance Example IIEdit Distance Example II

We can make two time series appear more similar by making one point on one map onto two (or more) points it the other.For example, suppose we have

Q = [5, 6, 8, 8, 7] and C = [5, 6, 6, 8, 7]

[5, 6, 8, 8, 7]

[5, 6, 6, 8, 7]

[5, 6, 8, 8, 7]

[5, 6, 6, 8, 7]

A “one to one” measure would have to match an “8” in Q to a “6” in C.

However if we allowed nonlinear alignments every number can match with “itself”.

Another way of looking at it is an attempt to make the two sequences more similar by inserting values…

[5, 6, 6, 8, 8, 7]

[5, 6, 6, 8, 8, 7]

This is call Dynamic Time Warping

Fixed Time AxisSequences are aligned “one to one”.

“Warped” Time AxisNonlinear alignments are possible.

Dynamic Time WarpingDynamic Time Warping

The Minkowski MetricThe Minkowski Metric

pn

i

pii cqCQD

1,

p = 1 Manhattan (Rectilinear, City Block)

p = 2 Euclidean

p = Max (Supremum, “sup”)

Assume Q and C are vectors of features measured from the objects of interest.

1.0

1 2 3 4 5 6 7 8 9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

So, we have projected our objects into feature space.

How do we measure the distance between points?

The Minkowski Metric, a WeaknessThe Minkowski Metric, a WeaknessSuppose we have a database of 3 items, with 2 features, number of children and

temperature. We want to know who is most similar to Mr Red under the Euclidean distance.44

1 2 3 4 5 6 7 8 9 10

35

36

37

38

39

40

41

42

43

3

4

(5,36)(1,36)(5,38)

1 2 3 4 5 6 7 8 9 10

105

96979899

100101102103104

5.4

49495

106107108109110

(5,96.8)(1,96.8)(5,102.2)

Celsius Fahrenheit

Green is closest to Red Blue is closest to Red

The Minkowski metric is sensitive to the units use to measure features, a very undesirable property since the units are usually arbitrary.Two solutions suggest themselves, normalize the features or use a weighted version of the Minkowski metric.

Normalizing FeaturesNormalizing Features

-6 -4 -2 0 2 4 6 8 10

2

4

6

8

10

12

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Before normalization, the Y-axis dominates

After normalization, both axes are equally important

Note that in both these images the axes are square (there is the same number of pixels per unit in both the X and Y direction)

Let C be a database of items, with the ith feature denoted by ci

To normalize the database…

for each feature ci = ci – (mean(ci) / std(ci);end

After normalization each feature will have a mean of zero and a standard deviation of one.

The Weighted Minkowski MetricThe Weighted Minkowski MetricAssume Q and C are vectors of feature measured from the objects of interest.

Further assume that W is a vector containing the relative importance of the features

pn

i

piii cqwWCQD

1,,

But how do we know the weights?

Euclidean

Weighted Euclidean

Manhattan

Max

The Minkowski Metrics have Simple Geometric InterpretationsThe Minkowski Metrics have Simple Geometric Interpretations

a

bc

• Organizing data into classes such that there is

• high intra-class similarity

• low inter-class similarity

• Finding the class labels and the number of classes directly from the data (in contrast to classification).

• More informally, finding natural groupings among objects.

What is Clustering?Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing

What is a What is a natural groupingnatural grouping among these objects? among these objects?

What is a What is a natural groupingnatural grouping among these objects? among these objects?

School Employees Simpson's Family Males Females

Clustering is subjective

Even if we know in advance the number of clusters we expect to see, the clustering obtained may be subjective.

Two Types of ClusteringTwo Types of Clustering

HierarchicalHierarchical

• Partitional algorithms: Construct various partitions and then evaluate them by some criterion• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using some criterion

PartitionalPartitional

Desirable Properties of a Clustering AlgorithmDesirable Properties of a Clustering Algorithm

• Scalability (in terms of both time and space)

• Ability to deal with different data types

• Discovery of clusters with arbitrary shape

• Minimal requirements for domain knowledge to determine input parameters

• Able to deal with noise and outliers

• Insensitive to order of input records

• High dimensionality

• Incorporation of user-specified constraints

• Interpretability and usability

Hierarchical ClusteringHierarchical Clustering

The number of dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10 34,459,425

Since we cannot test all possible trees we will have to heuristic search of all possible trees. We could do this..

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.

0 8 8 7 7

0 2 4 4

0 3 3

0 1

0

D( , ) = 8

D( , ) = 1

We begin with a distance matrix which contains the distances between every pair of objects in our database.


…Consider all possible merges…

Choose the best



Choose the best

Consider all possible merges… …

Choose the best



Choose the best


Choose the best

Consider all possible merges…

Choose the best…



Choose the best


Choose the best

Consider all possible merges…

Choose the best…

0 8 8 7 7

0 2 4 4

0 3 3

0 1

0

0 8 8

0 2

0

0 8 8 ?

0 2 ?

0 ?

?

In the first iteration of agglomerative clustering we merged so we need to remove them from the matrix

We now need to add the single cluster to our new smaller matrix

But what values do we fill in?What is…

D( , ) = ?

D( , ) = ?

We know how to measure the distance between two We know how to measure the distance between two objects, but defining the distance between an object objects, but defining the distance between an object and a cluster, or defining the distance between two and a cluster, or defining the distance between two clusters is non obvious. clusters is non obvious.

• Single linkage (nearest neighbor): In this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters.• Complete linkage (furthest neighbor): In this method, the distances between clusters are determined by the greatest distance between any two

objects in the different clusters (i.e., by the "furthest neighbors"). • Group average: In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two

different clusters.

0 8 8 7 7

0 2 4 4

0 3 3

0 1

0

0 8 8 7

0 2 4

0 3

0

D( , ) = Min[ D( , ), D( , ) ] = 4

D( , ) = Min[ D( , ), D( , ) ] = 7

Using Single linkage (nearest neighbor)…

Summary of Hierarchal Clustering MethodsSummary of Hierarchal Clustering Methods

• No need to specify the number of clusters in advance. • Hierarchal nature maps nicely onto human intuition for some domains• They do not scale well: time complexity of at least O(n2), where n is the number of total objects.• Like any heuristic search algorithms, local optima are a problem.• Interpretation of results is subjective.

Documents

Measurement of Similarity and Clustering Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521