26
SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis, observed that construction of a hierarchical cluster tree was a two-step process --creation of a vector- distance table, and construction of the tree on the basis of that table-- and outlined the first of these two steps. This lecture deals with the second step.

SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

Embed Size (px)

Citation preview

Page 1: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Lecture 12 introduced hierarchical cluster analysis, observed that construction of a hierarchical cluster tree was a two-step process --creation of a vector-distance table, and construction of the tree on the basis of that table-- and outlined the first of these two steps. This lecture deals with the second step.

Page 2: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

There are two main ways of constructing a cluster tree, which the literature on the subject generally refers to as 'top-down' and 'bottom-up'.

These terms won't be explained here since an explanation would take us too far afield.

Suffice it to say that this module confines itself to the 'bottom up' approach, and that nothing further is said about 'top-down'.

Page 3: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

As noted, construction of a cluster tree for a data matrix is based on the distance table abstracted from the matrix.

In what follows we will use the distance table constructed in the last lecture, but a 6 x 6 subset of the original 30 x 30 distance table will be used.

This makes it possible to show the whole table rather than just a fragment, thereby baking the discussion clearer.

Page 4: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Page 5: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

To further simplify the presentation, it is observed that the table is symmetrical on either side of the diagonal of zero-vales.

This is because the distance between any pair of vectors is the same in either direction: the distance between vector 2 and vector 3 is the same as that between vector 3 and vector 2.

Since the upper-right triangle simply duplicates the lower-left triangle, one of the two can be deleted without losing any information; the upper-right one is deleted:

Page 6: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

A cluster tree for the first 6 rows of the original data matrix will now be constructed step-by-step, showing how the distance table is used to do this.

The procedure is based on the principle that a set of vectors has a cluster structure if it can be divided into two or more groups in which the members of any given group are close to one another in the data space, and far from members of other cluster in the space.

At each step in tree construction, therefore, one looks for the clusters that are closest to one another and amalgamates them into a superordinate cluster, and this continues until all the vectors have been assigned to one of the clusters.

The following example will demonstrate this. 

Page 7: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

 Initially, each vector is taken to be a cluster on its own, that is, a cluster with only one member.

The distance table is now searched to find the smallest distance between clusters.

This is the distance between clusters (2) and (3): 2.24

Clusters (2) and (3) are now combined into a superordinate cluster (2,3) by drawing the tree, as below, and then emending the distance table to incorporate the new cluster.

Page 8: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Emendation of the distance table takes a bit of understanding, so it is described in detail.

Remove the rows and columns 2 and 3 from the table, and replace them with a single blank row and column to represent the new (2,3) cluster.

Note that 0 is inserted as the distance between (2,3) and itself for the self-evident reason that the distance of any object to itself is always 0.

Page 9: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Insert into the blank cells of the (2,3) row and column the minimum distance from (2,3) to the remaining clusters (1), (4), (5), and (6).

What does this mean?

Referring to the original distance table above, the distance between (2) and (1) is 2.83 and between (3) and (1) it is 5.00; the minimum here is 2.83, and it is inserted into the relevant cell:

Page 10: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance between (2) and (4) in the original distance table is 4.24 and between (3) and (4) it is 2.25; the minimum here is 2.25, and it is inserted into the relevant cell.

Page 11: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance between (2) and (5) in the original distance table is 7.81 and between (3) and (5) it is 5.66; the minimum here is 5.66, and it is inserted into the relevant cell.

Page 12: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance between (2) and (6) in the original distance table is 46.87 and between (3) and (6) it is 48.02; the minimum here is 46.87, and it is inserted into the relevant cell.

Emendation of the distance table is now complete, and the result is the basis for the next step in the construction of the cluster tree.

Note that the table has shrunk by one row/column. This shrinkage will continue as we proceed.

Page 13: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance table created in Step 1 is searched to find the smallest distance between clusters. This is the distance between clusters (2,3) and (4): 2.25

Clusters (2,3) and (4) are now combined into a superordinate cluster ((2,3),4) by drawing the tree, as below, and then emending the distance table to incorporate the new cluster.

Emendation of the distance table proceeds as in Step 1.

Page 14: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Remove the rows and columns (2,3) and 4 from the table, and replace them with a single blank row and column to represent the new ((2,3),4) cluster.

Page 15: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Insert into the blank cells of the ((2,3),4) row and column the minimum distance from ((2,3),4) to the remaining clusters (1), (5), and (6).

The distance between (2,3) and (1) is 2.83 and between (4) and (1) it is 7.07; the minimum here is 2.83, and it is inserted into the relevant cell.

Page 16: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance between (2,3) and (5) is 5.66 and between (4) and (5) it is 3.61; the minimum here is 3.61, and it is inserted into the relevant cell.

Page 17: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance between (2,3) and (6) is 46.87 and between (4) and (6) it is 47.89; the minimum here is 46.87, and it is inserted into the relevant cell.

Emendation of the distance table is now complete, and the result is the basis for Step 3 below.

Note that the table has again shrunk by one row/column.

Page 18: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance table created in Step 2 is searched to find the smallest distance between clusters. This is the distance between clusters ((2,3),4) and (1): 2.83

Clusters ((2,3),4) and (1) are now combined into a superordinate cluster (((2,3),4),1) by drawing the tree, as below, and then emending the distance table to incorporate the new cluster.

Emendation of the distance table proceeds as in Steps 1 and 2.

Page 19: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Remove the rows and columns (2,3) and 4 from the table, and replace them with a single blank row and column to represent the new (((2,3),4),1) cluster.

Page 20: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Insert into the blank cells of the (((2,3),4),1) column the minimum distance from (((2,3),4),1) to the remaining clusters (5) and (6).

The distance between ((2,3),4) and (5) is 3.61 and between (1) and (5) it is 10.63; the minimum here is 3.61, and it is inserted into the relevant cell.

Page 21: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance between ((2,3),4) and (6) is 46.87 and between (1) and (6) it is 46.40; the minimum here is 46.40, and it is inserted into the relevant cell.

Emendation of the distance table is now complete, and the result is the basis for Step 4 below.

Note again that the table has again shrunk by one row/column.

Page 22: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance table created in Step 3 is searched to find the smallest distance between clusters.

This is the distance between clusters (((2,3),4),1) and (5): 3.61

Clusters (((2,3),4),1) and (5) are now combined into a superordinate cluster ((((2,3),4),1),5) by drawing the tree and then emending the distance table to incorporate the new cluster.

Emendation of the distance table proceeds as in Steps 1-3.

Page 23: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Remove the rows and columns (((2,3),4),1) and 5 from the table, and replace them with a single blank row and column to represent the new ((((2,3),4),1),5) cluster.

Page 24: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Insert into the blank cell of the ((((2,3),4),1),5) column the minimum distance from ((((2,3),4),1),5) to the remaining cluster (6).

The distance between (((2,3),4),1) and (6) in Table 4 is 46.40 and between (5) and (6) it is 49.66; the minimum here is 46.40, and it is inserted into the relevant cell.

Page 25: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

The distance table created in Step 4 is searched to find the smallest distance between clusters.

There is only one remaining value.

Clusters ((((2,3),4),1),5) and (6) are now combined into a superordinate cluster (((((2,3),4),1),5),6) by drawing the tree and then emending the distance table to incorporate the new cluster.

Page 26: SEL3053: Analyzing Geordie Lecture 13. Hierarchical cluster analysis 2 - cluster tree construction Lecture 12 introduced hierarchical cluster analysis,

SEL3053: Analyzing GeordieLecture 13. Hierarchical cluster analysis 2 - cluster tree construction

Remove the rows and columns ((((2,3),4),1),5) and 6 from the table, and replace them with a single blank row and column to represent the new (((((2,3),4),1),5),6) cluster.

All 6 vectors have now been incorporated into the cluster tree, and tree construction stops.