Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Embed Size (px)

Citation preview

  • Slide 1

Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG Linkage algorithm Distance metric A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and b) a linkage algortihm Slide 2 Between clusters Within clusters Cluster analysis tries to minimize within cluster distances and to maximize between cluster distances. Slide 3 Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG The distance metric P.symP.xanP.polaC.platC.gradD.sym P.sym0237913 P.xan20411 15 P.pola34010 12 C.plat711100219 C.grad911102019 D.sym13151219 0 A distance matrix counts in the simplest case the number of differences between two data sets. Slide 4 Site 1 Site 2Site 3Site 4 P.sym1011 P.xan1001 P.pola0101 C.plat0111 C.grad1000 D.sym1011 Sum4235 Species presence-absence matrix A Site 1 Site 2Site 3Site 4 Site 14023 Site 20 2 12 Site 321 3 3 Site 432 35 Site 1 Site 2Site 3Site 4 Site 1100.5714290.666667 Site 2010.40.571429 Site 30.5714290.410.75 Site 40.6666670.5714290.751 Distance matrix D = A T A Soerensen index Jaccard index Slide 5 Site 1 Site 2Site 3Site 4 P.sym0.310.120.240.05 P.xan0.200.650.540.44 P.pola0.380.810.280.52 C.plat0.350.690.860.30 C.grad0.070.990.640.84 D.sym0.430.780.730.21 Sum1.754.043.302.36 Abundance data Euclidean distance Manhattan distance Correlation distance Site 1 Site 2Site 3Site 4 Site 11-0.27534-0.04805-0.71587 Site 2-0.2753410.5191390.807173 Site 3-0.048050.51913910.157251 Site 4-0.715870.8071730.1572511 Correlation distance matrix Bray Curtis distance Due to squaring Euclidean distances put particulalry weight on outliers. Needs a linear scale. The Manhattan distance needs linear scales. Despite of a large distance the metric might be zero. Correlations are sensitive to non-linearities in the data. The Bray-Curtis distance is equivalent to the Soerensen index for presence-absence data. Suffers from the same shortcoming as the Manhattan distance. Slide 6 P.symP.xanP.polaC.platC.gradD.sym P.sym0237913 P.xan20411 15 P.pola34010 12 C.plat711100219 C.grad911102019 D.sym13151219 0 Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG Linkage algorithm We first combine species that are nearest to from an inner cluster In the next step we look for a species or a cluster that is clostest to the average distance or the initial cluster We continue this procedure until all species are grouped. The single linkage algorithm tends to produce many small clusters. P.sym P.xan P.pola C.plat C.grad D.sym Slide 7 Sequential versus simultaneous algorithms In simultaneous algorithms the final solution is obtained in a single step and not stepwise as in the single linkage above. Agglomeration versus division algorithms Agglomerative procedures operate bottom up, division procedures top down. Monothetic versus polythetic algorithms Polythetic procedures use several descriptors of linkage, monothetic use the same at each step (for instance maximum association). Hierarchical versus non-hierarchical algorithms Hierarchical methods proceed in a non- overlapping way. During the linkage process all members of lower clusters are members of the next higher cluster. Non hierarchical methods proceed by optimization within group homogeneity. Hence they might include members not contained in higher order cluster. The single linkage algorithm uses the minimum distance between the members of two clusters as the measure of cluster distance. It favours chains of small clusters. The average linkage uses average distances between clusters. It gives frequently larger clusters. The most often used average linkage algorithm is the Unweighted Pair-Groups Method Average (UPGMA). The Ward algorithm calculates the total sum of squared deviations from the mean of a cluster and assigns members as to minimize this sum. The method gives often clusters of rather equal size. Median clustering tries to minimize within cluster variance. Slide 8 To check the performance of different cluster algorithms and distance metrics we use a matrix of random numbers. Which clusters to accept? Slide 9 Different cluster algorithms give different results. We accept those clusters that are stable irrespective of algorithm. In the case of our random numbers clustering is very unstable. Slide 10 Two methods detected the clusters OP and ABC All other items are not clearly separated. The position of item F remains unclear Slide 11 Clustering using a predefined number of clusters K-means O P A B D C F E H K I LN M J G K-means clustering starts from a predefind number of clusters and then arranges the items in a way that the distances between clusters are maximized with respect to the distances within the clusters. Technically the algorithm first randomly assigns cluster means and then places items (each time calculating new cluster means) until an optimal solution (convergence) has been reached). K-means always uses Euclidean distances Slide 12 Neighbour joining Neighbour joining is particularly used to generate phylogenetic trees Dissimilarities You need similarities (phylogenetic distances) (XY) between all elements X and Y. Select the pair with the lowest value of Q Calculate new dissimilarities Calculate the distancies from the new node Calculate Slide 13 Slide 14 Slide 15 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is cluster analysis. An easy but powerful technique is principal component analysis (PCA). Slide 16 Factor analysis Is it possible to group the variables according to their values for the countries? T (Jan)T (July)Mean TDiff T GDP GDP/C Elev Factor 1 Factor 2 Factor 3 Correlations The task is to find coefficients of correlation etween the original variables and the exctracted factors from the analysis of the coefficiencts of correlation between the original variables. Slide 17 Slide 18 Because the f values are also Z-transformed we have Eigenvalue Slide 19 How to compute the factor loadings? The dot product of orthonormal matrices gives the unity matrix Fundamental theorem of factor analysis Slide 20 1 2 3 4 5 6 F1F2 f 11 f 21 f 31 f 41 f 51 f 61 f 12 f 22 f 32 f 42 f 52 f 62 Z-trans- formed Factor values b Cases n Factors F Factors are new variables. They have factor values (independent of loadings) for each case. These factors can now be used in further analysis, for instance in regression analysis. Slide 21 Slide 22 We are looking for a new x,y system were the data are closest to the longest axis. PCA in fact rotates the original data set to find a solution where the data are closest to the axes. PCA leaves the number of axes unchanged. Only a few of these rotated axes can be interpreted from the distances to the original axes. We interpret the new axis on the basis of their distance (measured by their angle) to the original axes. The new axes are the principal axes (eigenvectors) of the dispersion matrix obtained from raw data. X1 Y1 X1 Y1 PCA is an eigenvector method Principal axes are eigenvectors. Slide 23 Slide 24 The programs differ in the direction of eigenvectors. This does not change the results but might pose problems with the interpretation of factors according to the original variables. Slide 25 Pincipal coordinate analysis PCoA uses different metrics to generate the dispersion matrix Slide 26 Using PCA or PCoA to group cases v A factor might be interpreted if more than two variables have loadings higher than 0.7. A factor might be interpreted if more than four variables have loadings higher than 0.6. A factor might be interpreted if more than 10 variables have loadings higher than 0.4. Slide 27 Correspondence analysis (reciprocal averaging, seriation, contingency table analysis) Correspondence analysis ordinates rows and columns of matrices simultaneously according their principal axes. It uses the 2-distances instead of correlations coefficients or Euclidean distances. distances Contingency table Slide 28 We take the transposed raw data matrix and calculate eigenvectors in the same way Correspondence analyis is row and column ordination. Joint plot Slide 29 The plots are similar but differ numerically and in orientation. The orientation problem comes again from the way Ecxel calculates eigenvalues. Row and column eigenvectors differ in scale. For a joint plot the vectors have to be rescaled. Slide 30 Slide 31 Reciprocal averaging Sorting according to row/column eigenvalues rearranges the matrix in a way where the largest values are near the matrix diagonal. Slide 32 =los() =(B85*B$97+C85*C$97+D85*D$97+E85*E$97)/$F85 =(H85-H$94)/H$95 Seriation using reciprocal averaging Repeat until scores become stable Weighed mean Z-transformed weighed means