37
Handbook of Cluster Analysis (provisional top level file) C. Hennig, M. Meila, F. Murtagh, R. Rocci (eds.) September 10, 2012

Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

Handbook of Cluster Analysis (provisional top level file)

C. Hennig, M. Meila, F. Murtagh, R. Rocci (eds.)

September 10, 2012

Page 2: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

ii

Page 3: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

Contents

1 Visual clustering for data analysis and graphical user interfaces 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Multidimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Performance measures in information retrieval . . . . . . . . . . . . . 4

1.2.2 Dendrogram to define the clusters of performance measures . . . . . . 5

1.2.3 Principal Component Analysis to validate the clusters . . . . . . . . . 6

1.2.4 3D-map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Graphs and collaborative networks . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 Basis of collaboration networks . . . . . . . . . . . . . . . . . . . . . 11

1.3.2 Geographic and thematic collaboration networks . . . . . . . . . . . . 15

1.3.3 Large collaborative networks . . . . . . . . . . . . . . . . . . . . . . . 16

1.3.4 Temporal collaborative networks . . . . . . . . . . . . . . . . . . . . . 18

1.4 Curve clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4.1 Time series microarray experiment . . . . . . . . . . . . . . . . . . . 21

i

Page 4: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

ii CONTENTS

1.4.2 Principal Component Analysis to characterize clusters . . . . . . . . . 22

1.4.3 Visualizing curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.4.4 Heatmap to combine two clusterings . . . . . . . . . . . . . . . . . . 25

1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Page 5: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

Chapter 1

Visual clustering for data analysis and

graphical user interfaces

Sebastien Dejean(1) and Josiane Mothe(2)

(1) Institut de Mathematiques de Toulouse, UMR 5219, Universite de

Toulouse et CNRS

(2) Institut de Recherche en Informatique de Toulouse, UMR 5505,

Universite de Toulouse et CNRS

Abstract

Cluster analysis is a major method in data mining to present overviews of large data sets. Clustering methods

allows dimension reducing by finding groups of similar objects or elements. Visual cluster analysis has been defined

as a specialization of cluster analysis and is considered as a solution to handle complex data using interactive

exploration of clustering results. In this chapter, we consider three cases studies in order to illustrate cluster

analysis and interactive visual analysis. The first case study is related to information retrieval field and illustrates

the case of multi-dimensional data in which objects to analyze are represented considering various features or

variables. Evaluation in information retrieval considers many performance measures. Cluster analysis is used to

1

Page 6: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

2CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

reduce the number of measures to a small number that can be used to compare various search engines. The second

case study considers networks in which data to analyze is represented in the form of matrices that correspond to

adjacency matrices. The data we used is obtained from publications; cluster analysis is used to analyze

collaborative networks. The third case study is related to curve clustering and applies when temporal data is

involved. In this case study, the application is time series gene expression. We conclude this chapter by presenting

some other types of data for which visual clustering can be used for analysis purposes and present some tools that

implement other visual analysis functionalities we did not present in the case studies.

1.1 Introduction

Cluster analysis is a major method in data mining to present overviews of large data sets

and has many applications in machine learning, image processing, social network analysis,

bioinformatics, marketing, e-business, ... Clustering methods allow to reduce dimension by

finding groups of similar objects or elements [20]. A large number of clustering methods

have been developed in the literature to achieve this general goal; they differ in the method

used to build the clusters and the distance they use to decide whether objects are similar or

not. Another important aspect of clustering is cluster validation which relies on validation

measures [19, 9]. The decision on which method to use and on the optimal number of clusters

can depend on the application and on the analyzed data set (as can be appreciated from other

chapters in this volume). Exploring and interpreting groups of elements that share similar

properties or behavior rather than individual objects allows the analyst to consider large

data sets and to understand their inner structure. Visual cluster analysis has been defined

as a specialization of cluster analysis and is considered as a solution to handle complex data

using interactive exploration of clustering results [43]. Shneidermans information seeking

mantra “overview first, zoom and filter, and then details on demand” [42] applies to visual

clustering. Various tools have been developed for visual cluster analysis providing these

functionalities to explore the results.

For cluster analysis, objects are often depicted as feature vectors or matrices: objects can

thus be viewed as points in a multi-dimensional space [5]. More complex data cannot be rep-

resented this way: this is the case for relational data (e.g. social networks) or time series and

Page 7: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.2. MULTIDIMENSIONAL DATA 3

temporal data. In this chapter, we consider three cases studies in order to illustrate cluster

analysis and interactive visual analysis. First, we illustrate the case of multi-dimensional

data in which objects to analyze are represented considering various features or variables.

The case study we chose is information retrieval (IR) evaluation for which many performance

measures have been defined in the literature. Cluster analysis is used to reduce the number

of measures to a small number that consider the various points of view that can be used

to compare various search engines. We then consider networks in which data to analyze

is represented in the form of matrices that correspond to adjacency matrices. We chose

to illustrate this case considering collaborative networks applied to publications. We show

how visual analysis can be used to find clusters of authors. Moreover, we expand this type

of exploration to more complex analysis, combining authorship with geographic and topic

information. We also illustrate how large scale and temporal collaborative networks can be

analyzed. The third case study is related to curve clustering and applies when temporal

data is involved. In this case study where the application is time series gene expression, we

show how clustering the shapes of the curves rather than the absolute level of expression

allows finding different types of gene expressions. We conclude this chapter in presenting

some other types of data for which visual clustering can be used for analysis purposes; and

present some tools that implement other visual analysis functionalities we did not present

in the case studies.

1.2 Multidimensional data

Multivariate statistical methods are generally based on a matrix of data as a starting point.

From a statistical point of view, the considered matrix consists of rows, which correspond

to objects or individuals to analyze, and of columns, which correspond to variables used

to characterize the individuals. No particular structure is assumed about the variables; in

particular, an arbitrary permutation of the columns will not affect the cluster analysis.

Page 8: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

4CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

1.2.1 Performance measures in information retrieval

The study presented here is detailed in [6]. Evaluating effectiveness of information retrieval

systems is achieved by performing on a collection of documents, a search, in which a set

of test queries are performed and, for each query, the list of the relevant documents. This

evaluation framework also includes performance measures making it possible to control the

impact of a modification of search parameters. A large number of measures are available to

assess performance of the system, some being more used like the mean average precision or

recall-precision curves.

In the present study, a row (an individual) corresponds to a run characterized by the

performance measures, which indeed correspond to variables (columns). The matrix we

have to analyze is composed of 23,518 rows and 130 columns. An extract of the matrix we

analyzed is presented in Table 1.1.

Table 1.1: Extract of the analyzed matrix. The first four columns represent an identifier,the collection on which the search engine was applied, the search engine and the informationneed respectively. Other columns correspond to performance measures.

Line Year System Topic 0.20R.prec 0.40R.prec 0.60R.prec . . .

1 TREC 1993 Brkly3 101 0.2500 0.1250 0.1111 . . .

2 TREC 1993 Brkly3 101 0.3077 0.2692 0.3077 . . .

3 TREC 1993 Brkly3 101 0.4737 0.4474 0.4211 . . .

. . . . . . . . . . . . . . . . . . . . . . . .

23516 TREC 1999 weaver2 448 0.0000 0.0000 0.0357 . . .

23517 TREC 1999 weaver2 449 0.0000 0.0000 0.0000 . . .

23518 TREC 1999 weaver2 450 0.7627 0.6864 0.5966 . . .

Among many problems that can be addressed regarding this data set, we focus here

on a clustering task. Indeed, one motivation of this work is to compare all measures and

to help the user to choose a small number of them when evaluating different IR systems.

Relationships between the 130 performance measures available for individual queries are

investigated and it is shown that they can be clustered into homogeneous clusters.

In our statistical approach, we focused on the columns of the matrix, in order to high-

Page 9: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.2. MULTIDIMENSIONAL DATA 5

light the relationships between performance measures. To achieve this analysis, we have

considered three exploratory multivariate methods: hierarchical clustering, partitioning and

Principal Component Analysis (PCA).

Clustering of the performance measures was performed in order to define a small number of

clusters, each one including redundant measures. Partitioning was used in order to stabilize

the results of the hierarchical clustering. PCA provides indicators and graphical displays

giving a synthetic view, in small dimension, of the correlation structure of the columns and

of clusters previously defined. Each method is illustrated in the following. A synthetic view

combining these methods is proposed as a 3D-map.

1.2.2 Dendrogram to define the clusters of performance measures

Agglomerative clustering proposes a classification of performance measures without any prior

information on the number of clusters.

P1000

P500

unranked_avg_prec1000

unranked_avg_prec500

P200

P100

exact_prec

exact_unranked_avg_prec

exact_relative_unranked_avg_prec

P30

P20

P15

P10

relative_prec10

relative_unranked_avg_prec10

relative_unranked_avg_prec15

relative_unranked_avg_prec20

relative_prec15

relative_prec20

relative_unranked_avg_prec30

relative_prec30

bpref_10

bpref_5

relative_unranked_avg_prec5P5

relative_prec5

ircl_prn.0.00

recip_rank

unranked_avg_prec200

unranked_avg_prec100

bpref

old_bpref

bpref_top10pRnonrel

old_bpref_top10pRnonrel

bpref_top25p2Rnonrel

bpref_top25pRnonrel

bpref_top50pRnonrel

X1.20R.prec

int_1.20R.prec

exact_int_R_rcl_prec

int_1.00R.prec

X0.80R.prec

int_0.80R.prec

X1.00R.prec

R.prec

X1.80R.prec

X2.00R.prec

int_1.80R.prec

int_2.00R.prec

X1.60R.prec

X1.40R.prec

int_1.60R.prec

int_1.40R.prec

bpref_top5Rnonrel

ircl_prn.0.30

ircl_prn.0.40

X11.pt_avg

infAP

avg_doc_prec

map

int_map

X3.pt_avg

ircl_prn.0.50

ircl_prn.0.60

map_at_R

int_map_at_R

X0.40R.prec

int_0.40R.prec

X0.60R.prec

int_0.60R.prec

ircl_prn.0.20

X0.20R.prec

int_0.20R.prec

ircl_prn.0.10

bpref_top10Rnonrel

exact_relative_prec

relative_prec1000

recall1000

exact_recall

bpref_allnonrel

avg_relative_prec

bpref_retnonrel

relative_prec500

recall500

relative_unranked_avg_prec1000

relative_unranked_avg_prec500

relative_prec100

relative_unranked_avg_prec100

relative_unranked_avg_prec200

relative_prec200

fallout_recall_142

fallout_recall_127

fallout_recall_113

fallout_recall_99

recall200

fallout_recall_85

fallout_recall_71

rcl_at_142_nonrel

recall100

bpref_topnonrel

fallout_recall_56

fallout_recall_42

fallout_recall_28

fallout_recall_14

bpref_retall

ircl_prn.0.70

ircl_prn.0.80

ircl_prn.0.90

ircl_prn.1.00

unranked_avg_prec30

unranked_avg_prec20

unranked_avg_prec15

unranked_avg_prec10

fallout_recall_0

unranked_avg_prec5

recall5

recall10

recall30

recall15

recall20

0500

1000

1500

Height

0500

1000

inertia

Figure 1.1: Dendrogram representing the hierarchical clustering (using Euclidean distanceand Ward’s criterion) of the IR performance measures with a relevant pruning at 6 clus-ters. The sub-plot in the upper-right corner represents the height of the 12 upper nodes;highlighting the first five bars refers to 6 clusters retained.

Page 10: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

6CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

The choice of the number of clusters is a crucial problem to be dealt with a posteriori

when performing clustering (see for instance [2, 11] in the context of text clustering and

other chapters in this volume. When using the Ward’s criterion the vertical scale of the tree

represents the loss of between-cluster inertia for each clustering step; a relevant pruning level

is characterized by a relatively important difference between the heights of two successive

nodes. In the sub-plot of Figure 1.1, a relevant cut corresponds to a point for which there is

a strong slope on the left and a weak slope on the right. Under these conditions, according

to the degree of sharpness desired, one can retain here 2, 3, 5 or 6 clusters. This last option

is represented in Figure 1.1 with 6 demarcated clusters.

1.2.3 Principal Component Analysis to validate the clusters

In Figure 1.2 a color is associated with a cluster obtained as depicted in Figure 1.1. The

relative position of the clusters on the first and second principal components is consistent

with the clusters obtained after the clustering process. Globally, the measures in each cluster

appear projected relatively close to each other. Furthermore, it also offers a partial (because

of the projection on a 2D space) representation of the inertia of the clusters. For instance,

the IR performance measures that are clustered in the cluster 3 (see Figure 1.1 and displayed

in green appear much closer to each other than the performance measures in other clusters

in Figure 1.2.

Regarding principal component 1 (horizontal axis in Figure 1.2), the main phenomenon

is the opposition between clusters 4 (blue), 5 (cyan) and 6 (magenta) on the right and, 1

(black) and 2 (red) on the left. These relative positions highlight an opposition between recall

oriented clusters (4, 5 and 6) and precision oriented ones (1 and 2). Along PC2 (vertical),

the opposition is between 1 (black) and 4 (blue) (bottom) and, 2 (red) and 6 (magenta)

(top). In this case, the discrimination globally concerns the number of documents on which

is based the performance measure: few documents (less than 30) for clusters 2 and 6, and

much more (more than 100) for clusters 1 and 4. Not surprisingly, cluster 3 (green), mainly

composed of global measure aggregating recall/precision curves such as MAP, is located in

the center of the plot. Cluster 3 acts as an intermediate between other clusters.

Page 11: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.2. MULTIDIMENSIONAL DATA 7

−100 −50 0 50 100

−60

−40

−20

020

4060

Dim 1 (34.31%)

Dim

2 (

15.5

2%)

P1000P500

P200

unranked_avg_prec1000

P100

unranked_avg_prec500

exact_prec

exact_unranked_avg_prec

P30

P20

unranked_avg_prec200

P15

P10

exact_relative_unranked_avg_prec

bpref_10relative_unranked_avg_prec10

relative_unranked_avg_prec15relative_prec10

relative_unranked_avg_prec5bpref_5

P5relative_prec5

relative_prec15relative_unranked_avg_prec20

relative_prec20ircl_prn.0.00

unranked_avg_prec100

recip_rank

relative_unranked_avg_prec30

relative_prec30

map_at_R

X0.20R.prec

bprefint_map_at_R

X0.40R.prec

X0.60R.prec

int_0.20R.prec

X1.80R.precX1.60R.prec

X1.40R.prec

bpref_top5Rnonrel

X1.20R.prec

X0.80R.prec

X2.00R.prec

ircl_prn.0.10

X1.00R.precR.prec

old_bpref

int_0.40R.prec

int_1.60R.precint_1.80R.prec

int_1.40R.prec

int_1.20R.prec

int_0.60R.prec

int_2.00R.prec

ircl_prn.0.20

int_0.80R.prec

exact_int_R_rcl_precint_1.00R.prec

bpref_top10pRnonrel

bpref_top25p2Rnonrel

ircl_prn.0.30X11.pt_avg

ircl_prn.0.40

infAPavg_doc_precmapold_bpref_top10pRnonrelint_map

bpref_top25pRnonrelX3.pt_avg

ircl_prn.0.50ircl_prn.0.60

bpref_top10Rnonrel

bpref_top50pRnonrelircl_prn.0.70

relative_prec100

exact_relative_prec

ircl_prn.0.80

relative_prec1000recall1000exact_recall

avg_relative_prec

relative_unranked_avg_prec100

unranked_avg_prec30

bpref_allnonrel

bpref_retall

relative_unranked_avg_prec1000

relative_prec500

relative_prec200

ircl_prn.0.90

recall500

bpref_retnonrel

fallout_recall_0

fallout_recall_142

unranked_avg_prec20

fallout_recall_127fallout_recall_113

fallout_recall_99

ircl_prn.1.00

relative_unranked_avg_prec500

fallout_recall_85

bpref_topnonrel

fallout_recall_71

relative_unranked_avg_prec200

rcl_at_142_nonrelfallout_recall_56

unranked_avg_prec15

fallout_recall_42

fallout_recall_28

recall200

fallout_recall_14

unranked_avg_prec10unranked_avg_prec5

recall100

recall5recall10

recall30

recall15

recall20

cluster 1cluster 2cluster 3cluster 4cluster 5cluster 6

Figure 1.2: Representation of variables on the first two principal components PC1 and PC2respectively explaining 34% and 15% of the total variance. Colors reveal the cluster thevariables belong to after partitioning.

Page 12: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

8CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

In this case, the visualization by PCA and the interpretation of the principal com-

ponents provides a very clear characterization of the clusters and of their mutual rela-

tions. Furthermore, a 3D representation considering the first three principal components

can highlight some potentially peculiar arrangement of points. For instance, the measure

exact relative unranked avg prec appears in black near the red cluster in Figure 1.2, but

having a look at the PCA in 3 dimensions (Figure 1.3) reveals a clear distinction between

the red and the black clusters.

Figure 1.3: Representation of variables on the first three principal components PC1 and PC2using the rgl package [1] in R [39].

Figure 1.3 was obtained using the rgl package [1] for the R software [39]. This package

uses OpenGL [51] to provide a visualization device system in R with interactive viewpoint

navigation facility.

Page 13: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.3. GRAPHS AND COLLABORATIVE NETWORKS 9

1.2.4 3D-map

Various works to combine several methods into one single graphic have been proposed.

Koren et al. [30] for instance suggest to superimpose a dendrogram over a synchronized low-

dimensional embedding resulting in a single image showing all the clusters and the relations

between them. In the same vein, Husson et al. [22] proposed to combine PCA, hierarchical

clustering and partitioning to enrich the description of the data. PCA representation is used

as a basis for the hierarchical tree drawing in a 3D-map. An implementation of such a rep-

resentation is available in the FactoMineR package [23] for R. In Figure 1.4, IR performance

measures are:

• located on the PCA factor map,

• linked through the branch of the dendrogram,

• colored according to the cluster they belong to after partitioning.

This representation includes the previous one with PCA (Figure 1.2) and adds other

information regarding the changes that have occurred when performing partitioning after

hierarchical clustering. For instance, one performance measure (exact unranker avg prec)

colored in black was linked by the dendrogram to the cluster in green, in the bottom left

corner. The partitioning method reallocated it into the black cluster what seems to be

consistent regarding the location of this point relatively to the two considered clusters.

1.3 Graphs and collaborative networks

As explained in [35], graphs are among the visualization tools most commonly used in the

literature, as linking concepts or objects is the most common mining technique. Graph

agents use 2D matrices of any type resulting from the pre-treatment of the raw information;

these matrices correspond to adjacency matrices. Adjacency matrices can be obtained from

analyzing co-occurrences from texts, co-authoring from publications, or any information

crossing [12]. From an adjacency matrix, a graph is built: graph nodes correspond to the

Page 14: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

10CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

−150 −100 −50 0 50 100

0

500

1000

1500

2000

−80

−60

−40

−20

0

20

40

60

80

Dim 1 (34.31%)

Dim

2 (

15.5

2%)

heig

ht

cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 cluster 6

P1000P500P200

unranked_avg_prec1000

P100

unranked_avg_prec500

exact_prec

exact_unranked_avg_prec

P30

P20

unranked_avg_prec200

P15P10

exact_relative_unranked_avg_prec

bpref_10relative_unranked_avg_prec10relative_unranked_avg_prec15relative_prec10

relative_unranked_avg_prec5bpref_5P5relative_prec5relative_prec15relative_unranked_avg_prec20

relative_prec20ircl_prn.0.00

unranked_avg_prec100

recip_rankrelative_unranked_avg_prec30

relative_prec30map_at_R

X0.20R.prec

bprefint_map_at_RX0.40R.prec

X0.60R.precint_0.20R.prec

X1.80R.precX1.60R.precX1.40R.prec

bpref_top5Rnonrel

X1.20R.prec

X0.80R.prec

X2.00R.prec

ircl_prn.0.10

X1.00R.precR.precold_bpref

int_0.40R.prec

int_1.60R.precint_1.80R.precint_1.40R.precint_1.20R.prec

int_0.60R.prec

int_2.00R.prec

ircl_prn.0.20int_0.80R.prec

exact_int_R_rcl_precint_1.00R.precbpref_top10pRnonrel

bpref_top25p2Rnonrel

ircl_prn.0.30X11.pt_avgircl_prn.0.40

infAPavg_doc_precmapold_bpref_top10pRnonrelint_mapbpref_top25pRnonrelX3.pt_avgircl_prn.0.50ircl_prn.0.60

bpref_top10Rnonrel

bpref_top50pRnonrelircl_prn.0.70relative_prec100

exact_relative_prec

ircl_prn.0.80

relative_prec1000recall1000exact_recall

avg_relative_prec

relative_unranked_avg_prec100

unranked_avg_prec30

bpref_allnonrel

bpref_retall

relative_unranked_avg_prec1000relative_prec500

relative_prec200

ircl_prn.0.90

recall500

bpref_retnonrel

fallout_recall_0

fallout_recall_142

unranked_avg_prec20

fallout_recall_127fallout_recall_113fallout_recall_99

ircl_prn.1.00

relative_unranked_avg_prec500

fallout_recall_85

bpref_topnonrel

fallout_recall_71

relative_unranked_avg_prec200

rcl_at_142_nonrelfallout_recall_56

unranked_avg_prec15

fallout_recall_42fallout_recall_28

recall200

fallout_recall_14

unranked_avg_prec10unranked_avg_prec5

recall100

recall5recall10

recall30

recall15recall20

Hierarchical clustering on the factor map

Figure 1.4: Representation from the FactoMineR package [23] combining PCA, hierarchicalclustering and partitioning.

Page 15: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.3. GRAPHS AND COLLABORATIVE NETWORKS 11

values of the crossed items whereas edges reflect the strength of the co-occurrence values.

Graph drawing can be based on force-based algorithms [18]. In this type of algorithm,

a graph node is considered as an object while an edge is considered as a spring. Edge

weights correspond to either repulsion or attraction forces between the objects that in turn

make them move in space. This keeps the vertices moving in the visualization space until

objects are stabilized. Once stabilized the spring system provides the best graph drawing

or node placement. To identify the most important objects, centrality analysis methods

such as degree, betweenness, or proximity analysis are useful. Social network analysis [40]

and science monitoring are major applications in which data analysis is based on graph

visualization: collaboration networks are visualized and browsing facilities are provided for

analysis and interpretation. Clustering is a key point in collaborative network analysis.

First, clusters result from graph simplification: weak edges are deleted to obtain the main

object clusters. On the other hand, clustering methods such as graph partitioning are used

in order to handle very large networks.

In this section, we illustrate the usefulness of graphs to address clustering issues for

collaborative network analysis and science monitoring. All the examples we provide use

publications in a domain that are first pre-treated to extract meta-data such as keywords or

author names and to build adjacency matrices. We show how graphs can be used to visualize

collaboration networks and clusters of authors. We also show an extension of collaborative

networks to country and semantic networks based on collaboration. We provide examples of

how clustering can be used to simplify overly large graphs. Temporal networks are the last

example we provide in this section.

1.3.1 Basis of collaboration networks

Collaboration networks can be extracted from co-authoring. In that case, nodes correspond

to authors and edges to co-authoring. Weights can also be associated both with nodes

and edges. Node weight refers to the node importance in terms of author frequency; in

the same way, edge weight depends on the strength of co-authoring. Node weights can

be expressed graphically by various means as depicted in [26] and presented in Figure 1.5.

Page 16: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

12CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

On the other hand, edge weights are either graphically represented or used to simplify the

network, suppressing the weaker links.

Figure 1.5: Graphical representation of node weight in graphs [26].

Kejzar et al. [28] present a collaboration network as a sum of cliques. For example,

Figure 1.6 presents the collaboration network obtained when the weights of the edges are

at least equal to 3 using the Pajek Software. Pajek, Slovene word for Spider, is a program,

for Windows, for analysis and visualization of large networks. It is freely available, for

noncommercial use. In this figure, colors represent sub-networks and correspond to clusters

of authors that are strongly related to each other.

Another example of so called graph filtering is provided in Figure 1.7 using the VisuG-

raph [15] tool that allows to visualize relational data.

Graph filtering makes it possible to keep the strongest relationships, according to the

threshold value the user sets. Graph filtering hides the weakest values of the relationships;

it does not allow the user to distinguish the nodes that play a central role in the graph

structure. This issue can be answered by analyzing the graph structure. K-core has been

designed to achieve this goal [8, 4]. The K-core is a graph decomposition that consists in

identifying some specific sub-sets or sub-graphs. K-core is obtained by recursively pruning

the nodes that have a degree smaller than K. The nodes which have only one neighbor

Page 17: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.3. GRAPHS AND COLLABORATIVE NETWORKS 13

Figure 1.6: Main part of line cut at level 3 of collaboration network. Colors represent cuts(connected subnetworks) [28].

Figure 1.7: Filtering a graph according to the strength of the links in VisuGraph [15].

Page 18: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

14CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

correspond to a coreness of 1. When the coreness 2 is considered, the nodes belonging to the

1-core are hidden. In that case, the sub-graph consists of the nodes that have at least two

neighbors, and so on. Node browsing can be applied to have a deeper view of the structure

around a specific node. This type of functionality is applied, starting from a node the user

selects in order to study a specific node and its relationships to other nodes by means of its

connections; it makes it possible to define the node’s role within the network structure.

Figure 1.8: Browsing a node in VisuGraph [15].

For example, in Figure 1.8, the node that has been selected seems to be a major element of

the graph. The vicinity of a node, or what could be called the self-centered network, is a way

to observe the way nodes behave. Considering self-centered networks, the user can extract

the diversity of the relationships and detect the local features where these relationships occur.

In the example Figure 1.8, the node that has been selected appears larger than the others

in order to distinguish it. The graph structure is rebuilt by browsing in several stages.

The first browsing (depth = 1) shows a connection with a single other node (see top left of

Figure 1.8). Then, the user discovers the complete structure by successive browsing. The

nodes in the center of the structure are characterized by their high connectivity to other

nodes. If it was deleted, the graph would be split into two sub-graphs. Browsing nodes

makes it possible to obtain the whole group of highly related nodes, while studying the

relationships with the other nodes. Structural holes [10] reveal the fact that two sets of

nodes do not communicate directly but rather need an intermediary node to communicate;

Page 19: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.3. GRAPHS AND COLLABORATIVE NETWORKS 15

Figure 1.9: Browsing from a structural hole in VisuGraph [15].

this latter node thus occupies a decisive position. Figure 1.9 illustrates the structural hole

feature. For this new selected node, the first browsing leads to display many other nodes. In

the next browsing, this node remains the center of the graph structure since the other nodes

are around it.

1.3.2 Geographic and thematic collaboration networks

Collaboration network can also be more sophisticated to include other types of information.

For example, a collaboration network including countries is useful when considering techno-

logical activity and creativity around the world [47]. In that case, rather than considering

co-authoring and thus a single type of node, the starting point is a 2-D matrix based on

both country of affiliation and authors. The matrix cells contain the number of publications

in which a given author co-occurs with a given country (the country where an author is

affiliated). Mothe et al. [35] present the resulting graph for a set of publications in the

information retrieval domain. In Figure 1.10, countries appear in green whereas authors are

displayed in red. Countries that are not correlated with other countries do not appear in

this graph. That means that the only publications that are considered have at least two

Page 20: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

16CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

authors belonging to two different countries. The edges correspond to links that have been

inferred between countries and authors. Using this type of network representation, coop-

eration between countries appears in a single shot. For example, strong relationships are

shown between China and Hong-Kong and between Israel and USA in this set of publi-

cations. China and Hong-Kong are not surprising considering the political point of view.

Israel and USA relationships in IR can be explained by the fact that a laboratory of the

IBM corporation is situated in Haifa (Israel) and publications are co-authored with IBM US

(this can be validated when going back to the publication themselves). The power of this

representation is that links are drawn, but more importantly, the explanation of the link

can be seen. When considering the Netherlands and the UK for example, the association

is mainly due to Djoerd Hiemstra. In the same way, the association between China and

Canada is due to two persons: Jian Yun Nie (Canada) and Ming Zhou (China).

Collaboration network can also be analyzed in the light of topics of interest. To conduct

such an analysis, the two dimensions of the matrix correspond to keywords and to countries.

Figure 1.11 provides an example of the resulting graph. Some interesting sub-networks have

been circled in the figure. For example, Canada and Turkey are linked through common

topics of interest and this link is not due to some publications that have been written by an

author from Canada and another from Turkey (in Figure 1.10 these two countries are not

linked).

1.3.3 Large collaborative networks

When visualizing graphs, size is a major issue. Graph partitioning is a way to face the

size issue. The principle is to provide a higher level graph that gives an overview of the

data structure. Several graph partitioning techniques exist such as spectral clustering [3, 25]

and multi-level partitioning like METIS (Serial Graph Partitioning and Fill-reducing Matrix

Ordering) algorithms [27]. METIS is a set of programs for partitioning graphs and other

elements, and producing fill reducing orderings for sparse matrices. Following this idea,

Karouach et al. [26] propose to reduce large complex graphs by means of Markov model

based clustering algorithm as presented by Stijn van Dongen [46]. For example, Figure 1.12

Page 21: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.3. GRAPHS AND COLLABORATIVE NETWORKS 17

Figure 1.10: Author/Country collaboration network [35].

Page 22: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

18CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

Figure 1.11: Topics/Country network [35].

shows the graph resulting from graph partitioning where each color corresponds to a cluster

(left part) and the visualization of the simplified graph representing each cluster by a single

node (right part).

1.3.4 Temporal collaborative networks

Temporality in a collaborative network is an issue since trend detection has to consider

evolution. Generally, visualization is based on visualizing independently the graph for the

various periods (see for example Figure 1.13).

Doing so, evolution is difficult to analyze. Loubier et al. [34] suggest two ways to integrate

the various periods and analyze the data in one shot. Figure 1.14 depicts nodes as histograms

that show the distribution of data on the entire time slot; each tabular frequency corresponds

to each considered year. This graph displays the specificities of each period. For example,

in the top right corner (in red) is presented year 2005. The collaborations that occur only

in this period are clearly identified. Top left corner (in green) is related to the fourth period

Page 23: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.3. GRAPHS AND COLLABORATIVE NETWORKS 19

Figure 1.12: Markov model based clustering algorithm applied to a collaborative network[26].

Figure 1.13: Collaborative networks for 4 periods corresponding to the publication year [34].

Page 24: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

20CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

(2008).

Figure 1.14: Collaborative networks using histograms and a spatial representation of time[34].

Following the idea of Dragicevic and Huot [16], Loubier [33] represents temporal collabo-

rative networks in the form of a clock. In Figure 1.15 a slice is devoted to each chosen time

segment (in this case a publication year). For example, the top left corner is devoted to one

specific year and is represented in green. Another is represented in red (top right corner).

In between these two slices, a slice is devoted to the collaborations that correspond to the

two considered years. In the central circle, the collaborations that involve the four periods

are represented. The other circles represent the collaboration within three periods.

1.4 Curve clustering

Curve clustering can occur in various contexts where one or more variables are acquired for

various ordered values of an explicative variable. For instance, this is the case for time series,

dose-response or spectral analysis. A survey can be found in [49].

Page 25: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.4. CURVE CLUSTERING 21

Figure 1.15: A circle display of temporal collaboration networks [33].

1.4.1 Time series microarray experiment

The study presented to illustrate curve clustering is detailed in [13]. In the context of mi-

croarray experiments, it focuses on the analysis of time series gene expression data. Original

data were hepatic gene expression profiles acquired during a fasting period in the mouse.

Two hundred selected genes were studied through 11 time points between 0 and 72 hours,

using a dedicated macroarray. For each gene, two to four replicates were available at each

time point. Data are presented in Figure 1.16 where lines join the average value between

replicates at each time point.

The aim of this study was to provide a relevant clustering of gene expression temporal

profiles. This was achieved by focusing on the shapes of the curves rather than the absolute

level of expression. Actually, the authors combined spline smoothing and first derivative

computation with hierarchical clustering and k-means partitioning. Once the groups were

obtained, three graphical representations were provided in order to make interpretation

easier; they are displayed and commented in the following. They were obtained using the R

Page 26: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

22CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

++

++++

++++ +++

+

+++

++++

+++ ++

++++++

++

+

+ ++

++

0 10 20 30 40 50 60 70

−0.

50.

00.

51.

01.

52.

02.

5

Time (h)

log

(nor

mal

ized

inte

nsity

)

++

++++

++++ +++

+

+++

++++

+++ ++

++++++

++

+

+ ++

++

++ +

+++ ++++ ++++ +++ +

+++ +

++

+

++

+ +

+++ ++++

++++

++

+

+++ +

+++

++

+++++

++++

+++ ++++

++++ +

+++++++

++++++ ++++ +++

++++

+++++++ +

+++

++++ ++

+

+ ++++

++

+

+++

++++ ++++

+++

++++

+++

++++

++++

++++

+++

+++

+

+++ +

+++ ++++

+

+

+++++

+++ ++++ ++++ +++

+++++

++

++++

++++

+++++++

++++

+++

++++

++++ ++++ ++++

++ ++++ ++++ ++++

+++

++++ +++

++++

++++

++++ ++

++++

++++ ++++ +

++++++

+

+

++ +

++++++ ++++ +

+++ ++++++ ++

++ ++++ +

+++ +++

++++ +++

++++ ++++

++++ ++++

++

++++ ++++

++++ +

++ ++++

+++++++

++++ ++++

++++

++++++ ++

++ +++++++

++++ +++

++

+

+ ++++

++++

++++

++++++

++++ +++

+ +++

++++

+++

++++

++++ ++++

++++

++ ++++

++++ ++++ +++

++++

+++ ++++++++

++

+

+++++

++++++

++++ +++

++

++ +

+++ +

+++++

+

++++ +++

+ ++++++

++++ ++++

+

+

+++++ +

+++ +++

++++

++++

++++

++++

++ ++++

++++

++++

+++

++++ +

++ ++++ ++++

++++

++++

++++++ ++++

+++

+

++

+++++

+++

++++ ++++

++++

++++

++

++++

++++ ++++

+++ +++

+ +++++++

++++ ++

++ ++++

++ ++++

++++ ++++ +++

+++++++

++++ +

+++ ++

+++++++

+++++

++++

++++

+++

++++

+++ ++++ +

+++

++++ ++++

++ ++++ ++++ ++++ ++

+ ++++ ++

+++++ ++

++ ++++

++++

++ ++++ ++

++ ++++ ++

+ ++++ +++ +++

+ +++

+++++ ++++

++ ++

++++++

+

++++++ +

++++++

++++++++

++++

++++

++++++ +

+++ ++

++

+++

++++ +++

++++

++++ ++++ +

++

+++ ++++

++++

++++ ++

+ ++++

+++ ++++ ++++

++++

++++

++++

+

+ ++++++++ ++

+ ++++

+++

++++ +

++

+

++++

+

+

++

++ ++++ ++

++

++++

+++

++++ +

++ ++++

++++ +

+++ ++++

++++++

++++ ++++

+++ +

+++ ++

+ ++++ ++++

++++

++++

++ ++++

++++ +

+++ +++ +++

++++

++++++++ ++

++

++++

++

++++ ++++ +

+++ +++

++++

+++ +++++

+++ ++

+++++

+

++++++ +

+++

+

+++ +++

++++ +++ +

+++ ++++ ++

++

+

+++

++++

+

+

++++ +

+++

+++

++++

+++

++++

++++

++++

+

+

+

+

+

+++++ ++++

+++

++++ +++

++++ +

+++

++++ ++

++

++++

++ +

+++

++++ ++++ +++

++++ +++ ++++ +

+++++++ +

+++

++

++++ +

+++ ++++

+++ ++++ +++

++++ +++

+++++ +

+

++

++++++

+++

+

++

+

++++

++

++ ++

+

++++ +++

+

++++ +

+++

++

++++ +++

+++++ +

+

+ +++++++ ++++ ++++ ++++ +++

+

++

++

+

+++++ +

++++++

+

+

++

+++

++++++++

+++

+++

+

+

++ ++++ ++++++++ ++

+

++++ +++ +

+++

+

+++

++++ ++++

++ ++

++++++

++++ ++

+++

++

+

++++++

+

+++ ++++ +++

+++

++++ ++++

++++ ++

+

++++ +++ +

+++ ++

++

++++ +

+++

++

++

+

+ ++++ +++

+ +

++

++++

+++ ++++

++++++

++ +++

+

++

++

++ +

+++ +++

++++

+++

+

+++ ++

++++++

+

+++++

+

+

++

++++ ++++

++++

+

++++

++++

+

++++ +

+

+

+

+

+++ ++

++

+

+++

+

+++++

++

+

+++

+

++

+

+

+++ ++++

++++

++++

++++

++ +++

+ ++++

++++

++

+

+++

+

+++

++

+

+

+

+++

++

+++

+

+

+

++ ++++++++ ++

++ ++

+ ++++

+++ ++++ ++++ +

+++ +

++++

+++

+

+ ++++ +

++++++

++++ ++

+++++

++++

++++ ++++

++

++++

++++

++++

+++

++++ +++

+

+++

++++ ++++

++++

++ ++

++ +

+++ ++++ ++

+

+++

+

+++ ++++

++++ ++

++ +

+++

++ ++++ +

+++

++++

+++

++++

++

+++++

++++

++++

++++

++ ++

++ +

+++++++ ++

+++++ ++

+++++ ++

++ ++++

++++

++

++++ +

+++++++

+++ ++++

+++ +

+++ ++++

+

+

++

+

+++

++++

++

+

+++ +++

+

+++

++++

+++ +++

+++++ +

+

+

+++++

++ ++++++++

++++ +++

++++ +++

++++ ++

++++++ ++++++ ++

++

++++

++++ ++

+++++

+++ ++

++ ++++ ++

++ ++++

++ +++

++

+++

+++

++++

+++

++++ ++

++

++++

++++ ++

+

+

++++++ ++++ ++

++ ++

+ ++++ +++ ++++ ++

++

++++ ++++++ ++++ ++

++ ++++ +

++++++

+++ ++++

++++ +

+++ ++++++

+

+

++ +++

+++++ +

++

++++

+++ ++

++++++ +

+

++

++++++++

++ ++

++ ++++

+++

++++

+++

++++ ++

++

++++ ++

++

++++++ +

+++

++++ +++ +

+++

+++

++++ ++++ ++++ +

+++

++ +++

+ +

+

+

+

++++ +

+

+

+

++

+ +++

++++ ++++

++++

+

++

+

++

++++++++ ++++

++

+++++ +++

++++ +++

+ ++++

++++

++

++++ ++++ ++++ +

++

++++

+++++++ +

+++ +

+++ ++

++

++++++ +

+++

++++

+++

++++ ++

+++++ +

+++

++++

++++++

++++ ++++ ++++ ++

+

++++ +++++++

++++ +

+++

++++++++++ ++++

++++

+++ ++++ ++

+ ++++ ++++ +

+++

++++++ ++

++

++++

++++ +++ ++++

+++ ++++ ++++ +++

+++++

++ +

+

++

++++ ++

++ ++

+

++++

+++

++++++++

++++

++++

++ +++

+++++

++++ ++

+++++

+++ ++++

++++ ++

++ ++

++

+++++

+

++++ +

+

++ +

+

+++++ +++ +

+++

++++ ++

+

+

++++

++ ++++

++++

++++ +++ +

+++

+++ ++++

++++

++++++++

++ ++++ ++++

+++++++

+

+++ +++ ++++

++++ +

+++ +

+++

++

+++

+ ++++

++++ +++ +

+++

+++

++++ ++

++ ++

++

++++

++ ++

++

+++

+++++ +++ ++

+++++ +++

+ ++++ ++++ ++++

++ ++

+

+

++++++++ +

+

+

+++

+

+++

++++

++++

++++

+++

+

++

++++

++++ ++++

+++

++++

+++

++++ +++

+++++ +

+++

++

++++ ++++

+++++++ +

++

+

+++++++

++++ ++

++++

++++

++++ +

+++++++

++

+++++ ++

++

+

++ ++++

++++

++++

++

++

++++++

++

++

+++

+++++++

++++ ++++ ++

++++++++

+

++

+

++++

++++ +

++++++

+++ +

+++

++++ ++

++

+

+++

++++++

+

+++++

+++++ ++++

+++

++++++++ ++

+

++++

+

++

++++ ++++

++++ +++ ++++

+++ ++++

++++ +++

+ ++

++

++

+

+

++ +

+++ ++

++

+++ ++++

+++

++++++++ +

+++++++

++++++ +

+++

++++

+++++++

+++ ++

++ +

+++ +++

+ ++++++

++++

++++ ++

++

+++

+

+++

+++

+

+++

++++ ++++

+++++

+ ++++

+++

+++++

+++

++++

+++ +++

+ ++++++++

++++

++ ++++ ++++ ++++

+++

+++

+ +++++++

++++

++++ +

++

+

++ ++++

++++ ++++ +++ ++

++

+++

++++ +++

+ ++++ +

+++

++ +++

+ ++

++ ++++

+++

+++++++ +

+++ ++++

++++ +++++

+ ++++ +

+

++ +++

++++ +

+++ +

++ ++++

++++ +++

+

+

+++

++ +++

+ ++++

++++ +++ ++++

+++ +++

+ ++++

++++

++++

++++++

+

+++ +

+

++

++

+

++

++

+++ ++

++

++++

+++

++

+

++

++

++++ +

++

+++++

+++

++++

+++

++++

++++

++++

+++++

+++++

++++ ++++

+++

+++

++++ ++++ +

+++

++++ ++++

++ +++++++

+

++++

+++

++++

+++ ++++ ++++ ++

+

+ ++++

++

++++

+

+++ ++++ +

++++++ ++

+ ++++ ++++

++++ +

++

++++++

+++++

+

++++++

++++

+++ +

+++ ++++

++++ ++++

++ ++

+

+

++++

++++

+++

+++++++

++++ ++++ +

+++ ++

++

++ +++

+ +

+++

++++

+++ +

+++ +++ +

+++

+++

+

++++ ++++

+

+ ++++ +

+++ ++++

+++ ++

+

+

+++ ++++

+

+++ +

+++ ++

+

+

++

++++

++++ +++

++++

++++

+++ ++

++ ++++

+++

+++++

+

+

+

+++ +

+++

++++ +

++ ++++

+++

++++

++++

+

+

++

++++

++++++

++++ ++++

+++ +

++

+ ++

+

++++

++++ +

+++

+++

+++ +

+++ +

+++ ++++ +++

+++

+

+++

++++ +

+++ ++++

++++

++ ++

++ +

+++ ++++

+++

++++

+++

++++++++

++++ +++

+

++ +

+++

++++ ++++

+++ +++++++ ++++ +

+++

++++

++++

++

++++ +

+++ +

+++

+++ +

+++ +

+

+ ++++

++++ ++

+++

++

+

++++++ ++++

++++ ++++++

+ +++

++++ +

+++ ++++

++++

++++

++ +

++

+

++++

++++++

+ +++

++++++++

++

+

+++++

++

++++

++++

+++++++ ++++

+++ ++++++++

++

+

+++++

++ +

+++

++++

++++ +

++

++++

+++ +++

+++++ +++

+++++

+

+++++

++++ +++

+ +++

++++

+++ +

++

+++++ ++++ +

+++

++

++++ +

+++

++++

+

++

++++ +

++

++++

++++

++++ +

+++

++

++++ ++

++

++++ +

++ ++++ +++ ++++

++++ +++

+++++

+

+++++ ++

++++++ +

++++++

+++ ++

++ ++

++ ++++

++++

++

++++ ++++ ++

+++++

++++ +++ ++++

++++ +

+

+

++

+

+

+

++ +

++

++++

+

++++ ++

+

+++

+ +

++++++

++++++++ +

+++++

++++

++

++

++

+++++

++++ +++

++++

++++

++++ ++++

++ ++++ +

+++ +

+++

+++ +++

++++

++++ ++++ ++

++ +

+

++

++ ++++ ++

++++++

+++

++++ +++

++++ ++++ +++

+++++++

++

+

+++++ ++

+

++++

++++

+++

+++++++

+

+++

+ +++

+

++

++++

++++

++++

+++++

++ +

++

++++++++

++

+

+++++

++++++

++++ +

+++

+++ +

+++

+

++

++++ +

+++ ++

++++++

++

++++

++++ +

++

+++

++

+

++

+

+

+

++++

+++

+

++++

++++

++ +++

+ ++++

++++ +

++++++ ++

+ ++++ ++++

++++

++++

++

++++

++++ +

+++ +

++++++ +

++ ++

++ +++

+++

++

++++

++ +++

+++++

++++ +

++

++++

+++ ++++

++++

++

+

+++++

Figure 1.16: Log-normalised intensity versus time for 130 genes. For each gene, the line joinsthe average value at each time point. Vertical dashed lines indicate time points.

software [39].

1.4.2 Principal Component Analysis to characterize clusters

PCA enables the experimental units collected as clusters to be confronted with the variables

of the experiment, here the time. Each cluster can then be characterized through the behavior

of its components.

In Figure 1.17, the variables of the data set (here the discretized time points) are displayed

on the left part by projection on the first two principal components. Their regular pattern

indicates the consistency of the smoothed and discretized data. The sort of horse shoe formed

by the times of discretization recalls well-known situations of variables connected with time

(or another continuous variable). In the right part of Figure 1.17, the observations (here

the genes) are also displayed on the first two principal components. The four clusters are

distributed along the first (horizontal) axis in a specific order. Regarding the variables, it

appears that the clusters on the left have high values of derivatives at the beginning of the

fasting experiment and these values decrease with clusters located on the right. The cluster

in red, located near the origin, acts as an intermediary between the other clusters. These

Page 27: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.4. CURVE CLUSTERING 23

−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

PC1 − 85%

PC

2 −

10%

t−0

t−4

t−8

t−11

t−15

t−19

t−23

t−27t−30 t−34

t−38

t−42

t−45

t−49

t−53

t−57

t−61

t−64

t−68t−72

−0.10 −0.05 0.00 0.05−

0.1

0−

0.0

50.0

00.0

5

PC1 − 85%

PC

2 −

10%

C16SR

VLDLrPMP70apoA.I

COX1ABCG5

HPNCL

PONABCG8CPT2iBABP

PPARa

SPI1

X36b4ACAT1ACAT2LCE

PXRACBPLCPT1

LDLrACC2CYP26

LEF1ACOTH

CYP27a1

LFABP

RbLHRXRa

ADISP

CYP2b10

Lpin1

CYP2b13

Lpin2

CYP2c29

SHP1ADSS1

CYP3A11

LPKSIAT4cALATCYP4A10

LPL

ALDH1

CYP4A14apoA.IV SRBI

ALDH3

CYP7a

apoA.V

CYP8b1

LXRaAM2R

CytB LXRbStat5bAOX

CytCEci

MCAD

THIOL

Elo1ASAT Elo2MDR1a Elo3MDR2TpalphaElo4

mHMGCoASATPsAb.catenin

Elo5MnSODATPsBBcl3

BIEN

BSEPFAS

NGFiBS14

FATNtcp

TpbetaCEBPa

FDFT

NURR1CEBPg

FIAFTRb

CACPFoxC2OCTN2

b.actin

FPPS

UCP2

apoEFXRp53

CAR1G6PasePALcatalaseG6PDH

PDK4

apoB

GKPECIcfos

Glut2PEPCKGA3PDHPex11a

Waf1

cHMGCoASapoC3

CHOP10

GSTaPGC1bdelta5

GSTmudelta6cjun GSTpi2

PLTP

SCD1

cMOATHMGCoAred

PMDCI

Figure 1.17: Representation of variables (discretized time points, on the left) and individuals(genes, on the right) on the first two principal components. Genes are differentially displayedaccording to their cluster following k-means.

directions are confirmed in the following when displaying the curves corresponding to each

cluster.

1.4.3 Visualizing curves

The first elements of interpretation provided by PCA can be strengthened by the represen-

tation of each smoothed curve according to the cluster it belongs to. This can be done by

superimposing the curve (on the left in Figure 1.18) or in a kind of disassembled view (four

plots on the right).

In this representation, it becomes clearer that:

km1 : the expression of the genes which belong to the first cluster (in black) increases during

the first half of fasting and then tends to decrease slightly or to stabilize.

km2 : the second cluster (red) reveals quasi-constant curves. These genes are not regulated

during fasting.

Page 28: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

24CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

0 20 40 60

0.0

0.5

1.0

1.5

2.0

2.5

Four clusters

Time (h)

log

(nor

mal

ized

inte

nsity

)

0 20 40 60

0.0

0.5

1.0

1.5

2.0

2.5

Cluster 1

Time (h)

log

(nor

mal

ized

inte

nsity

)

0 20 40 60

0.0

0.5

1.0

1.5

2.0

2.5

Cluster 2

Time (h)

log

(nor

mal

ized

inte

nsity

)

0 20 40 60

0.0

0.5

1.0

1.5

2.0

2.5

Cluster 3

Time (h)

log

(nor

mal

ized

inte

nsity

)

0 20 40 60

0.0

0.5

1.0

1.5

2.0

2.5

Cluster 4

Time (h)

log

(nor

mal

ized

inte

nsity

)

Figure 1.18: Representation of the smooth curves distributed in 4 clusters determinedthrough hierarchical and k-means classification.

Page 29: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.4. CURVE CLUSTERING 25

km3 : the third one (green) is characterized by a decrease of the gene expression with time.

km4 : the fourth cluster (blue) is composed of the most strongly induced genes during fasting.

Their expression strongly increases until the 40th hour of fasting and then stabilizes.

Let us note that focusing on the derivative of the smoothed functions allows clustering

curves with similar profiles whatever the absolute level of expression. This point is clearly

visible in Figure 1.18 for each cluster, mainly for the black one with average values from 0

to 2.5.

1.4.4 Heatmap to combine two clusterings

Another way to confront clustering results jointly performed on rows and columns of a data

set is the heatmap. This representation was highly popularized in the biological context by

Eisen et al. [17].

In Figure 1.19, the values represented are the derivative of the smoothed profiles. They

increase from green (negative value, decreasing profile) to red (positive value, increasing

profile) via black. Genes represented in a row are ordered according to the clusters obtained

with k-means partitioning. This explains why, in this case, a dendrogram cannot be drawn

on the left (or right) side of the heatmap. Horizontal blue dotted lines separated the four

clusters obtained following k-means reallocation. On the other hand, hierarchical clustering

of the columns was performed. Many different orderings are consistent with the structure

of the dendrogram. We forced the reordering of the time points to follow, as much as the

dendrogram allows it (rotating around the nodes of the tree), their increase from left to right.

As could be expected, perfectly ordered time points were obtained which is consistent with

the specific horse shoe of the variables in PCA (Figure 1.17): considering one time point, its

closest neighbors in time are also the closest mathematically.

The heatmap provides a color coding of the derivatives of the curves. This allows a direct

extraction of gene expression changes direction and amplitude at the different time points.

Consequently, it becomes much easier to identify both the causes of the clustering and the

Page 30: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

26CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

t−0

t−4

t−8

t−11

t−15

t−19

t−23

t−27

t−30

t−34

t−38

t−42

t−45

t−49

t−53

t−57

t−61

t−64

t−68

t−72

−0.02 0 0.02Value

Color Key

km4

km1

km2

km3

Figure 1.19: Heatmap of the derivative of the smoothed gene expression profiles for the wholedataset. Genes (in row) are ordered according to their cluster determined by the k-meansalgorithm. Horizontal blue lines separate the 4 clusters. Values increase from green (negativevalues) to red (positive values) via black.

Page 31: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.5. CONCLUSION 27

time points when major transcriptional changes occur.

The most strongly regulated genes are easily visualized: km4 genes at the uppermost

and one gene (SCD1) which appears as a green line in the lower quarter of the heatmap.

While km4 genes appear most strongly upregulated until the 30th hour of fasting, SCD1

is negatively regulated in a constant way during all the fasting periods. Thus, by contrast

to km4 genes, SCD1 expression profile could have been equally well modeled by a straight

line since its derivative appears nearly constant with fasting time. One obvious drawback

of this representation (Figure 1.19) is that the representation of km4 and SCD1 gene pro-

file derivatives tend to strongly narrow the color range used to represent the other profile

derivatives due to their extreme regulations in mouse liver during fasting. Once identified,

this drawback can be overcome by removing SCD1 and genes belonging to km4 from the

data set and by building a new heatmap [13].

1.5 Conclusion

To illustrate visual clustering we opted for a presentation based on three case studies dedi-

cated to three kinds of data: multidimensional, networks and curves. The specific character-

istics of each data set require appropriate tools the clustering task is associated with. Each

visualization technique provides one point of view, obviously subjective and partial. The

joint use of various techniques allows to enrich the perception of the data. Dendrogram can

be seen as a standard visualization of a clustering process, but we saw that the impression

provided by the tree is highly partial. Frequently, the use of dimension reduction techniques

such as PCA, multi-dimensional scaling or others projection techniques more adapted to

cluster analysis [21, 44], is used to observe and characterize clusters.

We do not pretend to propose an exhaustive view of visual clustering. Many other methods

could have been presented in this chapter. Furthermore, as we saw, the way to visualize

clusters depends on the kind of data to analyze but also can depend on the methodology

used to address the clustering task. Software can also have specificities in representing

clustering results and providing facilities to the user.

Page 32: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

28CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

When considering spatial data, the visualization can be highly enriched by representing

clusters on maps. For instance, the R package GeoXp [32] implements interactive graphics

for exploratory data analysis and includes dimension reduction techniques as well as clus-

ter analysis whose results are linked to a map. Another context in which cluster analysis

requires efficient visualization tools is the study of the origin of species; a context in which

phylogenetic trees with thousands of nodes has to be visualized. Standard phylogram or

cladogram looks like a dendrogram resulting from hierarchical clustering but many variants

exist such as reviewed in [38]; let us mention for instance unrooted or circular cladogram

and others using or not 3D visualization, with each one providing a specific survey of the

data. Images are also data for which the clustering task can be performed. For instance,

when presenting the results of an image retrieval system, clustering can allow to select a

subset of representatives of all retrieved images instead of providing a relevance-ordered list

[24, 36]. Clustering is also associated with images when dealing with image processing [37]:

a segmentation process consists of dividing an image into various parts in order to recognize

particular patterns (areas in Earth imaging or organs in a medical context).

Some clustering methodologies can result in specific visualization techniques. For in-

stance, Self Organizing Maps (SOM, [29]) address the clustering problem as a kind of neural

network where neurons (or nodes) are arranged on a grid or other specific shapes. This

can result in very specific representation such as those produced with the SOM toolbox for

Matlab [48] or the R package kohonen [50]. Visualization techniques have to be adapted

when performing algorithms from a fuzzy clustering framework [45, 31]. Indeed, in this con-

text, it is assumed that one element can belong to more than one cluster what is not always

possible using the standard visualization techniques. Radial visualization techniques are an

alternative to address this problem [41].

To address visualization needs for clustering results, a great deal of software has been

developed. The methodologies implemented as well as the facilities proposed highly depend

on the community in which the software was developed. For instance, it can be reasonable

to associate signal and image processing with Matlab toolboxes. Biostatistics and clustering

related to high-throughput biology were recently developed in the environment provided by

the free R software. Tetralogie [14] is developed by the university of Toulouse and allows

Page 33: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.5. CONCLUSION 29

one to analyze texts such as publications, patents, or web pages. Many other standalone

software packages are also available. Regarding networks, Gephi [7] is an open-source and

free solution that offers interactive visualization for the exploration of networks.

References

[1] D. Adler and D. Murdoch. rgl: 3D visualization device system (OpenGL), 2011. R

package version 0.92.798.

[2] M. Al Hasan, S. Salem, and M.J. Zaki. Simclus: an effective algorithm for clustering

with a lower bound on similarity. Knowledge and Information Systems, 28(3):665–685,

2011.

[3] C.J. Alpert and A.B. Kahng. Recent developments in netlist partitioning : a survey.

Integration: the VLSI Journal, 19:1–18, 1995.

[4] J. I. Alvarez-Hamelin, L. Dall’Asta, A. Barrat, and A. Vespignani. Large scale net-

works fingerprinting and visualization using the k-core decomposition. In Y. Weiss,

B. Scholkopf, and J. Platt, editors, Advances in Neural Information Processing Systems

18, pages 41–50, Cambridge, MA, 2006. MIT Press.

[5] N. Andrienko. Interactive visual clustering of large collections of trajectories. InWorking

Notes of the LWA 2011 - Learning, Knowledge, Adaptation, 2011.

[6] A. Baccini, S. Dejean, L. Lafage, and J. Mothe. How many performance measures

to evaluate information retrieval systems? Knowledge and Information System, pages

693–713, 2011.

[7] M. Bastian, S. Heymann, and M. Jacomy. Gephi: An open source software for exploring

and manipulating networks. pages 361–362, 2009.

[8] V. Batagelj and M. Zaversnik. Generalized cores, 2002.

[9] G. Brock, V. Pihur, S. Datta, and S. Datta. clvalid: An R package for cluster validation.

Journal of Statistical Software, 25(4):1–22, 3 2008.

[10] R. S. Burt. Structural holes: The social structure of competition. Harvard University

Press, Cambridge, MA, 1992.

[11] C.-L. Chen, F.S.C. Tseng, and T. Liang. An integration of fuzzy association rules and

wordnet for document clustering. Knowledge and Information Systems, 28(3):687–708,

Page 34: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

30CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

2011.

[12] F. Crimmins, T. Dkaki, J. Mothe, and A. Smeaton. TetraFusion: Information Discovery

on the Internet . IEEE Intelligent Systems and their Applications, 14(4):55–62, july

1999.

[13] S. Dejean, P Martin, A. Baccini, and P. Besse. Clustering time series gene expression

data using smoothing spline derivatives. EURASIP Journal on Bioinformatics and

Systems Biology, 2007. article ID 70561.

[14] B. Dousset. Tetralogie : interactivity for competitive intelligence, 2012.

http://atlas.irit.fr/PIE/Outils/Tetralogie.html.

[15] B. Dousset, E. Loubier, and J. Mothe. Interactive analysis of relational information

(regular paper). In Signal-Image Technology & Internet-Based Systems (SITIS), pages

179–186, 2010.

[16] P. Dragicevic and S. Huot. Spiraclock: a continuous and non-intrusive display for

upcoming events. In CHI ’02 extended abstracts on Human factors in computing systems,

CHI EA ’02, pages 604–605, New York, NY, USA, 2002. ACM.

[17] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display

of genome-wide expression patterns. Proceedings of the National Academy of Sciences,

95(25):14863–14868, 1998.

[18] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement.

Software: Practice and Experience, 21(11):1129–1164, 1991.

[19] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. J.

Intell. Inf. Syst., 17(2-3):107–145, December 2001.

[20] J. Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc.,

San Francisco, CA, USA, 2005.

[21] C. Hennig. Asymmetric linear dimension reduction for classification. Journal of Com-

putational & Graphical Statistics, 13(4):930, 2004.

[22] F. Husson, J. Josse, and Pages J. Principal component methods - hierarchical clustering -

partitional clustering: why would we need to choose for visualizing data?, 2010. Technical

report - Agrocampus Ouest.

[23] F. Husson, J. Josse, S. Le, and J. Mazet. FactoMineR: Multivariate Exploratory Data

Analysis and Data Mining with R, 2011. R package version 1.16.

Page 35: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.5. CONCLUSION 31

[24] Y. Jing, H. A. Rowley, J. Wang, D. Tsai, C. Rosenberg, and M. Covell. Google image

swirl: a large-scale content-based image visualization system. In Proceedings of the 21st

international conference companion on World Wide Web, pages 539–540. ACM New

York, NY, US, 2012.

[25] B. Jouve, P. Kuntz, and F. Velin. Extraction de structures macroscopiques dans des

grands graphes par une approche spectrale. In Danile Herin and Djamel A. Zighed,

editors, EGC, volume 1 of Extraction des Connaissances et Apprentissage, pages 173–

184. Hermes Science Publications, 2002.

[26] S. Karouach and B. Dousset. Les graphes comme representation synthetique et na-

turelle de l’information relationnelle de grandes tailles. In Workshop sur la recherche

d’information : un nouveau passage a l’echelle, associe a INFORSID’2003 , Nancy,

pages 35–48. INFORSID, 2003.

[27] G. Karypis and V. Kumar. Multilevel k-way hypergraph partitioning. In Proceedings

of the Design and Automation Conference, pages 343–348, 1998.

[28] N. Kejzar, S. Korenjak-Cerne, and V. Batagelj. Network analysis of works on clustering

and classification from web of science. In Hermann Locarek-Junge and Claus Weihs,

editors, Classification as a Tool for Research, Studies in Classification, Data Analysis,

and Knowledge Organization, Proceedings of IFCS’09, pages 525–536. Springer, 2010.

[29] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.

[30] Y. Koren and D. Harel. A two-way visualization method for clustered data. In Proceed-

ings of the ninth ACM SIGKDD international conference on Knowledge discovery and

data mining, KDD ’03, pages 589–594, New York, NY, USA, 2003. ACM.

[31] R. Kruse, C. Doring, and M.-J. Lesot. Fundamentals of fuzzy clustering. In Jose Va-

lente de Oliveira and Witold Pedrycz, editors, Advances in Fuzzy Clustering and its

Applications, chapter 1, pages 3–30. John Wiley & Sons, April 2007.

[32] T. Laurent, A. Ruiz-Gazen, and C. Thomas-Agnan. Geoxp: An r package for ex-

ploratory spatial data analysis. Journal of Statistical Software, 47(2):1–23, 4 2012.

[33] E. Loubier. Analyse et visualisation de donnees relationnelles par morphing de graphe

prenant en compte la dimension temporelle. PhD thesis, Universite Paul Sabatier, 2009.

[34] E. Loubier, W. Bahsoun, and B. Dousset. Visualization and analysis of large graphs.

In ACM International Workshop for Ph.D. Students in Information and Knowledge

Page 36: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

32CHAPTER 1. VISUAL CLUSTERING FORDATAANALYSIS ANDGRAPHICAL USER INTERFACES

Management (ACM PIKM), Lisbonne - Portugal, pages 41–48. ACM, 2007.

[35] J. Mothe, C. Chrisment, T. Dkaki, B. Dousset, and S. Karouach. Combining mining

and visualization tools to discover the geographic structure of a domain. Computers,

Environment and Urban Systems, pages 460–484, 2006.

[36] G. P. Nguyen and M. Worring. Interactive access to large image collections using

similarity-based visualization. J. Vis. Lang. Comput., 19(2):203–224, April 2008.

[37] J. R. Parker. Algorithms for Image Processing and Computer Vision. John Wiley &

Sons, Inc., New York, NY, USA, 2nd edition, 2010.

[38] G. A. Pavlopoulos, T. G. Soldatos, A. Barbosa Da Silva, and R. Schneider. A reference

guide for tree analysis and visualization. BioData Mining, 3(1):1, 2010.

[39] R Development Core Team. R: A Language and Environment for Statistical Computing.

R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.

[40] J. Scott. Social network analysis. Sociology, 22(1):109–127, 1988.

[41] J. Sharko and G. Grinstein. Visualizing fuzzy clusters using radviz. In Proceedings of the

2009 13th International Conference Information Visualisation, IV ’09, pages 307–316,

Washington, DC, USA, 2009. IEEE Computer Society.

[42] B. Shneiderman. The eyes have it: A task by data type taxonomy for information

visualizations. In IEEE Visual Languages, number UMCP-CSD CS-TR-3665, pages

336–343, College Park, Maryland 20742, U.S.A., 1996.

[43] S. Tobias, B. Jurgen, T. Tekusova, and J. Kohlhammer. Visual cluster analysis of

trajectory data with interactive kohonen maps. Information Visualization, 8(1):14–29,

2009.

[44] D.E. Tyler, F. Critchley, L. Dumbgen, and H. Oja. Invariant coordinate selection.

Journal of the Royal Statististical Society B, 71(3):549–592, 2009.

[45] J. Valente de Oliveira and W. Pedrycz. Advances in Fuzzy Clustering and its Applica-

tions. John Wiley & Sons, Inc., New York, NY, USA, 2007.

[46] S. M. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of

Utrecht, The Netherlands, 2000.

[47] A. Verbeek, K. Debackere, and M. Luwel. Science cited in patents: A geographic ’flow’

analysis of bibliographic citation patterns in patents. Scientometrics, 58(2):241–263,

2003.

Page 37: Handbook of Cluster Analysis (provisional top level le) · Visual cluster analysis has been de ned as a specialization of cluster analysis and is considered as a solution to handle

1.5. CONCLUSION 33

[48] J. Vesanto, J. Himberg, E. Alhoniemi, and J. Parhankangas. Self-organizing map in

matlab: the som toolbox. In In Proceedings of the Matlab DSP Conference, pages

35–40, 1999.

[49] T. Warren Liao. Clustering of time series data-a survey. Pattern Recognition,

38(11):1857–1874, 2005.

[50] R. Wehrens and L.M.C. Buydens. Self- and super-organising maps in R: the kohonen

package. Journal of Statistical Software, 21(5):1–19, 2007.

[51] M. Woo, J. Neider, T. Davis, and D. Shreiner. OpenGL(R) Programming Guide :

The Official Guide to Learning OpenGL(R), Version 2 (5th Edition). Addison-Wesley

Professional, 2005.