171
DEPARTAMENTO DE COMPUTACI ´ ON Constrained Clustering Algorithms: Practical Issues and Applications P H DT HESIS T ESE DE DOUTORAMENTO Manuel Eduardo Ares Brea 2013

Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Embed Size (px)

Citation preview

Page 1: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

DEPARTAMENTO DE COMPUTACION

Constrained Clustering Algorithms:Practical Issues and Applications

PHD THESIS

TESE DE DOUTORAMENTO

Manuel Eduardo Ares Brea

2013

Page 2: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 3: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

DEPARTAMENTO DE COMPUTACION

Constrained Clustering Algorithms:Practical Issues and Applications

PHD THESIS

Manuel Eduardo Ares Brea

PhD Supervisor:Dr. Alvaro Barreiro Garcıa

2013

Page 4: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 5: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

PHD THESIS

Constrained Clustering Algorithms:Practical Issues and Applications

Manuel Eduardo Ares Brea

PhD supervisor:Dr. Alvaro Barreiro Garcıa

Thesis committee:

Dr. Senen Barro

Dr. Alessandro Moschitti

Dr. Oscar Luaces

Dr. Juan M. Fernandez-Luna

Dr. Jose Santos

Page 6: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 7: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

D. Alvaro Barreiro Garcıa, Catedratico de Universidade na area de Cien-cias da Computacion e Intelixencia Artificial da Universidade da Coruna

CERTIFICA

Que a presente memoria intitulada Constrained Clustering Algorithms:Practical Issues and Applications foi realizada baixo a sua direccion e cons-titue a Tese que presenta D. Manuel Eduardo Ares Brea para optar ao graode Doutor pola Universidade da Coruna.

A Coruna, decembro de 2012

Asina: Dr. Alvaro Barreiro GarcıaDirector da Tese

Page 8: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 9: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

A mama e papa

Page 10: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 11: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

We search for the truthwe can die upon the tooth

but the thrill of just the chaseis worth the pain

Dio - The Last in Line

Page 12: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 13: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Abstract

Recently a new fashion of semi-supervised clustering algorithms, coined asConstrained Clustering, has emerged. These algorithms can incorporate somea priori domain knowledge to the clustering process, allowing the user to guidethe method and improve the quality of the partitions. Up to this date, the re-search on this topic has been focused on developing new algorithms, mostlyoverlooking certain practical questions whose importance is capital in real-world problems. In this thesis we identify and study two of these issues, con-straint extraction and robustness to noise.

In this thesis we perform an analysis of the robustness of some ConstrainedClustering algorithms to noisy sets of constraints, designing an experiment inwhich their behaviour is tested with synthetic sets of inaccurate constraintscreated with two noise models, one of them based on intuitions about thenature of real errors in the constraints. The strengths and weaknesses of eachalgorithm are discussed and used to conclude the scenarios in which using itis the best decision.

Moreover, we likewise propose in this work two schemes to automaticallyextract constraints in two important domains: web pages and text in general.In the former we use information external to the web pages (their social tags),whereas in the later we use information which is not taken into account by theusual text representation schemes (the order of the words). Both methods aretested in thorough experiments over reference collections and compared withsuitable baselines.

Lastly, and keeping with the practical focus of this thesis, we have as wellanalysed how to apply Constrained Clustering to tackle an existing real-worldproblem, the Avoiding Bias task, proposing a scheme that uses constraints tocodify the partition to be avoided. We study as well how to improve the qual-ity of the alternative partitions, proposing two approaches which use spectralclustering techniques.

Page 14: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 15: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Acknowledgements

Even though only my name appears on the cover of this work, this thesis wouldnot have come to fruition without the contributions of many people; it is botha pleasure and a duty to humbly thank them with these lines.

First of all, I owe a great deal of gratitude to my advisor, Alvaro Barreiro,for all the priceless work and effort that he has put into this thesis, for all theguidance and good advice that he has given me all these years and speciallyfor his clarity seeing through the problems and difficulties that, as in everyresearch, came up in the course of this work, being always able to suggestan appropriate course of action. I must thank as well my colleague JavierParapar; his collaboration has been essential in the work that has conduced tothis thesis, and, second to Alvaro’s, his was the voice that I trusted the mostwhen in doubt. His input and suggestions have been invaluable. Likewise, Iwant to thank my colleague Roi Blanco, not only for his advice and for beingthe finest Cicerone that a newcomer to the IR community can have, but alsofor being there as an inspirational role model, a tangible reminder of wherethis road can take us if we put enough effort.

Moreover, I want to thank all the other members of the Information Re-trieval Lab, and especially those with which I have shared the lab these years,Isma, Martın and Xose. Especial thanks must likewise go to Pedro Cabalar forhis initial support.

Additionally, I want to acknowledge the labour and the input of both themembers of the PhD defence committee and the anonymous reviewers of thepapers that make up this dissertation, as well as the support of the Xunta deGalicia through its predoctoral grant and the Government of Spain through theFPU grant AP2007-02476. Also, I want to thank all the Free Software workersand volunteers for their hard work in developing high quality software withoutwhich this thesis would not have been possible.

Finally, I am greatly grateful to my friends for their support and for thegood times that we have shared and those surely yet to come. Even more im-portantly, I owe to my family, and especially to my parents, my most heartfeltgratitude for their unconditional support, understanding and love.

Page 16: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 17: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Index

Chapter 1: Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 2: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Constrained Clustering . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Constrained Clustering Algorithms . . . . . . . . . . . . . . . . 9

2.3.1 Constrained k-Means . . . . . . . . . . . . . . . . . . . . 102.3.2 Pairwise Constrained k-Means . . . . . . . . . . . . . . . 132.3.3 HMRF k-Means . . . . . . . . . . . . . . . . . . . . . . . 142.3.4 Constrained Normalised Cut . . . . . . . . . . . . . . . . 162.3.5 Constrained Complete Link . . . . . . . . . . . . . . . . 202.3.6 Spectral Clustering with Imposed Constraints . . . . . . 22

2.4 Advantages and Applications . . . . . . . . . . . . . . . . . . . 232.5 Problems and Opportunities . . . . . . . . . . . . . . . . . . . . 25

2.5.1 Constraint Extraction . . . . . . . . . . . . . . . . . . . . 252.5.2 Algorithm Robustness . . . . . . . . . . . . . . . . . . . 262.5.3 Constraint Feasibility . . . . . . . . . . . . . . . . . . . . 262.5.4 Constraint Utility . . . . . . . . . . . . . . . . . . . . . . 27

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Chapter 3: Methodology and Experimental Settings . . . . . . . . . . 293.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Data Representation and Distance Measures . . . . . . . . . . . 30

3.2.1 Textual Data Representation . . . . . . . . . . . . . . . . 303.2.2 Distance Measures . . . . . . . . . . . . . . . . . . . . . 32

3.3 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.1 Purity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.2 Mutual Information . . . . . . . . . . . . . . . . . . . . 353.4.3 Rand Index and Adjusted Rand Index . . . . . . . . . . . 36

3.5 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . 373.5.1 Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Page 18: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

iv Index

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Chapter 4: Using Constrained Clustering in Avoiding Bias . . . . . . 414.1 Avoiding Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Previous Approaches . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Coordinated Conditional Information Bottleneck (CCIB) 424.2.2 COALA . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.3 Other approaches . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Our Initial Proposal . . . . . . . . . . . . . . . . . . . . . . . . . 454.3.1 Soft Constrained k-Means (SCKM) . . . . . . . . . . . . 464.3.2 Related Algorithms . . . . . . . . . . . . . . . . . . . . . 494.3.3 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4.1 Experimental Set-Up and Methodology . . . . . . . . . . 504.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Improving the Quality of Alternative Clusterings . . . . . . . . . 544.5.1 Negative Constraints in Constrained Normalised Cut . . 554.5.2 Combining Soft Constrained k-Means and Normalised Cut 574.5.3 Approach to Avoiding Bias . . . . . . . . . . . . . . . . . 58

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Chapter 5: Robustness of Constrained Clustering Algorithms . . . . 635.1 Algorithm Robustness . . . . . . . . . . . . . . . . . . . . . . . 635.2 Experimental Set-Up and Methodology . . . . . . . . . . . . . . 65

5.2.1 Clustering Algorithms . . . . . . . . . . . . . . . . . . . 655.2.2 Datasets and Data Representation . . . . . . . . . . . . . 675.2.3 Constraint Creation . . . . . . . . . . . . . . . . . . . . . 685.2.4 Other Details . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.3.1 Statistical Significance . . . . . . . . . . . . . . . . . . . 80

5.4 Conclusions of the Study . . . . . . . . . . . . . . . . . . . . . . 825.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.5.1 Revisiting Probabilistic Models for Clustering with Pair-wise Constraints, Nelson & Cohen (2007) . . . . . . . . 83

5.5.2 Spectral Clustering with Inconsistent Advice, Colemanet al. (2008) . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5.3 Training Data Cleaning for Text Classification, Esuli &Sebastiani (2009) . . . . . . . . . . . . . . . . . . . . . 84

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Chapter 6: Constraint Extraction . . . . . . . . . . . . . . . . . . . . . 876.1 Creation and Extraction of Constraints . . . . . . . . . . . . . . 876.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Creating Constraints from Social Tags . . . . . . . . . . . . . . . 91

6.3.1 Social Tags . . . . . . . . . . . . . . . . . . . . . . . . . 916.3.2 Delicious . . . . . . . . . . . . . . . . . . . . . . . . . . 926.3.3 Constraint Creation . . . . . . . . . . . . . . . . . . . . . 92

Page 19: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Index v

6.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . 966.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.4.2 Clustering Baselines and Document Representation . . . 976.4.3 Upper bound model . . . . . . . . . . . . . . . . . . . . 986.4.4 Parameters of the algorithms . . . . . . . . . . . . . . . 986.4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.6 Creating Constraints using n-Grams . . . . . . . . . . . . . . . . 104

6.6.1 n-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.6.2 Constraint Creation . . . . . . . . . . . . . . . . . . . . . 106

6.7 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . 1096.7.1 Datasets and Document Representation . . . . . . . . . 1106.7.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.7.3 Upper Bound Model . . . . . . . . . . . . . . . . . . . . 1116.7.4 Parameters of the Algorithms . . . . . . . . . . . . . . . 1126.7.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Chapter 7: Conclusions and Future Work . . . . . . . . . . . . . . . . 1257.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Appendix A: Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131A.1 Introducion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131A.2 Motivacion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132A.3 Contribucions da tese . . . . . . . . . . . . . . . . . . . . . . . . 133A.4 Estrutura da tese, resultados e traballo futuro . . . . . . . . . . 135

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Page 20: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 21: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

List of Tables

3 Methodology and Experimental Settings . . . . . . . . . . . . . .3.1 For different sample sizes, maximum number of pairs which in

a Low-Tailed Sign Test may be labelled with “+” such that thenull hypothesis is still rejected . . . . . . . . . . . . . . . . . . . 39

4 Using Constrained Clustering in Avoiding Bias . . . . . . . . . . .4.1 Distribution of the documents from dataset (i) according to Uni-

versity (rows) and Topic (columns) criteria . . . . . . . . . . . . 524.2 Distribution of the documents from dataset (ii) according to

Region (rows) and Topic (columns) criteria . . . . . . . . . . . 524.3 Avoiding bias results in the defined datasets for k-Means, the

new algorithm (SCKM) working with soft constraints and theCCIB method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Avoiding bias results in the defined datasets for k-Means, SoftConstrained k-Means (SCKM), Normalised Cut and the com-bined approach (NC+SCKM) . . . . . . . . . . . . . . . . . . . 60

5 Robustness of Constrained Clustering Algorithms . . . . . . . . .5.1 Summary of the clustering methods used in the study . . . . . . 665.2 Distribution of the data in the datasets used in the study . . . . 685.3 Number of eigenvectors used in the spectral methods . . . . . . 705.4 Results (ARI) of the non constrained methods k-Means (KM)

and Normalised Cut (NC) . . . . . . . . . . . . . . . . . . . . . 725.5 Results (ARI) of the constrained methods with 1% of the pos-

sible positive accurate constraints and without inaccurate con-straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.6 Results (ARI) of the constrained methods with 5% of the pos-sible negative accurate constraints and without inaccurate con-straints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.7 Largest ratio of inaccurate constraints for which the improve-ment over the baseline is significant (p-value ≤ 0.05) . . . . . . 81

Page 22: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

viii List of Tables

6 Constraint Extraction . . . . . . . . . . . . . . . . . . . . . . . . .6.1 Delicious tags; description of the dataset used in the experiments 976.2 Delicious tags; results of the baselines . . . . . . . . . . . . . . 1006.3 Delicious tags; evolution of the number and ratio of accurate

constraints as t increases . . . . . . . . . . . . . . . . . . . . . . 1006.4 Delicious tags; comparison of the best ARI for each constraint

set and best baseline for each algorithm . . . . . . . . . . . . . 1046.5 Word n-grams; distribution of the data in the datasets used in

the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.6 Word n-grams; amount of total and accurate positive constraints

that can be created in the datasets . . . . . . . . . . . . . . . . 1126.7 Word n-grams and NER; evolution of the number and ratio of

accurate constraints as the thresholds increase, Dataset (i) . . . 1156.8 Word n-grams and NER; evolution of the number and ratio of

accurate constraints as the thresholds increase, Dataset (ii) . . . 1156.9 Word n-grams and NER; summary of the amounts of constraints

and the percentage of them which are accurate for each con-straint generation method in each dataset . . . . . . . . . . . . 116

6.10 Word n-grams and NER; overlap between the constraints cre-ated with both methods for selected values of t . . . . . . . . . 117

6.11 Word n-grams and NER; informativeness of the accurate con-straints for selected values of t . . . . . . . . . . . . . . . . . . . 119

6.12 Word n-grams and NER; results for Dataset (i) . . . . . . . . . . 1216.13 Word n-grams and NER; results for Dataset (ii) . . . . . . . . . 1226.14 Word n-grams and NER; summary of the best ARI for each con-

strained method . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Page 23: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

List of Figures

4 Using Constrained Clustering in Avoiding Bias . . . . . . . . . . .4.1 Constrained Normalised Cut algorithm . . . . . . . . . . . . . . 554.2 Negative Constrained Normalised Cut method (NCNC) proposed

in Section 4.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.3 Normalised Cut plus Soft Constrained k-Means method (NC+SCKM)

proposed in Section 4.5.2 . . . . . . . . . . . . . . . . . . . . . 574.4 Stability of the parameters of the two proposed algorithms in

the training collection (Dataset (i), avoiding TOPIC) . . . . . . 61

5 Robustness of Constrained Clustering Algorithms . . . . . . . . .5.1 Results for collection (i). . . . . . . . . . . . . . . . . . . . . . . 735.2 Results for collection (ii) . . . . . . . . . . . . . . . . . . . . . . 745.3 Results for collection (iii) . . . . . . . . . . . . . . . . . . . . . 755.4 Results for collection (iv) . . . . . . . . . . . . . . . . . . . . . . 765.5 Results for collection (v) . . . . . . . . . . . . . . . . . . . . . . 77

6 Constraint Extraction . . . . . . . . . . . . . . . . . . . . . . . . .6.1 Example of the constraint extraction method based on Delicious

tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2 Constraints extracted from Delicious tags; results using Con-

strained Normalised Cut (CNC) . . . . . . . . . . . . . . . . . . 1026.3 Constraints extracted from Delicious tags; results using Soft

Constrained k-Means (SCKM) . . . . . . . . . . . . . . . . . . . 1046.4 Some possible word n-grams of a sentence . . . . . . . . . . . . 1056.5 Example of the constraint extraction method based on n-grams 1086.6 Constraints extracted using n-grams and NER; results obtained

using only good constraints . . . . . . . . . . . . . . . . . . . . 118

Page 24: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 25: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

List of Algorithms

1 BATCH K-MEANS (KM) . . . . . . . . . . . . . . . . . . . . . . . . 112 CONSTRAINED K-MEANS (CKM) . . . . . . . . . . . . . . . . . . 123 PAIRWISE CONSTRAINED K-MEANS (PCKM) . . . . . . . . . . . . 154 NORMALISED CUT (NC) . . . . . . . . . . . . . . . . . . . . . . . 185 CONSTRAINED NORMALISED CUT (CNC) . . . . . . . . . . . . . . 206 COMPLETE LINK (CL) . . . . . . . . . . . . . . . . . . . . . . . . 217 CONSTRAINED COMPLETE LINK (CCL) . . . . . . . . . . . . . . . 228 SPECTRAL CLUSTERING WITH IMPOSED CONSTRAINTS (SCIC) . . 23

9 SOFT CONSTRAINED K-MEANS (SCKM) . . . . . . . . . . . . . . 48

10 CONSTRAINT CREATION USING DELICIOUS TAGS . . . . . . . . . . 9311 CONSTRAINT CREATION USING N-GRAMS . . . . . . . . . . . . . . 107

Page 26: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 27: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Chapter 1

Thesis Outline

1.1 Introduction

Although data had been collected by individuals, organisations, firms and gov-ernments for a long time with diverse aims, it is with the advent of the com-puter age that data collection and processing has experimented a quantita-tive leap. Not only the ubiquity of the computers has enabled those actors togreatly automate those tasks, which has in turn made it possible to scale upgreatly the number and the size of the datasets, but also the popularisation ofthe Web, and specially the coming of the so-called “Web 2.0”, where new toolsand platforms have mostly blurred the distinction between content creatorsand content consumers, has caused a seemingly endless stream of new data(query logs, click-through data, blog and social network posts, photos, videos,geographic data,. . . ) every second.

This ever-increasing amount of information has prompted a growing needfor automated tools in order to explore and process this information. Tra-ditionally, the answer given by Data Mining to this situation was divided intwo approaches, Classification [Sebastiani, 2002] and Clustering [Jain et al.,1999].

Clustering is the most common form of automatic unsupervised data anal-ysis. Traditionally, clustering algorithms work by trying to find relationshipsin the data forming groups (clusters) using only the information present inthe data, aiming to fulfil two goals: maximise the similarity between the datapoints which are assigned to the same cluster and keep the data points as-signed to different clusters as dissimilar as possible. On the other hand, inclassification, the most popular supervised approach, the user knows exactlywhich groups are present in the data, and feeds the algorithm with examplesfrom these groups. With these examples, the algorithm characterises the cat-egories of the data in order to be able to assign new data instances (i.e. datawhich has not been seen before) to the right group.

As it stems from these descriptions, classification schemes rely on relativelyextensive training data. Having acquired it, which is by itself a quite importantsubtopic, highly-effective classification algorithms have been developed whichare able to yield high quality results in a wide array of tasks and domains,

Page 28: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2 Chapter 1. Thesis Outline

such as spam filtering, language identification, e-mail routing, etc. On thecontrary, the quality values of the results of even the most high-performingclustering schemes are in many cases modest in absolute terms, although theyare nevertheless useful, given the exploratory intent with which the clusteringalgorithms are used.

This phenomenon can be seen as a consequence of the unsupervised natureof the clustering task. Indeed, whereas in classification we have a clear ideaof what we want the algorithms to do and a large degree of control overthem through the choice of the training examples, in clustering we have theopposite situation: not only we have a quite tenuous idea of how to definea good clustering (e.g. what is exactly “keeping the data points assigned todifferent clusters as dissimilar as possible”? To which degree should that beenforced?), but also our control over the process is limited at best to devisinga way of comparing data points more in tune with our intuitions over the dataor tweaking some innards of the algorithm.

It is in this context where lately a new fashion of semi-supervised clus-tering algorithms, coined as Constrained Clustering [Basu et al., 2008], hasemerged. These new algorithms can incorporate some a priori domain knowl-edge, allowing the user to somehow guide the clustering process and improvethe quality of its results. This information is given to the algorithm as a setof pairwise constraints involving pairs of data points and expressing real hardrestrictions or preferences about whether or not these pairs should be in thesame cluster. Thus, and even though the user can have a greater influenceover the outcome, Constrained Clustering is still a clustering process, as it isthe algorithm itself which determines which groups exist in the data, while inclassification the goal is cataloguing previously unseen data points into groupswhich have been already defined taking into account the examples given bythe user. Moreover, these constraints do not have to be numerous or be dis-tributed among the whole dataset in order to have a noticeable effect on theclustering process, which enables us to attain large improvements in the finalquality of the partitions investing only a relatively small effort in obtaining this“training data”1.

1.2 Motivation

Constrained Clustering provides a convenient way to integrate in a clusteringprocess information which in a regular process would be unused. This conve-nience is mainly due to two reasons. Firstly, Constrained Clustering offers aneasy and unified way to provide the clustering algorithms with different kindsof clues about the appropriate or wanted grouping of the data. Regardlessof the domain of the data or the nature of the clues, this information can inalmost all cases be easily codified using constraints, which will affect the pro-cess in a principled way. Secondly, as we introduced in the previous section,these constraints do not necessarily have to come in big numbers or have aspecially broad coverage in order to be used effectively. This enables us to

1Even so, the process of how these constraints are obtained should not be disregarded as trivial,something that we want to remark with this thesis.

Page 29: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

1.3. Thesis Statement 3

make the most of specific domain information which, even though it may notaffect many data instances, could prove to be useful. This contrasts with themore or less extensive set of examples which have to be obtained when usinga classification algorithm.

These two characteristics can be very useful in the scenario that we havesketched in the previous section. For instance, on the one hand, the infor-mation contained in the datasets will be in most cases multimodal. If wefor example want to cluster pictures that have been uploaded to a social net-work, we may be able to establish some relations between them comparingthe people that has been “tagged”2 in them, or their geolocalisation data. Inorder to incorporate this information, that can be useful to detect which pic-tures are related, when using a regular clustering algorithm we would have tomake ad-hoc changes to the way in which the images are compared, whereaswhen using Constrained Clustering algorithms we would have a direct wayto encode it. On the other hand, the less demanding nature of ConstrainedClustering with respect to the amount of information provided enables us touse it effectively when processing very large datasets without having to investlots of resources in order to obtain a large number of constraints.

In spite of the very practical nature of these advantages, up to this date theresearch on Constrained Clustering has been mostly focused on theoreticalaspects, particularly on proposing new clustering algorithms in order to makethe most of the information carried by the constraints. With this thesis we aimto propose new applications of Constrained Clustering and to discuss certainpractical aspects and problems which have to be considered when using it totackle real-world problems.

1.3 Thesis Statement

The main claim of this work is that there are certain practical questions thathave been mostly overlooked in the research on Constrained Clustering andwhose importance is capital when trying to use it in real-world scenarios. Inthis thesis we identify two of these issues, constraint extraction and robustnessto noise.

As we have previously introduced, research on Constrained Clustering hasbeen largely focused on developing novel algorithms. When these algorithmsare put to the test the authors use in the experiments synthetic sets of con-straints, created from the golden truth with which the results of the clusteringare compared in order to assess their quality. However, since this golden truthwill obviously not be available in real-world clustering problems, it is clearthat suitable methods should be found to create, whether manually or au-tomatically, the constraints that fuel the Constrained Clustering algorithms,something to which very little attention has been paid.

Another consequence of using these synthetic sets of constraints is that inalmost all cases the constraints used in the aforesaid experiments are accu-rate, that is to say, they actually carry information about a good partition of

2That is, the people that appears in the picture, according to the user who has uploaded it orother users of the social network.

Page 30: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4 Chapter 1. Thesis Outline

the data. Therefore, the results of the experiments reported in these papersshow the behaviour of the algorithms under ideal conditions, ideal conditionsthat are unfortunately very unlikely in most real scenarios, where, given thatthey should be extracted, constraints sets will often be noisy, that is, they willcontain some inaccurate constraints. Thus, the robustness of the ConstrainedClustering algorithms to noise is bound to play an important role in their finaleffectiveness.

The main contributions of this thesis are the following. We perform ananalysis of the robustness of some Constrained Clustering algorithms to noisysets of constraints, designing an experiment in which the behaviour of thealgorithms is tested with synthetic sets of inaccurate constraints which arecreated with two different methods, a random one and another one based onintuitions about the nature of real errors in the constraints. In the light ofthe results of these experiments, we discuss the strengths and weaknesses ofeach method, which we have used to conclude the scenarios where using eachalgorithm may be the best decision.

Moreover, in this thesis we propose as well two schemes to extract auto-matically constraints in two important data domains: web pages and textualdata in general. In the first case, we propose a method which uses informa-tion external to the entities being clustered, specifically the tags that the userof Delicious, the most popular social bookmarking service, have associated tothese pages. In the second case, our proposal involves using the text of thedocuments itself, extracting from it valuable information which is usually nottaken into account by the usual text representation schemes. Particularly, weuse word n-grams to create constraints that can incorporate to the clusteringprocess some of the information contained in the vicinity relations betweenwords. Both methods are tested in thorough experiments over reference col-lections and compared with suitable baselines.

Given the scarcity of the papers dealing with the problem of obtaining theconstraints, the methodology that has been followed to conduct these experi-ments (e.g. the questions that are tested, the metrics used to obtain the resultsor the statistical tests used to validate them) can be thought as well as anothercontribution of this thesis on its own.

Finally, straying a little from the two issues introduced above, but keep-ing with the practical focus of this thesis, we have analysed how to applyConstrained Clustering to tackle an existing real-world problem, the AvoidingBias task, which consists of, given some data to cluster and an already knownpartition of it, finding an alternative partition of the data which is as well agood partition. In order to tackle this task we propose a scheme that usesconstraints to codify the partition to be avoided, constraints which are thenfed to a Constrained Clustering algorithm devised by us that is afterwards runover the input data, finding an alternative partition. Moreover, we study aswell how to improve the quality of the alternative partitions, proposing twoapproaches which use spectral clustering techniques.

Page 31: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

1.4. Outline 5

1.4 Outline

The main novel contributions of this thesis are presented in chapters 4, 5 and6. Chapters 2 and 3 respectively contain a general introduction to ConstrainedClustering and a discussion of some aspects common to the experiments per-formed and reported in this thesis; even though a specialist on the topic couldskip them any interested reader may find them interesting in order to framethe remainder of this work.

• Chapter 2 is a small survey of Constrained Clustering. In particular, wedescribe several Constrained Clustering algorithms which will be usedor referenced along this work and we enumerate some research oppor-tunities still open in this young topic, some of which will be addressedin this thesis.

• Chapter 3 gathers and examines some aspects common to the experi-ments conducted in the work conducing to this thesis, so as to avoidunnecessary repetitions along it and to have also a centralised point ofreference to which turn to when discussing the experiments in the nextchapters.

• Chapter 4 summarises our work in using Constrained Clustering to tacklethe Avoiding Bias problem. In the first part of the chapter we propose ascheme which uses constraints and a Constrained Clustering algorithmdevised by us, whereas in the second part we focus on improving thequality of the alternative groupings obtaining as a result of this task.

• Chapter 5 contains a study of the robustness of some Constrained Clus-tering algorithms to noisy sets of constraints (i.e. those containing inac-curate constraints), using two different noise models. In the discussionof the results of the study we highlight the strengths and weaknessesof these algorithms. The chapter is finished with an analysis of whichscenarios are most suited for each algorithm.

• Chapter 6 introduces and discusses two methods to automatically extractconstraints. In the first part of the chapter we propose an approachthat can be used when clustering web pages, which is based on externalinformation, the tags associated to these pages by the users of Delicious.On the other hand, in the second part of the chapter we propose anapproach that can be used in all kinds of textual information, which isbased on internal information.

• Finally, Chapter 7 presents the conclusions of the thesis and a summaryof the future research lines.

In order to try to keep the chapters introducing the main novel contribu-tions of the thesis as self-contained as possible each of them contains its ownintroduction to its specific topic and its own study of the relevant literature.

Page 32: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 33: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Chapter 2

Introduction

Data analysis has a central role to play in several real-world domains, suchas medicine, advertising or market analysis, to name a few. Specifically, withthe ever-growing size of data collections in possession of public institutionsand private companies there is a need for accurate automatic data analysistools in order to be able to process these great amounts of data in a timelymanner. Lately, a new fashion of semi-supervised data analysis algorithms,coined as Constrained Clustering algorithms, has emerged, which aims to en-able the user to have more control over the outcome of a clustering process.In this chapter we make a small survey of this topic, in which the researchsummarised in this thesis is circumscribed. In particular, we describe severalConstrained Clustering algorithms which will be used or referenced along thiswork and we enumerate some research opportunities still open in this youngtopic, some of which will be addressed in this thesis.

2.1 Clustering

Clustering [Jain et al., 1999] is the most popular form of automatic unsu-pervised data analysis. Given a collection of data instances, the goal of theclustering task is finding a meaningful grouping of the data, categorising theinstances into sets (the clusters) such that the similarity between elementscontained in the same cluster is as high as possible and that between those indifferent clusters is kept as low as possible. Clustering is usually used as anexploratory tool, for instance, in order to discover a potential underlying struc-ture in the data. For this process, regular clustering algorithms use only theinformation present in the data to cluster, or more exactly in the representa-tion of the data offered to them. In these aspects, clustering contrasts stronglywith classification, the most popular supervised data analysis approach [Se-bastiani, 2002], where the user knows exactly which groups are present inthe data, and feeds the algorithm with examples from these groups which thealgorithm uses to characterise them in order to be able to assign new datainstances (i.e. data which has not been seen before) to the right group.

Attending to their approach to the task, clustering algorithms can be di-vided into two broad classes:

Page 34: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

8 Chapter 2. Introduction

• Flat (also partitional) clustering algorithms group the data instances bypartitioning the space of representation, yielding a final set of clusterswith no explicit relation or hierarchy. The paradigmatic example of thisclass of algorithms is k-Means [McQueen, 1967].

• Hierarchical clustering algorithms divide the data in clusters which aswell make up a hierarchy, with clusters that share their content withother clusters. The family of agglomerative algorithms [Jain et al., 1999]is arguably the most popular example of hierarchical clustering algo-rithms.

2.2 Constrained Clustering

As it stems from the previous section, customarily one of the main differencesbetween clustering and classification has been the degree of control that theuser has over the entire process. Whereas in classification the outcome ofthe process lies almost fully in the choice of the examples made by the user(and thus the great importance of the technique which has been used to se-lect them), the capability of the user to influence the result of the clusteringalgorithms was classically limited at best to selecting which metric was usedto compare the data instances or tweaking some details of the innards of thealgorithms, such as which Laplacian is used in spectral methods or using Com-plete, Single or Average link in hierarchical ones.

Lately, a new fashion of semi-supervised clustering algorithms, coined asConstrained Clustering [Basu et al., 2008], has emerged. These new algo-rithms can incorporate some a priori domain knowledge, allowing the userto somehow guide the clustering process. This information is provided inthe shape of a set of pairwise relations between data elements (called con-straints) which express the preference (or even obligation, depending on theConstrained Clustering algorithm used) about whether the two data instancesjoined by each of these constraints should or should not be put in the samecluster. As can be noted, these constraints are very different from the exten-sive set of examples of each group which is needed in classification. Thus,and even though the user can influence the outcome, Constrained Cluster-ing is still a clustering process, as it is the algorithm itself which determineswhich groups exist in the data, while in classification the goal is cataloguingpreviously unseen data points into groups which have been already definedtaking into account the examples given by the user. With these ConstrainedClustering approaches, knowledge that was unused in traditional clusteringalgorithms is used to improve the grouping of data, that is, to make the finalclustering more accurate, meaningful or more in tune to the user’s view of thedata.

2.2.1 Constraints

According to the information that they provide about the data instances thereare two kinds of pairwise constraints, introduced by Wagstaff and Cardie[2000]:

Page 35: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.3. Constrained Clustering Algorithms 9

• Positive (also called Must-link) constraints, ML(a, b), indicate that twodata instances must (should) be in the same cluster

• Negative (also called Cannot-link) constraints, CL(a, b), indicate thattwo data instances cannot (should not) be in the same cluster

The degree of absoluteness of the constraints, that is, whether a groupingof the data is acceptable if some of the constraints are not respected, typicallydepends on the choice of the Constrained Clustering algorithm. While somealgorithms consider the constraints as absolute, and would not output a clus-tering in which for example data instances linked by a ML are in differentclusters, most of them balance the respect of the constraints with a clusterquality objective independent from these constraints, effectively using themas a non-absolute guide to an appropriate final clustering.

As noted by Wagstaff et al. [2001], the relation between data instancesexpressed by Must-link constraints is transitive, i.e. ML(a, b) ∧ML(b, c) →ML(a, c). Also, the ML relation is symmetrical and reflexive (trivially, a datainstance must always be in the same cluster as itself). Thus, ML is an equiv-alence relation. The resulting equivalence classes are called by some authors“chunklets” [Shental et al., 2004] or “neighbourhoods” [Basu et al., 2004a],which, if given enough well-chosen ML constraints, would completely andunivocally define a grouping of the data.

As for the Cannot-links, the relation expressed by them is symmetric, non-transitive and trivially anti-reflexive. Also noted by Wagstaff et al. [2001], inspite of this non-transitiveness it is however possible to extract “new” negativeconstraints (which explicit the information already present) when consideringCannot-links in conjunction with Must-links. Since Must-lists indicate thatdata instances in the same chunklet must be in the same cluster, a Cannot-link between an element in the chunklet and some other data instance impliesthe existence of a Cannot-link between that instance and each of the elementsof the chunklet. This can be compactly expressed as CL(a, b) ∧ML(b, c) →CL(a, c).

Finally, it is worth mentioning that some other types of constraints havebeen introduced to denote desired structural properties in the clusters, suchas minimum or a maximum possible radius [Davidson and Ravi, 2005a]. How-ever, they ultimately can be expressed as a set of pairwise positive and negativeconstraints.

2.3 Constrained Clustering Algorithms

The research into Constrained Clustering has yielded a great number of al-gorithms. Depending on how they use the information contained in the con-straints the algorithms can be divided in two groups: constraint-based1 anddistance-based.

1The name “Constraint-based” can be misleading, as it would suggest that only that group ofConstrained Clustering algorithms uses constraints, when actually in both cases the information isprovided using them. However, in keeping with the literature available on Constrained Clustering[Davidson and Basu, 2007] that denomination has been used in this thesis.

Page 36: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

10 Chapter 2. Introduction

• Constraint-based algorithms: in this group of methods the flow of theclustering algorithms is altered to accommodate the information pro-vided by the constraints. Thus, stages such as the initialisation of clus-ters or the assignment of data instances to clusters are changed in orderto obtain a final grouping of the data more in consonance with the avail-able constraints, which have a direct influence in how these steps arecarried out (hence the name of this group of algorithms).

• Distance-based algorithms: this group of algorithms use the informationconveyed by the constraints to learn a distance metric between data in-stances, reflecting the preferences conveyed by the constraints. Grossomodo, the goal of these approaches is obtaining a distance metric inwhich the data instances linked by positive constraints are brought closerand those linked by negative constraints are separated.

In this section we describe some of the most important and influentialConstrained Clustering algorithms, putting a special focus on those used inor related to the work summarised in this thesis.

2.3.1 Constrained k-Means

Constrained k-Means (CKM)2 was proposed by Wagstaff et al. in [2001]. Thisalgorithm is based on the skeleton of k-Means (KM, McQueen [1967]), oneof the most popular clustering algorithms, over which the authors make somechanges to make it take into account the constraints. Consequently, it is aConstraint-based algorithm.

Batch k-Means

Batch k-Means [McQueen, 1967] (KM) is one of the most widely-used cluster-ing algorithms, due to its simplicity and good performance, which enables itsuse in big datasets. It is an iterative algorithm, whose goal is distributing thedata in clusters such that the residual sum of squares (RSS) of the solution isminimised, which is defined as

RSS(Ω) =

k∑i=1

∑x∈ωi

(x− ωi)2 (2.1)

where Ω is the outcome of the algorithm (which is comprised of k clustersω1, ω2, ...ωk), and ωi is the centroid of cluster i.

The pseudocode for k-Means is shown in Algorithm 1. The first step ofthe algorithm is the initialisation, in which each of the k clusters is initialisedwith a point in the clustering space; these points are called the seeds of theclustering. A usual way to perform this initialisation is choosing randomly kdata points from the data to cluster, because thus these seeds will be situatedin populated zones of the space. After this initialisation, the algorithm entersa loop (lines 2–8), which is its main core.

2COP-KMEANS in the original paper.

Page 37: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.3. Constrained Clustering Algorithms 11

Algorithm 1: BATCH K-MEANS (KM)input : X, the data to cluster; k, the number of clustersoutput: Ω = ω1, ω2..ωk, a partition of the data

1 foreach ω ∈ Ω do Initialise(ω)2 while convergence is not attained do3 foreach ω ∈ Ω do RecalculateCentroid(ω)4 foreach x ∈ X do5 i← argminj∈1..k (Distance(x, ωj))6 Assign(x,ωi)7 end8 end

In this loop two steps are repeated until the solution converges: first, thecentroid of each cluster (i.e. the centroid of the points assigned to it) is cal-culated. Then, each data point is assigned to the cluster with the most similarcentroid. An usual convergence condition, which we have used in the wordsummarised in this thesis, compares the centroids of the clusters in the presentand the previous iteration. If they are very similar it is assumed that the algo-rithm has reached convergence, as the changes of the assignments of the datainstances would be minimal. Usually, the loop is also interrupted if a largenumber of iterations is performed without converging, to prevent an endlessloop.

One of the most important problems of k-Means (and of the ConstrainedClustering algorithms surveyed in this section which use it) is its dependencyon the initial conditions of the cluster, i.e. the seeds which have been chosenin the initialisation phase. A good set of seeds can direct the algorithm to agood and fast solution, while a bad set can direct it to a local minimum. Unfor-tunately, it is very hard to judge a priori the goodness of a set of seeds. Thus,the solution which is usually given to this dependency on the initialisations isrepeating the clustering process several times with different seeds.

Constrained k-Means

As was introduced before, Constrained k-Means was proposed by Wagstaffet al. in [2001]. This Constrained Clustering algorithm considers positiveand negative absolute constraints (i.e. all of them have to be honoured inorder to have an acceptable partition of the data). In order to introduce theseconstraints in k-Means they altered the process used in the main loop of thealgorithm to assign data instances to clusters.

As it can be seen in the lines 5–7 of the algorithm, instead of directlyassigning each instance to the cluster with the most similar centroid, CKMchecks if the assignments of the data instances to clusters violate any con-straint. Thus, an assignment of a data instance to a cluster will be illegal(function ViolatesConstraints) if a data point with which the data instancehas a Must-link has been assigned to a different cluster (line 10) or if the clus-ter contains a data point with which the data instance has a Cannot-link (line

Page 38: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

12 Chapter 2. Introduction

Algorithm 2: CONSTRAINED K-MEANS (CKM)input : X, the data to cluster; k, the number of clusters; ML and CL,

the positive and negative constraints to be taken into accountoutput: Ω = ω1, ω2..ωk, a partition of the data

1 foreach ω ∈ Ω do Initialise(ω)2 while convergence is not attained do3 foreach ω ∈ Ω do RecalculateCentroid(ω)4 foreach x ∈ X do5 i← argminj∈1..k (Distance(x, ωj)) such that

¬ ViolatesConstraints(x, ωj)6 if @ i then clustering fails7 else Assign(x,ωi)8 end9 end

function ViolatesConstraints(x, ω)input : x, a data instance; ω, a clusteroutput: if putting x in cluster ω contravenes any constraint

10 foreach (x, x′) ∈ML do if x′ /∈ ω then return true11 foreach (x, x′) ∈ CL do if x′ ∈ ω then return true12 return false

end

11). Put another way, this amounts to directly assigning a data point x to acluster if that cluster contains an instance linked to x with a ML and other-wise assigning to the cluster with the most similar centroid, excluding thoseclusters which contain a data point linked to x with a CL.

The absoluteness of the constraints, reflected in those changes in the as-signment policy, and the way in which the constraints are enforced (whichentails a great sensitivity of the clustering results to the order in which thedata instances are inspected) can result in problems in the clustering. Forexample, as the authors warn in their paper, a moderate number of negativeconstraints can render a clustering impossible (line 6) if a data point can notbe assigned to any cluster due to having a Cannot-link with data instances ineach cluster. This stagnation of the algorithm might not have happened if thedata points were inspected in a different order. Moreover, all the data pointsconnected by Must-links will be assigned blindly to the cluster where the onewhich was first inspected was assigned, without taking into account their sim-ilarity with the centroids of the clusters. Albeit we are ensuring that all thepositive constraints are respected, it is possible that, if the inspection orderwere different, all of those data points could have ended up in a completelydifferent cluster, which, on average, could be a better clustering for them.

Page 39: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.3. Constrained Clustering Algorithms 13

2.3.2 Pairwise Constrained k-Means

Pairwise Constrained k-Means (PCKM), introduced by Basu et al. [2004a], isa Constrained Clustering algorithm also inspired in k-Means. This algorithmuses positive and negative constraints to influence both the initialisation ofthe clusters and the assignment of data instances to clusters. Hence, it is aconstraint-based algorithm.

In a previous paper [Basu et al., 2002] the authors had already discussedthe use of domain information to initialise the clusters in a k-Means algorithm,although in that case the domain information was provided in form of clusterlabels for some data instances. Given these labels, each cluster was initialisedwith the centroid of the points which were surely known to be in it, obtain-ing an improvement on the cluster results. In the present case, Basu et al.propose a similar approach which uses positive and negative constraints, theonly domain information available, in the initialisation of the clusters. Aftertaking the transitive closure of the Must-links and Cannot-links (see Section2.2.1) the centroid of each of the k largest resulting neighbourhoods is used toinitialise the clusters. If less than k neighbourhoods were defined by the con-straints the next cluster is initialised, if such point exists, by a data instancelinked by Cannot-links with all neighbourhoods. All the remaining clusters areinitialised by random perturbations of the global centroid of X.

As for the effect of the constraints on the assignment of data instances toclusters, Basu et al. propose a new objective function Jpckm which must beminimised in the clustering process. The first term of this function quantifies,as in k-Means, the RSS of the solution, that is, the compactness of the clus-ters. Next, the second term measures the observance of positive constraints,adding a penalty value of wij to Jpckm if a Must-link between data points xiand xj is not respected by Ω (i.e. if their labels li and lj are different). Finally,the third term measures the observance on Cannot-links in a similar way, thistime with a penalty of wij for each not respected Cannot-link constraint. Intheir paper, the authors show how this function can be motivated by defininga certain Hidden Markov Random Field (HMRF) over the data instances, be-ing finding the MAP (maximum a posteriori probability) configuration of theHMRF equivalent to minimising Jpckm:

Jpckm(Ω) =1

2

k∑i=1

∑x∈ωi

‖x−ωi‖2+∑

(xi,xj)∈ML

wij1[li 6= lj ]+∑

(xi,xj)∈CL

wij1[li = lj ]

(2.2)In order to minimise this new objective function, the authors propose a

greedy strategy, based in k-Means. The pseudocode of the resulting algorithm(including the new initialisation phase) is shown in Algorithm 3. As it can beseen in line 6, instead of directly assigning a data point to the cluster with theclosest centroid, the algorithm also takes into account the possible penaltiesentailed by not respecting any constraint. Although, as introduced before,these penalties can be different for each constraint, following the outline of thealgorithm given by Basu et al. in the original paper this pseudocode considersthe same penalty value (w) for all constraints. Depending on the number of

Page 40: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

14 Chapter 2. Introduction

neighbourhoods, PCKM can be dependent on initial conditions (the randomperturbations of the global centroid (line 17).

2.3.3 HMRF k-Means

In a follow up paper [Basu et al., 2004b] to the one introducing Pairwise Con-strained k-Means, Basu et al. introduce a new Constrained Clustering algo-rithm, Hidden Markov Random Field k-Means (HMRF k-Means), which gen-eralises PCKM to use a broad range of clustering distortion measures. Apartform this, further changes to PCKM include an improved initialisation phaseand a second maximisation step where if the distance used is parametrisedits parameters are adjusted3. Hence, it is a hybrid approach, since not onlythe flow of the algorithm is altered (as happened in the previous approaches,which were Constraint-based), but also the distance between data points isadapted to reflect the preferences conveyed by the constraints, as happens inthe Distance-based Constrained Clustering approaches.

The objective function of HMRFKM is shown in Equation 2.3. In it, D isthe distance used, φ is the penalty scaling function, a monotonically increasingfunction of the distance dependent on the choice of D, and Z is a normalis-ing constant. φ is used to link the penalty for a violated constraint with thedistance between the data instances: more impact is respectively given to non-respected Must-links between distant data points and non-respected Cannot-links between close ones. The rationale behind this is that it is in that caseswhere the distance is performing worse and hence must be further adjusted.

Jhmrf km(Ω) =

k∑i=1

∑x∈ωi

D(x, ωi) +∑

(xi,xj)∈ML

wijφD(xi, xj)1[li 6= lj ]

+∑

(xi,xj)∈CL

wij(φ(Dmax)− φD(xi, xj))1[li = lj ] + logZ.

(2.3)

Given this objective function, the main loop of the algorithm is similar tothe one in PCKM. When reassigning data points, the penalties associated witheach cluster are calculated according to Jhmrf km. Also, after recalculating thecentroids, if the distance is parametrised its parameters are updated in orderto further decrease the objective function. The process to do so would dependon the distance used.

Regarding the initialisation of the clusters, the same outline of the one pro-posed in PCKM is followed, except in the case of having more neighbourhoodsthan clusters. In that case, instead of choosing the largest neighbourhoods, thisalgorithm uses a weighted variant of farthest first travel to select the centroidsof the neighbourhoods, where the weight of the centroids is proportional tothe size of its neighbourhood. The aim of this process is having a better repre-sentation of the distribution of the data.

3These changes over PCKM were introduced in an intermediate paper [Bilenko et al., 2004] bythe same authors, although this is not referenced in the HMRF k-Means paper.

Page 41: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.3. Constrained Clustering Algorithms 15

Algorithm 3: PAIRWISE CONSTRAINED K-MEANS (PCKM)input : X, the data to cluster; k, the number of clusters; ML and CL,

the positive and negative constraints to be taken into account;w, the penalty associated with each constraint

output: Ω = ω1, ω2..ωk, a partition of the data

1 TransitiveClosure (M ,C)2 Ω←InitialisePCKM (k,X,ML,CL)3 while convergence is not attained do4 foreach ω ∈ Ω do RecalculateCentroid(ω)5 foreach x ∈ X do6 i← argminj∈1..k ( 1

2‖x− ωj‖2+Penalties(x,ω,ML,CL))

7 Assign(x,ωi)8 end9 end

function InitialisePCKM(k,X,ML,CL)input : k, the number of clusters; X, the data to cluster; ML and

CL, positive and negative constraints

10 N1, N2, ..Nν ←GetNeighbourhoodsSortedBySize (ML,CL)11 if k ≤ ν then12 for i← 1 to k do Initialise(ωi,Centroid(Ni))13 else14 for i← 1 to ν do Initialise(ωi,Centroid(Ni))15 if ∃x linked with CL with all Ni then Initialise(ων+1,x)16 foreach ω still not initialised do17 Initialise(ω, RandomPerturbation(Centroid(X)))

18 end19 end

end

function Penalties(x,ω,ML,CL, w)input : x, a data point; ω, a cluster; ML and CL, positive and

negative constraints; w, the penalty associated with theconstraints

output: p, the penalties incurred into if x is assigned to ω

20 p← 021 foreach (x, x′) ∈ML do if x′ /∈ ω then p← p+ w22 foreach (x, x′) ∈ CL do if x′ ∈ ω then p← p+ w23 return p

end

Page 42: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

16 Chapter 2. Introduction

2.3.4 Constrained Normalised Cut

Constrained Normalised Cut (CNC) is a constraint-based constrained spectralclustering algorithm introduced by Ji and Xu [2006], which is based on thespectral clustering algorithm Normalised Cut (NC) introduced by Shi and Ma-lik [2000].

Spectral Clustering

Spectral Clustering algorithms [Ding, 2004; von Luxburg, 2006] are a familyof algorithms which use graph spectral techniques to tackle the clustering task,transforming it into a graph cut problem.

In order to do so, a graph is created to represent the data to be clustered.Specifically, a weighted graph G = (V,E,W ) is built, where each vertex (setV = v1, v2, ..., vn) corresponds to each data point of the data collection, andthe weight (set W = w11, w12, ..., wnn) of each edge (set E) is set accord-ing to the similarity between the data points joined by that edge4. There areseveral strategies to compute this weight, which range from using the plainsimilarity between data instances to more complex approaches, such as Gaus-sian smoothing. Moreover, the connectivity of G can be total or dependenton the values of similarity (using strategies such as connecting a vertex onlywith its nearest neighbours or only with points whose similarity is above athreshold).

Once the graph G has been created, the objective of clustering, which isfinding a good clustering of the data in k clusters (one with high similaritybetween data points assigned to the same cluster and consequently low sim-ilarity between points in different clusters) can be reformulated in terms offinding a good cut of graph G. The translation of the conditions of a good clus-tering to the graph cut problem would be finding a partition of the graph ink connected components (corresponding each one of them to a cluster) suchthat, on the one hand, the weights of the edges that join vertices in differentconnected components are minimised and, on the other hand, the weights ofthe edges between vertices in the same component are maximised. There areseveral functions which measure this objective. One of the most popular ofsuch functions is Shi and Malik’s Normalised Cut.

Normalised Cut

The Normalised Cut (NCut5) value of a certain cut of a given graph was in-troduced by [Shi and Malik, 2000]. For a certain cut C = A1, A2, ...Ak of agraph G = (V,E,W ) NCut is defined as:

NCut(A1, ...Ak) =

k∑i=1

cut(Ai, Ai)vol(Ai)

(2.4)

4In the remainder of this thesis we have assumed that greater weights mean greater similarities.5In this thesis we will use the abbreviation NC when referring to the Normalised Cut clustering

algorithm and the abbreviation NCut when referring to the Normalised Cut value of a cut of agraph.

Page 43: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.3. Constrained Clustering Algorithms 17

where

cut(A,B) =∑

i∈A,j∈Bwij ; vol(A) =

∑i∈A

n∑j=1

wij (2.5)

A1 to Ak are the connected components in which the graph has been dividedand Ai are the vertices which are not included in Ai (i.e. Ai = V \Ai).

As it stems from this definition, having a graph cut with a low NCut wouldmean having a cut of the graph in which the weights of the edges which joinvertices in different connected components are as low as possible while keep-ing the volumes of the resulting connected components as high as possible.This last condition ensures a certain balance between the connected compo-nents, trying to avoid trivial solutions. So, a cut of G with a low NCut wouldcorrespond to a good (as defined above) clustering of the data.

The minimisation of NCut can be presented as a matrix trace minimisa-tion problem. Summarising the approach taken in [von Luxburg, 2006], letH = (hij) be a n × k matrix which will be used to encode the membershipof data points to the connected components. The jth column of H containsthe membership of connected component Aj (its indicator vector) encoded asfollows:

hij =

1√

vol(Aj)if vi ∈ Aj

0 else(2.6)

Also, let D be a n × n diagonal matrix such that dii = degree(vi) =∑nj=1 wij and let L be the Laplacian matrix of graph G (that is, L = D −W ).

Using these matrices, it can be shown [von Luxburg, 2006] that the minimisa-tion of NCut can be written as in Equation 2.7:

minA1,...AkTr(HTLH) s.t. HTDH = I (2.7)

It can be demonstrated that the condition of discreteness of the values ofH (due to its definition in Equation 2.6) makes the minimisation problem ex-pressed in Equation 2.7 NP-Hard. If this discreteness of the values of H isrelaxed, allowing the indicator columns composing that matrix to have anyvalue in Rn, and the substitution Y = D

12H is performed we reach the expres-

sion (Equation 2.8):

minY ∈Rn×kTr(Y T[D−

12LD−

12

]Y ) s.t. Y TY = I (2.8)

This expression is in the standard form of a trace minimisation problem.Therefore, it can be demonstrated that the function is minimised by the matrixY which contains as columns the eigenvectors corresponding to the smallesteigenvalues of D−

12LD−

12 .

However, these values do not represent exactly a cut of the graph. Due tothe relaxation of the condition of discreteness of the values of H to reduce thecomplexity of the problem and make it computationally affordable, instead ofhaving an indicator vector for each connected component, we have a vector inRk for each data point (the rows of matrix Y ). Thus, we have a projection ofeach data instance in Rk based on its similarity to the other instances. Some

Page 44: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

18 Chapter 2. Introduction

technique (such as the aforementioned k-Means) has to be used to find a dis-crete segmentation of this space. Once this segmentation has been performedwe can backtrace each projected data point to the original one, obtaining thefinal outcome of the clustering algorithm.

Although this fact did not appear in the original paper, some experimentalstudies (such as [Jin et al., 2006] or [Ares et al., 2012, 2011; Ares and Bar-reiro, 2012], which are part of this thesis) have shown that using more thank eigenvectors in the projection step can improve (sometimes dramatically)the quality of the final clustering (which is still done into k clusters). Con-sequently, in this thesis we have considered that number of eigenvectors asa parameter of the clustering method, which we have called d. Y is thus amatrix in Rn×d, and the data instances are projected in vectors in Rd.

Algorithm 4: NORMALISED CUT (NC)input : X, the data to cluster; k, the number of clusters; d, the number

of eigenvectors used to project data pointsoutput: Ω = ω1, ω2..ωk, a partition of the data

1 W ← CalculateWeights(X)2 Y ← MinimiseRelaxedNCut(W,d) /* Eq. 2.8 */

3 for i← 1 to n do pi ← ith row of Y4 Ω′ ← Cluster(p1, p2..pn, k)5 foreach ω′i ∈ Ω′ do foreach pj ∈ ω′i do Assign(xj ,ωi)

Algorithm 4 shows the pseudocode of the Normalised Cut algorithm withthis last change. By itself, NC is not dependent on any initial conditions, as itwas the case for example of k-Means, which was dependent of the initialisationof the clusters. However, in practice this will be determined by the concreteapproach used to perform the segmentation of the projected data points (line4); for instance, if the aforementioned k-Means is used to cluster these pointsthe outcome of the whole algorithm will be conditioned by the choice of theseeds of that clustering.

Constrained Normalised Cut

Ji and Xu proposed in [2006] a Constrained Normalised Cut (CNC) algorithmwhich introduces non-absolute positive constraints in Normalised Cut. In or-der to do so, they alter the function minimised in the algorithm to obtain a newone which expresses a twofold objective: not only the cut of the graph whichminimises it conveys a good grouping (low similarity between data points indifferent clusters, high similarity between data points in the same cluster), butalso this partition respects the constraints supplied by the user.

To achieve this, they introduce a new matrix U to encode the constraints.This matrix has n columns (one for each data point in the collection to beclustered) and a row for each constraint. In it, a constraint which states thatdata points i and j should be in the same cluster will be encoded as a row ofzeroes with the exception of position i and j, which will be set to 1 and -1.

Page 45: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.3. Constrained Clustering Algorithms 19

If the membership to the connected components is encoded in a matrix Has in Equation 2.6 the product of U andH can be used to check the observanceof each constraint in each connected component. Specifically, the product ofa row of U which encodes a constraint between data points i and j with acolumn of H will be zero if and only if none or both of those data instanceshave been categorised in the connected component encoded by that column.

Thus, in order to measure the global observance of the constraints suppliedto the algorithm, Ji and Xu propose using the Frobenius norm6 of the prod-uct of U and H. It will be smaller as more constraints are respected, with aminimum of zero when all of them are honoured. Using this result, they adda penalty to the function to be minimised in the central step of the clustering(Equation 2.9).

minA1,...Ak(NCut(A1, ..., Ak) + ||βUH||2F ) (2.9)

In that expression, a new parameter (β) is introduced to control the degreeof enforcement of the constraints, with higher values of β meaning tighterenforcements. Again, the condition of discreteness of the values of H (that is,that they encode the membership to the connected components as in Equation2.6) makes this optimisation problem NP-hard. If this condition is droppedand a derivation similar to the one in previous section is followed [Ji and Xu,2006], we obtain the expression in Equation 2.10, which is as well subject toY TY = I.

minY ∈Rn×kTr(Y T[D−

12 (L+ βUTU)D−

12

]Y ) (2.10)

As this problem is in the standard form of a trace minimisation problem thesame theoretic result used in the unconstrained case can be used here. Thus,this equation is minimised by a matrix Y which contains as columns the eigen-vectors which correspond to the smallest eigenvalues of the matrix D−

12 (L +

βUTU)D−12 . Again, these columns are not proper indicator vectors, so a seg-

mentation of the projected data points has to be performed in order to producethe final clustering of the data. As it was the case in Normalised Cut, it hasbeen shown ([Ares et al., 2012, 2011; Ares and Barreiro, 2012], which arepart of this thesis) that using more than k eigenvectors to project the datapoints (the amount used in the original Ji et al.’s paper) can improve the re-sults of the clustering. Therefore, in Constrained Normalised Cut we haveagain considered that number as a parameter of the algorithm (d).

Algorithm 5 shows the pseudocode of the Normalised Cut Algorithm, It canbe seen that CNC is very similar to its non-constrained counterpart, with theonly difference of the introduction of the matrix U to encode the constraintsand the different function to be minimised (lines 2 and 3). As in NormalisedCut, the outcome of the clustering process is not by itself dependent on anyinitial conditions, but it could end being so depending on the algorithm usedto segment the projected points (line 5).

6Frobenius norm of matrix A ∈ Rm×n = ||A||F =√∑m

i=1

∑nj=1 a

2ij =

√Tr(ATA)

Page 46: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

20 Chapter 2. Introduction

Algorithm 5: CONSTRAINED NORMALISED CUT (CNC)input : X, the data to cluster; k, the number of clusters; ML, the

positive constraints to be taken into account; β, the strength ofthese constraints; d, the number of eigenvectors used to projectdata points

output: Ω = ω1, ω2..ωk, a partition of the data

1 W ← CalculateWeights(X)2 U ← CalculateMatrixU(ML)3 Y ← MinimiseRelaxedConstrainedNCut(W,U,β,d) /* Eq. 2.10 */

4 for i← 1 to n do pi ← ith row of Y5 Ω′ ← Cluster(p1, p2..pn, k)6 foreach ω′i ∈ Ω′ do foreach pj ∈ ω′i do Assign(xj ,ωi)

2.3.5 Constrained Complete Link

Constrained Complete Link (CCL) was introduced by Klein et al. in [2002]. Inthis case the algorithm is a Distance-based one able to accommodate positiveand negative constraints, inserted in the skeleton of Complete Link (CL) [Jainand Dubes, 1988], a hierarchical agglomerative clustering algorithm.

Complete Link

Hierarchical agglomerative clustering algorithms start with each data pointin its own individual cluster. Afterwards, and until there is only one clusterleft, the algorithms proceed by merging in each iteration the two closest (mostsimilar) clusters, creating a hierarchy of clusters. In the case of CompleteLink the distance between two clusters is defined as the maximum distancebetween their data-points (Equation 2.11). Once the process has concluded,the resulting hierarchy can be processed as needed. For instance, a set of kclusters can be obtained by cutting at the appropriate level the dendrogramdefined by the hierarchy.

dist(ω, ω′) = maxdist(x, x′) | x ∈ ω, x′ ∈ ω′ (2.11)

Obviously, calculating in each iteration the distances between clusters willbe very inefficient, as in each cluster merge only a small amount of distanceschange. Therefore, the distances between clusters are kept in a data structure,D, which is actualised in each iteration with the necessary changes. Thus,when clusters ωi and ωj are merged the only update needed in D is setting thedistances between this new cluster and any other cluster ω to the maximumof dist(ω, ωi) and dist(ω, ωj), that is, the maximum distance between clusterω and each of the merged clusters. The pseudocode of a Complete Link algo-rithm which uses this optimisation is shown in Algorithm 6. The outcome ofthis algorithm does not depend on any initial condition.

Page 47: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.3. Constrained Clustering Algorithms 21

Algorithm 6: COMPLETE LINK (CL)input : X, the data to clusteroutput: Steps, the steps followed by the clustering algorithm

1 for i← 1 to n do ωi ← xi2 D ← CalculateDistances (Ω)3 while |Ω| 6= 1 do4 (ωi, ωj)← GetClosestClusters(D)5 Add((i,j), Steps)6 MergeClustersCL(i, j, Ω, D)

7 end

Constrained Complete Link

Klein et al. proposed in [2002] a Constrained Clustering algorithm based onComplete Link which supports positive and negative constraints. In order tomake the constraints affect the clustering the authors alter the initialisation ofD, the distances between clusters:

• If two data instances xi and xj are affected by a positive constraint, thedistance between their clusters7 is set to 0, effectively bringing themclose.

• If they are affected by a negative constraint the distance between theirclusters is set to the maximum possible distance value (which we willdenote by∞), separating them.

As Klein et al. note in their paper, these steps may break the metricity ofD. This metricity is of capital importance, since the goal of the authors is us-ing the constraints to induce space-level changes in the behaviour of the algo-rithm; that is, that a Must-link between two data instances should make pointsclose to these instances more likely to be in the same cluster and, conversely,that a Cannot-link should make data instances close to the ones affected by theconstraint less likely to be in the same cluster. This would be the case if themetricity was restored after applying the changes in D, effectively propagatingthe effect of the constraints. In the case of Must-links, their effect on metricity(possible violations of the triangle inequality) can be fixed by calculating theshortest path between each pair of data points x and x′, settingD(x, x′) to thatvalue. As the authors note in the paper, that path must be composed by pointswhich are either x, x′ or any other point x′′ implied on a Must-link. Conse-quently, if we assume that the number of data points affected by ML is muchless that the total amount of data points, the cost of this search of the shortestpath is reasonable8. As for Cannot-links, the authors state that repairing theireffect is not computationally affordable, but that nevertheless their choice ofhierarchical agglomerative clustering algorithm (Complete-link) implicitly re-stores some metricity each time a merge is performed.

7At this stage of the algorithm each data instance is contained in its own cluster.8As noted in the paper, if C is the number of data points affected by Must-links, this search is

O(N2C), whereas by itself Complete Link standardly runs in O(N2 logN).

Page 48: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

22 Chapter 2. Introduction

Algorithm 7 shows the pseudocode of the Constrained Complete Link al-gorithm, which, as in the case of CL, is not dependent on any initial con-dition. The restoration of metricity introduced in the previous paragraph isperformed in line 10, after Must-links have been imposed and before doing sowith Cannot-links.

Algorithm 7: CONSTRAINED COMPLETE LINK (CCL)input : X, the data to cluster; k, the number of clusters; ML and CL,

the positive and negative constraints to be taken into accountoutput: Steps, the steps followed by the clustering algorithm

1 for i← 1 to n do ωi ← xi2 D ← CalculateDistances (Ω)3 ImposeConstraints(D,ML,CL)4 while |Ω| 6= 1 do5 (ωi, ωj)← GetClosestClusters(D)6 Add((i,j), Steps)7 MergeClustersCL(i, j, Ω, D)

8 end

function ImposeConstraints(D,ML,CL)input : D, a data structure containing the distances between

clusters; ML and CL, the positive and negative constraintsto be taken into account

9 foreach (x, x′) ∈ML do D(x, x′)← 010 PropagateMustLinks(D, ML)11 foreach (x, x′) ∈ CL do D(x, x′)←∞

end

2.3.6 Spectral Clustering with Imposed Constraints

Following the same principle used in Constrained Complete Link, Kamvar et al.(the same authors of CCL, albeit in a different order), propose in [2003] anapproach to apply positive and negative constraints to a spectral clusteringalgorithm.

Although the interpretation of spectral clustering followed by the authorsis a random walk rather than the graph cut used in (Constrained) NormalisedCut, the underlying process is mostly the same: from the data we obtain amatrix (in this case A, an affinity matrix) in which the value aij is related tothe similarity between data points i and j, and, after performing certain oper-ations on A, the eigenvectors of the resulting matrix (N) are used to projectdata points into vectors in Rk, which are clustered using a suitable clusteringmethod. The operations performed on A to obtain N (a normalised additivenormalisation) are shown on Equation 2.12, whereD is as in Normalised Cut adiagonal matrix with dii =

∑nj=1 aij and dmax is the maximum of that matrix.

N =1

dmax(A+ dmaxI −D) (2.12)

Page 49: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.4. Advantages and Applications 23

As in Constrained Complete Link, constraints are imposed on the algorithmby changing certain values of the matrix which quantifies the relation betweendata points, making this Constrained Clustering algorithm a Distance-basedone. In the present case, a Must-link between data instances i and j altersthe affinity values aij and aji, setting them to 1. Conversely, a Cannot-linkbetween these points would set aij and aji to 0. Unlike in CCL, in the presentalgorithm no effort is done in order to propagate positive and negative con-straints.

Algorithm 8: SPECTRAL CLUSTERING WITH IMPOSED CONSTRAINTS

(SCIC)input : X, the data to cluster; k, the number of clusters; ML and CL,

the positive and negative constraints to be taken into accountoutput: Ω = ω1, ω2..ωk, a partition of the data

1 A← CalculateAffinityMatrix (X)2 ImposeConstraints(A,ML,CL)3 N ← CalculateN (A) /* Equation 2.12 */

4 Y ← ObtainProjections(N,d)5 for i← 1 to n do pi ← ith row of Y6 Ω′ ← Cluster(p1, p2..pn, k)7 foreach ω′i ∈ Ω′ do foreach pj ∈ ω′i do Assign(xj ,ωi)

function ImposeConstraints(A,ML,CL)input : A, the affinity matrix; ML and CL, the positive and

negative constraints to be taken into account

8 foreach (xi, xj) ∈ML do a(i, j)← a(j, i)← 19 foreach (xi, xj) ∈ CL do a(i, j)← a(j, i)← 0

end

The pseudocode of the algorithm is shown in Algorithm 8. As it happenedwith the algorithms discussed in Section 2.3.4, this algorithm is not depen-dent by itself on any initial condition, but can end being so depending on theapproach used to cluster the projected data points (line 6).

2.4 Advantages and Applications

Constrained Clustering provides a convenient way to integrate in a clusteringprocess information which in a regular process would be unused. This conve-nience is mainly due to two reasons.

Firstly, Constrained Clustering offers an easy and unified way to providethe clustering algorithms with different kinds of clues about the appropriateor wanted grouping of the data, regardless of the nature of the clues or thedomain of the data. For instance, when clustering text documents such asblog posts the data instances are usually represented as real-valued vectorsobtained from the text of the documents, and the similarity between instancesis calculated with a metric such as the cosine distance. However, there are

Page 50: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

24 Chapter 2. Introduction

other possible secondary sources of evidence which can help clustering, suchas for instance the date in which the document was posted, since it can beargued that posts from around the same time are more likely to be relatedthan those which are farther away. When using a regular clustering algorithm,wanting to add this notion would mean having to tinker with the represen-tation or the distance calculation in ad-hoc ways. This would have the prob-lems of introducing possible undesired side effects and also of not being easilyadaptable to other kinds of side information. Instead, if we used a ConstrainedClustering algorithm the intuition about the dates could be easily encoded us-ing Must-link and Cannot-link constraints which would affect the process in aprincipled way and in whose creation some other information could be takeninto account.

Secondly, the constraints do not necessarily have to come in big numbersor have a specially broad coverage in order to be used effectively. This enablesus to make the most of specific domain information which, even though it maynot affect many data instances, could prove to be useful. For instance, againin the blog posts domain, linkbacks between posts provide compelling clues tocreate positive constraints, which could be used by the Constrained Clusteringalgorithms even if in the collection in question linkbacks are limited in numberor confined to posts by some authors or dealing with a certain subject. Anotherscenario in which these properties are useful is when we have access to domainexperts or regular users who can be queried about if pairs of entities shouldbe put in the same cluster or not, as we do not have to hassle them with anendless list of questions in order to have a working set of constraints able toinfluence the final clustering. Again, this contrasts with the extensive set ofexamples which we would have to elicit if we used a classification approach9.

In the years since the emergence of this research topic, Constrained Clus-tering has been shown to be effective in a lot of real-life data domains. In orderto give an idea of the breath of the fields in which Constrained Clustering hasbeen successfully tested, what follows is a non-exhaustive list of several ofthose domains, expanding on examples of applications given by Davidson andBasu in their survey [2007].

• Textual data: near-duplicate finding in comments to regulations [Yangand Callan, 2006]; clustering of newsgroups posts [Basu et al., 2004b],news stories [Ares and Barreiro, 2012] and web pages [Ares et al., 2011]

• Images: segmentation of images [Wang and Davidson, 2010], clusteringof faces [Bar-Hillel et al., 2005], clustering of image-based letter recog-nition data [Bilenko et al., 2004]

• Video: people tracking [Yan et al., 2006]

• Spatial information: GPS lane finding [Wagstaff et al., 2001], clusteringof Sony Aibo’s distance measurements [Davidson and Ravi, 2005a]

• Natural data: clustering of radar data, flower measurements, wine data[Bilenko et al., 2004]

9Even so, it should be noted that obtaining constraints is still a problem far from being solved,as we will see in Section 2.5.

Page 51: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.5. Problems and Opportunities 25

Chapter 4 shows our work in one of those applications, namely using aconstraint-based Constrained Clustering algorithm to tackle the avoiding biasproblem [Ares et al., 2009] and improving the quality of the alternative clus-tering using spectral techniques [Ares et al., 2010].

2.5 Problems and Opportunities

Constrained Clustering is a very young and fruitful topic which, as we haveseen, has yielded good results in several domains. However, it is obviously notdevoid of problems, which offer important and interesting research opportu-nities. In this section we note four of these problems: the extraction of theconstraints, the robustness of the algorithms, the feasibility of the constraintsand the utility of the constraints. In the work summarised in this thesis wehave dealt with two of these problems, namely the first ones: constraint ex-traction and algorithm robustness.

2.5.1 Constraint Extraction

Up to this date, the research on Constrained Clustering has been mostly fo-cused on developing new algorithms. As the aim of this work has been thecreation of methods to make the most of the information provided by the con-straints, the experiments reported in the papers start from sets of constraintsalready built, in almost all cases created using the golden truth10. This ob-viates a quite important problem in real-world problems, which is how theconstraints used to fuel the Constrained Clustering algorithms are obtained.Constraint extraction is a complex problem, to which manual and automaticapproaches can be applied.

In the manual methods the constraints are obtained by asking a humanexpert whether pairs of documents should be put in the same cluster. Giventhe time and effort problems (the users can not and will not invest an unlim-ited amount of time in answering our questions) the main challenge of thesemethods is extracting the most useful information from the user; that is, theconstraints which are more likely to help steering the process to a good out-come. This problem is tackled applying Active Learning schemes, such as theone proposed by Basu et al. [2004a].

In the case of automatic approaches the task is even more complicated. Togive an idea of this difficulty, it should be noted that, even though we werelimited by the “patience” of the user, in the manual methods it was finally theuser the one who established the relationship between the data points. In thecase of automatic methods this must be done by the method on its own, a pro-cess (detecting which data points are related and which ones are not related)which is in essence the same one carried out by clustering algorithms them-selves. To create the constraints the automatic constraint extraction methods

10For instance, taking a random pair of data points, looking up their labels in the golden truthand creating a Must-link between them if their labels are the same or a Cannot-link if they aredifferent.

Page 52: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

26 Chapter 2. Introduction

may resort to external information sources or utilise internal information al-ready contained in the data to cluster. In the former case the aim of the methodis broadening what is known about each data instance, whereas in the laterthe focus is on using clues which may be ignored due to the choice of the datarepresentation or the distance measure.

In the Chapter 6 of this thesis we propose two methods to automaticallyobtain constraints, one for clustering web pages, which uses external informa-tion [Ares et al., 2011], and another for any type of textual document, whichuses internal information [Ares and Barreiro, 2012].

2.5.2 Algorithm Robustness

Another consequence of using constraints extracted from the data labels isthat all the constraints used when testing the Constrained Clustering algo-rithms are true. That is to say, if a positive constraint asserts that two datapoints should be in the same cluster, they really are so in the golden truthagainst which the outcome of the algorithm is compared, and the same goesfor the negative constraints (i.e. the data points affected by them really are indifferent clusters in the golden truth). Hence, the results of the experimentsof these papers show the behaviour of the algorithm under ideal conditions.These ideal conditions are unfortunately very unlikely in most real scenarios,where the constraints must be extracted.

On the one hand, automatic constraint extraction methods usually workby generalising more or less explicit notions about the domain of the data tocluster (for instance, that two text documents which share the same author,source, etc must be in the same cluster). Clearly, these generalisations maynot always be valid, and hence these automatic methods will in most casesyield a non negligible amount of inaccurate constraints, whether their sourceis internal or external information.

On the other hand, when the constraints are product of user input theinformation provided may contain misjudgements. Clustering is mostly anexploratory tool, and hence it is used in situations where the configurationand structure of the data is not well known. Consequently, if two data in-stances must or must not be in the same cluster can be in some cases far fromclear. This situation can be aggravated if more than one user is taking partin the constraint creation process, since there might be non-trivial differencesin their criteria about the configuration of the data. Consequently, the robust-ness of the algorithms to noisy sets of constraints (i.e. containing erroneousconstraints) is bound to play an important role in their final effectiveness.

Chapter 5 of this thesis is devoted to an analysis of the robustness to noisysets of constraints of some of the Constrained Clustering algorithms introducedin Section 2.3.

2.5.3 Constraint Feasibility

In a clustering problem the geometry of the space of solutions of the cluster-ing is configured by the overall objective of the clustering, which in regular

Page 53: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

2.5. Problems and Opportunities 27

(non-constrained) clustering scenarios is in turn determined by the similari-ties between documents. When we specify constraints to the clustering we arealtering that space of solutions, pushing the Constrained Clustering algorithmto look for those solutions which respect the constraints11. This, as indicatedby Davidson and Ravi [2006], raises the question of whether such a solutionexists, that is, of whether the constraint set is feasible.

The feasibility problem is related with the correctness of the constraints:if the set of constraints is not feasible then it must contain some constraintwhich is not correct12. This infeasibility is not limited to trivially incorrect andunfeasible examples where the constraint set is inconsistent (that is, when fora given couple of data points x and x′ we have both ML(x, x′) and CL(x, x′)),but may affect to perfectly consistent sets of constraints. For instance, fork = 2 it is impossible to find a clustering of the dataset X = x1, x2, x3 whichsatisfies the constraints CL(x1, x2), CL(x2, x3), CL(x3, x1).

In [2005a] Davidson and Ravi prove that, given a range of possible k, test-ing the feasibility of a set of positive constraints is a P problem, whereas theexistence of negative constraints increases the complexity, turning the prob-lem into a NP one. These results have clear implications for those algorithmswhich try to generate in each iteration solutions which respect all the con-straints. They are heavily burdened by the existence of Cannot-links, withthe difficulty of finding a feasible solution causing for instance in Constrainedk-Means the possible stagnation of the algorithm which we have discussedin page 12, which can occur even with totally feasible sets of constraints. In[Davidson and Ravi, 2006] the same authors claim that the effects of this com-plexity are also apparent in other kinds of algorithms. Concretely, they showhow for a constraint-based partitional method where constraints do not haveto be unconditionally observed [Basu et al., 2004a; Bilenko et al., 2004] (seeSection 2.3.2), a distance-based partitional method [Bilenko et al., 2004], anda constraint-based agglomerative hierarchical method [Davidson and Ravi,2005b] there are noticeable differences in the average running time and fi-nal quality of the partition when using “easy” and “difficult” sets of feasibleconstraints (sets for which CKM finds and does not find a feasible solution,respectively).

2.5.4 Constraint Utility

As was previously indicated, the goal of Constrained Clustering is letting usersguide the clustering process by enabling them to supply the algorithm withdomain information. The experimental results have consistently shown overdatasets from diverse domains that the addition of these constraints fulfils itsultimate purpose, which is obtaining a better partition of the data. Usually,in these works the experiments are performed using several set of constraints(which as we have previously introduced are in almost all cases created fromthe golden truth against which the results are tested). The results over all

11The most extreme case in when constraints are absolute, as with them we are effectivelyreducing the space of possible solutions.

12Note that the opposite (“if some constraint is not correct then the constraint set is unfeasible”)does not generally hold.

Page 54: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

28 Chapter 2. Introduction

constraint sets are averaged, resulting in a final single value which is examinedto establish the good or bad behaviour of the algorithm.

In [2006] Davidson et al. argue that looking only at the average resultsmasks interesting outcomes for individual constraint sets. To illustrate thisclaim the authors test the behaviour of four different Constrained Clusteringalgorithms with randomly generated constraint sets over four datasets. Theseexperiments show that, although in average their addition is beneficial, theconstraints actually hurt the performance of the algorithms in a non negligi-ble percentage of the constraint sets tested, yielding results below the onesobtained without using any constraint.

As the authors note in the paper, this result disproves the assumption thatthe constraints are always helpful to the clustering, showing the need forways to characterise the possible utility of a set of constraints. In order todo so, Wagstaff et al. propose two measures, informativeness and coherence.Informativeness tries to measure the amount of information carried by theconstraints which the algorithm is not able to determine on its own. Con-straints sets with low informativeness would not be of much use to the al-gorithm, which already detects the relationships between the data instancesaffected. As for coherence, it measures the agreement between the constraintswith respect to a distance measure, namely that points in the vicinity of thoselinked by positive constraints are not cannot-linked and vice versa. Constraintsets with low coherence would confuse the algorithm, specially in the caseof distance-based ones. Thus, as stated by the authors, constraint sets withhigh informativeness and coherence are expected to yield gains in the qual-ity of the partitions, aspect which they largely confirm in their experiments,showing out also that drops in the quality of the results are associated withincoherent constraint sets.

2.6 Summary

In this chapter we have made a brief introduction to Constrained Clustering,the topic of this thesis. First, we have introduced the basic concepts of Clus-tering (Section 2.1) and Constrained Clustering (Section 2.2). Afterwards, wehave made a survey (Section 2.3) of the most important and influential Con-strained Clustering algorithms, with a special focus on those used in or relatedto the work summarised in this thesis. Finally, we have examined both some ofthe advantages and applications of Constrained Clustering (Section 2.4) andthe main problems and opportunities of the topic (Section 2.5), some of whichhave been addressed in this thesis.

Page 55: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Chapter 3

Methodology andExperimental Settings

As we have previously indicated, the work summarised in this thesis has aneminently practical nature, looking into applications and problems which havebeen previously mostly overlooked in the existing literature about ConstrainedClustering. Consequently, this work has implied conducting a wide array of ex-periments to test in practice intuitions and insights, in order to gain an actualknowledge of their adequateness and applicability. This chapter gathers andexamines some aspects common to these experiments1, so as to avoid unneces-sary repetitions along the thesis and have also a centralised point of referenceto which turn to when discussing the experiments in the next chapters.

3.1 Datasets

An adequate choice of the data collections over which the experiments aregoing to be performed is of a capital importance to ensure the validity of theconclusions drawn from them. Not only the greatest similarity with the datathat the clustering algorithms would be dealing with in a real-world situationshould be sought, but also standard datasets must be always used, in so far aspossible. This is mainly due to two reasons. Firstly, using standard datasetsgreatly facilitates reproducing the experiments, as the access to them is moreor less simple for the whole research community. Secondly, using these datacollections has a lesser risk of introducing biases in evaluation, since the cre-ation of these datasets is independent of the possible approaches used to solvethe problem being considered. Moreover, standard datasets are used in a largenumber of experiments, and hence any vice or problem in them are usuallyquickly detected and widely reported.

Following this spirit, in the work summarised in this thesis we have usedstandard and widely available datasets, putting special care also on either us-ing standard splits of them or otherwise making enough information available

1It should be noted that most of the concepts discussed in this chapter are not only applicableto Constrained Clustering, but also to Clustering in general.

Page 56: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

30 Chapter 3. Methodology and Experimental Settings

in order to enable a quick reconstruction of the data used. The specific de-tails of the datasets and splits used in each experiment will be discussed in thecorresponding chapter of this thesis.

3.2 Data Representation and Distance Measures

As we have seen in Chapter 2, the basic operation in the operation of cluster-ing algorithms is the comparison between data points. These data points areactually representations of the entities to be clustered, mathematical abstrac-tions which simplify and present them in a suitable way. Typically, the datapoints are vectors of n components, which are called features or attributes[Jain et al., 1999]. For instance, take the well-known Iris dataset [Fisher,1936], representing 150 iris flowers which belong to three different species.Since as such a computer program would not be able to compare them, a rep-resentation of the flowers is built in more adequate terms. In the case of thisdataset, each flower is represented by four numbers, product of different mea-surements over its petals and sepals. Hence, each data point from this datasetis a vector in R4, vectors which may be easily handled and compared. It is im-portant to remark that whatever information is left outside the representation(for example in the case of this dataset the colour of the flower) will not beavailable to the algorithm2.

In this section we will focus on how the representations of textual docu-ments are built, since that is arguably the most important data type for IR andconsequently it was in it where most of the work summarised in this thesiswas performed.

3.2.1 Textual Data Representation

Textual datasets are composed by textual entities called documents3. Each ofthese documents is a string of characters, which after a process called tokeni-sation is divided in tokens, defined as “an instance of a sequence of charactersin some particular document that are grouped together as a useful semanticunit for processing” [Manning et al., 2008]. The equivalence classes in thesetokens (i.e. sets of tokens which we consider to be the same even though thecharacters that compose them may not be the same) are called terms. Tokensand terms can be roughly thought as the words composing the documents,although there also representations where they are defined below the level ofwords (for instance, using character n-grams).

The most popular text representation schemes consider that terms are thefeatures of the documents, following the vector space model [Salton et al.,1975]. Consequently, given a collection of N documents D = d1, d2, d3..dNto be clustered which contains as a wholeM different terms, each document di

2As we discussed in Section 2.4, Constrained Clustering can be used to conveniently integratesome of that “lost” information.

3It should be noted that a “physical” document may contain one or several “logical” documents,depending on the granularity level of the clustering problem being tackled. For instance, thisthesis can be considered as a single entity or be divided, comprising each document a chapter, asection, a sentence, etc.

Page 57: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

3.2. Data Representation and Distance Measures 31

will be represented by a vector [wi1, wi2, wi3...wiM ] in RM , where each compo-nent wij indicates the weight (i.e. importance) of the jth term of the collectionin that ith document.

There exist a lot of term weighting schemes. We survey now three of themore influential and popular ones.

• Term Frequency (TF). TF is based on the notion that the number of timesthat a term appears in a document is a good indicator of the importanceof the term in that document. Hence, if we denote with tfij the numberof times that term j appears in document i, the TF weight of that termand document would simply be

wij = tfij (3.1)

Another alternative formulation divides the term counts by the totallength of the document (denoted by |d| =

∑Mj=1 tfij) in order to avoid

the effect of the differences in size between documents on their similar-ity:

wij =tfij|d|

(3.2)

• Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF is basedon the observation that the importance of a term to characterise thecontents of a document is not only related to the number of times thatit appears in the document, but also to the number of documents inwhich it appears [Jones, 1972]. A term which appears very rarely inthe documents should be given more weight, as it is very specific ofthe ones which contain it and hence describes them better than a termappearing in more documents. The scarcity of a term j is calculatedusing its inverse document frequency (idf):

idfj = logN

dfj(3.3)

where dfj is the document frequency of the term j, the number of docu-ments where it appears. Using this value, the weight of term j in docu-ment i with the TF-IDF weighting scheme would be

wij = tfij · logN

dfj(3.4)

Hence, in order to have a high weight when using this weighting schemea term should have a high tf value (i.e. appear much in the document)and a low idf value (i.e. appear in few documents).

• Mutual Information (MI): The Mutual Information weighting scheme[Pantel and Lin, 2002] is based on statistical theory. The mutual in-formation between two events x and y is defined as

Ixy = log2

p(x, y)

p(x)p(y)(3.5)

Page 58: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

32 Chapter 3. Methodology and Experimental Settings

where p(x, y) is the joint probability of the two events and p(x) and p(y)is the probability of each one on its own. Roughly speaking, we arequantifying how much we can learn of one of the event from the otherby comparing the probability of observing the two events together withthe theoretic probability of observing them if they were independent.The less related that the two events are, the closer that that quotient willbe to 1 and hence the closer that their mutual information will be to 0.In the extreme case of they being completely independent events p(x, y)will be equal to p(x)p(y), and therefore their mutual information will be0, showing that nothing can be known of one of them from having theother. Conversely, the more related that the two events are, the largerthat p(x, y) will be in relation to the product of the probabilities and thelarger that their mutual information will be.

In our case, we want to compare the probabilities of a specific term ap-pearing in a specific document with the probabilities of a random termfrom the collection being that specific term and being from that specificdocument, in order to quantify the relation between the term and a doc-ument. Thus, the MI weight of term j in document i would be calculatedas

wij = log2(tfijS∑N

i′=1tfi′j

S

∑Mj′=1

tfij′

S

+ 1) (3.6)

where S =∑Ni′=1

∑Mj′=1 tfi′j′ is the sum of the number of times that

each term appears in the documents of the collection. Moreover, a 1 isadded to prevent the argument of the logarithm from being 0 (if a termdoes not appear in the document).

As in the case of TF-IDF, this weighting scheme combines informationlocal to the document and global information from the whole collection,albeit in this case the combination has a clearer statistical interpretation.

In the experiments summarised in this thesis we have used the MutualInformation weighting scheme to represent the textual data, as it has beenshown by Pantel and Lin [2002] to outperform TF and TF-IDF.

3.2.2 Distance Measures

Once the representations of the entities have been built we need also a suitablemeasure to compare them. If this measure yields larger values when the datapoints are different it will be called a distance value, whereas if these valuesare smaller it is called a similarity value. One of these measure types can beeasily converted in the other, for example subtracting the actual value fromthe maximum possible one.

This is a very important choice, as indeed an adequate selection of thismeasure can make a big difference in the final quality of the partition. As wehave previously indicated, this selection of the measure was traditionally oneof the few places where users could influence the outcome of the clusteringprocess, maybe using some of their knowledge of the data domain to inform

Page 59: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

3.3. Parameter Tuning 33

the choice. As we have also seen in Chapter 2, a family of Constrained Clus-tering algorithms, the Distance-based ones, use the information contained inthe constraints to adjust the distances.

Next we will briefly survey two of the most used distance measures forvectors of continuous values, which are the representations which were usedin the work summarised in this thesis.

• Euclidean distance. The euclidean distance between two data points isthe euclidean norm of the difference between their representations. Itis the distance traditionally used for low-dimensionality continuous datarepresentations.

disteuc(di, dj) =

√√√√ n∑k=1

(wik − wjk)2 (3.7)

• Cosine distance. The cosine distance between two data points measurestheir divergence using the cosine of the angle between their represen-tations. This distance is widely used in Data Mining and InformationRetrieval, as it is specially suited for high-dimensional data, such as textdocuments.

distcos(di, dj) = 1− cos(di dj) = 1−∑nk=1(wik · wjk)√∑n

k=1 w2ik

√∑nk=1 w

2jk

(3.8)

3.3 Parameter Tuning

The operation of most of the clustering algorithms is controlled by some pa-rameters which should be adequately set when using them in the experiments.There are two main ways to tackle this problem:

• Crossvalidation. The tuning of the parameters is performed over a data-set (the training collection) and the resulting values are then used on adifferent dataset (the test collection), being the results over the later theones used to draw the conclusions of the experiment.

• Best values. The values of the parameters are tuned so as to obtain thebest results in each dataset.

Whereas with Crossvalidation we are avoiding providing the algorithmswith information which would not be available to them in a real-world sce-nario, with Best values we are getting a better idea of the best effort of thesystem. The choice between one or the other depends on the nature of the taskbeing tackled. Anyway, whatever the method chosen is, special care should bealways put in doing the parameter tuning in a way which is fair to all methodscompared.

A special kind of parameter is k, the number of clusters into which thedata should be partitioned. Although it is indeed a parameter of the clustering

Page 60: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

34 Chapter 3. Methodology and Experimental Settings

methods, in the experiments in the clustering literature the number of clus-ters is usually considered to be known beforehand, setting it to the “natural”number of clusters of the data collection (i.e. the number of clusters in thepartition used as reference). This is the approach that we have followed in theexperiments summarised in this thesis.

Another question related with parameter tuning is how to handle someother factors which, despite not being exactly parameters, can have a big in-fluence on the behaviour of the clustering algorithms, such as for instance theinitial set of seeds of k-Means and related algorithms or the order in whichdata points are examined and assigned in some Constrained Clustering algo-rithms. In these cases the most usual strategy is making several runs withdifferent initialisations of these factors and, either using the best result (sim-ilar to the Best values approach indicated above) or using the average of theresults with each initialisation (which along with other values such as theirstandard deviation or the statistical significance of the results allows a studyof their robustness). This is also the approach taken when testing ConstrainedClustering algorithms using synthetic constraints, creating, testing and study-ing the results of several constraint sets in order to get a more clear overviewof the behaviour of the algorithms.

3.4 Cluster Evaluation

Arguably the most important decision when designing an experiment is choos-ing how to evaluate the results. Since clustering is often an intermediate stepin a bigger task, an appealing alternative would be evaluating the goodnessof the clustering using the overall results of the task, performing it using dif-ferent clustering algorithms while keeping all other possible factors the sameand attributing any possible improvement to the effect of the algorithm. Forinstance, Liu and Croft [2004] have proposed methods which use clustering totry to improve the retrieval of textual documents. A way to evaluate clusteringalgorithms would be using them in these methods and seeing with which ofthem the best retrieval results are attained. This is called indirect evaluation,and, although it is specially suited to specific-purpose clustering algorithms,the results of large tasks usually depend on several factors, which may haveunforeseeable connections between them and the clustering algorithms used,which may in turn distort the conclusions drawn from the experiment.

On the other hand, direct evaluation metrics aim to assess the quality ofthe partitions yielded by clustering algorithms by themselves. There are twokinds of direct evaluation metrics:

• Internal metrics. They measure the goodness of a partition accordingto criteria which depend exclusively on the final configuration of theclusters.

• External metrics. They compare the partition yielded by the algorithmwith a partition of the data which is regarded as the correct one, andhence the more similar that they are the better that the partition is con-sidered to be and the better that the value of these metrics is.

Page 61: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

3.4. Cluster Evaluation 35

Each kind of metric has its pros and cons. On the one hand, internal met-rics could be thought as more objective, as they use only mathematical criteriaand do not prejudge the possible partitions of the data as “correct” or “incor-rect”. On the other hand, external metrics have precisely the advantage ofincorporating these notions, as in most cases partitions which are perfectlysound from a purely mathematical point of view would not have a clear sensefor a human user.

Next we make a survey of three external metrics used in the work sum-marised in this thesis. In the definition of the metrics we will denote with Ωthe partition of the N documents in k clusters ω1, ω2, ..ωk yielded by theclustering algorithm and with C the classification of those documents in jclasses c1, c2, ..cj which is used as reference of “good” clustering (the goldentruth).

3.4.1 Purity

Purity (abbreviated P) measures how well the partition yielded by the algo-rithm resembles in average the reference. To calculate it, each cluster is pairedwith the most similar class in the golden truth, i.e. the class with which itoverlaps the most. The purity value is the sum of the data points shared bya cluster and its more similar class divided by the total number of documentsin the dataset. Hence, the more similar that Ω and C are, the largest that thePurity value is. A possible drawback of this metric is that having small clustersor a reference grouping with classes with very unbalanced sizes will falselyincrease the Purity value, since it is always possible to find a reference classwith high overlap with each cluster.

P(Ω, C) =1

n

∑ω∈Ω

maxc∈C|ω ∩ c| (3.9)

3.4.2 Mutual Information

Like the Mutual Information weighting scheme discussed in Section 3.2.1, theMutual Information (abbreviated MI) metric stems from statistical theory. Aswe have discussed, the Mutual Information between two events quantifies therelation between them. For this metric the Mutual Information between tworandom discrete variables X and Y is used:

I(X;Y ) =∑x∈X

∑y∈Y

p(x, y) logp(x, y)

p(x)p(y)(3.10)

This value quantifies the relation between the two random variables, con-cretely how much information about the probability distribution function ofone of them can be learnt form that of the other.

In this case, if we consider two random variables stemming from Ω andC, whose events are respectively that a data point belongs to a given clusteror to a given class, the Mutual Information between them would quantify therelation between the clustering yielded by the algorithm and the reference,

Page 62: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

36 Chapter 3. Methodology and Experimental Settings

i.e. their similarity. This is the basis of the Mutual Information metric forclustering evaluation:

MI(Ω, C) =∑ω∈Ω

∑c∈C

|ω ∩ c|N

logN |ω ∩ c||ω||c|

(3.11)

where again higher values mean greater similarity between Ω and C.

3.4.3 Rand Index and Adjusted Rand Index

Rand Index (abbreviated RI) measures the ratio of good pairwise decisionsmade by the algorithm. It is calculated adding the number of pairs of datapoints which according to the reference should be in the same cluster and areso in the partition (the True Positives, TP ) plus the number of pairs whichshould be in different clusters and are so (the True Negatives, TN) and divid-ing it by the total number of pairwise decisions made by the algorithm (theN(N − 1)/2 possible pairs of data points).

RI(Ω, C) =TP + TN

N(N − 1)/2(3.12)

Greater values of RI mean a better ratio of good decisions, and hence moresimilarity between Ω and C.

Rand Index has the disadvantage of not being corrected by chance, that is,even thought it is bounded by 1 and 0, the agreement between two completelyunrelated partitions does not take a constant value due to the successful deci-sions attained by pure luck4. Hence, if we compare the Rand Indices yieldedby two partitions of the same data we are able to know which one of themis better (the one with larger RI), but we do not know how good they are inabsolute terms, since we do not have a lower bound for the agreement withthat concrete reference partition.

To solve this problem Hubert and Arabie [1985] proposed a new metricadjusted for chance named Adjusted Rand Index (abbreviated ARI) based onRI. It is created using the general form of an index corrected for chance:

Adjusted Index =Index− Expected index

Maximum index− Expected index(3.13)

where “Expected index” is the expected value that the index being correctedwould yield when comparing two unrelated (random) partitions in each par-ticular scenario. Using a generalised hypergeometric distribution as the modelfor randomness Hubert and Arabie obtain the following formula for the Ad-justed Rand Index:

ARI(Ω, C) =

∑ω∈Ω,c∈C

(|ω∩c|2

)−[ ∑ω∈Ω

(|ω|2

) ∑c∈C

(|c|2

)]/(n2

)12

[ ∑ω∈Ω

(|ω|2

)+∑c∈C

(|c|2

)]−[ ∑ω∈Ω

(|ω|2

) ∑c∈C

(|c|2

)]/(n2

) (3.14)

4In fact, as noted by Vinh et al. [2009], RI would only yield a value of 0 in the extreme case ofcomparing a partition with just one cluster with another composed by clusters containing singlepoints.

Page 63: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

3.5. Statistical Significance 37

Adjusted Rand Index is upper bounded by 1 and yields a value of 0 whencomparing two random partitions, enabling us to have an accurate judgementof the quality of the results. This, along with being one of the most well-known and widely used measures [Vinh et al., 2010], is the reason why wehave chosen to use ARI in most of the experiments which are summarisedin this thesis. Moreover, Albatineh et al. explored in [2006] how to correctfor chance several indices based in pair counting, while Vinh et al. did soin [2009] and [2010] for measures based on information theory. These worksdemonstrate how many of the corrected-for-chance indices (including ARI) arein the end equivalent, or at least they frequently show a very high correlationin their values, which reinforces the adequacy of using Adjusted Rand Indexin our experiments.

3.5 Statistical Significance

As we have seen in this chapter, a typical clustering experiment would involveclustering one or more collections with some clustering algorithms, whose pa-rameters would have been tuned using a suitable technique, and assessing thegoodness of the partitions yielded by them. The first intuition would be com-paring these values of goodness to draw conclusions about which approach isbetter. However, doing a simple comparison can be misleading. For instance,as we introduced in Section 3.3, one of the most usual ways to deal with fac-tors such as the seeds of a k-Means-based algorithm is to perform several runswith different random initialisations of these factors. Being random, we donot have any guarantee of the “fairness” of these initialisations, that is, oneof them can be especially good (or bad) for one or more algorithms. Thus,differences in the quality of the partitions, which we will attribute to somealgorithms being better or more suited for the task at hand than others, mighthave been obtained by pure chance. This effect, although it is especially dan-gerous when we are comparing only the results of the best initialisations, isnot limited to these cases; it can also affect us if we use the average of the ini-tialisations, since the effect of a good or bad initialisation can be very strongand alter significantly the mean value.

In order to try to prevent these kinds of errors we have tested when pos-sible the statistical significance of the results of the experiments performed,assessing the probability of having obtained any difference in performance bychance. In this technique, widely used in experimental sciences, a hypothe-sis, called H0 (the null hypothesis) is formulated, under which the alternativehypothesis, H1, the one that we are trying to test, would be false. Associatedwith the null hypothesis is the null distribution, the statistical distribution thata value derived from the observations (called the test statistic) would follow ifH0 is true. Given these, the statistical significance test calculates the probabil-ity under the null distribution of obtaining a test statistic at least as extreme asthe one corresponding to the actual observations used in the test. If this prob-ability (which is called the p-value of the test) is under a threshold (called α),set beforehand, the observations are considered to be incompatible with H0.In these cases the null hypothesis is said to be “rejected” and the alternative

Page 64: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

38 Chapter 3. Methodology and Experimental Settings

hypothesis is “accepted” (i.e. the result conveyed by it is deemed to be statisti-cally significant). Otherwise (p-value > α) we consider that the compatibilitybetween the observations and H0 is too large to allow us to rule out withguarantees its veracity, and hence the null hypothesis is said to be “failed to berejected” and the test is considered inconclusive5. Typical values for α, whichquantifies under the assumptions of the test used the probability of incorrectlyrejecting the null hypothesis ( known as a Type I error), are 0.05 (5%) or 0.01(1%).

3.5.1 Sign Test

To test the statistical significance of the results obtained in the work sum-marised in this thesis we have used the Lower-Tailed Sign Test [Conover,1971], a choice which was motivated by its simplicity and the reduced numberof assumptions about the observations in comparison with other tests such asWilcoxon’s or Student’s t.

The Sign Test uses as input a bivariate random sample of n′ observations(X1, Y1), (X2, Y2), ..(Xn′ , Yn′). The test proceeds by comparing the two com-ponents of each pair, labelling it with “+”, “-” or “0” depending on whether Xi

is respectively larger, smaller or equal to Yi. In the Lower-Tailed version of theSign test the null hypothesis is H0 : P (+) ≥ P (−) (the values of X are at leastequally (if not more) likely to be larger), while the alternative hypothesis isH1 : P (+) < P (−), that the values of Y are likely to be larger than those of X.The test statistic T is the number of pairs labelled with “+”, whose null distri-bution is a binomial with p = 1

2 and n equal to the number of pairs labelledwith “+” or “-”. Hence, the p-value of the test is the probability of obtaining Tor less successes in that binomial. If this probability is smaller than the chosenα the null hypothesis is rejected.

In the case of our experiments, the pairs will be the quality values of thepartitions yielded by the two methods which we want to compare (measuredwith the metrics introduced in Section 3.4) for each initialisation of the factorstested. From there on, the test proceeds as stated in the previous paragraph,informing us on whether the differences in performance between the algo-rithms can be attributed to the effect of the initialisations.

For example, let us say that we are testing a new algorithm against anexisting baseline. Let us say also that both methods are based on k-Means,and hence, since their results depend on the initial set of sets, n′ runs areperformed with different random initialisations. The results of each of theseruns using each algorithms constitute a set of n′ pairs of observations (Xi, Yi),where the first value is the one obtained using the baseline and the second theone obtained with the new algorithm (since we are trying to see if it worksbetter than the baseline). Assuming for convenience that n = n′ (i.e. thatfor none of the initialisations the quality values of the baseline and the newapproach are equal) and an α of 0.05, Table 3.1 shows for different values ofn′ the maximum number of initialisations for which the result obtained usingthe baseline can be better than the one of the new algorithm and the nullhypothesis (that the baseline is at least as good as the new method) is still

5Particularly, failing to reject the null hypothesis does not assert its veracity.

Page 65: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

3.6. Summary 39

Table 3.1: For different sample sizes, maximum number of pairs which in aLow-Tailed Sign Test may be labelled with “+” such that the null hypothesis isstill rejected (assuming that no pair is labelled with “0”)

Maximum T ton′ reject H0

3 —5 0

10 125 750 18

100 41

rejected; that is, the maximum number of successes for which the cumulativedistribution function of a binomial B(n′, 1

2 ) is smaller than 0.05.Note how smaller numbers of initialisations demand harsher thresholds in

order to reject the null hypothesis, as the evidence available to judge uponis smaller and hence the uncertainty about the significance is higher. For in-stance, in the extreme case of having only three initialisations the probabilityof that in the three cases the quality value Y is larger than X and yet the nullhypothesis is true is 0.126, too high to reject H0 even for the more lenient αof 0.1.

3.6 Summary

Given the eminently practical nature of the work summarised in this thesis,this chapters gathered and examined some aspects common to the experi-ments which will be reported in the next chapters. We have started by statingsome of the guidelines that we have followed when choosing the datasets overwhich we have performed the experiments (Section 3.1). Then, we have sur-veyed some representation schemes and distance metrics for the data points(Section 3.2), focusing mainly on the representation of textual data (Section3.2.1). Afterwards, we have examined how to tackle the problem of how totune the parameters of the algorithms in order to have a faithful and fair rep-resentation of their performance (Section 3.3). Finally, we have concluded thechapter describing how the evaluation of the clustering was carried out andthe metrics used to perform it (Section 3.4) and remarking as well the impor-tance of using statistical significance tests in order to better assess the resultsof the aforementioned evaluation (Section 3.5). To that end, we have used inour experiments the sign test (Section 3.5.1).

Page 66: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 67: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Chapter 4

Using Constrained Clusteringin Avoiding Bias

Constrained Clustering has been used in lots of domains and tasks, as has beenintroduced in Section 2.4. In this chapter we summarise our work in usingConstrained Clustering to tackle the Avoiding Bias problem, which has beenpreviously published in [Ares et al., 2009] and [Ares et al., 2010]. Particularly,in the first part of this chapter we propose an Avoiding Bias scheme which usesnegative constraints and a Constrained Clustering algorithm devised by us,whereas in the second part we focus on improving the quality of the alternativegroupings obtaining as a result of this task.

4.1 Avoiding Bias

As we have seen, clustering is the most popular non-supervised automaticdata analysis tool. Given a data collection, the clustering algorithms try toform a meaningful grouping of the data, categorising the data instances invarious groups (clusters), such that the instances in the same cluster bear highsimilarity between them and low similarity with the instances that have beenput in the other clusters.

Unfortunately, the concepts of “meaningful grouping” and “high” and “low”similarity are very subjective. Sometimes, and even though the partition ofthe data found by a certain clustering algorithm can make sense from a purelymathematical point of view, it might be completely useless or even meaning-less to the user, who in many cases is using it to get an idea of the distributionof the points of the collection.

Moreover, and keeping with the use of clustering as an exploratory tool,the outcome of applying a clustering algorithm to some data might reflect agrouping of it which is well known, or which would be easy to find with amanual examination. Again, in this cases clustering will be of little use to theuser, who would not get much interesting or new information from using it.Gondek and Hofmann [2004] suggest several examples of this situation, suchas the clustering of news corpora which have been already annotated by a

Page 68: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

42 Chapter 4. Using Constrained Clustering in Avoiding Bias

certain criterion (such as region) or the clustering of users’ data in ways notcorrelated with stratifications based on gender or income information.

Thus, sometimes mechanisms are needed to find alternative clusterings tothe one proposed by the clustering algorithm. If we are trying to avoid thetendency (bias) of the clustering algorithm to fall in a certain grouping of thedata that is being clustered the task is called Avoiding Bias.

4.2 Previous Approaches

4.2.1 Coordinated Conditional Information Bottleneck (CCIB)

Gondek and Hofmann [2004] propose an approach for obtaining alterna-tive clusterings based on the Conditional Informational Bottleneck framework[Gondek and Hofmann, 2003] which uses the concept of conditional mutualinformation. Conditional mutual information quantifies how much informa-tion about a random variable can be learnt from another in the presence ofa third; namely, being A, B and C random variables the mutual informationbetween A and B given C, I(A;B|C), can be calculated as

I(A;B|C) = I(A;B,C)− I(A;C) (4.1)

That is, I(A;B|C) measures what can be known of A from the joint distribu-tion of B and C minus what is already known when only the distribution of Cis available.

Thus, denoting by X,Y,Z and C respectively random variables for the datapoints, their relevant features, the available background knowledge and theclusters, and naming with PC|X a stochastic mapping of data instances toclusters which expresses the probability of assigning a point to a cluster (i.e.the result of the clustering), Gondek and Hofmann’s method tries to solve thefollowing optimisation problem

P ∗C|X = argmaxPC|X∈P

I(C;Y |Z) (4.2)

whereP = PC|X : I(C;X) ≤ Cmax, I(C;Y ) ≥ Imin (4.3)

This last expression defines the set P of all acceptable PC|X , which are thosewhere on the one hand the mutual information between C and X is under athreshold Cmax, indicating that there must be some fuzziness in the assigna-tion of data points to clusters, and on the other hand the mutual informationbetween C and Y is at the same time over a threshold Imin, in order to favourclustering solutions which obey some global consistency (since I(C;Y |Z) doesnot measure the relevant information that C provides in its own, withoutknowing Z). This last condition is what gives their formulation the “Coor-dinated” modifier.

Having defined this optimisation problem, the authors indicate in the paperhow to solve it using an alternation scheme, putting a special focus on how toeffectively deal with the common situation in practice of not having the joint

Page 69: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.2. Previous Approaches 43

distribution PXY Z , discussing for instance the estimates to be used in a textmining application.

As for the experiments performed to test their method, Gondek and Hof-mann report good results in datasets composed by synthetic data, images andtextual documents, obtaining in all cases appreciable effects on the partitions.Concretely, the results of the experiments with the images dataset are speciallyillustrative, as they are able to successfully steer the partition of a collectionof face images from one based on the characteristics of the picture (face-onlyviews versus face-and-shoulder views) to one based on the gender of the sub-jects.

4.2.2 COALA

In [2006] Bae and Bailey introduced COALA (Constrained Orthogonal Aver-age Link Algorithm), an approach to extract alternative clusterings based onhierarchical agglomerative clustering. In their paper, the authors stress thetwin objectives of uniqueness and quality in the alternative clustering: notonly should the alternative clustering presented to the user be different fromthe already known one, but it also should have a high quality, in order to beuseful1.

In order to fulfil these aims, Bae and Bailey develop an algorithm whichworks in two phases. Given a partition C of the data to which an alternativeis sought, in the first phase a Cannot-link is created between each pair ofdata points which were put in the same cluster in C. In the second phase anagglomerative hierarchical algorithm is run over the data. This algorithm issimilar to the Complete Link (CL) one introduced in Section 2.3.5: it startswith each data point in its own cluster and then proceeds to join clustersiteratively according to some strategy. It is in this strategy where the keydifferences with CL are.

First of all, in this case the distance between two clusters is the average dis-tance between the data instances in them. This approach, called Average Link[Voorhees, 1986], is more accurate and robust than Complete Link, which,since it defines the distance between clusters as the maximum distance be-tween their data points, is more sensitive to outliers2.

Second, the choice of the pair of clusters to be merged in each iterationis more complex than taking the most similar ones. In COALA the algorithmselects in each iteration two pairs of clusters. The first one, (q1, q2), whichthey call the qualitative pair, is indeed composed by the two closest clusters.The second one, (o1, o2), the dissimilar pair, are the closest clusters such thattheir merger satisfies all the Cannot-link constraints created in the first phaseof the algorithm (that is, there is no pair of data points from o1 and o2 whichwere in the same cluster in C). Once these pairs are selected (they may bethe same one) the distance between their clusters is compared dividing the

1Later on in this chapter we will return to the importance of the quality of the alternativeclustering.

2Note however that the Average Link approach would not “propagate” the effect of the con-straints in a Constrained Clustering algorithm similar to Constrained Complete Link (see page 21)in the same way that Complete Link does.

Page 70: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

44 Chapter 4. Using Constrained Clustering in Avoiding Bias

distance between q1 and q2 by the distance between o1 and o2. If this ratiois under a certain threshold ω merging the dissimilar pair is considered todegrade too much the quality of the resulting clustering and q1 and q2 aremerged instead in this iteration. Otherwise, (d(q1, q2)/d(o1, o2) ≥ ω) o1 ando2 will be merged in this iteration. Hence, ω (which is bounded by 0 and 1)acts as a parameter which controls the compromise between obtaining a goodand a dissimilar cluster, with smaller values putting the focus on dissimilarity.Also, if no dissimilar pair can be selected the quality pair is directly used inthe merger. This process continues until there is only a cluster left, which willcontain the whole collection.

In their paper, Bae and Bailey report that their approach outperforms anaive method based of ensemble clustering and another based on ConditionalInformation Bottleneck framework [Gondek and Hofmann, 2003] (the oneintroduced by Gondek and Hofmann prior to the one discussed in the precioussection), in stability, dissimilarity and quality. However, it should be noted thatthe experiments were conducted over small synthetic, numeric and categoricaldatasets with a limited number of features, and that the algorithm complexitymakes it inefficient for large collections.

4.2.3 Other approaches

Cui et al. introduced in [2007] a method to obtain non-redundant partitionsvia orthogonalisation. Their approach proceeds by iteratively repeating twosteps: first, a clustering of the data is found. Second, the data is orthogo-nalised (transformed) into a space not captured by the solution found in thefirst step (e.g. not spanned by the prototypes of the clusters). This data is thenused as input to the next iteration. This process continues yielding in eachiteration an alternative clustering until most of the data space is covered or nostructure can be found in the remaining space. In the paper the authors intro-duced two ways to attain the orthogonalisation of the data. In the first one,named “Orthogonal Clustering”, the cluster solution found in the first step isrepresented by the centroids of its clusters, and hence each data point is pro-jected into a subspace orthogonal to its cluster mean. On the other hand, in“Clustering in Orthogonal Subspaces”, the second approach, the cluster solu-tion is represented using the feature subspace that best captures the clusteringresult, which can be obtained using techniques such as Linear DiscriminantAnalysis (LDA3) or Principal Component Analysis (PCA). Once the subspace iscalculated, the data points are projected to a subspace orthogonal to it. Cuiet al. test their approach in synthetic and real-world datasets (numeric, textsand images) obtaining interesting results which show that their method is in-deed finding orthogonal partitions of the data, although they do not comparewith any other method. The authors did not report any appreciable differencebetween the two proposed ways of orthogonalising the data.

In [2008], Davidson and Qi introduced an approach to finding alternativeclustering which also uses constraints. In this case Must-links and Cannot-linksare used to characterise the grouping of the data to be avoided. A distance

3Not to be confused with Latent Dirichlet Analysis, for which the acronym LDA is also used.

Page 71: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.3. Our Initial Proposal 45

function matrix is then learnt from these constraints using the approach pro-posed by Xing et al. [2003], resulting in a transformation Dπ over the datawhich conveys the partition to avoid. Dπ, which is a matrix, is afterwardsdecomposed using Singular Value Decomposition (SVD), yielding three matri-ces H, S and A such that Dπ = HSA. These matrices are used to build analternative transformation D′π by taking the inverse of the stretcher matrix S(i.e. D′π = HS−1A). Finally, D′π is used to create transformed versions of theoriginal datasets over which the clustering algorithm would be applied to ob-tain the alternative clustering. Thus, this method has the advantage of beingquite general, not being tied to any clustering algorithm. Again, they testedtheir approach only in non-textual collections with small numbers of features,over which they report good results.

Also related to the problem of avoiding bias, although solving it in a lessautomated fashion, Cohn et al. introduced [2003] an algorithm to iterativelyalter the grouping found by a clustering process which uses the EM algorithmaccording to negative user feedback. They incorporate the user preferencesaltering the KL-divergence measure between the documents marked by theuser, introducing a new factor to measure the importance of a term for distin-guishing the documents. The experiments reported in their paper, focused onobtaining a “good” clustering instead on obtaining an alternative one, showpromising results, even though the collections (which are in this case com-posed by textual documents) are again very small.

4.3 Our Initial Proposal

With these preceding works in mind, we proposed in [Ares et al., 2009] amethod to tackle the avoiding bias problem based on pairwise constraints.

If we remember the formulation of the problem, we have a datasetX whichwe want to cluster and an already obtained partition of this data, which wewill denote by Ωavoid; for whatever reason, we want to obtain an alternativepartition Ωalt (i.e. different from that Ωavoid) which is also a good explanationof the data.

Analysing the inputs of the problem with pairwise constraints in mind,there is little room to extract positive constraints. The only information avail-able to us apart from the representation of the data is Ωavoid, the partition ofthe data which we want to avoid, and hence it does not give us much informa-tion about how the alternative partition of the data that we seek should looklike: neither the fact that two documents are in the same cluster in Ωavoid,nor that they are in different ones gives us any positive evidence about if theyshould be in the same cluster in Ωalt.

On the other hand, Ωavoid gives us plenty of information of how Ωalt islikely to not look like. Since this alternative clustering should be substantiallydifferent from the one we want to avoid, two data instances that are in thesame cluster in Ωavoid have a fair chance of not being in the same clusterin Ωalt. Consequently, similarly to the first phase of Bae and Bailey COALA(Section 4.2.2), we will use Ωavoid to extract a set of negative constraints,one for each pair of data points put in the same cluster by that partition. We

Page 72: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

46 Chapter 4. Using Constrained Clustering in Avoiding Bias

expect that the distortion induced by these constraints is on the one handenough to break the bias of the algorithm to fall in the avoided clustering andon the other hand not strong enough to break completely the structure of thesimilarities between documents, so that the final clustering of the data is stillmeaningful.

Given these negative constraints, it is almost certain that not all of them aregoing to be fulfilled by a good Ωalt

4, which rules out using absolute Cannot-link constraints as the ones provided by Constrained k-Means (CKM, see Sec-tion 2.3.1). In the case of COALA, the authors solve this problem by usingthe constraints in a hierarchical agglomerative method, comparing in eachmerger of clusters the overall best alternative with the best one that respectsthe constraints. Their approach, apart from the algorithmic complexity ofthe agglomerative clustering, which reduces its suitability in medium to largedatasets, has also the drawback of stopping considering constraints when nomerger that respects them is available, a stage that, depending on the structureof the dataset and the threshold set to choose one or other merger, can comequite early. Moreover, their approach only uses the constraints to classify bina-rily the mergers, that is, it does not take into account how many constraints amerge would break, an information which might be useful in obtaining moredifferent partitions of the data.

In order to overcome these problems, we proposed in [Ares et al., 2009]a new Constrained Clustering algorithm based on k-Means, Soft Constrainedk-Means (SCKM).

4.3.1 Soft Constrained k-Means (SCKM)

As we introduced in Section 2.3.1, the k-Means [McQueen, 1967] algorithm isa very popular clustering method, due to its good trade-off between effective-ness and cost. It is a generic algorithm, which does not need any prior knowl-edge apart from the desired number of clusters. Moreover, its clear structureand flow makes extending and modifying it very easy.

In Constrained k-Means (CKM) Wagstaff et al. [2001] introduced in k-Means Must-link and Cannot-link constraints as absolute constraints. Beingabsolute, the final partition has to fulfil all of them to be acceptable. Whilethis absoluteness can be very convenient if we know categorically the relationsbetween instances and we cannot afford to have them misplaced, that is notusually the case, as it is not in our approach to Avoiding Bias. Moreover, theabsoluteness could represent, as we discussed in page 12, an excessive burdento the process, since it can lead to situations where, even though there isan acceptable solution, it can not be found, a situation acknowledged by theauthors in their paper.

In order to overcome these limitations, while still letting the user use thoseabsolute constraints, we introduced in [Ares et al., 2009] Soft Constrained k-Means (SCKM), a Constrained Clustering algorithm based on k-Means whichenables the user to use two kinds of soft (non-absolute) constraints, which

4In fact, depending on the cardinality of Ωalt, it is quite likely that there does not exist anypartition of the data which can fulfil all of them.

Page 73: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.3. Our Initial Proposal 47

will gradually influence the process instead of defining categorically where adocument must or must not go:

• May-link constraints, MayL(a, b), indicating that two documents arelikely to be in the same cluster

• May-not-link constraints, MayNL(a, b), indicating that two documentsare not likely to be in the same cluster

Being not absolute, these constraints need not be symmetrical (althoughthe semantics of such constraints would not be clear) and, more importantly,they need not necessarily define transitive relations (which would be too riskyto extract given the uncertainty of the relations indicated by the constraints).

Naturally, the introduction of these constraints alters the way in which datainstances are assigned to clusters: instead of assigning a point to the clusterwith the more similar centroid a score is now calculated for each cluster. Thatscore is initialised with the similarity of the data instance and the centroidof the cluster, and it is actualised according to the number of May-link andMay-not-link constraints which assigning the point to the cluster would obeyor infringe. Concretely, the score is increased in an amount w for each May-link that the assignment would respect and decreased in the same amount wfor each May-not-link that the assignment would infringe. The data point willbe assigned to the cluster with the highest score such that that assignmentrespects all the absolute Must-link and Cannot-link constraints.

Algorithm 9 shows the pseudocode of the Soft Constrained k-Means algo-rithm that we proposed in [Ares et al., 2009]. It follows more or less closelythe skeleton laid by KM and CKM, with a main loop where data instances arereassigned to clusters until the convergence is attained. The choice of the bestcluster for a data instance is done by the function GetDestinationCluster,where the scores for the clusters mentioned in the last paragraph are calcu-lated.

In order to make the most of the new soft constraints two logs of the con-figuration of the clusters, Ωnew and Ωcurrent, are kept by the algorithm. Ωnewkeeps track of the data points which have been assigned to a cluster in thepresent iteration, whereas Ωcurrent keeps track of the “current” configurationof the clusters, that is, the cluster to which each data point was more recentlyassigned, be it in the current iteration (if the point has been already inspected)or in the previous one5. As it can be seen in lines 14 to 20, soft constraintsare checked against Ωcurrent, whereas the absolute ones are checked againstΩnew. Thus, the effect of the soft constraints is made a little more independentfrom the order in which data points are inspected, letting also the sole pres-ence of these constraints gradually affect the clustering process while avoidingsome of the problems associated with the absolute ones. It is also worth not-ing that this strategy fits well with the mechanism of the k-Means algorithm,which uses information from an iteration (the centroid of the documents) inorder to rearrange them in the next one.

5Hence, at the end of each interaction Ωcurrent is equal to Ωnew. However, the lineΩcurrent ← Ωnew has been nevertheless included in the pseudocode at the beginning of eachiteration to make it explicit.

Page 74: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

48 Chapter 4. Using Constrained Clustering in Avoiding Bias

Algorithm 9: SOFT CONSTRAINED K-MEANS (SCKM)input : X, the data to cluster; k, the number of clusters; ML and CL,

the positive and negative absolute constraints to be taken intoaccount; MayL and MayNL, the positive and negative softconstraints; w, the strength of these constraints

output: Ωnew = ω1, ω2..ωk, a partition of the data

1 foreach ω ∈ Ωnew do Initialise(ω)2 while convergence is not attained do3 foreach ω ∈ Ωnew do RecalculateCentroid(ω) // Used in 13

4 Ωcurrent ← Ωnew5 Clear(Ωnew)6 foreach x ∈ X do7 ω ← GetDestinationCluster(x, Ωnew, Ωcurrent)8 if @ ω then clustering fails9 else Assign(x,ω) // Actualises Ωnew and Ωcurrent

10

11 end12 end

function GetDestinationCluster(x, Ωnew, Ωcurrent)input : x, a data instance; Ωnew, clusters with the assignments

already made in this iteration; Ωcurrent, clusters output bythe previous iteration actualised with the changes made inthis iteration

output: the best cluster to put x

13 foreach ωj ∈ Ωnew do scorej ← Similarity(x, ωj) // ω from 3

14 foreach ωj ∈ Ωcurrent do15 foreach x′ ∈ ωj do16 if (x, x′) ∈MayL then scorej ← scorej + w17 if (x, x′) ∈MayNL then scorej ← scorej − w18 end19 end20 return argmaxωj∈Ωnew scorej s.t.

¬ ViolatesAbsoluteConstraints(x, ωj)end

function ViolatesAbsoluteConstraints(x, ω)input : x, a data instance; ω, a clusteroutput: if putting x in cluster ω contravenes any absolute constraint

21 foreach (x, x′) ∈ML do if x′ /∈ ω then return true22 foreach (x, x′) ∈ CL do if x′ ∈ ω then return true23 return false

end

Page 75: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.3. Our Initial Proposal 49

4.3.2 Related Algorithms

The SCKM algorithm was inspired by Yang and Callan’s [2006] work into de-tection of near-duplicate documents, specifically by the way in which the softconstraints are taken into account, using penalties over the similarity betweena data point and a centroid. However, their soft constraints (which they call“Family-links”) are only positive and are defined in a limited fashion, usingthem in an algorithm specially tailored for their task (they tested the algo-rithm for detecting duplicates in comments to United States new regulations),whereas with SCKM we adapted a general clustering algorithm (k-Means) tocope with this kind of constraints, not being restricted to a certain problem ordomain. Another key difference is that their algorithm does not take advan-tage of the information from the previous iterations to calculate the effect ofthe soft constraints, something that we do by using Ωcurrent.

However, clearly the algorithm more related with SCKM is the PairwiseConstrained k-Means (PCKM) algorithm by Basu et al. discussed in Section2.3.2. At the time by which we devised SCKM (late 2008) we were unfor-tunately unaware of PCKM, which was presented in 2004 in the SIAM Inter-national Conference on Data Mining and with which our algorithm has greatsimilarities.

In PCKM the authors treat Must-links and Cannot-links as non absolute,and define a global objective for the clustering (Equation 4.4) to be minimised,which takes into account both the distances of the data points to the cluster towhich it is assigned and the amount of constraints which are not respected bythe partition.

Jpckm(Ω) =1

2

k∑i=1

∑x∈ωi

‖x−ωi‖2+∑

(xi,xj)∈ML

wij1[li 6= lj ]+∑

(xi,xj)∈CL

wij1[li = lj ]

(4.4)Concretely, the objective Jpckm is increased in an amount w for each positiveand negative constraint which is not respected. In order to minimise thatobjective, when a data instance is assigned to a cluster (line 6 and functionPenalties of Algorithm 3) the algorithm chooses the cluster for which thesum of the distance to its centroid and the penalties entailed by that assign-ment is minimum. In our case (function GetDestinationCluster), the algo-rithm assigns the data point to the cluster with the highest score, being thatscore defined as the similarity of the point to the centroid of the cluster plus wfor each positive constraint which that assignment would respect minus w foreach negative one that it would not respect. It is quite easy to see that our andtheir criteria are equivalent. As for the distance and the similarity, they aretwo concepts intimately intertwined, as we introduced in Section 3.2.2, withthe first increasing and the second decreasing as the data points are more dif-ferent. Regarding the constraints, the negative ones are treated identically inboth algorithms (they reduce the appeal of the assignments that do not respectthem). In the case of the positive constraints, although they are accounted fordifferently (in PCKM the appeal of the assignments that do not respect themis reduced, whereas in SCKM that of those that respect them is increased),their final effect is the same. However, it is not clear from the paper contents

Page 76: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

50 Chapter 4. Using Constrained Clustering in Avoiding Bias

whether they are using a solution similar to our use of Ωcurrent to calculatethe penalties associated to the possible contravention of the constraints.

As for the differences between the algorithms, PCKM does not let the userinclude absolute constraints, whereas SCKM does not have the neighbourhood-based initialisation phase introduced by Basu et al., relying as KM and CKM ina random initialisation of the clusters.

All things considered, the results obtained using the SCKM algorithm whichare reported in this chapter and elsewhere in this thesis should be equally validfor the PCKM algorithm, at least if PCKM’s neighbourhood-based initialisationis not used.

4.3.3 Recapitulation

Having introduced the SCKM algorithm, the outline of our approach to avoid-ing bias is as follows. Given a dataset X and a partition Ωavoid of it which wewant to avoid:

1. A May-Not-Link constraint is created between each pair of data pointswhich are in the same cluster in Ωavoid.

2. The SCKM algorithm is run over X, using the constraints obtained in theprevious step.

The partition ofX yielded by the SCKM algorithm is the output of the avoidingbias method.

By using the SCKM algorithm we are avoiding the problems of COALAdiscussed above. Firstly, SCKM retains the good computational behaviour ofk-Means, since the computational cost of the major difference between them,which is searching the appropriate constraints when assigning points to clus-ters, is negligible compared with that of the main operation of both algo-rithms, which is the calculation of the similarity between all data instancesand the centroids. Secondly, the constraints are considered through the wholealgorithm operation. Finally, the amount of constraints which would be con-travened by a certain assignment (that is, the similarity to Ωavoid that thatassignment would introduce) is taken into account, unlike in COALA, whereonly whether any constraint is being contradicted is considered.

4.4 Experiments

4.4.1 Experimental Set-Up and Methodology

In order to assess the suitability of the proposed approach we performed avoid-ing bias experiments on reference collections comparing our method with analready established one.

The mechanic of these experiments is the same used the ones reported in[Gondek and Hofmann, 2004] or in [Cui et al., 2007]. Given a certain datasetX for which there are available two different hand-made partitions Ω1 andΩ2, we will alternatively consider one of them as Ωavoid, that is, the “known”

Page 77: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.4. Experiments 51

partition, the one to be avoided; run the Avoiding Bias approaches over Xwith that information and compare their outcome with that Ωavoid and withthe other partition. Our goal is obtaining in this process a partition of X whichresembles as less as possible the partition avoided but which is still a goodpartition of the data, an aspect which we assess comparing it with the otheravailable partition of the data. This is precisely one of the more debatableaspects of this experimental design, since it can be argued that other goodpartitions of the data different from the hand-made ones may exist, and thatconsequently we might inadvertently dismiss incorrectly an adequate partitionas bad. Indeed, there are other papers, such as the previously discussed [Baeand Bailey, 2006] or [Davidson and Qi, 2008], which appraise the qualityof the alternative partitions obtained using internal metrics6. However, wehave opted for using the discussed design due to its intuitiveness, clearnessand consistency with the formulation of the task, since precisely one of thescenarios commonly adduced to underscore the importance of avoiding biasis clustering a data collection and obtaining a partition of the data which onlymakes sense from a mathematical point of view, lacking any meaning for ahuman.

Baseline

In our experiments we have used as baseline the results reported by Gondekand Hofmann [2004] for their method Cooordinated Conditional InformationBottleneck (CCIB), discussed in Section 4.2.1. We have chosen that algorithmbecause of their use of the experimental methodology discussed in the previ-ous section, their extensive results in textual datasets and their use of compactmetrics, such as Purity (named “Precision’ in their paper) and Mutual Infor-mation (see Section 3.4), which facilitates the comparisons with other meth-ods. Given the reproducibility of the experimental settings described in theirpaper, in this experiment it was not necessary to re-implement Gondek andHofmann’s method. Instead, we were able to directly use the quality valuesreported by them in their paper.

Datasets

We have compared the behaviour of CCIB and our approach in two textualdatasets, defined by Gondek and Hofmann in their paper:

(i) This dataset was created as a subset of WebKB’s Universities Dataset. Theoriginal WebKB collection [Craven et al., 1998] was collected from thewebsites of several U.S. universities (Cornell, Texas, Washington, Wis-consin and others). These web pages have been manually tagged ac-cording to two aspects: university and topic (“course”, “department”,“faculty”, “project”, “staff”, “student” and “other”). The dataset used inthe experiments is created taking the documents from the Universitiesof Cornell, Texas, Washington and Wisconsin which were as well taggedas “course”, “faculty”, “project” “staff”, “student”, which yields a total of1087 documents, whose distribution is shown in Table 4.1.

6See Section 3.4.

Page 78: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

52 Chapter 4. Using Constrained Clustering in Avoiding Bias

Table 4.1: Distribution of the documents from dataset (i) according to Univer-sity (rows) and Topic (columns) criteria

staff project course student faculty totalwisc 12 25 85 156 42 320

utexas 3 20 38 148 46 255cornell 21 20 44 128 34 247

washington 10 21 77 126 31 265total 46 86 244 558 153 1087

Table 4.2: Distribution of the documents from dataset (ii) according to Region(rows) and Topic (columns) criteria

GCAT MCAT totalUK 1297 26 1323

INDIA 280 0 280total 1577 26 1603

(ii) This dataset was created from Reuters RCV1 [Lewis et al., 2004], ahuge document collection composed of about 810,000 news stories fromReuters, one of the most important news agencies. These documentshave been manually tagged according to three aspects: topic, geograph-ical area and industry. The dataset used in the experiments is createdtaking the documents with have been labelled with respectively only onetopic and region label and whose topic is “MCAT” (Markets) or “GCAT”(Government/Social) and whose region is “UK” or “INDIA”. This yields atotal of 1600 documents, whose distribution is shown in Table 4.2.

Other Details

In our method, the textual documents were represented using Mutual Infor-mation, and these representations were compared using cosine distance (seeSection 3.2). Moreover, as in Gondek and Hofmann experiments, the desirednumber of clusters (k) was set to the number of groups in the “expected” (i.e.non avoided) partition of the data. Thus, the only parameter that has to betuned is w, the weight of the soft constraints in our method. In order to doso we have used a crossvalidation strategy, which involved testing the possiblevalues in dataset (i) avoiding the Topic partition and taking the one with bestresults (w = 0.0025), using that value when avoiding the University partitionin that dataset and also when avoiding both partitions in the other dataset.Finally, as KM, the SCKM algorithm is dependent on the initial set of seeds.Consequently, as was introduced in Section 3.3, 10 random seed initialisationswere tested. In each of these initialisations we have as well randomised theorder in which the documents were inspected.

Page 79: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.4. Experiments 53

Tabl

e4.

3:A

void

ing

bias

resu

lts

inth

ede

fined

data

sets

for

k-M

eans

,the

new

algo

rith

m(S

CK

M)

wor

king

wit

hso

ftco

nstr

aint

san

dth

eC

CIB

met

hod.

The

arro

ws

next

toth

em

easu

res

indi

cate

whe

ther

the

valu

esof

that

mea

sure

are

bett

erw

hen

high

er(m

arke

dw

ith↑)

orw

hen

low

er(m

arke

dw

ith↓)

.D

atas

et(i

)A

void

ing

Topi

c(k

=4)

Avo

idin

gU

nive

rsit

y(k

=5)

MI(

Topi

c)↓

MI(

Uni

v.)↑

P(U

niv.

)↑M

I(U

niv.

)↓

MI(

Topi

c)↑

P(To

pic)↑

CC

IB0.

007

0.01

90.

292

0.00

90.

234

0.47

4B

atch

k-M

eans

0.51

80.

211

0.44

00.

322

0.51

60.

673

SCK

M(w

=0.

0025

)0.

004

0.29

50.

506

0.00

30.

469

0.64

3

Dat

aset

(ii)

Avo

idin

gTo

pic

(k=

2)A

void

ing

Reg

ion

(k=

2)M

I(To

pic)↓

MI(

Reg

ion)↑

P(R

egio

n)↑

MI(

Reg

ion)↓

MI(

Topi

c)↑

P(To

pic)↑

CC

IB0.

002

0.01

10.

552

<0.

001

0.85

50.

978

Bat

chk-

Mea

ns0.

007

0.08

10.

825

0.09

70.

008

0.98

4SC

KM

(w=

0.00

25)

<0.

001

0.14

10.

825

<0.

001

0.00

50.

984

Page 80: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

54 Chapter 4. Using Constrained Clustering in Avoiding Bias

4.4.2 Results

Table 4.3 shows the results achieved by CIB, our algorithm and a regular k-Means in this experiment. In the later two cases, the value shown is the meanof the 10 initialisations of the seeds (and of the order of inspection of the datapoints in the case of SCKM) which were tested. As a preliminary note it shouldbe noted how the MI values of the runs of the k-Means in the datasets showunequivocally the tendency of that algorithm to one of the possible clusteringsof the data, showing a real-world example where having a way to avoid thatbias could come in handy.

With the trained w our algorithm performed really well, achieving the twoaims of the Avoiding Bias task. Firstly, we have been able to steer away theclustering algorithm from the known organisation of the data. This can be seencomparing the values of MI for the known clustering of our algorithm with thevalues of k-Means. The decrease is considerable in all cases, involving severalorders of magnitude.

Secondly, the outcome of our clustering algorithm resembles more the“other” (not known) organisation of the data than the known one, a fact whichcan be confirmed comparing the MI for the known and unknown clustering. Inall cases the difference is very appreciable, involving again one or two ordersof magnitude. Furthermore, it is also worth remarking that in all cases thequality of the clustering (the purity for the not known partition) is still high.

Comparing with the results of Gondek and Hoffman (CCIB), our algorithmachieves in almost all cases noticeable increases in the similarity to the un-known clustering than their approach, with also more quality (i.e, greaterpurity). The only exception to this happens in dataset (ii) when trying toavoid the “Region” criterion. This can attributed to the special characters ofthis dataset, which is extremely unbalanced. Nevertheless, we must stress thateven in this extreme case the algorithm is able to fulfil the two aims previouslyexplained.

4.5 Improving the Quality of Alternative Cluster-ings

Having obtained these good results, in [Ares et al., 2010] we turned our at-tention to improving the quality of the alternative partitions.

In this regard, the key aspect to have in mind is that Avoiding Bias is still aclustering problem, where the main focus is providing the user with a mean-ingful grouping of the data. For instance, the easiest way to find a very differ-ent grouping from the one given would be assigning randomly documents toclusters, which would be obviously a very bad solution in terms of clusteringquality. Thus, a compromise has to be reached between the quality of the clus-tering and the distance to the avoided grouping when devising an avoidingbias algorithm.

In [Ares et al., 2010] we studied various ways to obtain an alternativeclustering with high quality while keeping the objective of avoiding the knownclustering. Specifically, we tested two different approaches which use the same

Page 81: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.5. Improving the Quality of Alternative Clusterings 55

Figure 4.1: Constrained Normalised Cut algorithm proposed by Ji and Xu[2006]. The constraints are introduced at the core of the trace minimisationproblem using the matrix U , which encodes them in a suitable way.

strategy as the work previously introduced in this chapter (i.e. using negativeconstraints to steer the clustering process away from the known clustering),making use of spectral clustering techniques to try to attain that high qual-ity. The first one was introducing negative constraints in the Constrained Nor-malised Clustering (CNC) approach proposed by Ji and Xu (see Section 2.3.4).The second one was introducing the Soft Constrained k-Means algorithm in-troduced in Section 4.3.1 in the second phase of a Normalised Cut clusteringalgorithm (see Section 2.3.4).

4.5.1 Negative Constraints in Constrained Normalised Cut

In Section 2.3.4 we have explained the approach used by Ji and Xu [2006] totransform the classic Normalised Cut algorithm into a Constrained Clusteringone, allowing the use of positive constraints. Intuitively, a similar schemecould be used to try to introduce negative information as well.

In their paper, the authors introduce a matrix U which encodes the positiveconstraints such that the Frobenius norm of the product of that matrix and theindicator matrix is in inverse proportion with the number of constraints whichare respected by the partition represented by the indicator matrix, having aminimum of zero when all of them are honoured. Thus, introducing this fac-tor into the function minimised at the core of the Normalised Cut algorithm(Equations 2.9 and 2.10) causes a change in the nature of the solution, havingnow to find a clustering of good quality (minimising NCut) while respectingas well the constraints (minimising the new term). The influence of the con-

Page 82: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

56 Chapter 4. Using Constrained Clustering in Avoiding Bias

Figure 4.2: Negative Constrained Normalised Cut method (NCNC) proposedin Section 4.5.1. Similarly to Ji et al.’s method, negative constraints are in-troduced at the core of the trace minimisation problem using a new matrixUN .

straints is controlled by a parameter (β), being the enforcement of the con-straints greater as the value of β increases, with a minimum in β = 0, wherethe constraints are not taken into account at all.

With that in mind, an apparently easy and intuitive way to introduce thenegative constraints would be using a new matrix UN , which would encodethe negative constraints in the same way as the positive ones were encodedin U . Again, the Frobenius norm of the product of UN with the indicatormatrix will be lower as more of the pairs of documents linked by a constraintare in the same cluster, and, vice versa, higher as more of them are not inthe same cluster, which is precisely the objective of the negative information.In order to introduce this new term in the minimisation a new parameter(βN) is needed to control the enforcement of the negative constraints. As thisnew factor is in direct proportion to the number of negative constraints whichare respected in the clustering, it must be introduced in the formula with aminus sign (Equations 4.5 and 4.6). Again, the value of βN is equal or greaterthan 0, with a harder enforcement of the constraints as its value increases. Inthe remaining of this chapter we will call this method Negative ConstrainedNormalised Cut (NCNC).

minA1,...Ak(NCut(A1, · · · , Ak)− ||βNUNH||2) (4.5)

minY ∈RTr(Y T[D−

12 (L− βNUNTUN )D−

12

]Y ) (4.6)

Page 83: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.5. Improving the Quality of Alternative Clusterings 57

Figure 4.3: Normalised Cut plus Soft Constrained k-Means method(NC+SCKM) proposed in Section 4.5.2. Negative constraints are introducedwhen clustering the projected data points.

Even though this approach seems theoretically sound, it does not yieldgood results in the Avoiding Bias task. Our explanation about why this hap-pens is given in Section 4.6.1.

4.5.2 Combining Soft Constrained k-Means and NormalisedCut

As has been previously explained (Section 2.3.4), the Normalised Cut algo-rithm is based on transforming the clustering problem into a graph cut prob-lem. The aim of the process is finding a cut of the graph which minimises itsNormalised Cut value. Being this a NP-hard problem, a certain relaxation ofthe conditions imposed on the solution has to be performed in order to reduceits complexity and make it computationally accessible. Thus, the outcome ofthis minimisation is a projection of the data points into Rk, instead of thegrouping itself, and a last step should be performed to reach the final cluster-ing of the data. In order to perform this last phase, Shi and Malik proposeusing k-Means on the projected data points.

Our second proposal in [Ares et al., 2010] was using the Soft Constrainedk-Means algorithm instead of k-Means, enabling the introduction of domainknowledge in form of absolute (Must and Cannot-Link) and non-absolute(May and May-Not-Link) constraints. Even though they would be defined overthe initial documents, the one to one correspondence between them and theprojected documents (the document which was represented by the vertex vi of

Page 84: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

58 Chapter 4. Using Constrained Clustering in Avoiding Bias

the graph is now encoded in the ith row of matrix Y ) enables us to apply thesesame instance level constraints over the corresponding projected documents.From now on, we will call this method Normalised Cut plus Soft Constrainedk-Means (NC+SCKM).

From the point of view of Soft Constrained k-Means, the Normalised Cutacts as a kind of document preprocessing phase, where the documents aretransformed from the chosen document representation to a representation inRk based on the Normalised Cut criterion. The effect of this “preprocessing”is twofold: not only we are benefiting from the increment of cluster qualitycaused by using the Normalised Cut algorithm, but also we are likely to ex-periment an increase in the effect of the pairwise constraints. As documentswhich are close to constrained ones are affected as well by the changes in thedestination of the later induced by the constraints, our intuition is that theeffectiveness of the constraints in this new data space is increased, as simi-lar documents (over which the same constraints tend to be true) are broughttogether and dissimilar ones are separated (thus avoiding some non desired“interferences” of the constraints over non related documents).

In terms of performance, the computational cost of this combined approachis the same of that of the Normalised Cut algorithm, as the cost of the SoftConstrained k-Means and k-Means is the same. Consequently, being the costli-est operation of the whole algorithm still by a wide margin the calculationof eigenvectors, the total cost will depend on the method chosen to performthat task. This cost can be kept fairly moderated if a standard algorithmis used. For instance, using Lanczos algorithm the time complexity wouldbe O(kNLanczosnnz(M)), where k is the desired number of clusters (i.e. ofeigenvectors), NLanczos is the number of iteration steps of the algorithm andnnz(M) is the number of non zero elements of the matrix D−

12LD−

12 , whose

eigenvectors are being calculated.

4.5.3 Approach to Avoiding Bias

These two approaches to introduce negative constraints in Normalised Cut areused to tackle the Avoiding Bias problem in the same way as SCKM was in[Ares et al., 2009]. Hence, the outline of the approach is the same as theone indicated in Section 4.3.3: first, a negative constraint is created for eachpair of documents that in the partition to be avoided are in the same cluster.Then, one of the new methods is used to cluster the data collection taking intoaccount these negative constraints. Again, the (hopefully) alternative parti-tion found by the Avoiding Bias approach is the one output by the clusteringmethod used in this later stage.

4.6 Experiments

To evaluate whether the proposed methods achieve any improvement over theSCKM-based Avoiding Bias approach introduced in the first part of this chap-ter we repeated the experiments summarised in Section 4.4 with the same

Page 85: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.6. Experiments 59

datasets, data representation, metrics,. . . , although with ten new random ini-tialisations of the seeds and of the order in which documents are examined.

Again, we have used a crossvalidation strategy to tune the parameter ofthe algorithms which controls the strength given to the constraints in eachmethod, that is, the βN in NCNC and the w in NC+SCKM. In both cases,the value of the parameter was again tuned in Dataset (i) avoiding the parti-tion of the documents by Topic. The w chosen for the baseline (SCKM) was0.0025, the value which again obtained the best compromise between qualityand avoidance. In the combined approach (NC+SCKM), as the focus of thiswork is improving the quality of the grouping, the value (w = 0.05) was cho-sen as the one which yielded the best similarity (MI) with the non avoidedgrouping of the documents (“University”) while maintaining a similarity withthe avoided grouping (“Topic”) less or equal to the one achieved by the base-line, which was itself quite low. As for NCNC, the constrained NormalisedCut with negative constraints, the tuning process showed poor quality valuesand a great instability of the algorithm with respect to the values of βN . Ourexplanation about why this happens will be given at the end of this section.

As for d, the number of eigenvectors used in the projection of the doc-uments used to solve the trace minimisation process (see Section 2.3.4), aparameter which affects NC, NCNC and NC+SCKM, we detected that, as in-troduced in that section, the best performance was obtained for values of dlarger than k (specifically when the number of eigenvectors ranged from 10to 20, in opposition to the number of desired clusters, which ranges from 2to 5). After some preliminary tests we have used to create the projection ofthe documents the first 15 eigenvectors, a value which we have found thatperforms well in all collections.

4.6.1 Results

The results of the aforementioned experiments are shown in Table 4.4. As inthe previous experiments, we report for each dataset and avoided partitionthe values of Mutual Information (MI) with the avoided and the non-avoidedgroupings, to see to which of them the outcome of the clustering process ismostly leaning, and Purity (P) with the non-avoided grouping, to measure thequality of the clustering. Hence, a good result would have high values of MIand P with the non-avoided partition and a low value of MI with the avoidedone. The results reported are the average of the ten different initialisationsof seeds and document inspection order tested in each combination of datasetand avoided grouping.

Firstly, it is worth remarking that the results show the expected increase inthe quality of clustering of Normalised Cut with respect to k-Means. Moreover,they also point out again a tendency in the non constrained algorithm (in ourcase, Normalised Cut) to fall in one of the two groupings of the collections,even though this tendency is sometimes less clear than in the case of the k-Means. As for the values of KM and SCKM, it is important to note their closesimilarity with the ones reported for the previous experiment in the Table 4.3,which shows the stability of both algorithms in average with respect to theinitial seeds.

Page 86: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

60 Chapter 4. Using Constrained Clustering in Avoiding Bias

Tabl

e4.

4:A

void

ing

bias

resu

lts

for

the

avoi

ding

bias

expe

rim

ent

wit

hth

ede

fined

data

sets

for

k-M

eans

,So

ftC

onst

rain

edk-

Mea

ns(S

CK

M),

Nor

mal

ised

Cut

and

the

com

bine

dap

proa

ch(N

C+

SCK

M).

The

arro

ws

next

toth

em

easu

res

indi

cate

whe

ther

the

valu

esof

that

mea

sure

are

bett

erw

hen

high

er(m

arke

dw

ith↑)

orw

hen

low

er(m

arke

dw

ith↓)

.D

atas

et(i

)A

void

ing

Topi

c(k

=4)

Avo

idin

gU

nive

rsit

y(k

=5)

MI(

Topi

c)↓

MI(

Uni

v.)↑

P(U

niv.

)↑M

I(U

niv.

)↓

MI(

Topi

c)↑

P(To

pic)↑

Bat

chk-

Mea

ns0.

507

0.23

00.

436

0.29

70.

568

0.68

7SC

KM

(w=

0.00

25)

0.00

50.

279

0.47

70.

003

0.45

00.

648

Nor

mal

ised

cut

0.48

00.

410

0.49

90.

582

0.56

10.

679

NC

+SC

KM

(w=

0.05

)0.

003

0.93

40.

768

0.00

10.

657

0.71

6

Dat

aset

(ii)

Avo

idin

gTo

pic

(k=

2)A

void

ing

Reg

ion

(k=

2)M

I(To

pic)↓

MI(

Reg

ion)↑

P(R

egio

n)↑

MI(

Reg

ion)↓

MI(

Topi

c)↑

P(To

pic)↑

Bat

chk-

Mea

ns0.

008

0.08

70.

825

0.14

00.

009

0.98

4SC

KM

(w=

0.00

25)

<0.

001

0.11

90.

825

<0.

001

0.00

80.

984

Nor

mal

ised

Cut

0.00

80.

151

0.82

50.

186

0.01

10.

984

NC

+SC

KM

(w=

0.05

)<

0.00

10.

164

0.82

5<

0.00

10.

016

0.98

4

Page 87: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

4.6. Experiments 61

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-0.25 -0.2 -0.15 -0.1 -0.05 0

Mu

tual

Info

rmati

on

w

MI with non-avoided

MI with avoided

(a) NC+SCKM

0

0.1

0.2

0.3

0.4

0.5

0.6

-0.12 -0.1 -0.08 -0.06 -0.04 -0.02 0

Mu

tual

Info

rmati

on

β

MI with non-avoided

MI with avoided

(b) CNC with negative constraints

Figure 4.4: Stability of the parameters of the two proposed algorithms in thetraining collection (Dataset (i), avoiding TOPIC)

A study of the results shows how the similarity of the outcome of the com-bined algorithm (NC+SCKM) with the non avoided partition (which, as weintroduced in Section 4.4, is used as an indication of the quality of the cluster-ing) is in all cases greatly increased over the Soft Constrained k-Means results.Moreover, the results show how the introduction of this constrained phasehas no detrimental effect over the quality of the Normalised Cut results withrespect to that partition, and in fact improves them in all cases. As for theavoided partition, the similarity of the results of our technique is still reduced,keeping it in all cases in values equal or less than those of the baseline (SCKMalone), which were already low.

It should be also noted that the reason for the repeated values of P forthe four methods in Dataset (ii) is the structure of the dataset, where in eachof the possible groupings one of the clusters is much bigger than the other(still, the MI values for that dataset attest the improvements attained usingthe combined method). Finally, it is also worth remarking that further tests onthe training collection have shown that the parameter w of this combined ap-proach is quite stable. This can be seen in Figure 4.4(a), which shows that theMI with the avoided and non-avoided groupings are not affected to a greaterextent by wide variations around the chosen value of 0.05.

The results of the tests performed with NCNC are not included in Table 4.4as the quality values achieved were poor and the value of the parameter βNwas very unstable. This is shown in Figure 4.4(b): for almost all values of theparameter the similarity with the avoided grouping is much higher than withthe non-avoided one, and for the values of βN in which the two similaritiescome closer the quality of the result is very low and a small variation of theparameter produces an abrupt change in the quality values. Our intuition isthat the cause of this behaviour has to do with the function which is minimised.With positive constraints, the function (Equation 2.9) has its lower bound inzero, a value which, if obtained, would mean both that the clustering has goodquality (NCut = 0) and that all the constraints are respected (||βUH||2 = 0).However, this is not what happens in the minimised function when negativeconstraints are involved (Equation 4.5). Here, a low value can be obtained ifall the constraints are respected, regardless of the quality of the clustering, as

Page 88: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

62 Chapter 4. Using Constrained Clustering in Avoiding Bias

one value is subtracted from the other. This makes tuning the value of βN veryhard, as a small change can alter dramatically the balance between those twofactors.

4.7 Summary

In this chapter we have summarised our work in applying Constrained Cluster-ing to the avoiding bias problem, which consists of, given some data to clusterand an already known partition of it, finding an alternative partition of thedata which is as well a good one (Section 4.1).

To tackle this task, we have proposed a scheme (Section 4.3) which usesnegative non-absolute constraints to codify the grouping to be avoided. Theseconstraints are fed to a Constrained Clustering algorithm devised by us (Sec-tion 4.3.1), whose design enables us to overcome certain shortcomings of ex-isting similar approaches (Section 4.3.3). Using the information contained inthe constraints, the algorithm finds an alternative partition of the data. Evalu-ation (Section 4.4) results focused on reference textual collections have shownconsiderable improvements in effectiveness over an existing well-known algo-rithm specially tailored for the Avoiding Bias Task.

In the second part of this chapter we have focused on improving the qualityof the alternative partitions (Section 4.5), proposing two approaches based onusing negative constraints in conjunction with spectral clustering techniques.The first approach tries to introduce these constraints in the core of a con-strained spectral algorithm (Section 4.5.1), whereas the second one combinesspectral clustering and the algorithm proposed in the first part of this chapter(Section 4.5.2). The experiments (Section 4.6), performed again on the samereference textual collections, have shown that whereas the first method doesnot yield good results, the second one attains large increments in the qualityof the results of the clustering while keeping low similarity with the avoidedgrouping.

Page 89: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Chapter 5

Robustness of ConstrainedClustering Algorithms

Up to this date, the research on Constrained Clustering has been mostly fo-cused on devising new clustering algorithms, overlooking certain practicalproblems, as we introduced in Section 2.5. In this chapter we summariseour work in studying one of these problems, the robustness of the ConstrainedClustering algorithms, which has been previously published in [Ares et al.,2012].

5.1 Algorithm Robustness

Constrained Clustering is an expanding field, which has been studied in recentyears by several authors. This research has been mostly focused on developingnew algorithms, with the aim of making the most of the information carriedby the constraints which are provided by them.

In the experiments made by the authors to test these new approaches thesets of constraints are built in almost all cases using the reference groupingsof the data to be clustered; for instance, taking randomly two data points andcreating a positive constraint if they are in the same cluster and a negativeconstraint if they are in different ones. On the other hand, the evaluation ofthe clustering results is usually performed using external metrics (see Section3.4). These metrics compare the outcome of the clustering process with agrouping of the data that is deemed as “correct”, which in these experimentsis the same partition used to create the constraints.

Consequently, the domain knowledge supplied to the Constrained Cluster-ing algorithms in the vast majority of the experiments in the literature wastotally accurate, as each pairwise constraint actually holds in the referencegrouping (since they are extracted from it). That is to say, the informationprovided by the constraints is precise, and would actually help the clusteringprocess to obtain an outcome closer to the one used as reference, thus helpingthe Constrained Clustering algorithm to achieve better results1. Moreover, it

1This is indeed the general case, even though in certain cases (see Section 2.5.4) this may not

Page 90: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

64 Chapter 5. Robustness of Constrained Clustering Algorithms

should be noted that this effect is not limited to the use of external metrics. In-ternal metrics, the other group of direct evaluation metrics, are mostly basedon measuring certain desirable properties of the partition, such as compact-ness, separation between clusters, etc, which, provided that an appropriatedata representation and distance are used, will be fulfilled by a hand-madereference partition of the data.

Even though creating constraints from the reference partitions makes sensein the context of these papers, since it enables us to measure how effectivelythe constraints are being used, this approach does not take into account thatin real Constrained Clustering problems the “true” clustering of the data is ob-viously not available, and hence the constraints which will be used are goingto have been obtained with methods that will yield a certain amount of in-accurate constraints. This concept (an inaccurate constraint) can seem trickyto define in a real-world problem, since in most of these scenarios there doesnot exist a partition of the data which can be exclusively deemed as correct.However, we can define an incorrect constraint as one that steers the clus-tering away from a good partition of the data; a partition which, as we havedefined along this thesis, puts similar data points in the same cluster and dis-similar ones in different clusters, and which in some cases must also havesome meaning to the user. Both automatic and manual constraint extractionmethods are likely to introduce these kind of “bad” constraints.

In the case of automatic constraint extraction schemes, they work by gener-alising some more or less explicit notions about the domain in question, which(hopefully) give us clues about which data entities are or are not related. Thegeneralisations made by these methods, being that, generalisations, will notbe always valid for each pair of entities, and hence in most cases the auto-matic methods will yield a non negligible amount of inaccurate constraints.For instance, Song et al. [2010] propose an automatic constraint extractionmethod for text documents that creates a positive constraint between two doc-uments if they share a minimum amount of named entities (see Section 6.7.2),which in our experiments on constraint extraction (Chapter 6) yielded in thebest case, with a very conservative, setting a 15% of inaccurate constraints(see Section 6.8). A more extreme example is provided by the avoiding biasscheme proposed in the previous chapter, where we positively know that theconstraint creation method (creating a Cannot-link between each of the datapoints which are in the same cluster in the avoided partition) is bound to cre-ate a big amount of inaccurate constraints. Indeed, if we label one of theseconstraints as inaccurate if the points are actually in the same cluster in thereference partition, in the first dataset used in the experiments the percentageof inaccurate constraints is about 25% when avoiding the partition by topicand 34% when avoiding the one by University. In the second dataset, due toits unbalanced nature, the problem is even more severe, with exorbitant per-centages of 82% and 98% when avoiding the partition by topic and by region,respectively2.

As for the manual constraint extraction methods, they are based on askingthe user whether pairs of instances should or should not be put in the same

be so clear.2Note how despite these numbers the proposed algorithm is able to provide good results.

Page 91: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.2. Experimental Set-Up and Methodology 65

cluster, a process which may produce inaccurate constraints due to misjudge-ments of the users. Since clustering is mostly an exploratory tool, it is oftenused in situation where the configuration of the data is not well known. There-fore, the answer to whether some instances are related can be far from clearfor the user, who is precisely running the algorithm to get an idea of the organ-isation of the data. Moreover, given the tedious nature of constraint creation,the process is often shared between users, something which can aggravate theproblem, as there might exist non-trivial differences in their criteria about theconfiguration of the data.

Consequently, the robustness of the Constrained Clustering algorithms tonoisy sets of constraints (i.e. containing inaccurate constraints) is bound toplay an important role in their final effectiveness when tackling real-worldproblems. In this chapter we make an experimental study on the robustnessof several Constrained Clustering algorithms using two different noise modelswhich highlights the strengths and weaknesses of each method when workingwith inaccurate positive or negative constraints.

5.2 Experimental Set-Up and Methodology

In order to study the behaviour of Constrained Clustering algorithms when theconstraints provided to them are not wholly accurate we have supplied ourown implementations of some algorithms with a synthetically created combi-nation of accurate and inaccurate constraints. We examine the evolution ofthe quality of the resulting partitions (using direct external evaluation) as theratio of noisy constraints is increased, noting specially the moment in whichthe quality of the results drops below that of the partition yielded by the cor-responding non-constrained method, that is, the maximum amount of noisyconstraints above which using constraints is actually harmful.

5.2.1 Clustering Algorithms

In this study we have used four different Constrained Clustering algorithms,two flat and two spectral ones, along with their non-constrained counterparts.We have focused in these two families of algorithms due to their good re-sults and popularity, which maximise the utility of studying their behaviourin noisy environments. Specifically, we have left out of our study hierarchicalalgorithms due to their usual large computational cost, which precludes usingthem in many real-world problems, and algorithms which use the probabilis-tic framework, which have already been studied in [Nelson and Cohen, 2007],one of the few existing studies which have considered inaccurate constraints,with which we compare our conclusions in Section 5.5. Table 5.1 shows asummary of the algorithms according to the absoluteness with which they usethe constraints and the family to which the algorithm belongs.

The two flat clustering approaches chosen were Constrained k-Means(CKM, see Section 2.3.1) and Soft Constrained k-Means (SCKM), the algo-rithm which we introduced in Section 4.3.1. In the case of the later the con-straints used were the non-absolute May-link and May-Not-Link, even though

Page 92: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

66 Chapter 5. Robustness of Constrained Clustering Algorithms

Table 5.1: Summary of the clustering methods used in the studyFlat clustering Spectral Clustering

Unconstrained - k-Means (KM) - Normalised Cut (NC)Hard Constraints - Constrained

k-Means (CKM)Soft Constraints - Soft Constrained - Constrained Normalised

k-Means (SCKM) Cut (CNC)- Normalised Cut withImposed Constraints (NCIC)

for simplicity’s sake we will use along this chapter the general denomination“Must-links” and “Cannot-links” when referring to positive and negative con-straints. Moreover, the similarities between SCKM and Pairwise Constrainedk-Means (PCKM) noted in Section 4.3.2 suggest that the results obtained withthese experiments should be applicable as well to Basu et al.’s method. Finally,the non-constrained counterpart used as baseline for these two algorithms isk-Means (see Section 2.3.1).

As for the spectral methods, we have used Constrained Normalised Cut(CNC, see Section 2.3.4) and an approach which we have dubbed NormalisedCut with Imposed Constraints (NCIC). As we have previously discussed inthis thesis, Constrained Normalised Cut can only accommodate positive con-straints. Since the negative information also plays an equally important role inreal-world problems (and can even be the only kind of information availablein certain domains) a study of the robustness to noisy negative constraints ofa spectral-based algorithm is very interesting. In order to do so3, we will usein our experiments a schema analogous to the one proposed in Kamvar et al.’sSpectral Clustering with Imposed Constraints (SCIC, see Section 2.3.6), whichlets the user introduce both positive and negative information, but in this casehaving to optimise the Normalised Cut function, which amounts to a changeof the normalisation of the weights matrix. This change enables us to simplifythe study of the results of the experiments (since now the unconstrained coun-terpart of both spectral methods is Normalised Cut) without altering greatlythe core of NCIC (in fact, in the paper Kamvar et al. describe various possiblenormalisations, choosing one due to “slight empirical benefits for our [their]data”). Hence, following Kamvar et al.’s approach, in NCIC the weights of theedges of the graph which join data instances involved in Must and Cannot-linkconstraints are altered, setting them to the maximum and the minimum possi-ble values, respectively. Afterwards, the process follows the same steps of thebasic Normalised Cut algorithm, but using these modified weights, thus en-abling the use of positive and negative constraints in a Normalised-Cut-basedalgorithm. As in the case of SCKM and PCKM, the results of NCIC in the ex-periment summarised in this chapter should be also applicable to SCIC.

3Although it was finally published in 2012, the work on the study summarised in this chapterstarted in 2009, which is the reason why NC+SCKM, the combined approach proposed in Section4.5.2 to introduce negative constraints in NC, was not considered.

Page 93: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.2. Experimental Set-Up and Methodology 67

5.2.2 Datasets and Data Representation

In the experiments summarised in this chapter we have used five datasets:

(i) WebKB Universities. This text dataset is the one labelled as (i) in theAvoiding Bias experiments (see Section 4.4.1). It is a subset of WebKB’sUniversities Dataset, which contains web pages from the websites of dif-ferent U.S. universities that have been manually tagged according to twoaspects: university and topic. For this subset we have taken the docu-ments from Cornell, Texas, Washington and Wisconsin universities anddropped those corresponding to “misc”, “other” and “department”. Thisyields a total of 1087 documents, which in this dataset are distributed infour groups (one for each university).

(ii) WebKB Topics. The same dataset as (i), but this time distributed in fivegroups, corresponding to the topics “course”, “faculty”, “project”, “staff”,and “student”.

(iii) Vehicle Silhouettes. This numeric dataset was created in the Turing In-stitute (Glasgow, Scotland) extracting certain image features (such ascompactness, circularity, elongatedness. . . ) from the silhouettes of fourkinds of vehicles: a double decker bus, a Chevrolet van, a Saab 9000 andan Opel Manta 400. It contains 846 data instances distributed in fourgroups (the four kinds of vehicles) with 18 numerical attributes (one foreach of the image features). The dataset is available in the UCI MachineLearning repository [Asuncion and Newman, 2007].

(iv) News 3 Related. This dataset is a sample of three categories of the20 Newsgroups collection [Asuncion and Newman, 2007], composedby e-mails belonging to twenty Usenet newsgroups which deal aboutdiverse topics. Following the same approach as Basu et al. [2004b],we have chosen 300 documents randomly from each of the categoriestalk.politics.misc, talk.politics.guns and talk.politics.

mideast, yielding a total of 900 documents distributed in these threegroups.

(v) Letter AOZ. This dataset is a subset of the Letter Recognition dataset[Asuncion and Newman, 2007], taking the examples for the letters “A”,“O” and “Z” for a total of 2276 data instances, with 16 primitive numeri-cal attributes (such as statistical moments and edge counts) of characterimages of those letters of the English alphabet.

This selection of datasets covers an interesting range of scenarios. As waspointed out by Basu et al. [2004b], the clustering of small datasets comprisedby sparse high-dimensional data is notably difficult, as the clustering algo-rithms are more prone to fall in local minima. This is the case of our textualdatasets, (i), (ii) and (iv), as they are composed by sparse data points withvery high dimensionality (as is usually the case with data points representingtext documents, where each dimension stands for a term of the collection) andcontain a small number of data instances (compared to that high dimension-ality). Datasets (iii) and (v) are numeric datasets, with dimensionalities much

Page 94: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

68 Chapter 5. Robustness of Constrained Clustering Algorithms

Table 5.2: Distribution of the data in the datasets used in the studyDataset (i): WebKB-Universities

Wisconsin Texas Cornell Washington total320 255 247 265 1087

Dataset (ii): WebKB-Topicsstaff project course student faculty total46 86 244 558 153 1087

Dataset (iii): Vehicle Silhouettesbus opel saab van total218 212 217 199 846

Dataset (iv): News 3 Relatedmisc guns mideast total300 300 300 900

Dataset (v): Letters AOZA O Z total

789 753 734 2276

smaller than those of the textual collections. Table 5.2 shows a summary ofthe distribution of the data points in the five datasets.

The textual documents in datasets (i), (ii) and (iv) have been representedusing Mutual Information (see Section 3.2.1). In the case of datasets (iii) and(v), as the data is already numerical, each data point was represented directlyby vectors in respectively R18 and R16, which were created concatenating the18 features of the points in the case of (iii) and the 16 features for (v). Both inthe partitional (in the comparisons with the centroids) and in the spectral al-gorithms (to set the values of the weights of the graph and in the segmentationof the projected data points) the similarity measure used between the vectorsrepresenting the data points was the cosine distance (see Section 3.2.2).

5.2.3 Constraint Creation

As previously mentioned, the objective of the experiments summarised in thischapter is to study the behaviour of Constrained Clustering algorithms undermore realistic circumstances, where not all the domain knowledge supplied tothem is accurate. To do so, sets of constraints with a growing ratio of false(non accurate) constraints were created from the reference grouping used asclustering ground truth. The initial set of truthful constraints was created byrandomly selecting pairs of data instances which in that reference partitionbelonged to the same cluster (to create the Must-links) or to different clustersin the reference (to create the Cannot-links).

To create the non-accurate constraints we have followed two differentstrategies. In the first one (which we have identified by “RND” in the results)the spurious constraints were created randomly, in a process similar to the one

Page 95: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.2. Experimental Set-Up and Methodology 69

carried out to create the accurate constraints, but this time reversed. That is,in this approach, the false Must-links were created by randomly selecting pairsof data points which belonged to different clusters in the reference groupingand the false Cannot-links were created by selecting pairs belonging to thesame clusters.

Even though it can be used as a first approximation to model a real-worldConstrained Clustering problem, with this strategy the spurious constraints areevenly distributed among all the possible ones, a fact which can be argued tobe unrealistic. Thus, with our second strategy we have tried to devise a morerealistic approach.

In this second constraint generation strategy (which we have identified by“SIM” in the results) the inaccurate constraints were created using the similar-ity between data instances. What we are capturing with this approach is theintuition that the errors in the constraints are likely to happen between pairsof data points which do not seem to belong in the same cluster, in the case oferroneous negative constraints, or in different ones, in the case of erroneouspositive constraints.

Concretely, we think that, when a human user or an automatic constraintextraction system is creating positive constraints, the possible spurious oneswill be concentrated between data points which, belonging in different clus-ters, are very similar. Thus, to create the spurious Must-links constraints wehave followed two steps: first, all the similarities between pairs of data pointsbelonging to different clusters in the reference grouping are calculated. Then,the pairs with the highest similarities are chosen and a Must-link constraintis created between them. These constraints will be spurious with respect tothe reference grouping, as we are only considering pairs of points which arein different clusters in it. In the case of the inaccurate negative constraints,the rationale is similar: they will be concentrated between data points which,belonging in the same cluster, are very dissimilar. Hence, the process to createthem is analogous, but this time using the lowest similarities between datapoints that are in the same cluster in the reference grouping. As it stems fromthese steps, the spurious positive and negative constraints created using thissecond approach will be the same in all the initialisations of the constraints(but not in the case of the accurate constraints, which are generated in theway explained above).

The initial amount of true constraints used in the experiments depended onthe kind of constraints used. For the Must-links it was 1% of the possible truth-ful positive constraints for each collection. For the Cannot-link constraints,due to their lesser informativeness, it was 5% of the possible negatives, inorder to have in all algorithms a starting point which improved perceptiblythe effectiveness of their non constrained counterparts. As for the non ac-curate constraints, ten different amounts of them were tested, starting with10% of the truthful constraints and ending with the same amount of true andfalse constraints, with steps of 10%. Finally, the transitive closure of the con-straints was only performed when they were fed to Constrained k-Means, asit is entailed by the absoluteness of the constraints. For the other clusteringalgorithms no transitive closure was used.

Page 96: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

70 Chapter 5. Robustness of Constrained Clustering Algorithms

Table 5.3: Number of eigenvectors used in the spectral methodsDataset i ii iii iv vNumber of clusters wanted 4 5 4 3 3Number of eigenvectors used 26 11 18 8 4

5.2.4 Other Details

Metrics

In order to assess the effectiveness of the different clustering algorithms wehave compared the outcomes of the algorithms with the reference groupingsusing three metrics: Adjusted Rand Index, Purity and Entropy (see Section3.4 for the first two and [Rosell et al., 2004] for the third one). However,as the results for the three metrics show the same trends, only the resultsfor Adjusted Rand Index (ARI) are presented in order to provide the clearestpossible picture of the torrent of data yielded by the experiments.

Parameters of the Algorithms

Again, in this study we have considered that the number of clusters (k) presentin the data was known; therefore, the number of desired clusters was set tothat amount in each dataset.

Moreover, in the experiments we have detected once more that the qualityof the results of the spectral methods improved considerably if the numberof eigenvectors (d, see Section 2.3.4) used is larger than k. In all the spec-tral clustering algorithms (constrained and unconstrained) we have used ineach collection the number of eigenvectors which yielded the better results inNormalised Cut (see Table 5.3).

The choice of datasets used in these experiments enables us to make someinteresting observations and hypothesis about this situation, which can be at-tributed to a wide array of factors. For instance, in the textual collections thedocuments in each cluster are quite heterogeneous, and so using a small num-ber of eigenvectors cannot capture all the information needed to decide towhich cluster a document should be assigned. The comparison of the numberof eigenvectors used for datasets (i) and (ii) offers some evidence supportingthis hypothesis: even though in both cases the underlying documents are thesame, when they are grouped by universities (and so each cluster is more het-erogeneous, as it contains documents from all the five topics) the best numberof eigenvectors is much bigger than the best number when trying to find thegrouping by topic. Furthermore, other important factors which might help toexplain this circumstance, both in textual and numerical collections, are therelatively small number of desired clusters (so that keeping only k eigenvec-tors would cause a loss of information) or even the process used to build thegraph, in which we did not prune any edge (i.e. it was totally connected),even if the associated weight was small. However, it should also be notedthat taking too many dimensions is in fact harmful to the effectiveness of thealgorithms, as it only adds noise to the projected data points.

Page 97: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.3. Results 71

Hence, the only parameters left to be set are the strength of the constraints(w) in SCKM and the degree of enforcement of the constraints (β) in CNC. Wehave tested several values in order to show their effect over the performanceof the algorithms in an environment affected by noise in the constraints.

Multiple Initialisations and Statistical Significance

Due to the dependency of the k-Means based algorithms on the initial set ofseeds we tested five different initialisations of them, which were created takingdata points randomly from the data to cluster. These same sets of seeds wereused in the clustering of the projection of the data points in NC, CNC and NCIC,as we have used k-Means to carry out that process. Also, in each initialisationthe order in which the data points were examined was randomised, so as tominimise the effect of this factor in CKM and SCKM.

Moreover, given the way in which constraints are created, which entailsthat at the very least (when using the similarity-based method to create inac-curate constraints) a large part of them are chosen randomly, we have optedfor using a strategy similar to that used for the initialisations of the seeds,testing also five different sets of constraints in order to have a better represen-tation of the behaviour of the algorithms. Hence, each reported result is theaverage of twenty-five different partitions (five initialisations of the seeds andfive initialisations of the constraints).

As for the Statistical Significance, we have used a lower-tailed Sign Test,as advanced in Section 3.5.1. For a given amount of false constraints, fiveobservations (Xi, Yi), i ∈ [1..5] were considered, one for each initialisation ofthe seeds, where Xi is the ARI of the non-constrained method and Yi is theaverage of the ARIs of the constrained method for the five initialisations of theconstraints. Hence, the null hypothesis used in the test is that the quality ofthe results of the baseline is greater or comparable to that of the constrainedmethods.

5.3 Results

Figures 5.1-5.5 show the results obtained by the algorithms analysed in eachdataset. In the case of CNC and SCKM, only a few runs (those correspondingto the most interesting values of each algorithm’s parameter) are shown. Theresults of the unconstrained methods (KM and NC) are also plotted as dottedstraight lines.

In each figure, the top graphs show the results with positive constraintsand the bottom ones the results with negative constraints, while the graphson the left show the results with the first constraint generation method (RND)and the ones on the right the results with the second one (SIM). In each graphthe horizontal axis shows the ratio between the non-accurate and the accurateconstraints, with the points in x = 0 indicating the values without any falseconstraints. The exact number of accurate constraints used in each experimentis shown in the title of each subfigure. Table 5.4 shows the results achieved bythe non constrained versions of the algorithms analysed, while Tables 5.5 and

Page 98: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

72 Chapter 5. Robustness of Constrained Clustering Algorithms

Table 5.4: Results (ARI) of the non constrained methods k-Means (KM) andNormalised Cut (NC)

Dataset i ii iii iv vKM 0.130 0.208 0.127 0.273 0.489NC 0.215 0.332 0.175 0.432 0.557

Table 5.5: Results (ARI) of the constrained methods with 1% of the possiblepositive accurate constraints and without inaccurate constraints

Dataset i ii iii iv vSCKM w = 0.00125 0.732SCKM w = 0.0025 0.792SCKM w = 0.005 0.429 0.743SCKM w = 0.0125 0.298 0.541 0.177 0.468SCKM w = 0.025 0.399 0.685 0.192 0.502SCKM w = 0.05 0.343 0.642 0.165NCIC 0.525 0.424 0.070 0.903 0.610CNC β = 5 0.932 0.904 0.680 0.875 0.941CNC β = 10 0.925 0.906 0.700 0.876 0.992CNC β = 20 0.848 0.913 0.670 0.860 0.980CNC β = 30 0.854 0.915 0.666 0.877 0.945

5.6 show the results of the algorithms analysed without inaccurate constraints(i.e. the initial point in the graphs) for the values of the parameter of eachmethod shown in the figures.

As a preliminary note, our experiments showed the enormous problemsfaced by Constrained k-Means with a moderate number of constraints, whichin most cases make clustering impossible. These problems were not limitedto a stagnation of the algorithm when using negative constraints (see page12) but also appeared when using positive constraints: due to the effect ofthe Must-links, occasionally one or more clusters were empty after the datapoints were assigned, which made the recalculation of the centroids and theprogress of the algorithm impossible. It should be noted that CKM is a de-terministic algorithm, that is, clustering the same documents with the sameconstraints and the same seeds will always yield the same results, followingthe exact same process. Hence, running the algorithm again without alteringany of those factors would be of no use to prevent these problems. As we

Table 5.6: Results (ARI) of the constrained methods with 5% of the possiblenegative accurate constraints and without inaccurate constraints

Dataset i ii iii iv vSCKM w = 0.0025 0.408 0.391 0.514 0.883 1SCKM w = 0.005 0.815 0.488 0.840 0.995 1SCKM w = 0.0125 0.987 0.667 0.999 1 1NCIC 0.207 0.363 0.764 0.471 0.940

Page 99: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.3. Results 73

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(i) P

OS-

RN

D (1

466

accu

rate

con

stra

ints

)

KM

SCK

M 0

.012

5SC

KM

0.0

250

SCK

M 0

.050

0N

CN

CIC

CN

C 5

.0C

NC

10.

0C

NC

20.

0C

NC

30.

0

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(i) P

OS-

SIM

(146

6 ac

cura

te c

onst

rain

ts)

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(i) N

EG

-RN

D (2

1747

acc

urat

e co

nstr

aint

s)

KM

SCK

M 0

.002

5SC

KM

0.0

05SC

KM

0.0

125

NC

NC

IC

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(i) N

EG

-SIM

(217

47 a

ccur

ate

cons

trai

nts)

Figu

re5.

1:R

esul

tsfo

rco

llect

ion

(i).

Page 100: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

74 Chapter 5. Robustness of Constrained Clustering Algorithms

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(ii)

PO

S-R

ND

(196

9 ac

cura

te c

onst

rain

ts)

KM

SCK

M 0

.012

5SC

KM

0.0

250

SCK

M 0

.050

0N

CN

CIC

CN

C 5

.0C

NC

10.

0C

NC

20.

0C

NC

30.

0

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(ii)

PO

S-SI

M (1

969

accu

rate

con

stra

ints

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(ii)

NE

G-R

ND

(192

32 a

ccur

ate

cons

trai

nts)

KM

SCK

M 0

.002

5SC

KM

0.0

05SC

KM

0.0

125

NC

NC

IC

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(ii)

NE

G-S

IM (1

9232

acc

urat

e co

nstr

aint

s)

Figu

re5.

2:R

esul

tsfo

rco

llect

ion

(ii)

Page 101: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.3. Results 75

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(iii)

PO

S-R

ND

(891

acc

urat

e co

nstr

aint

s)

KM

SCK

M 0

.012

5SC

KM

0.0

250

SCK

M 0

.050

0N

CN

CIC

CN

C 5

.0C

NC

10.

0C

NC

20.

0C

NC

30.

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(iii)

PO

S-SI

M (8

91 a

ccur

ate

cons

trai

nts)

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(iii)

NE

G-R

ND

(134

13 a

ccur

ate

cons

trai

nts)

KM

SCK

M 0

.002

5SC

KM

0.0

05SC

KM

0.0

125

NC

NC

IC

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rat

io b

etw

een

non

accu

rate

and

acc

urat

e co

nstr

aint

s

Dat

aset

(iii)

NE

G-S

IM (1

3413

acc

urat

e co

nstr

aint

s)

Figu

re5.

3:R

esul

tsfo

rco

llect

ion

(iii)

Page 102: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

76 Chapter 5. Robustness of Constrained Clustering Algorithms

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rati

o b

etw

een

no

n a

ccu

rate

an

d a

ccu

rate

co

nst

rain

ts

Data

set

(iv

) P

OS

-RN

D (

13

45

accu

rate

co

nst

rain

ts)

KM

SC

KM

0.0

05

0S

CK

M 0

.01

25

SC

KM

0.0

25

0N

CN

CIC

CN

C 5

.0C

NC

10

.0C

NC

20

.0C

NC

30

.0

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rati

o b

etw

een

no

n a

ccu

rate

an

d a

ccu

rate

co

nst

rain

ts

Data

set

(iv

) P

OS

-SIM

(1

34

5 a

ccu

rate

co

nst

rain

ts)

-0.2 0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rati

o b

etw

een

no

n a

ccu

rate

an

d a

ccu

rate

co

nst

rain

ts

Data

set

(iv

) N

EG

-RN

D (

13

50

0 a

ccu

rate

co

nst

rain

ts)

KM

SC

KM

0.0

02

5S

CK

M 0

.00

50

SC

KM

0.0

12

5N

CN

CIC

-0.2 0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rati

o b

etw

een

no

n a

ccu

rate

an

d a

ccu

rate

co

nst

rain

ts

Data

set

(iv

) N

EG

-SIM

(1

35

00

accu

rate

co

nst

rain

ts)

Figu

re5.

4:R

esul

tsfo

rco

llect

ion

(iv)

Page 103: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.3. Results 77

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rati

o b

etw

een

no

n a

ccu

rate

an

d a

ccu

rate

co

nst

rain

ts

Data

set

(v)

PO

S-R

ND

(8

63

0 a

ccu

rate

co

nst

rain

ts)

KM

SC

KM

0.0

01

25

SC

KM

0.0

02

50

SC

KM

0.0

05

00

NC

NC

ICC

NC

5.0

CN

C 1

0.0

CN

C 2

0.0

CN

C 3

0.0

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rati

o b

etw

een

no

n a

ccu

rate

an

d a

ccu

rate

co

nst

rain

ts

Data

set

(v)

PO

S-S

IM (

86

30

accu

rate

co

nst

rain

ts)

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rati

o b

etw

een

no

n a

ccu

rate

an

d a

ccu

rate

co

nst

rain

ts

Data

set

(v)

NE

G-R

ND

(8

62

97

accu

rate

co

nst

rain

ts)

KM

SC

KM

0.0

02

5S

CK

M 0

.00

5S

CK

M 0

.01

25

NC

NC

IC

0

0.2

0.4

0.6

0.8 1

0 0

.2 0

.4 0

.6 0

.8 1

ARI

Rati

o b

etw

een

no

n a

ccu

rate

an

d a

ccu

rate

co

nst

rain

ts

Data

set

(v)

NE

G-S

IM (

86

29

7 a

ccu

rate

co

nst

rain

ts)

Figu

re5.

5:R

esul

tsfo

rco

llect

ion

(v)

Page 104: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

78 Chapter 5. Robustness of Constrained Clustering Algorithms

have considered that altering the seeds or the constraints due to CKM’s be-haviour would be unfair for the other clustering algorithms, both situations,with Must-links and Cannot-links, were treated as errors in the clustering pro-cess. As these errors affected many runs, we have chosen to discard the resultsof CKM and to not show them in the tables or graphs. Anyhow, we were ableto extract some interesting insights from the runs which were conducted suc-cessfully (mainly with positive constraints). While the initial (i.e. withoutinaccurate constraints) quality values were in those cases high (comparable toConstrained Normalised Cut), they dropped very fast as false constraints wereintroduced. We think, in tune with what happens with the Chunklet Methodin Nelson and Cohen [2007] (see Section 5.5.1), that this is due to the ab-soluteness of the constraints and the inherent transitiveness that they entail,which multiplies the effect of the inaccurate ones.

Commenting on the results of the algorithms, one of the most remark-able among them is the improvement attained by Constrained Normalised Cutover the results of the non-constrained method in all datasets, which can beobserved comparing the values of NC and CNC in Tables 5.4 and 5.5. The de-crease in quality as more inaccurate constraints are added is more marked forthe higher values of β, with the runs made with the lower values of that pa-rameter (which controls the degree of enforcement of the constraints) showingthat with a less tight observance of them the algorithm is able to maintain goodresults (being better than Soft Constrained k-Means) until the noise reacheslevels which could be deemed unrealistic for a lot of real-world applications.In fact, using the similarity-based constraint generation strategy this decreasein the quality values is so greatly softened that even for the highest ratios ofbad constraints the baseline is still improved. Our intuition about the causes ofthis trend (which appears in the runs of all the algorithms, with both positiveand negative constraints) is given at the end of this section.

As for the Soft Constrained k-Means, if we inspect Tables 5.4, 5.5 and 5.6we can see that, when using positive information, the original improvementover the non constrained algorithm (k-Means) is more modest than in the caseof CNC. Even so, in the text datasets, (i), (ii) and (iv), the improvement overits own baseline is kept until the highest noise levels for the lowest value ofw. Moreover, with the first constraint generation strategy in several cases theresults of SCKM also outperform unconstrained NC with moderate amountsof inaccurate constraints, a situation that with the SIM strategy lasts untilthe highest ratios. This resistance to noise can be attributed to how SCKMincorporates the constraints (see Section 4.3.1). Without noise, when a datainstance is assigned the constraints will boost the score of one cluster (theone containing most of the data points which should be in the same cluster asthe point being assigned). But, as the ratio of inaccurate constraints grows,the differences between the modifiers applied to the clusters will be smaller,due to the noise in the constraints. Hence, the accurate and the inaccurateconstraints will be cancelling each other, and the assignment would rely againmostly in the similarities between data points and centroids.

When SCKM is used with noisy Cannot-links created with the RND strategythe situation is slightly different: whereas the improvement over KM is greater(due to the larger number of constraints), the fall in the quality values is much

Page 105: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.3. Results 79

sharper. This happens because, for a certain data point, the true Cannot-linksaffecting it will be evenly distributed between data points in the other clusters.However, its false constraints will be all with points in the same cluster, andso, a small number of them is enough to counteract the effect of the true ones.From that point on, adding more false Cannot-links significantly harms thequality of the clustering. Following the global trend, if the noisy constraintsare generated with the SIM strategy the problem is greatly eased.

In our experiments, the approach which we have dubbed Normalised Cutwith Imposed Constraints has shown an irregular behaviour when using onlygood constraints. With positive constraints, in the case of datasets (i) and (iv)NCIC improves NC noticeably, obtaining lesser improvements in datasets (ii)and (v) and a non-existent one in dataset (iii), where using NCIC to incorpo-rate the constraints is actually harmful. Using true negative constraints, thesituation is the opposite. NCIC improves NC greatly in datasets (iii) and (v),but offers poorer effectiveness in (iv), (ii) and (i), being the results in this lastdataset slightly inferior to those of NC. This situation is caused by how NCICuses the constraints. This algorithm changes the weights of the edges joiningvertices representing data points used in positive and negative constraints torespectively the maximum or minimum possible similarity values (namely 1and 0, using the cosine distance). In the case of Datasets (i) and (ii) (whichare composed by the same data points, but distributed according to differentcriteria) and (iv) the similarities are very low, which amplifies the effect ofthe Must-links and reduces the effect of Cannot-links. In Datasets (iii) and (v)the similarity values are very high, and so the impact of the Cannot-links isboosted and the effect of Must-links is lessened. As for why NCIC results withconstraints are in some cases worse than without them, we think that this iscaused by the change of weights induced by the constraints greatly altering theoriginal similarity space and deeply affecting the relations between the datapoints such that the projection made by the algorithm is not able to repre-sent faithfully neither the similarities between data points nor the constraints.However, further experiments should be conducted in order to test this hy-pothesis. Concretely, given that the other algorithms obtain good results withthe same sets of constraints we do not think that this could be attributed to aproblem with the utility of the constraints (see Section 2.5.4).

As for the behaviour of NCIC under noisy constraints, it also follows thesame irregular behaviour indicated in the previous paragraph. Thus, as withthe accurate constraints, when the difference between the similarity valuesand the value with which they are substituted is large (Must-links in (i), (ii)and (iv), Cannot-links in (iii) and (v)) the effect of the inaccurate constraintsis high, and, when that difference is smaller (Cannot-links in (i), (ii) and (iv),Must-links in (iii) and (v)), the effects of the noise are attenuated. This is con-firmed by the differences found in the behaviour of the algorithm dependingon the false constraint generation method used. Following the trend apprecia-ble in the other analysed algorithms, the fall in NCIC’s performance is smallerwhen the inaccurate constraints are generated with the SIM method. This isbecause with this method the pairs of data points with the highest values ofsimilarity are used to build the false positive constraints (and the pairs withthe smallest ones for the false negatives), and thus the actual changes caused

Page 106: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

80 Chapter 5. Robustness of Constrained Clustering Algorithms

by these inaccurate constraints over the weights of the graph are smaller inthis case.

This last result offers also an explanation of why all the analysed algo-rithms are more robust to the false constraints created with SIM than to thosecreated with RND. With this last approach the bad constraints will be evenlydistributed, but with the first one, following our intuition about which errorsa user or an automatic constraint creation scheme are more likely to make,they will be concentrated over pairs of data instances which belonging to dif-ferent clusters are very similar and vice versa. Thus, we are introducing falseconstraints between pairs of data instances over which the algorithms them-selves were already prone to make errors. Consequently, the results point outthat, in a real-world situation, if the inaccurate constraints are a product ofmisjudgements caused by the similarity between data points (i.e. not becauseof systematic or random errors) their effect might be lessened (but apparentnonetheless).

5.3.1 Statistical Significance

As for the statistical significance, Table 5.7 shows, starting from the pointwhere all the constraints are accurate, the largest ratio between bad and goodconstraints for which the improvement of the constrained method over thenon constrained one is still significant, that is, the null hypothesis is rejectedfor a p-value ≤ 0.05. A “na” entry means that that parameter was not testedfor that method, while “—” means that none of the combinations between ac-curate and inaccurate constraints yielded a significant improvement over thebaseline (not even using only accurate constraints).

Overall, this statistical significance data shows the same trends as the fig-ures and tables commented upon before, as in most cases the last significantpoint is the one before the constrained method falls below its baseline, whichsupports the analysis that we have made in this section. However, there arestill some peculiarities worth noting.

It is well known that the results of the k-Means algorithm and the algo-rithms based on it are quite sensitive to the seeds of the clustering. In thisstudy we have worked with random initialisations of the seeds, relying in thesmoothing effect of averaging over several ones to have a faithful representa-tion of the effectiveness of the algorithms. However, the Sign Test comparesindividually the results of each constrained method and its baseline for eachinitialisation, and hence the effects of the aforementioned dependency are insome cases revealed. This is readily apparent in the values of the results inDataset (iii), for which in several cases the accurate constraints are not ableto attain a significant improvement. Another example is SCKM with positiveconstraints, where the last significant improvement is usually a bit earlier thanthe point for which the results of this method are better in average than thoseof KM (for this algorithm the problem of the dependency on the seeds is alsoaggravated by its dependency on the order in which the data points are exam-ined). Even so, it should be noted that the improvements are still significantuntil a considerable amount of false constraints are added.

Page 107: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.3. Results 81

Table 5.7: Largest ratio of inaccurate constraints for which the improvementover the baseline is significant (p-value ≤ 0.05); alg. params.=parameter ofthe Constrained Clustering algorithm (β for NC, w for SCKM)

Positive constraints

alg. POS-RND POS-SIMparam. i ii iii iv v i ii iii iv v

SCKM 0.00125 na na na na 1 na na na na 0.40.0025 na na na na 0.9 na na na na 0.50.005 na na na 0.2 0 na na na 0.2 0.20.0125 0.7 0.4 — 0.1 na 1 0.9 — — na0.025 0.3 0.3 — — na 0.9 0.7 — — na0.05 0.1 0.2 — na na 0.5 0.6 — na na

NCIC 0.9 0.4 — 0.9 1.0 1 0.1 — 1 0.8CNC 5 0.6 0.4 0 0.3 0.7 1 1 0.2 0.9 1

10 0.2 0.2 0 0.1 0.6 0.6 0.8 0.2 0.6 120 0.1 0.1 0 0 0.4 0.4 0.4 0.2 0.3 130 0.1 0.1 0 0 0.4 0.3 0.3 0.2 0.2 1

Negative constraints

alg. NEG-RND NEG-SIMparam. i ii iii iv v i ii iii iv v

SCKM 0.0025 0.2 0.2 0.1 0.3 0.3 0.9 1 1 0.7 0.30.005 0.2 0.3 0.1 0.4 0.3 1 1 1 1 0.30.0125 0.1 0.2 0 0.3 0.3 0.9 1 1 1 0.3

NCIC — 0.2 0 0.4 0.4 — 1 0.2 1 0.2

Page 108: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

82 Chapter 5. Robustness of Constrained Clustering Algorithms

5.4 Conclusions of the Study

With this study we have tested the behaviour of four different ConstrainedClustering algorithms when the information supplied to them is not entirelyaccurate. Whereas one of the algorithms, Constrained k-Means, showed itsunsuitability with a high number of positive or negative constraints, we havefound that, as expected, each one of the other methods showed different fea-tures which point out the scenarios where employing it might be the soundestelection.

When using positive constraints (the only kind of constraints that it canuse) Constrained Normalised Cut was the most effective Constrained Cluster-ing algorithm under the initial conditions (i.e. without inaccurate constraints).For the smallest values of the parameter β when the false constraints wereadded the effectiveness was still good in most cases until reaching high ra-tios of inaccurate constraints. Hence, CNC would be the best option when theamount of false constraints is moderate to high and the computational cost(both in time and space) is not a crucial issue.

For its part, while Soft Constrained k-Means improves its baseline (k-Means), it usually performs worse than CNC when using positive constraints,and even in most cases its effectiveness with higher ratios of inaccurate con-straints is worse than Normalised Cut, which is a non constrained method.These results suggest that, when using positive information, SCKM should beused only if the computational cost is critical. In the case of negative con-straints, its effectiveness without false constraints was very good, reachingvery high quality values. When bad negative constraints were added using themore realistic approach, the quality of the results was still very good until thehighest noise levels, which shows that SCKM is an algorithm very suitable toincorporate negative information into clustering, even without restrictions onthe spacial or temporal costs of the process.

The Normalised Cut with Imposed Constraints algorithm showed an irreg-ular behaviour in the experiments, with initial results which, in the best case,do not outperform the other methods and in the worse case are actually be-low its own baseline. Even though the degradation of the results when addinginaccurate constraints was remarkably low, this circumstance causes NCIC tobe outperformed by NC and SCKM both with positive and negative noisy con-straints (except with the highest noise levels when the constraints are gener-ated with the “random” strategy). Moreover, this algorithm does not have anyadvantage in terms of cost over the other methods. The conjunction of theseresults recommends that using NCIC in real-world clustering problems should,at the very least, be considered very carefully.

On a more general level, the results of our experiments reinforce the idea,previously pointed out by Nelson and Cohen [2007] (see Section 5.5.1), thatthe degree of truthfulness of the constraints should be taken into account whentuning the clustering algorithms, adjusting the observance of the constraintsrequired to the algorithm according to the suspected ratio of inaccurate con-straints in the information available.

Moreover, the comparison between the results obtained with the two dif-ferent false constraint generation methods suggest that, when the errors in

Page 109: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.5. Related Work 83

the constraints are a product of misjudgements induced by high or low simi-larity between data points, the effect of the inaccurate constraints is likely tobe lessened.

5.5 Related Work

5.5.1 Revisiting Probabilistic Models for Clustering with Pair-wise Constraints, Nelson & Cohen (2007)

To the best of our knowledge, prior to our study the problem of the robust-ness of Constrained Clustering algorithms to noise had only been thoroughlyaddressed by Nelson and Cohen [2007]. Their work tests the performancewith noisy constraints of three existing probabilistic Constrained Clusteringalgorithms [Shental et al., 2004; Lu and Leen, 2005; Lange et al., 2005] andanother one introduced by them in that paper (the Sampled Chunklet Algo-rithm).

These constraints are extracted from the reference grouping, taking ran-domly pairs of data points and mislabelling a portion of them (i.e. turningsome true Must-links into Cannot-links and vice-versa), an approach very sim-ilar to the one which we have labelled RND in our experiments. An interestingfeature of Nelson and Cohen constraints is that each of them is annotated witha certainty value, which indicates the confidence in the veracity of the con-straint measured in a scale from 0 (minimum confidence) to 1 (maximum con-fidence). In their experiments, given the synthetic nature of the constraints,this value was artificially created, using the beta distribution. Namely, the cer-tainty was sampled from a β(5, 1) for accurate constraints and from a β(1, 5)for the inaccurate ones, reflecting the authors’ underlying assumption that anexpert should have higher certainty in correct constraints than in erroneousones. In the first case the mass of the beta distribution is concentrated onlarge values (the mean value of the distribution is 5

6), whereas in the secondcase the mass is concentrated on small values (the mean value is 1

6). Thisdegree of confidence of the source of the constraints in each piece of advice isa probability in a strict mathematical sense, which is used in the experimentsto set the penalty associated to violating the constraints (i.e. their weights),directly in their algorithm and using a result derived by them in the paper inthe other methods. Hence, the weights created from these probabilities havethe great advantage over other approaches of having a clear interpretation.

The results of their experiments underline the importance of taking intoaccount the degree of noise in the constraints when adjusting the adherenceof the algorithms to these pieces of advice. Moreover, and related to this, theyalso point out the huge problem posed by approaches which use absolute con-straints (and of the transitive closure entailed by them), as their effectivenessdrops dramatically when the constraints are noisy. A similar considerationwas made by Basu et al. [2004b], who if the constraints are known to be noisyrecommend not using their transitive closure.

Comparing with our study, these two important insights were also appar-ent in the results of our experiments, which were conducted over completely

Page 110: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

84 Chapter 5. Robustness of Constrained Clustering Algorithms

different families of algorithms: they deal with probabilistic clustering algo-rithms, while our study is centred in spectral and flat partition algorithms.Moreover, even though the weights of the constraints used in our experimentslack such a clear interpretation, we have used in our analysis a more realisticnoise model. Finally, Nelson and Cohen use in their paper only small numericdatasets, while we carried out our experiments also over textual datasets,which are more challenging due to their nature, as they result in very sparsedata points.

5.5.2 Spectral Clustering with Inconsistent Advice, Colemanet al. (2008)

In [2008] Coleman et al. study how to deal with inconsistent advice in theframework of spectral clustering. In order to do so, they devise three differ-ent methods, based on combining solving a spectral clustering problem (Nor-malised Cut) over the data points and a 2-correlation clustering problem overthe constraints. Namely, the latter offers a partition of the data focused onlyon respecting as many constraints as possible, which is later used to limit theacceptable solutions of the Normalised Cut problem, which does take into ac-count the similarity between entities.

They test the methods over six numeric datasets, using two different kindsof synthetic inconsistent advice, one concentrated over a small subset of thedataspace and the other spanning all the pairs of data points, agreeing eachconstraint independently with the actual classification with the same proba-bility (again similar to the RND approach of our study). The results of theseexperiments hint again that when the noise in the advice rises more freedomhas to be given to the algorithm to ignore it in order to achieve good results.However, it is worth noting that the authors admit that their work is basicallytheoretical, and that they focus on obtaining only two clusters from the data.Hence, they assert that more experimental work is needed in that area. Itshould be also remarked that Coleman et al.’s are dealing at all times withinconsistent advice, something which is stronger than having inaccurate ad-vice (as we already indicated in Section 2.5.3, having an inconsistent set ofconstraints implies that some of them are erroneous, but not the other wayaround). Consequently, the utility of their work to tackle the problem of in-accurate constraints is limited, since as long as they are consistent correlationclustering will provide a partition of the data which respects all of them, whichwill in turn influence the final partition of the data.

5.5.3 Training Data Cleaning for Text Classification, Esuli &Sebastiani (2009)

As it stems from the previous descriptions of the related papers, to the best ofour knowledge this is the first time in which the problem of how to create re-alistic noise in constraints is addressed. In the field of classification, Esuli andSebastiani propose in [2009] an approach to create realistic false positive andnegative examples of a class using the confidence of an automatic classifier on

Page 111: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

5.6. Summary 85

whether they belong or not to it, arguing that the examples which the classi-fier catalogues with lowest certainties are the ones that a human annotator ismore likely to misclassify due to lack of experience or time.

Among the results presented in this paper, which deals mostly with Train-ing Data Cleaning (i.e. detecting training examples which a human is likely tohave misclassified), the authors report that the realistic perturbation of the ex-amples is less damaging to the effectiveness of the classification task than onewhich selects at random the examples and classes to be perturbed. Their expla-nation to this phenomenon is that the examples perturbed with that methodare examples which were likely borderline examples anyway, and that theoverall perturbation induced by them would be limited. Both this result andthis explanation are in consonance with the ones that we have expound in thischapter for the Constrained Clustering task.

5.6 Summary

In this chapter we have studied the robustness of several Constrained Clus-tering algorithms to inaccurate constraints, a question (Section 5.1) whichis bound to play an important role in their final effectiveness in real-worldproblems. In order to do so, we have designed an experiment (Section 5.2)in which the behaviour of the algorithms is tested with synthetic sets of in-accurate constraints created with two different methods (Section 5.2.3), onebased on intuitions about the nature of real errors in the constraints. An anal-ysis of the results of this experiment (Section 5.3) showed the strengths andweaknesses of each method, which we have used to conclude the scenarios inwhich using each algorithm may be the best decision (Section 5.4). Moreover,a survey of the limited literature on the subject (Section 5.5) showed that ourresults and insights are compatible and complementary to the ones in thoseworks.

Page 112: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 113: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Chapter 6

Constraint Extraction

Apart from the robustness of Constrained Clustering algorithms to inaccu-rate constraints, to which we have devoted the previous chapter, in the workconducing to the elaboration of this thesis we have also dealt with anotherimportant and sometimes overlooked practical problem of Constrained Clus-tering: the obtention of constraints themselves. In this chapter we proposetwo schemes to automatically extract constraints, a research which was previ-ously published in [Ares et al., 2011] and [Ares and Barreiro, 2012]. The firstmethod, which uses external information, is focused on the clustering of webpages, whereas the second one, which uses internal information, can be usedwith textual documents of any type.

6.1 Creation and Extraction of Constraints

As has been previously introduced, up to this date the research on ConstrainedClustering has been mostly focused on devising new algorithms which canmake the most of the information carried by the constraints. Consequently,the experiments testing these new approaches begin with already built sets ofconstraints1, which are fed to the algorithms in order to compare how wellthey are using that information. However, by putting the focus in this un-equivocally important part of the process we run the risk of neglecting anequally essential one: obtaining the constraints. Indeed, in real-world clus-tering problems devising adequate methods to obtain constraints will play akey role in obtaining good results, since their number and, as we have seen inthe previous chapter, their quality, are very important factors on the possibleimprovements attained by using Constrained Clustering algorithms. This is acomplex problem, to which manual and automatic approaches can be applied.

Manual methods can be roughly summarised as follows: pairs of entitiesare presented to a user, who states whether or not they belong in the samecluster, creating in the process respectively positive or negative constraints.The main challenge of these manual methods is devising ways to make the

1Usually from the reference partition, as we have seen in the previous chapter.

Page 114: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

88 Chapter 6. Constraint Extraction

most of the knowledge of the user, creating the constraints which will help themost the clustering process to reach a good solution.

For instance, a very straightforward manual method could randomlychoose the entities shown to the user from the data to cluster, similarly to theway in which synthetic constraints are created in most Constrained Cluster-ing experiments, but this time, instead of looking up in the reference partition(which is obviously not available) we ask the user. However, in real-worldproblems there are some circumstances that show the necessity of using morecomplex extraction approaches. For example, when creating synthetic con-straints we can create as many as we want; we are not restrained by any fac-tor apart from the total number of possible constraints. However, in real-worldscenarios we will be certainly limited by the time that the user can devote toconstraint creation. Moreover, this process can become quite tedious, some-thing which may harm the accurateness of the constraints, since the user maybecome more and more careless in his answers when confronted with an end-less list of questions. By choosing entities at random we might be “wasting”some of these valuable queries to the user on pairs whose relation is clearand would be easily established by a clustering algorithm without using con-straints. The problem of selecting the most informative pairs of instances canbe tackled applying active learning schemes such as the one proposed by Basuet al. [2004a], which will be outlined in Section 6.2.

On the other hand, automatic methods create the constraints by them-selves, without any direct intervention from the user. This is done by examin-ing the entities, trying to find enough similarities or disparities between themto create respectively positive or negative constraints. It is worth remarkingthe difficulty of this task: in manual methods, the burden of establishing therelation between entities is on the user, who judges it according to his knowl-edge and experience. Automatic methods try to judge it automatically, and indoing so they are in essence performing the same task that must be carriedout by clustering algorithms themselves (detecting which entities are relatedand which ones are not2). Given this difficulty, automatic constraint extractionschemes are often specially tailored for a data domain, since they use specificrules which in most cases are not valid or easily transferable to other spheres,let alone other data types. For instance, when clustering web pages a simpleconstraint extraction scheme could involve examining their URLs and creatingpositive constraints between those sharing the same domain, something whichfor example would have no sense when clustering movie synopses or incomedata. In some specific cases, automatic constraint extraction systems can evenbe viewed as a sort of simple expert systems that try to emulate the reasoningfollowed by the human expert when creating constraints.

The information used by automatic methods to create constraints maycome from the entities themselves or from some external sources. In the firstcase, the goal is incorporating to the clustering process some details about theentities which are not captured by the data representation or by the similaritymeasure. This would be the case for instance of the method to create con-

2Although, to be fair, clustering algorithms have to make that decision for every pair of entities,whereas constraint extraction schemes are “allowed” to restrict themselves to those that they aremost sure of.

Page 115: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.2. Previous Work 89

straints between web pages described in the previous paragraph. The URL ofa web page is something that belongs to the page itself, an information thatwould be unused by the most usual data representations, which they wouldstrip the mark up of the page, extract the remaining plain text and representit as a vector using some of the schemes discussed in Section 3.2.1. Incor-porating the URL to this representation would not be a trivial task (shouldwe tokenise it?, should it be treated as a single term?. . . ) and neither wouldbe tweaking the similarity measure to take it into account. As we introducedin Section 2.4, constraints are very useful in these kind of situations, sincethey allow us to integrate this “secondary” information in a convenient andprincipled way. As for the methods which use external information, they tryto broaden what is known of each data instance in order to size up moreaccurately their relation. This new information is again most conveniently in-corporated to the clustering process as constraints, due to the aforementionedreasons. An example of these kind of methods is the one introduced by usin [Ares et al., 2011] and discussed in the first part of this chapter, which,again in the domain of web pages, uses tags from Delicious to create positiveconstraints.

Comparing manual and automatic methods, the later can be arguably con-sidered more convenient, since, at least in theory, we could obtain larger num-bers of constraints, not being limited by “human” factors such as the patienceof the user. However, it should be taken into account that automatic methodsare usually more prone to introduce erroneous constraints. We are substitut-ing the judgement of a human with some automatic rules, rules which usuallyare generalisations of certain notions about the data’s domain and hence, asit happens with general rules, and specially given the usual complex natureof real-world data domains, it is bound to be a certain amount of exceptionswhere using these rules will yield inaccurate constraints. Thus, when usingthese kind of methods a balance should be struck between the accuratenessof the constraints and their number, a balance which should be informed withthe findings of studies like the one presented in the previous chapter about therobustness of the Constrained Clustering algorithms that are going to be usedin each specific scenario.

In this chapter we will describe two automatic methods to extract con-straints. The first one, designed to cluster web pages, uses external informa-tion, whereas the second one, which can be used in any textual document,uses internal information.

6.2 Previous Work

Of all the papers introducing the clustering algorithms cited in this thesis(mainly in Section 2.3) only three propose real (non-synthetic) methods toobtain the constraints, whereas in the vast majority of these articles the con-straints are generated taking pairs of points randomly and creating a positiveor a negative constraint according to the reference groupings.

In [2001] (see Section 2.3.1) Wagstaff et al. tested their Constrained k-Means algorithm not only using constraints automatically extracted from the

Page 116: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

90 Chapter 6. Constraint Extraction

references, but also extracting them from domain knowledge, in this case inthe problem of GPS lane finding. Given a collection of data points whichrepresent the positions of several cars in a certain amount of time along a road,GPS lane finding tries to detect which traffic lanes are present in that road, aproblem which Wagstaff et al. show that can be satisfactorily tackled withConstrained Clustering. Specifically, they use absolute positive and negativeconstraints, which they create from the data itself using two basic notions:trace contiguity (“in the absence of lane changes, all of the points generatedfrom the same vehicle in a single pass over a road segment should end up inthe same lane”) and maximum separation (“a limit on how far apart two pointscan be (perpendicular to the centerline) while still being in the same lane”).

On the other hand, as we introduced in Section 4.3.2, Yang and Callanpropose [2006] a Constrained Clustering method specially thought for the de-tection of near-duplicates in text documents. In the same paper the authorsobtain good results using their algorithm to detect these near-duplicates incomments to United States’ new regulations. In order to do so, they automati-cally extract three types of constraints with the aid of ad-hoc rules defined overattributes of the comments. These rules are based on domain knowledge: pos-itive absolute constraints are created between comments if their word overlapis above 95% or contain completely the designated reference copy, whereasnegative absolute constraints are created between documents which cite dif-ferent docket identification numbers. Moreover, positive soft constraints (the“family links”) are created between documents which, having similar file sizes,share the same e-mail relayer, the same docket identification number and thesame footer block.

Lastly, in the paper where they introduce Pairwise Constrained k-Means(PCKM, [Basu et al., 2004a], see Section 2.3.2), Basu et al. propose an activelearning scheme which tries to make the most of the queries made to theuser in manual constraint creation, which is actually the main focus of theirarticle. As we discussed in the aforementioned section, a previous paper bythe same authors [Basu et al., 2002] showed that using domain informationto initialise the seeds of a k-Means based algorithm can improve markedlythe quality of the resulting partitions, which prompts Basu et al. to introducea constraint-based initialisation in PCKM. In keeping with this, the authorspropose an active learning scheme aimed to elicit a set of constraints thatwould provide a good initialisation of the seeds, which the authors state thatcan be obtained getting as many points per cluster (proportional to its actualsize) as possible asking pairwise questions, questions which will be used tocreate the positive and negative constraints. The scheme that they proposehas two phases. In Explore, the first phase, a farthest-first approach is used toobtain k (one for each cluster in the data) pairwise disjoint neighbourhoodsof data points. Afterwards, in Consolidation, the second phase, more pointsare added to the neighbourhoods until no more queries can be issued to theuser (i.e. our queries allotment is depleted). In their paper, Basu et al. reportexperimental results that show that this active learning method obtains betterresults than creating constraints by looking into the relationship of randompairs of data points.

As for other papers, arguably the one which is most related to the work

Page 117: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.3. Creating Constraints from Social Tags 91

summarised in the present chapter is [2010] by Song et al., who devise anautomatic constraint extraction method which we have used as baseline in theexperiments reported in the second part of this chapter. Specifically, this ap-proach, discussed in Section 6.7.2, creates positive constraints between textdocuments using the overlap of named entities as a clue of a positive relation-ship between them.

6.3 Creating Constraints from Social Tags

When the first search engines appeared most of the content present in Inter-net was mainly generated by companies, institutions and professionals, some-thing which was chiefly motivated by the difficulties of the people to createtheir own spaces. In the last years a massive socialisation of the web has oc-curred, a phenomenon, also called Web 2.0, which was possible because ofthe apparition of software platforms that facilitate both uploading contents byusers to the web and sharing them with other users. In this context, severalforms of social expression have arisen: weblogs, web forums, photo logs, so-cial networks, etc. Some of them have been already demonstrated as usefultools for several tasks in the context of information search and exploitation.For instance, it is a well-known fact that comments left by the readers of blogscan be very useful for retrieving relevant and opinionated posts [Mishne andGlance, 2006].

In this first part of the chapter we will study how to use tagging informationto improve the clustering of web pages, a domain where it has been demon-strated as a useful tool, with applications such as cluster-based retrieval [Leeet al., 2008] or clustering of search results [Zeng et al., 2004]. Particularly,we will propose an external constraint creation approach in order to extractpositive constraints from this social information.

6.3.1 Social Tags

Arguably, the key factor in the rise of Web 2.0 has been the possibility of in-teracting and sharing content easily with other users. Even though websitessuch as those hosting publishing services (Blogger, Wordpress, etc) or mediarepositories like Youtube or Flickr are the clearest examples of this trend, com-bining content creation and sharing, this social aspect is even more marked inthe aptly named social bookmarking websites, such as Delicious, CiteUlike orBibsonomy. These sites allow their users to post and share their bookmarks,which will usually link to content that has not been created by them.

A great downside of this new creation and sharing dynamic is a hugeavalanche of content which, apart from being in some cases of dubious qual-ity, may have a limited interest outside a definite scope, in which it might begreatly interesting. In order to tackle this problem, most social websites enableand encourage their users to assign tags to the contents. These tags enhancethe websites, as they provide a convenient way to find interesting items, inlarge part due to not limiting the users to choosing keywords from a giventaxonomy. Instead, these tags are entered freely by the users, and their aggre-

Page 118: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

92 Chapter 6. Constraint Extraction

gation tends to reflect a sort of consensus between them over the noteworthyaspects of the item in question and how to designate them, making searching,browsing and sharing the contents easier. This kind of information is called afolksonomy [Sturtz, 2004], which can be defined as a bottom-up classificationconstructed by a community and without a clear structure. In this chapter wewill use the information provided by the tags in Delicious, one of the largestand most important social bookmarking sites.

6.3.2 Delicious

Delicious is a social bookmarking founded in 2003, where users store, tag andshare web links. The typical dynamic of bookmark creation, similar to that ofother sites, is usually quite straightforward. A user finds somehow (usuallyoutside the Delicious site) an interesting document in the web, and wants tosave a reference to it (a bookmark). Then, instead of using the bookmarkingfunctionality of the browser, the user employs the Delicious service to save thebookmark. This has two main advantages: the bookmark is not bound to a lo-cal browser (i.e. it is easily accessible from different browsers and computers)and it can be quickly shared with other users (for example, friends or otherusers with the same interests can subscribe to a RSS feed of the new book-marks). In the process, the Delicious web service allows the user to assigntags to the bookmark. These tags compose the Delicious folksonomy, to whichwe refer in the present chapter. Finally, the service also offers the possibility tomake the bookmark private and to add some free text (up to a thousand char-acters) to it. From now on, when we speak of “Delicious bookmarks”, we willrefer exclusively to the bookmarks which were not set as private (i.e. thosepublicly available).

As was introduced before, the most important characteristic of the taggingprocess is that the user is completely free to enter whichever tags he wants,without having to choose them from a fixed taxonomy (some suggestions areprovided by Delicious albeit in a non obtrusive way). An interesting result,reported by Golder and Huberman [2006], is that, despite this freedom, therelative frequencies of the tags used for a given URL stabilise after a relativelysmall number of bookmarks (about a hundred), showing the emergence ofa consensus between the users about the most salient features of the webdocument and how to name them. They attribute this phenomenon to thecombination of a shared cultural background and imitation of other users.Wetzker et al. [2008] report a similar finding, with only 700 tags (from a totalof around seven million different ones in their dataset) accounting for a 50%of the assignments. This agreement in which tags to use provides structureand stability to the folksonomy.

6.3.3 Constraint Creation

Having briefly surveyed social tags and Delicious in the previous section, inthis section we will describe an approach to extract constraints from this socialtagging information.

Page 119: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.3. Creating Constraints from Social Tags 93

Formally, we start with a collection C of web pages w1, w2, w3, · · · whichwe want to cluster. Also, we have a function T which for a web page wi returnsthe set of tags Ti = τ1, τ2, τ3 · · · that the users of Delicious have associatedwith that web page in their bookmarks.

As we introduced in the previous section, these tags represent the mostsalient features of the web page according to the users, and so that two pagesshare a tag could be an evidence of some positive relationship between them,since a number of users have agreed that that tag represents somehow anaspect of those pages. Hence, a quite straightforward option to translate taginformation into constraints could be creating a positive constraint betweentwo pages wi and wj if they share some tag, that is, if T (wi)∩T (wj) 6= ∅. Thisapproach, although simple, is quite naive and may lead to the creation of a lotof inaccurate constraints, due to the laxity of the condition. For instance, sucha method is bound to find problems with the polysemy of tags. Let us imaginetwo web pages wa and wb. The first one deals with traditional buildings,and has been tagged with “building”, “clay”, “adobe” and “plan”, whereasthe second one deals with self publishing, and has been tagged with “PDF”,“interesting”, “adobe” and “print-on-demand”. Although it is quite clear that,at least attending to their topics, they are not related, the method would createa positive constraint between them because they share the tag “adobe”, whichin wa refers to a building material and in wb to a software company.

Analysing this example, it is easy to see that although in isolation it wouldbe impossible to reliably disambiguate the meaning of the tag “adobe” in eachbookmark, that task is greatly simplified if we consider the other tags whichthe users have attached to them. Consequently, a natural way to ease theproblems caused by polysemy is making the condition to create a constraintharder by demanding more than one shared tag, a parameter (the minimumamount of shared tags needed) which we will denote with t.

Algorithm 10: CONSTRAINT CREATION USING DELICIOUS TAGS

input : C, the set of web pages to cluster; T , a function which returnsthe tags associated with a web page; t, the minimum amount ofshared tags to create a constraint

output: ML, a set of Must-links

1 foreach w ∈ C do2 C ← C \ w3 foreach w′ ∈ C do4 if |T (w) ∩ T (w′)| ≥ t then ML←ML ∪ ML(w,w′)5 end6 end

Algorithm 10 shows the outline of the constraint extraction method pro-posed in this section, which was published in [Ares et al., 2011]. The processis quite simple: the tags of each pair of web pages are compared and a con-straint is created if they share t or more tags. This parameter controls thebalance between the number of constraints and their accurateness which wementioned at the end of Section 6.1: smaller values of t will yield more con-

Page 120: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

94 Chapter 6. Constraint Extraction

straints, at the expense of a larger ratio of inaccurate ones, whereas largervalues of t will create smaller sets of more accurate constraints. The diagramshown in Figure 6.1 illustrates how constraints are created using this method.

Possible problems

Apart from the aforementioned polysemy of the tags, there are certain otherdynamics in the tags of Delicious which may cause problems when creatingconstraints. Two of the main ones are:

• Nature of the tags Golder and Huberman [2006] distinguish sevenkinds of Delicious tags depending on their function: identify what (orwho) a bookmark is about, identify what it is, identify who owns it, re-fine categories, identify its qualities or characteristics, self reference andtask organising. Depending on the criterion according to which we wantto cluster the web pages, which will be usually their topic, the tags be-longing to most of these categories would not provide useful clues abouttheir relationship.

• Tag Sparsity Li et al. [2011] report that 90% of the URLs are taggedwith 10 tags or less, and that 80% of the URLs were posted by only oneuser (a figure also mentioned in [Wetzker et al., 2008]). Heymann et al.report that between 30% and 50% of the URLs present in Delicious haveonly been bookmarked one or twice, and that a user enters in average 2.5tags per bookmark [Heymann et al., 2008]. Moreover, even when a URLis popular and has been bookmarked many times, the aforesaid tendencyof the users to agree on a set of common tags could drive them as wellto overlook some labels for a URL which would be also descriptive of itscontents, such as synonyms of the most popular tags or concepts relatedto them. Thus, it can be that documents which are actually related wouldshare few or none tags, hindering the constraint creation process.

Negative Constraints

Along this section we have always dealt with creating positive constraints fromtags, that is, constraints which state that two entities should be in the samecluster. This is because tags state positive information about pages, that is,they reflect the users’ view of what the web page is, not what a web page isnot. Thus, it is very hard to extract negative constraints from tags, i.e. to inferfrom them that two documents are not related.

To illustrate this, let us take the most straightforward approach, whichwould be turning the lack of positive information into negative information. Aswe have introduced earlier, the tags of a bookmark reflect a consensus betweenusers about its most salient features. Consequently, an approach which createsnegative constraints between web pages which do not share any tag couldseem reasonable in theory, because that would mean that their salient featuresdo not overlap and hence that they do not seem to be related. However, it iseasy to see why such a method would not work in real-world scenarios: thisapproach would only yield good results if the tags attached to a web page by

Page 121: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.3. Creating Constraints from Social Tags 95

Figure 6.1: Example of the constraint extraction method based on Delicioustags. Delicious is queried with the URLs of the documents to cluster (symbol-ised here with numbers) and returns the tags (symbolised by coloured blocks)used by its users when they bookmarked the documents in question (note thatdocument 3 has not been bookmarked and hence no tags are available to usein constraint extraction). Since the threshold to create constraints (t) is 2,only one positive constraint is created between documents 1 and 2.

Page 122: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

96 Chapter 6. Constraint Extraction

the users reflected completely, clearly and objectively all its salient features,something which, due to the above-mentioned problems of synonymy, sparsityand polysemy, is unfortunately almost impossible in the real world. Morecomplex approaches (which, for instance, could perform a semantic analysisof the tags) are very likely to be as well affected by these problems. Therefore,the question of how to create negative constraints from tags remains open.

6.4 Evaluation Methodology

In order to test the effectiveness of the automatic constraint extraction methodproposed in the previous section, we have performed a series of experiments,focused on three aspects:

1. Does this method extract accurate constraints?

2. In general, is the information contained in the tags useful? How muchof that information reaches the end of the constraint creation process?What is its maximum possible effect on the clustering outcome?

3. Do these constraints have an appreciable effect on the clustering?

Obviously, the third question, whether or not the constraints created withthis method improve the clustering, is the definitive measure of the qualityof the constraints, since that is the final goal of using Constrained Clustering.However, the other two aspects are also important, as they show the suitabilityof mining social tags in order to obtain new information to feed the clusteringalgorithms. Also, the comparison of results of the three questions, and spe-cially the discordances that might appear between them, can provide us withsome insights about what constitutes a good and effective constraint.

6.4.1 Dataset

In our experiments we have used a subset from the DeliciousT140 dataset3.This dataset contains 144,574 web documents bookmarked in Delicious inJune of 2008 which were tagged with one of the 140 most popular tags in thewhole site and the tags returned by Delicious when looking up those URLs atthe end of that month (a total of 67,104 tags). More details about the creationof this dataset can be found in [Zubiaga et al., 2009].

Since this collection does not contain a categorisation of the documentswhich could be used as a reference in the experiments we have created oneof our own, using the Open Directory Project4 (ODP), arguably one of themost important web directories. In this site, the pages are classified by humanexperts (the “editors”) using a hierarchical tree of categories. The intersec-tion between the URLs in DeliciousT140 and ODP (using the dump made on2010-10-25) yielded 11,589 documents after removing those in which the textextracted (using HTML Parser) was empty; these were the documents used inthe experiments. The golden truth5 was created assigning each document to

3http://nlp.uned.es/social-tagging/delicioust140/4www.dmoz.org5Available at http://www.dc.fi.udc.es/~edu/DT140dmozRef.tar.gz

Page 123: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.4. Evaluation Methodology 97

Table 6.1: Dataset description

Category # docsComputers 3401Regional 1645Arts 1215Science 891Society 865World 632Reference 594Business 563Shopping 528Home 361Games 328Recreation 278Health 110News 105Sports 73Total 11589

Top 10 tags # docsreference 3788tools 2703software 2673design 2396web 2106blog 2087free 2075programming 1790development 1790resources 1686

its corresponding top-level category in the ODP hierarchy. Table 6.1 shows theresulting 15 non overlapped groups of this dataset and the top 10 most usedtags for the documents.

6.4.2 Clustering Baselines and Document Representation

In order to assess the effect of the constraints over the clustering we have com-pared the results obtained using the constraints extracted with our methodwith those obtained using Normalised Cut (NC) and k-Means (KM), the nonconstrained counterparts of the Constrained Clustering algorithms. So as toprovide a fair comparison, these algorithms were used over three different“views” of the documents, which use three different combinations of the doc-ument contents and the tags associated to them:

• Only documents: the information from the tags was not used in thisapproach, each document was represented only by its contents.

• Documents + tags: the content of each document was extended withthe tags associated to it, appending them to the plain text extracted fromeach web page.

• Only tags: the tags were used as a proxy for the document; each webpage was represented only with the tags associated with it, discardingits actual content

In the last two approaches each tag was repeated the number of times that itwas associated with each document.

As for the text representation scheme used for these three views of thedata, we have used Mutual Information (see Section 3.2.1).

Page 124: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

98 Chapter 6. Constraint Extraction

6.4.3 Upper bound model

The constraint extraction method proposed in Section 6.3.3 tries to make themost of the information contained in the tags. However, due to some dynamicsof tagging discussed also in that section, some of that information is bound tobe “wasted”, since, as we introduced in Section 6.1, a compromise has to bereached between the number of constraints and their accurateness. On theone hand, a conservative method may yield too few constraints, which wouldfail to have a noticeable effect in the outcome of the clustering. On the otherhand, a more aggressive method might have a ratio of inaccurate constraintstoo high, which, as we have seen in Chapter 5, can be very detrimental to theeffectiveness of the clustering algorithm.

In this section we propose an Upper-Bound (UB) model that we will use toquantify the raw information conveyed by the tags of the web pages, withoutlosing any of it due to the compromise between quantity and quality. This UBmodel has two steps:

1. We use a very loose criterion to create constraints, creating a constraintbetween two documents if they any tag.

2. We filter this set of constraints using a perfect oracle, that is, we take onlythe accurate constraints.

The effectiveness of the Constrained Clustering algorithms with this set ofconstraints constitutes an indicator of the maximum benefit that can be ob-tained from the tags. Hence, comparing the quality of that partition with thatof the one yielded by the non-constrained baselines we will be able to evaluateto which degree the information given by the tags is novel in relation to theone extracted form the contents of the web pages. Moreover, we will havea better reference of the aggressiveness of the constraint extraction processproposed by us, not only looking into the number and ratio of accurate andinaccurate constraints which “survive” it, but also comparing the final resultswith the baselines and with this upper-bound model.

6.4.4 Parameters of the algorithms

Apart from t, the minimum number of common tags between two documentsto create a constraint between them, and β and w, that control the strengthof the constraints in the two algorithms which we are comparing in our con-strained approach (CNC and SCKM, respectively), there are other parameterswhich had to be considered in the clustering processes.

In all the experiments the number of clusters that the algorithm shouldlook for in the data (k) was assumed to be know, setting it to 15, the num-ber of classes of the golden truth. This is the only parameter of the KM ap-proach (used in the baselines). Also, d, the number of eigenvectors used in thespectral algorithms (see Section 2.3.4) was considered as a parameter, testingvalues for it between 15 and 300.

Finally, the same 10 random sets of seeds were tested in each clusteringalgorithm in order to have a good representation of the effectiveness of the

Page 125: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.5. Results 99

algorithm, reporting the average of these ten initialisations. Furthermore, inthe case of SCKM for each of the seed initialisations the order in which thedocuments are inspected and assigned to a cluster was also randomised, toavoid any possible effect (positive or negative) which that factor might havehad over the results.

6.4.5 Evaluation

In order to appraise the accuracy values of the constraints obtained with ourmethod we have compared them with the hypothetical results of a baselinemethod which created constraints at random, that is, choosing randomly pairsof pages and creating a positive constraint between them. Given a collection tocluster whose size is denoted by N and a reference partition C = c1, c2, ..ckof that data, the amount of good constraints in a set of n constraints gener-ated by such a method would follow a hypergeometric distribution, where ndraws are made from an entire population of size

(N2

)(the total amount of

constraints) of which∑c∈C

(c2

)(i.e. the total number of possible good pos-

itive constraints) are successes. In the case of the collection and the goldentruth used in these experiments, the total amount of possible good constraintsis 67,146,666, of which 9,485,612 are accurate. This distribution enables usto define a test of significance for the accuracy of the constraints, in whichthe null hypothesis is that creating constraints randomly is at least as good asusing our method. The null distribution of this test is, as we have seen, thehypergeometric distribution, and the statistic of the test is the number of ac-curate constraints obtained by our method. In opposition to what happens inthe Sign Test, larger values of the statistic indicate higher incompatibility withthe null hypothesis.

As for the clustering results, they were evaluated using an external metric,namely Adjusted Rand Index (see Section 3.4 and specifically Section 3.4.3).The statistical significance of the possible improvements was tested using thelower-tailed Sign Test introduced in Section 3.5.1. The results obtained withthe ten initialisations of the seeds mentioned in the previous section wereused as the observations (Xi, Yi), i ∈ [1..10], where Xi and Yi are respectivelythe ARIs of the non constrained baseline and of the method which uses theconstraints with the i-th set of seeds. Thus, the null hypothesis of the test wasthat the quality of the results without using constraints was at least as good asthat of the ones obtained using the constraints obtained with our method.

6.5 Results

The results obtained by the baselines using the views of the documents intro-duced in Section 6.4.2 are shown in Table 6.2. It can be seen that the bestresults are obtained by the second approach, where the pages are representedonly by the tags associated with them, improving the results obtained by usingonly the documents’ contents (first approach). That is to say, tags seem tobe good representations of what a document is about, and maybe even betterthan the actual content itself. Another remarkable aspect is the two different

Page 126: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

100 Chapter 6. Constraint Extraction

Table 6.2: Results of the baselines (best values in bold)

BaselineOnly Only Docs +Docs Tags Tags

NC 0.178 0.190 0.166(best d) (95) (19) (21)

KM 0.147 0.186 0.182

Table 6.3: Evolution of the number and ratio of accurate constraints as t in-creases. The last column indicates for each value of t the maximum numberof accurate constraints which would be statistically compatible with the nullhypothesis stated in Section 6.4.5 (“our method is not better than creatingconstraints at random”) for an α of 0.05.

# accurate % accurate Upper limitt #constraints constraints constraints for non sign.1 30,477,693 6,530,518 21.43% 4,307,8312 15,524,394 4,489,184 28.92% 2,195,0653 8,231,157 3,011,039 36.58% 1,164,3314 4,461,885 1,953,132 43.77% 631,4875 2,450,620 1,227,851 50.10% 347,0726 1,355,464 750,981 55.40% 192,1437 747,349 445,971 59.67% 106,0688 410,263 259,020 63.14% 58,3239 225,360 148,549 65.92% 32,108

10 123,906 84,553 68.24% 17,70611 67,688 47,426 70.07% 9,71112 37,216 26,693 71.72% 5,36813 20,094 14,755 73.43% 2,92014 10,839 8,080 74.55% 1,59115 5,729 4,407 76.92% 853

effects that using the third approach (appending the tags to the contents ofthe documents) can have compared with using only documents. If we use KMthe quality of the resulting clustering is noticeably improved, up to levels verysimilar to those of the second approach, but if we use NC the resulting group-ing is actually worse than not using tags at all. Out intuition is that when tagsare added the similarity between pages which were dissimilar is boosted bythe presence of popular tags in both of them, which steers NC away from thecut of the graph based on the topic of the documents. This problem would notaffect the comparisons between documents and centroids in KM: if a tag (nowa term of the document) is very popular all centroids would contain more orless the same weights for those terms, and so the choice of the most similarcentroid would end relying on the comparison of the other terms. Hence, fromthose results in can be seen that tags can help clustering, but caution shouldbe should be exerted in how they are used. Also, using them with these ap-proaches only provides a modest improvement. To avoid cluttering, only theresults of the best baseline will be shown in the figures.

Page 127: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.5. Results 101

Table 6.3 shows an analysis of the accuracy of the constraints when movingthe parameter t (see Section 6.3.3), along with the maximum number of ac-curate constraints which would be statistically compatible (α = 0.05) with thenull hypothesis stated in Section 6.4.5 (“our method is not better than creatingconstraints at random”), values which we have judged to be more illustrativethan the p-values due to the very small magnitude of the later.

As expected, creating a constraint if two documents share any tag (t = 1)yields a very noisy set of constraints (only one out of five is accurate). How-ever, it should be noted that this accuracy, obtained with the laxest criterionpossible, is very markedly better than the one that can be obtained creatingconstraints at random (the number of good constraints is a 51% more thanthe maximum amount which be statistically probable for such a method forα = 0.05). This shows that the information contained in the tags is suitablefor creating accurate constraints. Moreover, it should be also noted that theamount of accurate constraints created with that setting (that is, the maximumnumber of useful constraints that can be extracted from the tags using our ap-proach) is very high: about 6.5 million constraints, more than 560 constraintson average per document. This leaves a lot of room to apply filtering tech-niques to improve greatly the quality of the constraints without decreasing toomuch their number. Indeed, and again as expected, raising t (meaning thatmore tags in common are required to create a constraint) improves the ratio ofgood constraints, showing that web pages dealing with the same topic usuallyshare a higher number of tags than unrelated ones. As can be seen in the table,increasing the threshold to create constraints augments as well the margin bywhich the results of our method are statistically significant. Since, obviously,greater t values also decrease the total number of constraints, a compromisebetween the quality and quantity of the constraints has to be reached.

According to these results, we will centre the remaining of this study on theclustering results with the constraints sets resulting from setting t to 3, 5 and10, since they showcase an interesting array of situations: lots of constraints,but very noisy (t = 3), moderate number of constraints with moderate noise(t = 5) and (relatively) small amounts of constraints and noise (t = 10). Togive a better understanding of the results, apart from the results obtained bythe global upper bound discussed in Section 6.4.3 (referenced as “GUB” in thelabels of the next figures), we will also show the results when using only theaccurate constraints in those subsets (marked with a leading “UB”).

Figure 6.2 shows the results obtained using these constraints with CNC. Inour experiments we have detected that the clustering could not be successfullyperformed for some constraints sets (specially for the Upper-bound model) inseveral initialisations of the seeds when low values of d (lower than 225) wereused, because one or more clusters became empty in the middle of the process,effectively preventing the clustering process to continue. Since with higher dvalues the changes in the ARI were negligible we report only the results withd=225. Figure 6.2(a) shows the results for β in (0,1] while Figure 6.2(b)focuses on the interval (0,30]. The first important result is the high improve-ment potential of using constraints created from social tags. In this example,the global upper bound model reaches an ARI of about 0.95, much higher thanthat of the best baseline (0.19). However, this is a theoretical result, showing

Page 128: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

102 Chapter 6. Constraint Extraction

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

AR

I

β

GUBUB t =3

t =3UB t =5

t =5UB t =10

t =10NC only tags

(a)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

AR

I

β

Best ARI for t =3 (β=0.025)Best ARI for t =5 (β=0.125)

UB t =10t =10

NC only tags

(b)

Figure 6.2: Results using Constrained Normalised Cut (CNC) with d=225

Page 129: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.5. Results 103

how much the clustering results could improve if we had a perfect way to filterthe noisy ones. As for the filtering method that we propose in Section 6.3.3(requiring some number of tags in common to create a constraint), the resultsshow that the constraints which pass that filter are able to improve the resultsof the baseline in the three scenarios tested, almost doubling it. This improve-ment is statistically significant in the three cases (a summary of the results isshown in Table 6.4). However, the method is not perfect: if we compare theresults with the ones obtained when using only the accurate constraints sur-viving the filter (those beginning with “UB”) it can be seen how the inaccurateconstraints that evade the cut are still able to harm noticeably the quality ofthe results.

Finally, some interesting insights can be obtained analysing the behaviourof the parameter β for the three sets of constraints. Beginning from β=0, atfirst increasing its value provides an improvement of the quality of the cluster-ing until a certain peak value, from which the ARI starts to decrease slowly. Inour opinion that peak point is where the influences of the similarity betweendocuments and the constraints are balanced, and thus the information in eachone is put to its best use. Regarding this, it is important to note how not onlythis best β is higher when the set of constraints is more accurate, but also howin that case that best value is also more stable. Indeed, as it can be seen inFigure 6.2(b), with t=10 (which provides a ratio of accurate constraints ofabout 68%) wide variations of the best β (10, much higher than those for t=3and t=5, with ratios of 50% and 37% of accurate constraints) do not decreasemuch the quality of the results. This same phenomenon, albeit in a lesserscale, can be appreciated in Figure 6.2(a) when comparing the results for t=3and t=5. These results complement and confirm some of the observationsmade in Chapter 5 about the behaviour of the CNC algorithm under noisy setsof constraints.

The results when using SCKM (Figure 6.3) have a global behaviour verysimilar to those obtained when using CNC. When using this algorithm the bestresult of the global upper bound is 0.82 for w=0.025 (not shown in the figure).This value is again a great improvement over the baseline (whose ARI is 0.19),reinforcing the idea of social tags as a good source of positive constraints. Evenso, the difference with the best value when using CNC (0.95) is quite patent,which we attribute to the expected difference of effectiveness between usinga partitional and a spectral clustering algorithm (also apparent in Table 6.2 inthe difference of ARIs when using only documents).

With respect to the results with the three tested sets of constraints it isinteresting to see how, despite the aforementioned difference in the globalupper bounds, the best values of SCKM are close to those of CNC (Table 6.4).Similarly, the best values for each scenario are also statistically significantlybetter than the unconstrained baseline (k-Means). Moreover, the evolutionof the results when moving the parameter w mimics what was observed onCNC: an initial rise in quality, a peak point and a slower decrease, with moreaccuracy in the constraints entailing a higher best w and more stability for thatparameter (this complements again the observations of Chapter 5) . However,it should be noted that the parameter w in SCKM is globally much less stablethan β in CNC; the Figure 6.3 shows how really small variations of the best

Page 130: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

104 Chapter 6. Constraint Extraction

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.0001 0.0002 0.0003 0.0004 0.0005

AR

I

w

GUBUB t =3

t =3UB t =5

t =5UB t =10

t =10KM only tags

Figure 6.3: Results using Soft Constrained k-Means (SCKM)

Table 6.4: Comparison of the best ARI for each constraint set and best base-line for each algorithm. Bold=Best for algorithm. †=Statistically significantimprovement over best (“Only tags”) unconstrained counterpart.

Algorithm CNC SCKMt = 3 0.339†(β=0.025) 0.325†(w=2.5 E-5)t = 5 0.341†(β=0.125) 0.328†(w=1.25 E-4)t = 10 0.363† (β=10.0) 0.294†(w=2.5 E-3)

Only tags 0.190 (using NC) 0.186 (using KM)

values (note the range of the x axis) cause strong drops in the quality of theclustering.

6.6 Creating Constraints using n-Grams

In the first part of this chapter we have proposed a method which uses ex-ternal information (the tags entered by users in Delicious) to create positiveconstraints that help in the clustering of web pages. In this second part of thechapter we will propose a method to extract positive constraints from any kindof textual documents, using only internal information.

Page 131: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.6. Creating Constraints using n-Grams 105

“As for me, all I know is that I know nothing”1-grams as, for, me, all, i, know, is, that, i, know, nothing2-grams as for, for me, me all, all i, i know, know is, is that, that i,. . .3-grams as for me, for me all, me all i, all i know, i know is,. . .

Figure 6.4: Some possible word n-grams of a sentence

6.6.1 n-Grams

As we introduced in Section 3.2.1, the most usual text representation ap-proaches represent a text document as a vector of M components (the numberof terms in the collection), being each of these components an estimation ofthe importance of the term to characterise the contents of the document. Al-though this has been shown to perform very good, it is also clear that in theprocess of converting the text documents into these representations we are los-ing important information which might be useful to characterise the contentof the documents and therefore the relationship between them.

Arguably, one if the most important pieces of information that is lost in theprocess is the order of the terms, and with it the notion of vicinity betweenthem. To illustrate this, let us consider two documents from a hypotheticalcollection of blog posts. One of them deals with the experiences of the au-thor while planning his holidays in Galicia, in the Northwest of the Iberianpeninsula, while the other is a review of a book about the Way of Saint James,a pilgrimage route that ends in Santiago de Compostela, the capital of Gali-cia. These two posts are likely to have similar representations when using theaforementioned bag-of-words representations. For instance, the term “book”is bound to have a large weight in both cases, which would contribute to in-creasing the similarity between them. However, it is also quite clear that atleast the dominant sense in which that term is used in each post is different:in the first one, it will mostly be used as a verb, meaning “to reserve”, but inthe second one it will be used to refer to the work being reviewed. When con-sidering terms in isolation, one by one, this distinction is impossible to make.However, if we examine the sequence of terms in which “book” appears in eachdocument the differences in meaning would be in most cases quite patent. Forexample, in the post recounting the planning of the author’s holidays it willbe used in sequences such as “book a flight”, “easy to book”, “cheap rooms tobook”, etc. On the other hand, in the review “book” will appear in contextslike “a good book”, “the author of the book”, “the book deals with”, etc.

Formally, given a sequence of N symbols, a1, a2, a3, . . . , aN , a subsequenceof n contiguous symbols ai, ai+1, .., ai+n−1 (with i ≥ 1 and i + n − 1 ≤ N)is called an n-gram. In our case, these symbols are the words of a text, andhence a word n-gram (also called an n-shingle) of a given text is a sequence ofn contiguous words from that text. Figure 6.4 shows an example of unigrams,bigrams and trigrams which could be extracted from a sentence.

Word n-grams have been widely used in Information Retrieval and DataMining. This dates back to works such as [Mitra et al., 1997], which showsthat indexing word bigrams (statistic phrases) works comparably well as in-

Page 132: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

106 Chapter 6. Constraint Extraction

dexing syntactic phrases, and has given interesting results such as [Broderet al., 1997], where n-grams are used to detect near-duplicates. Far frombeing confined to theoretical works, large companies have taken interest inpractical applications of n-grams, as is the case of Google, which has releasedan n-gram corpus [Brants and Franz, 2006] to encourage research in this field.

6.6.2 Constraint Creation

In the light of what has been discussed in the previous section in this sec-tion we will describe an approach which uses n-grams to extract constraintsfrom the text itself of any kind of textual documents, without resorting to anyexternal source of information.

Formally, in this case we start with a collection C of textual documentsd1, d2, d3, · · · which we want to cluster, and with a function Gn which for adocument di returns the set Gni of all possible n-grams that can be extractedfrom the text of the document.

The vast majority of the similarity measures between documents used intext clustering are based on measuring somehow the overlap of their vocabu-laries, with the intuition that a high overlap is a good indicator of a positiverelationship between them. For instance, the cosine distance, discussed in Sec-tion 3.2.2, measures the similarity between two documents with the cosine ofthe angle between the vectors which represent them, in such a way that if thetwo documents contain the same terms with the same exact weights the simi-larity is maximum, similarity which would be minimum6 if they do not shareany term.

Following the same logic, the method proposed in this section uses theoverlap between the word n-grams of documents to detect pairs of documentswhich are likely to be in the same cluster. By using word n-grams we are takingadvantage of the aforesaid certain natural order in which the words appear inthe text. For example, let us consider two documents which share a trigram.This means not only that they have (if none of the words is repeated) threewords in common, but also that they appear next to each other and in the sameorder in the two documents, which, for instance, makes more likely that thesewords are being used with the same sense in both documents, which, in turn,makes more likely that these documents are related and hence belonging inthe same cluster. We have opted for an approach similar to the one discussedfor Delicious tags in Section 6.3.3, producing a positive constraint betweentwo documents di and dj if they share a minimum number of n-grams (t),that is, if |Gn(di) ∩ Gn(dj)| ≥ t. Thus, the information carried by the order ofthe words in a text can be incorporated in a certain degree to the clusteringprocess.

However, one important aspect to consider is that not all n-grams sharedbetween documents are informative about their relatedness. Namely, it is verylikely that a considerable amount of documents will share trigrams such as “alot of”, “for instance the” or “in order to”, which are expressions which bearlittle or none information about the subject of a text, and, consequently, the

6Provided that the weights of the terms are non-negative, as they are with the most usualrepresentation schemes.

Page 133: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.6. Creating Constraints using n-Grams 107

fact of they being shared should not be treated as evidence to create a con-straint7. In order to reduce this “noise” we have opted for pruning from theset of n-grams obtained from a document all those which contain one or morestopwords8. Stopwords are words, such as “a”, “in” or “the”, which appear intexts with high frequency and which are not very informative about the textsubject. Consequently, their presence in an n-gram could be a good indicatorthat it does not carry as well much information about the subject of the test.In order to illustrate this filtering, let us consider again the sentence shownin Figure 6.4. If we use trigrams, only “me all i”, “all i know” and “i knownothing” would be considered when creating constraints, ignoring others suchas “as for me”, “for me all” or “is that i”. Discarding these trigrams is a conser-vative solution from the point of view of the constraint creation, limiting quiteaggressively their amount in the interests of improving the quality (accurate-ness) of the resulting set.

A less aggressive alternative to this filtering, also involving stopwords,could be eliminating the stopwords prior to the n-gram extraction, creatingthem afterwards with the remaining words. However, with that approach therelation between the words in the n-gram will be arguably more tenuous, giventhat in the original text one or more words could have been in between them,which would be detrimental to the quality of the constraints.

Algorithm 11: CONSTRAINT CREATION USING N-GRAMS

input : C, the set of text documents to cluster; n, the size of then-grams to be used; t, the minimum amount of shared tags tocreate a constraint; S, the set of stopwords to be considered

output: ML, a set of Must-links

1 foreach d ∈ C do2 C ← C \ d3 foreach d′ ∈ C do4 if | Filter(ExtractNGrams(n,d),S) ∩

Filter(ExtractNGrams(n,d′),S)| ≥ t then5 ML←ML ∪ ML(d, d′)6 end7 end8 end

Algorithm 11 shows the outline of the constraint extraction approach pro-posed in this section, which was published in [Ares and Barreiro, 2012]. Inthis outline two loops iterate along the set of documents extracting their n-grams and comparing them after discarding those containing any stopword.Obviously, implementing the method following literally this approach wouldbe very inefficient: natural and clear optimisations would be building fromthe beginning only the n-grams which do not contain any stopword, instead ofextracting all of them and filtering them afterwards, and saving the n-grams

7Although these expressions could be sign of a certain writing style, which might be useful tocharacterise the relationship between two documents.

8Specifically, we used the list of stopwords provided by Lucene 2.4.1 in the class StopAnalyzer.

Page 134: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

108 Chapter 6. Constraint Extraction

Figure 6.5: Example of the constraint extraction method based on n-grams(trigrams in this case). The symbolise stopwords, whereas , and symbolise “regular” words. After extracting all possible trigrams from the doc-uments, only those n-grams that do not contain any stopword are consideredwhen creating constraints. In this example, with the threshold t set to 1, onlyone positive constraint is created, between documents 1 and 2. Note how,since the order of words is significant, n-grams and are different,and hence documents 2 and 3 do not share any trigram.

of a document, instead of calculating them every time that they are needed.Nevertheless, we have opted for showing the algorithm this way for the sakeof clarity. As in the case of the algorithm extracting constraints from tags, theparameter t controls the balance between the quantity and the quality of theconstraints. Larger values of t demand more n-grams in common to create aconstraint, and hence will result in small sets of mostly accurate constraints;on the other hand, smaller values of t, since they impose softer conditions tocreate a constraint, will create larger sets with a larger ratio of inaccurate con-straints as well. The diagram shown in Figure 6.5 illustrates how constraintsare created using this method.

Negative Constraints

In this section we have argued that word n-grams can be used to infer positiverelationships between textual documents, which can be used in turn to createpositive constraints. As for using n-grams to extract negative constraints, werun into a problem similar to the one that we encountered when using socialtags (see page 94): n-grams are a way to take advantage of the existing orderbetween the words of a document in order to better characterise the contents

Page 135: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.7. Evaluation Methodology 109

of the documents. That is to say, we are gathering positive information aboutthe document, about what the document is, and hence it is very hard to useit to infer negative relationships between them. For instance, in this casecreating negative constraints between documents based on the lack of sharedn-grams is again quite risky. On the one hand, that none or very few (lessthan n) the words in one document appear in the other would be captured bythe regular distance measure between representations, i.e. we are not addingnew information to the clustering process. On the other hand, that more thann words appear in both documents, but not in the same order (otherwise theywould share some n-gram), does not imply a negative constraint. As we haveintroduced before, that n words appear in the same order in two documentssuggests that they are being used with the same sense, which could mean thatthe two documents are related. However, that n words appear in differentorder, or non-consecutive to each other, does not generally have to mean thatthey are being used in different sense, and even if it did, inferring from thatthat the documents are not related seems dubious at best.

6.7 Evaluation Methodology

In order to assess the goodness of the constraint extraction method proposedin this second part of the chapter we have performed an array of experiments,which were focused on four aspects:

1. Are the constraints generated using n-grams accurate?

2. Is the information supplied by the constraints extracted with this methoddifferent to the one obtained with an existent method?

3. In general, is the information contained in n-grams useful? How muchof that information reaches the end of the constraint creation process?What is its maximum possible effect on the clustering outcome?

4. Do these constraints improve the clustering? If so, how does this im-provement compare with the one attained with an existent method?

As happened also with the experiments reported in the first part of thischapter, even though whether or not the constraints help the clustering algo-rithm to improve the quality of the partition of the data is the actual test ofwhether or not the constraints created with our method are useful, the resultsof the other three questions and the comparison between those results canprovide us with useful insights about the constraint extraction problem andabout Constrained Clustering as a whole.

Taking into account the results discussed in Section 6.5 and those reportedin Chapter 5 in this experiment we will use only the Constrained NormalisedCut and its non-constrained counterpart Normalised Cut.

Page 136: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

110 Chapter 6. Constraint Extraction

Table 6.5: Summary of the distribution of the data in the datasets used in theexperiments

Dataset (i): RCV1-4x1000GDIP GVIO GPOL GCRIM total1000 1000 1000 1000 4000

Dataset (ii): 20News-Religiontalk.religion.misc alt.atheism soc.religion.christian total

628 799 997 2424

6.7.1 Datasets and Document Representation

We conducted over our experiments over two different datasets9:

(i) RCV1-4x1000. This dataset is a subset of the Reuters RCV1 collection[Lewis et al., 2004], which compiles a year of stories dispatched by thatnews agency starting in August of 1996 that were manually categorisedaccording to different criteria: country, industry and topic. Specifically,this dataset was created by choosing 1000 documents from each of thecategories GDIP (“International relations”), GVIO (“War, Civil war”), GPOL(“Domestic politics”) and GCRIM (“Crime, Law enforcement”), four of themore populated subcategories of GCAT (“Government/Social”), a widereaching top-level category. The documents were randomly chosen frominside each category so as to minimise the chances of picking documentswhich were too related (corrections, follow-ups, two sides of the samestory,. . . ). This yielded a dataset composed by 4000 documents uni-formly distributed by topic into four equally-sized clusters.

(ii) 20News-Religion. This dataset was taken from the 20 Newsgroup collec-tion [Asuncion and Newman, 2007], a collection of 18828 newsgroupsposts. Specifically, we have used the posts of the subset of groups re-lated to religion, that is, the topics talk.religion.misc, alt.atheismand soc.religion.christian. Hence, this dataset is composed by 2424documents distributed into three clusters, according to the group inwhich the document was posted.

The distribution of the data in each dataset is shown in Table 6.5. Byusing these two datasets we aim to have a wide picture of the performanceof the algorithms, since, as it stems from the descriptions, the character ofeach dataset is quite different: while in the first we will be dealing with neattexts composed by professional journalists, written conforming to a particularstyle book, in the second the texts were written by regular Internet users inthe midst of an on-line discussion, and hence they often show a quite anarchicstructure and are affected by typos, anacoluthons,. . .

The documents in all datasets were represented using Mutual Information(see Section 3.2.1), representations which were compared using the cosinedistance (see Section 3.2.2).

9The exact composition of the datasets can be obtained at:www.dc.fi.udc.es/~edu/AresBarreiroCERI12.gtruth.tar.gz

Page 137: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.7. Evaluation Methodology 111

6.7.2 Baselines

In our experiments we have used two different baselines. First, as we didwith the constraints extracted from Delicious’ tags discussed in the first partof this chapter, we have measured the effect produced by the constraints ex-tracted from n-grams comparing the quality of the final partition with that ofthe partition obtained without using any constraint, i.e. using Normalised Cut.Apart from that, in this experiment we have used a second baseline, anotherconstraint extraction approach based on internal information, to have a betterassessment of the behaviour of our algorithm. We have chosen for this base-line a method used in Song et al. [2010], which is based on Named EntitiesRecognition (NER).

Named entities are text entities for which “one or many rigid designatorsstand for the referent” [Nadeau and Sekine, 2007]. These designators rangefrom proper names, biological species or substances to temporal expressionsor amounts of money or of other units. Thus, examples of named entities areexpressions such as “John Ronald Reuel Tolkien”, “Galicia” or “1000 euros”.From this definition it can be seen that named entities convey a great dealof information about the meaning of the text in which they appear; this factis used by Song et al. to propose a method to extract positive constraintsfrom the text of the documents which has two steps. In the first one, a namedentity recognition algorithm is applied to the documents (namely, the StanfordNER detecting classes “Location”, “Person” and “Organization”). In the secondstage a positive constraint is created between the documents which share aminimum number of those named entities. The intuition behind this approachis clear: given that the contents of a text are quite well characterised by itsnamed entities, that two documents share a given amount of these entitiessuggests a relation between the topics covered in them.

We have chosen this constraint extraction method as baseline due to twomain reasons. First, its structure is similar to the one proposed in this chapter(which was independently developed). Secondly, named entities are wordsby themselves or sequences of consecutive words, as are the n-grams used inour approach. These two circumstances facilitate the comparisons betweenthe two algorithms, and enable us to evaluate the contingent improvement inthe results which could be attained by using higher level information, such asthe one used to tell named entities apart. This use of high level information(compared to the one used in our approach, i.e. that words appear togetherand do not include a stopword) by Song et al.’s algorithm ensures that theiralgorithm is a demanding baseline.

6.7.3 Upper Bound Model

In order to better quantify the utility of the information contained in n-gramswe have also used in our experiments an Upper-Bound (UB) model akin to theone introduced for Delicious-derived constraints in Section 6.4.3. Setting theparameter of our method to the laxest possible value (i.e. t = 1), which createsthe largest amount of constraints that can be obtained by our approach, andusing a perfect oracle to choose only the accurate ones (by comparing withthe reference partition) we will have a set of constraints that explicits all the

Page 138: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

112 Chapter 6. Constraint Extraction

Table 6.6: Amount of total and accurate positive constraints that can be cre-ated in both datasets. These values are used as parameters in the hypergeo-metric distribution discussed in Section 6.7.5.

Total amount of Total number of possiblepossible constraints good positive constraints

RCV1-4x1000 7,998,000 1,998,00020News-Religion 2,936,676 1,012,185

good information about the clustering that can be inferred by our method.Therefore, by using these constraints to help in the clustering of the data andcomparing the results with those of the unconstrained algorithm we will havea measure of how much novel information can be extracted from the n-grams.Also, by comparing the results of this Upper Bound with the ones obtainedwithout using the perfect oracle we will be able to quantify the effect of theinaccurate constraints over the final quality of the clustering. Moreover, wehave followed as well the same process with the NER-based baseline, in orderto explore these same questions for that method and compare the findingsobtained for both approaches.

6.7.4 Parameters of the Algorithms

Both the n-grams-based method and the NER-based baseline have only oneparameter, t, the minimum number of respectively shared n-grams or namedentities to create a positive constraint between two documents. When a dis-tinction must be made between the t of each method they will be named ttriand tent. In our method the size of the n-grams could be treated as a param-eter, but in the experiments we have chosen to set it to 3 as preliminary testshad shown that trigrams had a good and consistent performance.

As for the clustering algorithms, and apart from β, the strength given tothe constraints in Constrained Normalised Cut, and from k, the number ofclusters, which was considered to be known for the two datasets, in both al-gorithms (CNC and Normalised Cut) the number of eigenvectors used in theprojection of the points (d, see Section 2.3.4) was considered again as a pa-rameter, testing values from k to 100. Once more, k-Means was used in thefinal step of the two algorithms, and hence ten random initialisations of theseeds of that final clustering were used to have a more exact representation ofthe results of the clustering.

6.7.5 Evaluation

As in the case of the constraint extraction method based in tags, the accuracy ofthe constraints obtained with the n-grams method and the NER-based baselinewas appraised comparing them with the accuracy of a hypothetical methodwhich would create constraints at random (see Section 6.4.5). Table 6.6 showsthe parameters of the hypergeometric distribution that models the amount ofgood constraints created by such a method in the two datasets used in theseexperiments.

Page 139: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.7. Evaluation Methodology 113

In order to calculate the coincidence in the constraints extracted by theproposed method and the NER-based baseline we have used, apart from theplain overlap between the two sets of constraints generated by each method,the Jaccard Index (JI). If A and B are sets, their Jaccard Index is defined as

JI(A,B) =|A ∩B||A ∪B|

=|A ∩B|

|A|+ |B| − |A ∩B|(6.1)

As stems from this definition, the Jaccard Index ranges from 0, when A and Bare disjoint, to 1, when both sets are the same.

Given that there is a certain amount of coincidences that will appear due tochance, we will adjust for chance the Jaccard Index using the general formulafor correcting an index introduced in Section 3.4.3:

Adjusted Index =Index− Expected index

Maximum index− Expected index(6.2)

In the case of Jaccard Index, the expected value of the index can be calculatedagain with the aid of the Hypergeometrical distribution. Considering that AandB are sets of constraints created for a collection ofN entities, the expectedvalue for |A ∩ B| is the mean value of a hypergeometric distribution where,depending on the point of view, either |A| draws are made from a populationof(N2

)elements (the total amount of possible constraints), of which |B| are

successes, or |B| draws are made from the same population of(N2

)elements

of which |A| are successes. In both cases, the mean value is the same

Expected |A ∩B| = draws · successes in populationpopulation size

=|A||B|(N2

) (6.3)

and therefore

Expected JI(A,B) =

|A||B|(N

2 )

|A|+ |B| − |A||B|(N

2 )

(6.4)

Moreover, as we have stated above, the general maximum value of theJaccard Index is 1, when both sets are the same. However, in order to providea better adjustment we will use in our experiments a tighter upper bound:given two sets A and B of possibly different sizes, the maximum possibleJaccard Index between is the one that would be obtained when one of themis contained in the other, since in that situation |A ∩B| and |A ∪B| will haverespectively the largest and the smallest possible values. Thus

Maximum Possible JI(A,B) =min(|A|, |B|)max(|A|, |B|)

(6.5)

Note how for |A| = |B| the maximum possible JI value is 1, since the setsmay be the same, and how that maximum possible value decreases when thedifference in size between the two sets increases. By using this upper boundinstead of 1 we are putting a stronger focus on the elements (constraints, inthis case) that are shared by the sets, which will help us to better calibratethe similarities between the methods compared. For instance, let us consider

Page 140: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

114 Chapter 6. Constraint Extraction

a hypothetical example where a method A yields 100 constraints, which areall contained in the 1000 constraints yielded by another method B. Using Jac-card Index we would obtain a value of 0.1, meaning that the two methodsare extracting quite dissimilar sets of constraints, which could lead us to con-clude that the information extracted by the approaches and carried out bythe constraints is mostly different. However, the fact that all the constraintsextracted by method A are also extracted by method B is highly significant,and might be an indication of a relationship between the information used bythe two approaches to extract the constraints, something which is somewhat“drowned” in the final value of JI by the difference in size10. Using the upperbound value indicated in Equation 6.5 (which in this case is 0.1, that is, in theproposed scenario we are obtaining the maximum possible JI between sets ofthose sizes) we are ensuring that in these kinds of situations the value yieldedby the adjusted index is high.

Given these results, if we reformulate the Equation 6.2 according to Equa-tions 6.1, 6.4 and 6.5, we obtain the following formula for the Adjusted Jac-card Index (AJI):

AJI(A,B) =

|A∩B||A∪B| −

|A||B|

(N2 )

|A|+|B|− |A||B|(N

2 )

min(|A|,|B|)max(|A|,|B|) −

|A||B|

(N2 )

|A|+|B|− |A||B|(N

2 )

(6.6)

The Adjusted Jaccard Index yields a value of 1 when one of the sets iscompletely contained in the other and a value of 0 when the overlap betweenthe two sets can be fully attributed to chance. Finally, negative values indicatean overlap even smaller that the one that would be expected by chance alone.

As for the evaluation of the clustering results, we have followed again thesame approach as when using Delicious Tags, using Adjusted Rand Index (seeSection 3.4.3) and the lower-tailed Sign Test (see Section 3.5.1). Once more,the observations (Xi, Yi) are the ARIs obtained by the methods being com-pared (our n-gram approach, the NER-based baseline and the unconstrainedbaseline) with the ten different initialisations of the seeds. Concretely, sincein this case we have more than one baseline, we will state in each case whichones we are comparing, with the Xi and the Yi being respectively the oneswith the lower and the higher average ARIs. Hence, the null hypothesis willbe that in each case the method with the lower average ARI is actually as goodor better than the method with the larger average ARI.

6.8 Results

Tables 6.7 and 6.8 show for the two datasets used in the experiments theevolution of the number and the accuracy of the constraints created with the

10However, it should be noted that the noticeable difference between the sizes of the two sets isas well important to ascertain the relation between the methods. Consequently in our experimentswe will report both the regular and the adjusted values of Jaccard Index.

Page 141: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.8. Results 115

Table 6.7: Dataset (i): RCV1-4x1000. Evolution of the number and ratio ofaccurate constraints as the thresholds increase. The last column indicates foreach t the maximum number of accurate constraints statistically compatiblewith the null hypothesis stated in Section 6.7.5 (“the method is not betterthan creating constraints at random”) for an α of 0.05.

n-Grams (Trigrams) method# accurate % accurate Upper limit

t3gr #constraints constraints constraints for non sign.1 244,738 104,182 42.57% 61,4852 58,295 33,241 57.02% 14,7333 24,350 15,990 65.67% 6,1934 12,136 8,997 74.13% 3,1095 7,120 5,686 79.86% 1,838

NER-based baseline# accurate % accurate Upper limit

tent #constraints constraints constraints for non sign.1 1,524,140 514,121 33.73% 381,5392 507,282 200,629 39.55% 127,2153 210,218 101,495 48.28% 52,8364 96,554 55,012 56.98% 24,3395 48,836 30,914 63.30% 12,356

Table 6.8: Dataset (ii): 20News-Religion. Evolution of the number and ratio ofaccurate constraints as the thresholds increase. The last column indicates foreach t the maximum number of accurate constraints statistically compatiblewith the null hypothesis stated in Section 6.7.5 (“the method is not betterthan creating constraints at random”) for an α of 0.05.

n-Grams (Trigrams) method# accurate % accurate Upper limit

t3gr #constraints constraints constraints for non sign.1 141,579 76,623 54.12% 49,0852 53,330 39,396 73.87% 18,5603 31,732 25,388 80.01% 11,0764 24,891 19,746 79.33% 8,7025 19,194 14,917 77.72% 6,724

NER-based baseline# accurate % accurate Upper limit

tent #constraints constraints constraints for non sign.1 214,929 111,673 51.96% 74,4292 33,229 21,763 65.49% 11,5953 8,485 6,333 74.64% 2,9974 3,268 2,622 80.23% 1,1715 1,692 1,443 85.28% 615

Page 142: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

116 Chapter 6. Constraint Extraction

Table 6.9: Summary of the amounts of constraints and the percentage of themwhich are accurate for each constraint generation method in each dataset

Dataset (i)Trigrams Entities

t const. accurate const. accurate1 244,738 42.57% 1,524,140 33.73%2 58,295 57.02% 507,282 39.55%3 24,350 65.67% 210,218 48.28%4 12,136 74.13% 96,554 56.98%5 7,120 79.86% 48,836 63.30%

Dataset (ii)Trigrams Entities

t const. accurate const. accurate1 141,579 54.12% 214,929 51.96%2 53,330 73.87% 33,229 65.49%3 31,732 80.01% 8,485 74.64%4 24,891 79.33% 3,268 80.23%5 19,194 77.72% 1,692 85.28%

n-grams-based method and the NER-based baseline when increasing their re-spective thresholds. Moreover, and again due to the very small magnitude ofthe resulting p-values, we show for each t the maximum amount of accurateconstraints which would be still compatible with the null hypothesis stated inSection 6.7.5 (“the method is not better than creating constraints at random”)for an α of 0.05. The amount of constraints created and the percentage ofthem which are accurate for each approach is summarised in Table 6.9.

First of all, it should be noted that indeed both methods are generatingan amount of accurate constraints that is by a wide margin statistically signif-icantly better than generating constraints at random, which shows that bothsharing trigrams and named entities are evidences of a positive relationshipbetween documents.

As for the comparison between methods, there is a perceptible differencein their behaviour in each dataset: while in (i) the amount of constraintscreated using entities is larger, in (ii) it is mostly the other way around. Wethink that this is explained by the differences between the documents in thedatasets. On the one hand, the news stories in (i) are full of named entitiesindicating locations, organisations and persons, most of which only span oneof two words. Consequently, if they are surrounded by stopwords (somethingwhich is very likely) they will be pruned by our method, while Song et al.’swill use them. On the other hand, in (ii), which is composed by newsgroupposts, their method finds it difficult to find those named entities, whereas oursmakes the most of the quotations that users make of other posts.

All the same, the comparison between constraints sets of similar size, t3gr =1, tent = 3 in (i) and t3gr = 3, tent = 2 in (ii), shows that our method is re-spectively slightly under and clearly above Song et al.’s regarding the ratio ofaccurate constraints created. That is, the accuracy of this new method, which

Page 143: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.8. Results 117

Table 6.10: Overlap between the constraints created with both methods forselected values of t. Values (amount of constraints shared, % over those cre-ated with each method, Jaccard Index (JI) and Adjusted Jaccard Index (AJI,see Section 6.7.5)) are shown for all the constraints and for only the accurateones

Dataset (i)t shared

3gr. ent. amount % 3gr. % ent. JI AJI1 1 140,983 57.61% 9.25% 0.087 0.446

(accurate) 68,807 28.11% 13.38% 0.125 0.5081 3 80,732 32.99% 38.40% 0.216 0.238

(accurate) 43,774 42.02% 43.13% 0.270 0.257

Dataset (ii)t shared

3gr. ent. amount % 3gr. % ent. JI AJI1 1 58,763 41.51% 27.34% 0.197 0.266

(accurate) 42,184 55.05% 37.77% 0.289 0.3783 2 12,980 40.91% 39.06% 0.250 0.257

(accurate) 10,302 40.58% 47.34% 0.257 0.280

uses lower level information, appears to be comparable or better than that ofthe entity based one.

Table 6.10 shows the overlap between the constraints created with bothmethods. In the two datasets, and both for the total amount of possible con-straints created with either method (i.e. t3gr = 1, tent = 1) and for the setsindicated in the previous paragraph the portion of shared constraints is under60%. This is specially significant when comparing the whole sets of possibleconstraints extracted from (i). Even though Song et al.’s method creates abovefive times more constraints, the overlap between it and our method is only ofa 58%, descending to a 28% when considering only the accurate constraints(where the ratio is close to 1:5). The same trends can be appreciated when ex-amining the Jaccard Index and its adjusted version presented in Section 6.7.5:in all cases the JI values are below 0.3, and the corresponding adjusted valuesare less or almost equal to 0.511. These results show that, despite the similari-ties between both methods noted in Section 6.7.2, the information containedin n-grams is still quite different and can be exploited to create original con-straints.

Figure 6.8 shows several results obtained using only accurate constraints inthe two datasets. As for the quality values of Global Upper Bounds (GUB) (seeSection 6.7.3), they show the great potential to improve the clustering of the

11When comparing sets of accurate constraints we used in the calculation of the expected JI in-dex the maximum possible amount of good constraints

∑c∈C

(c2

)instead of the overall maximum

amount of possible constraints(N2

).

Page 144: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

118 Chapter 6. Constraint Extraction

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.1 0.2 0.3 0.4 0.5 0.6

AR

I

β

Dataset (i): RCV1-4x1000

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.5 1 1.5 2 2.5

AR

I

β

Dataset (ii): 20News-Religion

nGr GUBNER GUB

nGr UB t=2NER UB t=2

nGr UB t=3NER UB t=3

Figure 6.6: ARI values obtained using only good constraints. nGr=n-gramsmethod. NER=NER-based baseline. GUB=Quality values of the Global Up-per Bound (see Section 6.7.3). UB=Quality values obtained using only theaccurate constraints generated by the given method and the given t.

Page 145: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.8. Results 119

Table 6.11: Informativeness (see Davidson et al. [2006] and text) of the accu-rate constraints for selected values of t

Dataset (i) Dataset (ii)t Trigrams Entities t Trigrams Entities1 0.227 0.281 1 0.226 0.1942 0.172 0.260 2 0.172 0.1383 0.141 0.239 3 0.152 0.126

information contained in n-grams and in named entities. In both datasets theglobal upper bound of the NER-based baseline gives higher values, due to thelarger amount of constraints created with t = 1 (see Tables 6.7 and 6.8). Thedifference between the two methods is more marked (about +0.1 in ARI) inDataset (i), where the amount of good constraints created using NER almostquintuples the amount created with n-grams.

Furthermore, Figure 6.8 shows as well the ARI values obtained using onlythe accurate constraints contained in the sets created by the methods for t ∈2, 3 (note that the GUBs are these values for t = 1). An interesting factapparent in these figures is how, although in both datasets the quality valuesare mostly neatly arranged top to down according to the number of constraintsused (from larger sets to smaller ones), in Dataset (i) there are two noticeableanomalies in that pattern. Namely, on the one hand, the quality values forthe good constraints obtained with t3gr=1 and tent = 2 are very close, eventhough the number of constraints in the later almost doubles the ones in theformer. On the other hand, given the similar number of constraints (104,192versus 101,495) it would appear that the ARI values for t3gr=1 and tent = 3should be similar, while in fact the former are well above the later. Thesetwo circumstances show that in that dataset the constraints created using theinformation from the n-grams are more informative, that is, contain moreinformation that is not captured by the clustering algorithm on its own, whichhelps it to reach a better (as defined by our evaluation criterion) partition.Interestingly, the values of the informativeness metric12 proposed by Davidsonet al. [2006] (Table 6.11) for the good constraints created with these valuesof t are very close (0.227 versus 0.260 and 0.239), which suggests that themetric might not be useful in all circumstances to predict the influence that aset of constraints might have over the results of a clustering process.

Finally, Tables 6.12, 6.13 and 6.14 compare the clustering results of ourn-gram-based approach with the NER-based and the unconstrained baselines.In Dataset (i), our approach improves slightly the unconstrained baseline forthe three values of t tested, an improvement which is nevertheless statisticallysignificant for t = 2 and t = 3. In the case of the constraints extracted withSong et al’s approach, using them with CNC is actually harmful for the finalquality of the clustering: for the three values of the parameter tested in theexperiments the average ARI is in all cases under that of the unconstrainedmethod, a decrease which is statistically significant (i.e. the unconstrained re-sults are statistically better than the one obtained with those constraints). Con-

12Grosso modo, the ratio of constraints that an unconstrained run of the algorithm does notrespect, with larger values meaning more informativeness.

Page 146: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

120 Chapter 6. Constraint Extraction

sequently, as for the comparison between the two constraint creation methods,the best value using the n-grams-based approach is in all cases better than thebest one for the NER-based method. This difference is in all cases statisticallysignificant.

In Dataset (ii) our approach improves greatly both baselines for all thetested values of t, an improvement which is as well in all cases statistically sig-nificant, both comparing with the unconstrained one and with the one usingNER. In this dataset using the constraints extracted with Song et al.’s methodyields better results than in Dataset (i), improving in all cases the uncon-strained baseline, albeit by a smaller margin than our approach. For t = 2and t = 3 this improvement is statistically significant.

As can be seen in these results, the difference in effectiveness of the con-straints in the two datasets is clearly palpable: whereas in Dataset (i) theirquantitative effect is limited (though statistically significant nevertheless), theconstraints are able to provide quite noticeable improvements in Dataset (ii).As we have seen when analysing the results of the upper-bounds (Figure 6.8),this is not caused by the constraints created in Dataset (i) having low infor-mative content: in both datasets the global upper bound (i.e. the accurateconstraints created when setting t to 1) improves greatly the unconstrainedbaseline. On the contrary, this divergence seems to be caused by an overalllower accuracy of the constraints in Dataset (i): Table 6.9 shows how the ac-curacy values of the constraints created with both methods in that dataset areappreciably below those in Dataset (ii). Moreover, if we raise the threshold ofthe methods in order to obtain accuracies comparable to the lower ones ob-tained for Dataset (ii), i.e. we set t3gr to 2 or 3, the size of the constraint sets isgreatly reduced and with it the number of accurate constraints, which, as canbe seen in Figure 6.8, limits greatly the maximum improvement that can beobtained from the constraints. This, in conjunction with the moderate amountof inaccurate constraints still present, explains the arguably poor results in thisdataset.

It is also interesting to compare the previously discussed upper-bound re-sults for t3gr=1 and tent = 3 with the ones obtained with the whole set ofconstraints. As we have seen, even though the number of accurate constraintscreated by both methods for those parameters was similar, the ones createdwith out trigram-based method were more informative, and therefore the par-titions obtained using these constraints were of higher quality. However, thisdifference in quality when using only good constraints is not clearly apprecia-ble when using the whole set of constraints generated with these parameters.As can be seen in Table 6.12, for t3gr=1 and tent = 3, the best ARI valuesare respectively 0.505 and 0.503, an exiguous difference of 0.002 which isnevertheless statistically significant according to the sign test described at theend of Section 6.7.5. These results point out a possible more harmfulness ofthe inaccurate constraints yielded by the n-grams-based method, which is aswell likely aggravated by a slightly smaller percentage of accurate constraints(42.57% versus 48.28%).

As for Dataset (ii), the effect of the inaccurate constraints is similarly read-ily apparent in the marked difference between the best ARIs (0.592 versus0.334) when using the whole sets of constraints yielded by t3gr=3 and tent = 2,

Page 147: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.8. Results 121

Tabl

e6.

12:

Dat

aset

(i)

RC

V1-

4x10

00.

Bes

tav

erag

eA

RI

ofth

ere

sult

sof

CN

Cw

ith

the

give

over

ten

rand

omin

itia

lisat

ions

ofth

ese

eds

usin

gth

eco

nstr

aint

syi

elde

dby

each

met

hod

wit

hth

egi

vent.

The

valu

eofd

for

whi

chth

ebe

stva

lue

was

obta

ined

isbe

twee

npa

rent

hese

s.Th

e“B

asel

ine”

valu

eis

the

best

aver

age

AR

Iof

NC

.B

old=

Bes

tfo

rm

etho

dan

dt.

Bol

d&

enla

rged

=B

est

inda

tase

t.†=

Stat

.si

gn.

impr

ovem

ent

over

unco

nstr

aine

d.‡=

Stat

.si

gn.

impr

ovem

ent

over

unco

nstr

aine

dan

dth

eot

her

met

hod

wit

hsa

met

and

bestβ

.D

atas

et(i

):R

CV

1-4x

1000

t=

1t

=2

t=

Trig

ram

sEn

titi

esTr

igra

ms

Enti

ties

Trig

ram

sEn

titi

es0.

0012

50.

503

(4)

0.49

9(4

)0.

504

(4)

0.49

9(4

)0.

504

(4)

0.50

2(4

)0.

0025

0.50

5(4

)0.

497

(4)

0.50

4(4

)0.

498

(4)

0.50

4(4

)0.

502

(4)

0.00

500.

504

(4)

0.47

0(4

)0.

506

(4)

0.48

9(4

)0.

504

(4)

0.50

1(4

)0.

0062

50.

505

(4)

0.46

2(4

)0.

505

(4)

0.48

4(4

)0.

504

(4)

0.50

3(4

)0.

0125

0.50

2(4

)0.

395

(4)

0.50

7(4

)0.

470

(4)

0.50

5(4

)0.

496

(4)

0.02

50.

494

(5)

0.29

6(1

3)0.

509

(4)‡

0.43

5(4

)0.

508

(4)‡

0.47

7(4

)0.

050.

494

(6)

0.22

9(1

0)0.

504

(4)

0.35

4(4

)0.

508

(4)

0.43

3(4

)0.

0625

0.47

9(9

)0.

158

(4)

0.50

3(4

)0.

355

(4)

0.50

8(4

)0.

372

(5)

0.12

50.

447

(19)

0.11

2(5

)0.

474

(4)

0.27

8(4

)0.

503

(4)

0.33

0(5

)0.

250.

321

(9)

0.00

2(3

4)0.

494

(5)

0.19

7(4

)0.

489

(4)

0.28

5(5

)0.

50.

228

(4)

0.00

1(6

0)0.

430

(5)

0.10

4(4

)0.

463

(4)

0.24

2(8

)0.

625

0.12

9(4

)0.

001

(58)

0.40

5(5

)0.

062

(5)

0.46

2(6

)0.

239

(9)

Bas

elin

e0.

504

(4)

Page 148: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

122 Chapter 6. Constraint Extraction

Tabl

e6.

13:

Dat

aset

(ii)

20N

ews-

Rel

igio

n.B

est

aver

age

AR

Iof

the

resu

lts

ofC

NC

wit

hth

egi

venβ

over

ten

rand

omin

itia

lisat

ions

ofth

ese

eds

usin

gth

eco

nstr

aint

syi

elde

dby

each

met

hod

wit

hth

egi

vent.

The

valu

eofd

for

whi

chth

ebe

stva

lue

was

obta

ined

isbe

twee

npa

rent

hese

s.Th

e“B

asel

ine”

valu

eis

the

best

aver

age

AR

Iof

NC

.B

old=

Bes

tfo

rm

etho

dan

dt.

Bol

d&

enla

rged

=B

est

inda

tase

t.†=

Stat

.si

gn.

impr

ovem

ent

over

unco

nstr

aine

d.‡=

Stat

.si

gn.

impr

ovem

ent

over

unco

nstr

aine

dan

dth

eot

her

met

hod

wit

hsa

met

and

bestβ

.D

atas

et(i

i):

20N

ews-

Rel

igio

nt

=1

t=

2t

=3

βTr

igra

ms

Enti

ties

Trig

ram

sEn

titi

esTr

igra

ms

Enti

ties

0.00

125

0.28

5(1

5)0.

289

(18)

0.28

4(1

5)0.

284

(15)

0.28

4(1

5)0.

283

(15)

0.00

250.

290

(15)

0.28

4(1

5)0.

288

(15)

0.28

4(1

5)0.

286

(15)

0.28

3(1

5)0.

0050

0.29

4(1

8)0.

285

(14)

0.28

4(1

7)0.

291

(18)

0.28

6(1

5)0.

283

(15)

0.00

625

0.28

8(1

8)0.

289

(14)

0.28

6(1

7)0.

292

(18)

0.29

3(1

5)0.

283

(18)

0.01

250.

286

(16)

0.29

0(1

8)0.

294

(18)

0.28

6(1

5)0.

292

(15)

0.28

6(1

8)0.

025

0.30

2(1

4)0.

299

(20)

0.27

8(1

6)0.

288

(15)

0.30

5(1

6)0.

284

(15)

0.05

0.32

1(3

3)0.

297

(17)

0.29

3(1

4)0.

292

(15)

0.30

9(1

6)0.

287

(15)

0.06

250.

327

(9)

0.29

7(3

4)0.

315

(15)

0.29

3(1

4)0.

298

(15)

0.28

2(1

5)0.

125

0.42

4(1

5‡)

0.26

7(1

1)0.

392

(7)

0.29

2(2

0)0.

321

(15)

0.28

9(1

6)0.

250.

239

(68)

0.29

2(2

1)0.

483

(8)

0.31

0(1

5)0.

399

(7)

0.30

1(1

3)0.

50.

001

(55)

0.15

0(1

00)

0.51

1(1

5)‡

0.33

4(1

1)†

0.48

2(7

)0.

314

(13)

0.62

50.

001

(51)

0.11

0(3

)0.

490

(32)

0.31

2(1

4)0.

506

(7)

0.32

2(1

3)1.

25<

0.00

10.

001

(60)

0.40

0(4

9)0.

266

(27)

0.59

2(1

0)‡

0.33

5(1

3)2.

5<

0.00

10.

001

(58)

0.29

7(1

00)

0.27

6(2

2)0.

533

(35)

0.33

6(1

2)†

Bas

elin

e0.

283

(18)

Page 149: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

6.8. Results 123

Table 6.14: Summary of the best ARI for each constraint extraction methodwith t = 1, 2, 3. Bold=Best in dataset. †=Stat. sign. improvement overunconstrained. ‡=Stat. sign. improvement over unconstrained and the othermethod with same t

Dataset(i): RCV1-4x1000Method Trigrams Entitiest = 1 0.505 (β=0.00625) 0.499 (w=0.00125)t = 2 0.509‡ (β=0.025) 0.499 (w=0.00125)t = 3 0.508‡ (β=0.025) 0.503 (w=0.00625)

Baseline 0.504

Dataset(ii): 20News-ReligionMethod Trigrams Entitiest = 1 0.424‡ (β=0.125) 0.299 (w=0.025)t = 2 0.511‡ (β=0.5) 0.334†(w=0.5)t = 3 0.592‡ (β=1.25) 0.336†(w=2.5)

Baseline 0.283

taking into account that their ARIs are very close when using only the accurateconstraints. As can be seen in Table 6.8, the percentage of good constraints is65% with NER, in contrast with the 80% attained when using n-grams. Thisresults in almost twice as much inaccurate constraints when using named en-tities (11,466 versus 6,344), which explains the final difference in the qualityvalues.

On a general note, the clustering results show how, although informative,the trends found when studying in abstract numbers of constraints, their accu-racy ratios or, as we have seen earlier, their informativeness are not necessarilytranslated to the final clustering results. For instance, in Table 6.9 we can seehow setting ttri to 1 and tent to 2 in Dataset (i) or setting ttri and tent to 1 inDataset (ii) yields sets of constraints of similar accuracy but with a noticeablelarger amount of entity-based constraints. As usually more accuracy comesat the cost of tighter policies when creating constraints, which would meanless constraints, this could be taken as a sign that the entity-based constraintswould perform better, as we are able to attain the same accuracy in a largerset of constraints. However, Table 6.14 shows how in the first example the dif-ference in the average ARI is small, while in the second the trigram-based con-straints yield markedly better results. These circumstances underscore that,although it enables us to filter out those approaches that more obviously donot work (for instance, comparing with a hypothetical method that createdconstraints randomly), the analysis of the amount and the quality of the con-straints is not enough to infer the effect of the constraints created by a methodover the clustering process, and hence that the actual quality of the partitionsgenerated using them must be examined in order to accurately asses the good-ness and suitability of the approaches.

Page 150: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

124 Chapter 6. Constraint Extraction

6.9 Summary

In this chapter we have proposed two different methods to automatically ex-tract constraints, an integral part of the Constrained Clustering problem (Sec-tion 6.1) which is often overlooked in the existing literature (Section 6.2).

The first method that we have proposed (Section 6.3) is based on externalinformation, turning the information contained in social tags, namely the oneswith which users of Delicious label their bookmarks, into positive constraintsbetween web pages. The evaluation of this proposal (Section 6.4), which wasperformed with a large collection of real tagging data, showed (Section 6.5)both the goodness of social tags in order to detect positive relations betweenweb pages and the validity of the method to create constraints using that in-formation. These aspects have been confirmed not only with the accuracy ofthe created constraints but also with their overall effect on the clustering.

The second method (Section 6.6) is based on internal information, usingthe overlap of word n-grams as a clue of a positive relationship between textdocuments. In this case, the evaluation of the method (Section 6.7) wasperformed with two real-world datasets of different nature, comparing ourapproach with a baseline based on Named Entity Extraction (Section 6.7.2).The experiments showed (Section 6.8) that our proposal yielded constraintsof similar accuracy which were mostly different to the ones obtained with theNER-based baseline. As for their effect in the clustering quality, the constraintscreated with our method improved in all cases the results of both the uncon-strained and the NER-based baselines.

Page 151: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Chapter 7

Conclusions and Future Work

7.1 Conclusions

As we stated in the outline of this work, the main claim of this thesis isthat there are certain practical aspects of Constrained Clustering which, eventhough they are largely important when applying this technique to real-worldsituations, have been up to this date greatly overlooked in the research inthis topic. In the previous chapters we have examined some of these issuesalong with some practical applications of Constrained Clustering. These arethe main conclusions, contributions and insights that we have obtained alongthis thesis:

• We have tackled the avoiding bias task using a scheme based in a general-purpose Constrained Clustering algorithm (Soft Constrained k-Means,SCKM) devised by us, showing that this method outperforms, both interms of dissimilarity with the known partition and quality of the alter-native one, the well known algorithm by Gondek and Hofmann [2004],which is specially tailored for the avoiding bias tasks.

Moreover, since the avoiding bias problem is still a clustering task atthe core and therefore the process should try to output in the end agood partition of the data, in this thesis we have as well turned our at-tention to the quality of the alternative partition, looking into ways touse negative constraints, the basis of our aforementioned avoiding biasapproach, in conjunction with spectral clustering techniques, which areknown to yield partitions of high quality. In this regard we have proposedtwo approaches, one based on introducing the negative constraints atthe core of the Constrained Clustering method similarly to how positiveconstraints are introduced in the Constrained Normalised Cut algorithm(CNC) [Ji and Xu, 2006], and another one in which Spectral Clustering(specifically Normalised Cut, NC [Shi and Malik, 2000]) serves as a sortof preprocessing phase over whose output the constraints are applied,using the SCKM algorithm. We have showed that, although it appears tobe theoretically sound, the first approach does not work well in practice,yielding partitions of the data of low quality which for most values of

Page 152: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

126 Chapter 7. Conclusions and Future Work

the parameters are still more similar to the grouping of the data whichwe try to avoid than to the one which, following the usual methodologyin avoiding bias experiments, has been used as reference of a good al-ternative clustering. We have attributed this behaviour to the changesintroduced in the minimisation problem at the core of Normalised Cut,which make finding a balance between the quality of the constraints andobeying the constraints difficult. On the other hand, the approach whichcombines NC and SCKM obtains very good results, yielding partitionswhose similarity with the avoided grouping is similar to that of the onesgenerated by SCKM on its own but of noticeable better quality. This isdue to the effect of the projection made by NC, which not only increasesthe quality of the clustering per se, but also improves the effect of theconstraints by bringing together related documents and separating non-related ones.

• We have analysed the robustness to noise of several Constrained Clus-tering algorithms, a characteristic which is bound to be of capital im-portance when applying Constrained Clustering to real-world problems.In real scenarios, the sets of constraints must be obtained using manualor automatic methods, and are therefore likely to contain, to a greateror lesser extent, some erroneous or inaccurate constraints, somethingwhich is almost never taken into account in the experiments reportedin the Constrained Clustering literature. In order to test the behaviourof the algorithms under those conditions we created synthetically setsof constraints with a given amount of inaccurate ones, that is, positiveconstraints that stated that two data instances which were in differentclusters in the golden truth should be in the same cluster and negativeconstraints which did the opposite. These inaccurate constraints werecreated using two different noise models, one in which the pairs of datapoints affected by them were chosen randomly and another original one,in which we tried to capture our intuition that the errors in constraintcreation are likely to happen between pairs of data points which do notseem to belong in the same cluster, in the case of erroneous negative con-straints, or in different ones, in the case of erroneous positive constraints.Specifically, the positive inaccurate constraints were created between theclosest (i.e. most similar) pairs of data instances that belong to differentclusters and the negative ones between those which, belonging to thesame cluster, are farthest from each other.

In light of the results of the experiments using these synthetic sets of con-straints we were able to conclude the scenarios in which using each al-gorithm may be the soundest decision. First of all, Constrained k-Means[Wagstaff et al., 2001] faced already enormous problems with a mod-erate number of accurate constraints, which in most cases made clus-tering impossible. In the cases where we were able to obtain an initialclustering, the quality values dropped dramatically as false constraintswere added, mainly due to the transitiveness of the absolute constraints,which amplified greatly the effect of the noise. These results suggestthat using Constrained k-Means should be considered very carefully, and

Page 153: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

7.1. Conclusions 127

almost ruled out if there might be more than a nominal degree of noise.As for Constrained Normalised Cut [Ji and Xu, 2006], provided that thecomputational cost (both in time and space) is not a crucial issue, it isthe best option with moderate to high amounts of erroneous constraints,due to its good head start and the limited effect of the noise when wedo not require a very tight observance of the constraints. For its part,the recommended use of Soft Constrained k-Means depends on the typeof information available. When using positive information it should berestricted to cases where the computational cost is critical, since it usu-ally performed worse than CNC, and even in most cases its effectivenesswith higher ratios of inaccurate constraints was worse than NC, which isa non constrained method. However, when using negative informationSCKM reached very high quality values in noise-free conditions, keep-ing very good quality values until the highest noise levels when usinginaccurate constraints created with the most realistic approach, whichshowed SCKM’s suitability to incorporate negative information into clus-tering, even without restrictions on the spacial or temporal costs of theprocess. As for Normalised Cut with Imposed Constraints (an approachakin to [Kamvar et al., 2003]), it showed an irregular behaviour, which,in conjunction with its lack of advantages in terms of costs or quality ofthe results over the other methods, recommends that using this approachin real-world clustering problems should, at the very least, be carefullypondered. Finally, the comparison between the results obtained withthe two noise models provided us with another interesting insight, sug-gesting that when the inaccuracies in the constraints are product of mis-judgements induced by high or low similarity between data points theeffect of the inaccurate constraints is likely to be lessened, compared tothat of the randomly-created inaccurate constraints.

• We have proposed two methods which are able to automatically extracthighly effective positive constraints for the clustering of web pages andof textual documents in general. It is clear that for the possible im-provements attained by using Constrained Clustering the number andthe quality of the constraints are factors as important as the ability ofthe algorithms to make the most of the information contained in them.Up to this date, the research on Constrained Clustering has been mainlyconcerned with the later, paying little attention to proposing methods toobtain constraints, something that plays a key role in real-world prob-lems.

The first of the methods proposed in this thesis uses the informationcontained in social tags (specifically Delicious’ tags) to infer positive rela-tionships between web pages. These tags represent a consensus betweenthe users of the social tagging tool about the most salient features of thedocuments, and therefore our intuition was that, polysemy, homonymyand other problems specific to social tags notwithstanding, that two webpages share a given number of tags is a clue of a relationship betweenthem, being a larger number of shared constraints a stronger clue. In thisthesis we proposed a constraint extraction approach based upon this in-

Page 154: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

128 Chapter 7. Conclusions and Future Work

tuition, turning shared tags into positive constraints, and we tested andconfirmed its validity with our experiments.

On the other hand, the second method uses information which is internalto textual data, and thus it can be used in all textual domains. This ap-proach uses the information carried by the order of the terms in the text,information which is disregarded by the most usual text representationschemes. In this case, our intuition was that that two textual documentsshare one or more n-grams (i.e. consecutive terms) is a clue of a positiverelationship between the documents which can be turned into positiveconstraints that provide the clustering algorithm with new and useful in-formation. The experiments showed that this was indeed the case, andthat this approach outperforms the one proposed by Song et al. [2010]that uses higher-level information (Named Entities Recognition).

In both cases it should be remarked the simplicity of the methods, thatthey are quite easy to implement and how nevertheless they yield verygood results, both attending to the quality of the constraints and to thefinal effect on the quality of the partitions. Moreover, given the scarcityof the papers dealing with the problem of obtaining the constraints, themethodology that we have followed to conduct these experiments (e.g.the questions that are tested, the metrics used to obtain the results or thestatistical tests used to validate them), which was able to highlight someinteresting insights and results, is also a valuable contribution, whichmay be followed in further works in this topic.

7.2 Future Work

This section suggests how parts of this thesis could be extended and somegeneral future work lines which we deem interesting in consonance with whatwas discussed in the previous chapters.

• As we have discussed in Chapter 5, the constraints used in real-worldConstrained Clustering problems, whether obtained with a manual oran automatic method, are bound to be inaccurate to some degree, some-thing to which very little attention has been paid when developing Con-strained Clustering algorithms.

Therefore, we pose that future research in the creation of ConstrainedClustering algorithms should take into account this circumstance, and atthe very least examine the behaviour of the new algorithms in presenceof different degrees of noise. In order to do so in the best possible waywe should advance in the task of characterising different types of noisepresent in real-world problems, which would allow us to create differentnoise models, as we have done in the aforementioned chapter. Thiswould enable us in turn to better study the behaviour of not only newbut also existing algorithms and to more accurately conclude in whichscenarios using each algorithm would be more appropriate.

• Moreover, and also related to what was discussed in the previous point,we have seen as well in Chapter 5 that inaccurate constraints can be a

Page 155: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

7.2. Future Work 129

major problem for the effectiveness of the existing Constrained Cluster-ing algorithms, some of which are able to attain very large improvementswhen all the constraints are accurate. Therefore, devising ways to detectand filter spurious constraints is a very interesting line of work, whichcan make a large difference when using Constrained Clustering in real-world scenarios.

Designing ways to automatically discern between accurate and inaccu-rate constraints can be quite complex. On the one hand, if the constraintshave been manually created they will convey a human’s experience orknowledge or the domain, and thus it can be the case that perfectlyproper constraints could seem “counter-intuitive” (i.e. likely to be false)from a machine’s point of view. On the other hand, as we have seen inSection 6.1, if the constraints have been automatically created they willgenerally be based on information which is either not captured by theusual representation schemes used in that domain or completely exter-nal to the entities to be clustered themselves, and therefore it will be veryhard for an automatic algorithm to judge the veracity of the constraint.

Possibly, a more feasible goal could be devising a mixed semi-supervisedapproach, in which we use the input from users. For instance, similarlyto what Esuli and Sebastiani [2009] propose for classification, the auto-matic part of the method would select the constraints which seem morelikely to be inaccurate and present them to a user, who would ultimatelyaccept or reject them.

• Another interesting problem with connections to the accuracy of the con-straints is setting their strength, that is, how tightly they should be en-forced by the algorithm. The intuition, that we have experimentallyconfirmed in the work summarised in this thesis, is that the degree oftruthfulness of the constraints should be taken into account when tun-ing the clustering algorithms.

Consequently, it would be helpful (and arguably easier) to develop meth-ods which, even though they would not determine exactly which con-straints are accurate and which are not, are able nevertheless to, sim-ilarly to how existing methods [Davidson et al., 2006] assess their co-herence or informativeness, estimate the overall degree of truthfulnessof the whole set of constraints, and which, in the light of that informa-tion, and maybe taking as well into account those other factors, adjustaccordingly the strength of the constraints. Following with what was dis-cussed in the previous point, and given that most Constrained Clusteringalgorithms allow setting different weights to each constraint, if we had amethod that estimated the apparent accuracy of each constraint it wouldbe possible to fine-tune their strength.

• As we discussed in Chapter 6, obtaining the constraints that fuel the Con-strained Clustering algorithms is an integral part of applying this tech-nique to real-world problems, which is nevertheless often overlooked. Inthat chapter we proposed two techniques to automatically extract posi-tive constraints which can be used when clustering web pages and tex-

Page 156: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

130 Chapter 7. Conclusions and Future Work

tual data in general. As we have seen in Section 2.4, these are just asmall fraction of the domains in which Constrained Clustering has beenshown to be able to improve the quality of the clustering results.

Thus, an important line for future work is coming up with methods tocreate accurate and useful constraints in that wide variety of domains,either automatically or with the aid of human users. Moreover, it wouldbe as well interesting to devise ways to use different constraint creationschemes in order to create better constraints. Lastly, and even thoughas clues of a positive relationship between data instances could be insome cases easier to detect (as we also examine in the aforementionedchapter), obtaining negative constraints should not be neglected either.

• Also related to devising ways to obtain constraints there is the questionof creating new metrics that assess accurately the quality of the con-straints, and specifically that serve as a proxy of their actual effect onclustering.

Apart from being useful to discard a priori bad sets of constraints, sucha metric would be very helpful when developing methods to create con-straints or comparing existing ones. As we have seen in Chapter 6, thetrends detected in the existing metrics, such as the accuracy ratio orDavidson et al.’s informativeness, although informative, are not neces-sarily translated to the final clustering results. Therefore, presently inorder to evaluate a constraint creation method we have to carry out sev-eral complete clusterings of the test collections, something which may bequite time- and resource-consuming, a problem which would be eased ifwe had a more fitted metric.

• Lastly, there are some other interesting questions specific to the Nor-malised Cut algorithm and spectral clustering methods in general. Aswe have seen in Chapter 4, the question of how to introduce negativeevidence in the core of a Normalised-Cut-based Constrained Clusteringis still open, something which could be very fruitful given the good per-formance that the Constrained Normalised Cut approach by Ji and Xu[2006] obtains when introducing positive constraints. Moreover, bothconstrained and unconstrained Spectral Clustering algorithms are basedon transforming the clustering problem into a graph-cutting one. Usu-ally, the steps taken to find the solution imply a relaxation of certainconditions of the problem and solving an eigenproblem, which is themore computationally demanding part of the methods and whose com-plexity increases with the number of entities to cluster. Therefore, it willbe also interesting to study the scalability of these approaches and theirpossible practical limits.

Page 157: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Appendix A

Resumo

In accordance with the Regulations of the Ph.D.studies passed by the Governing Council of theUniversity of A Coruna on 17th July, 2012 we re-produce in this appendix a summary of this thesisin Galician.

A.1 Introducion

Aında que conxuntos de datos foran recompilados dende xa habıa moito tem-po por individuos, organizacions, empresas e gobernos con diversos fins foicoa chegada da era das computadoras cando a compilacion e o procesado des-tes experimentou un salto cualitativo. Non so a ubicuidade das computadoraspermitiu a estes actores automatizar enormemente estas tarefas, o que a suavez fixo posıbel aumentar grandemente o numero e tamano dos conxuntos dedatos, mais tamen a popularizacion da Web, e particularmente a chegada dachamada “Web 2.0”, onde as novas ferramentas e plataformas diluıron a dis-tincion entre creadores e consumidores de contidos, causou un torrente senfin de novos datos (query logs, click-through data, publicacions en blogues eredes sociais, fotografıas, vıdeos, datos xeograficos, etc) cada segundo.

Esta cada vez maior cantidade de informacion provocou unha crecente ne-cesidade de ferramentas automaticas coa fin de podermos explorar e procesaresta informacion. Tradicionalmente, a resposta dada pola Minarıa de Datosa esta situacion dividıase en duas aproximacions, Clasificacion [Sebastiani,2002] e Clustering [Jain et al., 1999].

O clustering e a forma mais comun de analise non supervisada de datos.Tradicionalmente, os algoritmos de clustering funcionan tentando atopar re-lacions nos datos formando grupos (os clusters) utilizando so a informacionpresente nos mesmos, co duplo obxectivo de por unha banda maximizar asimilaridade entre os elementos que son asignados ao mesmo cluster e polaoutra o de manter aqueles asignados a clusters diferentes tan distintos comofor posıbel. Por outro lado, en clasificacion, a aproximacion supervisada maispopular, o usuario conece exactamente que grupos existen neses datos, pro-porcionando aos algoritmos exemplos de aqueles. Empregando estes exemplosos algoritmos caracterizan as categorıas presentes nos datos coa fin de serenquen de asignar novos elementos (isto e, datos que non fosen examinadospreviamente) ao grupo correcto.

Como se desprende destas descricions, os metodos de clasificacion depen-den de conxuntos de datos de adestramento relativamente grandes. Unha vez

Page 158: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

132 Appendix A. Resumo

estes son adquiridos, o que e de por si un subtema bastante importante, temosao noso dispor algoritmos de clasificacion de alta eficacia que son quen de nosproporcionar resultados de alta calidade nunha grande variedade de tarefas edominios, tales como filtrado de correo non desexado (spam), identificacionde idioma, enrutado de e-mail, etc. Pola contra, os valores de calidade de in-cluso os metodos de clustering que mellor funcionan adoitan ser en moitoscasos modestos en termos absolutos, aında que de calquera xeito estes sexanutiles dada a intencion exploratoria coa cal son empregados.

Este fenomeno pode considerarse como unha consecuencia da naturezanon supervisada da tarefa de clustering. En efecto, mentres que en clasificaciontemos unha idea clara de que queremos que fagan os algoritmos e un alto graode control sobre eles a traves da escolla dos exemplos de adestramento enclustering temos a situacion oposta: non so temos unha idea un tanto tenue decomo definir o que e unha boa particion dos datos (por exemplo, que significaexactamente “manter aqueles [elementos] asignados a clusters diferentes tandistintos como for posıbel”? Ata que punto debemos facer que esta condicionse cumpra?), mais tamen o noso control sobre o proceso esta limitado nomellor dos casos a idear un xeito de comparar os datos mais acorde coas nosasintuicions sobre eles ou manipular alguns detalles internos dos algoritmos.

E neste contexto onde nos ultimos anos xurdiu un novo tipo de algorit-mos semisupervisados de clustering, os algoritmos de Constrained Clustering(clustering con restricions). Estes novos algoritmos poden incorporar informa-cion do dominio disponıbel a priori, permitindo ao usuario guiar dalgun xeitoo proceso de clustering e mellorar a calidade dos seus resultados. Esta infor-macion e proporcionada ao algoritmo como un conxunto de restricions sobrepares de elementos, as cales expresan limitacions firmes ou preferencias sobrese eses pares deben ou non ser encadrados no mesmo cluster. Deste xeito, eaında que o usuario pode ter unha maior influencia sobre o resultado, o Cons-trained Clustering segue a ser un proceso de clustering, dado que e o propioalgoritmo quen determina que grupos existen nos datos, a diferenza dos pro-cesos de clasificacion, onde o obxectivo e catalogar elementos non examina-dos anteriormente en grupos que foran definidos anteriormente de acordo cosexemplos proporcionados polo usuario. Ademais, estas restricions non tenenque ser moi numerosas ou estar distribuıdas ao longo de todo o conxunto dedatos para teren un efecto apreciabel sobre o proceso de clustering, o que nospermite acadar grandes melloras na calidade final das particions investindoso un esforzo relativamente pequeno en obter estes “datos de adestramento”1.

A.2 Motivacion

O clustering con restricions proporciona un xeito practico de integrar nun pro-ceso de clustering informacion que en un normal non serıa empregada. Estaconveniencia e debida principalmente a duas razons. Primeiro, o ConstrainedClustering ofrece un metodo sinxelo e unificado de proporcionar aos algorit-mos de clustering diferentes tipos de indicacions verbo do agrupamento apro-

1Aında ası, o proceso de obter estas restricions non debe ser desprezado como trivial, algo quequeremos salientar coa presente tese

Page 159: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

A.3. Contribucions da tese 133

piado ou desexado dos datos. Independentemente do dominio dos datos ou danatureza das indicacions, esta informacion pode en case todos os casos ser co-dificada facilmente empregando as restricions, que afectaran ao proceso dunxeito coherente e consistente. Segundo, como xa fora introducido na seccionanterior, estas restricions non tenen por que vir necesariamente nun grandenumero ou ter unha cobertura ampla para poderen ser empregadas de xeitoefectivo. Isto permıtenos aproveitar ao maximo informacion especıfica de do-minio que, aında que non afecte a moitos elementos, pode acabar sendo util.Isto contrasta co conxunto de exemplos mais ou menos extenso que hai querecompilar cando empregamos un algoritmo de clasificacion.

Estas duas caracterısticas poden ser moi utiles no escenario que esboza-mos na seccion anterior. Deste xeito, por unha banda, a informacion contidanos conxuntos de datos vai ser en moitos casos multimodal. Se por exemploqueremos agrupar fotografıas que foron cargadas nunha rede social e posıbelque sexamos quen de establecer relacions entre elas comparando a xente quefoi “marcada”2 nelas, ou os seus datos de xeolocalizacion. Se quixermos incor-porar esta informacion, que pode ser util para detectar que fotografıas estanrelacionadas, de empregarmos un algoritmo de clustering normal teriamosque facer cambios ad hoc no xeito en que as imaxes son comparadas, mentresque utilizando os algoritmos de clustering con restricions teriamos un xeitodirecto de codificala. Por outra banda, a natureza menos esixente do Constrai-ned Clustering no relativo a cantidade de informacion que debe ser fornecidapermıtenos empregalo con eficacia ao procesar conxuntos de datos moi gran-des, sen termos que investir moitos recursos co obxecto de obter unha grandecantidade de restricions.

A pesar da natureza practica destas avantaxes, ata o momento a investi-gacion en clustering con restricions veuse centrando en aspectos teoricos, eparticularmente en propor novos algoritmos de clustering co obxecto de apro-veitar ao maximo a informacion fornecida polas restricions. Con esta tese pre-tendemos propor novas aplicacions do Constrained Clustering e discutir certosaspectos e problemas practicos que deben ser considerados cando for empre-gado para enfrontar problemas do mundo real.

A.3 Contribucions da tese

A afirmacion principal que da pe a esa tese e que existen determinadas cues-tions practicas que na meirande parte dos casos venen sendo obviadas na in-vestigacion en clustering con restricions e cuxa importancia e capital ao tra-tarmos de aplicalo en problemas do mundo real. Nesta tese identificamos duasdestas cuestions, a extraccion de restricions e a sua robustez ao ruıdo.

Como xa introducimos anteriormente, a investigacion en Constrained Clus-tering estivo en grande parte centrada en desenvolver algoritmos novos e ori-xinais. Cando estes algoritmos son probados os autores empregan nos seusexperimentos conxuntos de restricions sinteticas, creadas empregando as par-ticions de referencia contra as que os resultados do clustering son comparadas

2Isto e, a xente que aparece na fotografıa, segundo o usuario que a cargou ou outros integrantesda rede social.

Page 160: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

134 Appendix A. Resumo

para cuantificar a sua calidade. Sen embargo, dado que estas referencias nonvan estar disponıbeis en problemas reais de clustering e claro que e precisoatopar metodos adecuados para crearmos ben sexa manualmente ou automa-ticamente estas restricions, algo ao que ata o momento se lle prestou poucaatencion.

Outra consecuencia de empregar estes conxuntos sinteticos de restricionse que en case todos os casos as restricions empregadas nos devanditos expe-rimentos son certas, e dicir, proporcionan informacion real e certa sobre unhaparticion boa dos datos. Deste xeito, os resultados dos experimentos descri-tos neses artigos reflicten o comportamento dos algoritmos baixo condicionsideais, condicions que desafortunadamente son pouco probabeis na meirandeparte dos escenarios reais, onde, ao teren que ser extraidos, os conxuntos derestricions van estar afectados por ruıdo, e dicir, conteran restricions inexac-tas. Ası, a robustez ao ruıdo dos algoritmos de clustering con restricions vaiter con certeza un papel importante na sua efectividade final.

As contribucions principais desta tese son as seguintes. Realizamos unhaanalise da robustez de certos algoritmos de clustering con restricions a con-xuntos de restricions afectados por ruıdo, desenandomos un experimento nocal o comportamento dos algoritmos e examinado coa axuda de conxuntossinteticos de restricions non exactas, os cales son creados usando dous meto-dos, un aleatorio e un baseado en intuicions sobre a natureza dos erros queno mundo real aparecen en aquelas. A luz dos resultados destes experimen-tos discutimos os puntos fortes e febles de cada metodo, os cales empregamospara concluır os escenarios nos cales empregar cada algoritmo poderıa ser amellor opcion.

Ademais, nesta tese propomos tamen dous metodos para extraer automa-ticamente restricions en dous dominios moi importantes: as paxinas web eo texto en xeral. No primeiro caso propomos un metodo que emprega infor-macion externa as entidades a seren agrupadas, especificamente as etiquetas(tags) que os usuarios de Delicious, o servizo de social bookmarking mais po-pular, asociaron a esas paxinas. No segundo caso, a nosa proposta empregao texto en si dos documentos, extraendo del informacion valiosa que normal-mente non e tida en conta polos metodos de representacion de texto maisusuais. Particularmente, empregamos n-gramas a nivel de palabra para crearrestricions que poidan incorporar ao proceso de clustering parte da informa-cion contida nas relacions de vecinanza entre os termos. Ambos os dous meto-dos son probados en experimentos meticulosos sobre coleccions de referencia,comparando os seus resultados cos de baselines apropiadas.

Dada a escaseza de artigos que traten o problema de obter as restricions ametodoloxıa seguida para levarmos a cabo estes experimentos (por exemplo:as cuestions que son examinadas, as metricas empregadas para obter os resul-tados ou os tests estatısticos cos que estes son validados) pode ser consideradado mesmo xeito como unha contribucion por si mesma desta tese.

Finalmente, afastandonos un chisco das duas cuestions presentadas ante-riormente, mais en consonancia co caracter practico desta tese, analizamoscomo aplicar o clustering con restricions para abordar un problema real exis-tente, como e a tarefa de Avoiding Bias (evitacion de tendencia), a cal consisteen dados uns datos a seren agrupados e unha particion destes xa conecida

Page 161: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

A.4. Estrutura da tese, resultados e traballo futuro 135

previamente atopar unha particion alternativa de aqueles que sexa ao mesmotempo unha boa particion. Para abordarmos esta tarefa propomos un metodoque emprega restricions para codificar a particion a ser evitada, restricions queson posteriormente proporcionadas a un algoritmo de Constrained Clusteringdesenado por nos que e executado sobre os datos de entrada, atopando unhaparticion alternativa destes. Alen diso, estudamos tamen como mellorar a cali-dade destas particions alternativas, propondo duas aproximacions que fan usode tecnicas de clustering espectral.

A.4 Estrutura da tese, resultados e traballo futuro

As principais contribucions desta tese son presentadas nos capıtulos 4, 5 e6. Os capıtulos 2 e 3 contenen respectivamente unha introducion xeral aoclustering con restricions e unha discusion dalguns aspectos comuns aos ex-perimentos levados a cabo no decurso do traballo conducente a esta tese epresentado neste volume. Aında que un especialista no campo poderıaos sal-tar calquera lector interesado pode atopalos de interese a fin de encadrar oresto deste traballo. Co obxecto de intentarmos manter os capıtulos que pre-sentan as principais contribucions da tese tan autocontidos como for posıbelcada un deles conten a sua propia introducion ao seu tema especıfico e o seupropio estudo da literatura relevante, unhas referencias que estan contidas aofinal deste volume.

• O capıtulo 2 e un pequeno estudo xeral do clustering con restricions,a area onde o traballo resumido nesta tese esta circunscrito. Primei-ro, introducimos os conceptos basicos do clustering (seccion 2.1) e doclustering con restricions (seccion 2.2). A continuacion, facemos un pe-queno estudo (seccion 2.3) dos algoritmos de clustering con restricionsmais importantes e influentes, pondo unha especial atencion naquelesque empregamos nesta tese ou que estan dalgun outro xeito relaciona-dos con ela. Finalmente, examinamos tanto algunhas das avantaxes eaplicacions do clustering con restricions (seccion 2.4) coma os principaisproblemas e oportunidades de investigacion abertas neste novo campo(seccion 2.5), algunhas das cales foron tratadas nesta tese.

• O capıtulo 3 agrupa e examina alguns aspectos comuns aos experimentoslevados a cabo no traballo conducente a esta tese, coa fin de evitarmosrepeticions innecesarias no decurso na mesma e para termos ası mesmoun punto de referencia centralizado ao que acudir ao tratar os experi-mentos nos capıtulos seguintes. Este capıtulo comeza marcando algunhadas linas xerais que seguimos para determinar os conxuntos de datos so-bre os que levamos a cabo os experimentos (seccion 3.1). Despois, estu-damos alguns metodos de representacion e metricas de distancia para osdatos (seccion 3.2), pondo un especial acento na representacion dos da-tos textuais (seccion 3.2.1). A continuacion, examinamos como abordaro problema de axustar os parametros dos algoritmos co obxecto de terunha representacion fiel e xusta do seu comportamento, (seccion 3.3).Finalmente, concluımos o capıtulo describindo como foi levada a cabo a

Page 162: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

136 Appendix A. Resumo

avaliacion do clustering e as metricas que empregamos (seccion 3.4) esalientando ası mesmo a importancia de empregar os tests de significan-cia estatıstica para valorar mellor os resultados da devandita avaliacion(seccion 3.5). Para este proposito utilizamos nos nosos experimentos otest de signo (seccion 3.5.1).

• O capıtulo 4 resume o noso traballo para empregar Constrained Clus-tering co fin de abordar o problema de Avoiding Bias (seccion 4.1). Anosa proposta e un metodo (seccion 4.3) que emprega restricions ne-gativas non absolutas para codificar a particion a ser evitada, as calesson utilizadas nun algoritmo de clustering con restricions desenado pornos (seccion 4.3.1), cuxo deseno permıtenos superar algunhas das defi-ciencias das aproximacions similares xa existentes (seccion 4.3.3). Em-pregando esta informacion contida nas restricions o algoritmo e quende atopar unha particion alternativa dos datos. Os resultados da avalia-cion (seccion 4.4), centrada en coleccions textuais de referencia, amosanconsiderabeis melloras sobre un dos algoritmos mais conecidos que foradesenado especificamente para a tarefa de Avoiding Bias.

Na segunda deste capıtulo centramonos en mellorar a calidade das parti-cions alternativas (seccion 4.5), propondo duas aproximacions baseadasen empregar restricions negativas en combinacion con tecnicas de cluste-ring espectral. A primeira aproximacion tenta introducir estas restricionsno nucleo dun algoritmo de clustering espectral (seccion 4.5.1), namen-tres que o segundo combina o clustering espectral co algoritmo propostona primeira parte deste capıtulo (seccion 4.5.2). Os experimentos (sec-cion 4.6), levados a cabo de novo nas mesmas coleccions textuais dereferencia, mostran que namentres o primeiro metodo non ten bos re-sultados o segundo acada grandes melloras na calidade destes mantendoao mesmo tempo una baixa similaridade coa particion a evitar.

• O capıtulo 5 conten un estudo da robustez dalguns algoritmos de clus-tering con restricions a conxuntos de restricions afectados por ruıdo(e dicir, aqueles que contenen restricions non exactas), unha cuestion(seccion 5.1) de capital importancia na sua eficacia ao abordar proble-mas reais. Para levar a cabo este estudo, como xa introducimos anterior-mente, desenamos un experimento (seccion 5.2) baseado en examinaro comportamento dos algoritmos que fan parte deste cando lle son pro-porcionados conxuntos de restricions afectados por ruıdo. Neste caso,estas restricions son creadas de forma sintetica seguindo dous metodos(seccion 5.2.3), un que simula ruıdo creado aleatoriamente e outro queesta baseado nas nosas intuicions sobre a natureza dos erros que nomundo real introducen os metodos automaticos e manuais de creacionde restricions.

A analise dos resultados deste experimento (seccion 5.3) amosa os pun-tos fortes e febles de cada algoritmo, que empregamos para concluıros escenarios nos que empregar cada algoritmos serıa a mellor decision(seccion 5.4). Deste xeito, vemos que en grandes trazos a mellor opcioncando temos restricions positivas e empregar Normalised Cut, namen-tres que no caso de ter restricions negativas deberiamos empregar Soft

Page 163: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

A.4. Estrutura da tese, resultados e traballo futuro 137

Constrained k-Means (o algoritmo presentado no capıtulo 4). Do mes-mo xeito, a comparacion entre os resultados obtidos cos dous metodosde creacion de ruıdo suxire que cando os erros nestas son produto deequivocacions inducidas por similaridades moi altas ou baixas entre en-tidades o efecto das restricions non exactas e probabel que sexa menor.Por ultimo, un estudo da limitada literatura existente neste problema(seccion 5.5) amosa que os nosos resultados e intuicions son compatıbeise complementarios con aqueles contidos nestes traballos.

• O capıtulo 6 presenta e analiza dous metodos para extraer restricionsde xeito automatico, unha parte integral do problema de clustering conrestricions (seccion 6.1) que e a miudo ignorada na literatura existente(seccion 6.2).

O primeiro metodo que propomos (seccion 6.3) esta baseado na infor-macion contida nos tags sociais, nomeadamente en aqueles cos que osusuarios de Delicious etiquetan os seus marcadores, convertındoa en res-tricions positivas entre paxinas web. A avaliacion desta proposta (seccion6.4), que e levada a cabo nunha ampla coleccion de datos reais de eti-quetaxe, mostra (seccion 6.5) tanto a bondade dos tags sociais para de-tectar relacions positivas entre paxinas web como a validez do metodopara crear restricions empregando esa informacion. Estes aspectos sonconfirmados non so atendendo a precision das restricions mais tamen aoseu efecto xeral no clustering.

O segundo metodo (seccion 6.6) esta baseado en informacion interna,empregando a coincidencia dos n-gramas a nivel de palabra como unindicio dunha relacion positiva entre documentos de texto. Neste caso, aavaliacion do metodo (seccion 6.7) foi levada a cabo con dous conxun-tos de datos reais de distinta natureza, comparando a nosa aproximacioncun baseline baseado en Reconecemento de Entidades (Named Entity Ex-traction, NER) (seccion 6.7.2). Os experimentos amosan (seccion 6.8)que a nosa proposta produce restricions de calidade semellante as obti-das con Reconecemento de Entidades, sendo en grande medida diferen-tes a estas. No tocante ao seu efecto no clustering as restricions creadasco noso metodo melloran en todos os casos os resultados de tanto obaseline sen restricions como do baseado en NER.

• Finalmente, o capıtulo 7 recolle as conclusions da tese e presenta unresumo das posibles futuras linas de investigacion. Entre estas compresalientar o desenvolvemento de novos modelos de ruıdo para exploraro funcionamento dos algoritmos de clustering con restricions en condi-cions o mais similares posibles a aquelas nas que seran empregados nomundo real, o estudo de tecnicas de filtrado de restricions espurias, afixacion automatica da forza das restricions de acordo cunha estimacionda sua veracidade, concibir novos metodos de creacion de restricionsadecuados para os diversos dominios onde o clustering con restricionspode ser utilizado con exito ou o desenvolvemento de novas metricaspara avaliar a calidade das restricions e predicir o seu efecto real sobreo clustering.

Page 164: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:
Page 165: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Bibliography

Albatineh, A. N., Niewiadomska-Bugaj, M., and Mihalko, D. (2006). On simi-larity indices and correction for chance agreement. Journal of Classification,23:301–313.

Ares, M. E. and Barreiro, A. (2012). Constrained text clustering using word tri-grams. In Proceedings of the 2nd Spanish Conference on Information Retrieval,CERI’12, pages 13–24, Valencia, Spain. Publicacions de la Universitat JaumeI.

Ares, M. E., Parapar, J., and Barreiro, A. (2009). Avoiding bias in text clus-tering using constrained k-means and may-not-links. In Proceedings of the2nd International Conference on Theory of Information Retrieval: Advances inInformation Retrieval Theory, ICTIR ’09, pages 322–329, Berlin, Heidelberg.Springer-Verlag.

Ares, M. E., Parapar, J., and Barreiro, A. (2010). Improving alternative textclustering quality in the avoiding bias task with spectral and flat partitionalgorithms. In Proceedings of the 21st international conference on Databaseand expert systems applications: Part II, DEXA’10, pages 407–421, Berlin,Heidelberg. Springer-Verlag.

Ares, M. E., Parapar, J., and Barreiro, A. (2011). Improving text clusteringwith social tagging. In Proceedings of the 5th International Conference onWeblogs and Social Media, ICWSM ’11, pages 430–433.

Ares, M. E., Parapar, J., and Barreiro, A. (2012). An experimental study ofconstrained clustering effectiveness in presence of erroneous constraints.Inf. Process. Manage., 48(3):537–551.

Asuncion, A. and Newman, D. (2007). UCI machine learning repository. [On-line; accessed 14-July-2011].

Bae, E. and Bailey, J. (2006). Coala: A novel approach for the extraction of analternate clustering of high quality and high dissimilarity. In Proceedings ofthe Sixth International Conference on Data Mining, ICDM ’06, pages 53–62,Washington, DC, USA. IEEE Computer Society.

Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D. (2005). Learning a Ma-halanobis metric from equivalence constraints. J. Mach. Learn. Res., 6:937–965.

Page 166: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

140 Bibliography

Basu, S., Banerjee, A., and Mooney, R. (2004a). Active semi-supervision forpairwise constrained clustering. In Proceedings of the 2004 SIAM Interna-tional Conference on Data Mining, SDM-04, pages 333–344.

Basu, S., Banerjee, A., and Mooney, R. J. (2002). Semi-supervised cluster-ing by seeding. In Proceedings of the Nineteenth International Conference onMachine Learning, ICML ’02, pages 27–34, San Francisco, CA, USA. MorganKaufmann Publishers Inc.

Basu, S., Bilenko, M., and Mooney, R. J. (2004b). A probabilistic frameworkfor semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD inter-national conference on Knowledge discovery and data mining, KDD ’04, pages59–68, New York, NY, USA. ACM.

Basu, S., Davidson, I., and Wagstaff, K. (2008). Constrained Clustering: Ad-vances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, 1edition.

Bilenko, M., Basu, S., and Mooney, R. J. (2004). Integrating constraints andmetric learning in semi-supervised clustering. In Proceedings of the twenty-first international conference on Machine learning, ICML ’04, pages 81–88,New York, NY, USA. ACM.

Brants, T. and Franz, A. (2006). Google research blog: All our n-gram arebelong to you. [Online; accessed 14/11/2012].

Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. (1997). Syn-tactic clustering of the web. In Selected papers from the sixth internationalconference on World Wide Web, pages 1157–1166, Essex, UK. Elsevier Sci-ence Publishers Ltd.

Cohn, D., Caruana, R., and McCallum, A. (2003). Semi-supervised clusteringwith user feedback. Technical Report TR-2003-1892, Cornell University.

Coleman, T., Saunderson, J., and Wirth, A. (2008). Spectral clustering withinconsistent advice. In Proceedings of the 25th international conference onMachine learning, ICML ’08, pages 152–159, New York, NY, USA. ACM.

Conover, W. J. (1971). Practical nonparametric statistics. John Wiley & Sons,New York, third edition.

Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K.,and Slattery, S. (1998). Learning to extract symbolic knowledge from theworld wide web. In Proceedings of the fifteenth national/tenth conferenceon Artificial intelligence/Innovative applications of artificial intelligence, AAAI’98/IAAI ’98, pages 509–516, Menlo Park, CA, USA. American Associationfor Artificial Intelligence.

Cui, Y., Fern, X. Z., and Dy, J. G. (2007). Non-redundant multi-view clusteringvia orthogonalization. In Proceedings of the 2007 Seventh IEEE InternationalConference on Data Mining, ICDM ’07, pages 133–142, Washington, DC,USA. IEEE Computer Society.

Page 167: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Bibliography 141

Davidson, I. and Basu, S. (2007). A survey of clustering with instance levelconstraints. Tutorial presented at IEEE ICDM Conference (2005) and ACMKDD Conference (2006).

Davidson, I. and Qi, Z. (2008). Finding alternative clusterings using con-straints. In Proceedings of the 2008 Eighth IEEE International Conference onData Mining, ICDM ’08, pages 773–778, Washington, DC, USA. IEEE Com-puter Society.

Davidson, I. and Ravi, S. (2005a). Clustering with constraints: Feasibilityissues and the k-means algorithm. In Proceedings of the 2005 SIAM Interna-tional Conference on Data Mining, SDM-05.

Davidson, I. and Ravi, S. S. (2005b). Agglomerative hierarchical clusteringwith constraints: theoretical and empirical results. In Proceedings of the9th European conference on Principles and Practice of Knowledge Discovery inDatabases, PKDD’05, pages 59–70, Berlin, Heidelberg. Springer-Verlag.

Davidson, I. and Ravi, S. S. (2006). Identifying and generating easy sets ofconstraints for clustering. In Proceedings of the 21st national conference onArtificial intelligence - Volume 1, AAAI’06, pages 336–341. AAAI Press.

Davidson, I., Wagstaff, K. L., and Basu, S. (2006). Measuring constraint-setutility for partitional clustering algorithms. In Proceedings of the 10th Euro-pean conference on Principle and Practice of Knowledge Discovery in Databases,PKDD’06, pages 115–126, Berlin, Heidelberg. Springer-Verlag.

Ding, C. (2004). A tutorial on spectral clustering. Tutorial presented at ICML2004: 21st International Conference on Machine Learning.

Esuli, A. and Sebastiani, F. (2009). Training data cleaning for text classifica-tion. In Proceedings of the 2nd International Conference on Theory of Infor-mation Retrieval: Advances in Information Retrieval Theory, ICTIR ’09, pages29–41, Berlin, Heidelberg. Springer-Verlag.

Fisher, R. A. (1936). The use of multiple measuremente in taxonomic prob-lems. Annals of Eugenics, 7(2):179–188.

Golder, S. A. and Huberman, B. A. (2006). Usage patterns of collaborativetagging systems. Journal of Information Science, 32(2):198–208.

Gondek, D. and Hofmann, T. (2003). Conditional information bottleneck clus-tering. In 3rd IEEE international conference on data mining, Workshop onclustering large data sets, pages 36–42. Citeseer.

Gondek, D. and Hofmann, T. (2004). Non-redundant data clustering. In Pro-ceedings of the Fourth IEEE International Conference on Data Mining, ICDM’04, pages 75–82, Washington, DC, USA. IEEE Computer Society.

Heymann, P., Ramage, D., and Garcia-Molina, H. (2008). Social tag predic-tion. In Proceedings of the 31st annual international ACM SIGIR conferenceon Research and development in information retrieval, SIGIR ’08, pages 531–538, New York, NY, USA. ACM.

Page 168: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

142 Bibliography

Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classifica-tion, 2:193–218.

Jain, A. K. and Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review.ACM Comput. Surv., 31(3):264–323.

Ji, X. and Xu, W. (2006). Document clustering with prior knowledge. In Pro-ceedings of the 29th annual international ACM SIGIR conference on Researchand development in information retrieval, SIGIR ’06, pages 405–412, NewYork, NY, USA. ACM.

Jin, R., Ding, C., and Kang, F. (2006). A probabilistic approach for optimizingspectral clustering. In Advances in Neural Information Processing Systems 18,pages 571–578, Cambridge, MA. MIT Press.

Jones, K. S. (1972). A statistical interpretation of term specificity and its ap-plication in retrieval. Journal of Documentation, 28:11–21.

Kamvar, S. D., Klein, D., and Manning, C. D. (2003). Spectral learning. InProceedings of the 18th international joint conference on Artificial intelligence,IJCAI’03, pages 561–566, San Francisco, CA, USA. Morgan Kaufmann Pub-lishers Inc.

Klein, D., Kamvar, S. D., and Manning, C. D. (2002). From instance-level con-straints to space-level constraints: Making the most of prior knowledge indata clustering. In Proceedings of the 19th International Conference on Ma-chine Learning, ICML ’02, pages 307–314, San Francisco, CA, USA. MorganKaufmann Publishers Inc.

Lange, T., Law, M. H. C., Jain, A. K., and Buhmann, J. M. (2005). Learn-ing with constrained and unlabelled data. In Proceedings of the 2005 IEEEComputer Society Conference on Computer Vision and Pattern Recognition(CVPR’05) - Volume 1 - Volume 01, CVPR ’05, pages 731–738, Washington,DC, USA. IEEE Computer Society.

Lee, K. S., Croft, W. B., and Allan, J. (2008). A cluster-based resamplingmethod for pseudo-relevance feedback. In Proceedings of the 31st annualinternational ACM SIGIR conference on Research and development in informa-tion retrieval, SIGIR ’08, pages 235–242, New York, NY, USA. ACM.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004). RCV1: A new benchmarkcollection for text categorization research. J. Mach. Learn. Res., 5:361–397.

Li, P., Wang, B., Jin, W., and Cui, Y. (2011). User-related tag expansion forweb document clustering. In Proceedings of the 33rd European conference onAdvances in information retrieval, ECIR’11, pages 19–31, Berlin, Heidelberg.Springer-Verlag.

Page 169: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Bibliography 143

Liu, X. and Croft, W. B. (2004). Cluster-based retrieval using language models.In Proceedings of the 27th annual international ACM SIGIR conference on Re-search and development in information retrieval, SIGIR ’04, pages 186–193,New York, NY, USA. ACM.

Lu, Z. and Leen, T. (2005). Semi-supervised learning with penalized proba-bilistic clustering. In Advances in Neural Information Processing Systems 17,pages 849–856, Cambridge, MA. MIT Press.

Manning, C. D., Raghavan, P., and Schtze, H. (2008). Introduction to Informa-tion Retrieval. Cambridge University Press, New York, NY, USA.

McQueen, J. (1967). Some methods for classification and analysis of multi-variate observations. Proceedings of the Fifth Berkeley Symposium on Mathe-matical Statistics and Probability, 1:281–297.

Mishne, G. and Glance, N. (2006). Leave a reply: An analysis of weblog com-ments. In Third annual workshop on the Weblogging ecosystem, Edinburgh,Scotland.

Mitra, M., Buckley, C., Singhal, A., and Cardie, C. (1997). An analysis ofstatistical and syntactic phrases. In Proccedings of the fifth conference onComputer-Assisted Information Retrieval (Recherche d’Information et ses Ap-plications), RIAO’97, pages 200–214.

Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition andclassification. Lingvisticae Investigationes, 30(1):3–26.

Nelson, B. and Cohen, I. (2007). Revisiting probabilistic models for clusteringwith pair-wise constraints. In Proceedings of the 24th international conferenceon Machine learning, ICML ’07, pages 673–680, New York, NY, USA. ACM.

Pantel, P. and Lin, D. (2002). Document clustering with committees. In Pro-ceedings of the 25th annual international ACM SIGIR conference on Researchand development in information retrieval, SIGIR ’02, pages 199–206, NewYork, NY, USA. ACM.

Rosell, M., Kann, V., and Litton, J.-E. (2004). Comparing comparisons: Docu-ment clustering evaluation using two manual classifications. In Proceedingsof the International Conference on Natural Language Processing, pages 207–216.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for auto-matic indexing. Commun. ACM, 18(11):613–620.

Sebastiani, F. (2002). Machine learning in automated text categorization.ACM Comput. Surv., 34(1):1–47.

Shental, N., Bar-Hillel, A., Hertz, T., and Weinshall, D. (2004). ComputingGaussian mixture models with EM using equivalence constraints. In Ad-vances in Neural Information Processing Systems 16, pages 465– 472, Cam-bridge, MA. MIT Press.

Page 170: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

144 Bibliography

Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEETrans. Pattern Anal. Mach. Intell., 22(8):888–905.

Song, Y., Pan, S., Liu, S., Wei, F., Zhou, M., and Qian, W. (2010). Constrainedcoclustering for textual documents. In Proceedings of the Twenty-Fourth AAAIConference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July11-15, 2010, pages 581– 586. The AAAI Press.

Sturtz, D. (2004). Communal categorization: the folksonomy. In TechnicalReport for INFO622: Content Representation. Drexel University.

Vinh, N. X., Epps, J., and Bailey, J. (2009). Information theoretic measures forclusterings comparison: is a correction for chance necessary? In Proceedingsof the 26th Annual International Conference on Machine Learning, ICML ’09,pages 1073–1080, New York, NY, USA. ACM.

Vinh, N. X., Epps, J., and Bailey, J. (2010). Information theoretic measures forclusterings comparison: Variants, properties, normalization and correctionfor chance. J. Mach. Learn. Res., 11:2837–2854.

von Luxburg, U. (2006). A tutorial on spectral clustering. Technical ReportTR-149, Max Planck Institute for Biological Cybernetics.

Voorhees, E. M. (1986). Implementing agglomerative hierarchic clusteringalgorithms for use in document retrieval. Inf. Process. Manage., 22(6):465–476.

Wagstaff, K. and Cardie, C. (2000). Clustering with instance-level constraints.In Proceedings of the Seventeenth International Conference on Machine Learn-ing, ICML ’00, pages 1103–1110, San Francisco, CA, USA. Morgan Kauf-mann Publishers Inc.

Wagstaff, K., Cardie, C., Rogers, S., and Schrodl, S. (2001). Constrained k-means clustering with background knowledge. In Proceedings of the Eigh-teenth International Conference on Machine Learning, ICML ’01, pages 577–584, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

Wang, X. and Davidson, I. (2010). Flexible constrained spectral clustering. InProceedings of the 16th ACM SIGKDD international conference on Knowledgediscovery and data mining, KDD ’10, pages 563–572, New York, NY, USA.ACM.

Wetzker, R., Zimmermann, C., and Bauckhage, C. (2008). Analyzing so-cial bookmarking systems: A del.icio.us cookbook. In Mining Social Data(MSoDa) Workshop Proceedings, pages 26–30. ECAI 2008.

Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. (2003). Distance metriclearning with application to clustering with side-information. In Advances inNeural Information Processing Systems 15, pages 505–512, Cambridge, MA.MIT Press.

Page 171: Constrained Clustering Algorithms: Practical Issues …edu/pubs/meares-phd.pdf · Constrained Clustering Algorithms: Practical Issues and Applications ... Constrained Clustering Algorithms:

Bibliography 145

Yan, R., Zhang, J., Yang, J., and Hauptmann, A. G. (2006). A discriminativelearning framework with pairwise constraints for video object classification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:578–593.

Yang, H. and Callan, J. (2006). Near-duplicate detection by instance-levelconstrained clustering. In Proceedings of the 29th annual international ACMSIGIR conference on Research and development in information retrieval, SIGIR’06, pages 421–428, New York, NY, USA. ACM.

Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., and Ma, J. (2004). Learning tocluster web search results. In Proceedings of the 27th annual internationalACM SIGIR conference on Research and development in information retrieval,SIGIR ’04, pages 210–217, New York, NY, USA. ACM.

Zubiaga, A., Garcıa-Plaza, A. P., Fresno, V., and Martınez, R. (2009). Content-based clustering for tag cloud visualization. In Proceedings of the 2009 In-ternational Conference on Advances in Social Network Analysis and Mining,ASONAM ’09, pages 316–319, Washington, DC, USA. IEEE Computer Soci-ety.