Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODELSELECTION MEASURES
By
AMIT DHURANDHAR
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2009
1
c© 2009 Amit Dhurandhar
2
To my family, friends and professors
3
ACKNOWLEDGMENTS
First and foremost, I would like to thank the almighty for giving me the strength
to overcome both academic and emotional challenges that I have faced in my pursuit of
earning a doctorate degree. Without his strength I would not have been in this position
today. Second, I would like to thank my family for their continued support and for the fun
we have when we all get together.
A very special thanks to my advisor, Dr. Alin Dobra, for not only his guidance
but also for the great commoradory that we share. I am greatful for having met such
an intelligent, creative, full-of-life yet patient and helpful individual. I have thoroughly
enjoyed the intense discussions (which others mistook for fights and actually bet on who
will win) we have had in this time.
I would like to thank Dr. Paul Gader and Dr. Arunava Banerjee for their insightful
suggestions and encouragement during difficult times. I would also like to thank
my other committee members Dr. Sanjay Ranka and Dr. Ravindra Ahuja for their
invaluable inputs. I feel fortunate to have taken courses with Dr. Meera Sitharam and Dr.
Anand Rangarajan who are great teachers and taught me what it means to understand
something.
Last but definitely not the least, I would like to thank my friends and roomates for
without them life would have been dry. A special thanks to Hale, Kartik (or Kartiks
should I say), Bhuppi, Ajit, Gnana, Somnath and many others for their support and
encouragement.
Thanks a lot guys! This would not have been possible without you all.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1 Practical Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 What is the Methodology ? . . . . . . . . . . . . . . . . . . . . . . . 181.3.2 Why have such a Methodology? . . . . . . . . . . . . . . . . . . . . 181.3.3 How do I Implement the Methodology ? . . . . . . . . . . . . . . . . 19
1.4 Applying the Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.4.1 Algorithmic Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 201.4.2 Dataset Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 GENERAL FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Generalization Error (GE) . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Alternative Methods for Computing the Moments of GE . . . . . . . . . . 29
3 ANALYSIS OF MODEL SELECTION MEASURES . . . . . . . . . . . . . . . 32
3.1 Hold-out Set Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Multifold Cross Validation Error . . . . . . . . . . . . . . . . . . . . . . . . 34
4 NAIVE BAYES CLASSIFIER, SCALABILITY and EXTENSIONS . . . . . . . 38
4.1 Example: Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . 384.1.1 Naive Bayes Classifier Model (NBC) . . . . . . . . . . . . . . . . . . 384.1.2 Computation of the Moments of GE . . . . . . . . . . . . . . . . . . 39
4.2 Full-Fledged NBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.1 Calculation of Basic Probabilities . . . . . . . . . . . . . . . . . . . 424.2.2 Direct Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.3 Approximation Techniques . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3.1 Series approximations (SA) . . . . . . . . . . . . . . . . . 464.2.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.3.3 Random sampling using formulations (RS) . . . . . . . . . 55
5
4.2.4 Empirical Comparison of Cumulative Distribution Function ComputingMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Monte Carlo (MC) vs Random Sampling Using Formulations . . . . . . . . 574.4 Calculation of Cumulative Joint probabilities . . . . . . . . . . . . . . . . . 594.5 Moment Comparison of Test Metrics . . . . . . . . . . . . . . . . . . . . . 62
4.5.1 Hold-out Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.5.2 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.5.3 Comparison of GE, HE, and CE . . . . . . . . . . . . . . . . . . . . 64
4.6 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 ANALYZING DECISION TREES . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1 Computing Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.1.1 Technical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 835.1.2 All Attribute Decision Trees (ATT) . . . . . . . . . . . . . . . . . . 845.1.3 Decision Trees with Non-trivial Stopping Criteria . . . . . . . . . . 855.1.4 Characterizing path exists for Three Stopping Criteria . . . . . . . . 875.1.5 Split Attribute Selection . . . . . . . . . . . . . . . . . . . . . . . . 885.1.6 Random Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . 905.1.7 Putting things together . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.7.1 Fixed Height . . . . . . . . . . . . . . . . . . . . . . . . . 925.1.7.2 Purity and Scarcity . . . . . . . . . . . . . . . . . . . . . . 94
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.1 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.3.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Take-aways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 K-NEAREST NEIGHBOR CLASSIFIER . . . . . . . . . . . . . . . . . . . . . . 108
6.1 Specific Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.2 Technical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.3 K-Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 1096.4 Computation of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4.1 General Characterization . . . . . . . . . . . . . . . . . . . . . . . . 1116.4.2 Efficient Characterization for Sample Independent Distance Metrics 113
6.5 Scalability Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.6.1 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.6.2 Study 1: Performance of the KNN Algorithm for Different Values
of k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.6.3 Study 2: Convergence of the KNN Algorithm with Increasing Sample
Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1226.6.4 Study 3: Relative Performance of 10-fold Cross Validation on Synthetic
Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6
6.6.5 Study 4: Relative Performance of 10-fold Cross Validation on RealDatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.8 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.9 Take-aways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7 INSIGHTS INTO CROSS-VALIDATION . . . . . . . . . . . . . . . . . . . . . . 132
7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.2 Overview of the Customized Expressions . . . . . . . . . . . . . . . . . . . 1407.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.4.1 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457.4.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.4.3 Expected value square + variance . . . . . . . . . . . . . . . . . . . 148
7.5 Take-aways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
APPENDIX: PROOFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7
LIST OF TABLES
Table page
2-1 Notation used throughout the thesis. . . . . . . . . . . . . . . . . . . . . . . . . 31
4-1 Contingency table of input X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4-2 Naive Bayes Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4-3 Empirical Comparison of the cdf computing methods in terms of execution time.RSn denotes the Random Sampling procedure using n samples to estimate theprobabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4-4 95% confidence bounds for Random Sampling. . . . . . . . . . . . . . . . . . . . 68
4-5 Comparison of methods for computing the cdf. . . . . . . . . . . . . . . . . . . . 68
6-1 Contingency table with v classes, M input vectors and total sample size N =∑M,vi=1,j=1 Nij. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8
LIST OF FIGURES
Figure page
4-1 I have two attributes each having two values with 2 class lables. . . . . . . . . . 69
4-2 The current iterate yk just satisfies the constraint cl and easily satisfies the otherconstraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4-3 Estimates of expected value of GE by MC and RS with increasing training setsize N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4-4 Estimates of expected value of GE by MC and RS with increasing training setsize N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4-5 Estimates of expected value of GE by MC and RS with increasing training setsize N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4-6 Estimates of expected value of GE by MC and RS with increasing training setsize N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4-7 Estimates of expected value of GE by MC and RS with increasing training setsize N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4-8 The plot is of the polynomial (x + 10)4x2y + (y + 10)4y2x− z = 0. . . . . . . . . 72
4-9 HE expectation in single dimension. . . . . . . . . . . . . . . . . . . . . . . . . . 73
4-10 HE variance in single dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4-11 HE E[] + Std() in single dimension. . . . . . . . . . . . . . . . . . . . . . . . . . 74
4-12 HE expectation in multiple dimensions. . . . . . . . . . . . . . . . . . . . . . . . 74
4-13 HE variance in multiple dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . 75
4-14 HE E[] + Std() in multiple dimensions. . . . . . . . . . . . . . . . . . . . . . . . 75
4-15 Expectation of CE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4-16 Individual run variance of CE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4-17 Pairwise covariances of CV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4-18 Total variance of cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4-19 E [] +√
Var (()) of CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4-20 Convergence behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4-21 CE expectation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4-22 Individual run variance of CE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9
4-23 Pairwise covariances of CV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4-24 Total variance of cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4-25 E [] +√
Var (()) of CV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4-26 Convergence behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5-1 The all attribute tree with 3 attributes A1, A2, A3, each having 2 values. . . . . 103
5-2 Given 3 attributes A1, A2, A3, the path m11m21m31 is formed irrespective ofthe ordering of the attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5-3 Fixed Height trees with d = 5, h = 3 and attributes with binary splits. . . . . . 104
5-4 Fixed Height trees with d = 5, h = 3 and attributes with ternary splits. . . . . . 104
5-5 Fixed Height trees with d = 8, h = 3 and attributes with binary splits. . . . . . 104
5-6 Purity based trees with d = 5 and attributes with binary splits. . . . . . . . . . 105
5-7 Purity based trees with d = 5 and attributes with ternary splits. . . . . . . . . . 105
5-8 Purity based trees with d = 8 and attributes with binary splits. . . . . . . . . . 105
5-9 Scarcity based trees with d = 5, pb = N10
and attributes with binary splits. . . . . 106
5-10 Scarcity based trees with d = 5, pb = N10
and attributes with ternary splits. . . . 106
5-11 Scarcity based trees with d = 8, pb = N10
and attributes with binary splits. . . . . 106
5-12 Comparison between AF and MC on three UCI datasets for trees prunned basedon fixed height (h = 3), purity and scarcity (pb = N
10). . . . . . . . . . . . . . . . 107
6-1 b, c and d are the 3 nearest neighbours of a. . . . . . . . . . . . . . . . . . . . . 128
6-2 The Figure shows the extent to which a point xi is near to x1. . . . . . . . . . . 129
6-3 Behavior of the GE for different values of k. . . . . . . . . . . . . . . . . . . . . 129
6-4 Convergence of the GE for different values of k. . . . . . . . . . . . . . . . . . . 130
6-5 Comparison between the GE and 10 fold Cross validation error (CE) estimatefor different values of k when the sample size (N) is 1000. . . . . . . . . . . . . . 130
6-6 Comparison between the GE and 10 fold Cross validation error (CE) estimatefor different values of k when the sample size (N) is 10000. . . . . . . . . . . . . 131
6-7 Comparison between true error (TE) and CE on 2 UCI datasets. . . . . . . . . . 131
7-1 Var(HE) for small sample size and low correlation. . . . . . . . . . . . . . . . . 149
7-2 Var(HE) for small sample size and medium correlation. . . . . . . . . . . . . . . 149
10
7-3 Var(HE) for small sample size and high correlation. . . . . . . . . . . . . . . . . 150
7-4 Var(HE) for larger sample size and low correlation. . . . . . . . . . . . . . . . . 150
7-5 Var(HE) for larger sample size and medium correlation. . . . . . . . . . . . . . . 151
7-6 Var(HE) for larger sample size and high correlation. . . . . . . . . . . . . . . . . 151
7-7 Cov(HEi, HEj) for small sample size and low correlation. . . . . . . . . . . . . 152
7-8 Cov(HEi, HEj) for small sample size and medium correlation. . . . . . . . . . . 152
7-9 Cov(HEi, HEj) for small sample size and high correlation. . . . . . . . . . . . . 153
7-10 Cov(HEi, HEj) for larger sample size and low correlation. . . . . . . . . . . . . 153
7-11 Cov(HEi, HEj) for larger sample size and medium correlation. . . . . . . . . . . 154
7-12 Cov(HEi, HEj) for larger sample size and high correlation. . . . . . . . . . . . . 154
7-13 Var(CE) for small sample size and low correlation. . . . . . . . . . . . . . . . . 155
7-14 Var(CE) for small sample size and medium correlation. . . . . . . . . . . . . . . 155
7-15 Var(CE) for small sample size and high correlation. . . . . . . . . . . . . . . . . 156
7-16 Var(CE) for larger sample size and low correlation. . . . . . . . . . . . . . . . . 156
7-17 Var(CE) for larger sample size and medium correlation. . . . . . . . . . . . . . . 157
7-18 Var(CE) for larger sample size and high correlation. . . . . . . . . . . . . . . . . 157
7-19 E[CE] for small sample size and low correlation. . . . . . . . . . . . . . . . . . . 158
7-20 E[CE] for larger sample size and low correlation. . . . . . . . . . . . . . . . . . . 158
7-21 E[CE] for small sample size at medium and high correlation. . . . . . . . . . . . 159
7-22 E2[CE] + V ar(CE) for small sample size and low correlation. . . . . . . . . . . 159
7-23 E2[CE] + V ar(CE) for small sample size and medium correlation. . . . . . . . . 160
7-24 E2[CE] + V ar(CE) for small sample size and high correlation. . . . . . . . . . . 160
7-25 E2[CE] + V ar(CE) for larger sample size and low correlation. . . . . . . . . . . 161
7-26 E2[CE] + V ar(CE) for larger sample size and medium correlation. . . . . . . . 161
7-27 E2[CE] + V ar(CE) for larger sample size and high correlation. . . . . . . . . . 162
A-1 Instances of possible arrangements. . . . . . . . . . . . . . . . . . . . . . . . . . 169
11
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODELSELECTION MEASURES
By
Amit Dhurandhar
August 2009
Chair: Alin DobraMajor: Computer Engineering
Considering the large amounts of data that is collected everyday in various domains
such as health care, financial services, astrophysics and many others, there is a pressing
need to convert this information into knowledge. Machine learning and data mining are
both concerned with achieving this goal in a scalable fashion. The main theme of my
work has been to analyze and better understand prevalent classification techniques and
paradigms which are an integral part of machine learning and data mining research, with
an aim to reduce the hiatus between theory and practice.
Machine learning and data mining researchers have developed a plethora of
classification algorithms to tackle classification problems. Unfortunately, no one
algorithm is superior to the others in all scenarios and neither is it totally clear as to
which algorithm should be preferred over others under specific circumstances. Hence,
an important question now is, what is the best choice of a classification algorithm for
a particular application? This problem is termed as classification model selection and
is a very important problem in machine learning and data mining. The primary focus
of my research has been to propose a novel methodology to study these classification
algorithms accurately and efficiently in the non-asymptotic regime. In particular, we
propose a moment based method where by focusing on the probabilistic space of classifiers
induced by the classification algorithm and datasets of size N drawn independently
and identically from a joint distribution (i.i.d.), we obtain efficient characterizations
12
for computing the moments of the generalization error. Moreover, we can also study
model selection techniques such as cross-validation, leave-one-out and hold out set in our
proposed framework. This is possible since we have also established general relationships
between the moments of the generalization error and moments of the hold-out-set error,
cross-validation error and leave-one-out error. Deploying the methodology we were able
to provide interesting explanations for the behavior of cross-validation. The methodology
aims at covering the gap between results predicted by theory and the behavior observed in
practice.
13
CHAPTER 1INTRODUCTION
A significant portion of the work in machine learning is dedicated to designing new
learning methods or better understanding, at a macroscopic level (i.e. performance over
various datasets), the known learning methods. The body of work that tries to understand
microscopic (i.e. essence of the method) behavior of either models or methods to evaluate
models – which I think is crucial for deepening the understanding of machine learning
techniques and results – and establish solid connections with Statistics is rather small.
The two prevalent approaches to establish such results are based on either theory or
empirical studies but usually not both, unless empirical studies are used to validate the
theory. While both methods are powerful in themselves, each suffers from at least a major
deficiency.
The theoretical method depends on nice, closed form formulae that usually restricts
the types of results that can be obtained to asymptotic results or statistical learning
theory (SLT) type of results Vapnik [1998]. Should formulae become large and tedious to
manipulate, the theoretical results are hard to obtain and use/interpret.
The empirical method is well suited for validating intuitions but is significantly less
useful for finding novel, interesting things since large number of experiments have to be
conducted in order to reduce the error to a reasonable level. This is particularly difficult
when small probabilities are involved, making the empirical evaluation impractical in such
a case.
An ideal scenario, from the point of view of producing interesting results, would
be to use theory to make as much progress as possible but potentially obtaining
uninterpretable formulae, followed by visualization to understand and find consequences
of such formulae. This would avoid the limitation of theory to use only nice formulae
and the limitation of empirical studies to perform large experiments. The role of the
theory could be to significantly reduce the amount of computation required and the role
14
of visualization to understand the potentially complicated theoretical formulae. This is
precisely what I propose, a new hybrid method to characterize and understand models
and model selection measures (i.e. methods that evaluate learning models). The work I
present here is an initial forray into what might prove to be an useful tool for studying
learning algorithms. I call this method semi-analytical, since not just the formulae, but
visualization in conjunction with the formulae lead to interpretability. What makes such
an endeavor possible is the fact that, mostly due to the linearity of expectation, moments
of complicated random variables can be computed exactly with efficient formulae, even
though deriving the exact distribution in the form of small closed form formulae is a
daunting task.
1.1 Practical Impact
In this subsection I discuss the impact of the proposed research on industry and the
field machine learning and data mining in general.
Impact on industry and other fields: In todays day and age adaptive classification
models find applicability in a wide spectrum of applications ranging over various domains.
Financial Firms deploy these models for security purposes such as fraud detection,
intrusion detection. Credit Card companies use these models to make credit card offers to
people by categorizing them based on their previous transaction history. Giant chains
of Supermarkets use these models to figure out which group of items are generally
bought together by the customer. These models are used extensively in Bioinformatics
for problems such as gene classification based on functionality, DNA/protein sequence
matching, etc. They also find application in medicine for the analysis of the importance of
clinical parameters and their combinations prediction of disease progression, extraction of
medical knowledge for outcome research, therapy planning and support, and for the overall
patient management. Todays state-of-the-art search engines also use classification models.
This is just a snapshot of the entire range of applications they are used for.
15
Noticing the wide applicability of classification models and the shear extent of
their number, it is but a desired goal that I choose the correct model for our specific
application. I, through our research hope to take a forward step in this direction.
Impact on the machine learning and data mining research: I believe that the
research will assist in providing new insight into the behavior of classification models
and model selection measures. The framework may be used as an exploratory tool for
observing and understanding models and selection measures under specific circumstances
that interest the user. It is possible that other related problems may also be framed in an
analogous fashion leading to interesting observations and consequent interpretations.
1.2 Related Work
A critical piece of theoretical work that is coherent and provides structure in
comparing learning methods is given by Statistical Learning Theory Vapnik [1998]. SLT
categorizes classification algorithms(actually the more general learning algorithms) into
different classes called Concept Classes. The concept class of a classification algorithm is
determined by its Vapnik-Chervonenkis(VC) dimension which is related to the shattering
capability of the algorithm. Given a 2 class problem, the shattering capability of a
function refers to the maximum number of points that the function can classify without
making any errors, for all possible assignments of the class labels to the points in some
chosen configuration. The shattering capability of an algorithm is the supremum of the
shattering capabilities of all the functions it can represent. Distribution free bounds on
the generalization error – expected error over the entire input, of a classifier built using
a particular classification algorithm belonging to a concept class are derived in SLT.
The bounds are functions of the VC dimension and the sample size. The strength of this
technique is that by finding the VC dimension of an algorithm I can derive error bounds
for the classifiers built using this algorithm without ever referring to the underlying
distribution. A fallout of this very general characterization is that the bounds are usually
16
loose Boucheron et al. [2005], Williamson [2001] which in turn result in making statements
about any particular classifier weak.
There is a large body of both experimental and theoretical work that addresses the
problem of understanding various model selection measures. The model selection measures
that relevant to our discussion, are Hold-out-set validation, Cross-validation. Shao [1993]
showed that asymptotically Leave-one-out(LOO) chooses the best but not the simplest
model. Devroye et al. [1996] derived distribution free bounds for cross validation. The
bounds they found were for the nearest neighbour model. Breiman [1996] showed that
cross validation gives an unbiased estimate of the first moment of the Generalization error.
Though cross validation has desired characteristics with estimating the first moment,
Breiman stated that its variance can be significant. Theoritical bounds on LOO error
under certain algorithmic stability assumptions were given by Kearns and Ron [1997].
They showed that the worst case error of the LOO estimate is not much worse than
the training error estimate. Elisseeff and Pontil [2003] introduced the notion of training
stability. They showed that even with this weaker notion of stability good bounds could be
obtained on the generalization error. Blum et al. [1999] showed that v-fold cross validation
is at least as good as Nv
hold out set estimation on expectation. Kohavi [1995] conducted
experiments on Naive Bayes and C4.5 using cross-validation. Through his experiments he
concluded that 10 fold stratified cross validation should be used for model selection.
Moore and Lee [1994] proposed heuristics to speed up cross-validation. Plutowski
[1996] survey included proposals with theoritical results, heuristics and experiments on
cross-validation. His survey was especially geared towards the behavior of cross-validation
on neural networks. He inferred from the previously published results that cross-validation
is robust. More recently, Bengio and Grandvalet [2003] proved that there is no universally
unbiased estimator of the variance of cross-validation. Zhu and Rohwer [1996] proposed
a simple setting in which cross-validation performs poorly. Goutte [1997] refuted this
17
proposed setting and claimed that a realistic scenario in which cross-validation fails is still
an open question.
The work I present here covers the middle ground between these theoretical and
empirical results by allowing classifier specific results based on moment analysis. Such
an endeavor is important since the gap between theoretical and empirical results is
significant Langford [2005]. Preliminary work of this nature was done in Braga-Neto and
Dougherty [2005] where the authors characterized the discrete histogram rule. However,
their analysis does not provide any indication of how other more popular algorithms can
be characterized in similar fashion keeping in mind scalability and accuracy. Specific
classification schemes such as the W-statistic Anderson [2003] have been characterized
in the past, but such analysis is very much limited to that and other similar statistics.
The methodology I present here may potentially be applicable to large variety of learning
algorithms.
1.3 Methodology
1.3.1 What is the Methodology ?
The methodology for studying classification models consists in studying the behavior
of the first two central moments of the GE of the classification algorithm studied. The
moments are taken over the space of all possible classifiers produced by the classification
algorithm, by training it over all possible datasets sampled independently and identically
(i.i.d.) from some distribution. The first two moments give enough information about the
statistical behavior of the classification algorithm to allow interesting observations about
its behavior/trends. Higher moments may be computed using the same strategy suggested
but might prove to be inefficient to compute.
1.3.2 Why have such a Methodology?
The answers to the following questions shed light on why the methodology is
necessary if tight statistical characterization is to be provided for classification algorithms.
18
1. Why study GE ? The biggest danger of learning is overfitting the training data. Themain idea in using GE as a measure of success of learning instead on the empiricalerror on a given dataset is to provide a mechanism to avoid this pitfall. Implicitly, byanalyzing GE all the input is considered.
2. Why study the moments instead of the distribution of GE ? Ideally, I would studythe distribution of GE instead of moments in order to get a complete picture ofwhat is its behavior. Studying the distribution of discrete random variables, exceptfor very simple cases, turns out to be very hard. The difficulty comes from the factthat even computing the pdf in a single point is intractable since all combinationsof random choices that result in the same value for GE have to be enumerated. Onthe other hand, the first two central moments coupled with distribution independentbounds such as Chebychev and Chernoff give guarantees about the worst possiblebehavior that are not too far from the actual behavior (small constant factor).Interestingly, it is possible to compute the moments of a random variable like GEwithout ever explicitly writing or making use of the formula for the pmf/pdf. Whatmakes such an endeavor possible is extensive use of the linearity of expectation as isexplained later.
3. Why characterize a class of classifiers instead of a single classifier ? While theuse of GE as the success measure is standard practice in Machine Learning,characterizing classes of classifiers instead of the particular classifier producedon a given dataset is not. From the point of view of the analysis, without largetesting datasets it is not possible to evaluate directly GE for a particular classifier.By considering classes of classifiers to which a classifier belongs, an indirectcharacterization is obtained for the particular classifier. This is precisely whatStatistical Learning Theory (SLT) does; there the class of classifiers consists in allclassifiers with the same VC dimension. The main problem with SLT results is thatclasses based on VC dimension are too large, thus results tend to be pessimistic.In the methodology, the class of classifiers consists only of the classifiers that areproduced by the given classification algorithm from datasets of fixed size fromthe underlying distribution. This is the smallest probabilistic class in which theparticular classifier produced on a given dataset can be placed in.
1.3.3 How do I Implement the Methodology ?
One way of approximately estimating the moments of GE over all possible classifiers
for a particular classification algorithm is by directly using Monte Carlo. If I use Monte
Carlo directly, I first need to produce a classifier on a sampled dataset then test on a
number of test sets sampled from the same distribution acquiring an estimate of the GE of
this classifier. Repeating this entire procedure a couple of times I would acquire estimates
of GE for different classifiers. Then by averaging the error of these multiple classifiers I
19
would get an estimate of the first moment of GE. The variance of GE can also be similarly
estimated.
Another way of estimating the moments of GE, is by obtaining parametric expressions
for them. If this can be accomplished the moments can be computed exactly. Moreover,
by dexterously observing the manner in which expressions are derived for a particular
classification algorithm, insights can be gained into analyzing other algorithms of interest.
Though deriving the expressions may be a tedious task, using them I obtain highly
accurate estimates of the moments. I propose this second alternative for analyzing models.
The key to the analysis is focusing on the learning and inference phases of the algorithm.
In cases where the parametric expressions are computationally intensive to compute
directly, I show that approximating individual terms using optimization techniques and
even Monte Carlo I obtain accurate estimates of the moments when compared to directly
using Monte Carlo (first alternative) for the same computational cost.
If the moments are to be studied on synthetic data then the distribution is anyway
assumed and the parametric expressions can be directly used. If I have real data an
empirical distribution can be built on the dataset and then the parametric expressions can
be used.
1.4 Applying the Methodology
It is important to note that the methodology is not aimed towards providing a
way of estimating bounds for GE of a classifier on a given dataset. The primary goal is
creating an avenue in which learning algorithms can be studied precisely i.e. studying the
statistical behavior of a particular algorithm w.r.t. a chosen/built distribution. Below, I
discuss the two most important perspectives in which the methodology can be applied.
1.4.1 Algorithmic Perspective
If a researcher/practitioner designs a new classification algorithm, he/she needs to
validate it. Standard practice is to validate the algorithm on a relatively small (5-20)
number of datasets and to report the performance. By observing the behavior of only a
20
few instances of the algorithm the designer infers its quality. Moreover, if the algorithm
under performs on some datasets, it can be sometimes difficult to pinpoint the precise
reason for its failure. If instead he/she is able to derive parametric expressions for the
moments of GE, the test results would be more relevant to the particular classification
algorithm, since the moments are over all possible datasets of a particular size drawn
i.i.d. from some chosen/built distribution. Testing individually on all these datasets is an
impossible task. Thus, by computing the moments using the parametric expressions the
algorithm would be tested on a plethora of datasets with the results being highly accurate.
Moreover, since the testing is done in a controlled environment i.e. all the parameters are
known to the designer while testing, he/she can precisely pinpoint the conditions under
which the algorithm performs well and the conditions under which the algorithm under
performs.
1.4.2 Dataset Perspective
If an algorithm designer validates his/her algorithm by computing moments as
mentioned earlier, it can instill greater confidence in the practitioner searching for an
appropriate algorithm for his/her dataset. The reason for this being, if the practitioner has
a dataset which has a similar structure or is from a similar source as the test dataset on
which an empirical distribution was built and favourable results reported by the designer,
then this would mean that the results apply not only to that particular test dataset, but
to other similar type of datasets and since the practitioner’s dataset belongs to this similar
collection, the results would also apply to his. Note that a distribution is just a weighting
of different datasets and this perspective is used in the above exposition.
If the dataset is categorical, it can be precisely modelled by a multinomial distribution
in the following manner. A multinomial is completely characterized by the probabilities
in each of its cells (which sum to 1) and the total count N (sum of individual cell counts).
The designer can set the number of cells in the multinomial to be the number of cells in
his contingency table, with empirical estimates for the individual cell probabilities being
21
the corresponding cell counts divided by the size of the dataset which is the value of N.
With this I have a fully specified multnomial distribution using which I can compute the
formulations, consequently characterizing the moments of the GE. Since the estimates for
the cell probabilities are based on the available dataset, the true underlying distribution
of which this dataset is a sample, may have slightly different values. This scenario can
be accounted for, by varying the cell probabilities to a desired degree and observing the
variation in the estimates of GE. This would assist in deciphering the sensitivity of the
model in question to noise. In the continuous case, there is no such generic distribution
(as the multinomial), but a popular choice could be a mixture of Gaussians (other
distributions could also be used).
1.5 Research Goals
In this section I state the specific research goals that I have accomplished in this
thesis work.
General Framework: To provide a statistical characterization of classifiers, a probabilistic
class of classifiers that contains the desired classifier has to be considered since the
behavior of any particular classifier can be arbitrarily poor. The class considered
by statistical learning theory is the class of classifiers with a given VC dimension
Vapnik [1998]. While the results thus obtained are very general, no particularity of
the classification algorithm is exploited. The class of classifiers considered in this thesis
is the classifiers obtained by applying the classification algorithm to a dataset of given
size sampled i.i.d. from the underlying distribution. This leads to a different way of
characterizing classifiers based on moment analysis. I develop a framework to analyze
classification algorithms.
Analysis of Model Selection Measures: To relate the moments of the Generalization
error (GE) to the moments of Cross-validation error (CE), Leave-one-out error (LE) and
Hold-out-set error (HE). This will assist us in studying the behavior of these errors given
the moments of any one of these errors.
22
Analysis of Specific Classification Models: To develop customized formulations for
the moments for specific classification algorithms. This will aid in studying classification
algorithms in conjunction with the selection measures. I choose the following models which
are a mix of parametric and non-parametric models.
1. Naive Bayesian Classifier (NBC) model: NBC is a model which is extensively usedin industry, due to its robustness outperforming its more sophisticated counterpartsin many real world applications(eg. spam filtering in Mozilla Thunderbird andMicrosoft Outlook, bio-informatics etc.). There has been work on the robustnessof NBC Domingos and Pazzani [1997], Rish [2001], but the proposed frameworkand the inter-relationships between the moments of the various errors helps us toextensively study not just the model but also the behavior of the validation methodsin conjunction with it.
2. Decision Trees (DT) model: Decision trees are also extensively used in data miningand machine learning applications. Besides performance, they are sometimespreferred over other models(eg. Support Vector Machines, neural nets) because theprocess by which the eventual classifier is built from the sample is transparent. Theprobabilistic formulations will incorporate various pruning conditions such as purity,scarcity and fixed height. The formulations will help better understand the behaviorof these trees for classification.
3. K-Nearest-Neighbor (KNN) Classifier model: This model is one of the more simplermodels but yet it is highly effective. Theoretical results exist Stone [1977], regardingconvergence of the Generalization Error (GE) of this algorithm to Bayes error (bestpossible performance). However, this result is asymptotic and for finite sample sizesin real scenarios finding the optimal value of K is more of an art than science. Themethodology proposed by us can used to study the algorithm for different values ofK and for different distance metrics accurately in controlled settings.
Scalability: To make the computation of the moments scalable. This is especially
relevant when the domain is discrete and the computation of individual probabilities
becomes computationally intensive. In these cases I have to come up with approximation
techniques that are accurate and fast, making the analysis practical.
Practical Study of Non-asymptotic Behavior of NBC, DT, KNN and Selection
Measures: The formulas of the moments of GE and consequently HE, CE and LE for the
NBC, DT and KNN that are derived using the general framework can be used to carry
out an extensive study of the behavior of these classification algorithms in conjunction
23
with the model selection measures. I have carried out such a comparison with the aim of
identifying interesting trends about the mentioned classification algorithms and the model
selection measures to exemplify the utility of the theoretical framework.
24
CHAPTER 2GENERAL FRAMEWORK
Probability distributions completely characterize the behavior of a random variable.
Moments of a random variable give us information about its probability distribution.
Thus, if I have knowledge of the moments of a random variable I can make statements
about its behavior. In some cases characterizing a finite subset of moments may prove
to be a more desired alternative than characterizing the entire distribution which can be
wild and computationally expensive to compute. This is precisely what I do when I study
the behavior of the generalization error of a classifier and the error estimation methods
viz. Hold-out-error, Leave-one-out error and Cross-validation error. Characterizing
the distribution though possible, can turn out to be a tedious task and studying the
moments instead is a more viable option. As a result, I employ moment analysis and
use linearity of expectation to explore the relationship between various estimates for the
error of classifiers: generalization error(GE), hold-out-set error(HE), and cross validation
error(CE) — leave-one-out error is just a particular case of CE and I do not analyze
it independently. The relationships are drawn by going over the space of all possible
datasets. The actual computation of moments though is conducted by going over the
space of classifiers induced by a particular classification algorithm and i.i.d. data. This
is done since it leads to computational efficiency. I interchangably go over the space of
datasets and space of classifiers as deemed appropriate, since the classification algorithm is
assumed to be deterministic. That is I have,
ED(N)[F(ζ[D(N)])] = EZ(N)[F(ζ)] = Ex1×...×xm [F(ζ(x1, x2, ..., xm))]
where F() is some function that operates on a classifier. I also consider the learning
algorithms to be symmetric(the algorithm is oblivious to random permutations of the
samples in the training dataset).
Throughout this section and in the rest of the thesis I use the notation in Table 2-1
unless stated otherwise.
25
2.1 Generalization Error (GE)
The notion of generalization error is defined with respect to an underlying probability
distribution defined over the input output space and a loss function (error metric). I
model this probability space with the random vector X for input and random variable
Y for output. When the input is fixed, Y (x) is the random variable that models the
output1 . I assume in this thesis that the domain X of X is discrete; all the theory can
be extended to continuous essentially by replacing the counting measure with Lebesque
measure and sums with integrals. Whenever the probability and expectation is with
respect to this probabilistic space(i.e. (X, Y )) that models the problem I will not use any
index. For other probabilistic spaces, I will specify by an index what is the probability
space I refer to. I denote the error metric by λ(a, b); in this thesis I will use only the 0-1
metric that takes value 1 if a 6= b and 0 otherwise. With this, the generalization error of a
classifier ζ is:
GE(ζ) = E [λ(ζ(X), Y )]
= P [ζ(X) 6=Y ]
=∑x∈X
P [X =x] P [ζ(x) 6=Y (x)]
(2–1)
where I used the fact that, for 0-1 loss function the expectation is the probability that
the prediction is erroneous. Notice that the notation using Y (x) is really a conditional on
X =x. I use this notation since it is intuitive and more compact. The last equation for the
generalization error is the most useful in this thesis since it decomposes a global measure,
generalization error, defined over the entire space into micro measures, one for each input.
1 By modeling the output for a given input as a random variable, I allow the output tobe randomized, as it might be in most real circumstances.
26
By carefully selecting the class of classifiers for which the moment analysis of the
generalization error is performed, meaningful and relevant probabilistic statements can
be made about the generalization error of a particular classifier from this class. The
probability distribution over the classifiers will be based on the randomness of the data
used to produce the classifier. To formalize this, let Z(N) be the class of classifiers built
over a dataset of size N with a probability space defined over it. With this, the k-th
moment around 0 of the generalization error is:
EZ(N)
[GE(ζ)k
]=
∑
ζ∈Z(N)
PZ [ζ] GE(ζ)k
The problem with this definition is that it talks about global characterization of
classifiers which can be hard to capture. I rewrite the formulae for the first and second
moment in terms of fine granularity structure of the classifiers.
While deriving these moments, I have to consider double expectations of the form:
EZ(N) [E [F(x, ζ)]] with F(x, ζ) a function that depends both on the input x and the
classifier. With this I arrive at the following result:
EZ(N) [E [F(x, ζ)]] =∑
ζ∈Z(N)
PZ(N) [ζ]∑x∈X
P [X =x]F(x, ζ)
=∑x∈X
P [X =x]∑
ζ∈Z(N)
PZ(N) [ζ]F(x, ζ)
= E[EZ(N) [F(x, ζ)]
]
(2–2)
that uses the fact that P [X =x] does not depend on a particular ζ and PZ(N) [ζ] does
not depend on a particular x, even though both quantities depend on the underlying
probability distribution.
Using the definition of the moments above, Equation 2–1 and Equation 2–2 I have the
following theorem.
27
Theorem 1. The first and second moment of GE are given by,
EZ(N) [GE(ζ)] =∑x∈X
P [X =x]∑y∈Y
PZ(N) [ζ(x)=y] P [Y (x) 6=y]
and
EZ(N)×Z(N) [GE(ζ)GE(ζ ′)] =∑x∈X
∑
x′∈XP [X =x] P [X =x′] ·
∑y∈Y
∑
y′∈YPZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] ·
P [Y (x) 6=y] P [Y (x′) 6=y′]
Proof.
EZ(N) [GE(ζ)] = EZ(N) [E [λ(ζ(X), Y )]]
= E[EZ(N) [λ(ζ(X), Y )]
]
=∑x∈X
P [X =x]∑
ζ∈Z(N)
PZ(N) [ζ] P [ζ(x) 6=Y (x)|ζ]
=∑x∈X
P [X =x]∑
ζ∈Z(N)
PZ(N) [ζ] P [ζ(x) = y, Y (x) 6= y|ζ]
=∑x∈X
P [X =x]∑y∈Y
∑
ζ∈Z(N)|ζ(x)=y
PZ(N) [ζ] P [ζ(x) = y, Y (x) 6= y|ζ]
=∑x∈X
P [X =x]∑y∈Y
PZ(N) [ζ(x)=y] P [Y (x) 6=y]
EZ(N)×Z(N) [GE(ζ)GE(ζ ′)]
= EZ(N)×Z(N) [E [λ(ζ(X), Y )] E [λ(ζ ′(X), Y )]]
=∑
(ζ,ζ′)∈Z(N)×Z(N)
PZ(N)×Z(N) [ζ, ζ ′]
(∑x∈X
P [X =x] P [ζ(x) 6=Y (x)]
)
(∑x∈X
P [X =x] P [ζ ′(x) 6=Y (x)]
)
28
=∑x∈X
∑
x′∈XP [X =x] P [X =x′]
∑
(ζ,ζ′)∈Z(N)×Z(N)
PZ(N)×Z(N) [ζ, ζ ′]
P [ζ(x) 6=Y (x)] P [ζ ′(x′) 6=Y (x′)]
=∑x∈X
∑
x′∈XP [X =x] P [X =x′]
∑y∈Y
∑
y′∈YPZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′]
P [Y (x) 6=y] P [Y (x′) 6=y′]
In both series of equations I made the transition from a summation over the class
of classifiers to a summation over the possible outputs since the focus changed from the
classifier to the prediction of the classifier for a specific input(x is fixed inside the first
summation). What this effectively does is it allows the computation of moments using
only local information(behavior on particular inputs) not global information(behavior on
all inputs). This results in speeding the process of computing the moments.
2.2 Alternative Methods for Computing the Moments of GE
The method I introduced above for computing the moments of the generalization
error are based on decomposing the moment into contributions of individual input-output
pairs. With such a decomposition, not only the analysis becomes simpler, but the
complexity of the algorithm required is reduced. In particular, the complexity of
computing the first moment is proportional to the size of the input-output space and
the complexity of estimating probabilities of the form PZ [ζ(x)=y]. The complexity of the
second moment is quadratic in the size of the input-output space and proportional to the
complexity of estimating PZ [ζ(x)=y ∧ ζ(x′)=y′]. To see the advantage of this method, I
compare it with the other two alternatives for computing the moments: definition based
computation and Monte Carlo simulation.
Definition based computation uses the definition of expectation. It consists in
summing over all possible datasets and multiplying the generalization error of the classifier
29
built from the dataset with the probability to obtain the dataset as an i.i.d. sample from
the underlying probability distribution. Formally,
ED(N) [GE(ζ)] =∑
D∈D(N)
P [D] GE(ζ[D]) (2–3)
where D(N) is the set of all possible datasets of size N . The number of possible datasets
is exponential in N with the base of the exponent proportional to the size of the
input-output space (the product of the sizes of the domains of inputs and outputs).
Evaluating the moments in this manner is impractical for all but very small spaces and
dataset sizes.
Monte Carlo simulation is a simple way to estimate moments that consists in
performing experiments to produce samples that determine the value of the generalization
error. In this case, to estimate ED(N) [GE(ζ)], datasets of size N have to be generated,
one for each sample desired. For each of these datasets a classifier has to be constructed
according to the classifier construction algorithm. For the classifier produced, samples
from the underlying probability distribution have to be generated in order to estimate
the generalization error of this classifier. Especially for second moments, the amount of
samples required will be large in order to obtain reasonable accuracy for the moments. If a
study has to be conducted in order to determine the influence of various parameters of the
data generation model, the overall amount of experiments that have to performed becomes
infeasible.
In summary, the advantages of the method I propose for estimating the moments
are: (a) the formulations are exact, (b) it needs only local behavior of the classifier, (c)
so the time complexity is reduced and (d) does not depend on the fact that some of the
probabilities are small. I will use this method to compute moments of the generalization
error for the NBC, DT and KNN algorithms.
30
Table 2-1. Notation used throughout the thesis.
Symbol MeaningX Random vector modeling inputX Domain of random vector (input space) XY Random variable modeling outputY (x) Random variable modeling output for input xY Set of class labels (output space)D Dataset(x, y) Data-point from dataset DDt Training datasetDs Testing datasetDi The ith part/fold of D (for cross validation)N Size of datasetNt Size of training datasetNs Size of testing datasetv Number of folds of cross validationζ Classifierζ[D] Classifier build from dataset DGE(ζ) Generalization error of classifier ζHE(ζ) Hold-out-set error of classifier ζCE(ζ) Cross validation error of classifier ζZ(S) The set of classifiers obtained by application of
classification algorithm to an i.i.d. set of size SD(S) Dataset of size SEZ(S)[] Expectation w.r.t. the space of classifiers built on a sample of size S
31
CHAPTER 3ANALYSIS OF MODEL SELECTION MEASURES
The exact computation of the generalization error depends on the actual underlying
probability distribution, which is unknown, and hence other estimates for the generalization
error have been introduced: hold out set(HOS), leave-one-out(LOO), and v-fold cross
validation(CV). In the this subsection I establish relationships between moments of
these error metrics and the moments of the generalization error with respect to some
distribution over the classifiers. The general setup for the analysis for all these metrics is
the following. A dataset D of size N is provided, containing i.i.d. samples coming from
the underlying distribution over the input and outputs. The set is further divided and
used both to build a classifier and to estimate the generalization error; the particular way
this is achieved is slightly different for each error metric. The important question I will
ask is how the values of the error metrics relate to the generalization error. In all the
developments that follow I will assume that ζ[D] is the classifier built deterministically
from the dataset D.
3.1 Hold-out Set Error
The HOS error involves randomly partitioning the dataset D into two parts Dt, the
training dataset of fixed size Nt, and Ds, the test dataset of fixed size Ns. A classifier is
built over the training dataset and the generalization error is estimated as the average
error over the test dataset. Formally, denoting the random variable that gives the HOS
error by HE I have:
HE =1
Ns
∑
(x,y)∈Ds
λ(ζ[Dt](x), y) (3–1)
where y is the actual label for the input x.
Proposition 1. The expected value of HE is given by,
EDt(Nt)×Ds(Ns) [HE] = EDt(Nt) [GE(ζ[Dt])]
32
Proof. Using the notation in Table 2-1 and realizing that all the datapoints are i.i.d. I
derive the above result.
EDt(Nt)×Ds(Ns) [HE] = EDt(Nt)
[EDs(Ns)
[∑(x,y)∈Ds
λ(ζ[Dt](x), y)
Ns
]]
= EDt(Nt)
[EDs(Ns) [P [ζ[Dt](x) 6= y|Ds]]
]
= EDt(Nt)
[∑Ds
P [ζ[Dt](x) 6= y|Ds]P [Ds]
]
= EDt(Nt)
[∑Ds
P [ζ[Dt](x) 6= y, Ds]
]
= EDt(Nt) [P [ζ[Dt](x) 6= y]]
= EDt(Nt) [GE(ζ[Dt])]
where I used the fact that by going over all values of one r.v. I get the probability fo the
other.
I observe from the above result that the expected value of HE is dependent only on
the size of the training set Dt. This result is intuitive since only Nt data-points are used
for building the classifier.
Lemma 1. The second moment of HE is given by,
EDt(Nt)×Ds(Ns) [HE2] = 1Ns
EDt(Nt) [GE(ζ[Dt])] + Ns−1Ns
EDt(Nt) [GE(ζ[Dt])2].
Proof. To compute the second moment of HE, from the definition in Equation 3–1 I have:
EDt(Nt)×Ds(Ns) [HE2] = 1N2
sEDt(Nt)×Ds(Ns)
[∑(x,y)∈Ds
∑(x′,y′)∈Ds
λ(ζ[Dt](x), y)λ(ζ[Dt](x′), y′)
]
The expression under the double sum depends on whether (x, y) and (x′, y′) are the same
or not. When they are the same, I are precisely in the case I derived for EDt(Nt)×Ds(Ns) [HE]
above, except that I have N2s in the denominator. This gives us the following term,
1Ns
EDt(Nt) [GE(ζ[Dt])]. When they are different, i.e. (x, y) 6= (x′, y′) then I get,
33
1
N2s
EDt(Nt)×Ds(Ns)
∑
(x,y)∈Ds
∑
(x′,y′)∈Ds\(x,y)
λ(ζ[Dt](x), y)λ(ζ[Dt](x′), y′)
=Ns − 1
Ns
EDt(Nt)×Ds(Ns)
[∑(x,y)∈Ds
λ(ζ[Dt](x), y)
Ns
∑(x′,y′)∈Ds\(x,y) λ(ζ[Dt](x
′), y′)
Ns − 1
]
=Ns − 1
Ns
EDt(Nt)×Ds(Ns) [P [ζ[Dt](x) 6= y|(x, y) ∈ Ds]P [ζ[Dt](x′) 6= y′|(x′, y′) ∈ Ds \ (x, y)]]
=Ns − 1
Ns
EDt(Nt) [EDs [P [ζ[Dt](x) 6= y|(x, y) ∈ Ds]] EDs [P [ζ[Dt](x′) 6= y′|(x′, y′) ∈ Ds \ (x, y)]]]
=Ns − 1
Ns
EDt(Nt)
[GE(ζ[Dt])
2]
where I used the primary fact that since the samples are i.i.d. any function applied on two
distinct inputs is also independent. This the reason why the EDs [] factorizes.
Putting everything together, and observing that terms inside summations are
constants, I have:
EDt(Nt)×Ds(Ns)
[HE2
]=
1
Ns
EDt(Nt) [GE(ζ[Dt])] +Ns − 1
Ns
EDt(Nt)
[GE(ζ[Dt])
2]
Theorem 2. The variance of HE is given by,
V arDt(Nt)×Ds(Ns)(HE) = 1Ns
EDt(Nt) [GE(ζ[Dt])] + Ns−1Ns
EDt(Nt) [GE(ζ[Dt])2] −
ED(Nt) [GE(ζ[Dt])]2.
Proof. The proof of the theorem immediately follows from the Proposition 1 and Lemma 1
and by using the formula for the variance of a random variable in this case HE,
V ar(HE) = E[HE2]− E[HE]2.
Unlike the first moment the variance depends on the sizes of both, the training set as well
as the test set.
3.2 Multifold Cross Validation Error
v-fold cross validation consists in randomly partitioning the available data into v
equi-size parts and then training v classifiers using all data but one chunk and then testing
34
the performance of the classifier on this chunk. The estimate of the generalization error
of the classifier build from the entire data is the average error over the chunks. Using
notation in this thesis and denoting by Di the i-th chunk of the data set D, the Cross
Validation Error (CE) is:
CE =1
v
v∑i=1
HEi
Notice that I expressed CE in terms of HE, the HOS Error. By substituting the formula
for HE in Equation 3–1 into the above equation, a direct definition for CV is obtained, if
desired.
In this case I have a classifier for each chunk not a single classifier for the entire
data. I model the selection of N i.i.d. samples that constitute the dataset D and the
partitioning into v chunks. With this I have:
Proposition 2. The expected value of CE is given by,
EDt( v−1v
N)×Di(Nv
) [CE] = EDt( v−1v
N) [GE(ζ[Dt])]
Proof. Using the above equation, Proposition 1 I get the above result.
EDt( v−1v
N)×Di(Nv
) [CE] =1
v
v∑i=1
EDt( v−1v
N)×Di(Nv
) [HEi]
=1
v
v∑i=1
EDt( v−1v
N) [GE(ζ[Dt])]
= EDt( v−1v
N) [GE(ζ[Dt])]
This results follows the intuition since is states that the expected error is the generalization
error of a classifier trained on v−1v
N data-points. Thus, at least on expectation, the cross
validation behaves exactly like HOS estimate that is trained over v − 1 chunks.
For CE I could compute the second moment around zero using the strategy
previously shown and then compute the variance. Here, I compute the variance using
the relationship between the variance of the sum and the variances and covariances of
individual terms. In this way I can decompose the overall variance of CE into the sum of
35
variances of individual estimators and the sum of covariances of pairs of such estimators;
this decomposition significantly enhances the understanding of the behavior of CE, as I
will see in the example in Section 4.1.
V ar(CE) =1
v2
(v∑
i=1
V ar(HEi)+∑
i 6=j
Cov(HEi, HEj)
)(3–2)
The quantity V ar(HEi), the variance of the HE on training data of size v−1v
N and test
data of size 1vN , is computed using the formulae in the previous section. The only things
that remain to be computed are the covariances. Since I already computed the expectation
of HE, to compute the covariance it is enough to compute the quantity Q = E[HEiHEj]
(since for any two random variables X, Y I have Cov(X,Y ) = E [XY ]− E [X] E [Y ]). Let
Di denote D \Di and let Ns = Nv. With this I have the following lemma,
Lemma 2. The EDit( v−1
vN)×Di(
Nv
)×Djt( v−1
vN)×Dj(
Nv
) [HEiHEj] =
EDit( v−1
vN)×Dj
t( v−1v
N) [GE(ζ[D \Di])GE(ζ[D \Dj])].
Proof.
EDit( v−1
vN)×Di(
Nv
)×Djt( v−1
vN)×Dj(
Nv
) [HEiHEj]
= EDit( v−1
vN)×Dj
t( v−1v
N)×Di(Nv
)
[HEiEDj(
Nv
)
[∑(xj ,yj)∈Dj
λ(ζ[Djt ](xj), yj)
Ns
]]
= EDit( v−1
vN)×Dj
t( v−1v
N)
[GE(ζ[D \Dj])EDi(
Nv
) [HEi]]
= EDit( v−1
vN)×Dj
t( v−1v
N) [GE(ζ[D \Dj])GE(ζ[D \Di])]
where used the fact that the datasets Di and Dj are disjoint and drawn i.i.d. It is
important to observe that due to the fact that D\Di and D\Dj intersect ( the intersection
is D \ (Di ∪Dj)), the two classifiers will neither be independent nor identical. As was the
case for the first and second moment of GE, this moment will depend only on the size of
the intersection and the sizes of the two sets since all points are i.i.d. This means that the
expression has the same value for any pair i, j, i 6=j.
36
Theorem 3. The variance of CE is given by,
V arDit( v−1
vN)×Di(
Nv
)×Djt( v−1
vN)×Dj(
Nv
)CE =1
NEDt( v−1
vN) [GE(ζ[Dt])] +
N − v
vNEDt( v−1
vN)
[GE(ζ[Dt])
2]− v − 2
vEDt( v−1
vN) [GE(ζ[Dt])]
2 +
v − 1
vEDi
t( v−1v
N)×Djt( v−1
vN) [GE(ζ[D \Dj])GE(ζ[D \Di])]
Proof. The expression for the covariance is immediate from the above result and so using
Equation 3–2 I derive the variance of CE.
It is worth mentioning that leave-one-out(LOO) is just a special case of v-fold cross
validation (v = N for leave-one-out). The formulae above apply to LOO as well thus no
separate analysis is necessary.
With this I have related the first two moments of HE and CE to that of GE. Hence,
if I can compute the moments of GE I can also compute the moments of HE and CE,
allowing us to study the model as well as the selection measures. In the next couple of
chapters I thus focus our attention on computing the moments of GE efficiently for the
following classification models – NBC, DT and KNN.
37
CHAPTER 4NAIVE BAYES CLASSIFIER, SCALABILITY AND EXTENSIONS
4.1 Example: Naive Bayes Classifier
The results(i.e. expressions and relationships) I derived in the previous section were
generic applicable to any deterministic classification algorithm. I can thus use those results
to study the behavior of the errors for a classification algorithm of our choice.
The classification algorithm I consider in this chapter is Naive Bayes. I first study the
Naive bayes for a single input attribute(i.e. for one dimension) and later the generalized
version maintaining scalability. As I will see, these moments are too complicated, as
mathematical formulae, to interpret. I will plot these moments to gain an understanding
of the behavior of the errors under different conditions thus portraying the usefulness of
the proposed method.
4.1.1 Naive Bayes Classifier Model (NBC)
In order to compute the moments of the generalization error, I first have to select
a classifier and specify the construction method. I selected the single input naive Bayes
classifier since the analysis is not too complicated but highlights both the method and the
difficulties that have to be overcome. As I will see, even this simplified version exhibits
interesting behavior. I fix the number of class labels to 2 as well. In the next section, I
discuss how the analysis I present here extends to the general NBC.
Given values for any of the inputs, the NBC computes the probability to see any
of the class labels as output under the assumption that the inputs influence the output
independently. The prediction is the class label that has the largest such estimated
probability. For the version of the naive Bayes classifier I consider here(i.e. a single input),
the prediction given input x is:
ζ(x) = argmaxk∈1,2P [Y = yk] P [X = x|Y = yk]
38
The probabilities that appear in the formula are estimated using the counts in the
contingency table in Table 4-1. Using the fact that P [Y = yk] P [X = x|Y = yk] =
P [X = x ∧ Y = yk] and the fact that P [X = xi ∧ Y = yk] is Nik
N, the prediction of the
classifier is:
ζ(xi) =
y1 if Ni1 ≥ Ni2)
y2 if Ni1 < Ni2)
4.1.2 Computation of the Moments of GE
Under the already stated data generation model, the moments of the generalization
error for the NBC can be computed. I now present three approaches for computing the
moments and show that the approach using theorem 1 is by far the most practical.
Going over datasets: If I calculate the moments by going over all possible datasets
the number of terms in the formulation for the moments is exponential in the number
of attribute values with the base of the exponential being the size of the dataset(i.e.
O(Nn) terms). This is because each of the cells in Table 4-1 can take O(N) values. The
formulation of the first moment would be as follows.
ED(N)[GE(ζ(D(N)))] =N∑
N11=0
N−N11∑N12=0
...
N−(N11+...+N(n−1)2)∑Nn1=0
eP [N11, ..., Nn2] (4–1)
where e is the corresponding error of the classifier. I see that this formulation can be
tedious to deal with. So can I do better? Yes I definitely can and this spurs from the
following observation. For the NBC built on Table 4-1 all I care about in the classification
process is the relative counts in each of the rows. Thus, if I had to classify a datapoint
with attribute value xi I would classify it into class y1 if Ni1 > Ni2 and vice-versa. What
this means is that irrespective of the actual counts of Ni1 and Ni2 as long Ni1 > Ni2
the classification algorithm would make the same prediction i.e. I would have the same
classifier. I can hence switch from going over the space of all possible datasets to going
39
over the space of all possible classifiers with the advantage of reducing the number of
terms.
Going over classifiers: If I find the moments by going over the space of possible
classifiers I reduce the number of terms from O(Nn) to O(2n). This is because there are
only two possible relations between the counts in any row(≥ or <). The formulation for
the first moment would then be as follows.
EZ(N)[GE(ζ)] = e1P [N11 ≥ N12, ..., Nn1 ≥ Nn2] + e2P [N11 < N12, ..., Nn1 ≥ Nn2] + ...
+e2nP [N11 < N12, ..., Nn1 < Nn2]
where e1, e2 to e2n are the corresponding errors. Though this formulation reduces the
complexity significantly since N >> 2 for any practical scenario, nonetheless the number
of terms are still exponential in n. Can I still do better? The answer is yes again. Here
is where theorem 1 gains prominence. To restate, theorem 1 says that while calculating
the first moment I just need to look at particular rows of the table and to calculate the
second moment just pairs of rows. This reduces the complexity significantly without
compromising on the accuracy as I will see.
Going over classifiers using Theorem 1: If I use theorem 1 the number of terms
reduces from an exponential in n to small polynomial in n. Thus the number of terms
in finding the first moment is just O(n) and that for the second moment is O(n2). The
formulation for the first moment would then be as follows.
EZ(N)[GE(ζ)] = e1P [N11 ≥ N12] + e2P [N11 < N12] + ... + e2nP [Nn1 < Nn2]
where e1, e2 to e2n are the corresponding errors. They are basically the respective cell
probabilities of the multinomial(i.e. e1 is probability of a datapoint belonging to the cell
40
x1C2, e2 is probability to belong to x1C1 and so on). For the second moment I would have
joint probabilities with the expression being the following,
EZ(N)[GE(ζ)2] = (e1 + e3)P [N11 ≥ N12, N21 ≥ N22] + (e2 + e3)P [N11 < N12, N21 ≥ N22] + ...
+(e2n−2 + e2n)P [N(n−1)1 < N(n−1)2, Nn1 < Nn2]
I have thus reduced the number of terms from O(Nn) to O(nk) where k is small and
depends on the order of the moment I are interested in. This formulation has another
advantage. The complexity of calculating the individual probabilities is also significantly
reduced. The probabilities for the first moment can be computed in O(N2) time and that
for the second in O(N4) time rather than O(Nn−1) and O(N2n−2) time respectively.
Further optimizations can be done by identifying independence between random
variables and expressing them as binomial cdfs and using the incomplete regularized
beta function to calculate these cdfs in essentially constant time. Infact in future
sections I discuss the general NBC model for which the cdfs (probabilities) cannot be
computed directly as it turns out to be too expensive. Over there I propose strategies to
efficiently compute these probabilities. The same strategies can be used here to make the
computation more scalable.
The situation when ζ and ζ ′ are the classifiers constructed for two different folds in
cross validation requires special treatment. Without loss of generality, assume that the
classifiers are built for the folds 1 and 2. If I let D1, . . . , Dv be the partitioning of the
datasets into v parts, ζ is constructed using D2 ∪D3 ∪ · · · ∪Dv and ζ ′ is constructed using
D1 ∪D3 ∪ · · · ∪Dv, thus D3 ∪ · · · ∪Dv training data is common for both. If I denote by Njk
the number of data-points with X = xj and Y = yk in this common part and by N(1)jk and
N(2)jk the number of such data-points in D1 and D2, respectively, then I have to compute
41
probabilities of the form,
P[(N
(2)i1 + Ni1 > N
(2)i2 + Ni2) ∧ (N
(1)j1 + Nj1 > N
(1)j2 + Nj2)
]
The estimation of this probability using the above method requires fixing the values of 6
random variables, thus giving an O(N6) algorithm. Again, further optimizations can be
carried out using the strategies in suggested hence.
Using the moments of GE, the moments of HE and CE are found using relationships
already derived.
4.2 Full-Fledged NBC
In the previous section I dicussed the NBC built on data in a single dimension.
As the dimensionality increases, the cost of exactly computing the moments from the
formulations also increases. To maintain the scalability of the method, I propose a number
of approximation schemes which can be used to estimate the probabilities efficiently
and accurately. As I will see approximating these probabilities leads to highly accurate
estimates of the moments for low computational cost, as against directly using Monte
Carlo. The approximation schemes I propose assist in efficient computation of the
probabilities in arbitrary dimension. As a matter of fact the approximation schemes
are generic enough to be applied to any application where cdfs need to be approximated
efficiently.
4.2.1 Calculation of Basic Probabilities
Having come up with the probabilistic formulation for discerning the moments of the
generalization error, I are now faced with the daunting task of efficiently calculating the
probabilities involved for the NBC when the number of dimensions is more than 1. In this
section I will mainly discuss single probabilities and the extension to joint probabilities is
in section 4.4. Let us now briefly preview the kind of probabilities I need to decipher.
42
With reference to Figure 4-1, considering the cell x1y1 without loss of generality
(w.l.o.g.) and by the Naive Bayes classifier independence assumption, I need to find the
probability of the following condition being true for the 2-dimensional case,
pc1px
11
pc1
py11
pc1
> pc2px
12
pc2
py12
pc2
i.e. pc2px11p
y11 > pc1p
x12p
y12
i.e. N2Nx11N
y11 > N1N
x12N
y12
In general for the d-dimensional(d ≥ 2) case I have to find the following probability,
P [N(d−1)2 Nx1
11 Nx211 ...Nxd
11 > N(d−1)1 Nx1
12 Nx212 ...Nxd
12 ] (4–2)
where the xi are random variables.
4.2.2 Direct Calculation
I can find the probability P [N(d−1)2 Nx1
11 Nx211 ...Nxd
11 > N(d−1)1 Nx1
12 Nx212 ...Nxd
12 ] by summing
over all possible assignments of the multinomial random variables involved. For the 2
dimensional case shown in Figure 4-1 I have,
P [N2Nx11N
y11 > N1N
x12N
y12] =
∑N111
∑N121
∑N211
∑N112
∑N122
∑N212
∑N222
P [N111, N121, N211, N112, N122, N212, N221, N222]·I[N2N
x11N
y11 > N1N
x12N
y12]
where N2 = N112 +N122 +N212 +N222, Nx11 = N111 +N121, Ny
11 = N111 +N211, N1 = N −N2,
Nx12 = N112 + N122, and I[condition] = 1 if condition is true else I[condition] = 0. Each
of the summations takes O(N) values and so the worst case time complexity is O(N7). I
thus observe that for the simple scenario depicted, the time to compute the probabilities
is unreasonable even for small size datasets(N = 100 say). The number of summations
increases linearly with the dimensionality of the space. Hence, the time complexity is
exponential in the dimensionality. I thus need to resort to approximations to speed up the
process.
43
4.2.3 Approximation Techniques
If all the moments of a random variable are known then I know the moment
generating function(MGF) of the random variable and as a consequence the probability
generating function and hence the precise cdf for any value in the domain of the random
variable. If only a subset of the moments are known then I can at best approximate the
MGF and so the cdf.
I need to compute probabilities (cdfs) of the following form P [X > 0] where X =
N(d−1)2 Nx1
11 Nx211 ...Nxd
11 − N(d−1)1 Nx1
12 Nx212 ...Nxd
12 . Most of the alternative approximation
techniques I propose in the subsections that follow, to efficiently compute the above
probabilities(cdfs), are based on the fact that I have knowledge of some finite subset of the
moments of the random variable X. I now ellucidate a method to obtain these moments.
Derivation of Moments: As previously mentioned the most general data generation
model for the discrete case is the multinomial distribution. I know the moment generating
function for it. A moment generating function generates all the moments of a random
variable, uniquely defining its distribution. The MGF of a multivariate distribution is
defined as follows,
MR(t) = E(eR′t) (4–3)
where R is a q dimensional random vector, R′ is transpose of R and t ∈ Rq. In our case q
is the number of cells in the multinomial.
Taking different order partial derivatives of the moment generating function w.r.t. the
elements of t and setting those elements to zero, gives us moments of the product of the
random variables in the multinomial raised to those orders. Formally,
∂v1+v2+...+vqMR(t)
∂tv11 ∂tv2
2 ...∂tvqq
|(t1=t2=...=tq=0)= E(Rv11 Rv2
2 ...Rvqq ) (4–4)
where R′ = (R1, R2, ..., Rq), t = (t1, t2, ..., tq) and v1, v2, ..., vq is the order of the partial
derivatives w.r.t. t1, t2, ..., tq respectively.
44
The expressions for these derivatives can be precomputed or computed at run time
using tools such as mathematica Wolfram-Research. But how does all of what I have
just discussed relate to our problem? Consider the 2 dimensional case given in Figure
4-1. I need to find the probability P [Z > 0] where Z = N2Nx11N
y11 − N1N
x12N
y12. The
individual terms in the product can be expressed as a sum of certain random variables
in the multinomial. Thus Z can be written as the sum of the product of some of the
multinomial random variables. Consider the first term in Z,
N2Nx11N
y11 = (N112 + N122 + N212 + N222)(N111 + N121)(N111 + N211)
= N112N2111 + ... + N222N121N211
the second term also can be expressed in this form. Thus Z can be written as the sum of
the products of the multinomial random variables.
E[Z] = E[N2Nx11N
y11 −N1N
x12N
y12]
= E[N2Nx11N
y11]− E[N1N
x12N
y12]
= E[N112N2111 + ... + N222N121N211]−
E[N111N2112 + ... + N221N122N212]
= E[N112N2111] + ... + E[N222N121N211]−
E[N111N2112]− ...− E[N221N122N212]
In the general case Z = Nd−12 Nx1
11...1...Nxd11...1 −Nd−1
1 Nx111...12...N
xd11...12 where the subscript
of N with dots has d + 1 numbers. The expected value of Z is then given by,
E[Z] =E[N11...12Nd11...1] + ... + E[Nm1m2...md2N11...md1N11...md−111... Nm11...1]−
E[N11...11Nd11...2]− ...− E[Nm1m2...md1N11...md2N11...md−112... Nm11...12]
where mi denotes the number of attribute values of xi. These expectations can be
computed using the technique in the discussion before. Higher moments can also be found
45
in the same vein since I would only need to find expectations of higher degree polynomials
in the random variables of the multinomial. Similarly, the expressions for the moments in
higher dimensions will also include higher degree polynomials.
4.2.3.1 Series approximations (SA)
The Edgeworth or the Gram-Charlier A series Hall [1992] are used to approximate
distributions of random variables whose moments or more specifically cumulants are
known. These expansions consist in writing the characteristic function of the unknown
distribution whose probability density is to be approximated in terms of the characteristic
function of another known distribution(usually normal). The density to be found is then
recovered by taking the inverse Fourier transform.
Let puc(t), pud(x) and κi be the characteristic function, probability density function
and the ith cumulant of the unknown distribution respectively. And let pkc(t), pkd(x) and
γi be the characteristic function, probability density function and the ith cumulant of the
known distribution respectively. Hence,
puc(t) = e[P∞
a=1(κa−γa)(it)a
r!]pkc(t)
pud(x) = e[P∞
a=1(κa−γa)(−D)a
r!]pkd(x)
where D is the differential operator. If pkd(x) is a normal density then I arrive at the
following expansion,
pud(x) =1√
2πκ2
e[−(x−κ1)2
2κ2][1 +
κ3
κ322
H3(x− κ1√
κ2
) + ...] (4–5)
where H3 = (x3 − 3x)/3! is the 3rd Hermite polynomial. This method works reasonably
well in practice as can be seen in Levin [1981] and Butler and Sutton [1998]. The major
challenge though, lies in choosing a distribution that will approximate the unknown
distribution ”well”, as the accuracy of the cdf estimate depends on this. The performance
of the method may vary significantly on the choice of this distribution, since choosing the
46
normal distribution may not always give satisfactory results. This task of choosing an
appropriate distribution is non-trivial.
4.2.3.2 Optimization
I have just seen a method of approximating the cdf using series expansions.
Interestingly, this problem can also be framed as an optimization problem, wherein I
find upper and lower bounds on the possible values of the cdf by optimizing over the set
of all possible distributions having these moments. Since our unknown distribution is an
element of this set, its cdf will lie within the bounds computed. This problem is called
the classical moment problem and has been studied in literature Isii [1960, 1963], Karlin
and Shapely [1953]. Infact upto 3 moments known, there are closed form solutions for the
bounds Prekopa [1989]. In the material that follows, I present the optimization problem
in its primal and dual form. I then explore strategies for solving it, given the fact that the
most obvious ones can prove to be computationally expensive.
Assume that I know m moments of the discrete random variable X denoted by
µ1, ..., µm where µj is the jth moment. The domain of X is given by U = x0, x1, ..., xn.
P [X = xr] = pr where r ∈ 0, 1, ..., n and∑
r pr = 1. I only discuss the maximization
version of the problem (i.e. finding the upper bound) since the minimization version (i.e.
finding the lower bound) has an analogous description. Thus, in the primal space I have
the following formulation,
max P [X <= xr] =∑r
i=0 pi, r ≤ n
subject to :∑n
i=0 pi = 1∑n
i=0 xipi = µ1
···
∑ni=0 xm
i pi = µm
pi ≥ 0, ∀ i ≤ n
47
Solving the above optimization problem gives us an upper bound on P [X <= xr].
On inspecting the formulation of the objective and the constraints I notice that it is
a Linear Programming(LP) problem with m + 1 equality constraints and 1 inequality
constraint. The number of optimization variables(i.e. the pr) is equal to the size of
the domain of X which can be large. For example, in the 2 dimensional case when
X = N2Nx11N
y11 − N1N
x12N
y12, X takes O(N2) values(as N1 constraints N2 given N
and N1 also constraints the maximum value of Nx11, Ny
11, Nx12 and Ny
12). Thus with N
as small as 100 I already have around 10000(ten thousand) variables. In an attempt to
monitor the explosion in computation, I derive the dual formulation of our problem.
A dual is a complimentary problem and solving the dual is usually easier than solving
the primal since the dual is always convex for primal maximization problems(concave
for minimization) irrespective of the form of the primal. For most convex optimization
problems(which includes LP) the optimal solution of the dual is the optimal solution
of the primal, technically speaking only if the Slaters conditions are satisfied(i.e.
duality gap is zero). Maximization(minimization) problems in the primal space map to
minimization(maximization) problems in the dual space. For a maximization Standardized
LP problem the primal and dual have the following form,
Primal:
max cT x
subject to : Ax = b, x ≥ 0
Dual:
min bT y
subject to : AT y ≥ c
where y is the dual variable.
For our LP problem the dual is the following,
min∑m
k=0 ykµk
48
subject to:∑m
k=0 ykxk − 1 ≥ 0; ∀ x ∈ W
∑mk=0 ykx
k ≥ 0; ∀ x ∈ U
whereyk represent the dual variables and W represents a subset of U over which the
cdf is computed. I observe that the number of variables is reduced to just m + 1 in the
dual formulation but the number of constraints has increased to the size of the domain
of X. I now propose some strategies to solve this optimization problem; discuss their
shortcomings and eventually suggest the strategy preffered by us.
Using Standard Linear Programming Solvers: I have a linear programming
problem whose domain is discrete and finite. On careful inspection for our problem
I observe that the number of variables in the primal formulation and the number of
constraints in the dual increases exponentially in the dimensionality of the space (i.e. the
domain of the r.v. X). Though the current state-of-the-art LP solvers(using interior point
methods) can solve linear optimization problems of the order of thousands of variables
and constraints rapidly, our problem can exceed these counts by a significant margin
even for moderate dataset sizes and reasonable dimension thus becoming computationally
intractable. Since standard methods for solving this LP can prove to be inefficient I
investigate other possibilities.
In the next three approaches I extend the domain of the random variable X to
include all the integers between the extremeties of the original domain. The current
domain is thus a superset of the original domain and so are the possible distributions.
Another way of looking at it is, in the space of y I have a superset of the original set of
constraints. Thus the upper bound calculated in this scenario will be greater than or
equal to the upper bound of the original problem. This extension is done to enhance the
performance of the next two approaches since it treadjumps the problem of explicitly
enumerating the domain of X and is a requirement for the third, as I will soon see.
Gradient descent with binary search(GD): I use gradient descent on the dual to
find new values of the vector y = [y0, ..., ym] and a reasonably large step length since the
49
objective to be minimized is an affine function and it tread jumps the problem of having a
large step size in optimizing convex functions. Fixing y and assuming x to be continuous
the two equations representing the inequalities above can be viewed as polynomials in
x. The basic intuition in the method I propose is that if the polynomials are always
non-negative within the domain of X then all the constraints are satisfied else some of the
constraints are violated. To check if the polynomials are always non-negative I find their
roots and perform the following checks. The polynomials will change sign only at their
roots and hence I need to carefully examine their behavior at these points. Here are the
details of the algorithm.
I check if the roots of the polynomial lie within the extremeties of the domain of X.
1. If they do not then I check to see if the value of polynomial at any point withinthis range satisfies the inequalities. On the inequalities being satisfied I jump to thenext y using gradient descent storing the current value of y inplace of the previouslystored(if it exists) one. If the inequalities are dis-satisfied I reject the value of yand perform a binary search between this value and the previous legal value of yalong the gradient until I reach the value that minimizes the objective satisfying theconstraints.
2. If I do, then I check the value of the constraints at the two extremeties. If satisfiedand if there exists only one root in the range I store this value of y and go on to thenext. If there are multiple roots then I check to see if consecutive roots have anyintegral values between them. If not I again store this value of y and move to thenext. Else I verify for any point between the roots that the constraints are satisfiedbased on which I either store or reject the value. On rejecting I perform the samebinary search type procedure mentioned above.
Checking if consecutive roots of the polynomial have values in the domain of X, is
where the extension in domain to include all integers between the extremeties helps in
enhancing perfomance. In the absence of this extension I would need to find if a particular
set of integers lie in the domain of X. This operation is expensive for large domains. But
with the extension all the above operations can be performed efficiently. Finding roots of
polynomials can be done extremely efficiently even for high degree polynomials by various
methods such as computing eigenvalues of the companion matrixEdelman and Murakami
[1995] as is implemented in Matlab. Since the number of roots is just the degree of the
50
polynomial which is the number of moments, the above mentioned checks can be done
quickly. The binary search takes log(t) steps where t is the step length. Thus the entire
optimization can be done efficiently. Nonetheless, the method suffers from the following
pitfall. The final bound is sensitive to the initial value of y. Depending on the initial y
I might stop at different values of the objective on hitting some constraint. I could thus
have a suboptimal value as our solution as I only descend on the negative of the gradient.
I can somewhat overcome this drawback by making multiple random restarts.
Gradient descent with local topology search(GDTS): Perform gradient descent
as mentioned before. Choose a random set of points around the current best solution.
Again perform gradient descent on the feasible subset of the choosen points. Choose
the best solution and repeat until some reasonable stopping criteria. This works well
sometimes in practice but not always.
Prekopa Algorithm (PA): Prekopa [1989] gave an algorithm for the discrete
moment problem. In his algorithm I maintain an m + 1 × m + 1 matrix called the basis
matrix B which needs to have a particular structure to be dual feasible. I iteratively
update the columns of this matrix until it becomes primal feasible, resulting in the optimal
solution to the optimization problem1 . The issue with this algorithm is that there is
no guarantee w.r.t. the time required for the algorithm to find this primal feasible basis
structure.
In the remaining approaches I further extend the domain of the random variable
X to be continuous within the given range. Again for the same reason described before
the bound is unintruisive. It is also worthwhile noting that the feasibility region of the
optimization problem is convex since the objective and the constraints are convex(actually
affine). Standard convex optimization strategies can not be used since the equation of the
boundary is unknown and the length of the description of the problem is large.
1 for explanation of the algorithm read Prekopa [1989]
51
Sequential Quadratic Programming(SQP): Sequential Quadratic Programming
is a method for non-linear optimization. It is known to have local convergence for
non-linear non-convex problems and will thus globally converge in the case of convex
optimization. The idea behind SQP is the following. I start with an initial feasible point
say, yinit. The original objective function is then approximated by a quadratic function
around yinit, which then is the objective for that particular iteration. The constraints are
approximated by linear constraints around the same point. The solution of the quadratic
program is a direction vector along which the next feasible point should be chosen. The
step length can be found using standard line search procedures or more sophisticated
merit functions. On deriving the new feasible point the procedure is repeated until a
suitable stopping criteria. Thus at every iteration a quadratic programming problem is
solved.
Let f(y), ceqj(y) and cieqj(y) be the objective function, the jth equality constraint
and the jth inequality constraint respectively. For the current iterate yk I have following
quadratic optimization problem,
min Ok(dk) = f(yk) +∇f(yk)T dk + 1
2dT
k∇2L(yk, λk)dk
subject to : ceqi(yk) +∇ceqi(yk)T dk = 0; i ∈ E
cieqi(yk) +∇cieqi(yk)T dk ≥ 0; i ∈ I
whereOk(dk) is the quadratic approximation of the objective function around yk.
The term f(yk) is generally dropped from the above objective since it is a constant at
any particular iteration and has no bearing on the solution. ∇2L() is the Hessian of
the Lagrangian w.r.t. y, E and I are the set of indices for the equality and inequality
constraints respectively and dk is the direction vector which is the solution of the above
optimization problem. The next iterate yk+1 is given by yk+1 = yk + αkdk where αk is the
step length.
For our specific problem the objective function is affine, thus a quadratic approximation
of it yields the original objective function. I have no equality constraints. For the
52
inequality constraints, I use the following idea. The two equations representing the infinite
number of linear constraints given in the dual formulation can be perceived as being
polynomials in x with coefficients y. For a particular iteration with iterate(y) known,
I find the lowest value that the polynomials take. This value is the value of the most
violated(if some constraints are violated)/just satisfied(if no constraint is violated) linear
constraint. This is shown in Figure 4-2. The constraint cl =∑m
j=0 yjxji is just satisfied.
With this in view I arrive at the following formulation of our optimization problem at the
kth iteration,
min µT dk
subject to :∑m
j=0 y(k)j xj
i +∑m
j=0 xjidk ≥ 0; yk = [y
(k)0 , ..., y
(k)m ]2
This technique gives a sense of the non-linear boundary traced out by the constraints.
The above mentioned values can be deduced by finding roots of the derivative of the 2
polynomials w.r.t. x and then finding the minimum of these values evaluated at the real
roots of its derivative. The number of roots is bounded by the number of moments, infact
it is equal to m − 1. Since this approach does not require the enumeration of each of
the linear constraints and operations described are fast with results being accurate, this
turns out to be a good option for solving this optimization problem. I carried out the
optimization using the Matlab function fmincon and the procedure just illustrated.
Semi-definite Programming(SDP): A semi-definite programming problem has a
linear objective, linear equality constraints and linear matrix inequality(LMI) constraints.
Here is an example formulation,
min cT q
subject to : q1F1 + ... + qnFn + H ¹ 0
Aq = b
2 y(k)i is the value of yi at the kth iteration
53
where H, F1, ..., Fn are positive semidefinite matrices, q ∈ Rn, b ∈ Rp and A ∈ Rp×n. The
SDP can be efficiently solved by interior point methods. As it turns out, I can express our
semi-infinite LP as a SDP.
Consider the constraint c1(x) =∑m
i=0 yixi. The constraint c1(x) satisfies c1(x) ≥ 0
∀x ∈ [a, b] iff ∃ a m + 1×m + 1 positive semidefinite matrix S such that,∑
i+j=2l−1 S(i, j) = 0; l = 1, ...,m∑l
k=0
∑k+m−lr=k yrrCk(m− r)Cl−ka
r−kbk =∑
i+j=2l S(i, j); l = 0, ..., m
S º 0 means S is positive semidefinite.
The proof of this result is given in Bertsimas and Popescu [1998].
I derive the equivalent semidefinite formulation for the second constraint c2(x) =∑m
i=0 yixi− 1 to be greater than or equal to zero. To accomplish this, I replace y0 by y0− 1
in the above set of equalities since c2(x) = c1(x) − 1. Thus ∀x ∈ [a, b] I have the following
semidefinite formulation for the second constraint,∑
i+j=2l−1 S(i, j) = 0; l = 1, ...,m∑l
k=1
∑k+m−lr=k yrrCk(m− r)Cl−ka
r−kbk+∑m−l
r=1 yr(m− r)Clar + y0 − 1 =
∑i+j=2l S(i, j); l = 1, ..., m
∑mr=1 yra
r + y0 − 1 = S(0, 0)
S º 0
Combining the above 2 results I have the following semidefinite program with O(m2)
constraints,
min∑m
k=0 ykµk
subject to :∑
i+j=2l−1 G(i, j) = 0; l = 1, ..., m∑l
k=1
∑k+m−lr=k yrrCk(m− r)Cl−ka
r−kbk+∑m−l
r=1 yr(m− r)Clar + y0 − 1 =
∑i+j=2l G(i, j); l = 1, ..., m
∑mr=1 yra
r + y0 − 1 = G(0, 0)∑
i+j=2l−1 Z(i, j) = 0; l = 1, ..., m∑l
k=0
∑k+m−lr=k yrrCk(m− r)Cl−kb
r−kck =∑
i+j=2l Z(i, j); l = 0, ..., m
54
G º 0, Z º 0
G and Z are m + 1 × m + 1 positive semidefinite matrices. The domain of the
random variable is [a, c]. Solving this semidefinite program yields an upper bound on the
cdf P [X <= b], where a ≤ b ≤ c. I used a free online SDP solverWu and Boyd [1996] to
solve the above semidefinite program. Through empirical studies that follow I found this
approach to be the best in solving the optimization problem in terms of a balance between
speed, reliability and accuracy.
4.2.3.3 Random sampling using formulations (RS)
In sampling I select a subset of observations from the universe consisting of all
possible observations. Using this subset I calculate a function whose value I consider to
be equal(or at least close enough) to the value of the same function applied to the entire
observation set. Sampling is an important process, since in many cases I do not have
access to this entire observation set(many times it is infinite). Numerous studiesHall
[1992], Bartlett et al. [2001], Chambers and Skinner [1977] have been conducted to analyze
different kinds of sampling procedures. The sampling procedure that is relevant to our
problem is Random Sampling and hence I restrict our discussion only to it.
Random sampling is a sampling technique in which I select a sample from a larger
population wherein each individual is chosen entirely by chance and each member of the
population has possibly an unequal chance of being included in the sample. Random
sampling reduces the likelihood of bias. It is known that asymptotically the estimates
found using random sampling converge to their true values.
For our problem the cdf can be computed using this sampling procedure. I sample
data from the multinomial distribution(our data generative model) and add the number of
times the condition whose cdf is to be computed is true. This number when divided by the
total number of samples gives an estimate of the cdf. By finding the mean and standard
deviation of these estimates I can derive confidence bounds on the cdf using Chebyshev
inequality. The width of these confidence bounds depends on the standard deviation of the
55
estimates, which in turn depends on the number samples used to compute the estimates.
As the number of samples increases the bounds become tighter. I will observe this in
the experiments that follow. Infact, all the estimates for the cdf necessary for computing
the moments of the generalization error using the mathematical formulations given by
us, can be computed parallely. The moments are thus functions of the samples for which
confidence bounds can be derived.
Notice that, only sampling in conjunction with our formulations for the moments
makes for an efficient method. If I directly use sampling without using the formulations, I
would first need to sample for building a set of classifiers, and then for each classifier built
I would need to sample test sets from the distribution. Reason being that the expectation
in the moments is w.r.t. all possible datasets of size N . This process can prove to be
computationally intensive for acquiring accurate estimates of the moments. As I will see
in the next section, Monte Carlo fails to provide accurate estimates when directly used
to compute the moments even in simple scenarios when compared with RS. Thus I are
still consistent with our methodology of going as far as possible in theory to reduce the
computation for conducting experiments.
4.2.4 Empirical Comparison of Cumulative Distribution Function ComputingMethods
Consider the 2 dimensional case in Figure 4-1. I instantiated all the cell probabilities
to be equal. I found the probability P [N2Nx11N
y11 > N1N
x12N
y12] by the methods suggested,
varying the dataset size from 10 to 1000 in multiples of 10 and having knowledge of
the first six moments of the random variable X = N2Nx11N
y11 − N1N
x12N
y12. The
actual probability in all the three cases is around 0.5 (actually just lesser than 0.5).
The execution speeds for the various methods are given in Table 4-33 . From the table
I see that the SDP and gradient descent methods are lightining fast. The SQP and
3 arnd means around
56
gradient descent with topology search methods take a couple of seconds to execute. The
thing to notice here is that SDP, SQP, the two gradient descent methods and the series
approximation method are oblivious to the size of the dataset with regards to execution
time. In terms of accuracy the gradient descent method is sensitive to initialization and
the series approximation method to the choice of distribution, as previously stated. A
normal distribution gives an estimate of 0.5 which is good in this case since the original
distribution is symmetric about the origin. But for finding cdf near the extremeties of the
domain of X the error can be considerable. Since the domain of X is finite, variants of
the beta distribution with a change of variable(i.e. shifting and scaling the distribution)
can provide better approximation capabilities. The SQP and SDP methods are robust
and insensitive to initialization(as long as the initial point is feasible). The bound found
by SQP is 0.64 to 0.34 and that found by SDP is 0.62 to 0.33. The LP solver also finds
a similar bound of 0.62 to 0.34 but the execution time scales quadratically with the size
of the input. On increasing the number of moments to 9 the bounds become tighter and
essentially require the same execution time. The SDP, SQP and LP methods, all give a
bound of 0.51 to 0.48. Thus by increasing the number of moments I can get arbitrarily
tight bounds. For RS I observe from Table 4-3 and Table 4-4 that the method does not
scale much in time with the size of the dataset but produces extremely good confidence
bounds as the number of samples increases. With 1000 samples I already have pretty tight
bounds with time required being just over half a second. Also as previously stated the cdf
can be calculated together rather than independently.
Recommendation: The SDP method is the best but RS can prove to be more than
acceptable.
4.3 Monte Carlo (MC) vs Random Sampling Using Formulations
In the previous section I proposed methods for efficiently and accurately computing
the cdf that are used in the computation of the moments. A natural question is, why
not use simple Monte Carlo to directly estimate the moments rather than derive the
57
formulations and then perform random sampling ? In this section, I show that MC fails
to provide accurate estimates even in a simple scenario, while RS does an extremely good
job for the same amount of computation (i.e. 10000 samples). Notice that N the training
set size and the sample size have different semantics. Since, the expectations are over all
datasets of size N , the sample size is the number of datasets of size N . More precisely, the
sample size is the number of training sets of size N and not the value of N itself. I first
explain the plots and later discuss their implications.
General Setup: I fix the total number of attributes to 2. Each attribute has two
values with the number of classes also being 2. The five Figures 4.6, 4-3, 4.6, 4-5 and
4.6 depict the estimates of MC and RS for different amounts of correlation (measured
using Chi-Square Connor-Linton [2003]) between the attributes and the class labels, with
increasing training set size.
Observations: From the Figure 4.6 I observe that when the attributes and class
labels are uncorrelated, with increasing training set size the estimates of both MC and
RS are accurate. Similar qualitative results are seen in Figure 4.6 when the attributes
and class labels are totally correlated. Hence, for extremely low and high correlations
both methods produce equally good estimates. The problem arises for the MC method
when I move away from these extreme correlations. This is seen in Figures 4-3, 4.6 and
4-5. Both the MC and RS methods perform well initially, but at higher training set sizes
(around 10000 and greater) the estimates of the MC method become grossly incorrect,
while the RS method still performs exceptionally well. Infact, the estimates of RS become
increasingly accurate with increasing training set size.
Reasons and Implications: An explanation of the above phenomena is as follows:
The term ED(N)[GE(ζ)] denotes the expected GE of all classifiers that are induced by all
possible training sets drawn from some distribution. In the continuous case the number
of possible training sets of size N is infinite, while in the discrete case it is O(Nm−1),
where m is the total number of cells in the contigency table. As N increases the number
58
of possible training sets increases rapidly even for small values of m. Thus, with increasing
N the complexity of ED(N)[GE(ζ)] also increases. In the experiments I reported above,
the value of m is 8 (2 × 2 × 2) and with increasing N from 10 to 10000 the upsurge in the
number of possible training sets is steep. Since I fix the amount of computation (i.e. the
number of samples), the MC method is unable to get enough samples so as to accurately
estimate (except at extreme correlations where almost each sample is representative
of the underlying distribution) ED(N)[GE(ζ)] at higher values of N (eg. 10000). The
MC method estimates, are based on samples from a small subspace of the entire sample
space. Hence, with increasing number of possible datasets I would have to proportionately
increase the number of samples to get good estimates. The RS method is not as affected
from increasing training set size. The reason for this being that the complexity (i.e. the
parameter space) of the cdf does not scale as much with increasing N (O(NO(d)) where
d is the dimension as against O(Nm−1)). Thus, in the case of the RS method the high
accuracy is sustained.
On increasing m, the number of possible training sets increases by a factor of N for
each cell added and hence direct MC is intractable to get accurate estimates. The RS
method does not scale likewise, since the number of terms (cdf) is linear in m and the
complexity of each term remains the practically unchanged for a fixed dimension. Since
computing the first moment is much of a challenge for the MC method, computing the
second moment which is over the D(N) ×D(N) space looks ominous. For the RS and the
other suggested methods (eg. optimization) this is equivalent to finding joint probabilities
which is not that hard a task.
4.4 Calculation of Cumulative Joint probabilities
Cumulative Joint probabilities need to be calculated for computation of higher
moments. Using the Random Sampling method, these probabilities can be computed
in similar fashion as the single probabilities shown above. But for the other methods
knowledge of the moments is required. Cumulative Joint probabilities are defined
59
over multiple random variables wherein each random variable satisfies some inequality
or equality. In our case, for the second moment I need to find the following kind
of cumulative joint probabilities P [X > 0, Y > 0], where X and Y are random
variables(overiding their definition in Table 2-1). Since the probability is of an event
over two distinct random variables the previous method of computing moments cannot be
directly applied. An important question is can I somehow through certain transformations
reuse the previous method? Fortunately the answer is affirmative. The intuition
behind the technique I propose is as follows. I find another random variable Z =
f(X, Y )(polynomial in X and Y ) such that Z > 0 iff X > 0 and Y > 0. Since the
two events are equivalent their probabilities are also equal. By taking derivatives of
the MGF of the multinomial I get expressions for the moments of polynomials of the
multinomial random variables. Thus, f(X, Y ) is required to be a polynomial in X and Y .
I now discuss the challenges in finding such a function and eventually suggest a solution.
Geometrically, I can consider the random variables X, Y and Z to denote the three
co-ordinate axes. Then the function f(X, Y ) should have a positive value in the first
quadrant and negative in the remaining three. If the domains of X and Y were infinite
and continuous then this problem is potentially intractable since the polynomial needs
to have a discrete jump along the X and Y axis. Such behavior can be emulated at best
approximately by polynomials. In our case though, the domains of the random variables
are finite, discrete and symmetric about the origin. Therefore, what I care about is that
the function behaves as desired only at these finite number of discrete points. One simple
solution is to have a circle covering the relevant points in the first quadrant and with
appropriate sign the function would be positive for all the points encompassed by it. This
works for small domains of X and Y . As the domain size increases the circle intrudes into
the other quadrants and no longer satisfying the conditions. Other simple functions such
as XY or X + Y or a product of the two also do not work. I now give a function that does
60
work and discuss the basic intuition in constructing it. Consider the domain of X and Y
to be integers in the interval [-a,a].4 Then the polynomial is given by,
Z = (X + a)rX2Y + (Y + a)rY 2X (4–6)
where r = maxbb ln[b]
ln[a+1a−b
]c + 1, 1 < b < a and b ∈ N.5 The value of r can be found
numerically by finding the corresponding value of b which maximizes that function. For
5 ≤ a ≤ 10 the value of b which does this is 5. For larger values 10 < a ≤ 106 the value
of b is 4. Figure 4-7 depicts the polynomial for a = 10 where r = 4. The polynomial
resembles a bird with its neck in the first quadrant, wings in the 2nd and 4th quadrants
and its posterior in the third. The general shape remains same for higher values of a.
The first requirement for the polynomial was that it must be symmetric. Secondly,
I wanted to penalise negative terms and so I have X + a(and Y + a) raised to some
power which will always be positive but will have lower values for lesser X(and Y ). The
X2Y (and Y 2X) makes the first(second) term zero if any of X and Y are zero. Moreover,
it imparts sign to the corresponding term. If absolute value function(||) could be used I
would replace the X2(Y 2) by |X|(|Y |) and set r = 1. But since I cannot; in the resultant
function r is a reciprocal of a logarithmic function of a. For a fixed r with increasing value
of a the polynomial starts violating the biconditional by becoming positive in the 2nd
and 4th quadrants(i.e. the wings rise). The polynomial is always valid in the 1st and 3rd
quadrants. With increase in degree(r) of the polynomial its wings begin flattening out,
thus satisfying the biconditional for a certain a.
By recursively, applying the above formula I can approximate cdf of probabilities with
multiple conditions.
4 in our problem X and Y have the same domain.
5 proof in appendix
61
Recommendation: If the degree of the polynomial is large use RS method for
convenience else use SDP for high accuracy.
With this I have all the ingredients necessary to perform experiments reasonably fast.
This is exactly what I report in the next section.
4.5 Moment Comparison of Test Metrics
In the previous subsections I pointed out how the moments of the generalization error
can be computed. I established connections between the moments of the generalization
error (GE) and the moments of Hold-out-set error (HE) and Cross validation error (CE).
Neither of these relationships, i.e. between the moments of these errors and the moments
of the generalization error give a direct indication of the behavior of the different types of
errors in specific circumstances. In this section I provide a graphical, simple to interpret,
representation of these moments for specific cases. While these visual representations do
not replace general facts, they greatly improve our understanding, allowing rediscovery of
empirically discovered properties of the error metrics and portraying the flexibility of the
method to model different scenarios.
General Setup: I study the behavior of the moments of HE, CE and GE in 1 as
well as multiple dimensions. The data distribution I use is a multinomial with a class prior
of 0.4. The dataset size is set to 100 i.e. N = 100 for the first 2 studies and varied for
the third. I set N = 100 (and not higher) to clearly observe the effects that increase in
dimensionality plays in the behavior of these error metrics. The third study varies N, and
studies the convergence behavior of these error metrics.
4.5.1 Hold-out Set
Our first study involves the dependency of the hold-out-set error on the splitting of
the data into testing (the hold-out-set) and training. To get insight into the behavior of
HE, I plotted the expectation in Figures 4.6, 4-11, the variance in Figures 4-9, 4.6 and
the sum of the expectation and standard deviation in Figures 4.6, 4-13 for single and
multiple dimensions respectively. As expected, the expectation of HE grows as the size
62
of the training dataset reduces. On the other hand, the variance is reduced until the size
of test data is 50%, then it increases slightly for the one dimensional case. The general
downwards trend is predictable using intuitive understanding of the naive Bayes classifier
but the fact that the variance has an upwards trend is not. I believe that the behavior
on the second part of the graph is due to the fact that the behavior of the classifier
becomes unstable as the size of the training dataset is reduced and this competes with the
reduction due the increase in the size of testing data. In higher dimensions the test data
size is insufficient even for large test set fractions (as N is only 100) and so any increase in
test size is desirable, leading to reduced variance. Our methodology established this fact
exactly without the doubts associated with intuitively determining the distinct behavior in
different dimensions.
From the plots for the sum of the expectation and the standard deviation of HE,
which indicate the pessimistic expected behavior, a good choice for the size of the test set
is 40-50% for this particular instance. This best split depends on the size of the dataset
and it is hard to select only based on intuition.
4.5.2 Cross Validation
In our second study I observed the behavior of CV with varying number of folds.
Here I observe the similar qualitative results in both lower and higher dimensions. As the
number of folds increases, the following trends are observed: (a) the expectation of CE
reduces (Figures 4.6, 4.6) since the size of training data increases, (b) the variance of the
classifier for each of the folds increases (Figures 4-15, 4-21) since the size of the test data
decreases, (c) the covariance between the estimates of different folds decreases first then
increases again (Figures 4.6, 4.6) – I explain this behavior below and the same trend is
observed for the total variance of CE (Figures 4-17, 4-23) and the sum of the expectation
and the standard deviation of CE (Figures 4.6, 4.6). Observe that the minimum of
the sum of the expectation and the standard deviation (which indicates the pessimistic
63
expected behavior) is around 10-20 folds, which coincides with the number of folds usually
recommended.
A possible explanation for the behavior of the covariance between the estimates
of different folds is based on the following two observations. First, when the number of
folds is small, the errors of the estimates have large correlations despite the fact that
the classifiers are negatively correlated – this happens because almost the entire training
dataset of one classifier is the test set for the other, 2-fold cross validation in the extreme.
Due to this, though the classifiers built may be similar or different their errors are strongly
positively correlated. Second, for large number of folds (leave-one-out situation in the
extreme), there is a huge overlap between the training sets, thus the classifiers built are
almost the same and so the corresponding errors they make are highly correlated again.
These two opposing trends produce the U-shaped curve of the covariance. This has a
significant effect on the overall variance and so the variance also has a similar form with
the minimum around 10 folds. Predicting this behavior using only intuition or reasonable
number of experiments or just theory is unlikely since it is not clear what the interaction
between the two trends is.
Such insight is possible only because I are able to observe with high accuracy the
factors that affect the behavior of these measures.
4.5.3 Comparison of GE, HE, and CE
The purpose of our last study I report was to determine the dependency of the three
errors on the size of the dataset which indicates the convergence behavior and relative
merits of hold-out-set and cross validation. In Figures 4-19, 4-25 I plotted the moments of
GE, HE and CE, the size of the hold-out-set for HE was set to 40% and 20 folds for CE.
As it can be observed from the figure, the error of hold-out-set is significantly larger for
small datasets. The error of cross validation is almost on par with the the generalization
error. This property of cross validation to reliably estimate the generalization error is
64
known from empirical studies. But the method can be used to estimate how quickly (at
what dataset size) HE and CE converge to GE.
This type of study can be used to observe the non-asymptotic convergence behavior of
errors.
4.6 Extension
I have laid down the basic groundwork necessary for characterizing classification
models and model selection measures. In particular, I have characterized the NBC model
applied to categorical data of arbitrary dimension and with binary class labels. In this
section I discuss extensibility of the analysis and the methodology.
The extension of the analysis to NBC with multiple classes is straightforward. I adopt
the ”winner takes all” policy to classify a datapoint, i.e. I classify the datapoint in the
class that has the highest corresponding polynomial (of the form N(d−1)2 Nx1
11 Nx211 ...Nxd
11 )
value. The approximation techniques employed for speedup, are applicable to this
scenario too. Infact, the series approximation and the optimization techniques can
be used to bound the cdf in any application where k (some integer) moments of the
random variable are known. As mentioned before, the generalized expressions for the
moments and the relationships of the moments of GE to the moments of the CE, LE
and HE hold even for the continuous case by switching from Counting measure to
Lebesgue measure. The challenge in this case too, is to characterize the probabilities
PZ(N) [ζ(x)=y] and PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] for the model at hand. The essence
in characterizing these probabilities for particular inputs (x), is expressing them as
probabilities of the function of the training sample satisfying some condition. The function
is determined by the model that is chosen, by prudently looking at the training algorithm
in relation with the classifiers it outputs. For example, in the NBC case the function was
N(d−1)2 Nx1
11 Nx211 ...Nxd
11 − N(d−1)1 Nx1
12 Nx212 ...Nxd
12 for multiple dimensions and Nx11 − Nx1
2 for
a single dimension. The probability of this function being greater than zero is computed
65
from any joint distribution that may specified over the data and the class labels. This
observation directs us in characterizing other models of interest.
In the continuous case, the NBC classifies an input based on the training sample class
prior and the class conditionals, just as for the discrete case. The class label assigned to an
input z1, z2, ..., zd is given by,
class label(z1, z2, ..., zd) = argmaxCiP [Ci]
d∏j=1
P [zj|Ci]
where Ci denotes the class i. The NBC estimates the prior and the conditionals from
the training sample. The prior is straightforward to estimate. For the conditionals a
parametric model is chosen, usually a normal. The parameters of the normal (mean and
variance) can be estimated in closed form i.e. as a function of the sample, using parameter
estimation methods such as Maximum Likelihood Estimate (MLE). Thus, each of the
conditionals can be represented as a function of the sample and so can the prior. The term
P [Ci]∏d
j=1 P [zj|Ci] is hence a function of the sample. Since the classification occurs by
taking the argmax of this term over all classes i.e. classifying the input in the class for
which this term is the greatest (ignoring ties which can easily be accounted for), I arrive
at a situation wherein the classification process is expressed as a function of the sample.
With the initially chosen joint density over the data and the class labels, the probability
of this function satisfying the required conditions can be computed by integrating the over
the appropriate domain. Other densities (parametric or non-parametric) may be used, the
key is represent the above term as a function of the sample.
When I consider other classification models, the basic theme of characterizing them
remains unchanged. I discuss possible ways of extending the analysis to some of these
other models. In the case of decision trees for example, the classification occurs at the
leaf nodes by choosing the most numerous class label. If no pruning criteria is enforced,
all paths contain all attributes and the leaf nodes are basically the input cells, for the
discrete case. The function used to classify an input is determined by the most numerous
66
class label in the training sample for that input. For the continuous case the classification
occurs based on majority in regions. In case of pruning, the pruning method should
also be considered in the characterization. For eg. pruning based on lack of samples
in a particular split. Here the process of classifying an input into a particular class,
is determined by the leaf node that is used to perform the classification. A node is a
leaf only if the number of samples in it is less than some threshold or all attributes are
already exhausted. These conditions can be encoded as functions of the sample. The
condition of a particular class being most numerous is also a function of the sample and
using the chosen joint distribution the probability of these events can be ascertained.
For a perceptron (or Neural Network in general) the classification is based on the weight
vector that is learnt. The training algorithm produces a weight vector which is a linear
combination of the misclassified patterns, and hence a function of the sample. Using the
pre-specified joint distribution, probabilities for this model could also be found out.
In this manner, by understanding the training and the functionality of the model,
characterizations can be developed. It may not always be possible to do this, but if
possible the characterizations developed aid in providing an accurate (if not exact)
representation of the behavior of the learning model and the model selection measures, in
a short amount of time (for that accuracy).
Table 4-1. Contingency table of input X
X y1 y2
x1 N11 N12
x2 N21 N22...xn Nn1 Nn2
N1 N2 N
67
Table 4-2. Naive Bayes Notation.
Symbol Symanticspc1 prior of class C1
pc2 prior of class C2
phij joint probability of being in hi, Cj
N1 r.v. denoting number of datapoints in class C1
N2 r.v. denoting number of datapoints in class C2
Nhij
r.v. denoting number of datapoints in hi, Cj
Nij r.v. denoting number of datapoints in cell i, Cj
N Size of dataset
Table 4-3. Empirical Comparison of the cdf computing methods in terms of executiontime. RSn denotes the Random Sampling procedure using n samples toestimate the probabilities.
Method Dataset Size 10 Dataset Size 100 Dataset Size 1000Direct 25 hrs arnd 200 centuries arnd 200 billion yrsSA Instantaneous Instantaneous InstantaneousLP arnd 3.5 sec arnd 2 min arnd 2:30 hrsGD arnd 0.13s arnd 0.13 sec arnd 0.13 secPA arnd 1s arnd 25 sec arnd 5 minGDTS arnd 3.5 sec arnd 3.5 sec arnd 3.5 secSQP arnd 3.5 sec arnd 3.5 sec arnd 3.5 secSDP arnd 0.1 sec arnd 0.1 sec arnd 0.1 secRS100 arnd 0.08 sec arnd 0.08 sec arnd 0.1 secRS1000 arnd 0.65 sec arnd 0.66 sec arnd 0.98 secRS10000 arnd 6.3 sec arnd 6.5 sec arnd 9.6 sec
Table 4-4. 95% confidence bounds for Random Sampling.
Samples Dataset Size 10 Dataset Size 100 Dataset Size 1000100 0.7-0.23 0.72-0.26 0.69-0.311000 0.54-0.4 0.56-0.42 0.57-0.4210000 0.5-0.44 0.51-0.47 0.52-0.48
Table 4-5. Comparison of methods for computing the cdf.
Method Accuracy SpeedDirect Exact solution LowSeries Approximation Variable HighStandard LP solvers High LowGradient descent Low HighPrekopa Algorithm High ModerateGradient descent (topology search) Moderate ModerateSequential Quadratic Programming High ModerateSemi-definite Programming High HighRandom Sampling High Moderate
68
Figure 4-1. I have two attributes each having two values with 2 class lables.
Figure 4-2. The current iterate yk just satisfies the constraint cl and easily satisfies theother constraints. Suppose, cl is
∑mj=0 yjx
ji where xi is a value of X, then in
the diagram on the left I observe that for the kth iteration y = yk thepolynomial
∑mj=0 yjx
j = 0 has a minimum at X = xi with the value of thepolynomial being a. This is also the value of cl evaluated at y = yk.
69
Figure 4-3. Estimates of ED(N)[GE(ζ)] by MC and RS with increasing training set size N .The attributes are uncorrelated with the class labels. ED(N)[GE(ζ)] is 0.5.
Figure 4-4. Estimates of ED(N)[GE(ζ)] by MC and RS with increasing training set size N .The correlation between the attributes and the class labels is 0.25.ED(N)[GE(ζ)] is 0.24.
70
Figure 4-5. Estimates of ED(N)[GE(ζ)] by MC and RS with increasing training set size N .The correlation between the attributes and the class labels is 0.5.ED(N)[GE(ζ)] is 0.14.
Figure 4-6. Estimates of ED(N)[GE(ζ)] by MC and RS with increasing training set size N .The correlation between the attributes and the class labels is 0.75.ED(N)[GE(ζ)] is 0.068.
71
Figure 4-7. Estimates of ED(N)[GE(ζ)] by MC and RS with increasing training set size N .The attributes are totally correlated to the class labels. ED(N)[GE(ζ)] is 0.
Figure 4-8. The plot is of the polynomial (x + 10)4x2y + (y + 10)4y2x− z = 0. I see that itis positive in the first quadrant and non-positive in the remaining three.
72
Figure 4-9. HE expectation in single dimension.
Figure 4-10. HE variance in single dimension.
73
Figure 4-11. HE E[] + Std() in single dimension.
Figure 4-12. HE expectation in multiple dimensions.
74
Figure 4-13. HE variance in multiple dimensions.
Figure 4-14. HE E[] + Std() in multiple dimensions.
75
Figure 4-15. Expectation of CE.
Figure 4-16. Individual run variance of CE.
76
Figure 4-17. Pairwise covariances of CV.
Figure 4-18. Total variance of cross validation.
77
Figure 4-19. E [] +√
Var (()) of CV
Figure 4-20. Convergence behavior.
78
Figure 4-21. CE expectation.
Figure 4-22. Individual run variance of CE.
79
Figure 4-23. Pairwise covariances of CV.
Figure 4-24. Total variance of cross validation.
80
Figure 4-25. E [] +√
Var (()) of CV.
Figure 4-26. Convergence behavior.
81
CHAPTER 5ANALYZING DECISION TREES
I use the methodology introduced, for analyzing the error of classifiers and the model
selection measures, to analyze decision tree algorithms. The methodology consists of
obtaining parametric expressions for the moments of the Generalization error (GE) for the
classification model of interest, followed by plotting these expressions for interpritability.
The major challenge in applying the methodology to decision trees, the main theme
of this work, is customizing the generic expressions for the moments of GE to this
particular classification algorithm. The specific contributions I make are: (a) I completely
characterize a subclass of decision trees namely, Random decision trees, (b) I discuss how
the analysis extends to other decision tree algorithms, and (c) in order to extend the
analysis to certain model selection measures, I generalize the relationships between the
moments of GE and moments of the model selection measures given in Dhurandhar and
Dobra [2009] to randomized classification algorithms. An extensive empirical comparison
between the proposed method and Monte Carlo, depicts the advantages of the method
in terms of running time and accuracy. It also showcases the use of the method as an
exploratory tool to study learning algorithms.
5.1 Computing Moments
In this section I first provide the necessary technical groundwork, followed by
customization of the expressions for decision trees. I now introduce some notation that
is used primarily in this section. X is a random vector modeling input whose domain is
denoted by X . Y is a random variable modeling output whose domain is denoted by Y(set of class labels). Y (x) is a random variable modeling output for input x. ζ represents
a particular classifier with its GE denoted by GE(ζ). Z(N) denotes a set of classifiers
obtained by application of a classification algorithm to different samples of size N .
82
5.1.1 Technical Framework
The basic idea in the generic characterization of the moments of GE as given in
Dhurandhar and Dobra [2009], is to define a class of classifiers induced by a classification
algorithm and an i.i.d. sample of a particular size from an underlying distribution. Each
classifier in this class and its GE act as random variables, since the process of obtaining
the sample is randomized. Since GE(ζ) is a random variable, it has a distribution. Quite
often though, characterizing a finite subset of moments turns out to be a more viable
option than characterizing the entire distribution. Based on these facts, I revisit the
expressions for the first two moments around zero of the GE of a classifier,
EZ(N) [GE(ζ)] =
∑x∈X
P [X =x]∑y∈Y
PZ(N) [ζ(x)=y] P [Y (x) 6=y](5–1)
EZ(N)×Z(N) [GE(ζ)GE(ζ ′)] =
∑x∈X
∑
x′∈XP [X =x] P [X =x′] ·
∑y∈Y
∑
y′∈YPZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] ·
P [Y (x) 6=y] P [Y (x′) 6=y′]
(5–2)
From the above equations I observe that for the first moment I have to characterize the
behavior of the classifier on each input separately while for the second moment I need
to observe its behavior on pairs of inputs. In particular, to derive expressions for the
moments of any classification algorithm I need to characterize PZ(N) [ζ(x)=y] for the
first moment and PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] for the second moment. The values for
the other terms denote the error of the classifier for the first moment and errors of two
classifiers for the second moment which are obtained directly from the underlying joint
distribution. For example, if I have data with a class prior p for class 1 and 1-p for class 2.
Then the error of a classifier classifying data into class 1 is 1-p and the error of a classifier
83
classifying data into class 2 is given by p. I now focus our attention on relating the above
two probabilities, to probabilities that can be computed using the joint distribution and
the classification model viz. Decision Trees.
In the subsections that follow I assume the following setup. I consider the dimensionality
of the input space to be d. A1, A2, ..., Ad are the corresponding discrete attributes or
continuous attributes with predetermined split points. a1, a2, ..., ad are the number of
attribute values/the number of splits of the attributes A1, A2, ..., Ad respectively. mij is the
ith attribute value/split of the jth attribute, where i ≤ aj and j ≤ d. Let C1, C2, ..., Ck be
the class labels representing k classes and N the sample size.
5.1.2 All Attribute Decision Trees (ATT)
Let us consider a decision tree algorithm whose only stopping criteria is that no
attributes remain when building any part of the tree. In other words, every path in
the tree from root to leaf has all the attributes. An example of such a tree is shown
in Figure 5-1. It can be seen that irrespective of the split attribute selection method
(e.g. information gain, gini gain, randomised selection, etc.) the above stopping criteria
yields trees with the same leaf nodes. Thus although a particular path in one tree has an
ordering of attributes that might be different from a corresponding path in other trees, the
leaf nodes will represent the same region in space or the same set of datapoints. This is
seen in Figure 5-2. Moreover, since predictions are made using data in the leaf nodes, any
deterministic way of prediction would lead to these trees resulting in the same classifier
for a given sample and thus having the same GE. Usually, prediction in the leaves is
performed by choosing the most numerous class as the class label for the corresponding
datapoint. With this I arrive at the expressions for computing the aforementioned
84
probabilities,
PZ(N) [ζ(x)=Ci] =
PZ(N)[ct(mp1mq2...mrdCi) > ct(mp1mq2...mrdCj),
∀j 6= i, i, j ∈ [1, ..., k]]
where x = mp1mq2...mrd represents a datapoint which is also a path from root to
leaf in the tree. ct(mp1mq2...mrdCi) is the count of the datapoints specified by the
cell mp1mq2...mrdCi. For example in Figure 4-1 x1y1C1 represents a cell. Henceforth,
when using the word ”path” I will strictly imply path from root to leaf. By computing
the above probability ∀ i and ∀ x I can compute the first moment of the GE for this
classification algorithm.
Similarly, for the second moment I compute cumulative joint probabilities of the
following form:
PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv] =
PZ(N)×Z(N)[ct(mp1...mrdCi) > ct(mp1...mrdCj),
ct(mf1...mhdCv) > ct(mf1...mhdCw),
∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]
where the terms have similar conotation as before. These probabilities can be
computed exactly or by using fast approximation techniques proposed in Dhurandhar and
Dobra [2009].
5.1.3 Decision Trees with Non-trivial Stopping Criteria
I just considered decision trees which are grown until all attributes are exhausted.
In real life though I seldom build such trees. The main reasons for this could be any
of the following: I wish to build small decision trees to save space; certain path counts
(i.e. number of datapoints in the leaves) are extremely low and hence I want to avoid
splitting further, as the predictions can get arbitrarily bad; I have split on a certain subset
of attributes and all the datapoints in that path belong to the same class (purity based
85
criteria); I want to grow trees to a fixed height (or depth). These stopping measures would
lead to paths in the tree that contain a subset of the entire set of attributes. Thus from
a classification point of view I cannot simply compare the counts in two cells as I did
previously. The reason for this being that the corresponding path may not be present in
the tree. Hence, I need to check that the path exists and then compare cell counts. Given
the classification algorithm, since the PZ(N) [ζ(x)=Ci] is the probability of all possible
ways in which an input x can be classified into class Ci for a decision tree it equates to
finding the following kind of probability for the first moment,
PZ(N) [ζ(x)=Ci] =
∑p
PZ(N)[ct(pathpCi) > ct(pathpCj), pathpexists,
∀j 6= i, i, j ∈ [1, ..., k]]
(5–3)
where p indexes all allowed paths by the tree algorithm in classifying input x. After the
summation, the right hand side term above is the probability that the cell pathpCi has the
greatest count, with the path ”pathp” being present in the tree. This will become clearer
when I discuss different stopping criteria. Notice that the characterization for the ATT is
just a special case of this more generic characterization.
The probability that I need to find for the second moment is,
PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv] =
∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), pathpexists,
ct(pathqCv) > ct(pathqCw), pathqexists,
∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]
(5–4)
where p and q index all allowed paths by the tree algorithm in classifying input x and x′
respectively. The above two equations are generic in analyzing any decision tree algorithm
which classifies inputs into the most numerous class in the corresponding leaf. It is not
difficult to generalize it further when the decision in leaves is some other measure than
86
majority. In that case I would just include that measure in the probability in place of the
inequality.
5.1.4 Characterizing path exists for Three Stopping Criteria
It follows from above that to compute the moments of the GE for a decision tree
algorithm I need to characterize conditions under which particular paths are present. This
characterization depends on the stopping criteria and split attribute selection method in
a decision tree algorithm. I now look at three popular stopping criteria, namely a) Fixed
height based, b) Purity (i.e. entropy 0 or gini index 0 etc.) based and c) Scarcity (i.e. too
few datapoints) based. I consider conditions under which certain paths are present for
each stopping criteria. Similar conditions can be enumerated for any reasonable stopping
criteria. I then choose a split attribute selection method, thereby fully characterizing the
above two probabilities and hence the moments.
1. Fixed Height: This stopping criteria is basically that every path in the tree shouldbe of length exactly h, where h ∈ [1, ..., d]. If h = 1 I classify based on just oneattribute. If h = d then I have the all attribute tree.In general, a path mi1mj2...mlh is present in the tree iff the attributes A1, A2, ..., Ah
are chosen in any order to form the path for a tree construction during thesplit attribute selection phase. Thus, for any path of length h to be present Ibiconditionally imply that the corresponding attributes are chosen.
2. Purity: This stopping criteria implies that I stop growing the tree from a particularsplit of a particular attribute if all datapoints lying in that split belong to thesame class. I call such a path pure else I call it impure. In this scenario, I couldhave paths of length 1 to d depending on when I encounter purity (assuming alldatapoints do not lie in 1 class). Thus, I have the following two separate checks forpaths of length d and less than d respectively.a) Path mi1mj2...mld present iff the path mi1mj2...ml(d−1) is impure and attributesA1, A2, ..., Ad−1 are chosen above Ad, or mi1mj2...ms(d−2)mld is impure andattributes A1, A2, ..., Ad−2, Ad are chosen above Ad−1, or ... or mj2...mld is impureand attributes A2, ..., Ad are chosen above A1.This means that if a certain set of d − 1 attributes are present in a path in the treethen I split on the dth attribute iff the current path is not pure, finally resulting in apath of length d.b) Path mi1mj2...mlh present where h < d iff the path mi1mj2...mlh is pure andattributes A1, A2, ..., Ah−1 are chosen above Ah and mi1mj2...ml(h−1) is impure orthe path mi1mj2...mlh is pure and attributes A1, A2, ..., Ah−2, Ah are chosen above
87
Ah−1 and mi1mj2...ml(h−2)mlh is impure or ... or the path mj2...mlh is pure andattributes A2, ..., Ah are chosen above A1 and mj2...mlh is impure.This means that if a certain set of h − 1 attributes are present in a path in the treethen I split on some hth attribute iff the current path is not pure and the resultingpath is pure.The above conditions suffice for ”path present” since the purity property isanti-monotone and the impurity property is monotone.
3. Scarcity: This stopping criteria implies that I stop growing the tree from aparticular split of a certain attribute if its count is less than or equal to somepre-specified pruning bound. Let us denote this number by pb. As before, I have thefollowing two separate checks for paths of length d and less than d respectively.a) Path mi1mj2...mld present iff the attributes A1, ..., Ad−1 are chosen above Ad andct(mi1mj2...ml(d−1)) > pb or the attributes A1, ..., Ad−2, Ad are chosen above Ad−1 andct(mi1mj2...ml(d−2)mnd) > pb or ... or the attributes A2, ..., Ad are chosen above A1
and ct(mi2mj3...mld) > pb.b) Path mi1mj2...mlh present where h < d iff the attributes A1, ..., Ah−1 are chosenabove Ah and ct(mi1mj2...ml(h−1)) > pb and ct(mi1mj2...mlh) ≤ pb or the attributesA1, ..., Ah−2, Ah are chosen above Ah−1 and ct(mi1mj2...ml(h−2)mnh) > pb andct(mi1mj2...mnh) ≤ pb or ... or the attributes A2, ..., Ah are chosen above A1 andct(mi2mj3...mlh) > pb and ct(mi1mj2...mlh) ≤ pb.This means that I stop growing the tree under a node once I find that the nextchosen attribute produces a path with occupancy ≤ pb.The above conditions suffice for ”path present” since the occupancy property ismonotone.
I observe from the above checks that I have two types of conditions that need to
be evaluated for a path being present namely, i) those that depend on the sample viz.
mi1mj2...ml(d−1) is impure or ct(mi1mj2...mlh) > pb and ii) those that depend split
attribute selection method viz. A1, A2, ..., Ah are chosen. The former depends on the
data distribution which I have specified to be a multinomial. The latter I discuss in the
next subsection. Note that checks for a combination of the above stopping criteria can be
obtained by appropriately combining the individual checks.
5.1.5 Split Attribute Selection
In decision tree construction algorithms, at each iteration I have to decide the
attribute variable on which the data should be split. Numerous measures have been
developed Hall and Holmes [2003]. Some of the most popular ones aim to increase the
purity of a set of datapoints that lie in the region formed by that split. The purer the
88
region, the better the prediction and lower the error of the classifier. Measures such
as, i) Information Gain (IG) Quinlan [1986], ii) Gini Gain (GG) Breiman et al. [1984],
iii) Gain Ratio (GR) Quinlan [1986], iv) Chi-square test (CS) Shao [2003] etc. aim at
realising this intuition. Other measures using Principal Component Analysis Smith [2002],
Correlation-based measures Hall [1998] have also been developed. Another interesting yet
non-intuitive measure in terms of its utility is the Random attribute selection measure.
According to this measure I randomly choose the split attribute from available set. The
decision tree that this algorithm produces is called a Random decision tree (RDT).
Surprisingly enough, a collection of RDTs quite often outperform their seemingly more
powerful counterparts Liu et al. [2005]. In this thesis I study this interesting variant.
I do this by first presenting a probabilistic characterization in selecting a particular
attribute/set of attributes, followed by simulation studies. Characterizations for the other
measures can be developed in similar vein by focusing on the working of each measure.
As an example, for the deterministic purity based measures mentioned above the split
attribute selection is just a function of the sample and thus by appropriately conditioning
on the sample I can find the relevant probabilities and hence the moments.
Before presenting the expression for the probability of selecting a split attribute/attributes
in constructing a RDT I extend the results in Dhurandhar and Dobra [2009] where
relationships were drawn between the moments of HE, CE, LE (just a special case of
cross-validation) and GE, to be applicable to randomized classification algorithms. The
random process is assumed to be independent of the sampling process. This result is
required since the results in Dhurandhar and Dobra [2009] are applicable to deterministic
classification algorithms and I would be analyzing RDT. With this I have the following
lemma.
Lemma 3. Let D and T be independent discrete random variables, with some distribution
defined on each of them. Let D and T denote the domains of the random variables. Let
89
f(d, t) and g(d, t) be two functions such that ∀t ∈ T ED[f(d, t)] = ED[g(d, t)] and d ∈ D.
Then, ET ×D[f(d, t)] = ET ×D[g(d, t)]
Proof.
ET ×D[f(d, t)] =∑t∈T
∑
d∈Df(d, t)P [T = t,D = d]
=∑t∈T
∑
d∈Df(d, t)P [D = d]P [T = t]
=∑t∈T
ED[g(d, t)]P [T = t]
= ET ×D[g(d, t)]
The result is valid even when D and T are continous, but considering the scope
of this thesis I are mainly interested in the discrete case. This result implies that all
the relationships and expressions in Dhurandhar and Dobra [2009] hold with an extra
expectation over the t, for randomized classification algorithms where the random process
is independent of the sampling process. In equations 6–9 and 6–2 the expectations w.r.t.
Z(N) become expectations w.r.t. Z(N, t).
5.1.6 Random Decision Trees
In this subsection I explain the randomized process used for split attribute selection
and provide the expression for the probability of choosing an attribute/a set of attributes.
The attribute selection method I use is as follows. I assume a uniform probability
distribution in selecting the attribute variables i.e. attributes which have already not
been chosen in a particular branch, have an equal chance of being chosen for the next
level. The random process involved in attribute selection is independent of the sample
and hence the lemma 1 applies. I now give the expression for the probability of selecting
a subset of attributes from the given set for a path. This expression is required in the
computation of the above mentioned probabilities used in computing the moments. For
90
the first moment I need to find the following probability. Given d attributes A1, A2, ..., Ad
the probability of choosing a set of h attributes where h ∈ 1, 2, ..., d is,
P [h attributes chosen] =1
d
h
since choosing without replacement is equivalent to simultaneously choosing a subset of
attributes from the given set.
For the second moment when the trees are different (required in the finding of
variance of CE since, the training sets in the various runs in cross validation are different
i.e. for finding EZ(N)×Z(N) [GE(ζ)GE(ζ ′)]), the probability of choosing l1 attributes for
path in one tree and l2 attributes for path in another tree where l1, l2 ≤ d is given by,
P [l1 attribute path in tree 1, l2 attribute path in tree 2] =
1
d
l1
d
l2
since the process of choosing one set of attributes for a path in one tree is independent of
the process of choosing another set of attributes for a path in a different tree.
For the second moment when the tree is the same (required in the finding of variance
of GE and HE i.e. for finding EZ(N)×Z(N) [GE(ζ)2]), the probability of choosing two sets
of attributes such that the two distinct paths resulting from them co-exist in a single tree
is given by the following. Assume I have d attributes A1, A2, ..., Ad. Let the lengths of
the two paths (or cardinality of the two sets) be l1 and l2 respectively, where l1, l2 ≤ d.
Without loss of generality assume l1 ≤ l2. Let p be the number of attributes common
to both paths. Notice that p ≥ 1 is one of the necessary conditions for the two paths to
co-exist. Let v ≤ p be those attributes among the total p that have same values for both
paths. Thus p − v attributes are common to both paths but have different values. At one
91
of these attributes in a given tree the two paths will bifurcate. The probability that the
two paths co-exist given our randomized attribute selection method is computed by finding
out all possible ways in which the two paths can co-exist in a tree and then multiplying
the number of each kind of way by the probability of having that way. A detailed proof is
given in the appendix. The expression for the probability based on the attribute selection
method is,
P [l1 and l2 length paths co− exist] =
v∑i=0
vPri(l1 − i− 1)!(l2 − i− 1)!(p− v)probi
where vPri = v!(v−i)!
denotes permutation and probi = 1d(d−1)...(d−i)(d−i−1)2...(d−l1+1)2(d−l1)...(d−l2+1)
is the probability of the ith possible way. For fixed height trees of height h, (l1− i−1)!(l2−i− 1)! becomes (h− i− 1)!2 and probi = 1
d(d−1)...(d−i)(d−i−1)2...(d−h+1)2.
5.1.7 Putting things together
I now have all the ingredients that are required for the computation of the moments
of GE. In this subsection I combine the results derived in the previous subsections to
obtain expressions for PZ(N) [ζ(x)=Ci] and PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv] which are
vital in the computation of the moments.
Let s.c.c.s. be an abbreviation for stopping criteria conditions that are sample
dependent. Conversely, s.c.c.i. be an abbreviation for stopping criteria conditions that are
sample independent or conditions that are dependent on the attribute selection method. I
now provide expressions for the above probabilities categorized by the 3 stopping criteria.
5.1.7.1 Fixed Height
The conditions for ”path exists” for fixed height trees depend only on the attribute
selection method as seen in subsection 5.1.4. Hence the probability used in finding the first
moment is given by,
92
PZ(N) [ζ(x)=Ci]
=∑
p
PZ(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ∀j 6= i, i, j ∈ [1, ..., k]]
=∑
p
PZ(N)[ct(pathpCi) > ct(pathpCj), s.c.c.i., ∀j 6= i, i, j ∈ [1, ..., k]]
=∑
p
PZ(N)[ct(pathpCi) > ct(pathpCj), ∀j 6= i, i, j ∈ [1, ..., k]]PZ(N)[s.c.c.i.]
=∑
p
PZ(N)[ct(pathpCi) > ct(pathpCj), ∀j 6= i, i, j ∈ [1, ..., k]]
dCh
(5–5)
where dCh = d!h!(d−h)!
and h is the length of the paths or the height of the tree. The
probability in the last step of the above derivation can be computed from the underlying
joint distribution. The probability for the second moment when the trees are different is
given by,
PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ct(pathqCv) > ct(pathqCw), pathqexists,
∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw),∀j 6= i, ∀w 6= v, i, j, v,
w ∈ [1, ..., k]] · PZ(N)×Z(N)[s.c.c.i.]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw),∀j 6= i, ∀w 6= v, i, j, v,
w ∈ [1, ..., k]]
dC2h
(5–6)
where h is the length of the paths. The probability for the second moment when the
trees are identical is given by,
93
PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ(x′)=Cv]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ct(pathqCv) > ct(pathqCw), pathqexists,
∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw),∀j 6= i, ∀w 6= v, i, j, v,
w ∈ [1, ..., k]] · PZ(N)×Z(N)[s.c.c.i.]
=∑p,q
b∑t=0
bPrt(h− t− 1)!2(r − v)probtPZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj),
ct(pathqCv) > ct(pathqCw),∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]
(5–7)
where r is the number of attributes that are common in the 2 paths, b is the number
of attributes that have the same value in the 2 paths, h is the length of the paths and
probt = 1d(d−1)...(d−t)(d−t−1)2...(d−h+1)2
. As before, the probability comparing counts can be
computed from the underlying joint distribution.
5.1.7.2 Purity and Scarcity
The conditions for ”path exists” in the case of purity and scarcity depend on both the
sample and the attribute selection method as can be seen in 5.1.4. The probability used in
finding the first moment is given by,
PZ(N) [ζ(x)=Ci]
=∑
p
PZ(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ∀j 6= i, i, j ∈ [1, ..., k]]
=∑
p
PZ(N)[ct(pathpCi) > ct(pathpCj), s.c.c.i, s.c.c.s., ∀j 6= i, i, j ∈ [1, ..., k]]
=∑
p
PZ(N)[ct(pathpCi) > ct(pathpCj), s.c.c.s., ∀j 6= i, i, j ∈ [1, ..., k]]PZ(N)[s.c.c.i.]
94
=∑
p
PZ(N)[ct(pathpCi) > ct(pathpCj), s.c.c.s., ∀j 6= i, i, j ∈ [1, ..., k]]
dChp−1(d− hp + 1)(5–8)
where hp is the length of the path indexed by p. The joint probability of comparing
counts and s.c.c.s. can be computed from the underlying joint distribution. The
probability for the second moment when the trees are different is given by,
PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ct(pathqCv) > ct(pathqCw), pathqexists,
∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw), s.c.c.s., ∀j 6= i,
∀w 6= v, i, j, v, w ∈ [1, ..., k]] · PZ(N)×Z(N)[s.c.c.i.]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw), s.c.c.s., ∀j 6= i,
∀w 6= v, i, j, v, w ∈ [1, ..., k]]
dChp−1dChq−1(d− hp + 1)(d− hq + 1)
(5–9)
where hp and hq are the lengths of the paths indexed by p and q. The probability for
the second moment when the trees are identical is given by,
PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ(x′)=Cv]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ct(pathqCv) > ct(pathqCw), pathqexists,
∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]
=∑p,q
PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw), s.c.c.s., ∀j 6= i, ∀w 6= v,
i, j, v, w ∈ [1, ..., k]]PZ(N)×Z(N)[s.c.c.i.]
=∑p,q
b∑t=0
bPrt(hp − t− 2)!(hq − t− 2)!(r − v)probt
(d− hp + 1)(d− hq + 1)PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj),
95
ct(pathqCv) > ct(pathqCw), s.c.c.s., ∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]
(5–10)
where r is the number of attributes that are common in the 2 paths sparing the
attributes chosen as leaves, b is the number of attributes that have the same value, hp
and hq are the lengths of the 2 paths and without loss of generality assuming hp ≤ hq
probt = 1d(d−1)...(d−t)(d−t−1)2...(d−hp)2(d−hp−1)...(d−hq)
. As before, the probability of comparing
counts and s.c.c.s. can be computed from the underlying joint distribution.
Using the expressions for the above probabilities the moments of GE can be
computed. In next section I perform experiments on synthetic as well as distributions
built on real data to portray the efficacy of the derived expressions.
5.2 Experiments
To exactly compute the probabilities for each path the time complexity for fixed
height trees is O(N2) and for purity and scarcity based trees is O(N3). Hence, computing
exactly the probabilities and consequently the moments is practical for small values of
N . For larger values of N , I propose computing the individual probabilities using Monte
Carlo (MC). In the empirical studies I report, I show that the accuracy in estimating the
error (i.e. the moments of GE) by using our expressions with MC is always greater than
by directly using MC for the same computational cost. In fact, the accuracy of using the
expressions is never worse than MC even when MC is executed for 10 times the number of
iterations as those of the expressions. The true error or the golden standard against which
I compare the accuracy of these estimators is obtained by running MC for a week, which is
around 200 times the number of iterations as those of the expressions.
Notation: In the experiments, AF refers to the estimates obtained by using the
expressions in conjunction with Monte Carlo. MC-i refers to simple Monte Carlo being
executed for i times the number of iterations as those of the expressions. The term True
Error or TE refers to the golden standard against which I compare AF and MC-i.
96
General Setup: I perform empirical studies on synthetic as well as real data. The
experimental setup for synthetic data is as follows: I fix N to 10000. The number of
classes is fixed to two. I observe the behavior of the error for the three kinds of trees with
the number of attributes fixed to d = 5 and each attribute having 2 attribute values. I
then increase the number of attribute values to 3, to observe the effect that increasing
the number of split points has on the performance of the estimators. I also increase the
number of attributes to d = 8 to study the effect that increasing the number of attributes
has on the performance. With this I have a d + 1 dimensional contingency table whose d
dimensions are the attributes and the (d+1)th dimension represents the class labels. When
each attribute has two values the total number of cells in the table is c = 2d+1 and with
three values the total number of cells is c = 3d × 2. If I fix the probability of observing a
datapoint in cell i to be pi such that∑c
i=1 pi = 1 and the sample size to N the distribution
that perfectly models this scenario is a multinomial distribution with parameters N and
the set p1, p2, ..., pc. In fact, irrespective of the value of d and the number of attribute
values for each attribute the scenario can be modelled by a multinomial distribution.
In the studies that follow the pi are varied and the amount of dependence between the
attributes and the class labels is computed for each set of pi using the Chi-square test
Connor-Linton [2003]. More precisely, I sum over all i the squares of the difference of each
pi with the product of its corresponding marginals, with each squared difference being
divided by this product .i.e. correlation =∑
i(pi−pim)2
pim, where pim is the product of the
marginals for the ith cell. The behavior of the error for trees with the three aforementioned
stopping criteria is seen for different correlation values and for a class prior of 0.5.
In case of real data, I perform experiments on distributions built on three UCI
datasets. I split the continuous attributes at the mean of the given data. I thus can form
a contingency table representing each of the datasets. The counts in the individual cells
divided by the dataset size provide us with empirical estimates for the individual cell
probabilities (pi). Thus, with the knowledge of N (dataset size) and the individual pi I
97
have a multinomial distribution. Using this distribution I observe the behavior of the error
for the three kinds of trees with results being applicable to other datasets that are similar
to the original.
Observations: Figures 5-3, 5-4 and 5-5 depict the error of fixed height trees with
the number of attributes being 5 for the first two figures and 8 for the third figure. The
number of attribute values increases from 2 to 3 in figures 5-3 and 5-4 respectively. I
observe in these figures that AF is significantly more accurate than both MC-1 and
MC-10. In fact the performance of the 3 estimators namely, AF, MC-1 and MC-10
remains more or less unaltered even with changes in the number of attributes and in the
number of splits per attribute. A similar trend is seen for both purity based trees i.e.
figures 5-6, 5-7 and 5-8 as well as scarcity based trees 5-9, 5-10 and 5-11. Though in the
case of purity based trees the performance of both MC-1 and MC-10 is much superior
as compared with their performance on the other two kinds of trees, especially at low
correlations. The reason for this being that, at low correlations the probability in each cell
of the multinomial is non-negligible and with N = 10000 the event that every cell contains
atleast a single datapoint is highly likely. Hence, the trees I obtain with high probability
using the purity based stopping criteria are all ATT. Since in an ATT all the leaves are
identical irrespective of the ordering of the attributes in any path, the randomness in
the classifiers produced, is only due to the randomness in the data generation process
and not because of the random attribute selection method. Thus, the space of classifiers
over which the error is computed reduces and MC performs well even for a relatively
fewer number of iterations. At higher correlations and for the other two kinds of trees the
probability of smaller trees is reasonable and hence MC has to account for a larger space
of classifiers induced by not only the randomness in the data but also by the randomness
in the attribute selection method.
In case of real data too figure 5-12, the performance of the expressions is significantly
superior as compared with MC-1 and MC-10. The performance of MC-1 and MC-10 for
98
the purity based trees is not as impressive here since the dataset sizes are much smaller (in
the tens or hundreds) compared to 10000 and hence the probability of having an empty
cell are not particularly low. Moreover, the correlations are reasonably high (above 0.6).
Reasons for superior performance of expressions: With simple MC, trees have
to be built while performing the experiments. Since, the expectations are over all possible
classifiers i.e. over all possible datasets and all possible randomizations in the attribute
selection phase, the exhaustive space over which direct MC has to run is huge. No tree
has to be explicitly built when using the expressions. Moreover, the probabilities for each
path can be computed parrallely. Another reason as to why calculating the moments using
expressions works better is that the portion of the probabilities for each path that depend
on the attribute selection method are computed exactly (i.e. with no error) by the given
expressions and the inaccuracies in the estimates only occur due to the sample dependent
portion in the probabilities.
5.3 Discussion
In the previous sections I derived the analytical expressions for the moments of
the GE of decision trees and depicted interesting behavior of RDT built under the 3
stopping criteria. It is clear that using the expressions I obtain highly accurate estimates
of the moments of errors for situations of interest. In this section I discuss issues related
to extension of the analysis to other attribute selection methods and issues related to
computational complexity of algorithm.
5.3.1 Extension
The conditions presented for the 3 stopping criteria namely, fixed height, purity and
scarcity are applicable irrespective of the attribute selection method. Commonly used
deterministic attribute selection methods include those based on Information Gain (IG),
Gini Gain (GG), Gain ratio (GR) etc. Given a sample the above metrics can be computed
for each attribute. Hence, the above metrics can be implemented as corresponding
functions of the sample. For e.g. in the case of IG I compute the loss in entropy (qlogq
99
where the q are computed from the sample) by the addition of an attribute as I build
the tree. I then compare the loss in entropy of all attributes not already chosen in the
path and choose the attribute for which the loss in entropy is maximum. Following
this procedure I build the path and hence the tree. To compute the probability of path
exists, I add these sample dependent conditions in the corresponding probabilities. These
conditions account for a particular set of attributes being chosen, in the 3 stopping
criteria. In other words, these conditions quantify the conditions in the 3 stopping criteria
that are attribute selection method dependent. Similar conditions can be derived for the
other attribute selection methods (attribute with maximum gini gain for GG, attribute
with maximum gain ratio for GR) from which the relevant probabilities and hence the
moments can be computed. Thus, while computing the probabilities given in equations
5–3 and 5–4 the conditions for path exists for these attribute selection methods depend
totally on the sample. This is unlike what I observed for the randomized attribute
selection criterion where the conditions for path exists depending on this randomized
criterion, were sample independent while the other conditions in purity and scarcity
were sample dependent. Characterizing these probabilities enables us in computing the
moments of GE for these other attribute selection methods.
In the analysis that I presented, I assumed that the split points for continuous
attributes were determined apriori to tree construction. If the split point selection
algorithm is dynamic i.e. the split points are selected while building the tree, then in the
path exists conditions of the 3 stopping criteria I would have to append an extra condition
namely, the split occurs at ”this” particular attribute value. In reality, the value of ”this”
is determined by the values that the samples attain for the specific attribute in the
particular dataset, which is finite.1 Hence, while analyzing I can choose a set of allowed
1 Since dataset is finite.
100
values for ”this” for each continuous attribute. Using these updated set of conditions for
the 3 stopping criteria the moments of GE can be computed.
5.3.2 Scalability
The time complexity of implementing the analysis is proportional to the product of
the size of the input/output space 2 and the number of paths that are possible in the
tree while classifying a particular input. To this end, it should be noted that if a stopping
criterion is not carefully chosen and applied, then the number of possible trees and hence
the number of allowed paths can become exponential in the dimensionality. In such
scenarios, studying small or at best medium size trees is feasible. For studying larger trees
the practitioner should combine stopping criteria (e.g. pruning bound and fixed height or
scarcity and fixed height) i.e. combine the conditions given for each individual stopping
criteria or choose a stopping criterion that limits the number of paths (e.g. fixed height).
Keeping these simple facts in mind and on appropriate usage, the expressions can assist in
delving into the statistical behavior of the errors for decision tree classifiers.
5.4 Take-aways
I have developed a general characterization for computing the moments of the GE
for decision trees. In particular I have specifically characterized RDT for three stopping
criteria namely, fixed height, purity and scarcity. Being able to compute moments of
GE, allows us to compute the moments of the various validation measures and observe
their relative behavior. Using the general characterization, characterizations for specific
attribute selection measures (e.g. IG, GG etc.) other than randomized can be developed
as described before. As a technical result, I have extended the theory in Dhurandhar and
Dobra [2009] to be applicable to randomized classification algorithms; this is necessary
if the theory is to be applied to random decisions trees as I did in this thesis. The
2 In case of continuous attributes the size of the input/output space is the size afterdiscretization
101
experiments reported in section 5.2 had two purposes: (a) portray the manner in which
the expressions can be utilized as an exploratory tool to gain a better understanding of
decision tree classifiers, and (b) show conclusively that the methodology in Dhurandhar
and Dobra [2009] together with the developments in this thesis provide a superior analysis
tool when compared with simple Monte Carlo.
102
Figure 5-1. The all attribute tree with 3 attributes A1, A2, A3, each having 2 values.
Figure 5-2. Given 3 attributes A1, A2, A3, the path m11m21m31 is formed irrespectiveof the ordering of the attributes. Three such permutations are shown in theabove figure.
103
Figure 5-3. Fixed Height trees with d = 5, h = 3 and attributes with binary splits.
Figure 5-4. Fixed Height trees with d = 5, h = 3 and attributes with ternary splits.
Figure 5-5. Fixed Height trees with d = 8, h = 3 and attributes with binary splits.
104
Figure 5-6. Purity based trees with d = 5 and attributes with binary splits.
Figure 5-7. Purity based trees with d = 5 and attributes with ternary splits.
Figure 5-8. Purity based trees with d = 8 and attributes with binary splits.
105
Figure 5-9. Scarcity based trees with d = 5, pb = N10
and attributes with binary splits.
Figure 5-10. Scarcity based trees with d = 5, pb = N10
and attributes with ternary splits.
Figure 5-11. Scarcity based trees with d = 8, pb = N10
and attributes with binary splits.
106
Fixed Ht. Scarcity
0.1
0.3
0.4
0.5
0.6
0.70.8
AF
MC−1MC−10
Purity Scarcity
0.1
0.2
0.3
0.40.5
0.6
0.7
0.8
Purity Fixed Ht.
MC−10MC−1
AF
TE TE
Fixed Ht. Purity Scarcity
0.1
0.20.3
0.4
0.5
0.60.7
0.8
AF
MC−1MC−10
TE
Shuttle Landing ControlPima Indians Balloon
0.2
Err
or
Err
or
Err
or
Figure 5-12. Comparison between AF and MC on three UCI datasets for trees prunnedbased on fixed height (h = 3), purity and scarcity (pb = N
10).
107
CHAPTER 6K-NEAREST NEIGHBOR CLASSIFIER
The kNN algorithm is a simple yet effective and hence commonly used classification
algorithm in industry and research. It is known to be a consistent estimator Stone [1977],
i.e. it asymptotically achieves Bayes error within a constant factor. None of the even
more sophisticated classification algorithms eg. SVM, Neural Networks etc. are known to
outperform it consistently Stanfill and Waltz [1986]. However, the algorithm is susceptible
to noise and choosing an appropriate value of k is more of an art than science.
6.1 Specific Contributions
I develop expressions for the first 2 moments of GE for the k-Nearest Neighbor
classification algorithm built on categorical data. I accomplish this by expressing the
moments as functions of the sample produced by the underlying joint distribution. In
particular, I develop efficient characterizations for the moments when the distance metric
used in the kNN algorithm, is independent of the sample. I also discuss issues related to
the scalability of the algorithm. I use the derived expressions, to study the classification
algorithm in settings of interest (example different values of k), by visualization. The joint
distribution I use in the empirical studies that ensue the theory, is a multinomial — the
most generic data generation model for the discrete case.
6.2 Technical Framework
In this section I present the generic expressions for the moments of GE that were
given in Dhurandhar and Dobra [2009]. The moments of the GE of a classifier built over
an independent and identically distributed (i.i.d.) random sample drawn from a joint
distribution, are taken over the space of all possible classifiers that can be built, given the
classification algorithm and the joint distribution. Though the classification algorithm
may be deterministic, the classifiers act as random variables since the sample that they are
built on is random. The GE of a classifier, being a function of the classifier, also acts as
a random variable. Due to this fact, GE of classifier denoted by GE(ζ) has a distribution
108
and consequently I can talk about its moments. The generic expressions for the first two
moments of GE taken over the space of possible classifiers resulting from samples of size N
from some joint distribution are as follows:
EZ(N) [GE(ζ)] =
∑x∈X
P [X =x]∑y∈Y
PZ(N) [ζ(x)=y] P [Y (x) 6=y](6–1)
EZ(N)×Z(N) [GE(ζ)GE(ζ ′)] =
∑x∈X
∑
x′∈XP [X =x] P [X =x′] ·
∑y∈Y
∑
y′∈YPZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] ·
P [Y (x) 6=y] P [Y (x′) 6=y′]
(6–2)
Equation 6–1 is the expression for the first moment of the GE(ζ). Notice that
inside the first sum∑
x∈X the input x is fixed and inside the second sum the output
y is fixed, thus the PZ(N) [ζ(x)=y] is the probability of all possible ways in which an
input x is classified into class y. This probability depends on the joint distribution and
the classification algorithm. The other two probabilities are directly derived from the
distribution. Thus, customizing the expression for EZ(N) [GE(ζ)], effectively means
deciphering a way of computing PZ(N) [ζ(x)=y]. Similarly, customizing the expression for
EZ(N)×Z(N) [GE(ζ)GE(ζ ′)] means finding a way of computing PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′]
given any joint distribution. In Section 6.4 I derive expressions for these two probabilities,
which depend only on the underlying joint probability distribution, thus providing a way
of computing them analytically.
6.3 K-Nearest Neighbor Algorithm
The k-nearest neighbor (kNN) classification algorithm classifies an input based on
the class labels of the closest k points in the training dataset. The class label assigned
109
to an input is usually the most numerous class of these k closest points. The underlying
intuition that is the basis of this classification model is that nearby points will tend to
have higher ”similarity” viz. same class, than points that are far apart.
The notion of closeness between points is determined by the distance metric used.
When the attributes are continuous, the most popular metric is the l2 norm or the
Euclidean distance. Figure 6.9 shows points in R2 space. The points b, c and d are the
3-nearest neighbors (k=3) of the point a. When the attributes are categorical the most
popular metric used is the Hamming distance Liu and White [1997]. The Hamming
distance between two points/inputs is the number of attributes that have distinct
values for the two inputs. This metric is sample independent i.e. the Hamming distance
between two inputs remains unchanged, irrespective of the sample counts produced in the
corresponding contingency table. For example, Table 6-1 represents a contingency table.
The Hamming distance between x1 and x2 is the same irrespective of the values of Nij
where i ∈ 1, 2, ..., M and j ∈ 1, 2, ..., v. Other metrics such as Value Difference Metric
(VDM) Stanfill and Waltz [1986], Chi-square Connor-Linton [2003] etc. exist, that depend
on the sample. I now provide a global characterization for calculating the aforementioned
probabilities for both kinds of metrics. This is followed by an efficient characterization for
the sample independent metrics, which includes the traditionally used and most popular
Hamming distance metric.
6.4 Computation of Moments
In this section I characterize the probabilities PZ(N) [ζ(x)=y] and
PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] required for the computation of the first two moments. In
the case, that the number of nearest neighbors at a particular distance d is more than k
for an input and at any lesser value of distance the number of NN is less than k, I classify
the input based on all the NN upto the distance d.
110
6.4.1 General Characterization
I provide a global characterization for the above mentioned probabilities without any
assumptions on the distance metric in this subsection.
The scenario wherein xi is classified into class Cj given i ∈ 1, 2, ..., M and j ∈1, 2, ..., v depends on two factors; 1) the kNN of xi and 2) the class label of the majority
of these kNN. The first factor is determined by the distance metric used, which may be
dependent or independent of the sample as previously discussed. The second factor is
always determined by the sample. The PZ(N) [ζ(xi)=Cj] is the probability of all possible
ways that input xi can be classified into class Cj, given the joint distribution over the
input-output space. This probability for xi is calculated by summing the joint probabilities
of having a particular set of kNN and the majority of this set of kNN has a class label Cj,
over all possible kNN that the input can have. Formally,
PZ(N) [ζ(xi)=Cj] =
∑q∈Q
PZ(N) [q, c(q, j) > c(q, t), ∀t ∈ 1, 2, ..., v, t 6= j](6–3)
where q is a set of kNN of the given input and Q is the set containing all possible q.
c(q, b) is a function which calculates the number of kNN in q that lie in class Cb. For
example, from Table 6-1, if x1 and x2 are the kNN of some input, then q = 1, 2and c(q, b) = N1b + N2b. Notice that, since x1 and x2 are the kNN of some input,∑2,v
i=1,j=1 Nij ≥ k. Moreover, if the kNN comprise of the entire input sample, then
the resultant classification is equivalent to classification performed using class priors
determined by the sample. The PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] used in the computation
of the second moment is calculated by going over kNN of two inputs rather than one. The
expression for this probability is given by,
111
PZ(N)×Z(N) [ζ(xi)=Cj ∧ ζ ′(xl)=Cw] =
∑q∈Q
∑r∈R
PZ(N)×Z(N)[q, c(q, j) > c(q, t), r, c(r, w) > c(r, s)
∀s, t ∈ 1, 2, ..., v, t 6= j, s 6= w]
(6–4)
where q and r are sets of kNN of xi and xl respectively. Q and R are sets containing all
possible q and r respectively. c(., .) has the same connotation as before.
As mentioned before the probability of a particular q (or the probability of the joint
q, r), depends on the distance metric used. The inputs (e.g. x1, x2, ...) that are the k
nearest neighbors to some given input depend on the sample irrespective of the distance
metric i.e. the kNN of an input depend on the sample even if the distance metric is sample
independent. I illustrate this fact by an example.
Example 1. Say, x1 and x2 are the two closest inputs to xi where x1 is closer than x2,
based on some sample independent distance metric. x1 and x2 are both the kNN of xi if
and only if∑2
a=1
∑vb=1 c(a, b) ≥ k,
∑vb=1 c(1, b) < k and
∑vb=1 c(2, b) < k. The first
inequality states that the number of copies of x1 and x2 given by∑v
j=1 N1j and∑v
j=1 N2j
respectively, in the contingency table 6-1 is greater than or equal to k. If this inequality is
true, then definitely the class label of input xi is determined by the copies of x1 or x2 or
both x1, x2. No input besides these two is involved in the classification of xi. The second
and third inequality state that the number of copies of x1 and x2 is less than k respectively.
This forces both x1 and x2 to be used in the classification of xi. If the first inequality was
untrue, then farther away inputs will also play a part in the classification of xi. Thus the
kNN of an input depend on the sample irrespective of the distance metric used.
The above example also illustrates the manner in which the set q (or r) can be
characterized as a function of the sample, enabling us to compute the two probabilities
required for the computation of the moments from any given joint distribution over
the data, for sample independent metrics. Without loss of generality (w.l.o.g.) assume
112
x1, x2, ..., xM are inputs in non-decreasing order (from left to right) of their distance from
a given input xi based on some sample independent distance metric. Then this input
having the kNN given by the set q = xa1, xa2, ..., xaz where a1, a2, ..., az ∈ 1, 2, ..., Mand ad < af if d < f is equivalent to the following conditions on the sample being true:∑z
l=1
∑vj=1 c(xal, j) ≥ k, ∀l ∈ 1, 2, ..., z ∑v
j=1 c(xal, j) > 0, ∀l ∈ 2q where 2q is the
power set of q and cardinality of l = z − 1 denoted by |l| = z − 1,∑v
j=1 c(l, j) < k and
∀xh ∈ x1, x2, ..., xM − q where h < az∑v
j=1 c(xh, j) = 0. The conditions imply that
for the elements of q to be the kNN of some given input, the sum of their counts should be
greater than or equal to k, the sum of any subset of the counts (check only subsets of q of
cardinality |q| − 1) should be less than k, the count of each element of q is non-zero and
the other inputs that are not in q, but are no farther to the given input than the farthest
input in q should have counts zero. Notice that all of these conditions are functions of the
sample.
The other condition in the probabilities is that a particular class label is the most
numerous among the kNNs, which is also a function of the sample. In case of sample
dependent metrics, the conditions that are equivalent to having a particular set q as kNN,
are totally dependent on the specific distance metric used. Since these distance metrics are
sample dependent, I can certainly write these conditions as the corresponding functions of
the sample. Since all the involved conditions in the above probabilities can be expressed
as functions of the sample, I can be compute them over any joint distribution defined over
the data.
6.4.2 Efficient Characterization for Sample Independent Distance Metrics
In the previous subsection I observed the global characterization for the kNN
algorithm. Though this characterization provides insight into the relationship between
the moments of GE, the underlying distribution and the kNN classification algorithm, it is
inefficient to compute in practical scenarios. This is due to fact that any given input can
have itself and/or any of the other inputs as kNN. Hence, the total number of terms in
113
finding the probabilities in equations 6–3 and 6–4 turns out to be exponential in the input
size M . Considering these limitations, I provide alternative expressions for computing
these probabilities efficiently for sample independent distance metrics, viz. Manhattan
distance Krause [1987], Chebyshev distance Abello et al. [2002], Hamming distance. The
number of terms in the new characterization I propose, is linear in M for PZ(N) [ζ(x)=y]
and quadratic in M for PZ(N)×Z(N)[ζ(x)=y ∧ ζ ′(x′)=y′].
The characterization I just presented, computes the probability of classifying an input
into a particular class for each possible set of kNN separately. What if I in some manner,
combine disjoint sets of these probabilities into groups, and compute a single probability
for each group ? This would reduce the number of terms to be computed, thus speeding
up the process of computation of the moments. To accomplish this, I use the fact that the
distance between inputs is independent of the sample. A consequence of this independence
is that all pairwise distances between the inputs are known prior to the computation of the
probabilities. This assists in obtaining a sorted ordering of inputs from the closest to the
farthest for any given input. For example, if I have inputs a1b1, a1b2, a2b1 and a2b2, then
given input a1b1, I know that a2b2 is the farthest from the given input, followed by a1b2,
a2b1 which are equidistant and a1b1 is the closest in terms of Hamming distance.
Before presenting a full-fledged characterization for computing the two probabilities, I
explain the basic grouping scheme that I employ with the help of an example.
Example 2. W.l.o.g. let x1 be the given input, for which I want to find PZ(N) [ζ(x1)=C1].
Let x1, x2,...,xM be inputs arranged in increasing order of distance from left to right.
This is shown in Figure 6.9. In this case, the number of terms I need, to compute
PZ(N) [ζ(x1)=C1] is M . The first term calculates the probability of classifying x1 into
C1 when the kNN are multiple instances of x1 (i.e.∑v
j=1 N1j ≥ k). Thus, the first group
contains only the set x1. The second term calculates the probability of classifying x1
into C1 when the kNN are multiple instances of x2 or x1, x2. The second group thus con-
tains the sets x2 and x1, x2 as the possible kNN to x1. If I proceed in this manner,
114
eventually I have M terms and consequently M groups. The M th group will contain sets
in which xM is an element of every set and the other elements in different sets are all
possible combinations of the remaining M − 1 inputs. Notice that this grouping scheme
covers all possible kNN as in the general case, stated previously i.e. ∪Mi=1gi = 2S − φ where
gi denotes the ith group, S = x1, x2, ..., xM, φ is the empty set and any two groups are
disjoint i.e. ∀ i, j ∈ 1, 2, ..., M, i 6= j gi ∩ gj = φ preventing multiple computations of the
same probability. The rth (r > 1) term in the expression for PZ(N) [ζ(x1)=C1] given the
contingency table 6-1 is,
PZ(N) [ζ(x1)=C1, sets in gr ∈ kNN ] =
PZ(N)[r∑
i=1
Ni1 ≥r∑
i=1
Nij, ∀ j ∈ 2, 3, ..., v,r∑
i=1
v∑
l=1
Nil ≥ k,
r−1∑i=1
v∑
l=1
Nil < k]
(6–5)
where the last two conditions force only sets in gr to be among the kNN. The first condi-
tion ensures that C1 is the most numerous class among the given kNN. For r = 1 the last
condition becomes invalid and unnecessary, hence it is removed. The probability for the
second moment is the sum of probabilities which are calculated for two inputs rather than
one and over two groups, one for each input. W.l.o.g. assume that x1 and x2 are 2 inputs
with x1 being the closest input of x2 and sets in gr ∈ kNN (1) i.e. kNN for input x1 and
sets in gs ∈ kNN (2) i.e. kNN for input x2 where r, s ∈ 2, 3, ..., M then the rsth term in
PZ(N)×Z(N) [ζ(x1)=C1, ζ(x2)=C2] is,
115
PZ(N)×Z(N)[ζ(x1)=C1, sets in gr ∈ kNN (1),
ζ(x2)=C2, sets in gs ∈ kNN (2)] =
PZ(N)×Z(N)[r∑
i=1
Ni1 ≥r∑
i=1
Nij,∀ j ∈ 2, 3, ..., v,r∑
i=1
v∑
l=1
Nil ≥ k,
r−1∑i=1
v∑
l=1
Nil < k,
s∑i=1
Ni2 ≥s∑
i=1
Nij,∀ j ∈ 1, 3, ..., v,s∑
i=1
v∑
l=1
Nil ≥ k,
s−1∑i=1
v∑
l=1
Nil < k]
(6–6)
In this case, when r = 1 remove∑r−1
i=1
∑vl=1 Nil < k condition from the above
probability. If s = 1 remove∑s−1
i=1
∑vl=1 Nil < k condition from the above probability.
In the general case, there may be multiple inputs that lie at a particular distance from
any given input; i.e. the concentric circles in Figure 6.9 may contain more than one input.
To accommodate this case, I extend the grouping scheme previously outlined. Previously,
the group gr contained all possible sets formed by the r − 1 distinct closest inputs to a
given input, with the rth closest input being present in every set. Realize that the rth
closest input does not necessarily mean it is the rth NN, since there may be multiple copies
of any of the r − 1 closest inputs. In our modified definition, the group gr contains all
possible sets formed by the r − 1 closest inputs, with at least one of the rth closest inputs
being present in every set. I illustrate this with an example. Say, I have inputs a1b1, a1b2,
a2b1 and a2b2, then given input a1b1, I know that a2b2 is the farthest from the given input,
followed by a1b2, a2b1 which are equidistant and a1b1 is the closest in terms of Hamming
distance. The group g1 contains only a1b1 as before. The group g2 in this case contains the
sets a1b2, a2b1, a1b2, a2b1, a1b2, a1b1 and a2b1, a1b1. Observe that each set has at
least one of the 2 inputs a1b2, a2b1. I now characterize the probabilities in equations 6–5
and 6–6 for this general case. Let qr denote the set containing inputs from the closest to
116
the rth closest, to some input xi. The function c(., .) has the same connotation as before.
With this the rth term in the PZ(N) [ζ(xi)=Cj] where r ∈ 2, 3, ..., G and G ≤ M is
number of groups is,
PZ(N) [ζ(xi)=Cj, sets in gr ∈ kNN ] =
PZ(N)[c(qr, j) > c(qr, l),∀ l ∈ 1, 2, ..., v l 6= j,
v∑t=1
c(qr, t) ≥ k,
v∑t=1
c(qr−1, t) < k]
(6–7)
where the last condition is removed for r = 1. Similarly, the rsth term in
PZ(N)×Z(N) [ζ(xi)=Cj, ζ′(xp)=Cw] where r, s ∈ 2, 3, ..., G is,
PZ(N)×Z(N)[ζ(xi)=Cj, sets in gr ∈ kNN (i),
ζ(xp)=Cw, sets in gs ∈ kNN (p)] =
PZ(N)×Z(N)[c(qr, j) > c(qr, l),∀ l ∈ 1, 2, ..., v l 6= j,
v∑t=1
c(qr, t) ≥ k,
v∑t=1
c(qr−1, t) < k,
c(qs, w) > c(qs, l),∀ l ∈ 1, 2, ..., v l 6= w,
v∑t=1
c(qs, t) ≥ k,
v∑t=1
c(qs−1, t) < k]
(6–8)
where∑v
t=1 c(qr−1, t) < k and∑v
t=1 c(qs−1, t) < k are removed when r = 1 and s = 1
respectively. From equation 6–7 the PZ(N) [ζ(xi)=Cj] is given by,
PZ(N) [ζ(xi)=Cj] =G∑
r=1
Tr (6–9)
where Tr is the rth term in PZ(N) [ζ(xi)=Cj]. From equation 6–8 the
PZ(N)×Z(N) [ζ(xi)=Cj, ζ′(xp)=Cw] is given by,
117
PZ(N)×Z(N) [ζ(xi)=Cj, ζ′(xp)=Cw] =
G∑r,s=1
Trs (6–10)
where Trs is the rsth term in PZ(N)×Z(N) [ζ(xi)=Cj, ζ′(xp)=Cw].
With this grouping scheme I have been able to reduce the number of terms in the
calculation of PZ(N) [ζ(xi)=Cj] and PZ(N)×Z(N)[ζ(xi)=Cj, ζ′(xp)=Cw] from exponential in
M (the number of distinct inputs), to manageable proportions of Ω(M) terms for the first
probability and Ω(M2) terms for the second probability. Moreover, I have accomplished
this without compromising on the accuracy.
6.5 Scalability Issues
In the previous section I provided the generic characterization and the time efficient
characterization for sample independent distance metrics, relating the two probabilities
required for the computation of the first and second moments, to probabilities that can
be computed using the joint distribution. In this section I discuss approximation schemes
that may be carried out, to further speed up the computation. There are two factors
on which the time complexity of calculating PZ(N) [ζ(xi)=Cj] and PZ(N)×Z(N)[ζ(xi) =
Cj, ζ′(xp)=Cw] depends,
1. the number of terms (or smaller probabilities) that sum up to the above probabilities,
2. the time complexity of each term.
Reduction in number of terms: In the previous section I reduced the number
of terms to a small polynomial in M for a class of distance metrics. The current
enhancement I propose, further reduces the number of terms and works even for
the general case at the expense of accuracy, which I can control. The rth term in the
characterizations has the condition that the number of the closest r − 1 distinct inputs
is less than k. The probability of this condition being true monotonically reduces with
increasing r. After a point, this probability may become ”small enough”, so that the
total contribution of the remaining terms in the sum is not worthwhile finding, given the
118
additional computational cost. I can set a threshold below which, if the probability of this
condition diminishes, I avoid computing the terms that follow.
Reduction in term computation: Each of the terms can be computed directly
from the underlying joint distribution. Different tricks can be employed to speed up the
computation such as collapsing cells of the table etc., but even then the complexity is
still a small polynomial in N . For example, using a multinomial joint distribution, the
time complexity of calculating a term for the probability of the first moment is quartic
in N and for the probability of the second moment it is octic in N . This problem can
be addressed by using the approximation techniques proposed in Dhurandhar and Dobra
[2009]. Using techniques such as optimization, I can find tight lower and upper bounds for
the terms in essentially constant time.
Parallel computation: Note that each of the terms is self contained and not
dependent on the others. This fact can be used to compute these terms in parallel,
eventually merging them to produce the result. This will further reduce the time of
computation.
With this I have not only proposed analytical expressions for the moments of GE for
the kNN classification model applied to categorical attributes, but have also suggested
efficient methods of computing them.
6.6 Experiments
In this section I portray the manner in which the characterizations can be used
to study the kNN algorithm in conjunction with the model selection measures (viz.
cross-validation). Generic relationships between the moments of GE and moments of CE
(cross-validation error) that are not algorithm specific are given in Dhurandhar and Dobra
[2009]. I use the expressions provided in this thesis and these relationships to conduct the
experiments described below. The main objective of the experiments I report, is to provide
a flavor of the utility of the expressions as a tool to study this learning method.
119
6.6.1 General Setup
I mainly conduct 4 studies. The first three studies are on synthetic data and the
fourth on 2 real UCI datasets. In our first study, I observe the performance of the kNN
algorithm for different values of k. In the second study, I observe the convergence behavior
of the algorithm with increasing sample size. In our third study, I observe the relative
performance of cross-validation in estimating the GE for different values of k. In the
three studies I vary the correlation (measured using Chi-square Connor-Linton [2003])
between the attributes and the class labels to see the effect it has on the performance of
the algorithm. In our fourth study, I choose 2 UCI datasets and observe the estimates
of cross-validation with the true error estimates. I also explain how a multinomial
distribution can be built over these datasets. The same idea can be used to build a
multinomial over any discrete dataset to represent it precisely.
Setup for studies 1-3: I set the dimensionality of the space to be 8. The number
of classes is fixed to two, with each attribute taking two values. This gives rise to a
multnomial with 29 = 512 cells. If I fix the probability of observing a datapoint in cell
i to be pi such that∑512
i=1 pi = 1 and the sample size to N , I then have a completely
specified multinomial distribution with parameters N and the set of cell probabilities
p1, p2, ..., p512. The distance metric I use is Hamming distance and the class prior is 0.5.
Setup for study 4: In case of real data I choose 2 UCI datasets whose attributes
are not limited to having binary splits. The datasets can be represented in the form of a
contingency table where each cell in the table contains the count of the number of copies
of the corresponding input belonging to a particular class. These counts in the individual
cells divided by the dataset size provide us with empirical estimates for the individual
cell probabilities (pi). Thus, with the knowledge of N (dataset size) and the individual pi
I have a multinomial distribution whose representative sample is the particular dataset.
Using this distribution I observe the estimates of the true error (i.e. moments of GE) and
120
estimates given by cross-validation for different values of k. Notice that these estimates are
also applicable (with high probability) to other datasets that are similar to the original.
A detailed explanation of these 4 studies is given below. The expressions in equations
6–9, 6–10 are used to produce the plots.
6.6.2 Study 1: Performance of the KNN Algorithm for Different Values of k.
In the first study I observe the behavior of the GE of the kNN algorithm for different
values of k and for a sample size of 1000.
In Figure 6-3a the attributes and the class labels are totally correlated (i.e. correlation
= 1). I observe that for a large range of values of k (from small to large) the error is zero.
This is expected since any input lies only in a single class with the probability of lying in
the other class being zero.
In Figure 6-3b I reduce the correlation between the attributes and class labels from
being totally correlated to a correlation of 0.5. I observe that for low values of k the error
is high, it then plummets to about 0.14 and increases again for large values of k. The high
error for low values of k is because the variance of GE is large for these low values. The
reason for the variance being large is that the number of points used to classify a given
input is relatively small. As the value of k increases this effect reduces upto a stage and
then remains constant. This produces the middle portion of the graph where the GE is
the smallest. In the right portion of the graph i.e. at very high values of k, almost the
entire sample is used to classify any given input. This procedure is effectively equivalent
to classifying inputs based on class priors. In the general setup I mentioned that I set the
priors to 0.5, which results in the high errors.
In Figure 6-3c I reduce the correlation still further down to 0 i.e. the attributes and
the class labels are uncorrelated. In here I observe that the error is initially high, then
reduces and remains unchanged. As before the initial upsurge is due to the fact that the
variance for low values of k is high, which later settles down.
121
From the three figures, Figure 6-3a, Figure 6-3b and Figure 6-3c I observe a gradual
increase in GE as the correlation reduces. The values of k that give low error for the three
values correlation and a sample size of 1000 can be deciphered from the corresponding
figures. In Figure 6-3a, I notice that small, mid-range and large values of k are all
acceptable. In Figure 6-3b I find that mid-range values (200 to 500) of k are desirable.
In the third figure, i.e. Figure 6-3c I discover that mid-range and large values of k produce
low error.
6.6.3 Study 2: Convergence of the KNN Algorithm with Increasing SampleSize.
In the second study I observe the convergence characteristics of the GE of the kNN
algorithm for different values of k, and with increasing sample size going from 1000 to
100000.
In Figure 6-4a the attributes and class labels are completely correlated. The error
remains zero for small, medium and large values of k irrespective of the sample size. In
this case any value of k is suitable.
In Figure 6-4b the correlation between the attributes and the class labels is 0.5. For
small sample sizes (less and close to 1000), large and small values of k result in high error
while moderate values of k have low error throughout. The initial high error for low values
of k is because the variance of the estimates is high. The reason for high error at large
values of k is because it is equivalent to classifying inputs based on priors and the prior
is 0.5. At moderate values of k both these effects are diminished and hence the error
produced by them is low. From the figure I see that after around 1500 the errors of the
low and high k converge to the error of moderate k. Thus here a k within the range 200 to
0.5N would be appropriate.
In Figure 6-4c the attributes and the class labels are uncorrelated. The initial high
error for low k is again because of the high variance. Since the attributes and class labels
are uncorrelated with a given prior, the error is 0.5 for moderate as well as high values
122
of k. Here large values of k do not have higher error than the mid-range values since the
prior is 0.5. The low value of k converges to the errors of the comparatively larger values
at around a sample size of 1500.
Here too from the three figures, Figure 6-4a, Figure 6-4b and Figure 6-4c I observe a
gradual increase in GE as the correlation reduces. At sample sizes of greater than about
1500, large, medium and small values of k all perform equally well.
6.6.4 Study 3: Relative Performance of 10-fold Cross Validation on SyntheticData.
In the third and final study on synthetic data I observe the performance of 10-fold
Cross validation in estimating the GE for different values of k and sample sizes of 1000
and 10000. The plots for the moments of cross validation error (CE) are produced using
the expressions I derived and the relationships between the moments of GE and the
moments of CE for deterministic classification algorithms given in Dhurandhar and Dobra
[2009].
In Figure 6-5a the correlation is 1 and the sample size is 1000. Cross validation
exactly estimates the GE which is zero irrespective of the value of k. When I increase the
sample size to 10000, as shown in Figure 6-6a Cross validation still does a pretty good job
in estimating the actual error (i.e. GE) of kNN.
In Figure 6-5b the correlation is set to 0.5 and the sample size is 1000. I observe that
cross validation initially, i.e. for low values of k underestimates the actual error, performs
well for moderate values of k and grossly overestimates the actual error for large values of
k. At low values of k the actual error is high because of the high variance, which I have
previously discussed. Hence, even-though the expected values of GE and CE are close-by,
the variances are far apart, since the variance of CE is low. This leads to the optimistic
estimate made by cross validation. At moderate values of k the variance of GE is reduced
and hence cross validation produces an accurate estimate. When k takes large values most
of the sample is used to classify an input, which is equivalent to classification based on
123
priors. The effect of this is more pronounced in the case of CE than GE, since a higher
percentage of the training sample ( 910th N) is used for classification of an input for a fixed
k, than it is when computing GE. Due to this, CE rises more steeply than GE. When I
increase the sample size to 10000, as is depicted in Figure 6-6b, the poor estimate at low
values of k that I saw for a smaller sample size of 1000 vanishes. The reason for this is
that the variance of GE reduces with the increase in sample size. Even for moderate values
of k the performance of cross validation improves though the difference in accuracy of
estimation is not as vivid as in the previous case. For large values of k though the error in
estimation is somewhat reduced it is still noticeable. It is advisable that in the scenario
presented I should use moderate values of k ranging from about 200 to 0.5N to achieve
reasonable amount of accuracy in the prediction made by cross-validation.
In Figure 6-5c the attributes are uncorrelated to the class labels and the sample size
is 1000. For low values of k the variance of GE is high while the variance of CE is low
and hence, the estimate of cross validation is off. For medium and large values of k, cross
validation estimates the GE accurately, which has the same reason mentioned above. On
increasing the sample size to 10000, shown in Figure 6-6c the variance of GE for low values
of k reduces and cross validation estimates the GE with high precision. In general, the
GE for any value of k will be estimated accurately by cross validation in this case, but for
lower sample sizes (below and around 1000) the estimates are accurate for moderate and
large values of k.
6.6.5 Study 4: Relative Performance of 10-fold Cross Validation on RealDatasets.
In our fourth and final study I observe the behavior of the true error (E[GE] +
Std(GE)) and the error estimated by cross-validation on 2 UCI datasets. On the Balloon
dataset in figure 6-7 I observe that cross-validation estimates the true error accurately for
a k value of 2. Increasing the k to 5 the cross-validation estimate becomes pessimistic.
This is because of the increase in variance of CE. I also observe that the true error is
124
lower for k equal to 2. The reason for this is the fact that the expected error is much
lower for this case than that for k equal to 5, even-though the variance for k equal to 2
is comparatively higher. For the dataset on the right in figure 6-7, cross-validation does
a good job for both, the small value of k and the larger value of k. The true error in this
case is lower for the higher k since the expectations for both the k is roughly the same but
the variance for the smaller k is larger. This is mainly due to the high covariance between
the successive runs of cross-validation.
6.7 Discussion
From the previous section I see that the expressions for the moments assist in
providing highly detailed explanations of the observed behavior. Mid-range values of k
were the best in studies 1 and 2 for small sample sizes. The reason for this is the fact
that at small values of k the prediction was based on individual cells and having a small
sample size the estimates were unstable, producing a large variance. For high values of k
the classification was essentially based on class priors and hence the expected error was
high, even-though the variance in this case was low. In the case of mid-range values of
k, the pitfalls of the extreme values of k were circumvented (since k was large enough
to reduce variance but small enough so as to prevent classification based on priors) and
hence the performance was superior. 10-fold cross-validation which is considered to be the
”holy-grail” in error estimation, is not always ideal as I have seen in the experiments. The
most common reason why cross-validation under performed in certain specific cases, was
that its variance was high, which in turn was due to the covariance between successive
runs of cross-validation was high. The ability to make such subtle observations and
provide meticulous explanations for them, is the key strength of the deployed methodology
– developing and using the expressions.
Another important aspect is that, in the experiments, I built a single distribution on
each test dataset to observe the best value of k. Considering the fact that data can be
noisy I can build multiple distributions with small perturbations in parameters (depending
125
on the level of noise) and observe the performance of the algorithm for different values of k
using the expressions. Then I can choose a robust value of k for which the estimates of the
error are acceptable on most (or all) built distributions. Notice that this value of k may
not be the best choice, on the distribution built without perturbations. I can thus use the
expressions to make these type of informed decisions.
As I can see, by building expressions for the moments of GE in the manner portrayed,
classification models in conjunction with popular model selection measures can be studied
in detail. The expressions can act as a guiding tool in making the appropriate choice of
model and model selection measure in desired scenarios. For example, in the experiments
I observed that 10-fold cross-validation did not perform well in certain cases. In these
cases I can use the expressions to study cross-validation with different number of folds
and attempt to find the ideal number of folds for our specific situation. Moreover, such
characterizations can aid in finding answers or challenging the appropriateness of questions
such as, What number of v in v-fold cross-validation gives the best bias/variance trade-off?.
The appropriateness of some queries has to be sometimes challenged since it may very well
be the case that no single value of v is truly optimal. In fact depending on the situation,
different values of v or may be even other model selection measures (viz. hold out set
etc.) may be optimal. Analyzing such situations and finding the appropriate values of the
parameters (i.e. v for cross-validation, f–fraction of hold-out, for hold out set validation)
can be accomplished using the methodology I have deployed in the thesis. Sometimes, it
is intuitive to anticipate the behavior of a learning algorithm in extreme cases, but the
behavior at the non-extreme cases is not as intuitive. Moreover, the precise point at which
the behavior of an algorithm starts to emulate the particular extreme case is a non-trivial
task. The methodology can be used to study such cases and potentially a wide range of
other relevant questions. Essentially, the studies 1 and 2 in the experimental section are
examples of such studies. In those experiments, at extreme correlations the behavior is
more or less predictable but at intermediate correlations it is not.
126
What the studies in the experimental section and the discussion above suggest is
that the method in Dhurandhar and Dobra [2009] and developments such as the ones
introduced in this thesis open new avenues in studying learning methods, allowing them to
be assessed for their robustness, appropriateness for a specific task, with lucid elucidations
being given for their behavior. These studies do not replace but complement purely
theoretical and empirical studies usually carried out when evaluating learning methods.
6.8 Possible Extensions
I discussed the importance of the methodology in the previous section. Below, I
touch upon ways of extending the analysis provided in this thesis. An interesting line
of future research would be to efficiently characterize the sample dependent distance
metrics. Another interesting line would be to extend the analysis to the continuous kNN
classification algorithm. A possible way of doing this would be to consider a set of k
points that would be kNN to a given input (recollect that to characterize the moments,
I only need to characterize the behavior of the algorithm on individual inputs e.g.
PZ(N) [ζ(xi)=Cj]) and consider the remaining N-k points to lie outside the smallest ball
encompassing the kNN. Under these conditions I would integrate the density defined on
the input/output space over all possible such N (i.e. k which are kNN and the remaining
N-k) with the appropriate condition for class majority (i.e. to classify an input in Ci,
I would have the condition that, at least bkvc + 1 points that are kNN lie in class Ci).
A rigorous analysis using ideas from this thesis would have to be performed and the
complexity discussed for the continuous kNN. I plan to address these issues in the future.
6.9 Take-aways
I provided a general characterization for the moments of GE of the kNN algorithm
applied to categorical data. In particular, I developed an efficient characterization for
the moments when the distance metric was sample independent. I discussed issues
related to scalability in using the expressions and suggested optimizations to speedup the
computation. I later portrayed the usage of the expressions and hence the methodology
127
with the help of empirical studies. It remains to be seen how extensible such an analysis
is, to other learning algorithms. However, if such an analysis is in fact possible, it can be
deployed as a tool to better understanding the statistical behavior of learning models in
the non-asymptotic regime.
Table 6-1. Contingency table with v classes, M input vectors and total sample sizeN =
∑M,vi=1,j=1 Nij.
X C1 C2 ... Cv
x1 N11 N12 ... N1v
x2 N21 N22 ... N2v...xM NM1 NM2 ... NMv
Figure 6-1. b, c and d are the 3 nearest neighbours of a.
128
Figure 6-2. The Figure shows the extent to which a point xi is near to x1. The radius ofthe smallest encompassing circle for a point xi is proportional to its distancefrom x1. x1 is the closest point and xM is the farthest.
(a) (b) (c)
Figure 6-3. Behavior of the GE for different values of k with sample size N = 1000 and thecorrelation between the attributes and class labels being 1 in (a), 0.5 in (b)and 0 in (c). Std() denotes standard deviation.
129
(a) (b) (c)
Figure 6-4. Convergence of the GE for different values of k when the sample size (N)increases from 1000 to 100000 and the correlation between the attributes andclass labels is 1 in (a), 0.5 in (b) and 0 in (c). Std() denotes standarddeviation. In (b) and (c), after about N = 1500 large, mid-range and smallvalues of k give the same error depicted by the dashed line.
(a) (b) (c)
Figure 6-5. Comparison between the GE and 10 fold Cross validation error (CE) estimatefor different values of k when the sample size (N) is 1000 and the correlationbetween the attributes and class labels is 1 in (a), 0.5 in (b) and 0 in (c).Std() denotes standard deviation.
130
(a) (b) (c)
Figure 6-6. Comparison between the GE and 10 fold Cross validation error (CE) estimatefor different values of k when the sample size (N) is 10000 and the correlationbetween the attributes and class labels is 1 in (a), 0.5 in (b) and 0 in (c).Std() denotes standard deviation.
Figure 6-7. Comparison between true error (TE) and CE on 2 UCI datasets.
131
CHAPTER 7INSIGHTS INTO CROSS-VALIDATION
A major portion of the research in Machine Learning is devoted to building
classification models. The errors that these models make on the entire input i.e. expected
loss over the entire input is defined as their Generalization Error (GE). Ideally, I would
want to choose the model with the lowest GE. In practice though, I do not have the entire
input and consequently I cannot compute the GE. Nonetheless, a number of methods have
been proposed namely, hold-out-set, akaike information criteria, bayes information criteria,
cross-validation etc. which aim at estimating the GE from the available input.
In this chapter I focus on cross-validation, which is arguably one of the most popular
GE estimation methods. In v-fold cross-validation the dataset of size N is divided into v
equal parts. The classification model is trained on v − 1 parts and tested on the remaining
part. This is performed v times and the average error over the v runs is considered as
an estimate of GE. This method is known to have low variance Kohavi [1995], Plutowski
[1996] for about 10-20 folds and is hence commonly used for small sample sizes.
Most of the experimental work on cross-validation focusses on reporting observations
Kohavi [1995], Plutowski [1996] and not on understanding the reasons for the observed
behavior. Moreover, modeling the covariance between individual runs of cross-validation
is not a straightforward task and is hence not adequately studied, though it is considered
to have a non-trivial impact on the behavior of cross-validation. The work presented in
Bengio and Grandvalet [2003], Markatou et al. [2005] address issues related to covariance,
but it is focussed on building and studying the behavior of estimators for the overall
variance of cross-validation. In Markatou et al. [2005] the estimators of the moments
of cross-validation error (CE) are primarily studied for the estimation of mean problem
and in the regression setting. The goal of this chapter is quite different. I do not wish
to build estimators for the moments of CE rather I want to experimentally observe the
behavior of the moments of cross-validation and provide explanations for the observed
132
behavior in the classification setting. The classification models I run these experiments on
consist of the Naive Bayes Classifier (NBC) – a parametric model and two non-parametric
models namely, K-Nearest Neighbor Classifier (KNN) and Decision Trees (DT). I choose
a mix of parametric and non-parametric models so that their hypothesis spaces are varied
enough. Additionally, these models are widely used in practice. The moments however, are
computed not using Monte Carlo directly but using the expressions given in Dhurandhar
and Dobra [2009, 2008, 2007]. The advantage of using these closed form expressions is
that they are exact formulas (not approximations) for the moments of CE and hence
these moments can be studied accurately with respect to any chosen distribution. In
fact as it turns out approximating certain probabilities in these expressions also leads to
significantly higher accuracy in computing the moments when compared with directly
using Monte Carlo. The reason for this is that the parameter space of the individual
probabilities that need to be computed in these expressions is much smaller than
the space over which the moments have to be computed at large and hence directly
using Monte Carlo to estimate the moments can prove to be highly inaccurate in many
cases Dhurandhar and Dobra [2009, 2008]. Another advantage of using the closed form
expressions is that they give us more control over the settings I wish to study.
In summary, the goal in this chapter is to empirically study the behavior of
the moments of CE (plotted using the expressions in the Appendix) and to provide
interesting explanations for the observed behavior. As I will see, when studying the
variance of CE, the covariance between the individual runs plays a decisive role and hence
understanding its behavior is critical in understanding the behavior of the total variance
and consequently the behavior of CE. I provide insights into the behavior of the covariance
apropos increasing sample size, increasing correlation between the data and the class labels
and increasing number of folds.
In the next section I review some basic definitions and previous results that are
relevant to the computation of the moments of CE. In Section 7.2 I provide an overview
133
of the expressions customized to compute the moments of CE for the three classification
algorithms namely; DT, NBC and KNN. In Section 7.3 I conduct a brief literature survey.
In Section 7.4 – the experimental section, I provide some keen insights into the behavior
of cross-validation, which is our primary goal. I discuss the implications of the study
conducted and summarize the major developments in the chapter in Section 7.5.
7.1 Preliminaries
Probability distributions completely characterize the statistical behavior of a random
variable. Moments of a random variable give us information about its probability
distribution. Thus, if I have knowledge of the moments of a random variable I can
make statements about its behavior. In some cases characterizing a finite subset of
moments may prove to be a more desired alternative than characterizing the entire
distribution which can be computationally expensive. By employing moment analysis
and using linearity of expectation efficient generalized expressions for the moments of GE
and relationships between the moments of GE and the moments of CE were derived in
Dhurandhar and Dobra [2009]. In this section I review the relevant results which are used
in the present study of CE.
Consider that N points are drawn independently and identically (i.i.d.) from a
given distribution and a classification algorithm is trained over these points to produce a
classifier. If multiple such sets of N i.i.d. points are sampled and a classification algorithm
is trained on each of them I would obtain multiple classifiers. Each of these classifiers
would have its own GE, hence the GE is a random variable defined over the space of
classifiers which are induced by training a classification algorithm on each of the datasets
that are drawn from the given distribution. The moments of GE computed over this space
of all possible such datasets of size N , depend on three things: 1) the number of samples
N , 2) the particular classification algorithm and 3) the given underlying distribution. I
denote by D(N) the space of datasets of size N drawn from a given distribution. The
moments taken over this new distribution – the distribution over the space of datasets of a
134
particular size, are related to the moments taken over the original given distribution which
is over individual inputs in the following manner,
ED(N)[F(ζ)]
= E(X×Y)×(X×Y)×...×(X×Y)[F(ζ)]
=∑
(x1,y1)∈X×Y
∑
(x2,y2)∈X×Y...
∑
(xN ,yN )∈X×YP [X1 = x1 ∧ Y1 = y1 ∧ · · · ∧XN = xN ∧ YN = yN ]·
F(ζ(x1, . . . , xN))
=∑
(x1,y1)∈X×Y
∑
(x2,y2)∈X×Y...
∑
(xN ,yN )∈X×Y
(N∏
i=1
P [Xi = xi ∧ Yi = yi]
)F(ζ(x1, . . . , xN))
where F(.) is some function that operates on a classifier ζ and ζ is a classifier obtained
by training on a particular dataset belonging to D(N). X and Y denote the input and
output space respectively. X1, ..., XN denote a set of N i.i.d. random variables defined over
the input space and Y1, ..., YN denote a set of N i.i.d. random variables defined over the
output space. For simplicity of notation I denote the moments over the space of the new
distribution rather than over the space of the given distribution.
Notice that in the above formula, ED(N)[F(ζ)] was expressed in terms of product
of probabilities since the independence of samples x1, y1, . . . , xn, yn allows for the
factorization of P [X1 = x1 ∧ Y1 = y1 ∧ · · · ∧ XN = xN ∧ YN = yN ]. By instantiating
the function F(.) with GE(.), I have a formula for computing moments of GE. The
problem with this characterization is that it is highly inefficient (exponential in the
size of the input-output space). Efficient characterizations for computing the moments
were developed in Dhurandhar and Dobra [2009] which I will shortly review. The
characterization reduces the number of terms in the moments from an exponential in
the input-output space to linear for the computation of the first moment and quadratic for
the computation of the second moment.
135
I define random variables of interest namely, hold-out error, cross-validation error and
generalization error. I also define moments of the necessary variables. In the notation used
to represent these moments I write the probabilistic space over which the moments are
taken as subscripts. If no subscript is present, the moments are taken over the original
input-output space. This convention is strictly followed in this particular section. In the
remaining sections I drop the subscript for the moments since the formulas can become
tedious to read and the space over which these moments are computed can be easily
deciphered from the context.
Hold-out Error (HE): The hold-out procedure involves randomly partitioning the
dataset D into two parts Dt – the training dataset of size Nt and Ds – the test datasets
of size Ns. A classifier is built over the training dataset and the error is estimated as the
average error over the test dataset. Formally,
HE =1
Ns
∑
(x,y)∈Ds
λ(ζ(x), Y (x))
where Y (x) ∈ Y is a random variable modeling the true class label of the input x, λ(., .) is
a 0-1 loss function which is 0 when its 2 arguments are equal and is 1 otherwise. ζ is the
classifier built on the training data Dt with ζ(x) being its prediction on the input x.
Cross Validation Error (CE): v-fold cross validation consists in randomly
partitioning the available data into v chunks and then training v classifiers using all
data but one chunk and then testing the performance of the classifier on this chunk. The
estimate of the error of the classifier built from the entire data is the average error over
the chunks. Denoting by HEi the hold-out error on the ith chunk of the dataset D, the
cross-validation error is given by,
CE =1
v
v∑i=1
HEi
Generalization Error (GE): The GE of a classifier ζ w.r.t. the underlying
distribution over the input-output space X × Y is given by,
136
GE(ζ) = E [λ(ζ(x), Y (x))]
= P [ζ(x) 6=Y (x)]
where x ∈ X is an input and Y (x) ∈ Y the true class label of the input x. It is thus the
expected error over the entire input.
Moments of GE: Given an underlying input-output distribution and a classification
algorithm, by generating N i.i.d. datapoints from the underlying distribution and training
the classification algorithm on these datapoints I obtain a classifier ζ. If I sample multiple
(in fact all possible) such sets of N datapoints from the given distribution and train the
classification algorithm on each of them, I induce a space of classifiers trained on a space
of datasets of size N denoted by D(N). Since the process of sampling produces a random
sample of N datapoints, the classifier induced by training the classification algorithm
on this sample is a random function. The generalization error w.r.t. the underlying
distribution of classifier ζ, denoted by GE(ζ) is also a random variable that can be studied
and whose moments are given by,
ED(N)[GE(ζ)] =∑x∈X
P [x]∑y∈Y
PD(N) [ζ(x)=y] P [Y (x) 6=y] (7–1)
ED(N)×D(N)[GE(ζ)GE(ζ ′)] =∑x∈X
∑
x′∈XP [x] P [x′] ·
∑y∈Y
∑
y′∈YPD(N)×D(N) [ζ(x)=y ∧ ζ ′(x′)=y′] ·
P [Y (x) 6=y] P [Y (x′) 6=y′]
(7–2)
137
where X 1 and Y denote the space of inputs and outputs respectively. Y (x) represents the
true output2 for a given input x. P [x] and P [x′] represent the probability of having a
particular input. ζ ′ in equation 6–2 is a classifier like ζ (may be same or different) induced
by the classification algorithm trained on a sample from the underlying distribution.
PD(N) [ζ(x)=y] P [Y (x) 6=y] represents the probability of error. The first probability in the
product PD(N) [ζ(x)=y], depends on the classification algorithm and the data distribution
that determines the training dataset. The second probability P [Y (x) 6=y], depends
only on the underlying distribution. Also note that both these probabilities are actually
conditioned on x but I omit writing the probabilities as conditionals explicitly since it is
an obvious fact and it makes the formulas more readable. ED(N)[.] denotes the expectation
taken over all possible datasets of size N drawn from the data distribution. The terms in
equation 6–2 also have similar semantics but are applicable to pairs of inputs and outputs.
Thus, by being able to compute each of these probabilities I can compute the moments of
GE.
Moments of CE: The process of sampling a dataset (i.i.d.) of size N from a
probability distribution and then partitioning it randomly into two disjoint parts of size Nt
and Ns, is statistically equivalent to sampling two different datasets of size Nt and Ns i.i.d.
from the same probability distribution. The first moment of CE is just the expected error
of the individual runs of cross-validation. In the individual runs the dataset is partitioned
into disjoint training and test sets. Dt(Nt) and Ds(Ns) denote the space of training sets of
size Nt and test sets of size Ns respectively. Hence, the first moment of CE is taken w.r.t.
the Dt(Nt)×Ds(Ns) space which is equivalent to the space obtained by sampling datasets
of size N = Nt + Ns followed by randomly splitting them into training and test sets. In
1 If input is continuous I replace sum over X by integrals in the above formulas,everything else remaining same.
2 Y(.) may or may not be randomized.
138
the computation of variance of CE I have to compute the covariance between any two
runs of cross-validation (equation 7–3). In the covariance I have to compute the following
cross moment, EDijt ( v−2
vN)×Di
s(Nv
)×Djs(N
v) [HEiHEj] where Dij
t (k) is the space of overlapped
training sets of size k in the ith and jth run of cross-validation (i, j ≤ v and i 6= j), Dfs (k)
is the space of all test sets of size k drawn from the data distribution in the f th run of
cross-validation (f ≤ v), Dft (k) is the space of all training sets of size k drawn from the
data distribution in the f th run of cross-validation (f ≤ v) and HEf is the hold-out
error of the classifier in the f th run of cross-validation. Since the cross moment considers
interaction between two runs of cross-validation it is taken over a space consisting of
training and test sets involving both the runs rather than just one. Hence, the subscript in
the cross moments is a cross product between 3 spaces (overlapped training sets between
two runs and the corresponding test sets). The other moments in the variance of CE are
taken over the same space as the expected value. The variance of CE is given by,
V ar(CE) =1
v2[
v∑i=1
V ar(HEi) +v∑
i,j,i6=j
Cov(HEi, HEj)]
=1
v2[
v∑i=1
(EDit( v−1
vN)×Di
s(Nv
)[HE2i ]− E2
Dit( v−1
vN)×Di
s(Nv
)[HEi])
+v∑
i,j,i6=j
(EDijt ( v−2
vN)×Di
s(Nv
)×Djs(N
v) [HEiHEj]
− EDit( v−1
vN)×Di
s(Nv
)[HEi]EDjt( v−1
vN)×Dj
s(Nv
)[HEj]]
The reason I introduced moments of GE previously is that, in Dhurandhar and Dobra
[2009] relationships were drawn between these moments and the moments of CE. Thus,
using the expressions for the moments of GE and the relationships which I will state
shortly, I have expressions for the moments of CE.
The relationship between the expected values of CE and GE is given by,
EDt( v−1v
N)×Ds(Nv
) [CE] = EDt( v−1v
N) [GE(ζ)]
139
where v is the number of folds, Dt(k) is the space of all training sets of size k that are
drawn from the data distribution and Ds(k) is the space of all test sets of size k drawn
from the data distribution.
In the computation of variance of CE I need to find the individual variances and the
covariances. In Dhurandhar and Dobra [2009] it was shown that
EDit( v−1
vN)×Di
s(Nv
)[HEi] = EDt( v−1v
N)×Ds(Nv
) [CE] ∀i ∈ 1, 2, ..., v and hence the
expectation of HEi can be computed using the above relationship between the expected
CE and expected GE. Notice that the space of training and test datasets over which the
moments are computed is the same for each fold (since the space depends only on the size
and all the folds are of the same size) and hence the corresponding moments are also the
same. To compute the remaining terms in the variance I use the following relationships.
The relationship between the second moment of HEi ∀i ∈ 1, 2, ..., v and the
moments of GE is given by,
EDit( v−1
vN)×Di
s(Nv
)[HE2i ] =
v
NEDt(
v−1v
N) [GE(ζ)] +N − v
NEDt(
v−1v
N)
[GE(ζ)2
]
The relationship between the cross moment and the moments of GE is given by,
EDijt ( v−2
vN)×Di
s(Nv
)×Djs(N
v) [HEiHEj] = EDij
t ( v−2v
N)[GE(ζ i)GE(ζj)
]
where ζf is the classifier built in the f th run of cross-validation. All the terms in the
variance can be computed using the above relationships.
7.2 Overview of the Customized Expressions
In Section 7.1 I provided the generalized expressions for computing the moments
of GE and consequently moments of CE. In particular the moments I compute are:
E[CE] and V ar(CE). The formula for the variance of CE can be rewritten as a convex
combination of the variance of the individual runs and the covariance between any two
runs. Formally,
140
V ar(CE) =1
vV ar(HE) +
v − 1
vCov(HEi, HEj) (7–3)
where HE is the error of any of the individual runs and HEi and HEj are the errors of
the ith and jth runs. Since the moments are over all possible datasets the variances and
covariances are same for all single runs and pairs of runs respectively and hence the above
formula.
In the expressions for the moments the probabilities P [ζ(x)=y] and P [ζ(x)=y ∧ ζ ′(x′)=y′]
are the only terms that depend on the particular classification algorithm. Customizing the
expressions for the moments equates to finding these probabilities. The other terms in the
expressions are straightforward to compute from the data distribution. I now give a high
level overview of how the two probabilities are customized for DT, NBC and KNN. The
precise details are in the previous chapters.
1. DT: To find the probability of classifying an input into a particular class (P [ζ(x)=y]),I have to sum the probabilities over all paths (path is a set of nodes and edges fromroot to leaf in a tree) that include the input x and have majority of the datapointslying in them belong to class y, in the set of possible trees. These probabilities canbe computed by fixing the split attribute selection method, the stopping criteriadeployed to curb the growth of the tree and the data distribution from which aninput is drawn. The probability P [ζ(x)=y ∧ ζ ′(x′)=y′] can be computed similarly,by considering pairs of paths (one for each input) rather than a single path.
2. NBC: The NBC algorithm assumes that the attributes are independent of eachother. An input is classified into a particular class, if the product of the classconditionals of each attribute for the input multiplied by the particular class prior isgreater than the same quantity computed for each of the other classes. To computeP [ζ(x)=y], the probability of the described quantity for input x belonging to class ybeing greater than the same quantity for each of the other classes is computed. Theprobability for the second moment can be computed analogously by considering pairsof inputs and outputs.
3. KNN: In KNN to compute P [ζ(x)=y], the probability that majority of the Knearest neighbors of x belong to class y is found. In the extreme case when K=N(K is the dataset size) the probability of the empirical prior of class y being greaterthan the empirical priors of the other classes is computed.In this case the majorityclassifier and KNN behave the same way. To compute P [ζ(x)=y ∧ ζ ′(x′)=y′] the
141
probability that the majority of the K nearest neighbors of x belong to class y andmajority of the K nearest neighbors of x′ belong to class y′ is found.
By using the customized expressions I can accurately study the behavior of
cross-validation. This method is more accurate than directly using Monte-Carlo to
estimate the moments since the parameter space of the individual probabilities that
need to be estimated is much smaller than the entire space over which the moments are
computed.
7.3 Related Work
There is a large body of both experimental and theoretical work that addresses
the problem of studying cross validation. In Efron [1986] cross validation is studied in
the linear regression setting with squared loss and is shown to be biased upwards in
estimating the mean of the true error. More recently the same author in Efron [2004],
compared parametric model selection techniques namely, covariance penalties with the
non-parametric cross validation and showed that under appropriate modeling assumptions
the former is more efficient than cross validation. Breiman [1996] showed that cross
validation gives an unbiased estimate of the first moment of GE. Though cross validation
has desired characteristics with estimating the first moment, Breiman stated that its
variance can be significant. In Moore and Lee [1994] heuristics are proposed to speed up
cross-validation which can be an expensive procedure with increasing number of folds.
In Zhu and Rohwer [1996] a simple setting was constructed in which cross-validation
performed poorly. Goutte [1997] refuted this proposed setting and claimed that a realistic
scenario in which cross-validation fails is still an open question.
The major theoretical work on cross-validation is aimed at finding bounds. The
current distribution free bounds Devroye et al. [1996], Kearns and Ron [1997], Blum et al.
[1999], Elisseeff and Pontil [2003], Vapnik [1998] for cross-validation are loose with some of
them being applicable only in restricted settings such as bounds that rely on algorithmic
stability assumptions. Thus, finding tight PAC (Probably Approximately Correct) style
142
bounds for the bias/variance of cross-validation for different values of v is still an open
question Guyon [2002].
Though bounds are useful in their own right they do not aid in studying trends of
the random variable in question,3 in this case CE. Asymptotic analysis can assist in
studying trends Stone [1977], Shao [1993] with increasing sample size, but it is not clear
when the asymptotics come into play. This is where empirical studies are useful. Most
empirical studies on cross-validation indicate that the performance (bias+variance) is
the best around 10-20 folds Kohavi [1995], Breiman [1996] while some others Schneider
[1997] indicate that the performance improves with increasing number of folds. In the
experimental study that I conduct using closed form expressions, I observe both of these
trends but in addition I provide lucid elucidations for observing such behavior.
7.4 Experiments
In this section I study cross-validation by explaining its behavior in detail. Especially
noticeable, is the behavior of the covariance of cross-validation which plays a significant
role in the overall variance. The role of covariance is significant since in the expression for
the overall variance given by V ar(CE) = 1vV ar(HE) + v−1
vCov(HEi, HEj) the weighting
(v−1v
) of the covariance is always greater than that for the individual variances (except for
v = 2 when its equal) and this weight increases as the number of folds increases. In the
studies I perform, I observe the behavior of the E[CE]4 , V ar(HE) (individual variances),
Cov(HEi, HEj) (covariances), V ar(CE) (total variance) and E2[CE] + V ar(CE) (total
mean squared error) with varying amounts of the correlation between the input attributes
and the class labels for the three classification algorithms. The details regarding the data
generation process are as follows.
3 unless they are extremely tight.
4 I drop the notation which shows the space over which the moments are taken forreadability purposes.
143
I fix the number of classes to 2. The results are averaged over multiple dimensionalities
(3, 5, 8 and 10) with each attribute having multiple splits/values (i.e. attributes with
binary splits, attributes with ternary splits) for all three algorithms (i.e. NBC, DT and
KNN) and additionally multiple values of K (2, 5 and 8) for the KNN algorithm. Assume
d is the dimensionality of the space (i.e. number of input attributes) and m is the number
of values of each attribute, then the total number of distinct datapoints (including the
class attribute) is c = 2md. I can represent this input-output space as a contingency
table with 2 columns, one for each class and md rows, one for each distinct input. The
number of cells in this contingency table is thus c. If I fix the probability of observing a
datapoint in cell i to be pi such that∑c
i=1 pi = 1 and the sample size to N the distribution
that precisely models this scenario is a multinomial distribution with parameters N and
the set p1, p2, ..., pc. This is our data generation model. In the studies that follow the
pi are varied and the amount of dependence between the attributes and the class labels
is computed for each set of pi using the Chi-square test Connor-Linton [2003]. More
precisely, I sum over all i the squares of the difference of each pi with the product of its
corresponding marginals, with each squared difference being divided by this product, that
is, correlation =∑
i(pi−pim)2
pim, where pim is the product of the marginals for the ith cell.
I initially set the dataset size to 100 and then increase it to a 1000, to observe the
effect that increasing dataset size has on the behavior of these moments. The reason I
choose these dataset sizes is that for the dataset sizes below and around 100 the behavior
is qualitatively similar to that at 100. For larger dataset sizes say above and around
1000 the behavior is qualitatively similar to that at 1000. Secondly, the shift in trends
in the behavior of cross-validation with increasing dataset size is clearly noticeable for
these dataset sizes. Moreover, cross-validation is primarily used when dataset sizes are
small since it has low variance and high computational cost (compared with hold-out
set for example) and hence studying it under these circumstances seems sensible. The
144
experiments reveal certain interesting facts and provide insights that assist in better
understanding cross-validation.
Observations I now explain the interesting trends that I observe through these
experiments. The interesting insights are mainly linked to the behavior of the variance
(in particular covariance) of CE. However, for the sake of completeness I also discuss the
behavior of the expected value of CE.
7.4.1 Variance
Figures 7.5 to 7-5 are plots of the variances of the individual runs of cross-validation.
Figures 7.5 to 7-11 depict the behavior of the covariance between any two runs of
cross-validation. Figures 7.5 to 7-17 showcase the behavior of the total variance of
cross-validation, which as I have seen is in fact a convex combination of the individual
variance and the pair-wise covariance.
Linear behavior of Var(HE): In Figures 7.5 to 7-5 I see that the individual
variances practically increase linearly with the number of folds. This linear increase
occurs since, the size of the test set decreases linearly5 with the number of folds; and I
know that CE is the average error over the v runs where the error of each run is the sum
of the zero-one loss function evaluated at each test point normalized by the size of the
test set. Since the test points are i.i.d. (independent and identically distributed), so are
the corresponding zero-one loss functions and from theory I have that the variance of a
random variable which is the sum of T i.i.d. random variables having variance σ2 < ∞, is
given by σ2
T.
V-shaped behavior of Cov(HEi,HEj): In Figure 7.5 I observe that the covariance
first decreases as the number of folds increases from 2 up-until 10-20 and then increases
up-until v = N (called Leave-one-out (LOO) validation). This strange behavior has the
following explanation. At low folds for example at v = 2 the test set for one run is the
5 more precisely expected test set size decreases linearly.
145
training set for the other and vice-versa. The two partitions are negatively correlated since
each datapoint is in either one of the two partitions but cannot be in both. In fact their
covariance is given by, −N4
. This implies that the test sets for the two runs are negatively
correlated. However, the two partitions are also training sets for the two runs and hence
the classifiers induced by them are also negatively correlated in terms of their prediction
on individual datapoints. Since, the test sets as well as the classifiers are negatively
correlated, the errors of these classifiers are positively correlated. In other words, these
classifiers make roughly the same number of mistakes on the respective test sets. The
reason for this is, if the two partitions are similar (i.e. say they are representative samples
of the distribution) then the classifiers are similar and so are the test sets and hence
both of their errors are low. When the two samples are highly dissimilar (i.e. say one is
representative the other is not) the classifiers built are dissimilar and so are the test sets.
Hence, the error that both of these classifiers make on their test sets which is the training
set of the other classifier are high. Thus, irrespective of the level of similarity between the
two partitions the correlation between the errors of the classifiers induced by them is high.
As the number of folds increases this effect reduces as the classifiers become more and
more similar due to the overlap of training sets while the test sets become increasingly
unstable as they become smaller. The latter (i.e. at high folds) increase in covariance
in Figure 7.5 is due to the case where LOO fails. If I have a majority classifier and the
dataset contains an equal number of datapoints from each class, then LOO would estimate
100% error. Since, each run would produce this error the errors of any two runs are
highly correlated. This effect reduces as the number of folds reduces. The classification
algorithms I have chosen, classify based on majority in their final inference step i.e. locally
they classify datapoints based on majority. At low data correlation as in Figure 7.5 the
probability of having equal number datapoints from each class for each input is high
and hence the covariance between the errors of two different runs is high. Thus, at high
folds this effect is predominant and it increases the covariance. Consequently, I have the
146
V-shaped graphs as in Figure 7.5 which are a combination of the first effect and this
second effect.
L-shaped behavior of Cov(HEi,HEj): As I increase the correlation between the
input attributes and the class labels seen in Figures 7-7, 7.5 the initial effect which raises
the covariance is still dominant, but the latter effect (equal number of datapoints from
each class for each input) has extremely low probability and is not significant enough to
increase the covariance at high folds. As a result, the covariance drops with increasing v.
On increasing the dataset size the covariance does not increase as much (in fact
reduces in some cases) in Figures 7-9, 7.5 and 7-11 at high folds. In Figure 7-9 though
the correlation between the input attributes and class labels is low, the probability of
having equal number datapoints from each class is low since the dataset size has increased.
For a given set of parameters the probability of a particular event occurring is essentially
reduced (never increased) as the number of events is increased (i.e. N increases), since the
original probability mass has now to be distributed over a larger set. Hence, the covariance
in Figure 7-9 drops as the number of folds increases. The behavior observed in Figures 7.5
and 7-11 has the same explanation as that for Figures 7-7 and 7.5 described before.
Finally, the covariance has a V-shape for low data correlations and low dataset sizes
where the classification algorithms classify based on majority at least at some local level.
In the other cases the covariance is high initially and then reduces with increasing number
of folds.
Behavior of Var(CE) similar to covariance: Figures 7.5 to 7-17 represent the
behavior of the total variance of CE. I know that total variance is given by, V ar(CE) =
1vV ar(HE) + v−1
vCov(HEi, HEj). I have also seen from Figures 7.5 to 7-5 that the
V ar(HE) vary almost linearly with respect to v. In other words, 1vV ar(HE) is practically
a constant with changing v. Hence, the trends seen for the total variance are similar to
those observed for the pairwise covariances with an added shift. Thus, at low correlations
and low sample sizes the variance is minimum not at the extremeties but somewhere in
147
between, say around 10-20 folds as seen in 7.5. In other cases, the total variance reduces
with increasing number of folds.
7.4.2 Expected value
Figures 7.5 to 7.5 depict the behavior of the expected value of CE for different
amounts of correlation between the input attributes and the class labels and for
two different sample sizes. The behavior of the expected value at medium and high
correlations for small and large sample sizes is the same and hence I plot these scenarios
only for small sample sizes as shown in Figure 7.5. From the figures I observe that as
the correlation increases the expected CE reduces. This occurs since the input attributes
become increasingly indicative of a particular class. As the number of folds increase the
expected value reduces since the training set sizes increase on expectation, enhancing
classifier performance.
7.4.3 Expected value square + variance
Figures 7-21 to 7.5 depict the behavior of CE. In Figure 7-21 I observe that the best
performance of cross-validation is around 10-20 folds. In the other cases, the behavior
improves as the number of folds increases. In Figure 7-21 the variance at high folds is
large and hence the above sum is large for high folds. As a result I have a V-shaped curve.
In the other Figures the variance is low at high folds and so is the expected value and
hence the performance improves as the number of folds increases.
7.5 Take-aways
In summary, I observed the behavior of cross-validation under varying amounts of
data correlation, varying sample size and with varying number of folds. I observed that
at low correlations and for low sample sizes (a characteristic of many real life datasets)
10-20 fold cross-validation was the best while for the other cases increasing the number of
folds helped enhance performance. Additionally, I provided in depth explanations for the
observed behavior and commented that the explanations for the behavior of covariance
were especially relevant to classification algorithms that classify based on majority at a
148
Figure 7-1. Var(HE) for small sample size and low correlation.
Figure 7-2. Var(HE) for small sample size and medium correlation.
global level (e.g. majority classifiers) or at least at some local level (e.g. DT classification
at the leaves). The other interesting fact was that all the experiments and the insights
were a consequence of the theoretical formulas that were derived previously. I hope that
non-asymptotic studies like the one presented will assist in better understanding popular
prevalent techniques, in this case cross-validation.
149
Figure 7-3. Var(HE) for small sample size and high correlation.
Figure 7-4. Var(HE) for larger sample size and low correlation.
150
Figure 7-5. Var(HE) for larger sample size and medium correlation.
Figure 7-6. Var(HE) for larger sample size and high correlation.
151
Figure 7-7. Cov(HEi, HEj) for small sample size and low correlation.
Figure 7-8. Cov(HEi, HEj) for small sample size and medium correlation.
152
Figure 7-9. Cov(HEi, HEj) for small sample size and high correlation.
Figure 7-10. Cov(HEi, HEj) for larger sample size and low correlation.
153
Figure 7-11. Cov(HEi, HEj) for larger sample size and medium correlation.
Figure 7-12. Cov(HEi, HEj) for larger sample size and high correlation.
154
Figure 7-13. Var(CE) for small sample size and low correlation.
Figure 7-14. Var(CE) for small sample size and medium correlation.
155
Figure 7-15. Var(CE) for small sample size and high correlation.
Figure 7-16. Var(CE) for larger sample size and low correlation.
156
Figure 7-17. Var(CE) for larger sample size and medium correlation.
Figure 7-18. Var(CE) for larger sample size and high correlation.
157
Figure 7-19. E[CE] for small sample size and low correlation.
Figure 7-20. E[CE] for larger sample size and low correlation.
158
Figure 7-21. E[CE] for small sample size at medium and high correlation.
Figure 7-22. E2[CE] + V ar(CE) for small sample size and low correlation.
159
Figure 7-23. E2[CE] + V ar(CE) for small sample size and medium correlation.
Figure 7-24. E2[CE] + V ar(CE) for small sample size and high correlation.
160
Figure 7-25. E2[CE] + V ar(CE) for larger sample size and low correlation.
Figure 7-26. E2[CE] + V ar(CE) for larger sample size and medium correlation.
161
Figure 7-27. E2[CE] + V ar(CE) for larger sample size and high correlation.
162
CHAPTER 8CONCLUSION
In this work I have proposed a novel methodology to study learning algorithms which
has the following potential benefits:
Gaining Insight: One of the main advantages of deploying such a methodology is
that it can be used as an exploratory tool and as an analysis tool. I can accurately study
when and why a particular evaluation measure or classification algorithm behaves in the
manner it does.
Finite Sample Convergence: Another benefit of the methodology is that it can
be used to evaluate the performance of the evaluation measures in estimating GE under
different conditions. For example, I can study how well HE and CE estimate GE with
increasing sample size. I can thus use CE below a certain sample size and HE beyond
that sample size so as to estimate GE accurately and efficiently. The methodology can
thus be used as a guidance tool.
Robustness: If an algorithm designer validates his/her algorithm by computing
moments as mentioned earlier, it can instill greater confidence in the practitioner
searching for an appropriate algorithm for his/her dataset. The reason for this being,
if the practitioner has a dataset which has a similar structure or is from a similar source
as the test dataset on which an empirical distribution was built and favorable results
reported by the designer, then this would mean that the good results apply not only
to that particular test dataset, but to other similar type of datasets and since the
practitioner’s dataset belongs to this similar collection, the results would also apply to
his. Hence, the robustness of the algorithm can be evaluated using this methodology which
can result in the algorithm having wider appeal.
Other Benefits: The methodology can be used to evaluate Probably Approximately
Correct (PAC) Bayes bounds McAllester [2003] in certain settings. Roughly speaking,
PAC Bayes bounds, bound the difference between the expected GE and expected
163
empirical error where the expectation is over a distribution defined over the hypothesis
space. In our case this distribution is induced by training a classification algorithm on
all i.i.d. samples of size N . I can compute the moments of GE and the moments of the
evaluation measures using our expressions for this case and compare them to verify the
tightness of the corresponding PAC Bayes bounds. The derived expressions can also
be used to focus on specific portions of the data, since the individual probabilities in
the expressions are only concerned with the behavior of the classification algorithm or
evaluation measure on single or pairs of inputs.
Hence, the methodology I discussed, can serve as a guidance tool, as an analysis tool
and as an exploratory tool to accurately study classification algorithms in conjunction
with evaluation measures. In the future it would be interesting to analyze and develop
efficient characterizations for other classification algorithms and evaluation measures in
this framework. This analysis will hopefully assist us in gaining new insights into the
behavior of these techniques. A more ambitious goal is to extend this kind of analysis to
study the more general class of learning algorithms. This includes but is not limited to
regression problems where the output is continuous rather than discrete.
164
APPENDIX: PROOFS
Proposition 3. The polynomial (x + a)rx2y + (y + a)ry2x > 0 iff x > 0 and y > 0, where
x, y ∈ [−a,−a + 1, ..., a], r = maxbb ln[b]
ln[a+1a−b
]c+ 1, a ∈ N, 1 < b < a and b ∈ N.
Proof. One direction is trivial. If x > 0 and y > 0 then definitely the polynomial is greater
than zero for any value of r. Now lets prove the other direction i.e. if (x + a)rx2y + (y +
a)ry2x > 0 then x > 0 and y > 0 where x, y ∈ [−a, ..., a], r = maxbb ln[b]
ln[a+1a−b
]c + 1, 1 < b < a
and b ∈ N. In other words, if x ≤ 0 or y ≤ 0 then (x + a)rx2y + (y + a)ry2x ≤ 0. We prove
this result by forming cases.
Case 1: Both x and y are zero.
The value of the polynomial is zero.
Case 2: One of x or y is zero.
The value of the polynomial again is zero since each of the two terms separated by a
sum have xy as a factor.
Case 3: Both x and y are less than zero.
Consider the first term (x + a)rx2y. This term is non-positive since x + a is always
non-negative(since x ∈ [−a, ..., a]) and x2 is always positive but y is non-positive.
Analogous argument for the second term and so it too is non-positive. Thus their sum is
non-positive.
Case 4: One of x or y is negative and the other is positive. Assume w.l.o.g. that x is
positive and y is negative.
(x + a)rx2y + (y + a)ry2x ≤ 0
only if (x + a)rx + (y + a)ry ≥ 0
only if r ≥ ln[−yx
]
ln[x+ay+a
]
On fixing the value of y the value of x at which the right hand side of the above achieves
maximum is 1(since lower the value of x higher the right handside but x is positive by our
assumption and x ∈ [−a, ..., a]). Thus we have the above inequality true only if,
165
r ≥ ln[−y]
ln[a+1y+a
]
Let b = −y then 1 ≤ b ≤ a since y is negative. Hence, if r satisfies the inequality for
all possible allowed values of b then only will it imply that the polynomial is less than or
equal to zero in the specified range. For r to satisfy the inequality for all allowed values
of b, it must satisfy the inequality for the value of b that the function is maximum. Also,
for b = 1 and b = a the right handside is zero. So the range of b over which we want to
find the maximum is 1 < b < a. With this the minimum value of r that satifies the above
inequality in order for the polynomial to be less than or equal to zero is,
r = maxbb ln[b]
ln[a+1a−b
]c+ 1
The 4 cases cover all the possibilities and thus we have shown that if x ≤ 0 or y ≤ 0
then (x + a)rx2y + (y + a)ry2x ≤ 0. Having also shown the other direction we have proved
the proposition.
The probability that two paths of lengths l1 and l2 (l2 ≥ l1) co-exist in a tree based on
the randomized attribute selection method is given by,
P [l1 and l2 length paths co− exist] =
v∑i=0
vPri(l1 − i− 1)!(l2 − i− 1)!(r − v)probi
where r is the number of attributes common in the two paths, v is the number attributes
with the same values in the two paths, vPri = v!(v−i)!
denotes permutation and
probi = 1d(d−1)...(d−i)(d−i−1)2...(d−l1+1)2(d−l1)...(d−l2+1)
.
We now prove the above result. The derivation of the above result will become clearer
through the following example. Consider the total number of attributes to be d as usual.
Let A1, A2 and A3 be three attributes that are common to both paths and also having
the same attribute values. Let A4 and A5 be common to both paths but have different
166
attribute values for each of them. Let A6 belong to only the first path and A7, A8 to only
the second path. Thus, in our example l1 = 6, l2 = 7, r = 5 and v = 3.
For the two paths to co-exist notice that atleast one of A4 or A5 has to be at a
lower depth than the non-common attributes A6, A7, A8. This has to be true since, if a
non-common attribute say A6 is higher than A4 and A5 in a path of the tree then the
other path cannot exist. Hence, in all the possible ways that the two paths can co-exist,
one of the attributes A4 or A5 has to occur at a maximum depth of v + 1, that is, 4
in this example. Figure A-1a depicts this case. In the successive tree structures, that
is, Figure A-1b, Figure A-1c the common attribute with distinct attribute values (A4)
rises higher up in the tree (to lower depths) until in Figure A-1d it becomes the root.
To find the probability that the two paths co-exist we sum up the probabilities of such
arrangements/tree structures. The probability of the subtree shown in Figure A-1a is
1d(d−1)(d−2)(d−3)(d−4)2(d−5)2(d−6)
considering that we choose attributes w/o replacement for
a particular path. Thus the probability of choosing the root is 1d, the next attribute is
1d−1
and so on till the subtree splits into two paths at depth 5. After the split at depth
5 the probability of choosing the respective attributes for the two paths is 1(d−4)2
, since
repetitions are allowed in two separate paths. Finally, the first path ends at depth
6 and only one attribute has to be chosen at depth 7 for the second path which is
chosen with a probability of 1d−6
. We now find the total number of subtrees with such
an arrangement where the highest common attribute with different values is at depth
of 4. We observe that A1, A2 and A3 can be permuted in whichever way w/o altering
the tree structure. The total number of ways of doing this is 3!, that is, 3Pr3. The
attributes below A4 can also be permuted in 2!3! w/o changing the tree structure.
Moreover, A4 can be replaced by A5. Thus, the total number of ways the two paths
can co-exist with this arrangement is 3Pr32!3!2. The probability of the arrangement
is hence given by, 3Pr32!3!2d(d−1)(d−2)(d−3)(d−4)2(d−5)2(d−6)
. Similarly, we find the probability of
the arrangement in Figure A-1b where the common attribute with different values is
167
at depth 3 then at depth 2 and finally at the root. The probabilities for the successive
arrangements are 3Pr23!4!2d(d−1)(d−2)(d−3)2(d−4)2(d−5)2(d−6)
, 3Pr14!5!2d(d−1)(d−2)2(d−3)2(d−4)2(d−5)2(d−6)
and
3Pr05!6!2d(d−1)2(d−2)2(d−3)2(d−4)2(d−5)2(d−6)
respectively. The total probability for the paths to co-exist
is given by the sum of the probabilities of these individual arrangements.
In the general case, where we have v attributes with the same values the number
of arrangements possible is v + 1. This is because the depth at which the two paths
separate out lowers from v + 1 to 1. When the bifurcation occurs at v + 1 the total
number of subtrees is vPrv(l1 − v − 1)!(l2 − v − 1)!(r − v) with this arrangement.
vPrv is the permutations of the common attributes with same values. (l1 − v − 1)! and
(l2 − v − 1)! are the total permutations of the attributes in path 1 and 2 respectively
after the split. r − v is the number of choices for the split attribute. The probability of
any one of the subtrees is 1d(d−1)...(d−v)(d−v−1)2...(d−l1+1)2(d−l1)...(d−l2+1)
since until a depth of
v + 1 the two paths are the same and then from v + 2 the two paths separate out. The
probability of the first arrangement is thus, vPrv(l1−v−1)!(l2−v−1)!(r−v)d(d−1)...(d−v)(d−v−1)2...(d−l1+1)2(d−l1)...(d−l2+1)
.
For the second arrangement with the bifurcation occurring at a depth of v, the number
of subtrees is vPrv−1(l1 − v)!(l2 − v)!(r − v) and the probability of any one of them
is 1d(d−1)...(d−v+1)(d−v)2...(d−l1+1)2(d−l1)...(d−l2+1)
. The probability of the arrangement is
thus vPrv−1(l1−v)!(l2−v)!(r−v)d(d−1)...(d−v+1)(d−v)2...(d−l1+1)2(d−l1)...(d−l2+1)
. Similarly, the probabilities of the other
arrangements can be derived. Hence the total probability for the two paths to co-exist
which is the sum of the probabilities of the individual arrangements is given by,
P [l1 and l2 length paths co− exist] =
v∑i=0
vPri(l1 − i− 1)!(l2 − i− 1)!(r − v)
d(d− 1)...(d− i)(d− i− 1)2...(d− l1 + 1)2(d− l1)...(d− l2 + 1).
168
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
AA
AA A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
1 1
2 2
3
3 33 3 3 3
2 2 2 2
1 1
4
4
4
41
5 55 5
5 5 5 5
6 66 6
77 7 7
8 8 8 8
a cb d
Figure A-1. Instances of possible arrangements.
169
REFERENCES
J. Abello, P. M. Pardalos, and M. G. C. Resende, editors. Handbook of massive data sets.Kluwer Academic Publishers, Norwell, MA, USA, 2002. ISBN 1-4020-0489-3.
T. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, 2003.
J. Bartlett, J. Kotrlik, and C. Higgins. Organizational research: Determining appropriatesample size for survey research. Information Technology, Learning, and PerformanceJournal, 19(1):43–50, 2001.
Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold crossvalidation. Journal of Machine Learning Research, 2003.
D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: a convexoptimization approach. Technical report, Dept. Math. O.R., Cambridge, Mass 02139,1998. URL citeseer.ist.psu.edu/bertsimas00optimal.html. Date accessed 5/2006.
A. Blum, A. Kalai, and J. Langford. Beating the hold-out: Bounds for k-fold andprogressive cross-validation. In Computational Learing Theory, pages 203–208, 1999.
S. Boucheron, O. Bousquet, and G. Lugosi. Introduction to statistical learning theory.Date accessed 1/2007, http://www.kyb.mpg.de/publications/pdfs/pdf2819.pdf, 2005.
U. Braga-Neto and E. Dougherty. Exact performance of error estimators for discreteclassifiers. Pattern Recognition, 38(11):1799–1814, 2005.
L. Breiman. Heuristics of instability and stabilization in model selection. The Annals ofStatistics, 1996.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.Wadsworth and Brooks, 1984.
R. Butler and R. Sutton. Saddlepoint approximation for multivariate cumulativedistribution functions and probability computations in sampling theory and outliertesting. Journal of the American Statistical Association, 93(442):596–604, 1998.
R. Chambers and C. Skinner. Analysis of Survey Data. Wiley, 1977.
J. Connor-Linton. Chi square tutorial. Date accessed 8/2006,http://www.georgetown.edu/faculty/ballc/webtools/ web chi tut.html, 2003.
L. Devroye, L. Gyor, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.Springer-Verlag, 1996.
A. Dhurandhar and A. Dobra. Semi-analytical method for analyzing models and modelselection measures based on moment analysis. ACM Transactions on KnowledgeDiscovery and Data Mining, 3, 2009.
170
A. Dhurandhar and A. Dobra. Probabilistic characterization of random decision trees.Journal of Machine Learning Research, 9, 2008.
A. Dhurandhar and A. Dobra. Probabilistic characterization of nearest neighbor classifier.Technical Report, 2007.
P. Domingos and M. J. Pazzani. On the optimality of the simple bayesian classifier underzero-one loss. Machine Learning, 29(2-3):103–130, 1997.
A. Edelman and H. Murakami. Polynomial roots from companion matrix eigenvalues.Mathematics of Computation, 64(210):763–776, 1995.
B. Efron. How biased is the apparent error rate of a prediction rule? Journal of theAmerican Statistical Association, 81:461–470, 1986.
B. Efron. The estimation of prediction error: Covariance penalties and cross-validation.Journal of the American Statistical Association, 99:619–642, 2004.
A. Elisseeff and M. Pontil. Learning Theory and Practice, chapter Leave-one-out error andstability of learning algorithms with applications. IOS Press, 2003.
C. Goutte. Note on free lunches and cross-validation. Neural Computation, 9(6):1245–1249,1997.
I. Guyon. Nips. Discussion: Open Problems, 2002.
M. Hall. Correlation-based feature selection for machine learning, 1998.
M. A. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete classdata mining. IEEE TRANSACTIONS ON KDE, 2003.
P. Hall. The Bootstrap and Edgeworth Expansion. Springer-Verlag, 1992.
K. Isii. The extrema of probability determined by generalized moments(i) boundedrandom variables. Ann. Inst. Stat. Math, 12:119–133, 1960.
K. Isii. On the sharpeness of chebyshev-type inequalities. Ann. Inst. Stat. Math, 14:185–197, 1963.
S. Karlin and L. Shapely. Geometry of moment spaces. Memoirs Amer. Math. Soc., 12,1953.
M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-outcross-validation. In Computational Learing Theory, pages 152–162, 1997.
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and modelselection. In In Proceedings of the Fourteenth International Joint Conference onArtificial Intelligence, pages 1137–1143. San Mateo, CA: Morgan Kaufmann, 1995.,1995.
171
E. F. Krause. Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover, 1987.
J. Langford. Filed under: Prediction theory, problems. Date accessed 6/2006,http://hunch.net/index.php?p=29, 2005.
B. Levin. A representation for multinomial cumulative distribution functions. The Annalsof Statistics, 9(5):1123–1126, 1981.
F. Liu, K. Ting, and W. Fan. Maximizing tree diversity by building complete-randomdecision trees. In PAKDD, pages 605–610, 2005.
W. Liu and A. White. Metrics for nearest neighbour discrimination with categoricalattributes. In Research and Development in Expert Systems XIV: Proceedings of the17th Annual Technicial Conference of the BCES Specialist Group, pages 51–59, 1997.
M. Markatou, H. Tian, S. Biswas, and G. Hripcsak. Analysis of variance of cross-validationestimators of the generalization error. J. Mach. Learn. Res., 6:1127–1168, 2005. ISSN1533-7928.
D. McAllester. Pac-bayesian stochastic model selection. Mach. Learn., 51, 2003.
A. Moore and M. Lee. Efficient algorithms for minimizing cross validation error. InInternational Conference on Machine Learning, pages 190–198, 1994.
M. Plutowski. Survey: Cross-validation in theory and in practice. Date accessed 10/2006,www.emotivate.com/CvSurvey.doc, 1996.
A. Prekopa. The discrete moment problem and linear programming. RUTCOR ResearchReport, 1989.
J. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
I. Rish. An empirical study of the naive bayes classifier. In IJCAI-01 workshop on”Empirical Methods in AI”, 2001.
J. Schneider. Cross validation. Date accessed 5/2008,http://www.cs.cmu.edu/ schneide/tut5/node42.html, 1997.
J. Shao. Linear model selection by cross validation. Journal of the American StatisticalAssociation, 88, 1993.
J. Shao. Mathematical statistics. Springer-Verlag, 2003.
L. Smith. A tutorial on principal components analysis. 2002.
C. Stanfill and D. Waltz. Toward memory-based reasoning. Commun. ACM, 29(12):1213–1228, 1986. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/7902.7906.
C. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595–645,1977.
172
V. Vapnik. Statistical Learning Theory. Wiley & Sons, 1998.
R. Williamson. Srm and vc theory(statistical learning theory).http://axiom.anu.edu.au/ williams/papers/P151.pdf, 2001.
Wolfram-Research. Mathematica. http://www.wolfram.com/.
S.-P. Wu and S. Boyd. Sdpsol: a parser/solver for sdp and maxdet problems with matrixstructure. Date accessed 7/2006, http://www.stanford.edu/ boyd/SDPSOL.html, 1996.
H. Zhu and R. Rohwer. No free lunch for cross validation. Neural Computation, 8(7):1421–1426, 1996.
173
BIOGRAPHICAL SKETCH
Amit Dhurandhar is originally from Pune, India. He received his B.E. degree in
computer science from the University of Pune in 2004. He then received his master’s
degree in 2005 (December) and his P.h.D. in summer 2009 from the University of Florida.
His primary research is focused on building theory and scalable frameworks for studying
classification algorithms and related techniques.
174