SEMI-ANALYTICAL METHOD FOR ANALYZING …ufdcimages.uflib.ufl.edu/UF/E0/02/47/33/00001/dhurandhar...I would like to thank Dr. Paul Gader and Dr. Arunava Banerjee for their insightful

SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODELSELECTION MEASURES

By

AMIT DHURANDHAR

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2009

1

c© 2009 Amit Dhurandhar

2

To my family, friends and professors

3

ACKNOWLEDGMENTS

First and foremost, I would like to thank the almighty for giving me the strength

to overcome both academic and emotional challenges that I have faced in my pursuit of

earning a doctorate degree. Without his strength I would not have been in this position

today. Second, I would like to thank my family for their continued support and for the fun

we have when we all get together.

A very special thanks to my advisor, Dr. Alin Dobra, for not only his guidance

but also for the great commoradory that we share. I am greatful for having met such

an intelligent, creative, full-of-life yet patient and helpful individual. I have thoroughly

enjoyed the intense discussions (which others mistook for fights and actually bet on who

will win) we have had in this time.

I would like to thank Dr. Paul Gader and Dr. Arunava Banerjee for their insightful

suggestions and encouragement during difficult times. I would also like to thank

my other committee members Dr. Sanjay Ranka and Dr. Ravindra Ahuja for their

invaluable inputs. I feel fortunate to have taken courses with Dr. Meera Sitharam and Dr.

Anand Rangarajan who are great teachers and taught me what it means to understand

something.

Last but definitely not the least, I would like to thank my friends and roomates for

without them life would have been dry. A special thanks to Hale, Kartik (or Kartiks

should I say), Bhuppi, Ajit, Gnana, Somnath and many others for their support and

encouragement.

Thanks a lot guys! This would not have been possible without you all.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.1 Practical Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.1 What is the Methodology ? . . . . . . . . . . . . . . . . . . . . . . . 181.3.2 Why have such a Methodology? . . . . . . . . . . . . . . . . . . . . 181.3.3 How do I Implement the Methodology ? . . . . . . . . . . . . . . . . 19

1.4 Applying the Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.4.1 Algorithmic Perspective . . . . . . . . . . . . . . . . . . . . . . . . . 201.4.2 Dataset Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.5 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 GENERAL FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1 Generalization Error (GE) . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Alternative Methods for Computing the Moments of GE . . . . . . . . . . 29

3 ANALYSIS OF MODEL SELECTION MEASURES . . . . . . . . . . . . . . . 32

3.1 Hold-out Set Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 Multifold Cross Validation Error . . . . . . . . . . . . . . . . . . . . . . . . 34

4 NAIVE BAYES CLASSIFIER, SCALABILITY and EXTENSIONS . . . . . . . 38

4.1 Example: Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . 384.1.1 Naive Bayes Classifier Model (NBC) . . . . . . . . . . . . . . . . . . 384.1.2 Computation of the Moments of GE . . . . . . . . . . . . . . . . . . 39

4.2 Full-Fledged NBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.1 Calculation of Basic Probabilities . . . . . . . . . . . . . . . . . . . 424.2.2 Direct Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.3 Approximation Techniques . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.3.1 Series approximations (SA) . . . . . . . . . . . . . . . . . 464.2.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.3.3 Random sampling using formulations (RS) . . . . . . . . . 55

5

4.2.4 Empirical Comparison of Cumulative Distribution Function ComputingMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3 Monte Carlo (MC) vs Random Sampling Using Formulations . . . . . . . . 574.4 Calculation of Cumulative Joint probabilities . . . . . . . . . . . . . . . . . 594.5 Moment Comparison of Test Metrics . . . . . . . . . . . . . . . . . . . . . 62

4.5.1 Hold-out Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.5.2 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.5.3 Comparison of GE, HE, and CE . . . . . . . . . . . . . . . . . . . . 64

4.6 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 ANALYZING DECISION TREES . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.1 Computing Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.1.1 Technical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 835.1.2 All Attribute Decision Trees (ATT) . . . . . . . . . . . . . . . . . . 845.1.3 Decision Trees with Non-trivial Stopping Criteria . . . . . . . . . . 855.1.4 Characterizing path exists for Three Stopping Criteria . . . . . . . . 875.1.5 Split Attribute Selection . . . . . . . . . . . . . . . . . . . . . . . . 885.1.6 Random Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . 905.1.7 Putting things together . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.1.7.1 Fixed Height . . . . . . . . . . . . . . . . . . . . . . . . . 925.1.7.2 Purity and Scarcity . . . . . . . . . . . . . . . . . . . . . . 94

5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.3.1 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.3.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4 Take-aways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6 K-NEAREST NEIGHBOR CLASSIFIER . . . . . . . . . . . . . . . . . . . . . . 108

6.1 Specific Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.2 Technical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.3 K-Nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 1096.4 Computation of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4.1 General Characterization . . . . . . . . . . . . . . . . . . . . . . . . 1116.4.2 Efficient Characterization for Sample Independent Distance Metrics 113

6.5 Scalability Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.6.1 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.6.2 Study 1: Performance of the KNN Algorithm for Different Values

of k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.6.3 Study 2: Convergence of the KNN Algorithm with Increasing Sample

Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1226.6.4 Study 3: Relative Performance of 10-fold Cross Validation on Synthetic

Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6

6.6.5 Study 4: Relative Performance of 10-fold Cross Validation on RealDatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.8 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.9 Take-aways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 INSIGHTS INTO CROSS-VALIDATION . . . . . . . . . . . . . . . . . . . . . . 132

7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.2 Overview of the Customized Expressions . . . . . . . . . . . . . . . . . . . 1407.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.4.1 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457.4.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.4.3 Expected value square + variance . . . . . . . . . . . . . . . . . . . 148

7.5 Take-aways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

APPENDIX: PROOFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7

LIST OF TABLES

Table page

2-1 Notation used throughout the thesis. . . . . . . . . . . . . . . . . . . . . . . . . 31

4-1 Contingency table of input X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4-2 Naive Bayes Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4-3 Empirical Comparison of the cdf computing methods in terms of execution time.RSn denotes the Random Sampling procedure using n samples to estimate theprobabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4-4 95% confidence bounds for Random Sampling. . . . . . . . . . . . . . . . . . . . 68

4-5 Comparison of methods for computing the cdf. . . . . . . . . . . . . . . . . . . . 68

6-1 Contingency table with v classes, M input vectors and total sample size N =∑M,vi=1,j=1 Nij. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8

LIST OF FIGURES

Figure page

4-1 I have two attributes each having two values with 2 class lables. . . . . . . . . . 69

4-2 The current iterate yk just satisfies the constraint cl and easily satisfies the otherconstraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4-3 Estimates of expected value of GE by MC and RS with increasing training setsize N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70





4-8 The plot is of the polynomial (x + 10)4x2y + (y + 10)4y2x− z = 0. . . . . . . . . 72

4-9 HE expectation in single dimension. . . . . . . . . . . . . . . . . . . . . . . . . . 73

4-10 HE variance in single dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4-11 HE E[] + Std() in single dimension. . . . . . . . . . . . . . . . . . . . . . . . . . 74

4-12 HE expectation in multiple dimensions. . . . . . . . . . . . . . . . . . . . . . . . 74

4-13 HE variance in multiple dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . 75

4-14 HE E[] + Std() in multiple dimensions. . . . . . . . . . . . . . . . . . . . . . . . 75

4-15 Expectation of CE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4-16 Individual run variance of CE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4-17 Pairwise covariances of CV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4-18 Total variance of cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4-19 E [] +√

Var (()) of CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4-20 Convergence behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4-21 CE expectation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4-22 Individual run variance of CE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

9

4-23 Pairwise covariances of CV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4-24 Total variance of cross validation. . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4-25 E [] +√

Var (()) of CV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4-26 Convergence behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5-1 The all attribute tree with 3 attributes A1, A2, A3, each having 2 values. . . . . 103

5-2 Given 3 attributes A1, A2, A3, the path m11m21m31 is formed irrespective ofthe ordering of the attributes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5-3 Fixed Height trees with d = 5, h = 3 and attributes with binary splits. . . . . . 104

5-4 Fixed Height trees with d = 5, h = 3 and attributes with ternary splits. . . . . . 104

5-5 Fixed Height trees with d = 8, h = 3 and attributes with binary splits. . . . . . 104

5-6 Purity based trees with d = 5 and attributes with binary splits. . . . . . . . . . 105

5-7 Purity based trees with d = 5 and attributes with ternary splits. . . . . . . . . . 105

5-8 Purity based trees with d = 8 and attributes with binary splits. . . . . . . . . . 105

5-9 Scarcity based trees with d = 5, pb = N10

and attributes with binary splits. . . . . 106


and attributes with ternary splits. . . . 106


and attributes with binary splits. . . . . 106

5-12 Comparison between AF and MC on three UCI datasets for trees prunned basedon fixed height (h = 3), purity and scarcity (pb = N

10). . . . . . . . . . . . . . . . 107

6-1 b, c and d are the 3 nearest neighbours of a. . . . . . . . . . . . . . . . . . . . . 128

6-2 The Figure shows the extent to which a point xi is near to x1. . . . . . . . . . . 129

6-3 Behavior of the GE for different values of k. . . . . . . . . . . . . . . . . . . . . 129

6-4 Convergence of the GE for different values of k. . . . . . . . . . . . . . . . . . . 130

6-5 Comparison between the GE and 10 fold Cross validation error (CE) estimatefor different values of k when the sample size (N) is 1000. . . . . . . . . . . . . . 130

6-6 Comparison between the GE and 10 fold Cross validation error (CE) estimatefor different values of k when the sample size (N) is 10000. . . . . . . . . . . . . 131

6-7 Comparison between true error (TE) and CE on 2 UCI datasets. . . . . . . . . . 131

7-1 Var(HE) for small sample size and low correlation. . . . . . . . . . . . . . . . . 149

7-2 Var(HE) for small sample size and medium correlation. . . . . . . . . . . . . . . 149

10

7-3 Var(HE) for small sample size and high correlation. . . . . . . . . . . . . . . . . 150

7-4 Var(HE) for larger sample size and low correlation. . . . . . . . . . . . . . . . . 150

7-5 Var(HE) for larger sample size and medium correlation. . . . . . . . . . . . . . . 151

7-6 Var(HE) for larger sample size and high correlation. . . . . . . . . . . . . . . . . 151

7-7 Cov(HEi, HEj) for small sample size and low correlation. . . . . . . . . . . . . 152

7-8 Cov(HEi, HEj) for small sample size and medium correlation. . . . . . . . . . . 152

7-9 Cov(HEi, HEj) for small sample size and high correlation. . . . . . . . . . . . . 153

7-10 Cov(HEi, HEj) for larger sample size and low correlation. . . . . . . . . . . . . 153

7-11 Cov(HEi, HEj) for larger sample size and medium correlation. . . . . . . . . . . 154

7-12 Cov(HEi, HEj) for larger sample size and high correlation. . . . . . . . . . . . . 154

7-13 Var(CE) for small sample size and low correlation. . . . . . . . . . . . . . . . . 155

7-14 Var(CE) for small sample size and medium correlation. . . . . . . . . . . . . . . 155

7-15 Var(CE) for small sample size and high correlation. . . . . . . . . . . . . . . . . 156

7-16 Var(CE) for larger sample size and low correlation. . . . . . . . . . . . . . . . . 156

7-17 Var(CE) for larger sample size and medium correlation. . . . . . . . . . . . . . . 157

7-18 Var(CE) for larger sample size and high correlation. . . . . . . . . . . . . . . . . 157

7-19 E[CE] for small sample size and low correlation. . . . . . . . . . . . . . . . . . . 158

7-20 E[CE] for larger sample size and low correlation. . . . . . . . . . . . . . . . . . . 158

7-21 E[CE] for small sample size at medium and high correlation. . . . . . . . . . . . 159

7-22 E2[CE] + V ar(CE) for small sample size and low correlation. . . . . . . . . . . 159

7-23 E2[CE] + V ar(CE) for small sample size and medium correlation. . . . . . . . . 160

7-24 E2[CE] + V ar(CE) for small sample size and high correlation. . . . . . . . . . . 160

7-25 E2[CE] + V ar(CE) for larger sample size and low correlation. . . . . . . . . . . 161

7-26 E2[CE] + V ar(CE) for larger sample size and medium correlation. . . . . . . . 161

7-27 E2[CE] + V ar(CE) for larger sample size and high correlation. . . . . . . . . . 162

A-1 Instances of possible arrangements. . . . . . . . . . . . . . . . . . . . . . . . . . 169

11

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODELSELECTION MEASURES

By

Amit Dhurandhar

August 2009

Chair: Alin DobraMajor: Computer Engineering

Considering the large amounts of data that is collected everyday in various domains

such as health care, financial services, astrophysics and many others, there is a pressing

need to convert this information into knowledge. Machine learning and data mining are

both concerned with achieving this goal in a scalable fashion. The main theme of my

work has been to analyze and better understand prevalent classification techniques and

paradigms which are an integral part of machine learning and data mining research, with

an aim to reduce the hiatus between theory and practice.

Machine learning and data mining researchers have developed a plethora of

classification algorithms to tackle classification problems. Unfortunately, no one

algorithm is superior to the others in all scenarios and neither is it totally clear as to

which algorithm should be preferred over others under specific circumstances. Hence,

an important question now is, what is the best choice of a classification algorithm for

a particular application? This problem is termed as classification model selection and

is a very important problem in machine learning and data mining. The primary focus

of my research has been to propose a novel methodology to study these classification

algorithms accurately and efficiently in the non-asymptotic regime. In particular, we

propose a moment based method where by focusing on the probabilistic space of classifiers

induced by the classification algorithm and datasets of size N drawn independently

and identically from a joint distribution (i.i.d.), we obtain efficient characterizations

12

for computing the moments of the generalization error. Moreover, we can also study

model selection techniques such as cross-validation, leave-one-out and hold out set in our

proposed framework. This is possible since we have also established general relationships

between the moments of the generalization error and moments of the hold-out-set error,

cross-validation error and leave-one-out error. Deploying the methodology we were able

to provide interesting explanations for the behavior of cross-validation. The methodology

aims at covering the gap between results predicted by theory and the behavior observed in

practice.

13

CHAPTER 1INTRODUCTION

A significant portion of the work in machine learning is dedicated to designing new

learning methods or better understanding, at a macroscopic level (i.e. performance over

various datasets), the known learning methods. The body of work that tries to understand

microscopic (i.e. essence of the method) behavior of either models or methods to evaluate

models – which I think is crucial for deepening the understanding of machine learning

techniques and results – and establish solid connections with Statistics is rather small.

The two prevalent approaches to establish such results are based on either theory or

empirical studies but usually not both, unless empirical studies are used to validate the

theory. While both methods are powerful in themselves, each suffers from at least a major

deficiency.

The theoretical method depends on nice, closed form formulae that usually restricts

the types of results that can be obtained to asymptotic results or statistical learning

theory (SLT) type of results Vapnik [1998]. Should formulae become large and tedious to

manipulate, the theoretical results are hard to obtain and use/interpret.

The empirical method is well suited for validating intuitions but is significantly less

useful for finding novel, interesting things since large number of experiments have to be

conducted in order to reduce the error to a reasonable level. This is particularly difficult

when small probabilities are involved, making the empirical evaluation impractical in such

a case.

An ideal scenario, from the point of view of producing interesting results, would

be to use theory to make as much progress as possible but potentially obtaining

uninterpretable formulae, followed by visualization to understand and find consequences

of such formulae. This would avoid the limitation of theory to use only nice formulae

and the limitation of empirical studies to perform large experiments. The role of the

theory could be to significantly reduce the amount of computation required and the role

14

of visualization to understand the potentially complicated theoretical formulae. This is

precisely what I propose, a new hybrid method to characterize and understand models

and model selection measures (i.e. methods that evaluate learning models). The work I

present here is an initial forray into what might prove to be an useful tool for studying

learning algorithms. I call this method semi-analytical, since not just the formulae, but

visualization in conjunction with the formulae lead to interpretability. What makes such

an endeavor possible is the fact that, mostly due to the linearity of expectation, moments

of complicated random variables can be computed exactly with efficient formulae, even

though deriving the exact distribution in the form of small closed form formulae is a

daunting task.

1.1 Practical Impact

In this subsection I discuss the impact of the proposed research on industry and the

field machine learning and data mining in general.

Impact on industry and other fields: In todays day and age adaptive classification

models find applicability in a wide spectrum of applications ranging over various domains.

Financial Firms deploy these models for security purposes such as fraud detection,

intrusion detection. Credit Card companies use these models to make credit card offers to

people by categorizing them based on their previous transaction history. Giant chains

of Supermarkets use these models to figure out which group of items are generally

bought together by the customer. These models are used extensively in Bioinformatics

for problems such as gene classification based on functionality, DNA/protein sequence

matching, etc. They also find application in medicine for the analysis of the importance of

clinical parameters and their combinations prediction of disease progression, extraction of

medical knowledge for outcome research, therapy planning and support, and for the overall

patient management. Todays state-of-the-art search engines also use classification models.

This is just a snapshot of the entire range of applications they are used for.

15

Noticing the wide applicability of classification models and the shear extent of

their number, it is but a desired goal that I choose the correct model for our specific

application. I, through our research hope to take a forward step in this direction.

Impact on the machine learning and data mining research: I believe that the

research will assist in providing new insight into the behavior of classification models

and model selection measures. The framework may be used as an exploratory tool for

observing and understanding models and selection measures under specific circumstances

that interest the user. It is possible that other related problems may also be framed in an

analogous fashion leading to interesting observations and consequent interpretations.

1.2 Related Work

A critical piece of theoretical work that is coherent and provides structure in

comparing learning methods is given by Statistical Learning Theory Vapnik [1998]. SLT

categorizes classification algorithms(actually the more general learning algorithms) into

different classes called Concept Classes. The concept class of a classification algorithm is

determined by its Vapnik-Chervonenkis(VC) dimension which is related to the shattering

capability of the algorithm. Given a 2 class problem, the shattering capability of a

function refers to the maximum number of points that the function can classify without

making any errors, for all possible assignments of the class labels to the points in some

chosen configuration. The shattering capability of an algorithm is the supremum of the

shattering capabilities of all the functions it can represent. Distribution free bounds on

the generalization error – expected error over the entire input, of a classifier built using

a particular classification algorithm belonging to a concept class are derived in SLT.

The bounds are functions of the VC dimension and the sample size. The strength of this

technique is that by finding the VC dimension of an algorithm I can derive error bounds

for the classifiers built using this algorithm without ever referring to the underlying

distribution. A fallout of this very general characterization is that the bounds are usually

16

loose Boucheron et al. [2005], Williamson [2001] which in turn result in making statements

about any particular classifier weak.

There is a large body of both experimental and theoretical work that addresses the

problem of understanding various model selection measures. The model selection measures

that relevant to our discussion, are Hold-out-set validation, Cross-validation. Shao [1993]

showed that asymptotically Leave-one-out(LOO) chooses the best but not the simplest

model. Devroye et al. [1996] derived distribution free bounds for cross validation. The

bounds they found were for the nearest neighbour model. Breiman [1996] showed that

cross validation gives an unbiased estimate of the first moment of the Generalization error.

Though cross validation has desired characteristics with estimating the first moment,

Breiman stated that its variance can be significant. Theoritical bounds on LOO error

under certain algorithmic stability assumptions were given by Kearns and Ron [1997].

They showed that the worst case error of the LOO estimate is not much worse than

the training error estimate. Elisseeff and Pontil [2003] introduced the notion of training

stability. They showed that even with this weaker notion of stability good bounds could be

obtained on the generalization error. Blum et al. [1999] showed that v-fold cross validation

is at least as good as Nv

hold out set estimation on expectation. Kohavi [1995] conducted

experiments on Naive Bayes and C4.5 using cross-validation. Through his experiments he

concluded that 10 fold stratified cross validation should be used for model selection.

Moore and Lee [1994] proposed heuristics to speed up cross-validation. Plutowski

[1996] survey included proposals with theoritical results, heuristics and experiments on

cross-validation. His survey was especially geared towards the behavior of cross-validation

on neural networks. He inferred from the previously published results that cross-validation

is robust. More recently, Bengio and Grandvalet [2003] proved that there is no universally

unbiased estimator of the variance of cross-validation. Zhu and Rohwer [1996] proposed

a simple setting in which cross-validation performs poorly. Goutte [1997] refuted this

17

proposed setting and claimed that a realistic scenario in which cross-validation fails is still

an open question.

The work I present here covers the middle ground between these theoretical and

empirical results by allowing classifier specific results based on moment analysis. Such

an endeavor is important since the gap between theoretical and empirical results is

significant Langford [2005]. Preliminary work of this nature was done in Braga-Neto and

Dougherty [2005] where the authors characterized the discrete histogram rule. However,

their analysis does not provide any indication of how other more popular algorithms can

be characterized in similar fashion keeping in mind scalability and accuracy. Specific

classification schemes such as the W-statistic Anderson [2003] have been characterized

in the past, but such analysis is very much limited to that and other similar statistics.

The methodology I present here may potentially be applicable to large variety of learning

algorithms.

1.3 Methodology

1.3.1 What is the Methodology ?

The methodology for studying classification models consists in studying the behavior

of the first two central moments of the GE of the classification algorithm studied. The

moments are taken over the space of all possible classifiers produced by the classification

algorithm, by training it over all possible datasets sampled independently and identically

(i.i.d.) from some distribution. The first two moments give enough information about the

statistical behavior of the classification algorithm to allow interesting observations about

its behavior/trends. Higher moments may be computed using the same strategy suggested

but might prove to be inefficient to compute.

1.3.2 Why have such a Methodology?

The answers to the following questions shed light on why the methodology is

necessary if tight statistical characterization is to be provided for classification algorithms.

18

1. Why study GE ? The biggest danger of learning is overfitting the training data. Themain idea in using GE as a measure of success of learning instead on the empiricalerror on a given dataset is to provide a mechanism to avoid this pitfall. Implicitly, byanalyzing GE all the input is considered.

2. Why study the moments instead of the distribution of GE ? Ideally, I would studythe distribution of GE instead of moments in order to get a complete picture ofwhat is its behavior. Studying the distribution of discrete random variables, exceptfor very simple cases, turns out to be very hard. The difficulty comes from the factthat even computing the pdf in a single point is intractable since all combinationsof random choices that result in the same value for GE have to be enumerated. Onthe other hand, the first two central moments coupled with distribution independentbounds such as Chebychev and Chernoff give guarantees about the worst possiblebehavior that are not too far from the actual behavior (small constant factor).Interestingly, it is possible to compute the moments of a random variable like GEwithout ever explicitly writing or making use of the formula for the pmf/pdf. Whatmakes such an endeavor possible is extensive use of the linearity of expectation as isexplained later.

3. Why characterize a class of classifiers instead of a single classifier ? While theuse of GE as the success measure is standard practice in Machine Learning,characterizing classes of classifiers instead of the particular classifier producedon a given dataset is not. From the point of view of the analysis, without largetesting datasets it is not possible to evaluate directly GE for a particular classifier.By considering classes of classifiers to which a classifier belongs, an indirectcharacterization is obtained for the particular classifier. This is precisely whatStatistical Learning Theory (SLT) does; there the class of classifiers consists in allclassifiers with the same VC dimension. The main problem with SLT results is thatclasses based on VC dimension are too large, thus results tend to be pessimistic.In the methodology, the class of classifiers consists only of the classifiers that areproduced by the given classification algorithm from datasets of fixed size fromthe underlying distribution. This is the smallest probabilistic class in which theparticular classifier produced on a given dataset can be placed in.

1.3.3 How do I Implement the Methodology ?

One way of approximately estimating the moments of GE over all possible classifiers

for a particular classification algorithm is by directly using Monte Carlo. If I use Monte

Carlo directly, I first need to produce a classifier on a sampled dataset then test on a

number of test sets sampled from the same distribution acquiring an estimate of the GE of

this classifier. Repeating this entire procedure a couple of times I would acquire estimates

of GE for different classifiers. Then by averaging the error of these multiple classifiers I

19

would get an estimate of the first moment of GE. The variance of GE can also be similarly

estimated.

Another way of estimating the moments of GE, is by obtaining parametric expressions

for them. If this can be accomplished the moments can be computed exactly. Moreover,

by dexterously observing the manner in which expressions are derived for a particular

classification algorithm, insights can be gained into analyzing other algorithms of interest.

Though deriving the expressions may be a tedious task, using them I obtain highly

accurate estimates of the moments. I propose this second alternative for analyzing models.

The key to the analysis is focusing on the learning and inference phases of the algorithm.

In cases where the parametric expressions are computationally intensive to compute

directly, I show that approximating individual terms using optimization techniques and

even Monte Carlo I obtain accurate estimates of the moments when compared to directly

using Monte Carlo (first alternative) for the same computational cost.

If the moments are to be studied on synthetic data then the distribution is anyway

assumed and the parametric expressions can be directly used. If I have real data an

empirical distribution can be built on the dataset and then the parametric expressions can

be used.

1.4 Applying the Methodology

It is important to note that the methodology is not aimed towards providing a

way of estimating bounds for GE of a classifier on a given dataset. The primary goal is

creating an avenue in which learning algorithms can be studied precisely i.e. studying the

statistical behavior of a particular algorithm w.r.t. a chosen/built distribution. Below, I

discuss the two most important perspectives in which the methodology can be applied.

1.4.1 Algorithmic Perspective

If a researcher/practitioner designs a new classification algorithm, he/she needs to

validate it. Standard practice is to validate the algorithm on a relatively small (5-20)

number of datasets and to report the performance. By observing the behavior of only a

20

few instances of the algorithm the designer infers its quality. Moreover, if the algorithm

under performs on some datasets, it can be sometimes difficult to pinpoint the precise

reason for its failure. If instead he/she is able to derive parametric expressions for the

moments of GE, the test results would be more relevant to the particular classification

algorithm, since the moments are over all possible datasets of a particular size drawn

i.i.d. from some chosen/built distribution. Testing individually on all these datasets is an

impossible task. Thus, by computing the moments using the parametric expressions the

algorithm would be tested on a plethora of datasets with the results being highly accurate.

Moreover, since the testing is done in a controlled environment i.e. all the parameters are

known to the designer while testing, he/she can precisely pinpoint the conditions under

which the algorithm performs well and the conditions under which the algorithm under

performs.

1.4.2 Dataset Perspective

If an algorithm designer validates his/her algorithm by computing moments as

mentioned earlier, it can instill greater confidence in the practitioner searching for an

appropriate algorithm for his/her dataset. The reason for this being, if the practitioner has

a dataset which has a similar structure or is from a similar source as the test dataset on

which an empirical distribution was built and favourable results reported by the designer,

then this would mean that the results apply not only to that particular test dataset, but

to other similar type of datasets and since the practitioner’s dataset belongs to this similar

collection, the results would also apply to his. Note that a distribution is just a weighting

of different datasets and this perspective is used in the above exposition.

If the dataset is categorical, it can be precisely modelled by a multinomial distribution

in the following manner. A multinomial is completely characterized by the probabilities

in each of its cells (which sum to 1) and the total count N (sum of individual cell counts).

The designer can set the number of cells in the multinomial to be the number of cells in

his contingency table, with empirical estimates for the individual cell probabilities being

21

the corresponding cell counts divided by the size of the dataset which is the value of N.

With this I have a fully specified multnomial distribution using which I can compute the

formulations, consequently characterizing the moments of the GE. Since the estimates for

the cell probabilities are based on the available dataset, the true underlying distribution

of which this dataset is a sample, may have slightly different values. This scenario can

be accounted for, by varying the cell probabilities to a desired degree and observing the

variation in the estimates of GE. This would assist in deciphering the sensitivity of the

model in question to noise. In the continuous case, there is no such generic distribution

(as the multinomial), but a popular choice could be a mixture of Gaussians (other

distributions could also be used).

1.5 Research Goals

In this section I state the specific research goals that I have accomplished in this

thesis work.

General Framework: To provide a statistical characterization of classifiers, a probabilistic

class of classifiers that contains the desired classifier has to be considered since the

behavior of any particular classifier can be arbitrarily poor. The class considered

by statistical learning theory is the class of classifiers with a given VC dimension

Vapnik [1998]. While the results thus obtained are very general, no particularity of

the classification algorithm is exploited. The class of classifiers considered in this thesis

is the classifiers obtained by applying the classification algorithm to a dataset of given

size sampled i.i.d. from the underlying distribution. This leads to a different way of

characterizing classifiers based on moment analysis. I develop a framework to analyze

classification algorithms.

Analysis of Model Selection Measures: To relate the moments of the Generalization

error (GE) to the moments of Cross-validation error (CE), Leave-one-out error (LE) and

Hold-out-set error (HE). This will assist us in studying the behavior of these errors given

the moments of any one of these errors.

22

Analysis of Specific Classification Models: To develop customized formulations for

the moments for specific classification algorithms. This will aid in studying classification

algorithms in conjunction with the selection measures. I choose the following models which

are a mix of parametric and non-parametric models.

1. Naive Bayesian Classifier (NBC) model: NBC is a model which is extensively usedin industry, due to its robustness outperforming its more sophisticated counterpartsin many real world applications(eg. spam filtering in Mozilla Thunderbird andMicrosoft Outlook, bio-informatics etc.). There has been work on the robustnessof NBC Domingos and Pazzani [1997], Rish [2001], but the proposed frameworkand the inter-relationships between the moments of the various errors helps us toextensively study not just the model but also the behavior of the validation methodsin conjunction with it.

2. Decision Trees (DT) model: Decision trees are also extensively used in data miningand machine learning applications. Besides performance, they are sometimespreferred over other models(eg. Support Vector Machines, neural nets) because theprocess by which the eventual classifier is built from the sample is transparent. Theprobabilistic formulations will incorporate various pruning conditions such as purity,scarcity and fixed height. The formulations will help better understand the behaviorof these trees for classification.

3. K-Nearest-Neighbor (KNN) Classifier model: This model is one of the more simplermodels but yet it is highly effective. Theoretical results exist Stone [1977], regardingconvergence of the Generalization Error (GE) of this algorithm to Bayes error (bestpossible performance). However, this result is asymptotic and for finite sample sizesin real scenarios finding the optimal value of K is more of an art than science. Themethodology proposed by us can used to study the algorithm for different values ofK and for different distance metrics accurately in controlled settings.

Scalability: To make the computation of the moments scalable. This is especially

relevant when the domain is discrete and the computation of individual probabilities

becomes computationally intensive. In these cases I have to come up with approximation

techniques that are accurate and fast, making the analysis practical.

Practical Study of Non-asymptotic Behavior of NBC, DT, KNN and Selection

Measures: The formulas of the moments of GE and consequently HE, CE and LE for the

NBC, DT and KNN that are derived using the general framework can be used to carry

out an extensive study of the behavior of these classification algorithms in conjunction

23

with the model selection measures. I have carried out such a comparison with the aim of

identifying interesting trends about the mentioned classification algorithms and the model

selection measures to exemplify the utility of the theoretical framework.

24

CHAPTER 2GENERAL FRAMEWORK

Probability distributions completely characterize the behavior of a random variable.

Moments of a random variable give us information about its probability distribution.

Thus, if I have knowledge of the moments of a random variable I can make statements

about its behavior. In some cases characterizing a finite subset of moments may prove

to be a more desired alternative than characterizing the entire distribution which can be

wild and computationally expensive to compute. This is precisely what I do when I study

the behavior of the generalization error of a classifier and the error estimation methods

viz. Hold-out-error, Leave-one-out error and Cross-validation error. Characterizing

the distribution though possible, can turn out to be a tedious task and studying the

moments instead is a more viable option. As a result, I employ moment analysis and

use linearity of expectation to explore the relationship between various estimates for the

error of classifiers: generalization error(GE), hold-out-set error(HE), and cross validation

error(CE) — leave-one-out error is just a particular case of CE and I do not analyze

it independently. The relationships are drawn by going over the space of all possible

datasets. The actual computation of moments though is conducted by going over the

space of classifiers induced by a particular classification algorithm and i.i.d. data. This

is done since it leads to computational efficiency. I interchangably go over the space of

datasets and space of classifiers as deemed appropriate, since the classification algorithm is

assumed to be deterministic. That is I have,

ED(N)[F(ζ[D(N)])] = EZ(N)[F(ζ)] = Ex1×...×xm [F(ζ(x1, x2, ..., xm))]

where F() is some function that operates on a classifier. I also consider the learning

algorithms to be symmetric(the algorithm is oblivious to random permutations of the

samples in the training dataset).

Throughout this section and in the rest of the thesis I use the notation in Table 2-1

unless stated otherwise.

25

2.1 Generalization Error (GE)

The notion of generalization error is defined with respect to an underlying probability

distribution defined over the input output space and a loss function (error metric). I

model this probability space with the random vector X for input and random variable

Y for output. When the input is fixed, Y (x) is the random variable that models the

output1 . I assume in this thesis that the domain X of X is discrete; all the theory can

be extended to continuous essentially by replacing the counting measure with Lebesque

measure and sums with integrals. Whenever the probability and expectation is with

respect to this probabilistic space(i.e. (X, Y )) that models the problem I will not use any

index. For other probabilistic spaces, I will specify by an index what is the probability

space I refer to. I denote the error metric by λ(a, b); in this thesis I will use only the 0-1

metric that takes value 1 if a 6= b and 0 otherwise. With this, the generalization error of a

classifier ζ is:

GE(ζ) = E [λ(ζ(X), Y )]

= P [ζ(X) 6=Y ]

=∑x∈X

P [X =x] P [ζ(x) 6=Y (x)]

(2–1)

where I used the fact that, for 0-1 loss function the expectation is the probability that

the prediction is erroneous. Notice that the notation using Y (x) is really a conditional on

X =x. I use this notation since it is intuitive and more compact. The last equation for the

generalization error is the most useful in this thesis since it decomposes a global measure,

generalization error, defined over the entire space into micro measures, one for each input.

1 By modeling the output for a given input as a random variable, I allow the output tobe randomized, as it might be in most real circumstances.

26

By carefully selecting the class of classifiers for which the moment analysis of the

generalization error is performed, meaningful and relevant probabilistic statements can

be made about the generalization error of a particular classifier from this class. The

probability distribution over the classifiers will be based on the randomness of the data

used to produce the classifier. To formalize this, let Z(N) be the class of classifiers built

over a dataset of size N with a probability space defined over it. With this, the k-th

moment around 0 of the generalization error is:

EZ(N)

[GE(ζ)k

]=

∑

ζ∈Z(N)

PZ [ζ] GE(ζ)k

The problem with this definition is that it talks about global characterization of

classifiers which can be hard to capture. I rewrite the formulae for the first and second

moment in terms of fine granularity structure of the classifiers.

While deriving these moments, I have to consider double expectations of the form:

EZ(N) [E [F(x, ζ)]] with F(x, ζ) a function that depends both on the input x and the

classifier. With this I arrive at the following result:

EZ(N) [E [F(x, ζ)]] =∑

ζ∈Z(N)

PZ(N) [ζ]∑x∈X

P [X =x]F(x, ζ)

=∑x∈X

P [X =x]∑

ζ∈Z(N)

PZ(N) [ζ]F(x, ζ)

= E[EZ(N) [F(x, ζ)]

]

(2–2)

that uses the fact that P [X =x] does not depend on a particular ζ and PZ(N) [ζ] does

not depend on a particular x, even though both quantities depend on the underlying

probability distribution.

Using the definition of the moments above, Equation 2–1 and Equation 2–2 I have the

following theorem.

27

Theorem 1. The first and second moment of GE are given by,

EZ(N) [GE(ζ)] =∑x∈X

P [X =x]∑y∈Y

PZ(N) [ζ(x)=y] P [Y (x) 6=y]

and

EZ(N)×Z(N) [GE(ζ)GE(ζ ′)] =∑x∈X

∑

x′∈XP [X =x] P [X =x′] ·

∑y∈Y

∑

y′∈YPZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] ·

P [Y (x) 6=y] P [Y (x′) 6=y′]

Proof.

EZ(N) [GE(ζ)] = EZ(N) [E [λ(ζ(X), Y )]]

= E[EZ(N) [λ(ζ(X), Y )]

]

=∑x∈X

P [X =x]∑

ζ∈Z(N)

PZ(N) [ζ] P [ζ(x) 6=Y (x)|ζ]

=∑x∈X

P [X =x]∑

ζ∈Z(N)

PZ(N) [ζ] P [ζ(x) = y, Y (x) 6= y|ζ]

=∑x∈X

P [X =x]∑y∈Y

∑

ζ∈Z(N)|ζ(x)=y

PZ(N) [ζ] P [ζ(x) = y, Y (x) 6= y|ζ]

=∑x∈X

P [X =x]∑y∈Y

PZ(N) [ζ(x)=y] P [Y (x) 6=y]

EZ(N)×Z(N) [GE(ζ)GE(ζ ′)]

= EZ(N)×Z(N) [E [λ(ζ(X), Y )] E [λ(ζ ′(X), Y )]]

=∑

(ζ,ζ′)∈Z(N)×Z(N)

PZ(N)×Z(N) [ζ, ζ ′]

(∑x∈X

P [X =x] P [ζ(x) 6=Y (x)]

)

(∑x∈X

P [X =x] P [ζ ′(x) 6=Y (x)]

)

28

=∑x∈X

∑

x′∈XP [X =x] P [X =x′]

∑

(ζ,ζ′)∈Z(N)×Z(N)

PZ(N)×Z(N) [ζ, ζ ′]

P [ζ(x) 6=Y (x)] P [ζ ′(x′) 6=Y (x′)]

=∑x∈X

∑

x′∈XP [X =x] P [X =x′]

∑y∈Y

∑

y′∈YPZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′]

P [Y (x) 6=y] P [Y (x′) 6=y′]

In both series of equations I made the transition from a summation over the class

of classifiers to a summation over the possible outputs since the focus changed from the

classifier to the prediction of the classifier for a specific input(x is fixed inside the first

summation). What this effectively does is it allows the computation of moments using

only local information(behavior on particular inputs) not global information(behavior on

all inputs). This results in speeding the process of computing the moments.

2.2 Alternative Methods for Computing the Moments of GE

The method I introduced above for computing the moments of the generalization

error are based on decomposing the moment into contributions of individual input-output

pairs. With such a decomposition, not only the analysis becomes simpler, but the

complexity of the algorithm required is reduced. In particular, the complexity of

computing the first moment is proportional to the size of the input-output space and

the complexity of estimating probabilities of the form PZ [ζ(x)=y]. The complexity of the

second moment is quadratic in the size of the input-output space and proportional to the

complexity of estimating PZ [ζ(x)=y ∧ ζ(x′)=y′]. To see the advantage of this method, I

compare it with the other two alternatives for computing the moments: definition based

computation and Monte Carlo simulation.

Definition based computation uses the definition of expectation. It consists in

summing over all possible datasets and multiplying the generalization error of the classifier

29

built from the dataset with the probability to obtain the dataset as an i.i.d. sample from

the underlying probability distribution. Formally,

ED(N) [GE(ζ)] =∑

D∈D(N)

P [D] GE(ζ[D]) (2–3)

where D(N) is the set of all possible datasets of size N . The number of possible datasets

is exponential in N with the base of the exponent proportional to the size of the

input-output space (the product of the sizes of the domains of inputs and outputs).

Evaluating the moments in this manner is impractical for all but very small spaces and

dataset sizes.

Monte Carlo simulation is a simple way to estimate moments that consists in

performing experiments to produce samples that determine the value of the generalization

error. In this case, to estimate ED(N) [GE(ζ)], datasets of size N have to be generated,

one for each sample desired. For each of these datasets a classifier has to be constructed

according to the classifier construction algorithm. For the classifier produced, samples

from the underlying probability distribution have to be generated in order to estimate

the generalization error of this classifier. Especially for second moments, the amount of

samples required will be large in order to obtain reasonable accuracy for the moments. If a

study has to be conducted in order to determine the influence of various parameters of the

data generation model, the overall amount of experiments that have to performed becomes

infeasible.

In summary, the advantages of the method I propose for estimating the moments

are: (a) the formulations are exact, (b) it needs only local behavior of the classifier, (c)

so the time complexity is reduced and (d) does not depend on the fact that some of the

probabilities are small. I will use this method to compute moments of the generalization

error for the NBC, DT and KNN algorithms.

30

Table 2-1. Notation used throughout the thesis.

Symbol MeaningX Random vector modeling inputX Domain of random vector (input space) XY Random variable modeling outputY (x) Random variable modeling output for input xY Set of class labels (output space)D Dataset(x, y) Data-point from dataset DDt Training datasetDs Testing datasetDi The ith part/fold of D (for cross validation)N Size of datasetNt Size of training datasetNs Size of testing datasetv Number of folds of cross validationζ Classifierζ[D] Classifier build from dataset DGE(ζ) Generalization error of classifier ζHE(ζ) Hold-out-set error of classifier ζCE(ζ) Cross validation error of classifier ζZ(S) The set of classifiers obtained by application of

classification algorithm to an i.i.d. set of size SD(S) Dataset of size SEZ(S)[] Expectation w.r.t. the space of classifiers built on a sample of size S

31

CHAPTER 3ANALYSIS OF MODEL SELECTION MEASURES

The exact computation of the generalization error depends on the actual underlying

probability distribution, which is unknown, and hence other estimates for the generalization

error have been introduced: hold out set(HOS), leave-one-out(LOO), and v-fold cross

validation(CV). In the this subsection I establish relationships between moments of

these error metrics and the moments of the generalization error with respect to some

distribution over the classifiers. The general setup for the analysis for all these metrics is

the following. A dataset D of size N is provided, containing i.i.d. samples coming from

the underlying distribution over the input and outputs. The set is further divided and

used both to build a classifier and to estimate the generalization error; the particular way

this is achieved is slightly different for each error metric. The important question I will

ask is how the values of the error metrics relate to the generalization error. In all the

developments that follow I will assume that ζ[D] is the classifier built deterministically

from the dataset D.

3.1 Hold-out Set Error

The HOS error involves randomly partitioning the dataset D into two parts Dt, the

training dataset of fixed size Nt, and Ds, the test dataset of fixed size Ns. A classifier is

built over the training dataset and the generalization error is estimated as the average

error over the test dataset. Formally, denoting the random variable that gives the HOS

error by HE I have:

HE =1

Ns

∑

(x,y)∈Ds

λ(ζ[Dt](x), y) (3–1)

where y is the actual label for the input x.

Proposition 1. The expected value of HE is given by,

EDt(Nt)×Ds(Ns) [HE] = EDt(Nt) [GE(ζ[Dt])]

32

Proof. Using the notation in Table 2-1 and realizing that all the datapoints are i.i.d. I

derive the above result.

EDt(Nt)×Ds(Ns) [HE] = EDt(Nt)

[EDs(Ns)

[∑(x,y)∈Ds

λ(ζ[Dt](x), y)

Ns

]]

= EDt(Nt)

[EDs(Ns) [P [ζ[Dt](x) 6= y|Ds]]

]

= EDt(Nt)

[∑Ds

P [ζ[Dt](x) 6= y|Ds]P [Ds]

]

= EDt(Nt)

[∑Ds

P [ζ[Dt](x) 6= y, Ds]

]

= EDt(Nt) [P [ζ[Dt](x) 6= y]]

= EDt(Nt) [GE(ζ[Dt])]

where I used the fact that by going over all values of one r.v. I get the probability fo the

other.

I observe from the above result that the expected value of HE is dependent only on

the size of the training set Dt. This result is intuitive since only Nt data-points are used

for building the classifier.

Lemma 1. The second moment of HE is given by,

EDt(Nt)×Ds(Ns) [HE2] = 1Ns

EDt(Nt) [GE(ζ[Dt])] + Ns−1Ns

EDt(Nt) [GE(ζ[Dt])2].

Proof. To compute the second moment of HE, from the definition in Equation 3–1 I have:

EDt(Nt)×Ds(Ns) [HE2] = 1N2

sEDt(Nt)×Ds(Ns)

[∑(x,y)∈Ds

∑(x′,y′)∈Ds

λ(ζ[Dt](x), y)λ(ζ[Dt](x′), y′)

]

The expression under the double sum depends on whether (x, y) and (x′, y′) are the same

or not. When they are the same, I are precisely in the case I derived for EDt(Nt)×Ds(Ns) [HE]

above, except that I have N2s in the denominator. This gives us the following term,

1Ns

EDt(Nt) [GE(ζ[Dt])]. When they are different, i.e. (x, y) 6= (x′, y′) then I get,

33

1

N2s

EDt(Nt)×Ds(Ns)

∑

(x,y)∈Ds

∑

(x′,y′)∈Ds\(x,y)

λ(ζ[Dt](x), y)λ(ζ[Dt](x′), y′)

=Ns − 1

Ns

EDt(Nt)×Ds(Ns)

[∑(x,y)∈Ds

λ(ζ[Dt](x), y)

Ns

∑(x′,y′)∈Ds\(x,y) λ(ζ[Dt](x

′), y′)

Ns − 1

]

=Ns − 1

Ns

EDt(Nt)×Ds(Ns) [P [ζ[Dt](x) 6= y|(x, y) ∈ Ds]P [ζ[Dt](x′) 6= y′|(x′, y′) ∈ Ds \ (x, y)]]

=Ns − 1

Ns

EDt(Nt) [EDs [P [ζ[Dt](x) 6= y|(x, y) ∈ Ds]] EDs [P [ζ[Dt](x′) 6= y′|(x′, y′) ∈ Ds \ (x, y)]]]

=Ns − 1

Ns

EDt(Nt)

[GE(ζ[Dt])

2]

where I used the primary fact that since the samples are i.i.d. any function applied on two

distinct inputs is also independent. This the reason why the EDs [] factorizes.

Putting everything together, and observing that terms inside summations are

constants, I have:

EDt(Nt)×Ds(Ns)

[HE2

]=

1

Ns

EDt(Nt) [GE(ζ[Dt])] +Ns − 1

Ns

EDt(Nt)

[GE(ζ[Dt])

2]

Theorem 2. The variance of HE is given by,

V arDt(Nt)×Ds(Ns)(HE) = 1Ns

EDt(Nt) [GE(ζ[Dt])] + Ns−1Ns

EDt(Nt) [GE(ζ[Dt])2] −

ED(Nt) [GE(ζ[Dt])]2.

Proof. The proof of the theorem immediately follows from the Proposition 1 and Lemma 1

and by using the formula for the variance of a random variable in this case HE,

V ar(HE) = E[HE2]− E[HE]2.

Unlike the first moment the variance depends on the sizes of both, the training set as well

as the test set.

3.2 Multifold Cross Validation Error

v-fold cross validation consists in randomly partitioning the available data into v

equi-size parts and then training v classifiers using all data but one chunk and then testing

34

the performance of the classifier on this chunk. The estimate of the generalization error

of the classifier build from the entire data is the average error over the chunks. Using

notation in this thesis and denoting by Di the i-th chunk of the data set D, the Cross

Validation Error (CE) is:

CE =1

v

v∑i=1

HEi

Notice that I expressed CE in terms of HE, the HOS Error. By substituting the formula

for HE in Equation 3–1 into the above equation, a direct definition for CV is obtained, if

desired.

In this case I have a classifier for each chunk not a single classifier for the entire

data. I model the selection of N i.i.d. samples that constitute the dataset D and the

partitioning into v chunks. With this I have:

Proposition 2. The expected value of CE is given by,

EDt( v−1v

N)×Di(Nv

) [CE] = EDt( v−1v

N) [GE(ζ[Dt])]

Proof. Using the above equation, Proposition 1 I get the above result.

EDt( v−1v

N)×Di(Nv

) [CE] =1

v

v∑i=1

EDt( v−1v

N)×Di(Nv

) [HEi]

=1

v

v∑i=1

EDt( v−1v

N) [GE(ζ[Dt])]

= EDt( v−1v

N) [GE(ζ[Dt])]

This results follows the intuition since is states that the expected error is the generalization

error of a classifier trained on v−1v

N data-points. Thus, at least on expectation, the cross

validation behaves exactly like HOS estimate that is trained over v − 1 chunks.

For CE I could compute the second moment around zero using the strategy

previously shown and then compute the variance. Here, I compute the variance using

the relationship between the variance of the sum and the variances and covariances of

individual terms. In this way I can decompose the overall variance of CE into the sum of

35

variances of individual estimators and the sum of covariances of pairs of such estimators;

this decomposition significantly enhances the understanding of the behavior of CE, as I

will see in the example in Section 4.1.

V ar(CE) =1

v2

(v∑

i=1

V ar(HEi)+∑

i 6=j

Cov(HEi, HEj)

)(3–2)

The quantity V ar(HEi), the variance of the HE on training data of size v−1v

N and test

data of size 1vN , is computed using the formulae in the previous section. The only things

that remain to be computed are the covariances. Since I already computed the expectation

of HE, to compute the covariance it is enough to compute the quantity Q = E[HEiHEj]

(since for any two random variables X, Y I have Cov(X,Y ) = E [XY ]− E [X] E [Y ]). Let

Di denote D \Di and let Ns = Nv. With this I have the following lemma,

Lemma 2. The EDit( v−1

vN)×Di(

Nv

)×Djt( v−1

vN)×Dj(

Nv

) [HEiHEj] =

EDit( v−1

vN)×Dj

t( v−1v

N) [GE(ζ[D \Di])GE(ζ[D \Dj])].

Proof.

EDit( v−1

vN)×Di(

Nv

)×Djt( v−1

vN)×Dj(

Nv

) [HEiHEj]

= EDit( v−1

vN)×Dj

t( v−1v

N)×Di(Nv

)

[HEiEDj(

Nv

)

[∑(xj ,yj)∈Dj

λ(ζ[Djt ](xj), yj)

Ns

]]

= EDit( v−1

vN)×Dj

t( v−1v

N)

[GE(ζ[D \Dj])EDi(

Nv

) [HEi]]

= EDit( v−1

vN)×Dj

t( v−1v

N) [GE(ζ[D \Dj])GE(ζ[D \Di])]

where used the fact that the datasets Di and Dj are disjoint and drawn i.i.d. It is

important to observe that due to the fact that D\Di and D\Dj intersect ( the intersection

is D \ (Di ∪Dj)), the two classifiers will neither be independent nor identical. As was the

case for the first and second moment of GE, this moment will depend only on the size of

the intersection and the sizes of the two sets since all points are i.i.d. This means that the

expression has the same value for any pair i, j, i 6=j.

36

Theorem 3. The variance of CE is given by,

V arDit( v−1

vN)×Di(

Nv

)×Djt( v−1

vN)×Dj(

Nv

)CE =1

NEDt( v−1

vN) [GE(ζ[Dt])] +

N − v

vNEDt( v−1

vN)

[GE(ζ[Dt])

2]− v − 2

vEDt( v−1

vN) [GE(ζ[Dt])]

2 +

v − 1

vEDi

t( v−1v

N)×Djt( v−1

vN) [GE(ζ[D \Dj])GE(ζ[D \Di])]

Proof. The expression for the covariance is immediate from the above result and so using

Equation 3–2 I derive the variance of CE.

It is worth mentioning that leave-one-out(LOO) is just a special case of v-fold cross

validation (v = N for leave-one-out). The formulae above apply to LOO as well thus no

separate analysis is necessary.

With this I have related the first two moments of HE and CE to that of GE. Hence,

if I can compute the moments of GE I can also compute the moments of HE and CE,

allowing us to study the model as well as the selection measures. In the next couple of

chapters I thus focus our attention on computing the moments of GE efficiently for the

following classification models – NBC, DT and KNN.

37

CHAPTER 4NAIVE BAYES CLASSIFIER, SCALABILITY AND EXTENSIONS

4.1 Example: Naive Bayes Classifier

The results(i.e. expressions and relationships) I derived in the previous section were

generic applicable to any deterministic classification algorithm. I can thus use those results

to study the behavior of the errors for a classification algorithm of our choice.

The classification algorithm I consider in this chapter is Naive Bayes. I first study the

Naive bayes for a single input attribute(i.e. for one dimension) and later the generalized

version maintaining scalability. As I will see, these moments are too complicated, as

mathematical formulae, to interpret. I will plot these moments to gain an understanding

of the behavior of the errors under different conditions thus portraying the usefulness of

the proposed method.

4.1.1 Naive Bayes Classifier Model (NBC)

In order to compute the moments of the generalization error, I first have to select

a classifier and specify the construction method. I selected the single input naive Bayes

classifier since the analysis is not too complicated but highlights both the method and the

difficulties that have to be overcome. As I will see, even this simplified version exhibits

interesting behavior. I fix the number of class labels to 2 as well. In the next section, I

discuss how the analysis I present here extends to the general NBC.

Given values for any of the inputs, the NBC computes the probability to see any

of the class labels as output under the assumption that the inputs influence the output

independently. The prediction is the class label that has the largest such estimated

probability. For the version of the naive Bayes classifier I consider here(i.e. a single input),

the prediction given input x is:

ζ(x) = argmaxk∈1,2P [Y = yk] P [X = x|Y = yk]

38

The probabilities that appear in the formula are estimated using the counts in the

contingency table in Table 4-1. Using the fact that P [Y = yk] P [X = x|Y = yk] =

P [X = x ∧ Y = yk] and the fact that P [X = xi ∧ Y = yk] is Nik

N, the prediction of the

classifier is:

ζ(xi) =

y1 if Ni1 ≥ Ni2)

y2 if Ni1 < Ni2)

4.1.2 Computation of the Moments of GE

Under the already stated data generation model, the moments of the generalization

error for the NBC can be computed. I now present three approaches for computing the

moments and show that the approach using theorem 1 is by far the most practical.

Going over datasets: If I calculate the moments by going over all possible datasets

the number of terms in the formulation for the moments is exponential in the number

of attribute values with the base of the exponential being the size of the dataset(i.e.

O(Nn) terms). This is because each of the cells in Table 4-1 can take O(N) values. The

formulation of the first moment would be as follows.

ED(N)[GE(ζ(D(N)))] =N∑

N11=0

N−N11∑N12=0

...

N−(N11+...+N(n−1)2)∑Nn1=0

eP [N11, ..., Nn2] (4–1)

where e is the corresponding error of the classifier. I see that this formulation can be

tedious to deal with. So can I do better? Yes I definitely can and this spurs from the

following observation. For the NBC built on Table 4-1 all I care about in the classification

process is the relative counts in each of the rows. Thus, if I had to classify a datapoint

with attribute value xi I would classify it into class y1 if Ni1 > Ni2 and vice-versa. What

this means is that irrespective of the actual counts of Ni1 and Ni2 as long Ni1 > Ni2

the classification algorithm would make the same prediction i.e. I would have the same

classifier. I can hence switch from going over the space of all possible datasets to going

39

over the space of all possible classifiers with the advantage of reducing the number of

terms.

Going over classifiers: If I find the moments by going over the space of possible

classifiers I reduce the number of terms from O(Nn) to O(2n). This is because there are

only two possible relations between the counts in any row(≥ or <). The formulation for

the first moment would then be as follows.

EZ(N)[GE(ζ)] = e1P [N11 ≥ N12, ..., Nn1 ≥ Nn2] + e2P [N11 < N12, ..., Nn1 ≥ Nn2] + ...

+e2nP [N11 < N12, ..., Nn1 < Nn2]

where e1, e2 to e2n are the corresponding errors. Though this formulation reduces the

complexity significantly since N >> 2 for any practical scenario, nonetheless the number

of terms are still exponential in n. Can I still do better? The answer is yes again. Here

is where theorem 1 gains prominence. To restate, theorem 1 says that while calculating

the first moment I just need to look at particular rows of the table and to calculate the

second moment just pairs of rows. This reduces the complexity significantly without

compromising on the accuracy as I will see.

Going over classifiers using Theorem 1: If I use theorem 1 the number of terms

reduces from an exponential in n to small polynomial in n. Thus the number of terms

in finding the first moment is just O(n) and that for the second moment is O(n2). The

formulation for the first moment would then be as follows.

EZ(N)[GE(ζ)] = e1P [N11 ≥ N12] + e2P [N11 < N12] + ... + e2nP [Nn1 < Nn2]

where e1, e2 to e2n are the corresponding errors. They are basically the respective cell

probabilities of the multinomial(i.e. e1 is probability of a datapoint belonging to the cell

40

x1C2, e2 is probability to belong to x1C1 and so on). For the second moment I would have

joint probabilities with the expression being the following,

EZ(N)[GE(ζ)2] = (e1 + e3)P [N11 ≥ N12, N21 ≥ N22] + (e2 + e3)P [N11 < N12, N21 ≥ N22] + ...

+(e2n−2 + e2n)P [N(n−1)1 < N(n−1)2, Nn1 < Nn2]

I have thus reduced the number of terms from O(Nn) to O(nk) where k is small and

depends on the order of the moment I are interested in. This formulation has another

advantage. The complexity of calculating the individual probabilities is also significantly

reduced. The probabilities for the first moment can be computed in O(N2) time and that

for the second in O(N4) time rather than O(Nn−1) and O(N2n−2) time respectively.

Further optimizations can be done by identifying independence between random

variables and expressing them as binomial cdfs and using the incomplete regularized

beta function to calculate these cdfs in essentially constant time. Infact in future

sections I discuss the general NBC model for which the cdfs (probabilities) cannot be

computed directly as it turns out to be too expensive. Over there I propose strategies to

efficiently compute these probabilities. The same strategies can be used here to make the

computation more scalable.

The situation when ζ and ζ ′ are the classifiers constructed for two different folds in

cross validation requires special treatment. Without loss of generality, assume that the

classifiers are built for the folds 1 and 2. If I let D1, . . . , Dv be the partitioning of the

datasets into v parts, ζ is constructed using D2 ∪D3 ∪ · · · ∪Dv and ζ ′ is constructed using

D1 ∪D3 ∪ · · · ∪Dv, thus D3 ∪ · · · ∪Dv training data is common for both. If I denote by Njk

the number of data-points with X = xj and Y = yk in this common part and by N(1)jk and

N(2)jk the number of such data-points in D1 and D2, respectively, then I have to compute

41

probabilities of the form,

P[(N

(2)i1 + Ni1 > N

(2)i2 + Ni2) ∧ (N

(1)j1 + Nj1 > N

(1)j2 + Nj2)

]

The estimation of this probability using the above method requires fixing the values of 6

random variables, thus giving an O(N6) algorithm. Again, further optimizations can be

carried out using the strategies in suggested hence.

Using the moments of GE, the moments of HE and CE are found using relationships

already derived.

4.2 Full-Fledged NBC

In the previous section I dicussed the NBC built on data in a single dimension.

As the dimensionality increases, the cost of exactly computing the moments from the

formulations also increases. To maintain the scalability of the method, I propose a number

of approximation schemes which can be used to estimate the probabilities efficiently

and accurately. As I will see approximating these probabilities leads to highly accurate

estimates of the moments for low computational cost, as against directly using Monte

Carlo. The approximation schemes I propose assist in efficient computation of the

probabilities in arbitrary dimension. As a matter of fact the approximation schemes

are generic enough to be applied to any application where cdfs need to be approximated

efficiently.

4.2.1 Calculation of Basic Probabilities

Having come up with the probabilistic formulation for discerning the moments of the

generalization error, I are now faced with the daunting task of efficiently calculating the

probabilities involved for the NBC when the number of dimensions is more than 1. In this

section I will mainly discuss single probabilities and the extension to joint probabilities is

in section 4.4. Let us now briefly preview the kind of probabilities I need to decipher.

42

With reference to Figure 4-1, considering the cell x1y1 without loss of generality

(w.l.o.g.) and by the Naive Bayes classifier independence assumption, I need to find the

probability of the following condition being true for the 2-dimensional case,

pc1px

11

pc1

py11

pc1

> pc2px

12

pc2

py12

pc2

i.e. pc2px11p

y11 > pc1p

x12p

y12

i.e. N2Nx11N

y11 > N1N

x12N

y12

In general for the d-dimensional(d ≥ 2) case I have to find the following probability,

P [N(d−1)2 Nx1

11 Nx211 ...Nxd

11 > N(d−1)1 Nx1

12 Nx212 ...Nxd

12 ] (4–2)

where the xi are random variables.

4.2.2 Direct Calculation

I can find the probability P [N(d−1)2 Nx1

11 Nx211 ...Nxd

11 > N(d−1)1 Nx1

12 Nx212 ...Nxd

12 ] by summing

over all possible assignments of the multinomial random variables involved. For the 2

dimensional case shown in Figure 4-1 I have,

P [N2Nx11N

y11 > N1N

x12N

y12] =

∑N111

∑N121

∑N211

∑N112

∑N122

∑N212

∑N222

P [N111, N121, N211, N112, N122, N212, N221, N222]·I[N2N

x11N

y11 > N1N

x12N

y12]

where N2 = N112 +N122 +N212 +N222, Nx11 = N111 +N121, Ny

11 = N111 +N211, N1 = N −N2,

Nx12 = N112 + N122, and I[condition] = 1 if condition is true else I[condition] = 0. Each

of the summations takes O(N) values and so the worst case time complexity is O(N7). I

thus observe that for the simple scenario depicted, the time to compute the probabilities

is unreasonable even for small size datasets(N = 100 say). The number of summations

increases linearly with the dimensionality of the space. Hence, the time complexity is

exponential in the dimensionality. I thus need to resort to approximations to speed up the

process.

43

4.2.3 Approximation Techniques

If all the moments of a random variable are known then I know the moment

generating function(MGF) of the random variable and as a consequence the probability

generating function and hence the precise cdf for any value in the domain of the random

variable. If only a subset of the moments are known then I can at best approximate the

MGF and so the cdf.

I need to compute probabilities (cdfs) of the following form P [X > 0] where X =

N(d−1)2 Nx1

11 Nx211 ...Nxd

11 − N(d−1)1 Nx1

12 Nx212 ...Nxd

12 . Most of the alternative approximation

techniques I propose in the subsections that follow, to efficiently compute the above

probabilities(cdfs), are based on the fact that I have knowledge of some finite subset of the

moments of the random variable X. I now ellucidate a method to obtain these moments.

Derivation of Moments: As previously mentioned the most general data generation

model for the discrete case is the multinomial distribution. I know the moment generating

function for it. A moment generating function generates all the moments of a random

variable, uniquely defining its distribution. The MGF of a multivariate distribution is

defined as follows,

MR(t) = E(eR′t) (4–3)

where R is a q dimensional random vector, R′ is transpose of R and t ∈ Rq. In our case q

is the number of cells in the multinomial.

Taking different order partial derivatives of the moment generating function w.r.t. the

elements of t and setting those elements to zero, gives us moments of the product of the

random variables in the multinomial raised to those orders. Formally,

∂v1+v2+...+vqMR(t)

∂tv11 ∂tv2

2 ...∂tvqq

|(t1=t2=...=tq=0)= E(Rv11 Rv2

2 ...Rvqq ) (4–4)

where R′ = (R1, R2, ..., Rq), t = (t1, t2, ..., tq) and v1, v2, ..., vq is the order of the partial

derivatives w.r.t. t1, t2, ..., tq respectively.

44

The expressions for these derivatives can be precomputed or computed at run time

using tools such as mathematica Wolfram-Research. But how does all of what I have

just discussed relate to our problem? Consider the 2 dimensional case given in Figure

4-1. I need to find the probability P [Z > 0] where Z = N2Nx11N

y11 − N1N

x12N

y12. The

individual terms in the product can be expressed as a sum of certain random variables

in the multinomial. Thus Z can be written as the sum of the product of some of the

multinomial random variables. Consider the first term in Z,

N2Nx11N

y11 = (N112 + N122 + N212 + N222)(N111 + N121)(N111 + N211)

= N112N2111 + ... + N222N121N211

the second term also can be expressed in this form. Thus Z can be written as the sum of

the products of the multinomial random variables.

E[Z] = E[N2Nx11N

y11 −N1N

x12N

y12]

= E[N2Nx11N

y11]− E[N1N

x12N

y12]

= E[N112N2111 + ... + N222N121N211]−

E[N111N2112 + ... + N221N122N212]

= E[N112N2111] + ... + E[N222N121N211]−

E[N111N2112]− ...− E[N221N122N212]

In the general case Z = Nd−12 Nx1

11...1...Nxd11...1 −Nd−1

1 Nx111...12...N

xd11...12 where the subscript

of N with dots has d + 1 numbers. The expected value of Z is then given by,

E[Z] =E[N11...12Nd11...1] + ... + E[Nm1m2...md2N11...md1N11...md−111... Nm11...1]−

E[N11...11Nd11...2]− ...− E[Nm1m2...md1N11...md2N11...md−112... Nm11...12]

where mi denotes the number of attribute values of xi. These expectations can be

computed using the technique in the discussion before. Higher moments can also be found

45

in the same vein since I would only need to find expectations of higher degree polynomials

in the random variables of the multinomial. Similarly, the expressions for the moments in

higher dimensions will also include higher degree polynomials.

4.2.3.1 Series approximations (SA)

The Edgeworth or the Gram-Charlier A series Hall [1992] are used to approximate

distributions of random variables whose moments or more specifically cumulants are

known. These expansions consist in writing the characteristic function of the unknown

distribution whose probability density is to be approximated in terms of the characteristic

function of another known distribution(usually normal). The density to be found is then

recovered by taking the inverse Fourier transform.

Let puc(t), pud(x) and κi be the characteristic function, probability density function

and the ith cumulant of the unknown distribution respectively. And let pkc(t), pkd(x) and

γi be the characteristic function, probability density function and the ith cumulant of the

known distribution respectively. Hence,

puc(t) = e[P∞

a=1(κa−γa)(it)a

r!]pkc(t)

pud(x) = e[P∞

a=1(κa−γa)(−D)a

r!]pkd(x)

where D is the differential operator. If pkd(x) is a normal density then I arrive at the

following expansion,

pud(x) =1√

2πκ2

e[−(x−κ1)2

2κ2][1 +

κ3

κ322

H3(x− κ1√

κ2

) + ...] (4–5)

where H3 = (x3 − 3x)/3! is the 3rd Hermite polynomial. This method works reasonably

well in practice as can be seen in Levin [1981] and Butler and Sutton [1998]. The major

challenge though, lies in choosing a distribution that will approximate the unknown

distribution ”well”, as the accuracy of the cdf estimate depends on this. The performance

of the method may vary significantly on the choice of this distribution, since choosing the

46

normal distribution may not always give satisfactory results. This task of choosing an

appropriate distribution is non-trivial.

4.2.3.2 Optimization

I have just seen a method of approximating the cdf using series expansions.

Interestingly, this problem can also be framed as an optimization problem, wherein I

find upper and lower bounds on the possible values of the cdf by optimizing over the set

of all possible distributions having these moments. Since our unknown distribution is an

element of this set, its cdf will lie within the bounds computed. This problem is called

the classical moment problem and has been studied in literature Isii [1960, 1963], Karlin

and Shapely [1953]. Infact upto 3 moments known, there are closed form solutions for the

bounds Prekopa [1989]. In the material that follows, I present the optimization problem

in its primal and dual form. I then explore strategies for solving it, given the fact that the

most obvious ones can prove to be computationally expensive.

Assume that I know m moments of the discrete random variable X denoted by

µ1, ..., µm where µj is the jth moment. The domain of X is given by U = x0, x1, ..., xn.

P [X = xr] = pr where r ∈ 0, 1, ..., n and∑

r pr = 1. I only discuss the maximization

version of the problem (i.e. finding the upper bound) since the minimization version (i.e.

finding the lower bound) has an analogous description. Thus, in the primal space I have

the following formulation,

max P [X <= xr] =∑r

i=0 pi, r ≤ n

subject to :∑n

i=0 pi = 1∑n

i=0 xipi = µ1

···

∑ni=0 xm

i pi = µm

pi ≥ 0, ∀ i ≤ n

47

Solving the above optimization problem gives us an upper bound on P [X <= xr].

On inspecting the formulation of the objective and the constraints I notice that it is

a Linear Programming(LP) problem with m + 1 equality constraints and 1 inequality

constraint. The number of optimization variables(i.e. the pr) is equal to the size of

the domain of X which can be large. For example, in the 2 dimensional case when

X = N2Nx11N

y11 − N1N

x12N

y12, X takes O(N2) values(as N1 constraints N2 given N

and N1 also constraints the maximum value of Nx11, Ny

11, Nx12 and Ny

12). Thus with N

as small as 100 I already have around 10000(ten thousand) variables. In an attempt to

monitor the explosion in computation, I derive the dual formulation of our problem.

A dual is a complimentary problem and solving the dual is usually easier than solving

the primal since the dual is always convex for primal maximization problems(concave

for minimization) irrespective of the form of the primal. For most convex optimization

problems(which includes LP) the optimal solution of the dual is the optimal solution

of the primal, technically speaking only if the Slaters conditions are satisfied(i.e.

duality gap is zero). Maximization(minimization) problems in the primal space map to

minimization(maximization) problems in the dual space. For a maximization Standardized

LP problem the primal and dual have the following form,

Primal:

max cT x

subject to : Ax = b, x ≥ 0

Dual:

min bT y

subject to : AT y ≥ c

where y is the dual variable.

For our LP problem the dual is the following,

min∑m

k=0 ykµk

48

subject to:∑m

k=0 ykxk − 1 ≥ 0; ∀ x ∈ W

∑mk=0 ykx

k ≥ 0; ∀ x ∈ U

whereyk represent the dual variables and W represents a subset of U over which the

cdf is computed. I observe that the number of variables is reduced to just m + 1 in the

dual formulation but the number of constraints has increased to the size of the domain

of X. I now propose some strategies to solve this optimization problem; discuss their

shortcomings and eventually suggest the strategy preffered by us.

Using Standard Linear Programming Solvers: I have a linear programming

problem whose domain is discrete and finite. On careful inspection for our problem

I observe that the number of variables in the primal formulation and the number of

constraints in the dual increases exponentially in the dimensionality of the space (i.e. the

domain of the r.v. X). Though the current state-of-the-art LP solvers(using interior point

methods) can solve linear optimization problems of the order of thousands of variables

and constraints rapidly, our problem can exceed these counts by a significant margin

even for moderate dataset sizes and reasonable dimension thus becoming computationally

intractable. Since standard methods for solving this LP can prove to be inefficient I

investigate other possibilities.

In the next three approaches I extend the domain of the random variable X to

include all the integers between the extremeties of the original domain. The current

domain is thus a superset of the original domain and so are the possible distributions.

Another way of looking at it is, in the space of y I have a superset of the original set of

constraints. Thus the upper bound calculated in this scenario will be greater than or

equal to the upper bound of the original problem. This extension is done to enhance the

performance of the next two approaches since it treadjumps the problem of explicitly

enumerating the domain of X and is a requirement for the third, as I will soon see.

Gradient descent with binary search(GD): I use gradient descent on the dual to

find new values of the vector y = [y0, ..., ym] and a reasonably large step length since the

49

objective to be minimized is an affine function and it tread jumps the problem of having a

large step size in optimizing convex functions. Fixing y and assuming x to be continuous

the two equations representing the inequalities above can be viewed as polynomials in

x. The basic intuition in the method I propose is that if the polynomials are always

non-negative within the domain of X then all the constraints are satisfied else some of the

constraints are violated. To check if the polynomials are always non-negative I find their

roots and perform the following checks. The polynomials will change sign only at their

roots and hence I need to carefully examine their behavior at these points. Here are the

details of the algorithm.

I check if the roots of the polynomial lie within the extremeties of the domain of X.

1. If they do not then I check to see if the value of polynomial at any point withinthis range satisfies the inequalities. On the inequalities being satisfied I jump to thenext y using gradient descent storing the current value of y inplace of the previouslystored(if it exists) one. If the inequalities are dis-satisfied I reject the value of yand perform a binary search between this value and the previous legal value of yalong the gradient until I reach the value that minimizes the objective satisfying theconstraints.

2. If I do, then I check the value of the constraints at the two extremeties. If satisfiedand if there exists only one root in the range I store this value of y and go on to thenext. If there are multiple roots then I check to see if consecutive roots have anyintegral values between them. If not I again store this value of y and move to thenext. Else I verify for any point between the roots that the constraints are satisfiedbased on which I either store or reject the value. On rejecting I perform the samebinary search type procedure mentioned above.

Checking if consecutive roots of the polynomial have values in the domain of X, is

where the extension in domain to include all integers between the extremeties helps in

enhancing perfomance. In the absence of this extension I would need to find if a particular

set of integers lie in the domain of X. This operation is expensive for large domains. But

with the extension all the above operations can be performed efficiently. Finding roots of

polynomials can be done extremely efficiently even for high degree polynomials by various

methods such as computing eigenvalues of the companion matrixEdelman and Murakami

[1995] as is implemented in Matlab. Since the number of roots is just the degree of the

50

polynomial which is the number of moments, the above mentioned checks can be done

quickly. The binary search takes log(t) steps where t is the step length. Thus the entire

optimization can be done efficiently. Nonetheless, the method suffers from the following

pitfall. The final bound is sensitive to the initial value of y. Depending on the initial y

I might stop at different values of the objective on hitting some constraint. I could thus

have a suboptimal value as our solution as I only descend on the negative of the gradient.

I can somewhat overcome this drawback by making multiple random restarts.

Gradient descent with local topology search(GDTS): Perform gradient descent

as mentioned before. Choose a random set of points around the current best solution.

Again perform gradient descent on the feasible subset of the choosen points. Choose

the best solution and repeat until some reasonable stopping criteria. This works well

sometimes in practice but not always.

Prekopa Algorithm (PA): Prekopa [1989] gave an algorithm for the discrete

moment problem. In his algorithm I maintain an m + 1 × m + 1 matrix called the basis

matrix B which needs to have a particular structure to be dual feasible. I iteratively

update the columns of this matrix until it becomes primal feasible, resulting in the optimal

solution to the optimization problem1 . The issue with this algorithm is that there is

no guarantee w.r.t. the time required for the algorithm to find this primal feasible basis

structure.

In the remaining approaches I further extend the domain of the random variable

X to be continuous within the given range. Again for the same reason described before

the bound is unintruisive. It is also worthwhile noting that the feasibility region of the

optimization problem is convex since the objective and the constraints are convex(actually

affine). Standard convex optimization strategies can not be used since the equation of the

boundary is unknown and the length of the description of the problem is large.

1 for explanation of the algorithm read Prekopa [1989]

51

Sequential Quadratic Programming(SQP): Sequential Quadratic Programming

is a method for non-linear optimization. It is known to have local convergence for

non-linear non-convex problems and will thus globally converge in the case of convex

optimization. The idea behind SQP is the following. I start with an initial feasible point

say, yinit. The original objective function is then approximated by a quadratic function

around yinit, which then is the objective for that particular iteration. The constraints are

approximated by linear constraints around the same point. The solution of the quadratic

program is a direction vector along which the next feasible point should be chosen. The

step length can be found using standard line search procedures or more sophisticated

merit functions. On deriving the new feasible point the procedure is repeated until a

suitable stopping criteria. Thus at every iteration a quadratic programming problem is

solved.

Let f(y), ceqj(y) and cieqj(y) be the objective function, the jth equality constraint

and the jth inequality constraint respectively. For the current iterate yk I have following

quadratic optimization problem,

min Ok(dk) = f(yk) +∇f(yk)T dk + 1

2dT

k∇2L(yk, λk)dk

subject to : ceqi(yk) +∇ceqi(yk)T dk = 0; i ∈ E

cieqi(yk) +∇cieqi(yk)T dk ≥ 0; i ∈ I

whereOk(dk) is the quadratic approximation of the objective function around yk.

The term f(yk) is generally dropped from the above objective since it is a constant at

any particular iteration and has no bearing on the solution. ∇2L() is the Hessian of

the Lagrangian w.r.t. y, E and I are the set of indices for the equality and inequality

constraints respectively and dk is the direction vector which is the solution of the above

optimization problem. The next iterate yk+1 is given by yk+1 = yk + αkdk where αk is the

step length.

For our specific problem the objective function is affine, thus a quadratic approximation

of it yields the original objective function. I have no equality constraints. For the

52

inequality constraints, I use the following idea. The two equations representing the infinite

number of linear constraints given in the dual formulation can be perceived as being

polynomials in x with coefficients y. For a particular iteration with iterate(y) known,

I find the lowest value that the polynomials take. This value is the value of the most

violated(if some constraints are violated)/just satisfied(if no constraint is violated) linear

constraint. This is shown in Figure 4-2. The constraint cl =∑m

j=0 yjxji is just satisfied.

With this in view I arrive at the following formulation of our optimization problem at the

kth iteration,

min µT dk

subject to :∑m

j=0 y(k)j xj

i +∑m

j=0 xjidk ≥ 0; yk = [y

(k)0 , ..., y

(k)m ]2

This technique gives a sense of the non-linear boundary traced out by the constraints.

The above mentioned values can be deduced by finding roots of the derivative of the 2

polynomials w.r.t. x and then finding the minimum of these values evaluated at the real

roots of its derivative. The number of roots is bounded by the number of moments, infact

it is equal to m − 1. Since this approach does not require the enumeration of each of

the linear constraints and operations described are fast with results being accurate, this

turns out to be a good option for solving this optimization problem. I carried out the

optimization using the Matlab function fmincon and the procedure just illustrated.

Semi-definite Programming(SDP): A semi-definite programming problem has a

linear objective, linear equality constraints and linear matrix inequality(LMI) constraints.

Here is an example formulation,

min cT q

subject to : q1F1 + ... + qnFn + H ¹ 0

Aq = b

2 y(k)i is the value of yi at the kth iteration

53

where H, F1, ..., Fn are positive semidefinite matrices, q ∈ Rn, b ∈ Rp and A ∈ Rp×n. The

SDP can be efficiently solved by interior point methods. As it turns out, I can express our

semi-infinite LP as a SDP.

Consider the constraint c1(x) =∑m

i=0 yixi. The constraint c1(x) satisfies c1(x) ≥ 0

∀x ∈ [a, b] iff ∃ a m + 1×m + 1 positive semidefinite matrix S such that,∑

i+j=2l−1 S(i, j) = 0; l = 1, ...,m∑l

k=0

∑k+m−lr=k yrrCk(m− r)Cl−ka

r−kbk =∑

i+j=2l S(i, j); l = 0, ..., m

S º 0 means S is positive semidefinite.

The proof of this result is given in Bertsimas and Popescu [1998].

I derive the equivalent semidefinite formulation for the second constraint c2(x) =∑m

i=0 yixi− 1 to be greater than or equal to zero. To accomplish this, I replace y0 by y0− 1

in the above set of equalities since c2(x) = c1(x) − 1. Thus ∀x ∈ [a, b] I have the following

semidefinite formulation for the second constraint,∑

i+j=2l−1 S(i, j) = 0; l = 1, ...,m∑l

k=1


r−kbk+∑m−l

r=1 yr(m− r)Clar + y0 − 1 =

∑i+j=2l S(i, j); l = 1, ..., m

∑mr=1 yra

r + y0 − 1 = S(0, 0)

S º 0

Combining the above 2 results I have the following semidefinite program with O(m2)

constraints,

min∑m

k=0 ykµk

subject to :∑

i+j=2l−1 G(i, j) = 0; l = 1, ..., m∑l

k=1


r−kbk+∑m−l

r=1 yr(m− r)Clar + y0 − 1 =

∑i+j=2l G(i, j); l = 1, ..., m

∑mr=1 yra

r + y0 − 1 = G(0, 0)∑

i+j=2l−1 Z(i, j) = 0; l = 1, ..., m∑l

k=0

∑k+m−lr=k yrrCk(m− r)Cl−kb

r−kck =∑

i+j=2l Z(i, j); l = 0, ..., m

54

G º 0, Z º 0

G and Z are m + 1 × m + 1 positive semidefinite matrices. The domain of the

random variable is [a, c]. Solving this semidefinite program yields an upper bound on the

cdf P [X <= b], where a ≤ b ≤ c. I used a free online SDP solverWu and Boyd [1996] to

solve the above semidefinite program. Through empirical studies that follow I found this

approach to be the best in solving the optimization problem in terms of a balance between

speed, reliability and accuracy.

4.2.3.3 Random sampling using formulations (RS)

In sampling I select a subset of observations from the universe consisting of all

possible observations. Using this subset I calculate a function whose value I consider to

be equal(or at least close enough) to the value of the same function applied to the entire

observation set. Sampling is an important process, since in many cases I do not have

access to this entire observation set(many times it is infinite). Numerous studiesHall

[1992], Bartlett et al. [2001], Chambers and Skinner [1977] have been conducted to analyze

different kinds of sampling procedures. The sampling procedure that is relevant to our

problem is Random Sampling and hence I restrict our discussion only to it.

Random sampling is a sampling technique in which I select a sample from a larger

population wherein each individual is chosen entirely by chance and each member of the

population has possibly an unequal chance of being included in the sample. Random

sampling reduces the likelihood of bias. It is known that asymptotically the estimates

found using random sampling converge to their true values.

For our problem the cdf can be computed using this sampling procedure. I sample

data from the multinomial distribution(our data generative model) and add the number of

times the condition whose cdf is to be computed is true. This number when divided by the

total number of samples gives an estimate of the cdf. By finding the mean and standard

deviation of these estimates I can derive confidence bounds on the cdf using Chebyshev

inequality. The width of these confidence bounds depends on the standard deviation of the

55

estimates, which in turn depends on the number samples used to compute the estimates.

As the number of samples increases the bounds become tighter. I will observe this in

the experiments that follow. Infact, all the estimates for the cdf necessary for computing

the moments of the generalization error using the mathematical formulations given by

us, can be computed parallely. The moments are thus functions of the samples for which

confidence bounds can be derived.

Notice that, only sampling in conjunction with our formulations for the moments

makes for an efficient method. If I directly use sampling without using the formulations, I

would first need to sample for building a set of classifiers, and then for each classifier built

I would need to sample test sets from the distribution. Reason being that the expectation

in the moments is w.r.t. all possible datasets of size N . This process can prove to be

computationally intensive for acquiring accurate estimates of the moments. As I will see

in the next section, Monte Carlo fails to provide accurate estimates when directly used

to compute the moments even in simple scenarios when compared with RS. Thus I are

still consistent with our methodology of going as far as possible in theory to reduce the

computation for conducting experiments.

4.2.4 Empirical Comparison of Cumulative Distribution Function ComputingMethods

Consider the 2 dimensional case in Figure 4-1. I instantiated all the cell probabilities

to be equal. I found the probability P [N2Nx11N

y11 > N1N

x12N

y12] by the methods suggested,

varying the dataset size from 10 to 1000 in multiples of 10 and having knowledge of

the first six moments of the random variable X = N2Nx11N

y11 − N1N

x12N

y12. The

actual probability in all the three cases is around 0.5 (actually just lesser than 0.5).

The execution speeds for the various methods are given in Table 4-33 . From the table

I see that the SDP and gradient descent methods are lightining fast. The SQP and

3 arnd means around

56

gradient descent with topology search methods take a couple of seconds to execute. The

thing to notice here is that SDP, SQP, the two gradient descent methods and the series

approximation method are oblivious to the size of the dataset with regards to execution

time. In terms of accuracy the gradient descent method is sensitive to initialization and

the series approximation method to the choice of distribution, as previously stated. A

normal distribution gives an estimate of 0.5 which is good in this case since the original

distribution is symmetric about the origin. But for finding cdf near the extremeties of the

domain of X the error can be considerable. Since the domain of X is finite, variants of

the beta distribution with a change of variable(i.e. shifting and scaling the distribution)

can provide better approximation capabilities. The SQP and SDP methods are robust

and insensitive to initialization(as long as the initial point is feasible). The bound found

by SQP is 0.64 to 0.34 and that found by SDP is 0.62 to 0.33. The LP solver also finds

a similar bound of 0.62 to 0.34 but the execution time scales quadratically with the size

of the input. On increasing the number of moments to 9 the bounds become tighter and

essentially require the same execution time. The SDP, SQP and LP methods, all give a

bound of 0.51 to 0.48. Thus by increasing the number of moments I can get arbitrarily

tight bounds. For RS I observe from Table 4-3 and Table 4-4 that the method does not

scale much in time with the size of the dataset but produces extremely good confidence

bounds as the number of samples increases. With 1000 samples I already have pretty tight

bounds with time required being just over half a second. Also as previously stated the cdf

can be calculated together rather than independently.

Recommendation: The SDP method is the best but RS can prove to be more than

acceptable.

4.3 Monte Carlo (MC) vs Random Sampling Using Formulations

In the previous section I proposed methods for efficiently and accurately computing

the cdf that are used in the computation of the moments. A natural question is, why

not use simple Monte Carlo to directly estimate the moments rather than derive the

57

formulations and then perform random sampling ? In this section, I show that MC fails

to provide accurate estimates even in a simple scenario, while RS does an extremely good

job for the same amount of computation (i.e. 10000 samples). Notice that N the training

set size and the sample size have different semantics. Since, the expectations are over all

datasets of size N , the sample size is the number of datasets of size N . More precisely, the

sample size is the number of training sets of size N and not the value of N itself. I first

explain the plots and later discuss their implications.

General Setup: I fix the total number of attributes to 2. Each attribute has two

values with the number of classes also being 2. The five Figures 4.6, 4-3, 4.6, 4-5 and

4.6 depict the estimates of MC and RS for different amounts of correlation (measured

using Chi-Square Connor-Linton [2003]) between the attributes and the class labels, with

increasing training set size.

Observations: From the Figure 4.6 I observe that when the attributes and class

labels are uncorrelated, with increasing training set size the estimates of both MC and

RS are accurate. Similar qualitative results are seen in Figure 4.6 when the attributes

and class labels are totally correlated. Hence, for extremely low and high correlations

both methods produce equally good estimates. The problem arises for the MC method

when I move away from these extreme correlations. This is seen in Figures 4-3, 4.6 and

4-5. Both the MC and RS methods perform well initially, but at higher training set sizes

(around 10000 and greater) the estimates of the MC method become grossly incorrect,

while the RS method still performs exceptionally well. Infact, the estimates of RS become

increasingly accurate with increasing training set size.

Reasons and Implications: An explanation of the above phenomena is as follows:

The term ED(N)[GE(ζ)] denotes the expected GE of all classifiers that are induced by all

possible training sets drawn from some distribution. In the continuous case the number

of possible training sets of size N is infinite, while in the discrete case it is O(Nm−1),

where m is the total number of cells in the contigency table. As N increases the number

58

of possible training sets increases rapidly even for small values of m. Thus, with increasing

N the complexity of ED(N)[GE(ζ)] also increases. In the experiments I reported above,

the value of m is 8 (2 × 2 × 2) and with increasing N from 10 to 10000 the upsurge in the

number of possible training sets is steep. Since I fix the amount of computation (i.e. the

number of samples), the MC method is unable to get enough samples so as to accurately

estimate (except at extreme correlations where almost each sample is representative

of the underlying distribution) ED(N)[GE(ζ)] at higher values of N (eg. 10000). The

MC method estimates, are based on samples from a small subspace of the entire sample

space. Hence, with increasing number of possible datasets I would have to proportionately

increase the number of samples to get good estimates. The RS method is not as affected

from increasing training set size. The reason for this being that the complexity (i.e. the

parameter space) of the cdf does not scale as much with increasing N (O(NO(d)) where

d is the dimension as against O(Nm−1)). Thus, in the case of the RS method the high

accuracy is sustained.

On increasing m, the number of possible training sets increases by a factor of N for

each cell added and hence direct MC is intractable to get accurate estimates. The RS

method does not scale likewise, since the number of terms (cdf) is linear in m and the

complexity of each term remains the practically unchanged for a fixed dimension. Since

computing the first moment is much of a challenge for the MC method, computing the

second moment which is over the D(N) ×D(N) space looks ominous. For the RS and the

other suggested methods (eg. optimization) this is equivalent to finding joint probabilities

which is not that hard a task.

4.4 Calculation of Cumulative Joint probabilities

Cumulative Joint probabilities need to be calculated for computation of higher

moments. Using the Random Sampling method, these probabilities can be computed

in similar fashion as the single probabilities shown above. But for the other methods

knowledge of the moments is required. Cumulative Joint probabilities are defined

59

over multiple random variables wherein each random variable satisfies some inequality

or equality. In our case, for the second moment I need to find the following kind

of cumulative joint probabilities P [X > 0, Y > 0], where X and Y are random

variables(overiding their definition in Table 2-1). Since the probability is of an event

over two distinct random variables the previous method of computing moments cannot be

directly applied. An important question is can I somehow through certain transformations

reuse the previous method? Fortunately the answer is affirmative. The intuition

behind the technique I propose is as follows. I find another random variable Z =

f(X, Y )(polynomial in X and Y ) such that Z > 0 iff X > 0 and Y > 0. Since the

two events are equivalent their probabilities are also equal. By taking derivatives of

the MGF of the multinomial I get expressions for the moments of polynomials of the

multinomial random variables. Thus, f(X, Y ) is required to be a polynomial in X and Y .

I now discuss the challenges in finding such a function and eventually suggest a solution.

Geometrically, I can consider the random variables X, Y and Z to denote the three

co-ordinate axes. Then the function f(X, Y ) should have a positive value in the first

quadrant and negative in the remaining three. If the domains of X and Y were infinite

and continuous then this problem is potentially intractable since the polynomial needs

to have a discrete jump along the X and Y axis. Such behavior can be emulated at best

approximately by polynomials. In our case though, the domains of the random variables

are finite, discrete and symmetric about the origin. Therefore, what I care about is that

the function behaves as desired only at these finite number of discrete points. One simple

solution is to have a circle covering the relevant points in the first quadrant and with

appropriate sign the function would be positive for all the points encompassed by it. This

works for small domains of X and Y . As the domain size increases the circle intrudes into

the other quadrants and no longer satisfying the conditions. Other simple functions such

as XY or X + Y or a product of the two also do not work. I now give a function that does

60

work and discuss the basic intuition in constructing it. Consider the domain of X and Y

to be integers in the interval [-a,a].4 Then the polynomial is given by,

Z = (X + a)rX2Y + (Y + a)rY 2X (4–6)

where r = maxbb ln[b]

ln[a+1a−b

]c + 1, 1 < b < a and b ∈ N.5 The value of r can be found

numerically by finding the corresponding value of b which maximizes that function. For

5 ≤ a ≤ 10 the value of b which does this is 5. For larger values 10 < a ≤ 106 the value

of b is 4. Figure 4-7 depicts the polynomial for a = 10 where r = 4. The polynomial

resembles a bird with its neck in the first quadrant, wings in the 2nd and 4th quadrants

and its posterior in the third. The general shape remains same for higher values of a.

The first requirement for the polynomial was that it must be symmetric. Secondly,

I wanted to penalise negative terms and so I have X + a(and Y + a) raised to some

power which will always be positive but will have lower values for lesser X(and Y ). The

X2Y (and Y 2X) makes the first(second) term zero if any of X and Y are zero. Moreover,

it imparts sign to the corresponding term. If absolute value function(||) could be used I

would replace the X2(Y 2) by |X|(|Y |) and set r = 1. But since I cannot; in the resultant

function r is a reciprocal of a logarithmic function of a. For a fixed r with increasing value

of a the polynomial starts violating the biconditional by becoming positive in the 2nd

and 4th quadrants(i.e. the wings rise). The polynomial is always valid in the 1st and 3rd

quadrants. With increase in degree(r) of the polynomial its wings begin flattening out,

thus satisfying the biconditional for a certain a.

By recursively, applying the above formula I can approximate cdf of probabilities with

multiple conditions.

4 in our problem X and Y have the same domain.

5 proof in appendix

61

Recommendation: If the degree of the polynomial is large use RS method for

convenience else use SDP for high accuracy.

With this I have all the ingredients necessary to perform experiments reasonably fast.

This is exactly what I report in the next section.

4.5 Moment Comparison of Test Metrics

In the previous subsections I pointed out how the moments of the generalization error

can be computed. I established connections between the moments of the generalization

error (GE) and the moments of Hold-out-set error (HE) and Cross validation error (CE).

Neither of these relationships, i.e. between the moments of these errors and the moments

of the generalization error give a direct indication of the behavior of the different types of

errors in specific circumstances. In this section I provide a graphical, simple to interpret,

representation of these moments for specific cases. While these visual representations do

not replace general facts, they greatly improve our understanding, allowing rediscovery of

empirically discovered properties of the error metrics and portraying the flexibility of the

method to model different scenarios.

General Setup: I study the behavior of the moments of HE, CE and GE in 1 as

well as multiple dimensions. The data distribution I use is a multinomial with a class prior

of 0.4. The dataset size is set to 100 i.e. N = 100 for the first 2 studies and varied for

the third. I set N = 100 (and not higher) to clearly observe the effects that increase in

dimensionality plays in the behavior of these error metrics. The third study varies N, and

studies the convergence behavior of these error metrics.

4.5.1 Hold-out Set

Our first study involves the dependency of the hold-out-set error on the splitting of

the data into testing (the hold-out-set) and training. To get insight into the behavior of

HE, I plotted the expectation in Figures 4.6, 4-11, the variance in Figures 4-9, 4.6 and

the sum of the expectation and standard deviation in Figures 4.6, 4-13 for single and

multiple dimensions respectively. As expected, the expectation of HE grows as the size

62

of the training dataset reduces. On the other hand, the variance is reduced until the size

of test data is 50%, then it increases slightly for the one dimensional case. The general

downwards trend is predictable using intuitive understanding of the naive Bayes classifier

but the fact that the variance has an upwards trend is not. I believe that the behavior

on the second part of the graph is due to the fact that the behavior of the classifier

becomes unstable as the size of the training dataset is reduced and this competes with the

reduction due the increase in the size of testing data. In higher dimensions the test data

size is insufficient even for large test set fractions (as N is only 100) and so any increase in

test size is desirable, leading to reduced variance. Our methodology established this fact

exactly without the doubts associated with intuitively determining the distinct behavior in

different dimensions.

From the plots for the sum of the expectation and the standard deviation of HE,

which indicate the pessimistic expected behavior, a good choice for the size of the test set

is 40-50% for this particular instance. This best split depends on the size of the dataset

and it is hard to select only based on intuition.

4.5.2 Cross Validation

In our second study I observed the behavior of CV with varying number of folds.

Here I observe the similar qualitative results in both lower and higher dimensions. As the

number of folds increases, the following trends are observed: (a) the expectation of CE

reduces (Figures 4.6, 4.6) since the size of training data increases, (b) the variance of the

classifier for each of the folds increases (Figures 4-15, 4-21) since the size of the test data

decreases, (c) the covariance between the estimates of different folds decreases first then

increases again (Figures 4.6, 4.6) – I explain this behavior below and the same trend is

observed for the total variance of CE (Figures 4-17, 4-23) and the sum of the expectation

and the standard deviation of CE (Figures 4.6, 4.6). Observe that the minimum of

the sum of the expectation and the standard deviation (which indicates the pessimistic

63

expected behavior) is around 10-20 folds, which coincides with the number of folds usually

recommended.

A possible explanation for the behavior of the covariance between the estimates

of different folds is based on the following two observations. First, when the number of

folds is small, the errors of the estimates have large correlations despite the fact that

the classifiers are negatively correlated – this happens because almost the entire training

dataset of one classifier is the test set for the other, 2-fold cross validation in the extreme.

Due to this, though the classifiers built may be similar or different their errors are strongly

positively correlated. Second, for large number of folds (leave-one-out situation in the

extreme), there is a huge overlap between the training sets, thus the classifiers built are

almost the same and so the corresponding errors they make are highly correlated again.

These two opposing trends produce the U-shaped curve of the covariance. This has a

significant effect on the overall variance and so the variance also has a similar form with

the minimum around 10 folds. Predicting this behavior using only intuition or reasonable

number of experiments or just theory is unlikely since it is not clear what the interaction

between the two trends is.

Such insight is possible only because I are able to observe with high accuracy the

factors that affect the behavior of these measures.

4.5.3 Comparison of GE, HE, and CE

The purpose of our last study I report was to determine the dependency of the three

errors on the size of the dataset which indicates the convergence behavior and relative

merits of hold-out-set and cross validation. In Figures 4-19, 4-25 I plotted the moments of

GE, HE and CE, the size of the hold-out-set for HE was set to 40% and 20 folds for CE.

As it can be observed from the figure, the error of hold-out-set is significantly larger for

small datasets. The error of cross validation is almost on par with the the generalization

error. This property of cross validation to reliably estimate the generalization error is

64

known from empirical studies. But the method can be used to estimate how quickly (at

what dataset size) HE and CE converge to GE.

This type of study can be used to observe the non-asymptotic convergence behavior of

errors.

4.6 Extension

I have laid down the basic groundwork necessary for characterizing classification

models and model selection measures. In particular, I have characterized the NBC model

applied to categorical data of arbitrary dimension and with binary class labels. In this

section I discuss extensibility of the analysis and the methodology.

The extension of the analysis to NBC with multiple classes is straightforward. I adopt

the ”winner takes all” policy to classify a datapoint, i.e. I classify the datapoint in the

class that has the highest corresponding polynomial (of the form N(d−1)2 Nx1

11 Nx211 ...Nxd

11 )

value. The approximation techniques employed for speedup, are applicable to this

scenario too. Infact, the series approximation and the optimization techniques can

be used to bound the cdf in any application where k (some integer) moments of the

random variable are known. As mentioned before, the generalized expressions for the

moments and the relationships of the moments of GE to the moments of the CE, LE

and HE hold even for the continuous case by switching from Counting measure to

Lebesgue measure. The challenge in this case too, is to characterize the probabilities

PZ(N) [ζ(x)=y] and PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] for the model at hand. The essence

in characterizing these probabilities for particular inputs (x), is expressing them as

probabilities of the function of the training sample satisfying some condition. The function

is determined by the model that is chosen, by prudently looking at the training algorithm

in relation with the classifiers it outputs. For example, in the NBC case the function was

N(d−1)2 Nx1

11 Nx211 ...Nxd

11 − N(d−1)1 Nx1

12 Nx212 ...Nxd

12 for multiple dimensions and Nx11 − Nx1

2 for

a single dimension. The probability of this function being greater than zero is computed

65

from any joint distribution that may specified over the data and the class labels. This

observation directs us in characterizing other models of interest.

In the continuous case, the NBC classifies an input based on the training sample class

prior and the class conditionals, just as for the discrete case. The class label assigned to an

input z1, z2, ..., zd is given by,

class label(z1, z2, ..., zd) = argmaxCiP [Ci]

d∏j=1

P [zj|Ci]

where Ci denotes the class i. The NBC estimates the prior and the conditionals from

the training sample. The prior is straightforward to estimate. For the conditionals a

parametric model is chosen, usually a normal. The parameters of the normal (mean and

variance) can be estimated in closed form i.e. as a function of the sample, using parameter

estimation methods such as Maximum Likelihood Estimate (MLE). Thus, each of the

conditionals can be represented as a function of the sample and so can the prior. The term

P [Ci]∏d

j=1 P [zj|Ci] is hence a function of the sample. Since the classification occurs by

taking the argmax of this term over all classes i.e. classifying the input in the class for

which this term is the greatest (ignoring ties which can easily be accounted for), I arrive

at a situation wherein the classification process is expressed as a function of the sample.

With the initially chosen joint density over the data and the class labels, the probability

of this function satisfying the required conditions can be computed by integrating the over

the appropriate domain. Other densities (parametric or non-parametric) may be used, the

key is represent the above term as a function of the sample.

When I consider other classification models, the basic theme of characterizing them

remains unchanged. I discuss possible ways of extending the analysis to some of these

other models. In the case of decision trees for example, the classification occurs at the

leaf nodes by choosing the most numerous class label. If no pruning criteria is enforced,

all paths contain all attributes and the leaf nodes are basically the input cells, for the

discrete case. The function used to classify an input is determined by the most numerous

66

class label in the training sample for that input. For the continuous case the classification

occurs based on majority in regions. In case of pruning, the pruning method should

also be considered in the characterization. For eg. pruning based on lack of samples

in a particular split. Here the process of classifying an input into a particular class,

is determined by the leaf node that is used to perform the classification. A node is a

leaf only if the number of samples in it is less than some threshold or all attributes are

already exhausted. These conditions can be encoded as functions of the sample. The

condition of a particular class being most numerous is also a function of the sample and

using the chosen joint distribution the probability of these events can be ascertained.

For a perceptron (or Neural Network in general) the classification is based on the weight

vector that is learnt. The training algorithm produces a weight vector which is a linear

combination of the misclassified patterns, and hence a function of the sample. Using the

pre-specified joint distribution, probabilities for this model could also be found out.

In this manner, by understanding the training and the functionality of the model,

characterizations can be developed. It may not always be possible to do this, but if

possible the characterizations developed aid in providing an accurate (if not exact)

representation of the behavior of the learning model and the model selection measures, in

a short amount of time (for that accuracy).

Table 4-1. Contingency table of input X

X y1 y2

x1 N11 N12

x2 N21 N22...xn Nn1 Nn2

N1 N2 N

67

Table 4-2. Naive Bayes Notation.

Symbol Symanticspc1 prior of class C1

pc2 prior of class C2

phij joint probability of being in hi, Cj

N1 r.v. denoting number of datapoints in class C1

N2 r.v. denoting number of datapoints in class C2

Nhij

r.v. denoting number of datapoints in hi, Cj

Nij r.v. denoting number of datapoints in cell i, Cj

N Size of dataset

Table 4-3. Empirical Comparison of the cdf computing methods in terms of executiontime. RSn denotes the Random Sampling procedure using n samples toestimate the probabilities.

Method Dataset Size 10 Dataset Size 100 Dataset Size 1000Direct 25 hrs arnd 200 centuries arnd 200 billion yrsSA Instantaneous Instantaneous InstantaneousLP arnd 3.5 sec arnd 2 min arnd 2:30 hrsGD arnd 0.13s arnd 0.13 sec arnd 0.13 secPA arnd 1s arnd 25 sec arnd 5 minGDTS arnd 3.5 sec arnd 3.5 sec arnd 3.5 secSQP arnd 3.5 sec arnd 3.5 sec arnd 3.5 secSDP arnd 0.1 sec arnd 0.1 sec arnd 0.1 secRS100 arnd 0.08 sec arnd 0.08 sec arnd 0.1 secRS1000 arnd 0.65 sec arnd 0.66 sec arnd 0.98 secRS10000 arnd 6.3 sec arnd 6.5 sec arnd 9.6 sec

Table 4-4. 95% confidence bounds for Random Sampling.

Samples Dataset Size 10 Dataset Size 100 Dataset Size 1000100 0.7-0.23 0.72-0.26 0.69-0.311000 0.54-0.4 0.56-0.42 0.57-0.4210000 0.5-0.44 0.51-0.47 0.52-0.48

Table 4-5. Comparison of methods for computing the cdf.

Method Accuracy SpeedDirect Exact solution LowSeries Approximation Variable HighStandard LP solvers High LowGradient descent Low HighPrekopa Algorithm High ModerateGradient descent (topology search) Moderate ModerateSequential Quadratic Programming High ModerateSemi-definite Programming High HighRandom Sampling High Moderate

68

Figure 4-1. I have two attributes each having two values with 2 class lables.

Figure 4-2. The current iterate yk just satisfies the constraint cl and easily satisfies theother constraints. Suppose, cl is

∑mj=0 yjx

ji where xi is a value of X, then in

the diagram on the left I observe that for the kth iteration y = yk thepolynomial

∑mj=0 yjx

j = 0 has a minimum at X = xi with the value of thepolynomial being a. This is also the value of cl evaluated at y = yk.

69

Figure 4-3. Estimates of ED(N)[GE(ζ)] by MC and RS with increasing training set size N .The attributes are uncorrelated with the class labels. ED(N)[GE(ζ)] is 0.5.

Figure 4-4. Estimates of ED(N)[GE(ζ)] by MC and RS with increasing training set size N .The correlation between the attributes and the class labels is 0.25.ED(N)[GE(ζ)] is 0.24.

70



71

Figure 4-7. Estimates of ED(N)[GE(ζ)] by MC and RS with increasing training set size N .The attributes are totally correlated to the class labels. ED(N)[GE(ζ)] is 0.

Figure 4-8. The plot is of the polynomial (x + 10)4x2y + (y + 10)4y2x− z = 0. I see that itis positive in the first quadrant and non-positive in the remaining three.

72

Figure 4-9. HE expectation in single dimension.

Figure 4-10. HE variance in single dimension.

73

Figure 4-11. HE E[] + Std() in single dimension.

Figure 4-12. HE expectation in multiple dimensions.

74

Figure 4-13. HE variance in multiple dimensions.

Figure 4-14. HE E[] + Std() in multiple dimensions.

75

Figure 4-15. Expectation of CE.

Figure 4-16. Individual run variance of CE.

76

Figure 4-17. Pairwise covariances of CV.

Figure 4-18. Total variance of cross validation.

77

Figure 4-19. E [] +√

Var (()) of CV

Figure 4-20. Convergence behavior.

78

Figure 4-21. CE expectation.

Figure 4-22. Individual run variance of CE.

79

Figure 4-23. Pairwise covariances of CV.

Figure 4-24. Total variance of cross validation.

80

Figure 4-25. E [] +√

Var (()) of CV.

Figure 4-26. Convergence behavior.

81

CHAPTER 5ANALYZING DECISION TREES

I use the methodology introduced, for analyzing the error of classifiers and the model

selection measures, to analyze decision tree algorithms. The methodology consists of

obtaining parametric expressions for the moments of the Generalization error (GE) for the

classification model of interest, followed by plotting these expressions for interpritability.

The major challenge in applying the methodology to decision trees, the main theme

of this work, is customizing the generic expressions for the moments of GE to this

particular classification algorithm. The specific contributions I make are: (a) I completely

characterize a subclass of decision trees namely, Random decision trees, (b) I discuss how

the analysis extends to other decision tree algorithms, and (c) in order to extend the

analysis to certain model selection measures, I generalize the relationships between the

moments of GE and moments of the model selection measures given in Dhurandhar and

Dobra [2009] to randomized classification algorithms. An extensive empirical comparison

between the proposed method and Monte Carlo, depicts the advantages of the method

in terms of running time and accuracy. It also showcases the use of the method as an

exploratory tool to study learning algorithms.

5.1 Computing Moments

In this section I first provide the necessary technical groundwork, followed by

customization of the expressions for decision trees. I now introduce some notation that

is used primarily in this section. X is a random vector modeling input whose domain is

denoted by X . Y is a random variable modeling output whose domain is denoted by Y(set of class labels). Y (x) is a random variable modeling output for input x. ζ represents

a particular classifier with its GE denoted by GE(ζ). Z(N) denotes a set of classifiers

obtained by application of a classification algorithm to different samples of size N .

82

5.1.1 Technical Framework

The basic idea in the generic characterization of the moments of GE as given in

Dhurandhar and Dobra [2009], is to define a class of classifiers induced by a classification

algorithm and an i.i.d. sample of a particular size from an underlying distribution. Each

classifier in this class and its GE act as random variables, since the process of obtaining

the sample is randomized. Since GE(ζ) is a random variable, it has a distribution. Quite

often though, characterizing a finite subset of moments turns out to be a more viable

option than characterizing the entire distribution. Based on these facts, I revisit the

expressions for the first two moments around zero of the GE of a classifier,

EZ(N) [GE(ζ)] =

∑x∈X

P [X =x]∑y∈Y

PZ(N) [ζ(x)=y] P [Y (x) 6=y](5–1)

EZ(N)×Z(N) [GE(ζ)GE(ζ ′)] =

∑x∈X

∑

x′∈XP [X =x] P [X =x′] ·

∑y∈Y

∑

y′∈YPZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] ·

P [Y (x) 6=y] P [Y (x′) 6=y′]

(5–2)

From the above equations I observe that for the first moment I have to characterize the

behavior of the classifier on each input separately while for the second moment I need

to observe its behavior on pairs of inputs. In particular, to derive expressions for the

moments of any classification algorithm I need to characterize PZ(N) [ζ(x)=y] for the

first moment and PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] for the second moment. The values for

the other terms denote the error of the classifier for the first moment and errors of two

classifiers for the second moment which are obtained directly from the underlying joint

distribution. For example, if I have data with a class prior p for class 1 and 1-p for class 2.

Then the error of a classifier classifying data into class 1 is 1-p and the error of a classifier

83

classifying data into class 2 is given by p. I now focus our attention on relating the above

two probabilities, to probabilities that can be computed using the joint distribution and

the classification model viz. Decision Trees.

In the subsections that follow I assume the following setup. I consider the dimensionality

of the input space to be d. A1, A2, ..., Ad are the corresponding discrete attributes or

continuous attributes with predetermined split points. a1, a2, ..., ad are the number of

attribute values/the number of splits of the attributes A1, A2, ..., Ad respectively. mij is the

ith attribute value/split of the jth attribute, where i ≤ aj and j ≤ d. Let C1, C2, ..., Ck be

the class labels representing k classes and N the sample size.

5.1.2 All Attribute Decision Trees (ATT)

Let us consider a decision tree algorithm whose only stopping criteria is that no

attributes remain when building any part of the tree. In other words, every path in

the tree from root to leaf has all the attributes. An example of such a tree is shown

in Figure 5-1. It can be seen that irrespective of the split attribute selection method

(e.g. information gain, gini gain, randomised selection, etc.) the above stopping criteria

yields trees with the same leaf nodes. Thus although a particular path in one tree has an

ordering of attributes that might be different from a corresponding path in other trees, the

leaf nodes will represent the same region in space or the same set of datapoints. This is

seen in Figure 5-2. Moreover, since predictions are made using data in the leaf nodes, any

deterministic way of prediction would lead to these trees resulting in the same classifier

for a given sample and thus having the same GE. Usually, prediction in the leaves is

performed by choosing the most numerous class as the class label for the corresponding

datapoint. With this I arrive at the expressions for computing the aforementioned

84

probabilities,

PZ(N) [ζ(x)=Ci] =

PZ(N)[ct(mp1mq2...mrdCi) > ct(mp1mq2...mrdCj),

∀j 6= i, i, j ∈ [1, ..., k]]

where x = mp1mq2...mrd represents a datapoint which is also a path from root to

leaf in the tree. ct(mp1mq2...mrdCi) is the count of the datapoints specified by the

cell mp1mq2...mrdCi. For example in Figure 4-1 x1y1C1 represents a cell. Henceforth,

when using the word ”path” I will strictly imply path from root to leaf. By computing

the above probability ∀ i and ∀ x I can compute the first moment of the GE for this

classification algorithm.

Similarly, for the second moment I compute cumulative joint probabilities of the

following form:

PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv] =

PZ(N)×Z(N)[ct(mp1...mrdCi) > ct(mp1...mrdCj),

ct(mf1...mhdCv) > ct(mf1...mhdCw),

∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]

where the terms have similar conotation as before. These probabilities can be

computed exactly or by using fast approximation techniques proposed in Dhurandhar and

Dobra [2009].

5.1.3 Decision Trees with Non-trivial Stopping Criteria

I just considered decision trees which are grown until all attributes are exhausted.

In real life though I seldom build such trees. The main reasons for this could be any

of the following: I wish to build small decision trees to save space; certain path counts

(i.e. number of datapoints in the leaves) are extremely low and hence I want to avoid

splitting further, as the predictions can get arbitrarily bad; I have split on a certain subset

of attributes and all the datapoints in that path belong to the same class (purity based

85

criteria); I want to grow trees to a fixed height (or depth). These stopping measures would

lead to paths in the tree that contain a subset of the entire set of attributes. Thus from

a classification point of view I cannot simply compare the counts in two cells as I did

previously. The reason for this being that the corresponding path may not be present in

the tree. Hence, I need to check that the path exists and then compare cell counts. Given

the classification algorithm, since the PZ(N) [ζ(x)=Ci] is the probability of all possible

ways in which an input x can be classified into class Ci for a decision tree it equates to

finding the following kind of probability for the first moment,

PZ(N) [ζ(x)=Ci] =

∑p

PZ(N)[ct(pathpCi) > ct(pathpCj), pathpexists,

∀j 6= i, i, j ∈ [1, ..., k]]

(5–3)

where p indexes all allowed paths by the tree algorithm in classifying input x. After the

summation, the right hand side term above is the probability that the cell pathpCi has the

greatest count, with the path ”pathp” being present in the tree. This will become clearer

when I discuss different stopping criteria. Notice that the characterization for the ATT is

just a special case of this more generic characterization.

The probability that I need to find for the second moment is,

PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv] =

∑p,q

PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), pathpexists,

ct(pathqCv) > ct(pathqCw), pathqexists,

∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]

(5–4)

where p and q index all allowed paths by the tree algorithm in classifying input x and x′

respectively. The above two equations are generic in analyzing any decision tree algorithm

which classifies inputs into the most numerous class in the corresponding leaf. It is not

difficult to generalize it further when the decision in leaves is some other measure than

86

majority. In that case I would just include that measure in the probability in place of the

inequality.

5.1.4 Characterizing path exists for Three Stopping Criteria

It follows from above that to compute the moments of the GE for a decision tree

algorithm I need to characterize conditions under which particular paths are present. This

characterization depends on the stopping criteria and split attribute selection method in

a decision tree algorithm. I now look at three popular stopping criteria, namely a) Fixed

height based, b) Purity (i.e. entropy 0 or gini index 0 etc.) based and c) Scarcity (i.e. too

few datapoints) based. I consider conditions under which certain paths are present for

each stopping criteria. Similar conditions can be enumerated for any reasonable stopping

criteria. I then choose a split attribute selection method, thereby fully characterizing the

above two probabilities and hence the moments.

1. Fixed Height: This stopping criteria is basically that every path in the tree shouldbe of length exactly h, where h ∈ [1, ..., d]. If h = 1 I classify based on just oneattribute. If h = d then I have the all attribute tree.In general, a path mi1mj2...mlh is present in the tree iff the attributes A1, A2, ..., Ah

are chosen in any order to form the path for a tree construction during thesplit attribute selection phase. Thus, for any path of length h to be present Ibiconditionally imply that the corresponding attributes are chosen.

2. Purity: This stopping criteria implies that I stop growing the tree from a particularsplit of a particular attribute if all datapoints lying in that split belong to thesame class. I call such a path pure else I call it impure. In this scenario, I couldhave paths of length 1 to d depending on when I encounter purity (assuming alldatapoints do not lie in 1 class). Thus, I have the following two separate checks forpaths of length d and less than d respectively.a) Path mi1mj2...mld present iff the path mi1mj2...ml(d−1) is impure and attributesA1, A2, ..., Ad−1 are chosen above Ad, or mi1mj2...ms(d−2)mld is impure andattributes A1, A2, ..., Ad−2, Ad are chosen above Ad−1, or ... or mj2...mld is impureand attributes A2, ..., Ad are chosen above A1.This means that if a certain set of d − 1 attributes are present in a path in the treethen I split on the dth attribute iff the current path is not pure, finally resulting in apath of length d.b) Path mi1mj2...mlh present where h < d iff the path mi1mj2...mlh is pure andattributes A1, A2, ..., Ah−1 are chosen above Ah and mi1mj2...ml(h−1) is impure orthe path mi1mj2...mlh is pure and attributes A1, A2, ..., Ah−2, Ah are chosen above

87

Ah−1 and mi1mj2...ml(h−2)mlh is impure or ... or the path mj2...mlh is pure andattributes A2, ..., Ah are chosen above A1 and mj2...mlh is impure.This means that if a certain set of h − 1 attributes are present in a path in the treethen I split on some hth attribute iff the current path is not pure and the resultingpath is pure.The above conditions suffice for ”path present” since the purity property isanti-monotone and the impurity property is monotone.

3. Scarcity: This stopping criteria implies that I stop growing the tree from aparticular split of a certain attribute if its count is less than or equal to somepre-specified pruning bound. Let us denote this number by pb. As before, I have thefollowing two separate checks for paths of length d and less than d respectively.a) Path mi1mj2...mld present iff the attributes A1, ..., Ad−1 are chosen above Ad andct(mi1mj2...ml(d−1)) > pb or the attributes A1, ..., Ad−2, Ad are chosen above Ad−1 andct(mi1mj2...ml(d−2)mnd) > pb or ... or the attributes A2, ..., Ad are chosen above A1

and ct(mi2mj3...mld) > pb.b) Path mi1mj2...mlh present where h < d iff the attributes A1, ..., Ah−1 are chosenabove Ah and ct(mi1mj2...ml(h−1)) > pb and ct(mi1mj2...mlh) ≤ pb or the attributesA1, ..., Ah−2, Ah are chosen above Ah−1 and ct(mi1mj2...ml(h−2)mnh) > pb andct(mi1mj2...mnh) ≤ pb or ... or the attributes A2, ..., Ah are chosen above A1 andct(mi2mj3...mlh) > pb and ct(mi1mj2...mlh) ≤ pb.This means that I stop growing the tree under a node once I find that the nextchosen attribute produces a path with occupancy ≤ pb.The above conditions suffice for ”path present” since the occupancy property ismonotone.

I observe from the above checks that I have two types of conditions that need to

be evaluated for a path being present namely, i) those that depend on the sample viz.

mi1mj2...ml(d−1) is impure or ct(mi1mj2...mlh) > pb and ii) those that depend split

attribute selection method viz. A1, A2, ..., Ah are chosen. The former depends on the

data distribution which I have specified to be a multinomial. The latter I discuss in the

next subsection. Note that checks for a combination of the above stopping criteria can be

obtained by appropriately combining the individual checks.

5.1.5 Split Attribute Selection

In decision tree construction algorithms, at each iteration I have to decide the

attribute variable on which the data should be split. Numerous measures have been

developed Hall and Holmes [2003]. Some of the most popular ones aim to increase the

purity of a set of datapoints that lie in the region formed by that split. The purer the

88

region, the better the prediction and lower the error of the classifier. Measures such

as, i) Information Gain (IG) Quinlan [1986], ii) Gini Gain (GG) Breiman et al. [1984],

iii) Gain Ratio (GR) Quinlan [1986], iv) Chi-square test (CS) Shao [2003] etc. aim at

realising this intuition. Other measures using Principal Component Analysis Smith [2002],

Correlation-based measures Hall [1998] have also been developed. Another interesting yet

non-intuitive measure in terms of its utility is the Random attribute selection measure.

According to this measure I randomly choose the split attribute from available set. The

decision tree that this algorithm produces is called a Random decision tree (RDT).

Surprisingly enough, a collection of RDTs quite often outperform their seemingly more

powerful counterparts Liu et al. [2005]. In this thesis I study this interesting variant.

I do this by first presenting a probabilistic characterization in selecting a particular

attribute/set of attributes, followed by simulation studies. Characterizations for the other

measures can be developed in similar vein by focusing on the working of each measure.

As an example, for the deterministic purity based measures mentioned above the split

attribute selection is just a function of the sample and thus by appropriately conditioning

on the sample I can find the relevant probabilities and hence the moments.

Before presenting the expression for the probability of selecting a split attribute/attributes

in constructing a RDT I extend the results in Dhurandhar and Dobra [2009] where

relationships were drawn between the moments of HE, CE, LE (just a special case of

cross-validation) and GE, to be applicable to randomized classification algorithms. The

random process is assumed to be independent of the sampling process. This result is

required since the results in Dhurandhar and Dobra [2009] are applicable to deterministic

classification algorithms and I would be analyzing RDT. With this I have the following

lemma.

Lemma 3. Let D and T be independent discrete random variables, with some distribution

defined on each of them. Let D and T denote the domains of the random variables. Let

89

f(d, t) and g(d, t) be two functions such that ∀t ∈ T ED[f(d, t)] = ED[g(d, t)] and d ∈ D.

Then, ET ×D[f(d, t)] = ET ×D[g(d, t)]

Proof.

ET ×D[f(d, t)] =∑t∈T

∑

d∈Df(d, t)P [T = t,D = d]

=∑t∈T

∑

d∈Df(d, t)P [D = d]P [T = t]

=∑t∈T

ED[g(d, t)]P [T = t]

= ET ×D[g(d, t)]

The result is valid even when D and T are continous, but considering the scope

of this thesis I are mainly interested in the discrete case. This result implies that all

the relationships and expressions in Dhurandhar and Dobra [2009] hold with an extra

expectation over the t, for randomized classification algorithms where the random process

is independent of the sampling process. In equations 6–9 and 6–2 the expectations w.r.t.

Z(N) become expectations w.r.t. Z(N, t).

5.1.6 Random Decision Trees

In this subsection I explain the randomized process used for split attribute selection

and provide the expression for the probability of choosing an attribute/a set of attributes.

The attribute selection method I use is as follows. I assume a uniform probability

distribution in selecting the attribute variables i.e. attributes which have already not

been chosen in a particular branch, have an equal chance of being chosen for the next

level. The random process involved in attribute selection is independent of the sample

and hence the lemma 1 applies. I now give the expression for the probability of selecting

a subset of attributes from the given set for a path. This expression is required in the

computation of the above mentioned probabilities used in computing the moments. For

90

the first moment I need to find the following probability. Given d attributes A1, A2, ..., Ad

the probability of choosing a set of h attributes where h ∈ 1, 2, ..., d is,

P [h attributes chosen] =1

d

h

since choosing without replacement is equivalent to simultaneously choosing a subset of

attributes from the given set.

For the second moment when the trees are different (required in the finding of

variance of CE since, the training sets in the various runs in cross validation are different

i.e. for finding EZ(N)×Z(N) [GE(ζ)GE(ζ ′)]), the probability of choosing l1 attributes for

path in one tree and l2 attributes for path in another tree where l1, l2 ≤ d is given by,

P [l1 attribute path in tree 1, l2 attribute path in tree 2] =

1

d

l1

d

l2

since the process of choosing one set of attributes for a path in one tree is independent of

the process of choosing another set of attributes for a path in a different tree.

For the second moment when the tree is the same (required in the finding of variance

of GE and HE i.e. for finding EZ(N)×Z(N) [GE(ζ)2]), the probability of choosing two sets

of attributes such that the two distinct paths resulting from them co-exist in a single tree

is given by the following. Assume I have d attributes A1, A2, ..., Ad. Let the lengths of

the two paths (or cardinality of the two sets) be l1 and l2 respectively, where l1, l2 ≤ d.

Without loss of generality assume l1 ≤ l2. Let p be the number of attributes common

to both paths. Notice that p ≥ 1 is one of the necessary conditions for the two paths to

co-exist. Let v ≤ p be those attributes among the total p that have same values for both

paths. Thus p − v attributes are common to both paths but have different values. At one

91

of these attributes in a given tree the two paths will bifurcate. The probability that the

two paths co-exist given our randomized attribute selection method is computed by finding

out all possible ways in which the two paths can co-exist in a tree and then multiplying

the number of each kind of way by the probability of having that way. A detailed proof is

given in the appendix. The expression for the probability based on the attribute selection

method is,

P [l1 and l2 length paths co− exist] =

v∑i=0

vPri(l1 − i− 1)!(l2 − i− 1)!(p− v)probi

where vPri = v!(v−i)!

denotes permutation and probi = 1d(d−1)...(d−i)(d−i−1)2...(d−l1+1)2(d−l1)...(d−l2+1)

is the probability of the ith possible way. For fixed height trees of height h, (l1− i−1)!(l2−i− 1)! becomes (h− i− 1)!2 and probi = 1

d(d−1)...(d−i)(d−i−1)2...(d−h+1)2.

5.1.7 Putting things together

I now have all the ingredients that are required for the computation of the moments

of GE. In this subsection I combine the results derived in the previous subsections to

obtain expressions for PZ(N) [ζ(x)=Ci] and PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv] which are

vital in the computation of the moments.

Let s.c.c.s. be an abbreviation for stopping criteria conditions that are sample

dependent. Conversely, s.c.c.i. be an abbreviation for stopping criteria conditions that are

sample independent or conditions that are dependent on the attribute selection method. I

now provide expressions for the above probabilities categorized by the 3 stopping criteria.

5.1.7.1 Fixed Height

The conditions for ”path exists” for fixed height trees depend only on the attribute

selection method as seen in subsection 5.1.4. Hence the probability used in finding the first

moment is given by,

92

PZ(N) [ζ(x)=Ci]

=∑

p

PZ(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ∀j 6= i, i, j ∈ [1, ..., k]]

=∑

p

PZ(N)[ct(pathpCi) > ct(pathpCj), s.c.c.i., ∀j 6= i, i, j ∈ [1, ..., k]]

=∑

p

PZ(N)[ct(pathpCi) > ct(pathpCj), ∀j 6= i, i, j ∈ [1, ..., k]]PZ(N)[s.c.c.i.]

=∑

p

PZ(N)[ct(pathpCi) > ct(pathpCj), ∀j 6= i, i, j ∈ [1, ..., k]]

dCh

(5–5)

where dCh = d!h!(d−h)!

and h is the length of the paths or the height of the tree. The

probability in the last step of the above derivation can be computed from the underlying

joint distribution. The probability for the second moment when the trees are different is

given by,

PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv]

=∑p,q

PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ct(pathqCv) > ct(pathqCw), pathqexists,

∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]

=∑p,q

PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw),∀j 6= i, ∀w 6= v, i, j, v,

w ∈ [1, ..., k]] · PZ(N)×Z(N)[s.c.c.i.]

=∑p,q


w ∈ [1, ..., k]]

dC2h

(5–6)

where h is the length of the paths. The probability for the second moment when the

trees are identical is given by,

93

PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ(x′)=Cv]

=∑p,q


∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]

=∑p,q


w ∈ [1, ..., k]] · PZ(N)×Z(N)[s.c.c.i.]

=∑p,q

b∑t=0

bPrt(h− t− 1)!2(r − v)probtPZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj),

ct(pathqCv) > ct(pathqCw),∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]

(5–7)

where r is the number of attributes that are common in the 2 paths, b is the number

of attributes that have the same value in the 2 paths, h is the length of the paths and

probt = 1d(d−1)...(d−t)(d−t−1)2...(d−h+1)2

. As before, the probability comparing counts can be

computed from the underlying joint distribution.

5.1.7.2 Purity and Scarcity

The conditions for ”path exists” in the case of purity and scarcity depend on both the

sample and the attribute selection method as can be seen in 5.1.4. The probability used in

finding the first moment is given by,

PZ(N) [ζ(x)=Ci]

=∑

p

PZ(N)[ct(pathpCi) > ct(pathpCj), pathpexists, ∀j 6= i, i, j ∈ [1, ..., k]]

=∑

p

PZ(N)[ct(pathpCi) > ct(pathpCj), s.c.c.i, s.c.c.s., ∀j 6= i, i, j ∈ [1, ..., k]]

=∑

p

PZ(N)[ct(pathpCi) > ct(pathpCj), s.c.c.s., ∀j 6= i, i, j ∈ [1, ..., k]]PZ(N)[s.c.c.i.]

94

=∑

p

PZ(N)[ct(pathpCi) > ct(pathpCj), s.c.c.s., ∀j 6= i, i, j ∈ [1, ..., k]]

dChp−1(d− hp + 1)(5–8)

where hp is the length of the path indexed by p. The joint probability of comparing

counts and s.c.c.s. can be computed from the underlying joint distribution. The

probability for the second moment when the trees are different is given by,

PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ ′(x′)=Cv]

=∑p,q


∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]

=∑p,q

PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw), s.c.c.s., ∀j 6= i,

∀w 6= v, i, j, v, w ∈ [1, ..., k]] · PZ(N)×Z(N)[s.c.c.i.]

=∑p,q

PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw), s.c.c.s., ∀j 6= i,

∀w 6= v, i, j, v, w ∈ [1, ..., k]]

dChp−1dChq−1(d− hp + 1)(d− hq + 1)

(5–9)

where hp and hq are the lengths of the paths indexed by p and q. The probability for

the second moment when the trees are identical is given by,

PZ(N)×Z(N) [ζ(x)=Ci ∧ ζ(x′)=Cv]

=∑p,q


∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]

=∑p,q

PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj), ct(pathqCv) > ct(pathqCw), s.c.c.s., ∀j 6= i, ∀w 6= v,

i, j, v, w ∈ [1, ..., k]]PZ(N)×Z(N)[s.c.c.i.]

=∑p,q

b∑t=0

bPrt(hp − t− 2)!(hq − t− 2)!(r − v)probt

(d− hp + 1)(d− hq + 1)PZ(N)×Z(N)[ct(pathpCi) > ct(pathpCj),

95

ct(pathqCv) > ct(pathqCw), s.c.c.s., ∀j 6= i, ∀w 6= v, i, j, v, w ∈ [1, ..., k]]

(5–10)

where r is the number of attributes that are common in the 2 paths sparing the

attributes chosen as leaves, b is the number of attributes that have the same value, hp

and hq are the lengths of the 2 paths and without loss of generality assuming hp ≤ hq

probt = 1d(d−1)...(d−t)(d−t−1)2...(d−hp)2(d−hp−1)...(d−hq)

. As before, the probability of comparing

counts and s.c.c.s. can be computed from the underlying joint distribution.

Using the expressions for the above probabilities the moments of GE can be

computed. In next section I perform experiments on synthetic as well as distributions

built on real data to portray the efficacy of the derived expressions.

5.2 Experiments

To exactly compute the probabilities for each path the time complexity for fixed

height trees is O(N2) and for purity and scarcity based trees is O(N3). Hence, computing

exactly the probabilities and consequently the moments is practical for small values of

N . For larger values of N , I propose computing the individual probabilities using Monte

Carlo (MC). In the empirical studies I report, I show that the accuracy in estimating the

error (i.e. the moments of GE) by using our expressions with MC is always greater than

by directly using MC for the same computational cost. In fact, the accuracy of using the

expressions is never worse than MC even when MC is executed for 10 times the number of

iterations as those of the expressions. The true error or the golden standard against which

I compare the accuracy of these estimators is obtained by running MC for a week, which is

around 200 times the number of iterations as those of the expressions.

Notation: In the experiments, AF refers to the estimates obtained by using the

expressions in conjunction with Monte Carlo. MC-i refers to simple Monte Carlo being

executed for i times the number of iterations as those of the expressions. The term True

Error or TE refers to the golden standard against which I compare AF and MC-i.

96

General Setup: I perform empirical studies on synthetic as well as real data. The

experimental setup for synthetic data is as follows: I fix N to 10000. The number of

classes is fixed to two. I observe the behavior of the error for the three kinds of trees with

the number of attributes fixed to d = 5 and each attribute having 2 attribute values. I

then increase the number of attribute values to 3, to observe the effect that increasing

the number of split points has on the performance of the estimators. I also increase the

number of attributes to d = 8 to study the effect that increasing the number of attributes

has on the performance. With this I have a d + 1 dimensional contingency table whose d

dimensions are the attributes and the (d+1)th dimension represents the class labels. When

each attribute has two values the total number of cells in the table is c = 2d+1 and with

three values the total number of cells is c = 3d × 2. If I fix the probability of observing a

datapoint in cell i to be pi such that∑c

i=1 pi = 1 and the sample size to N the distribution

that perfectly models this scenario is a multinomial distribution with parameters N and

the set p1, p2, ..., pc. In fact, irrespective of the value of d and the number of attribute

values for each attribute the scenario can be modelled by a multinomial distribution.

In the studies that follow the pi are varied and the amount of dependence between the

attributes and the class labels is computed for each set of pi using the Chi-square test

Connor-Linton [2003]. More precisely, I sum over all i the squares of the difference of each

pi with the product of its corresponding marginals, with each squared difference being

divided by this product .i.e. correlation =∑

i(pi−pim)2

pim, where pim is the product of the

marginals for the ith cell. The behavior of the error for trees with the three aforementioned

stopping criteria is seen for different correlation values and for a class prior of 0.5.

In case of real data, I perform experiments on distributions built on three UCI

datasets. I split the continuous attributes at the mean of the given data. I thus can form

a contingency table representing each of the datasets. The counts in the individual cells

divided by the dataset size provide us with empirical estimates for the individual cell

probabilities (pi). Thus, with the knowledge of N (dataset size) and the individual pi I

97

have a multinomial distribution. Using this distribution I observe the behavior of the error

for the three kinds of trees with results being applicable to other datasets that are similar

to the original.

Observations: Figures 5-3, 5-4 and 5-5 depict the error of fixed height trees with

the number of attributes being 5 for the first two figures and 8 for the third figure. The

number of attribute values increases from 2 to 3 in figures 5-3 and 5-4 respectively. I

observe in these figures that AF is significantly more accurate than both MC-1 and

MC-10. In fact the performance of the 3 estimators namely, AF, MC-1 and MC-10

remains more or less unaltered even with changes in the number of attributes and in the

number of splits per attribute. A similar trend is seen for both purity based trees i.e.

figures 5-6, 5-7 and 5-8 as well as scarcity based trees 5-9, 5-10 and 5-11. Though in the

case of purity based trees the performance of both MC-1 and MC-10 is much superior

as compared with their performance on the other two kinds of trees, especially at low

correlations. The reason for this being that, at low correlations the probability in each cell

of the multinomial is non-negligible and with N = 10000 the event that every cell contains

atleast a single datapoint is highly likely. Hence, the trees I obtain with high probability

using the purity based stopping criteria are all ATT. Since in an ATT all the leaves are

identical irrespective of the ordering of the attributes in any path, the randomness in

the classifiers produced, is only due to the randomness in the data generation process

and not because of the random attribute selection method. Thus, the space of classifiers

over which the error is computed reduces and MC performs well even for a relatively

fewer number of iterations. At higher correlations and for the other two kinds of trees the

probability of smaller trees is reasonable and hence MC has to account for a larger space

of classifiers induced by not only the randomness in the data but also by the randomness

in the attribute selection method.

In case of real data too figure 5-12, the performance of the expressions is significantly

superior as compared with MC-1 and MC-10. The performance of MC-1 and MC-10 for

98

the purity based trees is not as impressive here since the dataset sizes are much smaller (in

the tens or hundreds) compared to 10000 and hence the probability of having an empty

cell are not particularly low. Moreover, the correlations are reasonably high (above 0.6).

Reasons for superior performance of expressions: With simple MC, trees have

to be built while performing the experiments. Since, the expectations are over all possible

classifiers i.e. over all possible datasets and all possible randomizations in the attribute

selection phase, the exhaustive space over which direct MC has to run is huge. No tree

has to be explicitly built when using the expressions. Moreover, the probabilities for each

path can be computed parrallely. Another reason as to why calculating the moments using

expressions works better is that the portion of the probabilities for each path that depend

on the attribute selection method are computed exactly (i.e. with no error) by the given

expressions and the inaccuracies in the estimates only occur due to the sample dependent

portion in the probabilities.

5.3 Discussion

In the previous sections I derived the analytical expressions for the moments of

the GE of decision trees and depicted interesting behavior of RDT built under the 3

stopping criteria. It is clear that using the expressions I obtain highly accurate estimates

of the moments of errors for situations of interest. In this section I discuss issues related

to extension of the analysis to other attribute selection methods and issues related to

computational complexity of algorithm.

5.3.1 Extension

The conditions presented for the 3 stopping criteria namely, fixed height, purity and

scarcity are applicable irrespective of the attribute selection method. Commonly used

deterministic attribute selection methods include those based on Information Gain (IG),

Gini Gain (GG), Gain ratio (GR) etc. Given a sample the above metrics can be computed

for each attribute. Hence, the above metrics can be implemented as corresponding

functions of the sample. For e.g. in the case of IG I compute the loss in entropy (qlogq

99

where the q are computed from the sample) by the addition of an attribute as I build

the tree. I then compare the loss in entropy of all attributes not already chosen in the

path and choose the attribute for which the loss in entropy is maximum. Following

this procedure I build the path and hence the tree. To compute the probability of path

exists, I add these sample dependent conditions in the corresponding probabilities. These

conditions account for a particular set of attributes being chosen, in the 3 stopping

criteria. In other words, these conditions quantify the conditions in the 3 stopping criteria

that are attribute selection method dependent. Similar conditions can be derived for the

other attribute selection methods (attribute with maximum gini gain for GG, attribute

with maximum gain ratio for GR) from which the relevant probabilities and hence the

moments can be computed. Thus, while computing the probabilities given in equations

5–3 and 5–4 the conditions for path exists for these attribute selection methods depend

totally on the sample. This is unlike what I observed for the randomized attribute

selection criterion where the conditions for path exists depending on this randomized

criterion, were sample independent while the other conditions in purity and scarcity

were sample dependent. Characterizing these probabilities enables us in computing the

moments of GE for these other attribute selection methods.

In the analysis that I presented, I assumed that the split points for continuous

attributes were determined apriori to tree construction. If the split point selection

algorithm is dynamic i.e. the split points are selected while building the tree, then in the

path exists conditions of the 3 stopping criteria I would have to append an extra condition

namely, the split occurs at ”this” particular attribute value. In reality, the value of ”this”

is determined by the values that the samples attain for the specific attribute in the

particular dataset, which is finite.1 Hence, while analyzing I can choose a set of allowed

1 Since dataset is finite.

100

values for ”this” for each continuous attribute. Using these updated set of conditions for

the 3 stopping criteria the moments of GE can be computed.

5.3.2 Scalability

The time complexity of implementing the analysis is proportional to the product of

the size of the input/output space 2 and the number of paths that are possible in the

tree while classifying a particular input. To this end, it should be noted that if a stopping

criterion is not carefully chosen and applied, then the number of possible trees and hence

the number of allowed paths can become exponential in the dimensionality. In such

scenarios, studying small or at best medium size trees is feasible. For studying larger trees

the practitioner should combine stopping criteria (e.g. pruning bound and fixed height or

scarcity and fixed height) i.e. combine the conditions given for each individual stopping

criteria or choose a stopping criterion that limits the number of paths (e.g. fixed height).

Keeping these simple facts in mind and on appropriate usage, the expressions can assist in

delving into the statistical behavior of the errors for decision tree classifiers.

5.4 Take-aways

I have developed a general characterization for computing the moments of the GE

for decision trees. In particular I have specifically characterized RDT for three stopping

criteria namely, fixed height, purity and scarcity. Being able to compute moments of

GE, allows us to compute the moments of the various validation measures and observe

their relative behavior. Using the general characterization, characterizations for specific

attribute selection measures (e.g. IG, GG etc.) other than randomized can be developed

as described before. As a technical result, I have extended the theory in Dhurandhar and

Dobra [2009] to be applicable to randomized classification algorithms; this is necessary

if the theory is to be applied to random decisions trees as I did in this thesis. The

2 In case of continuous attributes the size of the input/output space is the size afterdiscretization

101

experiments reported in section 5.2 had two purposes: (a) portray the manner in which

the expressions can be utilized as an exploratory tool to gain a better understanding of

decision tree classifiers, and (b) show conclusively that the methodology in Dhurandhar

and Dobra [2009] together with the developments in this thesis provide a superior analysis

tool when compared with simple Monte Carlo.

102

Figure 5-1. The all attribute tree with 3 attributes A1, A2, A3, each having 2 values.

Figure 5-2. Given 3 attributes A1, A2, A3, the path m11m21m31 is formed irrespectiveof the ordering of the attributes. Three such permutations are shown in theabove figure.

103

Figure 5-3. Fixed Height trees with d = 5, h = 3 and attributes with binary splits.

Figure 5-4. Fixed Height trees with d = 5, h = 3 and attributes with ternary splits.

Figure 5-5. Fixed Height trees with d = 8, h = 3 and attributes with binary splits.

104

Figure 5-6. Purity based trees with d = 5 and attributes with binary splits.

Figure 5-7. Purity based trees with d = 5 and attributes with ternary splits.

Figure 5-8. Purity based trees with d = 8 and attributes with binary splits.

105

Figure 5-9. Scarcity based trees with d = 5, pb = N10

and attributes with binary splits.


and attributes with ternary splits.


and attributes with binary splits.

106

Fixed Ht. Scarcity

0.1

0.3

0.4

0.5

0.6

0.70.8

AF

MC−1MC−10

Purity Scarcity

0.1

0.2

0.3

0.40.5

0.6

0.7

0.8

Purity Fixed Ht.

MC−10MC−1

AF

TE TE

Fixed Ht. Purity Scarcity

0.1

0.20.3

0.4

0.5

0.60.7

0.8

AF

MC−1MC−10

TE

Shuttle Landing ControlPima Indians Balloon

0.2

Err

or

Err

or

Err

or

Figure 5-12. Comparison between AF and MC on three UCI datasets for trees prunnedbased on fixed height (h = 3), purity and scarcity (pb = N

10).

107

CHAPTER 6K-NEAREST NEIGHBOR CLASSIFIER

The kNN algorithm is a simple yet effective and hence commonly used classification

algorithm in industry and research. It is known to be a consistent estimator Stone [1977],

i.e. it asymptotically achieves Bayes error within a constant factor. None of the even

more sophisticated classification algorithms eg. SVM, Neural Networks etc. are known to

outperform it consistently Stanfill and Waltz [1986]. However, the algorithm is susceptible

to noise and choosing an appropriate value of k is more of an art than science.

6.1 Specific Contributions

I develop expressions for the first 2 moments of GE for the k-Nearest Neighbor

classification algorithm built on categorical data. I accomplish this by expressing the

moments as functions of the sample produced by the underlying joint distribution. In

particular, I develop efficient characterizations for the moments when the distance metric

used in the kNN algorithm, is independent of the sample. I also discuss issues related to

the scalability of the algorithm. I use the derived expressions, to study the classification

algorithm in settings of interest (example different values of k), by visualization. The joint

distribution I use in the empirical studies that ensue the theory, is a multinomial — the

most generic data generation model for the discrete case.

6.2 Technical Framework

In this section I present the generic expressions for the moments of GE that were

given in Dhurandhar and Dobra [2009]. The moments of the GE of a classifier built over

an independent and identically distributed (i.i.d.) random sample drawn from a joint

distribution, are taken over the space of all possible classifiers that can be built, given the

classification algorithm and the joint distribution. Though the classification algorithm

may be deterministic, the classifiers act as random variables since the sample that they are

built on is random. The GE of a classifier, being a function of the classifier, also acts as

a random variable. Due to this fact, GE of classifier denoted by GE(ζ) has a distribution

108

and consequently I can talk about its moments. The generic expressions for the first two

moments of GE taken over the space of possible classifiers resulting from samples of size N

from some joint distribution are as follows:

EZ(N) [GE(ζ)] =

∑x∈X

P [X =x]∑y∈Y

PZ(N) [ζ(x)=y] P [Y (x) 6=y](6–1)

EZ(N)×Z(N) [GE(ζ)GE(ζ ′)] =

∑x∈X

∑

x′∈XP [X =x] P [X =x′] ·

∑y∈Y

∑

y′∈YPZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] ·

P [Y (x) 6=y] P [Y (x′) 6=y′]

(6–2)

Equation 6–1 is the expression for the first moment of the GE(ζ). Notice that

inside the first sum∑

x∈X the input x is fixed and inside the second sum the output

y is fixed, thus the PZ(N) [ζ(x)=y] is the probability of all possible ways in which an

input x is classified into class y. This probability depends on the joint distribution and

the classification algorithm. The other two probabilities are directly derived from the

distribution. Thus, customizing the expression for EZ(N) [GE(ζ)], effectively means

deciphering a way of computing PZ(N) [ζ(x)=y]. Similarly, customizing the expression for

EZ(N)×Z(N) [GE(ζ)GE(ζ ′)] means finding a way of computing PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′]

given any joint distribution. In Section 6.4 I derive expressions for these two probabilities,

which depend only on the underlying joint probability distribution, thus providing a way

of computing them analytically.

6.3 K-Nearest Neighbor Algorithm

The k-nearest neighbor (kNN) classification algorithm classifies an input based on

the class labels of the closest k points in the training dataset. The class label assigned

109

to an input is usually the most numerous class of these k closest points. The underlying

intuition that is the basis of this classification model is that nearby points will tend to

have higher ”similarity” viz. same class, than points that are far apart.

The notion of closeness between points is determined by the distance metric used.

When the attributes are continuous, the most popular metric is the l2 norm or the

Euclidean distance. Figure 6.9 shows points in R2 space. The points b, c and d are the

3-nearest neighbors (k=3) of the point a. When the attributes are categorical the most

popular metric used is the Hamming distance Liu and White [1997]. The Hamming

distance between two points/inputs is the number of attributes that have distinct

values for the two inputs. This metric is sample independent i.e. the Hamming distance

between two inputs remains unchanged, irrespective of the sample counts produced in the

corresponding contingency table. For example, Table 6-1 represents a contingency table.

The Hamming distance between x1 and x2 is the same irrespective of the values of Nij

where i ∈ 1, 2, ..., M and j ∈ 1, 2, ..., v. Other metrics such as Value Difference Metric

(VDM) Stanfill and Waltz [1986], Chi-square Connor-Linton [2003] etc. exist, that depend

on the sample. I now provide a global characterization for calculating the aforementioned

probabilities for both kinds of metrics. This is followed by an efficient characterization for

the sample independent metrics, which includes the traditionally used and most popular

Hamming distance metric.

6.4 Computation of Moments

In this section I characterize the probabilities PZ(N) [ζ(x)=y] and

PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] required for the computation of the first two moments. In

the case, that the number of nearest neighbors at a particular distance d is more than k

for an input and at any lesser value of distance the number of NN is less than k, I classify

the input based on all the NN upto the distance d.

110

6.4.1 General Characterization

I provide a global characterization for the above mentioned probabilities without any

assumptions on the distance metric in this subsection.

The scenario wherein xi is classified into class Cj given i ∈ 1, 2, ..., M and j ∈1, 2, ..., v depends on two factors; 1) the kNN of xi and 2) the class label of the majority

of these kNN. The first factor is determined by the distance metric used, which may be

dependent or independent of the sample as previously discussed. The second factor is

always determined by the sample. The PZ(N) [ζ(xi)=Cj] is the probability of all possible

ways that input xi can be classified into class Cj, given the joint distribution over the

input-output space. This probability for xi is calculated by summing the joint probabilities

of having a particular set of kNN and the majority of this set of kNN has a class label Cj,

over all possible kNN that the input can have. Formally,

PZ(N) [ζ(xi)=Cj] =

∑q∈Q

PZ(N) [q, c(q, j) > c(q, t), ∀t ∈ 1, 2, ..., v, t 6= j](6–3)

where q is a set of kNN of the given input and Q is the set containing all possible q.

c(q, b) is a function which calculates the number of kNN in q that lie in class Cb. For

example, from Table 6-1, if x1 and x2 are the kNN of some input, then q = 1, 2and c(q, b) = N1b + N2b. Notice that, since x1 and x2 are the kNN of some input,∑2,v

i=1,j=1 Nij ≥ k. Moreover, if the kNN comprise of the entire input sample, then

the resultant classification is equivalent to classification performed using class priors

determined by the sample. The PZ(N)×Z(N) [ζ(x)=y ∧ ζ ′(x′)=y′] used in the computation

of the second moment is calculated by going over kNN of two inputs rather than one. The

expression for this probability is given by,

111

PZ(N)×Z(N) [ζ(xi)=Cj ∧ ζ ′(xl)=Cw] =

∑q∈Q

∑r∈R

PZ(N)×Z(N)[q, c(q, j) > c(q, t), r, c(r, w) > c(r, s)

∀s, t ∈ 1, 2, ..., v, t 6= j, s 6= w]

(6–4)

where q and r are sets of kNN of xi and xl respectively. Q and R are sets containing all

possible q and r respectively. c(., .) has the same connotation as before.

As mentioned before the probability of a particular q (or the probability of the joint

q, r), depends on the distance metric used. The inputs (e.g. x1, x2, ...) that are the k

nearest neighbors to some given input depend on the sample irrespective of the distance

metric i.e. the kNN of an input depend on the sample even if the distance metric is sample

independent. I illustrate this fact by an example.

Example 1. Say, x1 and x2 are the two closest inputs to xi where x1 is closer than x2,

based on some sample independent distance metric. x1 and x2 are both the kNN of xi if

and only if∑2

a=1

∑vb=1 c(a, b) ≥ k,

∑vb=1 c(1, b) < k and

∑vb=1 c(2, b) < k. The first

inequality states that the number of copies of x1 and x2 given by∑v

j=1 N1j and∑v

j=1 N2j

respectively, in the contingency table 6-1 is greater than or equal to k. If this inequality is

true, then definitely the class label of input xi is determined by the copies of x1 or x2 or

both x1, x2. No input besides these two is involved in the classification of xi. The second

and third inequality state that the number of copies of x1 and x2 is less than k respectively.

This forces both x1 and x2 to be used in the classification of xi. If the first inequality was

untrue, then farther away inputs will also play a part in the classification of xi. Thus the

kNN of an input depend on the sample irrespective of the distance metric used.

The above example also illustrates the manner in which the set q (or r) can be

characterized as a function of the sample, enabling us to compute the two probabilities

required for the computation of the moments from any given joint distribution over

the data, for sample independent metrics. Without loss of generality (w.l.o.g.) assume

112

x1, x2, ..., xM are inputs in non-decreasing order (from left to right) of their distance from

a given input xi based on some sample independent distance metric. Then this input

having the kNN given by the set q = xa1, xa2, ..., xaz where a1, a2, ..., az ∈ 1, 2, ..., Mand ad < af if d < f is equivalent to the following conditions on the sample being true:∑z

l=1

∑vj=1 c(xal, j) ≥ k, ∀l ∈ 1, 2, ..., z ∑v

j=1 c(xal, j) > 0, ∀l ∈ 2q where 2q is the

power set of q and cardinality of l = z − 1 denoted by |l| = z − 1,∑v

j=1 c(l, j) < k and

∀xh ∈ x1, x2, ..., xM − q where h < az∑v

j=1 c(xh, j) = 0. The conditions imply that

for the elements of q to be the kNN of some given input, the sum of their counts should be

greater than or equal to k, the sum of any subset of the counts (check only subsets of q of

cardinality |q| − 1) should be less than k, the count of each element of q is non-zero and

the other inputs that are not in q, but are no farther to the given input than the farthest

input in q should have counts zero. Notice that all of these conditions are functions of the

sample.

The other condition in the probabilities is that a particular class label is the most

numerous among the kNNs, which is also a function of the sample. In case of sample

dependent metrics, the conditions that are equivalent to having a particular set q as kNN,

are totally dependent on the specific distance metric used. Since these distance metrics are

sample dependent, I can certainly write these conditions as the corresponding functions of

the sample. Since all the involved conditions in the above probabilities can be expressed

as functions of the sample, I can be compute them over any joint distribution defined over

the data.

6.4.2 Efficient Characterization for Sample Independent Distance Metrics

In the previous subsection I observed the global characterization for the kNN

algorithm. Though this characterization provides insight into the relationship between

the moments of GE, the underlying distribution and the kNN classification algorithm, it is

inefficient to compute in practical scenarios. This is due to fact that any given input can

have itself and/or any of the other inputs as kNN. Hence, the total number of terms in

113

finding the probabilities in equations 6–3 and 6–4 turns out to be exponential in the input

size M . Considering these limitations, I provide alternative expressions for computing

these probabilities efficiently for sample independent distance metrics, viz. Manhattan

distance Krause [1987], Chebyshev distance Abello et al. [2002], Hamming distance. The

number of terms in the new characterization I propose, is linear in M for PZ(N) [ζ(x)=y]

and quadratic in M for PZ(N)×Z(N)[ζ(x)=y ∧ ζ ′(x′)=y′].

The characterization I just presented, computes the probability of classifying an input

into a particular class for each possible set of kNN separately. What if I in some manner,

combine disjoint sets of these probabilities into groups, and compute a single probability

for each group ? This would reduce the number of terms to be computed, thus speeding

up the process of computation of the moments. To accomplish this, I use the fact that the

distance between inputs is independent of the sample. A consequence of this independence

is that all pairwise distances between the inputs are known prior to the computation of the

probabilities. This assists in obtaining a sorted ordering of inputs from the closest to the

farthest for any given input. For example, if I have inputs a1b1, a1b2, a2b1 and a2b2, then

given input a1b1, I know that a2b2 is the farthest from the given input, followed by a1b2,

a2b1 which are equidistant and a1b1 is the closest in terms of Hamming distance.

Before presenting a full-fledged characterization for computing the two probabilities, I

explain the basic grouping scheme that I employ with the help of an example.

Example 2. W.l.o.g. let x1 be the given input, for which I want to find PZ(N) [ζ(x1)=C1].

Let x1, x2,...,xM be inputs arranged in increasing order of distance from left to right.

This is shown in Figure 6.9. In this case, the number of terms I need, to compute

PZ(N) [ζ(x1)=C1] is M . The first term calculates the probability of classifying x1 into

C1 when the kNN are multiple instances of x1 (i.e.∑v

j=1 N1j ≥ k). Thus, the first group

contains only the set x1. The second term calculates the probability of classifying x1

into C1 when the kNN are multiple instances of x2 or x1, x2. The second group thus con-

tains the sets x2 and x1, x2 as the possible kNN to x1. If I proceed in this manner,

114

eventually I have M terms and consequently M groups. The M th group will contain sets

in which xM is an element of every set and the other elements in different sets are all

possible combinations of the remaining M − 1 inputs. Notice that this grouping scheme

covers all possible kNN as in the general case, stated previously i.e. ∪Mi=1gi = 2S − φ where

gi denotes the ith group, S = x1, x2, ..., xM, φ is the empty set and any two groups are

disjoint i.e. ∀ i, j ∈ 1, 2, ..., M, i 6= j gi ∩ gj = φ preventing multiple computations of the

same probability. The rth (r > 1) term in the expression for PZ(N) [ζ(x1)=C1] given the

contingency table 6-1 is,

PZ(N) [ζ(x1)=C1, sets in gr ∈ kNN ] =

PZ(N)[r∑

i=1

Ni1 ≥r∑

i=1

Nij, ∀ j ∈ 2, 3, ..., v,r∑

i=1

v∑

l=1

Nil ≥ k,

r−1∑i=1

v∑

l=1

Nil < k]

(6–5)

where the last two conditions force only sets in gr to be among the kNN. The first condi-

tion ensures that C1 is the most numerous class among the given kNN. For r = 1 the last

condition becomes invalid and unnecessary, hence it is removed. The probability for the

second moment is the sum of probabilities which are calculated for two inputs rather than

one and over two groups, one for each input. W.l.o.g. assume that x1 and x2 are 2 inputs

with x1 being the closest input of x2 and sets in gr ∈ kNN (1) i.e. kNN for input x1 and

sets in gs ∈ kNN (2) i.e. kNN for input x2 where r, s ∈ 2, 3, ..., M then the rsth term in

PZ(N)×Z(N) [ζ(x1)=C1, ζ(x2)=C2] is,

115

PZ(N)×Z(N)[ζ(x1)=C1, sets in gr ∈ kNN (1),

ζ(x2)=C2, sets in gs ∈ kNN (2)] =

PZ(N)×Z(N)[r∑

i=1

Ni1 ≥r∑

i=1

Nij,∀ j ∈ 2, 3, ..., v,r∑

i=1

v∑

l=1

Nil ≥ k,

r−1∑i=1

v∑

l=1

Nil < k,

s∑i=1

Ni2 ≥s∑

i=1

Nij,∀ j ∈ 1, 3, ..., v,s∑

i=1

v∑

l=1

Nil ≥ k,

s−1∑i=1

v∑

l=1

Nil < k]

(6–6)

In this case, when r = 1 remove∑r−1

i=1

∑vl=1 Nil < k condition from the above

probability. If s = 1 remove∑s−1

i=1

∑vl=1 Nil < k condition from the above probability.

In the general case, there may be multiple inputs that lie at a particular distance from

any given input; i.e. the concentric circles in Figure 6.9 may contain more than one input.

To accommodate this case, I extend the grouping scheme previously outlined. Previously,

the group gr contained all possible sets formed by the r − 1 distinct closest inputs to a

given input, with the rth closest input being present in every set. Realize that the rth

closest input does not necessarily mean it is the rth NN, since there may be multiple copies

of any of the r − 1 closest inputs. In our modified definition, the group gr contains all

possible sets formed by the r − 1 closest inputs, with at least one of the rth closest inputs

being present in every set. I illustrate this with an example. Say, I have inputs a1b1, a1b2,

a2b1 and a2b2, then given input a1b1, I know that a2b2 is the farthest from the given input,

followed by a1b2, a2b1 which are equidistant and a1b1 is the closest in terms of Hamming

distance. The group g1 contains only a1b1 as before. The group g2 in this case contains the

sets a1b2, a2b1, a1b2, a2b1, a1b2, a1b1 and a2b1, a1b1. Observe that each set has at

least one of the 2 inputs a1b2, a2b1. I now characterize the probabilities in equations 6–5

and 6–6 for this general case. Let qr denote the set containing inputs from the closest to

116

the rth closest, to some input xi. The function c(., .) has the same connotation as before.

With this the rth term in the PZ(N) [ζ(xi)=Cj] where r ∈ 2, 3, ..., G and G ≤ M is

number of groups is,

PZ(N) [ζ(xi)=Cj, sets in gr ∈ kNN ] =

PZ(N)[c(qr, j) > c(qr, l),∀ l ∈ 1, 2, ..., v l 6= j,

v∑t=1

c(qr, t) ≥ k,

v∑t=1

c(qr−1, t) < k]

(6–7)

where the last condition is removed for r = 1. Similarly, the rsth term in

PZ(N)×Z(N) [ζ(xi)=Cj, ζ′(xp)=Cw] where r, s ∈ 2, 3, ..., G is,

PZ(N)×Z(N)[ζ(xi)=Cj, sets in gr ∈ kNN (i),

ζ(xp)=Cw, sets in gs ∈ kNN (p)] =

PZ(N)×Z(N)[c(qr, j) > c(qr, l),∀ l ∈ 1, 2, ..., v l 6= j,

v∑t=1

c(qr, t) ≥ k,

v∑t=1

c(qr−1, t) < k,

c(qs, w) > c(qs, l),∀ l ∈ 1, 2, ..., v l 6= w,

v∑t=1

c(qs, t) ≥ k,

v∑t=1

c(qs−1, t) < k]

(6–8)

where∑v

t=1 c(qr−1, t) < k and∑v

t=1 c(qs−1, t) < k are removed when r = 1 and s = 1

respectively. From equation 6–7 the PZ(N) [ζ(xi)=Cj] is given by,

PZ(N) [ζ(xi)=Cj] =G∑

r=1

Tr (6–9)

where Tr is the rth term in PZ(N) [ζ(xi)=Cj]. From equation 6–8 the

PZ(N)×Z(N) [ζ(xi)=Cj, ζ′(xp)=Cw] is given by,

117

PZ(N)×Z(N) [ζ(xi)=Cj, ζ′(xp)=Cw] =

G∑r,s=1

Trs (6–10)

where Trs is the rsth term in PZ(N)×Z(N) [ζ(xi)=Cj, ζ′(xp)=Cw].

With this grouping scheme I have been able to reduce the number of terms in the

calculation of PZ(N) [ζ(xi)=Cj] and PZ(N)×Z(N)[ζ(xi)=Cj, ζ′(xp)=Cw] from exponential in

M (the number of distinct inputs), to manageable proportions of Ω(M) terms for the first

probability and Ω(M2) terms for the second probability. Moreover, I have accomplished

this without compromising on the accuracy.

6.5 Scalability Issues

In the previous section I provided the generic characterization and the time efficient

characterization for sample independent distance metrics, relating the two probabilities

required for the computation of the first and second moments, to probabilities that can

be computed using the joint distribution. In this section I discuss approximation schemes

that may be carried out, to further speed up the computation. There are two factors

on which the time complexity of calculating PZ(N) [ζ(xi)=Cj] and PZ(N)×Z(N)[ζ(xi) =

Cj, ζ′(xp)=Cw] depends,

1. the number of terms (or smaller probabilities) that sum up to the above probabilities,

2. the time complexity of each term.

Reduction in number of terms: In the previous section I reduced the number

of terms to a small polynomial in M for a class of distance metrics. The current

enhancement I propose, further reduces the number of terms and works even for

the general case at the expense of accuracy, which I can control. The rth term in the

characterizations has the condition that the number of the closest r − 1 distinct inputs

is less than k. The probability of this condition being true monotonically reduces with

increasing r. After a point, this probability may become ”small enough”, so that the

total contribution of the remaining terms in the sum is not worthwhile finding, given the

118

additional computational cost. I can set a threshold below which, if the probability of this

condition diminishes, I avoid computing the terms that follow.

Reduction in term computation: Each of the terms can be computed directly

from the underlying joint distribution. Different tricks can be employed to speed up the

computation such as collapsing cells of the table etc., but even then the complexity is

still a small polynomial in N . For example, using a multinomial joint distribution, the

time complexity of calculating a term for the probability of the first moment is quartic

in N and for the probability of the second moment it is octic in N . This problem can

be addressed by using the approximation techniques proposed in Dhurandhar and Dobra

[2009]. Using techniques such as optimization, I can find tight lower and upper bounds for

the terms in essentially constant time.

Parallel computation: Note that each of the terms is self contained and not

dependent on the others. This fact can be used to compute these terms in parallel,

eventually merging them to produce the result. This will further reduce the time of

computation.

With this I have not only proposed analytical expressions for the moments of GE for

the kNN classification model applied to categorical attributes, but have also suggested

efficient methods of computing them.

6.6 Experiments

In this section I portray the manner in which the characterizations can be used

to study the kNN algorithm in conjunction with the model selection measures (viz.

cross-validation). Generic relationships between the moments of GE and moments of CE

(cross-validation error) that are not algorithm specific are given in Dhurandhar and Dobra

[2009]. I use the expressions provided in this thesis and these relationships to conduct the

experiments described below. The main objective of the experiments I report, is to provide

a flavor of the utility of the expressions as a tool to study this learning method.

119

6.6.1 General Setup

I mainly conduct 4 studies. The first three studies are on synthetic data and the

fourth on 2 real UCI datasets. In our first study, I observe the performance of the kNN

algorithm for different values of k. In the second study, I observe the convergence behavior

of the algorithm with increasing sample size. In our third study, I observe the relative

performance of cross-validation in estimating the GE for different values of k. In the

three studies I vary the correlation (measured using Chi-square Connor-Linton [2003])

between the attributes and the class labels to see the effect it has on the performance of

the algorithm. In our fourth study, I choose 2 UCI datasets and observe the estimates

of cross-validation with the true error estimates. I also explain how a multinomial

distribution can be built over these datasets. The same idea can be used to build a

multinomial over any discrete dataset to represent it precisely.

Setup for studies 1-3: I set the dimensionality of the space to be 8. The number

of classes is fixed to two, with each attribute taking two values. This gives rise to a

multnomial with 29 = 512 cells. If I fix the probability of observing a datapoint in cell

i to be pi such that∑512

i=1 pi = 1 and the sample size to N , I then have a completely

specified multinomial distribution with parameters N and the set of cell probabilities

p1, p2, ..., p512. The distance metric I use is Hamming distance and the class prior is 0.5.

Setup for study 4: In case of real data I choose 2 UCI datasets whose attributes

are not limited to having binary splits. The datasets can be represented in the form of a

contingency table where each cell in the table contains the count of the number of copies

of the corresponding input belonging to a particular class. These counts in the individual

cells divided by the dataset size provide us with empirical estimates for the individual

cell probabilities (pi). Thus, with the knowledge of N (dataset size) and the individual pi

I have a multinomial distribution whose representative sample is the particular dataset.

Using this distribution I observe the estimates of the true error (i.e. moments of GE) and

120

estimates given by cross-validation for different values of k. Notice that these estimates are

also applicable (with high probability) to other datasets that are similar to the original.

A detailed explanation of these 4 studies is given below. The expressions in equations

6–9, 6–10 are used to produce the plots.

6.6.2 Study 1: Performance of the KNN Algorithm for Different Values of k.

In the first study I observe the behavior of the GE of the kNN algorithm for different

values of k and for a sample size of 1000.

In Figure 6-3a the attributes and the class labels are totally correlated (i.e. correlation

= 1). I observe that for a large range of values of k (from small to large) the error is zero.

This is expected since any input lies only in a single class with the probability of lying in

the other class being zero.

In Figure 6-3b I reduce the correlation between the attributes and class labels from

being totally correlated to a correlation of 0.5. I observe that for low values of k the error

is high, it then plummets to about 0.14 and increases again for large values of k. The high

error for low values of k is because the variance of GE is large for these low values. The

reason for the variance being large is that the number of points used to classify a given

input is relatively small. As the value of k increases this effect reduces upto a stage and

then remains constant. This produces the middle portion of the graph where the GE is

the smallest. In the right portion of the graph i.e. at very high values of k, almost the

entire sample is used to classify any given input. This procedure is effectively equivalent

to classifying inputs based on class priors. In the general setup I mentioned that I set the

priors to 0.5, which results in the high errors.

In Figure 6-3c I reduce the correlation still further down to 0 i.e. the attributes and

the class labels are uncorrelated. In here I observe that the error is initially high, then

reduces and remains unchanged. As before the initial upsurge is due to the fact that the

variance for low values of k is high, which later settles down.

121

From the three figures, Figure 6-3a, Figure 6-3b and Figure 6-3c I observe a gradual

increase in GE as the correlation reduces. The values of k that give low error for the three

values correlation and a sample size of 1000 can be deciphered from the corresponding

figures. In Figure 6-3a, I notice that small, mid-range and large values of k are all

acceptable. In Figure 6-3b I find that mid-range values (200 to 500) of k are desirable.

In the third figure, i.e. Figure 6-3c I discover that mid-range and large values of k produce

low error.

6.6.3 Study 2: Convergence of the KNN Algorithm with Increasing SampleSize.

In the second study I observe the convergence characteristics of the GE of the kNN

algorithm for different values of k, and with increasing sample size going from 1000 to

100000.

In Figure 6-4a the attributes and class labels are completely correlated. The error

remains zero for small, medium and large values of k irrespective of the sample size. In

this case any value of k is suitable.

In Figure 6-4b the correlation between the attributes and the class labels is 0.5. For

small sample sizes (less and close to 1000), large and small values of k result in high error

while moderate values of k have low error throughout. The initial high error for low values

of k is because the variance of the estimates is high. The reason for high error at large

values of k is because it is equivalent to classifying inputs based on priors and the prior

is 0.5. At moderate values of k both these effects are diminished and hence the error

produced by them is low. From the figure I see that after around 1500 the errors of the

low and high k converge to the error of moderate k. Thus here a k within the range 200 to

0.5N would be appropriate.

In Figure 6-4c the attributes and the class labels are uncorrelated. The initial high

error for low k is again because of the high variance. Since the attributes and class labels

are uncorrelated with a given prior, the error is 0.5 for moderate as well as high values

122

of k. Here large values of k do not have higher error than the mid-range values since the

prior is 0.5. The low value of k converges to the errors of the comparatively larger values

at around a sample size of 1500.

Here too from the three figures, Figure 6-4a, Figure 6-4b and Figure 6-4c I observe a

gradual increase in GE as the correlation reduces. At sample sizes of greater than about

1500, large, medium and small values of k all perform equally well.

6.6.4 Study 3: Relative Performance of 10-fold Cross Validation on SyntheticData.

In the third and final study on synthetic data I observe the performance of 10-fold

Cross validation in estimating the GE for different values of k and sample sizes of 1000

and 10000. The plots for the moments of cross validation error (CE) are produced using

the expressions I derived and the relationships between the moments of GE and the

moments of CE for deterministic classification algorithms given in Dhurandhar and Dobra

[2009].

In Figure 6-5a the correlation is 1 and the sample size is 1000. Cross validation

exactly estimates the GE which is zero irrespective of the value of k. When I increase the

sample size to 10000, as shown in Figure 6-6a Cross validation still does a pretty good job

in estimating the actual error (i.e. GE) of kNN.

In Figure 6-5b the correlation is set to 0.5 and the sample size is 1000. I observe that

cross validation initially, i.e. for low values of k underestimates the actual error, performs

well for moderate values of k and grossly overestimates the actual error for large values of

k. At low values of k the actual error is high because of the high variance, which I have

previously discussed. Hence, even-though the expected values of GE and CE are close-by,

the variances are far apart, since the variance of CE is low. This leads to the optimistic

estimate made by cross validation. At moderate values of k the variance of GE is reduced

and hence cross validation produces an accurate estimate. When k takes large values most

of the sample is used to classify an input, which is equivalent to classification based on

123

priors. The effect of this is more pronounced in the case of CE than GE, since a higher

percentage of the training sample ( 910th N) is used for classification of an input for a fixed

k, than it is when computing GE. Due to this, CE rises more steeply than GE. When I

increase the sample size to 10000, as is depicted in Figure 6-6b, the poor estimate at low

values of k that I saw for a smaller sample size of 1000 vanishes. The reason for this is

that the variance of GE reduces with the increase in sample size. Even for moderate values

of k the performance of cross validation improves though the difference in accuracy of

estimation is not as vivid as in the previous case. For large values of k though the error in

estimation is somewhat reduced it is still noticeable. It is advisable that in the scenario

presented I should use moderate values of k ranging from about 200 to 0.5N to achieve

reasonable amount of accuracy in the prediction made by cross-validation.

In Figure 6-5c the attributes are uncorrelated to the class labels and the sample size

is 1000. For low values of k the variance of GE is high while the variance of CE is low

and hence, the estimate of cross validation is off. For medium and large values of k, cross

validation estimates the GE accurately, which has the same reason mentioned above. On

increasing the sample size to 10000, shown in Figure 6-6c the variance of GE for low values

of k reduces and cross validation estimates the GE with high precision. In general, the

GE for any value of k will be estimated accurately by cross validation in this case, but for

lower sample sizes (below and around 1000) the estimates are accurate for moderate and

large values of k.

6.6.5 Study 4: Relative Performance of 10-fold Cross Validation on RealDatasets.

In our fourth and final study I observe the behavior of the true error (E[GE] +

Std(GE)) and the error estimated by cross-validation on 2 UCI datasets. On the Balloon

dataset in figure 6-7 I observe that cross-validation estimates the true error accurately for

a k value of 2. Increasing the k to 5 the cross-validation estimate becomes pessimistic.

This is because of the increase in variance of CE. I also observe that the true error is

124

lower for k equal to 2. The reason for this is the fact that the expected error is much

lower for this case than that for k equal to 5, even-though the variance for k equal to 2

is comparatively higher. For the dataset on the right in figure 6-7, cross-validation does

a good job for both, the small value of k and the larger value of k. The true error in this

case is lower for the higher k since the expectations for both the k is roughly the same but

the variance for the smaller k is larger. This is mainly due to the high covariance between

the successive runs of cross-validation.

6.7 Discussion

From the previous section I see that the expressions for the moments assist in

providing highly detailed explanations of the observed behavior. Mid-range values of k

were the best in studies 1 and 2 for small sample sizes. The reason for this is the fact

that at small values of k the prediction was based on individual cells and having a small

sample size the estimates were unstable, producing a large variance. For high values of k

the classification was essentially based on class priors and hence the expected error was

high, even-though the variance in this case was low. In the case of mid-range values of

k, the pitfalls of the extreme values of k were circumvented (since k was large enough

to reduce variance but small enough so as to prevent classification based on priors) and

hence the performance was superior. 10-fold cross-validation which is considered to be the

”holy-grail” in error estimation, is not always ideal as I have seen in the experiments. The

most common reason why cross-validation under performed in certain specific cases, was

that its variance was high, which in turn was due to the covariance between successive

runs of cross-validation was high. The ability to make such subtle observations and

provide meticulous explanations for them, is the key strength of the deployed methodology

– developing and using the expressions.

Another important aspect is that, in the experiments, I built a single distribution on

each test dataset to observe the best value of k. Considering the fact that data can be

noisy I can build multiple distributions with small perturbations in parameters (depending

125

on the level of noise) and observe the performance of the algorithm for different values of k

using the expressions. Then I can choose a robust value of k for which the estimates of the

error are acceptable on most (or all) built distributions. Notice that this value of k may

not be the best choice, on the distribution built without perturbations. I can thus use the

expressions to make these type of informed decisions.

As I can see, by building expressions for the moments of GE in the manner portrayed,

classification models in conjunction with popular model selection measures can be studied

in detail. The expressions can act as a guiding tool in making the appropriate choice of

model and model selection measure in desired scenarios. For example, in the experiments

I observed that 10-fold cross-validation did not perform well in certain cases. In these

cases I can use the expressions to study cross-validation with different number of folds

and attempt to find the ideal number of folds for our specific situation. Moreover, such

characterizations can aid in finding answers or challenging the appropriateness of questions

such as, What number of v in v-fold cross-validation gives the best bias/variance trade-off?.

The appropriateness of some queries has to be sometimes challenged since it may very well

be the case that no single value of v is truly optimal. In fact depending on the situation,

different values of v or may be even other model selection measures (viz. hold out set

etc.) may be optimal. Analyzing such situations and finding the appropriate values of the

parameters (i.e. v for cross-validation, f–fraction of hold-out, for hold out set validation)

can be accomplished using the methodology I have deployed in the thesis. Sometimes, it

is intuitive to anticipate the behavior of a learning algorithm in extreme cases, but the

behavior at the non-extreme cases is not as intuitive. Moreover, the precise point at which

the behavior of an algorithm starts to emulate the particular extreme case is a non-trivial

task. The methodology can be used to study such cases and potentially a wide range of

other relevant questions. Essentially, the studies 1 and 2 in the experimental section are

examples of such studies. In those experiments, at extreme correlations the behavior is

more or less predictable but at intermediate correlations it is not.

126

What the studies in the experimental section and the discussion above suggest is

that the method in Dhurandhar and Dobra [2009] and developments such as the ones

introduced in this thesis open new avenues in studying learning methods, allowing them to

be assessed for their robustness, appropriateness for a specific task, with lucid elucidations

being given for their behavior. These studies do not replace but complement purely

theoretical and empirical studies usually carried out when evaluating learning methods.

6.8 Possible Extensions

I discussed the importance of the methodology in the previous section. Below, I

touch upon ways of extending the analysis provided in this thesis. An interesting line

of future research would be to efficiently characterize the sample dependent distance

metrics. Another interesting line would be to extend the analysis to the continuous kNN

classification algorithm. A possible way of doing this would be to consider a set of k

points that would be kNN to a given input (recollect that to characterize the moments,

I only need to characterize the behavior of the algorithm on individual inputs e.g.

PZ(N) [ζ(xi)=Cj]) and consider the remaining N-k points to lie outside the smallest ball

encompassing the kNN. Under these conditions I would integrate the density defined on

the input/output space over all possible such N (i.e. k which are kNN and the remaining

N-k) with the appropriate condition for class majority (i.e. to classify an input in Ci,

I would have the condition that, at least bkvc + 1 points that are kNN lie in class Ci).

A rigorous analysis using ideas from this thesis would have to be performed and the

complexity discussed for the continuous kNN. I plan to address these issues in the future.

6.9 Take-aways

I provided a general characterization for the moments of GE of the kNN algorithm

applied to categorical data. In particular, I developed an efficient characterization for

the moments when the distance metric was sample independent. I discussed issues

related to scalability in using the expressions and suggested optimizations to speedup the

computation. I later portrayed the usage of the expressions and hence the methodology

127

with the help of empirical studies. It remains to be seen how extensible such an analysis

is, to other learning algorithms. However, if such an analysis is in fact possible, it can be

deployed as a tool to better understanding the statistical behavior of learning models in

the non-asymptotic regime.

Table 6-1. Contingency table with v classes, M input vectors and total sample sizeN =

∑M,vi=1,j=1 Nij.

X C1 C2 ... Cv

x1 N11 N12 ... N1v

x2 N21 N22 ... N2v...xM NM1 NM2 ... NMv

Figure 6-1. b, c and d are the 3 nearest neighbours of a.

128

Figure 6-2. The Figure shows the extent to which a point xi is near to x1. The radius ofthe smallest encompassing circle for a point xi is proportional to its distancefrom x1. x1 is the closest point and xM is the farthest.

(a) (b) (c)

Figure 6-3. Behavior of the GE for different values of k with sample size N = 1000 and thecorrelation between the attributes and class labels being 1 in (a), 0.5 in (b)and 0 in (c). Std() denotes standard deviation.

129

(a) (b) (c)

Figure 6-4. Convergence of the GE for different values of k when the sample size (N)increases from 1000 to 100000 and the correlation between the attributes andclass labels is 1 in (a), 0.5 in (b) and 0 in (c). Std() denotes standarddeviation. In (b) and (c), after about N = 1500 large, mid-range and smallvalues of k give the same error depicted by the dashed line.

(a) (b) (c)

Figure 6-5. Comparison between the GE and 10 fold Cross validation error (CE) estimatefor different values of k when the sample size (N) is 1000 and the correlationbetween the attributes and class labels is 1 in (a), 0.5 in (b) and 0 in (c).Std() denotes standard deviation.

130

(a) (b) (c)

Figure 6-6. Comparison between the GE and 10 fold Cross validation error (CE) estimatefor different values of k when the sample size (N) is 10000 and the correlationbetween the attributes and class labels is 1 in (a), 0.5 in (b) and 0 in (c).Std() denotes standard deviation.

Figure 6-7. Comparison between true error (TE) and CE on 2 UCI datasets.

131

CHAPTER 7INSIGHTS INTO CROSS-VALIDATION

A major portion of the research in Machine Learning is devoted to building

classification models. The errors that these models make on the entire input i.e. expected

loss over the entire input is defined as their Generalization Error (GE). Ideally, I would

want to choose the model with the lowest GE. In practice though, I do not have the entire

input and consequently I cannot compute the GE. Nonetheless, a number of methods have

been proposed namely, hold-out-set, akaike information criteria, bayes information criteria,

cross-validation etc. which aim at estimating the GE from the available input.

In this chapter I focus on cross-validation, which is arguably one of the most popular

GE estimation methods. In v-fold cross-validation the dataset of size N is divided into v

equal parts. The classification model is trained on v − 1 parts and tested on the remaining

part. This is performed v times and the average error over the v runs is considered as

an estimate of GE. This method is known to have low variance Kohavi [1995], Plutowski

[1996] for about 10-20 folds and is hence commonly used for small sample sizes.

Most of the experimental work on cross-validation focusses on reporting observations

Kohavi [1995], Plutowski [1996] and not on understanding the reasons for the observed

behavior. Moreover, modeling the covariance between individual runs of cross-validation

is not a straightforward task and is hence not adequately studied, though it is considered

to have a non-trivial impact on the behavior of cross-validation. The work presented in

Bengio and Grandvalet [2003], Markatou et al. [2005] address issues related to covariance,

but it is focussed on building and studying the behavior of estimators for the overall

variance of cross-validation. In Markatou et al. [2005] the estimators of the moments

of cross-validation error (CE) are primarily studied for the estimation of mean problem

and in the regression setting. The goal of this chapter is quite different. I do not wish

to build estimators for the moments of CE rather I want to experimentally observe the

behavior of the moments of cross-validation and provide explanations for the observed

132

behavior in the classification setting. The classification models I run these experiments on

consist of the Naive Bayes Classifier (NBC) – a parametric model and two non-parametric

models namely, K-Nearest Neighbor Classifier (KNN) and Decision Trees (DT). I choose

a mix of parametric and non-parametric models so that their hypothesis spaces are varied

enough. Additionally, these models are widely used in practice. The moments however, are

computed not using Monte Carlo directly but using the expressions given in Dhurandhar

and Dobra [2009, 2008, 2007]. The advantage of using these closed form expressions is

that they are exact formulas (not approximations) for the moments of CE and hence

these moments can be studied accurately with respect to any chosen distribution. In

fact as it turns out approximating certain probabilities in these expressions also leads to

significantly higher accuracy in computing the moments when compared with directly

using Monte Carlo. The reason for this is that the parameter space of the individual

probabilities that need to be computed in these expressions is much smaller than

the space over which the moments have to be computed at large and hence directly

using Monte Carlo to estimate the moments can prove to be highly inaccurate in many

cases Dhurandhar and Dobra [2009, 2008]. Another advantage of using the closed form

expressions is that they give us more control over the settings I wish to study.

In summary, the goal in this chapter is to empirically study the behavior of

the moments of CE (plotted using the expressions in the Appendix) and to provide

interesting explanations for the observed behavior. As I will see, when studying the

variance of CE, the covariance between the individual runs plays a decisive role and hence

understanding its behavior is critical in understanding the behavior of the total variance

and consequently the behavior of CE. I provide insights into the behavior of the covariance

apropos increasing sample size, increasing correlation between the data and the class labels

and increasing number of folds.

In the next section I review some basic definitions and previous results that are

relevant to the computation of the moments of CE. In Section 7.2 I provide an overview

133

of the expressions customized to compute the moments of CE for the three classification

algorithms namely; DT, NBC and KNN. In Section 7.3 I conduct a brief literature survey.

In Section 7.4 – the experimental section, I provide some keen insights into the behavior

of cross-validation, which is our primary goal. I discuss the implications of the study

conducted and summarize the major developments in the chapter in Section 7.5.

7.1 Preliminaries

Probability distributions completely characterize the statistical behavior of a random

variable. Moments of a random variable give us information about its probability

distribution. Thus, if I have knowledge of the moments of a random variable I can

make statements about its behavior. In some cases characterizing a finite subset of

moments may prove to be a more desired alternative than characterizing the entire

distribution which can be computationally expensive. By employing moment analysis

and using linearity of expectation efficient generalized expressions for the moments of GE

and relationships between the moments of GE and the moments of CE were derived in

Dhurandhar and Dobra [2009]. In this section I review the relevant results which are used

in the present study of CE.

Consider that N points are drawn independently and identically (i.i.d.) from a

given distribution and a classification algorithm is trained over these points to produce a

classifier. If multiple such sets of N i.i.d. points are sampled and a classification algorithm

is trained on each of them I would obtain multiple classifiers. Each of these classifiers

would have its own GE, hence the GE is a random variable defined over the space of

classifiers which are induced by training a classification algorithm on each of the datasets

that are drawn from the given distribution. The moments of GE computed over this space

of all possible such datasets of size N , depend on three things: 1) the number of samples

N , 2) the particular classification algorithm and 3) the given underlying distribution. I

denote by D(N) the space of datasets of size N drawn from a given distribution. The

moments taken over this new distribution – the distribution over the space of datasets of a

134

particular size, are related to the moments taken over the original given distribution which

is over individual inputs in the following manner,

ED(N)[F(ζ)]

= E(X×Y)×(X×Y)×...×(X×Y)[F(ζ)]

=∑

(x1,y1)∈X×Y

∑

(x2,y2)∈X×Y...

∑

(xN ,yN )∈X×YP [X1 = x1 ∧ Y1 = y1 ∧ · · · ∧XN = xN ∧ YN = yN ]·

F(ζ(x1, . . . , xN))

=∑

(x1,y1)∈X×Y

∑

(x2,y2)∈X×Y...

∑

(xN ,yN )∈X×Y

(N∏

i=1

P [Xi = xi ∧ Yi = yi]

)F(ζ(x1, . . . , xN))

where F(.) is some function that operates on a classifier ζ and ζ is a classifier obtained

by training on a particular dataset belonging to D(N). X and Y denote the input and

output space respectively. X1, ..., XN denote a set of N i.i.d. random variables defined over

the input space and Y1, ..., YN denote a set of N i.i.d. random variables defined over the

output space. For simplicity of notation I denote the moments over the space of the new

distribution rather than over the space of the given distribution.

Notice that in the above formula, ED(N)[F(ζ)] was expressed in terms of product

of probabilities since the independence of samples x1, y1, . . . , xn, yn allows for the

factorization of P [X1 = x1 ∧ Y1 = y1 ∧ · · · ∧ XN = xN ∧ YN = yN ]. By instantiating

the function F(.) with GE(.), I have a formula for computing moments of GE. The

problem with this characterization is that it is highly inefficient (exponential in the

size of the input-output space). Efficient characterizations for computing the moments

were developed in Dhurandhar and Dobra [2009] which I will shortly review. The

characterization reduces the number of terms in the moments from an exponential in

the input-output space to linear for the computation of the first moment and quadratic for

the computation of the second moment.

135

I define random variables of interest namely, hold-out error, cross-validation error and

generalization error. I also define moments of the necessary variables. In the notation used

to represent these moments I write the probabilistic space over which the moments are

taken as subscripts. If no subscript is present, the moments are taken over the original

input-output space. This convention is strictly followed in this particular section. In the

remaining sections I drop the subscript for the moments since the formulas can become

tedious to read and the space over which these moments are computed can be easily

deciphered from the context.

Hold-out Error (HE): The hold-out procedure involves randomly partitioning the

dataset D into two parts Dt – the training dataset of size Nt and Ds – the test datasets

of size Ns. A classifier is built over the training dataset and the error is estimated as the

average error over the test dataset. Formally,

HE =1

Ns

∑

(x,y)∈Ds

λ(ζ(x), Y (x))

where Y (x) ∈ Y is a random variable modeling the true class label of the input x, λ(., .) is

a 0-1 loss function which is 0 when its 2 arguments are equal and is 1 otherwise. ζ is the

classifier built on the training data Dt with ζ(x) being its prediction on the input x.

Cross Validation Error (CE): v-fold cross validation consists in randomly

partitioning the available data into v chunks and then training v classifiers using all

data but one chunk and then testing the performance of the classifier on this chunk. The

estimate of the error of the classifier built from the entire data is the average error over

the chunks. Denoting by HEi the hold-out error on the ith chunk of the dataset D, the

cross-validation error is given by,

CE =1

v

v∑i=1

HEi

Generalization Error (GE): The GE of a classifier ζ w.r.t. the underlying

distribution over the input-output space X × Y is given by,

136

GE(ζ) = E [λ(ζ(x), Y (x))]

= P [ζ(x) 6=Y (x)]

where x ∈ X is an input and Y (x) ∈ Y the true class label of the input x. It is thus the

expected error over the entire input.

Moments of GE: Given an underlying input-output distribution and a classification

algorithm, by generating N i.i.d. datapoints from the underlying distribution and training

the classification algorithm on these datapoints I obtain a classifier ζ. If I sample multiple

(in fact all possible) such sets of N datapoints from the given distribution and train the

classification algorithm on each of them, I induce a space of classifiers trained on a space

of datasets of size N denoted by D(N). Since the process of sampling produces a random

sample of N datapoints, the classifier induced by training the classification algorithm

on this sample is a random function. The generalization error w.r.t. the underlying

distribution of classifier ζ, denoted by GE(ζ) is also a random variable that can be studied

and whose moments are given by,

ED(N)[GE(ζ)] =∑x∈X

P [x]∑y∈Y

PD(N) [ζ(x)=y] P [Y (x) 6=y] (7–1)

ED(N)×D(N)[GE(ζ)GE(ζ ′)] =∑x∈X

∑

x′∈XP [x] P [x′] ·

∑y∈Y

∑

y′∈YPD(N)×D(N) [ζ(x)=y ∧ ζ ′(x′)=y′] ·

P [Y (x) 6=y] P [Y (x′) 6=y′]

(7–2)

137

where X 1 and Y denote the space of inputs and outputs respectively. Y (x) represents the

true output2 for a given input x. P [x] and P [x′] represent the probability of having a

particular input. ζ ′ in equation 6–2 is a classifier like ζ (may be same or different) induced

by the classification algorithm trained on a sample from the underlying distribution.

PD(N) [ζ(x)=y] P [Y (x) 6=y] represents the probability of error. The first probability in the

product PD(N) [ζ(x)=y], depends on the classification algorithm and the data distribution

that determines the training dataset. The second probability P [Y (x) 6=y], depends

only on the underlying distribution. Also note that both these probabilities are actually

conditioned on x but I omit writing the probabilities as conditionals explicitly since it is

an obvious fact and it makes the formulas more readable. ED(N)[.] denotes the expectation

taken over all possible datasets of size N drawn from the data distribution. The terms in

equation 6–2 also have similar semantics but are applicable to pairs of inputs and outputs.

Thus, by being able to compute each of these probabilities I can compute the moments of

GE.

Moments of CE: The process of sampling a dataset (i.i.d.) of size N from a

probability distribution and then partitioning it randomly into two disjoint parts of size Nt

and Ns, is statistically equivalent to sampling two different datasets of size Nt and Ns i.i.d.

from the same probability distribution. The first moment of CE is just the expected error

of the individual runs of cross-validation. In the individual runs the dataset is partitioned

into disjoint training and test sets. Dt(Nt) and Ds(Ns) denote the space of training sets of

size Nt and test sets of size Ns respectively. Hence, the first moment of CE is taken w.r.t.

the Dt(Nt)×Ds(Ns) space which is equivalent to the space obtained by sampling datasets

of size N = Nt + Ns followed by randomly splitting them into training and test sets. In

1 If input is continuous I replace sum over X by integrals in the above formulas,everything else remaining same.

2 Y(.) may or may not be randomized.

138

the computation of variance of CE I have to compute the covariance between any two

runs of cross-validation (equation 7–3). In the covariance I have to compute the following

cross moment, EDijt ( v−2

vN)×Di

s(Nv

)×Djs(N

v) [HEiHEj] where Dij

t (k) is the space of overlapped

training sets of size k in the ith and jth run of cross-validation (i, j ≤ v and i 6= j), Dfs (k)

is the space of all test sets of size k drawn from the data distribution in the f th run of

cross-validation (f ≤ v), Dft (k) is the space of all training sets of size k drawn from the

data distribution in the f th run of cross-validation (f ≤ v) and HEf is the hold-out

error of the classifier in the f th run of cross-validation. Since the cross moment considers

interaction between two runs of cross-validation it is taken over a space consisting of

training and test sets involving both the runs rather than just one. Hence, the subscript in

the cross moments is a cross product between 3 spaces (overlapped training sets between

two runs and the corresponding test sets). The other moments in the variance of CE are

taken over the same space as the expected value. The variance of CE is given by,

V ar(CE) =1

v2[

v∑i=1

V ar(HEi) +v∑

i,j,i6=j

Cov(HEi, HEj)]

=1

v2[

v∑i=1

(EDit( v−1

vN)×Di

s(Nv

)[HE2i ]− E2

Dit( v−1

vN)×Di

s(Nv

)[HEi])

+v∑

i,j,i6=j

(EDijt ( v−2

vN)×Di

s(Nv

)×Djs(N

v) [HEiHEj]

− EDit( v−1

vN)×Di

s(Nv

)[HEi]EDjt( v−1

vN)×Dj

s(Nv

)[HEj]]

The reason I introduced moments of GE previously is that, in Dhurandhar and Dobra

[2009] relationships were drawn between these moments and the moments of CE. Thus,

using the expressions for the moments of GE and the relationships which I will state

shortly, I have expressions for the moments of CE.

The relationship between the expected values of CE and GE is given by,

EDt( v−1v

N)×Ds(Nv

) [CE] = EDt( v−1v

N) [GE(ζ)]

139

where v is the number of folds, Dt(k) is the space of all training sets of size k that are

drawn from the data distribution and Ds(k) is the space of all test sets of size k drawn

from the data distribution.

In the computation of variance of CE I need to find the individual variances and the

covariances. In Dhurandhar and Dobra [2009] it was shown that

EDit( v−1

vN)×Di

s(Nv

)[HEi] = EDt( v−1v

N)×Ds(Nv

) [CE] ∀i ∈ 1, 2, ..., v and hence the

expectation of HEi can be computed using the above relationship between the expected

CE and expected GE. Notice that the space of training and test datasets over which the

moments are computed is the same for each fold (since the space depends only on the size

and all the folds are of the same size) and hence the corresponding moments are also the

same. To compute the remaining terms in the variance I use the following relationships.

The relationship between the second moment of HEi ∀i ∈ 1, 2, ..., v and the

moments of GE is given by,

EDit( v−1

vN)×Di

s(Nv

)[HE2i ] =

v

NEDt(

v−1v

N) [GE(ζ)] +N − v

NEDt(

v−1v

N)

[GE(ζ)2

]

The relationship between the cross moment and the moments of GE is given by,

EDijt ( v−2

vN)×Di

s(Nv

)×Djs(N

v) [HEiHEj] = EDij

t ( v−2v

N)[GE(ζ i)GE(ζj)

]

where ζf is the classifier built in the f th run of cross-validation. All the terms in the

variance can be computed using the above relationships.

7.2 Overview of the Customized Expressions

In Section 7.1 I provided the generalized expressions for computing the moments

of GE and consequently moments of CE. In particular the moments I compute are:

E[CE] and V ar(CE). The formula for the variance of CE can be rewritten as a convex

combination of the variance of the individual runs and the covariance between any two

runs. Formally,

140

V ar(CE) =1

vV ar(HE) +

v − 1

vCov(HEi, HEj) (7–3)

where HE is the error of any of the individual runs and HEi and HEj are the errors of

the ith and jth runs. Since the moments are over all possible datasets the variances and

covariances are same for all single runs and pairs of runs respectively and hence the above

formula.

In the expressions for the moments the probabilities P [ζ(x)=y] and P [ζ(x)=y ∧ ζ ′(x′)=y′]

are the only terms that depend on the particular classification algorithm. Customizing the

expressions for the moments equates to finding these probabilities. The other terms in the

expressions are straightforward to compute from the data distribution. I now give a high

level overview of how the two probabilities are customized for DT, NBC and KNN. The

precise details are in the previous chapters.

1. DT: To find the probability of classifying an input into a particular class (P [ζ(x)=y]),I have to sum the probabilities over all paths (path is a set of nodes and edges fromroot to leaf in a tree) that include the input x and have majority of the datapointslying in them belong to class y, in the set of possible trees. These probabilities canbe computed by fixing the split attribute selection method, the stopping criteriadeployed to curb the growth of the tree and the data distribution from which aninput is drawn. The probability P [ζ(x)=y ∧ ζ ′(x′)=y′] can be computed similarly,by considering pairs of paths (one for each input) rather than a single path.

2. NBC: The NBC algorithm assumes that the attributes are independent of eachother. An input is classified into a particular class, if the product of the classconditionals of each attribute for the input multiplied by the particular class prior isgreater than the same quantity computed for each of the other classes. To computeP [ζ(x)=y], the probability of the described quantity for input x belonging to class ybeing greater than the same quantity for each of the other classes is computed. Theprobability for the second moment can be computed analogously by considering pairsof inputs and outputs.

3. KNN: In KNN to compute P [ζ(x)=y], the probability that majority of the Knearest neighbors of x belong to class y is found. In the extreme case when K=N(K is the dataset size) the probability of the empirical prior of class y being greaterthan the empirical priors of the other classes is computed.In this case the majorityclassifier and KNN behave the same way. To compute P [ζ(x)=y ∧ ζ ′(x′)=y′] the

141

probability that the majority of the K nearest neighbors of x belong to class y andmajority of the K nearest neighbors of x′ belong to class y′ is found.

By using the customized expressions I can accurately study the behavior of

cross-validation. This method is more accurate than directly using Monte-Carlo to

estimate the moments since the parameter space of the individual probabilities that

need to be estimated is much smaller than the entire space over which the moments are

computed.

7.3 Related Work

There is a large body of both experimental and theoretical work that addresses

the problem of studying cross validation. In Efron [1986] cross validation is studied in

the linear regression setting with squared loss and is shown to be biased upwards in

estimating the mean of the true error. More recently the same author in Efron [2004],

compared parametric model selection techniques namely, covariance penalties with the

non-parametric cross validation and showed that under appropriate modeling assumptions

the former is more efficient than cross validation. Breiman [1996] showed that cross

validation gives an unbiased estimate of the first moment of GE. Though cross validation

has desired characteristics with estimating the first moment, Breiman stated that its

variance can be significant. In Moore and Lee [1994] heuristics are proposed to speed up

cross-validation which can be an expensive procedure with increasing number of folds.

In Zhu and Rohwer [1996] a simple setting was constructed in which cross-validation

performed poorly. Goutte [1997] refuted this proposed setting and claimed that a realistic

scenario in which cross-validation fails is still an open question.

The major theoretical work on cross-validation is aimed at finding bounds. The

current distribution free bounds Devroye et al. [1996], Kearns and Ron [1997], Blum et al.

[1999], Elisseeff and Pontil [2003], Vapnik [1998] for cross-validation are loose with some of

them being applicable only in restricted settings such as bounds that rely on algorithmic

stability assumptions. Thus, finding tight PAC (Probably Approximately Correct) style

142

bounds for the bias/variance of cross-validation for different values of v is still an open

question Guyon [2002].

Though bounds are useful in their own right they do not aid in studying trends of

the random variable in question,3 in this case CE. Asymptotic analysis can assist in

studying trends Stone [1977], Shao [1993] with increasing sample size, but it is not clear

when the asymptotics come into play. This is where empirical studies are useful. Most

empirical studies on cross-validation indicate that the performance (bias+variance) is

the best around 10-20 folds Kohavi [1995], Breiman [1996] while some others Schneider

[1997] indicate that the performance improves with increasing number of folds. In the

experimental study that I conduct using closed form expressions, I observe both of these

trends but in addition I provide lucid elucidations for observing such behavior.

7.4 Experiments

In this section I study cross-validation by explaining its behavior in detail. Especially

noticeable, is the behavior of the covariance of cross-validation which plays a significant

role in the overall variance. The role of covariance is significant since in the expression for

the overall variance given by V ar(CE) = 1vV ar(HE) + v−1

vCov(HEi, HEj) the weighting

(v−1v

) of the covariance is always greater than that for the individual variances (except for

v = 2 when its equal) and this weight increases as the number of folds increases. In the

studies I perform, I observe the behavior of the E[CE]4 , V ar(HE) (individual variances),

Cov(HEi, HEj) (covariances), V ar(CE) (total variance) and E2[CE] + V ar(CE) (total

mean squared error) with varying amounts of the correlation between the input attributes

and the class labels for the three classification algorithms. The details regarding the data

generation process are as follows.

3 unless they are extremely tight.

4 I drop the notation which shows the space over which the moments are taken forreadability purposes.

143

I fix the number of classes to 2. The results are averaged over multiple dimensionalities

(3, 5, 8 and 10) with each attribute having multiple splits/values (i.e. attributes with

binary splits, attributes with ternary splits) for all three algorithms (i.e. NBC, DT and

KNN) and additionally multiple values of K (2, 5 and 8) for the KNN algorithm. Assume

d is the dimensionality of the space (i.e. number of input attributes) and m is the number

of values of each attribute, then the total number of distinct datapoints (including the

class attribute) is c = 2md. I can represent this input-output space as a contingency

table with 2 columns, one for each class and md rows, one for each distinct input. The

number of cells in this contingency table is thus c. If I fix the probability of observing a

datapoint in cell i to be pi such that∑c

i=1 pi = 1 and the sample size to N the distribution

that precisely models this scenario is a multinomial distribution with parameters N and

the set p1, p2, ..., pc. This is our data generation model. In the studies that follow the

pi are varied and the amount of dependence between the attributes and the class labels

is computed for each set of pi using the Chi-square test Connor-Linton [2003]. More

precisely, I sum over all i the squares of the difference of each pi with the product of its

corresponding marginals, with each squared difference being divided by this product, that

is, correlation =∑

i(pi−pim)2

pim, where pim is the product of the marginals for the ith cell.

I initially set the dataset size to 100 and then increase it to a 1000, to observe the

effect that increasing dataset size has on the behavior of these moments. The reason I

choose these dataset sizes is that for the dataset sizes below and around 100 the behavior

is qualitatively similar to that at 100. For larger dataset sizes say above and around

1000 the behavior is qualitatively similar to that at 1000. Secondly, the shift in trends

in the behavior of cross-validation with increasing dataset size is clearly noticeable for

these dataset sizes. Moreover, cross-validation is primarily used when dataset sizes are

small since it has low variance and high computational cost (compared with hold-out

set for example) and hence studying it under these circumstances seems sensible. The

144

experiments reveal certain interesting facts and provide insights that assist in better

understanding cross-validation.

Observations I now explain the interesting trends that I observe through these

experiments. The interesting insights are mainly linked to the behavior of the variance

(in particular covariance) of CE. However, for the sake of completeness I also discuss the

behavior of the expected value of CE.

7.4.1 Variance

Figures 7.5 to 7-5 are plots of the variances of the individual runs of cross-validation.

Figures 7.5 to 7-11 depict the behavior of the covariance between any two runs of

cross-validation. Figures 7.5 to 7-17 showcase the behavior of the total variance of

cross-validation, which as I have seen is in fact a convex combination of the individual

variance and the pair-wise covariance.

Linear behavior of Var(HE): In Figures 7.5 to 7-5 I see that the individual

variances practically increase linearly with the number of folds. This linear increase

occurs since, the size of the test set decreases linearly5 with the number of folds; and I

know that CE is the average error over the v runs where the error of each run is the sum

of the zero-one loss function evaluated at each test point normalized by the size of the

test set. Since the test points are i.i.d. (independent and identically distributed), so are

the corresponding zero-one loss functions and from theory I have that the variance of a

random variable which is the sum of T i.i.d. random variables having variance σ2 < ∞, is

given by σ2

T.

V-shaped behavior of Cov(HEi,HEj): In Figure 7.5 I observe that the covariance

first decreases as the number of folds increases from 2 up-until 10-20 and then increases

up-until v = N (called Leave-one-out (LOO) validation). This strange behavior has the

following explanation. At low folds for example at v = 2 the test set for one run is the

5 more precisely expected test set size decreases linearly.

145

training set for the other and vice-versa. The two partitions are negatively correlated since

each datapoint is in either one of the two partitions but cannot be in both. In fact their

covariance is given by, −N4

. This implies that the test sets for the two runs are negatively

correlated. However, the two partitions are also training sets for the two runs and hence

the classifiers induced by them are also negatively correlated in terms of their prediction

on individual datapoints. Since, the test sets as well as the classifiers are negatively

correlated, the errors of these classifiers are positively correlated. In other words, these

classifiers make roughly the same number of mistakes on the respective test sets. The

reason for this is, if the two partitions are similar (i.e. say they are representative samples

of the distribution) then the classifiers are similar and so are the test sets and hence

both of their errors are low. When the two samples are highly dissimilar (i.e. say one is

representative the other is not) the classifiers built are dissimilar and so are the test sets.

Hence, the error that both of these classifiers make on their test sets which is the training

set of the other classifier are high. Thus, irrespective of the level of similarity between the

two partitions the correlation between the errors of the classifiers induced by them is high.

As the number of folds increases this effect reduces as the classifiers become more and

more similar due to the overlap of training sets while the test sets become increasingly

unstable as they become smaller. The latter (i.e. at high folds) increase in covariance

in Figure 7.5 is due to the case where LOO fails. If I have a majority classifier and the

dataset contains an equal number of datapoints from each class, then LOO would estimate

100% error. Since, each run would produce this error the errors of any two runs are

highly correlated. This effect reduces as the number of folds reduces. The classification

algorithms I have chosen, classify based on majority in their final inference step i.e. locally

they classify datapoints based on majority. At low data correlation as in Figure 7.5 the

probability of having equal number datapoints from each class for each input is high

and hence the covariance between the errors of two different runs is high. Thus, at high

folds this effect is predominant and it increases the covariance. Consequently, I have the

146

V-shaped graphs as in Figure 7.5 which are a combination of the first effect and this

second effect.

L-shaped behavior of Cov(HEi,HEj): As I increase the correlation between the

input attributes and the class labels seen in Figures 7-7, 7.5 the initial effect which raises

the covariance is still dominant, but the latter effect (equal number of datapoints from

each class for each input) has extremely low probability and is not significant enough to

increase the covariance at high folds. As a result, the covariance drops with increasing v.

On increasing the dataset size the covariance does not increase as much (in fact

reduces in some cases) in Figures 7-9, 7.5 and 7-11 at high folds. In Figure 7-9 though

the correlation between the input attributes and class labels is low, the probability of

having equal number datapoints from each class is low since the dataset size has increased.

For a given set of parameters the probability of a particular event occurring is essentially

reduced (never increased) as the number of events is increased (i.e. N increases), since the

original probability mass has now to be distributed over a larger set. Hence, the covariance

in Figure 7-9 drops as the number of folds increases. The behavior observed in Figures 7.5

and 7-11 has the same explanation as that for Figures 7-7 and 7.5 described before.

Finally, the covariance has a V-shape for low data correlations and low dataset sizes

where the classification algorithms classify based on majority at least at some local level.

In the other cases the covariance is high initially and then reduces with increasing number

of folds.

Behavior of Var(CE) similar to covariance: Figures 7.5 to 7-17 represent the

behavior of the total variance of CE. I know that total variance is given by, V ar(CE) =

1vV ar(HE) + v−1

vCov(HEi, HEj). I have also seen from Figures 7.5 to 7-5 that the

V ar(HE) vary almost linearly with respect to v. In other words, 1vV ar(HE) is practically

a constant with changing v. Hence, the trends seen for the total variance are similar to

those observed for the pairwise covariances with an added shift. Thus, at low correlations

and low sample sizes the variance is minimum not at the extremeties but somewhere in

147

between, say around 10-20 folds as seen in 7.5. In other cases, the total variance reduces

with increasing number of folds.

7.4.2 Expected value

Figures 7.5 to 7.5 depict the behavior of the expected value of CE for different

amounts of correlation between the input attributes and the class labels and for

two different sample sizes. The behavior of the expected value at medium and high

correlations for small and large sample sizes is the same and hence I plot these scenarios

only for small sample sizes as shown in Figure 7.5. From the figures I observe that as

the correlation increases the expected CE reduces. This occurs since the input attributes

become increasingly indicative of a particular class. As the number of folds increase the

expected value reduces since the training set sizes increase on expectation, enhancing

classifier performance.

7.4.3 Expected value square + variance

Figures 7-21 to 7.5 depict the behavior of CE. In Figure 7-21 I observe that the best

performance of cross-validation is around 10-20 folds. In the other cases, the behavior

improves as the number of folds increases. In Figure 7-21 the variance at high folds is

large and hence the above sum is large for high folds. As a result I have a V-shaped curve.

In the other Figures the variance is low at high folds and so is the expected value and

hence the performance improves as the number of folds increases.

7.5 Take-aways

In summary, I observed the behavior of cross-validation under varying amounts of

data correlation, varying sample size and with varying number of folds. I observed that

at low correlations and for low sample sizes (a characteristic of many real life datasets)

10-20 fold cross-validation was the best while for the other cases increasing the number of

folds helped enhance performance. Additionally, I provided in depth explanations for the

observed behavior and commented that the explanations for the behavior of covariance

were especially relevant to classification algorithms that classify based on majority at a

148

Figure 7-1. Var(HE) for small sample size and low correlation.

Figure 7-2. Var(HE) for small sample size and medium correlation.

global level (e.g. majority classifiers) or at least at some local level (e.g. DT classification

at the leaves). The other interesting fact was that all the experiments and the insights

were a consequence of the theoretical formulas that were derived previously. I hope that

non-asymptotic studies like the one presented will assist in better understanding popular

prevalent techniques, in this case cross-validation.

149

Figure 7-3. Var(HE) for small sample size and high correlation.

Figure 7-4. Var(HE) for larger sample size and low correlation.

150

Figure 7-5. Var(HE) for larger sample size and medium correlation.

Figure 7-6. Var(HE) for larger sample size and high correlation.

151

Figure 7-7. Cov(HEi, HEj) for small sample size and low correlation.

Figure 7-8. Cov(HEi, HEj) for small sample size and medium correlation.

152

Figure 7-9. Cov(HEi, HEj) for small sample size and high correlation.

Figure 7-10. Cov(HEi, HEj) for larger sample size and low correlation.

153

Figure 7-11. Cov(HEi, HEj) for larger sample size and medium correlation.

Figure 7-12. Cov(HEi, HEj) for larger sample size and high correlation.

154

Figure 7-13. Var(CE) for small sample size and low correlation.

Figure 7-14. Var(CE) for small sample size and medium correlation.

155

Figure 7-15. Var(CE) for small sample size and high correlation.

Figure 7-16. Var(CE) for larger sample size and low correlation.

156

Figure 7-17. Var(CE) for larger sample size and medium correlation.

Figure 7-18. Var(CE) for larger sample size and high correlation.

157

Figure 7-19. E[CE] for small sample size and low correlation.

Figure 7-20. E[CE] for larger sample size and low correlation.

158

Figure 7-21. E[CE] for small sample size at medium and high correlation.

Figure 7-22. E2[CE] + V ar(CE) for small sample size and low correlation.

159

Figure 7-23. E2[CE] + V ar(CE) for small sample size and medium correlation.

Figure 7-24. E2[CE] + V ar(CE) for small sample size and high correlation.

160

Figure 7-25. E2[CE] + V ar(CE) for larger sample size and low correlation.

Figure 7-26. E2[CE] + V ar(CE) for larger sample size and medium correlation.

161

Figure 7-27. E2[CE] + V ar(CE) for larger sample size and high correlation.

162

CHAPTER 8CONCLUSION

In this work I have proposed a novel methodology to study learning algorithms which

has the following potential benefits:

Gaining Insight: One of the main advantages of deploying such a methodology is

that it can be used as an exploratory tool and as an analysis tool. I can accurately study

when and why a particular evaluation measure or classification algorithm behaves in the

manner it does.

Finite Sample Convergence: Another benefit of the methodology is that it can

be used to evaluate the performance of the evaluation measures in estimating GE under

different conditions. For example, I can study how well HE and CE estimate GE with

increasing sample size. I can thus use CE below a certain sample size and HE beyond

that sample size so as to estimate GE accurately and efficiently. The methodology can

thus be used as a guidance tool.

Robustness: If an algorithm designer validates his/her algorithm by computing

moments as mentioned earlier, it can instill greater confidence in the practitioner

searching for an appropriate algorithm for his/her dataset. The reason for this being,

if the practitioner has a dataset which has a similar structure or is from a similar source

as the test dataset on which an empirical distribution was built and favorable results

reported by the designer, then this would mean that the good results apply not only

to that particular test dataset, but to other similar type of datasets and since the

practitioner’s dataset belongs to this similar collection, the results would also apply to

his. Hence, the robustness of the algorithm can be evaluated using this methodology which

can result in the algorithm having wider appeal.

Other Benefits: The methodology can be used to evaluate Probably Approximately

Correct (PAC) Bayes bounds McAllester [2003] in certain settings. Roughly speaking,

PAC Bayes bounds, bound the difference between the expected GE and expected

163

empirical error where the expectation is over a distribution defined over the hypothesis

space. In our case this distribution is induced by training a classification algorithm on

all i.i.d. samples of size N . I can compute the moments of GE and the moments of the

evaluation measures using our expressions for this case and compare them to verify the

tightness of the corresponding PAC Bayes bounds. The derived expressions can also

be used to focus on specific portions of the data, since the individual probabilities in

the expressions are only concerned with the behavior of the classification algorithm or

evaluation measure on single or pairs of inputs.

Hence, the methodology I discussed, can serve as a guidance tool, as an analysis tool

and as an exploratory tool to accurately study classification algorithms in conjunction

with evaluation measures. In the future it would be interesting to analyze and develop

efficient characterizations for other classification algorithms and evaluation measures in

this framework. This analysis will hopefully assist us in gaining new insights into the

behavior of these techniques. A more ambitious goal is to extend this kind of analysis to

study the more general class of learning algorithms. This includes but is not limited to

regression problems where the output is continuous rather than discrete.

164

APPENDIX: PROOFS

Proposition 3. The polynomial (x + a)rx2y + (y + a)ry2x > 0 iff x > 0 and y > 0, where

x, y ∈ [−a,−a + 1, ..., a], r = maxbb ln[b]

ln[a+1a−b

]c+ 1, a ∈ N, 1 < b < a and b ∈ N.

Proof. One direction is trivial. If x > 0 and y > 0 then definitely the polynomial is greater

than zero for any value of r. Now lets prove the other direction i.e. if (x + a)rx2y + (y +

a)ry2x > 0 then x > 0 and y > 0 where x, y ∈ [−a, ..., a], r = maxbb ln[b]

ln[a+1a−b

]c + 1, 1 < b < a

and b ∈ N. In other words, if x ≤ 0 or y ≤ 0 then (x + a)rx2y + (y + a)ry2x ≤ 0. We prove

this result by forming cases.

Case 1: Both x and y are zero.

The value of the polynomial is zero.

Case 2: One of x or y is zero.

The value of the polynomial again is zero since each of the two terms separated by a

sum have xy as a factor.

Case 3: Both x and y are less than zero.

Consider the first term (x + a)rx2y. This term is non-positive since x + a is always

non-negative(since x ∈ [−a, ..., a]) and x2 is always positive but y is non-positive.

Analogous argument for the second term and so it too is non-positive. Thus their sum is

non-positive.

Case 4: One of x or y is negative and the other is positive. Assume w.l.o.g. that x is

positive and y is negative.

(x + a)rx2y + (y + a)ry2x ≤ 0

only if (x + a)rx + (y + a)ry ≥ 0

only if r ≥ ln[−yx

]

ln[x+ay+a

]

On fixing the value of y the value of x at which the right hand side of the above achieves

maximum is 1(since lower the value of x higher the right handside but x is positive by our

assumption and x ∈ [−a, ..., a]). Thus we have the above inequality true only if,

165

r ≥ ln[−y]

ln[a+1y+a

]

Let b = −y then 1 ≤ b ≤ a since y is negative. Hence, if r satisfies the inequality for

all possible allowed values of b then only will it imply that the polynomial is less than or

equal to zero in the specified range. For r to satisfy the inequality for all allowed values

of b, it must satisfy the inequality for the value of b that the function is maximum. Also,

for b = 1 and b = a the right handside is zero. So the range of b over which we want to

find the maximum is 1 < b < a. With this the minimum value of r that satifies the above

inequality in order for the polynomial to be less than or equal to zero is,

r = maxbb ln[b]

ln[a+1a−b

]c+ 1

The 4 cases cover all the possibilities and thus we have shown that if x ≤ 0 or y ≤ 0

then (x + a)rx2y + (y + a)ry2x ≤ 0. Having also shown the other direction we have proved

the proposition.

The probability that two paths of lengths l1 and l2 (l2 ≥ l1) co-exist in a tree based on

the randomized attribute selection method is given by,


v∑i=0

vPri(l1 − i− 1)!(l2 − i− 1)!(r − v)probi

where r is the number of attributes common in the two paths, v is the number attributes

with the same values in the two paths, vPri = v!(v−i)!

denotes permutation and

probi = 1d(d−1)...(d−i)(d−i−1)2...(d−l1+1)2(d−l1)...(d−l2+1)

.

We now prove the above result. The derivation of the above result will become clearer

through the following example. Consider the total number of attributes to be d as usual.

Let A1, A2 and A3 be three attributes that are common to both paths and also having

the same attribute values. Let A4 and A5 be common to both paths but have different

166

attribute values for each of them. Let A6 belong to only the first path and A7, A8 to only

the second path. Thus, in our example l1 = 6, l2 = 7, r = 5 and v = 3.

For the two paths to co-exist notice that atleast one of A4 or A5 has to be at a

lower depth than the non-common attributes A6, A7, A8. This has to be true since, if a

non-common attribute say A6 is higher than A4 and A5 in a path of the tree then the

other path cannot exist. Hence, in all the possible ways that the two paths can co-exist,

one of the attributes A4 or A5 has to occur at a maximum depth of v + 1, that is, 4

in this example. Figure A-1a depicts this case. In the successive tree structures, that

is, Figure A-1b, Figure A-1c the common attribute with distinct attribute values (A4)

rises higher up in the tree (to lower depths) until in Figure A-1d it becomes the root.

To find the probability that the two paths co-exist we sum up the probabilities of such

arrangements/tree structures. The probability of the subtree shown in Figure A-1a is

1d(d−1)(d−2)(d−3)(d−4)2(d−5)2(d−6)

considering that we choose attributes w/o replacement for

a particular path. Thus the probability of choosing the root is 1d, the next attribute is

1d−1

and so on till the subtree splits into two paths at depth 5. After the split at depth

5 the probability of choosing the respective attributes for the two paths is 1(d−4)2

, since

repetitions are allowed in two separate paths. Finally, the first path ends at depth

6 and only one attribute has to be chosen at depth 7 for the second path which is

chosen with a probability of 1d−6

. We now find the total number of subtrees with such

an arrangement where the highest common attribute with different values is at depth

of 4. We observe that A1, A2 and A3 can be permuted in whichever way w/o altering

the tree structure. The total number of ways of doing this is 3!, that is, 3Pr3. The

attributes below A4 can also be permuted in 2!3! w/o changing the tree structure.

Moreover, A4 can be replaced by A5. Thus, the total number of ways the two paths

can co-exist with this arrangement is 3Pr32!3!2. The probability of the arrangement

is hence given by, 3Pr32!3!2d(d−1)(d−2)(d−3)(d−4)2(d−5)2(d−6)

. Similarly, we find the probability of

the arrangement in Figure A-1b where the common attribute with different values is

167

at depth 3 then at depth 2 and finally at the root. The probabilities for the successive

arrangements are 3Pr23!4!2d(d−1)(d−2)(d−3)2(d−4)2(d−5)2(d−6)

, 3Pr14!5!2d(d−1)(d−2)2(d−3)2(d−4)2(d−5)2(d−6)

and

3Pr05!6!2d(d−1)2(d−2)2(d−3)2(d−4)2(d−5)2(d−6)

respectively. The total probability for the paths to co-exist

is given by the sum of the probabilities of these individual arrangements.

In the general case, where we have v attributes with the same values the number

of arrangements possible is v + 1. This is because the depth at which the two paths

separate out lowers from v + 1 to 1. When the bifurcation occurs at v + 1 the total

number of subtrees is vPrv(l1 − v − 1)!(l2 − v − 1)!(r − v) with this arrangement.

vPrv is the permutations of the common attributes with same values. (l1 − v − 1)! and

(l2 − v − 1)! are the total permutations of the attributes in path 1 and 2 respectively

after the split. r − v is the number of choices for the split attribute. The probability of

any one of the subtrees is 1d(d−1)...(d−v)(d−v−1)2...(d−l1+1)2(d−l1)...(d−l2+1)

since until a depth of

v + 1 the two paths are the same and then from v + 2 the two paths separate out. The

probability of the first arrangement is thus, vPrv(l1−v−1)!(l2−v−1)!(r−v)d(d−1)...(d−v)(d−v−1)2...(d−l1+1)2(d−l1)...(d−l2+1)

.

For the second arrangement with the bifurcation occurring at a depth of v, the number

of subtrees is vPrv−1(l1 − v)!(l2 − v)!(r − v) and the probability of any one of them

is 1d(d−1)...(d−v+1)(d−v)2...(d−l1+1)2(d−l1)...(d−l2+1)

. The probability of the arrangement is

thus vPrv−1(l1−v)!(l2−v)!(r−v)d(d−1)...(d−v+1)(d−v)2...(d−l1+1)2(d−l1)...(d−l2+1)

. Similarly, the probabilities of the other

arrangements can be derived. Hence the total probability for the two paths to co-exist

which is the sum of the probabilities of the individual arrangements is given by,


v∑i=0

vPri(l1 − i− 1)!(l2 − i− 1)!(r − v)

d(d− 1)...(d− i)(d− i− 1)2...(d− l1 + 1)2(d− l1)...(d− l2 + 1).

168

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

AA

AA A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

1 1

2 2

3

3 33 3 3 3

2 2 2 2

1 1

4

4

4

41

5 55 5

5 5 5 5

6 66 6

77 7 7

8 8 8 8

a cb d

Figure A-1. Instances of possible arrangements.

169

REFERENCES

J. Abello, P. M. Pardalos, and M. G. C. Resende, editors. Handbook of massive data sets.Kluwer Academic Publishers, Norwell, MA, USA, 2002. ISBN 1-4020-0489-3.

T. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, 2003.

J. Bartlett, J. Kotrlik, and C. Higgins. Organizational research: Determining appropriatesample size for survey research. Information Technology, Learning, and PerformanceJournal, 19(1):43–50, 2001.

Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold crossvalidation. Journal of Machine Learning Research, 2003.

D. Bertsimas and I. Popescu. Optimal inequalities in probability theory: a convexoptimization approach. Technical report, Dept. Math. O.R., Cambridge, Mass 02139,1998. URL citeseer.ist.psu.edu/bertsimas00optimal.html. Date accessed 5/2006.

A. Blum, A. Kalai, and J. Langford. Beating the hold-out: Bounds for k-fold andprogressive cross-validation. In Computational Learing Theory, pages 203–208, 1999.

S. Boucheron, O. Bousquet, and G. Lugosi. Introduction to statistical learning theory.Date accessed 1/2007, http://www.kyb.mpg.de/publications/pdfs/pdf2819.pdf, 2005.

U. Braga-Neto and E. Dougherty. Exact performance of error estimators for discreteclassifiers. Pattern Recognition, 38(11):1799–1814, 2005.

L. Breiman. Heuristics of instability and stabilization in model selection. The Annals ofStatistics, 1996.

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.Wadsworth and Brooks, 1984.

R. Butler and R. Sutton. Saddlepoint approximation for multivariate cumulativedistribution functions and probability computations in sampling theory and outliertesting. Journal of the American Statistical Association, 93(442):596–604, 1998.

R. Chambers and C. Skinner. Analysis of Survey Data. Wiley, 1977.

J. Connor-Linton. Chi square tutorial. Date accessed 8/2006,http://www.georgetown.edu/faculty/ballc/webtools/ web chi tut.html, 2003.

L. Devroye, L. Gyor, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.Springer-Verlag, 1996.

A. Dhurandhar and A. Dobra. Semi-analytical method for analyzing models and modelselection measures based on moment analysis. ACM Transactions on KnowledgeDiscovery and Data Mining, 3, 2009.

170

citeseer.ist.psu.edu/bertsimas00optimal.html

A. Dhurandhar and A. Dobra. Probabilistic characterization of random decision trees.Journal of Machine Learning Research, 9, 2008.

A. Dhurandhar and A. Dobra. Probabilistic characterization of nearest neighbor classifier.Technical Report, 2007.

P. Domingos and M. J. Pazzani. On the optimality of the simple bayesian classifier underzero-one loss. Machine Learning, 29(2-3):103–130, 1997.

A. Edelman and H. Murakami. Polynomial roots from companion matrix eigenvalues.Mathematics of Computation, 64(210):763–776, 1995.

B. Efron. How biased is the apparent error rate of a prediction rule? Journal of theAmerican Statistical Association, 81:461–470, 1986.

B. Efron. The estimation of prediction error: Covariance penalties and cross-validation.Journal of the American Statistical Association, 99:619–642, 2004.

A. Elisseeff and M. Pontil. Learning Theory and Practice, chapter Leave-one-out error andstability of learning algorithms with applications. IOS Press, 2003.

C. Goutte. Note on free lunches and cross-validation. Neural Computation, 9(6):1245–1249,1997.

I. Guyon. Nips. Discussion: Open Problems, 2002.

M. Hall. Correlation-based feature selection for machine learning, 1998.

M. A. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete classdata mining. IEEE TRANSACTIONS ON KDE, 2003.

P. Hall. The Bootstrap and Edgeworth Expansion. Springer-Verlag, 1992.

K. Isii. The extrema of probability determined by generalized moments(i) boundedrandom variables. Ann. Inst. Stat. Math, 12:119–133, 1960.

K. Isii. On the sharpeness of chebyshev-type inequalities. Ann. Inst. Stat. Math, 14:185–197, 1963.

S. Karlin and L. Shapely. Geometry of moment spaces. Memoirs Amer. Math. Soc., 12,1953.

M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-outcross-validation. In Computational Learing Theory, pages 152–162, 1997.

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and modelselection. In In Proceedings of the Fourteenth International Joint Conference onArtificial Intelligence, pages 1137–1143. San Mateo, CA: Morgan Kaufmann, 1995.,1995.

171

E. F. Krause. Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover, 1987.

J. Langford. Filed under: Prediction theory, problems. Date accessed 6/2006,http://hunch.net/index.php?p=29, 2005.

B. Levin. A representation for multinomial cumulative distribution functions. The Annalsof Statistics, 9(5):1123–1126, 1981.

F. Liu, K. Ting, and W. Fan. Maximizing tree diversity by building complete-randomdecision trees. In PAKDD, pages 605–610, 2005.

W. Liu and A. White. Metrics for nearest neighbour discrimination with categoricalattributes. In Research and Development in Expert Systems XIV: Proceedings of the17th Annual Technicial Conference of the BCES Specialist Group, pages 51–59, 1997.

M. Markatou, H. Tian, S. Biswas, and G. Hripcsak. Analysis of variance of cross-validationestimators of the generalization error. J. Mach. Learn. Res., 6:1127–1168, 2005. ISSN1533-7928.

D. McAllester. Pac-bayesian stochastic model selection. Mach. Learn., 51, 2003.

A. Moore and M. Lee. Efficient algorithms for minimizing cross validation error. InInternational Conference on Machine Learning, pages 190–198, 1994.

M. Plutowski. Survey: Cross-validation in theory and in practice. Date accessed 10/2006,www.emotivate.com/CvSurvey.doc, 1996.

A. Prekopa. The discrete moment problem and linear programming. RUTCOR ResearchReport, 1989.

J. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.

I. Rish. An empirical study of the naive bayes classifier. In IJCAI-01 workshop on”Empirical Methods in AI”, 2001.

J. Schneider. Cross validation. Date accessed 5/2008,http://www.cs.cmu.edu/ schneide/tut5/node42.html, 1997.

J. Shao. Linear model selection by cross validation. Journal of the American StatisticalAssociation, 88, 1993.

J. Shao. Mathematical statistics. Springer-Verlag, 2003.

L. Smith. A tutorial on principal components analysis. 2002.

C. Stanfill and D. Waltz. Toward memory-based reasoning. Commun. ACM, 29(12):1213–1228, 1986. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/7902.7906.

C. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595–645,1977.

172

V. Vapnik. Statistical Learning Theory. Wiley & Sons, 1998.

R. Williamson. Srm and vc theory(statistical learning theory).http://axiom.anu.edu.au/ williams/papers/P151.pdf, 2001.

Wolfram-Research. Mathematica. http://www.wolfram.com/.

S.-P. Wu and S. Boyd. Sdpsol: a parser/solver for sdp and maxdet problems with matrixstructure. Date accessed 7/2006, http://www.stanford.edu/ boyd/SDPSOL.html, 1996.

H. Zhu and R. Rohwer. No free lunch for cross validation. Neural Computation, 8(7):1421–1426, 1996.

173

BIOGRAPHICAL SKETCH

Amit Dhurandhar is originally from Pune, India. He received his B.E. degree in

computer science from the University of Pune in 2004. He then received his master’s

degree in 2005 (December) and his P.h.D. in summer 2009 from the University of Florida.

His primary research is focused on building theory and scalable frameworks for studying

classification algorithms and related techniques.

174

Documents

SEMI-ANALYTICAL METHOD FOR ANALYZING …ufdcimages.uflib.ufl.edu/UF/E0/02/47/33/00001/dhurandhar...I would like to thank Dr. Paul Gader and Dr. Arunava Banerjee for their insightful