Upload
brain420
View
214
Download
0
Embed Size (px)
Citation preview
8/8/2019 fuzzyppt
1/28
WS 2003/04 Data Mining Algorithms 7 81
Chapter 7: Classification
Introduction
Classification problem, evaluation of classifiers Bayesian Classifiers
Optimal Bayes classifier, naive Bayes classifier, applications
Nearest Neighbor Classifier
Basic notions, choice of parameters, applications
Decision Tree Classifiers
Basic notions, split strategies, overfitting, pruning of decisiontrees
Scalability to Large Databases SLIQ, SPRINT, RainForest
Further Approaches to Classification
Neural networks, genetic algorithm, rough set approach, fuzzyset approaches, support vector machines, prediction
WS 2003/04 Data Mining Algorithms 7 82
Scalability to Large Databases:Motivation
Construction of decision trees is one of the most important tasks inclassification
We considered up to now
small data sets
main memory resident data
New requirements
larger and larger commercial databases
necessity to use secondary storage algorithms
Scalability for databases of arbitrary (i.e., unbounded) size
8/8/2019 fuzzyppt
2/28
WS 2003/04 Data Mining Algorithms 7 83
Scalability to Large Databases:Approaches
Sampling
use a subset of the data as training set such thatsample fits into main memory
evaluate sample of all potential splits (for numericalattributes)
poor quality of resulting decision trees
Support by indexing structures (secondary storage)
Use all data as training set (not just a sample) Management of the data by a database system
Indexing structures may provide high efficiency
no loss in the quality of decision trees
WS 2003/04 Data Mining Algorithms 7 84
Scalability to Large Databases:Storage and Indexing Structures
Identify expensive operations:
Evaluation of potential splits and selection of best split
for numerical attributes sorting the attribute values
evaluation of attribute values as potential split points
for categorial attributes O(2m) potential binary splits for mdistinct attribute values
Partitioning of training data
according to the selected split point
read and write operations to access the training data
Effort for growth phase dominates the overall effort
8/8/2019 fuzzyppt
3/28
WS 2003/04 Data Mining Algorithms 7 85
SLIQ: Introduction
[Mehta, Agrawal & Rissanen 1996]
SLIQ: Scalable decision tree classifier Binary splits Evaluation of the splits by using the Gini-Index
Special data structures avoid sorting of the training data
for every node of the decision tree
for each numerical attribute
=
=k
j
jpTgini1
21)( for kclasses ciwithfrequenciespi
WS 2003/04 Data Mining Algorithms 7 86
SLIQ: Data Structures
Attribute lists
values of an attribute in ascending order
in combination with reference to respective entry in class list sequential access
secondary storage resident
Class list
contains class label for each training object and
reference to the respective leaf node in the decision tree random access
main memory resident
Histograms
for each leaf node of the decision tree
frequencies of the individual classes per partition
8/8/2019 fuzzyppt
4/28
WS 2003/04 Data Mining Algorithms 7 87
SLIQ: Example
Age Id
23 2
30 1
40 3
45 6
55 5
55 4
Id Age Income Class
1 30 65 G
2 23 15 B
3 40 75 G
4 55 40 B
5 55 100 G
6 45 60 G
Income Id
15 2
40 460 6
65 1
75 3
100 5
Id Class Leaf
1 G N1
2 B N1
3 G N1
4 B N1
5 G N1
6 G N1
N1
Training data
Class list
Attribute lists
WS 2003/04 Data Mining Algorithms 7 88
SLIQ: Algorithm
Breadth first strategy
For all leaf nodes on the same level of the decision
tree, evaluate all possible splits for all attributes Standard decision tree classifiers follow a depth first strategy
Split of numerical attributes
Sequentially scan the attribute list of attribute a, andfor each value vin the list do:
Determine the respective entry ein the class list
Let kbe the value of the leaf attribute ofe
Update the histogram ofkbased on the value of the classattribute ofe
8/8/2019 fuzzyppt
5/28
WS 2003/04 Data Mining Algorithms 7 89
SPRINT: Introduction
[Shafer, Agrawal & Mehta 1996]
Shortcomings of SLIQ
Size of class list linearly grows with the size of thedatabase, i.e. with the number of training examples
SLIQ scales well only if sufficient main memory forthe entire class list is available
Goals of SPRINT
Scalability for arbitrarily large databases
Simple parallelization of the method
WS 2003/04 Data Mining Algorithms 7 90
SPRINT: Data Structures
Class list there is no class list any longer
additional attribute class for the attribute lists(resident in secondary storage) no main memory data structures any longer scalable to arbitrarily large databases
Attribute lists no single attribute list for the entire training set
separated attribute lists for each node of the
decision tree instead waiving of central data structures supports a simple
parallelization of SPRINT
8/8/2019 fuzzyppt
6/28
WS 2003/04 Data Mining Algorithms 7 91
SPRINT: Example
Age Class Id
17 high 1
20 high 523 high 0
32 low 4
43 high 2
68 low 3
car type class Id
family high 0
sportive high 1sportive high 2
family low 3
truck low 4
family high 5
Attribute listsfor node N1
Age Class Id
17 high 1
20 high 5
23 Hoch 0
age class Id
32 low 4
43 high 2
68 low 3
car type class Id
family high 0
sportive high 1
family high 5
car type class Id
sportive high 2
family low 3
truck low 4
Attribute
lists fornode N2
Attribute
lists fornode N3
Age 27.5 Age > 27.5N1
N2 N3
WS 2003/04 Data Mining Algorithms 7 92
SPRINT: Experimental Evaluation
SLIQ is more efficient than SPRINT as long as the class
list fits into main memory SLIQ is not applicable for data sets with more than one
million entries
0 0.5 1.0 1.5 2.0 2.5 3.0
number of objects(in millions)
runtime(inseconds)
0
1000
2000
3000
4000
5000
6000
7000
8000
SLIQ
SPRINT
8/8/2019 fuzzyppt
7/28
WS 2003/04 Data Mining Algorithms 7 93
RainForest: Introduction
[Gehrke, Ramakrishnan & Ganti 1998]
Shortcomings of SPRINT
Does not exploit the available main memory
Is applicable to breadth first decision tree construction only
Goals of RainForest
Exploits the available main memory to increase the efficiency
Applicable to all known algorithms
RainForest: Basic idea
Separate scalability aspects from quality aspects of a decisiontree classifier
WS 2003/04 Data Mining Algorithms 7 94
RainForest: Data Structures
AVC set for attribute aand node k
Contains a class histogram for each value ofa
For all training objects that belong to the partition of node k Entries: (ai, cj, count)
AVC group for node k
Set of AVC sets of node kfor all attributes
For categorial attributes:
AVC set is significantly smaller than attribute lists
At least one of the AVC sets fits into main memory Potentially, the entire AVC group fits into main memory
8/8/2019 fuzzyppt
8/28
WS 2003/04 Data Mining Algorithms 7 95
RainForest: Example
Id age income class
1 young 65 G
2 young 15 B
3 young 75 G
4 senior 40 B
5 senior 100 G
6 senior 60 G
Training data
value class count
young B 1
young G 2
senior B 1
senior G 2
AVC set age for N1
value class count
15 B 1
40 B 1
60 G 1
65 G 1
75 G 1
100 G 1
AVC set income for N1
value class count
15 B 1
65 G 1
75 G 1
AVC set income for N2
value class count
young B 1
young G 2
AVC set age for N2
age = young age = senior
N2 N3
N1
WS 2003/04 Data Mining Algorithms 7 96
RainForest: Algorithms
Assumption
The entire AVC group of the root node fits into main memory
Then, the AVC groups of each node also fit into main memory
Algorithm RF_Write
Construction of the AVC group of node kin main memory bysequential scan over the training set
Determination of the optimal split for node kby using the AVCgroup
Reading the training set and distribution (writing) to thepartitions
training set is read twice and written once
8/8/2019 fuzzyppt
9/28
WS 2003/04 Data Mining Algorithms 7 97
RainForest: Algorithms
Algorithm RF_Read Avoids explicit writing of the partitions to secondary storage
Reading of desired partitions from the entire training data set
Simultaneous creation of AVC groups for as many partitions aspossible
Training database is read for each tree level multiple times
Algorithm RF_Hybrid
Usage of RF_Read as long as the AVC groups of all nodes fromthe current level of the decision tree fit into main memory
Subsequent materialization of the partitions by using RF_Write
WS 2003/04 Data Mining Algorithms 7 98
RainForest: Experimental Evaluation
for all RainForest algorithms, the runtime linearlyincreases with the number nof training objects
RainForest is significantly more efficient than SPRINT
SPRINT
RainForest
3.0
number oftraining objects
(in millions)
runtime(inseconds)
20,000
10,000
1.0 2.0
8/8/2019 fuzzyppt
10/28
WS 2003/04 Data Mining Algorithms 7 99
Boosting and Bagging
Techniques to increase classification accuracy
Bagging
Basic idea: Learn a set of classifiers and decide theclass prediction by following the majority of theindividual votes
Boosting
Basic idea: Learn a series of classifiers, where each
classifier in the series pays more attention to theexamples misclassified by its predecessor
Applicable to decision trees or Bayesian classifier
WS 2003/04 Data Mining Algorithms 7 100
Boosting: Algorithm
Algorithm
Assign every example an equal weight 1/N
For t = 1, 2, , T do
Obtain a hypothesis (classifier) h(t) under w(t)
Calculate the error of h(t) and re-weight the examples basedon the error
Normalize w(t+1) to sum to 1.0
Output a weighted sum of all the hypothesis, witheach hypothesis weighted according to its accuracy
on the training set Boosting requires only linear time and constant space
8/8/2019 fuzzyppt
11/28
WS 2003/04 Data Mining Algorithms 7 101
Chapter 7: Classification
Introduction
Classification problem, evaluation of classifiers Bayesian Classifiers
Optimal Bayes classifier, naive Bayes classifier, applications
Nearest Neighbor Classifier
Basic notions, choice of parameters, applications
Decision Tree Classifiers
Basic notions, split strategies, overfitting, pruning of decisiontrees
Scalability to Large Databases SLIQ, SPRINT, RainForest
Further Approaches to Classification
Neural networks, genetic algorithm, rough set approach, fuzzyset approaches, support vector machines, prediction
WS 2003/04 Data Mining Algorithms 7 102
Neural Networks
Advantages
prediction accuracy is generally high
robust, works when training examples contain errors output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes
fast evaluation of the learned target function
Criticism
long training time
difficult to understand the learned function (weights),no explicit knowledge generated
not easy to incorporate domain knowledge
8/8/2019 fuzzyppt
12/28
WS 2003/04 Data Mining Algorithms 7 103
A Neuron
The n-dimensional input vector x= (x1, x2, , xn) is mappedinto variable y by means of the scalar product and anonlinear function mapping
w1
w2
wn
x1
x2
xn
f
weightedsum
inputvector x
output y
activationfunction
k (bias for input k)
weightvector w
WS 2003/04 Data Mining Algorithms 7 104
Network Training
The ultimate objective of training
obtain a set of weights that makes almost all the
tuples in the training data classified correctly Steps
Initialize weights with random values
Feed the input tuples into the network one by one
For each unit
Compute the net input to the unit as a linear combination
of all the inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
8/8/2019 fuzzyppt
13/28
WS 2003/04 Data Mining Algorithms 7 105
Multi-Layer Perceptron
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
+=i
jiijj OwI
jIj eO += 1
1
))(1( jjjjj OTOOrr =
( )= k jkkjjj wErrOOErr 1
ijijij OErrlww )(+=
jjj Errl)(+=
WS 2003/04 Data Mining Algorithms 7 106
Network Pruning and Rule Extraction
Network pruning
Fully connected network will be hard to articulate
Ninput nodes, hhidden nodes and moutput nodes lead to
h(m+N)weights
Pruning: Remove some of the links without affecting classification
accuracy of the network
Extracting rules from a trained network
Discretize activation values; replace individual activation value by
the cluster average maintaining the network accuracy
Enumerate the output from the discretized activation values to
find rules between activation value and output Find the relationship between the input and activation value
Combine the above two to have rules relating the output to input
8/8/2019 fuzzyppt
14/28
WS 2003/04 Data Mining Algorithms 7 107
Genetic Algorithms
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits An initial population is created consisting of randomly
generated rules
e.g., If A1 and Not A2 then C2 can be encoded as 100
Based on the evolutionary notion ofsurvival of the fittest,a new population is formed that consists of the fittestrules and their offsprings
The fitnessof a rule is represented by its classificationaccuracy on a set of training examples
Offsprings are generated by crossoverand mutation
WS 2003/04 Data Mining Algorithms 7 108
Rough Set Approach
Rough sets are used to approximately or roughlydefine equivalent classes
A rough set for a given class C is approximated by twosets: a lower approximation (certain to be in C) and anupper approximation (cannot be described as notbelonging to C)
Finding the minimal subsets (reducts) of attributes (forfeature reduction) is NP-hard but a discernibility matrixis used to reduce the computation intensity
8/8/2019 fuzzyppt
15/28
WS 2003/04 Data Mining Algorithms 7 109
Fuzzy SetApproaches
Fuzzy logic uses truth values between 0.0 and 1.0 torepresent the degree of membership (such as usingfuzzy membership graph)
Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories{low, medium, high} with fuzzy values calculated
For a given new sample, more than one fuzzy value may
apply Each applicable rule contributes a vote for membership
in the categories
Typically, the truth values for each predicted categoryare summed
WS 2003/04 Data Mining Algorithms 7 110
Motivation: Linear Separation
separating hyperplane
Support Vector Machines (SVM)
Vectors in drepresent objects
Objects belong to exactly one of
two respective classes For the sake of simpler formulas,
the used class labels are:
y= 1 and y= +1
Classification by linear separation:determine hyperplane whichseparates both vector sets with a
maximal stability Assign unknown elements to the
halfspace in which they reside
and acknowledgements: Prof. Dr. Hans-Peter Kriegel and Matthias Schubert (LMU Munich)and Dr. Thorsten Joachims (U Dortmund and Cornell U)
8/8/2019 fuzzyppt
16/28
WS 2003/04 Data Mining Algorithms 7 111
Support Vector Machines
Problems of linear separation Definition and efficient determination of the
maximum stable hyperplane
Classes are not always linearly separable
Computation of selected hyperplanes is veryexpensive
Restriction to two classes
Approach to solve these problems
Support Vector Machines (SVMs) [Vapnik 1979, 1995]
WS 2003/04 Data Mining Algorithms 7 112
Maximum Margin Hyperplane
Observation: There is no unique hyperplane to separatep1 fromp2 Question: which hyperplane separates the classes best?
Criteria Stability at insertion Distance to the objects of both classes
p1
p2
p1
p2
8/8/2019 fuzzyppt
17/28
WS 2003/04 Data Mining Algorithms 7 113
Support Vector Machines: Principle
Basic idea: Linear separation with the
Maximum Margin Hyperplane (MMH)
Distance to points from any of the
two sets is maximal, i.e. at least
Minimal probability that the
separating hyperplane has to be
moved due to an insertion
Best generalization behaviour
MMH is maximally stable MMH only depends on pointspiwhose
distance to the hyperplane exactly is
pi is called a support vectormargin
maximum margin hyperplane
p1
p2
WS 2003/04 Data Mining Algorithms 7 114
Maximum Margin Hyperplane
Recall some algebraic notions for feature spaceFS
Inner product of two vectors
e.g., canonical scalar product:
Hyperplane H(w,b)with normal vector wand value b:
Distance of a vector x to the hyperplane H(w,b):
yxFSyxrrrr
,:,
( ) { }0,,, =+= bxwFSxbwH rrrr
( ) ( )bxwww
bwHxdist +=rr
rr
rr,
,
1),(,
( ) = =d
i iiyxyx
1,rr
8/8/2019 fuzzyppt
18/28
WS 2003/04 Data Mining Algorithms 7 115
Computation of theMaximum Margin Hyperplane
Two assumptions for classifying xi (class 1: yi= +1, class 2: yi= 1):
1) The classification error is zero
2) The margin is maximal
Let denote the minimumdistance of any trainingobject xi to the hyperplaneH(w,b):
Then: Maximize subject to for i [1..n]
( ) 0,0,1
0,1>+
>++=
8/8/2019 fuzzyppt
19/28
WS 2003/04 Data Mining Algorithms 7 117
Dual Optimization Problem
For computational purposes, transform the primary optimization
problem into a dual one by using Lagrange multipliers
For the solution, use algorithms from optimization theory
Up to now only linearly separable data
If data is not linearly separable: Soft Margin Optimization
Dual optimization problem: Find parameters i that
minimize
subject to and 0 i
jiji
n
i
n
j
ji
n
i
i xxyyLrrr
= = == 1 11 2
1)(
01
= =n
i iiy
WS 2003/04 Data Mining Algorithms 7 118
Soft Margin Optimization
Problem of Maximum Margin Optimization: How to treat non-linearly separable data?
Two typical problems:
Trade-off between training error and size of margin
data points are not separable complete separation is not optimal
8/8/2019 fuzzyppt
20/28
WS 2003/04 Data Mining Algorithms 7 119
Soft Margin Optimization
Additionally regard the number of
training errors when optimizing: i is the distance frompi to the
margin (often called slackvariable)
Ccontrols the influence ofsingle training vectors
Primary optimization problem with soft margin:
Find a w that minimizes
subject to i [1..n]: and i 0
=+n
i iCww
1,
21
rr
( ) iii bxwy + 1,rr
1
2p1
p2
WS 2003/04 Data Mining Algorithms 7 120
Soft Margin Optimization
jiji
n
i
n
j
ji
n
i
i xxyyLrrr
= = == 1 11 2
1)( Dual OP: Maximize
subject to and 0 i C=
=n
i
ii y1
0
Dual optimization problem with Lagrange multipliers:
0 < i < C: pi is a support vector with i = 0i = C: pi is a support vector with i >0i = 0: pi is no support vector
1
2p1
p2
Decision rule:
( )
+=
SVxiii
i
bxxysignxhrrr
,
8/8/2019 fuzzyppt
21/28
WS 2003/04 Data Mining Algorithms 7 121
Kernel Machines:Non-Linearly Separable Data Sets
Problem: For real data sets, a linear separation with a high
classification accuracy often is not possible Idea: Transform the data non-linearly into a new space, and try to
separate the data in the new space linearly (extension of thehypotheses space)
Example for a quadratically separable data set
WS 2003/04 Data Mining Algorithms 7 122
Kernel Machines:Extension of the Hypotheses Space
Principle
Try to separate in the extended feature space linearly
Example
Here: a hyperplane in the extended feature space is apolynomial of degree 2 in the input space
input space extended feature space
(x, y, z) (x, y, z, x2, xy, xz, y2, yz, z2)
8/8/2019 fuzzyppt
22/28
WS 2003/04 Data Mining Algorithms 7 123
Kernel Machines: Example
Input space (2 attributes): Extended space (6 attributes):
x1
x2 x2
2
1x
( )21,xxx =r
( ) 1,2,2,2,, 21212
2
2
1 xxxxxxx =r
WS 2003/04 Data Mining Algorithms 7 124
Kernel Machines: Example (2)
Input space (2 attributes): Extended space (3 attributes):
( )21,xxx =
r
( ) 212
2
2
1 2,, xxxxx =
r
x1
x2
0
0
21x
2
2x
0
1
1
8/8/2019 fuzzyppt
23/28
WS 2003/04 Data Mining Algorithms 7 125
Kernel Machines
Introduction of a kernel corresponds to a feature transformation
Feature transform only affects the scalar product of training vectors
Kernel K is a function:
( ) newold FSFSx :r
Dual optimization problem:
Maximize
subject to and 0 i C
)(),(2
1)(
1 11
jiji
n
i
n
j
ji
n
i
i xxyyLrrr
= = ==
01
= =n
i iiy
( ) )(),(, jiji xxxxKrrrr
=
WS 2003/04 Data Mining Algorithms 7 126
Kernel Machines: Examples
Radial basis kernel Polynomial kernel (degree 2)
( )d
yxyxK 1,),( +=
rrrr2
exp),( yxyxK
rrrr
=
8/8/2019 fuzzyppt
24/28
WS 2003/04 Data Mining Algorithms 7 127
Support Vector Machines: Discussion
+ generate classifiers with a high classification accuracy+ relatively weak tendency to overfitting (generalization
theory)
+ efficient classification of new objects
+ compact models
training times may be long (appropriate feature spacemay be very high-dimensional)
expensive implementation
resulting models rarely provide an intuition
WS 2003/04 Data Mining Algorithms 7 128
What Is Prediction?
Prediction is similar to classification
First, construct a model
Second, use model to predict unknown value
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
Prediction is different from classification
Classification refers to predict categorical class label Prediction models continuous-valued functions
8/8/2019 fuzzyppt
25/28
WS 2003/04 Data Mining Algorithms 7 129
Predictive modeling: Predict data values or construct
generalized linear models based on the database data. One can only predict value ranges or category distributions
Method outline:
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
Predictive Modeling in Databases
WS 2003/04 Data Mining Algorithms 7 130
Linear regression:Y= + X Two parameters, and specify the line and are to be
estimated by using the data at hand. using the least squares criterion to the known values of
Y1, Y2, , X1, X2,
Multiple regression:Y= b0 + b1 X1 + b2 X2 Many nonlinear functions can be transformed into the
above.
Log-linear models:
The multi-way table of joint probabilities is approximatedby a product of lower-order tables.
Probability: p(a, b, c, d) =abacadbcd
Regress Analysis and Log-LinearModels in Prediction
8/8/2019 fuzzyppt
26/28
WS 2003/04 Data Mining Algorithms 7 131
Locally Weighted Regression
Construct an explicit approximation tofover a local regionsurrounding query instancexq
Locally weighted linear regression:
The target functionf is approximated nearxq using the linearfunction:
minimize the squared error: distance-decreasing weight K
the gradient descent training rule:
In most cases, the target function is approximated by a constant,linear, or quadratic function.
$ ( ) ( ) ( )f x w w a x wnan x= + + +0 1 1 L
( ) kxneighborsnearestx qq q xxdKxfxfxE ,_2
)),(())()((21)(
( ) kxneighborsnearestx jqq xaxfxfxxdKjw ,_ )()()()),((
WS 2003/04 Data Mining Algorithms 7 132
Prediction: Numerical Data
8/8/2019 fuzzyppt
27/28
8/8/2019 fuzzyppt
28/28
WS 2003/04 Data Mining Algorithms 7 135
References (I)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data
Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995.
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994
AAAI Conf., pages 601-606, AAAI Press, 1994.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision
tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases,
pages 416-427, New York, NY, August 1998. T. Joachims: Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision
tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop
Research Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham,
England, April 1997.
References (II)
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automaticinteraction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France,March 1996.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-DiciplinarySurvey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on ArtificialIntelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building andpruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August1998.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for datamining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.
1996. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.