fuzzyppt

Embed Size (px)

Citation preview

  • 8/8/2019 fuzzyppt

    1/28

    WS 2003/04 Data Mining Algorithms 7 81

    Chapter 7: Classification

    Introduction

    Classification problem, evaluation of classifiers Bayesian Classifiers

    Optimal Bayes classifier, naive Bayes classifier, applications

    Nearest Neighbor Classifier

    Basic notions, choice of parameters, applications

    Decision Tree Classifiers

    Basic notions, split strategies, overfitting, pruning of decisiontrees

    Scalability to Large Databases SLIQ, SPRINT, RainForest

    Further Approaches to Classification

    Neural networks, genetic algorithm, rough set approach, fuzzyset approaches, support vector machines, prediction

    WS 2003/04 Data Mining Algorithms 7 82

    Scalability to Large Databases:Motivation

    Construction of decision trees is one of the most important tasks inclassification

    We considered up to now

    small data sets

    main memory resident data

    New requirements

    larger and larger commercial databases

    necessity to use secondary storage algorithms

    Scalability for databases of arbitrary (i.e., unbounded) size

  • 8/8/2019 fuzzyppt

    2/28

    WS 2003/04 Data Mining Algorithms 7 83

    Scalability to Large Databases:Approaches

    Sampling

    use a subset of the data as training set such thatsample fits into main memory

    evaluate sample of all potential splits (for numericalattributes)

    poor quality of resulting decision trees

    Support by indexing structures (secondary storage)

    Use all data as training set (not just a sample) Management of the data by a database system

    Indexing structures may provide high efficiency

    no loss in the quality of decision trees

    WS 2003/04 Data Mining Algorithms 7 84

    Scalability to Large Databases:Storage and Indexing Structures

    Identify expensive operations:

    Evaluation of potential splits and selection of best split

    for numerical attributes sorting the attribute values

    evaluation of attribute values as potential split points

    for categorial attributes O(2m) potential binary splits for mdistinct attribute values

    Partitioning of training data

    according to the selected split point

    read and write operations to access the training data

    Effort for growth phase dominates the overall effort

  • 8/8/2019 fuzzyppt

    3/28

    WS 2003/04 Data Mining Algorithms 7 85

    SLIQ: Introduction

    [Mehta, Agrawal & Rissanen 1996]

    SLIQ: Scalable decision tree classifier Binary splits Evaluation of the splits by using the Gini-Index

    Special data structures avoid sorting of the training data

    for every node of the decision tree

    for each numerical attribute

    =

    =k

    j

    jpTgini1

    21)( for kclasses ciwithfrequenciespi

    WS 2003/04 Data Mining Algorithms 7 86

    SLIQ: Data Structures

    Attribute lists

    values of an attribute in ascending order

    in combination with reference to respective entry in class list sequential access

    secondary storage resident

    Class list

    contains class label for each training object and

    reference to the respective leaf node in the decision tree random access

    main memory resident

    Histograms

    for each leaf node of the decision tree

    frequencies of the individual classes per partition

  • 8/8/2019 fuzzyppt

    4/28

    WS 2003/04 Data Mining Algorithms 7 87

    SLIQ: Example

    Age Id

    23 2

    30 1

    40 3

    45 6

    55 5

    55 4

    Id Age Income Class

    1 30 65 G

    2 23 15 B

    3 40 75 G

    4 55 40 B

    5 55 100 G

    6 45 60 G

    Income Id

    15 2

    40 460 6

    65 1

    75 3

    100 5

    Id Class Leaf

    1 G N1

    2 B N1

    3 G N1

    4 B N1

    5 G N1

    6 G N1

    N1

    Training data

    Class list

    Attribute lists

    WS 2003/04 Data Mining Algorithms 7 88

    SLIQ: Algorithm

    Breadth first strategy

    For all leaf nodes on the same level of the decision

    tree, evaluate all possible splits for all attributes Standard decision tree classifiers follow a depth first strategy

    Split of numerical attributes

    Sequentially scan the attribute list of attribute a, andfor each value vin the list do:

    Determine the respective entry ein the class list

    Let kbe the value of the leaf attribute ofe

    Update the histogram ofkbased on the value of the classattribute ofe

  • 8/8/2019 fuzzyppt

    5/28

    WS 2003/04 Data Mining Algorithms 7 89

    SPRINT: Introduction

    [Shafer, Agrawal & Mehta 1996]

    Shortcomings of SLIQ

    Size of class list linearly grows with the size of thedatabase, i.e. with the number of training examples

    SLIQ scales well only if sufficient main memory forthe entire class list is available

    Goals of SPRINT

    Scalability for arbitrarily large databases

    Simple parallelization of the method

    WS 2003/04 Data Mining Algorithms 7 90

    SPRINT: Data Structures

    Class list there is no class list any longer

    additional attribute class for the attribute lists(resident in secondary storage) no main memory data structures any longer scalable to arbitrarily large databases

    Attribute lists no single attribute list for the entire training set

    separated attribute lists for each node of the

    decision tree instead waiving of central data structures supports a simple

    parallelization of SPRINT

  • 8/8/2019 fuzzyppt

    6/28

    WS 2003/04 Data Mining Algorithms 7 91

    SPRINT: Example

    Age Class Id

    17 high 1

    20 high 523 high 0

    32 low 4

    43 high 2

    68 low 3

    car type class Id

    family high 0

    sportive high 1sportive high 2

    family low 3

    truck low 4

    family high 5

    Attribute listsfor node N1

    Age Class Id

    17 high 1

    20 high 5

    23 Hoch 0

    age class Id

    32 low 4

    43 high 2

    68 low 3

    car type class Id

    family high 0

    sportive high 1

    family high 5

    car type class Id

    sportive high 2

    family low 3

    truck low 4

    Attribute

    lists fornode N2

    Attribute

    lists fornode N3

    Age 27.5 Age > 27.5N1

    N2 N3

    WS 2003/04 Data Mining Algorithms 7 92

    SPRINT: Experimental Evaluation

    SLIQ is more efficient than SPRINT as long as the class

    list fits into main memory SLIQ is not applicable for data sets with more than one

    million entries

    0 0.5 1.0 1.5 2.0 2.5 3.0

    number of objects(in millions)

    runtime(inseconds)

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    SLIQ

    SPRINT

  • 8/8/2019 fuzzyppt

    7/28

    WS 2003/04 Data Mining Algorithms 7 93

    RainForest: Introduction

    [Gehrke, Ramakrishnan & Ganti 1998]

    Shortcomings of SPRINT

    Does not exploit the available main memory

    Is applicable to breadth first decision tree construction only

    Goals of RainForest

    Exploits the available main memory to increase the efficiency

    Applicable to all known algorithms

    RainForest: Basic idea

    Separate scalability aspects from quality aspects of a decisiontree classifier

    WS 2003/04 Data Mining Algorithms 7 94

    RainForest: Data Structures

    AVC set for attribute aand node k

    Contains a class histogram for each value ofa

    For all training objects that belong to the partition of node k Entries: (ai, cj, count)

    AVC group for node k

    Set of AVC sets of node kfor all attributes

    For categorial attributes:

    AVC set is significantly smaller than attribute lists

    At least one of the AVC sets fits into main memory Potentially, the entire AVC group fits into main memory

  • 8/8/2019 fuzzyppt

    8/28

    WS 2003/04 Data Mining Algorithms 7 95

    RainForest: Example

    Id age income class

    1 young 65 G

    2 young 15 B

    3 young 75 G

    4 senior 40 B

    5 senior 100 G

    6 senior 60 G

    Training data

    value class count

    young B 1

    young G 2

    senior B 1

    senior G 2

    AVC set age for N1

    value class count

    15 B 1

    40 B 1

    60 G 1

    65 G 1

    75 G 1

    100 G 1

    AVC set income for N1

    value class count

    15 B 1

    65 G 1

    75 G 1

    AVC set income for N2

    value class count

    young B 1

    young G 2

    AVC set age for N2

    age = young age = senior

    N2 N3

    N1

    WS 2003/04 Data Mining Algorithms 7 96

    RainForest: Algorithms

    Assumption

    The entire AVC group of the root node fits into main memory

    Then, the AVC groups of each node also fit into main memory

    Algorithm RF_Write

    Construction of the AVC group of node kin main memory bysequential scan over the training set

    Determination of the optimal split for node kby using the AVCgroup

    Reading the training set and distribution (writing) to thepartitions

    training set is read twice and written once

  • 8/8/2019 fuzzyppt

    9/28

    WS 2003/04 Data Mining Algorithms 7 97

    RainForest: Algorithms

    Algorithm RF_Read Avoids explicit writing of the partitions to secondary storage

    Reading of desired partitions from the entire training data set

    Simultaneous creation of AVC groups for as many partitions aspossible

    Training database is read for each tree level multiple times

    Algorithm RF_Hybrid

    Usage of RF_Read as long as the AVC groups of all nodes fromthe current level of the decision tree fit into main memory

    Subsequent materialization of the partitions by using RF_Write

    WS 2003/04 Data Mining Algorithms 7 98

    RainForest: Experimental Evaluation

    for all RainForest algorithms, the runtime linearlyincreases with the number nof training objects

    RainForest is significantly more efficient than SPRINT

    SPRINT

    RainForest

    3.0

    number oftraining objects

    (in millions)

    runtime(inseconds)

    20,000

    10,000

    1.0 2.0

  • 8/8/2019 fuzzyppt

    10/28

    WS 2003/04 Data Mining Algorithms 7 99

    Boosting and Bagging

    Techniques to increase classification accuracy

    Bagging

    Basic idea: Learn a set of classifiers and decide theclass prediction by following the majority of theindividual votes

    Boosting

    Basic idea: Learn a series of classifiers, where each

    classifier in the series pays more attention to theexamples misclassified by its predecessor

    Applicable to decision trees or Bayesian classifier

    WS 2003/04 Data Mining Algorithms 7 100

    Boosting: Algorithm

    Algorithm

    Assign every example an equal weight 1/N

    For t = 1, 2, , T do

    Obtain a hypothesis (classifier) h(t) under w(t)

    Calculate the error of h(t) and re-weight the examples basedon the error

    Normalize w(t+1) to sum to 1.0

    Output a weighted sum of all the hypothesis, witheach hypothesis weighted according to its accuracy

    on the training set Boosting requires only linear time and constant space

  • 8/8/2019 fuzzyppt

    11/28

    WS 2003/04 Data Mining Algorithms 7 101

    Chapter 7: Classification

    Introduction

    Classification problem, evaluation of classifiers Bayesian Classifiers

    Optimal Bayes classifier, naive Bayes classifier, applications

    Nearest Neighbor Classifier

    Basic notions, choice of parameters, applications

    Decision Tree Classifiers

    Basic notions, split strategies, overfitting, pruning of decisiontrees

    Scalability to Large Databases SLIQ, SPRINT, RainForest

    Further Approaches to Classification

    Neural networks, genetic algorithm, rough set approach, fuzzyset approaches, support vector machines, prediction

    WS 2003/04 Data Mining Algorithms 7 102

    Neural Networks

    Advantages

    prediction accuracy is generally high

    robust, works when training examples contain errors output may be discrete, real-valued, or a vector of

    several discrete or real-valued attributes

    fast evaluation of the learned target function

    Criticism

    long training time

    difficult to understand the learned function (weights),no explicit knowledge generated

    not easy to incorporate domain knowledge

  • 8/8/2019 fuzzyppt

    12/28

    WS 2003/04 Data Mining Algorithms 7 103

    A Neuron

    The n-dimensional input vector x= (x1, x2, , xn) is mappedinto variable y by means of the scalar product and anonlinear function mapping

    w1

    w2

    wn

    x1

    x2

    xn

    f

    weightedsum

    inputvector x

    output y

    activationfunction

    k (bias for input k)

    weightvector w

    WS 2003/04 Data Mining Algorithms 7 104

    Network Training

    The ultimate objective of training

    obtain a set of weights that makes almost all the

    tuples in the training data classified correctly Steps

    Initialize weights with random values

    Feed the input tuples into the network one by one

    For each unit

    Compute the net input to the unit as a linear combination

    of all the inputs to the unit

    Compute the output value using the activation function

    Compute the error

    Update the weights and the bias

  • 8/8/2019 fuzzyppt

    13/28

    WS 2003/04 Data Mining Algorithms 7 105

    Multi-Layer Perceptron

    Output nodes

    Input nodes

    Hidden nodes

    Output vector

    Input vector: xi

    wij

    +=i

    jiijj OwI

    jIj eO += 1

    1

    ))(1( jjjjj OTOOrr =

    ( )= k jkkjjj wErrOOErr 1

    ijijij OErrlww )(+=

    jjj Errl)(+=

    WS 2003/04 Data Mining Algorithms 7 106

    Network Pruning and Rule Extraction

    Network pruning

    Fully connected network will be hard to articulate

    Ninput nodes, hhidden nodes and moutput nodes lead to

    h(m+N)weights

    Pruning: Remove some of the links without affecting classification

    accuracy of the network

    Extracting rules from a trained network

    Discretize activation values; replace individual activation value by

    the cluster average maintaining the network accuracy

    Enumerate the output from the discretized activation values to

    find rules between activation value and output Find the relationship between the input and activation value

    Combine the above two to have rules relating the output to input

  • 8/8/2019 fuzzyppt

    14/28

    WS 2003/04 Data Mining Algorithms 7 107

    Genetic Algorithms

    GA: based on an analogy to biological evolution

    Each rule is represented by a string of bits An initial population is created consisting of randomly

    generated rules

    e.g., If A1 and Not A2 then C2 can be encoded as 100

    Based on the evolutionary notion ofsurvival of the fittest,a new population is formed that consists of the fittestrules and their offsprings

    The fitnessof a rule is represented by its classificationaccuracy on a set of training examples

    Offsprings are generated by crossoverand mutation

    WS 2003/04 Data Mining Algorithms 7 108

    Rough Set Approach

    Rough sets are used to approximately or roughlydefine equivalent classes

    A rough set for a given class C is approximated by twosets: a lower approximation (certain to be in C) and anupper approximation (cannot be described as notbelonging to C)

    Finding the minimal subsets (reducts) of attributes (forfeature reduction) is NP-hard but a discernibility matrixis used to reduce the computation intensity

  • 8/8/2019 fuzzyppt

    15/28

    WS 2003/04 Data Mining Algorithms 7 109

    Fuzzy SetApproaches

    Fuzzy logic uses truth values between 0.0 and 1.0 torepresent the degree of membership (such as usingfuzzy membership graph)

    Attribute values are converted to fuzzy values

    e.g., income is mapped into the discrete categories{low, medium, high} with fuzzy values calculated

    For a given new sample, more than one fuzzy value may

    apply Each applicable rule contributes a vote for membership

    in the categories

    Typically, the truth values for each predicted categoryare summed

    WS 2003/04 Data Mining Algorithms 7 110

    Motivation: Linear Separation

    separating hyperplane

    Support Vector Machines (SVM)

    Vectors in drepresent objects

    Objects belong to exactly one of

    two respective classes For the sake of simpler formulas,

    the used class labels are:

    y= 1 and y= +1

    Classification by linear separation:determine hyperplane whichseparates both vector sets with a

    maximal stability Assign unknown elements to the

    halfspace in which they reside

    and acknowledgements: Prof. Dr. Hans-Peter Kriegel and Matthias Schubert (LMU Munich)and Dr. Thorsten Joachims (U Dortmund and Cornell U)

  • 8/8/2019 fuzzyppt

    16/28

    WS 2003/04 Data Mining Algorithms 7 111

    Support Vector Machines

    Problems of linear separation Definition and efficient determination of the

    maximum stable hyperplane

    Classes are not always linearly separable

    Computation of selected hyperplanes is veryexpensive

    Restriction to two classes

    Approach to solve these problems

    Support Vector Machines (SVMs) [Vapnik 1979, 1995]

    WS 2003/04 Data Mining Algorithms 7 112

    Maximum Margin Hyperplane

    Observation: There is no unique hyperplane to separatep1 fromp2 Question: which hyperplane separates the classes best?

    Criteria Stability at insertion Distance to the objects of both classes

    p1

    p2

    p1

    p2

  • 8/8/2019 fuzzyppt

    17/28

    WS 2003/04 Data Mining Algorithms 7 113

    Support Vector Machines: Principle

    Basic idea: Linear separation with the

    Maximum Margin Hyperplane (MMH)

    Distance to points from any of the

    two sets is maximal, i.e. at least

    Minimal probability that the

    separating hyperplane has to be

    moved due to an insertion

    Best generalization behaviour

    MMH is maximally stable MMH only depends on pointspiwhose

    distance to the hyperplane exactly is

    pi is called a support vectormargin

    maximum margin hyperplane

    p1

    p2

    WS 2003/04 Data Mining Algorithms 7 114

    Maximum Margin Hyperplane

    Recall some algebraic notions for feature spaceFS

    Inner product of two vectors

    e.g., canonical scalar product:

    Hyperplane H(w,b)with normal vector wand value b:

    Distance of a vector x to the hyperplane H(w,b):

    yxFSyxrrrr

    ,:,

    ( ) { }0,,, =+= bxwFSxbwH rrrr

    ( ) ( )bxwww

    bwHxdist +=rr

    rr

    rr,

    ,

    1),(,

    ( ) = =d

    i iiyxyx

    1,rr

  • 8/8/2019 fuzzyppt

    18/28

    WS 2003/04 Data Mining Algorithms 7 115

    Computation of theMaximum Margin Hyperplane

    Two assumptions for classifying xi (class 1: yi= +1, class 2: yi= 1):

    1) The classification error is zero

    2) The margin is maximal

    Let denote the minimumdistance of any trainingobject xi to the hyperplaneH(w,b):

    Then: Maximize subject to for i [1..n]

    ( ) 0,0,1

    0,1>+

    >++=

  • 8/8/2019 fuzzyppt

    19/28

    WS 2003/04 Data Mining Algorithms 7 117

    Dual Optimization Problem

    For computational purposes, transform the primary optimization

    problem into a dual one by using Lagrange multipliers

    For the solution, use algorithms from optimization theory

    Up to now only linearly separable data

    If data is not linearly separable: Soft Margin Optimization

    Dual optimization problem: Find parameters i that

    minimize

    subject to and 0 i

    jiji

    n

    i

    n

    j

    ji

    n

    i

    i xxyyLrrr

    = = == 1 11 2

    1)(

    01

    = =n

    i iiy

    WS 2003/04 Data Mining Algorithms 7 118

    Soft Margin Optimization

    Problem of Maximum Margin Optimization: How to treat non-linearly separable data?

    Two typical problems:

    Trade-off between training error and size of margin

    data points are not separable complete separation is not optimal

  • 8/8/2019 fuzzyppt

    20/28

    WS 2003/04 Data Mining Algorithms 7 119

    Soft Margin Optimization

    Additionally regard the number of

    training errors when optimizing: i is the distance frompi to the

    margin (often called slackvariable)

    Ccontrols the influence ofsingle training vectors

    Primary optimization problem with soft margin:

    Find a w that minimizes

    subject to i [1..n]: and i 0

    =+n

    i iCww

    1,

    21

    rr

    ( ) iii bxwy + 1,rr

    1

    2p1

    p2

    WS 2003/04 Data Mining Algorithms 7 120

    Soft Margin Optimization

    jiji

    n

    i

    n

    j

    ji

    n

    i

    i xxyyLrrr

    = = == 1 11 2

    1)( Dual OP: Maximize

    subject to and 0 i C=

    =n

    i

    ii y1

    0

    Dual optimization problem with Lagrange multipliers:

    0 < i < C: pi is a support vector with i = 0i = C: pi is a support vector with i >0i = 0: pi is no support vector

    1

    2p1

    p2

    Decision rule:

    ( )

    +=

    SVxiii

    i

    bxxysignxhrrr

    ,

  • 8/8/2019 fuzzyppt

    21/28

    WS 2003/04 Data Mining Algorithms 7 121

    Kernel Machines:Non-Linearly Separable Data Sets

    Problem: For real data sets, a linear separation with a high

    classification accuracy often is not possible Idea: Transform the data non-linearly into a new space, and try to

    separate the data in the new space linearly (extension of thehypotheses space)

    Example for a quadratically separable data set

    WS 2003/04 Data Mining Algorithms 7 122

    Kernel Machines:Extension of the Hypotheses Space

    Principle

    Try to separate in the extended feature space linearly

    Example

    Here: a hyperplane in the extended feature space is apolynomial of degree 2 in the input space

    input space extended feature space

    (x, y, z) (x, y, z, x2, xy, xz, y2, yz, z2)

  • 8/8/2019 fuzzyppt

    22/28

    WS 2003/04 Data Mining Algorithms 7 123

    Kernel Machines: Example

    Input space (2 attributes): Extended space (6 attributes):

    x1

    x2 x2

    2

    1x

    ( )21,xxx =r

    ( ) 1,2,2,2,, 21212

    2

    2

    1 xxxxxxx =r

    WS 2003/04 Data Mining Algorithms 7 124

    Kernel Machines: Example (2)

    Input space (2 attributes): Extended space (3 attributes):

    ( )21,xxx =

    r

    ( ) 212

    2

    2

    1 2,, xxxxx =

    r

    x1

    x2

    0

    0

    21x

    2

    2x

    0

    1

    1

  • 8/8/2019 fuzzyppt

    23/28

    WS 2003/04 Data Mining Algorithms 7 125

    Kernel Machines

    Introduction of a kernel corresponds to a feature transformation

    Feature transform only affects the scalar product of training vectors

    Kernel K is a function:

    ( ) newold FSFSx :r

    Dual optimization problem:

    Maximize

    subject to and 0 i C

    )(),(2

    1)(

    1 11

    jiji

    n

    i

    n

    j

    ji

    n

    i

    i xxyyLrrr

    = = ==

    01

    = =n

    i iiy

    ( ) )(),(, jiji xxxxKrrrr

    =

    WS 2003/04 Data Mining Algorithms 7 126

    Kernel Machines: Examples

    Radial basis kernel Polynomial kernel (degree 2)

    ( )d

    yxyxK 1,),( +=

    rrrr2

    exp),( yxyxK

    rrrr

    =

  • 8/8/2019 fuzzyppt

    24/28

    WS 2003/04 Data Mining Algorithms 7 127

    Support Vector Machines: Discussion

    + generate classifiers with a high classification accuracy+ relatively weak tendency to overfitting (generalization

    theory)

    + efficient classification of new objects

    + compact models

    training times may be long (appropriate feature spacemay be very high-dimensional)

    expensive implementation

    resulting models rarely provide an intuition

    WS 2003/04 Data Mining Algorithms 7 128

    What Is Prediction?

    Prediction is similar to classification

    First, construct a model

    Second, use model to predict unknown value

    Major method for prediction is regression

    Linear and multiple regression

    Non-linear regression

    Prediction is different from classification

    Classification refers to predict categorical class label Prediction models continuous-valued functions

  • 8/8/2019 fuzzyppt

    25/28

    WS 2003/04 Data Mining Algorithms 7 129

    Predictive modeling: Predict data values or construct

    generalized linear models based on the database data. One can only predict value ranges or category distributions

    Method outline:

    Minimal generalization

    Attribute relevance analysis

    Generalized linear model construction

    Prediction

    Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement,

    entropy analysis, expert judgement, etc.

    Multi-level prediction: drill-down and roll-up analysis

    Predictive Modeling in Databases

    WS 2003/04 Data Mining Algorithms 7 130

    Linear regression:Y= + X Two parameters, and specify the line and are to be

    estimated by using the data at hand. using the least squares criterion to the known values of

    Y1, Y2, , X1, X2,

    Multiple regression:Y= b0 + b1 X1 + b2 X2 Many nonlinear functions can be transformed into the

    above.

    Log-linear models:

    The multi-way table of joint probabilities is approximatedby a product of lower-order tables.

    Probability: p(a, b, c, d) =abacadbcd

    Regress Analysis and Log-LinearModels in Prediction

  • 8/8/2019 fuzzyppt

    26/28

    WS 2003/04 Data Mining Algorithms 7 131

    Locally Weighted Regression

    Construct an explicit approximation tofover a local regionsurrounding query instancexq

    Locally weighted linear regression:

    The target functionf is approximated nearxq using the linearfunction:

    minimize the squared error: distance-decreasing weight K

    the gradient descent training rule:

    In most cases, the target function is approximated by a constant,linear, or quadratic function.

    $ ( ) ( ) ( )f x w w a x wnan x= + + +0 1 1 L

    ( ) kxneighborsnearestx qq q xxdKxfxfxE ,_2

    )),(())()((21)(

    ( ) kxneighborsnearestx jqq xaxfxfxxdKjw ,_ )()()()),((

    WS 2003/04 Data Mining Algorithms 7 132

    Prediction: Numerical Data

  • 8/8/2019 fuzzyppt

    27/28

  • 8/8/2019 fuzzyppt

    28/28

    WS 2003/04 Data Mining Algorithms 7 135

    References (I)

    C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future

    Generation Computer Systems, 13, 1997.

    L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.

    Wadsworth International Group, 1984.

    P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data

    for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data

    Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995.

    U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994

    AAAI Conf., pages 601-606, AAAI Press, 1994.

    J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision

    tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases,

    pages 416-427, New York, NY, August 1998. T. Joachims: Learning to Classify Text using Support Vector Machines. Kluwer, 2002.

    M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision

    tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop

    Research Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham,

    England, April 1997.

    References (II)

    J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automaticinteraction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.

    M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.

    In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France,March 1996.

    S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-DiciplinarySurvey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998

    J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on ArtificialIntelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.

    R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building andpruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August1998.

    J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for datamining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.

    1996. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and

    Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.