fuzzyppt

8/8/2019 fuzzyppt

1/28

WS 2003/04 Data Mining Algorithms 7 81

Chapter 7: Classification

Introduction

Classification problem, evaluation of classifiers Bayesian Classifiers

Optimal Bayes classifier, naive Bayes classifier, applications

Nearest Neighbor Classifier

Basic notions, choice of parameters, applications

Decision Tree Classifiers

Basic notions, split strategies, overfitting, pruning of decisiontrees

Scalability to Large Databases SLIQ, SPRINT, RainForest

Further Approaches to Classification

Neural networks, genetic algorithm, rough set approach, fuzzyset approaches, support vector machines, prediction


Scalability to Large Databases:Motivation

Construction of decision trees is one of the most important tasks inclassification

We considered up to now

small data sets

main memory resident data

New requirements

larger and larger commercial databases

necessity to use secondary storage algorithms

Scalability for databases of arbitrary (i.e., unbounded) size

8/8/2019 fuzzyppt

2/28


Scalability to Large Databases:Approaches

Sampling

use a subset of the data as training set such thatsample fits into main memory

evaluate sample of all potential splits (for numericalattributes)

poor quality of resulting decision trees

Support by indexing structures (secondary storage)

Use all data as training set (not just a sample) Management of the data by a database system

Indexing structures may provide high efficiency

no loss in the quality of decision trees


Scalability to Large Databases:Storage and Indexing Structures

Identify expensive operations:

Evaluation of potential splits and selection of best split

for numerical attributes sorting the attribute values

evaluation of attribute values as potential split points

for categorial attributes O(2m) potential binary splits for mdistinct attribute values

Partitioning of training data

according to the selected split point

read and write operations to access the training data

Effort for growth phase dominates the overall effort

8/8/2019 fuzzyppt

3/28


SLIQ: Introduction

[Mehta, Agrawal & Rissanen 1996]

SLIQ: Scalable decision tree classifier Binary splits Evaluation of the splits by using the Gini-Index

Special data structures avoid sorting of the training data

for every node of the decision tree

for each numerical attribute

=

=k

j

jpTgini1

21)( for kclasses ciwithfrequenciespi


SLIQ: Data Structures

Attribute lists

values of an attribute in ascending order

in combination with reference to respective entry in class list sequential access

secondary storage resident

Class list

contains class label for each training object and

reference to the respective leaf node in the decision tree random access

main memory resident

Histograms

for each leaf node of the decision tree

frequencies of the individual classes per partition

8/8/2019 fuzzyppt

4/28


SLIQ: Example

Age Id

23 2

30 1

40 3

45 6

55 5

55 4

Id Age Income Class

1 30 65 G

2 23 15 B

3 40 75 G

4 55 40 B

5 55 100 G

6 45 60 G

Income Id

15 2

40 460 6

65 1

75 3

100 5

Id Class Leaf

1 G N1

2 B N1

3 G N1

4 B N1

5 G N1

6 G N1

N1

Training data

Class list

Attribute lists


SLIQ: Algorithm

Breadth first strategy

For all leaf nodes on the same level of the decision

tree, evaluate all possible splits for all attributes Standard decision tree classifiers follow a depth first strategy

Split of numerical attributes

Sequentially scan the attribute list of attribute a, andfor each value vin the list do:

Determine the respective entry ein the class list

Let kbe the value of the leaf attribute ofe

Update the histogram ofkbased on the value of the classattribute ofe

8/8/2019 fuzzyppt

5/28


SPRINT: Introduction

[Shafer, Agrawal & Mehta 1996]

Shortcomings of SLIQ

Size of class list linearly grows with the size of thedatabase, i.e. with the number of training examples

SLIQ scales well only if sufficient main memory forthe entire class list is available

Goals of SPRINT

Scalability for arbitrarily large databases

Simple parallelization of the method


SPRINT: Data Structures

Class list there is no class list any longer

additional attribute class for the attribute lists(resident in secondary storage) no main memory data structures any longer scalable to arbitrarily large databases

Attribute lists no single attribute list for the entire training set

separated attribute lists for each node of the

decision tree instead waiving of central data structures supports a simple

parallelization of SPRINT

8/8/2019 fuzzyppt

6/28


SPRINT: Example

Age Class Id

17 high 1

20 high 523 high 0

32 low 4

43 high 2

68 low 3

car type class Id

family high 0

sportive high 1sportive high 2

family low 3

truck low 4

family high 5

Attribute listsfor node N1

Age Class Id

17 high 1

20 high 5

23 Hoch 0

age class Id

32 low 4

43 high 2

68 low 3

car type class Id

family high 0

sportive high 1

family high 5

car type class Id

sportive high 2

family low 3

truck low 4

Attribute

lists fornode N2

Attribute

lists fornode N3

Age 27.5 Age > 27.5N1

N2 N3


SPRINT: Experimental Evaluation

SLIQ is more efficient than SPRINT as long as the class

list fits into main memory SLIQ is not applicable for data sets with more than one

million entries

0 0.5 1.0 1.5 2.0 2.5 3.0

number of objects(in millions)

runtime(inseconds)

0

1000

2000

3000

4000

5000

6000

7000

8000

SLIQ

SPRINT

8/8/2019 fuzzyppt

7/28


RainForest: Introduction

[Gehrke, Ramakrishnan & Ganti 1998]

Shortcomings of SPRINT

Does not exploit the available main memory

Is applicable to breadth first decision tree construction only

Goals of RainForest

Exploits the available main memory to increase the efficiency

Applicable to all known algorithms

RainForest: Basic idea

Separate scalability aspects from quality aspects of a decisiontree classifier


RainForest: Data Structures

AVC set for attribute aand node k

Contains a class histogram for each value ofa

For all training objects that belong to the partition of node k Entries: (ai, cj, count)

AVC group for node k

Set of AVC sets of node kfor all attributes

For categorial attributes:

AVC set is significantly smaller than attribute lists

At least one of the AVC sets fits into main memory Potentially, the entire AVC group fits into main memory

8/8/2019 fuzzyppt

8/28


RainForest: Example

Id age income class

1 young 65 G

2 young 15 B

3 young 75 G

4 senior 40 B

5 senior 100 G

6 senior 60 G

Training data

value class count

young B 1

young G 2

senior B 1

senior G 2

AVC set age for N1

value class count

15 B 1

40 B 1

60 G 1

65 G 1

75 G 1

100 G 1

AVC set income for N1

value class count

15 B 1

65 G 1

75 G 1

AVC set income for N2

value class count

young B 1

young G 2

AVC set age for N2

age = young age = senior

N2 N3

N1


RainForest: Algorithms

Assumption

The entire AVC group of the root node fits into main memory

Then, the AVC groups of each node also fit into main memory

Algorithm RF_Write

Construction of the AVC group of node kin main memory bysequential scan over the training set

Determination of the optimal split for node kby using the AVCgroup

Reading the training set and distribution (writing) to thepartitions

training set is read twice and written once

8/8/2019 fuzzyppt

9/28


RainForest: Algorithms

Algorithm RF_Read Avoids explicit writing of the partitions to secondary storage

Reading of desired partitions from the entire training data set

Simultaneous creation of AVC groups for as many partitions aspossible

Training database is read for each tree level multiple times

Algorithm RF_Hybrid

Usage of RF_Read as long as the AVC groups of all nodes fromthe current level of the decision tree fit into main memory

Subsequent materialization of the partitions by using RF_Write


RainForest: Experimental Evaluation

for all RainForest algorithms, the runtime linearlyincreases with the number nof training objects

RainForest is significantly more efficient than SPRINT

SPRINT

RainForest

3.0

number oftraining objects

(in millions)

runtime(inseconds)

20,000

10,000

1.0 2.0

8/8/2019 fuzzyppt

10/28


Boosting and Bagging

Techniques to increase classification accuracy

Bagging

Basic idea: Learn a set of classifiers and decide theclass prediction by following the majority of theindividual votes

Boosting

Basic idea: Learn a series of classifiers, where each

classifier in the series pays more attention to theexamples misclassified by its predecessor

Applicable to decision trees or Bayesian classifier


Boosting: Algorithm

Algorithm

Assign every example an equal weight 1/N

For t = 1, 2, , T do

Obtain a hypothesis (classifier) h(t) under w(t)

Calculate the error of h(t) and re-weight the examples basedon the error

Normalize w(t+1) to sum to 1.0

Output a weighted sum of all the hypothesis, witheach hypothesis weighted according to its accuracy

on the training set Boosting requires only linear time and constant space

8/8/2019 fuzzyppt

11/28


Chapter 7: Classification

Introduction

Classification problem, evaluation of classifiers Bayesian Classifiers

Optimal Bayes classifier, naive Bayes classifier, applications

Nearest Neighbor Classifier

Basic notions, choice of parameters, applications

Decision Tree Classifiers

Basic notions, split strategies, overfitting, pruning of decisiontrees

Scalability to Large Databases SLIQ, SPRINT, RainForest

Further Approaches to Classification

Neural networks, genetic algorithm, rough set approach, fuzzyset approaches, support vector machines, prediction


Neural Networks

Advantages

prediction accuracy is generally high

robust, works when training examples contain errors output may be discrete, real-valued, or a vector of

several discrete or real-valued attributes

fast evaluation of the learned target function

Criticism

long training time

difficult to understand the learned function (weights),no explicit knowledge generated

not easy to incorporate domain knowledge

8/8/2019 fuzzyppt

12/28


A Neuron

The n-dimensional input vector x= (x1, x2, , xn) is mappedinto variable y by means of the scalar product and anonlinear function mapping

w1

w2

wn

x1

x2

xn

f

weightedsum

inputvector x

output y

activationfunction

k (bias for input k)

weightvector w


Network Training

The ultimate objective of training

obtain a set of weights that makes almost all the

tuples in the training data classified correctly Steps

Initialize weights with random values

Feed the input tuples into the network one by one

For each unit

Compute the net input to the unit as a linear combination

of all the inputs to the unit

Compute the output value using the activation function

Compute the error

Update the weights and the bias

8/8/2019 fuzzyppt

13/28


Multi-Layer Perceptron

Output nodes

Input nodes

Hidden nodes

Output vector

Input vector: xi

wij

+=i

jiijj OwI

jIj eO += 1

1

))(1( jjjjj OTOOrr =

( )= k jkkjjj wErrOOErr 1

ijijij OErrlww )(+=

jjj Errl)(+=


Network Pruning and Rule Extraction

Network pruning

Fully connected network will be hard to articulate

Ninput nodes, hhidden nodes and moutput nodes lead to

h(m+N)weights

Pruning: Remove some of the links without affecting classification

accuracy of the network

Extracting rules from a trained network

Discretize activation values; replace individual activation value by

the cluster average maintaining the network accuracy

Enumerate the output from the discretized activation values to

find rules between activation value and output Find the relationship between the input and activation value

Combine the above two to have rules relating the output to input

8/8/2019 fuzzyppt

14/28


Genetic Algorithms

GA: based on an analogy to biological evolution

Each rule is represented by a string of bits An initial population is created consisting of randomly

generated rules

e.g., If A1 and Not A2 then C2 can be encoded as 100

Based on the evolutionary notion ofsurvival of the fittest,a new population is formed that consists of the fittestrules and their offsprings

The fitnessof a rule is represented by its classificationaccuracy on a set of training examples

Offsprings are generated by crossoverand mutation


Rough Set Approach

Rough sets are used to approximately or roughlydefine equivalent classes

A rough set for a given class C is approximated by twosets: a lower approximation (certain to be in C) and anupper approximation (cannot be described as notbelonging to C)

Finding the minimal subsets (reducts) of attributes (forfeature reduction) is NP-hard but a discernibility matrixis used to reduce the computation intensity

8/8/2019 fuzzyppt

15/28


Fuzzy SetApproaches

Fuzzy logic uses truth values between 0.0 and 1.0 torepresent the degree of membership (such as usingfuzzy membership graph)

Attribute values are converted to fuzzy values

e.g., income is mapped into the discrete categories{low, medium, high} with fuzzy values calculated

For a given new sample, more than one fuzzy value may

apply Each applicable rule contributes a vote for membership

in the categories

Typically, the truth values for each predicted categoryare summed


Motivation: Linear Separation

separating hyperplane

Support Vector Machines (SVM)

Vectors in drepresent objects

Objects belong to exactly one of

two respective classes For the sake of simpler formulas,

the used class labels are:

y= 1 and y= +1

Classification by linear separation:determine hyperplane whichseparates both vector sets with a

maximal stability Assign unknown elements to the

halfspace in which they reside

and acknowledgements: Prof. Dr. Hans-Peter Kriegel and Matthias Schubert (LMU Munich)and Dr. Thorsten Joachims (U Dortmund and Cornell U)

8/8/2019 fuzzyppt

16/28


Support Vector Machines

Problems of linear separation Definition and efficient determination of the

maximum stable hyperplane

Classes are not always linearly separable

Computation of selected hyperplanes is veryexpensive

Restriction to two classes

Approach to solve these problems

Support Vector Machines (SVMs) [Vapnik 1979, 1995]


Maximum Margin Hyperplane

Observation: There is no unique hyperplane to separatep1 fromp2 Question: which hyperplane separates the classes best?

Criteria Stability at insertion Distance to the objects of both classes

p1

p2

p1

p2

8/8/2019 fuzzyppt

17/28


Support Vector Machines: Principle

Basic idea: Linear separation with the

Maximum Margin Hyperplane (MMH)

Distance to points from any of the

two sets is maximal, i.e. at least

Minimal probability that the

separating hyperplane has to be

moved due to an insertion

Best generalization behaviour

MMH is maximally stable MMH only depends on pointspiwhose

distance to the hyperplane exactly is

pi is called a support vectormargin

maximum margin hyperplane

p1

p2


Maximum Margin Hyperplane

Recall some algebraic notions for feature spaceFS

Inner product of two vectors

e.g., canonical scalar product:

Hyperplane H(w,b)with normal vector wand value b:

Distance of a vector x to the hyperplane H(w,b):

yxFSyxrrrr

,:,

( ) { }0,,, =+= bxwFSxbwH rrrr

( ) ( )bxwww

bwHxdist +=rr

rr

rr,

,

1),(,

( ) = =d

i iiyxyx

1,rr

8/8/2019 fuzzyppt

18/28


Computation of theMaximum Margin Hyperplane

Two assumptions for classifying xi (class 1: yi= +1, class 2: yi= 1):

1) The classification error is zero

2) The margin is maximal

Let denote the minimumdistance of any trainingobject xi to the hyperplaneH(w,b):

Then: Maximize subject to for i [1..n]

( ) 0,0,1

0,1>+

>++=

8/8/2019 fuzzyppt

19/28


Dual Optimization Problem

For computational purposes, transform the primary optimization

problem into a dual one by using Lagrange multipliers

For the solution, use algorithms from optimization theory

Up to now only linearly separable data

If data is not linearly separable: Soft Margin Optimization

Dual optimization problem: Find parameters i that

minimize

subject to and 0 i

jiji

n

i

n

j

ji

n

i

i xxyyLrrr

= = == 1 11 2

1)(

01

= =n

i iiy


Soft Margin Optimization

Problem of Maximum Margin Optimization: How to treat non-linearly separable data?

Two typical problems:

Trade-off between training error and size of margin

data points are not separable complete separation is not optimal

8/8/2019 fuzzyppt

20/28



Additionally regard the number of

training errors when optimizing: i is the distance frompi to the

margin (often called slackvariable)

Ccontrols the influence ofsingle training vectors

Primary optimization problem with soft margin:

Find a w that minimizes

subject to i [1..n]: and i 0

=+n

i iCww

1,

21

rr

( ) iii bxwy + 1,rr

1

2p1

p2



jiji

n

i

n

j

ji

n

i

i xxyyLrrr

= = == 1 11 2

1)( Dual OP: Maximize

subject to and 0 i C=

=n

i

ii y1

0

Dual optimization problem with Lagrange multipliers:

0 < i < C: pi is a support vector with i = 0i = C: pi is a support vector with i >0i = 0: pi is no support vector

1

2p1

p2

Decision rule:

( )

+=

SVxiii

i

bxxysignxhrrr

,

8/8/2019 fuzzyppt

21/28


Kernel Machines:Non-Linearly Separable Data Sets

Problem: For real data sets, a linear separation with a high

classification accuracy often is not possible Idea: Transform the data non-linearly into a new space, and try to

separate the data in the new space linearly (extension of thehypotheses space)

Example for a quadratically separable data set


Kernel Machines:Extension of the Hypotheses Space

Principle

Try to separate in the extended feature space linearly

Example

Here: a hyperplane in the extended feature space is apolynomial of degree 2 in the input space

input space extended feature space

(x, y, z) (x, y, z, x2, xy, xz, y2, yz, z2)

8/8/2019 fuzzyppt

22/28


Kernel Machines: Example

Input space (2 attributes): Extended space (6 attributes):

x1

x2 x2

2

1x

( )21,xxx =r

( ) 1,2,2,2,, 21212

2

2

1 xxxxxxx =r


Kernel Machines: Example (2)

Input space (2 attributes): Extended space (3 attributes):

( )21,xxx =

r

( ) 212

2

2

1 2,, xxxxx =

r

x1

x2

0

0

21x

2

2x

0

1

1

8/8/2019 fuzzyppt

23/28


Kernel Machines

Introduction of a kernel corresponds to a feature transformation

Feature transform only affects the scalar product of training vectors

Kernel K is a function:

( ) newold FSFSx :r

Dual optimization problem:

Maximize

subject to and 0 i C

)(),(2

1)(

1 11

jiji

n

i

n

j

ji

n

i

i xxyyLrrr

= = ==

01

= =n

i iiy

( ) )(),(, jiji xxxxKrrrr

=


Kernel Machines: Examples

Radial basis kernel Polynomial kernel (degree 2)

( )d

yxyxK 1,),( +=

rrrr2

exp),( yxyxK

rrrr

=

8/8/2019 fuzzyppt

24/28


Support Vector Machines: Discussion

+ generate classifiers with a high classification accuracy+ relatively weak tendency to overfitting (generalization

theory)

+ efficient classification of new objects

+ compact models

training times may be long (appropriate feature spacemay be very high-dimensional)

expensive implementation

resulting models rarely provide an intuition


What Is Prediction?

Prediction is similar to classification

First, construct a model

Second, use model to predict unknown value

Major method for prediction is regression

Linear and multiple regression

Non-linear regression

Prediction is different from classification

Classification refers to predict categorical class label Prediction models continuous-valued functions

8/8/2019 fuzzyppt

25/28


Predictive modeling: Predict data values or construct

generalized linear models based on the database data. One can only predict value ranges or category distributions

Method outline:

Minimal generalization

Attribute relevance analysis

Generalized linear model construction

Prediction

Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement,

entropy analysis, expert judgement, etc.

Multi-level prediction: drill-down and roll-up analysis

Predictive Modeling in Databases


Linear regression:Y= + X Two parameters, and specify the line and are to be

estimated by using the data at hand. using the least squares criterion to the known values of

Y1, Y2, , X1, X2,

Multiple regression:Y= b0 + b1 X1 + b2 X2 Many nonlinear functions can be transformed into the

above.

Log-linear models:

The multi-way table of joint probabilities is approximatedby a product of lower-order tables.

Probability: p(a, b, c, d) =abacadbcd

Regress Analysis and Log-LinearModels in Prediction

8/8/2019 fuzzyppt

26/28


Locally Weighted Regression

Construct an explicit approximation tofover a local regionsurrounding query instancexq

Locally weighted linear regression:

The target functionf is approximated nearxq using the linearfunction:

minimize the squared error: distance-decreasing weight K

the gradient descent training rule:

In most cases, the target function is approximated by a constant,linear, or quadratic function.

$ ( ) ( ) ( )f x w w a x wnan x= + + +0 1 1 L

( ) kxneighborsnearestx qq q xxdKxfxfxE ,_2

)),(())()((21)(

( ) kxneighborsnearestx jqq xaxfxfxxdKjw ,_ )()()()),((


Prediction: Numerical Data

8/8/2019 fuzzyppt

27/28

8/8/2019 fuzzyppt

28/28


References (I)

C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future

Generation Computer Systems, 13, 1997.

L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.

Wadsworth International Group, 1984.

P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data

for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data

Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995.

U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994

AAAI Conf., pages 601-606, AAAI Press, 1994.

J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision

tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases,

pages 416-427, New York, NY, August 1998. T. Joachims: Learning to Classify Text using Support Vector Machines. Kluwer, 2002.

M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision

tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop

Research Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham,

England, April 1997.

References (II)

J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automaticinteraction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.

M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.

In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France,March 1996.

S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-DiciplinarySurvey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998

J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on ArtificialIntelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.

R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building andpruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August1998.

J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for datamining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.

1996. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and

Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.

Documents

fuzzyppt