CES 514 – Data Mining Lecture 8 classification (contd…)

CES 514 – Data Mining Lecture 8

classification (contd…)

Example: PEBLS

PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)– Works with both continuous and nominal features For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)

– Each record is assigned a weight factor

– Number of nearest neighbor, k = 1

Example: PEBLS

Class

Marital Status

Single Married

Divorced

Yes 2 0 1

No 2 4 1

i

ii

n

n

n

nVVd

2

2

1

121 ),(

Distance between nominal attribute values:

d(Single,Married)

= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1

d(Single,Divorced)

= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0

d(Married,Divorced)

= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1

d(Refund=Yes,Refund=No)

= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Class

Refund

Yes No

Yes 0 3

No 3 4

Example: PEBLS

d

iiiYX YXdwwYX

1

2),(),(

Tid Refund Marital Status

Taxable Income Cheat

X Yes Single 125K No

Y No Married 100K No 10

Distance between record X and record Y:

where:

correctly predicts X timesofNumber

predictionfor used is X timesofNumber Xw

wX 1 if X makes accurate prediction most of the time

wX > 1 if X is not reliable for making predictions

Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data


One Possible Solution

B1


Another possible solution

B2


Other possible solutions

B2


Which one is better? B1 or B2? How do you define better?

B1

B2


Find hyperplane maximizes the margin (e.g. B1 is better than B2.)

B1

B2

b11

b12

b21

b22

margin


B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Margin

w


We want to maximize:

– Which is equivalent to minimizing:

– But subjected to the following constraints:

This is a constrained optimization problem

– Numerical approaches to solve it (e.g., quadratic programming)

2||||

2 Margin

w

1bxw if1

1bxw if1)(

i

i

ixf

2

||||)(

2wwL

Overview of optimization

Simplest optimization problem: Maximize f(x) (one variable)

If the function has nice properties (such as differentiable), then we can use calculus to solve the problem. solve equation f’(x) = 0. Suppose a root is a. Then if f’’(a) < 0 then a is a maximum.

Tricky issues: • How to solve the equation f’(x) = 0?• what if there are many solutions? Each is a “local” optimum.

How to solve g(x) = 0

Even polynomial equations are very hard to solve.

Quadratic has a closed-form. What about higher-degrees?

Numerical techniques: (iteration)• bisection• secant• Newton-Raphson etc.

Challenges:• initial guess• rate of convergence?

Functions of several variables

Consider equation such as F(x,y) = 0

To find the maximum of F(x,y), we solve the equations

and 0xF 0

yF

If we can solve this system of equations, then we have found a local maximum or minimum of F.

We can solve the equations using numerical techniques similar to the one-dimensional case.

When is the solution maximum or minimum?

• Hessian:

• if the Hessian is positive definite in the neighborhood of a, then a is a minimum.

• if the Hessian is negative definite in the neighborhood of a, then a is a maximum.

• if it is neither, then a is a saddle point.

2

22

2

2

),,(

yf

xyf

yxf

xf

yxfH

Application - linear regression

Problem: given (x1,y1), … (xn, yn), find the best linear relation between x and y.

Assume y = Ax + B. To find A and B, we will minimize

Since this is a function of two variables, we can solve by setting and

2

1

)(),( BAxyBAE j

n

jj

0BE

0AE

Constrained optimization

Maximize f(x,y) subject to g(x,y) = c

Using Lagrange multiplier, the problem is formulated as maximizing:

h(x,y) = f(x,y) + (g(x,y) – c)

Now, solve the equations:

0

h

y

h

x

h

Support Vector Machines (contd)

What if the problem is not linearly separable?


What if the problem is not linearly separable?– Introduce slack variables

Need to minimize:

Subject to:

ii

ii

1bxw if1

-1bxw if1)(

ixf

N

i

kiC

wwL

1

2

2

||||)(

Nonlinear Support Vector Machines

What if decision boundary is not linear?

Nonlinear Support Vector Machines

Transform data into higher dimensional space

Artificial Neural Networks (ANN)

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

Output

Input

Output Y is 1 if at least two of the three inputs are equal to 1.


X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

0.3

0.3

0.3 t=0.4

Outputnode

Inputnodes

otherwise0

trueis if1)( where

)04.03.03.03.0( 321

zzI

XXXIY


Model is an assembly of inter-connected nodes and weighted links

Output node sums up each of its input value according to the weights of its links

Compare output node against some threshold t

X1

X2

X3

Y

Black box

w1

t

Outputnode

Inputnodes

w2

w3

)( tXwIYi

ii Perceptron Model

)( tXwsignYi

ii

or

General Structure of ANN

Activationfunction

g(Si )Si Oi

I1

I2

I3

wi1

wi2

wi3

Oi

Neuron iInput Output

threshold, t

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

Training ANN means learning the weights of the neurons

Algorithm for learning ANN

Initialize the weights (w0, w1, …, wk)

Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples– Objective function:

– Find the weights wi’s that minimize the above objective function e.g., backpropagation algorithm

2),( i

iii XwfYE

WEKA

WEKA implementations

WEKA has implementation of all the major data mining algorithms including:

• decision trees (CART, C4.5 etc.)• naïve Bayes algorithm and all variants• nearest neighbor classifier• linear classifier• Support Vector Machine • clustering algorithms• boosting algorithms etc.

Documents

CES 514 – Data Mining Lecture 8 classification (contd…)