29
CES 514 – Data Mining Lecture 8 classification (contd…)

CES 514 – Data Mining Lecture 8 classification (contd…)

  • View
    225

  • Download
    2

Embed Size (px)

Citation preview

Page 1: CES 514 – Data Mining Lecture 8 classification (contd…)

CES 514 – Data Mining Lecture 8

classification (contd…)

Page 2: CES 514 – Data Mining Lecture 8 classification (contd…)

Example: PEBLS

PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg)– Works with both continuous and nominal features For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM)

– Each record is assigned a weight factor

– Number of nearest neighbor, k = 1

Page 3: CES 514 – Data Mining Lecture 8 classification (contd…)

Example: PEBLS

Class

Marital Status

Single Married

Divorced

Yes 2 0 1

No 2 4 1

i

ii

n

n

n

nVVd

2

2

1

121 ),(

Distance between nominal attribute values:

d(Single,Married)

= | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1

d(Single,Divorced)

= | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0

d(Married,Divorced)

= | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1

d(Refund=Yes,Refund=No)

= | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Class

Refund

Yes No

Yes 0 3

No 3 4

Page 4: CES 514 – Data Mining Lecture 8 classification (contd…)

Example: PEBLS

d

iiiYX YXdwwYX

1

2),(),(

Tid Refund Marital Status

Taxable Income Cheat

X Yes Single 125K No

Y No Married 100K No 10

Distance between record X and record Y:

where:

correctly predicts X timesofNumber

predictionfor used is X timesofNumber Xw

wX 1 if X makes accurate prediction most of the time

wX > 1 if X is not reliable for making predictions

Page 5: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data

Page 6: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

One Possible Solution

B1

Page 7: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

Another possible solution

B2

Page 8: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

Other possible solutions

B2

Page 9: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

Which one is better? B1 or B2? How do you define better?

B1

B2

Page 10: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

Find hyperplane maximizes the margin (e.g. B1 is better than B2.)

B1

B2

b11

b12

b21

b22

margin

Page 11: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

B1

b11

b12

0 bxw

1 bxw 1 bxw

1bxw if1

1bxw if1)(

xf 2||||

2 Margin

w

Page 12: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

We want to maximize:

– Which is equivalent to minimizing:

– But subjected to the following constraints:

This is a constrained optimization problem

– Numerical approaches to solve it (e.g., quadratic programming)

2||||

2 Margin

w

1bxw if1

1bxw if1)(

i

i

ixf

2

||||)(

2wwL

Page 13: CES 514 – Data Mining Lecture 8 classification (contd…)

Overview of optimization

Simplest optimization problem: Maximize f(x) (one variable)

If the function has nice properties (such as differentiable), then we can use calculus to solve the problem. solve equation f’(x) = 0. Suppose a root is a. Then if f’’(a) < 0 then a is a maximum.

Tricky issues: • How to solve the equation f’(x) = 0?• what if there are many solutions? Each is a “local” optimum.

Page 14: CES 514 – Data Mining Lecture 8 classification (contd…)

How to solve g(x) = 0

Even polynomial equations are very hard to solve.

Quadratic has a closed-form. What about higher-degrees?

Numerical techniques: (iteration)• bisection• secant• Newton-Raphson etc.

Challenges:• initial guess• rate of convergence?

Page 15: CES 514 – Data Mining Lecture 8 classification (contd…)

Functions of several variables

Consider equation such as F(x,y) = 0

To find the maximum of F(x,y), we solve the equations

and 0xF 0

yF

If we can solve this system of equations, then we have found a local maximum or minimum of F.

We can solve the equations using numerical techniques similar to the one-dimensional case.

Page 16: CES 514 – Data Mining Lecture 8 classification (contd…)

When is the solution maximum or minimum?

• Hessian:

• if the Hessian is positive definite in the neighborhood of a, then a is a minimum.

• if the Hessian is negative definite in the neighborhood of a, then a is a maximum.

• if it is neither, then a is a saddle point.

2

22

2

2

),,(

yf

xyf

yxf

xf

yxfH

Page 17: CES 514 – Data Mining Lecture 8 classification (contd…)

Application - linear regression

Problem: given (x1,y1), … (xn, yn), find the best linear relation between x and y.

Assume y = Ax + B. To find A and B, we will minimize

Since this is a function of two variables, we can solve by setting and

2

1

)(),( BAxyBAE j

n

jj

0BE

0AE

Page 18: CES 514 – Data Mining Lecture 8 classification (contd…)

Constrained optimization

Maximize f(x,y) subject to g(x,y) = c

Using Lagrange multiplier, the problem is formulated as maximizing:

h(x,y) = f(x,y) + (g(x,y) – c)

Now, solve the equations:

0

h

y

h

x

h

Page 19: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines (contd)

What if the problem is not linearly separable?

Page 20: CES 514 – Data Mining Lecture 8 classification (contd…)

Support Vector Machines

What if the problem is not linearly separable?– Introduce slack variables

Need to minimize:

Subject to:

ii

ii

1bxw if1

-1bxw if1)(

ixf

N

i

kiC

wwL

1

2

2

||||)(

Page 21: CES 514 – Data Mining Lecture 8 classification (contd…)

Nonlinear Support Vector Machines

What if decision boundary is not linear?

Page 22: CES 514 – Data Mining Lecture 8 classification (contd…)

Nonlinear Support Vector Machines

Transform data into higher dimensional space

Page 23: CES 514 – Data Mining Lecture 8 classification (contd…)

Artificial Neural Networks (ANN)

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

Output

Input

Output Y is 1 if at least two of the three inputs are equal to 1.

Page 24: CES 514 – Data Mining Lecture 8 classification (contd…)

Artificial Neural Networks (ANN)

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

0.3

0.3

0.3 t=0.4

Outputnode

Inputnodes

otherwise0

trueis if1)( where

)04.03.03.03.0( 321

zzI

XXXIY

Page 25: CES 514 – Data Mining Lecture 8 classification (contd…)

Artificial Neural Networks (ANN)

Model is an assembly of inter-connected nodes and weighted links

Output node sums up each of its input value according to the weights of its links

Compare output node against some threshold t

X1

X2

X3

Y

Black box

w1

t

Outputnode

Inputnodes

w2

w3

)( tXwIYi

ii Perceptron Model

)( tXwsignYi

ii

or

Page 26: CES 514 – Data Mining Lecture 8 classification (contd…)

General Structure of ANN

Activationfunction

g(Si )Si Oi

I1

I2

I3

wi1

wi2

wi3

Oi

Neuron iInput Output

threshold, t

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

Training ANN means learning the weights of the neurons

Page 27: CES 514 – Data Mining Lecture 8 classification (contd…)

Algorithm for learning ANN

Initialize the weights (w0, w1, …, wk)

Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples– Objective function:

– Find the weights wi’s that minimize the above objective function e.g., backpropagation algorithm

2),( i

iii XwfYE

Page 28: CES 514 – Data Mining Lecture 8 classification (contd…)

WEKA

Page 29: CES 514 – Data Mining Lecture 8 classification (contd…)

WEKA implementations

WEKA has implementation of all the major data mining algorithms including:

• decision trees (CART, C4.5 etc.)• naïve Bayes algorithm and all variants• nearest neighbor classifier• linear classifier• Support Vector Machine • clustering algorithms• boosting algorithms etc.