25
Linear Discriminant Functions Linear Discriminant Functions Discriminant Functions Least Squares Method Fisher’s Linear Discriminant Probabilistic Generative Models

Linear Discriminant Functions Discriminant Functions Least Squares Method Fisher’s Linear Discriminant Probabilistic Generative Models

Embed Size (px)

Citation preview

Page 1: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Linear Discriminant Functions Linear Discriminant Functions

Discriminant Functions

Least Squares Method

Fisher’s Linear Discriminant

Probabilistic Generative Models

Page 2: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Linear Discriminant FunctionsLinear Discriminant Functions

A discriminant function is a linear combination of the components of x:

g(x) = wt x + w0

where • wt is the weight vector• w0 is the bias or threshold weight

For the two-class problem we can use the following decision rule:

Decide c1 if g(x) > 0 and c2 if g(x) < 0.

For the general case we will have one discriminant function for each class.

Page 3: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Figure 5.1Figure 5.1

Page 4: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

The Normal Vector wThe Normal Vector w

The hyperplane H divides the feature space into two regions: Region R1 for class c1 and Region R2 for class c2.

For two points x1 and x2 on the decision boundary:

wt x1 + w0 = wt x2 + w0

which means

wt (x1 – x2) = 0

Thus w is normal to any vector in the hyperplane.

Page 5: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Geometry for Linear ModelsGeometry for Linear Models

Page 6: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

The Problem with Multiple ClassesThe Problem with Multiple Classes

How do we use a linear discriminant when we have more than two classes?

There are two approaches:

1. Learn one discriminant function for each class 2. Learn a discriminant function for all pairs of classes

If c is the number of classes, in the first case we have c functions andin the second case we have c(c-1) / 2 functions.

In both cases we are left with ambiguous regions.

Page 7: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Figure 5.3Figure 5.3

Page 8: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Linear MachinesLinear Machines

To avoid the problem of ambiguous regions we can use linear machines:

We define c linear discriminant functions and choose the onewith highest value for a given x.

gk(x) = wkt x + wk0 k = 1, …, c

In this case the decision regions are convex and thus are limitedin flexibility and accuracy.

Page 9: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Figure 5.4Figure 5.4

Page 10: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Generalized Linear Discriminant FunctionsGeneralized Linear Discriminant Functions

A linear discriminant function g(x) can be written as:

g(x) = w0 + Σi wixi i = 1, …, d (d is the number of features).

We could add additional terms to obtain a quadratic discriminant function:

g(x) = w0 + Σi wixi + Σi Σj wij xixj

The quadratic discriminant function introduces d(d-1)/2 coefficientscorresponding to the products of attributes. The surfaces are thus morecomplicated (hyperquadric surfaces).

Page 11: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Generalized Linear Discriminant FunctionsGeneralized Linear Discriminant Functions

We could even add more terms wijk xi xj xk and obtain the classof polynomial discriminant functions. The generalized form is

g(x) = Σi wi yi(x)

g(x) = wt y

Where the summation goes over all functions yi(x).The yi(x) functions are called the phi or φ functions.The function is now linear on the yi(x).

The functions map a d-dimensional x-space into a d’ dimensionaly-space.

Example: g(x) = w1 + w2x + w3x2 y = (1 x x2 ) t

Page 12: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Figure 5.5Figure 5.5

Page 13: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Mapping to other spaceMapping to other space

Mapping from x to y:

If x follows certain probability distribution, the corresponding distribution on the new space will be degenerate.

Even with simple functions for y, the decision surfaces in x can be quite complicated.

With a larger space we have more degrees of freedom (parameters to specify). Thus, we need larger samples.

Page 14: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Figure 5.6Figure 5.6

Page 15: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Linear Discriminant Functions Linear Discriminant Functions

Discriminant Functions

Least Squares Method

Fisher’s Linear Discriminant

Probabilistic Generative Models

Page 16: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Least SquaresLeast Squares

And how do we compute y(x)? How do we find the values of w0, w1, w2, …, wd?

We can simply find the w that minimizes an error function E(w):

E(w) = ½ Σ (g(x,w) – t)2

Problems: Lacks robustness; assumes target vector is Gaussian.

Page 17: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Least SquaresLeast Squares

Least squares vs Logistic regression

Page 18: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Least SquaresLeast Squares

Least squares Logistic regression

Page 19: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Carl Friedrich GaussCarl Friedrich Gauss German 1777 – 1855German 1777 – 1855

Carl F. Gauss is known as the scientist whodeveloped the idea of “least squares method”.

He came up with this idea at the early age of eighteen years old!

He is considered one of the greatest mathematicianof all times.

He made major discoveries in geometry, number theory, magnetism,astronomy, among other fields.

Anecdotes: solved a problem posed by his teacher (sum all integers from 1 – 100).

Page 20: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Linear Discriminant Functions Linear Discriminant Functions

Discriminant Functions

Least Squares Method

Fisher’s Linear Discriminant

Probabilistic Generative Models

Page 21: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Fisher’s Linear DiscriminantFisher’s Linear Discriminant

The idea is to project the data on one single dimension.

We choose a projection that maximizes class separation,

and minimizes the variance within each class.

Find w that maximizes an error function

J(w) = (m2 – m1)2 / s12 + s22

J(w) = wT SB w / wT Sw w

Where SB is the between-class covariance matrixAnd SW is the within class covariance matrix

Page 22: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Fisher’s Linear DiscriminantFisher’s Linear DiscriminantSB = (m2 – m1) (m2 – m1)T

SW = ∑ (x – m1)(x – m1)T + ∑(x – m2)(x – m2)T

Page 23: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Fisher’s Linear DiscriminantFisher’s Linear Discriminant

Wrong Right

Page 24: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Linear Discriminant Functions Linear Discriminant Functions

Discriminant Functions

Least Squares Method

Fisher’s Linear Discriminant

Probabilistic Generative Models

Page 25: Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models

Probabilistic Generative ModelsProbabilistic Generative Models

We first compute g(x) = w1x1 + w2x2 + … + wdxd + w0

But instead we wish to have P(Ck|x).

To get conditional probabilities we compute a logistic function:

L(g(x)) = 1 / ( 1 + exp(-g(x)) )

And L(y) = P(Ck|x) if the two classes can be modeled as a Gaussian distribution with equal covariance matrix.