40
Linear Models for Linear Models for Classification Classification : : Probabilistic Methods Probabilistic Methods Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Linear Models for Classification : Probabilistic Methods

  • Upload
    jered

  • View
    90

  • Download
    3

Embed Size (px)

DESCRIPTION

Linear Models for Classification : Probabilistic Methods. Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/. Recall, Linear Methods for Classification. - PowerPoint PPT Presentation

Citation preview

Page 1: Linear Models for  Classification : Probabilistic Methods

Linear Models for Linear Models for ClassificationClassification: : Probabilistic MethodsProbabilistic Methods

Adopted from Seung-Joon YiBiointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

Page 2: Linear Models for  Classification : Probabilistic Methods

2

Recall, Linear Methods for ClassificationRecall, Linear Methods for Classification

Problem Definition: Given the training data {xn,tn}, find a linear model for each class yk(x) to partition the feature space into decision regions

Deterministic Models: Discriminant Functions Fisher Discriminant function Perceptron

Page 3: Linear Models for  Classification : Probabilistic Methods

3

Probabilistic Approaches for ClassificationProbabilistic Approaches for Classification Generative Models:

Inference : Model p(x/Ck) and p(Ck) Decision : Model p(Ck/x)

Discriminative Models Model p(Ck/x) directly Use the functional form of the generalized linear model explicitly Determine the parameters directly using Maximum Likelihood

Page 4: Linear Models for  Classification : Probabilistic Methods

Comes from population growth Prob distribution function of Normal R.V. İs Logistic sigmoid İf class conditional densities are Normal, posteriors become

logistic sigmoid

Logistic Sigmoid FunctionLogistic Sigmoid Function

4

simple[2] logistic function may be defined by the formula

Page 5: Linear Models for  Classification : Probabilistic Methods

Posterior Probabilities can be formulated Posterior Probabilities can be formulated by by 2-Class: Logistic sigmoid acting on a linear function

of x K-Class: Softmax transformation of a linear function

of x

Then, The parameters of the densities as well as the class

priors can be determined using Maximum Likelihood

5

Page 6: Linear Models for  Classification : Probabilistic Methods

6

Probabilistic Generative ModelsProbabilistic Generative Models: 2-Class: 2-Class Recall, given

Posterior can be expresses by Logistic Sigmoid

a is called logit function

| and |k k kp C p C p Cx x

1 11

1 1 2 2

||

| |

11 exp

p C p Cp C

p C p C p C p C

aa

xx

x x

1 1

2 2

|where ln .

|p C p C

ap C p C

xx

Page 7: Linear Models for  Classification : Probabilistic Methods

Posterior can be expresses by Softmax function or normalized exponential Multi-class generalisation of logistic sigmoid:

Probabilistic Generative Models K-ClassProbabilistic Generative Models K-Class

7

| exp| ,

| exp

where ln | .

k k kk

j j jj j

k k k

p C p C ap C

p C p C a

a p C p C

x

xx

x

Page 8: Linear Models for  Classification : Probabilistic Methods

8

Probabilistic Generative ModelsProbabilistic Generative ModelsGaussian Class Conditionals for 2-ClassGaussian Class Conditionals for 2-Class

Assume same covariance matrix ∑,

Note The quadratic terms in x from the exponents are cancelled. The resulting decision boundary is linear in input space. The prior only shifts the decision boundary, i.e. parallel

contour.

T 1/ 2 1/ 2

1 1 1| exp .22

k k kDp C

x x μ x μ

T1 0

11 T 1 T 11 2 0 1 1 2 2

2

|

1 1 and ln2 2

p C w

p Cw

p C

x w x

w μ μ μ μ μ μ

| kp Cx

1 |p C x

Page 9: Linear Models for  Classification : Probabilistic Methods

9(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models: Gaussian Class : Gaussian Class Conditionals for K-classesConditionals for K-classes

When, covariance matrix is the same, decision boundaries are linear. When, each class-condition density have its own covariance matrix,

ak becomes quadratic functions of x, giving rise to a quadratic discriminant.

T0

1 T 10

1 and ln2

k k k

k k k k k k

a w

w p C

x w x

w μ μ μ

Page 10: Linear Models for  Classification : Probabilistic Methods

10

Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution--Maximum Likelihood Solution- Two classes Given

Data set: , , 1,...,n nt n Nx

1 21 or 0, (denoting and , respectively)nt C C

Page 11: Linear Models for  Classification : Probabilistic Methods

Q: Find P(C1) = π and P(C2) = 1- π and parameters of p(Ck/x): μ1, μ2 and

11

Page 12: Linear Models for  Classification : Probabilistic Methods

Probabilistic Generative ModelsProbabilistic Generative Models-Maximum Likelihood Solution-Maximum Likelihood Solution

Let P(C1) = π and P(C2) = 1- π

12

Page 13: Linear Models for  Classification : Probabilistic Methods

13

Probabilistic Generative ModelsProbabilistic Generative Models-Maxim-Maximize log likelihood w r toize log likelihood w r to. . π ,μμ11 μ μ22. ∑. ∑

.

1 1

1 21

1 N

nn

N NtN N N N

11 1

1 N

n nn

tN

μ x 22 1

1 1N

n nn

tN

μ x

1 21 2

T1

k

k n k n kk n C

N NN N

N

S S S

S x μ x μ

S

Page 14: Linear Models for  Classification : Probabilistic Methods

14(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Probabilistic Generative ModelsProbabilistic Generative Models-Discrete Features--Discrete Features-

Discrete feature values When we have D inputs, the table size grows exponentially

with the number of featuresto a 2D size table. . Naïve Bayes assumption, conditioned on the class Ck

Linear with respect to the features as in the continuous features.

11

| 1 ii

Dxx

k kikii

p C

x

0,1ix

1

ln | ln 1 ln 1 lnD

k k i ki i ki ki

p C p C x x p C

x

Page 15: Linear Models for  Classification : Probabilistic Methods

15(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayes Decision Boundaries: 2DBayes Decision Boundaries: 2D-Pattern Classification, Duda et al. pp.42-Pattern Classification, Duda et al. pp.42

Page 16: Linear Models for  Classification : Probabilistic Methods

16(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayes Decision Boundaries: 3DBayes Decision Boundaries: 3D-Pattern Classification, Duda et al. pp.43-Pattern Classification, Duda et al. pp.43

Page 17: Linear Models for  Classification : Probabilistic Methods

For both Gaussian distributed and discrete For both Gaussian distributed and discrete inputsinputs

The posterior class probabilities are given by Generalized linear models with logistic sigmoid or softmax activation functions.

17

Page 18: Linear Models for  Classification : Probabilistic Methods

18

Probabilistic Generative ModelsProbabilistic Generative Models-Exponential Family--Exponential Family- Recall, bernoulli, binomial, multinomial, Gaussian can be expressed in a

general form

T| expk k kp h gx λ x λ λ u x

1 1| .p C ax

Page 19: Linear Models for  Classification : Probabilistic Methods

19

Probabilistic Generative ModelsProbabilistic Generative ModelsExponential Family-Exponential Family-

2- Classes: Logistic Function The subclass for which u(x) = x.

K-Classes: Softmax function. Linear with respect to x again.

T

For some scaling parameter ,1 1 1| , exp .k k k

s

p s h gs s s

x λ x λ λ x T| expk k kp h gx λ x λ λ u x

T1 2 1 2 1 2ln ln ln lna g g p C p C x λ λ x λ λ

T ln lnk k k ka g p C x λ x λ

expwhere | .

expk

kjj

ap C

a

x

Page 20: Linear Models for  Classification : Probabilistic Methods

Probabilistic Discriminative ModelsProbabilistic Discriminative Models

Goal: Find p(Ck/x) directly No inferrence step Discriminative Training: Max likelihood p(Ck/x) İmproves prediction performance when p(x/Ck) is poorly

estimated

20

Page 21: Linear Models for  Classification : Probabilistic Methods

21(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Fixed basis functionsFixed basis functions: x : x

Assume fixed nonlinear transformation Transform inputs using a vector of basis functions The resulting decision boundaries will be linear in the feature

space y(x)= WT Φ

Page 22: Linear Models for  Classification : Probabilistic Methods

22

Posterior probability of a class for two-Posterior probability of a class for two-class problem:class problem:

The number of adjustable parameters (M-dimensional, 2-class) 2 Gaussian class conditional densities (generative model)

2M parameters for means M(M+1)/2 parameters for (shared) covariance matrix Grows quadratically with M

Logistic regression (discriminative model) M parameters for Grows linearly with M

Page 23: Linear Models for  Classification : Probabilistic Methods

23(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Determining the parameters using Determining the parameters using Likelihood function:Likelihood function:

Take negative log likelihood: Cross-entropy error function Recall, cross entropy between two probability distributions measures

the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p.

Page 24: Linear Models for  Classification : Probabilistic Methods

The gradient of the error function w.r.t. WThe gradient of the error function w.r.t. W

The same form as the linear regression prediction target value

24

Page 25: Linear Models for  Classification : Probabilistic Methods

25(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Iterative Iterative RReweighted eweighted LLeast east SSquaresquares

Recall, Linear regression models in ch.3 ML solution on the assumption of a Gaussian noise leads to a close-

form solution, as a consequence of the quadratic dependence of the log likelihood on the parameter w.

Logistic regression model No longer a closed-form solution But the error function is concave and has a unique minimum

Efficient iterative technique can be used The Newton-Raphson update to minimize a function E(w)

– Where H is the Hessian matrix, the second derivatives of E(w)

Page 26: Linear Models for  Classification : Probabilistic Methods

26(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Iterative reweighted least squares (Cont’d)Iterative reweighted least squares (Cont’d)

CASE 1: SSE function:

Newton-Raphson update:

CASE 2:Cross-entropy error function:

Newton-Rhapson update: (iterative reweighted least squares)

Page 27: Linear Models for  Classification : Probabilistic Methods

27

Multiclass logistic regerssionMulticlass logistic regerssion

Posterior probability for multiclass classification

We can use ML to determine the parameters directly. Likelihood function using 1-of-K coding scheme

Cross-entropy error function for the multiclass classification

Page 28: Linear Models for  Classification : Probabilistic Methods

28(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Multiclass logistic regression (Cont’d)Multiclass logistic regression (Cont’d)

The derivative of the error function

Same form, the product of error times the basis function.

The Hessian matrix

IRLS algorithm can also be used for a batch processing

Page 29: Linear Models for  Classification : Probabilistic Methods

29(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Generalized Linear ModelsGeneralized Linear Models

Recall, for a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables.

However this is not the case for all choices of class-conditional density It might be worth exploring other types of discriminative probabilistic

model

Page 30: Linear Models for  Classification : Probabilistic Methods

Generalized Linear Model: 2 ClassesGeneralized Linear Model: 2 Classes

For example: For each input, we evaluate an=wTΦn

30

θ

Page 31: Linear Models for  Classification : Probabilistic Methods

31

Noisy Threshold modelNoisy Threshold model

Corresponding activation function when θ is drawn from p(θ), mixture of Gaussian

Page 32: Linear Models for  Classification : Probabilistic Methods

Probit FunctionProbit Function

32

Sigmoidal shapeThe generalized linear model based on a probit activation function is known as probit regression.

Page 33: Linear Models for  Classification : Probabilistic Methods

33

Canonical link functionsCanonical link functions

Recall, if we take the derivative of the error function w.r.t the parameter w, it takes the form of the error times the feature vector. Logistic regression model with sigmoid activation function

Logistic regression model with softmax activation function

This is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function.

Page 34: Linear Models for  Classification : Probabilistic Methods

34(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Canonical link functions (Cont’d)Canonical link functions (Cont’d)

Consider the exponential family, Conditional distributions of the target variable

Log likelihood:

The derivative of the log likelihood: where

The canonical link function:

then

Page 35: Linear Models for  Classification : Probabilistic Methods

35

The Laplace approximationThe Laplace approximation

Goal: Find a Gaussian approximation to a non-Gaussian density, centered on the mode z0 of the distribution.

Suppose: p(z)= (1 /Z)f(z) , non Gaussian Taylor expansion, arround mode z0, of the logarithm of the

target function:

Resulting approximated Gaussian distribution:

Page 36: Linear Models for  Classification : Probabilistic Methods

36(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Laplace approximation Laplace approximation forfor p(z) exp(-z∝p(z) exp(-z∝ 22/2)σ(20z +4) /2)σ(20z +4)

Left: the normalized distribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red.

Right:The negative logarithms of the corresponding curves

Page 37: Linear Models for  Classification : Probabilistic Methods

37(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Model comparison and BICModel comparison and BIC

Laplace approximation to the normalization constant Z

This result can be used to obtain an approximation to the model evidence, which plays a central role in Bayesian model comparison.

Consider a set of models having parameters The log of model evidence can be approximated as

Further approximation with some more assumption: Bayesian Information Criterion (BIC)

Page 38: Linear Models for  Classification : Probabilistic Methods

38(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Logistic RegressionBayesian Logistic Regression

Exact Bayesian inference is intractable. Gaussian prior:

Posterior:

Log of posterior:

Laplace approximation of posterior distribution

Page 39: Linear Models for  Classification : Probabilistic Methods

39(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Predictive distributionPredictive distribution

Can be obtained by marginalizing w.r.t the posterior distribution p (w|t) which is approximated by a Gaussian q(w)

where a is a marginal distribution of a Gaussian which is also Gaussian

Page 40: Linear Models for  Classification : Probabilistic Methods

40(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Predictive distributionPredictive distribution

Resulting variational approximation to the predictive distribution

To integrate over a, we make use of the close similarity between the logistic sigmoid function and the probit function

Then

where

Finally we get