This supervised learning technique uses Bayes’ rule but is different in philosophy from the well...

Preview:

Citation preview

• This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al.

• Bayes’ rule:

• Pr is probability

• Equation means: “How does the probability of an item being a member of group change, given evidence x”

Bayesian Discriminant Analysis

Prior probabilityThis can be a problem!

Bayesian Discriminant Analysis• Bayes’ rule can be turned into a classification

rule:

=> Choose group 1

*If priors are both 0.5, decision boundaries are

where curves cross

• If the data is multivariate normal drawn from the same population, the decision rule becomes:

Bayes-Gaussian Discriminant Analysis

slope intercept

with the “distance” defined as:

and

• Note that if the data is just 1D this is just an equation for a line:

Like an average cov mat

• If the data is multivariate normal but drawn from different populations, the decision rule is the same but the “decision distance” becomes:

Bayes-Gaussian Discriminant Analysis

b c

• Note that if the data is just 1D this is an equation for a parabola:

New quadratic term

a

• The “quadratic” version is always called quadratic discriminant analysis, QDA

• The “linear” version is called by a number of names!• linear discriminant analysis, LDA

• Some combination of of the above with the words, Gaussian or classification

• A number of techniques use the name LDA!• Important to specify the equations used to tell the

difference!

Bayes-Gaussian Discriminant Analysis

Bayes-Gaussian Discriminant Analysis

Groups have similar covariance structure:linear discriminant rule should work well

Groups have different covariance structure:quadratic discriminant rule may work better

• This supervised technique is called Linear Discriminant Analysis (LDA) in R• Also called Fisher linear discriminant analysis

• CVA is closely related to linear Bayes-Gaussian discriminant analysis

• Works on a principle similar to PCA: Look for “interesting directions in data space”• CVA: Find directions in space which best separate

groups.• Technically: find directions which maximize ratio of

between group to within variation

Canonical Variate Analysis

Canonical Variate Analysis

Project on PC1:Not necessarily good group separation!

Project on CV1:Good group separation!

Note: There are #groups -1 or p CVswhich ever is smaller

• Use between-group to within-group covariance matrix, W-1B to find directions of best group separation (CVA loadings, Acv):

• CVA can be used for dimension reduction.

• Caution! These “dimensions” are not at right angles (i.e. not orthogonal)

• CVA plots can thus be distorted from reality

• Always check loading angles!

• Caution! CVA will not work well with very correlated data

Canonical Variate Analysis

Canonical Variate Analysis

2D CVA of gasoline data set: 2D PCA of gasoline data set:

• Distance metric used in CVA to assign group i.d. of an unknown data point:

• If data is Gaussian and group covariance structures are the same then CVA classification is the same as Bayes-Gaussian classification.

Canonical Variate Analysis

• PLS-DA is a supervised discrimination technique and very popular in chemometrics• Works well with highly correlated variables (like in

spectroscopy)• Lots of correlation causes CVA to fail!

• Group labels coded into a “response matrix” Y• PLS searches for directions of maximum covariance in X and Y.

• Loading for X can be used like PCA loading• Dimension reduction

• Loading plots

Partial Least Squares Discriminant Analysis

Partial Least Squares Discriminant Analysis• PLS-DA theory: Find an (approximate) linear

relationship between experimental (explanatory) variables and group labels (response variables)• Y=XB+E

• X=TPT+EX

• Y=UQT+EY

• So substituting: UQT=TPTB+E

exp. vars.lbls.

“error” or “residuals” matrix

PLS-scores PLS-loadings

*Use these “Y-scores” with a “soft-max” or “Bayes” to pick “most-likely” group label

Partial Least Squares Discriminant Analysis• How do we solve this for T, P and U??• Objective: maximize covariance between X and Y scores, T

and U.

• Various procedure to do this:

• Kernel-PLS

• SIMPLS

• NIPLS

• Give close, but slightly different numerical results

• In R, functions are:

• plsr (pls package)

• spls (spls package)

• Easiest: plsda (caret pakage)

Partial Least Squares Discriminant Analysis

2D PLS of gasoline data set: 2D PCA of gasoline data set:

• Group assignments of observation vectors are made by interpreting Y scores.• Typically “soft-max” function is used.

Partial Least Squares Discriminant Analysis

Observation Vectors

Y-s

core

s

Recommended