Principal Component Analysis For Novelty Detection

Principal Component Analysis for Novelty Detection

A journal article submitted to and accepted by Pattern Recognition Letters

Jordan McBain, P.Eng.

Markus Timusk, PhD, P.Eng.

Condition Monitoring Maintenance technique

Maintenance undertaken when some indicator of health is flagged

Advanced technique employed when cost-benefit analysis justifies the expense of monitoring equipment

Alternative to run-to-failure maintenance and statistically determined time-based maintenance

Employ pattern recognition to automate diagnosis Expert system employed to replicate technicians

maintenance insight Computer and sensors replaces technician and screw driver

set atop vibrating machine – the nature of the vibration used to discern state

Pattern Recognition Equality insufficient means of classifying real-world

members of class (noise, variance, etc) Pattern recognition

Real-world signals presumed to be representative of class reduced to representative n-dimensional feature vectors

Plotted in N-dimensional space Decision boundary generated with pattern recognition

techniques Employed as classification rule

Problems Choice of features

How representative? Maximize number of features? Curse of dimensionality

Imbalance of data

Principal Component Analysis One technique used to find “optimal” set of

features Finds the axes of normally distributed data Select the largest axes and omit smaller ones to

define new basis Project data onto basis to reduce dimensionality of

problem space Each feature presumed to be normally

distributed

N-dimensional scattering of features presumed independent

Combined probability:( ) ( )* ( )P A B P A P B

2

21

1

1( )2

1 1

1 1( ) ( ) ( )2 2

1

1( ) ( )

2

1 1

(2 ) | |(2 )

i i

i

di i t

ii

xd d

ii i i

xx x

d dd

ii

p x p x e

e e

Find principal components (i.e. axes of hyper-ellipsoidal distribution)

Select maximum variance (largest axes)

Eigenvalue problem Eigenvectors – principle

components Eigenvalues – size of axis

Novelty Detection Deals with imbalance of data between classes Fault detection in machinery

Easy to collect data representative of healthy state

Difficult to collect data representative of faulted states Costly to break machinery Operationally unacceptable Poor database of faults kept Can never capture them all!

Model healthy data with decision boundary If test patterns fall outside, classify as a fault!

Problem PCA is best for selecting a subspace that best

represents the data In pattern recognition, we seek to discriminant

between classes Objective of most feature reduction

techniques are not optimized for novelty detection

Feature Reduction Techniques

Feature Reduction Techniques Feature Selection vs. Feature Extraction Selection

Choosing small subsets of features that are adequate to describe classes

E.g. “Search” Examines all subsets of feature combinations to find the one which

maximizes some objective function May employ classifier error as objective function Exponential explosion

Heuristics to mitigate possible If computationally feasible, gives the best results

Extraction Computes a small number of new features form the set of

old features E.g. PCA

Principal Component Analysis Seeks a subspace in which the data

representation error is minimal Development

For a set of n vectors in d-dimensional space seek the equation of a hyper plane onto which the data

may be projected with minimal representation error Hyper plane fixed at the data’s mean, m Hyper plane’s orientation defined by direction vector, w

(normal definition of a plane)

Derive error function

Optimization problem well known eigenvalue problem

Resultant feature space is linear May not represent non-linear and changing data

well Kernel PCA and Dynamic PCA

Techniques only suitable for representing data not discriminating between them

Source: Duda, 2000

Multiple Discriminant Analysis Seeks to find efficient subspaces for

discrimination rather than representation Development

Two class problem with d-dimensional set of n-vectors grouped into D1 and D2

Projected onto some direction vector w to give

Consequently grouped into subsets Y1 and Y 2 Find the direction vector w such that the distance

between projected sample means m1 and m2 is maximized Rationalize the distance against the relative sample size

Reduces to

Solution is described as “analogous to the well known Rayleigh quotient:”

Technique extended for problems with n-classes Objective to maximize the spread between all classes in

the projected space

11 2( )ww S m m

Source: Duda, 2000

Extraction for Novelty Detection

Development Objective: distinguish between normal and

abnormal classes KFDA inappropriate (assumes classes group well

into separate classes) Novelty detection – classes may cluster well but

abnormal classes expected to orbit the normal data Means could overlap

Eliminating previous objective functions

Approach: find the subspace maximizing difference between average spread of the normal class and average spread of the abnormal class measured from the mean of the normal class

Mathematically, for an outlier class containing b elements and target class containing a-elements with mean m_t

To simplify, introduce outlier scatter matrix, O, for outlier data centered at m_t

Reducing to

Maximize this objective function Find the eigenvectors and eigenvalues of the matrix St-O

Select the first k largest eigenvalues and use corresponding eigenvectors as new basis

Project data onto new basis Proceed with classification Limitations

Still dependant on assumption of normal data distribution (as are other PCA techniques)

Assumption: normal data scatter somewhat circularly and outlier data orbit nicely without intruding (as with PCA and MDA )

Machinery vibration data are not normally Gaussian (heuristic)

Validation: Artificial Data Artificial 3-d data set

Normal distribution: spherical (radius 50) centered at origin

Outlier distribution: randomly generated spherical distribution (radius 100) Not permitted to fall within cylinder concentric with the

normal data’s sphere and oriented with length parallel to [1,1,1]

Validation: Artifical Data Results (reduced to 2 dimensions)

Subspace’s normal vector only 7 degrees off from expected [1,1,1]

Experimental Methodology

Apparatus Spectraquest gear dynamics simulator

3-hp motor Magnetic particle brake loading National Instruments PXI data acquisition and

control Accelerometers (sampling 4kHz)

Faults 4 motors employed

healthy Combo bearing faults Broken rotor bars Rotor unbalance

Gearbox faults Fault-free conditions Missing tooth gear Chipped tooth Bearing with outer race faults Bearings with inner and outer race faults

Feature Extraction Autoregressive model

a model of a statistical process generated by regressing previous values of that statistical process with itself

Sampling of sampled signal that best represents the original sampling

Order 10

Segmentation Vibration data segmented into groups based

on intervals with constant number of shaft rotations

Gaussian Window 70% overlap between segments

Results: Proposed Algorithm

Results: Kernel PCA

Results: Kernel FDA N.B. Potential for singular matrices

Results: Exhaustive Feature Search

Feature Extraction in the Absence of Outliers

Motivation and Development The above violates assumption of novelty

detection Limited data from fault classes

In the case where we know nothing of the outlier classes Work with what we have: normal data

Minimize variance of normal data

Results: Novelty Reduction (Outlier Absence)

Conclusions

Conclusions Reduce a large feature space to a smaller one

Mitigate the curse of dimensionality Objective function tweaked for novelty detection Similar to MDA but modified to accommodate case

where normal and outlier means are closely separated

Results good for artificial and machinery data Future work

Extend technique with kernels Difficult problem due to need for mean

Thanks CEMI Dr. Mechefske, Queens

Documents

Principal Component Analysis For Novelty Detection