PATTERN RECOGNITION curvature, facial expressions, or ...users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/wiley06.pdf · Pattern recognition stems from the need for automated machine

PATTERN RECOGNITION

ROBI POLIKAR

Rowan UniversityGlassboro, New Jersey

1. INTRODUCTION

Pattern recognition stems from the need for automatedmachine recognition of objects, signals or images, or theneed for automated decision-making based on a given setof parameters. Despite over half a century of productiveresearch, pattern recognition continues to be an activearea of research because of many unsolved fundamentaltheoretical problems as well as a rapidly increasing num-ber of applications that can benefit from pattern recogni-tion.

A fundamental challenge in automated recognition anddecision-making is the fact that pattern recognition prob-lems that appear to be simple for even a 5-year old may infact be quite difficult when transferred to machine do-main. Consider the problem of identifying the gender of aperson by looking at a pictorial representation. Let us usethe pictures in Fig. 1 for a simple demonstration, whichinclude actual photographs as well as cartoon renderings.It is relatively straightforward for humans to effortlesslyidentify the genders of these people, but now consider theproblem of having a machine making the same decision.What distinguishing features are there between these twoclasses—males and females—that the machine shouldlook at to make an intelligent decision? It is not difficultto realize that many of the features that initially come tomind, such as hair length, height-to-weight ratio, body

curvature, facial expressions, or facial bone structure—even when used in combination—may fail to provide cor-rect male/female classification of these images. Althoughwe can naturally and easily identify each person as maleor female, it is not so easy to determine how we come tothis conclusion, or more specifically, what features we useto solve this classification problem.

Of course, real-world pattern recognition problems areconsiderably more difficult then even the one illustratedabove, and such problems span a very wide spectrum ofapplications, including speech recognition (e.g., auto-mated voice-activated customer service), speaker identifi-cation, handwritten character recognition (such as the oneused by the postal system to automatically read the ad-dresses on envelopes), topographical remote sensing, iden-tification of a system malfunction based on sensor data orloan/credit card application decision based on an individ-ual’s credit report data, among many others. More re-cently, a growing number of biomedical engineering-related applications have been added to this list, includ-ing DNA sequence identification, automated digital mam-mography analysis for early detection of breast cancer,automated electrocardiogram (ECG) or electroencephalo-gram (EEG) analysis for cardiovascular or neurologicaldisorder diagnosis, and biometrics (personal identificationbased on biological data such as iris scan, fingerprint,etc.). The list of applications can be infinitely extended,but all of these applications share a common denominator:automated classification or decision making based on ob-served parameters, such as a signal, image, or in general apattern, obtained by combining several observations ormeasurements.

This article provides an introductory background topattern recognition and is organized as follows: The ter-

Figure 1. Pattern recognition problems that may be trivial for us may be quite challenging forautomated systems. What distinguishing features can we use to identify above pictured people asmales or females?

1

Wiley Encyclopedia of Biomedical Engineering, Copyright & 2006 John Wiley & Sons, Inc.

minology commonly used in pattern recognition is intro-duced first, followed by different components that make upa typical pattern recognition system. These components,which include data acquisition, feature extraction and se-lection, classification algorithm (model) selection andtraining, and evaluation of the system performance, areindividually described. Particular emphasis is then givento different approaches that can be used for model selec-tion and training, which constitutes the heart of a patternrecognition system. Some of the more advanced topics andcurrent research issues are discussed last, such as kernel-based learning, combining classifiers and their many ap-plications.

2. BACKGROUND AND TERMINOLOGY

2.1. Commonly Used Terminology in Pattern Recognition

A set of variables believed to carry discriminating andcharacterizing information about an object to be identifiedare called features, which are usually measurements orobservations about the object. A collection of d such fea-tures, ordered in some meaningful way into d-dimensionalcolumn vector is the feature vector, denoted x, whichrepresents the signature of the object to be identified. Thed-dimensional space in which the feature vector lies is re-ferred to as the feature space. A d-dimensional vector in ad-dimensional space constitutes a point in that space. Thecategory to which a given object belongs is called the class(or label) and is typically denoted by o. A collection of fea-tures of an object under consideration, along with the cor-rect class information for that object, is then called apattern. Any given sample pattern of an object is also re-ferred to as an instance or an exemplar. The goal of a pat-tern recognition system is therefore to estimate the correctlabel corresponding to a given feature vector based onsome prior knowledge obtained through training. Trainingis the procedure by which the pattern recognition systemlearns the mapping relationship between feature vectorsand their corresponding labels. This relationship formsthe decision boundary in the d-dimensional feature spacethat separates patterns of different classes from eachother. Therefore, we can equivalently state that the goalof a pattern recognition algorithm is to determine thesedecision boundaries, which are, in general, nonlinearfunctions. Consequently, pattern recognition can also becast as a function approximation problem. Figure 2 illus-trates these concepts on a hypothetical 2D, four classproblem. For example, feature 1 may be systolic bloodpressure measurement and feature 2 may be the weight ofa patient, obtained from a cohort of elderly individualsover 60 years of age. Different classes may then indicatenumber of heart attacks suffered within the last 5-yearperiod, such as none, one, two, or more than two.

The pattern recognition algorithm is usually trainedusing training data, for which the correct labels for each ofthe instances that makes up the data is a priori known.The performance of this algorithm is then evaluated on aseparate test or validation data, typically collected at thesame time or carved of the existing training data, forwhich the correct labels are also a priori known. Unknown

data to be classified, for which the pattern recognition al-gorithm is trained, is then referred to as field data. Thecorrect class labels for these data are obviously not knowna priori, and it is the classifier’s job to estimate the correctlabels.

A quantitative measure that represents the cost ofmaking a classification error is called the cost function.The pattern recognition algorithm is specifically trained tominimize this function. A typical cost function is the meansquare error between numerically encoded values of ac-tual and predicted labels. A pattern recognition systemthat adjusts its parameters to find the correct decisionboundaries, through a learning algorithm using a trainingdataset, such that a cost function is minimized, is usuallyreferred to as the classifier or more formally as the model.Incorrect labeling of the data by the classifier is an errorand the cost of making a decision, in particular an incor-rect one, is called the cost of error. We should quickly notethat not all errors are equally costly. For example, considerthe problem of estimating whether a patient is likely toexperience myocardial infarction by analyzing a set of fea-tures obtained from the patient’s recent ECG. Two possi-ble types of error exist. The patient is in fact healthy butthe classifier predicts that s/he is likely to have a myocar-dial infarction is known as a false positive (false alarm er-ror), and typically referred to as type I error. The cost ofmaking a type I error might be the side effects and the costof administering certain drugs that are in fact not needed.Conversely, failing to recognize the warning signs of theECG and declaring the patient as perfectly healthy is afalse negative, also known as the type II error. The cost ofmaking this type of error may include death. In this case,a type II error is costlier than a type I error. Pattern rec-ognition algorithms can often be fine-tuned to minimizeone type of error at the cost of increasing the other type.

Two parameters are often used in evaluating the per-formance of a trained system. The ability or the perfor-mance of the classifier in correctly identifying the classesof the training data, data that it has already seen, is called

Sen

sor

2 M

easu

rem

ents

(fea

ture

2)

Sensor 1 Measurements(feature 1)

Measurements from class 1 class 2 class 3 class 4

Decisionboundaries

Figure 2. Graphical representation of data and decision bound-aries.

2 PATTERN RECOGNITION

the training performance. The training performance istypically used to determine how long the training shouldcontinue, or how well the training data have been learned.The training performance is usually not a good indicator ofthe more meaningful generalization performance, which isthe ability or the performance of the classifier in identify-ing the classes of previously unseen patterns.

In this article, we focus on the so-called supervisedclassification algorithms, where it is assumed that a train-ing dataset with pre-assigned class labels is available. Ap-plications exist where the user has access to data withoutcorrect class labels. Such applications are handled by un-supervised clustering algorithms. Unsupervised algo-rithms are not discussed in this article; however, some ofthe more commonly used clustering algorithms are men-tioned in the last section of this article for users whosespecific applications may call for unsupervised learning.

2.2. Common Issues in Pattern Recognition

The two-feature, four class hypothetical example shown inFig. 2 represents a rather overly optimistic, if not an idealscenario: Patterns from any given class are perfectly sep-arable from those of other classes through some boundary,albeit a nonlinear one. In fact, such problems are oftenconsidered as easy for most pattern recognition applica-tions. In most applications of practical interest, the dataare not as cooperative. A more realistic scenario is the oneshown in Fig. 3, where the patterns from different classesoverlap in the feature space. Looking at data distributionin Fig. 3, one may think that it is impossible to draw adecision boundary that perfectly separates instances ofone class from others, and that the algorithm’s task is animpossible one. Not so. The task of the pattern recognitionalgorithms is to determine the decision boundary thatprovides the best possible generalization performance (onunseen data), and not one that provides perfect trainingperformance. In fact, even if it were possible to find such adecision boundary that provides perfect separation oftraining data, such boundaries are usually not desired be-cause much of the overlap is typically caused by noisydata, and finding a decision boundary that provides per-

fect training data classification would amount to learningthe noise in the data.

Learning noise invariably causes an inferior general-ization performance on the test data. This phenomenon isobserved often in pattern recognition and is referred to asovertraining or overfitting. Several approaches have beenproposed to prevent overfitting, such as early terminationof the training algorithm, or using a separate validationdataset, and adjusting the algorithm parameters until theperformance on this validation dataset is optimized.

Uncooperative data can be because of a variety of ex-ternal problems; for example, the dataset may be uncoop-erative because of the number of features beinginadequate. For the hypothetical case mentioned above,if we were to add patient’s age and cholesterol levels, thedata (then in a 4D space) may be better separable. Theopposite problem, presence of irrelevant features, can alsobe a problem. Another culprit for overlapping data is theoutliers in the data. Even if the features are selected ap-propriately, outliers may always cause overlapping pat-terns (for example, it is not unusual to see people withhigh weight and high blood pressure and no apparent car-diovascular problems). It is therefore essential to use ap-propriate feature extraction/selection and outlierdetection approaches to minimize the effects of uncooper-ative data.

Finally, a very important issue is the evaluation ofclassifier’s performance. It is mentioned above that a testdataset is used for this purpose, but considering that thetest dataset is really a subset of the original data, a portionof which were carved out for training, how can we ensurethat the evaluation we conduct on this data represents thetrue performance of the system on never-before-seen fielddata?

All these issues lie within the domain of pattern rec-ognition, and they are discussed below as we describe theindividual components of a complete pattern recognitionsystem.

3. COMPONENTS OF A PATTERN RECOGNITION SYSTEM

A classifier model and its associated training algorithmare all that are usually associated with pattern recogni-tion. However, a complete pattern recognition system con-sists of several components, shown in Fig. 4, of whichselection and training of such a model is just one compo-nent. We describe each component prior to actual modelselection in this section, giving particular emphasis tomodel selection, training, and evaluation in subsequentsections.

3.1. Data Acquisition

Apart from employing an appropriate classification algo-rithm, one of the most important requirements for design-ing a successful pattern recognition system is to haveadequate and representative training and test datasets.Adequacy ensures that a sufficient amount of data existsto learn the decision boundary as a functional mappingbetween the feature vectors and the correct class labels.There is no rule that specifies how much data is sufficient;

Sen

sor

2 M

easu

rem

ents

(fea

ture

2)

Sensor 1 Measurements (feature 1)


Figure 3. Uncooperative data are common in practical patternrecognition applications.

PATTERN RECOGNITION 3

however, a general guideline indicates that there shouldeither be at least 10 times the number of training datainstances as there are adjustable parameters in the clas-sifier (1), or 10 times the instances per class per feature, toreduce or minimize overfitting (2). Representative data, onthe other hand, ensures that all meaningful variations offield data instances that the system is likely to see aresampled by the training and test data. This condition isoften more difficult to satisfy, as it is usually not practicalto determine whether the training data distribution ade-quately spans the entire space on which the problem isdefined. An educated guess is usually all that is availableto the designer in choosing the type of sensors or mea-surement schemes that will provide the data.

3.2. Preprocessing

An essential, yet often overlooked step in the design pro-cess is preprocessing, where the goal is to condition theacquired data such that noise from various sources areremoved to the extend that it is possible. Various filteringtechniques can be employed if the user has prior knowl-edge regarding the spectrum of the noise. For example, ifthe measurement environment is known to introduce high(low)-frequency noise, an appropriate low (high)-pass DIG-

ITAL FILTER may be employed to remove the noise. ADAPTIVE

FILTERS can also be used if the spectral characteristics ofthe noise are known to change in time.

Conditioning may also include normalization, as clas-sifiers are known to perform better with feature valuesthat lie in a relatively smaller range. Normalization can bedone with respect to the mean and variance of the featurevalues or with respect to the amplitude of the data. In theformer, instances are normalized such that the normalizeddata have zero mean and unit variance

x0i ¼xi � mi

si; ð1Þ

where xi indicates the ith feature of original data instance,x0i is its normalized value, mi is the mean, and si is thestandard deviation of xi. In the latter, instances are simplydivided by a constant so that all feature values are re-stricted to ½�1 1� range:

x¼xffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPd

i¼ 1 ðxiÞ2

q : ð2Þ

Finally, if great magnitude differences exist between theindividual feature values, a logarithmic transformation,for example, can be used to reduce the dynamic range ofthe feature values.

Preprocessing should also include outlier removalwhen possible. For low-dimensional problems ðd � 3Þ, thedata can be plotted, which often provides visual clues onwhether any outliers are present. For data with d > 3,Mahalanobis distance can be used, particularly if the datafollow a Gaussian or near-Gaussian distribution. In thiscase, one computes the Mahalanobis distance of each datapoint from the distribution of training data instances forthe class it belongs, and if this distance is larger than athreshold (such as somemultiple of the standard deviationof the data), that instance can be considered as an outlier.The Mahalanobis distance of instance x from the trainingdata of instances of class oj can be computed as

Mj ¼ðx� ljÞTS�1

j ðx� ljÞ; ð3Þ

where lj is the mean of class oj instances and Sj is theircovariance matrix.

3.3. Feature Extraction

Both feature extraction and feature selection steps (dis-cussed next) are in effect dimensionality reduction proce-dures. In short, the goal of feature extraction is to findpreferably small number of features that are particularlydistinguishing or informative for the classification pro-cess, and that are invariant to irrelevant transformationsof the data. Consider the identification of a cancerous tis-sue from an MRI image: The shape, color contrast ratio ofthis tissue to that of surrounding tissue, 2D Fourier spec-trum, and so on are all likely to be relevant and distin-guishing features, but the height or eye color of the patientare probably not. Furthermore, although the shape is arelevant feature, tumors whose shapes are small or large,

Data Acquisition

Feature Extraction

Preprocessing

FeatureSelection

Model Selection & Training

Evaluation

Solution

A classification / decision makingproblem from the real world

How to acquire data, how much data should be acquired ?

Noise removal, filtering,normalization, outlier removal

Extraction of relevant featuresfrom the available data, followed by selection of the minimum setof most relevant features. Bothsteps also contribute to dimensionality reduction

Choosing the type of classificationmodel and training it with anappropriate learning algorithm

Estimating the true generalizationperformance of the classifier inthe real world? Confidence in thisestimation?

Figure 4. Components of a pattern recognition system.


that are a few centimeters to the left or right, or that arerotated in one direction or another, are still tumors. There-fore, an appropriate transformation may be necessary toobtain a translation, rotation, and scale invariant versionof the shape feature. The goal of feature selection, on theother hand, is to select a yet smaller subset of the ex-tracted features that are deemed to be the most distin-guishing and informative ones.

The importance of dimensionality reduction in patternrecognition cannot be overstated. A small but informativeset of features significantly reduces the complexity of theclassification algorithm, the time and memory require-ments to run this algorithm, as well as the possibility ofoverfitting. In fact, the detrimental effects of a large num-ber of features are well known within the pattern recog-nition community, and affectionately referred to as thecurse of dimensionality. It is therefore important to keepthe number of features as few as possible, while ensuringthat enough discriminating power is retained. The Ock-ham’s Razor, the well-known philosophical argument ofWilliam of Ockham (1284–1347), that entities are not to bemultiplied without necessity is often mentioned when thevirtues of dimensionality reduction are discussed.

Feature extraction is usually obtained from a mathe-matical transformation on the data. Some of the mostwidely used transformations are linear transformations,such as PRINCIPAL COMPONENT ANALYSIS and linear discrimin-ant analysis.

3.3.1. Principal Component Analysis (PCA). PCA, alsoknown as Karhunen–Loeve transformation, is one of theoldest techniques in multivariate analysis and is the mostcommonly used dimensionality reduction technique inpattern recognition. It was originally developed by Pear-son in 1901 and generalized by Loeve in 1963.

The underlying idea is to project the data to a lowerdimensional space, where most of the information is re-tained. In PCA, it is assumed that the information is car-ried in the variance of the features, that is, the higher thevariation in a feature, the more information that featurecarries. Hence, PCA employs a linear transformation thatis based on preserving the most variance in the data usingthe least number of dimensions. The data instances areprojected onto a lower dimensional space where the newfeatures best represent the entire data in the least squaressense. It can be shown that the optimal approximation, inthe least square error sense, of a d-dimensional randomvector x 2 <d by a linear combination of d0od indepen-dent vectors is obtained by projecting the vector x onto theeigenvectors ei corresponding to the largest eigenvalues liof the covariance matrix (or the scatter matrix) of the datafrom which x is drawn. The eigenvectors of the covariancematrix of the data are referred to as principal axes of thedata, and the projection of the data instances on to theseprincipal axes are called the principal components. Di-mensionality reduction is then obtained by only retainingthose axes (dimensions) that account for most of the vari-ance, and discarding all others.

Figure 5 illustrates PCA, where the principal axes arealigned with the most variation in the data. In this 2Dcase, Principal Axis 1 accounts for more of the variation,

and dimensionality reduction can be obtained by project-ing the 2D original data onto the first principal axis,thereby obtaining 1D data that best represent the origi-nal data.

In summary, PCA is equivalent to walking around thedata to see from which angle one gets the best view. Therotation of the axes is done in such a way that the newaxes are aligned with the directions of maximum variance,which are the directions of eigenvectors of the covariancematrix. Implementation details can be found in the articleon PRINCIPAL COMPONENT ANALYSIS.

One minor problem exists with PCA, however: Al-though it provides the best representation of the data ina lower dimensional space, no guarantee exists that theclasses in the transformed data are better separated thanthey are in the original data, which is because PCA doesnot take class information into consideration. Ensuringthat the classes are best separated in the transformedspace is better handled by the linear discriminant analysis(LDA).

3.3.2. Linear Discriminant Analysis (LDA). The goal ofLDA (or Fisher Linear Discriminant) is to find a transfor-mation such that the intercluster distances between theclasses are maximized and intracluster distances within agiven class are minimized in the transformed lower di-mensional space. These distances are measured using be-tween and within scatter matrix, respectively, as describedbelow.

Consider a multiclass classification problem and let Cbe the number of classes. For the ith class, let fxig be theset of patterns in this class, mi be the mean of vectorsx 2 fxig, and ni be the number of patterns in fxig. Letm bethe mean of all patterns in all C classes. Then, the withinscatter matrix SW and between scatter matrix SB are de-

0

e2

Princip

al axis

1

e1

Prin

cipal

axis

2

0

2

4

6

8

10

2 4 6 8 10−2

−2

Figure 5. Principal axes are along the eigenvectors of the co-variance matrix of the data.


fined as follows (1):

SW ¼XCi¼1

Xx2Xi

ðx�miÞ � ðx�miÞT

SB ¼XCi¼ 1

niðm�miÞ � ðm�miÞT ;

ð4Þ

where T denotes the transpose operator. The transforma-tion, which is also the projection from the original featurespace onto a lower dimensional feature space, can be ex-pressed as

y¼WT� x; ð5Þ

where the column vector y is the feature vector in theprojected space corresponding to pattern x. The optimummatrixW is obtained by maximizing the criterion function

JðWÞ¼SBF=SWF ; ð6Þ

where SBF and SWF are the corresponding scatter matricesin the (feature) projection space. It can be shown that SBF

and SWF can be written as

SBF ¼WTSBW

SWF ¼WTSWW:ð7Þ

Therefore, JðWÞ can be represented in terms of the scattermatrices of the original patterns

JðWÞ¼WTSBW

WTSWW: ð8Þ

JðWÞ is a vector valued function, and the determinant ofthis function can be used as a scalar measure of JðWÞ. Thecolumns of W that maximize the determinant of JðWÞ arethen the eigenvectors that correspond to the largest eigen-

values in the generalized eigenvalue equation:

SBwi ¼ liSWwi: ð9Þ

For nonsingular SW, Equation 9 can be written as

S�1W SBwi ¼ liwi: ð10Þ

From Equation 10, we can directly compute the eigenval-ues li and the eigenvectors wi, which then constitute thecolumns of the W matrix.

Figure 6 illustrates the transformation of axes obtainedby LDA, which takes the class information into consider-ation. LDA is not without its own limitations, however. Aclose look at the rank properties of the scatter matricesshow that regardless of the dimensionality of the originalpattern, LDA transforms a pattern vector onto a featurevector, whose dimension can be at most C-1, where C is thenumber of classes. This restriction is serious because, forproblems with high dimensionality where most of the fea-tures do in fact carry meaningful information, a reductionto C-1 dimensions may cause loss of important informa-tion. Furthermore, the approach implicitly assumes thatthe class distributions are unimodal (such as Gaussian). Ifthe original distributions are multimodal or highly over-lapping, LDA becomes ineffective. In general, if the dis-criminatory information lies within different means (thatis, centers of classes are sufficiently far away from eachother), the LDA technique works well. If the discrimina-tory information lies within the variances, than PCAworks better then the LDA.

A variation of LDA, referred to as nonparametric disc-riminant analysis (3,4) removes the unimodal assumptionas well as the restriction of projection to a C-1 dimensionalspace, which is achieved by redefining the between-classmatrix, making it a full rank matrix.

SB ¼1

N

XCi¼ 1

XCj¼ 1

Xx2Xi

wijxðx�mijxÞ � ðx�mijxÞT ; ð11Þ

where N is the total number of instances, mijx represents

Poor choice of projection direction LDA: Good choice of projection direction

Figure 6. Linear discriminant analysis.


the mean of xi ’s k-nearest neighbors from class oj and wijx

represents the weight of the feature vector x from class oi

to class oj

wijx ¼minðdistðxi

KNNÞ;distðxjKNNÞÞ

distðxiKNNÞþdistðxj

KNNÞ; ð12Þ

with distðxiKNNÞ being the Euclidean distance from x to its

k-nearest neighbors in class oi. In general, if a class oi

instance is far away in the feature space from the clusterof class oj instances, wijx is a small quantity. If, however,an instance of class oi is close to the boundary of class oj

instances, then wijx is a large quantity. The rest of theanalysis is identical to that of regular LDA, where thegeneralized eigenvalue equation is solved to obtain thetransformation matrix W. Those columns of W represent-ing the eigenvectors of the largest eigenvalues are thenretained and the remaining are discarded to achieve thedesired dimensionality reduction.

3.3.3. Other Feature Extraction Techniques. AlthoughPCA and LDA are very commonly used, they are not nec-essarily the best ones for all applications. In fact, depend-ing on the application, the better discriminatinginformation may reside in the spectral domain for whicha Fourier-based transformation DISCRETE/FAST FOURIER

TRANSFORM may be more appropriate. For NONSTATIONARY

SIGNAL (such as ECG, EEG) a WAVELET-based time-fre-quency representation may be the better feature extrac-tion technique (5). In such cases, the dimensionalityreduction is obtained by discarding the transformationcoefficients that are smaller than a threshold, assumingthat those coefficients do not carry much information.

We should add that some sources treat the transfor-mation-based dimensionality reduction techniques as aspecial case of feature selection, as the end result is theselection of a fewer number of features. These techniques,when considered as feature selection techniques, are re-ferred to as filtering-based feature selection (as opposed towrapper-based feature selection as described in the nextsection), as they filter out irrelevant features.

3.4. Feature Selection

In feature selection, this author specifically means selec-tion of m features that provide the most discriminatoryinformation, out of a possible d features, where mod. Inother words, by feature selection, this author refer, to se-lecting a subset of features from a set of features that havealready been identified by a preceding feature extractionalgorithm. The main question to answer under this settingis then ‘‘which subset of features provide the most dis-criminatory information?’’

A criterion function is used to assess the discriminatoryperformance of the features, and a common choice for thisfunction is the performance of a subsequent classifiertrained on the given set of features. In essence, we arelooking for a subset of features that leads to the best gen-eralization performance of the classifier when trained onthis subset. It should be noted, of course, the best subset

then inevitably becomes a function of the classifier chosen.For example, the best subset of features for a neural net-work may be different than the one for a decision tree typeclassifier. The feature selection is therefore said to bewrapped around the classifier chosen, and, hence, suchfeature selection approaches are referred to as wrapperapproaches (6).

There is, of course, a conceptually trivial solution tothis problem: evaluate every subset of features by traininga classifier with each such subset, observing its general-ization performance, and then selecting the subset thatprovides the best performance. Such an exhaustive search,as conceptually simple as it may be, is prohibitively ex-pensive (computation wise) even for a relatively smallnumber of features. This is because, the number of subsetsof features to be evaluated grows combinatorially as thenumber of features increase. For a fixed size of d and m,the number of subsets of features isCðd;mÞ¼d!=m!ðd�mÞ!. If, on the other hand, we are notjust interested in a subset of features with cardinality m,but rather a subset whose cardinality is no larger than m,then the total number of subsets to be evaluated becomesPm

i¼ 1 Cðd; iÞ. Just for illustrative purposes, if we are inter-ested to find the best set of features (of any cardinality) outof 12 features, 4095 subsets of features need to be evalu-ated. Or, if we are interested in finding the best subsetwith 10 features or less, out of 20 features, we would haveto evaluate 184,756 subsets of features. It is not unusualfor practical problems to have hundreds, if not thousands,of features.

Of course, more efficient search algorithms exists thatavoid the full exhaustive search, such as the well-estab-lished depth-first search, breath-first search, branch andbound search, as well as hill climb search; however, eachhas it own limitations. For example, the branch and boundalgorithm avoids the exhaustive search by searching sub-spaces and computing upper and lower bounds on the so-lutions obtained in these subspaces. The algorithm keepstrack of the performance of the best solution found thusfar, and if the performance of another solution is worsethan the current best, the subspaces of this solution arediscarded from future search. This of course, makes senseif and only if the criterion function used to evaluate theperformance is monotonic on the feature subsets, that is,the performance of any feature subset must improve withthe addition of features. As discussed earlier, this is clearlynot the case in classification problems, as addition of ir-relevant features are guaranteed to cause performancedegradation.

Other search algorithms include sequential forwardand backward search (also referred to as hill-climbing)(7). Forward search starts with no features and evaluatesevery single feature, one at a time, and finds the best sin-gle feature. This feature is then included as part of theoptimal feature subset. Then, keeping this one feature, allother features are added, again one at a time, to form atwo-feature subset. The best two-feature subset is re-tained and the search continues by adding one featureat a time to the best subset found thus far. The searchterminates when addition of a new feature no longer im-proves performance. The backward search works much


the same, but in reverse order: The search starts with allfeatures and then one feature is removed at a time untilremoving features no longer improves performance.

Figure 7 illustrates the forward-search-based hill climbon a sample feature set of cardinality 12. Each ordered 12-tuple indicates which features are included or not (indi-cated by ‘‘1’’ or ‘‘0,’’ respectively) in that feature subset.First row shows the root level where no features are yetincluded in the search. Each feature is evaluated one at atime (twelfth total evaluations). Assume that the secondfeature provides the best performance. Then, all two-fea-ture combinations including the second feature are eval-uated (11 such evaluations). Assume that second and 12thfeatures together provide the best performance, which isbetter than the performance of the second feature alone.Then, keeping these two features, one more feature isadded and the procedure continues until adding a featureno longer improves the performance from the previous it-eration.

Forward and backward searches are significantly morecost efficient than the exhaustive search, as at each stagethe number of possible combinations that need to be eval-uated decreases by one. Therefore, the total number offeature subsets that must be evaluated is

Nþ ðNþ 1Þþ ðNþ 2Þ þ � � � þ ðN � ðN � 1ÞÞ

¼NðN � 1ÞÞ

2;

ð13Þ

where N is the total number of features. In comparison to4095 exhaustive search evaluations, hill climb requiresonly 66 for the 12-feature example mentioned above. How-ever, computational efficiency comes at the cost of opti-mality: Hill climb is suboptimal, as the optimal featuresubset that provides the best performance need not con-tain the single best feature obtained earlier.

Despite their suboptimality, these search algorithmsare often employed if (1) the total number of features issignificantly large, and (2) the features are uncorrelated.We would like to emphasize the second condition: Wrap-

per-based feature selection algorithms can only be used ifthe individual features are uncorrelated and, better yet, ifthey are independent of each other. Such approaches can-not be used for time-series-based features, such as ECGsignals, where one feature is clearly and strongly corre-lated with the features that come before and after itself.

4. MODEL SELECTION AND TRAINING

4.1. Setting the Problem as a Function Approximation

Only after acquiring and preprocessing adequate and rep-resentative data and extracting and selecting the mostinformative features is one finally ready to select a clas-sifier and its corresponding training algorithm. As men-tioned earlier, one can think of the classification as afunction approximation problem: find a function thatmaps a d-dimensional input to appropriately encodedclass information (both inputs and outputs must be en-coded, unless they are already of numerical nature). Oncethe classification is cast as a function approximation prob-lem, a variety of mathematical tools, such as optimizationalgorithms, can be used. Some of the more common onesare described below. Although most common pattern rec-ognition algorithms are categorized as statistical ap-proaches vs. neural network type approaches, it ispossible to show that they are infact closely related andeven a one-to-one match between certain statistical ap-proaches and their corresponding neural network equiva-lents can be established.

4.2. Statistical Pattern Recognition

4.2.1. Bayes Classifier. In statistical approaches, datapoints are assumed to be drawn from a probability distri-bution, where each pattern has a certain probability ofbelonging to a class, determined by its class conditionedprobability distribution. In order to build a classifier, thesedistributions must either be known ahead of time or mustbe learned from the data. The problem is cast as follows: Agiven d-dimensionalx¼ ðx1; . . . ; xdÞ needs to be assigned to

0,0,0,0,0,0,0,0,0,0,0,0

1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,10,1,0,0,0,0,0,0,0,0,0,0 …

1,1,0,0,0,0,0,0,0,0,0,0 0,1,0,0,0,0,0,0,0,0,0,10,1,1,0,0,0,0,0,0,0,0,0

1,1,0,0,0,0,0,0,0,0,0,1 0,1,0,0,0,0,0,0,0,0,1,10,1,1,0,0,0,0,0,0,0,0,1

...

2nd feature givesbest performance

2nd and 12th featuresgive best performance

1st, 2nd and 12th featuresgive best performance

Search continues until the best feature in the current row is not better than that of the previous row.

12 combinationsto test



Root level: No feature isincluded in the solution

…

…

Figure 7. Forward search hill climb.


one of c classes o1; . . . ;oc. The feature vector x that be-longs to class oj is considered as an observation drawnrandomly from a probability distribution conditioned onclass oj, PðxjojÞ. This distribution is called the likelihood,the (conditional) probability of observing a feature value ofx, given that the correct class is oj. All things being equal,the category with higher class conditional probability ismore ‘‘likely’’ to be the correct class. All things may notalways be equal, however, as certain classes may be in-herently more likely. The likelihood is therefore convertedinto a posterior probability PðojjxÞ, the (conditional) prob-ability of correct class being oj, given that feature value xhas been observed. The conversion is done using the well-known Bayes theorem that takes the prior likelihood ofeach class into consideration:

PðojjxÞ¼Pðx \ ojÞ

PðxÞ¼

PðxjojÞ � PðojÞPCk¼ 1

PðxjokÞ � PðokÞ

; ð14Þ

where PðojÞ is the prior probability, the previously knownprobability of correct class being class oj (without regard-ing the observed feature vector), and PðxÞ is the evidence,the total probability of observing the feature vector x. Theprior probability is usually known from prior experience.For example, referring to our original example, if histor-ical data indicates that 20% of all people over the age of 60have had two or more heart attacks, regardless of weightand blood pressure, the prior probability for this class is0.2. If such prior experience is not available, it can eitherbe estimated from the ratio of training data patients thatfall into this category, or if that is not reliable (because ofsmall training data), it can be taken as equal for allclasses.

The posterior probability is calculated for each class,given x, and the final classification then assigns x to theclass for which the posterior probability is largest, that is,oi is chosen if PðoijxÞ > PðojjxÞ, 8i, j¼ 1; . . . ; c. It should benoted that the evidence is the same for all classes, andhence its value is inconsequential for the final classifica-tion. A classifier constructed in this manner is usually re-ferred to as a Bayes classifier or Bayes decision rule, andcan be shown to be the optimal classifier, with smallestachievable error, in the statistical sense.

A more general form of the Bayes classifier considersthe fact that not all errors are equally costly, and hencetries to minimize the expected risk RðaijxÞ, the expectedloss for taking action ai. Whereas taking action ai is usu-ally considered as choosing class oi, refusing to make adecision can also be an action, allowing the classifier not tomake a decision if the expected risk of doing so is smallerthan that of choosing any of the classes. The expected riskcan be calculated as

RðaijxÞ¼Xcj¼ 1

lðaijojÞ � PðojjxÞ; ð15Þ

where lðaijojÞ is the loss incurred for taking action ai whenthe correct class is oj. If one associates action ai as select-ing oi, and if all errors are equally costly the zero-one loss

is obtained

lðaijojÞ¼0; if i¼ j

1; if iOj:

(ð16Þ

This loss function assigns no loss to correct classificationand assigns a loss of 1 to misclassification. The risk cor-responding to this loss function is then

RðaijxÞ¼XjOij¼ 1;...;c

PðojjxÞ¼ 1�PðoijxÞ; ð17Þ

proving that the class that maximizes the posterior prob-ability minimizes the expected risk.

Out of the three terms in the optimal Bayes decisionrule, the evidence is unnecessary, the prior probability canbe easily estimated, but we have not mentioned how toobtain the key third term, the likelihood. Yet, it is thiscritical likelihood term whose estimation is usually verydifficult, particularly for high dimensional data, renderingBayes classifier impractical for most applications of prac-tical interest. One cannot discard the Bayes classifier out-right, however, as several ways exist in which it can stillbe used: (1) If the likelihood is known, it is the optimalclassifier; (2) if the form of the likelihood function isknown (e.g., Gaussian), but its parameters are unknown,they can be estimated using MAXIMUMLIKELIHOODESTIMATION

(MLE); (3) even the form of the likelihood function can beestimated from the training data, for example, by using k-nearest neighbor approach (discussed below) or by usingParzen windows (1), which computes the superposition of(usually Gaussian) kernels, each of which are centered onavailable training data points—however, this approachbecomes computationally expensive as dimensionality in-creases; and (4) the Bayes classifier can be used as abenchmark against the performance of new classifiers byusing artificially generated data whose distributions areknown.

4.2.2. Naıve Bayes Classifiers. As mentioned above, themain disadvantage of the Bayes classifier is the difficultyin estimating the likelihood (class-conditional) probabili-ties, particularly for high dimensional data because of thecurse of dimensionality. There is highly practical solutionto this problem, however, and that is to assume class-con-ditional independence of the features:

PðxjoiÞ¼Ydj¼ 1

PðxðjÞjoiÞ; ð18Þ

which yields the so-called Naıve Bayes classifier. Equation18 basically requires that the jth feature of instance x,denoted as x(j), is independent of all other features, giventhe class information. It should be noted that this is not asnearly restrictive as assuming full independence, that is,

PðxÞ¼Ydj¼ 1

PðxðjÞÞ: ð19Þ


The classification rule corresponding to the Naıve Bayesclassifier is then to compute the discriminant functionrepresenting posterior probabilities as

giðxÞ¼PðoiÞYdj¼ 1

PðxðjÞjoiÞ ð20Þ

for each class i, and then choosing the class for which thediscriminant function giðxÞ is largest. The main advantageof this approach is that it only requires univariate densi-ties pðxðjÞjoiÞ to be computed, which are much easier tocompute than the multivariate densities pðxjoiÞ. In prac-tice, Naıve Bayes has been shown to provide respectableperformance, comparable with that of neural networks,even under mild violations of the independence assump-tions.

4.2.3. k-Nearest Neighbor (kNN) Classifier. Althoughthe kNN can be used as a nonparametric density estima-tion algorithm (see NEAREST NEIGHBOR RULES), it is morecommonly used as a classification tool. As such, it is one ofthe most commonly used algorithms, because it is veryintuitive, simple to implement, and yet provides remark-ably good performance even for demanding applications.Simply put, the kNN takes a test instance x, finds its k-nearest neighbors in the training data, and assigns x tothe class occurring most often among those k neighbors.The entire algorithm can be summarized as follows:

* Out of N training vectors, identify the k-nearestneighbors (based on some distance metric, such asEuclidean) of x irrespective of the class label. Choosean odd (or prime) k.

* Identify the number of samples kj that belongs toclass oj such that

Pj kj ¼ k.

* Assign x to the class with the maximum number of kjsamples.

Figure 8 illustrates this procedure for k¼ 11. Accordingto the kNN rule, the test instance indicated by the plussign would be labeled as class-3 (represented by circlesK), as out of its 11 nearest neighbors, class-3 instancesoccur most often in the training data [six instances, asopposed to two from class-2 ðmÞ, three from class-4 (%),and none from class-1 (’)]. Similarly, the test instanceindicated by the diamond shape ~ would be labeled asclass-1, because its 11 nearest neighbors have seven in-stances from class-1 and only four instances from class-4(none from other classes) in the training data. It should benoted that the choice of k as an odd number is not by ac-cident. For two-class problems, it is often chosen as an oddnumber, and for c-class problems, it is usually chosen as anumber that is not divisible by c to prevent potential ties.

The limiting case of kNN is to chose k¼ 1, essentiallyturning the classification algorithm to one that assigns thetest instance to the class of its NEAREST NEIGHBOR in thetraining data. Choosing a large k has the advantage ofcreating smooth decision boundaries. However, it also hashigher computational complexity, but more importantly,

loses local information because of averaging caused byfurther away instances being taken into account in theclassification.

Despite its simple structure, the kNN is a formidablecompetitor to other classification algorithms. In fact, it canbe easily shown that when sufficiently dense data exists,its performance approaches to that of the optimal Bayesclassifier. Specifically, in the large sample limit, the errorof the 1-NN classifier is at worst twice the error of the op-timal Bayes classifier.

The kNN does not really do much of any learning. Infact, it does no processing until a request to classify anunknown instance is made. It simply compares the un-known data instance with those that are in the trainingdata, for which it must have access to the entire database.Therefore, this approach is also called lazy learning ormemory-based learning, which is in contrast to that of‘‘eager’’ learning algorithms, such as neural networks,which do in fact learn the decision boundary before it isasked to classify an unknown instance. In eager learningalgorithms, the training data can be discarded after thetraining: Once the classifier model has been generated, allinformation contained in the training data is then con-densed and stored as model parameters, such as the neu-ral network weights. Lazy algorithms have little or nocomputational cost of training, but more computationalcost during the testing compared with eager learners.

4.3. Neural Networks

Among countless number of neural network structures,There are two that are used more often than all others: themultilayer perceptron (MLP) and the radial basis function(RBF). These networks have been extensively studied,empirically tested on a broad spectrum of applications,and hence their properties are now well known. Further-more, these two types of networks have proven to be uni-versal approximators (8–11), a term referring to the abilityof these networks to approximate any decision boundaryof arbitrary dimensionality and arbitrary complexity witharbitrary precision, given adequate amount of data andproper selection of their architectural and training pa-

Sen

sor

2 M

easu

rem

ents

(fea

ture

2)

Sensor 1 Measurements(feature 1)


k=11

Figure 8. kNN classification.


rameters. These two types of networks are explained indetail in this article.

4.3.1. Neuronal Model and the Multilayer Percept-rorn. The artificial neural networks (ANNs), or simplyneural networks, are designed to mimic the decision-mak-ing ability of the brain, and hence their structure resem-bles that of the nervous system, albeit very crudely,featuring a massively interconnected set of elements. Inits most basic form, a single element of the nervous sys-tem, a neuron, consists of four basic parts, as schemati-cally illustrated in Fig. 9: the dendrites that receiveinformation from other neurons, the cell body (the soma)that contains the nucleus and processes the informationreceived by the neuron, the axon that is used to transmitthe processed information away from the soma and towardits intended destination, and the synaptic terminals thatrelay this information to/from the dendrites of consecutiveneurons, the brain, or the intended sensory/motor organ.

The neuron, modeled as the unit shown in Fig. 10, istypically called a node. It features a set of inputs and aweight for each input representing the information arriv-ing at the dendrites, a processing unit representing thecell body and an output that connects to other nodes (orprovides the output) representing the synaptic terminals.The weights, positive or negative, represent the excitatoryor inhibitory nature of neuronal activation, respectively.The axon is implicitly represented by the physical layoutof the neural network.

It is noted that for a d-dimensional feature vector, thenode usually has dþ1 inputs to the node often exist,where the 0th input is indicated as x0 with the constantvalue of 1, and its associated weight w0. This input servesas the bias of the node. The output of the node is calculatedas the weighted sum of its inputs, called the net sum, orsimply net, passed through the activation function f:

net¼Xdi¼ 0

wjixi ¼x �wT þw0

y¼ f ðnetÞ:

ð21Þ

Note that the net sum creates a linear decision boundary,x �wT þw0, which is then modified by the activation func-tion. Popular choices for the activation function f include

(1) The thresholding function

f ðnetÞ¼

1; if net � 0

0; otherwise

8<: or

f ðnetÞ¼

1; if net � 0

�1; otherwise:

8<:

ð22Þ

When used with this activation function, the nodeis also known as the perceptron.

(2) The identity (linear) function

f ðnetÞ¼net: ð23Þ

(3) The logistic (sigmoid) function

f ðnetÞ¼1

1þ e�b�net : ð24Þ

(4) The hyperbolic tangent (sigmoid) function

f ðnetÞ¼ tanhðbnetÞ¼2

1þ e�b�net � 1: ð25Þ

The thresholding function, the identity function, andthe logistic sigmoid function are depicted in Figs. 11a 11b,

……

x0= 1

x1

xi

xd-1

d-1

xd

w0

w1

wi

w

wd

f y∑

Figure 10. The node model of a neuron.

DendritesCell body(soma)

Axon Synapticterminals

Figure 9. Schematic illustration of the neuron.


and 11c, respectively (the shape of the hyperbolic tangentfunction looks similar to that of the logistic sigmoid, ex-cept the logistic function takes values between 0 and 1,whereas the range of the hyperbolic tangent is between� 1 and 1). The b parameter in the sigmoidal functionscontrols the sharpness of the function transition at zero,and both functions approach to thresholding function as bapproaches 1, and to a linear function as b approaches 0.

In the case of the thresholding function, the node sim-ply makes a hard decision to fire (an action potential) ornot, depending on whether it has a net excitatory or in-hibitory weighted input. In the case of the identity func-tion, the node does no processing and relays its input to itsoutput, and in the case of the sigmoid function, it providesa continuous output in the range of ½0 1� as a soft decisionon whether to fire or not. Rosenblatt, who has coined theterm perceptron, has shown that perceptron can learn anylinear decision boundary through a simple learning al go-rithm, where the weights are randomly initialized andthen iteratively updated as

wðtþ 1Þ¼wðtÞþDwðtÞ ) Dw¼ Zeixi ð26Þ

for each xi that is misclassified by the algorithm in theprevious iteration, where Z is the so-called learning rateand ei is the error of the perceptron for input xi. However,if the classes are not linearly separable, than the algo-rithm loops infinitely without convergence on a solution.As most problems have nonlinear decision boundaries, thesingle perceptron is of limited use.

Although a single perceptron may not be of much prac-tical use, appropriate combinations of them can be quitepowerful and approximate an arbitrarily complex nonlin-ear decision boundary. Arguably the most popular of clas-sifiers constructed in such a fashion is the ubiquitousmultilayer perceptron (MLP). The structure of the MLP isdesigned to mimic that of the nervous system: a largenumber of neurons connected together, and the informa-tion flows from one neuron to others in a cascade-likestructure until it reaches its intended destination (Fig.12).

Figure 13 provides a more detailed representation ofthe MLP structure on which we identify many architec-tural properties of this network. An MLP is a feed-forwardtype neural network, indicating that the information flows

-15 -10 -5 0 5 10 15

-1

-0.5

0

0.5

1

net

(a)

(c)

(b)

f(ne

t)

-15 -10 -5 0 5 10 15-15

-10

-5

0

5

10

15

net

f(ne

t)

-15 -10 -5 0 5 10 150

0.2

0.4

0.6

0.8

1

net

f(ne

t)

β = 0.1

β = 0.5β = 0.9

β = 0.25

Figure 11. (a) The thresholding, (b) identity, and (c) logistic sigmoid activation functions.


in one direction from the input toward the output (as op-posed to recurrent networks that have feedback loops). Thefirst layer, also called the input layer, uses nodes with lin-ear activation functions, where each node has a single in-put corresponding to one of the features of the inputvector. Therefore, the input layer consists of d nodes, fora d-dimensional feature vector (an additional input nodewith a constant value of 1 is also routinely added to serveas the bias term), As a result of the linear activation func-tions, the input layer does no processing apart from pre-senting the data to the next layer, called the hidden layer.An MLP may have one or more hidden layers; however, itcan be shown that any decision boundary function of ar-bitrary dimensionality and arbitrary complexity can berealized with a single hidden layer MLP. Each input node

is fully connected to every other node in the followinglayer through the set of weights Wji. The hidden layernodes use a nonlinear activation function, typically a si-gmoid function. The outputs of the hidden layer nodes arethen fully connected either to the next hidden layer’snodes, or more commonly to the output layer nodesthrough another set of weights, Wkj. The output layernodes, of which one exists for each class, also use a sigmo-idal activation function. The class labels of the trainingdata are encoded using c binary digits (e.g., class-3 in a 5-class problem is represented with ½0 0 1 0 0�, whereasclass-5 is represented as ½0 0 0 0 1�). The logistic functionforces each output to be between 0 and 1. The value foreach output is then interpreted as the support given by theMLP to the corresponding class. In fact, if a sufficientlydense training dataset is available, if the MLP architec-ture is chosen appropriately to learn the underlying datadistribution, and if the outputs are normalized to add upto 1, then individual outputs approximate the posteriorprobability of the corresponding class. The softmax rule istypically used for normalizations (1,9): representing theactual classifier output corresponding to the kth class aszkðxÞ, and the normalized values as z0kðxÞ, approximatedposterior probabilities PðokjxÞ can be obtained as

PðokjxÞ � z0kðxÞ¼ezkðxÞPCc¼ 1 e

zcðxÞ)XCk¼ 1

z0kðxÞ¼ 1: ð27Þ

The decision is therefore chosen as the class indicated bythe output node that yields the largest value for the giveninput (that is, choosing the class with largest posteriorprobability).

Although the number of nodes for the input and outputlayers are fixed (number of features and classes, respec-tively), the number of nodes for the hidden layer is a freeparameter of the algorithm. In general, the predictivepower of the MLP—its ability to approximate complex de-cision boundaries—increases with the number of hiddenlayer nodes, however, so does its chance of overfitting thedata, not to mention its memory and computational bur-den. Therefore, the least number of nodes that provide adesirable performance should be used (per Ockham’s ra-zor).

The goal of the MLP is to adjust its weights so that acost function calculated as the squared error on labeledtraining data is minimized. This is achieved by using anoptimization algorithm, typically one of several gradient-descent-based approaches, where the algorithm tries tofind the global minimum of the so-called quadratic errorsurface defined by the cost function

JðwÞ¼1

2

XNn¼ 1

Xck¼ 1

ðtk � zkÞ2; ð28Þ

wherew represents the entire set of weights of the MLP,Nis the number of training data instances, c is the numberof classes, tk is the target (correct) output for the kth out-put node, and zk is the actual output of the kth outputnode. The algorithm by which the weights are iteratively

Input Information

OutputInformation

Input

Output

(a)

(b)

Figure 12. (a) The MLP structure mimicking (b) the simplifiedstructure of the nervous system.


learned is called the BACKPROPAGATION LEARNING RULE wherethe weight update searches for the minimum of the errorsurface along the direction of the negative gradient of theerror:

wðtþ 1Þ¼wðtÞþDwðtÞ ) Dw¼ � Z@JðwÞ

@w: ð29Þ

The careful reader will immediately notice that the costfunction in Equation 28 does not appear to be a function ofthe weights w. However, the cost function is in fact animplicit function of the weights, because the values of theoutput nodes depend on all network weights. The algo-rithm begins with random initialization of all weights, andthen calculates the weight update rule for the outputweights first. As the actual outputs as well as the desiredtarget outputs are available for the output layer, it can beeasily shown to yield

Dwkj ¼ Z � dk � yj ¼ Zðtk � zkÞf0ðnetkÞyj; ð30Þ

where the selection of logistic sigmoid as the activationfunction yields the convenient

f 0ðnetkÞ¼ f ðnetkÞ½1� f ðnetkÞ� ð31Þ

as the derivative of the activation function, and where

dkðtk � zkÞf0ðnetkÞ ð32Þ

is called the sensitivity of the kth output node.Representing J as a function of input layer weights is a

little trickier, because the desired outputs of the hiddenlayer nodes are not known. The trick itself is not a difficultone, however, and involves backpropagating the error, thatis, representing the error at the output as a function of theinput to hidden layer weights through a series of chain

rules, yielding

Dwji ¼ � Z@J

@wji¼ Z

Xck¼ 1

dkwkj

" #f 0ðnetjÞxi

¼ Z � dj � xi;

ð33Þ

where

dj ¼Xck¼ 1

dkwkj

" #f 0ðnetjÞ ð34Þ

is called the sensitivity of the jth hidden layer node.Successful implementation of the MLP requires suit-

able and appropriate choices of its many parameters.These parameters include the learning rate Z, the num-ber of hidden layer nodes H, the error goal E (the value ofthe cost function below which we consider the network astrained), the activation function, the way in which theweights are initialized, and so on. Although no hard rulesexist for choosing any of these parameters, the followsguidelines are often recommended.

4.3.1.1. Learning Rate. The learning rate representsthe size of each iterative step taken in search of the min-imum error. The learning rate, in theory, only affects theconvergence time. However, in practice, too large a valueof learning rate can also result in system divergence, asthe globally optimum solution is seldom found. A too smallchoice of Z results in long training time, whereas a toolarge choice can cause the algorithm to take too big stepsand miss the minimum of the error surface. Typical valuesof Z lie in the ½0 1� range.

4.3.1.2. Number of Hidden Layer Nodes H. As men-tioned earlier, H defines the expressive power of the net-work, and larger H results in a network that can solvemore complicated problems. However, excessive number ofhidden nodes causes overfitting, the phenomenon wherethe training error can be made arbitrarily small, but the

d

i=1

j=1

ijij xwfy = f(netj) =

H

jkjk ywfz = f(netk) =

……

.... …

…...

Wji

Wkj

net j

i=1,2,…dj=1,2,…,Hk=1,2,…c

xd

x (d-1)

x2

x1

d inputnodes

H hiddenlayer nodes

c outputnodes

zc

z1

..

…

..

zknet k

yj

∑

∑

Figure 13. The MLP architecture.


network performance on the test data remains poor. Toofew hidden nodes, however, may not be enough to solve themore complicated problems. Typically, H is determined bya combination of previous expertise, amount of data avail-able, dimensionality, complexity of the problem, trial anderror, or validation on an additional training dataset. Acommon rule of thumb that is often used is to choose Hsuch that the total number of weights remains less thenN/10, where N is the total number of training data available.

4.3.1.3. Error Goal E. If the training data is noisy oroverlapping, reaching zero error is usually not possible.Therefore, an error goal is set as a threshold, below whichtraining is terminated. Selecting E too large causes pre-mature termination of the algorithm, whereas choosing ittoo small forces the algorithm to learn the noise in thetraining data. E is also selected based on prior experience,trial and error, or validation on an additional training da-taset, if one is available.

4.3.1.4. Choice of Activation Function and Output Encod-ing. If the problem is one of classification, then the outputnode activation function is typically chosen as a sigmoid. Ifthe desired output classes are binary encoded (e.g.,½0 0 1 0 0� indicating class-3 in a 5-class problem), then lo-gistic sigmoid is used. Plus or minus 1 can also be used toencode the outputs (e.g., ½�1 � 1 þ 1 � 1 � 1�), in whichcase the hyperbolic tangent is employed. It has been em-pirically shown, however, that better performance may beobtained if softer values are used instead of the asymptoticones, such as 0.05 instead of 0 and 7 0.95 instead of 7 1during training. If the MLP is being used strictly for func-tion approximation, linear activation function is used atthe output layer, although the radial basis function (dis-cussed below) is preferred for such function approximationproblems.

4.3.1.5. Momentum. Small plateaus or local minima inthe error surface may cause backpropagation to take along time, or even get stuck at a local minimum, as sche-matically illustrated in Fig. 14, for the leftmost selection ofinitial weights. In order to prevent this problem, a mo-mentum term is added, which incorporates the speed inwhich the weights are learned. This term is loosely relatedto the momentum in physics—a moving object keeps mov-ing unless prevented by outside forces. The modified

weight update rule is then obtained as follows

wðtþ 1Þ¼wðtÞþ ð1� aÞDwðtÞþ aDwðt� 1Þ; ð35Þ

where a is the momentum term. For a¼ 0, the originalbackpropagation is obtained: The weight update is deter-mined purely by the gradient descent. For a¼ 1, the gra-dient is ignored, and the update is based on themomentum: The weight update continues along the direc-tion in which it was moving previously. Although a is typ-ically taken to be in ½0:9 0:95� range, an inappropriatechoice may lead to oscillations in training. The optimalvalue is often problem dependent and can be determinedby trying several values and choosing the one that leads tosmallest overall error.

Figure 14 shows that the minimum error reached alsodepends on the weight initialization. If, for example, therightmost location had been randomly selected, the gra-dient descent would have easily located the global mini-mum of the error surface. Although an effective use of themomentum term reduces the dependence of the final so-lution to the initialization of the weights, it is commonpractice to train the MLP several times, with differentrandom initializations of the weights, and choosing theone that leads to the smallest error.

4.3.2. Radial Basis Function (RBF) Networks. As pointedout earlier, the classification problem can also be cast as afunction approximation problem, where the goal of theclassifier is to approximate the function that provides anaccurate mapping of the input patterns to their correctclasses. There are many applications, however, where theproblem strictly calls for a function approximation (alsocalled system identification), or regression, particularlywhen the output to be determined is not one of severalcategorical entities, but rather a number on a discrete orcontinuous scale. For example, one may be interested inestimating the severity of dementia, on a scale of 0 to 10,from a series of electroencephalogram (EEG) measure-ments, or estimating the cerebral blood oxygenation levelsobtained through near-infrared spectroscopy measure-ments. Clearly, because the output is not categorical, itis not a classification problem, but rather a function ap-proximation: we assume that there is an unknown func-tion that maps the features obtained from the signal to anumber that represents the sought after value, and it is

w

J(w)

Random initializationof the weight onthe error surface

Local minimumof the error surface

Direction of thenegative gradient

Global minimumof the error surface

Alternate randominitialization of the weight

on the error surface

Direction of the negative gradient

Figure 14. Schematic illustration of the localminima problem in gradient-descent-type opti-mization.


this function we wish to approximate from the given train-ing data. It is for such function approximation applica-tions that the RBF networks are most commonly used.

We should note that both the MLP and the RBF can beused for either classification or function approximationapplications, as they are both universal approximators.However, many empirical studies have shown that theRBF networks usually perform well on function approxi-mation problems, and MLP networks usually perform wellon classification problems. As discussed below, the struc-ture of the RBF network makes it a natural fit for functionapproximation.

Given a set of input/output pairs as training datadrawn from an unknown (to be approximated) function,the main idea in RBF network is to place radial basisfunctions, also called kernel functions, on top of each train-ing data point, and then compute the superposition ofthese functions to obtain an estimate of the unknownfunction. Figure 15 illustrates the approach, where thedotted line on the left indicates a 1D unknown functionand the stars indicate the training data obtained from thisfunction. A Gaussian-type radial basis function is thenplaced on top of each training data point (right). The in-puts of the training data determine the centers, whereastheir desired values determine the amplitudes of the ker-nel functions. Superposition of these kernel functions thenprovides an estimate of the unknown function.

The term radial basis functions is an appropriatelychosen one: radial refers to the symmetry of the kernel,whereas basis functions are those functions using a linearcombination of which any other function (in the samefunction space) can be represented. The RBF network au-tomates the above described procedure using the archi-tecture shown in Fig. 16. RBF is also a feed-forwardnetwork, and its architecture shows great resemblanceto that of the MLP.

The main difference between the RBF and the MLPnetworks is the activation function used in the hidden andoutput layers. At the hidden layer, the RBF uses the radialbasis functions for activation, typically the Gaussian. Thehidden layer of the RBF is also referred to as the receptivefield layer. The output of the jth receptive field is calcu-lated as

yj ¼jðnetj ¼ k x� uj kÞ

¼ e�kx�ujk

s

� �2;

ð36Þ

where j is the radial basis function; uj are the input toreceptive field weights, which also constitute the centersof the Gaussian kernels, and are usually obtained as pro-totypes of the training data; and s, the main free param-eters of the algorithm, is the spread (standard deviation)of the Gaussian kernel. The spread should be chosen ju-

∗

∗ ∗

∗

∗∗ ∗

∗

∗

∗

Input

Output

∗

∗ ∗

∗

∗∗ ∗

∗

∗

∗

Input

Output

Training datapoints

Unknownfunction to beapproximated

Gaussian kernelscentered at training

data points

Approximated functionby superposition of

Gaussian kernels

Figure 15. Function approximation using ra-dial basis functions.

……

.... …

…...

Uji

Wkj

xd

x (d-1)

x2

x1

d inputnodes

H hidden layer RBFs(receptive fields)

c outputnodes

zc

z1

..

…

..

z =knetk

yj

ϕ1

ϕj

ϕH x1

xd

uJi2

yj =�(netj =

=e− J

J

x−u

x−u )

HH

j=1 j=1jkj yw jkj ywf =f (netk) =

Linear act. function

∑ ∑

�ϕ Figure 16. The RBF network architecture.


diciously: Large s values result in wider Gaussians, whichcan be helpful in averaging out the noise in the data.However, local information can also be lost because of thisaveraging. Small s values, on the other hand, result innarrow Gaussians, which help retain the local informa-tion, however, may cause spurious peaks where trainingdata is not very dense, particularly if those areas havenoisy data instances.

The hidden layer units essentially compute the dis-tance between the given input and preset centers, calledprototypes (determined by the u weights), and calculates asupport from each kernel for the given input: The supportis large from those kernels whose centers are close to thegiven input, and small for others. The kth output node isthe linear combination of these supports:

zk ¼ f ðnetkÞ¼ fXHj¼ 1

wkjyj

!¼XHj¼ 1

wkjyj ¼y �wT ; ð37Þ

where the activation function is usually the linear func-tion, H is the number of receptive field units, and w is theset of weights connecting the receptive layer to the outputlayer.

An RBF network can be trained in one of four ways:

4.3.2.1. Approach 1. Approach 1 is known as exactRBF, as it guarantees correct classification of all trainingdata instances. It requires N hidden layer nodes, one foreach training instance. No iterative training is involved.RBF centers ðuÞ are fixed as training data points, spreadas variance of the data, and w are obtained by solving aset of linear equations. This approach may lead to over-fitting.

4.3.2.2. Approach 2. In order to reduce the overfitting,fewer fixed centers are selected at random from the train-ing data, where HoN. Otherwise, same as Approach 1,with no iterative training.

4.3.2.3. Approach 3. In this approach, centers are ob-tained from unsupervised learning (clustering). Spreadsare obtained as variances of clusters, and w are obtainedthrough LEAST MEAN SQUARES (LMS) algorithm. Both clus-tering and LMS algorithm are iterative, and therefore theapproach is more computationally expensive than the pre-vious ones. However, this approach is quite robust to over-fitting, typically providing good results. It is the mostcommonly used procedure.

4.3.2.4. Approach 4. All unknowns are obtained fromgradient-based supervised learning. The most computa-tionally expensive approach, and with the exception ofcertain specialized applications, the additional computa-tional burden do not necessarily translate into better per-formance.

In comparing MLP and RBF, their main common de-nominator is that they are both universal approximators.They have several differences, however: (1) MLP gener-ates a global approximation to the decision boundaries, asopposed to the RBF, which generates more local approx-

imations; (2) MLP is more likely to battle with local min-ima and flat valleys than RBF, and hence, in general, haslonger training times; (3) as MLPs generate global ap-proximations, they excel in extrapolating, that is, classi-fying instances that are outside of the feature spacerepresented by the training data. Extrapolating, however,may mean dealing with outliers; (4) MLPs typically re-quire fewer parameters than RBFs to approximate a givenfunction with the same accuracy; (5) all parameters of anMLP are trained simultaneously, whereas RBF parame-ters can be trained separately in an efficient hybrid man-ner; (6) RBFs have a single hidden layer, whereas MLPscan have multiple hidden layers; (7) the hidden neurons ofan MLP compute the inner product between an input vec-tor and the weight vector, whereas RBFs compute the dis-tance between the input and the radial basis functioncenters; (8) the hidden layer of an RBF is nonlinear and itsoutput layer is usually linear, whereas all layers of anMLP are nonlinear. This really is more of a historic pref-erence based on empirical success that MLPs typically dobetter in classification type problems and RBFs do betterin function approximation type problems.

5. PERFORMANCE EVALUATION

Once the model selection and training is completed, itsgeneralization performance needs to be evaluated on pre-viously unseen data to estimate its true performance onfield data.

One of the simplest, and hence, the most popular meth-ods for evaluating the generalization performance is tosplit the entire training data into two partitions, wherethe first partition is used for actual training and the sec-ond partition is used for testing the performance of thealgorithm. The performance on this latter dataset is thenused as an estimate of the algorithm’s true (and unknown)field performance. However, estimating the algorithm’sfuture performance in this manner still presents some le-gitimate concerns:

1. As a portion of the dataset is set aside for testing, notall of the original data instances are available fortraining. Since an adequate and representativetraining data set is of paramount importance forsuccessful training, this approach can result in sub-optimal performance.

2. What percent of the original data should be set asidefor testing? If a small portion is used for testing,then the estimate of the generalization performancemay be unreliable. If a large portion of the dataset isused for testing, less data can be used for trainingleading to poor training.

3. We can also never know whether the test data that isset aside, regardless of its size, is a representative ofthe data the network will see in the future. After all,the instances in the test dataset may have a largedistribution nearby or far away from the decisionboundary, making the test dataset too difficult or toosimple to classify, respectively.


One approach that has been accepted to provide a rea-sonably good estimate of the true generalization perfor-mance uses cross validation to address the abovementioned issues. In k-fold cross validation, the entireavailable training dataset is split into k > 2 partitions,creating k blocks of data. Of these k blocks, k� 1 are usedfor training and the remaining kth block is used for test-ing. This procedure is then repeated k times, using a dif-ferent block for testing in each case. The k testperformances obtained are averaged, and the averagetest performance is declared as the estimate of the truegeneralization performance of the algorithm. The proce-dure is illustrated in Fig. 17.

The algorithm has several important advantages. First,all data instances are used for training as well as for test-ing, but never at the same time, allowing full utilization ofthe data. Second, as the procedure is repeated k times, theprobability of an unusually lucky or unlucky partitioningis reduced through averaging. Third, a confidence intervalfor the true generalization performance can be obtainedusing a two-tailed z-test, if k > 30, or a two-tailed t-test,otherwise.

For example, the 100�ð1� aÞ% two-sided confidence in-terval t-test would provide

P¼P� ta=2;k�1 �sffiffiffik

p ; ð38Þ

where P is the mean of k individual test performancesP1 Pk, s is the standard error of k individual perfor-mance values (can be estimated by calculating the stan-dard deviation), ta=2;k�1 is the critical value of t-distribution for k� 1 degrees of freedom for the desiredconfidence level a, and P is the range of values in whichthe true generalization performance is considered to lie.

The choice of k is of course data dependent: selecting ktoo large divides the data into too many partitions, allow-ing a larger amount of data for training; however, it alsoincreases the confidence interval because of large varia-tions on the test performances, as a result of testing onvery few test instances. On the other hand, choosing k toosmall causes not enough partitions to be made and re-

duces the amount of data available for training. For suf-ficiently large datasets, k is typically taken as 5 or 10;however, for smaller datasets, k may be chosen larger toallow a larger training data. The extreme case, wherek¼N, is also known as leave-one-out validation. In thiscase, the entire data but one instance is used for eachtraining session, and testing solely on the remaining in-stance, repeating the process N times. Although thismethod provides the best estimate of the generalizationperformance, it usually yields a wides worst confidenceinterval, not to mention additional computational burden.

6. OTHER TOPICS & FUTURE DIRECTIONS

6.1. Other Common Pattern Recognition Approaches

As this chapter is meant to provide an introductory back-ground to pattern recognition, only those topics that aremost fundamental, and only those algorithms that aremost commonly used have been described in the previousparagraphs. The field of pattern recognition, however,comprises of a vast number of algorithms, many of whichare specifically geared toward certain applications fallingoutside the scope of this article. Nevertheless, a few addi-tional well-established algorithms should be mentioned toprovide breadth and topical name recognition.

As a preprocessing step, a technique that has somesimilarities to principal component analysis, but intendedfor solving a different problem is the INDEPENDENT COMPO-

NENT ANALYSIS (ICA). Unlike PCA, which seeks to find di-rections in feature space that best represent the data inmean square error sense, ICA seeks to find directionsalong which the data components are most independentfrom each other (12). Typically used for BLIND SOURCE SEPA-

RATION applications (13), ICA attempts to decouple differ-ent components of a composite data, which are originallygenerated by different sources. Isolating electrooculogram(caused by eye-blink artifacts) from an EEG, or isolatingfetus ECG from that of the mother’s are traditional appli-cations for ICA.

As mentioned in the introduction, this chapter focuseson supervised algorithms, for which the availability of a

Entire DataBlock 1 Block 2 Block k-1……Block 3 Block k

Block 1 Block 2 Block k-1……Block 3 Block k





.

.

....

.

.

....

Training blocks in dark background

Test block inlight background

Entire original training data

Trial 1

Trial 2

Trial 3

Trial k-1

Trial k

TestPerformance

P1

P2

P3

Pk-1

Pk

Estimate of the true generalization performance: 1 kPt

t=1P=

k ∑Figure 17. k-fold cross validation.


training data with pre-assigned class labels is assumed. Ifsuch a training dataset is not available, unsupervisedclustering algorithms can be used (1,2). Most commonlyused examples of such algorithms include the k-means (1),several variations of fuzzy c-means (14), vector quantiza-tion (8,15) adaptive resonance theory (ART) networks(16), Kohonen networks (8,15) and variations of self-orga-nizing maps (SOMs) (15).

In supervised learning, several prominent approachesare also commonly used, such as time-delay neural net-works (17), recurrent networks (such as Hopfield network)(8), neuro-fuzzy algorithms (18), ARTMAP networks (19),learning vector quantization (8), genetic algorithms (20),swarm intelligence (21,22), and decision-tree-based ap-proaches (1,23), among others. Also within the realm ofsupervised learning, multiple classifier (ensemble) sys-tems and kernel-machine-based approaches (such as sup-port vector machines) are rapidly gaining momentum,which are discussed in more detail below as current anddeveloping research areas.

6.2. Current and Developing Research Areas

6.2.1. Kernel-Based Learning and Support Vector Ma-chines. An area that has seen much recent interest is ker-nel-based learning with particular emphasis on thesupport vector machines (SVM). Introduced by Vapnik,SVMs (also called maximum margin classifiers) try to findthe optimal decision boundary by maximizing the marginbetween the boundaries of different classes (24–26), asshown in Fig. 18. To do so, SVMs identify those instancesof each class that define the boundary of that class in thefeature space. These instances, considered to be the mostinformative ones, are called the support vectors, which arecalculated through a quadratic programming-type con-strained optimization algorithm.

Identifying margin defining instances may be easy onthe simple 2D illustrative example of Fig. 18 where theclasses are linearly separable, but SVMs really shine intheir ability to find such instances in more complex, lin-early nonseparable high dimensional spaces. In fact, oneof the main novelties of SVMs is their ability to use a ker-nel function to map the feature space into a higher di-mensional space, where the classes are linearly separable,and find the support vectors in that high dimensionalspace. Furthermore, what makes SVMs even more popu-lar is the fact that they do so without actually requiring

any calculations in the higher dimensional space, throughthe procedure affectionately known as the kernel trick.The trick is figuratively illustrated in Figs. 19a and 19b,and on the famous XOR problem in Figs. 19c and 19d.

Using SVMs with several different computationallyless expensive optimization algorithms, implementingthe algorithm for multiclass problems are ongoing re-search areas.

6.2.2. Mixture Models and Ensemble Learning. Anotherrapidly developing area within the pattern recognitioncommunity is the so-called ensemble systems, also knownunder various names such as multiple classifier systems(MCS), multiple expert models, or mixture models. Theidea behind ensemble systems is to employ several classi-fiers, instead of a single classifier, to solve the particularpattern recognition problem at hand (27). Several inter-esting properties of ensemble systems have been estab-lished over the last several years that have made themquite popular. For example, Schapire has shown that us-ing an algorithm called boosting (28), and later AdaBoost(29), a weak learner, an algorithm that generates classifi-ers that can merely do better than random guessing can beturned into a strong learner that generates a classifierthat has an arbitrarily small error, with a guarantee thatthe combined system will perform as good or better thanthe best individual classifier that makes up the ensemble.Several variations of such ensemble systems can be seenin hierarchical mixture of experts (HME) (30) that use theexpectation maximization (EM) (31) algorithm fortraining, or in stacked generalization (32) that uses a sec-ond-level classifier that is trained on the outputs of thefirst level classifiers. It has been well-established thatmuch is to be gained from combining classifiers if the clas-sifiers are as independent or as diverse as possible. Thatway, the individual weakness or instability of each classi-fier can be effectively averaged out by the combinationprocess (such as voting-based approaches), which may sig-nificantly improve the generalization performance of theclassification system. A specific area of interest within theensemble systems researchers involves developing effec-tive techniques for measuring diversity among classifiers,as well as investigating various combination schemes.

More recently, ensemble-systems-based approacheshave also been proposed for such applications as incre-mental learning, data fusion, and confidence estimationusing the Learnþ þ algorithm (33). Incremental learning

Feature 1Feature 1

Fea

ture

2

Fea

ture

2

Viable decisionboundaries

Optimal decision boundary

Maximummargin

Support vectors

Figure 18. Support vectors as maximum mar-gin classifiers.


is defined as the ability of a classifier to learn additionalinformation from new data that may later become avail-able, even if the new data introduce instances from pre-viously unseen classes and even if the previous data are nolonger available. Learnþ þ generates an ensemble of clas-sifiers for each dataset that becomes available, where theindividual classifiers are specifically geared toward learn-ing the novel information presented by the new data. Thealgorithm is also used for data fusion applications, wherethe additional data that become available may consist of adifferent set of features. Finally, as the ensemble systemsemploy several classifiers, estimating the confidence ofclassification system also becomes possible without neces-sarily resorting to cross-validation techniques.

6.3. Applications

A large number of applications benefit from current andnear future research in pattern recognition. Some of theseareas are in fact very closely related to active researchareas in biomedical engineering. For example, computa-tional neuroscience is one of the rapidly expanding areas,where researchers try to understand more intricate de-tails of how the nervous system really works; propose newmodels of neurons and local neural circuits as well as newlearning rules based on these discoveries. Spiking neural

networks is one type of neural network that is activelybeing considered for this area.

Another area related to biomedical engineering is bio-informatics and genomics, where new and existing patternrecognition algorithms are evaluated in gene sequenceanalysis, development of artificial immune systems, andgenomic data mining. Vision and image processing, bio-metric identification (face, fingerprint, iris scan, etc.),handwritten character recognition, auditory data andspeech processing, speech and speaker identification,and diagnosis of various cardiovascular and neurologicaldisorders based on features obtained from noninvasivetests or bioimaging modalities are all biomedical applica-tion-oriented areas of active research.

Among all biomedical engineering-related applicationsof pattern recognition, those that involve analysis of bio-logical signals for detection or identification of certainpathological conditions have enjoyed the most attention.For example, analysis of various features extracted fromthe ECG to diagnose a broad spectrum of cardiovasculardisorders, including arrhythmias and heart rate variabil-ity, have been popular areas of research. Using EEG sig-nals to diagnose/detect various neurological conditions—from detection of characteristic EEG waves to advanceddetection of epileptic conditions to automated early diag-nosis of Alzheimer’s diseases—have also enjoyed signifi-

[1 1]T12�1(x) = e− 1 t = [0 0]T2t =tx− 2�2(x) = e− 2tx−x = [x1 x2]

0 1

0

1

x1

x2

0 0.2 0.4 0.6 0.8 1.0 1.2| | | | | |

_

_

_

_

_

_

1.0

0.8

0.6

0.4

0.2

0

(0,0)

(1,1)

(0,1)(1,0)

�2(x)

�1(x)

Feature 1(a) (b)

(c) (d)

Fea

ture

2�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )

�( )�( )

�( )

�( )�( )

�( )

�( )

Feature 1

�(.)

�(.)

Fea

ture

2

Figure 19. (a) Nonlinearly separable two-class problem (b) made linearly separable with the useof a kernel function. (c) Nonlinearly separable XOR problem (d) made linearly seaprable with theuse of an exponential kernel.


cant attention. Analysis of respiratory signals to detectpulmonary disorders such as apnea and hypopnea, anal-ysis of Raman spectra for early detection of skin cancer,analysis of aging on gait patterns, analysis of variousimaging modalities (such as MIR, PET scan, digital mam-mogram) for automated detection of tumors or other ab-normalities are just a few of many applications of patternrecognition in biomedical engineering.

Of course, pattern recognition is also being activelyused in nonbiomedical applications, such as general signalprocessing and telecommunications applications, roboticsand dynamic control, financial or other time series dataanalysis, military and computer security applications, re-mote sensing and oceanographic applications, geographi-cal information systems, nondestructive evaluation,hazardous compound identification based on chemicalsensor data, among many others.

6.4. Some Final Thoughts

Literally hundreds of pattern recognition approaches andalgorithms exist, and it is often asked whether any one ofthem is consistently better than the others. A cleverlynamed theorem, called the no-free lunch theorem, tells usthat no algorithm is universally superior to all others inthe absence of any additional information (34). In fact, itcan be proven that problems exist for which randomguessing will outperform any other algorithm. Althoughsuch problems may not be of much practical interest, thereis still a lesson to be learned: The choice of the appropriatealgorithm almost invariably depends on the nature of theproblem, the distribution that provides the data for thatproblem, and the prior knowledge available to the de-signer.

6.5. For Further Reading

The nature of this chapter makes it impossible to provide amore detailed discussion on all topics, or to provide specificalgorithms for all techniques. Therefore, the author’s goalhas been to provide a fundamental background, uponwhich interested readers can build. The references citedwithin the article will provide the additional depth andbreadth; however, a topic-specific list is also provided be-low for textbook based references.

Pattern Classification (2nd edition), by Duda, Hart andStork (Wiley, 2000); Statistical Pattern Recognition byWebb (Wiley, 2002), Pattern Recognition (2nd edition) byTheodoridis and Koutroumbas (Academic Press, 2003) andThe Elements of Statistical Learning by Hastie, Tibshiraniand Friedman (Springer, 2002) cover a broad spectrum ofpattern recognition topics in considerable detail. For al-gorithms and techniques that are specifically related toneural networks, Neural Networks, A ComprehensiveFoundation by Haykin (Prentice Hall, 1998), Neural Net-works for Pattern Recognition by Bishop (Oxford Univer-sity Press, 1995), and Pattern Recognition and NeuralNetworks by Ripley (Cambridge University Press, 1996)are classic texts that provide a very comprehensive theo-retical coverage of neural network-based approaches. Forthe practitioners, Netlab by Nabney and Computer Man-ual in Matlab to Accompany Pattern Classification by

Stork and Yom-Tov include Matlab functions for many ofthe popular pattern recognition algorithms, whereas Neu-ral and Adaptive Systems: Fundamentals through Simu-lations by Principe, Euliano and Lefebvre (Wiley, 1999)and Pattern Recognition, Concepts, Methods and Applica-tions by Marques de Sa (Springer, 2001) both include asimulation software on CD that allows the readers to sim-ulate the algorithms discussed in their respective books.Pattern Recognition in Medical Imaging by Meyer-Baese(Academic Press, 2003) also covers most pattern recogni-tion approaches along with feature extraction and signal/image processing techniques that are particularly appli-cable to large dimensional datasets, such as medical im-ages. An Introduction to Support Vector Machines andOther Kernel-based Learning Methods by Cristianini andTaylor (Cambridge University Press, 2000) provides anexcellent introduction to kernel-based methods, whereasCombining Pattern Classifiers: Methods and Algorithmsby Kuncheva (Wiley, 2004) is the only text of its kind cov-ering ensemble-based systems. For a more light-heartedand recreational reading, Intelligent Data Analysis, anIntroduction by Berthold and Hand (Springer, 2003) andHow to Solve It: Modern Heuristics, by Michalewics andFogel (Springer, 2000) can be recommended. Finally, alsoof interest is PRTools a set of pattern recognition functionsorganized as a Matlab toolbox, available from www.prtool-s.org.

A large number of journals, such as IEEE Transactionsin Pattern Analysis and Machine Learning, Neural Net-works, Knowledge and Data Engineering, Fuzzy Systems,Systems Man and Cybernetics, as well as Transactions onBiomedical Engineering, Elsevier’s Pattern Recognition,Neural Networks, Information Fusion, Neurocomputingand many others publish papers on theory and applica-tions of pattern recognition. For the most recent advancesin pattern recognition, the readers may wish to attendsome of the major conferences, such as the InternationalJoint Conference on Neural Networks (IJCNN), Interna-tional Conference on Artificial Neural Networks (ICANN),International Conference on Pattern Recognition (ICPR),Neural Information Processing Systems Conference(NISP), and International Workshop on Multiple Classi-fier Systems (MCS). All of these conferences feature ses-sions on biomedical applications, and all major biomedicalengineering conferences, such as IEEE’s Engineering inMedicine and Biology Conference, typically feature ahealthy number of presentations on pattern recognition.

BIBLIOGRAPHY

1. R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed.New York: Wiley, 2000.

2. A. K. Jain, R. P. W. Duin, and J. Mao, Statistical pattern rec-ognition: a review. IEEE Trans. Pattern Analysis Machine In-

tell. 2000; 22(1):4–37.

3. K. Fukunaga and J. Mantock, Nonparametric discriminantanalysis. IEEE Trans. Pattern Analysis Machine Intell. 1983;5:671–678.

4. K. Fukunaga, Introduction to Statistical Pattern Recognition,2nd ed. San Diego, CA: Academic Press, 1990.


5. M. Akay, ed. Time Frequency and Wavelets in Biomedical Sig-

nal Processing. Piscataway, NJ: IEEE Press, 1998.

6. R. Kohavi and G. H. John, Wrappers for feature subset selec-tion. Artific. Intell. 1997; 97(1–2):273–324.

7. A. Jain and D. Zongker, Feature selection: evaluation, appli-cation, and small performance. IEEE Trans. Pattern Analysis

Machine Intell. 1997; 19(2):153–158.

8. S. Haykin, Neural Networks, A Comprehensive Foundation,2nd ed. Englewood Cliffs, NJ: Prentice Hall, 1998.

9. C. Bishop, Neural Networks for Pattern Recognition. Oxford,UK: Oxford University, 1995.

10. D. Hush and B. Horne, Progress in supervised neural net-works. IEEE Signal Proc. Mag. 1993; 10(1):8–39.

11. A. Webb, Statistical Pattern Recognition, 2nd ed. New York:Wiley, 2002.

12. A. Hyvarinen, J. Karhunen, and E. Oja, Independent Compo-

nent Analysis. New York: Wiley, 2001.

13. A. Hyvarinen and E. Oja, A fast fixed point algorithm for in-dependent component analysis. Neural Computation 1997;9(7):1483–1492.

14. N. R. Pal, K. Pal, J. M. Keller, and J. Bezdek, A possibilisticfuzzy c-means clustering algorithm. IEEE Trans. Fuzzy Syst.2005; 13(4):517–530.

15. T. Kohonen, Self Organizing Maps. Berlin, Germany:Springer, 2000.

16. G. A. Carpenter and S. Grossberg, The ART of adaptive pat-tern recognition by a self-organizing neural network. Com-

puter 1988; 21(3):77–88.

17. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang,Phoneme recognition using time-delay neural networks.IEEE Trans. Acoust. Speech Signal Proc. 1989; 37(3):328–339.

18. J. R. Jang, C. Sun, and E. Mizutani, Neuro-Fuzzy and Soft

Computing: A Computational Approach to Learning and Ma-

chine Intelligence. Englewood, NJ: Pearson, 1996.

19. G. A. Carpenter, S. Grossberg, S. Markuzon, J. H. Reynolds,and D. B. Rosen, Fuzzy ARTMAP: a neural network archi-tecture for incremental supervised learning of analog multi-dimensional maps. IEEE Trans. Neural Networks 1992;3(5):698–713.

20. M. Mitchell, An Introduction to Genetic Algorithm (Complex

Adaptive Systems). Cambridge, MA: The MIT Press, 1998.

21. R. C. Eberhart, Y. Shi, and J. Kennedy, Swarm Intelligence.Burlington, MA: Morgan Kauffman, 2001.

22. E. Bonabeau, M. Dorigo, and G. Theraulaz, Swarm Intelli-

gence: From Natural to Artificial Systems. Cambridge, UK:Oxford University Press, 1999.

23. J. R. Quinlan, C4.5: Programs for Machine Learning. Bur-lington, MA: Morgan Kaufmann, 1992.

24. C. J. C. Burges, A tutorial on support vector machines forpattern recognition. Data Mining Knowledge Discov. 1998;2(2):121–167.

25. V. Vapnik, The Nature of Statistical Learning Theory, 2nd ed.Berlin: Springer, 2000.

26. B. Scholkoph and A. Smola, Learning with Kernels: SupportVector Machines, Optimization and Beyond. Cambridge, MA:MIT Press, 2002.

27. L. I. Kuncheva, Combining Pattern Classifiers: Methods and

Algorithms. New York: Wiley, 2004.

28. R. E. Schapire, The strength of weak learnability. Machine

Learn. 1990; 5:197–227.

29. Y. Freund and R. Schapire, A decision theoretic generaliza-tion of on-line learning and an application to boosting. Com-

put. Syst. Sci. 1997; 57(1):119–139.

30. R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton,Adaptive mixtures of local experts. Neural Computation1991; 3:79–87.

31. M. I. Jordan and R. A. Jacobs, Hierarchical mixtures of ex-perts and the EM algorithm. Neural Computation 1994;6(2):181–214.

32. D. H. Wolpert, Stacked generalization. Neural Networks

1992; 5(2):241–259.

33. R. Polikar, L. Udpa, S. Udpa, and V. Honavar, Learnþ þ : anincremental learning algorithm for supervised neural net-works. IEEE Trans. Syst. Man Cybernet. (C) 2001; 31(4):497–508.

34. D. H. Wolpert andW. G. Macready, No free lunch theorems foroptimization. IEEE Trans. Evolutionary Computation 1997;1(1):67–82.


Documents

PATTERN RECOGNITION curvature, facial expressions, or ...users.rowan.edu/~polikar/RESEARCH/PUBLICATIONS/wiley06.pdf · Pattern recognition stems from the need for automated machine