Multiclass Filters by a Weighted Pairwise Criterion

8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

1/9

1412 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

Multiclass Filters by a Weighted Pairwise Criterionfor EEG Single-Trial Classification

Haixian Wang, Member, IEEE

AbstractThe filtering technique for dimensionality reductionof multichannel electroencephalogram (EEG) recordings, modeledusing common spatial patterns and its variants, is commonly usedin two-class braincomputer interfaces (BCI). For a multiclassproblem, the optimization of certain separability criteria in theoutput space is not directly related to the classification error ofEEGsingle-trial segments. In this paper, we derive a new discrim-inant criterion, termed weighted pairwise criterion (WPC), foroptimizing multiclass filters by minimizing the upper bound of theBayesian errorthat is intentionally formulated for classifying EEGsingle-trial segments. The WPC approach pays more attention toclose class pairs that are more likely to be misclassified than faraway class pairs that are already well separated. Moreover, weextend WPC by integrating temporal information of EEG series.Computationally, we employ the rank-one update and power iter-ation technique to optimize the proposed discriminant criterion.The experiments of multiclass classification on the datasets of BCIcompetitions demonstrate the efficacy of the proposed method.

Index TermsBayesian classification error, braincomputer in-terfaces (BCI), common spatial patterns (CSP), multiclass filters,weighted pairwise criterion (WPC).

I. INTRODUCTION

ACCURATE classification of electroencephalogram (EEG)

signals is the core problem in the community of braincomputer interfaces (BCI) [22]. A large number of modern

signal processing and machine learning techniques have been

used and developed [1], [8], [15]. One powerful and widely

used method for processing multichannel EEG series is the fil-

tering technique, represented by the common spatial patterns

(CSP) [4]. The CSP approach, designed for the two-class prob-

lem [12], [16], [18], [19], seeks few filters such that the ratio

of the filtered variances between the two populations is maxi-

mized (or minimized). By the fact that CSP make use of only

spatial information, the spatio-temporal versions were also de-

veloped, for example, the common spatio-spectral patterns [11]

and the local temporal CSP [20]. The literature [4] reviewedmany variants of CSP.

Manuscript received June 20, 2010; revised October 14, 2010 andDecember 30, 2010; accepted January 3, 2011. Date of publication January 13,2011; date of current version April 20, 2011. This work was supported in partby the National Natural Science Foundation of China under Grants 61075009and 60803059, in part by the Qing Lan Project, and in part by the Fund for theProgram of Excellent Young Teachers at Southeast University.

The author is with the Key Laboratory of Child Development and Learn-ing Science of Ministry of Education, Research Center for Learning Science,Southeast University, Nanjing 210096, China (e-mail: [email protected]).

Digital Object Identifier 10.1109/TBME.2011.2105869

The CSP approach was originally suggested for two-class

paradigm. The multiclass extensions have been investigated in

the literature. One trivial extension was to divide the multiclass

problem into many two-class situations followed by applying

CSP repeatedly [4], [7]. Another conventional extension was

the joint approximate diagonalization (JAD) ofM covariancematrices, where Mwas the number of multiple classes [7].This is based on the observation that CSP is to simultaneously

diagonalize two covariance matrices. The JAD was further in-

vestigated from the perspective of mutual information and brain

source separation [9], [10]. The JAD approach is actually a

decomposition technique rather than a classification method.

Recently, Zheng and Lin [23] presented a multiclass exten-

sion via Bayesian classification error estimation. The discrim-

inant criterion is derived by minimizing the upper bound of

the Bayesian error of classifyingTx, where x is an EEG sig-nal recorded at a specific time point and is a filter. Whilethis is a reasonable criterion to optimize spatial filters, a more

direct approach uses the same features for optimizing spatial

filters as for classification. Denoting a multivariate time se-

ries of band-pass filtered single-trial EEG by X, these features

are in CSP TXXT (cf., [4, eq. (2)], [16, eq. (2)], and [23,eqs. (29)(31)]). The quantityTXXT is the variance of the

band-pass filtered EEG signals. It is equal to band power. So,the band power TXXT with an appropriate spatial filter in fact corresponds to the effect of event-related desyncroniza-

tion/syncronization (ERD/ERS), which is an effective neuro-

physiological feature for classification of brain activities [4].

Specifically, the so-called idle rhythms, reflected around 10 Hz,

can be observed over motor and sensorimotor areas in most

persons. These idle rhythms are attenuated when processing

motor activity. This physiological phenomenon is termed ERD

effect because of loss of synchrony in the neural population. By

contrast, the rebound of the rhythmic activity is termed ERS.

In this paper, we directly target the classification of

TXXT

, for which the upper bound of the Bayesian clas-sification error is estimated. Accordingly, by minimizing theupper bound of the Bayesian classification error, we develop a

new discriminant criterion that directly related to the classifica-

tion of EEG single-trial segments. The proposed criterion takes

the form of sum of weighted pairwise classes and is referred to

as weighted pairwise criterion (WPC). The WPC approach puts

heavier weights onto close class pairs, which are more likely to

be misclassified, and de-emphasizes the influence of far away

class pairs, where the classes are already well separated. The

weighting strategy helps make the criterion suited in produc-

ing separability in the output space, which has been witnessed

in pattern classification problems [13], [14]. Computationally,

0018-9294/$26.00 2011 IEEE


2/9

WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1413

the proposed WPC method is conveniently implemented by the

rank-one update and power iteration technique. Moreover, we

compare WPC with [23]. The classification targets of the two

methods are different, and then the discriminant criteria, which

are obtained by minimizing the upper bound of the Bayesian

classification error, are thus different. We also extend WPC by

integrating temporal information of EEG series into the covari-ance matrix formulation. The efficacy of the proposed approach

is demonstrated on the classification of four motor imagery tasks

on two datasets of BCI competitions.

The remainder of this paper is organized as follows. In

Section II, the CSP and its multiclass situation via Bayesian

classification error estimation based on EEG sampling points

are briefly reviewed. In Section III, we derive the upper bound

of the Bayesian error that is intentionally formulated for clas-

sifying EEG single-trial segments, then propose the WPC to

minimize the upper bound, and give the optimization procedure

by using the rank-one update and power iteration technique. The

comparison with [23] and the extension by integrating temporal

information are presented in Section IV. The experimental re-sults are presented in Section V. Finally, Section VI concludes

the paper.

II. BRIEFREVIEW OFCSP AND ITSMULTICLASSSITUATION

Let xRK be an EEG signal at a specific time point withKelectrodes. We view x recorded during performing certain men-

tal task as a K-dimensional random variable that is generatedfrom a Gaussian distribution. Suppose thatC ={c1 , . . . , cM}is the set of mental conditions to be investigated. We consider

the multiclass (M >2) classification problem that assigns EEG

single-trial segments into theMpredefined brain states. Givenclasscl (l {1, 2, . . . , M }), the random variable x is assumedto be Gaussian distributed according to x|clN(0, l ), wherel is the covariance matrix. The Gaussian assumption will notsacrifice generality when studying linear filters and statistics

less than second order [10]. For the purpose of classification,

we wish to learnG(G < K)filters (linear transformation vec-tors) g RK using the finite training data such that the filteredfeatures are more discriminative for predicting class labels than

using the raw EEG data. Hereafter, the term conditions and class

labels are used interchangeably.

A. CSP: Two-Class Paradigm

The CSP method provides a powerful way for extracting EEG

features related with the modulation of ERD/ERS. The CSP

algorithm is applied to two-class situation only. It solves the

filters such that the projected EEG series have the maximum

ratio of variances between the two classes. Maximizing the

variances actually characterize the ERD/ERS effects. Let X =[x1 , . . . ,xN]RKN be a segment of EEG series during onetrial, where xi is the multichannel EEG signal at a specific time

pointi, andNdenotes the number of sampled time-points.The CSP approach solves spatial filters by simultaneously

diagonalizing the estimated covariance matrices under the two

conditions. The covariance matrices of the two classes are esti-

mated as

l = 1

N|Il |tIl

XtXTt (1)

whereIl(l {1, 2})denotes the set of indices of trials belong-ing to classcl , and|Il | is the cardinality of setIl . The spatialfilters of CSP can be alternatively formulated as an optimizationproblem [4], [18]

= arg{max, min}RKT1

T2 (2)

where the notation {max, min} means that maximizing or min-imizing the Rayleigh quotient is of equally interest. The spatial

filters thus are obtained by solving the generalized eigenvalue

equation

1 = 2 . (3)

The eigenvalue measures the ratio of variances between the

two classes. For the purpose of classification, the filters are spec-ified by choosing several generalized eigenvectors associated

with eigenvalues from both ends of the eigenvalue spectrum.

The variances of the spatially filtered EEG data are discrimina-

tive features, which are input into a classifier.

B. Multiclass Situation by Bayesian Error Estimation

The CSP is suitable for two-class classification only. Zheng

and Lin [23] addressed the multiclass paradigm via Bayesian

classification error estimation. By the assumption that the dis-

tribution of the EEG sampling point xconditioned on classclis Gaussian, i.e.,pl =N(x;0, l ), the filtered EEG signal y =

Tx is also Gaussian distributed according toN(y; 0, Tl ).Based on the Gaussian distribution, Zheng and Lin [23] obtained

the upper bound of the Bayesian error of classifying Tx, givenby

()qM(M1)2

q3

32

Ml= 1|T (l)|

T

2(4)

where qis the common a priori probability of the Mclasses, and =

Mm = 1qm . Minimizing the upper bound of the Bayesian

classification error is equivalent to maximize the discriminant

criterion

J() =M

l=1

|T (l)|T

. (5)

So, theG filters are defined as [23]

1 = arg max

J() (6)

G = arg max

T g = 0g= 1 , . . . , G1

J(). (7)

Using some suitable estimations for land , the defined filterscan be determined one by one via the rank-one update and power

iteration procedure.


3/9


III. WPC

For the specific classification of multiclass EEG single-trial

segment X, a more direct approach is to consider optimizing

the featureTXXT , rather thanTx, as for classification. Itis ideal to obtain the Bayesian classification error-based opti-

mal criterion for the classification ofTXXT. The Bayesian

classification error is, in general, too complex to be calculateddirectly. Therefore, the upper bound of the Bayesian classifica-

tion error, which is meanwhile required to be easy to optimize

in practice, is usually estimated as a suboptimal criterion. In this

section, we develop a new discriminant criterion based on the

upper bound of the Bayesian classification error of EEG single

trials. It is noted that we take TXXT , rather than Tx, asour target element in deriving the upper bound of the multiclass

Bayesian classification error.

A. Upper Bound of Multiclass Bayesian Error

Recalling X= [x1 , . . . ,xN], we have TXXT =

Ni=1 (Txi )2 , where N is the number of sample points in

one trial. Under the assumption of independent Gaussian dis-

tributionN(0, l ) ofxi conditioned on classcl , we have that(TXXT)/(Tl ) abides by

2 distribution with degree of

freedomN. Usually, the number of sampling points N is verylarge, sayN >30. By the central limit theorem, we have thatTXXT conditioned on class clapproaches the Gaussian dis-tribution with meanN T l and variance2N(

T l )2 . For

the time being, we assume that2N(T l )2 is less than one,

which will be addressed with the general case later on.

Denotelm ()by the Bayesian error between classes cl andcm , i.e.,

lm () =ql P(f (X)=cl |cl ) +qm P(f (X)=cm |cm ) (8)where P(f (X)=cl |cl ) is the probability that samples belong-ing to classcl are misclassified into classcm ,f ()denotes theBayesian classifier, andql andqm are thea prioriprobabilitiesof classes cl and cm , respectively. Since the data

TXX

T conditioned on each class are (approximately) Gaussian dis-

tributed with variance less than one and mean being, for exam-

ple,N T l if conditioned on classcl , it follows that

ql P(f (X)=cl |cl ) +qm P(f (X)=cm |cm )

D m qlpl (x)dx+ D l qmpm (x)dx (9)wherepl (x)andpm (x)are the probability density functions ofGaussian distributionsN(N T l , 1) andN(N

T m , 1),respectively, andDm andDl are defined as

Dm ={x: qmpm (x)qlpl (x)} (10)Dl ={x: qlpl (x)> qmpm (x)}. (11)

Supposeql =qm =q. Then, we have [21]

D mpl (x)dx+D l

pm (x)dx= 1erfN|T (lm )|

2

2 (12)

where the error function (erf) is defined as erf(x) =(2/

)x

0 eu 2 du. By (8), (9), and (12), the Bayesian error

between classescl andcm in the 1-D feature space after beingprojected onto is expressed as

lm ()

q 1erf

N|T(lm )|2

2 . (13)It is still complex to optimize via (13), since is embedded

in the error function. We would like to isolate from the errorfunction. We present the following inequality. For 0xa,we have

erf(x) 1a

erf(a)x. (14)

The equality holds when taking x = 0 or x = a. The proof isgiven in Appendix A. Let

lm () =|Tl T m | (15)be theabsolute distance between classes cland cm in the reduced1-D feature space. By (14), we have

erf

N lm ()

2

2

1

lmerf

Nlm

2

2

lm () (16)

where lm is the maximum value of lm (). Note we haverequired, in the beginning of this section, that the magnitude of

is subject to the constraint 2N(T l )2 1. The left and

right expressions of (16) are not equal for all directions of .The two expressions are equal when arriving at the maximum

or the minimum value. Combining (13) and (16), we have

lm ()

q 1 1

lmerf

Nlm

2

2

lm (). (17)

For theMclasses problem, the upper bound of the Bayesianerror is calculated as [5]

()M1l= 1

Mm = l+1

lm ()

M1l= 1

Mm = l+1

q

1 1

lmerf

Nlm

2

2

lm ()

. (18)

B. Discriminant Criterion Based on Upper Bound of Multiclass

Bayesian Error

To minimize the Bayesian error, we should minimize its upper

bound, which is reduced to maximize the following discriminantcriterion:

JP() =M1l=1

Mm = l+1

1

lmerf

Nlm

2

2

lm (). (19)

Let

lm = 1

lmerf

Nlm

2

2

. (20)

Then,JP()can be rewritten as

JP() =M1

l= 1

M

m = l+ 1

lm lm (). (21)


4/9


Fig. 1. Weighting function(u) = (1/u)erf(u).

Thelm can be viewed as weight imposed on pairwise classescl andcm . SinceN

T l andN T m are the distribution

means of classescl andcm in the reduced 1-D feature space, re-spectively, the quantityNlm in (20) is the maximum distance(with respect to ) between the two class means. It reflectsthe separability of two classes. Note (u) = (1/u)erf(u) is amonotonically decreasing function ofu, as shown in Fig. 1. So,the pairwise class weighting function lm in (20) is monoton-ically decreasing with respect to Nlm . That is, in (21), weimpose heavier weights onto close class pairs, which are more

likely to be misclassified. Theclose class pairs are endowed with

emphasize, which helps make the criterion suited in producingseparability in the output space.

C. Discriminant Criterion: General Case of

In the earlier derivation, we require that the variance

2N(T l )2 is less than 1. This requirement is satisfied

by restricting the length of . It suffices to restrict suchthat T (

2N M) = 1. In fact, for any class cl , we have

2N(T l )2


5/9


s is the sign that results in the largest first principal eigen-value of all possible s. Once the first vector 1 is obtained,we proceed to find the second vector 2 in the orthogonallycomplementary space of 1 , i.e., in the space spanned byIK1 T1 , where IKdenotes the K-dimensional identity ma-trix. Therefore, 2 is solved as the first principal eigenvec-

tor of the deflated matrix (I

K1 T

1 )H

(s)(I

K1 T

1 ).Note that we use the same symbol s, but it is not neces-sarily the same with the one producing 1 . Generally, sup-pose the first g vectors 1 , . . . , g have been obtained. The(g+ 1)th vector is determined in the orthogonally complemen-tary space spanned by 1 , . . . , g , i.e., in the space spannedby IKUgUTg , where Ug is the matrix of orthonormal ba-sis of1 , . . . , g , which, for example, can be obtained by theSchmidt orthogonalization procedure. So, g +1 is solved asthe first principal eigenvector of (IKUgUTg)H(s)(IKUgU

Tg). Given the obtained g + 1 , according to the Schmidt or-

thogonalization procedure, the basis matrix Ug +1 is formed

by padding Ug as Ug +1 = [Ug ,ug + 1 ], where ug + 1 =g +1Ug (UTgg + 1 ) /g +1Ug (UTgg + 1 ). In theory,

g +1 is orthogonal with Ug , i.e., UTgg + 1 = 0, which im-

plies that ug +1 is simply the normalized g +1 . We keep theprevious Schmidt orthogonalization procedure for computa-

tional precision in practice. Note IKUg + 1UTg +1 = (IKug + 1u

Tg + 1 )(IKUgUTg), which makes it feasible to compute

(IKUg +1UTg +1 )H(s)(IKUg + 1UTg +1 )for the next stepby updating(IKUgUTg)H(s)(IKUgUTg) through mul-tiplying IK ug + 1uTg + 1 from both sides.

In practice, the covariance matrices l (l= 1, . . . , M ) areusually unknown, which thus need to be estimated. The ex-

pression defined in (1) provides a way of estimation. Wesummarize the optimization procedure of multiclass filters via

the WPC approach in Table I.

E. Classification

Suppose g (g= 1, . . . , G) are the G filters obtained byWPC. For any EEG data segment Xt , we extract the features as

2g =Tg XtX

Tt g , (g= 1, . . . , G). (31)

The extracted features on training EEG data are used to design

a classifier. For a testing EEG segment, its features are extracted

in the same way, which are input into the trained classifier to

predict its class label.

IV. COMPARISON ANDEXTENSION

In this section, we compare the proposed WPC approach with

[23]. The starting points and formulations of the two methods

are completely different. We also extend WPC by integrating

temporal information.

A. Comparison With [23]

Both the proposed WPC approach and [23] are based on the

Bayesian error estimation. However, the classification targets

and then the criteria of the two methods are different, as sum-

marized in Table II. The discriminant criterion of [23] is derived

TABLE IOPTIMIZATION PROCEDURE OFMULTICLASS FILTERSVIA THEWPC APPROACH

by minimizing the upper bound of the Bayesian error of classi-

fying Tx, while WPC takes feature TXXT used in EEGsingle-trial classification as target directly.

It is noted that

qM1l= 1

Mm = l+1

|T(lm )| M

l= 1

|T(l)|

2q

M1

l= 1

M

m = l+1 |

T(l

m )

|. (32)


6/9


TABLE IICOMPARISONBETWEENWPCAND ZHENG ANDLIN[23]

The proof is given in Appendix B. Let

J() =M1l=1

Mm = l+1

|T(lm )|T

. (33)

Then, we have qJ()J()2qJ(). So, maximizing J()can be roughly performed by maximizing J(). Using =1/2 , maximizing J()is equivalent to maximizing

J() =M1l= 1

Mm = l+ 1

|T (lm )|T

(34)

which is M(M1)/2 pairs of absolute distances betweenT l and

Tm subject to= 1.The maximization of (34), however, may not very appropriate

for classifying multiclass EEG single-trials in some cases. For

example, consider the situation that one class has large differ-

ence (in terms of) from the other classes. To maximize thecriterion J(), the class pairs (say cl andcm ) that have largedifferences (betweenl andm ) heavily control the selectionof the direction of . Note is of unit length. As a result, the

remote class is projected from the other classes as far as possiblewhile close classes are more likely to be merged.

By contrast, with the weight (1/lm )erf(

Nl m

4M ), the crite-

rion JP() in(25) orJP() in (22) de-emphasizes the influenceof large class differences, where the classes are already well

separated and gives great emphasize to small class differences

where the close classes are more likely to be confused.

An interesting connection between WPC and [23] is that,

when applied in two-class paradigm, criterion JP()and crite-rionJ()produce the same solution of. This, however, doesnot necessarily imply that the upper bounds of the Bayesian

errors of these two methods are equal. On the other hand, com-

paring the upper bounds of the Bayesian errors derived by thetwo methods is meaningless since they are derived for classify-

ing different objects.

B. Extension to WPC: Integrating Temporal Information

The set of filters of WPC are obtained by considering clas-

sifying the projected variance TXXT . Note that X is asegment of EEG single-trial time course. The covariance for-

mulation XXT , however, is globally independent of time.

The temporal information is completely ignored. From the

study of neurophysiology, EEG signals are usually nonsta-

tionary. It is useful to integrate the temporal information into

the covariance formulation, reflecting the temporal manifold

of the EEG time course [20]. Specifically, by the fact that

XXT = (1/2N)N

i, j =1 (xi xj )(xi xj )T , weuse the tem-porally local covariance matrix

C= 1

2N

Ni, j =1

(xi xj )(xi xj )TA(i, j) (35)

for covariance modeling instead ofXXT that is time indepen-dent. The time-dependent adjacency value A(i, j) is definedsuch that only temporally close sample pairs, say{xi xj :|ij|< }with being a temporal range parameter, are se-lected to contribute to the summation (35). The value A(i, j)ismonotonously decreasing with respect to temporal distances be-

tween selected sample pairs. In this paper, the adjacency matrix

A is defined using the Tukeys tricube weighting function [6]

A(i, j) =

1

ij

33

, |ij|<

0, else.

(36)

With some algebraic derivations, C is compactly expressed

as C= (1/N)XEXT , where E= D A is the Laplacianmatrix, A= (A(i, j))i, j = 1,...,N, and D is the diagonal ma-trix whose diagonal entries are row sums of A. Let L=(1/N)E. Then, under the same probabilistic assumption withthe previous section, the Gaussian quadratic formTXLXT conditioned on class cl has mean tr(L)

T l and vari-ance 2tr(L2 )(T l )

2 , where tr() is the trace operator. IfTXLXT is treated as target feature for classification pur-pose. Then, (28) is accordingly modified by integrating temporal

information as

HT I(s) =

M1l=1

Mm = l+ 1

1

lm

erf

tr(L

)

lm4M

tr(L2 )

slm (lm ).

(37)

Note that, in this case, the difference between means of classes clandcm becomestr(L)(T(lm )). And is normalizedas(/

T (

2tr(L2 )M)).

In the implementation of the temporal extension of WPC, the

covariance matricesl (l= 1, . . . , M )are estimated as

l = 1

N|Il |tIl

XtLXTt . (38)

The optimization procedure can be similarly carried out with

WPC. The features are extracted as

2g = Tg XtLX

Tt g , (g= 1, . . . , G) (39)

whereg are the filters of the temporal extension of WPC.1) Choice of: The additional parameter is determined

from the data using a three-way cross-validation strategy. This

strategy contains two nested loops. In the outer loop, the sam-

ples are divided into T1 folds, in which one-fold is treated astesting set. The testing samples are used for the estimation of

generalization ability and are not concerned with the solutions

of the filters and the parameter. In the inner loop, the remaining

T1

1folds are further divided intoT2 folds, in which one-fold

is treated as validation set while the remainingT21folds are


7/9


treated as training set. For each , the filters are solved on thetraining set, and then the recognition rate is calculated on the

validation set. This procedure is repeated T2 times with a dif-ferent validation set each time. The average recognition rates

is recorded as the recognition accuracy across theT2 folds. Weselect the that results in the maximum recognition accuracy.

We then solve the filters using all the T2 folds with the optimal selected earlier. With the filters obtained, we calculate therecognition rate on the testing set which is specified in the outer

loop. The earlier procedure is repeatedT1 times with a differentfold as testing set each time. The average recognition rates are

computed as the final recognition accuracy across the T1 folds.

V. EXPERIMENTS

We evaluate the effectiveness of the proposed multiclass

methods on two publicly available datasets of BCI competi-

tions. These two datasets are of four-class motor imagery EEG

signals. We compare the classification performances of the pro-

posed multiclass methods with the multiclass CSP using one-versus-rest and using JAD [7], the multiclass information theo-

retic feature extraction [10], and the multiclass CSP presented

in [23].

A. EEG Datasets Used for Evaluation

1) Dataset IIIa of BCI Competition III: This dataset is of

four-class motor imagery paradigm by recording three subjects

(k3b, k6b, and l1b) [3]. The subjects, sitting in a normal chair

with relaxation, were asked to perform four different tasks of

motor imagery (i.e., left hand, right hand, one foot, and tongue)

by cues, which were presented in a randomized order. In each

trial, the cue was displayed from the third second and lastedfor 1.25 s. At the same time, the motor imaginary started and

continued until the fixation cross disappeared at the seventh

second. So, the duration of the motor imagery in each trial was

4 s. For subject k3b, there were 90 trials for each mental task.

And for subjects k6b and l1b, each mental task cue appeared 60

times. In our experiment, we discard four trials of subject k6b

because of missing data. The EEG measurements were recorded

using 60 sensors by a 64-channel neuroscan system. The left and

right mastoids were used as reference and ground, respectively.

The EEG signals were sampled at 250 Hz and filtered by cutoff

frequencies 1 and 50 Hz with the notchfilter ON.

2) Dataset IIa of BCI Competition IV: This dataset contains

EEG signals recorded during a cue-based four-class motor im-

agery task from nine subjects [17]. Each trial started from a

short acoustic warning tone along with a fixation cross dis-

played on the black screen. After 2 s, a visual cue was presented

for 1.25 s, instructing the subjects to carry out the desired motor

imagery task (i.e., the imagination of movement of the left hand,

right hand, both feet, or tongue) from the third second until the

fixation cross disappeared at the sixth second. Each subject par-

ticipated two sessions recorded on different days. There were

288 trials in each session for each subject, i.e., 72 trials per task.

Twenty-two electrodes were used to record the EEG signals that

were sampled at 250 Hz and filtered by cutoff frequencies 0.5

and 100 Hz with the notchfilter ON.

B. Experimental Settings and Results

The data are band-pass filtered between 5 and 35 Hz using

a fifth-order butterworth filter, as in [9] and [10]. The EEG

segments recorded during the motor imagery period, i.e., from

the third second to the seventh second in dataset IIIa of BCI

competition III and from the third second to the sixth second in

dataset IIa of BCI competition IV, are used in the experiment. Weexploit the three-fold cross-validation strategy to evaluate the

classification accuracy. That is, we partition all the trials of each

class per subject into three divisions, in which each division is

used as testing data while the remainder two divisions are used

as training data. This procedure is repeated three times until

each division is used once as testing data. In each repetition, for

each filter obtained on the training data, features are obtained

by projection on the 15 frequency bands of 2-Hz width in the

range 535 Hz [9], [10]. Consequently, we obtain a (15G)-dimensional feature vector for each trial, where G is the numberof filters selected on the training data. That is, we use the three-

way cross-validation procedure withT1 =T2 = 3to determinethe value ofG, whereG varies from 2 to 10 in step of 2. The(15G)-dimensional feature vectors are further reduced to 3-D1

vectors by using the Fisher discriminant analysis (FDA) [21]. It

should be noted that the spatial filters, the value ofGas well asthe FDA weights are calculated on the basis of the training data

and then applied to the testing data. The conventional classifier

of the nearest class mean with Euclidean distance [21] is adopted

to predict the class labels of the testing samples.

Table III reports the classification accuracies by using the

multiclass filters solved by the various methods. Note, we also

evaluate the classification accuracy of WPC integrating tempo-

ral information (WPC/TI), where the parameteris determined

by the three-way cross-validation procedure withT1 =T2 = 3.Here is varied logarithmically from 1 to 5 in step of 1. It isobserved that the proposed WPC method achieves much bet-

ter classification accuracy than the existing multiclass methods,

and WPC/TI further improves the results in most cases. The im-

provement of WPC/TI is attributed to the local temporal mod-

eling. The reason that WPC/TI results in lower classification

accuracies than WPC in few cases may be due to overfitting.

C. Comparison With BCI Competition IV

For dataset IIa of BCI competition IV, to compare with the re-

sults of the winners, we use the evaluation of session-to-session

transfer from session one to session two in terms of kappa score,simulating competition scenario. The procedure of the session-

to-session transfer is much simpler than the cross validation.

Specifically, we use the first session as training data and the

second session as testing data. All the experimental settings are

same with the description in the previous section except that the

training data and the testing data are now fixed. The classifica-

tion accuracy is summarized in Table IV. It can be seen that our

proposed methods have fairly well classification performance

compared with the results obtained by the best two competi-

1Since the number of classes is four, we can obtain at most three dimensions

of features by FDA, which is known as the rank-limit problem.


8/9


TABLE IIICOMPARISON OF THECLASSIFICATIONACCURACIES(%) OF THEPROPOSEDWPCAND WPC/TI METHODSWITH THEEXISTINGMULTICLASS METHODS FOREACH

SUBJECT ON THEDATASETS OFBCI COMPETITIONS, WHEREM1M6 REFER TOMULTICLASS CSP USINGONE-VERSUS-REST, MULTICLASS CSP USINGJAD,MULTICLASS INFORMATION THEORETICFEATUREEXTRACTION, MULTICLASS CSP IN [23], WPC,AND WPC/TI, RESPECTIVELY

TABLE IVKAPPASCORES OFVARIOUSMULTICLASS METHODS FOREACHSUBJECT ON

DATASETIIA OF BCI COMPETITIONIV USINGSESSION-TO-SESSIONTRANSFER,WHERENO. 1 AND NO. 2 REFER TO THEBESTTWOCOMPETITORS,AND

M1M6ARE SAMEWITHTHOSE OFTABLEIII

tors. Note that the results obtained by the multiclass CSP usingone-versus-rest and using JAD are slightly different from those

reported in [9], since different classifiers and time segments are

used.

In our experiment, a simple classification procedure is em-

ployed to reveal the effectiveness of the multiclass filters ob-

tained by WPC and WPC/TI. The classification performance

may be improvedif we solve filters in narrower frequency bands,

tuning the optimal time segment for each trial, and/or using other

sophisticated classifiers. The goal of this paper is to demonstrate

the effectiveness of the weighted scheme for solving multiclass

filters: while we use the same experimental settings for all the

methods, the weighted pairwise design produces a much higherclassification accuracy.

VI. CONCLUSION

In this paper, we propose a new discriminant criterion, called

WPC, of optimizing multiclass filters. The approach is estab-

lished by minimizing the upper bound of the Bayesian error of

classifying EEG single-trial segments, resulting in the form of

sum of weights imposed on individual pairwise classes accord-

ing to their closeness. We pay special emphasize on the effect

of closer classes that are more likely to cause misclassification.

In other words, the contributions of different class pairs to the

discriminant criterion are biased. Computationally, the WPC

algorithm is conveniently solved by the rank-one update and

power iteration technique.

The proposed WPC approach is intentionally formulated for

classifying EEG single-trial data. It takes into account classi-

fication errors of EEG trials between pairs of classes. While

the criterion derived based on the Bayesian error of classify-

ing EEG sampling points is reasonable, the large pairwise classdifferences may play an overwhelming role in the optimization.

By contrast, WPC directly uses the same features for optimizing

spatial filters as for classification. Moreover, we extend WPC

by integrating the temporal information of EEG series in the co-

variance matrix formulation. The effectiveness of the proposed

WPC method is demonstrated by the classification of four motor

imagery tasks on two datasets of BCI competition.

Finally, we point out that the Bayesian error estimation heav-

ily relies on the assumption of independent Gaussian distribu-

tion. This assumption, however, does not hold stringently in ap-

plications, since EEG data usually has an autocorrelation struc-

ture. One possible way is to consider using the Gauss mixturemodel instead of single Gaussian distribution. We are studying

this issue theoretically and practically.

APPENDIXA

PROOF OF(14)

For0xa, in the error function

erf(x) = 2

x0

eu2

du (40)

we use the variable substitutionv = ax u. Then, we have

erf(x) = 2

a0

ev2 ( xa )

2x

adv. (41)

Since0 xa 1, it follows that

erf(x)xa

2

a0

ev2

dv= 1

aerf(a)x (42)

which completes the proof.


9/9


APPENDIXB

PROOF OF(32)

We have

M

l=1

T (l)|=M

l=1

|T

lM

m = 1 qm

=

Ml=1

M

m = 1

qT(lm )

2qM1l= 1

Mm = l+ 1

|T (lm )|. (43)

On the other hand, we have

qM1l=1

Mm = l+1

|T (lm )|

= q2

Ml=1

Mm = 1

|T (l)T(m )|

q2

Ml=1

Mm = 1

|T (l)|+ q2

Ml=1

Mm = 1

|T (m )|

=1

2

Ml=1

|T (l)|+12

Mm = 1

|T (m )|

=

Ml= 1

|T(l)|. (44)

The proof is, thus, established.

ACKNOWLEDGMENT

The author would like to thank the anonymous referees and

the editors for constructive recommendations, which improve

the paper substantially.

REFERENCES

[1] A. Bashashati, M. Fatourechi, R. K. Ward, and G. E. Birch, A surveyof signal processing algorithms in braincomputer interfaces based onelectrical brain signals, J. Neural Eng., vol. 4, no. 2, pp. R32R57, Jun.2007.

[2] B. Blankertz, K.-R. Muller, G. Curio, T. M. Vaughan, G. Schalk, J. R.Wolpaw, A. Schlogl, C. Neuper, G. Pfurtscheller, T. Hinterberger,M. Schroder, and N. Birbaumer, The BCI competition 2003: Progressand perspectives in detection and discrimination of EEG single trials,

IEEE Trans. Biomed. Eng., vol. 51, no. 6, pp. 10441051, Jun. 2004.[3] B. Blankertz, K.-R. Muller, D. J. Krusienski, G. Schalk, J. R. Wolpaw,

A. Schlogl, G. Pfurtscheller, J. R. Millan, M.Schroder, andN. Birbaumer,The BCI competition III: Validating alternative approaches to actual BCIproblems,IEEE Trans. Neural Syst.Rehabil.Eng.,vol.14,no.2,pp.153159, Jun. 2006.

[4] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K.-R. Muller,Optimizing spatial filters for robust EEG single-trial analysis, IEEESignal Process. Mag., vol. 25, no. 1, pp. 4156, Jan. 2008.

[5] J. T. Chu and J. C. Chuen, Error probability in decision functions forcharacter recognition, J. Assoc. Comput. Mach., vol. 14, no. 2, pp. 273280, 1967.

[6] W. S. Cleveland, Robust locally weighted regression and smoothing scat-

terplots, J. Amer. Stat. Assoc., vol. 74, pp. 829836, 1979.

[7] G. Dornhege,B. Blankertz, G. Curio,and K.-R. Muller, Boosting bit ratesin noninvasive EEG single-trial classifications by feature combination andmulti-class paradigms, IEEE Trans. Biomed. Eng.,vol.51,no.6,pp.9931002, Jun. 2004.

[8] G. Dornhege, M. Krauledat, K.-R. Muller, and B. Blankertz, GeneralSignal Processing and Machine Learning Tools for BCI. Cambridge,MA: MIT Press, 2007, pp. 207233.

[9] C. Gouy-Pailler, M. Congedo, C. Brunner, C. Jutten, and G. Pfurtscheller,Nonstationary brain source separation for multiclass motor imagery,

IEEE Trans. Biomed. Eng., vol. 57, no. 2, pp. 469478, Feb. 2010.[10] M. Grosse-Wentrup and M. Buss, Multiclass common spatial patterns

and information theoretic feature extraction, IEEE Trans. Biomed. Eng.,vol. 55, no. 8, pp. 19912000, Aug. 2008.

[11] S. Lemm, B.Blankertz, G.Curio, andK.-R. Muller, Spatio-spectral filtersfor improved classification of single trial EEG, IEEE Trans. Biomed.

Eng., vol. 52, no. 9, pp. 15411548, Sep. 2005.[12] Y. Li, X. Gao, and S. Gao, Classification of single-trial electroencephalo-

gram during finger movement, IEEE Trans. Biomed. Eng., vol. 51, no. 6,pp. 10191025, Jun. 2004.

[13] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, Multiclass linear dimen-sion reduction by weighted pairwise Fisher criteria, IEEE Trans. Pattern

Anal. Mach. Intell., vol. 23, no. 7, pp. 762766, Jul. 2001.[14] R. Lotlikar and R. Kothari, Fractional-step dimensionality reduction,

IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 6, pp. 623627, Jun.2000.

[15] D. J. McFarland, C. W. Anderson, K.-R. Muller, A. Schlogl, and D. J.Krusienski, BCI meeting 2005workshop on BCI signal processing:Feature extraction and translation, IEEE Trans. Neural Syst. Rehabil.

Eng., vol. 14, no. 2, pp. 135138, Jun. 2006.[16] J. Muller-Gerking, G. Pfurtscheller, and H. Flyvbjerg, Designing optimal

spatial filters for single-trial EEG classification in a movementtask,Clin.Neurophys., vol. 110, no. 5, pp. 787798, May 1999.

[17] M. Naeem, C. Brunner, R. Leeb, B. Graimann, and G. Pfurtscheller,Seperability of four-class motor imagery data using independent compo-nents analysis, J. Neural Eng., vol. 3, pp. 208216, 2006.

[18] L. C. Parra, C. D. Spence, A. D. Gerson, and P. Sajda, Recipes for linearanalysis of EEG, NeuroImage, vol. 28, no. 2, pp. 326341, Nov. 2005.

[19] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller, Optimal spatialfiltering of single trial EEG during imagined hand movement, IEEETrans. Rehabil. Eng., vol. 8, no. 4, pp. 441446, Dec. 2000.

[20] H. Wang and W. Zheng, Local temporal common spatial patterns forrobust single-trial EEG classification, IEEE Trans. Neural Syst. Rehabil.

Eng., vol. 16, no. 2, pp. 131139, Apr. 2008.[21] A. R. Webb, Statistical Pattern Recognition. London, U.K.: Oxford

Univ. Press, 1999.[22] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and

T. M. Vaughan, Braincomputer interfaces for communication and con-trol, Clin. Neurophysiol., vol. 113, no. 6, pp. 767791, Jun. 2002.

[23] W. Zheng and Z. Lin, Optimizing multi-class spatio-spectral filters viaBayes error estimation for EEG classification, in Proc. Neural Informat.Process. Syst. (NIPS), 2009, pp. 19.

Haixian Wang (M09) received the B.S. and M.S.degrees in statistics and the Ph.D. degree in com-puter science from Anhui University, Anhui, China,in 1999, 2002, and 2005, respectively.

During 20022005, he was with the Key Labora-tory of Intelligent Computing and Signal Processingof theMinistryof Educationof China.He is currentlywith the Key Laboratory of Child Development andLearning Science of the Ministry of Education, Re-search Center for Learning Science, Southeast Uni-versity, Nanjing, Jiangsu China. His research inter-

ests include EEG signal processing, statistical pattern recognition, and machinelearning.

Documents

Multiclass Filters by a Weighted Pairwise Criterion