Upload
raveendra-moodithaya
View
218
Download
0
Embed Size (px)
Citation preview
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
1/9
1412 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
Multiclass Filters by a Weighted Pairwise Criterionfor EEG Single-Trial Classification
Haixian Wang, Member, IEEE
AbstractThe filtering technique for dimensionality reductionof multichannel electroencephalogram (EEG) recordings, modeledusing common spatial patterns and its variants, is commonly usedin two-class braincomputer interfaces (BCI). For a multiclassproblem, the optimization of certain separability criteria in theoutput space is not directly related to the classification error ofEEGsingle-trial segments. In this paper, we derive a new discrim-inant criterion, termed weighted pairwise criterion (WPC), foroptimizing multiclass filters by minimizing the upper bound of theBayesian errorthat is intentionally formulated for classifying EEGsingle-trial segments. The WPC approach pays more attention toclose class pairs that are more likely to be misclassified than faraway class pairs that are already well separated. Moreover, weextend WPC by integrating temporal information of EEG series.Computationally, we employ the rank-one update and power iter-ation technique to optimize the proposed discriminant criterion.The experiments of multiclass classification on the datasets of BCIcompetitions demonstrate the efficacy of the proposed method.
Index TermsBayesian classification error, braincomputer in-terfaces (BCI), common spatial patterns (CSP), multiclass filters,weighted pairwise criterion (WPC).
I. INTRODUCTION
ACCURATE classification of electroencephalogram (EEG)
signals is the core problem in the community of braincomputer interfaces (BCI) [22]. A large number of modern
signal processing and machine learning techniques have been
used and developed [1], [8], [15]. One powerful and widely
used method for processing multichannel EEG series is the fil-
tering technique, represented by the common spatial patterns
(CSP) [4]. The CSP approach, designed for the two-class prob-
lem [12], [16], [18], [19], seeks few filters such that the ratio
of the filtered variances between the two populations is maxi-
mized (or minimized). By the fact that CSP make use of only
spatial information, the spatio-temporal versions were also de-
veloped, for example, the common spatio-spectral patterns [11]
and the local temporal CSP [20]. The literature [4] reviewedmany variants of CSP.
Manuscript received June 20, 2010; revised October 14, 2010 andDecember 30, 2010; accepted January 3, 2011. Date of publication January 13,2011; date of current version April 20, 2011. This work was supported in partby the National Natural Science Foundation of China under Grants 61075009and 60803059, in part by the Qing Lan Project, and in part by the Fund for theProgram of Excellent Young Teachers at Southeast University.
The author is with the Key Laboratory of Child Development and Learn-ing Science of Ministry of Education, Research Center for Learning Science,Southeast University, Nanjing 210096, China (e-mail: [email protected]).
Digital Object Identifier 10.1109/TBME.2011.2105869
The CSP approach was originally suggested for two-class
paradigm. The multiclass extensions have been investigated in
the literature. One trivial extension was to divide the multiclass
problem into many two-class situations followed by applying
CSP repeatedly [4], [7]. Another conventional extension was
the joint approximate diagonalization (JAD) ofM covariancematrices, where Mwas the number of multiple classes [7].This is based on the observation that CSP is to simultaneously
diagonalize two covariance matrices. The JAD was further in-
vestigated from the perspective of mutual information and brain
source separation [9], [10]. The JAD approach is actually a
decomposition technique rather than a classification method.
Recently, Zheng and Lin [23] presented a multiclass exten-
sion via Bayesian classification error estimation. The discrim-
inant criterion is derived by minimizing the upper bound of
the Bayesian error of classifyingTx, where x is an EEG sig-nal recorded at a specific time point and is a filter. Whilethis is a reasonable criterion to optimize spatial filters, a more
direct approach uses the same features for optimizing spatial
filters as for classification. Denoting a multivariate time se-
ries of band-pass filtered single-trial EEG by X, these features
are in CSP TXXT (cf., [4, eq. (2)], [16, eq. (2)], and [23,eqs. (29)(31)]). The quantityTXXT is the variance of the
band-pass filtered EEG signals. It is equal to band power. So,the band power TXXT with an appropriate spatial filter in fact corresponds to the effect of event-related desyncroniza-
tion/syncronization (ERD/ERS), which is an effective neuro-
physiological feature for classification of brain activities [4].
Specifically, the so-called idle rhythms, reflected around 10 Hz,
can be observed over motor and sensorimotor areas in most
persons. These idle rhythms are attenuated when processing
motor activity. This physiological phenomenon is termed ERD
effect because of loss of synchrony in the neural population. By
contrast, the rebound of the rhythmic activity is termed ERS.
In this paper, we directly target the classification of
TXXT
, for which the upper bound of the Bayesian clas-sification error is estimated. Accordingly, by minimizing theupper bound of the Bayesian classification error, we develop a
new discriminant criterion that directly related to the classifica-
tion of EEG single-trial segments. The proposed criterion takes
the form of sum of weighted pairwise classes and is referred to
as weighted pairwise criterion (WPC). The WPC approach puts
heavier weights onto close class pairs, which are more likely to
be misclassified, and de-emphasizes the influence of far away
class pairs, where the classes are already well separated. The
weighting strategy helps make the criterion suited in produc-
ing separability in the output space, which has been witnessed
in pattern classification problems [13], [14]. Computationally,
0018-9294/$26.00 2011 IEEE
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
2/9
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1413
the proposed WPC method is conveniently implemented by the
rank-one update and power iteration technique. Moreover, we
compare WPC with [23]. The classification targets of the two
methods are different, and then the discriminant criteria, which
are obtained by minimizing the upper bound of the Bayesian
classification error, are thus different. We also extend WPC by
integrating temporal information of EEG series into the covari-ance matrix formulation. The efficacy of the proposed approach
is demonstrated on the classification of four motor imagery tasks
on two datasets of BCI competitions.
The remainder of this paper is organized as follows. In
Section II, the CSP and its multiclass situation via Bayesian
classification error estimation based on EEG sampling points
are briefly reviewed. In Section III, we derive the upper bound
of the Bayesian error that is intentionally formulated for clas-
sifying EEG single-trial segments, then propose the WPC to
minimize the upper bound, and give the optimization procedure
by using the rank-one update and power iteration technique. The
comparison with [23] and the extension by integrating temporal
information are presented in Section IV. The experimental re-sults are presented in Section V. Finally, Section VI concludes
the paper.
II. BRIEFREVIEW OFCSP AND ITSMULTICLASSSITUATION
Let xRK be an EEG signal at a specific time point withKelectrodes. We view x recorded during performing certain men-
tal task as a K-dimensional random variable that is generatedfrom a Gaussian distribution. Suppose thatC ={c1 , . . . , cM}is the set of mental conditions to be investigated. We consider
the multiclass (M >2) classification problem that assigns EEG
single-trial segments into theMpredefined brain states. Givenclasscl (l {1, 2, . . . , M }), the random variable x is assumedto be Gaussian distributed according to x|clN(0, l ), wherel is the covariance matrix. The Gaussian assumption will notsacrifice generality when studying linear filters and statistics
less than second order [10]. For the purpose of classification,
we wish to learnG(G < K)filters (linear transformation vec-tors) g RK using the finite training data such that the filteredfeatures are more discriminative for predicting class labels than
using the raw EEG data. Hereafter, the term conditions and class
labels are used interchangeably.
A. CSP: Two-Class Paradigm
The CSP method provides a powerful way for extracting EEG
features related with the modulation of ERD/ERS. The CSP
algorithm is applied to two-class situation only. It solves the
filters such that the projected EEG series have the maximum
ratio of variances between the two classes. Maximizing the
variances actually characterize the ERD/ERS effects. Let X =[x1 , . . . ,xN]RKN be a segment of EEG series during onetrial, where xi is the multichannel EEG signal at a specific time
pointi, andNdenotes the number of sampled time-points.The CSP approach solves spatial filters by simultaneously
diagonalizing the estimated covariance matrices under the two
conditions. The covariance matrices of the two classes are esti-
mated as
l = 1
N|Il |tIl
XtXTt (1)
whereIl(l {1, 2})denotes the set of indices of trials belong-ing to classcl , and|Il | is the cardinality of setIl . The spatialfilters of CSP can be alternatively formulated as an optimizationproblem [4], [18]
= arg{max, min}RKT1
T2 (2)
where the notation {max, min} means that maximizing or min-imizing the Rayleigh quotient is of equally interest. The spatial
filters thus are obtained by solving the generalized eigenvalue
equation
1 = 2 . (3)
The eigenvalue measures the ratio of variances between the
two classes. For the purpose of classification, the filters are spec-ified by choosing several generalized eigenvectors associated
with eigenvalues from both ends of the eigenvalue spectrum.
The variances of the spatially filtered EEG data are discrimina-
tive features, which are input into a classifier.
B. Multiclass Situation by Bayesian Error Estimation
The CSP is suitable for two-class classification only. Zheng
and Lin [23] addressed the multiclass paradigm via Bayesian
classification error estimation. By the assumption that the dis-
tribution of the EEG sampling point xconditioned on classclis Gaussian, i.e.,pl =N(x;0, l ), the filtered EEG signal y =
Tx is also Gaussian distributed according toN(y; 0, Tl ).Based on the Gaussian distribution, Zheng and Lin [23] obtained
the upper bound of the Bayesian error of classifying Tx, givenby
()qM(M1)2
q3
32
Ml= 1|T (l)|
T
2(4)
where qis the common a priori probability of the Mclasses, and =
Mm = 1qm . Minimizing the upper bound of the Bayesian
classification error is equivalent to maximize the discriminant
criterion
J() =M
l=1
|T (l)|T
. (5)
So, theG filters are defined as [23]
1 = arg max
J() (6)
G = arg max
T g = 0g= 1 , . . . , G1
J(). (7)
Using some suitable estimations for land , the defined filterscan be determined one by one via the rank-one update and power
iteration procedure.
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
3/9
1414 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
III. WPC
For the specific classification of multiclass EEG single-trial
segment X, a more direct approach is to consider optimizing
the featureTXXT , rather thanTx, as for classification. Itis ideal to obtain the Bayesian classification error-based opti-
mal criterion for the classification ofTXXT. The Bayesian
classification error is, in general, too complex to be calculateddirectly. Therefore, the upper bound of the Bayesian classifica-
tion error, which is meanwhile required to be easy to optimize
in practice, is usually estimated as a suboptimal criterion. In this
section, we develop a new discriminant criterion based on the
upper bound of the Bayesian classification error of EEG single
trials. It is noted that we take TXXT , rather than Tx, asour target element in deriving the upper bound of the multiclass
Bayesian classification error.
A. Upper Bound of Multiclass Bayesian Error
Recalling X= [x1 , . . . ,xN], we have TXXT =
Ni=1 (Txi )2 , where N is the number of sample points in
one trial. Under the assumption of independent Gaussian dis-
tributionN(0, l ) ofxi conditioned on classcl , we have that(TXXT)/(Tl ) abides by
2 distribution with degree of
freedomN. Usually, the number of sampling points N is verylarge, sayN >30. By the central limit theorem, we have thatTXXT conditioned on class clapproaches the Gaussian dis-tribution with meanN T l and variance2N(
T l )2 . For
the time being, we assume that2N(T l )2 is less than one,
which will be addressed with the general case later on.
Denotelm ()by the Bayesian error between classes cl andcm , i.e.,
lm () =ql P(f (X)=cl |cl ) +qm P(f (X)=cm |cm ) (8)where P(f (X)=cl |cl ) is the probability that samples belong-ing to classcl are misclassified into classcm ,f ()denotes theBayesian classifier, andql andqm are thea prioriprobabilitiesof classes cl and cm , respectively. Since the data
TXX
T conditioned on each class are (approximately) Gaussian dis-
tributed with variance less than one and mean being, for exam-
ple,N T l if conditioned on classcl , it follows that
ql P(f (X)=cl |cl ) +qm P(f (X)=cm |cm )
D m qlpl (x)dx+ D l qmpm (x)dx (9)wherepl (x)andpm (x)are the probability density functions ofGaussian distributionsN(N T l , 1) andN(N
T m , 1),respectively, andDm andDl are defined as
Dm ={x: qmpm (x)qlpl (x)} (10)Dl ={x: qlpl (x)> qmpm (x)}. (11)
Supposeql =qm =q. Then, we have [21]
D mpl (x)dx+D l
pm (x)dx= 1erfN|T (lm )|
2
2 (12)
where the error function (erf) is defined as erf(x) =(2/
)x
0 eu 2 du. By (8), (9), and (12), the Bayesian error
between classescl andcm in the 1-D feature space after beingprojected onto is expressed as
lm ()
q 1erf
N|T(lm )|2
2 . (13)It is still complex to optimize via (13), since is embedded
in the error function. We would like to isolate from the errorfunction. We present the following inequality. For 0xa,we have
erf(x) 1a
erf(a)x. (14)
The equality holds when taking x = 0 or x = a. The proof isgiven in Appendix A. Let
lm () =|Tl T m | (15)be theabsolute distance between classes cland cm in the reduced1-D feature space. By (14), we have
erf
N lm ()
2
2
1
lmerf
Nlm
2
2
lm () (16)
where lm is the maximum value of lm (). Note we haverequired, in the beginning of this section, that the magnitude of
is subject to the constraint 2N(T l )2 1. The left and
right expressions of (16) are not equal for all directions of .The two expressions are equal when arriving at the maximum
or the minimum value. Combining (13) and (16), we have
lm ()
q 1 1
lmerf
Nlm
2
2
lm (). (17)
For theMclasses problem, the upper bound of the Bayesianerror is calculated as [5]
()M1l= 1
Mm = l+1
lm ()
M1l= 1
Mm = l+1
q
1 1
lmerf
Nlm
2
2
lm ()
. (18)
B. Discriminant Criterion Based on Upper Bound of Multiclass
Bayesian Error
To minimize the Bayesian error, we should minimize its upper
bound, which is reduced to maximize the following discriminantcriterion:
JP() =M1l=1
Mm = l+1
1
lmerf
Nlm
2
2
lm (). (19)
Let
lm = 1
lmerf
Nlm
2
2
. (20)
Then,JP()can be rewritten as
JP() =M1
l= 1
M
m = l+ 1
lm lm (). (21)
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
4/9
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1415
Fig. 1. Weighting function(u) = (1/u)erf(u).
Thelm can be viewed as weight imposed on pairwise classescl andcm . SinceN
T l andN T m are the distribution
means of classescl andcm in the reduced 1-D feature space, re-spectively, the quantityNlm in (20) is the maximum distance(with respect to ) between the two class means. It reflectsthe separability of two classes. Note (u) = (1/u)erf(u) is amonotonically decreasing function ofu, as shown in Fig. 1. So,the pairwise class weighting function lm in (20) is monoton-ically decreasing with respect to Nlm . That is, in (21), weimpose heavier weights onto close class pairs, which are more
likely to be misclassified. Theclose class pairs are endowed with
emphasize, which helps make the criterion suited in producingseparability in the output space.
C. Discriminant Criterion: General Case of
In the earlier derivation, we require that the variance
2N(T l )2 is less than 1. This requirement is satisfied
by restricting the length of . It suffices to restrict suchthat T (
2N M) = 1. In fact, for any class cl , we have
2N(T l )2
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
5/9
1416 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
s is the sign that results in the largest first principal eigen-value of all possible s. Once the first vector 1 is obtained,we proceed to find the second vector 2 in the orthogonallycomplementary space of 1 , i.e., in the space spanned byIK1 T1 , where IKdenotes the K-dimensional identity ma-trix. Therefore, 2 is solved as the first principal eigenvec-
tor of the deflated matrix (I
K1 T
1 )H
(s)(I
K1 T
1 ).Note that we use the same symbol s, but it is not neces-sarily the same with the one producing 1 . Generally, sup-pose the first g vectors 1 , . . . , g have been obtained. The(g+ 1)th vector is determined in the orthogonally complemen-tary space spanned by 1 , . . . , g , i.e., in the space spannedby IKUgUTg , where Ug is the matrix of orthonormal ba-sis of1 , . . . , g , which, for example, can be obtained by theSchmidt orthogonalization procedure. So, g +1 is solved asthe first principal eigenvector of (IKUgUTg)H(s)(IKUgU
Tg). Given the obtained g + 1 , according to the Schmidt or-
thogonalization procedure, the basis matrix Ug +1 is formed
by padding Ug as Ug +1 = [Ug ,ug + 1 ], where ug + 1 =g +1Ug (UTgg + 1 ) /g +1Ug (UTgg + 1 ). In theory,
g +1 is orthogonal with Ug , i.e., UTgg + 1 = 0, which im-
plies that ug +1 is simply the normalized g +1 . We keep theprevious Schmidt orthogonalization procedure for computa-
tional precision in practice. Note IKUg + 1UTg +1 = (IKug + 1u
Tg + 1 )(IKUgUTg), which makes it feasible to compute
(IKUg +1UTg +1 )H(s)(IKUg + 1UTg +1 )for the next stepby updating(IKUgUTg)H(s)(IKUgUTg) through mul-tiplying IK ug + 1uTg + 1 from both sides.
In practice, the covariance matrices l (l= 1, . . . , M ) areusually unknown, which thus need to be estimated. The ex-
pression defined in (1) provides a way of estimation. Wesummarize the optimization procedure of multiclass filters via
the WPC approach in Table I.
E. Classification
Suppose g (g= 1, . . . , G) are the G filters obtained byWPC. For any EEG data segment Xt , we extract the features as
2g =Tg XtX
Tt g , (g= 1, . . . , G). (31)
The extracted features on training EEG data are used to design
a classifier. For a testing EEG segment, its features are extracted
in the same way, which are input into the trained classifier to
predict its class label.
IV. COMPARISON ANDEXTENSION
In this section, we compare the proposed WPC approach with
[23]. The starting points and formulations of the two methods
are completely different. We also extend WPC by integrating
temporal information.
A. Comparison With [23]
Both the proposed WPC approach and [23] are based on the
Bayesian error estimation. However, the classification targets
and then the criteria of the two methods are different, as sum-
marized in Table II. The discriminant criterion of [23] is derived
TABLE IOPTIMIZATION PROCEDURE OFMULTICLASS FILTERSVIA THEWPC APPROACH
by minimizing the upper bound of the Bayesian error of classi-
fying Tx, while WPC takes feature TXXT used in EEGsingle-trial classification as target directly.
It is noted that
qM1l= 1
Mm = l+1
|T(lm )| M
l= 1
|T(l)|
2q
M1
l= 1
M
m = l+1 |
T(l
m )
|. (32)
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
6/9
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1417
TABLE IICOMPARISONBETWEENWPCAND ZHENG ANDLIN[23]
The proof is given in Appendix B. Let
J() =M1l=1
Mm = l+1
|T(lm )|T
. (33)
Then, we have qJ()J()2qJ(). So, maximizing J()can be roughly performed by maximizing J(). Using =1/2 , maximizing J()is equivalent to maximizing
J() =M1l= 1
Mm = l+ 1
|T (lm )|T
(34)
which is M(M1)/2 pairs of absolute distances betweenT l and
Tm subject to= 1.The maximization of (34), however, may not very appropriate
for classifying multiclass EEG single-trials in some cases. For
example, consider the situation that one class has large differ-
ence (in terms of) from the other classes. To maximize thecriterion J(), the class pairs (say cl andcm ) that have largedifferences (betweenl andm ) heavily control the selectionof the direction of . Note is of unit length. As a result, the
remote class is projected from the other classes as far as possiblewhile close classes are more likely to be merged.
By contrast, with the weight (1/lm )erf(
Nl m
4M ), the crite-
rion JP() in(25) orJP() in (22) de-emphasizes the influenceof large class differences, where the classes are already well
separated and gives great emphasize to small class differences
where the close classes are more likely to be confused.
An interesting connection between WPC and [23] is that,
when applied in two-class paradigm, criterion JP()and crite-rionJ()produce the same solution of. This, however, doesnot necessarily imply that the upper bounds of the Bayesian
errors of these two methods are equal. On the other hand, com-
paring the upper bounds of the Bayesian errors derived by thetwo methods is meaningless since they are derived for classify-
ing different objects.
B. Extension to WPC: Integrating Temporal Information
The set of filters of WPC are obtained by considering clas-
sifying the projected variance TXXT . Note that X is asegment of EEG single-trial time course. The covariance for-
mulation XXT , however, is globally independent of time.
The temporal information is completely ignored. From the
study of neurophysiology, EEG signals are usually nonsta-
tionary. It is useful to integrate the temporal information into
the covariance formulation, reflecting the temporal manifold
of the EEG time course [20]. Specifically, by the fact that
XXT = (1/2N)N
i, j =1 (xi xj )(xi xj )T , weuse the tem-porally local covariance matrix
C= 1
2N
Ni, j =1
(xi xj )(xi xj )TA(i, j) (35)
for covariance modeling instead ofXXT that is time indepen-dent. The time-dependent adjacency value A(i, j) is definedsuch that only temporally close sample pairs, say{xi xj :|ij|< }with being a temporal range parameter, are se-lected to contribute to the summation (35). The value A(i, j)ismonotonously decreasing with respect to temporal distances be-
tween selected sample pairs. In this paper, the adjacency matrix
A is defined using the Tukeys tricube weighting function [6]
A(i, j) =
1
ij
33
, |ij|<
0, else.
(36)
With some algebraic derivations, C is compactly expressed
as C= (1/N)XEXT , where E= D A is the Laplacianmatrix, A= (A(i, j))i, j = 1,...,N, and D is the diagonal ma-trix whose diagonal entries are row sums of A. Let L=(1/N)E. Then, under the same probabilistic assumption withthe previous section, the Gaussian quadratic formTXLXT conditioned on class cl has mean tr(L)
T l and vari-ance 2tr(L2 )(T l )
2 , where tr() is the trace operator. IfTXLXT is treated as target feature for classification pur-pose. Then, (28) is accordingly modified by integrating temporal
information as
HT I(s) =
M1l=1
Mm = l+ 1
1
lm
erf
tr(L
)
lm4M
tr(L2 )
slm (lm ).
(37)
Note that, in this case, the difference between means of classes clandcm becomestr(L)(T(lm )). And is normalizedas(/
T (
2tr(L2 )M)).
In the implementation of the temporal extension of WPC, the
covariance matricesl (l= 1, . . . , M )are estimated as
l = 1
N|Il |tIl
XtLXTt . (38)
The optimization procedure can be similarly carried out with
WPC. The features are extracted as
2g = Tg XtLX
Tt g , (g= 1, . . . , G) (39)
whereg are the filters of the temporal extension of WPC.1) Choice of: The additional parameter is determined
from the data using a three-way cross-validation strategy. This
strategy contains two nested loops. In the outer loop, the sam-
ples are divided into T1 folds, in which one-fold is treated astesting set. The testing samples are used for the estimation of
generalization ability and are not concerned with the solutions
of the filters and the parameter. In the inner loop, the remaining
T1
1folds are further divided intoT2 folds, in which one-fold
is treated as validation set while the remainingT21folds are
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
7/9
1418 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
treated as training set. For each , the filters are solved on thetraining set, and then the recognition rate is calculated on the
validation set. This procedure is repeated T2 times with a dif-ferent validation set each time. The average recognition rates
is recorded as the recognition accuracy across theT2 folds. Weselect the that results in the maximum recognition accuracy.
We then solve the filters using all the T2 folds with the optimal selected earlier. With the filters obtained, we calculate therecognition rate on the testing set which is specified in the outer
loop. The earlier procedure is repeatedT1 times with a differentfold as testing set each time. The average recognition rates are
computed as the final recognition accuracy across the T1 folds.
V. EXPERIMENTS
We evaluate the effectiveness of the proposed multiclass
methods on two publicly available datasets of BCI competi-
tions. These two datasets are of four-class motor imagery EEG
signals. We compare the classification performances of the pro-
posed multiclass methods with the multiclass CSP using one-versus-rest and using JAD [7], the multiclass information theo-
retic feature extraction [10], and the multiclass CSP presented
in [23].
A. EEG Datasets Used for Evaluation
1) Dataset IIIa of BCI Competition III: This dataset is of
four-class motor imagery paradigm by recording three subjects
(k3b, k6b, and l1b) [3]. The subjects, sitting in a normal chair
with relaxation, were asked to perform four different tasks of
motor imagery (i.e., left hand, right hand, one foot, and tongue)
by cues, which were presented in a randomized order. In each
trial, the cue was displayed from the third second and lastedfor 1.25 s. At the same time, the motor imaginary started and
continued until the fixation cross disappeared at the seventh
second. So, the duration of the motor imagery in each trial was
4 s. For subject k3b, there were 90 trials for each mental task.
And for subjects k6b and l1b, each mental task cue appeared 60
times. In our experiment, we discard four trials of subject k6b
because of missing data. The EEG measurements were recorded
using 60 sensors by a 64-channel neuroscan system. The left and
right mastoids were used as reference and ground, respectively.
The EEG signals were sampled at 250 Hz and filtered by cutoff
frequencies 1 and 50 Hz with the notchfilter ON.
2) Dataset IIa of BCI Competition IV: This dataset contains
EEG signals recorded during a cue-based four-class motor im-
agery task from nine subjects [17]. Each trial started from a
short acoustic warning tone along with a fixation cross dis-
played on the black screen. After 2 s, a visual cue was presented
for 1.25 s, instructing the subjects to carry out the desired motor
imagery task (i.e., the imagination of movement of the left hand,
right hand, both feet, or tongue) from the third second until the
fixation cross disappeared at the sixth second. Each subject par-
ticipated two sessions recorded on different days. There were
288 trials in each session for each subject, i.e., 72 trials per task.
Twenty-two electrodes were used to record the EEG signals that
were sampled at 250 Hz and filtered by cutoff frequencies 0.5
and 100 Hz with the notchfilter ON.
B. Experimental Settings and Results
The data are band-pass filtered between 5 and 35 Hz using
a fifth-order butterworth filter, as in [9] and [10]. The EEG
segments recorded during the motor imagery period, i.e., from
the third second to the seventh second in dataset IIIa of BCI
competition III and from the third second to the sixth second in
dataset IIa of BCI competition IV, are used in the experiment. Weexploit the three-fold cross-validation strategy to evaluate the
classification accuracy. That is, we partition all the trials of each
class per subject into three divisions, in which each division is
used as testing data while the remainder two divisions are used
as training data. This procedure is repeated three times until
each division is used once as testing data. In each repetition, for
each filter obtained on the training data, features are obtained
by projection on the 15 frequency bands of 2-Hz width in the
range 535 Hz [9], [10]. Consequently, we obtain a (15G)-dimensional feature vector for each trial, where G is the numberof filters selected on the training data. That is, we use the three-
way cross-validation procedure withT1 =T2 = 3to determinethe value ofG, whereG varies from 2 to 10 in step of 2. The(15G)-dimensional feature vectors are further reduced to 3-D1
vectors by using the Fisher discriminant analysis (FDA) [21]. It
should be noted that the spatial filters, the value ofGas well asthe FDA weights are calculated on the basis of the training data
and then applied to the testing data. The conventional classifier
of the nearest class mean with Euclidean distance [21] is adopted
to predict the class labels of the testing samples.
Table III reports the classification accuracies by using the
multiclass filters solved by the various methods. Note, we also
evaluate the classification accuracy of WPC integrating tempo-
ral information (WPC/TI), where the parameteris determined
by the three-way cross-validation procedure withT1 =T2 = 3.Here is varied logarithmically from 1 to 5 in step of 1. It isobserved that the proposed WPC method achieves much bet-
ter classification accuracy than the existing multiclass methods,
and WPC/TI further improves the results in most cases. The im-
provement of WPC/TI is attributed to the local temporal mod-
eling. The reason that WPC/TI results in lower classification
accuracies than WPC in few cases may be due to overfitting.
C. Comparison With BCI Competition IV
For dataset IIa of BCI competition IV, to compare with the re-
sults of the winners, we use the evaluation of session-to-session
transfer from session one to session two in terms of kappa score,simulating competition scenario. The procedure of the session-
to-session transfer is much simpler than the cross validation.
Specifically, we use the first session as training data and the
second session as testing data. All the experimental settings are
same with the description in the previous section except that the
training data and the testing data are now fixed. The classifica-
tion accuracy is summarized in Table IV. It can be seen that our
proposed methods have fairly well classification performance
compared with the results obtained by the best two competi-
1Since the number of classes is four, we can obtain at most three dimensions
of features by FDA, which is known as the rank-limit problem.
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
8/9
WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1419
TABLE IIICOMPARISON OF THECLASSIFICATIONACCURACIES(%) OF THEPROPOSEDWPCAND WPC/TI METHODSWITH THEEXISTINGMULTICLASS METHODS FOREACH
SUBJECT ON THEDATASETS OFBCI COMPETITIONS, WHEREM1M6 REFER TOMULTICLASS CSP USINGONE-VERSUS-REST, MULTICLASS CSP USINGJAD,MULTICLASS INFORMATION THEORETICFEATUREEXTRACTION, MULTICLASS CSP IN [23], WPC,AND WPC/TI, RESPECTIVELY
TABLE IVKAPPASCORES OFVARIOUSMULTICLASS METHODS FOREACHSUBJECT ON
DATASETIIA OF BCI COMPETITIONIV USINGSESSION-TO-SESSIONTRANSFER,WHERENO. 1 AND NO. 2 REFER TO THEBESTTWOCOMPETITORS,AND
M1M6ARE SAMEWITHTHOSE OFTABLEIII
tors. Note that the results obtained by the multiclass CSP usingone-versus-rest and using JAD are slightly different from those
reported in [9], since different classifiers and time segments are
used.
In our experiment, a simple classification procedure is em-
ployed to reveal the effectiveness of the multiclass filters ob-
tained by WPC and WPC/TI. The classification performance
may be improvedif we solve filters in narrower frequency bands,
tuning the optimal time segment for each trial, and/or using other
sophisticated classifiers. The goal of this paper is to demonstrate
the effectiveness of the weighted scheme for solving multiclass
filters: while we use the same experimental settings for all the
methods, the weighted pairwise design produces a much higherclassification accuracy.
VI. CONCLUSION
In this paper, we propose a new discriminant criterion, called
WPC, of optimizing multiclass filters. The approach is estab-
lished by minimizing the upper bound of the Bayesian error of
classifying EEG single-trial segments, resulting in the form of
sum of weights imposed on individual pairwise classes accord-
ing to their closeness. We pay special emphasize on the effect
of closer classes that are more likely to cause misclassification.
In other words, the contributions of different class pairs to the
discriminant criterion are biased. Computationally, the WPC
algorithm is conveniently solved by the rank-one update and
power iteration technique.
The proposed WPC approach is intentionally formulated for
classifying EEG single-trial data. It takes into account classi-
fication errors of EEG trials between pairs of classes. While
the criterion derived based on the Bayesian error of classify-
ing EEG sampling points is reasonable, the large pairwise classdifferences may play an overwhelming role in the optimization.
By contrast, WPC directly uses the same features for optimizing
spatial filters as for classification. Moreover, we extend WPC
by integrating the temporal information of EEG series in the co-
variance matrix formulation. The effectiveness of the proposed
WPC method is demonstrated by the classification of four motor
imagery tasks on two datasets of BCI competition.
Finally, we point out that the Bayesian error estimation heav-
ily relies on the assumption of independent Gaussian distribu-
tion. This assumption, however, does not hold stringently in ap-
plications, since EEG data usually has an autocorrelation struc-
ture. One possible way is to consider using the Gauss mixturemodel instead of single Gaussian distribution. We are studying
this issue theoretically and practically.
APPENDIXA
PROOF OF(14)
For0xa, in the error function
erf(x) = 2
x0
eu2
du (40)
we use the variable substitutionv = ax u. Then, we have
erf(x) = 2
a0
ev2 ( xa )
2x
adv. (41)
Since0 xa 1, it follows that
erf(x)xa
2
a0
ev2
dv= 1
aerf(a)x (42)
which completes the proof.
8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion
9/9
1420 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011
APPENDIXB
PROOF OF(32)
We have
M
l=1
T (l)|=M
l=1
|T
lM
m = 1 qm
=
Ml=1
M
m = 1
qT(lm )
2qM1l= 1
Mm = l+ 1
|T (lm )|. (43)
On the other hand, we have
qM1l=1
Mm = l+1
|T (lm )|
= q2
Ml=1
Mm = 1
|T (l)T(m )|
q2
Ml=1
Mm = 1
|T (l)|+ q2
Ml=1
Mm = 1
|T (m )|
=1
2
Ml=1
|T (l)|+12
Mm = 1
|T (m )|
=
Ml= 1
|T(l)|. (44)
The proof is, thus, established.
ACKNOWLEDGMENT
The author would like to thank the anonymous referees and
the editors for constructive recommendations, which improve
the paper substantially.
REFERENCES
[1] A. Bashashati, M. Fatourechi, R. K. Ward, and G. E. Birch, A surveyof signal processing algorithms in braincomputer interfaces based onelectrical brain signals, J. Neural Eng., vol. 4, no. 2, pp. R32R57, Jun.2007.
[2] B. Blankertz, K.-R. Muller, G. Curio, T. M. Vaughan, G. Schalk, J. R.Wolpaw, A. Schlogl, C. Neuper, G. Pfurtscheller, T. Hinterberger,M. Schroder, and N. Birbaumer, The BCI competition 2003: Progressand perspectives in detection and discrimination of EEG single trials,
IEEE Trans. Biomed. Eng., vol. 51, no. 6, pp. 10441051, Jun. 2004.[3] B. Blankertz, K.-R. Muller, D. J. Krusienski, G. Schalk, J. R. Wolpaw,
A. Schlogl, G. Pfurtscheller, J. R. Millan, M.Schroder, andN. Birbaumer,The BCI competition III: Validating alternative approaches to actual BCIproblems,IEEE Trans. Neural Syst.Rehabil.Eng.,vol.14,no.2,pp.153159, Jun. 2006.
[4] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K.-R. Muller,Optimizing spatial filters for robust EEG single-trial analysis, IEEESignal Process. Mag., vol. 25, no. 1, pp. 4156, Jan. 2008.
[5] J. T. Chu and J. C. Chuen, Error probability in decision functions forcharacter recognition, J. Assoc. Comput. Mach., vol. 14, no. 2, pp. 273280, 1967.
[6] W. S. Cleveland, Robust locally weighted regression and smoothing scat-
terplots, J. Amer. Stat. Assoc., vol. 74, pp. 829836, 1979.
[7] G. Dornhege,B. Blankertz, G. Curio,and K.-R. Muller, Boosting bit ratesin noninvasive EEG single-trial classifications by feature combination andmulti-class paradigms, IEEE Trans. Biomed. Eng.,vol.51,no.6,pp.9931002, Jun. 2004.
[8] G. Dornhege, M. Krauledat, K.-R. Muller, and B. Blankertz, GeneralSignal Processing and Machine Learning Tools for BCI. Cambridge,MA: MIT Press, 2007, pp. 207233.
[9] C. Gouy-Pailler, M. Congedo, C. Brunner, C. Jutten, and G. Pfurtscheller,Nonstationary brain source separation for multiclass motor imagery,
IEEE Trans. Biomed. Eng., vol. 57, no. 2, pp. 469478, Feb. 2010.[10] M. Grosse-Wentrup and M. Buss, Multiclass common spatial patterns
and information theoretic feature extraction, IEEE Trans. Biomed. Eng.,vol. 55, no. 8, pp. 19912000, Aug. 2008.
[11] S. Lemm, B.Blankertz, G.Curio, andK.-R. Muller, Spatio-spectral filtersfor improved classification of single trial EEG, IEEE Trans. Biomed.
Eng., vol. 52, no. 9, pp. 15411548, Sep. 2005.[12] Y. Li, X. Gao, and S. Gao, Classification of single-trial electroencephalo-
gram during finger movement, IEEE Trans. Biomed. Eng., vol. 51, no. 6,pp. 10191025, Jun. 2004.
[13] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, Multiclass linear dimen-sion reduction by weighted pairwise Fisher criteria, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 23, no. 7, pp. 762766, Jul. 2001.[14] R. Lotlikar and R. Kothari, Fractional-step dimensionality reduction,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 6, pp. 623627, Jun.2000.
[15] D. J. McFarland, C. W. Anderson, K.-R. Muller, A. Schlogl, and D. J.Krusienski, BCI meeting 2005workshop on BCI signal processing:Feature extraction and translation, IEEE Trans. Neural Syst. Rehabil.
Eng., vol. 14, no. 2, pp. 135138, Jun. 2006.[16] J. Muller-Gerking, G. Pfurtscheller, and H. Flyvbjerg, Designing optimal
spatial filters for single-trial EEG classification in a movementtask,Clin.Neurophys., vol. 110, no. 5, pp. 787798, May 1999.
[17] M. Naeem, C. Brunner, R. Leeb, B. Graimann, and G. Pfurtscheller,Seperability of four-class motor imagery data using independent compo-nents analysis, J. Neural Eng., vol. 3, pp. 208216, 2006.
[18] L. C. Parra, C. D. Spence, A. D. Gerson, and P. Sajda, Recipes for linearanalysis of EEG, NeuroImage, vol. 28, no. 2, pp. 326341, Nov. 2005.
[19] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller, Optimal spatialfiltering of single trial EEG during imagined hand movement, IEEETrans. Rehabil. Eng., vol. 8, no. 4, pp. 441446, Dec. 2000.
[20] H. Wang and W. Zheng, Local temporal common spatial patterns forrobust single-trial EEG classification, IEEE Trans. Neural Syst. Rehabil.
Eng., vol. 16, no. 2, pp. 131139, Apr. 2008.[21] A. R. Webb, Statistical Pattern Recognition. London, U.K.: Oxford
Univ. Press, 1999.[22] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and
T. M. Vaughan, Braincomputer interfaces for communication and con-trol, Clin. Neurophysiol., vol. 113, no. 6, pp. 767791, Jun. 2002.
[23] W. Zheng and Z. Lin, Optimizing multi-class spatio-spectral filters viaBayes error estimation for EEG classification, in Proc. Neural Informat.Process. Syst. (NIPS), 2009, pp. 19.
Haixian Wang (M09) received the B.S. and M.S.degrees in statistics and the Ph.D. degree in com-puter science from Anhui University, Anhui, China,in 1999, 2002, and 2005, respectively.
During 20022005, he was with the Key Labora-tory of Intelligent Computing and Signal Processingof theMinistryof Educationof China.He is currentlywith the Key Laboratory of Child Development andLearning Science of the Ministry of Education, Re-search Center for Learning Science, Southeast Uni-versity, Nanjing, Jiangsu China. His research inter-
ests include EEG signal processing, statistical pattern recognition, and machinelearning.