Multiclass Filters by a Weighted Pairwise Criterion

Embed Size (px)

Citation preview

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    1/9

    1412 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

    Multiclass Filters by a Weighted Pairwise Criterionfor EEG Single-Trial Classification

    Haixian Wang, Member, IEEE

    AbstractThe filtering technique for dimensionality reductionof multichannel electroencephalogram (EEG) recordings, modeledusing common spatial patterns and its variants, is commonly usedin two-class braincomputer interfaces (BCI). For a multiclassproblem, the optimization of certain separability criteria in theoutput space is not directly related to the classification error ofEEGsingle-trial segments. In this paper, we derive a new discrim-inant criterion, termed weighted pairwise criterion (WPC), foroptimizing multiclass filters by minimizing the upper bound of theBayesian errorthat is intentionally formulated for classifying EEGsingle-trial segments. The WPC approach pays more attention toclose class pairs that are more likely to be misclassified than faraway class pairs that are already well separated. Moreover, weextend WPC by integrating temporal information of EEG series.Computationally, we employ the rank-one update and power iter-ation technique to optimize the proposed discriminant criterion.The experiments of multiclass classification on the datasets of BCIcompetitions demonstrate the efficacy of the proposed method.

    Index TermsBayesian classification error, braincomputer in-terfaces (BCI), common spatial patterns (CSP), multiclass filters,weighted pairwise criterion (WPC).

    I. INTRODUCTION

    ACCURATE classification of electroencephalogram (EEG)

    signals is the core problem in the community of braincomputer interfaces (BCI) [22]. A large number of modern

    signal processing and machine learning techniques have been

    used and developed [1], [8], [15]. One powerful and widely

    used method for processing multichannel EEG series is the fil-

    tering technique, represented by the common spatial patterns

    (CSP) [4]. The CSP approach, designed for the two-class prob-

    lem [12], [16], [18], [19], seeks few filters such that the ratio

    of the filtered variances between the two populations is maxi-

    mized (or minimized). By the fact that CSP make use of only

    spatial information, the spatio-temporal versions were also de-

    veloped, for example, the common spatio-spectral patterns [11]

    and the local temporal CSP [20]. The literature [4] reviewedmany variants of CSP.

    Manuscript received June 20, 2010; revised October 14, 2010 andDecember 30, 2010; accepted January 3, 2011. Date of publication January 13,2011; date of current version April 20, 2011. This work was supported in partby the National Natural Science Foundation of China under Grants 61075009and 60803059, in part by the Qing Lan Project, and in part by the Fund for theProgram of Excellent Young Teachers at Southeast University.

    The author is with the Key Laboratory of Child Development and Learn-ing Science of Ministry of Education, Research Center for Learning Science,Southeast University, Nanjing 210096, China (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TBME.2011.2105869

    The CSP approach was originally suggested for two-class

    paradigm. The multiclass extensions have been investigated in

    the literature. One trivial extension was to divide the multiclass

    problem into many two-class situations followed by applying

    CSP repeatedly [4], [7]. Another conventional extension was

    the joint approximate diagonalization (JAD) ofM covariancematrices, where Mwas the number of multiple classes [7].This is based on the observation that CSP is to simultaneously

    diagonalize two covariance matrices. The JAD was further in-

    vestigated from the perspective of mutual information and brain

    source separation [9], [10]. The JAD approach is actually a

    decomposition technique rather than a classification method.

    Recently, Zheng and Lin [23] presented a multiclass exten-

    sion via Bayesian classification error estimation. The discrim-

    inant criterion is derived by minimizing the upper bound of

    the Bayesian error of classifyingTx, where x is an EEG sig-nal recorded at a specific time point and is a filter. Whilethis is a reasonable criterion to optimize spatial filters, a more

    direct approach uses the same features for optimizing spatial

    filters as for classification. Denoting a multivariate time se-

    ries of band-pass filtered single-trial EEG by X, these features

    are in CSP TXXT (cf., [4, eq. (2)], [16, eq. (2)], and [23,eqs. (29)(31)]). The quantityTXXT is the variance of the

    band-pass filtered EEG signals. It is equal to band power. So,the band power TXXT with an appropriate spatial filter in fact corresponds to the effect of event-related desyncroniza-

    tion/syncronization (ERD/ERS), which is an effective neuro-

    physiological feature for classification of brain activities [4].

    Specifically, the so-called idle rhythms, reflected around 10 Hz,

    can be observed over motor and sensorimotor areas in most

    persons. These idle rhythms are attenuated when processing

    motor activity. This physiological phenomenon is termed ERD

    effect because of loss of synchrony in the neural population. By

    contrast, the rebound of the rhythmic activity is termed ERS.

    In this paper, we directly target the classification of

    TXXT

    , for which the upper bound of the Bayesian clas-sification error is estimated. Accordingly, by minimizing theupper bound of the Bayesian classification error, we develop a

    new discriminant criterion that directly related to the classifica-

    tion of EEG single-trial segments. The proposed criterion takes

    the form of sum of weighted pairwise classes and is referred to

    as weighted pairwise criterion (WPC). The WPC approach puts

    heavier weights onto close class pairs, which are more likely to

    be misclassified, and de-emphasizes the influence of far away

    class pairs, where the classes are already well separated. The

    weighting strategy helps make the criterion suited in produc-

    ing separability in the output space, which has been witnessed

    in pattern classification problems [13], [14]. Computationally,

    0018-9294/$26.00 2011 IEEE

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    2/9

    WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1413

    the proposed WPC method is conveniently implemented by the

    rank-one update and power iteration technique. Moreover, we

    compare WPC with [23]. The classification targets of the two

    methods are different, and then the discriminant criteria, which

    are obtained by minimizing the upper bound of the Bayesian

    classification error, are thus different. We also extend WPC by

    integrating temporal information of EEG series into the covari-ance matrix formulation. The efficacy of the proposed approach

    is demonstrated on the classification of four motor imagery tasks

    on two datasets of BCI competitions.

    The remainder of this paper is organized as follows. In

    Section II, the CSP and its multiclass situation via Bayesian

    classification error estimation based on EEG sampling points

    are briefly reviewed. In Section III, we derive the upper bound

    of the Bayesian error that is intentionally formulated for clas-

    sifying EEG single-trial segments, then propose the WPC to

    minimize the upper bound, and give the optimization procedure

    by using the rank-one update and power iteration technique. The

    comparison with [23] and the extension by integrating temporal

    information are presented in Section IV. The experimental re-sults are presented in Section V. Finally, Section VI concludes

    the paper.

    II. BRIEFREVIEW OFCSP AND ITSMULTICLASSSITUATION

    Let xRK be an EEG signal at a specific time point withKelectrodes. We view x recorded during performing certain men-

    tal task as a K-dimensional random variable that is generatedfrom a Gaussian distribution. Suppose thatC ={c1 , . . . , cM}is the set of mental conditions to be investigated. We consider

    the multiclass (M >2) classification problem that assigns EEG

    single-trial segments into theMpredefined brain states. Givenclasscl (l {1, 2, . . . , M }), the random variable x is assumedto be Gaussian distributed according to x|clN(0, l ), wherel is the covariance matrix. The Gaussian assumption will notsacrifice generality when studying linear filters and statistics

    less than second order [10]. For the purpose of classification,

    we wish to learnG(G < K)filters (linear transformation vec-tors) g RK using the finite training data such that the filteredfeatures are more discriminative for predicting class labels than

    using the raw EEG data. Hereafter, the term conditions and class

    labels are used interchangeably.

    A. CSP: Two-Class Paradigm

    The CSP method provides a powerful way for extracting EEG

    features related with the modulation of ERD/ERS. The CSP

    algorithm is applied to two-class situation only. It solves the

    filters such that the projected EEG series have the maximum

    ratio of variances between the two classes. Maximizing the

    variances actually characterize the ERD/ERS effects. Let X =[x1 , . . . ,xN]RKN be a segment of EEG series during onetrial, where xi is the multichannel EEG signal at a specific time

    pointi, andNdenotes the number of sampled time-points.The CSP approach solves spatial filters by simultaneously

    diagonalizing the estimated covariance matrices under the two

    conditions. The covariance matrices of the two classes are esti-

    mated as

    l = 1

    N|Il |tIl

    XtXTt (1)

    whereIl(l {1, 2})denotes the set of indices of trials belong-ing to classcl , and|Il | is the cardinality of setIl . The spatialfilters of CSP can be alternatively formulated as an optimizationproblem [4], [18]

    = arg{max, min}RKT1

    T2 (2)

    where the notation {max, min} means that maximizing or min-imizing the Rayleigh quotient is of equally interest. The spatial

    filters thus are obtained by solving the generalized eigenvalue

    equation

    1 = 2 . (3)

    The eigenvalue measures the ratio of variances between the

    two classes. For the purpose of classification, the filters are spec-ified by choosing several generalized eigenvectors associated

    with eigenvalues from both ends of the eigenvalue spectrum.

    The variances of the spatially filtered EEG data are discrimina-

    tive features, which are input into a classifier.

    B. Multiclass Situation by Bayesian Error Estimation

    The CSP is suitable for two-class classification only. Zheng

    and Lin [23] addressed the multiclass paradigm via Bayesian

    classification error estimation. By the assumption that the dis-

    tribution of the EEG sampling point xconditioned on classclis Gaussian, i.e.,pl =N(x;0, l ), the filtered EEG signal y =

    Tx is also Gaussian distributed according toN(y; 0, Tl ).Based on the Gaussian distribution, Zheng and Lin [23] obtained

    the upper bound of the Bayesian error of classifying Tx, givenby

    ()qM(M1)2

    q3

    32

    Ml= 1|T (l)|

    T

    2(4)

    where qis the common a priori probability of the Mclasses, and =

    Mm = 1qm . Minimizing the upper bound of the Bayesian

    classification error is equivalent to maximize the discriminant

    criterion

    J() =M

    l=1

    |T (l)|T

    . (5)

    So, theG filters are defined as [23]

    1 = arg max

    J() (6)

    G = arg max

    T g = 0g= 1 , . . . , G1

    J(). (7)

    Using some suitable estimations for land , the defined filterscan be determined one by one via the rank-one update and power

    iteration procedure.

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    3/9

    1414 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

    III. WPC

    For the specific classification of multiclass EEG single-trial

    segment X, a more direct approach is to consider optimizing

    the featureTXXT , rather thanTx, as for classification. Itis ideal to obtain the Bayesian classification error-based opti-

    mal criterion for the classification ofTXXT. The Bayesian

    classification error is, in general, too complex to be calculateddirectly. Therefore, the upper bound of the Bayesian classifica-

    tion error, which is meanwhile required to be easy to optimize

    in practice, is usually estimated as a suboptimal criterion. In this

    section, we develop a new discriminant criterion based on the

    upper bound of the Bayesian classification error of EEG single

    trials. It is noted that we take TXXT , rather than Tx, asour target element in deriving the upper bound of the multiclass

    Bayesian classification error.

    A. Upper Bound of Multiclass Bayesian Error

    Recalling X= [x1 , . . . ,xN], we have TXXT =

    Ni=1 (Txi )2 , where N is the number of sample points in

    one trial. Under the assumption of independent Gaussian dis-

    tributionN(0, l ) ofxi conditioned on classcl , we have that(TXXT)/(Tl ) abides by

    2 distribution with degree of

    freedomN. Usually, the number of sampling points N is verylarge, sayN >30. By the central limit theorem, we have thatTXXT conditioned on class clapproaches the Gaussian dis-tribution with meanN T l and variance2N(

    T l )2 . For

    the time being, we assume that2N(T l )2 is less than one,

    which will be addressed with the general case later on.

    Denotelm ()by the Bayesian error between classes cl andcm , i.e.,

    lm () =ql P(f (X)=cl |cl ) +qm P(f (X)=cm |cm ) (8)where P(f (X)=cl |cl ) is the probability that samples belong-ing to classcl are misclassified into classcm ,f ()denotes theBayesian classifier, andql andqm are thea prioriprobabilitiesof classes cl and cm , respectively. Since the data

    TXX

    T conditioned on each class are (approximately) Gaussian dis-

    tributed with variance less than one and mean being, for exam-

    ple,N T l if conditioned on classcl , it follows that

    ql P(f (X)=cl |cl ) +qm P(f (X)=cm |cm )

    D m qlpl (x)dx+ D l qmpm (x)dx (9)wherepl (x)andpm (x)are the probability density functions ofGaussian distributionsN(N T l , 1) andN(N

    T m , 1),respectively, andDm andDl are defined as

    Dm ={x: qmpm (x)qlpl (x)} (10)Dl ={x: qlpl (x)> qmpm (x)}. (11)

    Supposeql =qm =q. Then, we have [21]

    D mpl (x)dx+D l

    pm (x)dx= 1erfN|T (lm )|

    2

    2 (12)

    where the error function (erf) is defined as erf(x) =(2/

    )x

    0 eu 2 du. By (8), (9), and (12), the Bayesian error

    between classescl andcm in the 1-D feature space after beingprojected onto is expressed as

    lm ()

    q 1erf

    N|T(lm )|2

    2 . (13)It is still complex to optimize via (13), since is embedded

    in the error function. We would like to isolate from the errorfunction. We present the following inequality. For 0xa,we have

    erf(x) 1a

    erf(a)x. (14)

    The equality holds when taking x = 0 or x = a. The proof isgiven in Appendix A. Let

    lm () =|Tl T m | (15)be theabsolute distance between classes cland cm in the reduced1-D feature space. By (14), we have

    erf

    N lm ()

    2

    2

    1

    lmerf

    Nlm

    2

    2

    lm () (16)

    where lm is the maximum value of lm (). Note we haverequired, in the beginning of this section, that the magnitude of

    is subject to the constraint 2N(T l )2 1. The left and

    right expressions of (16) are not equal for all directions of .The two expressions are equal when arriving at the maximum

    or the minimum value. Combining (13) and (16), we have

    lm ()

    q 1 1

    lmerf

    Nlm

    2

    2

    lm (). (17)

    For theMclasses problem, the upper bound of the Bayesianerror is calculated as [5]

    ()M1l= 1

    Mm = l+1

    lm ()

    M1l= 1

    Mm = l+1

    q

    1 1

    lmerf

    Nlm

    2

    2

    lm ()

    . (18)

    B. Discriminant Criterion Based on Upper Bound of Multiclass

    Bayesian Error

    To minimize the Bayesian error, we should minimize its upper

    bound, which is reduced to maximize the following discriminantcriterion:

    JP() =M1l=1

    Mm = l+1

    1

    lmerf

    Nlm

    2

    2

    lm (). (19)

    Let

    lm = 1

    lmerf

    Nlm

    2

    2

    . (20)

    Then,JP()can be rewritten as

    JP() =M1

    l= 1

    M

    m = l+ 1

    lm lm (). (21)

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    4/9

    WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1415

    Fig. 1. Weighting function(u) = (1/u)erf(u).

    Thelm can be viewed as weight imposed on pairwise classescl andcm . SinceN

    T l andN T m are the distribution

    means of classescl andcm in the reduced 1-D feature space, re-spectively, the quantityNlm in (20) is the maximum distance(with respect to ) between the two class means. It reflectsthe separability of two classes. Note (u) = (1/u)erf(u) is amonotonically decreasing function ofu, as shown in Fig. 1. So,the pairwise class weighting function lm in (20) is monoton-ically decreasing with respect to Nlm . That is, in (21), weimpose heavier weights onto close class pairs, which are more

    likely to be misclassified. Theclose class pairs are endowed with

    emphasize, which helps make the criterion suited in producingseparability in the output space.

    C. Discriminant Criterion: General Case of

    In the earlier derivation, we require that the variance

    2N(T l )2 is less than 1. This requirement is satisfied

    by restricting the length of . It suffices to restrict suchthat T (

    2N M) = 1. In fact, for any class cl , we have

    2N(T l )2

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    5/9

    1416 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

    s is the sign that results in the largest first principal eigen-value of all possible s. Once the first vector 1 is obtained,we proceed to find the second vector 2 in the orthogonallycomplementary space of 1 , i.e., in the space spanned byIK1 T1 , where IKdenotes the K-dimensional identity ma-trix. Therefore, 2 is solved as the first principal eigenvec-

    tor of the deflated matrix (I

    K1 T

    1 )H

    (s)(I

    K1 T

    1 ).Note that we use the same symbol s, but it is not neces-sarily the same with the one producing 1 . Generally, sup-pose the first g vectors 1 , . . . , g have been obtained. The(g+ 1)th vector is determined in the orthogonally complemen-tary space spanned by 1 , . . . , g , i.e., in the space spannedby IKUgUTg , where Ug is the matrix of orthonormal ba-sis of1 , . . . , g , which, for example, can be obtained by theSchmidt orthogonalization procedure. So, g +1 is solved asthe first principal eigenvector of (IKUgUTg)H(s)(IKUgU

    Tg). Given the obtained g + 1 , according to the Schmidt or-

    thogonalization procedure, the basis matrix Ug +1 is formed

    by padding Ug as Ug +1 = [Ug ,ug + 1 ], where ug + 1 =g +1Ug (UTgg + 1 ) /g +1Ug (UTgg + 1 ). In theory,

    g +1 is orthogonal with Ug , i.e., UTgg + 1 = 0, which im-

    plies that ug +1 is simply the normalized g +1 . We keep theprevious Schmidt orthogonalization procedure for computa-

    tional precision in practice. Note IKUg + 1UTg +1 = (IKug + 1u

    Tg + 1 )(IKUgUTg), which makes it feasible to compute

    (IKUg +1UTg +1 )H(s)(IKUg + 1UTg +1 )for the next stepby updating(IKUgUTg)H(s)(IKUgUTg) through mul-tiplying IK ug + 1uTg + 1 from both sides.

    In practice, the covariance matrices l (l= 1, . . . , M ) areusually unknown, which thus need to be estimated. The ex-

    pression defined in (1) provides a way of estimation. Wesummarize the optimization procedure of multiclass filters via

    the WPC approach in Table I.

    E. Classification

    Suppose g (g= 1, . . . , G) are the G filters obtained byWPC. For any EEG data segment Xt , we extract the features as

    2g =Tg XtX

    Tt g , (g= 1, . . . , G). (31)

    The extracted features on training EEG data are used to design

    a classifier. For a testing EEG segment, its features are extracted

    in the same way, which are input into the trained classifier to

    predict its class label.

    IV. COMPARISON ANDEXTENSION

    In this section, we compare the proposed WPC approach with

    [23]. The starting points and formulations of the two methods

    are completely different. We also extend WPC by integrating

    temporal information.

    A. Comparison With [23]

    Both the proposed WPC approach and [23] are based on the

    Bayesian error estimation. However, the classification targets

    and then the criteria of the two methods are different, as sum-

    marized in Table II. The discriminant criterion of [23] is derived

    TABLE IOPTIMIZATION PROCEDURE OFMULTICLASS FILTERSVIA THEWPC APPROACH

    by minimizing the upper bound of the Bayesian error of classi-

    fying Tx, while WPC takes feature TXXT used in EEGsingle-trial classification as target directly.

    It is noted that

    qM1l= 1

    Mm = l+1

    |T(lm )| M

    l= 1

    |T(l)|

    2q

    M1

    l= 1

    M

    m = l+1 |

    T(l

    m )

    |. (32)

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    6/9

    WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1417

    TABLE IICOMPARISONBETWEENWPCAND ZHENG ANDLIN[23]

    The proof is given in Appendix B. Let

    J() =M1l=1

    Mm = l+1

    |T(lm )|T

    . (33)

    Then, we have qJ()J()2qJ(). So, maximizing J()can be roughly performed by maximizing J(). Using =1/2 , maximizing J()is equivalent to maximizing

    J() =M1l= 1

    Mm = l+ 1

    |T (lm )|T

    (34)

    which is M(M1)/2 pairs of absolute distances betweenT l and

    Tm subject to= 1.The maximization of (34), however, may not very appropriate

    for classifying multiclass EEG single-trials in some cases. For

    example, consider the situation that one class has large differ-

    ence (in terms of) from the other classes. To maximize thecriterion J(), the class pairs (say cl andcm ) that have largedifferences (betweenl andm ) heavily control the selectionof the direction of . Note is of unit length. As a result, the

    remote class is projected from the other classes as far as possiblewhile close classes are more likely to be merged.

    By contrast, with the weight (1/lm )erf(

    Nl m

    4M ), the crite-

    rion JP() in(25) orJP() in (22) de-emphasizes the influenceof large class differences, where the classes are already well

    separated and gives great emphasize to small class differences

    where the close classes are more likely to be confused.

    An interesting connection between WPC and [23] is that,

    when applied in two-class paradigm, criterion JP()and crite-rionJ()produce the same solution of. This, however, doesnot necessarily imply that the upper bounds of the Bayesian

    errors of these two methods are equal. On the other hand, com-

    paring the upper bounds of the Bayesian errors derived by thetwo methods is meaningless since they are derived for classify-

    ing different objects.

    B. Extension to WPC: Integrating Temporal Information

    The set of filters of WPC are obtained by considering clas-

    sifying the projected variance TXXT . Note that X is asegment of EEG single-trial time course. The covariance for-

    mulation XXT , however, is globally independent of time.

    The temporal information is completely ignored. From the

    study of neurophysiology, EEG signals are usually nonsta-

    tionary. It is useful to integrate the temporal information into

    the covariance formulation, reflecting the temporal manifold

    of the EEG time course [20]. Specifically, by the fact that

    XXT = (1/2N)N

    i, j =1 (xi xj )(xi xj )T , weuse the tem-porally local covariance matrix

    C= 1

    2N

    Ni, j =1

    (xi xj )(xi xj )TA(i, j) (35)

    for covariance modeling instead ofXXT that is time indepen-dent. The time-dependent adjacency value A(i, j) is definedsuch that only temporally close sample pairs, say{xi xj :|ij|< }with being a temporal range parameter, are se-lected to contribute to the summation (35). The value A(i, j)ismonotonously decreasing with respect to temporal distances be-

    tween selected sample pairs. In this paper, the adjacency matrix

    A is defined using the Tukeys tricube weighting function [6]

    A(i, j) =

    1

    ij

    33

    , |ij|<

    0, else.

    (36)

    With some algebraic derivations, C is compactly expressed

    as C= (1/N)XEXT , where E= D A is the Laplacianmatrix, A= (A(i, j))i, j = 1,...,N, and D is the diagonal ma-trix whose diagonal entries are row sums of A. Let L=(1/N)E. Then, under the same probabilistic assumption withthe previous section, the Gaussian quadratic formTXLXT conditioned on class cl has mean tr(L)

    T l and vari-ance 2tr(L2 )(T l )

    2 , where tr() is the trace operator. IfTXLXT is treated as target feature for classification pur-pose. Then, (28) is accordingly modified by integrating temporal

    information as

    HT I(s) =

    M1l=1

    Mm = l+ 1

    1

    lm

    erf

    tr(L

    )

    lm4M

    tr(L2 )

    slm (lm ).

    (37)

    Note that, in this case, the difference between means of classes clandcm becomestr(L)(T(lm )). And is normalizedas(/

    T (

    2tr(L2 )M)).

    In the implementation of the temporal extension of WPC, the

    covariance matricesl (l= 1, . . . , M )are estimated as

    l = 1

    N|Il |tIl

    XtLXTt . (38)

    The optimization procedure can be similarly carried out with

    WPC. The features are extracted as

    2g = Tg XtLX

    Tt g , (g= 1, . . . , G) (39)

    whereg are the filters of the temporal extension of WPC.1) Choice of: The additional parameter is determined

    from the data using a three-way cross-validation strategy. This

    strategy contains two nested loops. In the outer loop, the sam-

    ples are divided into T1 folds, in which one-fold is treated astesting set. The testing samples are used for the estimation of

    generalization ability and are not concerned with the solutions

    of the filters and the parameter. In the inner loop, the remaining

    T1

    1folds are further divided intoT2 folds, in which one-fold

    is treated as validation set while the remainingT21folds are

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    7/9

    1418 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

    treated as training set. For each , the filters are solved on thetraining set, and then the recognition rate is calculated on the

    validation set. This procedure is repeated T2 times with a dif-ferent validation set each time. The average recognition rates

    is recorded as the recognition accuracy across theT2 folds. Weselect the that results in the maximum recognition accuracy.

    We then solve the filters using all the T2 folds with the optimal selected earlier. With the filters obtained, we calculate therecognition rate on the testing set which is specified in the outer

    loop. The earlier procedure is repeatedT1 times with a differentfold as testing set each time. The average recognition rates are

    computed as the final recognition accuracy across the T1 folds.

    V. EXPERIMENTS

    We evaluate the effectiveness of the proposed multiclass

    methods on two publicly available datasets of BCI competi-

    tions. These two datasets are of four-class motor imagery EEG

    signals. We compare the classification performances of the pro-

    posed multiclass methods with the multiclass CSP using one-versus-rest and using JAD [7], the multiclass information theo-

    retic feature extraction [10], and the multiclass CSP presented

    in [23].

    A. EEG Datasets Used for Evaluation

    1) Dataset IIIa of BCI Competition III: This dataset is of

    four-class motor imagery paradigm by recording three subjects

    (k3b, k6b, and l1b) [3]. The subjects, sitting in a normal chair

    with relaxation, were asked to perform four different tasks of

    motor imagery (i.e., left hand, right hand, one foot, and tongue)

    by cues, which were presented in a randomized order. In each

    trial, the cue was displayed from the third second and lastedfor 1.25 s. At the same time, the motor imaginary started and

    continued until the fixation cross disappeared at the seventh

    second. So, the duration of the motor imagery in each trial was

    4 s. For subject k3b, there were 90 trials for each mental task.

    And for subjects k6b and l1b, each mental task cue appeared 60

    times. In our experiment, we discard four trials of subject k6b

    because of missing data. The EEG measurements were recorded

    using 60 sensors by a 64-channel neuroscan system. The left and

    right mastoids were used as reference and ground, respectively.

    The EEG signals were sampled at 250 Hz and filtered by cutoff

    frequencies 1 and 50 Hz with the notchfilter ON.

    2) Dataset IIa of BCI Competition IV: This dataset contains

    EEG signals recorded during a cue-based four-class motor im-

    agery task from nine subjects [17]. Each trial started from a

    short acoustic warning tone along with a fixation cross dis-

    played on the black screen. After 2 s, a visual cue was presented

    for 1.25 s, instructing the subjects to carry out the desired motor

    imagery task (i.e., the imagination of movement of the left hand,

    right hand, both feet, or tongue) from the third second until the

    fixation cross disappeared at the sixth second. Each subject par-

    ticipated two sessions recorded on different days. There were

    288 trials in each session for each subject, i.e., 72 trials per task.

    Twenty-two electrodes were used to record the EEG signals that

    were sampled at 250 Hz and filtered by cutoff frequencies 0.5

    and 100 Hz with the notchfilter ON.

    B. Experimental Settings and Results

    The data are band-pass filtered between 5 and 35 Hz using

    a fifth-order butterworth filter, as in [9] and [10]. The EEG

    segments recorded during the motor imagery period, i.e., from

    the third second to the seventh second in dataset IIIa of BCI

    competition III and from the third second to the sixth second in

    dataset IIa of BCI competition IV, are used in the experiment. Weexploit the three-fold cross-validation strategy to evaluate the

    classification accuracy. That is, we partition all the trials of each

    class per subject into three divisions, in which each division is

    used as testing data while the remainder two divisions are used

    as training data. This procedure is repeated three times until

    each division is used once as testing data. In each repetition, for

    each filter obtained on the training data, features are obtained

    by projection on the 15 frequency bands of 2-Hz width in the

    range 535 Hz [9], [10]. Consequently, we obtain a (15G)-dimensional feature vector for each trial, where G is the numberof filters selected on the training data. That is, we use the three-

    way cross-validation procedure withT1 =T2 = 3to determinethe value ofG, whereG varies from 2 to 10 in step of 2. The(15G)-dimensional feature vectors are further reduced to 3-D1

    vectors by using the Fisher discriminant analysis (FDA) [21]. It

    should be noted that the spatial filters, the value ofGas well asthe FDA weights are calculated on the basis of the training data

    and then applied to the testing data. The conventional classifier

    of the nearest class mean with Euclidean distance [21] is adopted

    to predict the class labels of the testing samples.

    Table III reports the classification accuracies by using the

    multiclass filters solved by the various methods. Note, we also

    evaluate the classification accuracy of WPC integrating tempo-

    ral information (WPC/TI), where the parameteris determined

    by the three-way cross-validation procedure withT1 =T2 = 3.Here is varied logarithmically from 1 to 5 in step of 1. It isobserved that the proposed WPC method achieves much bet-

    ter classification accuracy than the existing multiclass methods,

    and WPC/TI further improves the results in most cases. The im-

    provement of WPC/TI is attributed to the local temporal mod-

    eling. The reason that WPC/TI results in lower classification

    accuracies than WPC in few cases may be due to overfitting.

    C. Comparison With BCI Competition IV

    For dataset IIa of BCI competition IV, to compare with the re-

    sults of the winners, we use the evaluation of session-to-session

    transfer from session one to session two in terms of kappa score,simulating competition scenario. The procedure of the session-

    to-session transfer is much simpler than the cross validation.

    Specifically, we use the first session as training data and the

    second session as testing data. All the experimental settings are

    same with the description in the previous section except that the

    training data and the testing data are now fixed. The classifica-

    tion accuracy is summarized in Table IV. It can be seen that our

    proposed methods have fairly well classification performance

    compared with the results obtained by the best two competi-

    1Since the number of classes is four, we can obtain at most three dimensions

    of features by FDA, which is known as the rank-limit problem.

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    8/9

    WANG: MULTICLASS FILTERS BY A WEIGHTED PAIRWISE CRITERION FOR EEG SINGLE-TRIAL CLASSIFICATION 1419

    TABLE IIICOMPARISON OF THECLASSIFICATIONACCURACIES(%) OF THEPROPOSEDWPCAND WPC/TI METHODSWITH THEEXISTINGMULTICLASS METHODS FOREACH

    SUBJECT ON THEDATASETS OFBCI COMPETITIONS, WHEREM1M6 REFER TOMULTICLASS CSP USINGONE-VERSUS-REST, MULTICLASS CSP USINGJAD,MULTICLASS INFORMATION THEORETICFEATUREEXTRACTION, MULTICLASS CSP IN [23], WPC,AND WPC/TI, RESPECTIVELY

    TABLE IVKAPPASCORES OFVARIOUSMULTICLASS METHODS FOREACHSUBJECT ON

    DATASETIIA OF BCI COMPETITIONIV USINGSESSION-TO-SESSIONTRANSFER,WHERENO. 1 AND NO. 2 REFER TO THEBESTTWOCOMPETITORS,AND

    M1M6ARE SAMEWITHTHOSE OFTABLEIII

    tors. Note that the results obtained by the multiclass CSP usingone-versus-rest and using JAD are slightly different from those

    reported in [9], since different classifiers and time segments are

    used.

    In our experiment, a simple classification procedure is em-

    ployed to reveal the effectiveness of the multiclass filters ob-

    tained by WPC and WPC/TI. The classification performance

    may be improvedif we solve filters in narrower frequency bands,

    tuning the optimal time segment for each trial, and/or using other

    sophisticated classifiers. The goal of this paper is to demonstrate

    the effectiveness of the weighted scheme for solving multiclass

    filters: while we use the same experimental settings for all the

    methods, the weighted pairwise design produces a much higherclassification accuracy.

    VI. CONCLUSION

    In this paper, we propose a new discriminant criterion, called

    WPC, of optimizing multiclass filters. The approach is estab-

    lished by minimizing the upper bound of the Bayesian error of

    classifying EEG single-trial segments, resulting in the form of

    sum of weights imposed on individual pairwise classes accord-

    ing to their closeness. We pay special emphasize on the effect

    of closer classes that are more likely to cause misclassification.

    In other words, the contributions of different class pairs to the

    discriminant criterion are biased. Computationally, the WPC

    algorithm is conveniently solved by the rank-one update and

    power iteration technique.

    The proposed WPC approach is intentionally formulated for

    classifying EEG single-trial data. It takes into account classi-

    fication errors of EEG trials between pairs of classes. While

    the criterion derived based on the Bayesian error of classify-

    ing EEG sampling points is reasonable, the large pairwise classdifferences may play an overwhelming role in the optimization.

    By contrast, WPC directly uses the same features for optimizing

    spatial filters as for classification. Moreover, we extend WPC

    by integrating the temporal information of EEG series in the co-

    variance matrix formulation. The effectiveness of the proposed

    WPC method is demonstrated by the classification of four motor

    imagery tasks on two datasets of BCI competition.

    Finally, we point out that the Bayesian error estimation heav-

    ily relies on the assumption of independent Gaussian distribu-

    tion. This assumption, however, does not hold stringently in ap-

    plications, since EEG data usually has an autocorrelation struc-

    ture. One possible way is to consider using the Gauss mixturemodel instead of single Gaussian distribution. We are studying

    this issue theoretically and practically.

    APPENDIXA

    PROOF OF(14)

    For0xa, in the error function

    erf(x) = 2

    x0

    eu2

    du (40)

    we use the variable substitutionv = ax u. Then, we have

    erf(x) = 2

    a0

    ev2 ( xa )

    2x

    adv. (41)

    Since0 xa 1, it follows that

    erf(x)xa

    2

    a0

    ev2

    dv= 1

    aerf(a)x (42)

    which completes the proof.

  • 8/12/2019 Multiclass Filters by a Weighted Pairwise Criterion

    9/9

    1420 IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 5, MAY 2011

    APPENDIXB

    PROOF OF(32)

    We have

    M

    l=1

    T (l)|=M

    l=1

    |T

    lM

    m = 1 qm

    =

    Ml=1

    M

    m = 1

    qT(lm )

    2qM1l= 1

    Mm = l+ 1

    |T (lm )|. (43)

    On the other hand, we have

    qM1l=1

    Mm = l+1

    |T (lm )|

    = q2

    Ml=1

    Mm = 1

    |T (l)T(m )|

    q2

    Ml=1

    Mm = 1

    |T (l)|+ q2

    Ml=1

    Mm = 1

    |T (m )|

    =1

    2

    Ml=1

    |T (l)|+12

    Mm = 1

    |T (m )|

    =

    Ml= 1

    |T(l)|. (44)

    The proof is, thus, established.

    ACKNOWLEDGMENT

    The author would like to thank the anonymous referees and

    the editors for constructive recommendations, which improve

    the paper substantially.

    REFERENCES

    [1] A. Bashashati, M. Fatourechi, R. K. Ward, and G. E. Birch, A surveyof signal processing algorithms in braincomputer interfaces based onelectrical brain signals, J. Neural Eng., vol. 4, no. 2, pp. R32R57, Jun.2007.

    [2] B. Blankertz, K.-R. Muller, G. Curio, T. M. Vaughan, G. Schalk, J. R.Wolpaw, A. Schlogl, C. Neuper, G. Pfurtscheller, T. Hinterberger,M. Schroder, and N. Birbaumer, The BCI competition 2003: Progressand perspectives in detection and discrimination of EEG single trials,

    IEEE Trans. Biomed. Eng., vol. 51, no. 6, pp. 10441051, Jun. 2004.[3] B. Blankertz, K.-R. Muller, D. J. Krusienski, G. Schalk, J. R. Wolpaw,

    A. Schlogl, G. Pfurtscheller, J. R. Millan, M.Schroder, andN. Birbaumer,The BCI competition III: Validating alternative approaches to actual BCIproblems,IEEE Trans. Neural Syst.Rehabil.Eng.,vol.14,no.2,pp.153159, Jun. 2006.

    [4] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and K.-R. Muller,Optimizing spatial filters for robust EEG single-trial analysis, IEEESignal Process. Mag., vol. 25, no. 1, pp. 4156, Jan. 2008.

    [5] J. T. Chu and J. C. Chuen, Error probability in decision functions forcharacter recognition, J. Assoc. Comput. Mach., vol. 14, no. 2, pp. 273280, 1967.

    [6] W. S. Cleveland, Robust locally weighted regression and smoothing scat-

    terplots, J. Amer. Stat. Assoc., vol. 74, pp. 829836, 1979.

    [7] G. Dornhege,B. Blankertz, G. Curio,and K.-R. Muller, Boosting bit ratesin noninvasive EEG single-trial classifications by feature combination andmulti-class paradigms, IEEE Trans. Biomed. Eng.,vol.51,no.6,pp.9931002, Jun. 2004.

    [8] G. Dornhege, M. Krauledat, K.-R. Muller, and B. Blankertz, GeneralSignal Processing and Machine Learning Tools for BCI. Cambridge,MA: MIT Press, 2007, pp. 207233.

    [9] C. Gouy-Pailler, M. Congedo, C. Brunner, C. Jutten, and G. Pfurtscheller,Nonstationary brain source separation for multiclass motor imagery,

    IEEE Trans. Biomed. Eng., vol. 57, no. 2, pp. 469478, Feb. 2010.[10] M. Grosse-Wentrup and M. Buss, Multiclass common spatial patterns

    and information theoretic feature extraction, IEEE Trans. Biomed. Eng.,vol. 55, no. 8, pp. 19912000, Aug. 2008.

    [11] S. Lemm, B.Blankertz, G.Curio, andK.-R. Muller, Spatio-spectral filtersfor improved classification of single trial EEG, IEEE Trans. Biomed.

    Eng., vol. 52, no. 9, pp. 15411548, Sep. 2005.[12] Y. Li, X. Gao, and S. Gao, Classification of single-trial electroencephalo-

    gram during finger movement, IEEE Trans. Biomed. Eng., vol. 51, no. 6,pp. 10191025, Jun. 2004.

    [13] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, Multiclass linear dimen-sion reduction by weighted pairwise Fisher criteria, IEEE Trans. Pattern

    Anal. Mach. Intell., vol. 23, no. 7, pp. 762766, Jul. 2001.[14] R. Lotlikar and R. Kothari, Fractional-step dimensionality reduction,

    IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 6, pp. 623627, Jun.2000.

    [15] D. J. McFarland, C. W. Anderson, K.-R. Muller, A. Schlogl, and D. J.Krusienski, BCI meeting 2005workshop on BCI signal processing:Feature extraction and translation, IEEE Trans. Neural Syst. Rehabil.

    Eng., vol. 14, no. 2, pp. 135138, Jun. 2006.[16] J. Muller-Gerking, G. Pfurtscheller, and H. Flyvbjerg, Designing optimal

    spatial filters for single-trial EEG classification in a movementtask,Clin.Neurophys., vol. 110, no. 5, pp. 787798, May 1999.

    [17] M. Naeem, C. Brunner, R. Leeb, B. Graimann, and G. Pfurtscheller,Seperability of four-class motor imagery data using independent compo-nents analysis, J. Neural Eng., vol. 3, pp. 208216, 2006.

    [18] L. C. Parra, C. D. Spence, A. D. Gerson, and P. Sajda, Recipes for linearanalysis of EEG, NeuroImage, vol. 28, no. 2, pp. 326341, Nov. 2005.

    [19] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller, Optimal spatialfiltering of single trial EEG during imagined hand movement, IEEETrans. Rehabil. Eng., vol. 8, no. 4, pp. 441446, Dec. 2000.

    [20] H. Wang and W. Zheng, Local temporal common spatial patterns forrobust single-trial EEG classification, IEEE Trans. Neural Syst. Rehabil.

    Eng., vol. 16, no. 2, pp. 131139, Apr. 2008.[21] A. R. Webb, Statistical Pattern Recognition. London, U.K.: Oxford

    Univ. Press, 1999.[22] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and

    T. M. Vaughan, Braincomputer interfaces for communication and con-trol, Clin. Neurophysiol., vol. 113, no. 6, pp. 767791, Jun. 2002.

    [23] W. Zheng and Z. Lin, Optimizing multi-class spatio-spectral filters viaBayes error estimation for EEG classification, in Proc. Neural Informat.Process. Syst. (NIPS), 2009, pp. 19.

    Haixian Wang (M09) received the B.S. and M.S.degrees in statistics and the Ph.D. degree in com-puter science from Anhui University, Anhui, China,in 1999, 2002, and 2005, respectively.

    During 20022005, he was with the Key Labora-tory of Intelligent Computing and Signal Processingof theMinistryof Educationof China.He is currentlywith the Key Laboratory of Child Development andLearning Science of the Ministry of Education, Re-search Center for Learning Science, Southeast Uni-versity, Nanjing, Jiangsu China. His research inter-

    ests include EEG signal processing, statistical pattern recognition, and machinelearning.