Generalized Component Analysis and Blind Source Separation ... · Generalized Component Analysis and Blind Source Separation Methods for Analyzing Multichannel Brain Signals ... A

Generalized Component Analysis and Blind Source SeparationMethods for Analyzing Multichannel Brain Signals

Andrzej CichockiRiken, Brain Science Institute, Japan

Laboratory for Advanced Brain Signal Processingand Warsaw University of Technology, Poland

Blind source separation (BSS) and related methods, e.g., independent componentanalysis (ICA) are generally based on a wide class of unsupervised learning algorithms andthey found potential applications in many areas from engineering to neuroscience. Therecent trends in blind source separation and generalized component analysis (GCA) is toconsider problems in the framework of matrix factorization or more general signals decom-position with probabilistic generative models and exploit a priori knowledge about truenature, morphology or structure of latent (hidden) variables or sources such as sparseness,spatio-temporal decorrelation, statistical independence, nonnegativity, smoothness or low-est possible complexity. The goal of BSS can be considered as estimation of true physicalsources and parameters of a mixing system, while objective of GCA is finding a new re-duced or hierarchical and structured representation for the observed (sensor) data that canbe interpreted as physically meaningful coding or blind signal decompositions. The keyissue is to find a such transformation or coding which has true physical meaning and inter-pretation. In this paper we discuss some promising applications of BSS/GCA for analyzingmulti-modal, multi-sensory data, especially EEG/MEG data. Moreover, we propose to ap-ply these techniques for early detection of Alzheimer disease (AD) using EEG recordings.Furthermore, we briefly review some efficient unsupervised learning algorithms for linearblind source separation, blind source extraction and generalized component analysis usingvarious criteria, constraints and assumptions.

A fairly general blind signal separation (BSS) problem often referred as blind signaldecomposition or blind source extraction (BSE) can be formulated as follows (see Figure1). We observe records of sensor signals x(t) = [x1(t), . . . , xm(t)]T , where t is time and (·)T

means transpose of a vector, from a MIMO (multiple-input/multiple-output) dynamical(mixing and filtering) system. These signals are usually superposition of unknown sourcesignals and noises. The objective is to find an inverse system, sometimes termed a recon-struction system, neural network or a inverse system, if it exists and is stable, in order toestimate the all primary source signals s(t) = [s1(t), . . . , sn(t)]T or only some of them with

The author would like to thank Professor Michael J. Wenger for His invitation to NDSQM-2004 Sympo-sium and very helpful comments and suggestions which substantially improve the form of presentation.Correspondence may be addressed to Riken, Brain Science Institute, Laboratory for Advanced Brain SignalProcessing, Wako-shi, Saitama 351-0198 Japan; email sent to [email protected]

GENERALIZED COMPONENT ANALYSIS 2

(a) (b)

AdaptiveLearningAlgorithm

Neural Network(Reconstruction

System)s tn( )x tm( )

y t F s1( ) = ( )i

y t F sn j( ) = ( )

x t1( )s t1( )

S

S

v t1( )v tm( )

Unknown

DynamicSystem

Dipolar brain sources

MEG or EEGsignals

Figure 1. (a) General model illustrating blind source separation (BSS), (b) Such models are ex-ploited in non-invasive multi-sensor recording of brain activity using EEG (electroencephalography)or MEG (magnetoencephalography). It is assumed that the scalp sensors (electrodes, SQUIDs) picksup superposition neuronal brain sources and non-brain sources related for example to movements ofeyes, muscles and noise. Objective is to identify the individual signals coming from different areasof the brain.

specific properties. This estimation is performed on the basis of only the output signalsy(t) = [y1(t), . . . , yn(t)]T . Preferably, the inverse (unmixing) system should be adaptive insuch a way that it has some tracking capability in non-stationary environments. Insteadof estimating the source signals directly, it is sometimes more convenient to identify anunknown mixing and filtering system first (e.g., when the inverse system does not exist,especially when the system is overcomplete, i.e., the number of observations is less than thenumber of source signals with m < n) and then estimate source signals implicitly by exploit-ing some a priori information about the source signals and applying a suitable optimizationprocedure.

There appears to be something magical about blind source separation; we are estimat-ing the original source signals without knowing the parameters of mixing and/or filteringprocesses. It is difficult to imagine that one can estimate this at all. In fact, without somea priori knowledge, it is not possible to uniquely estimate the original source signals. How-ever, one can usually estimate them up to certain indeterminacies. In mathematical termsthese indeterminacies and ambiguities can be expressed as arbitrary scaling and permuta-tion of estimated source signals (Tong, Liu, Soon, & Huang, 1991). These indeterminaciespreserve, however, the waveforms of original sources. Although these indeterminacies seemto be rather severe limitations, in a great number of applications these limitations are notessential, since the most relevant information about the source signals is contained in thetemporal waveforms or time-frequency patterns of the source signals and usually not intheir amplitudes or order in which they are arranged in the output of the system1.

1For some dynamical models, however, there is no guarantee that the estimated or extracted signals have


The problems of separating or extracting of the source signals from the sensor array,without knowing the transmission channel characteristics and the sources can be expressedbriefly as a number of related BSS or generalized component analysis (GCA) methods suchIndependent Component Analysis (ICA) (and its extensions: Topographic ICA, Multidi-mensional ICA, Kernel ICA, Tree-dependent Component Analysis, Multiresolution SubbandDecomposition -ICA) (Hyvarinen, Karhunen, & Oja, 2001; Bach & Jordan, 2003; Cichocki& Georgiev, 2003), Sparse Component Analysis (SCA) (Zibulevsky, Kisilev, Zeevi, & Pearl-mutter, 2002; Li, Cichocki, & Amari, 2004; Washizawa & Cichocki, 2006; He & Cichocki,2006), Sparse Principal Component Analysis (SPCA) (Chenubhotla, 2004; Zou, Hastie,& Tibshirani, 2006), Multichannel Morphological Component Analysis (MMCA) (Bobin,Moudden, J.-L., Starck, & Elad, 2006), Non-negative Matrix Factorization (NMF) (Lee &Seung, 1999; Sajda, Du, & Parra, 2003), Smooth Component Analysis (SmoCA) (Cichocki& Amari, 2003), Parallel Factor Analysis (PARAFAC) (Miwakeichi et al., 2004), Time-Frequency Component Analyzer (TFCA) (Belouchrani & Amin, 1996) and MultichannelBlind Deconvolution (MBD) (Amari & Cichocki, 1998; Zhang, Cichocki, & Amari, 2004;Choi, Cichocki, & Amari, 2002).

The mixing and filtering processes of the unknown input sources sj may have differentmathematical or physical models, depending on the specific applications (Hyvarinen et al.,2001; Amari & Cichocki, 1998). Most of linear BSS models in the simplest forms can beexpressed algebraically as some specific problems of matrix factorization: Given observation(often called sensor or data matrix) X = [x(1), . . . ,x(N)] ∈ IRm×N perform the matrixfactorization:

X = AS + V, (1)

where N is the number of available samples, m is the number of observations, n is thenumber of sources, A ∈ IRm×n represents the unknown basis data matrix or mixing matrix(depending on applications), V ∈ IRm×N is an unknown matrix representing errors or noiseand matrix, S = [s(1), . . . , s(N)] ∈ IRn×N contains the corresponding latent (hidden) com-ponents that give the contribution of each basis vectors. Usually these latent componentsrepresent unknown source signals with specific statistical properties or temporal structures.The matrices have usually clear statistical properties and meanings. For example, the rowsof the matrix S that represent sources or components should be as sparse as possible forSCA or statistically mutually independent as possible for ICA. Often it is required thatthe estimated components are piecewise smooth (SmoCA) or take only nonnegative values(NMF) or values with specific constraints (Lee & Seung, 1999; Cichocki & Georgiev, 2003).

Although some decompositions or matrix factorizations provide an exact reconstruc-tion data (i.e., X = AS), we shall consider here decompositions which are approximativein nature. In fact, many problems in signal and image processing can be expressed in suchterms of matrix factorization. However, different cost functions and imposed constraintsmay lead to different types of matrix factorization. In many signal processing applicationsthe data matrix X = [x(1),x(2) . . . ,x(N)] is represented by vectors x(k) (k = 1, 2, . . . , N)for a set of discrete time instants (k = t) as multiple measurements or recordings, thus the

exactly the same waveforms as the source signals, and then the requirements must be sometimes furtherrelaxed to the extent that the extracted waveforms are distorted (i.e., time delayed, filtered or convolved)versions of the primary source signals (see Figure 1(a)).


(a)

H WS

)( ks )(ky)(kx

m nn

Unknown

v( )k

(b)

Unknownprimarysources

Unknownmatrix

Observablemixedsignals

Neural network

Separatedoutputsignals

S+

+

S+

+

+

S

+

S+

+

s k( ) x k( )

x k( )

y k( )

y k( )

( )k

( )k

s k( )

LearningAlgorithm

111 111 1

1

n m

m

mn

1n

m1

1m

n1

nmn

h

h

h

h

w

w

w

w

v

v

Figure 2. Block diagrams illustrating linear blind source separation or blind identification problem:(a) General schema with the mixing matrix A = H, (b) Detailed model. For the overcompleteproblem (m < n) the separating matrix W may not exist; in such cases we attempt to identify themixing matrix A first and next to estimate sources by exploiting some a priori knowledge such assparsity or independence of unknown sources.

compact aggregated matrix equation (1) can be written in a vector form as the system oflinear equations (see Figure 2):

x(k) = As(k) + v(k), (k = 1, 2 . . . , N) (2)

where x(k) = [x1(k), x2(k), . . . , xm(k)]T is vector of the observed signals at the discretetime instant k while s(k) = [s1(k), s2(k), . . . , sn(k)]T is the vector of unknown sources atthe same time instant2. The above formulated problems are related closely to linear inverseproblem or more generally, to solving a large ill-conditioned system of linear equations(overdetermined or underdetermined depending on applications) where it is necessary toestimate reliably vectors s(k) and in some cases also to identify a matrix A for noisy data(Kreutz-Delgado et al., 2003; Cichocki & Amari, 2003; Cichocki & Unbehauen, 1994). It

2Data are often represented not in the time domain but in the complex frequency or the time frequencydomain, so index k may have different meaning.


Mutual Independence,Non-Gaussianity,

ICA

Time-Frequency,Spectral and/or Spatial

Diversities

Temporal-Structure,Linear Predictability,

Non-whiteness

Non-Stationarity,Time-varying Variances

Figure 3. Basic approaches for blind source separation. Each approach exploits some a prioriknowledge, diversities and specific properties of the source signals (Cichocki & Amari, 2003).

is assumed that only the sensor vector x(k) is available and it is necessary to design afeed-forward or recurrent neural network and an associated adaptive learning algorithmthat enables estimation of sources, identification of the mixing matrix A and/or separatingmatrix W with good tracking abilities. Often BSS/GCA is obtained by finding an n×m, fullrank, linear transformation (separating) matrix W = A+, where A+ means Moore-Penrosepseudo-inverse of A such that the output signal vector y = [y1, y2, . . . , yn]T , defined byy = Wx, contains desired components (e.g., sparse, nonnegative, independent, spatio-temporally decorrelated).

Although many different source separation criteria and algorithms are available, theirprinciples can be summarized by the following four fundamental approaches (see Figure 3):

• The most popular approach exploits as the cost function some measure of signalsstatistical independence, non-Gaussianity or sparseness. When original sources are assumedto be statistically independent, the higher-order statistics (HOS) are essential (implicitly orexplicitly) to solve the BSS problem. In such a case, the method does not allow more thanone Gaussian source.

• If sources have temporal structures, then each source has non-vanishing tempo-ral correlation, and less restrictive conditions than statistical independence can be used,namely, second-order statistics (SOS) are often sufficient to estimate the mixing matrixand sources. Along this line, several methods have been developed (Molgedey & Schuster,1994; Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997; Ziehe, Muller, Nolte, Mack-ert, & Curio, 2000; Choi, Cichocki, & Belouchrani, 2002; Tong et al., 1991; Cichocki &Belouchrani, 2001; Barros & Cichocki, 2001). Note that these SOS methods do not allowthe separation of sources with identical power spectra shapes or i.i.d. (independent andidentically distributed) sources.

• The third approach exploits non-stationarity (NS) properties and second orderstatistics (SOS). Mainly, we are interested in the second-order nonstationarity in the sensethat source variances vary in time. The nonstationarity was first taken into account by


(Matsuoka, Ohya, & Kawamoto, 1995) and it was shown that a simple decorrelation tech-nique is able for wide class of source signals to perform the BSS task. In contrast to otherapproaches, the non-stationarity information based methods allow the separation of coloredGaussian sources with identical power spectra shapes. However, they do not allow the sepa-ration of sources with identical non-stationarity properties. There are some recent works onnonstationary source separation (Barros & Cichocki, 2001; Choi, Cichocki, & Belouchrani,2002; Choi, Cichocki, & Amari, 2002).Methods that exploit either the temporal structure of sources (second-order correlations)and/or the non-stationarity variance of sources, lead in the simplest scenario to the SOS BSSmethods. In contrast to BSS methods based on the HOS, all the SOS based methods do nothave to infer the probability distributions of sources or nonlinear activation (score) functions(see next sections). This class of methods are referred as spatio-temporal decorrelation(STD) techniques.

• The fourth approach exploits the various diversities3 of signals, typically, time,frequency, (spectral or “time coherence”) and/or time-frequency diversities, or more gener-ally, joint space-time-frequency (STF) diversity. Such approach leads to concept of Time-Frequency Component Analyzer (TFCA) (Cichocki & Amari, 2003). TFCA decomposesthe signal into specific components in the time-frequency domain and computes the time-frequency representations (TFRs) of the individual components. Usually components areinterpreted here as localized, sparse and structured signals in the time-frequency plain (e.g,spectrogram). In TFCA components are estimated by analyzing the time-frequency distri-butions of the observed signals and suitable sparsification or decomposition of components.TFCA provides an elegant and promising solution to suppression of some artifacts andinterference via masking and/or multi bandpass filtering of undesired components.More sophisticated or advanced approaches use combinations or integration of some of theabove mentioned approaches: HOS, SOS, NS and STF (Space-Time-Frequency) diversity,in order to separate or extract sources with various morphology, structures or statisticalproperties and to reduce the influence of noise and undesirable interferences.

The all above mentioned BSS methods belongs to a wide class of unsupervised learningalgorithms. Unsupervised learning algorithms try to discover a structure underlying a dataset, extraction of meaningful features and finding useful representations of the given data.Since data can be always interpreted in many different ways, some knowledge is needed todetermine which features or properties represent our true latent (hidden) components. Forexample, PCA finds a low-dimensional representation of the data that captures most of itsvariance. On the other hand SCA tries to explain data as mixture of sparse components(usually, in the time-frequency domain) and NMF seeks to explain data by parts-basedlocalized additive representations (with nonnegativity constraints).

Generalized component analysis algorithms, e.g., ICA, SCA, NMF, MMCA, STD andSPCA are often considered as pure mathematical formulas, powerful, but rather mechanicalprocedures: There is illusion that there are not very much left for the user to do afterthe machinery has been optimally implemented. The successful and efficient use of thesuch tools strongly depends on a priori knowledge, common sense and appropriate use ofthe preprocessing and postprocessing tools. In other words, it is preprocessing of data

3By diversities we mean usually different morphology, characteristics or features of the signals.


PREPROCESSING

PCA, whitening,model reduction,

denoising,filtering

POSTPROCESSING

FFT, TFR, Sparsification,Feature extraction,

Clustering

BSS/BSE:ICA, SCA, NMF,

TFCA

RawData

ReconstructedData

Preprocesseddata

Components

X YX~

Y

Figure 4. Fundamental three stages implemented and exploited in the BSS/GCA for efficientseparation, decomposition and/or extraction of signals.

and postprocessing of models where an expertise is truly needed in order to extract andidentify physiologically significant and meaningful components. Typical preprocessing toolsinclude: Principal Component Analysis (PCA), Factor Analysis, (FA), whitening, modelreduction, filtering, Fast Fourier Transform (FFT), Time Frequency Representation (TFR)and sparsification (Wavelets Package transformation, DCT, Curvelets, Ridgelets) of data(see Figure 4). Postprocessing tools includes: Deflation and reconstruction (”cleaning”) oforiginal raw data by removing undesirable components, noise or artifacts (see Figure 5).On the other hand, the assumed linear mixing models must be valid at least approximatelyand original sources signals should have specified statistical properties (Cichocki & Amari,2003; Amari & Cichocki, 1998; Cichocki, Amari, Siwek, Tanaka, & al., 2004).

The main objective of this contribution is to propose an extended BSS or GCAapproaches which integrate or combine several different criteria in order to extract phys-iologically and neuroanatomically meaningful and plausible brain sources and to presenttheir potential applications for analysis EEG/MEG, fMRI data, especially for very earlydetection of Alzheimer disease (AD) using EEG recordings.

Extended Blind Source Separation and GeneralizedComponent Analysis Approach to the Preprocessing of

EEG and MEG Recordings

A great challenge in neurophysiology is to asses non-invasively the physiologicalchanges occurring in different parts of the brain. These activations can be modeled andmeasured often as neuronal brain source signals that indicate the function or malfunctionof various physiological sub-systems. To extract the relevant information for diagnosis andtherapy, expert knowledge are required not only in medicine and neuroscience but alsostatistical signal processing. To understand human neurophysiology, we currently rely onseveral types of non-invasive neuroimaging techniques. These techniques include electroen-cephalography (EEG) and magnetoencephalography (MEG) (Makeig, Delorme, et al., 2004;Jahn, Cichocki, Ioannides, & Amari, 1999).

In recent years, a great interest has been in applying high density array EEG systemsto analyze patterns and imaging of the human brain, where EEG has desirable property ofexcellent time resolution. This property combined with other systems such as eye tracking


and EMG (electromyography) systems with relatively low cost of instrumentations makesit attractive for investigating the higher cognitive mechanisms in the brain and opens aunique window to investigate the dynamics of human brain functions as they are able tofollow changes in neural activity on a millisecond time-scale. In comparison, the otherfunctional imaging modalities, e.g., positrontomography (PET) and functional magneticresonance imaging (fMRI), are limited in temporal resolution to time scales on the orderof, at best, one second by physiological and signal-to-noise considerations.

Brain sources are extremely weak, nonstationary signals and usually distorted bylarge noise, interference and on-going activity of the brain. Moreover, they are mutuallysuperimposed and low-passed filtered by EEG recording systems (see Figure 1 (b)). Be-sides classical signal analysis tools (such as adaptive supervised filtering, parametric ornon-parametric spectral estimation, time-frequency analysis, and higher-order statistics),intelligent blind source separation techniques (IBSS) and generalized component analysis(GCA) can be used for analyzing brain data, especially for noise and artifact reduction,enhancement, detection and estimation of neuronal brain source signals.

Determining active regions of the brain, given EEG/MEG measurements on the scalpis an important problem. A more accurate and reliable solution to such a problem cangive information about higher brain functions and patient-specific cortical activity. How-ever, estimating the location and distribution of electric current sources within the brainfrom EEG/MEG recording is an ill-posed problem, since there is no unique solution andthe solution does not depend continuously on the data. The ill-posedness of the problemand distortion of sensor signals by large noise sources makes finding a correct solution achallenging analytic and computational problem.

If one knows the positions and orientations of the sources in the brain, one cancalculate the patterns of electric potentials or magnetic fields on the surface of the head. Thisis called the forward problem. If otherwise one has only the patterns of electric potential ormagnetic fields on the scalp level, then one needs to calculate the locations and orientationsof the sources. This is called the inverse problem. Inverse problems are notoriously moredifficult to solve than forward problems. In this case, given only the electric potentials andmagnetic fields on the scalp, there is no unique solution to the problem. The only hope isthat there is some additional information available that can be used to constrain the infiniteset of possible solutions to a single unique solution. This is where intelligent blind sourceseparation can be used.

Every EEG electrode montage acts as a some kind of spatial filters of cortical brainactivity and the BSS procedure can be considered also as spatial filter which attempt tocancel the effect of superposition of various brain activities and possibly to estimate com-ponents that represent physiologically different processes (Makeig, Delorme, et al., 2004;Makeig, Debener, Onton, & Delorme, 2004; Delorme & Makeig, 2004).

BSS/GCA is a flexible and powerful approach for the elimination of artifacts andnoise from EEG/MEG data and enhancement of neuronal brain sources. In fact, for theseapplications, ICA/BSS techniques have been successfully applied to remove artifacts andnoise including background brain activity, electrical activity of the heart, eye-blink andother muscle activity, and environmental noise efficiently (T. Jung et al., 2000; Vorobyov &Cichocki, 2002; Jahn et al., 1999; Miwakeichi et al., 2004). However, most of the methodsrequire manual detection, classification of interference components and the estimation of the


1x

Wn

y MMm

xm

x

2x

1y

2y

2x

1x

1y

2y

ny

+

W

SensorSignals

DemixingSystem

BSS/ICA

OptionalNonlinearAdaptive

Filters

InverseSystem

ReconstructedSensorsSignals

NAF

NAF

NAF

Decision

Blocks

M

On-lineswitching

0 or 1

1

~y

2

~y

ny~

Figure 5. Basic model of deflation for removing undesirable components like noise and artifactsand enhancing multi-sensory (e.g., EEG/MEG) data. Often the estimated components are firstfiltered, normalized, ranked, ordered and clustered in order to identify significant and physiologicallymeaningful sources or artifacts.

cross-correlation between independent components and the reference signals correspondingto specific artifacts (T. Jung et al., 2000; Makeig, Debener, et al., 2004).

A conceptual model for the elimination of noise and other undesirable componentsfrom multi-sensory data is depicted in Figure 5. First, BSS is performed using suitablychosen robust algorithm with respect to noise by a linear transformation of sensory dataas y(k) = Wx(k), where the vector y(k) represents the specific components (e.g., sparse,smooth, spatio-temporally decorrelated or statistically independent components). Then,the projection of interesting or useful components (e.g., independent activation maps) yj(k)back onto the sensors (electrodes). The corrected or “cleaned” sensor signals are obtained bylinear transformation4 x(k) = W−1y(k), where W−1 is the inverse or pseudo-inverse of theunmixing matrix W and y(k) is the vector obtained from the vector y(k) after removal ofall the undesirable components (i.e., by replacing them with zeros). The entries of estimatedattenuation (mixing) matrix A = W−1 indicate how strongly each electrode picks up eachindividual component. Back projection of some significant components x(k) = W−1y(k)allows us not only remove some artifacts and noise but also to enhance EEG data.

In many cases the estimated components must be at first filtered or smoothed in orderto identify all significant components. Moreover, the EEG/MEG data can be first decom-posed into useful signal and noise subspaces using standard techniques such as PCA, orFactor Analysis (FA) (see next sections). Next, we can apply BSS algorithms to decomposethe observed signals (signal subspace) into specific components.

In addition to the denoising and artifacts removal, BSS/GCA techniques can be usedto decompose EEG/MEG data into individual components, each representing possibly aphysiologically distinct process or brain source. The main idea here is to apply localizationand imaging methods to each of these components in turn. The decomposition is usuallybased on the underlying assumption of sparsity and/or statistical independence between the

4For simplicity, we assumed that m = n. In the more general case, for m > n instead of the inversematrix W−1, we use the Moore-Penrose pseudo-inverse generalized matrix W+.


activation of different cell assemblies involved. Alternative criteria for the decompositionare: nonnegativity, spatio-temporal decorrelation, temporal predictability or smoothness ofextracted components. The BSS approaches enable us to project each component (localized“brain source”) onto an activation map at the skull level. For each activation map, we canapply an EEG/MEG source localization procedure, looking only for a single dipole (or brainsource) per map. In order to localize a single component (say, a source sj = yj) we computethe sensor space projection (on the scalp level) for the source j

xj(k) = ADjWx(k) = ADjy(k) = aj sj(k), (3)

where Dj is a diagonal matrix with all entries zero except for ones on the j-th row andthe j-th column (j = 1, 2, . . . , n) and aj is the j-th column of the estimated mixing matrixA = W−1. It is worth to note that xj(k) at each time instant k is uniquely representedby the fixed vector aj which is scaled by sj(k), so dipole fitting algorithms will localizej-th source to the some location independent on time instant (under assumption that Ais fixed). The sensor space back projection of each component can be used as input datato any localization algorithm for source modeling, e.g., BESA algorithm. By localizingmultiple sources independently one by one, we can dramatically increase the likelihood ofefficiently converging to the correct and reliable solutions for the electromagnetic inverseproblems.

One of the biggest strength of BSS/GCA approach is that it offers a variety of pow-erful and efficient algorithms that are able to estimate various kind of sources (sparse,nonnegative, statistically independent, spatio-temporally decorrelated, smooth etc.). Someof the algorithms, e.g., AMUSE or TICA (Cichocki & Amari, 2003; S. A. Cruces, Cichocki,& Amari, 2004; S. A. Cruces & Cichocki, 2003), are able to automatically rank and orderthe components according to their complexity or sparseness measures. Some algorithmsare quite robust in respect to noise (e.g., SOBI or SONS) (Belouchrani et al., 1997; Choi,Cichocki, & Amari, 2002; Choi, Cichocki, & Belouchrani, 2002; Choi, Cichocki, & Amari,1998).

In real world scenario latent (hidden) components (e.g., brain sources) have variouscomplex properties and features. In other words, true unknown involved sources are seldomall sparse or only all statistically independent or all spatio-temporally decorrelated , etc.Therefore, if we apply only one single technique like ICA or SCA or STD, we usually fail toextract all interesting components. We need rather to apply fusion strategy by employingseveral suitably chosen criteria and associated algorithms to extract all desired sources. Forthis reasons, we suggested to use algorithms in cascade (multiple) or parallel mode in orderto extract components with various features and statistical properties (Cichocki & Amari,2003). In other words, we may apply here two possible approaches. The most promisingapproach is a sequential blind extraction (see Figure 6), in which we extract components oneby one in each stage applying different criterion (e.g., statistical independence, sparseness.smoothness etc). In this way, we can extract sequentially different components with variousstatistical properties.

In an alternative approach, after suitable preprocessing, we perform simultaneously(in parallel way) several GCA (ICA, SCA, MMCA, STD, TFCA). Next the estimatedcomponents are normalized, ranked, clustered and compared to each other using somesimilarity measures (see Figure 7). Furthermore, the components are back projected to


Figure 6. Conceptual model of sequential blind sources extraction. In each stage different criterioncan be used.

scalp level and hypothetic brain sources are localized on basis of clusters of subcomponents.In this way, on basis of statistics and a priori knowledge (e.g., information about externalstimuli for event related brain sources), we can identify sources with electrophysiologicalmeaning and specific anatomic localizations.

In summary, blind source separation and generalized component analysis (BSS/GCA)algorithms allows:

1. Extract and remove artifacts and noise from raw EEG/MEG data.2. Recover neuronal brain sources activated in cortex (especially, in auditory, visual,

somatosensory, motoric and olfactory cortex).3. Improve the signal-to-noise ratio (SNR) of evoked potentials (EP’s), especially

AEP, VEP and SEP.4. Improve spatial resolution of EEG and reduce level of subjectivity involved in the

brain source localization.Above mentioned applications of BSS/GCA show special promise in the areas of non-

invasive human brain imaging techniques to delineate the neural processes that underliehuman cognition and sensoro-motor functions. These techniques lead to interesting and ex-citing new ways of investigating and analyzing brain data and develop new hypotheses howthe neural assemblies communicate and process information. The fundamental problemshere are: What are the system’s real properties and how can we get information about them?What is valuable information in the observed data and what is only noise or interference?How can the observed (sensor) data be transformed into features or components charac-


Figure 7. Parallel model of GCA employing fusion strategy with various component analysisalgorithms for estimation of physiologically and neuroanatomically meaningful event-related brainsources. The reliability of estimated sources or components should be analyzed by investigatingthe spread of the obtained components for many trials and possibly many subjects. Usually, thereliable and significant components corresponds to small and well separated clusters from the restof components, while unreliable components usually do not belong to any cluster.

terizing the brain sources in a reasonable way? These problems are actually an extensiveresearch area and presented approaches still remain to be validated at least experimentallyto obtain full gain, especially for multi-sensory and multi-modal (EEG, MEG fMRI, PET)data.

Early Detection of Alzheimer Disease Using Blind SourceSeparation and Generalized Component Analysis of EEG

Data

Finding electro-physiological and neuroimage markers of different diseases and psy-chiatric disorders is generating increasing interest (Jeong, 2004; Petersen et al., 2001). Theseries of studies we have undertaken has been an attempt to find such markers for differ-ential diagnosis of aging, especially for mild cognitive impairment (MCI) and Alzheimer’sdisease (AD) through BSS/GCA of EEG recordings (Cichocki et al., 2005; Vialatte et al.,2005).

Alzheimer’s disease (AD) is characterized by the degeneration of cortical neuronsand the presence amyloid plaques and neurofibrillary tangles. These pathological changes,which begin in the entorhinal cortex, gradually spread across the brain, destroying the hip-pocampus and neocortex. Recent advances in drug treatment for dementia, particularly theacetylcholinesterase inhibitors for AD, are most effective in early stages which are difficultto accurately diagnosis. Recent studies have demonstrated that AD has a pre-symptomaticphase, likely lasting years, during which neuronal degeneration is occurring but clinicalsymptoms are not clearly observable (Petersen, 2003; Jeong, 2004; DeKosky & Marek,2003). Since early diagnosis of AD may alter the outcome of the disease, the development


of practical diagnostic test that can detect the presence of the disease well before clinicalsymptoms is of increasing importance (Musha et al., 2002; Jelic et al., 2000; Petersen, 2003).A diagnostic device should be inexpensive to make possible screening of many individualswho are at risk of developing this dangerous disease. The electroencephalogram (EEG) isone of the most promising candidates to become such device since it is cheap, portable andnoninvasive recording of EEG data is save and simple procedure. However, while quite manysignal processing techniques have been already applied for revealing pathological changes inEEG associated with AD [see (Jeong, 2004; Petersen, 2003; Cichocki et al., 2005; Vialatteet al., 2005), for review], the EEG-based detection of AD at its most early stages is stillnot sufficiently reliable and further improvements are necessary. Moreover, the efficiencyof early detection of AD is lower for standard EEG comparing to modern neuroimagingtechniques (fMRI, PET, SPECT) and it is considered only as an additional diagnostic tool(Wagner, 2000; DeKosky & Marek, 2003). This is why a number of more sophisticated ormore advanced multistage data analysis and signal processing approaches with optimizationshould be applied for this problem.

Methods for Blind Recovery of Electrophysiological Markers of Alzheimer Dis-ease

To our best knowledge, no study till now has been reported about the applicationof BSS and GCA methods as preprocessing tools in AD diagnosis (Cichocki et al., 2005).Generally speaking, in our novel approach, we first decompose EEG data into suitablecomponents (especially, spatio-temporal decorrelated, independent and/or sparse), next werank them according to some measures (such as linear predictability, increasing complexityor sparseness) and then we project to the scalp level only some significant and specific com-ponents, possibly of physiological origin that could be apparently the brain signal markersfor dementia, and more specifically, for Alzheimer’s disease (see Figure 8).

We believe that BSS/GCA are promising methods for discrimination dementia duethe following reasons:

1) GCA and BSS algorithms allow us to extract and eliminate various artifacts andhigh frequency noise and interference. In this way, we can improve the signal-to-noise ratioand enhance some significant brain source signals.

2) BSS/GCA algorithms, especially these with equivariant and nonholonomic prop-erties allow us to extract extremely weak brain sources (Cichocki, Unbehauen, & Rummert,1994; Cichocki & Amari, 2003), probably corresponding to the sources located deep in thebrain like, e.g., in the entorhinal cortex and hippocampus.

The one of the hallmark of EEG abnormalities in AD patients is a shift of the powerspectrum to lower frequencies. It is common consensus in literature that the earliest changesin the EEG abnormalities are observed in increasing theta activity and decrease in betaactivity, which are followed by a decrease of alpha activity (Jeong, 2004; Musha et al.,2002; Jelic et al., 2000). Delta activity typically increases later during the progression ofthe disease. This suggests that an increase of theta power and a decrease beta and alphapowers are markers for the subsequent rate of a cognitive and functional decline in earlyand mild stage of AD. However, these changes in early stage of AD can be very small anddistorted by many factors, thus detection them is rather very difficult directly from rawEEG data. Therefore, we exploit these properties for suitably filtered or enhanced data


Preprocessing:Artifacts removal;

Denoising;Filtering;

Model reduction

Wavelet TFR,

Sparse bumpmodeling

Rankingand

clusteringof components

Classification

Neural network

BSS/BSE:ICA, SCA, NMF,

TFCA

( )W

FeatureExtraction

Backprojection

( )W+

Diagnosis

RawEEGData

CleanEEGData

EnhancedEEG

AD/MCI

Normal

EnhancedEEG

Components

Significantmarkers

Noisesubspace

Signalsubspace

EEG unit

Figure 8. Conceptual block diagrams of the proposed preprocessing (enhancement and filtering)method for early diagnosis of AD; the main novelty lies in suitable ordering of components andselection of only few significant AD markers (components) and back-projecting (deflation) of thesecomponents on the scalp level for further analysis and processing in the frequency or the timefrequency domain using sparsification, bump modeling and feature extraction. The components areselected via an optimization procedure.

rather than for raw EEG data. This filtering is performed using BSS techniques describedin the next sections. The basic idea can be formulated as filtering based on BSS/GCA viaselection of most relevant or significant components and project them back to scalp level(Cichocki et al., 2005; Vialatte et al., 2005).

In strict sense, BSS means estimation of true (original) neuronal brain sources whileGCA means decomposition of EEG data to meaningful components, though exactly thesame procedure can be used for separation of two or more subspaces of the signal withoutestimation of true sources (see the next section). One procedure currently becoming popularin EEG analysis is removing artifact-related independent components and back projection ofindependent components originating from brain (T. Jung et al., 2000; Vorobyov & Cichocki,2002). In this procedure, components of brain origin are not required to be separated fromeach other exactly, because they are mixed again by back projection after removing artifact-related components. By the similar procedure, we can filter off the on going ”brain noise”


also in wider sense, improving the signal to noise ratio (SNR).In one of our simplest recently developed procedure, we do not attempt to identify

individual brain sources or physiologically meaningful components but rather identify thewhole group or cluster of significant for AD components (Cichocki et al., 2005). In otherwords, we divide the available EEG data into two subspaces: brain signal subspace and“brain noise” subspace. Finding fundamental mechanism or principle for identification ofsignificant and not significant components is critical in our approach and, in general, mayrequire probably more extensive studies. We attempt here rather to differentiate clustersor subspaces of components with similar properties or features. In this simple approach theestimation of all individual components corresponding to separate and physiologically mean-ingful brain sources is not required, unlike in other applications of BSS to EEG processingdiscussed in the previous section. The use of clusters of components could be especiallybeneficial when the data from different subjects are compared: similarities among individ-ual components for different subjects are usually low, while subspaces formed by similarcomponents are more likely to be sufficiently consistent. Differentiation of signal and noisesubspaces with high and low amount of diagnostically useful information can be made easierif components are separated and sorted according to some criterion which, at least to someextent, correlate with the diagnostic value of components. For this reason, we have appliedAMUSE BSS algorithm which provides automatic ordering of components according to de-creasing variance and simultaneously decreasing their linear predictability (see Figure 9).The main advantage of AMUSE over other BSS algorithms was highly reproducible com-ponents in respect to their ranking or ordering and also across subjects belonging to thesame of group of patients. This allows us to identify significant components and optimizetheir number.

The proposed AMUSE BSS algorithm (see the next sections for detail) performslinear decomposition of EEG signals into precisely ordered spatio-temporal decorrelatedcomponents that have lowest possible complexity (in the sense of best linear predictability)(Stone, 2001; Cichocki & Amari, 2003). In other words, in the frequency domain the powerspectra of components have possibly distinct shapes.

Artifact-free 20 seconds intervals of raw resting EEG recordings from 23 early stageAD patients and 38 age-matched healthy controls were decomposed into spatio-temporallydecorrelated components using BSS algorithm ”AMUSE” (see next sections for detail). ADpatients had, at the time of EEG recording, only memory impairment but no apparentloss in general cognitive, behavioral, or functional status. The MMSE (Mini-Mental StatusExam) score was 24 or higher at the time of recording and diagnosis was made duringfollow-up clinical evaluation within the next 12-18 months. Recording was made with eyesclosed in an awake resting condition (with vigilance control) using 21 electrodes accordingto 10-20 international system with the sampling frequency 200 Hz (Musha et al., 2002).

We found by extensive experiments that filtering and enhancement of EEG data isthe best if we reject components with lowest variance and back project only the first 5-7spatio-temporally decorrelated components with highest linear predictability and largestvariance. Automatic sorting of components by AMUSE algorithm makes it possible toperform this simply and consistently for all subjects by removing components with indexhigher than some chosen threshold (in our case, we projected the first six components for10-20 international EEG system).


(a) (b)

Figure 9. Example of raw EEG (a) and its components separated with AMUSE STD algorithm (b)for a patient with early stage AD (EarlyAD15). AMUSE was applied to 20 s artifact-free interval ofEEG, but only 2 s are shown here. The scale for the components is arbitrary but linear. Note thatthe components are automatically ordered according to decreasing linear predictability (increasingcomplexity).

After back projecting the six significant component to scalp level, we performed astandard spectral analysis based on FFT of reconstructed or filtered EEG data and appliedlinear discriminant analysis (LDA) (Cichocki et al., 2005) to combine relative power invarious frequency bands. Relative spectral powers were computed by dividing the power insix standard frequency bands: delta (1 - 4 Hz), theta (4-8 Hz), alpha 1 (8-10 Hz), alpha2 (10-13 Hz), beta 1 (13-18 Hz) and beta 2 (18-22 Hz) by the power in 1-22 Hz band. Toreduce the number of variables used for classification, we averaged band power values overall 21 channels. Relative power of filtered data in the six bands were processed with theLDA. We found that the filtering or enhancement method based on BSS AMUSE approachincreases differentiability between early stage AD patients and age-matched healthy controlsand improve considerably sensitivity and specificity (from 12% to 18% in comparison to thestandard approach without BSS filtering) and gives us 86% overall correct classificationaccuracy.

In particular, our computer simulations indicate that several specific spatio-temporal


(a) (b)

Fp1Fp2F3 F4 C3 C4 P3 P4 O1 O2 F7 F8 T3 T4 T5 T6 FpzFz Cz Pz Oz0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22Theta 4−8 Hz

electrode

rela

tive

pow

er

Fp1Fp2F3 F4 C3 C4 P3 P4 O1 O2 F7 F8 T3 T4 T5 T6 FpzFz Cz Pz Oz0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Beta 13−22 Hz

electrode

rela

tive

pow

erFigure 10. Plots of average distribution of relative power for all 21 electrodes for 23 AD patients(blue (dark) lines) and 38 age matched healthy subjects (green (light) lines) for: (a) Theta band, (b)Beta band. Error bars indicate the standard deviation fluctuation around the average values of therelative power. These plots indicate that several specific spatio-temporally decorrelated components(back projected to the scalp level) in low-frequency bands (theta waves 4-8 Hz, Figure (a)) andhigh-frequency band (beta waves 13-22 Hz, Figure (b)) show different magnitudes of relative powerfor AD patients (blue (dark) lines) and controls (green (light) lines). These components in the caseof theta waves have higher relative power for AD patients while the same components for beta waveshave lower relative power in comparison to normal age-matched healthy subjects. For delta waves(1-4 Hz) and alpha waves (8-13 Hz), we have observed (not shown on this Figure) also some (butmuch smaller) differences in the magnitude of relative power between AD patients and the healthysubjects.

decorrelated of components in the lower frequency range (theta waves 4-8 Hz) and also inthe higher frequency range (beta waves 13-22 Hz) have substantially different magnitudes ofrelative power for early and mild-stage AD patients than for age-matched healthy subjects.In fact, the components in the theta band have higher magnitude of relative power for ADpatients, while the components in the beta band have lower magnitude of the relative powerin comparison to normal age-matching healthy subjects (see Figure 10).

There is obviously room for improvement and extension of the proposed method bothin ranking and selection of optimal (significant) components, apparatus and post-processingto perform classification task. Especially, we can apply variety of BSS/GCA methods dis-cussed in this chapter. We are actually investigating several alternative, more sophisticatedand even more promising approaches, in which we employ a family of BSS algorithms(rather than one single AMUSE BSS algorithm), as explained in the previous section, inorder extract from raw EEG data significant components whose time courses and electrodeprojections corresponded to neurophysiologically and neuroanatomically meaningful activa-tions of separate brain regions. We believe that such approach enable us to extract not onlyvarious noise and artifacts from neuronal signals but also under favorable circumstancesto estimate physiologically and functionally distinct neuronal brain sources. Some of them


may represent local field activities of the human brain and could be significant markers forearly stage of AD. To confirm these hypotheses, we need more experimental and simulationworks. Nevertheless, by virtue of separating various neuronal and noise/artifacts sources,BSS/GCA techniques offer at least a way of improving the effective signal to noise ratio(SNR) and enhance some brain patterns.

Furthermore, instead standard linear discriminant analysis (LDA), we can use neuralnetworks or support vector machine (SVM) classifiers. We expect to obtain even betterdiscrimination and sensitivity if we apply these methods.

Moreover, classification can be probably further improved by supplementing the set ofspectral power measures which we used with much different indices, such as coherence (Jelicet al., 2000) or alpha dipolarity, a new index depending on prevalence local vs. distributedsources of EEG alpha activity, which was shown to be very sensitive to mild AD (Musha etal., 2002).

Additional attractive but still open issue is that using the proposed BSS/GCA ap-proach, we can not only detect but also measure in consistent way the progression of ADand influence of medications. Another important open issue is to establish relations of theextracted AD sensitive EEG components to currently existing neuroimaging, genetic andbiochemical markers (Jelic et al., 2000). The proposed method can be also potentially usefuland effective tool for differential diagnosis of AD from other types of dementia, and possiblyfor diagnosis of other mental diseases. Particularly, the possibility of differential diagnosis ofAD from vascular dementia (VaD) will be very important. For these purposes, more stud-ies would be needed to asses the usefulness and efficiency of the available and future blindsource separation and generalized component analysis methods for enhancement/filteringand extraction of the EEG/MEG, fMRI, PET latent components.

Principal Component Analysis - Signal and NoiseSubspaces

PCA is perhaps one of the oldest and the best-known technique in multivariate anal-ysis and data mining. It was introduced by Pearson, who used it in a biological context andfurther developed by Hotelling in works done on psychometry. PCA was also developedindependently by Karhunen in the context of probability theory and was subsequently gen-eralized by Loeve [see for example (Cichocki, Kasprzak, & Skarbek, 1996; Rosipal, Girolami,Trejo, & Cichocki, 2001; Wang, Lee, Fiori, Leung, & Zhu, 2003) and references therein].The purpose of principal component analysis PCA is to derive a relatively small numberof uncorrelated linear combinations (principal components) of a set of random zero-meanvariables while retaining as much of the information from the original variables as possible.Often the principal components (PCs) (i.e., directions on which the input data have thelargest variances) are usually regarded as important or significant, while those componentswith the smallest variances called minor components (MCs) are usually regarded as unim-portant or associated with noise. However, in some applications, the MCs are of the sameimportance as the PCs, for example, in curve and surface fitting or total least squares (TLS)problems (Cichocki & Unbehauen, 1994).

PCA can be converted to the eigenvalue problem of the covariance matrix of x(k)and it is essentially equivalent to the Karhunen-Loeve transform used in image and sig-nal processing. In other words, PCA is a technique for computation of eigenvectors and


eigenvalues for the estimated covariance matrix

Rxx = Ex(k)xT (k) = VΛVT ∈ IRm×m, (4)

where Λ = diag λ1, λ2, ..., λm is a diagonal matrix containing the m eigenvalues and V =[v1,v2, . . . ,vm] ∈ IRm×m is the corresponding orthogonal or unitary matrix consisting of theunit length eigenvectors referred to as principal eigenvectors. PCA can be also done via thesingular value decomposition (SVD) of the batch data matrix X = [x(1),x(2), . . . ,x(N)].

The Karhunen-Loeve-transform determines a linear transformation of an input vectorx as

yP = VTS x, (5)

where x = [x1(k), x2(k), . . . , xm(k)]T is the zero-mean input vector, yP =[y1(k), y2(k), . . . , yn(k)]T (n < m)is the output vector called the vector of principal compo-nents (PCs), and VS = [v1,v2, . . . ,vn] ∈ IRm×n is the set of signal subspace eigenvectors,with the orthonormal vectors vi = [vi1, vi2, . . . , vim]T , (i.e., (vT

i vj = δij) for j ≤ i, where δij

is the Kronecker delta equals to 1 for i = j, otherwise zero. The vectors vi (i = 1, 2, . . . , n)are eigenvectors of the covariance matrix, while the variances of the PCs yi(k) = vT

i x(k) arethe corresponding principal eigenvalues. On the other hand, the (m−n) minor componentsare given by

yM = VTN x, (6)

where VN = [vm,vm−1, . . . ,vm−n+1] consists of the (m − n) eigenvectors associated withthe smallest eigenvalues.

An important problem arising in many application areas is determination of the di-mension of the signal and noise subspaces. In other words, a central issue in PCA is choosingthe number of principal components to be retained. To solve this problem, we usually ex-ploit a fundamental property of PCA: It projects the input data x(k) from their originalm-dimensional space onto an n-dimensional output subspace y(k) (typically, with n ¿ m),thus performing a dimensionality reduction which retains most of the intrinsic informationin the input data vectors. In other words, the principal components yi(k) = vT

i x(k) areestimated in such a way that, for n < m, although the dimensionality of data is stronglyreduced, the most relevant information is retained in the sense that the original inputdata x can be reconstructed from the output data (signals) y by using the transforma-tion x = VS y, that minimizes a suitable cost function. A commonly used criterion is theminimization of mean squared error ‖x−VT

S VS x‖22.

PCA enables us to divide observed (measured), sensor signals: x(k) = xs(k) + ν(k)into two subspaces: the signal subspace corresponding to principal components associatedwith the largest eigenvalues called principal eigenvalues: λ1, λ2, ..., λn, (m > n) and as-sociated eigenvectors Vs = [v1,v2, . . . ,vn] called the principal eigenvectors and the noisesubspace corresponding to the minor components associated with the eigenvalues λn+1, ...,λm. The subspace spanned by the n first eigenvectors vi can be considered as an approxi-mation of the noiseless signal subspace. One important advantage of this approach is thatit enables not only a reduction in the noise level, but also allows us to estimate the numberof sources on the basis of distribution of eigenvalues. However, a problem arising from thisapproach, is how to correctly set or estimate the threshold which divides eigenvalues intothe two subspaces, especially when the noise is large (i.e., the SNR is low).


Probabilistic PCA - Expectation Maximization Algorithm

Let us assume that we model the vector x(k) ∈ IRm as

x(k) = As(k) + ν(k), (7)

where A ∈ IRm×n is a full column rank mixing matrix with m > n representing factor load-ing , s(k) ∈ IRn is a vector of zero-mean Gaussian sources with the nonsingular covariancematrix Rs s = Es(k)sT (k) and ν(k) ∈ IRm is a vector of Gaussian zero-mean i.i.d. noisemodeled by the covariance matrix Rνν = σ2

νIm, furthermore, random vectors s(k) andν(k) are uncorrelated.

The model given by (7) is often referred to as probabilistic PCA (PPCA), and has beenintroduced in the machine learning context. Moreover, such a model can be also consideredas a special form of Factor Analysis (FA) with isotropic noise. The only difference is that,in FA the noise covariance matrix can be in general arbitrary positive definite diagonalmatrix. Under above assumptions the maximum likelihood estimator of matrix A in thelinear generative model (7) is given by

AML = VS [ΛS − σ2νIn]1/2 R, (8)

where R ∈ IRn×n is arbitrary orthogonal rotation matrix, so AML estimate only rotatedversion of principal eigenvectors5.

In order to estimate A and S = [s(1), . . . , s(N)] we can formulate the following costfunction:

J(A, S) = ‖X−AS‖2F . (9)

By minimizing the above cost function using alternating least squares (ALS) or ExpectationMaximization (EM) principle, we obtain EM-PCA algorithm for noiseless model (σ2

ν = 0):

E-step S = (AT A)−1ATX, (10)M-step A = XST (SST )−1. (11)

For noisy data the EM-PCA algorithm can take following form:

M = (AT A + σ2νIn)−1, (12)

S = MATX, (13)A = XST (SST + mσ2

νM)−1, (14)

σ2ν =

1Nm

tr(XXT − ASX). (15)

For noisy data the EM-PCA algorithm allows us not only to estimates matrices A, S butalso the variance σ2

ν of additive Gaussian noise.

5The true principal eigenvectors can be recovered when the columns of RT are equal to the eigenvectorsof the matrix AAT .


å

1y

122xx =

mmxx

1=

mv

1

1 2v

1 1v

111xx =

2y+

_

++

+

+

+

+

_

_

_

+

+

+

+

+

_

_

32x

mx

3

3 1x

mv

2

22v

21v

1 1v

mv

1

1 2v

21v

mv

2

22v

22x

mx

2

21x

å

å å

å

å

å

å

Figure 11. Sequential extraction of principal components (Cichocki & Amari, 2003).

Sequential Method for PCA

One of the simplest and intuitively understandable approaches to the derivation ofadaptive algorithms for extraction of true principal components is based on self-association(also called self-supervising or the replicator principle) (Cichocki & Unbehauen, 1993,1994). According to this approach, we first compress the data vector x(k) to one vari-able y1(k) = vT

1 x(k) and next we attempt to reconstruct the original data from y1(k) byusing the transformation x(k) = v1y1(k). Let us assume, that we wish to extract principalcomponents (PCs) sequentially by employing the self-supervising principle (replicator) anda cascade (hierarchical) neural network architecture (Cichocki & Unbehauen, 1994).Let us consider a simple processing unit (see Fig.11)

y1(k) = vT1 x(k) =

m∑

j=1

v1jxj(k), (16)

which extracts the first principal component, with λ1 = Ey21(k). Strictly speaking, the

factor y1(k) is called the first principal component of x(k), if the variance of y1(k) is maxi-mally large under the constraint that the principal vector v1 has unit length.

The vector v1 = [v11, v12, . . . , v1m]T should be determined in such a way so that thereconstruction vector x(k) = v1y1(k) will reproduce (reconstruct) the input training vectorsx(k) as correctly as possible, according to the following cost function J(v1) = E‖e1(k)‖2

2 ≈∑Nk=1 ‖e1(k)‖2

2, where e1(k) = x(k)− v1 y1(k).In general, the loss (cost) function is expressed as

Ji(vi) =N∑

k=1

‖ei(k)‖22 (17)

whereei = xi+1 = xi − viyi, yi = vT

i xi, x1(k) = x(k).

In order to increase convergence speed, we can minimize the cost function (17) byemploying the recursive least-squares (RLS) approach for optimal updating of the learningrate ηi (Cichocki & Unbehauen, 1993; Cichocki et al., 1996; Wang et al., 2003):

x1(k) = x(k), (18)


yi(k) = vTi (k)xi(k) (i = 1, 2, . . . , n), (19)

vi(k + 1) = vi(k) +yi(k)ηi(k)

[xi(k)− yi(k)vi(k)], (20)

ηi(k + 1) = ηi(k) + |yi(k)|2, (21)xi+1(k) = xi(k)− yi(k)vi∗, (22)ηi(0) = 2 max‖xi(k)‖2

2 = 2xi,max, (23)vi+1(0) = vi∗ − [v1∗, . . . ,vi∗]T [v1∗, . . . ,vi∗]vi∗, (24)

where vi∗ means vector vi(k) after achieving convergence. The above algorithm is fast andaccurate for extracting sequentially arbitrary number of eigenvectors in PCA.

Sparse PCA

The importance of PCA is mainly due to the following three important properties:1. Principal components sequentially capture the maximum variability (variance)

among data matrix X, thus guaranteeing minimal information loss in the sense of meansquared errors.

2. Principal components are uncorrelated, i.e., Eyiyj = δijλj , (i, j = 1, 2, . . . , n).3. The PCs are hierarchically organized with respect to decreasing values of their

variances (eigenvalues of the covariance matrix).On the other hand the standard PCA has several disadvantages. A particular disadvantagethat is our focus here is that standard principal components yi(k) = vT

i x =∑m

j=1 vijxi(k)are usually a linear combination of all variables xi(k). In other words, all weights vij

(referred as loadings) are not zero. This means that principal vectors vi are dense (notsparse) which often makes physical interpretation of the principal components difficult insome applications (Zou et al., 2006). For example, in many applications (from biology toimage understanding) the coordinate axis have a physical interpretations (each axis mightcorrespond to specific feature) but only if the components are sparsely represented, i.e. bya very few variables - non zero loadings (coordinates). Recently several modifications ofPCA has been proposed which impose some sparseness for principal (basis) vectors andcorresponding components are called sparse principal components (SPCA) (Chenubhotla,2004; Zou et al., 2006). The main idea in SPCA is to force the basis vectors to be sparse,however sparsity profile should be adjustable or well controlled via some parameter in orderto discover specific feature in the observed data. In other words, our objective is to estimatesparse principal components, i.e., the sets of sparse vectors vi spanning a low-dimensionalspace that represent most of the variance present in the data X.

The cost functions (17) for standard PCA can be easily modified to perform the SPCA(using, for example, the concept Laplace prior) as follows:

Ji(vi) =N∑

k=1

‖xi(k)− vi yi(k)‖22 + α‖vi‖1, (25)

where xi+1 = xi − viyi, yi = vTi xi, x1(k) = x(k) and α ≥ 0 is parameter which control

sparsity. The first term in the above cost function is designed the PC eigenvectors, whilethe second penalty term promotes sparsity. As α is increased, the sparsity of the SPCA


basis vectors is also increased at cost of slight decreasing of variance captured by SPCAand small correlation between the components. The presence of correlation implies thatthe reconstruction errors ei = xi+1 of the SPCA will be not optimal (in the sense of meansquared error). However, typically for achieving 50-70% sparsity in the basis vectors theerror increases by less than 2%.

Alternative approach to perform the SPCA is to apply regression type minimizationwith Lasso constraints (Zou et al., 2006):

J(V,W) =N∑

k=1

‖x(k)−WVTx(k)‖22 + α

n∑

i=1

‖vi‖1, (26)

subject to WTW = In,

where V = [v1,v2, . . . ,vn] ∈ IRn×m and W ∈ IRn×m. The above optimization problem canbe solved using the standard procedure, for example, semidefinite programming (SDP).

SPCA in contrast to the standard PCA enables to reveal often multi-scale hierarchicalstructures from data (Zou et al., 2006; Chenubhotla, 2004). For example, for EEG/MEGdata the SPCA generates spatially localized, narrow bandpass functions as basis vectors,thereby achieving a joint space and frequency representation what is impossible using stan-dard PCA.

Blind Source Separation Based on Spatio-TemporalDecorrelation

Temporal, spatial and spatio-temporal6 decorrelations play important roles inEEG/MEG data analysis. These techniques are based only on second-order statistics (SOS).They are the basis for modern subspace methods of spectrum analysis and array process-ing and are often used in a preprocessing stage in order to improve convergence propertiesof adaptive systems, to eliminate redundancy or to reduce noise. Spatial decorrelationor prewhitening is often considered a necessary (but not sufficient) condition for strongerstochastic independence criteria. After prewhitening, the BSS or ICA tasks usually becomesomewhat easier and well-posed (less ill-conditioned), because the subsequent separating(unmixing) system is described by an orthogonal matrix for real-valued signals and a uni-tary matrix for complex-valued signals and weights. Furthermore, spatio-temporal andtime-delayed decorrelation can be used to identify the mixing matrix and to perform blindsource separation of colored sources under certain weak conditions (Cichocki & Amari,2003).

AMUSE Algorithm and its Properties

AMUSE algorithm belongs to the group of the second-order statistics spatio-temporaldecorrelation (SOS-STD) algorithms. It provides identical or at least very similar decompo-sition of raw data as the well known and popular SOBI and TDSEP algorithms (Belouchraniet al., 1997; Ziehe et al., 2000). This class algorithms are sometimes classified or referred

6Literally, space and time. Spatio-temporal data has both a spatial (i.e. location) and a temporal (i.e.time related) components.


as ICA algorithms. However, these algorithms do not exploit implicitly or explicitly statis-tical independence. Moreover, in the contrast to the standard higher order statistics ICAalgorithms they are able to estimate colored Gaussian distributed sources and their able toestimate original sources only if the sources have temporal structure.

AMUSE algorithm have some similarity with the standard PCA. The main differenceis that AMUSE employs PCA two times (in cascade way) in two separate steps: In thefirst step, standard PCA can be applied for whitening (sphering) data and in the secondstep SVD/PCA is applied for time delayed covariance matrix of the pre-whitened data.Mathematically AMUSE algorithm is the following two stage procedure: In the first stepwe apply a standard or robust prewhitening (sphering) as linear transformation x1(k) =Qx(k) where Q = R−1/2

x = (VΛVT )−1/2 = VΛ−1/2VT of the standard covariance matrixRxx = Ex(k)xT (k) and x(k) is a vector of observed data for time instant k. In thenext step (for pre-whitened data), the SVD is applied for time-delayed covariance matrixRx1x1 = Ex1(k)xT

1 (k−1) = UΣVT , where Σ is diagonal matrix with decreasing singularvalues and U , V are orthogonal matrices of left and right singular vectors. Then, anunmixing (separating) matrix is estimated as W = UTQ (Cichocki & Amari, 2003).

The main advantage of AMUSE algorithm in comparison to other BSS/ICA algo-rithms is that it allows us automatically to order components due to application of SVD(singular value decomposition). In fact, the components are ordered according decreasingvalues of singular values of the time-delayed covariance matrix. In other words, AMUSEalgorithm exploit a simple principle that the estimated components tends to be less complexor more precisely, they have better linear predictability than any mixture of those sources.It should be emphasized that all components estimated by AMUSE are uniquely definedand consistently ranked. The consistent ranking is due to the fact that singular values arealways ordered in decreasing order.

The main disadvantage of AMUSE algorithm is that its performance is relativelysensitive to additive noise since the algorithm exploits only one time delayed covariancematrix. To alleviate this problem, we can alternatively use SOBI algorithm which allowsapproximately jointly diagonalize simultaneously hundreds of the time delayed covariancematrices with various time delays (Belouchrani et al., 1997; Belouchrani & Cichocki, 2000).

Blind Source Extraction Using Linear Predictability andAdaptive Band Pass Filters.

There are two main approaches to solve the problem of blind source separation. Thefirst approach, which was mentioned briefly in the previous section, is to simultaneouslyseparate all sources. In the second one, we extract sources sequentially in a blind fashion,one by one, rather than separating them all simultaneously. In many applications, a largenumber of sensors (electrodes, sensors, microphones or transducers) are available but onlya very few source signals are subjects of interest. For example, in the modern high densityarray EEG or MEG devices, we record, typically, more than 100 sensor signals, but only afew source signals are interesting; the rest can be considered as interfering noise. In anotherexample, the cocktail party problem, it is usually essential to extract the voices of specificpersons rather than separate all the source signals of all speakers available (in mixing form)from an array of microphones. For such applications it is essential to develop and apply


reliable, robust and effective learning algorithms which enable us to extract only a smallnumber of source signals that are potentially interesting and contain useful information.

We can use here several different models and criteria. The most frequently usedcriterion is based on higher order statistics (HOS), which assumes that the sources aremutually statistically independent and non-Gaussian (at most only one can be Gaussian).For independence criteria, we can use some measures of non-Gaussianity (Cichocki & Amari,2003).

An alternative criterion, based on the concept of linear predictability, assumes thatsource signals have some temporal structure, i.e., the sources are colored with differentautocorrelation functions or equivalently have different spectra shapes. In this approach, weexploit the temporal structure of signals rather than their statistical independence (Cichocki& Thawonmas, 2000; Stone, 2001; Barros & Cichocki, 2001). Intuitively speaking, sourcesignals with temporal structures sj have less complexity than the mixed sensor signals xj .In other words, the degree of temporal predictability of any source signal is higher than (orequal to) that of any mixture. For example, waveforms of a mixture of two sine waves withdifferent frequencies are more complex or less predictable than either of the original sinewaves. This means that applying the standard linear predictor model and minimizing themean squared error, which is measure of predictability, we can separate or extract signalswith different temporal structures. More precisely, by minimizing the error, we maximizea measure of temporal predictability for each recovered signal (Cichocki, Thawonmas, &Amari, 1997; Cichocki, Rutkowski, & Siwek, 2002).

It is worth to note that two basic criteria used in BSE: temporal linear predictabilityand non-Gaussianity based on kurtosis may lead to different results. Temporal predictabilityforces the extracted signal to be smooth and possibly lowest complexity while the non-Gaussianity measure forces the extracted signals to be as independent as possible withsparse representation for sources that have positive kurtosis.

Let us assume that temporally correlated source signals are modeled by autoregressiveprocesses (AR) (see Figure 12) as

sj(k) = sj(k) +L∑

p=1

ajpsj(k − p) = sj(k) + Aj(z)sj(k), (27)

where Aj(z) =∑L

p=1 ajp z−p, z−ps(k) = s(k − p) and sj(k) are i.i.d. unknown innovativeprocesses. In practice, the AR model can be extended to more general models such as theAuto Regressive Moving Average (ARMA) model or the Hidden Markov Model (HMM)(Cichocki & Amari, 2003; Hyvarinen et al., 2001; Amari, Hyvarinem, Lee, Lee, & Sanchez,2002).

For ill-conditioned problems (when a mixing matrix is ill-conditioned and/or sourcesignals have different amplitudes), we can apply optional preprocessing (prewhitening) tothe sensor signals x in the form

x1 = Qx,

where Q ∈ IRn×m is a decorrelation matrix ensuring that the auto-correlation matrix Rx1x1

= Ex1xT1 = In is an identity matrix. To model temporal structures of source signals, we

consider a linear processing unit with an adaptive filter with the transfer function B1(z)(which estimates one Aj(z)) as illustrated in Figure 12.


A (z)1

S

A (z)n

+

+

A Q

LearningAlgorithm

S

+

+

)(kje

B (z)1

S

+

-

jnw

)(ky j

Unknown)(~ ky j

S+

+

1jw

AR model of sources MixingRobust

Prewhitening Blind Extraction

s k( )~

1

x k( )1n

x k( )1

x k( )11

s k( )1

s k( )ns k( )~

nx k( )

m

Figure 12. Block diagram illustrating implementation of learning algorithm for blind extraction ofa temporally correlated source (Cichocki & Amari, 2003).

Let us assume for simplicity, that we want to extract only one source signal, e.g., sj(k),from the available sensor vector x(k). For this purpose, we employ a single processing unitdescribed as (see Figure 13):

y1 (k) = wT1 x (k) =

m∑

j=1

w1j xj (k) , (28)

ε1 (k) = y1 (k)−L∑

p=1

b1p y1 (k − p) = wT1 x (k)− bT

1 y1(k), (29)

where w1 = [w11, w12, . . . , w1m]T , y1(k) = [y1 (k − 1) , y1(k − 2), . . . , y1 (k − L)]T ,b1 = [b11, b12, . . . , b1L]T

and B1 (z) =L∑

p=1b1pz

−p is the transfer function of the corresponding FIR filter. It should

be noted that the FIR filter can have a sparse representation. In particular, only one singleprocessing unit, e.g. with delay p and b1p 6= 0 can be used instead of L parameters. Theprocessing unit has two outputs: y1(k) which estimates the extracted source signals, andε1 (k), which represents a linear prediction error or estimator of an innovation, after passingthe output signal y1(k) through FIR filter.

Our objective is to estimate optimal values of vectors w1 and b1, in such a way thatthe processing unit successfully extracts one of the sources. This is achieved if the global

vector defined as g1 = ATw1 =(wT

1 A)T

= cjej contains only one nonzero element, e.g.in the j-th row, such that y1 (k) = cjsj , where cj is an arbitrary nonzero scaling factor. Forthis purpose, we reformulate the problem as a minimization of the cost function

J (w1,b1) = Eε21

. (30)


S

S

)(1 kx

)(kxm

)(2 kx

1-z

++

+_

)(1 ky

Lb1

SL)(1 ke

)(~1 ky

11b

1-z

1-z

mw1

++

+

12w

11w

M

Figure 13. The neural network structure of single extraction unit using a linear predictor.

The main motivation for applying such a cost function is the assumption that primarysource signals (signals of interest) have temporal structures and can be modeled, e.g., byan autoregressive model (Cichocki & Amari, 2003; Barros & Cichocki, 2001; H.-Y. Jung &Lee, 2000).

According to the AR model of source signals, the filter output can be represented asε1(k) = y1(k)− y1(k), where y1(k) =

∑Lp=1 b1py1(k − p) is defined as an error or estimator

of the innovation source sj(k). The mean squared error Eε21(k) achieves a minimum

c21Es2

j (k), where c1 is a positive scaling constant, if and only if y1 = ±c1sj for anyj ∈ 1, 2, . . . ,m or y1 = 0 holds.

Let us consider the processing unit shown in Figure 13. The associated cost function(30) can be evaluated as follows:

Eε21

= wT

1 Rx1x1w1 − 2wT1 Rx1y1b1 + bT

1 Ry1y1b1, (31)

where Rx1x1 ≈ Ex1xT1 , Rx1y1 ≈ Ex1yT

1 and Ry1y1 ≈ Ey1yT1 , are estimators of true

values of correlation and cross-correlation matrices: Rx1x1 ,Rx1y1 ,Ry1y1 , respectively. Inorder to estimate vectors w1 and b1, we evaluate gradients of the cost function and equalizethem to zero as follows:

∂J1 (w1,b1)∂w1

= 2Rx1x1w1 − 2Rx1y1b1 = 0, (32)

∂J1 (w1,b1)∂b1

= 2Ry1y1b1 − 2Ry1x1w1 = 0. (33)

Solving the above matrix equations, we obtain a simple iterative algorithm:

w1 = R−1x1x1

Rx1y1b1, w1 =w1

||w1||2 , (34)

b1 = R−1y1y1

Ry1x1w1 = R−1y1 y1

Ry1 y1 , (35)

where the matrices Ry1 y1 and Ry1 y1 are estimated based on the parameters w1 obtainedin the previous iteration step. In order to avoid the trivial solution w1 = 0, we normalizethe vector w1 to unit length in each iteration step as w1(l + 1) = w1(l + 1)/ ‖w1(l + 1)‖2

(which ensures that Ey21 = 1).


It is worth to note here that in our derivation matrices Ry1 y1 and Ry1 y1 are assumedto be independent of the vector w1(l + 1), i.e., they are estimated based on w1(l) in theprevious iteration step. This two-phase procedure is similar to the expectation maximiza-tion (EM) scheme: (i) Freeze the correlation and cross-correlation matrices and learn theparameters of the processing unit (w1,b1); (ii) freeze w1 and b1 and learn new statistics(i.e., matrices Ry1 y1 and Ry1 y1) of the estimated source signal, then go back to (i) andrepeat. Hence, in phase (i), our algorithm extracts a source signal, whereas in phase (ii)it learns the statistics of the source (Cichocki & Belouchrani, 2001; Cichocki et al., 2002;Gharieb & Cichocki, 2003).

Independent Component Analysis (ICA)

ICA can be defined as follows: The ICA of a random vector x(k) ∈ IRm is obtainedby finding an n ×m, (with m ≥ n), full rank separating (transformation) matrix W suchthat the output signal vector y(k) = [y1(k), y2(k), . . . , yn(k)]T (independent components)estimated by

y(k) = Wx(k), (36)

are as independent as possible evaluated by an information-theoretic cost function such asminima of Kullback-Leibler divergence (Hyvarinen et al., 2001).

Compared with principal component analysis (PCA), which removes second-ordercorrelations from observed signals, ICA further removes higher-order dependencies. In-dependence of random variables is a more general concept than decorrelation. Roughlyspeaking, we say that random variables yi and yj are statistically independent if knowledgeof the values of yi provides no information about the values of yj . Mathematically, theindependence of yi and yj can be expressed by the relationship

p(yi, yj) = p(yi)p(yj), (37)

where p(y) denotes the probability density function (pdf) of the random variable y. In otherwords, signals are independent if their joint pdf can be factorized.

If independent signals are zero-mean, then the generalized covariance matrix of f(yi)and g(yj), where f(y) and g(y) are different, odd nonlinear activation functions (e.g., f(y) =tanh(y) and g(y) = y for super-Gaussian sources) is a non-singular diagonal matrix:

Rf g = Ef(y)gT (y) =

Ef(y1)g(y1) 0. . .

0 Ef(yn)g(yn)

,

(38)

i.e., the covariances Ef(yi)g(yj) are all zero for i 6= j. It should be noted that for oddf(y) and g(y), if the probability density function of each zero-mean source signal is even,then the terms of the form Ef(yi)Eg(yi) equal zero. The true general condition forstatistical independence of signals is the vanishing of high-order cross-cumulants (Cichockiet al., 1994; Cichocki & Unbehauen, 1996; Amari & J.-F.Cardoso, 1997; Cichocki, Bogner,Moszczynski, & Pope, 1997).


The above diagonalization principle can be expressed as (Fiori, 2003; Nishimori, 1999)

R−1fg = Λ−1, (39)

where Λ is any diagonal positive definite matrix (typically, Λ = I or Λ = diagRfg). Bypre-multiplying the above equation by separating matrix W and Λ, we obtain:

ΛR−1fg W = W, (40)

which suggest the following iterative multiplicative learning algorithm

W(l + 1) = ΛR−1fg W(l), (41)

W(l + 1) = W(l + 1)[WT (l + 1)W(l + 1)

]−1/2, (42)

where the last equation represents the symmetric orthogonalization to keep algorithm stable.The above algorithm is simple and fast but need prewhitening the data.

In fact, a wide class of ICA algorithms can be expressed in general form as (see Table1) (Cichocki & Amari, 2003)

∇W(l) = W(l + 1)−W(l) = ηF(y)W(l), (43)

where y(k) = W(l)x(k) and the matrix F(y) can take different forms, for exampleF(y) = Λn − f(y)gT (y) with suitably chosen nonlinearities f(y) = [f(y1), ..., f(yn)]and g(y) = [g(y1), ..., g(yn)] (Cichocki et al., 1994; S. A. Cruces, Castedo, & Cichocki, 2002;S. Cruces, Cichocki, & Castedo, 2000; S. A. Cruces & Cichocki, 2003; Cichocki & Amari,2003; Hyvarinen & Oja, 2000).

Assuming prior knowledge of the source distributions pi(yi), we can estimate W usingmaximum likelihood (ML):

J(W,y) = −12

log | det(WWT )| −n∑

i=1

log(pi(yi) (44)

Using natural gradient descent to increase likelihood we get:

W(l + 1) = η[I− f(y)yT

]W(l), (45)

where f(y) = [f1(y1), f2(y2), . . . , fn(yn)]T is an entry-wise nonlinear score function definedby:

fi(yi) = −p′i(yi)pi(yi)

= −d log(pi(yi)d(yi)

. (46)

It should be noted that ICA can perform blind source separation, i.e., enable toestimate true sources only if they are all statistically independent and non Gaussian (exceptpossibly of one) (Cichocki & Amari, 2003; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996;Karhunen & Pajunen, 1997).


Table 1: Basic equivariant adaptive learning algorithms for ICA. Some of these algorithms requirepre-whitening.

No. Learning Algorithm References

1. ∆W = η[Λ− 〈f(y)gT (y)〉

]W Cichocki, Unbehauen, Rummert (1994)

Λ is a diagonal matrix with nonnegative elements λii

W(l + 1) =[I∓ η [I− 〈f(y)gT (y)〉]

]∓1

W(l) Cruces, Cichocki, Castedo (2000)

2. ∆W = η[Λ− 〈f(y)yT 〉

]W, f(yi) = −p′(yi)/p(yi) Bell, Sejnowski (1995)

λii = 〈f(yi(k))yi(k)〉 or λii = 1, ∀i Amari, Cichocki, Yang (1995)

Choi, Cichocki, Amari (1999)

3. ∆W = η[I− 〈yyT 〉 − 〈f(y)yT 〉+ 〈y fT (y)〉

]W Cardoso, Laheld, (1996)

4. ∆W = η[I− 〈yyT 〉 − 〈f(y)yT 〉+ 〈f(y) fT (y)〉

]W Karhunen, Pajunen (1997)

5. W = W + η[Λ− 〈f(y)yT 〉

]W, λii = 〈f(yi) yi〉

Hyvarinen, Oja (1999)

ηii = [λii + 〈f ′(yi)〉]−1; W = W(WT W)−1/2

6. ∆W = η[I−Λ−1〈yyT 〉

]W Amari, Cichocki (1998)

λii(k) = 〈y2i (k))〉 Choi, Cichocki, Amari (2000)

7. ∆W = η[I−C1,q(y,y)Sq+1(y)

]W

Cruces, Castedo, Cichocki (2002)C1,q(yi, yj) = Cum(yi, yj , . . . , yj︸︷︷︸

q

)

8. W(l + 1) = exp(η F[y]) W(l) Nishimori, Fiori(1999,2003)

F(y) = Λ− 〈yyT 〉 − 〈f(y)yT 〉+ 〈y fT (y)〉 Cichocki, Georgiev (2002)

9. W = Λ R−1fg W Fiori (2003)

W = W(WT W)−1/2


Multiresolution Subband Decomposition – Independent Component Analysis(MSD-ICA)

Despite the success of using standard ICA in many applications, the basic assump-tions of ICA may not hold for some kind of signals hence some caution should be taken whenusing standard ICA to analyze real world problems, especially in analysis of EEG/MEGdata. In fact, by definition, the standard ICA algorithms are not able to estimate statisti-cally dependent original sources, that is, when the independence assumption is violated. Inthis section, we will present a natural extension and generalization of ICA called Multireso-lution Subband Decomposition ICA (MSD-ICA) which relaxes considerably the assumptionregarding mutual independence of primarily sources (Cichocki & Georgiev, 2003; Cichocki& Amari, 2003; Tanaka & Cichocki, 2004; Cichocki, Li, Georgiev, & Amari, 2004). Thekey idea in this approach is the assumption that the unknown wide-band source signalscan be dependent, however some their (narrow band or sparse linearly transformed) sub-components are independent. In other words, we assume that each unknown source canbe modeled or represented as a sum (or linear combinations) of narrow-band sub-signals(sub-components):

si(k) = si1(k) + si2(k) + · · ·+ siK(k). (47)

For example, in the simplest case, source signals can be modeled or decomposed into theirlow- and high- frequency sub-components:

si(k) = siL(k) + siH(k) (i = 1, 2, . . . , n). (48)

In practice, the high-frequency sub-components siH(k) are often found to be mutually inde-pendent, while the low-frequency sub-components are weakly dependent. In such a case, wecan use a High Pass Filter (HPF) to extract mixture of the high frequency sub-componentsand then apply any standard ICA algorithm to such preprocessed sensor (observed) signals.

The basic concept in Subband Decomposition ICA is to divide the sensor signalspectra into their subspectra or subbands, and then to treat those subspectra individuallyfor the purpose at hand. The subband signals can be ranked and processed independently.Let us assume that only a certain set of sub-components are independent. Provided thatfor some of the frequency subbands (at least one) all sub-components, say sij(k)n

i=1, aremutually independent or temporally decorrelated, then we can easily estimate the mixing orseparating system under condition that these subbands can be identified by some a prioriknowledge or detected by some self-adaptive process. For this purpose, we simply applyany standard ICA algorithm, however not for all available raw sensor data but only forsuitably pre-processed (e.g., subband filtered or more generally in a transform domain)sensor signals.

By applying any standard ICA/BSS algorithm for specific sub-bands and raw sen-sor data, we obtain sequence of separating matrices W0,W1, . . . ,WL, where W0 is theseparating matrix from the original data x and Wj is the separating matrix from pre-processed sensor data xj in the j-th sub-band. In order to identify for which sub-bandscorresponding source subcomponents are independent, we propose to compute the global(mixing-separating) matrices Gjq = WjW−1

q , ∀j 6= q and m = n, where Wq is estimat-ing separating matrix for q-th sub-band. If subcomponents are mutually independent for


Figure 14. Bank of filters employed in preprocessing stage for MSD-ICA with typical frequencybands. For each sensor signal we employ the identical set of filters. The sub-bands can be overlappedor not and have more complex sub-bands forms. Instead of subband filters we can use variousalternative transforms to sparsify data.

at least two sub-bands, say for the sub-band j and sub-band q, then the global matrixWjW−1

q = Pjq will be generalized permutation matrix with only one nonzero (or domi-nated) element in each row and each column. This follows from the simple observation thatin such case the both matrices Wj and Wq represent inverses (for m = n) of the samemixing matrix A (neglecting nonessential scaling and permutation ambiguities). In thisway, we can blindly identify essential information for which frequency sub-bands the sourcesubcomponents are independent and we can easily identify correctly the mixing matrix.Furthermore, the same concept can be used to estimate blindly the performance index andto compare performance of various ICA algorithms, especially for large scale problems.

In the preprocessing stage we can use any linear transforms, especially, more so-phisticated methods, such as block transforms, multirate sub-band filter bank or wavelets,ridgelets and curvelets transforms, can be applied in order to sparsify data and extractsignals with specific structures or morphological properties. We can extend and general-ize further this concept by performing the decomposition of sensor signals in a compositetime-frequency domain rather than in frequency sub-bands as such. This naturally leadsto the concept of wavelets packets (sub-band hierarchical trees) and to block transformpackets (Cichocki & Amari, 2003; Zibulevsky et al., 2002; Bach & Jordan, 2003; Bobin etal., 2006). Such preprocessing techniques has been extensively tested and implemented inour ICALAB (Cichocki, Amari, et al., 2004).

Such explanation can be summarized as follows. The MSD-ICA (MultiresolutionSubband Decomposition ICA) can be formulated as a task of estimation of the separatingmatrix W and/or the estimating mixing matrix A on the basis of suitable wavelet packageor subband decomposition of sensor signals and by applying a classical ICA (instead for


raw sensor data) for one or several preselected subbands for which source sub-componentsare independent.

Validity of ICA, BSS algorithms for real world data

One of the fundamental question in BSS is problem whether the obtained results ofthe specific BSS/ICA algorithm is reliable and represent inherent properties of the modeland data or it is just a random, synthetic or purely mathematical, decomposition of datawithout any physical meaning. In fact, since most of BSS algorithms are stochastic in nature,their results could be somewhat different in different runs even for the same algorithm.Thus, the results obtained in a single run or for single set of data of any BSS algorithmshould be interpreted with reserve and reliability of estimated sources should be analyzedby investigating the spread of the obtained estimates for many runs (Himberg, Hyvarinen,& Esposito, 2004). Such an analysis can be performed, for example by using resamplingor bootstrapping method in which the available data is randomly changed by producingsurrogate data sets from the original data (Mainecke, Ziehe, Kawanabe, & Muller, 2002).The specific ICA/BSS algorithm is then run many times with bootstrapped samples thatare somewhat different from each other. Alternative approach called ICASSO has beendeveloped by (Himberg et al., 2004) which is based on running the specific BSS algorithmmany times for various different initial conditions and parameters and by visualizing theclustering structure of the estimated sources (components) in the signal subspace. In otherwords, to estimate algorithmic reliability it was suggested to run the BSS algorithm manytimes using different initial conditions and assessing which of the components are foundin almost all run. For this purpose the estimated components are clustered and classified.The reliable components corresponds to small and well separated clusters from the rest ofcomponents, while unreliable components usually do not belong to any cluster (Himberg etal., 2004; Mainecke et al., 2002).

It is worth to note that the concept of MSD-ICA described in the previous sectioncan be extended easily to more general and flexible multi-dimensional models for checkingvalidity and reliability of ICA (or more generally BSS) algorithms (see Figure 14). In thismodel we can use a bank of stable filters with transfer functions Hi(z), for example, set ofFIR (finite impulse response filters). The parameters (coefficients) of such FIR filters canbe suitably designed or even be randomly generated. In this case, the proposed methodhas some similarity with resampling or bootstrap approach proposed by (Mainecke et al.,2002). Similarly as in MSD-ICA, we run any BSS algorithm for sufficiently large number Lof filters and generate set of separating matrices: W0,W1, . . . ,WL or alternatively setof estimated mixing matrices: A0, A1, . . . , AL.7 In the next step we estimate the globalmixing-separating matrices Gpq = WpW+

q for any p 6= q.The performance of blind separation can be characterized by one single performance

index (referred as Amari’s performance index) which we refer as blind performance index(since we do not know a true mixing matrix):

BPIi =1n

n∑

j=1

( ∑ni=1 |gij |2

maxi |gij |2 − 1

)+

1n

n∑

i=1

( ∑nj=1 |gij |2

maxj |gij |2 − 1

), (49)

7The set of matrices can be further extended if data will be bootstrapped and/or initial conditions willbe changed for each run.


where gij is ij-th element of the matrix Gpq. In many cases, we are not able to achieveperfect separation for some sources or we are able to extract only some sources (not of allthem). In such cases instead of using one global performance index, we can define localperformance index as

BPIi =

( ∑nj=1 |gij |2

maxj |gij |2 − 1

). (50)

If the performance index BPIi for specific index i and filters p, q is close to zero thismeans that with high probability this component is successfully extracted. In order to assessignificant components the all estimated components should be clustered according theirmutual similarities. These similarities can be searched in the time domain or the frequencydomain. The natural measure of similarity between of the estimated components can beabsolute value of their mutual correlation coefficients |rij | for i 6= j which are elements ofthe similarity matrix (Himberg et al., 2004)

R = W Rxx WT, (51)

where W = [W0,W1, . . . ,WL] and Rxx = ExxT = ARssAT is covariance matrix ofobservations under assumption that the covariance matrix of sources Rss = EssT is adiagonal matrix and separating matrices Wp are normalized (e.g., to unit length vectors).

Sparse Component Analysis and Sparse SignalRepresentations

Sparse Component Analysis (SCA) and sparse signals representations (SSR) arise inmany scientific problems, especially, where we wish to represent signals of interest by usinga small (or sparse) number of basis signals from a much larger set of signals, often calleddictionary (Donoho & Elad, 2004). Such problems arise also in many applications such aselectro-magnetic and biomagnetic inverse problems (EEG/MEG), feature extraction, filter-ing, wavelet denoising, time-frequency representation, neural and speech coding, spectralestimation, direction of arrival estimation, failure diagnosis and speed-up processing (Ci-chocki & Amari, 2003; Li, Cichocki, & Amari, 2006).

In opposite to ICA where the mixing matrix and source signals are estimated simulta-neously the SCA is usually a multi stage procedure (Zibulevsky et al., 2002; Li et al., 2003,2004; Georgiev & Cichocki, 2004). In first stage we need to find a suitable linear trans-formation which guarantee that sources in the transformed domain are sufficiently sparse.Typically, we represent the observed data in the time-frequency domain using waveletspackage (Li et al., 2004). In the next step, we estimate the columns ai of the mixing ma-trix A using an advanced hierarchical clustering technique. This step is the most difficultand challenging task since it requires to identify precisely intersections of all hyperplaneson which observed data are located (Li et al., 2003; Theis, Georgiev, & Cichocki, 2004).In the last step, we estimate sparse sources using for example a modified robust linearprogramming (LP), quadratic programming (QP) or semi-definite programming (SDP) op-timization. The big advantage of SCA is its ability to reconstruct of original sources andalso their numbers even if the number of observations (sensors) is smaller than number ofsources under certain weak conditions (Li et al., 2004; Cichocki, Li, et al., 2004; Georgiev


& Cichocki, 2004). Moreover, the system can be highly nonstationary (i.e., the number ofactive sources can change dramatically in time) and sources can be statistically dependent.

We can state the subset selection sub-problem as follows: Find an optimal subset ofr << n columns from the matrix A which we denote by Ar ∈ IRm×r such that Arsr∗ ∼= x,or equivalently As∗ + er = x, where er represents some residual error vector which normshould below some threshold. The problem consists often not only in estimating the sparsevector s∗ but also correct or optimal sparsity profile that is the sparsity index r, that isdetection the number of sources.

Usually, we have interest in sparsest and unique representation, i.e., it is necessary tofind solution having the smallest possible number of non-zero-components. The problem canbe reformulated as the following robust optimization problem (Cichocki, Li, et al., 2004):

(Pρ) Jρ(s) = ‖s‖ρ =n∑

j=1

ρ(sj) s. t. As = x, (52)

where A ∈ IRm×n, (usually with n >> m) and ‖s‖ρ suitably chosen function which measuresthe sparsity of the vector s. It should be noted the sparsity measure does not need benecessary a norm, although we use such notation. For example, we can apply Shannon,Gauss or Renyi entropy or normalized kurtosis as measure of the sparsity (Cichocki & Amari,2003; Kreutz-Delgado et al., 2003; Zibulevsky et al., 2002). In the standard form, we uselp-norm with 0 ≤ p ≤ 1. Especially, l0 quasi-norm attract a lot of attention since it ensuressparsest representation (Donoho & Elad, 2004; Li et al., 2003, 2004). Unfortunately, suchformulated problem (52) for lp-norm with p < 1 is rather very difficult to solve, especiallyfor p = 0 it is NP-hard, so for a large scale problem it is numerically untractable. Forthis reason, we often use Basis Pursuit (BP) or standard Linear Programming (LP) for‖s‖ρ = ‖s‖1, with ρ = p = 1.

In practice, due to noise and other uncertainty (e.g., measurement errors) the systemof linear underdetermined equations should not be satisfied precisely but with some pre-scribed tolerance (i.e., As ∼= x in the sense that ‖x−As‖q ≤ ε). From the practical point ofview as well as from a statistical point of view, it is convenient and quite natural to replacethe exact constraints x = As by the constraint ‖x −As‖q ≤ ε, where choice of lq-normdepends on distribution of noise and specific applications. For noisy and uncertain data weshould to use a more flexible and robust cost function (in comparison to the standard (Pρ)problem) which will be referred as Extended Basis Pursuit Denoising (EBPD)(Cichocki,Li, et al., 2004):

(EBPD) Jq,ρ(s) = ‖x−As‖qq + α ‖s‖ρ, (53)

There are several possible basic choices for lq and sparsity criteria (‖s‖ρ = ‖s‖p) For ex-ample, for the uniform (Laplacian) distributed noise we should choose l∞-Chebyshev norm(l1-norm). Some basic choices of ρ (for lq = 2) are ρ = 0 (minimum l0 quasi norm or atomicdecomposition related with the matching pursuit (MP) and FOCUSS algorithm), ρ = 1 (ba-sis pursuit denoising) and ρ = 2 (ridge regression) (Kreutz-Delgado et al., 2003; Zibulevskyet al., 2002; Donoho & Elad, 2004). The optimal choice of ρ norms depends on distributionof noise in sparse components. For example, for noisy components, we can use robust normssuch as Huber function defined as ‖s‖ρH =

∑i ρH(si), where ρH(si) = s2

i /2 if |si| ≤ β and


ρH(si) = β |si| − β2/2 if |si| > β, and/or epsilon norm defined as ‖s‖+ε =∑

j |sj |+ε where|sj |+ε = max0, (|sj | − ε).

The practical importance of the EBPD approach in comparison to the standard LPor BP approach is that the EPBD allows for treating the presence of noise or errors dueto mismodeling. Moreover, using the EBPD approach, we have possibility to adjust thesparsity profile (i.e., adjust the number of non-zero components) by tuning the parameterα. In contrast, by using the LP approach we do not have such option. Furthermore, themethod can be applied both for undercomplete and/or overcomplete models (i.e., when thenumber of sources is larger or less than the number of sensors.

The practical importance of the extended quadratic programming approach in con-trast to the linear programming or standard Basis Pursuit approach is that the (QP ) allowsfor treating the presence of noise or errors due to mismodeling. In practice, in the presenceof noise the true model is: x(k) = As(k) + v(k).

Nonnegative Matrix Factorization and Sparse Coding withNonnegativity Constraints

The NMF (Nonnegative Matrix Factorization) introduced by Lee and Seung (Lee& Seung, 1999), sometimes called also PMF (Positive Matrix Factorization) which wasfirst proposed by Paatero does not assume explicitly or implicitly sparseness or the mutualstatistical independence of components, however, usually provides sparse decomposition.

The NMF method is designed to capture alternative structures inherent in the dataand, possibly to provide more biological insight. Lee and Seung introduced NMF in itsmodern formulation as a method to decompose images (Lee & Seung, 1999). For example,in this context, NMF yielded a decomposition of human faces into parts reminiscent offeatures such as lips, eyes, nose, etc. By contrast to other factorization methods, suchas ICA or PCA, to image data yielded often nonnegative components with obvious visualinterpretation.

The NMF found wide applications in spectroscopy, chemometrics and environmentalscience where the matrices have some physical meanings. The NMF has also potentialapplication in analysis of EEG/MEG data by extracting hidden interesting sparse andlocalized components from spectra and/or spectrograms of the data. Whereas the originalapplication of NMF focused on grouping elements of images into parts (using the matrixA), we take the dual viewpoint by focusing primarily on grouping samples into componentsrepresenting by the matrix S.

NMF decomposes the data matrix X as a product of two matrices A and S having onlynon-negative elements. This results in reduced representation of the original data. In thereduced data set, each feature is a linear combination of the original attribute set. NMF doesnot allow negative entries in the matrix factors A and S in the model X = AS+V. Unlikethe other matrix factorization these non-negativity constraints permit the combination ofmultiple basis signals to represent original signals or images. But only additive combinationsare allowed, because the non-zero elements of A and S are all positive. Thus in suchdecomposition no subtractions can occur. For these reasons, the non-negativity constraintsare compatible with the intuitive notion of combining components to form a whole signalor image, which is how NMF learns a parts-based representation (Lee & Seung, 1999).


Let us consider as an illustrative example beta divergence as cost function (Kompass,2005; Minami & Eguchi, 2002; Cichocki, Zdunek, & Amari, 2006a):

D(β)K (X||AS) =

∑

ik

(xik

xβik − [AS]βikβ(β + 1)

+ [AS]βik[AS]ik − xik

β + 1

)

+αS‖S‖1 + αA‖A‖1, (54)

where αS and αA are small positive regularization parameters which control the degree ofsmoothing or sparseness of the matrices A and S, respectively and l1 norms ||A||1 and||S||1 are introduced to enforce sparse representation of solutions. The choice of the βparameter depends on statistical distribution of data and the beta divergence correspondsto Tweedie models (Minami & Eguchi, 2002). On the basis of such cost function we candrive various kinds of NMF algorithms: Multiplicative based on the gradient descent or theexponentiated gradient (EG)algorithms, additive algorithms using projected gradient (PG)or interior projected gradient (IPG) approaches and fixed point (FP) algorithms.

In order to derive a flexible NMF learning algorithm, we compute the gradient of (54)with respect to elements of matrices sjk = sj(k) = [S]jk and aij = [A]ij as follows

∂D(β)K

∂sjk=

m∑

i=1

aij

([AS]βik − xik [AS]β−1

ik

)+ αS , (55)

∂D(β)K

∂aij=

N∑

k=1

([AS]βik − xik[AS]β−1

ik

)sjk + αA. (56)

The simplest approach to design multiplicative algorithm is apply exponentiated gradient(EG) method (Cichocki, Amari, et al., 2006):

sjk ← sjk exp

(−ηjk

∂D(β)K

∂sjk

), (57)

aij ← aij exp

(−ηij

∂D(β)K

∂aij

), (58)

where the positive learning rates ηjk and ηij can take different forms. Typically, in order toguarantee stability of the algorithm we assume that ηjk = ηj = ω (

∑mi=1 aij)−1, ηij = ηj =

ω (∑N

k=1 xjk)−1, where ω ∈ (0, 2) is an over-relaxation parameter.Alternatively, similar to the Lee and Seung approach, by choosing suitable learning

rates:

ηjk =sjk∑m

i=1 aij [AS]βik, ηij =

aij∑Nk=1[AS]βik sjk

, (59)

we obtain multiplicative update rules (Kompass, 2005; Cichocki, Zdunek, & Amari, 2006a):

sjk ← sjk[∑m

i=1 aij (xik/[AS]1−βik )− αS ]ε∑m

i=1 aij [AS]βik, (60)

aij ← aij[∑N

k=1(xik/[AS]1−βik ) sjk − αA]ε∑N

k=1[AS]βik sjk

, (61)


where the additional nonlinear operator is introduced in practice defined as [x]ε = maxε, xwith a small ε in order to avoid zero and negative values.

The projected gradient (PG) NMF algorithm with additive updates can be writtenin a general form as (Lin, 2005)

sjk ← PΩ

[sjk − ηjk

(∂DK

∂sjk

)], (62)

aij ← PΩ

[aij − ηij

(∂DK

∂aij

)], (63)

where the learning rates are not fixed but adjusted in each iteration step in a such waythat they keep update nonnegative and/or PΩ(x) ensures projection of x onto feasible(nonnegative) set Ω.

Using the Interior Projected Gradient (IPG) technique the additive algorithm cantake the following form (using some MATLAB notations):

A ← A− ηA A ./ (ASST ) . ∗ (AS−X)ST , (64)S ← S− ηS X./(ATAS) . ∗ AT (AS−X), (65)

where ηA and ηS are diagonal matrices with positive entries representing suitably chosenlearning rates (Merritt & Zhang, 2004).

Finally, the family of fixed point NMF algorithms can be derived for by equalizingthe gradients (for β = 1) to zero (Cichocki & Zdunek, 2006) (compare with EM-PCAalgorithm):

∇XDF (X||AS) = ATAS−ATX + αS = 0, (66)∇ADF (X||AS) = ASST −XST + αA = 0. (67)

These equations suggest the following fixed point updates rules:

S ← maxε,[(ATA)+(ATX− αS)

] =

[(ATA)+(ATX− αS)

]ε,

A ← maxε,[(XST − αA)(SST )+

] =

[(XST − αA)(SST )+

]ε.

where [A]+ means Moore Penrose pseudo-inverse and max function is componentwise.During the above updates, we should update the matrices A and S alternatively. Due

to some physical constraints and also in order to achieve a unique solution it is necessaryusually to normalize in each iteration the columns of A or rows of S to unity or fixed norm.

Useful NMF learning algorithms for various flexible and generalized cost functionsare presented in Table 2 (Cichocki, Zdunek, & Amari, 2006b; Cichocki, Amari, et al., 2006;Zdunek & Cichocki, 2006; Cichocki & Zdunek, 2006; Dhillon & Sra, 2005). Algorithms aredescribed in the matrix form using MATLAB notations. The operators .∗, ./ and .β meancomponentwise multiplication, division and rising to the power β each element of a matrixor a vector, respectively. In practice, in order to avoid division by zero and log(0) a smallpositive value ε is added when necessary. Furthermore, in order to avoid negative valueswe use operator [x]ε = maxx, ε, where ε is the small positive number, typically 10−16.


Table 2: NMF algorithms and corresponding cost functions (MATLAB notations are used)

Minimization of cost function Iterative Learning Algorithm

subject to aij ≥ 0 and sik ≥ 0

Amari alpha divergence S ← S . ∗(AT (X ./ (AS + ε)) .α

).1/α

∑ik

xik(xik/[AS]ik)α−1 − 1

(α− 1)α+ A ← A . ∗

(((X./ (AS + ε)) .α) ST

).1/α

+[AS]ik − xik

αA ← A diag(1 ./sum(A, 1)), α 6= 0

Shannon entropy S ← S . ∗ exp(ηS AT ln(X./(AS + ε))

)

∑ik

(sik log sik) s.t. X = AS A ← A . ∗ exp(ln(X./(AS + ε)) ST ηA

)

∑ij

(aij log aij) s.t. XT = ST AT A ← A diag(1./sum(A, 1))

ηA = diag1./sum(A, 1), ηS = diag1./sum(S, 1)

Euclidean distance with regularization S ← S . ∗[AT X− αS

]ε./

[AT AS

]ε

∑ik

(xik − [AS]ik)2 + αS ||S||1 + αA||A||1 A ← A . ∗[XST ]− αA

]ε

./[ASST

]ε,

A ← A diag(1./sum(A, 1))

S ←[(AT A)+(AT X− αS)

]ε,

A ←[(XST − αA)(SST )+

]ε

Beta divergence S ← S. ∗(AT (X./(AS + ε).1−β)

)./

([AT (AS + ε).β

)

∑ik

(xikxβ

ik − [AS]βikβ(β + 1)

+ A ← A. ∗((X./(AS + ε).1−β)ST

)./

((AS + ε).βST

)

+ [AS]βik[AS]ik − yik

β + 1) A ← A diag1./sum(A, 1), β = [−1, 1]

Itakura-Saito distance S ← S . ∗ [(AT P) ./ (AT Q + ε)] .ω

∑ik

(− logxik

[AS]ik+

xik

[AS]ik− 1) A ← A . ∗ [(PST ) ./ (QST + ε)] .ω

A ← A diag(1 ./ sum(A, 1)), ω = [0.5, 1]

P ← X ./(AS + ε) .2, Q ← 1 ./(AS + ε)


An essential feature of the NMF approach is that it reduces the data set from itsfull data space to lower dimensional NMF space determined by rank n (typically, n <(mN)/(m+N)). For any rank n, the NMF algorithms group the available data into classesor clusters of components. The key open issue is to find whether a given rank n decomposesthe data into ”meaningful” components. In general, the NMF algorithms may or may notconverge to the same meaningful solutions on each run, depending on the random initialconditions and the kind of the algorithm we use. If a clustering into n classes is strong, wewould expect that sample assignment to clusters would vary little from run to run. AlthoughNMF is pure algebraic factorization, it was shown that as the rank n increases the methodmay uncover some structure or substructures, whose robustness can be evaluated by runningalgorithm for gradually increasing n. In fact, NMF may reveal hierarchical structure when itexists but does not force such structure on the data like SCA or ICA does. Thus, NMF mayhave some advantages in exposing meaningful components and discover fine substructures.

The utility of NMF for estimating latent (hidden) components and their clusters orclasses from EEG data (represented in the frequency or time frequency domains) stems fromits nonnegativity constraints, which facilitates the detection of sharp boundaries amongclasses. These components are typically sparse, localized and relatively independent, whichmakes often physiologically and neuroanatomically meaningful signal decompositions. De-spite its promising features, NMF has limitations due to non-uniqueness of solutions anddifficulties to find optimal dimensions of matrices A and S, as well as interpretation of somecomponents.

Multi-layer NMF

In order to improve performance of the NMF, especially for ill-conditioned and badlyscaled data and also to reduce risk to get stuck in local minima of non-convex minimization,we have developed a simple hierarchical and multi-stage procedure in which we performsequential decomposition of nonnegative matrices as follows: In the first step, we performthe basic decomposition (factorization) X = A1S1 using any available NMF algorithm.In the second stage, the results obtained from the first stage are used to perform thesimilar decomposition: S1 = A2S2 using the same or different update rules, and so on. Wecontinue our decomposition taking into account only the last achieved components. Theprocess can be repeated arbitrary many times until some stopping criteria are satisfied. Ineach step, we usually obtain gradual improvements of the performance. Thus, our modelhas the form: X = A1A2 · · ·ALSL , with the basis nonnegative matrix defined as A =A1A2 · · ·AL. Physically, this means that we build up a system that has many layers orcascade connection of L mixing subsystems. The key point in our novel approach is thatthe learning (update) process to find parameters of sub-matrices S l and Al is performedsequentially, i.e. layer by layer. In each step or each layer, we can use the same cost (loss)functions, and consequently, the same learning (minimization) rules, or completely differentcost functions and/or corresponding update rules.

In summary, the NMF is a promising technique for extracting, clustering and clas-sifying of latent components. However, the challenge that still remains is to provide ameaningful physiological interpretation for some of NMF discovered hidden components orclasses of components when the structures of the true sources are completely unknown.


Discussion and Conclusions

In this paper we have discussed briefly several extensions and modifications of blindsource separation and decomposition algorithms for spatio-temporal decorrelation, indepen-dent component analysis, sparse component analysis and nonnegative matrix factorizationwhere various criteria and constraints are imposed such linear predictability, smoothness,mutual independence, sparsity and non-negativity of extracted components. Especially,we described generalization and extension of ICA to MSD-ICA which relaxes considerablythe condition on independence of original sources. Using these concepts in many cases,we are able to reconstruct (recover) the original brain sources and to estimate mixing andseparating matrices, even if the original sources are not independent and in fact they arestrongly correlated. Moreover, we propose a simple method for checking validity and trueperformance of BSS separation by applying the bank of filters with various frequency char-acteristics.

Furthermore, we have proposed a simple and an efficient BSS approach for blindextraction from raw EEG data specific components in order to improve sensitivity andspecificity of early detection of Alzheimer disease. The basic principle is to order and clusterautomatically of the estimated components and next project back to the scalp level only thesuitable group of components which are significant electrophysiological markers of Alzheimerdisease. The suboptimal selection of indexes and the number of ordered components hasbeen performed by extensive computer simulation and optimization procedure.

References

Amari, S., & Cichocki, A. (1998). Adaptive blind signal processing - neural network approaches.Proceedings IEEE, 86, 1186–1187.

Amari, S., Hyvarinem, A., Lee, S., Lee, T., & Sanchez, V. (2002). Blind signal separation andindependent component analysis. Neurocomputing, 49 (12), 1-5.

Amari, S., & J.-F.Cardoso.(1997). Blind source separation — semi-parametric statistical approach.IEEE Trans. on Signal Processing, 45 (11), 2692–2700.

Bach, F., & Jordan, M. (2003). Beyond independent components: trees and clusters. Journal ofMachine Learning Research, 4, 1205-1233.

Barros, A. K., & Cichocki, A.(2001). Extraction of specific signals with temporal structure. NeuralComputation, 13 (9), 1995-2000.

Bell, A., & Sejnowski, T. (1995). An information maximization approach to blind separation andblind deconvolution. Neural Computation, 7, no. 6, 1129-1159.

Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., & Moulines, E.(1997). A blind source separationtechnique using second-order statistics. IEEE Trans. Signal Processing, 45 (2), 434-444.

Belouchrani, A., & Amin, M. (1996). A new approach for blind source separation using time-frequency distributions. Proc. SPIE, 2846, 193-203.

Belouchrani, A., & Cichocki, A. (2000). Robust whitening procedure in blind source separationcontext. Electronics Letters, 36 (24), 2050-2053.

Bobin, J., Moudden, Y., J.-L., Starck, & Elad, M. (2006). Morphological diversity and sourceseparation. IEEE Signal Processing Letters, 13, 409–412.


Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Trans. SignalProcessing, 44 (12), 3017-3030.

Chenubhotla, S. C.(2004). Spectral Methods for Multi-scale Features Extraction and Data Clustering.Ph.D Thesis, University of Toronto.

Choi, S., Cichocki, A., & Amari, S. (1998). Flexible independent component analysis. In Proc. ofthe 1998 IEEE Workshop on NNSP (pp. 83–92). Cambridge, UK.

Choi, S., Cichocki, A., & Amari, S. (2002). Equivariant nonstationary source separation. NeuralNetworks, 15, 121–130.

Choi, S., Cichocki, A., & Belouchrani, A. (2002). Second order nonstationary source separation.Journal of VLSI Signal Processing, 32 (1–2), 93–104.

Cichocki, A., & Amari, S. (2003). Adaptive Blind Signal and Image Processing (new revised andimproved edition). New York: John Wiley.

Cichocki, A., Amari, S., Zdunek, R., Kompass, R., Hori, G., & He, Z. (2006). Extended SMARTalgorithms for non-negative matrix factorization. LNAI-4029, Springer, 548-562.

Cichocki, A., Amari, S. M., Siwek, K., Tanaka, T., & al. et. (2004). ICALAB Toolboxes for Signaland Image Processing www.bsp.brain.riken.go.jp. JAPAN.

Cichocki, A., & Belouchrani, A. (2001, Dec. 9-13). Sources separation of temporally correlatedsources from noisy data using bank of band-pass filters. In Third International Conference onIndependent Component Analysis and Signal Separation (ICA-2001) (pp. 173–178). San Diego,USA.

Cichocki, A., Bogner, R., Moszczynski, L., & Pope, K. (1997). Modified Herault-Jutten algorithmsfor blind separation of sources. Digital Signal Processing, 7 (2), 80 - 93.

Cichocki, A., & Georgiev, P. (2003, January). Blind source separation algorithms with matrix con-straints. IEICE Transactions on Fundamentals of Electronics, Communications and ComputerSciences, E86-A(1), 522–531.

Cichocki, A., Kasprzak, W., & Skarbek, W.(1996). Adaptive learning algorithm for principal com-ponent analysis with partial data. In R. Trappl (Ed.), Cybernetics and systems ’96. thirteentheuropean meeting on cybernetics and systems research (Vol. 2, pp. 1014–1019). Austrian Societyfor Cybernetic Studies, Vienna.

Cichocki, A., Li, Y., Georgiev, P. G., & Amari, S. (2004). Beyond ICA: Robust sparse signalrepresentations. In Proceedings of 2004 IEEE International Symposium on Circuits and Systems(ISCAS2004) (Vol. V, pp. 684–687). Vancouver, Canada.

Cichocki, A., Rutkowski, T. M., & Siwek, K. (2002, September). Blind signal extraction of signalswith specified frequency band. In Neural Networks for Signal Processing XII: Proceedings of the2002 IEEE Signal Processing Society Workshop (pp. 515–524). Martigny, Switzerland: IEEE.

Cichocki, A., Shishkin, S., Musha, T., Leonowicz, Z., Asada, T., & Kurachi, T.(2005). EEG filteringbased on blind source separation (BSS) for early detection of Alzheimer disease. ClinicalNeurophysiology, 116, 729-737.

Cichocki, A., & Thawonmas, R. (2000). On-line algorithm for blind signal extraction of arbitrarilydistributed, but temporally correlated sources using second order statistics. Neural ProcessingLetters, 12 (1), 91-98.

Cichocki, A., Thawonmas, R., & Amari, S. (1997). Sequential blind signal extraction in orderspecified by stochastic properties. Electronics Letters, 33 (1), 64-65.


Cichocki, A., & Unbehauen, R. (1993). Robust estimation of principal components in real time.Electronics Letters, 29 (21), 1869–1870.

Cichocki, A., & Unbehauen, R. (1994). Neural Networks for Optimization and Signal Processing(new revised and improved edition). New York: John Wiley & Sons.

Cichocki, A., & Unbehauen, R.(1996). Robust neural networks with on-line learning for blind iden-tification and blind separation of sources. IEEE Trans. Circuits and Systems I : FundamentalsTheory and Applications, 43 (11), 894-906.

Cichocki, A., Unbehauen, R., & Rummert, E.(1994). Robust learning algorithm for blind separationof signals. Electronics Letters, 30 (17), 1386-1387.

Cichocki, A., & Zdunek, R. (2006). NMFLAB Toolboxes for Signal and Image Processingwww.bsp.brain.riken.go.jp. JAPAN.

Cichocki, A., Zdunek, R., & Amari, S. (2006a). Csiszar’s divergences for non-negative matrix fac-torization: Family of new algorithms. LNCS 3889, Springer, 32–39.

Cichocki, A., Zdunek, R., & Amari, S.(2006b, May 15-19). New algorithms for non-negative matrixfactorization in applications to blind source separation. In Proc. IEEE International Conferenceon Acoustics, Speech, and Signal Processing, ICASSP-2006.

Cruces, S., Cichocki, A., & Castedo, L. (2000). An iterative inversion approach to blind sourceseparation. IEEE Trans. on Neural Networks, 11 (6), 1423-1437.

Cruces, S. A., Castedo, L., & Cichocki, A. (2002). Robust blind source separation algorithms usingcumulants. Neurocomputing, 49, 87–118.

Cruces, S. A., & Cichocki, A.(2003, April). Combining blind source extraction with joint approximatediagonalization: Thin algorithms for ICA. In Proceedings of 4th International Symposiumon Independent Component Analysis and Blind Signal Separation (ICA2003) (pp. 463–468).Kyoto, Japan: ICA.

Cruces, S. A., Cichocki, A., & Amari, S.(2004). From blind signal extraction to blind instantaneoussignal separation: criteria, algorithms and stability. IEEE Transactions on Neural Networks,Special issue on Information Theoretical Learning, 15, 859-873.

DeKosky, S., & Marek, K.(2003). Looking backward to move forward: Early detection of neurode-generative disorders. Science, 302 (5646), 830-834.

Delorme, A., & Makeig, S.(2004). EEGLAB: an open source toolbox for analysis of single-trial EEGdynamics. J. Neuroscience Methods, 134:9-21, 2004, 134, 9-21.

Dhillon, I., & Sra, S. (2005). Generalized nonnegative matrix approximations with Bregman diver-gences. In NIPS -Neural Information Proc. Systems, Vancouver Canada.

Donoho, D. L., & Elad, M.(2004). Representation via l1 minimization. The Proc. National Academyof Science, 100, 2197-2202.

Fiori, S.(2003). A fully multiplicative orthoghonal-group ICA neural algorithm. Electronics Letters,39 (24), 1737-1738.

Georgiev, P. G., & Cichocki, A. (2004). Sparse component analysis of overcomplete mixtures byimproved basis pursuit method. In Proceedings of 2004 IEEE International Symposium onCircuits and Systems (ISCAS2004) (Vol. V, pp. 37–40). Vancouver, Canada.

Gharieb, R. R., & Cichocki, A.(2003). Second-order statistics based blind source separation using abank of subband filters. Digital Signal Processing, 13, 252–274.


He, Z., & Cichocki, A.(2006). K-EVD clustering and its applications to sparse component analysis.In 6th international conference on independent component analysis and blind signal separation(p. 90-97). Springer LNCS 3889.

Himberg, J., Hyvarinen, A., & Esposito, F. (2004). Validating the independent components ofneuroimaging time series via clustering and visualization. NeuroImage, 22 (3), 1214-1222.

Hyvarinen, A., Karhunen, J., & Oja, E.(2001). Independent Component Analysis. New York: JohnWiley.

Hyvarinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications.Neural Networks, 13, 411-430.

Jahn, O., Cichocki, A., Ioannides, A., & Amari, S.(1999). Identification and elimination of artifactsfrom MEG signals using efficient independent components analysis. In Proc. of th 11th Int.Conference on Biomagentism BIOMAG-98 (p. 224-227). Sendai, Japan.

Jelic, V., Johansson, S., Almkvist, O., Shigeta, M., Julin, P., Nordberg, A., et al. (2000). Quanti-tative electroencephalography in mild cognitive impairment: longitudinal changes and possibleprediction of Alzheimer’s disease. Neurobiological Aging, 21 (4), 533-540.

Jeong, J. (2004). EEG dynamics in patients with Alzheimer’s disease. Clinical Neurophysiology,115 (7), 1490-1505.

Jung, H.-Y., & Lee, S.-Y. (2000). On the temporal decorrelation of feature parameters for noise-robust speech recognition. IEEE Transactions on Speech and Audio Processing, 8 (7), 407-416.

Jung, T., Makeig, S., Humphries, C., Lee, T.-W., McKeown, M., Iragui, V., et al.(2000). Removingelectroencephalographic artifacts by blind source separation. Psychophysiology, 37, 167-178.

Karhunen, J., & Pajunen, P. (1997, June). Blind source separation and tracking using nonlinearPCA criterion: A least-squares approach. In Proc. 1997 int. conference on neural networks(icnn’97) (Vol. 4, pp. 2147–2152). Houston, Texas, USA.

Kompass, R. (2005). A generalized divergence measure for nonnegative matrix factorization. In .Neuroinfomatics Workshop, Torun, Poland.

Kreutz-Delgado, K., Murray, J. F., Rao, B. D., Engan, K., Lee, T.-W., & Sejnowski, T. J. (2003).Dictionary learning algorithms for sparse representation. Neural Computation, 15 (2), 349-396.

Lee, D. D., & Seung, H. S. (1999). Learning of the parts of objects by non-negative matrix factor-ization. Nature, 401, 788-791.

Li, Y., Cichocki, A., & Amari, S. (2004, June). Analysis of sparse representation and blind sourceseparation. Neural Computation, 16 (6), 1193–1204.

Li, Y., Cichocki, A., & Amari, S. (2006). Blind estimation of channel parameters and source com-ponents for EEG signals: A sparse factorization approach. IEEE Transactions on NeuralNetworks, 17, 419-431.

Li, Y., Cichocki, A., Amari, S., Shishkin, S., Cao, J., & Gu, F. (2003). Sparse representationand its applications in blind source separation. In Seventeenth Annual Conference on NeuralInformation Processing Systems (NIPS-2003). Vancouver.

Lin, C.-J. (2005). Projected gradient methods for non-negative matrix factorization (Tech. Rep.).Department of Computer Science, National Taiwan University.

Mainecke, F., Ziehe, A., Kawanabe, M., & Muller, K.-R.(2002). A resampling approach to estimatethe stability of one dimensional or multidimensional independent components. NeuroImage,49 (13), 1514-1525.


Makeig, S., Debener, S., Onton, J., & Delorme, A. (2004). Mining event-related brain dynamics.Trends in Cognitive Science, 8, 204-210.

Makeig, S., Delorme, A., M., M. W., Townsend, J., Courchense, E., & Sejnowski, T. (2004). Elec-troencephalographic brain dynamics following visual targets requiring manual responses. PLOSBiology, 2, 747-762.

Matsuoka, K., Ohya, M., & Kawamoto, M.(1995). A neural net for blind separation of nonstationarysignals. Neural Networks, 8 (3), 411-419.

Merritt, M., & Zhang, Y.(2004). An interior-point gradient method for large-scale totally nonnegativeleast squares problems (Tech. Rep.). Department of Computational and Applied Mathematics,Rice University, Houston, Texas, USA.

Minami, M., & Eguchi, S. (2002). Robust blind source separation by beta-divergence. NeuralComputation, 14, 1859–1886.

Miwakeichi, F., Martinez-Montes, E., Valds-Sosa, P. A., Nishiyama, N., Mizuhara, H., & Yamaguchi,Y.(2004). Decomposing EEG data into space-time-frequency components using Parallel FactorAnalysis. NeuroImage, 22 (3), 1035-1045.

Molgedey, L., & Schuster, H. (1994). Separation of a mixture of independent signals using timedelayed correlations. Physical Review Letters, 72 (23), 3634-3637.

Musha, T., Asada, T., Yamashita, F., T., T. K., Chen, Z., Matsuda, H., et al. (2002). A new EEGmethod for estimating cortical neuronal impairment that is sensitive to early stage Alzheimer’sdisease. Clinical Neurophysiology, 113 (7), 1052-1508.

Nishimori, Y. (1999). Learning algorithm for ICA by geodesic flows on orthogonal group. In JointConference on Neural Networks (IJCNN’99) (Vol. 2, p. 1625-1647).

Petersen, R. (2003). Mild Cognitive Impairment: Aging to Alzheimers Disease. New York: OxfordUniversity Press.

Petersen, R., Stevens, J., Ganguli, M., Tangalos, E., Cummings, J., & DeKosky, S.(2001). Practiceparameter: Early detection of dementia: Mild cognitive impairment (an evidence-based review).Neurology, 56, 1133-1142.

Rosipal, R., Girolami, M., Trejo, L. J., & Cichocki, A. (2001). Kernel PCA for feature extractionand de-noising in nonlinear regression. Neural Computing & Applications, 10, 231–243.

Sajda, P., Du, S., & Parra, L. (2003). Recovery of constituent spectra using non-negative matrixfactorization. In Proceedings of SPIE – volume 5207 (p. 321-331.). Wavelets: Applications inSignal and Image Processing.

Stone, J.(2001). Blind source separation using temporal predictability. Neural Computation, 13 (7),1559-1574.

Tanaka, T., & Cichocki, A. (2004). Subband decomposition independent component analysis andnew performance criteria. In Proceedings of International Conference on Acoustics, Speech, andSignal Processing (ICASSP2004) (Vol. V, pp. 541–544). Montreal, Canada.

Theis, F. J., Georgiev, P. G., & Cichocki, A.(2004). Robust overcomplete matrix recovery for sparsesources using a generalized Hough transform. In Proceedings of 12th European Symposium onArtificial Neural Networks (ESANN2004) (p. 223-232). Bruges, Belgium.

Tong, L., Liu, R.-W., Soon, V.-C., & Huang, Y.-F.(1991). Indeterminacy and identifiability of blindidentification. IEEE Trans. on Circuits and Systems, 38 (5), 499-509.


Vialatte, F., Cichocki, A., Dreyfus, G., Musha, T., Shishkin, S. L., & Gervais, R. (2005). Earlydiagnosis of Alzheimer’s disease by blind source separation, time frequency representation, andbump modeling of EEG signals. In Artificial Neural Networks: Biological Inspirations ICANN2005, 15th International Conference Warsaw, Poland, September 11-15, 2005 proceedings (Vol.LNCS 3696, p. 683-692).

Vorobyov, S., & Cichocki, A. (2002). Blind noise reduction for multisensory signals using ICA andsubspace filtering, with application to EEG analysis. Biological Cybernetics, 86 (4), 293–303.

Wagner, A. (2000). Early detection of Alzheimer’s disease: An fMRI marker for people at risk?Nature Neuroscience, 10 (3), 973-974.

Wang, Z., Lee, Y., Fiori, S., Leung, C.-S., & Zhu, Y.-S.(2003). An improved sequential method forprincipal component analysis. Pattern Recognition Letters, 24, 1409-1415.

Washizawa, Y., & Cichocki, A.(2006). On line k-plane clustering learning algorithm for sparse com-ponent analysis. In IEEE International Conference on Acoustics, Speech, and Signal Processing.ICASSP-2006, Toulouse, France.

Zdunek, R., & Cichocki, A.(2006). Non-negative matrix factorization with quasi-Newton optimiza-tion. LNAI 4029, Springer, 870-879.

Zhang, L., Cichocki, A., & Amari, S.(2004). Multichannel blind deconvolution of nonminimum-phasesystems using filter decomposition. IEEE Transactions on Signal Processing, 52 (5), 1430–1442.

Zibulevsky, M., Kisilev, P., Zeevi, Y., & Pearlmutter, B. (2002). Blind source separation viamultinode sparse representation. In In Advances in Neural Information Processing Systems,(NIPS2001) (p. 185-191). Morgan Kaufmann.

Ziehe, A., Muller, K.-R., Nolte, G., Mackert, B.-M., & Curio, G.(2000). Artifact reduction in biomag-netic recordings based on time-delayed second order correlations. IEEE Trans. on BiomedicalEngineering, 47, 75-87.

Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal ofComputational and Graphical Statistics, 15, 265-286.

Documents

Generalized Component Analysis and Blind Source Separation ... · Generalized Component Analysis and Blind Source Separation Methods for Analyzing Multichannel Brain Signals ... A