IEEE TRANSACTIONS ON SIGNAL PROCESSING 1 Spatio … · Danil Kuzin, Olga Isupova, and Lyudmila Mihaylova, Senior Member, IEEE Abstract—This paper introduces a new sparse spatio-temporal

IEEE TRANSACTIONS ON SIGNAL PROCESSING 1

Spatio-Temporal Structured Sparse Regression withHierarchical Gaussian Process Priors

Danil Kuzin, Olga Isupova, and Lyudmila Mihaylova, Senior Member, IEEE

Abstract—This paper introduces a new sparse spatio-temporalstructured Gaussian process regression framework for onlineand offline Bayesian inference. This is the first framework thatgives a time-evolving representation of the interdependenciesbetween the components of the sparse signal of interest. Ahierarchical Gaussian process describes such structure and theinterdependencies are represented via the covariance matrices ofthe prior distributions. The inference is based on the expecta-tion propagation method and the theoretical derivation of theposterior distribution is provided in the paper. The inferenceframework is thoroughly evaluated over synthetic, real video andelectroencephalography (EEG) data where the spatio-temporalevolving patterns need to be reconstructed with high accuracy.It is shown that it achieves 15% improvement of the F-measurecompared with the alternating direction method of multipliers,spatio-temporal sparse Bayesian learning method and one-levelGaussian process model. Additionally, the required memory forthe proposed algorithm is less than in the one-level Gaussianprocess model. This structured sparse regression framework isof broad applicability to source localisation and object detectionproblems with sparse signals.

I. INTRODUCTION

SPARSE regression problems arise often in various ap-plications, e.g., compressive sensing [1], EEG source

localisation [2] and direction of arrival estimation [3]. Inall these applications, a dictionary of basis functions can beconstructed that allows sparse representations of the signals ofinterest, i.e. many of the coefficients of the basis functions areclose to zero. This allows to perform sensing tasks with loweramount of observations than the signal dimensionality. However,the signal recovery problem becomes more computationallyexpensive when sparsity assumptions are incorporated.

The sparse signal representation can be expressed as aregression problem of finding a signal x given the vectory of observations and the design matrix A that satisfies theequation

y = Ax + ε, (1)

where ε is the Gaussian noise vector, ε ∼ N (ε; 0, σ2I), σ2

is the variance and I is the identity matrix. Therefore, theobservations also have a Gaussian distribution

y ∼ N (y; Ax, σ2I). (2)

When the number of observations is less than the numberof coefficients the problem is ill-posed in the sense that it has

D.Kuzin, L.Mihaylova are with the Department of Automatic Controland Systems Engineering, the University of Sheffield, Sheffield, UK e-mail:[email protected], [email protected]. O.Isupova is with theDepartment of Engineering Science, the University of Oxford, Oxford, UKe-mail: [email protected]

an infinite number of possible solutions and additional regu-larisation is required. This is usually achieved by imposing lppenalty functions with 0 ≤ p < 2 [4], [5], [6].

In the compressive sensing literature, it has been shownthat if a matrix A satisfies the restricted isometry property(RIP) [7] then a solution of a convex l1-minimisation problemis equivalent to a solution of a sparse l0-minimisation problem.However, the problem of identification whether a given matrixsatisfies the RIP is NP-hard [8]. In contrast, Bayesian modelsdo not impose any restrictions on the matrix A and regularisethe problem (1) with sparsity-inducing priors [9].

Bayesian models for sparse regression can be classifiedinto models with a weak sparsity prior and a strong sparsityprior [10]. The weak sparsity prior leads to a unimodal posteriordistribution of the signal with a sharp peak at zero, thus eachcoefficient has a high posterior probability of being close tozero. The strong sparsity prior is a mixture of latent binaryvariables that explicitly capture whether coefficients are zero ornon-zero. In this paper we consider one type of strong sparsitypriors — spike and slab models.

In spike and slab models, sparsity is achieved by selectingeach component of x from a mixture of a spike distribution,that is the delta function, and a slab distribution, that is someflat distribution, usually a Gaussian with a large variance [11].Following the Bayesian approach, latent variables that areindicators of spikes are added to the model [12] and a relevantdistribution is placed over them [13]. Therefore, each signalcomponent has an independent latent variable, which controlswhether this component would be a spike or a slab.

In many applications, the independence assumption is notvalid [14] as non-zero elements tend to appear in groups, andan unknown structure often exists in the field of the latentvariables. For example, wavelet coefficients of images areusually organised in trees [15], chromosomes have a spatialstructure along a genome [16], video from single-pixel camerashas a temporal structure [17]. In these cases it is useful tointroduce additional hierarchical or group penalties that promotesuch structures in recovered signals.

A. Contributions

This paper proposes the spike and slab model with ahierarchical Gaussian process prior on the latent variables.Such hierarchical prior allows to model spatial structuraldependencies for signal components that can evolve in time.

The model has a flexible structure which is governed only bythe covariance functions of the Gaussian processes. This allowsto model different types of structures and does not require any

arX

iv:1

807.

0556

1v1

[st

at.M

L]

15

Jul 2

018


specific knowledge about the structure such as determinationof particular groups of coefficients with similar behaviour. If,however, there is information about the structure, it can beeasily incorporated into the covariance functions. The modelis flexible as spatial and temporal dependencies are decoupledby different levels of the hierarchical Gaussian process prior.Therefore, the spatial and temporal structures are modelledindependently allowing to encode different assumptions foreach type of structure. It allows to reduce complexity andprocess streaming data.

Overall, the main contributions of this work consist in:1) the proposed novel spike and slab model with the

hierarchical Gaussian process prior for signal recoverywith spatio-temporal structural dependencies;

2) the developed Bayesian inference algorithm based onexpectation propagation;

3) the novel online inference algorithm for streaming databased on Bayesian filtering;

4) a thorough validation and evaluation of the proposedmethod over synthetic and real data including the electricalactivity data for the EEG source localisation problem andvideo data for the compressive background subtractionproblem.

The paper is organised as follows. Section II reviews therelated work. Section III provides an overview of existingspike and slab models. The proposed model and the inferencealgorithm are presented in Section IV. Section V demonstratesthe online version of the algorithm. Section VI presents thecomplexity evaluation and numerical experiments. Section VIIconcludes the paper. Appendices provide theoretical derivationsof the inference algorithm.

II. RELATED WORK

Different spatial structure assumptions for sparse modelshave been extensively studied in the literature. The grouplasso [18], [19] extends the classical lasso method for groupsparsity such that coefficients form groups and all coefficientsin a group are either non-zero or zero together, but groupsare required to be defined in advance. In contrast to grouplasso, structural dependencies in our model are defined by theparameters of covariance functions of the Gaussian processes(GPs) and the actual groups are inferred from the data.

Group constraints for weak sparse models include smoothrelevance vector machines [20], spatio-temporal coupling ofthe parameters for the scale mixture of Gaussians repre-sentation [21], [22], row and element sparsity [23], blocksparsity [24].

For spike and slab priors a spatio-temporal structure ismodelled with a one-level Gaussian processes prior [25], wherethe prior is imposed on all locations of non-zero componentstogether. The covariance matrix is represented as the Kroneckerproduct of the temporal and spatial matrices.

In contrast to the one-level GP our model introduces anadditional level of a GP prior for temporal dependencies.Therefore, the temporal and spatial structures are decoupled.The proposed model is thus more flexible. Broadly speaking,the top-level GP can encode the slow change of groups of

spikes positions in time while the low-level GP allows tomodel the local changes of each group. The one-level GP priormodel also requires significantly more memory to store thecovariance function for modelling both spatial and temporalstructural dependencies as it is built as a Kronecker product ofspatial and temporal covariance matrices. The resulting size ofthe covariance matrix scales quadratically with spatio-temporaldimensionality, which makes it infeasible even for averagesize problems, whereas for our model the total size of twocovariance matrices scales linearly.

More importantly, in the proposed model structural depen-dencies are considered at every timestamp whereas in [25]the GP prior is imposed on the whole batch of data. Thisconsideration of every timestamp allows us to develop anincremental inference algorithm — all latent variables areinferred for the new time moment in the similar manner as forthe offline inference. Meanwhile, it is unclear how to applythe one-level GP model to the incremental data without re-processing the previous data.

GPs are widely used to model complex structures anddynamics in data not only in sparse problems. In [26] GPis used as a prior for nonlinear state transition and observationfunctions for state-space Bayesian filtering. Hierarchical GPmodels are proposed to model structures in [27].

III. SPARSE MODELS FOR STRUCTURED DATA

This section presents a roadmap of models that are usedin the formulation of the proposed spatio-temporal structuredsparse model. It starts from the basic spike and slab modeland continues with its extension for structured data.

The generative model for the spatio-temporal regressionproblem can be formulated in the following way:• The data is collected for the sequence of the T discrete

timestamps. Indexes are denoted by t ∈ [1, . . . , T ].• At each timestamp t the unknown signal of size N is de-

noted by xt = [x1t, . . . , xNt]>. Signals at all timestamps

are concatenated into a matrix X = [x1, . . . ,xT ].• The observations of size K are denoted by yt =

[y1t, . . . , yKt]>. They are obtained with the design matrix

A ∈ RK×N . Observations at all timestamps are concate-nated into matrix Y = [y1, . . . ,yT ].

• An independent Gaussian noise with the variance σ2 isadded to the observations.

The probabilistic model can be then expressed as

p(yt|xt) = N (yt; Axt, σ2I) ∀t. (3)

It is assumed that the dimensionality K of observations ytis less than the dimensionality N of signals xt, thereforethe problem of recovery of signal xt from observations yt isunderdetermined and it can have an infinite number of solutions.Sparsity-inducing priors allow to specify additional constraintsthat lead to a unique optimal solution.

A. Factor graphs

For Bayesian models, factor graphs are used to visualisecomplex distributions [28] in a form of undirected graphical


models. They are also important for the approximate inferencemethod described in Section IV.

The joint probability density function p(·) of latent vari-ables ζi can be factorised as a product of factors ψC that arefunctions of a corresponding set of latent variables ζC

p(ζ1, ..., ζm) =1

Z

∏C

ψC(ζC), (4)

where Z is a normalisation constant. This factorisation canbe represented as a bipartite graph with variable verticescorresponding to ζi, factor vertices corresponding to ψC andedges connecting corresponding vertices.

The distribution of latent variables xt in (3) can be repre-sented as a factor

gt(xt) = N (yt; Axt, σ2I). (5)

The factor graphs are used in this paper to visualise differentspike and slab models. In Fig. 1 – 3 circles represent variablevertices and small squares represent factor vertices.

B. Spike and slab model

Sparsity can be induced with the spike and slab model [29],where additional latent variables Ω = ωitt=1:T, i=1:N

indicate if signal components xit are zeros. This is representedas a mixture of a spike and a slab

p(xit|ωit) = ωitδ0(xit) + (1− ωit)N (xit; 0, σ2x), (6)

where spike δ0(·) is the delta function centered at zero, andslab is the Gaussian distribution with the variance σ2

x. Theconditional distributions p(xit|ωit) are further denoted byfactors fit(ωit, xit).

In this model ωiti=1:N are considered conditionallyindependent given xt. The prior is imposed on the indicators

p(ωit) = Ber(ωit; z), (7)

where Ber(·; z) denotes a Bernoulli distribution with the successprobability parameter z. The prior distributions p(ωit) arefurther denoted by hind

it (ωit). The problem (5) – (7) can besolved independently for each t.

The model can be represented as a factor graph (Fig. 1) witha product of factors (5) – (7) for all t and i.

The posterior p(X,Ω) of latent variables X and Ω is

p =

T∏t=1

[gt(xt)

N∏i=1

[fit(ωit, xit)h

indit (ωit)

]]. (8)

C. Spike and slab model with a spatial structure

A spatial structure can be implemented by adding interdepen-dencies for the locations of spikes in xit [25], [30], [31]. Thisis achieved by modelling the probabilities of spikes with the ad-ditional latent variables Γ = [γ1, . . . ,γT ] = γitt=1:T, i=1:N

that are samples from a Gaussian process. A Gaussian processis a way to specify prior on functions, it can be defined asan infinite expansion of multivariate Gaussian distribution.In GP all finite subsets of variables have a joint Gaussiandistribution. The properties of the structure are defined through

x1t x2t . . . xNt

ω1t ω2t ωNt

gt

f1t f2t fNt

h1t h2t hNt

Fig. 1. Spike and slab model for one time moment (different time momentsare independent). All signal components are conditionally independent givendata, therefore structural assumptions cannot be modelled.

the covariance function of GP, which in this paper is assumedto be squared exponential:

p(γt) = N (γt;µt,Σ0), Σ0(i, j) = αΣ exp

(− (i− j)2

2`2Σ

),

(9)where µt is the mean vector and Σ0 is the covariance matrixwith the hyperparameters αΣ and `2Σ.

The conditional independence assumption for ωit from (7)is replaced by

p(ωit|γit) = Ber(ωit; Φ(γit)), (10)p(γt) = N (γt;µt,Σ0), (11)

where Φ(·) is the standard Gaussian cumulative distributionfunction (cdf). Scaling is required to normalise probabilitiesto the [0, 1] interval and it is convenient to use Φ(·) for thispurpose in the derivations with GPs [32]. The conditionaldistributions p(ωit|γit) are denoted by factors hit(ωit, γit).The prior distributions p(γt) are denoted by rind

t (γt).In this model γtt=1:T are independent and therefore the

problem can be solved separately for each timestamp. Usingthe introduced factors (5), (6) and (10) – (11), factor graphcan be built as in Figure 2. The posterior p(X,Ω,Γ) of thelatent variables is given by

p =

T∏t=1

[gt(xt)

N∏i=1

[fit(ωit, xit)hit(ωit, γit)] rt(γt)

]. (12)

IV. THE PROPOSED SPATIO-TEMPORAL STRUCTURED SPIKEAND SLAB MODEL

In this paper a spatio-temporal latent structure of thepositions of non-zero signal components is considered forthe underdetermined recovery problem (3). The followingassumptions are introduced:

1) xt is sparse, i.e. it contains a lot of zeros for eachtimestamp t;

2) non-zero elements in xt are clustered in groups for eachtimestamp t;

3) these groups can move and evolve in time.This recovery problem is addressed with the hierarchical

Bayesian approach. As in Section III-B, the first assumption


x1t x2t . . . xNt

ω1t ω2t ωNt

γ1t γ2t γNt

gt

f1t f2t fNt

h1t h2t hNt

rt

Fig. 2. Spike and slab model with a spatial structure for one time moment. Thelocations of spikes have a GP distribution, therefore encouraging a structurein space, but they are independent in time.

can be implemented in the model using the spike and slabprior (6).

Similarly to Section III-C, the second model assumptioncan be implemented by adding spatial dependencies for thepositions of spikes in xit. This is achieved by modelling theprobabilities of spikes Ω with the scaled GP on Γ (10), (11).GPs specify a prior over an unknown structure. This isparticularly useful as it allows to avoid a specification ofany structural patterns — the only parameter for structuralmodelling is the GP covariance function.

The third condition is addressed with the dynamic hierarchi-cal GP prior. The mean M = [µ1, . . . ,µT ] for the spatial GPevolves over time according to the top-level temporal GP

µt ∼ N (µt;µt−1,W), W(i, j) = αW exp

(− (i− j)2

2`2W

),

(13)where W is the squared exponential covariance matrix of thetemporal GP with the hyperparameters αW and `2W .

This allows to implicitly specify the prior over the evolutionfunction of the structure. The rate of the evolution is controlledwith the top-level GP covariance function.

According to these assumptions, the model can be expressedas a factor graph (Figure 3) where the factor rt(γt,µt)denotes N (γt;µt,Σ0) and the factor ut(µt,µt−1) denotesN (µt;µt−1,W).

The full posterior distribution p(X,Ω,Γ,M) is then

p =

T∏t=1

[gt(xt)

N∏i=1

[fit(xit, ωit)hit(ωit, γit)] rt(γt,µt)

]

×T∏t=2

ut(µt,µt−1). (14)

The exact posterior for the proposed hierarchical spikeand slab model is intractable, therefore approximate inferencemethods should be used. In this paper expectation propaga-tion (EP) [33] is employed. EP is shown to be the most effectiveBayesian inference method for sparse modelling [34].

In this section the description of the EP method and thekey components of the inference for the proposed model are

presented. The details of the inference algorithm can be foundin the appendices.

A. Expectation propagation

EP is a deterministic inference method that approximates theposterior distribution using the factor decomposition (4), whereeach factor is approximated with distributions ψC(·) from theexponential family:

p(ζ1, ..., ζm) =1

Z

∏C

ψC(ζC), (15)

where p is an approximating distribution and Z is a nor-malisation constant. Approximating factorised distribution isdetermined by minimisation of the Kullback-Leibler (KL)divergence with the true distribution. The KL-divergence is acommon measure of similarity between distributions.

Direct approximation is intractable due to intractabilityof the true posterior. Minimisation of the KL divergencebetween individual factors ψC and ψC may not provide goodapproximation for the resulted product. In EP, approximationof each factor is performed in the context of other factorsto improve a result for the final product. Iteratively oneof the factors is chosen for refinement. The chosen factorψC is refined to minimise the KL-divergence between theproduct q ∝ ψC

∏C′ 6=C ψC′ and ψC

∏C′ 6=C ψC′ , where the

approximating factor is replaced with a factor from the trueposterior.

Factor refinement consists of five steps which are summarisedbelow (with details given in Appendices B-E).

1) Compute a cavity distribution q\C ∝ q

ψC: the joint

distribution without the factor ψC2) Compute a tilted distribution ψCq\C : the product of the

cavity distribution and the true factor3) Refine the approximation q: q∗ = argmin KL

(ψCq

\C ||q)

by minimising the KL-divergence between the tilteddistribution ψCq\C and the approximating distribution q.This is equivalent to matching the moments of thedistributions [33].

4) Compute an updated factor ψnewC ∝ q∗

q\Cusing the refined

approximation and cavity distribution.5) Update the current joint posterior qnew ∝

ψnewC

∏C′ 6=C ψC′ with the newly updated factor

ψnewC .

B. Approximating factors

Here the key components of the EP inference algorithm forthe proposed model are provided. The true posterior p (14) isapproximated with the distribution q

q =∏t

qgtqftqhtqrtqut

, (16)

where each factor qa, a ∈ gt, ft, ht, rt, ut, is from theexponential family and all latent variables are separated inthe factors.


x1t x2t · · · xNt

ω1t ω2t ωNt

γ1t γ2t γNt· · · · · ·

µt−1· · ·µ1 µt+1 · · · µTµt

· · · · · ·

gt

f1t f2t fNt

h1t h2t hNt

rt

u2 ut−1 ut ut+1 ut+2 uT

Fig. 3. Proposed spike and slab model with a spatio-temporal structure. The locations of spikes have a GP distribution in space with parameters that arecontrolled by a top-level GP and they evolve in time, therefore promoting temporal dependence.

Below the factors qa of the approximating posterior q areintroduced. Gaussian and Bernoulli distributions are used inthe factors, which parameters are updated during the iterationsof the EP algorithm.

The factors gt = N (yt; Axt, σ2I) from (5) can be viewed

as the distributions of xt with fixed observed variables yt:qgt = N (xt; mgt ,Vgt), where mgt = (A>A)−1A>yt,Vgt = σ2(A>A)−1.

The factors ft =∏Ni=1 fit and ht =

∏Ni=1 hit from (6)

and (10) are approximated with the products of Gaussian andBernoulli distributions

qft = N (xt; mft ,Vft)

N∏i=1

Ber(ωit; Φ(zfit)), (17)

qht= N (γt;νht

,Sh)

N∏i=1

Ber(ωit; Φ(zhit)), (18)

where the components of xt and γt are independent. Therefore,the covariance matrices Vft and Sh are diagonal1. Distributionparameters mft , Vft , zfit , νht

, Sh, and zhitare updated

during EP iterations according to Appendices B and C.The approximation for the factors rt = N (γt;µt,Σ0) and

ut = N (µt;µt−1,W) from (9) and (13) is intended to separatethe latent variables and it is represented as products of Gaussiandistributions

qrt = N (γt;νrt ,Sr)N (µt; ert ,Dr), (19)qut

= N (µt−1; eut←,Du←)N (µt; eut→,Du→). (20)

Distribution parameters ert , Dr, νrt , Sr, eut←, Du←, eut→,and Du→ are updated during EP iterations according toAppendices D and E.

The posterior approximation q given by (16) thus containsthe products of Gaussian and Bernoulli distributions that areequal to unnormalised Gaussian and Bernoulli distributions,respectively (Appendix A). This can be conveniently expressed

1Note that Sh does not depend on time. In this paper, single covariancematrices are used for all time moments for both GP variables γ and µ in theapproximating factors. However, the method can be applied with individualcovariance matrices for each time moment as well.

in terms of the natural parameters and q can be represented interms of distributions of the latent variables.

For xt in q this product property leads to the Gaussiandistribution N (xt; mt,Vt) with natural parameters

V−1t = V−1

gt + V−1ft, V−1

t mt = V−1gt mgt + V−1

ftmft . (21)

Similarly, γt in q is distributed as N (γt;νt,S), wherenatural parameters are

S−1 = S−1h + S−1

r , S−1νt = S−1h νht

+ S−1r νrt . (22)

The top GP latent variables µt have the Gaussian distribu-tions N (µt; et,D) with natural parameters

D−1 = D−1r + D−1

u→1t>1 + D−1u←1t<T , (23a)

D−1et = D−1r ert + D−1

u→eut→1t>1+

D−1u←eut+1←1t<T , (23b)

where 1 is the indicator function.The distributions for ωt are

∏Ni=1 Ber(ωit; Φ(zit)) with the

parameters

zit = Φ−1

([(1− Φ(zfit))(1− Φ(zhit))

Φ(zfit)Φ(zhit)

+ 1

]−1). (24)

The full approximating posterior q is then

q =

T∏t=1

N (xt; mt,Vt)

T∏t=1

N∏i=1

Ber(ωit; Φ(zit))

×T∏t=1

N (γt;νt,S)

T∏t=1

N (µt; et,D). (25)

In the EP inference algorithm, each of the introducedapproximating factors qft , qht

, qrt , qutis iteratively updated

according to the factor refinement procedure as in Section IV-A.Note that the factors qgt are not updated, as the correspondingfactors gt from the true posterior distribution are already fromthe exponential family.


C. Implementation details

There are no theoretical guarantees of EP convergence.However, it can be achieved using damping [35]: during step 4of the factor refinement procedure in Section IV-A the factoris updated as qdamp

a = (qnewa )η(qold

a )1−η , where qolda is the value

of the factor from the previous iteration, qnewa is the updated

value of the factor, η ∈ (0, 1] is the damping coefficient. Itis exponentially decreased as η = ηoldξ after each iteration,where ξ ∈ (0, 1] is the parameter that governs the speed ofexponential decrease and ηold is the value of the dampingcoefficient from the previous iteration.

It is also known that during the EP updates negative variancescan appear [34]. In this case negative variances are replacedwith a large value representing +∞.

V. ONLINE INFERENCE WITH BAYESIAN FILTERING

In this section the problem (3) is considered for streamingdata, i.e. when new data becomes available at every timestamp.The conventional batch inference can be infeasible for largeor streaming data. The developed online Bayesian filteringalgorithm for the model presented in Section IV allows toiteratively update the approximation of x based on new samplesof data.

Bayesian filtering consist of two steps that are iterated foreach new sample of data:• prediction, where an estimate of a hidden system state at

the next time step is predicted based on the observationsavailable at the current time moment;

• update, where this estimate is updated once an observationat the next time moment is obtained.

In the proposed model the hidden state is represented by thelatent variables xt, ωt, γt and µt that should be inferred basedon observations yt.

A. Prediction

At the prediction step for the timestamp t+ 1 the currentestimate of the posterior distribution of the latent variablesp(xt,ωt,γt,µt|y1:t) is available. It is based on all observa-tions y1:t = [y1, . . . ,yt] up to the timestamp t. The initialestimate of this posterior can be obtained by the offlineinference algorithm applied to the initial Tinit timestamps.

Marginalisation of the latent variables for the currenttimestamp t allows to obtain predictions for the latent variablesfor the next timestamp t+ 1

p(xt+1,ωt+1,γt+1,µt+1|y1:t) =

=

∫p(xt+1,ωt+1,γt+1,µt+1|xt,ωt,γt,µt)

× p(xt,ωt,γt,µt|y1:t)dxtdωtdγtdµt (26)

The first term in the integral (26) is factorised according tothe generative model (5),(6),(10), and (13)

p(xt+1,ωt+1,γt+1,µt+1|xt,ωt,γt,µt)= p(xt+1|ωt+1)p(ωt+1|γt+1)p(γt+1|µt+1)p(µt+1|µt)

(27)

Therefore, the terms related to variables xt+1, ωt+1 andγt+1 are independent from the integral variables in (26) andthe integral can be rewritten as∫

p(xt+1,ωt+1,γt+1,µt+1|xt,ωt,γt,µt)

× p(xt,ωt,γt,µt|y1:t)dxtdωtdγtdµt

= p(xt+1|ωt+1)p(ωt+1|γt+1)p(γt+1|µt+1)

×∫p(µt+1|µt)p(µt|y1:t)dµt (28)

The initial estimate of the posterior p(µTinit|y1:Tinit) obtained

from the offline EP algorithm is a Gaussian distribution:

p(µTinit|y1:Tinit) = N (µTinit

; e1:Tinit ,D1:Tinit), (29)

where e1:Tinit and D1:Tinit are the mean and the covariancematrix of the estimate of the posterior for µTinit

obtained basedon observations y1:Tinit .

According to the generative model (13) the first term of theintegral in (28) is also Gaussian, therefore the integral is alsoa Gaussian distribution on µt+1 for t = Tinit:∫

p(µt+1|µt)p(µt|y1:t)dµt

= N (µt+1; e1:t,Dpredict1:t )

def= p(µt+1), (30)

where Dpredict1:t = W + D1:t is the covariance of the predicted

distribution.Substitution of (28) and (30) back into (26) provides the

predicted distribution:

p(xt+1,ωt+1,γt+1,µt+1|y1:t)

=p(xt+1|ωt+1)p(ωt+1|γt+1)p(γt+1|µt+1)p(µt+1) (31)

B. Update

At the update step the predicted distribution (31) of thelatent variables for the next timestamp is corrected with thenew data yt+1

p(xt+1,ωt+1,γt+1,µt+1|y1:t+1)

=1

Zp(yt+1|xt+1,ωt+1,γt+1,µt+1)

× p(xt+1,ωt+1,γt+1,µt+1|y1:t)

=1

Zp(yt+1|xt+1)p(xt+1|ωt+1)p(ωt+1|γt+1)

× p(γt+1|µt+1)p(µt+1), (32)

where Z is the normalisation constant.Since components of the vectors xt+1 and ωt+1 are

conditionally independent, the terms p(xt+1|ωt+1) andp(ωt+1|γt+1) are further factorised:

p(xt+1,ωt+1,γt+1,µt+1|y1:t+1)

=1

Zp(yt+1|xt+1)

[N∏i=1

p(xit+1|ωit+1)p(ωit+1|γit+1)

]× p(γt+1|µt+1)p(µt+1), (33)

The resulting formula for update (33) is the same as theposterior distribution (14) with the only exception in the term


related to µt. The approximation of this posterior is proposedin Section IV. The algorithm is only required to be adjustedfor the new factor p(µt+1).

The factor p(µt+1) is a Gaussian distribution, i.e. it is fromthe exponential family already and it only depends on a singlelatent variable, therefore this factor should not be updated inthe EP iterations. The information from this factor will bepassed through the general approximating distribution q to theother factors.

In the EP algorithm used for inference of the updateddistribution (33) the distribution for µt is approximated withthe Gaussian distribution for any t. Therefore, the identity (30)is true for any t and the whole procedure can be applied forall timestamps.

C. Minibatch filtering

The developed Bayesian filtering procedure can be easilyextended to the case of inferring minibatches for timestamps[t+ 1 : t+M ], where M is the size of a minibatch:

p(xt+1:t+M ,ωt+1:t+M ,γt+1:t+M ,µt+1:t+M |y1:t+M ) (34)

rather than for the next timestamp t+ 1 only as in (33).Indeed, due to conditional independence marginalisation (26)

also comes down to integral (30) similar to (28). And the updatestep can also be performed by the EP algorithm with the onlydifference that it should be applied for M timestamps ratherthan one.

VI. EXPERIMENTS

This section presents validation and evaluation results forthe proposed algorithms. The performance of these two-levelGP algorithms is compared with:• the spatio-temporal spike and slab model with a one-level

GP prior and its modification with common precisionapproximation [25];

• a popular alternating direction method of multipli-ers (ADMM) method [36], which is a convex optimisationmethod used here for the lasso problem [4];

• a spatio-temporal sparse Bayesian learning (STSBL)algorithm [37].

For quantitative comparison, the following measures areused:

• NMSE(normalised mean square error) =‖X− X‖2F‖X‖2F

,

where X is the true signal, X is the estimate, computedas the mean of the approximated posterior distribution,‖ · ‖F is the Frobenius norm of a matrix;

• F-measure [13] = 2precision · recallprecision + recall

between non-zero

elements of the true signal X and non-zero elements ofthe estimate X.

The NMSE shows the normalised error of signal reconstruction,with 0 corresponding to an ideal match. The F-measure showshow well slab locations are restored. An F-measure equal to 1means that the true and estimated signals coincide, whilst 0corresponds to lack of similarity between them. Arguably, for

10 20 30 40 50

20

40

60

80

100

time

sign

alco

mpo

nent

−200

−100

0

100

200

(a) Data

10 20 30 40 50

20

40

60

80

100

time

sign

alco

mpo

nent

−200

−100

0

100

200

(b) Data

Fig. 4. Examples of the true signal X for the synthetic data. In each exampletwo groups of slabs generated at t = 1 evolve in time until t = 50.

the sparse regression problem, the NMSE is less meaningfulthan the F-measure [38].

Both two-level and one-level GP algorithms are iterated untilconvergence, which is measured by difference in the estimateof the signal X at the current and previous iterations.

A. Synthetic data

In this experiment, the algorithm performance is studiedon synthetic data with known true values of signal X andslab locations Ω. The synthetic data represents the signals thathave slowly evolving in time groups of non-zero elements. Tocreate a spatio-temporal structure of slabs at the first timestampt = 1 two groups of slab locations are generated with Poisson-distributed sizes for the signal xt of dimensionality N = 100.Then, from t = 2 to t = T = 50, these groups randomlyevolve: each border of each group can go up, down, or stay atthe same location with such probabilities that in average thesparsity level remains 95%. In such way, locations of the slabgroups are generated. The values of non-zero elements of thesignal are then drawn from the distribution N (0, 104). Thisprocedure is repeated 10 times to generate 10 data samples.The examples of generated X are shown in Fig. 4.

The elements of the design matrix A are generated asindependent and identically distributed (iid) samples from thestandard Gaussian. For each of the data samples, observationsY = AX of different length K are generated. The value K/Nis referred as an undersampling ratio. It changes from 10% to55%.

The algorithms are evaluated in terms of average F-measure,NMSE and time2 (Fig. 5) on this data. On the interval between10% and 20% of the undersampling ratio both inferencemethods for the two-level GP model and full EP inference forthe one-level GP model show competitive results in terms ofthe accuracy metrics while outperforming the other methods.On the interval between 20% and 30% of the undersamplingratio the inference methods for one- and two-level GP modelsare already able to perfectly reconstruct the sparse signalwhile both ADMM and STSBL show less accurate results.STSBL achieves the perfect reconstruction starting from theundersampling ratio 30% and ADMM achieves these resultsstarting from the undersampling ratio 50%.

In the proposed EP algorithm for the two-level GP model(Section IV), the complexity of each iteration is O(N3T ),as matrices of size N × N are inverted for each timestamp

2Time is evaluated with 4.2GHz Intel Core i7 CPU and 16GB RAM.


to compute cavity distributions for the factors u and r. Inthe proposed online inference algorithm (Section V), first theoffline version is trained on size Tinit. Then, when new dataof size M is available, the previous results are used as priorand the complexity of update is O(N3M), while in the offlineversion it is O(N3(Tinit +M)).

On average, the proposed two-level GP algorithm requiressimilar to the full one-level GP algorithm number of iterationsfor convergence: approximately 30 iterations on the intervalbetween 10% and 20% of the undersampling ratio, 15 iterationson the interval between 20% and 30%, and less than 10iterations for the higher undersampling ratios. The approximateinference algorithm for the one-level GP model takes slightlymore iterations to converge.

In the one-level GP algorithm [25] the complexity of oneiteration is O(N3T 3). This is related to inversion of full spatio-temporal covariance matrix. It is addressed with low rankand common precision approximations [25], which reduceboth the computational complexity and the quality of theresults. The K-rank approximation, where K is a parameterof the algorithm, reduces the computational complexity toO(N2KT ) and the common precision approximation reducesit to O(N2T + T 2N).

In terms of the computational time the full EP inferencefor the one-level GP model is the slowest method. Theapproximated inference for the one-level GP model significantlyimprove its performance in terms of the computational timewhile also cause loss in accuracy. The ADMM method showssimilar results to the approximated one-level GP model in termsof the computational time, but has even bigger loss in terms ofboth accuracy measures. The STSBL takes slightly more timefor the lower values of the undersampling ratio, which helps itto achieve better results than the ADMM method in terms of theaccuracy measures. The proposed offline and online inferencemethods for the two-level GP method demonstrate a satisfactorytrade-off between computational time and accuracy. They obtaincompetitive results in terms of accuracy measures as the full EPinference for the one-level GP model while require significantlyless computational time. In terms of computational time theproposed method demonstrates competitive results with theSTSBL method.

The proposed online inference method for the two-level GPmodel allows to save computational time while preservingthe accuracy of the recovered signal. Note that the developedinference methods for the two-level GP model outperformcompetitors in the lowest undersampling ratio interval, i.e. theyrequire less measurements to get the same quality as otheralgorithms.

B. Real data: moving object detection in video

The considered methods for sparse regression are comparedon the problem of object detection in video sequences. TheConvoy dataset [39] is used where a background frame issubtracted from each video frame. As moving objects takeonly part of a frame the considered signal of the subtractedvideo frames is sparse. Moreover, objects are representedas clusters of pixels, which evolve in time. Therefore, the

10 20 30 40 500.9

0.92

0.94

0.96

0.98

1

Undersampling ratio, %

F-m

easu

re

(a) F-measure

10 20 30 40 50

10−14

10−9

10−4

101


log

NM

SE

(b) NMSE

10 20 30 40 50

10−1

100

101

102


log

Tim

e,se

c

(c) Time

Two-level GP Two-level GP onlineOne-level GP full One-level GP approx

ADMM STSBL

Fig. 5. Performance of the algorithms on the synthetic data. Note that theNMSE plots have logarithmic scale of y-axis. As the convergence criteria

is||Xnew − Xold||∞||Xold||∞

< 10−3, values below 10−3 are less significant.

The proposed algorithms referred as two-level GP and two-level GP onlineoutperform others in the 10−20% interval, where the number of observationsis the lowest.

background subtraction application fully satisfies the proposedspatio-temporal structured model assumptions.

The frames with subtracted background are resized to 32×32pixels and reshaped as vectors xt ∈ RN , N = 1024. Thenumber of frames in the dataset is T = 260. The sparse obser-vations are obtained as Y = AX, where A ∈ RK×N is thematrix with iid Gaussian elements. 10 different random designmatrices A are used to generate 10 data samples. The numberof observations K is chosen such that the undersampling ratioK/N changes from 10% to 55%. This procedure corresponds


10 20 30 40 50

0.2

0.4

0.6

0.8

1


F-m

easu

re

(a) F-measure

10 20 30 40 50

10−2

10−1

100


log

NM

SE

(b) NMSE

Two-level GP Two-level GP onlineOne-level GP approx ADMM

STSBL

Fig. 6. Performance of the algorithms on the Convoy data. The proposedalgorithms referred as two-level GP and two-level GP online outperform theothers in the 20−30% interval. On the interval 10−15% all methods cannotreconstruct the true signal. The NMSE plot shows that the proposed algorithmsunderperform the competitors for the values higher than 30%, but the visualdifference in performance becomes insignificant that is demonstrated in Fig. 7.

to compressive sensing observations [40].For this problem the full EP inference for the one-level GP

model is infeasible due to its memory requirements, thereforeonly the common precision approximated inference for theone-level GP model is considered.

The average F-measure and NMSE obtained by all thealgorithms on the Convoy data are presented in Fig. 6. Theproposed algorithm shows the best results for the undersamplingratio 20 − 30%. For larger values of the undersamplingratio all the algorithms provide close almost ideal results ofreconstruction.

Fig. 7 presents the reconstructed sample frame from theConvoy data. For all the algorithms, the reconstruction resultsare provided for the undersampling ratio 10%, where theproposed algorithms slightly underperform the competitors interms of the quality metrics, for the undersampling ratio 20%,where the proposed algorithm outperforms the competitorsboth in terms of NMSE and the F-measure, and for theundersampling ratio 40%, where the proposed algorithmsshow a little higher NMSE. It is clearly seen that for theundersampling ratio 10% the difference in the quality metrics isinsignificant since none of the methods is able to reconstruct thesignal. The STSBL represents an exceptional example but stillthe frame reconstructed by this method contains considerable

amount of noise. For the undersampling ratio 20% the proposedmethod provides the clear reconstructed frame in contrast to thereconstructed frames by all the competitors that are more noisy.Meanwhile, for the undersampling ratio 40% the differencebetween reconstruction results by all four algorithms is notremarkable.

Note that similar to the synthetic data experiment theproposed algorithms obtain the best results for the lowest un-dersampling ratio values where the reconstruction is reasonable,i.e. they require a less number of observations.

C. Real data: EEG source localisation

The third experiment is devoted to the EEG source localisa-tion problem.

The goal of the non-invasive EEG source localisationproblem is to find 3D locations of dipoles such that theirelectromagnetic field coincides with the field measured byelectrodes on the human head cortex. This is important,for example, for localisation of active areas in human-braininterfaces and treatment of neurological disorders [41], [42].This problem is ill-posed in sense that there exist an infinitenumber of possible active areas inside the brain that couldproduce the same field on the head cortex. To regularise theproblem, we use the idea that slab locations are distributedin space and temporally evolve, similar to [43]. Similar ideaapplies to the MEG source localisation [44].

Using the earlier introduced notation, the EEG sourcelocalisation problem is stated as

yt = Axt + εt, ∀t ∈ [1, . . . , T ], (35)

where yt ∈ RK is the vector containing observations ofpotential differences taken from K = 69 electrodes placedon a human head cortex, A ∈ RK×N is the lead field matrixcorresponding to N/3 = 272 voxels, xt ∈ RN is the signal,that is the current density of dipole activation.

Here xt represents the dipole moments corresponding to thegrid locations:

xt =[x1x, x1y, x1z, x2x, x2y, x2z, . . . , xN

3 z

]>. (36)

For each grid voxel i inside the brain with location coordi-nates loc(i) = (xi, yi, zi) the corresponding dipole moments(xix, xiy, xiz) along the 3D axis are considered.

We employ the following covariance function that promotesclose values for collinear dipole moments corresponding toclose grid positions

K(i, j) = αK exp

(−d(i, j)2

2`2K

), K ∈ Σ0,W, (37)

where the distance is computed as

d(i, j) =

0, if axis for dipole moments i, j are different||loc(i)− loc(j)||22, otherwise.

(38)Hyperparameters are selected so that the sampled potential

differences have the similar behaviour as the provided data.The data and lead field matrix for the experiments is

processed with EEGLAB [45]. We use the data provided in


(a) Original frame (b) Two-level GP 10% (c) One-level GP 10% (d) ADMM 10% (e) STSBL 10%

(f) Background frame (g) Two-level GP 20% (h) One-level GP 20% (i) ADMM 20% (j) STSBL 20%

(k) Reference object de-tection

(l) Two-level GP 40% (m) One-level GP 40% (n) ADMM 40% (o) STSBL 40%

Fig. 7. Sample frame with reconstruction results from sparse observations for the Convoy data. (a), (f): the original and static background non-compressedframes; (k): object detection results based on non-compressed frame difference (static background frame is subtracted from the original frame); (b), (g), (l):reconstruction of compressed object detection results based on the proposed online two-level GP method; (c), (h), (m): reconstruction of the compressed objectdetection results based on the one-level GP method; (d), (i), (n): reconstruction of the compressed object detection results based on the ADMM method; (e),(j), (o): reconstruction of the compressed object detection results based on the STSBL method. (b), (c), (d), and (e) show the results for the undersamplingrate 10%, where all the algorithms fail to reconstruct the true signal. (g), (h), (i), and (j) show the reconstruction for the undersampling rate 20%, wherethe difference in performance between the algorithms is visible. While for the undersampling rate 40% ((l), (m), (n), and (o)) reconstruction results areindistinguishable in quality.

(a) Located dipole moments 1 msafter the event

(b) Located dipole moments 170 msafter the event

Fig. 8. Located dipoles by the proposed offline two-level GP method for theEEG source localisation problem. There is no brain response immediately afterthe event and (a) demonstrates reconstructed brain active area that remainsactive during the whole period and it is not related to the event. While (b)shows the reconstructed active area when the brain response to the event isdetected.

EEGLAB for the source localisation problem with annotatedevents.

Figure 8 presents located dipoles by the proposed methodfor the fourth event at two given time moments. The first timemoment is taken right after the event happened and there is noresponse to it in the brain activity yet. The second time momentis chosen when the response is detected. Figure 9 shows thecomparison of measured and restored potential differences bythe proposed algorithm.

The true signal X is unknown for the EEG source localisation

Measured EEG

50 100 150 200 250

10

20

30

40

50

60

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

(a) Measured EEG

Restored EEG

50 100 150 200 250

10

20

30

40

50

60

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

(b) Reconstructed EEG (AX)

Fig. 9. Reconstruction by the proposed offline two-level GP method of the EEGsignal. As the true active dipole areas are not known, reconstruction qualityis based on the observations Y. Reconstructed EEG has lower magnitude,potentially because noise has been taken into account.

problem, therefore, NMSE between the observations yt andreconstructed Axt is used for the quantitative comparison inthis experiment. The obtained results for all the algorithmsaround the time of the brain response are presented in Fig. 10.The proposed two-level GP algorithms show the best resultsamong the competitors. Both proposed offline and onlineinference methods demonstrate similar performance. Note thatin this experiment the undersampling ratio is approximately8%, which confirms that the proposed method is able to providebetter results for lower values of the undersampling ratio.


165 170 175

0.2

0.25

0.3

0.35

time

NM

SE

Two-level GPTwo-level GP onlineOne-level GP approx

ADMM

Fig. 10. Results for NMSE between yt and Axt during the brain responsetime. The proposed algorithms referred as two-level GP and two-level GPonline have the lowest NMSE among the others.

TABLE ITWO-LEVEL GP HYPERPARAMETERS

Parameter Synthetic Convoy EEG

σ2x 104 160 4 ∗ 105

σ2 10−4 4 10−3

η 0.999 0.99 0.9ξ 0.9999 0.999 0.8`W 15 15 22.17`Σ 10 10 0.2217αW 10 10 10−2

αΣ 10 10 0.05

D. Parameters selection

For the proposed algorithm and for the one-level GP theparameters η and ξ are grid optimised to make the comparisonfair. The prior shape hyperparameters `Σ, `W , αΣ, αW andvariances σ2

x and σ2 are specified so that sampled data has thesame form as training data. ADMM and STSBL use the defaultvalues of parameters. The selected hyperparameter values forthe proposed algorithm for all datasets are presented in Table I.

VII. CONCLUSIONS

This paper proposes a new hierarchical Gaussian processmodel of spatio-temporal structure representation with complextemporal evolution in sparse Bayesian inference methods.This is achieved using the flexible hierarchical GP priorfor the spike and slab model, where spatial and temporalstructural dependencies are encoded by different levels of theprior. Offline and online methods are developed for posteriorinference for this model.

We show that the introduced model can be applied todifferent areas such as compressive sensing and EEG sourcelocalisation. The results show the superiority of the proposedmethod in comparison with the non-hierarchical GP method,the alternating direction method of multipliers and the spatio-temporal sparse Bayesian learning method. The developedalgorithms demonstrate better performance both in terms ofsignal value reconstruction and localisation of non-zero signalcomponents: within the low amount of measurements range itachieves around 15% improvement in terms of slab localisationquality.

Acknowledgments: The authors would like to thank thesupport from the EC Seventh Framework Programme [FP72013-2017] TRAcking in compleX sensor systems (TRAX)Grant agreement no.: 607400.

REFERENCES

[1] M. F. Duarte and Y. C. Eldar, “Structured compressed sensing: Fromtheory to applications,” IEEE Transactions on Signal Processing, vol. 59,no. 9, pp. 4053–4085, 2011.

[2] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction fromlimited data using FOCUSS: A re-weighted minimum norm algorithm,”IEEE Transactions on Signal Processing, vol. 45, no. 3, pp. 600–616,1997.

[3] J. Yin and T. Chen, “Direction-of-arrival estimation using a sparserepresentation of array covariance vectors,” IEEE Transactions on SignalProcessing, vol. 59, no. 9, pp. 4489–4493, 2011.

[4] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journalof the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1,pp. 267–288, 1996.

[5] D. Malioutov, M. Çetin, and A. S. Willsky, “A sparse signal reconstructionperspective for source localization with sensor arrays,” IEEE Transactionson Signal Processing, vol. 53, no. 8, pp. 3010–3022, 2005.

[6] A. Carmi, P. Gurfil, and D. Kanevsky, “Methods for sparse signal recoveryusing Kalman filtering with embedded pseudo-measurement norms andquasi-norms,” IEEE Transactions on Signal Processing, vol. 58, no. 4,pp. 2405–2409, 2010.

[7] E. J. Candes and T. Tao, “Decoding by linear programming,” IEEETransactions on Information Theory, vol. 51, no. 12, pp. 4203–4215,2005.

[8] A. M. Tillmann and M. E. Pfetsch, “The computational complexityof the restricted isometry property, the nullspace property, and relatedconcepts in compressed sensing,” IEEE Transactions on InformationTheory, vol. 60, no. 2, pp. 1248–1259, 2014.

[9] M. E. Tipping, “Sparse Bayesian learning and the relevance vectormachine,” The Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001.

[10] S. Mohamed, K. Heller, and Z. Ghahramani, “Bayesian and L1 approachesto sparse unsupervised learning,” in Proceedings of the 29th InternationalConference on Machine Learning, 2012, pp. 751–758.

[11] T. J. Mitchell and J. J. Beauchamp, “Bayesian variable selection in linearregression,” Journal of the American Statistical Association, vol. 83, no.404, pp. 1023–1032, 1988.

[12] N. G. Polson and J. G. Scott, “Shrink globally, act locally: SparseBayesian regularization and prediction,” Bayesian Statistics, vol. 9, pp.501–538, 2010.

[13] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press,2012.

[14] F. Bach, R. Jenatton, J. Mairal, G. Obozinski et al., “Structured sparsitythrough convex optimization,” Statistical Science, vol. 27, no. 4, pp.450–468, 2012.

[15] S. Mallat, A wavelet tour of signal processing, third edition: the sparseway, 3rd ed. Academic Press, 2008.

[16] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical learning withsparsity: the lasso and generalizations. CRC Press, 2015.

[17] J. Yang, X. Yuan, X. Liao, P. Llull, D. Brady, G. Sapiro, and L. Carin,“Video compressive sensing using Gaussian mixture models,” IEEETransactions on Image Processing, vol. 23, no. 11, pp. 4863–4878,2014.

[18] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,” Journal of the Royal Statistical Society: Series B(Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.

[19] P. Sprechmann, I. Ramirez, G. Sapiro, and Y. C. Eldar, “C-HiLasso: A col-laborative hierarchical sparse modeling framework,” IEEE Transactionson Signal Processing, vol. 59, no. 9, pp. 4183–4198, 2011.

[20] A. Schmolck, “Smooth relevance vector machines,” Ph.D. dissertation,University of Exeter, 2008.

[21] M. A. Van Gerven, B. Cseke, F. P. De Lange, and T. Heskes, “EfficientBayesian multivariate fMRI analysis using a sparsifying spatio-temporalprior,” NeuroImage, vol. 50, no. 1, pp. 150–161, 2010.

[22] A. Wu, M. Park, O. O. Koyejo, and J. W. Pillow, “Sparse Bayesianstructure learning with dependent relevance determination priors,” inAdvances in Neural Information Processing Systems 27, 2014, pp. 1628–1636.


[23] W. Chen, D. Wipf, Y. Wang, Y. Liu, and I. J. Wassell, “SimultaneousBayesian sparse approximation with structured sparse models,” IEEETransactions on Signal Processing, vol. 64, no. 23, pp. 6145–6159, 2016.

[24] Z. Zhang and B. D. Rao, “Sparse signal recovery with temporallycorrelated source vectors using sparse Bayesian learning,” IEEE Journalof Selected Topics in Signal Processing, vol. 5, no. 5, pp. 912–926, 2011.

[25] M. R. Andersen, A. Vehtari, O. Winther, and L. K. Hansen, “Bayesianinference for spatio-temporal spike and slab priors,” arXiv preprintarXiv:1509.04752, 2015.

[26] M. Deisenroth and S. Mohamed, “Expectation propagation in Gaus-sian process dynamical systems,” in Advances in Neural InformationProcessing Systems, 2012, pp. 2609–2617.

[27] N. D. Lawrence and A. J. Moore, “Hierarchical Gaussian process latentvariable models,” in Proceedings of the 24th International Conferenceon Machine learning, 2007, pp. 481–488.

[28] M. J. Wainwright and M. I. Jordan, “Graphical models, exponentialfamilies, and variational inference,” Foundations and Trends in MachineLearning, vol. 1, no. 1–2, pp. 1–305, 2008.

[29] E. I. George and R. E. McCulloch, “Variable selection via Gibbssampling,” Journal of the American Statistical Association, vol. 88,no. 423, pp. 881–889, 1993.

[30] B. E. Engelhardt and R. P. Adams, “Bayesian structured sparsity fromGaussian fields,” ArXiv e-prints, 2014.

[31] Q. Wu, Y. D. Zhang, M. G. Amin, and B. Himed, “High-resolution passiveSAR imaging exploiting structured Bayesian compressive sensing,” IEEEJournal of Selected Topics in Signal Processing, vol. 9, no. 8, pp. 1484–1497, 2015.

[32] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machinelearning. The MIT Press, 2006.

[33] T. P. Minka, “Expectation propagation for approximate Bayesian infer-ence,” in Proceedings of the 17th Conference on Uncertainty in ArtificialIntelligence, 2001, pp. 362–369.

[34] J. M. Hernandez-Lobato, D. Hernandez-Lobato, and A. Suarez, “Expec-tation propagation in linear regression models with spike-and-slab priors,”Machine Learning, vol. 99, no. 3, pp. 437–487, 2015.

[35] T. Minka and J. Lafferty, “Expectation-propagation for the generativeaspect model,” in Proceedings of the 18th Conference on Uncertainty inArtificial Intelligence, 2002, pp. 352–359.

[36] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,” Foundations and Trends in Machine Learning, vol. 3,no. 1, pp. 1–122, 2011.

[37] Z. Zhang, T.-P. Jung, S. Makeig, Z. Pi, and B. D. Rao, “Spatiotemporalsparse Bayesian learning with applications to compressed sensingof multichannel physiological signals,” IEEE Transactions on NeuralSystems and Rehabilitation Engineering, vol. 22, no. 6, pp. 1186–1197,2014.

[38] B. Xin, Y. Wang, W. Gao, D. Wipf, and B. Wang, “Maximal sparsitywith deep networks?” in Advances in Neural Information ProcessingSystems, 2016, pp. 4340–4348.

[39] G. Warnell, S. Bhattacharya, R. Chellappa, and T. Basar, “Adaptive-rate compressive sensing using side information,” IEEE Transactions onImage Processing, vol. 24, no. 11, pp. 3846–3857, 2015.

[40] V. Cevher, A. Sankaranarayanan, M. F. Duarte, D. Reddy, and R. G.Baraniuk, “Compressive sensing for background subtraction,” in Pro-ceedings of 10th European Conference on Computer Vision, 2008, pp.155–168.

[41] M. A. Jatoi, N. Kamel, A. S. Malik, I. Faye, and T. Begum, “A surveyof methods used for source localization using EEG signals,” BiomedicalSignal Processing and Control, vol. 11, pp. 42–52, 2014.

[42] S. Baillet, J. C. Mosher, and R. M. Leahy, “Electromagnetic brainmapping,” IEEE Signal Processing Magazine, vol. 18, no. 6, pp. 14–30,2001.

[43] S. Baillet and L. Garnero, “A Bayesian approach to introducing anatomo-functional priors in the EEG/MEG inverse problem,” IEEE Transactionson Biomedical Engineering, vol. 44, no. 5, pp. 374–385, 1997.

[44] A. Solin, P. Jylänki, J. Kauramäki, T. Heskes, M. A. van Gerven,and S. Särkkä, “Regularizing solutions to the MEG inverse prob-lem using space-time separable covariance functions,” arXiv preprintarXiv:1604.04931, 2016.

[45] A. Delorme and S. Makeig, “EEGLAB: an open source toolbox foranalysis of single-trial EEG dynamics including independent componentanalysis,” Journal of Neuroscience Methods, vol. 134, no. 1, pp. 9–21,2004.

APPENDIX APRODUCT AND QUOTIENT RULES

EP updates are based on products and quotients of distribu-tions. This section presents the product and quotient rules forGaussian and Bernoulli distributions.

A. Product of Gaussians

A product of two Gaussian distributions is a unnormalisedGaussian distribution

N (x; m1,Σ1)N (x; m2,Σ2) ∝ N (x; m,Σ),

where

Σ−1 = Σ−11 + Σ−1

2 , Σ−1m = Σ−11 m1 + Σ−1

2 m2

B. Quotient of Gaussians

A quotient of two Gaussian distributions is a unnormalisedGaussian distribution3

N (x; m1,Σ1)

N (x; m2,Σ2)∝ N (x; m,Σ),

where

Σ−1 = Σ−11 −Σ−1

2 , Σ−1m = Σ−11 m1 −Σ−1

2 m2

C. Product of Bernoulli

A product of two Bernoulli distributions is a unnormalisedBernoulli distribution

Ber(x; Φ(z1))Ber(x; Φ(z2)) ∝ Ber(x; Φ(t(z1, z2))),

where

t(z1, z2) = Φ−1

([(1− Φ(z1))(1− Φ(z2))

Φ(z1)Φ(z2)+ 1

]−1)

D. Quotient of Bernoulli

A quotient of two Bernoulli distributions is a unnormalisedBernoulli distribution

Ber(x; Φ(z1))

Ber(x; Φ(z2))∝ Ber(x; Φ(d(z1, z2))),

where

d(z1, z2) = Φ−1

([(1− Φ(z1))Φ(z2)

(1− Φ(z2))Φ(z1)+ 1

]−1)

3Although quotient can lose positive semidefiniteness, we will still refer toit as a Gaussian distribution


APPENDIX BEP UPDATE FOR FACTOR fit

A. Cavity distribution

The unnormalised cavity distribution q\qfit (xit, ωit) =q(xit,ωit)qfit (xit,ωit)

can be computed as

q\qfit =N (xit; mt(i),Vt(i, i))Ber(ωit; Φ(zit))

N (xit; mft(i),Vft(i, i))Ber(ωit; Φ(zfit))

∝ N (xit;m\fit , v

\fit )Ber(ωit; Φ(z

\fit )),

where

(v\fit )−1 = V−1

t (i, i)−V−1ft

(i, i),

(v\fit )−1m

\fit = V−1

t (i, i)mt(i)−V−1ft

(i, i)mft(i, i),

z\fit = zhit

B. Moments matching

The moments of the tilted distribution q\qfit fit are

Zit = Φ(z\fit )N (0;m

\fit , v

\fit )

+ (1− Φ(z\fit ))N (0;m

\fit , v

\fit + σ2

x),

Exit =1− Φ(z

\fit )

ZitN (0;m

\fit , v

\fit )

m\fit σ

2x

v\fit + σ2

x

,

Ex2it =

1− Φ(z\fit )

ZitN (0;m

\fit , v

\fit )

×(

(m\fit )2σ4

x

(v\fit + σ2

x)2+

v\fit σ

2x

v\fit + σ2

x

),

Eωit =Φ(z

\fit )

ZitN (0;m

\fit , v

\fit )

The new approximation q∗(xit, ωit) is

q∗ = N (xit;mq∗

it , vq∗

it )Ber(ωit; Φ(zq∗

it )),

where

mq∗

it = Exit, vq∗

it = Ex2it − (Exit)2, zq

∗

it = Φ−1(Eωit).

C. Factor update

The new factor approximation qnewfit

(xit, ωit) = q∗(xit,ωit)

q\qfit (xit,ωit)

can be computed as

qnewfit =

N(xit;m

q∗

it , vq∗

it

)Ber

(ωit; Φ

(zq∗

it

))N(xit;m

\fit , v

\fit

)Ber

(ωit; Φ

(z\fit

))∝ N

(xit; m

newft (i),Vnew

ft (i, i))

Ber(ωit; Φ

(znewfit

)),

where (Vnewft

)−1(i, i) =

(vq∗

it

)−1

−(v\fit

)−1

,(Vnewfit

)−1(i, i)mnew

ft (i) =(vq∗

it

)−1

mq∗

it −(v\fit

)−1

m\ffit,

znewfit = d

(zq∗

it , z\fit

).

APPENDIX CEP UPDATE FOR FACTOR hit


The unnormalised cavity distribution q\qhit (γit, ωit) =q(γit,ωit)qhit

(γit,ωit)can be computed as

q\qhit =N (γit;νt(i),S(i, i))Ber(ωit; Φ(zit))

N (γit;νht(i),Sh(i, i))Ber(ωit; Φ(zhit

))

∝ N (γit; ν\hit , s

\hit )Ber(ωit; Φ(z

\hit )),

where

(s\hit )−1 = S−1

t (i, i)− S−1h (i, i)

(s\hit )−1ν

\hit = S−1

t (i, i)µt(i)− S−1h (i, i)νht

(i, i)

z\hit = zfit

B. Moments matching

The moments of the tilted distribution q\qhithit are

Zit = Φ(z\hit )Φ(a) + (1− Φ(z

\hit ))(1− Φ(a)),

Eγit =1

Zit(Φ(z

\hit )K + (1− Φ(z

\hit ))(ν

\hit −K)),

Eγ2it =

1

Zit

[(2Φ(z

\hit )− 1)

((ν\hit )2Φ(a) + s

\hit Φ(a)

+2ν\hit s\hit N (a; 0, 1)√1 + s

\hit

− (s\hit )2aN (a; 0, 1)

1 + s\hit

)

+ (1− Φ(z\hit )(s

\hit + (ν

\hit )2)

],

Eωit =Φ(z

\hit )Φ(a)

Zit,

where

a =ν\hit√

1 + s\hit

, K = s\hit

N (a; 0, 1)√1 + s

\hit

+ ν\hit Φ(a)

The new approximation q∗(γit, ωit) is

q∗ = N (γit; νq∗

it , sq∗

it )Ber(ωit; Φ(zq∗

it )),

where

νq∗

it = Eγit, sq∗

it = Eγ2it − (Eγit)2, zq

∗

it = Φ−1 (Eωit) .

C. Factor update

The new factor approximation qnewhit

(γit, ωit) =q∗(γit, ωit)

q\qhit (γit, ωit)can be computed as

qnewhit

=N(γit; ν

q∗

it , sq∗

it

)Ber

(ωit; Φ

(zq∗

it

))N(γit; ν

\hit , s

\hit

)Ber

(ωit; Φ

(z\hit

))∝ N

(γit;ν

newht

(i),Snewh (i, i)

)Ber

(ωit; Φ

(znewhit

)),

where

(Snewh )−1

(i, i) =(sq∗

it

)−1

−(s\hit

)−1

,


(Snewh )−1

(i, i)νnewht

(i) =(sq∗

it

)−1

νq∗

it −(s\hit

)−1

ν\hit ,

znewhit

= d(zq∗

it , z\hit

).

APPENDIX DEP UPDATE FOR FACTOR rt


The unnormalised cavity distribution q\qrt (γt,µt) =q(γt,µt)qrt (γt,µt)

can be computed as

q\qrt =N (γt;νt,S)N (µt; et,D)

N (γt;νrt ,Sr)N (µt; ert ,Dr)

∝ N (γt;ν\rt ,S

\r)N (µt; e\rt ,D

\r),

where

(S\r)−1 = (S)−1 − (Sr)−1

(S\r)−1ν\rt = (S)−1νt − (Sr)

−1νrt

(D\r)−1 = (D)−1 − (Dr)−1

(D\r)−1e\rt = (D)−1et − (Dr)

−1ert

B. Find the update for the factor qnewrt

For the factor qrt parameters of the Gaussian distributionsfound during the moment matching step are cancelled outduring the factor update step and the resulting formulae are

qnewrt (γt,µt) ∝ N

(γt;ν

newrt ,S

newr

)N(µt; e

newrt ,D

newr

),

where

Snewr = D\r + Σ0, νnew

rt = e\rt

Dnewr = S\r + Σ0, enew

rt = ν\rt .

APPENDIX EEP UPDATE FOR FACTOR ut


The unnormalised cavity distribution q\qut (µt−1,µt) =q(µt−1,µt)

qut(µt−1,µt)can be computed as

q\qut =N (µt−1; et−1,D)N (µt; et,D)

N (µt−1; eut←,Du←)N (µt; eut→,Du→)

∝ N (µt−1; e\ut−1,D

\ut−1)N (µt; e

\ut ,D

\ut ),

where

(D\ut−1)−1 = (D)−1 − (Du←)−1

(D\ut−1)−1e

\ut−1 = (D)−1et−1 − (Du←)−1eut←

(D\ut )−1 = (D)−1 − (Du→)−1

(D\ut )−1e

\ut = (D)−1et − (Du→)−1eut→

B. Find the update for the factor qnewut

For the factor qut parameters of the Gaussian distributionsfound during the moment matching step are cancelled outduring the factor update step and the resulting formulae are

qnewut

(µt−1,µt) ∝ N(µt; e

newut→,D

newu→)N(µt−1; enew

ut←,Dnewu←),

where

Dnewu→ = D

\ut−1 + W, enew

ut→ = e\ut−1

Dnewu← = D

\ut + W, enew

ut← = e\ut .

Documents

IEEE TRANSACTIONS ON SIGNAL PROCESSING 1 Spatio … · Danil Kuzin, Olga Isupova, and Lyudmila Mihaylova, Senior Member, IEEE Abstract—This paper introduces a new sparse spatio-temporal