Collaborative Multi-output Gaussian Processesauai.org/uai2014/proceedings/individuals/159.pdf · tion to large scale multi-output problems. For exam-ple, na ve inference in a fully

Collaborative Multi-output Gaussian Processes

Trung V. NguyenANU & NICTA

Canberra, Australia

Edwin V. BonillaNICTA & ANU

Sydney, Australia

Abstract

We introduce the collaborative multi-outputGaussian process (GP) model for learningdependent tasks with very large datasets.The model fosters task correlations by mixingsparse processes and sharing multiple sets ofinducing points. This facilitates the applica-tion of variational inference and the deriva-tion of an evidence lower bound that decom-poses across inputs and outputs. We learnall the parameters of the model in a sin-gle stochastic optimization framework thatscales to a large number of observations peroutput and a large number of outputs. Wedemonstrate our approach on a toy prob-lem, two medium-sized datasets and a largedataset. The model achieves superior per-formance compared to single output learn-ing and previous multi-output GP models,confirming the benefits of correlating spar-sity structure of the outputs via the inducingpoints.

1 INTRODUCTION

Gaussian process models (GPs, Rasmussen andWilliams, 2006) are a popular choice in Bayesian re-gression due to their ability to capture complex depen-dencies and non-linearities in data. In particular, whenhaving multiple outputs or tasks they have provedeffective in modeling the dependencies between thetasks, outperforming competitive baselines and sin-gle output learners (Bonilla et al., 2008; Teh et al.,2005; Alvarez and Lawrence, 2009; Wilson et al., 2012).However, the prohibitive cost of performing exact in-ference in GP models severely hinders their applica-tion to large scale multi-output problems. For exam-ple, naıve inference in a fully coupled Gaussian processmodel over P outputs and N data points can have

a complexity of O(N3P 3) and O(N2P 2) in time andmemory, respectively.

A motivating example of a large scale multi-output ap-plication is the tracking of movements of a robot armusing 2 or more joint torques. If one of the robot mo-tors malfunctions and fails to record a torque, datacollected from the other motors may be used to in-fer the missing torque values. However, taking 100measurements per second already results in a total ofover 40,000 data points per torque in just 7 minutes.Clearly this problem is well beyond the capabilitiesof conventional multiple output GPs. Building multi-output GP models that can learn correlated tasks atthe scale of these types of problems is thus the mainfocus of this paper.

In the single output setting previous attempts to scaleup GP inference resort to approximate inference. Mostapproximation methods can be understood within asingle probabilistic framework that uses a set of induc-ing points in order to obtain an approximate process(or a low-rank approximate covariance) over which in-ference can be performed more efficiently (Quinonero-Candela and Rasmussen, 2005). These models havebeen referred to in the literature as sparse models.Nevertheless, straightforward application of such ap-proximate techniques will yield a computational costof at least O(PNM2) in time and O(PNM) in mem-ory, where M is the number of inducing points. Thishigh complexity still prevents us from applying GPmodels to large scale multi-output problems.

In this work we approach the challenge of building scal-able multi-output Gaussian process models based onthe following observations. Firstly, inducing variablesare the key catalyst for achieving sparsity and dealingwith large scale problems in Gaussian process models.In particular, they capture the sufficient statistics ofa dataset allowing the construction of sparse processesthat can approximate arbitrarily well the exact GPmodel (Titsias, 2009). Secondly, the use of global la-tent variables (such as the inducing points) allows us

to induce dependencies in a highly correlated modelefficiently. This observation is exploited in Hensmanet al. (2013) for single output GP regression modelswhere, by explicitly representing a distribution overthe inducing variables, stochastic variational inferencecan be used to work with millions of data points. Fi-nally, the key to multi-output and multi-task learningis to model dependencies between the outputs basedon realistic assumptions of what can be shared acrossthe tasks. It turns out that sharing “sparsity struc-ture” can not only be a reasonable assumption butalso a crucial component when modeling dependenciesbetween different related tasks.

Based on these observations, we propose the collabo-rative multi-output Gaussian Process (COGP) modelwhere latent processes are mixed to generate depen-dent outputs. Each process is sparse and character-ized by its own set of inducing points. The sparsitystructure enabling output correlations is thus createdvia the shared inducing sets. To learn this structure,we maintain an explicit representation of the posteriorover the inducing points which in turn allows inferenceto be carried out efficiently. In particular, we obtaina variational lower bound of the model evidence thatdecomposes across inputs and outputs. This decom-position makes possible the application of stochasticvariational inference, thus allowing the model to han-dle a large number of observations per output and alarge number of outputs. Furthermore, learning of allthe parameters in the model, including kernel hyperpa-rameters and inducing inputs, can be done in a singlestochastic optimization framework.

We analyze our multi-out model on a toy problemwhere the inducing variables are shown to be con-ducive to the sharing of information between two re-lated tasks. Additionally, we evaluate our model ontwo moderate-sized datasets in which we show thatit can outperform previous non-scalable multi-outputapproaches as well as single output baselines. Finally,on a large scale experiment regarding the learning ofrobot inverse dynamics we show the substantial bene-fits of collaborative learning provided by our model.

Related work. Most GP-based multi-output mod-els create correlated outputs by mixing a set of inde-pendent latent processes. The mixing can be a lin-ear combination with fixed coefficients (see e.g. Tehet al., 2005; Bonilla et al., 2008). This is knownin the geostatistics community as the “linear modelof coregionalization” (Goovaerts, 1997). Such mod-els may also be reformulated in a common Bayesianframework, for example by placing a spike and slabprior over the coefficients (Titsias and Lazaro-Gredilla,2011). More complex dependencies can be induced

by using input-dependent coefficients (Wilson et al.,2012; Nguyen and Bonilla, 2013) or convolving pro-cesses (Boyle and Frean, 2005; Alvarez and Lawrence,2009; Alvarez et al., 2010).

While we also use the mixing construction, the keydifference in our model is the role of inducing vari-ables. In particular, when used in previous modelsto reduce the computational costs (see e.g. Alvarezand Lawrence, 2009; Alvarez et al., 2010), the induc-ing points are integrated out or collapsed. In contrast,our model maintains an explicit representation of theposterior over the inducing variables that is learnedusing data from all outputs. This explicit represen-tation facilitates scalable learning in a similar fashionto the approach in Hensman et al. (2013), making itapplicable to very large datasets.

2 MODEL SPECIFICATION

Before diving into technical details of the model spec-ification, we discuss the modeling philosophy behindour collaborative multi-output Gaussian processes.To learn the outputs jointly, we need a mechanismthrough which information can be transferred amongthe outputs. This is achieved in the model by allowingthe outputs to share multiple sets of inducing vari-ables, each of which captures a different pattern com-mon to the outputs. These variables play a doublepivotal role in the model: they collaboratively shareinformation across the outputs and provide sufficientstatistics so as to induce sparse processes.

Consider the joint regression of P tasks with inputsX = {xn ∈ RD}Nn=1 and outputs y = {yi}Pi=1

where yi = {yin}Nn=1. We model each output as aweighted combination of Q shared latent functions{gj}Qj=1, plus an individual latent function {hi}Pi=1

unique to that output for greater flexibility. The Qshared functions have independent Gaussian processpriors gj(x) ∼ GP(0, kj(·, ·)). Similarly, each indi-vidual function of an output also has a GP prior,i.e. hi(x) ∼ GP(0, khi (·, ·)).

As we want to sparsify these processes, we introducea set of shared inducing variables uj for each gj(x),i.e. uj contains the values of gj(x) at the inducinginputs Zj . Likewise, we have individual inducing vari-ables corresponding to each hi(x), which we denotewith vi and their corresponding inducing inputs Zh

i .The inducing inputs lie in the same space as the in-puts X. For convenience, we assume all processes havethe same number of inducing points, M . However weemphasize that this is not imposed in practice.

We denote the collective variables: g = {gj}, h ={hi}, u = {uj}, v = {vi}, Z = {Zj}, and Zh = {Zh

i }

where gj = {gj(xn)}, hi = {hi(xn)}. Note that wereserve subscript i for indexing the outputs and theircorresponding individual processes (i = 1 . . . P ), j forthe shared latent processes (j = 1 . . . Q), and n for theinputs (n = 1 . . . N).

2.1 PRIOR MODEL

From the definition of the GPs and the independenceof the processes, the prior of the multi-output modelcan be written as:

p(g|u) =

Q∏j=1

p(gj |uj) =

Q∏j=1

N (gj ;µj , Kj) (1)

p(u) =

Q∏j=1

p(uj) =

Q∏j=1

N (uj ; 0, k(Zj ,Zj)) (2)

p(h|v) =

P∏i=1

p(hi|vi) =

P∏i=1

N (hi;µhi , K

hi ) (3)

p(v) =

P∏i=1

p(vi) =

P∏i=1

N (vi; 0, k(Zhi ,Z

hi )), (4)

where the corresponding means and covariances of theGaussians are given by:

µj = k(X,Zj)k(Zj ,Zj)−1uj (5)

µhi = k(X,Zh

i )k(Zhi ,Z

hi )−1vi (6)

Kj = kj(X,X)− k(X,Zj)k(Zj ,Zj)−1k(Zj ,X) (7)

Khi = khi (X,X)− k(X,Zh

i )k(Zhi ,Z

hi )−1k(Zh

i ,X).(8)

In the equations and hereafter, we omit the subscriptsj, h, i from the kernels kj(·, ·) and khi (·, ·) when it isclear from the parameters inside the parentheses whichcovariance function is in action.

Equations (2) and (4) follow directly from the proper-ties of GPs, while the expressions for p(g|u) and p(h|v)(Equations (1) and (3)) come from the conditionalsof the multivariate Gaussian distributions. Instead ofwriting the joint priors p(g,u) and p(h,v), the aboveequivalent equations are given to emphasize the suf-ficient statistics role of u and v in the model. Hereby sufficient statistics we mean, for any sparse process(say gj), any other set of function values is indepen-dent of gj given the inducing variables uj .

2.2 LIKELIHOOD MODEL

As mentioned above, we assume that observations foreach output are (noisy) linear combinations of the Qlatent functions gj(x) plus an independent functionhi(x). Hence we have that the likelihood with stan-

dard iid Gaussian noise is given by:

p(y|g,h) =

P∏i=1

N∏n=1

N (yin;

Q∑j=1

wijgj(xn) + hi(xn), β−1i ),

(9)

where wij are the corresponding weights and βi is theprecision of each Gaussian. As the latent values g arespecified conditioned on the inducing variables u, thisconstruction implies that each output is a weightedcombination of the inducing values. We note that if uand v are marginalized out, we obtain the semipara-metric latent factor model (Teh et al., 2005). However,doing so is against the purpose of our model which en-courages sharing of outputs via the inducing variables.Furthermore, as we shall see in the next section, ex-plicit representation of these variables is fundamentalto scalable inference of the model.

3 INFERENCE

We approximate the posterior over the latent variablesg,h,u,v given observations y using variational infer-ence (Jordan et al., 1999). In section 3.1 we derive alower bound of the marginal likelihood which has thekey property of factorizing over the data points andoutputs. Section 3.2 takes advantage of this factoriza-tion to derive stochastic variational inference, allowingthe model to scale to very large datasets. Section 3.3compares the complexity of the model with previousmulti-output methods.

3.1 VARIATIONAL LOWER BOUND

In variational inference, we find the “closest” approxi-mate distribution to the true posterior in terms of theKL divergence. We first observe that the true poste-rior distribution can be written as:

p(g,h,u,v|y) = p(g|u,y)p(h|v,y)p(u,v|y). (10)

Here we recall the modeling assumption that each setof inducing variables is the sufficient statistics of thecorresponding latent process. This motivates replacingthe true posteriors over g and h with their conditionaldistributions given the inducing variables, leading to adistribution of the form:

q(g,h,u,v|y) = p(g|u)p(h|v)q(u,v), (11)

with

q(u,v) =

Q∏j=1

N (uj ; mj ,Sj)︸︷︷︸q(uj)

P∏i=1

N (vi; mhi ,S

hi )︸︷︷︸

q(vi)

. (12)

This technique has been used by Titsias (2009) andHensman et al. (2013) to derive variational inferencealgorithms for the single output case. Since the con-ditionals p(g|u) and p(h|v) are known (Equations (1)and (3)), we only need to learn q(u,v) so as to min-imize the divergence between the approximate poste-rior and the true posteriors. The quality of approxi-mation depends entirely on the posterior over the in-ducing variables, thus underlining their pivotal role inthe model as previously discussed.

To find the best q(u,v), we optimize the evidencelower bound (ELBO) of the log marginal:

log p(y) ≥∫q(u,v) log p(y|u,v)dudv

−Q∑

j=1

KL[q(uj)||p(uj)]−P∑i=1

KL[q(vi)||p(vi)], (13)

which is derived using Jensen’s inequality and the factthat both of q(u,v) and p(u,v) fully factorize. Sinceq(uj), q(vi), p(uj), p(vi) are all multivariate Gaussiandistributions, the KL divergence terms are analyticallytractable. To compute the expected likelihood term inthe ELBO we first see that

log p(y|u,v) ≥ 〈log p(y|g,h)〉p(g,h|u,v)

=

P∑i=1

N∑n=1

〈log p(yin|gn, hin)〉p(g|u)p(hi|vi)(14)

where gn = {gjn = (gj)n}Qj=1. The inequality is dueto Jensen’s inequality and the equality is due to thefactorization of the likelihood.

The ELBO can be computed by first solving forthe individual expectations 〈log p(yin|gn, hin)〉 overp(g|u)p(hi|vi) and then substituting these into Equa-tion (13) (see the supplementary material for details).Hence the resulting lower bound is given by:

L =∑i,n

(logN (yin; µin, β

−1i )− 1

2βi

Q∑j=1

w2ij kjnn

− 1

2βik

hinn −

1

2βi

Q∑j=1

tr w2ijSjΛjn − βi

1

2tr Sh

i Λin

)

−Q∑

j=1

(1

2log |KjzzS

−1j |+

1

2tr K−1jzz

(mjm

Tj + Sj

))

−P∑i=1

(1

2log |Kizz(Sh

i )−1|

+1

2tr K−1izz

(mh

i (mhi )T + Sh

i

)), (15)

where Kjzz = k(Zj ,Zj), Kizz = k(Zhi ,Z

hi ), and:

µin =

Q∑j=1

wijAj(n, :)mj + Ahi (n, :)mh

i , (16)

Λjn = Aj(n, :)TAj(n, :), (17)

Λin = Ahi (n, :)TAh

i (n, :), (18)

with kjnn = (Kj)nn; khinn = (Khi )nn; µjn = (µj)n;

µhin = (µh

i )n; and we have defined the auxiliary ma-trices Aj = k(X,Zj)K

−1jzz and Ah

i = k(Xi,Zhi )K−1izz

and used Aj(n, :) to denote the n-th row vector of Aj .Notice that this ELBO generalizes the bound for stan-dard GP regression derived in Hensman et al. (2013),which can be recovered by setting P = Q = 1, wij = 1and hi(x) = 0.

The novelty of the variational lower bound in Equa-tion (15) is that it decomposes across both inputsand outputs. This enables the use of stochastic op-timization methods, which allow the model to handlevery large datasets for which existing GP-based multi-output models are simply impractical.

3.2 STOCHASTIC VARIATIONALINFERENCE

So far in the description of the model and inferencewe have implicitly assumed that every output has fullobservations at all inputs X. To discern where learn-ing occurs for each output, we make the missing datascenario more explicit. Specifically, each output i canhave observations at a different set of inputs Xi. Weshall use oi to denote the indices of Xi (in the set X)and use the indexing operator B(oi) to select the rowscorresponding to oi from any arbitrary matrix B. Wealso overload yi as the observed targets of output i.

3.2.1 Learning the Parameters of theVariational Distribution

We can obtain the derivatives of the ELBO in Equa-tion (15) wrt the variational parameters for optimiza-tion. The derivatives of L wrt the parameters of q(uj)are given by:

∂L∂mj

=

P∑i=1

βiwijAj(oi)Ty\ji (19)

−[K−1jzz +

P∑i=1

βiw2ijAj(oi)

TAj(oi)

]mj ,

∂L∂Sj

=1

2S−1j −

1

2

[K−1jzz +

P∑i=1

βiw2ijAj(oi)

TAj(oi)

],

(20)

where y\ji = yi −Ah

i (oi)mhi −

∑j′ 6=j wij′Aj′(oi)mj′ .

The derivatives of L wrt the parameters of q(vi) aregiven by:

∂L∂mh

i

=βiAhi (oi)

Ty\hi

−[K−1izz + βiA

hi (oi)

TAhi (oi)

]mi, (21)

∂L∂Sh

i

=1

2S−1i −

1

2

[K−1izz + βiA

hi (oi)

TAhi (oi)

], (22)

where y\hi = yi −

∑Qj=1 wijAj(oi, :)mj .

It can be seen that the derivatives of the parametersof q(vi) only involve the observations of the outputi. The derivatives of the parameters of q(uj) involvethe observations of all outputs but is a sum of con-tributions from individual outputs. Computation ofthe derivatives can therefore be easily distributed orparallelized.

Since the optimal distributions q(uj) and q(vi) are inthe exponential family, it is more convenient to usestochastic variational inference (Hensman et al., 2012,2013) to perform update of their canonical parameters.This works by taking a step of length l in the directionof the natural gradient approximated by mini-batchesof the data. For instance, consider q(uj) whose canon-ical parameters are Φ1 = S−1j mj and Φ2 = − 1

2S−1j .Their stochastic update equations at time t + 1 aregiven by:

Φ1(t+1) = S−1j(t)mj(t)

+ l

( P∑i=1

βiwijAj(oi)Ty\ji − S−1j(t)mj(t)

)(23)

Φ2(t+1) = −1

2S−1j(t) + l

(1

2S−1j(t) −

1

2Λ

), (24)

where Λ = K−1jzz +∑P

i=1 βiw2ijAj(oi)

TAj(oi).

3.2.2 Inducing Inputs and Hyper-parameters

To learn the hyperparameters, which in this model in-clude the mixing weights, the covariance hyperparam-eters of the latent processes, and the noise precision ofeach output, we follow standard practice in GP infer-ence. For this model this involves taking derivatives ofthe ELBO and applying standard stochastic gradientdescent in alternative steps with the variational pa-rameters, much like a variational EM algorithm. Thederivatives are given in the supplementary material.

Learning of the inducing inputs, which was not con-sidered in the single output case in Hensman et al.(2013), is also possible in our stochastic optimizationapproach. In the supplementary material, we show

Table 1: Comparison of the time and storage com-plexity of approximate inference of multi-output GPmodels. A&W, 2009 refers to Alvarez and Lawrence(2009). COGP is the only method with complexityindependent of the number of inputs N and outputsP , thus it can scale to very large datasets.

METHOD TIME STORAGECOGP, this paper O(M3) O(M2)SLFM, (Teh et al., 2005) O(QNM2

t ) O(QNMt)MTGP, (Bonilla et al., 2008) O(PNM2

t ) O(PNMt)CGP-FITC (A&W, 2009) O(PNM2

t ) O(PNMt)GPRN, (Wilson et al., 2012) O(PQN3) O(PQN2)

that the additional cost of computing the derivatives ofthe lower bound wrt the inducing inputs is not signifi-cantly higher than the cost of updating the variationalparameters. This makes optimizing the inducing loca-tions a practical option, which can be critical in high-dimensional problems. Indeed, our experiments on alarge scale multi-output problem show that automaticlearning of the inducing inputs can lead to significantperformance gain with little overhead in computation.

3.3 COMPLEXITY ANALYSIS

In this section we analyze the complexity of the modeland compare it to existing multi-output approaches.For consistency, we first unify common notations usedfor all models. We use P as the number of outputs; Nas the number of inputs; Q as the number of shared la-tent processes; and Mt as the total number of inducinginputs. It is worth noting that Mt = (P +Q)×M inour COGP model, assuming that each sparse processhas equal number of inducing points. Also, COGP hasP additional individual processes, one for each output.

The complexity of COGP can be read off by inspect-ing the ELBO in Equation (15), with the key observa-tion that it contains a sum over the outputs as well asover the inputs. This means a mini-batch containinga small subset of the inputs and outputs can be usedfor stochastic optimization. Technically, the cost isO(M3) or O(NbM

2), where Nb is the size of the mini-batches, depending on which is larger between M andNb. In practice, we may use Nb > M as a large batchsize (e.g. Nb = 1000) helps reduce stochasticity of theoptimization. However, here we use O(M3) for easiercomparison with other models whose time and storagedemands are given in Table 1. We see that COGP hasa computational complexity that is independent of thesize of the inputs and outputs, which makes it the onlymethod capable of handling large scale problems.

3.4 PREDICTION

The predictive distribution of the i-th output for a testinput x∗ is given by:

p(f∗|y,x∗) = N (f∗;

Q∑j=1

wijµj∗ + µhi∗, w

2ijsj∗ + shi∗),

(25)

where µj∗ and sj∗ are the mean and variance ofthe prediction for gj∗ = gj(x∗), i.e. p(gj∗|y,x∗) =N (gj∗;µj∗, sj∗). Likewise, µh

i∗ and shi∗ are the meanand variance of the prediction for hi∗ = hi(x∗),p(hi∗|y,x∗) = N (hi∗;µ

hi∗, s

hi∗). These predictive

means and variances are given by:

µj∗ = kj∗zK−1jzzmj , (26)

sj∗ = kj∗∗ − kj∗z(K−1jzz −K−1jzzSjK

−1jzz

)kTj∗z, (27)

µhi∗ = ki∗zK

−1izzm

hi , (28)

shi∗ = ki∗∗ − ki∗z(K−1izz −K−1izzSiK

−1izz

)kTi∗z, (29)

where kj∗∗ = kj(x∗,x∗), ki∗∗ = khi (x∗,x∗), kj∗z is thecovariance between x∗ and Zj , and ki∗z is the covari-ance between x∗ and Zh

i .

4 EXPERIMENTS

We evaluate the proposed approach with four experi-ments. A toy problem is first used to study the trans-fer of learning between two related processes via theshared inducing points. We then compare the modelwith existing multi-output models on the tasks of pre-dicting foreign exchange rate and air temperature. Inthe final experiment, we show that joint learning un-der sparsity can yield significant performance gain ona large scale dataset of inverse dynamics of a robotarm.

Since we are using stochastic optimization, the learn-ing rates need to be chosen carefully. We found thatthe rates used in Hensman et al. (2013) also work wellfor our model. Specifically, we used the learning ratesof 0.01 for the variational parameters, 1 × 10−5 forthe covariance hyperparameters, and 1× 10−4 for theweights, noise precisions, and inducing inputs. We alsoincluded a momentum term of 0.9 for all of the param-eters except the variational parameters and the induc-ing inputs. All of the experiments are executed on anIntel(R) Core(TM) i7-2600 3.40GHz CPU with 8GBof RAM using Matlab R2012a.

4.1 TOY PROBLEM

In this toy problem, two related outputs are simu-lated from the same latent function sin(x) and cor-rupted by independent noise: y1(x) = sin(x) + ε and

y2(x) = −sin(x) + ε, ε ∼ N (0, 0.01). Each outputis given 200 observations with missing values in the(−7,−3) interval for the first output and the (4, 8)interval for the second output. We used Q = 1 la-tent sparse process with squared exponential kernel,h1(x) = h2(x) = 0, and M = 15 inducing inputs forour model.

Figure 1 shows the predictive distributions by ourmodel (COGP) and independent GPs with stochas-tic variational inference (SVIGP, one for each output).The locations of the inducing inputs are fixed and iden-tical for both methods. It is apparent from the figurethat the independent GPs fail to predict the functionsin the unobserved regions, especially for output 1. Incontrast, by using information from the observed in-tervals of one output to interpolate the missing signalof the other, COGP makes perfect prediction for bothoutputs. This confirms the effectiveness of collabora-tive learning of sparse processes via the shared induc-ing variables. Additionally, we note that the inferenceprocedure learned that the weights are w11 = 1.07 andw21 = −1.06 which accurately reflects the correlationbetween the two outputs.

4.2 FOREIGN EXCHANGE RATEPREDICTION

The first real world application we consider is to pre-dict the foreign exchange rate w.r.t the US dollar ofthe top 10 international currencies (CAD, EUR, JPY,GBP, CHF, AUD, HKD, NZD, KRW, and MXN) and3 precious metals (gold, silver, and platinum)1. Thesetting of our experiment described here is identical tothat in Alvarez et al. (2010). The dataset consists ofall the data available for the 251 working days in theyear of 2007. There are 9, 8, and 42 days of missingvalues for gold, silver, and platinum, respectively. Weremove from the data the exchange rate of CAD ondays 50–100, JPY on day 100–150, and AUD on day150–200. Note that these 3 currencies are from verydifferent geographical locations, making the problemmore interesting. The 153 points are used for test-ing, and the remaining 3051 data points are used fortraining. Since the missing data corresponds to longcontiguous sections, the objective here is to evaluatethe capacity of the model to impute the missing cur-rency values based on other currencies.

For preprocessing we normalized the outputs to havezero mean and unit variance. Since the exchange ratesare driven by a small number of latent market forces(see e.g. Alvarez et al., 2010), we tried different val-ues of Q = 1, 2, 3 and selected Q = 2 which gave thebest model evidence (ELBO). We used the squared-

1Data is available at http://fx.sauder.ubc.ca

−10 −5 0 5 10−4

−2

0

2

4output 1 − COGP

−10 −5 0 5 10−4

−2

0

2

4output 1 − SVIGP

−10 −5 0 5 10−4

−2

0

2

4output 2 − COGP

−10 −5 0 5 10−4

−2

0

2

4output 2 − SVIGP

Figure 1: Simulated data and predictive distributions of by COGP (first and third figure) and independent GPsusing stochastic variational inference (second and last figure) for the toy problem. Solid black line: predictivemean; grey bar: two standard deviations; magenta dots: real observations; blue dots: missing data. Theblack crosses show the locations of the inducing inputs. By sharing inducing points across the outputs, COGPaccurately interpolates the missing function values.

0 50 100 150 200 2500.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15CAD

Time (days)

Exchange r

ate

0 50 100 150 200 2507.8

8

8.2

8.4

8.6

8.8

9

9.2

9.4x 10

−3 JPY

Time (days)

Exchange r

ate

0 50 100 150 200 2500.7

0.75

0.8

0.85

0.9

AUD

Time (days)

Exchange r

ate

Figure 2: Real observations and predictive distributions for CAD (left), JPY (middle), and AUD (right). Themodel used information from other currencies to effectively extrapolate the exchange rates of AUD. The colorcoding scheme is the same as in Figure 1.

Table 2: Performance comparison on the foreign ex-change rate dataset. Results are averages of the 3 out-puts over 5 repetitions. Smaller figures are better.

METHOD SMSE NLPDCOGP 0.2125 -0.8394CGP 0.2427 -2.9474IGP 0.5996 0.4082

exponential covariance function for the shared pro-cesses and the noise covariance function for the indi-vidual process of each output. M = 100 inducing in-puts (per sparse process) were randomly selected fromthe training data and fixed throughout training.

The real data and predictive distributions by ourmodel are shown in Figure 2. They exhibit similarbehaviors to those by the convolved model with induc-ing kernels in Alvarez et al. (2010). In particular, bothmodels perform better at capturing the strong depreci-ation of the AUD than the fluctuations of the CAD andJPY currency. Further analysis of the dataset foundthat 4 other currencies (GBP, NZD, KRW, and MXN)also experienced the same trend during the days 150

– 200. This information from these currencies was ef-fectively used by the model to extrapolate the valuesof the AUD.

We also report in Table 2 the predictive performance ofour model compared to the convolved GPs model withexact inference (CGP, Alvarez and Lawrence, 2009)and independent GPs (IGP, one for each output). Ourmodel outperforms both of CGP and IGP in terms ofthe standardized mean squared error (SMSE). CGPhas lower negative log predictive density (NLPD),mainly due to the less conservative predictive varianceof the exact CGP for the CAD currency. For refer-ence, the convolved GPs with approximation via thevariational inducing kernels (CGPVAR, Alvarez et al.,2010) has an SMSE of 0.2795 while the NLPD was notprovided. Training took only 10 minutes for our modelcompared to 1.4 hours for the full CGP model.

4.3 AIR TEMPERATURE PREDICTION

Next we consider the task of predicting air tempera-ture at 4 different locations in the south coast of Eng-land. The air temperatures are recorded by a net-work of weather sensors (named Bramblemet, Soton-met, Cambermet, and Chimet) during the period from

Table 3: Performance comparison on the air tempera-ture dataset. Results are averages of 2 outputs over 5repetitions.

METHOD SMSE NLPDCOGP 0.1077 2.1712CGP 0.1125 2.2219IGP 0.8944 12.5319

July 10 to July 15, 2013. Measurements were takenevery 5 minutes, resulting in a maximum of 4320 ob-servations. There are missing data for Bramblemet(100 points), Chimet (15 points), and Sotonmet (1002points), possibly due to network outages or hardwarefailures. We further simulated failure of the sensors byremoving the observations from the time periods [10.2 -10.8] for Cambermet and [13.5 - 14.2] for Chimet. Theremoved data comprises 375 data points and is usedfor testing. The remaining data consisting of 15,788points is used for training. Similar to the previous ex-periment, the objective is to evaluate the ability of themodel to use the signals from the functioning sensorsto extrapolate the missing signals.

We normalized the outputs to have zero mean andunit variance. We used Q = 2 sparse processes withthe squared exponential covariance function and in-dividual processes with the noise covariance function.M = 200 inducing inputs were randomly selected fromthe training set and fixed throughout training.

The real data and the predictive distributions byour model, CGP with exact inference (Alvarez andLawrence, 2009), and independent GPs are shown inFigure 3. It is clear that the independent GP modelis clueless in the test regions and thus simply uses theaverage temperature as its prediction. For Camber-met, both COGP and CGP can capture the rising intemperature from the morning until the afternoon andthe fall afterwards. The performance of the models aresummarized in Table 3, which shows that our modeloutperforms CGP in terms of both SMSE and NLPD.It took 5 minutes on average to train our model com-pared to 3 hours of CGP with exact inference.

It is also worth noting the characteristics of the sparseprocesses learned by our model as they correspond todifferent patterns in the data. In particular, one pro-cess has an inverse lengthscale of 136 which capturesthe global increase in temperature during the trainingperiod while the other has an inverse lengthscale of 0.5to model the local variations within a single day.

Table 4: Performance comparison on the robot inversedynamics dataset. In the last two lines, standard GPis applied to output 1 and the other method is appliedto output 2. Results are averaged over 5 repetitions.

OUTPUT 1 OUTPUT 2

METHOD SMSE NLPD SMSE NLPD

COGP, learn z 0.2631 3.0600 0.0127 0.8302COGP, fix z 0.2821 3.2281 0.0131 0.8685GP, SVIGP 0.3119 3.2198 0.0101 1.1914GP, SOD 0.3119 3.2198 0.0104 1.9407

4.4 ROBOT INVERSE DYNAMICS

Our last experiment is with a dataset relating to an in-verse dynamics model of a 7-degree-of-freedom anthro-pomorphic robot arm (Vijayakumar and Schaal, 2000).The data consists of 48,933 datapoints mapping from a21-dimensional input space (7 joints positions, 7 jointvelocities, 7 joint accelerations) to the corresponding7 joint torques. It has been used in previous work (seee.g. Rasmussen and Williams, 2006; Vijayakumar andSchaal, 2000) but only for single task learning. Chaiet al. (2008) considered multitask learning of robot in-verse dynamics but on a different and much smallerdataset.

Here we consider joint learning for the 4th and 7thtorques, where the former has 2,000 points while thelatter has 44,484 points for training. The test set con-sists of 8,898 observations equally divided between thetwo outputs.

Since none of the existing multi-output models are ap-plicable to problems of this scale, we compare with in-dependent models that learn each output separately.Standard GP is applied to the first output as it hasonly 2,000 observations for training. For the secondoutput, we used two baselines. The first is the sub-set of data (SOD) approach where 2,000 data pointsare randomly selected for training with a standard GPmodel. The second is the sparse GP with stochasticvariational inference (SVIGP) using 500 inducing in-puts and a batch size of 1,000. In case of COGP, wealso used a batch size of 1,000 and 500 inducing pointsfor the shared process (Q = 1) and each of the indi-vidual processes.

The performance of all methods in terms of SMSE andNLPD is given in Table 4. The benefits of learning thetwo outputs jointly are evident, as can be seen by thesignificantly lower SMSE and NLPD of COGP com-pared to the full GP for the first output (4th torque).While the SMSE of the second output is essentiallythe same for all methods, the NLPD of COGP is sub-

10 11 12 13 14 155

10

15

20

25

30Cambermet − COGP

Time (days)

Air T

em

pera

ture

(C

)

10 11 12 13 14 155

10

15

20

25

30Cambermet − CGP

Time (days)

Air T

em

pera

ture

(C

)

10 11 12 13 14 155

10

15

20

25

30Cambermet − Independent GP

Time (days)

Air T

em

pera

ture

(C

)

10 11 12 13 14 155

10

15

20

25

30Chimet − COGP

Time (days)

Air T

em

pera

ture

(C

)

10 11 12 13 14 155

10

15

20

25

30Chimet − CGP

Time (days)

Air T

em

pera

ture

(C

)

10 11 12 13 14 155

10

15

20

25

30Chimet − Independent GP

Time (days)

Air T

em

pera

ture

(C

)

Figure 3: Real data and predictive distributions by our method (COGP, left figures), the convolved GP methodwith exact inference (CGP, middle figures), and full independent GPs (right figures) for the air temperatureproblem. The coding color scheme is the same as in Figure 1.

stantially better than that of the independent SVIGPmodel which has the same amount of training data forthis torque. These results validate the impact of col-laborative learning under sparsity assumptions, open-ing up new opportunities for improvement over singletask learning with independent sparse processes.

Finally, we see on Table 4 that optimizing the induc-ing inputs can yield better performance than fixingthem. More importantly, the overhead in computa-tion is small, as demonstrated by the training timesshown in Figure 4. For instance, the total trainingtime is only 1.9 hours when learning with 500 induc-ing inputs compared to 1.6 hours when fixing them.As this dataset is 21-dimensional, this small differencein training time confirms that learning of the induc-ing inputs is a practical option even when dealing withproblems of high dimensions.

5 DISCUSSION

We have presented scalable multi-output GPs forlearning of correlated functions. The formulationaround the inducing variables was shown to be con-ducive to effective and scalable joint learning undersparsity. We note that although our large scale exper-iments were done with over 40,000 observations – thelargest publicly available multi-output dataset found,

100 200 300 400 5000

0.5

1

1.5

2

No. of inducing inputs

Tra

inin

g tim

e (

hours

)

Learn zFixed z

Figure 4: Learning of the inducing inputs is a practicaloption as the overhead in training time is small.

the model can easily handle much bigger datasets.

Acknowledgements

EVB thanks Stephen Hardy for early discussions oninducing-point sharing for multi-output GPs. NICTAis funded by the Australian Government through theDepartment of Communications and the AustralianResearch Council through the ICT Centre of Excel-lence Program.

References

Alvarez, M. and Lawrence, N. D. (2009). Sparse con-volved Gaussian processes for multi-output regres-sion. In NIPS.

Alvarez, M. A., Luengo, D., Titsias, M. K., andLawrence, N. D. (2010). Efficient multioutput Gaus-sian processes through variational inducing kernels.In AISTATS.

Bonilla, E. V., Chai, K. M. A., and Williams, C. K. I.(2008). Multi-task Gaussian process prediction. InNIPS.

Boyle, P. and Frean, M. (2005). Dependent Gaussianprocesses. In NIPS.

Chai, K. M. A., Williams, C. K., Klanke, S., and Vi-jayakumar, S. (2008). Multi-task gaussian processlearning of robot inverse dynamics. In NIPS.

Goovaerts, P. (1997). Geostatistics for natural re-sources evaluation. Oxford University Press.

Hensman, J., Fusi, N., and Lawrence, N. D. (2013).Gaussian processes for big data. In Uncertainty inArtificial Intelligence.

Hensman, J., Rattray, M., and Lawrence, N. D. (2012).Fast variational inference in the conjugate exponen-tial family. In NIPS.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., andSaul, L. K. (1999). An introduction to variationalmethods for graphical models. Machine Learning,37:183–233.

Nguyen, T. V. and Bonilla, E. V. (2013). Efficientvariational inference for gaussian process regressionnetworks. In AISTATS.

Osborne, M. A., Roberts, S. J., Rogers, A., Ramchurn,S. D., and Jennings, N. R. (2008). Towards real-timeinformation processing of sensor network data us-ing computationally efficient multi-output Gaussianprocesses. In Proceedings of the 7th internationalconference on Information processing in sensor net-works.

Quinonero-Candela, J. and Rasmussen, C. E. (2005).A unifying view of sparse approximate Gaussianprocess regression. The Journal of Machine Learn-ing Research.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaus-sian Processes for Machine Learning. The MITPress.

Teh, Y. W., Seeger, M., and Jordan, M. I. (2005).Semiparametric latent factor models. In AISTATS.

Titsias, M. (2009). Variational learning of inducingvariables in sparse Gaussian processes. In AISTATS.

Titsias, M. K. and Lazaro-Gredilla, M. (2011). Spikeand slab variational inference for multi-task andmultiple kernel learning. In NIPS.

Vijayakumar, S. and Schaal, S. (2000). Locallyweighted projection regression: An O(n) algorithmfor incremental real time learning in high dimen-sional space. In ICML.

Wilson, A. G., Knowles, D. A., and Ghahramani, Z.(2012). Gaussian process regression networks. InICML.

Documents

Collaborative Multi-output Gaussian Processesauai.org/uai2014/proceedings/individuals/159.pdf · tion to large scale multi-output problems. For exam-ple, na ve inference in a fully