24
Approximation Methods for Gaussian Process Regression Joaquin Qui˜ nonero-Candela Applied Games, Microsoft Research Ltd. 7 J J Thomson Avenue, CB3 0FB Cambridge, UK [email protected] Carl Edward Ramussen Max Planck Institute for Biological Cybernetics Spemannstr. 38, D-72076 T¨ ubingen, Germany [email protected] Christopher K. I. Williams Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh 5 Forrest Hill, Edinburgh EH1 2QL, Scotland, UK [email protected] May 21, 2007 Abstract A wealth of computationally efficient approximation methods for Gaus- sian process regression have been recently proposed. We give a unifying overview of sparse approximations, following Qui˜ nonero-Candela and Ras- mussen (2005), and a brief review of approximate matrix-vector multipli- cation methods. 1 Introduction Gaussian processes (GPs) are flexible, simple to implement, fully probabilistic methods suitable for a wide range of problems in regression and classification. A recent overview of GP methods is provided by Rasmussen and Williams (2006). Gaussian processes allow a Bayesian use of kernels for learning, with the follow- ing two key advantages: GPs provide fully probabilistic predictive distributions, including esti- mates of the uncertainty of the predictions. 1

Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

Approximation Methods for

Gaussian Process Regression

Joaquin Quinonero-Candela

Applied Games, Microsoft Research Ltd.

7 J J Thomson Avenue, CB3 0FB Cambridge, UK

[email protected]

Carl Edward Ramussen

Max Planck Institute for Biological Cybernetics

Spemannstr. 38, D-72076 Tubingen, Germany

[email protected]

Christopher K. I. Williams

Institute for Adaptive and Neural Computation

School of Informatics, University of Edinburgh

5 Forrest Hill, Edinburgh EH1 2QL, Scotland, UK

[email protected]

May 21, 2007

Abstract

A wealth of computationally efficient approximation methods for Gaus-sian process regression have been recently proposed. We give a unifyingoverview of sparse approximations, following Quinonero-Candela and Ras-mussen (2005), and a brief review of approximate matrix-vector multipli-cation methods.

1 Introduction

Gaussian processes (GPs) are flexible, simple to implement, fully probabilisticmethods suitable for a wide range of problems in regression and classification. Arecent overview of GP methods is provided by Rasmussen and Williams (2006).Gaussian processes allow a Bayesian use of kernels for learning, with the follow-ing two key advantages:

• GPs provide fully probabilistic predictive distributions, including esti-mates of the uncertainty of the predictions.

1

Page 2: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

• The evidence framework applied to GPs allows to learn the parameters ofthe kernel.

However, a serious problem with GP methods is that the naıve implemen-tation requires computation which grows as O(n3), where n is the number oftraining cases. A host of approximation techniques have recently been pro-posed to allow application of GPs to large problems in machine learning. Thesetechniques can broadly be divided into two classes: 1) those based on sparsemethods, which approximate the full posterior by expressions involving matri-ces of lower rank m < n, and 2) those relying on approximate matrix-vectormultiplication (MVM) conjugate gradient (CG) methods.

The structure of this paper is as follows: In section 2 we provide an overviewof Gaussian process regression. In section 3 we review sparse approximations un-der the unifying view, recently proposed by Quinonero-Candela and Rasmussen(2005), based on inducing inputs and conditionally independent approximationsto the prior. Section 4 describes work on approximate MVM methods for GPregression. The problem of selecting the inducing inputs is discussed in section5. We address methods for setting hyperparameters in the kernel in section 6.The main focus of the chapter is on GP regression, but in section 7 we brieflydescribe how these methods can be extended to the classification problem. Con-clusions are drawn in section 8.

2 Gaussian Process Regression

Probabilistic regression is usually formulated as follows: given a training setD = {(xi, yi), i = 1, . . . , n} of n pairs of (vectorial) inputs xi and noisy (real,scalar) outputs yi, compute the predictive distribution of the function values f∗(or noisy y∗) at test locations x∗. In the simplest case (which we deal with here)we assume that the noise is additive, independent and Gaussian, such that therelationship between the (latent) function f(x) and the observed noisy targetsy are given by

yi = f(xi) + εi , where εi ∼ N (0, σ2

noise) , (1)

where σ2noise

is the variance of the noise, and we use the notation N (a, A) forthe Gaussian distribution with mean a and covariance A.

Definition 1. A Gaussian process (GP) is a collection of random variables,any finite number of which have consistent1 joint Gaussian distributions.

Gaussian process (GP) regression is a Bayesian approach which assumes aGP prior2 over functions, i.e. that a priori the function values behave accordingto

p(f |x1,x2, . . . ,xn) = N (0, K) , (2)

1The random variables obey the usual rules of marginalization, etc.2For notational simplicity we exclusively use zero-mean priors.

2

Page 3: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

where f = [f1, f2, . . . , fn]⊤ is a vector of latent function values, fi = f(xi) and Kis a covariance matrix, whose entries are given by the covariance function, Kij =k(xi,xj). Valid covariance functions give rise to positive semidefinite covariancematrices. In general, positive semidefinite kernels are valid covariance functions.The covariance function encodes our assumptions about the the function wewish to learn, by defining a notion of similarity between two function values, asa function of the corresponding two inputs. A very common covariance functionis the Gaussian, or squared exponential:

Kij = k(xi,xj) = v2 exp(

−‖xi − xj‖

2

2λ2

)

, (3)

where v2 controls the prior variance, and λ is an isotropic lengthscale parameterthat controls the rate of decay of the covariance, i.e. determines how far away xi

must be from xj for fi to be unrelated to fj . We term the parameters that definethe covariance functions hyperparameters. Learning of the hyperparametersbased on data is discussed in section 6.

Note that the GP treats the latent function values fi as random variables,indexed by the corresponding input. In the following, for simplicity we willalways neglect the explicit conditioning on the inputs; the GP model and allexpressions are always conditional on the corresponding inputs. The GP modelis concerned only with the conditional of the outputs given the inputs; we donot model anything about the distribution of the inputs themselves.

As we will see in later sections, some approximations are strictly equivalent toGPs, while others are not. That is, the implied prior may still be multivariateGaussian, but the covariance function may be different for training and testcases.

Definition 2. A Gaussian process is called degenerate iff the covariance func-tion has a finite number of non-zero eigenvalues.

Degenerate GPs (such as e.g. with polynomial covariance function) corre-spond to finite linear (-in-the-parameters) models, whereas non-degenerate GPs(such as e.g. the squared exponential or RBF covariance function) do not. Theprior for a finite m-dimensional linear model only considers a universe of atmost m linearly independent functions; this may often be too restrictive whenn ≫ m. Note however, that non-degeneracy on its own doesn’t guarantee theexistence of the “right kind” of flexibility for a given particular modelling task.For a more detailed background on GP models, see for example Rasmussen andWilliams (2006).

Inference in the GP model is simple: we put a joint GP prior on trainingand test latent values, f and f∗

3, and combine it with the likelihood p(y|f) usingBayes’ rule, to obtain the joint posterior

p(f , f∗|y) =p(f , f∗)p(y|f)

p(y), (4)

3We will mostly consider a vector of test cases f∗ (rather than a single f∗) evaluated at aset of test inputs X∗.

3

Page 4: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

where y = [y1, . . . , yn] is the vector of training targets. The final step needed toproduce the desired posterior predictive distribution is to marginalize out theunwanted training set latent variables:

p(f∗|y) =

p(f , f∗|y)df =1

p(y)

p(y|f) p(f , f∗) df , (5)

or in words: the predictive distribution is the marginal of the renormalized jointprior times the likelihood. The joint GP prior and the independent likelihoodare both Gaussian

p(f , f∗) = N(

0,[

Kf ,f K∗,f

Kf ,∗ K∗,∗

])

, and p(y|f) = N (f , σ2

noiseI) , (6)

where K is subscript by the variables between which the covariance is computed(and we use the asterisk ∗ as shorthand for f∗) and I is the identity matrix. Sinceboth factors in the integral are Gaussian, the integral can be evaluated in closedform to give the Gaussian predictive distribution

p(f∗|y) = N(

K∗,f (Kf ,f +σ2

noiseI)−1y, K∗,∗−K∗,f (Kf ,f +σ2

noiseI)−1Kf ,∗

)

. (7)

The problem with the above expression is that it requires inversion of a matrixof size n×n which requires O(n3) operations, where n is the number of trainingcases. Thus, the simple exact implementation can handle problems with at mosta few thousand training cases on today’s desktop machines.

3 Sparse Approximations Based on Inducing Vari-ables

To overcome the computational limitations of GP regression, numerous authorshave recently suggested a wealth of sparse approximations. Common to allthese approximation schemes is that only a subset of the latent variables aretreated exactly, and the remaining variables are given some approximate, butcomputationally cheaper treatment. However, the published algorithms havewidely different motivations, emphasis and exposition, so it is difficult to get anoverview of how they relate to each other, and which can be expected to giverise to the best algorithms.

A useful discussion of some of the approximation methods is given in chapter8 of Rasmussen and Williams (2006). In this Section we go beyond this, andprovide a unifying view of sparse approximations for GP regression, followingQuinonero-Candela and Rasmussen (2005). For each algorithm we analyze theposterior, and compute the effective prior which it is using. Thus, we rein-terpret the algorithms as “exact inference with an approximate prior”, ratherthan the existing (ubiquitous) interpretation “approximate inference with theexact prior”. This approach has the advantage of directly expressing the ap-proximations in terms of prior assumptions about the function, which makes

4

Page 5: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

the consequences of the approximations much easier to understand. While thisview of the approximations is not the only one possible (Seeger, 2003, categorizessparse GP approximations according to likelihood approximations), it has theadvantage of putting all existing probabilistic sparse approximations under oneumbrella, thus enabling direct comparison and revealing the relation betweenthem.

We now seek to modify the joint prior p(f∗, f) from (6) in ways which willreduce the computational requirements from (7). Let us first rewrite that priorby introducing an additional set of m latent variables u = [u1, . . . , um]⊤, whichwe call the inducing variables. These latent variables are values of the Gaussianprocess (as also f and f∗), corresponding to a set of input locations Xu, which wecall the inducing inputs. Whereas the additional latent variables u are alwaysmarginalized out in the predictive distribution, the choice of inducing inputs doesleave an imprint on the final solution. The inducing variables will turn out tobe generalizations of variables which other authors have referred to variously as“support points”, “active set” or “pseudo-inputs”. Particular sparse algorithmschoose the inducing variables in various different ways; some algorithms chosethe inducing inputs to be a subset of the training set, others not, as we willdiscuss in Section 5. For now consider any arbitrary inducing variables.

Due to the consistency of Gaussian processes, we know that we can recoverp(f∗, f) by simply integrating (marginalizing) out u from the joint GP priorp(f∗, f ,u)

p(f∗, f) =

p(f∗, f ,u) du =

p(f∗, f |u) p(u) du,

where p(u) = N (0, Ku,u) .

(8)

This is an exact expression. Now, we introduce the fundamental approximationwhich gives rise to almost all sparse approximations. We approximate the jointprior by assuming that f∗ and f are conditionally independent given u, see Figure1, such that

p(f∗, f) ≃ q(f∗, f) =

q(f∗|u) q(f |u) p(u) du . (9)

The name inducing variable is motivated by the fact that f and f∗ can only com-municate though u, and u therefore induces the dependencies between trainingand test cases. As we shall detail in the following sections, the different compu-tationally efficient algorithms proposed in the literature correspond to differentadditional assumptions about the two approximate inducing conditionals q(f |u),q(f∗|u) of the integral in (9). It will be useful for future reference to specify herethe exact expressions for the two conditionals

training conditional: p(f |u) = N (Kf ,uK−1

u,uu, Kf ,f − Qf ,f ) , (10a)

test conditional: p(f∗|u) = N (K∗,uK−1

u,uu, K∗,∗ − Q∗,∗) , (10b)

where we have introduced the shorthand notation4 Qa,b , Ka,uK−1u,uKu,b. We

4Note, that Qa,b depends on u although this is not explicit in the notation.

5

Page 6: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

��

��

��

������

AA

AA

AA

QQ

QQ

QQ

QQQ

u

f1 f2r r r fn f∗r r

��

��

��

������

AA

AA

AA

QQ

QQ

QQ

QQQ

u

f1 f2r r r fn f∗r r

Figure 1: Graphical model of the relation between the inducing variablesu, the training latent functions values f = [f1, . . . , fn]⊤ and the test func-tion value f∗. The thick horizontal line represents a set of fully connectednodes. The observations y1, . . . , yn, y∗ (not shown) would dangle individ-ually from the corresponding latent values, by way of the exact (factored)likelihood (6). Upper graph: the fully connected graph corresponds tothe case where no approximation is made to the full joint Gaussian pro-cess distribution between these variables. The inducing variables u aresuperfluous in this case, since all latent function values can communicatewith all others. Lower graph: assumption of conditional independence

between training and test function values given u. This gives rise tothe separation between training and test conditionals from (9). Noticethat having cut the communication path between training and test latentfunction values, information from f can only be transmitted to f∗ via theinducing variables u.

can readily identify the expressions in (10) as special (noise free) cases of thestandard predictive equation (7) with u playing the role of (noise free) observa-tions. Note that the (positive semidefinite) covariance matrices in (10) have theform K − Q with the following interpretation: the prior covariance K minus a(non-negative definite) matrix Q quantifying how much information u providesabout the variables in question (f or f∗). We emphasize that all the sparsemethods discussed in this chapter correspond simply to different approxima-tions to the conditionals in (10), and throughout we use the exact likelihoodand inducing prior

p(y|f) = N (f , σ2

noiseI) , and p(u) = N (0, Ku,u) . (11)

The sparse approximations we present in the following subsections are sys-tematically sorted by decreasing crudeness of the additional approximations tothe training and test conditionals. We will present the subset of data (SoD),

6

Page 7: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

subset of regressors (SoR), deterministic training conditional (DTC), and thepartially and fully independent training conditional (PITC, and FITC) approx-imations.

3.1 The Subset of Data (SoD) Approximation

Before we get started with the more sophisticated approximations, we mentionas a baseline method the simplest possible sparse approximation (which doesn’tfall inside our general scheme): use only a subset of the data (SoD). The compu-tational complexity is reduced to O(m3), where m < n. We would not generallyexpect SoD to be a competitive method, since it would seem impossible (evenwith fairly redundant data and a good choice of the subset) to get a realisticpicture of the uncertainties when only a part of the training data is even consid-ered. We include it here mostly as a baseline against which to compare bettersparse approximations.

We will illustrate the various sparse approximations on a toy dataset, asillustrated in Figure 4. There are n = 100 datapoints in 1-D, but for the sparsemethods we have randomly selected m = 10 of these data points. The targetfunction is given by sin(x)/x with additive, independent, identically distributedGaussian noise of amplitude 0.2. The training inputs lie on a uniform grid in[−10, 10]. For the predictions we have used a squared exponential (SE) covari-ance function (equation 3), with slightly too short a lengthscale λ = 1.2, chosenso as to emphasize the different behaviours, and with amplitude v2 = 0.7. Fig-ure 4 top left shows the predictive mean and two standard deviation error barsobtained with the full dataset. The 500 test inputs lie on a uniform grid in[−14, 14].

In Figure 4 top right, we see how the SoD method produces wide predic-tive distributions when training on a randomly selected subset of 10 cases. Afair comparison to other methods would take into account that the computa-tional complexity is independent of n as opposed to other more advanced meth-ods. These extra computational resources could be spent in a number of ways,e.g. larger m, or an active (rather than random) selection of the m points. Inthis Chapter we will concentrate on understanding the theoretical foundationsof the various approximations rather than investigating the necessary heuristicsneeded to turn the approximation schemes into practical algorithms.

3.2 The Subset of Regressors (SoR)

SoR models are finite linear-in-the-parameters models with a particular prioron the weights. For any input x∗, the corresponding function value f∗ is givenby:

f∗ =

m∑

i=1

k(x∗,xiu)wi

u= K∗,u wu , with p(wu) = N (0, K−1

u,u) , (12)

where there is one weight associated with each inducing input in Xu. Note thatthe covariance matrix for the prior on the weights is the inverse of that on u,

7

Page 8: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

such that we recover the exact GP prior on u, which is Gaussian with zero meanand covariance

u = Ku,u wu ⇒ 〈uu⊤〉 = Ku,u〈wu w⊤

u 〉Ku,u = Ku,u . (13)

Using the effective prior on u and the fact that wu = K−1u,u u we can redefine

the SoR model in an equivalent, more intuitive way:

f∗ = K∗,u K−1

u,u u , with u ∼ N (0, Ku,u) . (14)

The Subset of Regressors (SoR) algorithm was proposed in Wahba (1990,chapter 7), and in Poggio and Girosi (1990, eq. 25) via the regularizationframework. It was adapted by Smola and Bartlett (2001) to propose a sparsegreedy approximation to Gaussian process regression.

We are now ready to integrate the SoR model in our unifying framework.Given that there is a deterministic relation between any f∗ and u, the approxi-mate conditional distributions in the integral in (9) are given by:

qSoR

(f |u) = N (Kf ,u K−1

u,u u, 0) , and qSoR

(f∗|u) = N (K∗,u K−1

u,u u, 0) ,

(15)

with zero conditional covariance, compare to (10). The effective prior impliedby the SoR approximation is easily obtained from (9), giving

qSoR

(f , f∗) = N(

0,[ Qf ,f Qf ,∗

Q∗,f Q∗,∗

])

, (16)

where we recall Qa,b , Ka,uK−1u,uKu,b. A more descriptive name for this

method, would be the Deterministic Inducing Conditional (DIC) approximation.We see that this approximate prior is degenerate. There are only m degrees offreedom in the model, which implies that only m linearly independent functionscan be drawn from the prior. The m + 1-th one is a linear combination of theprevious ones. For example, in a very low noise regime, the posterior could beseverely constrained by only m training cases.

The degeneracy of the prior can cause unreasonable predictive distributions.For covariance functions that decay to zero for a pair of faraway inputs, it isimmediate to see that Q∗,∗ will go to zero when the test inputs X∗ are far awayfrom the inducing inputs Xu. As a result, f∗ will have no prior variance underthe approximate prior (16), and therefore the predictive variances will tend tozero as well far from the inducing inputs. This is unreasonable, because thearea around the inducing inputs is where we are gathering the most informa-tion about the training data: we would like to be most uncertain about ourpredictions when far away from the inducing inputs. For covariance functionsthat do not decay to zero, the approximate prior over functions is still veryrestrictive, and given enough data only a very limited family of functions willbe plausible under the posterior, leading to overconfident predictive variances.This is a general problem of finite linear models with small numbers of weights

8

Page 9: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

(for more details see Rasmussen and Quinonero-Candela, 2005). Figure 4, mid-dle left panel, illustrates the unreasonable predictive uncertainties of the SoRapproximation on a toy dataset.5

The predictive distribution is obtained by using the SoR approximate prior(16) instead of the true prior in (5). For each algorithm we give two forms ofthe predictive distribution, one which is easy to interpret, and the other whichis economical to compute with:

qSoR

(f∗|y) = N(

Q∗,f (Qf ,f + σ2

noiseI)−1y,

Q∗,∗ − Q∗,f (Qf ,f + σ2

noiseI)−1Qf ,∗

)

, (17a)

= N(

σ−2K∗,uΣKu,f y, K∗,uΣKu,∗

)

, (17b)

where we have defined Σ = (σ−2Ku,fKf ,u + Ku,u)−1. Equation (17a) is readilyrecognized as the regular prediction equation (7), except that the covarianceK has everywhere been replaced by Q, which was already suggested by (16).This corresponds to replacing the covariance function k with k

SoR(xi,xj) =

k(xi,u)K−1u,uk(u,xj). The new covariance function has rank (at most) m. Thus

we have the following

Remark 3. The SoR approximation is equivalent to exact inference in the de-generate Gaussian process with covariance function k

SoR(xi,xj) = k(xi,u)K−1

u,uk(u,xj).

The equivalent (17b) is computationally cheaper, and with (12) in mind,Σ is the covariance of the posterior on the weights wu. Note that as opposedto the subset of data method, all training cases are taken into account. Thecomputational complexity is O(nm2) initially, and O(m) and O(m2) per testcase for the predictive mean and variance respectively.

3.3 The Deterministic Training Conditional (DTC) Ap-proximation

Taking up ideas contained in the work of Csato and Opper (2002), Seeger et al.(2003) recently proposed another sparse approximation to Gaussian processregression, which does not suffer from the nonsensical predictive uncertaintiesof the SoR approximation, but that interestingly leads to exactly the samepredictive mean. Seeger et al. (2003), who called the method Projected LatentVariables (PLV), presented the method as relying on a likelihood approximation,based on the projection f = Kf ,u K−1

u,u u:

p(y|f) ≃ q(y|u) = N (Kf ,u K−1

u,u u, σ2

noiseI) . (18)

The method has also been called the Projected Process Approximation (PPA) byRasmussen and Williams (2006, Chapter 8). One way of obtaining an equivalent

5Wary of this fact, Smola and Bartlett (2001) propose using the predictive variances ofthe SoD method, or a more accurate but computationally costly alternative (more details aregiven in Quinonero-Candela, 2004, Chapter 3).

9

Page 10: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

model is to retain the usual likelihood, but to impose a deterministic trainingconditional and the exact test conditional from (10b)

qDTC

(f |u) = N (Kf ,u K−1

u,u u,0), and qDTC

(f∗|u) = p(f∗|u) . (19)

This reformulation has the advantage of allowing us to stick to our view ofexact inference (with exact likelihood) with approximate priors. Indeed, underthis model the conditional distribution of f given u is identical to that of theSoR, given in the left of (15). In this framework a systematic name for thisapproximation is the Deterministic Training Conditional (DTC).

The fundamental difference with SoR is that DTC uses the exact test condi-tional (10b) instead of the deterministic relation between f∗ and u of SoR. Thejoint prior implied by DTC is given by:

qDTC

(f , f∗) = N(

0,[

Qf ,f Qf ,∗

Q∗,f K∗,∗

])

, (20)

which is surprisingly similar to the effective prior implied by the SoR approxi-mation (16). The difference is that under the DTC approximation f∗ has a priorvariance of its own, given by K∗,∗. This prior variance reverses the behaviourof the predictive uncertainties, and turns them into sensible ones, see Figure 4,middle right, for an illustration.

The predictive distribution is now given by:

qDTC

(f∗|y) = N(

Q∗,f (Qf ,f + σ2

noiseI)−1y,

K∗,∗ − Q∗,f (Qf ,f + σ2

noiseI)−1Qf ,∗

)

, (21a)

= N(

σ−2K∗,uΣKu,f y, K∗,∗ − Q∗,∗ + K∗,uΣK⊤

∗,u

)

, (21b)

where again we have defined Σ = (σ−2Ku,fKf ,u + Ku,u)−1 as in (17). Thepredictive mean for the DTC is identical to that of the SoR approximation(17), but the predictive variance replaces the Q∗,∗ from SoR with K∗,∗ (whichis larger, since K∗,∗ − Q∗,∗ is positive semidefinite). This added term is thepredictive variance of the posterior of f∗ conditioned on u. It grows to the priorvariance K∗,∗ as x∗ moves far from the inducing inputs in Xu.

Remark 4. The only difference between the predictive distribution of DTC andSoR is the variance. The predictive variance of DTC is never smaller than thatof SoR.

Note, that since the covariances for training cases and test cases are com-puted differently, see (20), it follows that

Remark 5. The DTC approximation does not correspond exactly to a Gaussianprocess,

as the covariance between latent values depends on whether they are consideredtraining or test cases, violating consistency, see Definition 1. The computationalcomplexity has the same order as for SoR.

10

Page 11: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

3.4 Partially Independent Training Conditional Approxi-mations

The sparse approximations we have seen so far, SoR and DTC, both impose adeterministic relation between the training and inducing latent variables, result-ing in inducing training conditionals where the covariance matrix has been setto zero, (15) and (19). A less crude approximation to the training conditionalis to preserve a block-diagonal of the true covariance matrix, given by (10a),and set the remaining entries to zero. The corresponding graphical model isshown in Figure 2; notice that the variables in f are divided into k groups.Each group corresponds to a block in the block-diagonal matrix. This structureis equivalent to assuming conditional independence only for part of the trainingfunction values (those with covariance set to zero). We call this the partiallyindependent training conditional (PITC) approximation:

qPITC

(f |u) = N(

Kf ,u K−1

u,u u, blockdiag[Kf ,f − Qf ,f ])

,

and qPITC

(f∗|u) = p(f∗|u) .(22)

where blockdiag[A] is a block-diagonal matrix (where the blocking structureis not explicitly stated). For now, we consider one single test input given bythe exact test conditional (10b). As we discuss later in this section, the jointtest conditional can also be approximated in various ways. The effective priorimplied by the PITC approximation is given by

qPITC

(f , f∗) = N(

0,[

Qf ,f − blockdiag[Qf ,f − Kf ,f ] Qf ,∗

Q∗,f K∗,∗

])

, (23)

Note, that the sole difference between the DTC and PITC is that in the topleft corner of the implied prior covariance, PITC replaces the approximate co-variances of DTC by the exact ones on the block-diagonal. The predictivedistribution is

qFITC

(f∗|y) = N(

Q∗,f (Qf ,f + Λ)−1y, K∗,∗ − Q∗,f (Qf ,f + Λ)−1Qf ,∗

)

(24a)

= N(

K∗,uΣKu,fΛ−1y, K∗,∗ − Q∗,∗ + K∗,uΣKu,∗

)

, (24b)

where we have defined Σ = (Ku,u + Ku,fΛ−1Kf ,u)−1 and Λ = blockdiag[Kf ,f −

Qf ,f + σ2noise

I ]. An identical expression was obtained by Schwaighofer andTresp (2003, Sect. 3), developed from the original Bayesian Committee Machine(BCM) of Tresp (2000). The BCM was originally proposed as a transductivelearner (i.e. where the test inputs have to be known before training), and theinducing inputs Xu were chosen to be the test inputs.

However, it is important to realize that the BCM proposes two orthogonalideas: first, the block-diagonal structure of the partially independent trainingconditional, and second setting the inducing inputs to be the test inputs. Thesetwo ideas can be used independently and in Section 5 we propose using the firstwithout the second.

11

Page 12: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

��

��

��

������

AA

AA

AA

QQ

QQ

QQ

QQQ

u

fI1 fI2r r r fIk f∗

Figure 2: Graphical representation of the PITC approximation. The setof latent function values fIi

indexed by the the set of indices Ii is fullyconnected. The PITC is more general than the FITC, (see graph in Fig. 3)in that conditional independence is between k groups of training latentfunction values. This corresponds to the block-diagonal approximation tothe true training conditional given in (22).

The computational complexity of the PITC approximation depends on theblocking structure imposed in (22). A reasonable choice, also recommended byTresp (2000) may be to choose k = n/m blocks, each of size m × m. The com-putational complexity is thus O(nm2). Since in the PITC model, the covarianceis computed differently for training and test cases we have

Remark 6. The PITC approximation does not correspond exactly to a Gaussianprocess.

This is because computing covariances requires knowing whether points arefrom the training- or test-set, (23). To obtain a Gaussian process from thePITC, one would need to extend the partial conditional independence assump-tion to the joint conditional p(f , f∗|u), which would require abandoning our pri-mal assumption that the training and the test function values are conditionallyindependent, given the inducing variables.

Fully Independent Training Conditional Recently Snelson and Ghahra-mani (2006) proposed another likelihood approximation to speed up Gaussianprocess regression, which they called Sparse Pseudo-input Gaussian processes(SPGP). While the DTC is based on the likelihood approximation given by (18),the SPGP proposes a more sophisticated likelihood approximation with a richercovariance

p(y|f) ≃ q(y|u) = N (Kf ,u K−1

u,u u, diag[Kf ,f − Qf ,f ] + σ2

noiseI) , (25)

where diag[A] is a diagonal matrix whose elements match the diagonal of A. Aswas pointed out by Csato (2005), the SPGP is an extreme case of the PITCapproximation, where the training conditional is taken to be fully independent.We call it the Fully Independent Training Conditional (FITC) approximation.The corresponding graphical model is given in Figure 3. The effective prior

12

Page 13: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

��

��

��

������

AA

AA

AA

QQ

QQ

QQ

QQQ

u

f1 f2r r r fn f∗

Figure 3: Graphical model for the FITC approximation. Comparedto those in Figure 1, all edges between latent function values have beenremoved: the latent function values are conditionally fully independentgiven the inducing variables u. Although strictly speaking the SoR andDTC approximations could also be represented by this graph, note thatboth further assume a deterministic relation between f and u. The FITCis an extreme case of PITC, with k = n unitary groups (blocks), seeFigure 2

implied by the FITC is given by

qFITC

(f , f∗) = N(

0,[ Qf ,f − diag[Qf ,f − Kf ,f ] Qf ,∗

Q∗,f K∗,∗

])

, (26)

and we see that in contrast to PITC, FITC only replaces the approximatecovariances of DTC by the exact ones on the diagonal, i.e. the approximateprior variances are replaced by the true prior variances.

The predictive distribution of the FITC is identical to that of PITC (24),except for the alternative definition of Λ = diag[Kf ,f − Qf ,f + σ2

noiseI ]. The

computational complexity is identical to that of SoR and DTC.The question poses itself again for FITC, of how to treat the test conditional.

For predictions at a single test input, this is obvious. For joint predictions, thereare two options, either 1) use the exact full test conditional from (10b), or 2)extend the additional factorizing assumption to the test conditional. AlthoughSnelson and Ghahramani (2006) don’t explicitly discuss joint predictions, itwould seem that they probably intend the second option. Whereas the addi-tional independence assumption for the test cases is not really necessary forcomputational reasons, it does affect the nature of the approximation. Underoption 1) the training and test covariance are computed differently, and thusthis does not correspond to our strict definition of a GP model, but

Remark 7. Iff the assumption of full independence is extended to the test con-ditional, the FITC approximation is equivalent to exact inference in a non-degenerate Gaussian process with covariance function k

FIC(xi,xj) = k

SoR(xi,xj)+

δi,j [k(xi,xj) − kSoR

(xi,xj)],

where δi,j is Kronecker’s delta. A logical name for the method where theconditionals (training and test) are always forced to be fully independent would

13

Page 14: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

be the Fully Independent Conditional (FIC) approximation. The effective priorimplied by FIC is:

qFIC

(f , f∗) = N(

0,[ Qf ,f − diag[Qf ,f − Kf ,f ] Qf ,∗

Q∗,f Q∗,∗ − diag[Q∗,∗ − K∗,∗]

])

.

(27)

In Figure 4, bottom, we show the behaviour of the predictive distribution ofFITC, and PITC (with 10 uniform blocks).

Summary of sparse approximations In the table below we give a summaryof the way approximations are built. All these methods have been detailed in theprevious subsections. The initial cost and that of the mean and variance per testcase are respectively n3, n and n2 for the exact GP, and nm2, m and m2 for allother listed methods. The “GP?” column indicates whether the approximationis equivalent to a GP. For FITC see Remark 7. Recall that we have definedQa,b , Ka,uK−1

u,uKu,b.

Method q(f∗|u) q(f |u) joint prior covariance GP?

GP exact exact

[

Kf ,f Kf ,∗

K∗,f K∗,∗

] √

SoR determ. determ.

[

Qf ,f Qf ,∗

Q∗,f Q∗,∗

] √

DTC exact determ.

[

Qf ,f Qf ,∗

Q∗,f K∗,∗

]

FITC (exact) fully indep.

[

Qf ,f − diag[Qf ,f − Kf ,f ] Qf ,∗

Q∗,f K∗,∗

]

(√

)

PITC exact partially indep.

[

Qf ,f − blokdiag[Qf ,f − Kf ,f ] Qf ,∗

Q∗,f K∗,∗

]

14

Page 15: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

−15 −10 −5 0 5 10 15−1.5

−1

−0.5

0

0.5

1

1.5FullGP

−15 −10 −5 0 5 10 15−1.5

−1

−0.5

0

0.5

1

1.5SoD

−15 −10 −5 0 5 10 15−1.5

−1

−0.5

0

0.5

1

1.5SoR

−15 −10 −5 0 5 10 15−1.5

−1

−0.5

0

0.5

1

1.5DTC

−15 −10 −5 0 5 10 15−1.5

−1

−0.5

0

0.5

1

1.5FITC

−15 −10 −5 0 5 10 15−1.5

−1

−0.5

0

0.5

1

1.5PITC

Figure 4: Toy data. All methods use the squared exponential covariancefunction, and a slightly too short lengthscale, chosen on purpose to em-phasize the different behaviour of the predictive uncertainties. The dotsare the training points, the crosses are the targets corresponding to theinducing inputs, randomly selected from the training set. The solid lineis the mean of the predictive distribution, and the dotted lines show the95% confidence interval of the predictions. We include a full GP (top left)for reference.

15

Page 16: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

4 Fast Matrix Vector Multiplication (MVM) Ap-proximations

One straightforward method to speed up GP regression is to note that linearsystem (Kf ,f +σ2

noiseI)α = y can be solved by an iterative method, for example

conjugate gradients (CG). (See Golub and Van Loan (1989, sec. 10.2) for furtherdetails on the CG method.) Each iteration of the CG method requires a matrix-vector multiplication (MVM) which takes O(n2) time. Conjugate gradientsgives the exact solution (ignoring round-off errors) if run for n iterations, butit will give an approximate solution if terminated earlier, say after k iterations,with time complexity O(kn2). The CG method has been suggested by Wahbaet al. (1995) (in the context of numerical weather prediction) and by Gibbs andMacKay (1997) (in the context of general GP regression).

However, the scaling of the CG method (at least O(n2)) is too slow to reallybe useful for large problems. Recently a number of researchers have notedthat if the MVM step can be approximated efficiently then the CG methodcan be speeded up. Each CG iteration involves an MVM of the form a =(Kf ,f + σ2

noiseI)v, which requires n inner products ai =

∑n

j=1k(xi,xj)vj . Each

ai is a weighted sum of kernels between source points {xj} and target pointxi. Such sums can be approximated using multi-resolution space-partitioningdata structures. For example Shen et al. (2006) suggest a simple approximationusing a k-d tree, and while Gray (2004) and Freitas et al. (2006) make use ofdual trees. The Improved Fast Gauss Transform (IFGT) (Yang et al., 2005)is another method for approximating the weighted sums. It is specific to theGaussian kernel and uses a more sophisticated approximation based on Taylorexpansions. Generally experimental results show that these fast MVM methodsare most effective when the input space is low dimensional.

Notice that these methods for accelerating the weighted sum of kernel func-tions can also be used at test time, and may be particularly helpful when n islarge.

5 Selecting the Inducing Variables

We have until now assumed that the inducing inputs Xu were given. Tra-ditionally, sparse models have very often been built upon a carefully chosensubset of the training inputs. This concept is probably best exemplified in thepopular support vector machine (SVM) (Cortes and Vapnik, 1995). The re-cent work by Keerthi et al. (2006) seeks to further sparsify the SVM. In sparseGaussian processes, it has also been suggested to select the inducing inputs Xu

from among the training inputs. Since this involves a prohibitive combinato-rial optimization, greedy optimization approaches have been suggested usingvarious selection criteria like online learning (Csato and Opper, 2002), greedyposterior maximization (Smola and Bartlett, 2001), maximum information gain(Lawrence et al., 2003; Seeger et al., 2003), matching pursuit (Keerthi and Chu,2006), and others. As discussed in Section 3.4, selecting the inducing inputs

16

Page 17: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

from among the test inputs has also been considered in transductive settingsby Tresp (2000). Recently, Snelson and Ghahramani (2006) have proposed torelax the constraint that the inducing variables must be a subset of training/testcases, turning the discrete selection problem into one of continuous optimiza-tion. One may hope that finding a good solution is easier in the continuousthan the discrete case, although finding the global optimum is intractable inboth cases. And it is possible that the less restrictive choice can lead to betterperformance in very sparse models.

Which criterion should be used to set the inducing inputs? Departing froma fully Bayesian treatment which would involve defining priors on Xu, one couldmaximize the marginal likelihood (also called the evidence) with respect to Xu,an approach also followed by Snelson and Ghahramani (2006). Each of theapproximate methods proposed involves a different effective prior, and hence itsown particular effective marginal likelihood conditioned on the inducing inputs

q(y|Xu) =

∫∫

p(y|f) q(f |u) p(u|Xu)dudf =

p(y|f) q(f |Xu)df , (28)

which of course is independent of the test conditional. We have in the aboveequation explicitly conditioned on the inducing inputs Xu. Using Gaussianidentities, the effective marginal likelihood is very easily obtained by adding aridge σ2

noiseI (from the likelihood) to the covariance of effective prior on f . Using

the appropriate definitions of Λ, the log marginal likelihood becomes

log q(y|Xu) = − 1

2log |Qf ,f + Λ| − 1

2y⊤(Qf ,f + Λ)−1y − n

2log(2π) , (29)

where ΛSoR = ΛDTC = σ2noise

I, ΛFITC = diag[Kf ,f − Qf ,f ] + σ2noise

I, andΛPITC = blockdiag[Kf ,f−Qf ,f ]+σ2

noiseI. The computational cost of the marginal

likelihood is O(nm2) for all methods, that of its gradient with respect to oneelement of Xu is O(nm). This of course implies that the complexity of comput-ing the gradient wrt. to the whole of Xu is O(dnm2), where d is the dimensionof the input space.

It has been proposed to maximize the effective posterior instead of the effec-tive marginal likelihood (Smola and Bartlett, 2001). However, this is potentiallydangerous and can lead to overfitting. Maximizing the whole evidence insteadis sound and comes at an identical computational cost (for a deeper analysis,see Quinonero-Candela, 2004, Sect. 3.3.5 and Fig. 3.2).

The marginal likelihood has traditionally been used to learn the hyperpa-rameters of GPs, in the non fully Bayesian treatment (see for example Williamsand Rasmussen, 1996). For the sparse approximations presented here, once youare learning Xu it is straightforward to allow for learning hyperparameters (ofthe covariance function) during the same optimization. For methods that selectXu from the training data, one typically interleaves optimization of the hyper-parameters with with selection of Xu, as proposed for example by Seeger et al.(2003).

17

Page 18: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

Augmentation Since in the previous sections, we haven’t assumed anythingabout u, for each test input x∗ in turn, we can simply augment the set of in-ducing variables by f∗, so that we have one additional inducing variable equalto the current test latent. Let us first investigate the consequences for the testconditional from (10b). Note that the interpretation of the covariance matrixK∗,∗ − Q∗,∗ was “the prior covariance minus the information which u providesabout f∗”. It is clear that the augmented u (with f∗) provides all possibleinformation about f∗, and consequently Q∗,∗ = K∗,∗. An equivalent view onaugmentation is that the assumption of conditional independence between f∗and f is dropped. This is seen trivially, by adding edges between f∗ and thefi. Because f∗ enjoys a full, original prior variance under the test conditional,augmentation helps reverse the misbehaviour of the predictive variances of de-generate GP priors, see (Quinonero-Candela and Rasmussen, 2005, Section 8.1)for the details. Augmented SoR and augmented DTC are identical models (seeQuinonero-Candela and Rasmussen, 2005, Remark 12).

Augmentation was originally proposed by Rasmussen (2002), and applied indetail to the SoR with RBF covariance by Quinonero-Candela (2004). Later,Rasmussen and Quinonero-Candela (2005) proposed to use augmentation to“heal” the relevance vector machine (RVM) (Tipping, 2000), which is also equiv-alent to a degenerate GP with nonsensical predictive variances that shrink tozero far away from the training inputs (Tipping, 2001, Appendix D). Althoughaugmentation was initially proposed for a narrow set of circumstances, it is eas-ily applied to any of the approximations discussed. Of course, augmentationdoesn’t make any sense for an exact, non-degenerate Gaussian process model (aGP with a covariance function that has a feature-space which is infinite dimen-sional, i.e. with basis functions everywhere).

Prediction with an augmented sparse model comes at a higher computationalcost, since now f∗ interacts directly with all of f , and not just with u. For eachnew test point O(nm) operations are required, as opposed to O(m) for the mean,and O(m2) for the predictive distribution of all the non-augmented methodswe have discussed. Whether augmentation is of practical interest (i.e. increasesperformance at a fixed computational budget) is unclear, since the extra compu-tational effort needed for augmentation could be invested by the non-augmentedmethods, for example to use more inducing variables.

6 Approximate Evidence and HyperparameterLearning

Hyperparameter learning is an issue that is sometimes completely ignored inthe literature on approximations to GPs. However, learning good values of thehyperparameters is crucial, and it might come at a high computational cost thatcannot be ignored. For clarity, let us distinguish between three phases in GPmodelling:

hyperparameter learning: the hyperparameters are learned, by for example

18

Page 19: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

maximizing the log marginal likelihood.

pre-computation: all possible pre-computations not involving test inputs areperformed (such as inverting the covariance matrix of the training functionvalues, and multiplying it by the vector of targets).

testing: only the computations involving the test inputs are made, which couldnot have been made previously.

If approximations to GP regression are to provide a computational speedup,then one must, in principle, take into account all three phases of GP modelling.

Let us for a moment group hyperparameter learning and pre-computationsinto a wider training phase. In practice, there will be situations where oneof the two times, training or testing, is more important than the other. Letus distinguish between two extreme scenarios. One the one hand, there areapplications for which training time is unimportant, the crucial point being thatthe time required to make predictions, once the system has been trained, is reallysmall. An example could be applications embedded in a portable device, withlimited CPU power. At the other extreme are applications where the systemneeds to be re-trained often and quickly, and where prediction time mattersmuch less than the quality of the predictions. An example could be a case wherewe know already that we will only need to make very few predictions, hencetraining time automatically becomes predominant. A particular applicationmight lie somewhere between these two extremes.

A common choice of objective function for setting the hyperparameters is themarginal likelihood p(y|X). For GP regression, this can be computed exactly(by integrating out f analytically) to obtain

log p(y|X) = − 1

2log |Kf ,f +σ2

noiseI|−1

2y⊤(Kf ,f +σ2

noiseI)−1y− n

2log(2π) . (30)

Also, the derivative of the marginal likelihood with respect to a hyperparameterθj is given by

∂θj

log p(y|X) = − 1

2y⊤R−1

∂R

∂θj

R−1y − 1

2tr

(

R−1∂R

∂θj

)

, (31)

where R = (Kf ,f + σ2noise

I), and tr(A) denotes the trace of matrix A. Thesederivatives can then be fed to a gradient-based optimization routine. Themarginal likelihood is not the only criterion that can be used for hyperparam-eter optimization. For example one can use a cross-validation objective; seeRasmussen and Williams (2006, chapter 5) for further details.

The approximate log marginal likelihood for the SoR, DTC, FITC and PITCapproximations is given in equation (29), taking time O(nm2). For the SoDapproximation one simply ignores all training datapoints not in the inducingset, giving

log qSoD

(y|Xu) = − 1

2log |Ku,u + σ2

noise I| − 1

2y⊤

u(Ku,u + σ2

noise I)−1yu

− m2

log(2π) ,(32)

19

Page 20: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

where the inducing inputs Xu are a subset of the training inputs, and yu are thecorresponding training targets. The computational cost here is O(m3), insteadof the O(n3) operations needed to compute the evidence of the full GP model.For all of these approximations one can also calculate derivatives with respectto the hyperparameters analytically, for use in a gradient-based optimizer.

In general we believe that for a given sparse approximation it makes mostsense to both optimize the hyperparameters and make predictions under thesame approximation (Quinonero-Candela, 2004, Chapter 3). The hyperparam-eters that are optimal for the full GP model, if we were able to obtain them,may also well be very different from those optimal for a specific sparse approx-imation. For example, for the squared exponential covariance function, see (3),the lengthscale optimal for the full GP might be too short for a very sparse ap-proximation. Certain sparse approximations lead to marginal likelihoods thatare better behaved than others. Snelson and Ghahramani (2006) report thatthe marginal likelihood of FITC suffers much less from local maxima than themarginal likelihood common to SoR and DTC.

Gibbs and MacKay (1997) discussed how the marginal likelihood and itsderivatives can be approximated using CG iterative methods, building on ideasin Skilling (1993). It should be possible to speed up such computations usingthe fast MVM methods outlined in section 4.

7 Classification

Compared with the regression case, the approximation methods for GP classifi-cation need to deal with an additional difficulty: the likelihood is non-Gaussian,which implies that neither the predictive distribution (5), nor the marginal like-lihood can be computed analytically. With the approximate prior at hand, themost common approach is to further approximate the resulting posterior, p(f |y),by a Gaussian. It can be shown that if p(y|f) is log-concave then the posteriorwill be unimodal; this is the case for the common logistic and probit responsefunctions. (For a review of likelihood functions for GP classification, see Ras-mussen and Williams (2006, Chapter 3).) An evaluation of different, commonways of determining the approximating Gaussian has been made by Kuss andRasmussen (2005).

For the subset of regressors (SoR) method we have f∗ =∑m

i=1k(x∗,x

iu)wi

u

with wu ∼ N (0, K−1u,u) from (12). For log-concave likelihoods, the optimization

problem to find the maximum a posteriori (MAP) value of wu is convex. Usingthe MAP value of the weights together with the Hessian at this point, we obtainan approximate Gaussian predictive distribution of f∗, which can be fed throughthe sigmoid function to yield probabilistic predictions. The question of how tochoose the inducing inputs still arises; Lin et al. (2000) select these using aclustering method, while Zhu and Hastie (2002) propose a forward selectionstrategy.

The subset of data (SoD) method for GP classification was proposed byLawrence et al. (2003), using an expectation-propagation (EP) approximation

20

Page 21: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

to the posterior (Minka, 2001), and an information gain criterion for greedilyselecting the inducing inputs from the training set.

The deterministic training conditional (DTC) approximation has also beenused for classification. Csato and Opper (2002) present and “online” method,where the examples are processed sequentially, while Csato et al. (2003) give anEP type algorithm where multiple sweeps across the data are permitted.

The partially independent training conditional (PITC) approximation hasalso been applied to GP classification. Tresp (2000) generalized the use of theBayesian committee machine (BCM) to non-Gaussian likelihoods, by applyinga Laplace approximation to each of the partially independent blocks.

8 Conclusions

In this chapter we have reviewed a number of methods for dealing with theO(n3) scaling of a naıve implementation of Gaussian process regression methods.These can be divided into two classes, those based on sparse methods, andthose based on fast MVM methods. We have also discussed the selection ofinducing variables, hyperparameter learning, and approximation methods forthe marginal likelihood.

Probably the question that is most on the mind of the practitioner is “whatmethod should I use on my problem?” Unfortunately this is not an easy ques-tion to answer, as it will depend on various aspects of the problem such as thecomplexity of the underlying target function, the dimensionality of the inputs,the amount of noise on the targets, etc. There has been some empirical workon the comparison of approximate methods, for example in Schwaighofer andTresp (2003) and Rasmussen and Williams (2006, chapter 8) but more needsto be done. Empirical evaluation is not easy, as there exist very many criteriaof comparison, under which different methods might perform best. There isa large choice of the measures of performance: mean squared error, negativelog predictive probability, etc., and an abundance of possible measures of speedfor the approximations: hyperparameter learning time, pre-computation time,testing time, number of inducing variables, number of CG iterations, etc. Onepossible approach for empirical evaluation of computationally efficient approx-imations to GPs is to fix a “computational budget” of available time, and seewhich approximation achieves the best performance (under some loss) withinthat time. Empirical comparisons should always include approximations thatat first sight might seem to simple or naıve, such as subset of data (SoD), butthat might end up performing as well as more elaborate methods.

Acknowledgements

The authors thank Ed Snelson and Iain Murray for recent, very fruitful discus-sions. This work was supported in part by the IST Programme of the EuropeanCommunity, under the PASCAL Network of Excellence, IST-2002-506778. This

21

Page 22: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

publication only reflects the authors’ views. CER was supported by the GermanResearch Council (DFG) through grant RA 1030/1.

References

Corinna Cortes and Vladimir Vapnik. Support-vector network. Machine Learn-ing, 20(3):273–297, 1995.

Lehel Csato, 2005. Personal communication at the 2005 Sheffield GaussianProcess Round Table, organized by Neil D. Lawrence.

Lehel Csato and Manfred Opper. Sparse online Gaussian processes. NeuralComputation, 14(3):641–669, 2002.

Lehel Csato, Manfred Opper, and Ole Winther. TAP Gibbs Free Energy, BeliefPropagation and Sparsity. In T. G. Dietterich, S. Becker, and Z. Ghahra-mani, editors, Neural Information Processing Systems 14, pages 657–663,Cambridge, MA, 2003. MIT Press.

Nando De Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. FastKrylov methods for N-body learning. In Y. Weiss, B. Scholkopf, and J. Platt,editors, Advances in Neural Information Processing Systems 18, Cambridge,MA, 2006. MIT Press.

Mark N. Gibbs and David J. C. MacKay. Efficient Implementation of GaussianProcesses. Unpublished manuscript. Cavendish Laboratory, Cambridge, UK.http://www.inference.phy.cam.ac.uk/ mackay/BayesGP.html, 1997.

Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns HopkinsUniversity Press, Baltimore, 1989. Second edition.

Alexander Gray. Fast kernel matrix-vector multiplication with application toGaussian process learning. Technical Report CMU-CS-04-110, School of Com-puter Science, Carnegie Mellon University, 2004.

S. Sathiya Keerthi, Olivier Chapelle, and Dennis DeCoste. Building support vec-tor machines with reduced classifier complexity. Journal of Machine LearningResearch, 7:1493–1515, 2006.

S. Sathiya Keerthi and Wei Chu. A matching pursuit approach to sparse GPregression. In Y. Weiss, B. Scholkopf, and J. Platt, editors, Advances in NeuralInformation Processing Systems 18, Cambridge, MA, 2006. MIT Press.

Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference forbinary Gaussian process classification. Journal of Machine Learning Research,pages 1679–1704, 2005.

22

Page 23: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

Neil Lawrence, Matthias Seeger, and Ralf Herbrich. Fast sparse gaussian pro-cess methods: The informative vector machine. In S. Becker, S. Thrun, andK. Obermayer, editors, Neural Information Processing Systems 15, pages 609–616, Cambridge, MA, 2003. MIT Press.

Xiwi Lin, Grace Wahba, Dong Xiang, Fangyu Gao, Ronald Klein, and BarbaraKlein. Smoothing Spline ANOVA Models for Large Data Sets With BernoulliObservations and the Randomized GACV. Annals of Statistics, 28:1570–1600,2000.

Thomas P. Minka. A Family of Algorithms for Approximate Bayesian Inference.PhD thesis, Massachusetts Institute of Technology, 2001.

Tomaso Poggio and Frederico Girosi. Networks for Approximation and Learning.Proceedings of IEEE, 78:1481–1497, 1990.

Joaquin Quinonero-Candela. Learning with Uncertainty – Gaussian Processesand Relevance Vector Machines. PhD thesis, Technical University of Den-mark, Lyngby, Denmark, 2004.

Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view ofsparse approximate Gaussian process regression. Journal of Machine LearningResearch, 6:1939–1959, 2005.

Carl Edward Rasmussen. Reduced rank Gaussian process learning. Technicalreport, Gatsby Computational Neuroscience Unit, UCL, 2002.

Carl Edward Rasmussen and Joaquin Quinonero-Candela. Healing the relevancevector machine by augmentation. In Luc De Raedt and Stefan Wrobel, editors,Proceedings of the 22nd International Conference on Machine Learning, pages689–696, 2005.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processesfor Machine Learning. MIT Press, Cambridge, Massachusetts, 2006.

Anton Schwaighofer and Volker Tresp. Transductive and inductive methodsfor approximate Gaussian process regression. In S. Becker, S. Thrun, andK. Obermayer, editors, Advances in Neural Information Processing Systems15. MIT Press, 2003.

Matthias Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generali-sation Error Bounds and Sparse Approximations. PhD thesis, University ofEdinburgh, 2003.

Matthias Seeger, Christopher K. I. Williams, and Neil Lawrence. Fast forwardselection to speed up sparse Gaussian process regression. In Christopher M.Bishop and Brendan J. Frey, editors, Ninth International Workshop on Artifi-cial Intelligence and Statistics. Society for Artificial Intelligence and Statistics,2003.

23

Page 24: Approximation Methods for Gaussian Process Regressionmlg.eng.cam.ac.uk/pub/pdf/QuiRasWil07.pdf · 2011. 7. 1. · (2005), based on inducing inputs and conditionally independent approximations

Yirong Shen, Andrew Ng, and Matthias Seeger. Fast Gaussian process regressionusing KD-trees. In Y. Weiss, B. Scholkopf, and J. Platt, editors, Advancesin Neural Information Processing Systems 18, Cambridge, MA, 2006. MITPress.

John Skilling. Bayesian numerical analysis. In W. T. Grandy, Jr. and P. Milonni,editors, Physics and Probability. Cambridge University Press, 1993.

Alexander J. Smola and Peter L. Bartlett. Sparse greedy Gaussian processregression. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advancesin Neural Information Processing Systems 13, pages 619–625, Cambridge,MA, 2001. MIT Press.

Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes usingpseudo-inputs. In Y. Weiss, B. Scholkopf, and J. Platt, editors, Advancesin Neural Information Processing Systems 18, Cambridge, MA, 2006. MITPress.

Michael E. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen,and K.-R. Muller, editors, Advances in Neural Information Processing Sys-tems 12, pages 652–658, Cambridge, MA, 2000. MIT Press.

Michael E. Tipping. Sparse Bayesian learning and the Relevance Vector Ma-chine. Journal of Machine Learning Research, 1:211–244, 2001.

Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000.

Grace Wahba. Spline Models for Observational Data. Society for Industrial andApplied Mathematics, 1990.

Grace Wahba, Donald R. Johnson, Feng Gao, and Jianjian Gong. Adaptive Tun-ing of Numerical Weather Prediction Models: Randomized GCV in Three-andFour-Dimensional Data Assimilation. Monthly Weather Review, 123:3358–3369, 1995.

Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian processesfor regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, edi-tors, Advances in Neural Information Processing Systems 8, pages 514–520,Cambridge, MA, 1996. MIT Press.

Changjiang Yang, Ramani Duraiswami, and Larry Davis. Efficient kernel ma-chines using the improved fast Gauss transform. In Lawrence K. Saul, YairWeiss, and Leon Bottou, editors, Advances in Neural Information ProcessingSystems 17, pages 1561–1568, Cambridge, MA, 2005. MIT Press.

Ji Zhu and Trevor J. Hastie. Kernel Logistic Regression and the Import VectorMachine. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advancesin Neural Information Processing Systems 14, pages 1081–1088. MIT Press,2002.

24