Variable selection for Gaussian processes via sensitivity ...proceedings.mlr.press/v89/paananen19a/paananen19a.pdf · cesses and kernel methods, it is sometimes possible to utilize

Variable selection for Gaussian processes via sensitivity analysis ofthe posterior predictive distribution

Topi Paananen Juho Piironen Michael Riis Andersen Aki [email protected] [email protected] [email protected] [email protected]

Helsinki Institute for Information Technology, HIITAalto University, Department of Computer Science

Abstract

Variable selection for Gaussian process mod-els is often done using automatic relevancedetermination, which uses the inverse length-scale parameter of each input variable as aproxy for variable relevance. This implicitlydetermined relevance has several drawbacksthat prevent the selection of optimal inputvariables in terms of predictive performance.To improve on this, we propose two novel vari-able selection methods for Gaussian processmodels that utilize the predictions of a fullmodel in the vicinity of the training pointsand thereby rank the variables based on theirpredictive relevance. Our empirical resultson synthetic and real world data sets demon-strate improved variable selection comparedto automatic relevance determination in termsof variability and predictive performance.

1 INTRODUCTION

Often the goal of supervised learning is not only to learnthe relationship between the predictors and target vari-ables, but to also assess the predictive relevance of theinput variables. A relevant input variable is one witha high predictive power on the target variable (Veh-tari et al., 2012). In many applications, simplifying amodel by selecting only the most relevant input vari-ables is important for two reasons. Firstly, it makesthe model more interpretable and understandable bydomain experts. Secondly, it may reduce future costsif there is a price associated with measuring or predict-ing with many variables. Here, we focus on methods

Proceedings of the 22nd International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s).

that select a subset of the original variables, as op-posed to constructing new features, as this preservesthe interpretability of the variables.

Gaussian processes (GPs) are flexible, nonparametricmodels for regression and classification in the Bayesianframework (Rasmussen and Williams, 2006). The rele-vance of input variables of a fitted GP model is ofteninferred implicitly from the length-scale parameters ofthe GP covariance function. This is called automaticrelevance determination (ARD), a term that originatedin the neural network literature (MacKay, 1994; Neal,1995), and since then has been used extensively forboth Gaussian processes (Williams and Rasmussen,1996; Seeger, 2000) and other models (Tipping, 2000;Wipf and Nagarajan, 2008).

Alternative to ARD, variable selection via sparsifyingspike-and-slab priors is possible also with Gaussianprocesses (Linkletter et al., 2006; Savitsky et al., 2011).The drawback of these methods is that they requireusing a Markov chain Monte Carlo method for infer-ence, which is computationally expensive with Gaussianprocesses. Due to space constraints, we will not con-sider sparsifying priors in this study. The predictiveprojection method originally devised for generalizedlinear models (Goutis and Robert, 1998; Dupuis andRobert, 2003) has also been implemented for Gaussianprocesses (Piironen and Vehtari, 2016). This methodcan potentially select variables with good predictiveperformance, but has a substantial computational costdue to the required exploration of the model space.

Due to the close connection between Gaussian pro-cesses and kernel methods, it is sometimes possibleto utilize variable selection approaches used for kernelmethods also with Gaussian processes. For example,Crawford et al. (2019, 2018) derive an analog for theeffect size of each input variable for nonparametricmethods, and show that it generalizes also to Gaussianprocess models. They then assess the importance ofvariable j using Kullback-Leibler divergence between

Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution

the marginal distribution of the rest of the variables totheir conditional distribution when variable j is set tozero.

The main contributions of this paper are summarizedas follows. We present two novel input variable se-lection methods for Gaussian process models, whichdirectly assess the predictive relevance of the variablesvia sensitivity analysis. Both methods utilize the pos-terior of the full model (i.e. one that includes all theinput variables) near the training points to estimate thepredictive relevance of the variables. We also demon-strate why certain properties of automatic relevancedetermination make it unsuitable for variable selectionand show that the proposed methods are not affectedby these weaknesses. Our empirical evaluations in-dicate that the proposed methods lead to improvedvariable selection performance in terms of predictiveperformance, and generate the relevance ranking moreconsistently between different training data sets. Wealso demonstrate how the pointwise estimates of themethods can be useful for assessing local predictiverelevance beyond the global, average relevance of eachvariable. The methods serve as practical alternatives toautomatic relevance determination without increasingthe computational cost as much as many alternativevariable selection methods in the literature.

2 BACKGROUND

This section shortly reviews Gaussian processes andautomatic relevance determination in this context, aswell as discusses the problems associated with variableselection via ARD.

2.1 Gaussian Process Models

Gaussian processes (GPs) are nonparametric modelsthat define a prior distribution directly in the space oflatent functions f(x), where x is a p-dimensional inputvector. The form and smoothness of the functionsgenerated by a GP are determined by its covariancefunction k(x,x′), which defines the covariance betweenthe latent function values at the input points x and x′.The prior is typically assumed to have zero mean:

p(f(X)) = p(f) = N (f | 0,K),

where K is the covariance matrix between the la-tent function values at the training inputs X =(x(1), . . . ,x(n)) such that Kij = k(x(i),x(j)).

In regression problems with Gaussian observationmodels, the GP posterior distribution is analyticallytractable for both the latent values and noisy observa-tions by conditioning the joint normal distributionof training and test outputs on the observed data.

While other observation models do not have analyt-ically tractable solutions, numerous approximationshave been developed for inference with different likeli-hoods in both regression and classification (Williamsand Barber, 1998; Minka, 2001; Vanhatalo et al., 2009).

2.2 Automatic Relevance Determination

A widely used covariance function in Gaussian processinference is the squared exponential (SE) with sepa-rate length-scale parameters li for each of the p inputdimensions

kSE(x,x′) = σ2f exp

(−1

2

p∑i=1

(xi − x′i)2

l2i

). (1)

While the common hyperparameter σf determines theoverall variability, the separate length-scale parame-ters li allow the functions to vary at different scalesalong different variables.

In some contexts, automatic relevance determination(ARD) simply means using the covariance function (1)instead of one with a single length-scale parameter l.Often, however, the term is used for a more specificmeaning, namely for inferring the predictive relevanceof each variable from the inverse of its length-scaleparameter. This intuition is based on the fact thatan infinitely large length-scale means no correlationbetween the latent function values in that dimension.However, in practice they will never be infinite, andinferring irrelevance from a large length-scale is prob-lematic for two reasons. First, the length-scale param-eters alone are not well identified, but only the ratiosof li and σf (Zhang, 2004), which increases varianceof the relevance measure. Second, ARD systematicallyoverestimates the predictive relevance of nonlinear vari-ables relative to linear variables of equal relevance inthe squared error sense (Piironen and Vehtari, 2016).

3 PREDICTING FEATURERELEVANCES VIA SENSITIVITYANALYSIS

This section describes the two proposed variable se-lection methods that analyze the sensitivity of theposterior GP at the training data locations. We willfirst present the outline of both methods and theirproperties, and then discuss their computational com-plexity.

3.1 Kullback-Leibler Divergence as aMeasure of Predictive Relevance

The Kullback-Leibler divergence (KLD) is a widelyused measure of dissimilarity between two probability

Topi Paananen, Juho Piironen, Michael Riis Andersen, Aki Vehtari

distributions (Kullback and Leibler, 1951). In this sec-tion, we present a method for assessing the predictiverelevance of input variables via sensitivity analysis ofthe posterior predictive distribution. When moving aninput with respect to a single variable, a large differencein KLD between predictive distributions indicates thatthe variable has a high predictive relevance. In ourmethod, the predictive distributions are compared atthe training points and points that are moved with re-spect to one variable. The KLD is a favourable measurefor predictive relevance, because it takes into accountchanges in both the predictive mean and uncertainty.As the KLD is applicable to arbitrary distributionswith compatible support, this procedure is not limitedto any specific likelihood or model.

By relating to the total-variation distance of Pinsker’sinequality, it is reasonable to utilize the Kullback-Leibler divergence from density p to q, DKL(p || q), as ameasure of distance in the form (Simpson et al., 2017)

d(p || q) =√

2DKL(p || q).

Using the square root also allows linear approximationof infinitesimal changes in the predictive distributionvia perturbations in the input variables. Applying thisto the posterior predictive distribution at a trainingpoint i, p(y∗|x(i),y), and a point that is perturbed byamount ∆ with respect to variable j, p(y∗|x(i) +∆j ,y),we use the following measure of predictive relevance

r(i, j,∆) =d(p(y∗|x(i),y) || p(y∗|x(i) + ∆j ,y))

∆, (2)

where ∆j is a vector of zeroes with ∆ on the j’th entry.Averaging this measure over all of the training points iyields a relevance estimate for the j’th variable

KLj =1

n

n∑i=1

r(i, j,∆).

Using this estimate, we can rank the input variables byrelevance and a desired number of them can be selected.Henceforth, we will refer to the presented method asthe KL method.

For Gaussian observation models, the relevance mea-sure defined in equation (2) is related to the partialderivative of the mean of the latent function with re-spect to the variable j. In this case, if the latentvariance is constant, taking the limit ∆→ 0 simplifiesthe measure in equation (2) to the form

lim∆→0

r(i, j,∆) = (Var[y∗|x(i),y])−1/2 ∂

∂xjE[y∗|x(i),y],

where E[y∗|x(i),y] and Var[y∗|x(i),y] are the meanand variance of the posterior predictive distribution

p(y∗|x(i),y), respectively. Hence, in the special case ofa Gaussian likelihood, the proposed measure can beinterpreted as a partial derivative weighted by the pre-dictive uncertainty. The method thus has a connectionto methods that rank variables via partial derivatives,see e.g. (Hardle and Stoker, 1989; Ruck et al., 1990;Lal et al., 2006; Liu et al., 2018).

The choice of the perturbation distance ∆ has to be rea-sonable with respect to the given data set. Accordingto our empirical evaluations, the proposed method isinsensitive to the size of the perturbation. The resultsof this paper are computed with ∆ = 10−4 when theinputs were normalized to zero mean and unit standarddeviation, and no noticeable differences were observedwhen ∆ was varied for two orders of magnitude aboveand below this value. However, very small values shouldbe avoided because of potential numerical instability.For more details, see Figure 7 in the supplementarymaterial.

3.2 Variance of the Posterior Latent Mean

In this section, we present a method for ranking inputvariables based on the variability of the GP latent meanin the direction of each variable. When the value ofa single input variable is changed, large variability inthe latent mean indicates that the variable is relevantfor predicting the target variable. In contrast to theKL method, this method thus considers only the la-tent mean, but examines it throughout the conditionaldistribution of each variable at the training point andnot just the immediate vicinity of the point. We thusignore the uncertainty of the predictions but utilizeinformation from a larger area of the input space. An-other benefit of this is that computing the predictivemean of a Gaussian process is computationally cheaperthan predicting the marginal variance.

In order to estimate the variance of the mean of thelatent function, we will approximate the distributionof the input variables. Under the assumption thatthe input data has finite first and second moments,we can do this by computing the sample mean µ andsample covariance Σ from the n training inputs X =(x(1), . . . ,x(n)). At any given point, the conditional

distribution of variable j at training point i, p(xj |x(i)−j),

can then be estimated using the conditioning rule ofthe multivariate Gaussian:

xj |x−j ∼ N (mj , s2j ),

mj = µj + σj,−jΣ−1−j,−j(x−j − µ−j),

s2j = σj,j − σj,−jΣ

−1−j,−jσ−j,j .

(3)

Here, the subscript j refers to selecting the row orcolumn j from µ or Σ, whereas the subscript −j refersto excluding them.


The variance of the posterior mean f(i)

j = E[f(xj) |x(i)−j ]

along the j’th dimension is then given by integrating

over the conditional distribution p(xj |x(i)−j)

Var[f(i)

j ] =

∫(f

(i)

j )2(xj)N (xj |mj , s2j ) dxj

−(∫

f(i)

j (xj)N (xj |mj , s2j ) dxj

)2

.

With a change of variables k = (xj −mj)/(√

2sj), thevariance takes a simpler form that can be numericallyapproximated with the Gauss-Hermite quadrature.

Var[f(i)

j ] =

∫(f

(i)

j )2(√

2sjk +mj)e−k

2

√π

dk

−

(∫f

(i)

j (√

2sjk +mj)e−k

2

√π

dk

)2

≈π−1/2Nk∑

nk=1

wi (f(i)

j )2(√

2sjki +mj)

−π−1

(Nk∑

nk=1

wi f(i)

j (√

2sjki +mj)

)2

,

where Nk is the number of weights wi and evaluationpoints ki of the Gauss-Hermite quadrature approxima-tion. The ki are given by the roots of the physicists’version of the Hermite polynomial HNk

(k) and theweights are

wi =2Nk−1Nk!

√π

N2k [HNk−1

(ki)]2. (4)

The above procedure is repeated with all of the con-

ditional distributions p(xj |x(i)−j) from the n training

points and the average is computed

VARj =1

n

n∑i=1

Var[f(i)

j ]. (5)

The average is then used as an estimate of the predictiverelevance of variable j. Henceforth, we will refer to thismethod as the VAR method.

In this paper, we will only consider data sets where thenumber of data points n is greater than the number ofinput dimensions p. In the absence of linearly depen-dent components in the inputs, the resulting samplecovariance matrix Σ will be positive definite and itsinverse can be computed using the Cholesky decompo-sition. In order to increase the numerical stability ofthe decomposition, a small diagonal term is added toill-conditioned sample covariance matrices. By usingmore shrinkage when estimating the covariance matrix,the VAR method could be used also with data setswhere n < p.

Relating to the KL method, the advantage of the VARmethod is that modelling the distribution of the inputsallows us to examine the GP posterior for out-of-samplebehavior in a larger area of the input space than justat the training data locations. On the other hand,estimating the input distribution is a task itself, whichmay increase the variance of the resulting relevanceestimate when data are scarce.

3.3 Computational Complexity

Exact inference with Gaussian processes has complexityO(n3) for a data set with n observations, which hinderstheir applicability especially in large data sets. Once afull GP model is fitted, ranking variables using ARDrequires no additional computations. By a projectionapproach (Piironen and Vehtari, 2016), the variablescan possibly be ranked more accurately, but the draw-back is that the model space exploration to find thesubmodels increases the complexity to O(p2n3), wherep is the number of input variables.

The complexity of Gaussian process inference arisesfrom the unavoidable matrix inversion. However, thesame inverse can be used for making an arbitrary num-ber of predictions at new test points, and the cost ofpredicting the GP mean and variance at a single testpoint are O(n) and O(n2), respectively. Both of theproposed methods in this paper utilize a constant num-ber of predictions for each of the n data points and pinput variables. As the VAR method does not requirethe predictive variance, its computational complexity isO(p ·n ·n) = O(pn2), whereas the KL method has com-plexity O(pn3). For both methods, using a sparse GPapproximation with m < n inducing points can reducethe cost of predictions and reduce the complexity of theproposed methods to O(pnm) and O(pnm2), respec-tively for the VAR and KL methods (Bui et al., 2017).Alternatively, one may reduce computational cost byusing only a subset of training points to estimate thepredictive relevances of variables.

In addition to the complexity due to predictions, theVAR method requires inverting the submatrix Σ−j,−jof the sample covariance matrix of the inputs, for eachof the p variables. Taking advantage of the positivedefiniteness of the full covariance matrix, the Choleskydecomposition of it, O(p3) in complexity, needs tobe computed only once per training set. Then theCholesky decomposition for each submatrix Σ−j,−jis obtained with a rank one update from the full co-variance matrix, resulting in p rank one updates ofcomplexity O(p2). Thus, the full complexity of thevariance method is O(pn2 + p3). Because we are con-sidering only the case p < n, the effective complexityis still O(pn2). The details of the rank one update aredescribed in the supplementary material.


−1 0 1−2−1012

f1(x1)

−1 0 1−2−1012

f2(x2)

−1 0 1−2−1012

f3(x3)

−1 0 1−2−1012

f4(x4)

−1 0 1−2−1012

f5(x5)

−1 0 1−2−1012

f6(x6)

−1 0 1−2−1012

f7(x7)

−1 0 1−2−1012

f8(x8)

Figure 1: Latent functions fj(xj), j = 1, . . . , 8 of thetwo toy examples. Black represents the latent functionswith uniform inputs and red represents normally dis-tributed inputs, each function scaled to unit varianceaccording to its corresponding input distribution.

4 EXPERIMENTS

This section will present two toy examples that illus-trate how the proposed methods are able to assess thepredictive relevance between linear and nonlinear vari-ables more accurately compared to automatic relevancedetermination. The section will also present variableselection results in regression and classification tasks onreal data sets. In all experiments, the model of choiceis a GP model with the ARD covariance function inequation (1). We want to emphasize, that our intent isnot to criticize the use of this kernel in general, but toshow that it is problematic to use it for assessing therelevance of input variables.

4.1 Toy Examples

In the first experiment we consider a toy example,where the target variable is constructed as a sum ofeight independent and additive variables whose re-sponses have varying degrees of nonlinearity. We gener-ate the target variable y based on the inputs as follows:

y = f1(x1) + . . .+ f8(x8) + ε,

ε ∼ N (0, 0.32),

fj(xj) = Aj sin (φjxj),

(6)

where the angular frequencies φj are equally spacedbetween π/10 and π, and the scaling factors Aj are suchthat the variance of each fj(xj) is one. We considertwo separate mechanisms for generating the input dataso that either xj ∼ U(−1, 1) or xj ∼ N (0, 0.42). Thefunctions fj are presented in Figure 1 for uniformlydistributed inputs (black) and normally distributedinputs (red).

For both toy examples, we sampled 300 training pointsand constructed a Gaussian process model with a co-

1 2 3 4 5 6 7 80.0

0.5

1.0

xj»U(¡1;1)

True relevanceARDKLVAR

1 2 3 4 5 6 7 8Input Variable

0.0

0.5

1.0

xj»N(0;0:42)

True relevanceARDKLVAR

Pred

ictiv

e Re

leva

nce

Figure 2: Relevance estimates for eight variables inthe two toy examples in equation (6) with uniformlydistributed inputs (top) and normally distributed in-puts (bottom). The estimates are computed with ARD(blue), the KL method (red), and the VAR method(cyan). The results are averaged over 200 data real-izations and scaled so that the most relevant variablehas a relevance of one. Error bars representing 95%confidence intervals are indistinguishable.

variance function being the squared exponential inequation (1) with an added constant term. Using thefull model with hyperparameters optimized to the maxi-mum of marginal likelihood, we calculated the relevanceof each variable either directly via ARD using the in-verse length-scale, or by averaging the KL and VARrelevance estimates from each training point. The aver-aged results of 200 random data sets are presented inFigure 2 for the two examples with inputs distributeduniformly (top) and normally (bottom). Input 1 is themost linear one and input 8 is the most nonlinear.

Figure 2 demonstrates that in the toy example withuniform inputs, all three methods prefer nonlinear in-puts over linear inputs. However, the preference in ourmethods is not as severe as with ARD, which assignsrelevance values close to zero for half of the variables.The bottom figure, representing the toy example withGaussian distributed inputs, shows that our methodsproduce almost equal relevance values for all eight vari-ables. Overall, our methods are notably better thanARD in identifying the true relevances of the variablesdespite the varying degrees of nonlinearity. To ensurethat the above results hold even if there are irrelevantvariables in the data, we repeated the experiment withthe addition of totally irrelevant input variables. Theresults are comparable, and are shown in Figure 10 inthe supplementary material.


0 2 4 6 8 10 12Number of Covariates

−1.25

−1.00

−0.75

−0.50

MLP

D

Boston Housing

ARDKLVAR

0 2 4 6 8 10 12 14 16 18−1.50−1.25−1.00−0.75−0.50−0.25

Automobile

ARDKLVAR

0 1 2 3 4 5 6 7

−1.0

0.0

1.0

2.0M

LPD

Concrete

ARDKLVAR

0 2 4 6 8 10 12Number of Covariates

−1.4

−1.3

−1.2

−1.1

−1.0

Crime

ARDKLVAR

Figure 3: Mean log predictive densities (MLPDs) of the test sets with 95% confidence intervals for submodels asa function of variables included in the submodel. Blue depicts variables sorted using ARD, red and cyan depictthe KL and VAR methods, respectively. The dashed horizontal line depicts the MLPD of the full model withhyperparameters sampled using the Hamiltonian Monte Carlo algorithm.

4.2 Real World Data

In the second experiment, we compared the variableselection performance of the three methods on fivebenchmark data sets obtained from the UCI machinelearning repository1. The data sets are summarizedin Table 1. The Pima indians data set is a binaryclassification problem, and the others are regressiontasks. For each method, we used a Gaussian processmodel with a Gaussian likelihood and a sum of con-stant and squared exponential kernels as a covariancefunction. The model was first fitted with all p variablesincluded, and then submodels with 1 to p − 1 of themost relevant variables included were fitted again. Thesubmodel variables were picked based on the relevanceranking given by each method. We performed 50 repeti-tions, each time splitting the data into random trainingand test sets with the number of training points shownin Table 1. Both the full model and submodels weretrained on the training set, and the predictive perfor-mance of the submodels was evaluated by computingthe mean log predictive densities (MLPDs) using theindependent test set.

For the regression tasks, the mean log predictive den-sities of the submodels on the test sets are presentedin Figure 3 as a function of the number of variablesincluded in the submodel. A plot for each data set con-tains results when the variables are sorted using ARD(blue), the KL method (red), and the VAR method(cyan). Thus, the only difference between the three

1https://archive.ics.uci.edu/ml/index.html

Table 1: Summary of real world dataset parameters:number of variables p, data points ntot, and trainingpoints used n.

Dataset p ntot n

Concrete 7 103 80Boston Housing 13 506 300Automobile 38 193 150Crime 102 1992 400Pima Indians 8 392 300

curves is the choice of variables included in the sub-models. The GP models are fitted by maximizing thehyperparameter posterior distribution, with a half-tdistribution as the prior for the noise and signal magni-tudes, and inverse-gamma distribution for the length-scales. The inverse-gamma was chosen because it hasa sharp left tail that penalizes very small length-scales,but its long right tail allows the length-scales to becomelarge (Stan Development Team, 2017). The plots forthe Automobile and Crime data sets are shown only upto a point where the predictive performance saturates.The horizontal line represents the MLPD of the fullmodel on the test sets, which was computed using 100Hamiltonian Monte Carlo (HMC) samples from thehyperparameter posterior (Duane et al., 1987).

The results show that in all four data sets, both of theproposed methods generate a better ranking for thevariables than ARD does, resulting in submodels with

https://archive.ics.uci.edu/ml/index.html


0 1 2 3 4 5 6 7 8Number of Variables Included

−0.65−0.60−0.55−0.50−0.45

MLP

DPima Indians

ARDKLVAR

Figure 4: Mean log predictive densities (MLPDs) ofthe test sets of the Pima indians data set with 95%confidence intervals for submodels as a function of thenumber of variables included in the submodel. Thedashed horizontal line depicts the MLPD of the fullmodel with hyperparameters sampled using the Hamil-tonian Monte Carlo algorithm.

better predictive performance. The improvement ismost distinct in the first three or four variables in allthe data sets. This is because ARD, by definition, picksthe most nonlinear variables first, but our methodsare able to identify variables that are more relevantfor prediction, albeit more linear. After the initialimprovement, the ranking in the latter variables isnever worse than for ARD.

For the binary classification problem, we used aprobit likelihood and the expectation propagation(EP) (Minka, 2001) method to approximate the pos-terior distribution. The mean log predictive densitieson independent test sets as a function of the numberof variables included in the submodel are presented inFigure 4. The improvement in variable ranking is verysimilar to the regression tasks, with largest improve-ments in submodels with one to three variables. Bothof the proposed methods are thus able identify variableswith good predictive performance in regression as wellas classification tasks.

4.3 Ranking Variability

The weak identifiability of the length-scale parametersincreases the variation of the ARD relevance estimate.To quantify this, we studied the variability of eachmethod in determining the relevance ranking of vari-ables between the 50 random training splits of the fourregression data sets. For each consecutive choice ofwhich variable to add to the submodel, we computedthe entropy, which depicts the variability of the vari-able choice between different training sets. If the samevariable is chosen in each training set, the resultingentropy is zero, and more variability leads to higher en-tropy. Because the maximum possible entropy dependson the number of variables to choose from, we dividedthe entropy values by the maximum possible entropyof each data set. The maximum entropy corresponds

to the case where any of the p variables is chosen withequal probability. The variability results are presentedin Figure 5.

Figure 5 indicates that the ranking variability is cor-related with predictive performance of the submodels,shown in Figure 3. In the Housing, Automobile, andCrime data sets, ARD has the largest variability inthe first variable choice. This seems to propagate intoimproved predictive performance in small submodelswith one to three variables. On the other hand, in theConcrete data set, ARD has more variability in thelatter variable choices. For example, the better perfor-mance of the submodel with six variables is purely theresult of choosing between two variables more consis-tently, because all three methods always pick the sametwo variables last, but ARD is more uncertain abouttheir order of relevance. A more detailed analysis ofthe ranking variability is presented in the supplement.

4.4 Pointwise Relevance Estimates

In some cases, a variable might have strong predictiverelevance in some region, while being quite irrelevanton average. In some applications, the identification ofsuch locally relevant variables is important. Considera hypothetical regression problem, where the variablesrepresent measurements to be made on a patient, andthe dependent variable represents the progression of adisease. The information that some measurement haslittle relevance on average, but for some patients it is aclear indication of how far the disease has progressed,may provide essential information for medical profes-sionals. In the context of neural networks, Refenes andZapranis (1999) discuss using the maximum of point-wise relevance values as a useful indicator in financialapplications.

Both of the proposed variable selection methods can beused to assess the relevance of variables in a specific areaof the input space. To demonstrate this, we computedthe pointwise KL relevance values of the variables 1and 8 from a sample of 300 training points from the toyexample in equation (6), and the results are presentedin Figure 6. As mentioned in Section 3.1, in Gaussianprocess regression with a Gaussian likelihood, the KLrelevance value is analogous to the partial derivative ofthe mean of the latent function divided by the standarddeviation of the posterior predictive distribution. Thiscan be clearly seen by comparing Figure 6 to the truelatent functions f1(x1) and f8(x8) in Figure 2.

The pointwise predictive relevance values presentedin Figure 6 illustrate one of the novel aspects of theproposed methods. While automatic relevance deter-mination outputs only a single number that implicitlyrepresents the relevance of a variable, the KL and VAR


2 4 6 8 10 12#th Variable Choice

0.2

0.4

0.6

0.8

Rela

tive

Entro

py

Boston Housing

ARDKLVAR

2 4 6 8 10 12 14 16 180.2

0.4

0.6

0.8Automobile

ARDKLVAR

1 2 3 4 5 6 70.0

0.1

0.2

0.3Re

lativ

e En

tropy

ConcreteARDKLVAR

2 4 6 8 10 12#th Variable Choice

0.5

0.6

0.7

Crime

ARDKLVAR

Figure 5: Relative entropies that depict the variability between different training data sets for each consecutivechoice of variables to add to the submodel. Blue depicts variables sorted using ARD, red and cyan depict the KLand VAR methods, respectively.

−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00x1

0

5

10

−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00x8

0

5

10

Poin

twise

Rel

evan

ce

Figure 6: Plot of the pointwise KL relevance valuesfor variables 1 and 8 computed from a sample of 300points from the toy example in equation (6), wherex1, x8 ∼ N (0, 0.42).

methods can compute the relevance estimate of inputvariables at arbitrary points of the input space. Asshown in the experiments of Section 4.2, averaging thepointwise values at the training points is effective inassessing the global predictive relevance, and the abilityto consider predictive relevance locally improves theapplicability of the methods.

5 CONCLUSIONS

This paper has proposed two new methods for rankingvariables in Gaussian process models based on their pre-dictive relevances. Our experiments on simulated andreal world data sets indicate that the methods produce

an improved variable relevance ranking compared tothe commonly used automatic relevance determinationvia length-scale parameters. Regarding the predictiveperformance, although the methods were better thanARD, even better results could be obtained by othermeans, such as the predictive projection method, butat the expense of a much higher computational cost.Additionally, our methods were shown to generate therelevance ranking for variables with less variation com-pared to ARD, which is an important result in termsof interpretability of the chosen submodels. We alsoshowed how one of the methods is connected to rel-evance estimation via derivatives, which encouragesfurther research in this direction.

The methods proposed here require computing rele-vance values for each variable in each point of thetraining data, thus increasing the computational costcompared to automatic relevance determination. How-ever, this cost is by no means prohibitive compared tothe Gaussian process inference, which is computation-ally expensive in itself. Additionally, the methods aresimpler and computationally cheaper than most alter-native methods proposed in the literature. We thusdiscourage interpreting the length-scale of a particulardimension as a measure of predictive relevance, andadvise using a more appropriate method for variableselection and relevance assessment.

Python implementations for the methods dis-cussed in the paper are freely available athttps://www.github.com/topipa/gp-varsel-kl-var.

https://www.github.com/topipa/gp-varsel-kl-var


References

Bui, T. D., Yan, J., and Turner, R. E. (2017). A uni-fying framework for gaussian process pseudo-pointapproximations using power expectation propaga-tion. The Journal of Machine Learning Research,18(1):3649–3720.

Crawford, L., Flaxman, S., Runcie, D., and West, M.(2019). Variable prioritization in nonlinear black boxmethods: a genetic association case study. Submittedto the Annals of Applied Statistics.

Crawford, L., Wood, K. C., Zhou, X., and Mukherjee,S. (2018). Bayesian approximate kernel regressionwith variable selection. Journal of the AmericanStatistical Association, 113(524):1710–1721.

Duane, S., Kennedy, A. D., Pendleton, B. J., andRoweth, D. (1987). Hybrid Monte Carlo. Physicsletters B, 195(2):216–222.

Dupuis, J. A. and Robert, C. P. (2003). Variable selec-tion in qualitative models via an entropic explanatorypower. Journal of Statistical Planning and Inference,111(1-2):77–94.

Goutis, C. and Robert, C. P. (1998). Model choice ingeneralised linear models: A bayesian approach viakullback-leibler projections. Biometrika, 85(1):29–37.

Hager, W. W. (1989). Updating the inverse of a matrix.SIAM review, 31(2):221–239.

Hardle, W. and Stoker, T. M. (1989). Investigatingsmooth multiple regression by the method of aver-age derivatives. Journal of the American statisticalAssociation, 84(408):986–995.

Kullback, S. and Leibler, R. A. (1951). On informationand sufficiency. The annals of mathematical statistics,22(1):79–86.

Lal, T. N., Chapelle, O., Weston, J., and Elisseeff, A.(2006). Embedded methods. In Feature extraction,pages 137–165. Springer.

Linkletter, C., Bingham, D., Hengartner, N., Higdon,D., and Ye, K. Q. (2006). Variable selection forgaussian process models in computer experiments.Technometrics, 48(4):478–490.

Liu, X., Chen, J., Nair, V., and Sudjianto, A.(2018). Model interpretation: A unified derivative-based framework for nonparametric regression andsupervised machine learning. arXiv preprintarXiv:1808.07216.

MacKay, D. J. (1994). Bayesian nonlinear modeling forthe prediction competition. ASHRAE transactions,100(2):1053–1062.

Minka, T. P. (2001). A family of algorithms for approx-imate Bayesian inference. PhD thesis, MassachusettsInstitute of Technology.

Neal, R. M. (1995). Bayesian learning for neural net-works. PhD thesis, University of Toronto.

Piironen, J. and Vehtari, A. (2016). Projection pre-dictive model selection for Gaussian processes. In2016 IEEE 26th International Workshop on MachineLearning for Signal Processing (MLSP), pages 1–6.IEEE.

Rasmussen, C. E. and Williams, C. K. (2006). Gaussianprocesses for machine learning, volume 1. MIT pressCambridge.

Refenes, A.-P. and Zapranis, A. (1999). Neural modelidentification, variable selection and model adequacy.Journal of Forecasting, 18(5):299–332.

Ruck, D. W., Rogers, S. K., and Kabrisky, M. (1990).Feature selection using a multilayer perceptron. Jour-nal of Neural Network Computing, 2(2):40–48.

Savitsky, T., Vannucci, M., and Sha, N. (2011). Vari-able selection for nonparametric gaussian processpriors: Models and computational strategies. Sta-tistical science: a review journal of the Institute ofMathematical Statistics, 26(1):130.

Seeger, M. (2000). Bayesian model selection for sup-port vector machines, gaussian processes and otherkernel classifiers. In Advances in Neural InformationProcessing Systems 12, pages 603–609. MIT Press.

Simpson, D., Rue, H., Riebler, A., Martins, T. G.,Sørbye, S. H., et al. (2017). Penalising model compo-nent complexity: A principled, practical approach toconstructing priors. Statistical Science, 32(1):1–28.

Stan Development Team (2017). Stan Modeling Lan-guage Users Guide and Reference Manual. http://mc-stan.org. Version 2.16.0.

Tipping, M. E. (2000). The relevance vector machine.In Advances in Neural Information Processing Sys-tems 12, pages 652–658. MIT Press.

Vanhatalo, J., Jylanki, P., and Vehtari, A. (2009). Gaus-sian process regression with student-t likelihood. InAdvances in Neural Information Processing Systems22, pages 1910–1918. Curran Associates, Inc.

Vehtari, A., Ojanen, J., et al. (2012). A survey ofBayesian predictive methods for model assessment,selection and comparison. Statistics Surveys, 6:142–228.

Williams, C. K. and Barber, D. (1998). Bayesian clas-sification with gaussian processes. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,20(12):1342–1351.

Williams, C. K. and Rasmussen, C. E. (1996). Gaus-sian processes for regression. In Advances in NeuralInformation Processing Systems 8, pages 514–520.MIT Press.


Wipf, D. P. and Nagarajan, S. S. (2008). A new viewof automatic relevance determination. In Advancesin Neural Information Processing Systems 20, pages1625–1632. Curran Associates, Inc.

Zhang, H. (2004). Inconsistent estimation and asymp-totically equal interpolations in model-based geo-statistics. Journal of the American Statistical Asso-ciation, 99(465):250–261.

Documents

Variable selection for Gaussian processes via sensitivity ...proceedings.mlr.press/v89/paananen19a/paananen19a.pdf · cesses and kernel methods, it is sometimes possible to utilize