Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivativeobservationsGaussian Process approximations 2017 –workshopby Eero Siivola1, joint work with Aki Vehtari1, Juho Piironen1, JavierGonzález2, Jarno Vanhatalo3 and Olli-Pekka Koistinen1

1Aalto University, Finland2Amazon, Cambridge, UK3Univeristy of Helsinki, Finland

Advances in using GPs with derivative observationsMay 30, 2017

2/26

Contents of this talk

I Theory behind GPs + derivativesI GP-NEBI Automatic monotonicity detection with GPsI Bayesian optimization with derivative sign information


3/26

Theory: GP + derivative observations

How to use (partial) derivatives with GPs?We need to consider two parts:

I Covariance functionI Likelihood function

I Posterior -> Inference method


4/26

Covariance function

Nice property (See e.g. Papoulis [1991, ch. 10]):

cov

(∂f (1)

∂x(1)g

, f (2))

=∂

∂x(1)g

cov(

f (1), f (2))

=∂

∂x(1)g

k(

x(1),x(2))

and:

cov

(∂f (1)

∂x(1)g

,∂f (2)

∂x(2)h

)= ∂2

∂x(1)g ∂x(2)

h

cov(f (1), f (2)

)= ∂2

∂x(1)g ∂x(2)

h

k(x(1),x(2))


5/26

Let X =[x(1), . . . ,x(n)]T and X =

[x(1), . . . , x(m)

]T, be points

where we observe function values and partial derivative values.

The covariance between latent function valuesfX =

[f (1), . . . , f (n)

]Tand latent function derivative values

f′X

=

[∂ f (1)

∂x(1)g, . . . , ∂ f (m)

∂x(m)g

]T

is:

KX,X =

∂

∂x(1)g

cov(f (1), f (1)) · · · ∂

∂x(m)g

cov(f (1), f (m))

.... . .

...∂

∂x(1)g

cov(f (n), f (1)) · · · ∂

∂x(m)g

cov(f (n), f (m))

= KTX,X


6/26

And between latent function derivative values fX and fX

KX,X =

∂2

∂x(1)g ∂x(1)

gcov(f (1), f (1)) · · · ∂2

∂x(1)g ∂x(m)

gcov(f (1), f (m))

.... . .

...∂2

∂x(m)g ∂x(1)

gcov(f (m), f (1)) · · · ∂2

∂x(m)g ∂x(m)

gcov(f (m), f (m))


7/26

Likelihood functionObservations are assumed independent given latent functionvalues:

p(y, y′|fX, f′X) =

(n∏

i=1

p(y (i)|f (i)))(

m∏i=1

p

(∂y (i)

∂x(i)g

∣∣∣∣∣ ∂ f (i)

∂x(i)g

))How to select the likelihood of derivatives?

I If direct derivative values can be observed:Gaussian likelihood

I If we only have hint about the direction:Probit likelihood with a tuning parameter (Riihimäki andVehtari (2010))

p

(∂y (i)

∂x(i)g

∣∣∣∣∣ ∂ f (i)

∂x(i)g

)= Φ

(∂ f (i)

∂x(i)g

1ν

), where

φ(a)=

a∫−∞

N(x |0,1)dx


8/26

−3 −2 −1 0 1 2 3

0

1

x

y

Probit likelihood with ν = 1

−3 −2 −1 0 1 2 3

0

1

x

y

Probit likelihood with ν = 1× 10−4


9/26

Posterior distributionPosterior distribution of joint values:

p(f, f′|y, y′,X, X) =

p(f, f′|X, X)

(n∏

i=1p(y (i)|f (i))

)(m∏

i=1p(

∂y (i)

∂x(i)g

∣∣∣∣ ∂ f (i)

∂x(i)g

))Z

Different parts:I p(f, f′|X, X) is GaussianI p(y (i)|f (i)) are Gaussian

I p(

∂y (i)

∂x(i)g

∣∣∣∣ ∂ f (i)

∂x(i)g

)Gaussian/probit

The posterior distribution is either Gaussian or similar as inclassification problems

I We might need posterior approximation methods


10/26

Saddle point search using GPs + derivativeobservations

I The properties of thesystem can be describedby an energy surface

I Finding a minimum energypath and the saddle pointbetween two states isuseful when determiningproperties of transitions


11/26

Nudged elastic band (NEB)I Starting from an initial

guess, the idea is to movethe images downwards onthe energy surface but keepthem evenly spaced

I The images are movedalong a force vector, whichis a resultant of twocomponents:

I (Negative) energygradient componentperpendicular to the path

I A spring force parallel tothe path, which tends tokeep the images evenlyspaced


12/26

I The convergence of NEB may require hundreds orthousands of iterations

I Each iteration requires evaluation of the energy gradientfor all images, which is often a time-consuming operation


13/26

Speedup of NEB

I Repeat until convergence:1. Evaluate the energy (and

forces) at the images ofthe current path

2. If path not converged,approximate the energysurface using machinelearning based on theobservations so far

3. Find the predictedminimum energy path onthe approximate surfaceand go to 1

I The details in paper byPeterson (2016)


14/26

Speedup of NEB with GP and derivatives

I Evaluate the energy (and forces) only at the image with thehighest uncertainty

I Re-approximate the energy surface and find a new MEPguess after each image evaluation

I Convergence check:I If the magnitude of the force (may be accurate or

approximation) is below the convergence limit for allimages, we don’t move the path, but evaluate more images,until the convergence limit is not met any more or allimages have been evaluated

I If we manage to evaluate all images without moving thepath, we know for sure if the path is converged

I The details in paper by Koistinen, Maras, Vehtari andJónsson (2016):


15/26


16/26


17/26


18/26

I When evaluating the transition rates, the Hessian of theminimum points needs to be evaluated at some phase

I This information can be used to improve the GPapproximations, especially in the beginning, when there islittle information


19/26

Comparison of methods in heptamer case study


20/26

Automatic monotonicity detection

I Derivative sign information can be used to find monotonicinput output directions

I The basic idea:I Add derivative sign observations to the GP modelI See if the additions affect to the probability of the data

I the dimension is monotonic if not

I The details in paper by Siivola, Piironen and Vehtari (2016)


21/26

Theoretical background

Energy comparison:

E(y, y′|X, Xm) = − log p(y, y′|X, Xm)

= − log

p(y|X)

≈1︷︸︸︷p(y′|y,X, Xm)

≈ E(y|X).


22/26

GP with monotonicity assumption

regular GP

Number of virtual observations

Energy of data

E

E′

< N

Figure : Change in energy in reality as a function of virtual derivativesign observations


23/26

Using automatic monotonicity detection inmodelling

I Monotonic dimensions can be detected from the data andused in modelling

I The method makes the modelling results especially on theborders.


24/26

ExperimentI Six different functions of varying monotonicityI Different amount of noise added to training samples (signal

to noise ratio (SNR) between 0 and 1)I Measure the log predictive posterior density of samples

from a hold out set that resemble 20 % of the bordermostsamples in the training data:

lppd =L∑

i=1

log∫

p(yi |f )ppost(f |xi)df

I Do this for three different models for 200 times:I Use fixed monotonicityI Use monotonicity if the it does not change the energy

(adaptive monotonicity)I Use model without derivative observations


25/26

Results

0 0.5 1

−5

0

N=

30

f(x) = −x

0 0.5 1−15.4

0

N=

90

0 0.5 1−0.15

5·1

0−

2

f(x) = 0

adaptivemonotonicity

fixedmonotonicity

0 0.5 1−0.1

0.1

0 0.5 1−0.1

0.3

f(x) = x

0 0.5 1

00.2

0 0.5 1

00.3

f(x) = 11+e−0.5x

0 0.5 1

00.2

0 0.5 1

00.3

f(x) = 11+e−x

0 0.5 1

00.2

0 0.5 1−0.1

0.3

f(x) = 11+e−2x

0 0.5 1−0.2

0.5

Figure : ∆LPPD of baseline and named method on y axis, SNR on xaxis


26/26

Multidimensional experiment

Diabetes data1:I Target value: a measure of diabetes progression one year

after baselineI 10 dimensionsI Detect monotonic dimensions and use them if needed

1diabetes data, available at:http://web.stanford.edu/~hastie/Papers/LARS/diabetes.data

http://web.stanford.edu/~hastie/Papers/LARS/diabetes.data


27/26

Results

40 80

100

200

300

age

0 1

sex

20 40

body mass index (bmi)

80 120

mean arterial pressure (map)

100 200 300

triglycerides (tc)

100 200 300

100

200

300

low-density lipoprotein (ldl)

40 80 120

high-density lipoprotein (hdl)

5 10

total choleserol (tch)

4 6

low-tension glaucoma (ltg)

80 120

fasting blood glucose (glu)

basic GP

GP with AMD

Figure : Target value as a function of single predictive values whileothers are kept at the median of dataset. Regular black linescorrespond to regular GP mean and 90 % posterior central interval.Red dashed lines correspond to AMD GPs mean and standarddeviation when body mass index and low-tension glaucoma aredetected as increasing. Black dashed line corresponds to the largestvalue of covariate.


28/26

Bayesian optimization with virtual derivative signobservations

Bayesian optimization (BO):I A global optimization strategy designed to find the

minimum of expensive black-box functions:1. Fit GP to the available dataset X, y2. Evaluate the function at a new location based on some

acquisition function3. If stopping criterion is not met, go to 1

I Usually the search space is selceted so that the minimumis not on the border

I An over-exploration of the edges is a typical problem


29/26

Figure : Over exploration of the edges visualized with LCB as anacquisition function. Circles are initial samples and crosses areacquisitions.


30/26

Fixing over exploration with derivative signobservations

I By adding fake derivative observations to the borders, theover-exploration problem can be solved:

1. Fit GP to the available dataset X, y2. Find a new location based on some acquisition function3. If the new location is at the border:

I add a derivative sign observation to the border4. Else:

I add the new location.

5. If stopping criterion is not met, go to 1I The details in paper by Siivola, Vehtari, Vanhatalo and

Gonzalez (2017)


31/26

Figure : GP prior and acquisition functions for one dimensionalspace. a) and c), without fake derivative sign observations. b) and d)with derivative sign observations.


32/26

Experiments

Metrics for comparing performances of two BO algorithm:I Percentual minimum difference (PMD): PMD is designed to

compare the absolute performances of the algorithms andintuitively it measures the difference of the best values ofboth algorithms.

I Percentual hit difference (PHD): PHD is created forcomparing the speeds of the algorithms and intuitively itmeasures difference of how fast both algorithms are ableto find good enough values.

I Percentual border hit difference (PBHD): Assuming thatthe minimum is not near the border, BHD tells the scaleddifference of unnecessary samples taken near the borders.


33/26

I Average evaluation distance difference (AED): Intuitively,AED measures the overall performance of the algorithmbefore finding the minimum.

I Virtual derivative observations per dimension (VDO):Intuitively, larger VDOs are worse, since they increase thecomputational burden of the algorithm as GP’s scale asO((n + q)3).

The interpretation for the magnitude of PMD, PHD and PBHDare that negative values tell that the proposed method is better,the values are always scaled between −1 and 1 and the furtheraway the value is from 0, the bigger the difference between thetwo methods is.


34/26

Experiment 1


35/26

Experiment 2I 100 d-dimensional multivariate normal distribution

functions as d = 1, ...,11I Different amount of noise added to the functionsI BO and BO with derivatives ran for 100 acquisitions


36/26

Results


37/26


38/26

Experiment 3

I Same as in experiment 2, but for sigopt function dataset2

I 113 functions from 1 to 11 dimensions

2Dataset available at: https://github.com/sigopt/sigopt-examples

https://github.com/sigopt/sigopt-examples


39/26

Results


40/26


41/26

Summary

Derivatives can be used with GPs in many new ways:I To improve accuracy of GPs in simulation of energy

surfacesI To automatically find monotonic dimensions from dataI To fix border over-exploration problem of BOs


42/26

Questions?I email: [email protected]

IAdvances in using GPs with derivative observations

May 30, 201743/26

References

I Papoulis, A. (1991). Probability, Random Variables, and Stochastic Processes.McGraw-Hill, New York. Third Edition.

I Riihimäki, J. and Vehtari, A. (2010) "Gaussian processes with monotonicityinformation." In proceedings of AISTATS 2010. vol. 9, pp. 645-652.

I Peterson (2016) "Acceleration of saddle-point searches with machine learning".In J. Chem. Phys., 145, p. 074106

I Koistinen, O.-P., Maras, E., Vehtari, A. and Jónsson, H. (2016) "Minimum energypath calculations with Gaussian process regression". Nanosystems: Physics,Chemistry, Mathematics, 2016, 7 (6), p. 925-935

I Siivola, E., Piironen, J., and Vehtari, A. (2016) "Automatic monotonicity detectionfor Gaussian Processes" arXiv: https://arxiv.org/abs/1610.05440

I Siivola, E., Vehtari, A., Vanhatalo, J., and González, J. (2017) "Bayesianoptimization with virtual derivative sign observations" arXiv:https://arxiv.org/abs/1704.00963

https://arxiv.org/abs/1610.05440

https://arxiv.org/abs/1704.00963

Documents

Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop