43
Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop by Eero Siivola 1 , joint work with Aki Vehtari 1 , Juho Piironen 1 , Javier González 2 , Jarno Vanhatalo 3 and Olli-Pekka Koistinen 1 1 Aalto University, Finland 2 Amazon, Cambridge, UK 3 Univeristy of Helsinki, Finland

Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivativeobservationsGaussian Process approximations 2017 –workshopby Eero Siivola1, joint work with Aki Vehtari1, Juho Piironen1, JavierGonzález2, Jarno Vanhatalo3 and Olli-Pekka Koistinen1

1Aalto University, Finland2Amazon, Cambridge, UK3Univeristy of Helsinki, Finland

Page 2: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

2/26

Contents of this talk

I Theory behind GPs + derivativesI GP-NEBI Automatic monotonicity detection with GPsI Bayesian optimization with derivative sign information

Page 3: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

3/26

Theory: GP + derivative observations

How to use (partial) derivatives with GPs?We need to consider two parts:

I Covariance functionI Likelihood function

I Posterior -> Inference method

Page 4: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

4/26

Covariance function

Nice property (See e.g. Papoulis [1991, ch. 10]):

cov

(∂f (1)

∂x(1)g

, f (2))

=∂

∂x(1)g

cov(

f (1), f (2))

=∂

∂x(1)g

k(

x(1),x(2))

and:

cov

(∂f (1)

∂x(1)g

,∂f (2)

∂x(2)h

)= ∂2

∂x(1)g ∂x(2)

h

cov(f (1), f (2)

)= ∂2

∂x(1)g ∂x(2)

h

k(x(1),x(2))

Page 5: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

5/26

Let X =[x(1), . . . ,x(n)]T and X =

[x(1), . . . , x(m)

]T, be points

where we observe function values and partial derivative values.

The covariance between latent function valuesfX =

[f (1), . . . , f (n)

]Tand latent function derivative values

f′X

=

[∂ f (1)

∂x(1)g, . . . , ∂ f (m)

∂x(m)g

]T

is:

KX,X =

∂x(1)g

cov(f (1), f (1)) · · · ∂

∂x(m)g

cov(f (1), f (m))

.... . .

...∂

∂x(1)g

cov(f (n), f (1)) · · · ∂

∂x(m)g

cov(f (n), f (m))

= KTX,X

Page 6: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

6/26

And between latent function derivative values fX and fX

KX,X =

∂2

∂x(1)g ∂x(1)

gcov(f (1), f (1)) · · · ∂2

∂x(1)g ∂x(m)

gcov(f (1), f (m))

.... . .

...∂2

∂x(m)g ∂x(1)

gcov(f (m), f (1)) · · · ∂2

∂x(m)g ∂x(m)

gcov(f (m), f (m))

Page 7: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

7/26

Likelihood functionObservations are assumed independent given latent functionvalues:

p(y, y′|fX, f′X) =

(n∏

i=1

p(y (i)|f (i)))(

m∏i=1

p

(∂y (i)

∂x(i)g

∣∣∣∣∣ ∂ f (i)

∂x(i)g

))How to select the likelihood of derivatives?

I If direct derivative values can be observed:Gaussian likelihood

I If we only have hint about the direction:Probit likelihood with a tuning parameter (Riihimäki andVehtari (2010))

p

(∂y (i)

∂x(i)g

∣∣∣∣∣ ∂ f (i)

∂x(i)g

)= Φ

(∂ f (i)

∂x(i)g

), where

φ(a)=

a∫−∞

N(x |0,1)dx

Page 8: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

8/26

−3 −2 −1 0 1 2 3

0

1

x

y

Probit likelihood with ν = 1

−3 −2 −1 0 1 2 3

0

1

x

y

Probit likelihood with ν = 1× 10−4

Page 9: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

9/26

Posterior distributionPosterior distribution of joint values:

p(f, f′|y, y′,X, X) =

p(f, f′|X, X)

(n∏

i=1p(y (i)|f (i))

)(m∏

i=1p(

∂y (i)

∂x(i)g

∣∣∣∣ ∂ f (i)

∂x(i)g

))Z

Different parts:I p(f, f′|X, X) is GaussianI p(y (i)|f (i)) are Gaussian

I p(

∂y (i)

∂x(i)g

∣∣∣∣ ∂ f (i)

∂x(i)g

)Gaussian/probit

The posterior distribution is either Gaussian or similar as inclassification problems

I We might need posterior approximation methods

Page 10: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

10/26

Saddle point search using GPs + derivativeobservations

I The properties of thesystem can be describedby an energy surface

I Finding a minimum energypath and the saddle pointbetween two states isuseful when determiningproperties of transitions

Page 11: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

11/26

Nudged elastic band (NEB)I Starting from an initial

guess, the idea is to movethe images downwards onthe energy surface but keepthem evenly spaced

I The images are movedalong a force vector, whichis a resultant of twocomponents:

I (Negative) energygradient componentperpendicular to the path

I A spring force parallel tothe path, which tends tokeep the images evenlyspaced

Page 12: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

12/26

I The convergence of NEB may require hundreds orthousands of iterations

I Each iteration requires evaluation of the energy gradientfor all images, which is often a time-consuming operation

Page 13: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

13/26

Speedup of NEB

I Repeat until convergence:1. Evaluate the energy (and

forces) at the images ofthe current path

2. If path not converged,approximate the energysurface using machinelearning based on theobservations so far

3. Find the predictedminimum energy path onthe approximate surfaceand go to 1

I The details in paper byPeterson (2016)

Page 14: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

14/26

Speedup of NEB with GP and derivatives

I Evaluate the energy (and forces) only at the image with thehighest uncertainty

I Re-approximate the energy surface and find a new MEPguess after each image evaluation

I Convergence check:I If the magnitude of the force (may be accurate or

approximation) is below the convergence limit for allimages, we don’t move the path, but evaluate more images,until the convergence limit is not met any more or allimages have been evaluated

I If we manage to evaluate all images without moving thepath, we know for sure if the path is converged

I The details in paper by Koistinen, Maras, Vehtari andJónsson (2016):

Page 15: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

15/26

Page 16: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

16/26

Page 17: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

17/26

Page 18: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

18/26

I When evaluating the transition rates, the Hessian of theminimum points needs to be evaluated at some phase

I This information can be used to improve the GPapproximations, especially in the beginning, when there islittle information

Page 19: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

19/26

Comparison of methods in heptamer case study

Page 20: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

20/26

Automatic monotonicity detection

I Derivative sign information can be used to find monotonicinput output directions

I The basic idea:I Add derivative sign observations to the GP modelI See if the additions affect to the probability of the data

I the dimension is monotonic if not

I The details in paper by Siivola, Piironen and Vehtari (2016)

Page 21: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

21/26

Theoretical background

Energy comparison:

E(y, y′|X, Xm) = − log p(y, y′|X, Xm)

= − log

p(y|X)

≈1︷ ︸︸ ︷p(y′|y,X, Xm)

≈ E(y|X).

Page 22: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

22/26

GP with monotonicity assumption

regular GP

Number of virtual observations

Energy of data

E

E′

< N

Figure : Change in energy in reality as a function of virtual derivativesign observations

Page 23: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

23/26

Using automatic monotonicity detection inmodelling

I Monotonic dimensions can be detected from the data andused in modelling

I The method makes the modelling results especially on theborders.

Page 24: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

24/26

ExperimentI Six different functions of varying monotonicityI Different amount of noise added to training samples (signal

to noise ratio (SNR) between 0 and 1)I Measure the log predictive posterior density of samples

from a hold out set that resemble 20 % of the bordermostsamples in the training data:

lppd =L∑

i=1

log∫

p(yi |f )ppost(f |xi)df

I Do this for three different models for 200 times:I Use fixed monotonicityI Use monotonicity if the it does not change the energy

(adaptive monotonicity)I Use model without derivative observations

Page 25: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

25/26

Results

0 0.5 1

−5

0

N=

30

f(x) = −x

0 0.5 1−15.4

0

N=

90

0 0.5 1−0.15

5·1

0−

2

f(x) = 0

adaptivemonotonicity

fixedmonotonicity

0 0.5 1−0.1

0.1

0 0.5 1−0.1

0.3

f(x) = x

0 0.5 1

00.2

0 0.5 1

00.3

f(x) = 11+e−0.5x

0 0.5 1

00.2

0 0.5 1

00.3

f(x) = 11+e−x

0 0.5 1

00.2

0 0.5 1−0.1

0.3

f(x) = 11+e−2x

0 0.5 1−0.2

0.5

Figure : ∆LPPD of baseline and named method on y axis, SNR on xaxis

Page 26: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

26/26

Multidimensional experiment

Diabetes data1:I Target value: a measure of diabetes progression one year

after baselineI 10 dimensionsI Detect monotonic dimensions and use them if needed

1diabetes data, available at:http://web.stanford.edu/~hastie/Papers/LARS/diabetes.data

Page 27: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

27/26

Results

40 80

100

200

300

age

0 1

sex

20 40

body mass index (bmi)

80 120

mean arterial pressure (map)

100 200 300

triglycerides (tc)

100 200 300

100

200

300

low-density lipoprotein (ldl)

40 80 120

high-density lipoprotein (hdl)

5 10

total choleserol (tch)

4 6

low-tension glaucoma (ltg)

80 120

fasting blood glucose (glu)

basic GP

GP with AMD

Figure : Target value as a function of single predictive values whileothers are kept at the median of dataset. Regular black linescorrespond to regular GP mean and 90 % posterior central interval.Red dashed lines correspond to AMD GPs mean and standarddeviation when body mass index and low-tension glaucoma aredetected as increasing. Black dashed line corresponds to the largestvalue of covariate.

Page 28: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

28/26

Bayesian optimization with virtual derivative signobservations

Bayesian optimization (BO):I A global optimization strategy designed to find the

minimum of expensive black-box functions:1. Fit GP to the available dataset X, y2. Evaluate the function at a new location based on some

acquisition function3. If stopping criterion is not met, go to 1

I Usually the search space is selceted so that the minimumis not on the border

I An over-exploration of the edges is a typical problem

Page 29: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

29/26

Figure : Over exploration of the edges visualized with LCB as anacquisition function. Circles are initial samples and crosses areacquisitions.

Page 30: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

30/26

Fixing over exploration with derivative signobservations

I By adding fake derivative observations to the borders, theover-exploration problem can be solved:

1. Fit GP to the available dataset X, y2. Find a new location based on some acquisition function3. If the new location is at the border:

I add a derivative sign observation to the border4. Else:

I add the new location.

5. If stopping criterion is not met, go to 1I The details in paper by Siivola, Vehtari, Vanhatalo and

Gonzalez (2017)

Page 31: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

31/26

Figure : GP prior and acquisition functions for one dimensionalspace. a) and c), without fake derivative sign observations. b) and d)with derivative sign observations.

Page 32: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

32/26

Experiments

Metrics for comparing performances of two BO algorithm:I Percentual minimum difference (PMD): PMD is designed to

compare the absolute performances of the algorithms andintuitively it measures the difference of the best values ofboth algorithms.

I Percentual hit difference (PHD): PHD is created forcomparing the speeds of the algorithms and intuitively itmeasures difference of how fast both algorithms are ableto find good enough values.

I Percentual border hit difference (PBHD): Assuming thatthe minimum is not near the border, BHD tells the scaleddifference of unnecessary samples taken near the borders.

Page 33: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

33/26

I Average evaluation distance difference (AED): Intuitively,AED measures the overall performance of the algorithmbefore finding the minimum.

I Virtual derivative observations per dimension (VDO):Intuitively, larger VDOs are worse, since they increase thecomputational burden of the algorithm as GP’s scale asO((n + q)3).

The interpretation for the magnitude of PMD, PHD and PBHDare that negative values tell that the proposed method is better,the values are always scaled between −1 and 1 and the furtheraway the value is from 0, the bigger the difference between thetwo methods is.

Page 34: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

34/26

Experiment 1

Page 35: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

35/26

Experiment 2I 100 d-dimensional multivariate normal distribution

functions as d = 1, ...,11I Different amount of noise added to the functionsI BO and BO with derivatives ran for 100 acquisitions

Page 36: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

36/26

Results

Page 37: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

37/26

Page 38: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

38/26

Experiment 3

I Same as in experiment 2, but for sigopt function dataset2

I 113 functions from 1 to 11 dimensions

2Dataset available at: https://github.com/sigopt/sigopt-examples

Page 39: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

39/26

Results

Page 40: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

40/26

Page 41: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

41/26

Summary

Derivatives can be used with GPs in many new ways:I To improve accuracy of GPs in simulation of energy

surfacesI To automatically find monotonic dimensions from dataI To fix border over-exploration problem of BOs

Page 42: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

Advances in using GPs with derivative observationsMay 30, 2017

42/26

Questions?I email: [email protected]

Page 43: Advances in using GPs with derivative observationssiivole1/pdf/gpa17_siivola.pdf · Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop

IAdvances in using GPs with derivative observations

May 30, 201743/26

References

I Papoulis, A. (1991). Probability, Random Variables, and Stochastic Processes.McGraw-Hill, New York. Third Edition.

I Riihimäki, J. and Vehtari, A. (2010) "Gaussian processes with monotonicityinformation." In proceedings of AISTATS 2010. vol. 9, pp. 645-652.

I Peterson (2016) "Acceleration of saddle-point searches with machine learning".In J. Chem. Phys., 145, p. 074106

I Koistinen, O.-P., Maras, E., Vehtari, A. and Jónsson, H. (2016) "Minimum energypath calculations with Gaussian process regression". Nanosystems: Physics,Chemistry, Mathematics, 2016, 7 (6), p. 925-935

I Siivola, E., Piironen, J., and Vehtari, A. (2016) "Automatic monotonicity detectionfor Gaussian Processes" arXiv: https://arxiv.org/abs/1610.05440

I Siivola, E., Vehtari, A., Vanhatalo, J., and González, J. (2017) "Bayesianoptimization with virtual derivative sign observations" arXiv:https://arxiv.org/abs/1704.00963