66
Gaussian Processes for Machine Learning Chris Williams Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh, UK August 2007 Chris Williams ANC Gaussian Processes for Machine Learning

Machine Learning vs Statistics

Embed Size (px)

Citation preview

Page 1: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 1/66

Gaussian Processes for Machine Learning

Chris Williams

Institute for Adaptive and Neural ComputationSchool of Informatics, University of Edinburgh, UK

August 2007

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 2: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 2/66

Overview

1

What is machine learning?2 Gaussian Processes for Machine Learning

3 Multi-task Learning

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 3: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 3/66

1. What is Machine Learning?

The goal of machine learning is to build computer systemsthat can adapt and learn from their experience. (Dietterich,1999)

Machine learning usually refers to changes in systems that

perform tasks associated with artificial intelligence (AI). Suchtasks involve recognition, diagnosis, planning, robot control,prediction, etc. (Nilsson, 1996)

Some reasons for adaptation:

Some tasks can be hard to define except via examples

Adaptation can improve a human-built system, or trackchanges over time

Goals can be autonomous machine performance, or enablinghumans to learn from data (data mining)

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 4: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 4/66

Roots of Machine Learning

Statistical pattern recognition, adaptive control theory (EE)

Artificial Intelligence: e.g. discovering rules using decisiontrees, inductive logic programming

Brain models, e.g. neural networks

Psychological models

Statistics

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 5: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 5/66

Problems Addressed by Machine Learning

Supervised Learning

model p (y |x): regression,classification, e tc

Unsupervised Learning

model p (x): not just clustering!

Reinforcement Learning

Markov decision processes,

POMDPs, planning.

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 6: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 6/66

1 2 3

4 5 6

Mask

Mask * Foreground

Mask

Mask * Foreground Background

(Williams and Titisias, 2004)

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 7: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 7/66

Machine Learning and Statistics

Statistics Machine Learning

probabilistic (graphical) models

Same models, but different problems?

Not all machine learning methods are based on probabilisicmodels, e.g. SVMs, non-negative matrix factorization

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 8: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 8/66

Some Differences

Statistics: focus on understanding data in terms of models

Statistics: interpretability, hypothesis testingMachine Learning: greater focus on prediction

Machine Learning: focus on the analysis of learningalgorithms (not just large dataset issues)

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 9: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 9/66

Slide from Rob Tibshirani (early 1990s)

NEURAL NETS STATISTICSnetwork modelweights parameters

learning fittinggeneralization test set performancesupervised learning regression/classificationunsupervised learning density estimationoptimal brain damage model selectionlarge grant = $100,000 large grant= $10,000

nice place to have a meeting: nice place to have a meeting:Snowbird, Utah, French Alps Las Vegas in August

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 10: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 10/66

2. Gaussian Processes for Machine Learning

Gaussian processes

History

Regression, classification and beyondCovariance functions/kernels

Dealing with hyperparameters

Theory

Approximations for large datasets

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 11: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 11/66

Gaussian Processes

A Gaussian process is a stochastic process specified by itsmean and covariance functions

Mean functionµ(x) = E[f  (x)]

often we take µ(x) ≡ 0 ∀x

Covariance function

k (x, x) = E[(f  (x) − µ(x))(f  (x) − µ(x))]

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 12: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 12/66

A Gaussian process prior over functions can be thought of asa Gaussian prior on the coefficients w ∼ N (0,Λ) where

Y (x) =N F i =1

w i φi (x)

In many interesting cases, N F  = ∞Can choose φ’s as eigenfunctions of the kernel k (x, x) wrt

p (x) (Mercer)  k (x, y)p (x)φi (x) d x = λi φi (y)

(For stationary covariance functions and Lebesgue measure we

get instead  k (x − x)e −2πi s·xd x = S (s)e −2πi s·x

where S (s) is the power spectrum)

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 13: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 13/66

k (x , x ) = σ20 + σ2

1xx  k (x , x ) = exp −|x  − x |

k (x , x ) = exp −(x  − x )2

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 14: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 14/66

Prediction with Gaussian Processes

A non-parametric priorover functions

Although GPs can beinfinite-dimensional

objects, prediction from afinite dataset is O (n3) 0 0.5 1

−2

−1

0

1

2

input, x

         f         (      x         )

0 0.5 1

−2

−1

0

1

2

input, x

         f         (      x         )

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 15: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 15/66

Gaussian Process Regression

Dataset D = (xi , y i )ni =1, Gaussian likelihood p (y i |f  i ) ∼ N (0, σ2)

f  (x) =n

i =1

αi k (x, xi )

whereα = (K  + σ2I )−1y

var(f  (x)) = k (x, x) − kT (x)(K  + σ2I )−1k(x)

in time O (n3), with k(x) = (k (x, x1), . . . , k(x, xn))T 

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 16: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 16/66

Some GP History

1940s: Wiener, Kolmogorov (time series)

Geostatistics (Matheron, 1973), Whittle (1963)

O’Hagan (1978); Sacks et al (Design and Analysis of Computer Experiments, 1989)

Williams and Rasmussen (1996), inspired by Neal’s (1996)construction of GPs from neural networks with an infinitenumber of hidden units

Regularization framework (Tikhonov and Arsenin, 1977;

Poggio and Girosi, 1990); MAP rather than fully probabilistic

SVMs (Vapnik, 1995): non-probabilistic, use “kernel trick”and quadratic programming

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 17: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 17/66

Carl Edward Rasmussen and ChrisWilliams, MIT Press, 2006

New: available online

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 18: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 18/66

Regression, classification and beyond

Regression with Gausian noise: e.g. robot arm inversedynamics (21-d input space)

Classification: binary, multiclass, e.g. handwritten digitclassification

ML community tends to use approximations to deal withnon-Gaussian likelihoods, cf MCMC in statistics?

MAP solution, Laplace approximation

Expectation Propagation (Minka, 2001; see also Opper and

Winther, 2000)

Other likelihoods (e.g. Poisson), observations of derivatives,uncertain inputs, mixtures of GPs

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 19: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 19/66

Covariance functions

Covariance function is key entity, determining notion of similarity

Squared exponential (“Gaussian”) covariance function iswidely applied in ML; Matern kernel not very widely used

Polynomial kernel k (x, x) = (1 + x · x)p  is popular in kernelmachines literature

Neural network covariance function (Williams, 1998)

k NN(x, x) = σ2f   sin−1 2xM x 

(1 + 2xM x)(1 + 2xM x)

where x = (1, x 1, . . . , x D )

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 20: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 20/66

String kernels: let φs (x ) denote the number of times asubstring s  appears in string x 

k (x , x ) =s 

w s φs (x )φs (x )

(Watkins, 1999; Haussler, 1999).

Efficient methods using suffix trees to compute certain string

kernels in time |x | + |x 

| (Leslie et al, 2003; Vishwanathan andSmola, 2003)

Extended to tree kernels (Collins and Duffy, 2002)

Fisher kernel

φθ(x ) = ∇θ log p (x |θ)

k (x , x ) = φθ(x )F −1φθ(x )

where F  is the Fisher information matrix (Jaakkola et al,2000)

Chris Williams ANCGaussian Processes for Machine Learning

Page 21: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 21/66

Automatic Relevance Determination

k SE (x, x) = σ2f  exp

− 1

2(x − x)M (xp  − xq )

Isotropic M  = −2I 

ARD: M  = diag(−21 , −2

2 , . . . , −2D  )

−2

0

2

−2

0

2

−2

−1

0

1

2

input x1input x2

  o  u   t  p  u   t  y

−2

0

2

−2

0

2

−2

−1

0

1

2

input x1input x2

  o  u   t  p  u   t  y

Chris Williams ANCGaussian Processes for Machine Learning

Page 22: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 22/66

Dealing with hyperparameters

Criteria for model selection

Marginal likelihood p (y|X , θ)

Estimate the generalization error: LOO-CVni =1 log p (y i |y−i ,X ,θ)

Bound the generalization error (e.g. PAC-Bayes)

Typically do ML-II rather than sampling of  p (θ|X , y)

Optimize by gradient descent (etc) on objective function

SVMs do not generally have good methods for kernel selection

Chris Williams ANCGaussian Processes for Machine Learning

Page 23: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 23/66

How the marginal likelihood works

100

−100

−80

−60

−40

−20

0

20

40

   l  o  g   p

  r  o   b  a   b   i   l   i   t  y

characteristic lengthscale

minus complexity penaltydata fitmarginal likelihood

log p (y|X ,θ) = −1

2yT K −1

y  y − log |K y | −n

2log2π

Chris Williams ANCGaussian Processes for Machine Learning

Page 24: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 24/66

Marginal Likelihood and Local Optima

−5 0 5−2

−1

0

1

2

input, x

  o

  u   t  p  u   t ,  y

−5 0 5−2

−1

0

1

2

input, x

  o

  u   t  p  u   t ,  y

There can be multiple optima of the marginal likelihoodThese correspond to different interpretations of the data

Chris Williams ANCGaussian Processes for Machine Learning

Page 25: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 25/66

The Baby and the Bathwater

MacKay (2003 ch 45): In moving from neural networks tokernel machines did we throw out the baby with the

bathwater? i.e. the ability to learn hiddenfeatures/representations

But consider M  = ΛΛ for Λ being D  × k , for k  < D 

The k  columns of Λ can identify directions in the input spacewith specially high relevance (Vivarelli and Williams, 1999)

Chris Williams ANCGaussian Processes for Machine Learning

Page 26: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 26/66

Theory

Equivalent kernel (Silverman, 1984)

Consistency (Diaconis and Freedman, 1986; Choudhuri,Ghoshal and Roy 2005; Choi and Schervish, 2004)

Average case learning curvesPAC-Bayesian analysis for GPs (Seeger, 2003)

p D{R L(f  D) ≤ R L(f  D) + gap(f  D,D, δ )} ≥ 1 − δ 

where R L(f  D) is the expected risk, and R L(f  D) is the empirical(training) risk

Chris Williams ANCGaussian Processes for Machine Learning

Page 27: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 27/66

Approximation Methods for Large Datasets

Fast approximate solution of the linear system

Subset of Data

Subset of Regressors

Inducing Variables

Projected Process Approximation

FITC, PITC, BCM

SPGPEmpirical Comparison

Chris Williams ANCGaussian Processes for Machine Learning

Page 28: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 28/66

Some interesting recent uses for Gaussian Processes

Modelling transcriptional regulation using Gaussian Processes. NeilD. Lawrence, Guido Sanguinetti, Magnus Rattray (NIPS 2006)

A Switched Gaussian Process for Estimating Disparity andSegmentation in Binocular Stereo. Oliver Williams (NIPS 2006)

Learning to Control an Octopus Arm with Gaussian ProcessTemporal Difference Methods. Yaakov Engel, Peter Szabo, DmitryVolkinshtein (NIPS 2005)

Worst-Case Bounds for Gaussian Process Models. Sham Kakade,

Matthias Seeger, Dean Foster (NIPS 2005)Infinite Mixtures of Gaussian Process Experts. Carl Rasmussen,Zoubin Ghahramani (NIPS 2002)

Chris Williams ANCGaussian Processes for Machine Learning

Page 29: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 29/66

3. Multi-task Learning

There are multiple (possibly) related tasks, and we wish toavoid tabula rasa learning by sharing information across tasks

E.g. Task clustering, inter-task correlationsTwo cases:

With task-descriptor features t

Without task-descriptor features, based solely on task identities

Joint work with Edwin Bonilla & Felix Agakov (AISTATS

2007) and Kian Ming Chai

Chris Williams ANCGaussian Processes for Machine Learning

Page 30: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 30/66

Multi-task Learning using Task-specific Features

M  tasks, learn mapping g i (x), i  = 1, . . . ,M 

ti  is task descriptor (task-specific feature vector) for task i 

g i (x) = g (ti , x): potential for transfer  across tasks

Out motivation is for compiler performance prediction, wherethere are multiple benchmark programs (=tasks), and x

describes sequences of code transformations

Another example: predicting school pupil performance basedon pupil and school features

We particularly care about the case when we have very littledata from the test task; here inter-task transfer will be mostimportant

Chris Williams ANCGaussian Processes for Machine Learning

Page 31: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 31/66

Overview

Model setup

Related work

Experimental setup, feature representation

Results

Discussion

Chris Williams ANCGaussian Processes for Machine Learning

Page 32: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 32/66

Task-descriptor Model

z =

x

t

k (z, z) = k x (x, x)k t (t, t)

Decomposition into task similarity (k t ) and input similarity(k x )

For the widely-used “Gaussian” kernel, this occurs naturally

Independent tasks if  k t (ti , t j ) = δ ij 

C.f. co-kriging in geostatistics (e.g. Wackernagel, 1998)Without task-descriptors, simply parameterize K t 

Chris Williams ANCGaussian Processes for Machine Learning

Page 33: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 33/66

xf

Chris Williams ANCGaussian Processes for Machine Learning

Page 34: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 34/66

Related Work

Work using task-specific features

Bakker and Heskes (2003) use neural networks. These can betricky to train (local optima, number of hidden units etc)

Yu et al (NIPS 2006, Stochastic Relational Models forDiscriminative Link Prediction)

Chris Williams ANCGaussian Processes for Machine Learning

Page 35: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 35/66

General work on Multi-task Learning

What should be transferred?

Early work: Thrun (1996), Caruana (1997)

Minka and Picard (1999); multiple tasks share same GP

hyperparameters (but are uncorrelated)Evgeniou et al (2005): induce correlations between tasksbased on a correlated prior over linear regression parameters(special case of co-kriging)

Multilevel (or hierarchical) modelling in statistics (e.g.Goldstein, 2003)

Chris Williams ANCGaussian Processes for Machine Learning

Page 36: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 36/66

Compiler Performance Prediction

Goal: Predict speedup of a new program under a givensequence of compiler transformations

Only have a limited number of runs of the new program, butalso have data from other (related?) tasks

Speedup s  measured as

s (x) =time(baseline)

time(x)

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 37: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 37/66

Example Transformation

Loop unrolling

// original loop

for(i=0; i<100; i++)a[i] = b[i] + c[i];

// loop unrolled twice

for(i=0; i<100; i+=2){a[i] = b[i] + c[i];

a[i+1] = b[i+1] +

c[i+1];

}

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 38: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 38/66

Experimental Setup

Benchmarks: 11 C programs from UTDSP

Transformations: Source-to-source using SUIF

Platform: TI C6713 board13 transformations in sequences up to length 5, using eachtransformation at most once ⇒ 88214 sequences perbenchmark (exhaustively enumerated)

Significant speedups can be obtained (max is 1.84)

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 39: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 39/66

Input Features x

Code features (C), or transformation-based representation (T)

Code features: extract features from transformed program

based on knowledge of compiler experts (code size,instructions executed, parallelism)

83 features reduced to 15-d with PCA

Transformation-based representation: length-13 bit vectorstating what transformations were used (“bag of characters”)

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 40: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 40/66

Task-specific features t

Record the speedup on a small number of  canonical sequences: response-based approach

Canonical sequences selected by principal variables method(McCabe, 1984)

A variety of possible criteria can be used, e.g. maximize|ΣS (1)

|, minimize tr(ΣS (2)|S (1)). Use greedy selection

We don’t use all 88214 sequences to define the canonicalsequences, only only 2048. In our experiments we use 8canonical variables

Could consider e.g. code features from untransformedprograms, but experimentally response-based method issuperior

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 41: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 41/66

Experiments

LOO-CV setup (leave out one task at a time)

Therefore 10 reference  tasks for each prediction task; we usednr  = 256 examples per benchmark

Use nte 

examples from the test task (nte 

≥ 8)

Assess performance using mean absolute error (MAE) on allremaining test sequences

Comparison to baseline “no transfer” method using just datafrom test task

Used GP regression prediction with squared exponential kernel

ARD was used, except for “no transfer” case when nte  ≤ 64

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 42: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 42/66

ADP COM EDG FFT FIR HIS IIR LAT LMS LPC SPE AVG0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

    M

    A    E

 

[T] COMBINED

[T] GATING

[C] COMBINED

[C] GATING

MEDIAN CANONICALS

MEDIAN TEST

0.540(0.007)

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 43: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 43/66

Results

T-combined is best overall (av MAE is 0.0576, compared to

0.1162 for median canonicals)T-combined generally either improves performance or leaves itabout the same compared to T-no-transfer-canonicals

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 44: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 44/66

8 16 32 64 1280

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4HISTOGRAM

TEST SAMPLES INCLUDED FOR TRAINING

     M     A     E

 

[T] COMBINED[T] NO TRANSFER RANDOM

[T] NO TRANSFER CANONICALSMEDIAN CANONICALS

8 16 32 64 1280

0.02

0.04

0.06

0.08

0.1

0.12

0.14EDGE_DETECT

TEST SAMPLES INCLUDED FOR TRAINING

     M     A     E

 

[T] COMBINED[T] NO TRANSFER RANDOM

[T] NO TRANSFER CANONICALSMEDIAN CANONICALS

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 45: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 45/66

8 16 32 64 1280

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16ADPCM

TEST SAMPLES INCLUDED FOR TRAINING

     M     A     E

 

[T] COMBINED[T] NO TRANSFER RANDOM

[T] NO TRANSFER CANONICALS

MEDIAN CANONICALS

8 16 32 64 1280

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

TEST SAMPLES INCLUDED FOR TRAINING

     M     A     E

ALL BENCHMARKS

 

[T] COMBINED[T] NO TRANSFER RANDOM

[T] NO TRANSFER CANONICALS

MEDIAN CANONICALS

T-combined generally improves performance or leaves it aboutthe same compared to the best “no transfer” scenario

Chris Williams ANC

Gaussian Processes for Machine Learning

U d di T k R l d

Page 46: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 46/66

Understanding Task Relatedness

GP predictive mean is

s (z∗) = kT (z∗)(K f   ⊗ K x  + σ2I )−1s

Can look at K f  , but difficult to interpret?

Predictive mean s (z∗) = hT 

(z∗)s, where

hT (z) = (h11, . . . , h1

nr , . . . , hM 

1 , . . . , hM nr , hM +1

1 , . . . , hM +1nte 

, )

Measure contribution of task i  on test point z∗ by computing

r i (z∗) =|hi (z∗)|

|h(z∗)|

Chris Williams ANC

Gaussian Processes for Machine Learning

Average r ’s over test examples

Page 47: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 47/66

Average r  s over test examples

adp com edg fft fir his iir lat lms lpc spe

spe

lpc

lms

lat

iir

his

fir

fft

edg

com

adp

Chris Williams ANC

Gaussian Processes for Machine Learning

Di i

Page 48: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 48/66

Discussion

Our focus is on the hard problem of prediction on a new taskgiven very little data for that task

The presented method allows sharing over tasks. This should

be beneficial, but note that “no transfer” method has thefreedom to use different hyperparams on each task

Can learn similarity between tasks directly (unparameterizedK t ), but this is not so easy if  nte  is very small

Note that there is no inter-task transfer in noiseless case!

(autokrigeability)

Chris Williams ANC

Gaussian Processes for Machine Learning

G l C l i

Page 49: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 49/66

General Conclusions

Key issues:

Designing/discovering covariance functions suitable for various

types of data

Methods for setting/inference of hyperparameters

Dealing with large datasets

Chris Williams ANC

Gaussian Processes for Machine Learning

Ga ssia P ocess Reg essio

Page 50: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 50/66

Gaussian Process Regression

Dataset D = (xi , y i )ni =1, Gaussian likelihood p (y i |f  i ) ∼ N (0, σ2)

f  (x) =n

i =1

αi k (x, xi )

whereα = (K  + σ2I )−1y

var(x) = k (x, x) − kT 

(x)(K  + σ2

I )−1

k(x)in time O (n3), with k(x) = (k (x, x1), . . . ,k(x, xn))T 

Chris Williams ANC

Gaussian Processes for Machine Learning

Fast approximate solution of linear systems

Page 51: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 51/66

Fast approximate solution of linear systems

Iterative solution of (K  + σ2nI )v = y, e.g. using ConjugateGradients. Minimizing

1

2vT (K  + σ2

nI )v − yT v.

This takes O (kn2

) for k  iterations.Fast approximate matrix-vector multiplication

ni =1

k (x j , xi )v i 

k -d tree/ dual tree methods (Gray, 2004; Shen, Ng andSeeger, 2006; De Freitas et al 2006)

Improved Fast Gauss transform (Yang et al, 2005)

Chris Williams ANC

Gaussian Processes for Machine Learning

Subset of Data

Page 52: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 52/66

Subset of Data

Simply keep m datapoints, discard the rest: O (m3)

Can choose the subset randomly, or by a greedy selection

criterionIf we are prepared to do work for each test point, can selecttraining inputs nearby to the test point. Stein (Ann. Stat.,2002) shows that a screening effect operates for somecovariance functions

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 53: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 53/66

K uu

uf 

n

m

K  = K fu K −1uu  K uf  

Nystrom approximation to K 

Chris Williams ANC

Gaussian Processes for Machine Learning

Subset of Regressors

Page 54: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 54/66

Subset of Regressors

Silverman (1985) showed that the mean GP predictor can beobtained from the finite-dimensional model

f  (x∗) =n

i =1

αi k (x∗, xi )

with a prior α ∼ N (0,K −1)

A simple approximation to this model is to consider only asubset of regressors

f  SR(x∗) =mi =1

αi k (x∗, xi ), with αu  ∼ N (0,K −1uu  )

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 55: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 55/66

f  SR(x∗) = ku (x∗)(K uf  K fu  + σ2nK uu )

−1K uf  y,

V[f  SR(x∗)] = σ2nku (x∗)(K uf  K fu  + σ2

nK uu )−1ku (x∗)

SoR corresponds to using a degenerate  GP prior (finite rank)

Chris Williams ANC

Gaussian Processes for Machine Learning

Inducing Variables

Page 56: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 56/66

Inducing Variables

Quinonero-Candela and Rasmussen (JMLR, 2005)

p (f ∗|y) =1

p (y)

 p (y|f )p (f , f ∗)d f 

Now introduce inducing variables u

p (f , f ∗) =

 p (f , f ∗,u)d u =

 p (f , f ∗|u)p (u)d u

Approximation

p (f , f ∗) q (f , f ∗)def  

q (f |u)q (f ∗|u)p (u)d u

q (f |u) – training conditionalq (f ∗|u) – test conditional

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 57: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 57/66

u

f  f *

Inducing variables can be:

(sub)set of training points(sub)set of test points

new x points

Chris Williams ANC

Gaussian Processes for Machine Learning

Projected Process Approximation—PP

Page 58: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 58/66

Projected Process Approximation PP

(Csato & Opper, 2002; Seeger, et al 2003; aka PLV, DTC)

Inducing variables are subset of training points

q (y|u) =  N (y|K fu K −1uu  u, σ2

nI )

K fu K −1uu  u is mean prediction for f  given u

Predictive mean for PP is the same as SR, but variance is

never smaller. SR is like PP but with deterministic q (f  ∗|u)

      

      

 ¡  ¡  ¡ 

 ¡  ¡  ¡ 

 e  e  e 

 e  e  e 

        

    

      

u

 f 1 f 2 r r r  f n f ∗

Chris Williams ANC

Gaussian Processes for Machine Learning

FITC, PITC and BCM

Page 59: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 59/66

FITC, PITC and BCM

See Quinonero-Candela and Rasmussen (2005) for overview

Under PP, q (f |u) =  N (y|K fu K −1uu  u, 0)

Instead FITC (Snelson and Ghahramani, 2005) uses individualpredictive variances diag[K ff   − K fu K −1

uu  K uf  ], i.e. fully

independent training conditionals

PP can make poor predictions in low noise [S Q-C M R W]

PITC uses blocks of training points to improve theapproximation

BCM (Tresp, 2000) is the same approximation as PITC,except that the test  points are the inducing set

Chris Williams ANC

Gaussian Processes for Machine Learning

Sparse GPs using Pseudo-inputs

Page 60: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 60/66

p g p

(Snelson and Ghahramani, 2006)

FITC approximation, but inducing inputs are new points, in

neither the training or test setsLocations of the inducing inputs are changed along withhyperparameters so as to maximize the approximate marginallikelihood

Chris Williams ANC

Gaussian Processes for Machine Learning

Complexity

Page 61: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 61/66

p y

Method Storage Initialization Mean Variance

SD O (m2) O (m3) O (m) O (m2)

SR O (mn) O (m2n) O (m) O (m2)PP, FITC O (mn) O (m2n) O (m) O (m2)BCM O (mn) O (mn) O (mn)

Chris Williams ANC

Gaussian Processes for Machine Learning

Empirical Comparison

Page 62: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 62/66

p p

Robot arm problem, 44,484 training cases in 21-d, 4,449 testcases

For SD method subset of size m was chosen at random,hyperparameters set by optimizing marginal likelihood (ARD).Repeated 10 times

For SR, PP and BCM methods same subsets/hyperparameterswere used (BCM: hyperparameters only)

Chris Williams ANC

Gaussian Processes for Machine Learning

Method m SMSE MSLL mean runtime (s)SD 256 0 0813 ± 0 0198 -1 4291 ± 0 0558 0 8

Page 63: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 63/66

SD 256 0.0813 ± 0.0198 1.4291 ± 0.0558 0.8512 0.0532 ± 0.0046 -1.5834 ± 0.0319 2.11024 0.0398 ± 0.0036 -1.7149 ± 0.0293 6.5

2048 0.0290 ± 0.0013 -1.8611 ± 0.0204 25.04096 0.0200 ± 0.0008 -2.0241 ± 0.0151 100.7

SR 256 0.0351 ± 0.0036 -1.6088 ± 0.0984 11.0512 0.0259 ± 0.0014 -1.8185 ± 0.0357 27.01024 0.0193 ± 0.0008 -1.9728 ± 0.0207 79.5

2048 0.0150 ± 0.0005 -2.1126 ± 0.0185 284.84096 0.0110 ± 0.0004 -2.2474 ± 0.0204 927.6

PP 256 0.0351 ± 0.0036 -1.6940 ± 0.0528 17.3512 0.0259 ± 0.0014 -1.8423 ± 0.0286 41.41024 0.0193 ± 0.0008 -1.9823 ± 0.0233 95.1

2048 0.0150 ± 0.0005 -2.1125 ± 0.0202 354.24096 0.0110 ± 0.0004 -2.2399 ± 0.0160 964.5BCM 256 0.0314 ± 0.0046 -1.7066 ± 0.0550 506.4

512 0.0281 ± 0.0055 -1.7807 ± 0.0820 660.51024 0.0180 ± 0.0010 -2.0081 ± 0.0321 1043.22048 0.0136 ± 0.0007 -2.1364 ± 0.0266 1920.7Chris Williams ANC

Gaussian Processes for Machine Learning

0.09

SD

Page 64: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 64/66

10−1

100

101

102

103

104

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

time (s)

      S      M      S      E

SD

SR and PP

BCM

Chris Williams ANC

Gaussian Processes for Machine Learning

−1.4

SD

Page 65: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 65/66

10−1

100

101

102

103

104

−2.3

−2.2

−2.1

−2

−1.9

−1.8

−1.7

−1.6

−1.5

time (s)

      M

      S      L      L

SD

SR

PP

BCM

Chris Williams ANC

Gaussian Processes for Machine Learning

Page 66: Machine Learning vs Statistics

7/29/2019 Machine Learning vs Statistics

http://slidepdf.com/reader/full/machine-learning-vs-statistics 66/66

Judged on time, for this dataset SD, SR and PP are on thesame trajectory, with BCM being worse

But what about greedy vs random subset selection, methodsto set hyperparameters, different datasets?

In general, we must take into account training  (initialization),testing  and hyperparameter learning  times separately [S Q-CM R W]. Balance will depend on your situation.

Chris Williams ANC

Gaussian Processes for Machine Learning