Advanced Introduction to Machine Learning CMU-10715bapoczos/Classes/ML10715_2015... ·...

Advanced Introduction to

Machine Learning

CMU-10715

Gaussian Processes

Barnabás Póczos

Introduction

http://www.gaussianprocess.org/

Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin

Introduction Regression

Properties of Multivariate Gaussian distributions

Ridge Regression

Gaussian Processes

Weight space view o Bayesian Ridge Regression + Kernel trick

Function space view o Prior distribution over functions

+ calculation posterior distributions

Contents

Regression

Why GPs for Regression?

Motivation 1:

All the above regression method give point estimates. We would like a method that could also provide confidence during the estimation.

Motivation 2:

Let us kernelize linear ridge regression, and see what we get…

Regression methods:

Linear regression, ridge regression, support vector regression, kNN regression, etc…

• Here’s where the function will

most likely be. (expected function)

• Here are some examples of what it might look like. (sampling from the posterior distribution)

• Here is a prediction of what you’ll see if you evaluate your function at x’, with confidence

GPs can answer the following questions

Why GPs for Regression?

Properties of Multivariate Gaussian

Distributions

1D Gaussian Distribution Parameters

• Mean,

• Variance, 2

Multivariate Gaussian

A 2-dimensional Gaussian is defined by

• a mean vector = [ 1, 2 ]

• a covariance matrix: where i,j

2 = E[ (xi – i) (xj – j) ]

is (co)variance

Note: is symmetric,

“positive semi-definite”: x: xT x 0

Multivariate Gaussian

Multivariate Gaussian examples

= (0,0)

Multivariate Gaussian examples

= (0,0)

Marginal distributions of Gaussians are Gaussian

Given:

Marginal Distribution:

baba xxx ),(),,(

Useful Properties of Gaussians

Marginal distributions of Gaussians are Gaussian

Block Matrix Inversion

Theorem

Definition: Schur complements

Conditional distributions of Gaussians are Gaussian

Notation:

Conditional Distribution:

Useful Properties of Gaussians

Higher Dimensions Visualizing > 3 dimensions is… difficult

Means and marginals are practical, but then we don’t see correlations between those variables

Marginals are Gaussian, e.g., f(6) ~ N(µ(6), σ2(6))

Visualizing an 8-dimensional Gaussian f:

µ(6) σ2(6)

6 5 4 3 2 1 7 8

Yet Higher Dimensions Why stop there?

Don’t panic: It’s just a function

Getting Ridiculous

Why stop there?

Gaussian Process

Probability distribution indexed by an arbitrary set (integer, real, finite dimensional vector, etc)

Each element gets a Gaussian distribution over the reals with mean µ(x)

These distributions are dependent/correlated as defined by k(x,z)

Any finite subset of indices defines a multivariate Gaussian distribution

Definition:

Gaussian Process

Distribution over functions….

Yayyy! If our regression model is a GP, then it won’t be a point estimate anymore! It can provide regression estimates with confidence

Domain (index set) of the functions can be pretty much whatever

• Reals

• Real vectors

• Graphs

• Strings

• Sets

• …

Bayesian Updates for GPs

• How can we do regression and learn the GP from data?

• We will be Bayesians today: • Start with GP prior

• Get some data

• Compute a posterior

Samples from the prior distribution

Picture is taken from Rasmussen and Williams

Samples from the posterior distribution

Posterior

Introduction

Ridge Regression

Gaussian Processes

• Weight space view Bayesian Ridge Regression + Kernel trick

• Function space view Prior distribution over functions

Contents

Ridge Regression

Linear regression:

Ridge regression:

The Gaussian Process is a Bayesian Generalization

of the kernelized ridge regression

Introduction

Ridge Regression

Gaussian Processes

Contents

Weight Space View

GP = Bayesian ridge regression in feature space

+ Kernel trick to carry out computations

The training data

Bayesian Analysis of Linear Regression with Gaussian noise

The likelihood:

The prior:

Now, we can calculate the posterior:

After “completing the square”

MAP estimation

Ridge Regression

This posterior covariance matrix doesn’t depend on the observations y,

A strange property of Gaussian Processes

Projections of Inputs into Feature Space

The reviewed Bayesian linear regression suffers from

limited expressiveness

To overcome the problem )

go to a feature space and do linear regression there

a., explicit features

b., implicit features (kernels)

Explicit Features

Linear regression in the feature space

Explicit Features

The predictive distribution after feature map:

Explicit Features

Shorthands:

The predictive distribution after feature map:

Explicit Features The predictive distribution after feature map:

A problem with (*) is that it needs an NxN matrix inversion...

(*) can be rewritten:

Theorem:

Proofs • Mean expression. We need:

• Variance expression. We need:

Lemma:

Matrix inversion Lemma:

From Explicit to Implicit Features

Reminder: This was the original formulation:

From Explicit to Implicit Features

The feature space always enters in the form of:

Lemma:

No need to know the explicit N dimensional features.

Their inner product is enough.

Results

Results using Netlab , Sin function

Results using Netlab, Sin function

Increased # of training points

Results using Netlab, Sin function

Increased noise

Results using Netlab, Sinc function

Thanks for the Attention!

Extra Material

Introduction

Ridge Regression

Gaussian Processes

Contents

Function Space View An alternative way to get the previous results

Inference directly in function space

Definition: (Gaussian Processes)

GP is a collection of random variables, s.t. any finite number of them have a joint Gaussian distribution

Function Space View

Notations:

Function Space View

Gaussian Processes:

Function Space View

The Bayesian linear regression is an example of GP

Function Space View

Special case

Function Space View

Observation

Explanation

Prediction with noise free observations

noise free observations

Prediction with noise free observations Lemma:

Proofs: a bit of calculation using the joint (n+m) dim density

Remarks:

Prediction using noisy observations

The joint distribution:

The posterior for the noisy observations:

In the weight space view we had:

Short notations:

Two ways to look at it:

• Linear predictor

• Manifestation of the Representer Theorem

Remarks:

GP pseudo code Inputs:

GP pseudo code (continued)

Outputs:

Thanks for the Attention!

Advanced Introduction to Machine Learning CMU-10715bapoczos/Classes/ML10715_2015... ·...

Documents

(3)kb.psu.ac.th/psukb/bitstream/2016/10715/1/409407.pdf · ระดับน้าตาลในเลือดของผู้ป่วยเบาหวานชนิดที่

CMU SCS Thank you! C. Faloutsos CMU. CMU SCS Large Graph Mining C. Faloutsos CMU

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture1... · Logistics Text book No required book Reading assignments on class homepage Optional: David Mackay,

Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University

Introduction to Machine Learningepxing/Class/10715/lectures/RiskMin.pdf · Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture3-Structured_Sparsity.pdfAdvanced Introduction to Machine Learning 10715, Fall 2014 Structured Sparsity,

Applying Machine Learning to Cognitive Modeling for Cognitive Tutorsreports-archive.adm.cs.cmu.edu/anon/ml/CMU-ML-06-1… · · 2006-07-21Applying Machine Learning to Cognitive

Recommender Systems (Machine Learning Summer School 2014 @ CMU)

Gianfranco rondon insectos en su_esplendor-10715

10-715 Advanced Introduction to Machine Learning ...10715-f18/homeworks/10715_hw4.pdf10-715 Advanced Introduction to Machine Learning: Homework 4 Neural Networks Released: Wednesday,

Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory

The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/MLE_MAP_Part1… · CMU-10701 2. MLE, MAP, ... “Mind reading” = fMRI data processing. Independence

No. 10715 IMPROVING MINERAL SPECTRA USING SINGLE ... · No. 10715 OBJECTIVE This paper investigates approaches to further improvement in mineral spectra reproducibility using the

Machine learning in space and time - CMU › research › Flaxman_Thesis_2015.pdf · Machine learning in space and time Spatiotemporal learning and inference with Gaussian processes

CMU in Silicon Valley - USENIX · CMU in Silicon Valley • Established 2002 • Significant growth in ... • Machine learning and system identification can be then be ... •Mobile

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture6.pdf · Advanced Introduction to Machine Learning 10715, Fall 2014 The Kernel Trick, Reproducing Kernel

Ryan O'Donnell (CMU) Yi Wu (CMU, IBM) Yuan Zhou (CMU)

Introduction to Machine Learning - Carnegie Mellon School of …epxing/Class/10715/lectures/... · 2014-11-19 · Stone’s theorem 1977: Many classification, regression algorithms