23
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 17. Bayesian inference; Bayesian regression

COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

COMP90051StatisticalMachineLearningSemester2,2017

Lecturer:TrevorCohn

17.Bayesianinference;Bayesian regression

Page 2: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

Training==optimisation(?)Stagesoflearning&inference:

• Formulatemodel

• Fitparameterstodata

• Makeprediction

2

p(y|x) = sigmoid(x

0w)

p(y|x) = Normal(x

0w;�2

)

p(y⇤|x⇤) = sigmoid(x

0⇤ ˆw)

Regression

𝒘" referredtoasa‘pointestimate’

E[y⇤] = x

0⇤w

ˆw = argmaxw p(y|X,w)p(w)

ditto

Page 3: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

BayesianAlternativeNothingspecialabout𝒘"…usemorethanonevalue?

• Formulatemodel

• Considerthespaceoflikelyparameters– thosethatfitthetrainingdatawell

• Make‘expected’prediction

3

p(y|x) = sigmoid(x

0w) p(y|x) = Normal(x

0w;�2

)

p(y⇤|x⇤) = Ep(w|Xi,y) [sigmoid(x

0w)]

Regression

p(y⇤|x⇤) = Ep(w|Xi,y)

⇥Normal(x

0⇤w,�2

)

p(w|X,y)

Page 4: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

Uncertainty

Fromsmalltrainingsets,werarelyhavecompleteconfidenceinanymodels

learned.Canwequantifytheuncertainty,anduseitinmakingpredictions?

4

Page 5: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

RegressionRevisited

Linearregression:y=w0 +w1 x(herey =humidity,x =temperature)

5

• Learnmodelfromdata* minimise errorresidualsbychoosingweights𝐰" = 𝐗&𝐗 '(𝐗&𝐲

• But…howconfidentarewe* in𝐰"?* inthepredictions?

Page 6: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

Predictionuncertainty

• Singlepredictionisoflimiteduseduetouncertainty* singlenumberuninformative- maybewildlyoff* mightwanttoformulatedecisionfromprediction,e.g.,ifPr(y<70)

6

Page 7: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

ConfidenceinMLEpointestimate

• Whatdoesitmeantominimise objective?* …areothernearbysolutionssimilarlygood?

• Effectofdata* lotsofdatarelativetodimensionality,MLElikelytobeagoodestimate

* otherwiseunreliable

• MAPapartial solution,butstillreliantonsinglepoint

7

Page 8: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

EffectofTrainingSampleonMLE

• Modelling y=2x– 3* draw1000softrainingsetsof10instances

* smalladdednoise

8

• FitweightseachtimeusingMLE* observevariabilityinweights

* peakat(2,-3)

mle fit&noise

groundtruth

empiricaldistribution(histogram)overlearnedweightsw

w1w0

singledatasample

Page 9: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

Aside:Learningthenoiserate

• Canalsolearnnoiseparameter,σ2* expressNLLasfunctionofσ2;differentiate;setto0;solve* resultsin

• Quantifiesthequalityofthefit* allowssmarterdecisionmaking,e.g.,P(y <60)

9

�2 =1

N

NX

i=1

(yi �Xiw)2

N.b.,wecomputebettererror boundslateron

showing+- 𝛔 (68%confidenceinterval)

Page 10: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

Dowetrustpointestimate𝐰" ?

10

• Howstable islearning?* 𝐰" highlysensitivetonoise* howmuchuncertaintyin

parameterestimate?* moreinformative if

NLLobjectivehighlypeaked

• Formalised asFisherInformationmatrix* E[2nd deriv ofNLL]

* measures curvatureofobjectiveabout𝐰"

I =1

�2X0X

Figure:RogersandGirolami p81

Page 11: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

TheBayesianView

Retainandmodelallunknowns(e.g.,uncertaintyoverparameters)andusethisinformationwhen

makinginferences.

11

Page 12: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

ABayesianView

• Couldwereasonoverallparametersthatareconsistentwiththedata?* weightswithabetterfittothetrainingdatashouldbemoreprobablethanothers

* makepredictionswithalltheseweights,scaledbytheirprobability

• ThisistheideaunderlyingBayesian inference

12

Page 13: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

Uncertaintyoverparameters

• Manyreasonablesolutionstoobjective* whyselectjustone?

• Reasonunderall possibleparametervalues* weightedbytheirposteriorprobability

• Morerobustpredictions* lesssensitivetooverfitting,particularlywithsmalltrainingsets

* cangiverisetomoreexpressivemodelclass(Bayesianlogisticregressionbecomesnon-linear!)

13

Page 14: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

Frequentist vsBayesiandivide• Frequentist:learningusingpointestimates,regularisation,p-values…* backedbycomplextheoryrelyingonstrongassumptions* mostlysimpleralgorithms,characterises muchpracticalmachinelearningresearch

• Bayesian:maintainuncertainty,marginalise (sum)outunknownsduringinference* nicertheorywithfewerassumptions* oftenmorecomplexalgorithms,butnotalways* whenpossible,resultsinmoreelegantmodels

14

Page 15: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

BayesianRegression

ApplicationofBayesianinferencetolinearregression,using

Normalprioroverw

15

Page 16: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

RevisitingLinearRegression

• Recallprobabilisticformulationoflinearregression

• MotivatedbyBayesrule

• Givesrisetothepenalised RSSobjective

16

y ⇠ Normal(x

0w,�2

)

w ⇠ Normal(0, �2ID)

p(w|X,y) =p(y|X,w)p(w)

p(y|X)

max

wp(w|X,y) = max

wp(y|X,w)p(w)

pointestimatetakenhere,avoidscomputingmarginallikelihoodterm

ID =DxDidentitymatrix

Page 17: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

BayesianLinearRegression

• Rewindonestep, considerfullposterior

• Canwecomputethedenominator(marginallikelihoodorevidence)?* ifso,wecanusethefullposterior,notjustitsmode

17

p(w|X,y,�2) =p(y|X,w,�2)p(w)

p(y|X)

=p(y|X,w,�2)p(w)Rp(y, |X,w,�2)p(w)dw

Hereweassumenoisevar. known

Page 18: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

BayesianLinearRegression(cont)

• WehavetwoNormaldistributions* normallikelihoodxnormalprior

• TheirproductisalsoaNormaldistribution* conjugateprior:whenproductoflikelihoodxpriorresultsinthesamedistributionastheprior

* evidence canbecomputedeasilyusingthenormalisingconstantoftheNormaldistribution

18

p(w|X,y,�2) / Normal(w|0, �2ID)Normal(y|Xw,�2IN )

/ Normal(w|wN ,VN )

closedformsolutionforposterior!

Page 19: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

BayesianLinearRegression(cont)

19

wN =1

�2VNX0y

VN = �2(X0X+�2

�2ID)�1

Advanced: verifybyexpressingproductoftwoNormals,gatheringexponentstogetherand‘completingthesquare’toexpressassquared

exponential(i.e.,Normaldistribution).

whereNotethatmean(andmode)aretheMAPsolutionfrombefore

p(w|X,y,�2) / Normal(w|0, �2ID)Normal(y|Xw,�2IN )

/ Normal(w|wN ,VN )

Page 20: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

BayesianLinearRegressionexample

20

Step1:selectprior,heresphericalabout0 Step2:observetrainingdata

Step3:formulateposterior,fromprior&likelihood Samplesfromposterior

Page 21: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

SequentialBayesianUpdating

• Canformulateforgivendataset

• Whathappensasweseemoreandmoredata?1. Startfromprior2. Seenewlabelleddatapoint3. Computeposterior4. Theposteriornowtakesroleofprior

&repeatfromstep2

21

p(w|X,y,�2)

p(w)

p(w|X,y,�2)

Page 22: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

SequentialBayesianUpdating

BishopFig3.7,p155 22

3.3. Bayesian Linear Regression 155

Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.

3.3. Bayesian Linear Regression 155

Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.

3.3. Bayesian Linear Regression 155

Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.

3.3. Bayesian Linear Regression 155

Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.

• Initiallyknowlittle,manyregressionlineslicensed

• Likelihoodconstrainspossibleweightssuchthatregressionisclosetopoint

• Posteriorbecomesmorerefined/peakedasmoredataintroduced

• Approachesapointmassaboutsolution

Page 23: COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

StatisticalMachineLearning(S22017) Lecture17

Summary

• Uncertaintynotcapturedbypointestimates(MLE,MAP)

• Bayesianapproachpreservesuncertainty* careaboutpredictionsNOTparameters* chooseprioroverparameters,thenmodelposterior

• Newconcepts:* sequentialBayesianupdating* conjugateprior(Normal-Normal)

• Stilltocome...usingposteriorforBayesianpredictionsontest 23