COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent

COMP90051StatisticalMachineLearningSemester2,2017

Lecturer:TrevorCohn

17.Bayesianinference;Bayesian regression

StatisticalMachineLearning(S22017) Lecture17

Training==optimisation(?)Stagesoflearning&inference:

• Formulatemodel

• Fitparameterstodata

• Makeprediction

2

p(y|x) = sigmoid(x

0w)

p(y|x) = Normal(x

0w;�2

)

p(y⇤|x⇤) = sigmoid(x

0⇤ ˆw)

Regression

𝒘" referredtoasa‘pointestimate’

E[y⇤] = x

0⇤w

ˆw = argmaxw p(y|X,w)p(w)

ditto


BayesianAlternativeNothingspecialabout𝒘"…usemorethanonevalue?

• Formulatemodel

• Considerthespaceoflikelyparameters– thosethatfitthetrainingdatawell

• Make‘expected’prediction

3

p(y|x) = sigmoid(x

0w) p(y|x) = Normal(x

0w;�2

)

p(y⇤|x⇤) = Ep(w|Xi,y) [sigmoid(x

0w)]

Regression

p(y⇤|x⇤) = Ep(w|Xi,y)

⇥Normal(x

0⇤w,�2

)

⇤

p(w|X,y)


Uncertainty

Fromsmalltrainingsets,werarelyhavecompleteconfidenceinanymodels

learned.Canwequantifytheuncertainty,anduseitinmakingpredictions?

4


RegressionRevisited

Linearregression:y=w0 +w1 x(herey =humidity,x =temperature)

5

• Learnmodelfromdata* minimise errorresidualsbychoosingweights𝐰" = 𝐗&𝐗 '(𝐗&𝐲

• But…howconfidentarewe* in𝐰"?* inthepredictions?


Predictionuncertainty

• Singlepredictionisoflimiteduseduetouncertainty* singlenumberuninformative- maybewildlyoff* mightwanttoformulatedecisionfromprediction,e.g.,ifPr(y<70)

6


ConfidenceinMLEpointestimate

• Whatdoesitmeantominimise objective?* …areothernearbysolutionssimilarlygood?

• Effectofdata* lotsofdatarelativetodimensionality,MLElikelytobeagoodestimate

* otherwiseunreliable

• MAPapartial solution,butstillreliantonsinglepoint

7


EffectofTrainingSampleonMLE

• Modelling y=2x– 3* draw1000softrainingsetsof10instances

* smalladdednoise

8

• FitweightseachtimeusingMLE* observevariabilityinweights

* peakat(2,-3)

mle fit&noise

groundtruth

empiricaldistribution(histogram)overlearnedweightsw

w1w0

singledatasample


Aside:Learningthenoiserate

• Canalsolearnnoiseparameter,σ2* expressNLLasfunctionofσ2;differentiate;setto0;solve* resultsin

• Quantifiesthequalityofthefit* allowssmarterdecisionmaking,e.g.,P(y <60)

9

�2 =1

N

NX

i=1

(yi �Xiw)2

N.b.,wecomputebettererror boundslateron

showing+- 𝛔 (68%confidenceinterval)


Dowetrustpointestimate𝐰" ?

10

• Howstable islearning?* 𝐰" highlysensitivetonoise* howmuchuncertaintyin

parameterestimate?* moreinformative if

NLLobjectivehighlypeaked

• Formalised asFisherInformationmatrix* E[2nd deriv ofNLL]

* measures curvatureofobjectiveabout𝐰"

I =1

�2X0X

Figure:RogersandGirolami p81


TheBayesianView

Retainandmodelallunknowns(e.g.,uncertaintyoverparameters)andusethisinformationwhen

makinginferences.

11


ABayesianView

• Couldwereasonoverallparametersthatareconsistentwiththedata?* weightswithabetterfittothetrainingdatashouldbemoreprobablethanothers

* makepredictionswithalltheseweights,scaledbytheirprobability

• ThisistheideaunderlyingBayesian inference

12


Uncertaintyoverparameters

• Manyreasonablesolutionstoobjective* whyselectjustone?

• Reasonunderall possibleparametervalues* weightedbytheirposteriorprobability

• Morerobustpredictions* lesssensitivetooverfitting,particularlywithsmalltrainingsets

* cangiverisetomoreexpressivemodelclass(Bayesianlogisticregressionbecomesnon-linear!)

13


Frequentist vsBayesiandivide• Frequentist:learningusingpointestimates,regularisation,p-values…* backedbycomplextheoryrelyingonstrongassumptions* mostlysimpleralgorithms,characterises muchpracticalmachinelearningresearch

• Bayesian:maintainuncertainty,marginalise (sum)outunknownsduringinference* nicertheorywithfewerassumptions* oftenmorecomplexalgorithms,butnotalways* whenpossible,resultsinmoreelegantmodels

14


BayesianRegression

ApplicationofBayesianinferencetolinearregression,using

Normalprioroverw

15


RevisitingLinearRegression

• Recallprobabilisticformulationoflinearregression

• MotivatedbyBayesrule

• Givesrisetothepenalised RSSobjective

16

y ⇠ Normal(x

0w,�2

)

w ⇠ Normal(0, �2ID)

p(w|X,y) =p(y|X,w)p(w)

p(y|X)

max

wp(w|X,y) = max

wp(y|X,w)p(w)

pointestimatetakenhere,avoidscomputingmarginallikelihoodterm

ID =DxDidentitymatrix


BayesianLinearRegression

• Rewindonestep, considerfullposterior

• Canwecomputethedenominator(marginallikelihoodorevidence)?* ifso,wecanusethefullposterior,notjustitsmode

17

p(w|X,y,�2) =p(y|X,w,�2)p(w)

p(y|X)

=p(y|X,w,�2)p(w)Rp(y, |X,w,�2)p(w)dw

Hereweassumenoisevar. known


BayesianLinearRegression(cont)

• WehavetwoNormaldistributions* normallikelihoodxnormalprior

• TheirproductisalsoaNormaldistribution* conjugateprior:whenproductoflikelihoodxpriorresultsinthesamedistributionastheprior

* evidence canbecomputedeasilyusingthenormalisingconstantoftheNormaldistribution

18

p(w|X,y,�2) / Normal(w|0, �2ID)Normal(y|Xw,�2IN )

/ Normal(w|wN ,VN )

closedformsolutionforposterior!


BayesianLinearRegression(cont)

19

wN =1

�2VNX0y

VN = �2(X0X+�2

�2ID)�1

Advanced: verifybyexpressingproductoftwoNormals,gatheringexponentstogetherand‘completingthesquare’toexpressassquared

exponential(i.e.,Normaldistribution).

whereNotethatmean(andmode)aretheMAPsolutionfrombefore

p(w|X,y,�2) / Normal(w|0, �2ID)Normal(y|Xw,�2IN )

/ Normal(w|wN ,VN )


BayesianLinearRegressionexample

20

Step1:selectprior,heresphericalabout0 Step2:observetrainingdata

Step3:formulateposterior,fromprior&likelihood Samplesfromposterior


SequentialBayesianUpdating

• Canformulateforgivendataset

• Whathappensasweseemoreandmoredata?1. Startfromprior2. Seenewlabelleddatapoint3. Computeposterior4. Theposteriornowtakesroleofprior

&repeatfromstep2

21

p(w|X,y,�2)

p(w)

p(w|X,y,�2)


SequentialBayesianUpdating

BishopFig3.7,p155 22

3.3. Bayesian Linear Regression 155

Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.







• Initiallyknowlittle,manyregressionlineslicensed

• Likelihoodconstrainspossibleweightssuchthatregressionisclosetopoint

• Posteriorbecomesmorerefined/peakedasmoredataintroduced

• Approachesapointmassaboutsolution


Summary

• Uncertaintynotcapturedbypointestimates(MLE,MAP)

• Bayesianapproachpreservesuncertainty* careaboutpredictionsNOTparameters* chooseprioroverparameters,thenmodelposterior

• Newconcepts:* sequentialBayesianupdating* conjugateprior(Normal-Normal)

• Stilltocome...usingposteriorforBayesianpredictionsontest 23

Documents

COMP90051 Statistical Machine Learning · 2017-10-23 · Statistical Machine Learning (S2 2017) Lecture 17 A Bayesian View •Could we reason over all parameters that are consistent