Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
COMP90051StatisticalMachineLearningSemester2,2017
Lecturer:TrevorCohn
17.Bayesianinference;Bayesian regression
StatisticalMachineLearning(S22017) Lecture17
Training==optimisation(?)Stagesoflearning&inference:
• Formulatemodel
• Fitparameterstodata
• Makeprediction
2
p(y|x) = sigmoid(x
0w)
p(y|x) = Normal(x
0w;�2
)
p(y⇤|x⇤) = sigmoid(x
0⇤ ˆw)
Regression
𝒘" referredtoasa‘pointestimate’
E[y⇤] = x
0⇤w
ˆw = argmaxw p(y|X,w)p(w)
ditto
StatisticalMachineLearning(S22017) Lecture17
BayesianAlternativeNothingspecialabout𝒘"…usemorethanonevalue?
• Formulatemodel
• Considerthespaceoflikelyparameters– thosethatfitthetrainingdatawell
• Make‘expected’prediction
3
p(y|x) = sigmoid(x
0w) p(y|x) = Normal(x
0w;�2
)
p(y⇤|x⇤) = Ep(w|Xi,y) [sigmoid(x
0w)]
Regression
p(y⇤|x⇤) = Ep(w|Xi,y)
⇥Normal(x
0⇤w,�2
)
⇤
p(w|X,y)
StatisticalMachineLearning(S22017) Lecture17
Uncertainty
Fromsmalltrainingsets,werarelyhavecompleteconfidenceinanymodels
learned.Canwequantifytheuncertainty,anduseitinmakingpredictions?
4
StatisticalMachineLearning(S22017) Lecture17
RegressionRevisited
Linearregression:y=w0 +w1 x(herey =humidity,x =temperature)
5
• Learnmodelfromdata* minimise errorresidualsbychoosingweights𝐰" = 𝐗&𝐗 '(𝐗&𝐲
• But…howconfidentarewe* in𝐰"?* inthepredictions?
StatisticalMachineLearning(S22017) Lecture17
Predictionuncertainty
• Singlepredictionisoflimiteduseduetouncertainty* singlenumberuninformative- maybewildlyoff* mightwanttoformulatedecisionfromprediction,e.g.,ifPr(y<70)
6
StatisticalMachineLearning(S22017) Lecture17
ConfidenceinMLEpointestimate
• Whatdoesitmeantominimise objective?* …areothernearbysolutionssimilarlygood?
• Effectofdata* lotsofdatarelativetodimensionality,MLElikelytobeagoodestimate
* otherwiseunreliable
• MAPapartial solution,butstillreliantonsinglepoint
7
StatisticalMachineLearning(S22017) Lecture17
EffectofTrainingSampleonMLE
• Modelling y=2x– 3* draw1000softrainingsetsof10instances
* smalladdednoise
8
• FitweightseachtimeusingMLE* observevariabilityinweights
* peakat(2,-3)
mle fit&noise
groundtruth
empiricaldistribution(histogram)overlearnedweightsw
w1w0
singledatasample
StatisticalMachineLearning(S22017) Lecture17
Aside:Learningthenoiserate
• Canalsolearnnoiseparameter,σ2* expressNLLasfunctionofσ2;differentiate;setto0;solve* resultsin
• Quantifiesthequalityofthefit* allowssmarterdecisionmaking,e.g.,P(y <60)
9
�2 =1
N
NX
i=1
(yi �Xiw)2
N.b.,wecomputebettererror boundslateron
showing+- 𝛔 (68%confidenceinterval)
StatisticalMachineLearning(S22017) Lecture17
Dowetrustpointestimate𝐰" ?
10
• Howstable islearning?* 𝐰" highlysensitivetonoise* howmuchuncertaintyin
parameterestimate?* moreinformative if
NLLobjectivehighlypeaked
• Formalised asFisherInformationmatrix* E[2nd deriv ofNLL]
* measures curvatureofobjectiveabout𝐰"
I =1
�2X0X
Figure:RogersandGirolami p81
StatisticalMachineLearning(S22017) Lecture17
TheBayesianView
Retainandmodelallunknowns(e.g.,uncertaintyoverparameters)andusethisinformationwhen
makinginferences.
11
StatisticalMachineLearning(S22017) Lecture17
ABayesianView
• Couldwereasonoverallparametersthatareconsistentwiththedata?* weightswithabetterfittothetrainingdatashouldbemoreprobablethanothers
* makepredictionswithalltheseweights,scaledbytheirprobability
• ThisistheideaunderlyingBayesian inference
12
StatisticalMachineLearning(S22017) Lecture17
Uncertaintyoverparameters
• Manyreasonablesolutionstoobjective* whyselectjustone?
• Reasonunderall possibleparametervalues* weightedbytheirposteriorprobability
• Morerobustpredictions* lesssensitivetooverfitting,particularlywithsmalltrainingsets
* cangiverisetomoreexpressivemodelclass(Bayesianlogisticregressionbecomesnon-linear!)
13
StatisticalMachineLearning(S22017) Lecture17
Frequentist vsBayesiandivide• Frequentist:learningusingpointestimates,regularisation,p-values…* backedbycomplextheoryrelyingonstrongassumptions* mostlysimpleralgorithms,characterises muchpracticalmachinelearningresearch
• Bayesian:maintainuncertainty,marginalise (sum)outunknownsduringinference* nicertheorywithfewerassumptions* oftenmorecomplexalgorithms,butnotalways* whenpossible,resultsinmoreelegantmodels
14
StatisticalMachineLearning(S22017) Lecture17
BayesianRegression
ApplicationofBayesianinferencetolinearregression,using
Normalprioroverw
15
StatisticalMachineLearning(S22017) Lecture17
RevisitingLinearRegression
• Recallprobabilisticformulationoflinearregression
• MotivatedbyBayesrule
• Givesrisetothepenalised RSSobjective
16
y ⇠ Normal(x
0w,�2
)
w ⇠ Normal(0, �2ID)
p(w|X,y) =p(y|X,w)p(w)
p(y|X)
max
wp(w|X,y) = max
wp(y|X,w)p(w)
pointestimatetakenhere,avoidscomputingmarginallikelihoodterm
ID =DxDidentitymatrix
StatisticalMachineLearning(S22017) Lecture17
BayesianLinearRegression
• Rewindonestep, considerfullposterior
• Canwecomputethedenominator(marginallikelihoodorevidence)?* ifso,wecanusethefullposterior,notjustitsmode
17
p(w|X,y,�2) =p(y|X,w,�2)p(w)
p(y|X)
=p(y|X,w,�2)p(w)Rp(y, |X,w,�2)p(w)dw
Hereweassumenoisevar. known
StatisticalMachineLearning(S22017) Lecture17
BayesianLinearRegression(cont)
• WehavetwoNormaldistributions* normallikelihoodxnormalprior
• TheirproductisalsoaNormaldistribution* conjugateprior:whenproductoflikelihoodxpriorresultsinthesamedistributionastheprior
* evidence canbecomputedeasilyusingthenormalisingconstantoftheNormaldistribution
18
p(w|X,y,�2) / Normal(w|0, �2ID)Normal(y|Xw,�2IN )
/ Normal(w|wN ,VN )
closedformsolutionforposterior!
StatisticalMachineLearning(S22017) Lecture17
BayesianLinearRegression(cont)
19
wN =1
�2VNX0y
VN = �2(X0X+�2
�2ID)�1
Advanced: verifybyexpressingproductoftwoNormals,gatheringexponentstogetherand‘completingthesquare’toexpressassquared
exponential(i.e.,Normaldistribution).
whereNotethatmean(andmode)aretheMAPsolutionfrombefore
p(w|X,y,�2) / Normal(w|0, �2ID)Normal(y|Xw,�2IN )
/ Normal(w|wN ,VN )
StatisticalMachineLearning(S22017) Lecture17
BayesianLinearRegressionexample
20
Step1:selectprior,heresphericalabout0 Step2:observetrainingdata
Step3:formulateposterior,fromprior&likelihood Samplesfromposterior
StatisticalMachineLearning(S22017) Lecture17
SequentialBayesianUpdating
• Canformulateforgivendataset
• Whathappensasweseemoreandmoredata?1. Startfromprior2. Seenewlabelleddatapoint3. Computeposterior4. Theposteriornowtakesroleofprior
&repeatfromstep2
21
p(w|X,y,�2)
p(w)
p(w|X,y,�2)
StatisticalMachineLearning(S22017) Lecture17
SequentialBayesianUpdating
BishopFig3.7,p155 22
3.3. Bayesian Linear Regression 155
Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.
3.3. Bayesian Linear Regression 155
Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.
3.3. Bayesian Linear Regression 155
Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.
3.3. Bayesian Linear Regression 155
Figure 3.7 Illustration of sequential Bayesian learning for a simple linear model of the form y(x,w) =w0 + w1x. A detailed description of this figure is given in the text.
• Initiallyknowlittle,manyregressionlineslicensed
• Likelihoodconstrainspossibleweightssuchthatregressionisclosetopoint
• Posteriorbecomesmorerefined/peakedasmoredataintroduced
• Approachesapointmassaboutsolution
StatisticalMachineLearning(S22017) Lecture17
Summary
• Uncertaintynotcapturedbypointestimates(MLE,MAP)
• Bayesianapproachpreservesuncertainty* careaboutpredictionsNOTparameters* chooseprioroverparameters,thenmodelposterior
• Newconcepts:* sequentialBayesianupdating* conjugateprior(Normal-Normal)
• Stilltocome...usingposteriorforBayesianpredictionsontest 23