Day 5: Generative models, structured classificationsuriya/website-intromlss2018/course... · Generative vs discriminative models •Recall optimal unrestricted predictor for following

Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago

Instructor:SuriyaGunasekar,TTIChicago

22June2018

Day5:Generativemodels,structured

classification

Topicssofar

• Linearregression• Classification

o nearestneighbors,decisiontrees,logisticregression

• Yesterdayo Maximummarginclassifiers,Kerneltrick

• Todayo Quickreviewofprobabilityo Generativemodels– naiveBayesclassifiero StructuredPrediction– conditionalrandomfields

1

SeveralslidesadaptedfromDavidSontagwhointurncreditsLukeZettlemoyer,CarlosGuestrin,DanKlein,and

Vibhav Gogate

2

Bayesian/probabilisticlearning

• Usesprobabilitytomodeldataand/orquantifyuncertaintiesinpredictiono Systematicframeworktoincorporatepriorknowledgeo Frameworkforcomposingandreasoningaboutuncertainityo Whatistheconfidenceinthepredictiongivenobservationssofar?

• Modelassumptionsneednothold(andoftendonothold)inrealityo evenso,manyprobabilisticmodelsworkreallywellinpractice

3

Quickoverviewofrandomvariables• Randomvariables:Avariableaboutwhichwe(may)haveuncertainty

o e.g.,𝑊 = 𝑤𝑒𝑎𝑡ℎ𝑒𝑟𝑡𝑜𝑚𝑜𝑟𝑟𝑜𝑤,or𝑇 = 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒• Forallrandomvariables𝑋 domain𝒳 of𝑋 isthesetofvalues𝑋 cantake• Discreterandomvariables:probabilitydistributionisatable

o FordiscreteRV𝑋, ∀𝑥 ∈ 𝒳, Pr 𝑋 = 𝑥 ≥ 0 and ∑ Pr 𝑋 = 𝑥 = 1�?∈𝒳

• Continuousrandom𝑋 withdomain𝒳 ⊆ ℝo Cumulativedistributionfunction 𝐹C 𝑡 = Pr(𝑋 ≤ 𝑡)

§ again𝐹C 𝑡 ∈ 0,1 andalso𝐹C −∞ = 0, 𝐹C +∞ = 1o Probabilitydensityfunction (ifexists)𝑃C 𝑡 = KLM N

KO§ Isalwayspositive,butcanbegreaterthan1

4

Pr 𝑊 = 𝑠𝑢𝑛 = 0.6

Quickoverviewofrandomvariables

• ExpectationDiscreteRV 𝐄 𝑓(𝑋) = ∑ 𝑓(𝑥)Pr(𝑋 = 𝑥)�

?∈𝒳

• Mean 𝐄 𝑋• Variance 𝐄 𝑋 − 𝐄 𝑋 V

5

Jointdistributions

• Jointdistributionofrandomvariables𝑋W, 𝑋V, … , 𝑋Y isdefinedforall𝑥W ∈ 𝒳W, 𝑥V ∈ 𝒳V,… , 𝑥Y ∈ 𝒳Y

𝑝 𝑥W, 𝑥V, … , 𝑥Y = Pr(𝑋W = 𝑥W, 𝑋V = 𝑥V,… , 𝑋Y = 𝑥Y)

• Howmaynumbersneededfor𝑑 variableseachhavingdomainofKvalues?

o 𝐾Y!!Toomanynumbers,usuallysomeassumptionismadetoreducenumberofprobabilities

6

Marginaldistribution

• Sub-tablesobtainedbyeliminationofvariables• Probabilitydistributionofasubsetofvariables

7

Marginaldistribution

• Sub-tablesobtainedbyeliminationofvariables• Probabilitydistributionofasubsetofvariables• Given:jointdistribution𝑝 𝑥W, 𝑥V, … , 𝑥Y = Pr(𝑋W = 𝑥W, 𝑋V = 𝑥V, … , 𝑋Y = 𝑥Y)for𝑥W ∈ 𝒳W, 𝑥V ∈ 𝒳V,… , 𝑥Y ∈ 𝒳Y

• Saywewantgetamarginalofjust𝑥W, 𝑥V, 𝑥\,thatiswewanttoget𝑝 𝑥W, 𝑥V, 𝑥] = Pr(XW = xW, XV = xV, 𝑋] = 𝑥])

• Thiscanbeobtainedbymariginalizing𝑝 𝑥W, 𝑥V, 𝑥] = ` ` … ` 𝑝(𝑥W, 𝑥V, 𝑧b, 𝑥], 𝑧\, … , 𝑧Y)

�

cd∈𝒳d

�

ce∈𝒳e

�

cf∈𝒳f

8

Conditioning

• Randomvariables𝑋 and𝑌 withdomains𝒳 and𝒴

Pr 𝑋 = 𝑥 𝑌 = 𝑦 =Pr 𝑋 = 𝑥, 𝑌 = 𝑦

Pr 𝑌 = 𝑦

9

• Probabilitydistributionsofasubsetofvariableswithfixedvaluesofothers

Conditioning

• Randomvariables𝑋 and𝑌 withdomains𝒳 and𝒴

Pr 𝑋 = 𝑥 𝑌 = 𝑦 =Pr 𝑋 = 𝑥, 𝑌 = 𝑦

Pr 𝑌 = 𝑦• Conditionalexpectation

𝐄[𝑓(𝑋)|𝑌 = 𝑦] = ` 𝑓 𝑥 Pr 𝑋 = 𝑥 𝑌 = 𝑦�

?∈𝒳

• ℎ 𝑦 = 𝐄[𝑓(𝑋)|𝑌 = 𝑦] isafunctionof𝑦

• ℎ 𝑌 isarandomvariablewithdistributiongivenbyPr(ℎ 𝑌 = ℎ(𝑦)) = Pr(𝑌 = 𝑦)

10

Productrule

11

• Goingfromconditionaldistributiontojointdistribution

Pr 𝑋 = 𝑥 𝑌 = 𝑦 =Pr 𝑋 = 𝑥, 𝑌 = 𝑦

Pr 𝑌 = 𝑦

Pr 𝑋 = 𝑥, 𝑌 = 𝑦 = Pr(𝑌 = 𝑦) Pr 𝑋 = 𝑥 𝑌 = 𝑦• Whatabouttheevariables?

Pr 𝑋W = 𝑥W, 𝑋V = 𝑥V, 𝑋b = 𝑥b =

Pr(𝑋W = 𝑥W) Pr 𝑋V = 𝑥V 𝑋W = 𝑥W Pr 𝑋b = 𝑥b 𝑋W = 𝑥W, 𝑋V = 𝑥V• Moregenerally,

Pr 𝑋W = 𝑥W, 𝑋V = 𝑥V, … , 𝑋Y = 𝑥Y

= Pr(𝑋W = 𝑥W)mPr(𝑋n = 𝑥n|𝑋noW = 𝑥noW, 𝑋noV = 𝑥noV, … , 𝑋W = 𝑥W)Y

npV

Optimalunrestrictedclassifier

• 𝐶 classclassificationproblem𝒴 = {1,2, … , 𝐶}• PopulationdistributionLet 𝒙, 𝑦 ∼ 𝒟• Considerthepopulation0-1lossofclassifier𝑦x(𝑥)

𝐿 𝑦x ≜ 𝐄𝒙,{ 𝟏 𝑦 ≠ 𝑦x 𝒙 = Pr~,�

𝑦 ≠ 𝑦x 𝒙

= Pr 𝒙 Pr(𝑦 ≠ 𝑦x(𝒙)|𝒙)

• Pr(𝑦 ≠ 𝑦x 𝒙 |𝒙) = 1 − Pr(𝑦 = 𝑦x 𝒙 |𝒙)• OptimalunrestrictedclassifierorBayesoptimalclassifier

𝑦x∗∗ 𝒙 = argmax�

Pr(𝑦 = 𝑐|𝒙)

Riskofclassifier𝑦x(𝒙)

Conditionalrisk𝐿(𝑦x|𝒙)

12

Checkthatthisisminimizedfor𝑦x 𝑥 = argmax

�Pr(𝑦 = 𝑐|𝑥)

Generativevsdiscriminativemodels

• Recalloptimalunrestrictedpredictorforfollowingcaseso Regression+squaredlossà 𝑓∗∗(𝒙) = 𝐄 𝑦 𝒙o Classification+0-1lossà 𝑦x∗∗ 𝒙 = argmax

�Pr(𝑦 = 𝑐|𝒙)

• Non-probabilisticapproach:don'tdealwithprobabilities,justestimate𝑓(𝒙) directlytothedata.

• Discriminativemodels:Estimate/infertheconditionaldensityPr(𝑦|𝒙)o Typicallyusesaparametricmodel𝑓�(𝒙) ofPr(𝑦|𝒙)

• Generativemodels: EstimatethefulljointprobabilitydensityPr 𝑦, 𝒙o NormalizetofindtheconditionaldensityPr(𝑦|𝒙)o SpecifymodelsforPr 𝒙, 𝑦 or[Pr 𝒙|𝑦 andPr(𝑦)]o Why?Intwoslides!

13

Bayesrule

• Optimalclassifier𝑦x∗∗ 𝑥 = argmax

�Pr 𝑦 = 𝑐 𝑥

• Bayesrule:Pr(𝑥, 𝑦) = Pr 𝑦 𝑥 Pr(𝑥) = Pr(𝑥|𝑦) Pr(𝑦)

𝑦x∗∗ 𝑥 = argmax�

Pr 𝑦 = 𝑐 𝑥

= argmax�

Pr 𝑥 𝑦 = 𝑐 Pr 𝑦 = 𝑐Pr 𝑥

= argmax�

Pr 𝑥 𝑦 = 𝑐 Pr 𝑦 = 𝑐

14

Bayesrule

• Optimalclassifier𝑦x∗∗ 𝑥 = argmax

�Pr 𝑦 = 𝑐 𝑥

= argmax�


• Whyisthishelpful?o Oneconditionalmightbetrickytomodelwithpriorknowledgebuttheothersimple

o e.g.,saywewanttospecifyamodelfordigitrecognition

§ comparespecifyingPr(image|digit = 1) vsPr(digit = 1|image)

15

à digit1Binaryimages

Generativemodelforclassification

argmax�

Pr 𝑦 = 𝑐 𝑥

= argmax�


• Cclassclassificationwithbinaryfeatures𝑥 ∈ ℝY and𝑦 ∈ 1,2, … , 𝐶

• WanttospecifyPr(𝑥|𝑦) = Pr(𝑥W, 𝑥V, … , 𝑥Y|𝑦)

• Ifeachof𝑥W, 𝑥V, … , 𝑥Y cantakeoneofKvalues.HowmanyparameterstospecifyPr(𝑥|𝑦)?o 𝐶𝐾Y!!Toomany

16

NaiveBayesassumption

SpecifyingPr(𝒙|𝑦) = Pr(𝑥W, 𝑥V, … , 𝑥Y|𝑦) requires𝐶𝐾Y

NaiveBayesassumption:featuresareindependentgivenclass𝑦• e.g.,fortwofeatures

Pr(𝑥W, 𝑥V|𝑦) = Pr(𝑥W|𝑦) Pr(𝑥V|𝑦)• moregenerally,

Pr(𝑥W, 𝑥V, … , 𝑥Y|𝑦) = Pr(𝑥W|𝑦) Pr(𝑥V|𝑦)…Pr(𝑥Y|𝑦)= ∏ Pr(𝑥n|𝑦)Y

npW

• numberofparametersifeachof𝑥W, 𝑥V, … , 𝑥Y cantakeoneofKvalues?o 𝐶𝐾𝑑

17

NaiveBayesclassifier• NaiveBayesassumption:featuresareindependentgivenclass:

Pr(𝑥W, 𝑥V, … , 𝑥Y|𝑦) = ∏ Pr(𝑥n|𝑦)YnpW

• Cclasses𝒴 = {1,2, … , 𝐶} dbinaryfeature𝒳 = 0,1 Y

• Modelparameters:specifyfrompriorknowledgeand/orlearnfromdatao PriorsPr 𝑦 = 𝑐 à #parameters 𝐶 − 1o ConditionalprobabilitiesPr(𝑥n = 1|𝑦 = 𝑐)à #parameters 𝐶𝑑

§ if𝑥W, 𝑥V, … , 𝑥� takesoneof𝐾 discretevaluesratherthanbinaryà#parameters 𝐾 − 1 𝐶𝑑

§ if𝑥W, 𝑥V, … , 𝑥� arecontinuous,additionallymodelPr(𝑥n|𝑦 = 𝑐) assomeparametricdistribution,likeGaussianPr(𝑥n|𝑦 = 𝑐) ∼𝒩(𝜇n,�, 𝜎),andestimatetheparameters(𝜇n,�, 𝜎) fromdata

• Classifierrule:𝑦x�� 𝑥 = argmax

�Pr 𝑥W, 𝑥V, … , 𝑥Y 𝑦 = 𝑐 Pr(𝑦 = 𝑐)

= argmax�

Pr(𝑦 = 𝑐)mPr(𝑥n|𝑦 = 𝑐)Y

npW18

Digitrecognizer

19Slidecredit:DavidSontag

Whathastobelearned?


MLEforparametersofNB

• Trainingdataset𝑆 = 𝑥 � , 𝑦 � : 𝑖 = 1,2, … , 𝑁• MaximumlikelihoodestimationfornaiveBayeswithdiscretefeaturesandlabels• Assume𝑆 hasiid examples

o Prior:whatistheprobabilityofobservinglabel𝑦

Pr 𝑦 = 𝑐 =∑ 𝟏 𝑦 � = 𝑐��pW

𝑁o Conditionaldistribution:

Pr 𝑥n = 𝑧n|𝑌 = 𝑐 =∑ 1[𝑥n

� = 𝑧n, 𝑦 � = 𝑐]��pW

∑ 1[𝑦 � = 𝑐]��pW

21

``𝟏 𝑦 � = 𝑐� �

�

�

��

``𝟏 𝑥n� = 𝑧n�, 𝑦 � = 𝑐

�

�

�

c��

MLEforparametersofNB


SmoothingforparametersofNB



Pr 𝑦 = 𝑐 =∑ 𝟏 𝑦 � = 𝑐��


Pr 𝑥n = 𝑧n|𝑌 = 𝑐 =∑ 1 𝑥n

� = 𝑧n, 𝑦 � = 𝑐�� + 𝜖∑ 1 𝑦 � = 𝑐�� + ∑ 𝜖�

c��

23




Pr 𝑦 = 𝑐 =∑ 𝟏 𝑦 � = 𝑐��



� = 𝑧n, 𝑦 � = 𝑐�� + 𝜖∑ 1 𝑦 � = 𝑐�� + ∑ 𝜖�

c��

24




Pr 𝑦 = 𝑐 =∑ 𝟏 𝑦 � = 𝑐��



� = 𝑧n, 𝑦 � = 𝑐�� + 𝜖∑ 1 𝑦 � = 𝑐�� + ∑ 𝜖�

c��

25

``𝟏 𝑥n� = 𝑧n�, 𝑦 � = 𝑐 + 𝜖

�

�

�

c��

Missingfeatures

OneofthekeystrengthsofBayesianapproachesisthattheycannaturallyhandlemissingdata• Whathappensifwedon’thavevalueofsomefeature𝑥n

(�)

o e.g.,applicantscredithistoryunknowno e.g.,somemedicaltestsnotperformed

• HowtocomputePr 𝑥W, 𝑥V, … 𝑥�oW, ? , 𝑥��W … , 𝑥Y 𝑦 ?o e.g.,threecointossesE = 𝐻, ? , 𝑇o ⇒ Pr 𝐸 = Pr 𝐻,𝐻, 𝑇 +Pr({𝐻, 𝑇, 𝑇})

• Moregenerally

Pr 𝑥W, 𝑥V, … 𝑥�oW, ? , 𝑥��W … , 𝑥Y 𝑦 =`Pr 𝑥W, 𝑥V, … 𝑥�oW, 𝑧�, 𝑥��W … , 𝑥Y 𝑦�

c


MissingfeaturesinnaiveBayes

Pr 𝑥W, 𝑥V, … 𝑥�oW, ? , 𝑥��W … , 𝑥Y 𝑦

=`Pr 𝑥W, 𝑥V, … 𝑥�oW, 𝑧�, 𝑥��W … , 𝑥Y 𝑦�

c

=` Pr 𝑧� 𝑦 mPr 𝑥n 𝑦�

n¡�

�

c

=mPr 𝑥n 𝑦�

n¡�

`Pr 𝑧� 𝑦�

c

=mPr 𝑥n 𝑦�

n¡�

27

• Simplyignorethemissingvaluesandcomputelikelihoodbasedonlyobservedfeatures

• noneedtofill-inorexplicitlymodelmissingvalues

28

NaiveBayes• Generativemodel

o ModelPr(𝒙|𝑦) andPr(𝑦)

• Prediction: modelsthefulljointdistributionandusesBayesruletogetPr(𝑦|𝒙)

• Cangeneratedatagivenlabel• Naturallyhandlesmissingdata

LogisticRegression• Discriminativemodel

o ModelPr(𝑦|𝒙)

• Prediction:directlymodelswhatwewantPr(𝑦|𝒙)

• Cannotgeneratedata• Cannothandlemissingdataeasily

Documents

Day 5: Generative models, structured classificationsuriya/website-intromlss2018/course... · Generative vs discriminative models •Recall optimal unrestricted predictor for following