CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

CPSC340:MachineLearningandDataMining

NonlinearRegressionFall2019

LastTime:LinearRegression• Wediscussedlinearmodels:

• “Multiplyfeaturexij byweightwj,addthemtogetyi”.

• Wediscussedsquarederror function:

• Interactivedemo:– http://setosa.io/ev/ordinary-least-squares-regression

http://www.bloomberg.com/news/articles/2013-01-10/the-dunbar-number-from-the-guru-of-social-networks

Matrix/NormNotation(MEMORIZE/STUDYTHIS)

• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Weuse‘w’asa“dtimes1”vectorcontainingweight‘wj’inposition‘j’.– Weuse‘y’asan“ntimes1”vectorcontainingtarget‘yi’inposition‘i’.– Weuse‘xi’asa“dtimes1”vectorcontainingfeatures‘j’ofexample‘i’.

• We’renowgoingtobecarefultomakesurethesearecolumnvectors.

– So‘X’isamatrixwithxiT inrow‘i’.


• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Ourpredictionforexample‘i’isgivenbythescalarwTxi.– Ourpredictionsforall‘i’ (ntimes1vector)isthematrix-vectorproductXw.


• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Ourpredictionforexample‘i’isgivenbythescalarwTxi.– Ourpredictionsforall‘i’ (ntimes1vector)isthematrix-vectorproductXw.– Residualvector‘r’givesdifferencebetweenpredictionsandyi (ntimes1).– LeastsquarescanbewrittenasthesquaredL2-normoftheresidual.

BacktoDerivingLeastSquaresford>2…• Wecanwritevectorofpredictions𝑦"𝑖 asamatrix-vectorproduct:

• Andwecanwritelinearleastsquaresinmatrixnotationas:

• We’llusethisnotationtoderived-dimensionalleastsquares‘w’.– By settingthegradient𝛻𝑓 𝑤 equaltothezerovectorandsolvingfor‘w’.

Digression:MatrixAlgebraReview• Quickreviewoflinearalgebraoperationswe’lluse:– If‘a’and‘b’bevectors,and‘A’and‘B’bematricesthen:

LinearandQuadraticGradients• Fromtheseruleswehave(seepost-lectureslideforsteps):

• Howdowecomputegradient?

LinearandQuadraticGradients• We’vewrittenasad-dimensionalquadratic:

• Gradientisgivenby:

• Usingdefinitionsof‘A’and‘b’:

NormalEquations• Setgradientequaltozerotofindthe“critical”points:

• Wenowmovetermsnotinvolving‘w’totheotherside:

• Thisisasetof‘d’linearequations calledthenormalequations.– Thisalinearsystemlike“Ax=b”fromMath152.

• YoucanuseGaussianeliminationtosolvefor‘w’.

– InJulia,the“\”commandcanbeusedtosolvelinearsystems:

NormalEquations• Setgradientequaltozerotofindthe“critical”points:

• Wenowmovetermsnotinvolving‘w’totheotherside:

• Thisisasetof‘d’linearequations calledthe“normalequations”.– Thisalinearsystemlike“Ax=b”fromMath152.

• YoucanuseGaussianeliminationtosolvefor‘w’.

– InPython,yousolvelinearsystemsin1lineusingnumpy.linalg.solve.

IncorrectSolutionstoLeastSquaresProblem

LeastSquaresCost• Cost ofsolving“normalequations”XTXw =XTy?• FormingXTy vectorcostsO(nd).– Ithas‘d’elements,andeachisaninnerproductbetween‘n’numbers.

• FormingmatrixXTXcostsO(nd2).– Ithasd2 elements,andeachisaninnerproductbetween‘n’numbers.

• SolvingadxdsystemofequationscostsO(d3).– CostofGaussianeliminationonad-variablelinearsystem.– Otherstandardmethodshavethesamecost.

• OverallcostisO(nd2 +d3).– Whichtermdominatesdependson‘n’and‘d’.

LeastSquaresIssues• Issueswithleastsquaresmodel:– Solutionmightnotbeunique.– Itissensitivetooutliers.– Italwaysusesallfeatures.– Datacanmightsobigwecan’tstoreXTX.

• Oryoucan’taffordtheO(nd2 +d3)cost.– Itmightpredictoutsiderangeofyi values.– Itassumesalinearrelationshipbetweenxi andyi.

Non-UniquenessofLeastSquaresSolution• Whyisn’tsolutionunique?– Imaginehavingtwofeaturesthatareidenticalforallexamples.– Icanincreaseweightononefeature,anddecreaseitontheother,withoutchangingpredictions.

– Thus,if(w1,w2)isasolutionthen(w1+w2,0)isanothersolution.– Thisisspecialcaseoffeaturesbeing“collinear”:

• Onefeatureisalinearfunctionoftheothers.

• But,any‘w’where∇f(w)=0isaglobalminimizerof‘f’.– Thisisduetoconvexity of‘f’,whichwe’lldiscusslater.

(pause)

Motivation:Non-LinearProgressionsinAthletics

• Aretopathletesgoingfaster,higher,andfarther?

http://www.at-a-lanta.nl/weia/Progressie.htmlhttps://en.wikipedia.org/wiki/Usain_Bolthttp://www.britannica.com/biography/Florence-Griffith-Joyner

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:

http://www.at-a-lanta.nl/weia/Progressie.html

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.

http://www.at-a-lanta.nl/weia/Progressie.html

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.

• CPSC540.

https://en.wikipedia.org/wiki/Multivariate_normal_distribution

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:

• KNNregression:– Find‘k’nearestneighbours ofxi.– Returnthemeanofthecorrespondingyi.

http://scikit-learn.org/stable/modules/neighbors.html


• KNNregression.• Couldbeweightedbydistance.

– Closepoints‘j’getmore“weight”wij.

http://scikit-learn.org/stable/modules/neighbors.html


• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.

http://www.mathworks.com/matlabcentral/fileexchange/35316-kernel-regression-with-variable-window-width/content/ksr_vw.m


• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.• ‘Locallylinearregression’:foreachxi,fitalinearmodelweightedbydistance.

(BetterthanKNNandNWatboundaries.)

http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd423.htm


• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.• ‘Locallylinearregression’:foreachxi,fitalinearmodelweightedbydistance.

(BetterthanKNNandNWatboundaries.)

– Ensemblemethods:• Canimproveperformancebyaveragingacrossregressionmodels.

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression.

• Applications:– Regressionforestsforfluidsimulation:

• https://www.youtube.com/watch?v=kGB7Wd9CudA– KNNforimagecompletion:

• http://graphics.cs.cmu.edu/projects/scene-completion• Combinedwith“graphcuts”and“Poissonblending”.

– KNNregressionfor“voicephotoshop”:• https://www.youtube.com/watch?v=I3l4XLZ59iw• Combinedwith“dynamictimewarping”and“Poissonblending”.

• Butwe’llfocusonlinearmodelswithnon-lineartransforms.– Thesearethebuildingblocksformoreadvancedmethods.

http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd423.htm

Whydon’twehaveay-intercept?– Linearmodelis𝑦"i =wxi insteadof𝑦"i =wxi +w0 withy-interceptw0.– Withoutanintercept,ifxi =0thenwemustpredict𝑦"i =0.

Whydon’twehaveay-intercept?– Linearmodelis𝑦"i =wxi insteadof𝑦"i =wxi +w0 withy-interceptw0.– Withoutanintercept,ifxi =0thenwemustpredict𝑦"i =0.

AddingaBiasVariable• Simpletricktoadday-intercept(“bias”)variable:

– Makeanewmatrix“Z”withanextrafeaturethatisalways“1”.

• Nowuse“Z”asyourfeaturesinlinearregression.– We’lluse ‘v’insteadof‘w’asregressionweightswhenweusefeatures‘Z’.

• Sowecanhaveanon-zeroy-interceptbychangingfeatures.– Thismeanswecanignorethey-interceptinourderivations,whichiscleaner.

Motivation:LimitationsofLinearModels• Onmanydatasets,yi isnotalinearfunctionofxi.

• Canweuseleastsquaretofitnon-linear models?

Non-LinearFeatureTransforms• Canweuselinearleastsquarestofitaquadraticmodel?

• Youcandothisbychangingthefeatures(changeofbasis):

• Fitnewparameters‘v’ under“changeofbasis”:solveZTZv = ZTy.• It’salinearfunctionofw,butaquadraticfunctionofxi.

Non-LinearFeatureTransforms

GeneralPolynomialFeatures(d=1)• Wecanhaveapolynomialofdegree‘p’byusingthesefeatures:

• Therearepolynomialbasisfunctionsthatarenumericallynicer:– E.g.,Lagrangepolynomials(seeCPSC303).

Summary• Matrixnotationforexpressingleastsquaresproblem.• Normalequations:solutionofleastsquaresasalinearsystem.– Solve(XTX)w=(XTy).

• Solutionmightnotbeuniquebecauseofcollinearity.– Butanysolutionisoptimalbecauseof “convexity”.

• Tree/probabilistic/non-parametric/ensemble regressionmethods.• Non-lineartransforms:– Allowustomodelnon-linearrelationshipswithlinearmodels.

• Nexttime:howtodoleastsquareswithamillionfeatures.

LinearLeastSquares:ExpansionStep

VectorViewofLeastSquares• Weshowedthatleastsquaresminimizes:

• The½andthesquaringdon’tchangesolution,soequivalentto:

• Fromthisviewpoint,leastsquareminimizesEuclideandistancebetweenvectoroflabels‘y’andvectorofpredictionsXw.

BonusSlide:Householder(-ish)Notation• Househoulder notation:setof(fairly-logical)conventionsformath.

BonusSlide:Householder(-ish)Notation• Househoulder notation:setof(fairly-logical)conventionsformath:

Whendoesleastsquareshaveauniquesolution?• Wesaidthatleastsquaressolutionisnotuniqueifwehaverepeatedcolumns.

• Butthereareotherwaysitcouldbenon-unique:– Onecolumnisascaledversionofanothercolumn.– Onecolumncouldbethesumof2othercolumns.– Onecolumncouldbethreetimesonecolumnminusfourtimesanother.

• LeastsquaressolutionisuniqueifandonlyifallcolumnsofXare“linearlyindependent”.– Nocolumncanbewrittenasa“linearcombination”oftheothers.– Manyequivalentconditions(seeStrang’s linearalgebrabook):

• Xhas“fullcolumnrank”,XTXisinvertible,XTXhasnon-zeroeigenvalues,det(XTX)>0.– Notethatwecannothaveindependentcolumnsifd>n.

Documents

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that