39
CPSC 340: Machine Learning and Data Mining Nonlinear Regression Fall 2019

CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

CPSC340:MachineLearningandDataMining

NonlinearRegressionFall2019

Page 2: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

LastTime:LinearRegression• Wediscussedlinearmodels:

• “Multiplyfeaturexij byweightwj,addthemtogetyi”.

• Wediscussedsquarederror function:

• Interactivedemo:– http://setosa.io/ev/ordinary-least-squares-regression

http://www.bloomberg.com/news/articles/2013-01-10/the-dunbar-number-from-the-guru-of-social-networks

Page 3: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Matrix/NormNotation(MEMORIZE/STUDYTHIS)

• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Weuse‘w’asa“dtimes1”vectorcontainingweight‘wj’inposition‘j’.– Weuse‘y’asan“ntimes1”vectorcontainingtarget‘yi’inposition‘i’.– Weuse‘xi’asa“dtimes1”vectorcontainingfeatures‘j’ofexample‘i’.

• We’renowgoingtobecarefultomakesurethesearecolumnvectors.

– So‘X’isamatrixwithxiT inrow‘i’.

Page 4: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Matrix/NormNotation(MEMORIZE/STUDYTHIS)

• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Ourpredictionforexample‘i’isgivenbythescalarwTxi.– Ourpredictionsforall‘i’ (ntimes1vector)isthematrix-vectorproductXw.

Page 5: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Matrix/NormNotation(MEMORIZE/STUDYTHIS)

• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Ourpredictionforexample‘i’isgivenbythescalarwTxi.– Ourpredictionsforall‘i’ (ntimes1vector)isthematrix-vectorproductXw.– Residualvector‘r’givesdifferencebetweenpredictionsandyi (ntimes1).– LeastsquarescanbewrittenasthesquaredL2-normoftheresidual.

Page 6: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

BacktoDerivingLeastSquaresford>2…• Wecanwritevectorofpredictions𝑦"𝑖 asamatrix-vectorproduct:

• Andwecanwritelinearleastsquaresinmatrixnotationas:

• We’llusethisnotationtoderived-dimensionalleastsquares‘w’.– By settingthegradient𝛻𝑓 𝑤 equaltothezerovectorandsolvingfor‘w’.

Page 7: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Digression:MatrixAlgebraReview• Quickreviewoflinearalgebraoperationswe’lluse:– If‘a’and‘b’bevectors,and‘A’and‘B’bematricesthen:

Page 8: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

LinearandQuadraticGradients• Fromtheseruleswehave(seepost-lectureslideforsteps):

• Howdowecomputegradient?

Page 9: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

LinearandQuadraticGradients• We’vewrittenasad-dimensionalquadratic:

• Gradientisgivenby:

• Usingdefinitionsof‘A’and‘b’:

Page 10: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

NormalEquations• Setgradientequaltozerotofindthe“critical”points:

• Wenowmovetermsnotinvolving‘w’totheotherside:

• Thisisasetof‘d’linearequations calledthenormalequations.– Thisalinearsystemlike“Ax=b”fromMath152.

• YoucanuseGaussianeliminationtosolvefor‘w’.

– InJulia,the“\”commandcanbeusedtosolvelinearsystems:

Page 11: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

NormalEquations• Setgradientequaltozerotofindthe“critical”points:

• Wenowmovetermsnotinvolving‘w’totheotherside:

• Thisisasetof‘d’linearequations calledthe“normalequations”.– Thisalinearsystemlike“Ax=b”fromMath152.

• YoucanuseGaussianeliminationtosolvefor‘w’.

– InPython,yousolvelinearsystemsin1lineusingnumpy.linalg.solve.

Page 12: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

IncorrectSolutionstoLeastSquaresProblem

Page 13: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

LeastSquaresCost• Cost ofsolving“normalequations”XTXw =XTy?• FormingXTy vectorcostsO(nd).– Ithas‘d’elements,andeachisaninnerproductbetween‘n’numbers.

• FormingmatrixXTXcostsO(nd2).– Ithasd2 elements,andeachisaninnerproductbetween‘n’numbers.

• SolvingadxdsystemofequationscostsO(d3).– CostofGaussianeliminationonad-variablelinearsystem.– Otherstandardmethodshavethesamecost.

• OverallcostisO(nd2 +d3).– Whichtermdominatesdependson‘n’and‘d’.

Page 14: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

LeastSquaresIssues• Issueswithleastsquaresmodel:– Solutionmightnotbeunique.– Itissensitivetooutliers.– Italwaysusesallfeatures.– Datacanmightsobigwecan’tstoreXTX.

• Oryoucan’taffordtheO(nd2 +d3)cost.– Itmightpredictoutsiderangeofyi values.– Itassumesalinearrelationshipbetweenxi andyi.

Page 15: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Non-UniquenessofLeastSquaresSolution• Whyisn’tsolutionunique?– Imaginehavingtwofeaturesthatareidenticalforallexamples.– Icanincreaseweightononefeature,anddecreaseitontheother,withoutchangingpredictions.

– Thus,if(w1,w2)isasolutionthen(w1+w2,0)isanothersolution.– Thisisspecialcaseoffeaturesbeing“collinear”:

• Onefeatureisalinearfunctionoftheothers.

• But,any‘w’where∇f(w)=0isaglobalminimizerof‘f’.– Thisisduetoconvexity of‘f’,whichwe’lldiscusslater.

Page 16: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

(pause)

Page 17: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Motivation:Non-LinearProgressionsinAthletics

• Aretopathletesgoingfaster,higher,andfarther?

http://www.at-a-lanta.nl/weia/Progressie.htmlhttps://en.wikipedia.org/wiki/Usain_Bolthttp://www.britannica.com/biography/Florence-Griffith-Joyner

Page 18: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:

http://www.at-a-lanta.nl/weia/Progressie.html

Page 19: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.

http://www.at-a-lanta.nl/weia/Progressie.html

Page 20: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.

• CPSC540.

https://en.wikipedia.org/wiki/Multivariate_normal_distribution

Page 21: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:

• KNNregression:– Find‘k’nearestneighbours ofxi.– Returnthemeanofthecorrespondingyi.

http://scikit-learn.org/stable/modules/neighbors.html

Page 22: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:

• KNNregression.• Couldbeweightedbydistance.

– Closepoints‘j’getmore“weight”wij.

http://scikit-learn.org/stable/modules/neighbors.html

Page 23: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:

• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.

http://www.mathworks.com/matlabcentral/fileexchange/35316-kernel-regression-with-variable-window-width/content/ksr_vw.m

Page 24: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:

• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.• ‘Locallylinearregression’:foreachxi,fitalinearmodelweightedbydistance.

(BetterthanKNNandNWatboundaries.)

http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd423.htm

Page 25: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:

• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.• ‘Locallylinearregression’:foreachxi,fitalinearmodelweightedbydistance.

(BetterthanKNNandNWatboundaries.)

– Ensemblemethods:• Canimproveperformancebyaveragingacrossregressionmodels.

Page 26: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression.

• Applications:– Regressionforestsforfluidsimulation:

• https://www.youtube.com/watch?v=kGB7Wd9CudA– KNNforimagecompletion:

• http://graphics.cs.cmu.edu/projects/scene-completion• Combinedwith“graphcuts”and“Poissonblending”.

– KNNregressionfor“voicephotoshop”:• https://www.youtube.com/watch?v=I3l4XLZ59iw• Combinedwith“dynamictimewarping”and“Poissonblending”.

• Butwe’llfocusonlinearmodelswithnon-lineartransforms.– Thesearethebuildingblocksformoreadvancedmethods.

http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd423.htm

Page 27: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Whydon’twehaveay-intercept?– Linearmodelis𝑦"i =wxi insteadof𝑦"i =wxi +w0 withy-interceptw0.– Withoutanintercept,ifxi =0thenwemustpredict𝑦"i =0.

Page 28: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Whydon’twehaveay-intercept?– Linearmodelis𝑦"i =wxi insteadof𝑦"i =wxi +w0 withy-interceptw0.– Withoutanintercept,ifxi =0thenwemustpredict𝑦"i =0.

Page 29: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

AddingaBiasVariable• Simpletricktoadday-intercept(“bias”)variable:

– Makeanewmatrix“Z”withanextrafeaturethatisalways“1”.

• Nowuse“Z”asyourfeaturesinlinearregression.– We’lluse ‘v’insteadof‘w’asregressionweightswhenweusefeatures‘Z’.

• Sowecanhaveanon-zeroy-interceptbychangingfeatures.– Thismeanswecanignorethey-interceptinourderivations,whichiscleaner.

Page 30: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Motivation:LimitationsofLinearModels• Onmanydatasets,yi isnotalinearfunctionofxi.

• Canweuseleastsquaretofitnon-linear models?

Page 31: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Non-LinearFeatureTransforms• Canweuselinearleastsquarestofitaquadraticmodel?

• Youcandothisbychangingthefeatures(changeofbasis):

• Fitnewparameters‘v’ under“changeofbasis”:solveZTZv = ZTy.• It’salinearfunctionofw,butaquadraticfunctionofxi.

Page 32: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Non-LinearFeatureTransforms

Page 33: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

GeneralPolynomialFeatures(d=1)• Wecanhaveapolynomialofdegree‘p’byusingthesefeatures:

• Therearepolynomialbasisfunctionsthatarenumericallynicer:– E.g.,Lagrangepolynomials(seeCPSC303).

Page 34: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Summary• Matrixnotationforexpressingleastsquaresproblem.• Normalequations:solutionofleastsquaresasalinearsystem.– Solve(XTX)w=(XTy).

• Solutionmightnotbeuniquebecauseofcollinearity.– Butanysolutionisoptimalbecauseof “convexity”.

• Tree/probabilistic/non-parametric/ensemble regressionmethods.• Non-lineartransforms:– Allowustomodelnon-linearrelationshipswithlinearmodels.

• Nexttime:howtodoleastsquareswithamillionfeatures.

Page 35: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

LinearLeastSquares:ExpansionStep

Page 36: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

VectorViewofLeastSquares• Weshowedthatleastsquaresminimizes:

• The½andthesquaringdon’tchangesolution,soequivalentto:

• Fromthisviewpoint,leastsquareminimizesEuclideandistancebetweenvectoroflabels‘y’andvectorofpredictionsXw.

Page 37: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

BonusSlide:Householder(-ish)Notation• Househoulder notation:setof(fairly-logical)conventionsformath.

Page 38: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

BonusSlide:Householder(-ish)Notation• Househoulder notation:setof(fairly-logical)conventionsformath:

Page 39: CPSC 340: Machine Learning and Data Miningfwood/CS340/lectures/L13.pdf · Non-Uniqueness of Least Squares Solution •Why isn’t solution unique? –Imagine having two features that

Whendoesleastsquareshaveauniquesolution?• Wesaidthatleastsquaressolutionisnotuniqueifwehaverepeatedcolumns.

• Butthereareotherwaysitcouldbenon-unique:– Onecolumnisascaledversionofanothercolumn.– Onecolumncouldbethesumof2othercolumns.– Onecolumncouldbethreetimesonecolumnminusfourtimesanother.

• LeastsquaressolutionisuniqueifandonlyifallcolumnsofXare“linearlyindependent”.– Nocolumncanbewrittenasa“linearcombination”oftheothers.– Manyequivalentconditions(seeStrang’s linearalgebrabook):

• Xhas“fullcolumnrank”,XTXisinvertible,XTXhasnon-zeroeigenvalues,det(XTX)>0.– Notethatwecannothaveindependentcolumnsifd>n.