50
ML4NLP Introduction to Classification Dan Goldwasser CS 590NLP Purdue University [email protected]

ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

ML4NLPIntroduction to Classification

Dan Goldwasser

CS 590NLP

Purdue [email protected]

Page 2: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

StatisticalLanguageModeling

• Intuition:bylookingatlargequantitiesoftextwecanfindstatisticalregularities– Distinguishbetweencorrectandincorrectsentences

• Languagemodelsdefineaprobabilitydistributionoverstrings(e.g.,sentences)inalanguage.

• Wecanuselanguagemodeltoscoreandranksentences

“I don’t know {whether,weather} to laugh or cry”

P(“Idon’t..weathertolaugh..”)><P(“Idon’t..whethertolaugh..”)

Page 3: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

LanguageModelingwithN-grams

U n i g ra m m o delB i g ra m m o del

Tr i g ra m m o del

P (w1)P (w2)...P (wi )P (w1)P (w2 |w1)...P (wi |wi-1)P (w1)P (w2 |w1)...P (wi |wi-2 wi-1)

Page 4: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

EvaluatingLanguageModels

• Assumingthatwehavealanguagemodel,howcanwetellifit’sgood?

• Option1:trytogenerateShakespeare..– ThisisknowasQualitativeevaluation

• Option2:Quantitativeevaluation– Option2.1:SeehowwellyoudoonSpellingcorrection

• ThisisknownasExtrinsic Evaluation– Option2.2:Findanindependentmeasure forLMquality

• ThisisknownasIntrinsic Evaluation

Page 5: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

WhenareLMapplicable?

DeepVisual-SemanticAlignments forGeneratingImageDescriptions

Findingregularityinlanguageissurprisinglyuseful!Easyexample:weather/whether

Butalso-Translation(canyouproduce“legal”FrenchfromsourceEnglish?)

CaptionGeneration(combineoutputofvisualsensorsintoagrammaticalsentence)

Page 6: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Classification

• Afundamentalmachine learningtool– WidelyapplicableinNLP

• Supervisedlearning:Learnerisgivenacollectionoflabeleddocuments– Emails:Spam/notspam;Reviews:Pos/Neg

• Buildafunction mappingdocumentstolabels– Keyproperty:Generalization

• functionshouldworkwellonnewdata

Page 7: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

SentimentAnalysis

Dude, I just watched this horror flick! Selling points: nightmares scenes, torture scenes, terrible monsters that was so bad a##!

Don’t buy the popcorn it was terrible, the monsters selling it must have wanted to torture me, it was so bad it gave me nightmares!

What should your learning algorithm look at?

Page 8: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

Page 9: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

PowerRelations

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al .WWW2012.

BlahUnacceptableblah

Yourhonor, Iagreeblahblahblah

What should your learning algorithm look at?

Page 10: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

PowerRelationsCommunicative behaviors are “patterned and coordinated, like a dance” [Niederhoffer and Pennebaker 2002]

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al.WWW2012.

Page 11: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Classification

• Weassumewehavealabeleddataset.Howcanwebuildaclassifier?

• Decideonarepresentation andalearningalgorithm– Essentially:Functionapproximation

• Representation:Whatisthedomainofthefunction• Learning:Howtofindagood approximation

– Wewilllookintoseveralsimpleexamples• NaïveBayes,Perceptron

• Let’sstartwithsomedefinitions..

Page 12: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

BasicDefinitions• Given:Dasetoflabeledexamples{<x,y>}• Goal:Learnafunctionf(x)s.t. f(x)=y

– Note:ycanbebinary,orcategorical– Typically theinputxisrepresented asavectoroffeatures

• BreakDintothreeparts:– Trainingset(usedbythelearningalgorithm)– Test set(evaluatethelearnedmodel)– Development set(tuningthelearningalgorithm)

• Evaluation:– performancemeasureoverthetestset– Accuracy:proportionofcorrectpredictions(testdata)

Page 13: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

PrecisionandRecall• Givenadataset,wetrainaclassifierthatgets99%accuracy

• Didwedoagoodjob?• Buildaclassifierforbraintumor:

– 99.9%ofbrainscansdonotshowsignsoftumor– Didwedoagoodjob?

• Bysimplysaying“NO”toallexampleswereducetheerrorbyafactorof10!– ClearlyAccuracyisnotthebestwaytoevaluatethelearningsystemwhenthedataisheavilyskewed!

• Intuition:weneedameasurethatcapturestheclasswecareabout!(rare)

13

Page 14: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

PrecisionandRecall

• Thelearnercanmaketwokindsofmistakes:– FalsePositive– FalseNegative

• Precision:• “when we predicted the rare class, how often are we right?”

• Recall• “Out of all the instances of the rare class, how many did we

catch?”

14

TrueLabel:

TrueLabel:

Predicted:

TruePositive

FalsePositive

Predicted:

FalseNegative

TrueNegative

0110

True Pos

Predicted Pos=

True Pos

True Pos + False Pos

True Pos

Actual Pos=

True Pos

True Pos + False Neg

Page 15: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

F-Score

• PrecisionandRecallgiveustworeferencepointstocomparelearningperformance

• Whichalgorithmisbetter?• Option1:Average• Option2:F-Score

15

Precision Recall Average FScore

Algorithm1 0.5 0.4 0.45 0.444

Algorithm2 0.7 0.1 0.4 0.175

Algorithm3 0.02 1 0.51 0.0392

2PR

P +R

P +R

2Propertiesoff-score:• Rangesbetween0-1• Prefersprecisionandrecall

withsimilarvalues

Weneedasinglescore

Page 16: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

SimpleExample:NaïveBayes

• NaïveBayes:simpleprobabilisticclassifier– Givenasetoflabeleddata:

• DocumentsD,eachassociatedwithalabelv• Simplefeaturerepresentation:BoW

– Learning:• ConstructaprobabilitydistributionP(v|d)

– Prediction:• Assignthelabelwiththehighestprobability

• Reliesonstrong simplifyingassumptions

Page 17: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

SimpleRepresentation:BoW

• Basicidea:(sentimentanalysis)– “I loved this movie, it’s awesome! I couldn’t stop

laughing for two hours!”– Mappinginputtolabelcanbedonebyrepresentingthefrequenciesofindividual words

– Document=wordcounts

• Simple,yetsurprisinglypowerfulrepresentation!

Page 18: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

BayesRule

• NaïveBayesisasimpleprobabilisticclassificationmethod,basedonBayesrule.

P(v | d) = P(d | v)P(v)P(d)

Page 19: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

BasicsofNaïveBayes• P(v) - thepriorprobability ofalabelv

Reflectsbackgroundknowledge;beforedataisobserved.Ifnoinformation- uniformdistribution.

• P(D) - Theprobabilitythatthissample oftheDataisobserved.(Noknowledgeofthelabel)

• P(D|v): TheprobabilityofobservingthesampleD,giventhatthelabelv isthetarget(Likelihood)–

• P(v|D): Theposteriorprobability of v.Theprobabilitythatv isthetarget,giventhatD hasbeenobserved.

19

Page 20: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

BayesRule

• NaïveBayesisasimpleclassificationmethod,basedonBayesrule.

Check your intuition:

P(v|d)increaseswithP(v)andwithP(d|v)

P(v|d)decreaseswithP(d)

P(v | d) = P(d | v)P(v)P(d)

Page 21: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaïveBayes

• Thelearnerconsidersasetofcandidatelabels,andattemptstofindthemostprobable onev∈V,giventheobserveddata.

• Suchmaximallyprobableassignmentiscalledmaximumaposteriori assignment(MAP);Bayestheoremisusedtocomputeit:

P(v|D)=P(D|v)P(v)/P(D)

vMAP = argmaxv∈ V P(v|D) = argmaxv∈ v P(D|v) P(v)/P(D)

= argmaxv∈ V P(D|v) P(v) SinceP(D)isthesameforallv∈ V

Page 22: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaïveBayes

• HowcanwecomputeP(v |D)?– Basic idea: represent document as a set of

features, such as BoW features

vMAP = argmaxvj∈V

P(x1,x2, ...,xn | vj )P(vj)P(x1,x2, ...,xn )

= argmaxvj∈VP(x1,x2, ...,xn | vj )P(vj)

)x,...,x,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj ∈∈ ==

Page 23: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NB:ParameterEstimation

• Giventrainingdatawecanestimatethetwoterms• EstimatingP(v) iseasy.Foreachvaluevcounthow

manytimesitappearsinthetrainingdata.

• However,itisnotfeasibletoestimate P(x1,…,xn |v)– Inthiscasewehavetoestimate,foreachtargetvalue,the

probabilityofeachinstance(mostofwhichwillnotoccur)

• InordertouseaBayesianclassifiersinpractice,weneedtomakeassumptionsthatwillallowustoestimatethesequantities.

VMAP = argmaxv P(x1, x2, …, xn | v )P(v)

23

Question:Assume binary xi’s. How many parameters does the model require?

Page 24: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NB:IndependenceAssumption

• Bagofwordsrepresentation:– Wordpositioncanbeignored

• ConditionalIndependence:Assumefeatureprobabilitiesareindependentgiventhelabel– P(xi|vj)=P(xi|xi-1;vj)

• Bothassumptionsare nottrue– Helpsimplifythemodel– Simplemodelsworkwell

Page 25: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaiveBayes

25

P(x1,x2, ...,xn | vj ) =

= P(x1 | x2, ...,xn , vj)P(x2, ...,xn | vj) = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3, ...,xn | vj) = ....... = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3 | x4 , ...,xn , vj)...P(xn | vj)

VMAP = argmaxv P(x1, x2, …, xn | v )P(v)

Assumption:featurevaluesareindependentgiventhetargetvalue

= P(xi | vj )i=1

n∏

Page 26: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

EstimatingProbabilities(MLE)

26

n

ndocuments) (v#

documents) v in training in appears (word # v)|P(word kkk ==

Sparsity of data is a problem-- if is small, the estimate is not accurate-- if is 0, it will dominate the estimate: we will never predict

if a word that never appeared in training (with ) appears in the test data

n kn v

v

∏ == ∈ i iidislike}{like,vNB v)|wordP(xP(v)argmax v

v)|P(wordkHowdoweestimate ?

Assumeadocumentclassificationproblem, usingwordfeatures

Page 27: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

RobustEstimationofProbabilities

• Thisprocessiscalledsmoothing.• Therearemanywaystodoit,somebetterjustifiedthanothers;• Anempiricalissue.

• Here:• nk is#(of occurrences of the word in the presence of v)• n is#(of occurrences of the label v)• pisapriorestimateofv(e.g.,uniform)• misequivalentsamplesize(#oflabels)

• LaplaceRule:fortheBooleancase,p=1/2,m=2

27

∏ == ∈ i iidislike}{like,vNB v)|wordP(xP(v)argmax v

mnmpn v)|P(x k

k +

+=

2n1n v)|P(x k

k +

+=

Page 28: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaïveBayes• Veryeasytoimplement• Convergesveryquickly

– Learningisjustcounting• Performswellinpractice

– Appliedtomanydocumentclassificationtasks– Ifdatasetissmall,NBcanperformbetterthansophisticatedalgorithms

• Strongindependenceassumptions– Ifassumptionshold:NBistheoptimalclassifier– Evenifnot,canperformwell

• Next:fromNBtolearninglinearthresholdfunctions

Page 29: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaïveBayes:TwoClasses

• NoticethatthenaïveBayesmethodgivesamethodforpredictingratherthananexplicitclassifier

• Inthecaseoftwoclasses,v∈{0,1} wepredictthatv=1 iff:

29

10)v|P(x0)P(v

1)v|P(x1)P(vn

1i jij

n

1i jij>

=•=

=•=

∏∏

=

=

Page 30: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaïveBayes:TwoClasses

30

• NoticethatthenaïveBayesmethodgivesamethodforpredictingratherthananexplicitclassifier.

• Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:

10)v|P(x0)P(v

1)v|P(x1)P(vn

1i jij

n

1i jij>

=•=

=•=

∏∏

=

=

Denote: pi = P(xi = 1 | v = 1), qi = P(xi = 1 | v = 0)

P(vj = 1)• pi

xi (1 - pi )1-xi

i=1

n∏

P(vj = 0)• qixi (1 - qi )

1-xi

i=1

n∏

> 1

Page 31: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaïveBayes:TwoClasses

31

Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:

1)

q-1q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1ix

i

iij

n

1ix

i

iij

n

1ix-1

ix

ij

n

1ix-1

ix

ij

i

i

ii

ii

>•=

•=

=•=

•=

∏∏

=

=

=

=

Page 32: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaïveBayes:TwoClasses

32

Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:

1)

q-1q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1ix

i

iij

n

1ix

i

iij

n

1ix-1

ix

ij

n

1ix-1

ix

ij

i

i

ii

ii

>•=

•=

=•=

•=

∏∏

=

=

=

=

0)xq-1

qlogp-1

p(logq-1p-1log

0)P(v1)P(v

log

:iff 1vpredict we logarithm; Take

ii

ii

i

ii

i

i

j

j >−++=

=

=

∑∑

Page 33: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

NaïveBayes:TwoClasses

Introduction toMachineLearning.Fall2014 33

Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:

0)xq-1

qlogp-1

p(logq-1p-1log

0)P(v1)P(v

log

:iff 1vpredict we logarithm; Take

ii

ii

i

ii

i

i

j

j >−++=

=

=

∑∑

wi = log pi

1 - pi

− log qi

1 - qi

= log pi

qi

1 - qi

1 - pi

if pi = qi then wi = 0 and the feature is irrelevant

•We get that naive Bayes is a linear separator with :

1)

q-1q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1ix

i

iij

n

1ix

i

iij

n

1ix-1

ix

ij

n

1ix-1

ix

ij

i

i

ii

ii

>•=

•=

=•=

•=

∏∏

=

=

=

=

Page 34: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

LinearClassifiers

• Linearthresholdfunctions– Associateaweight(wi)witheachfeature(xi)– Prediction:sign(b + wTx) = sign (b + Σ wi xi)

• b + wTx ≥ 0predicty=1• Otherwise,predicty=-1

• NBisalinearthresholdfunction– Weightvector(w)isassignedbycomputingconditionalprobabilities

• Infact,Linearthresholdfunctionsareaverypopularrepresentation!

Page 35: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

LinearClassifiers

Eachpointinthisspaceisadocument.

Thecoordinates(e.g.,x1,x2),aredeterminedbyfeatureactivations

++++

+

+

++

- -

- - -- - - - -

++++

+++

sign(b + wTx)

Page 36: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Expressivity

• Linearfunctionsarequiteexpressive– Existsalinearfunctionthatisconsistentwiththedata

• Afamousnegativeexamples(XOR):

++++

+

+

+++

+++

+++

++++

+

+

+++

+++

+++

- -

- - -- - - - -

- -

- - -- - - - -

Page 37: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

ExpressivityBytransformingthefeaturespacethesefunctionscanbemadelinear

Representeachpointin2Das(x,x2)

Page 38: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Expressivity

Morerealisticscenario:thedataisalmost linearlyseparable,exceptforsomenoise.

- --

- -- - - - -

++++

+

+

+++

+++

+++

sign(b + wTx)

-

+

--+

++++

+

+

+++

+++

+++

+

-

Page 39: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Features• SofarwehavediscussedBoW representation

– Infact,youcanuseaveryrichrepresentation• Broaderdefinition

– FunctionsmappingattributesoftheinputtoaBoolean/categorical/numericvalue

• Question:assumethatyouhavealexicon,containingpositiveandnegativesentimentwords.HowcanyouuseittoimproveoverBoW?

φ1(x) =1 x1 is capitalized0 otherwise

⎧⎨⎩

φk (x) =1 x contains ''good '' more than twice0 otherwise

⎧⎨⎩

Page 40: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Perceptron

• Oneoftheearliestlearningalgorithms– IntroducedbyRosenblatt1958tomodelneurallearning

• Goal:directlysearchforaseparatinghyperplane– Ifoneexists,perceptronwillfindit– Ifnot,…

• Online algorithm– Considersoneexampleatatime(NB– looksatentiredata)

• Errordrivenalgorithm– Updatestheweightsonlywhenamistakeismade

Page 41: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

PerceptronIntuition

Page 42: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

42

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

Rn∈

Page 43: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

Page 44: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

• Themarginofadataset(𝛾)isthemaximummarginpossibleforthatdatasetusinganyweightvector.

Page 45: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

MistakeBoundforPerceptron

• LetD={(xi,yi)}bealabeleddatasetthatisseparable• Let||xi||<Rforallexamples.• Let𝛾 bethemarginofthedatasetD.• Then,theperceptronalgorithmwillmakeatmost

R2/ 𝛾 2mistakesonthedata.

Page 46: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

PracticalExample

46

Source: Scaling to very very large corpora for natural language disambiguationMichele Banko, Eric Brill. Microsoft Research, Redmond, WA. 2001.

Task: context sensitive spelling{principle , principal},{weather,whether}.

Page 47: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

Page 48: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

DeceptionClassification

Page 49: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Summary• ClassificationisabasictoolforNLP

– E.g.,Whatisthetopicofadocument?• Classifier:mappingfrominputtolabel

– Label:BinaryorCategorical• Wesawtwosimplelearningalgorithmsforfindingtheparametersoflinearclassificationfunctions– NaïveBayesandPerceptron

• Next:– Moresophisticatedalgorithms– Applications(or– howtogetittowork!)

Page 50: ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

Questions?