ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal: Learn a function f(x) s.t.f(x) = y – Note: y can be binary, or categorical

ML4NLPIntroduction to Classification

Dan Goldwasser

CS 590NLP

Purdue [email protected]

StatisticalLanguageModeling

• Intuition:bylookingatlargequantitiesoftextwecanfindstatisticalregularities– Distinguishbetweencorrectandincorrectsentences

• Languagemodelsdefineaprobabilitydistributionoverstrings(e.g.,sentences)inalanguage.

• Wecanuselanguagemodeltoscoreandranksentences

“I don’t know {whether,weather} to laugh or cry”

P(“Idon’t..weathertolaugh..”)><P(“Idon’t..whethertolaugh..”)

LanguageModelingwithN-grams

U n i g ra m m o delB i g ra m m o del

Tr i g ra m m o del

P (w1)P (w2)...P (wi )P (w1)P (w2 |w1)...P (wi |wi-1)P (w1)P (w2 |w1)...P (wi |wi-2 wi-1)

EvaluatingLanguageModels

• Assumingthatwehavealanguagemodel,howcanwetellifit’sgood?

• Option1:trytogenerateShakespeare..– ThisisknowasQualitativeevaluation

• Option2:Quantitativeevaluation– Option2.1:SeehowwellyoudoonSpellingcorrection

• ThisisknownasExtrinsic Evaluation– Option2.2:Findanindependentmeasure forLMquality

• ThisisknownasIntrinsic Evaluation

WhenareLMapplicable?

DeepVisual-SemanticAlignments forGeneratingImageDescriptions

Findingregularityinlanguageissurprisinglyuseful!Easyexample:weather/whether

Butalso-Translation(canyouproduce“legal”FrenchfromsourceEnglish?)

CaptionGeneration(combineoutputofvisualsensorsintoagrammaticalsentence)

Classification

• Afundamentalmachine learningtool– WidelyapplicableinNLP

• Supervisedlearning:Learnerisgivenacollectionoflabeleddocuments– Emails:Spam/notspam;Reviews:Pos/Neg

• Buildafunction mappingdocumentstolabels– Keyproperty:Generalization

• functionshouldworkwellonnewdata

SentimentAnalysis

Dude, I just watched this horror flick! Selling points: nightmares scenes, torture scenes, terrible monsters that was so bad a##!

Don’t buy the popcorn it was terrible, the monsters selling it must have wanted to torture me, it was so bad it gave me nightmares!

What should your learning algorithm look at?

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

PowerRelations

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al .WWW2012.

BlahUnacceptableblah

Yourhonor, Iagreeblahblahblah

What should your learning algorithm look at?

PowerRelationsCommunicative behaviors are “patterned and coordinated, like a dance” [Niederhoffer and Pennebaker 2002]

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al.WWW2012.

Classification

• Weassumewehavealabeleddataset.Howcanwebuildaclassifier?

• Decideonarepresentation andalearningalgorithm– Essentially:Functionapproximation

• Representation:Whatisthedomainofthefunction• Learning:Howtofindagood approximation

– Wewilllookintoseveralsimpleexamples• NaïveBayes,Perceptron

• Let’sstartwithsomedefinitions..

BasicDefinitions• Given:Dasetoflabeledexamples{<x,y>}• Goal:Learnafunctionf(x)s.t. f(x)=y

– Note:ycanbebinary,orcategorical– Typically theinputxisrepresented asavectoroffeatures

• BreakDintothreeparts:– Trainingset(usedbythelearningalgorithm)– Test set(evaluatethelearnedmodel)– Development set(tuningthelearningalgorithm)

• Evaluation:– performancemeasureoverthetestset– Accuracy:proportionofcorrectpredictions(testdata)

PrecisionandRecall• Givenadataset,wetrainaclassifierthatgets99%accuracy

• Didwedoagoodjob?• Buildaclassifierforbraintumor:

– 99.9%ofbrainscansdonotshowsignsoftumor– Didwedoagoodjob?

• Bysimplysaying“NO”toallexampleswereducetheerrorbyafactorof10!– ClearlyAccuracyisnotthebestwaytoevaluatethelearningsystemwhenthedataisheavilyskewed!

• Intuition:weneedameasurethatcapturestheclasswecareabout!(rare)

13

PrecisionandRecall

• Thelearnercanmaketwokindsofmistakes:– FalsePositive– FalseNegative

• Precision:• “when we predicted the rare class, how often are we right?”

• Recall• “Out of all the instances of the rare class, how many did we

catch?”

14

TrueLabel:

TrueLabel:

Predicted:

TruePositive

FalsePositive

Predicted:

FalseNegative

TrueNegative

0110

True Pos

Predicted Pos=

True Pos

True Pos + False Pos

True Pos

Actual Pos=

True Pos

True Pos + False Neg

F-Score

• PrecisionandRecallgiveustworeferencepointstocomparelearningperformance

• Whichalgorithmisbetter?• Option1:Average• Option2:F-Score

15

Precision Recall Average FScore

Algorithm1 0.5 0.4 0.45 0.444

Algorithm2 0.7 0.1 0.4 0.175

Algorithm3 0.02 1 0.51 0.0392

2PR

P +R

P +R

2Propertiesoff-score:• Rangesbetween0-1• Prefersprecisionandrecall

withsimilarvalues

Weneedasinglescore

SimpleExample:NaïveBayes

• NaïveBayes:simpleprobabilisticclassifier– Givenasetoflabeleddata:

• DocumentsD,eachassociatedwithalabelv• Simplefeaturerepresentation:BoW

– Learning:• ConstructaprobabilitydistributionP(v|d)

– Prediction:• Assignthelabelwiththehighestprobability

• Reliesonstrong simplifyingassumptions

SimpleRepresentation:BoW

• Basicidea:(sentimentanalysis)– “I loved this movie, it’s awesome! I couldn’t stop

laughing for two hours!”– Mappinginputtolabelcanbedonebyrepresentingthefrequenciesofindividual words

– Document=wordcounts

• Simple,yetsurprisinglypowerfulrepresentation!

BayesRule

• NaïveBayesisasimpleprobabilisticclassificationmethod,basedonBayesrule.

P(v | d) = P(d | v)P(v)P(d)

BasicsofNaïveBayes• P(v) - thepriorprobability ofalabelv

Reflectsbackgroundknowledge;beforedataisobserved.Ifnoinformation- uniformdistribution.

• P(D) - Theprobabilitythatthissample oftheDataisobserved.(Noknowledgeofthelabel)

• P(D|v): TheprobabilityofobservingthesampleD,giventhatthelabelv isthetarget(Likelihood)–

• P(v|D): Theposteriorprobability of v.Theprobabilitythatv isthetarget,giventhatD hasbeenobserved.

19

BayesRule

• NaïveBayesisasimpleclassificationmethod,basedonBayesrule.

Check your intuition:

P(v|d)increaseswithP(v)andwithP(d|v)

P(v|d)decreaseswithP(d)

P(v | d) = P(d | v)P(v)P(d)

NaïveBayes

• Thelearnerconsidersasetofcandidatelabels,andattemptstofindthemostprobable onev∈V,giventheobserveddata.

• Suchmaximallyprobableassignmentiscalledmaximumaposteriori assignment(MAP);Bayestheoremisusedtocomputeit:

P(v|D)=P(D|v)P(v)/P(D)

vMAP = argmaxv∈ V P(v|D) = argmaxv∈ v P(D|v) P(v)/P(D)

= argmaxv∈ V P(D|v) P(v) SinceP(D)isthesameforallv∈ V

NaïveBayes

• HowcanwecomputeP(v |D)?– Basic idea: represent document as a set of

features, such as BoW features

vMAP = argmaxvj∈V

P(x1,x2, ...,xn | vj )P(vj)P(x1,x2, ...,xn )

= argmaxvj∈VP(x1,x2, ...,xn | vj )P(vj)

)x,...,x,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj ∈∈ ==

NB:ParameterEstimation

• Giventrainingdatawecanestimatethetwoterms• EstimatingP(v) iseasy.Foreachvaluevcounthow

manytimesitappearsinthetrainingdata.

• However,itisnotfeasibletoestimate P(x1,…,xn |v)– Inthiscasewehavetoestimate,foreachtargetvalue,the

probabilityofeachinstance(mostofwhichwillnotoccur)

• InordertouseaBayesianclassifiersinpractice,weneedtomakeassumptionsthatwillallowustoestimatethesequantities.

VMAP = argmaxv P(x1, x2, …, xn | v )P(v)

23

Question:Assume binary xi’s. How many parameters does the model require?

NB:IndependenceAssumption

• Bagofwordsrepresentation:– Wordpositioncanbeignored

• ConditionalIndependence:Assumefeatureprobabilitiesareindependentgiventhelabel– P(xi|vj)=P(xi|xi-1;vj)

• Bothassumptionsare nottrue– Helpsimplifythemodel– Simplemodelsworkwell

NaiveBayes

25

P(x1,x2, ...,xn | vj ) =

= P(x1 | x2, ...,xn , vj)P(x2, ...,xn | vj) = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3, ...,xn | vj) = ....... = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3 | x4 , ...,xn , vj)...P(xn | vj)

VMAP = argmaxv P(x1, x2, …, xn | v )P(v)

Assumption:featurevaluesareindependentgiventhetargetvalue

= P(xi | vj )i=1

n∏

EstimatingProbabilities(MLE)

26

n

ndocuments) (v#

documents) v in training in appears (word # v)|P(word kkk ==

Sparsity of data is a problem-- if is small, the estimate is not accurate-- if is 0, it will dominate the estimate: we will never predict

if a word that never appeared in training (with ) appears in the test data

n kn v

v

∏ == ∈ i iidislike}{like,vNB v)|wordP(xP(v)argmax v

v)|P(wordkHowdoweestimate ?

Assumeadocumentclassificationproblem, usingwordfeatures

RobustEstimationofProbabilities

• Thisprocessiscalledsmoothing.• Therearemanywaystodoit,somebetterjustifiedthanothers;• Anempiricalissue.

• Here:• nk is#(of occurrences of the word in the presence of v)• n is#(of occurrences of the label v)• pisapriorestimateofv(e.g.,uniform)• misequivalentsamplesize(#oflabels)

• LaplaceRule:fortheBooleancase,p=1/2,m=2

27

∏ == ∈ i iidislike}{like,vNB v)|wordP(xP(v)argmax v

mnmpn v)|P(x k

k +

+=

2n1n v)|P(x k

k +

+=

NaïveBayes• Veryeasytoimplement• Convergesveryquickly

– Learningisjustcounting• Performswellinpractice

– Appliedtomanydocumentclassificationtasks– Ifdatasetissmall,NBcanperformbetterthansophisticatedalgorithms

• Strongindependenceassumptions– Ifassumptionshold:NBistheoptimalclassifier– Evenifnot,canperformwell

• Next:fromNBtolearninglinearthresholdfunctions

NaïveBayes:TwoClasses

• NoticethatthenaïveBayesmethodgivesamethodforpredictingratherthananexplicitclassifier

• Inthecaseoftwoclasses,v∈{0,1} wepredictthatv=1 iff:

29

10)v|P(x0)P(v

1)v|P(x1)P(vn

1i jij

n

1i jij>

=•=

=•=

∏∏

=

=


30

• NoticethatthenaïveBayesmethodgivesamethodforpredictingratherthananexplicitclassifier.

• Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:

10)v|P(x0)P(v

1)v|P(x1)P(vn

1i jij

n

1i jij>

=•=

=•=

∏∏

=

=

Denote: pi = P(xi = 1 | v = 1), qi = P(xi = 1 | v = 0)

P(vj = 1)• pi

xi (1 - pi )1-xi

i=1

n∏

P(vj = 0)• qixi (1 - qi )

1-xi

i=1

n∏

> 1


31

Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:

1)

q-1q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1ix

i

iij

n

1ix

i

iij

n

1ix-1

ix

ij

n

1ix-1

ix

ij

i

i

ii

ii

>•=

•=

=•=

•=

∏

∏

∏∏

=

=

=

=


32


1)

q-1q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1ix

i

iij

n

1ix

i

iij

n

1ix-1

ix

ij

n

1ix-1

ix

ij

i

i

ii

ii

>•=

•=

=•=

•=

∏

∏

∏∏

=

=

=

=

0)xq-1

qlogp-1

p(logq-1p-1log

0)P(v1)P(v

log

:iff 1vpredict we logarithm; Take

ii

ii

i

ii

i

i

j

j >−++=

=

=

∑∑


Introduction toMachineLearning.Fall2014 33


0)xq-1

qlogp-1

p(logq-1p-1log

0)P(v1)P(v

log

:iff 1vpredict we logarithm; Take

ii

ii

i

ii

i

i

j

j >−++=

=

=

∑∑

€

wi = log pi

1 - pi

− log qi

1 - qi

= log pi

qi

1 - qi

1 - pi

if pi = qi then wi = 0 and the feature is irrelevant

•We get that naive Bayes is a linear separator with :

1)

q-1q)(q-(10)P(v

)p-1

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

n

1ix

i

iij

n

1ix

i

iij

n

1ix-1

ix

ij

n

1ix-1

ix

ij

i

i

ii

ii

>•=

•=

=•=

•=

∏

∏

∏∏

=

=

=

=

LinearClassifiers

• Linearthresholdfunctions– Associateaweight(wi)witheachfeature(xi)– Prediction:sign(b + wTx) = sign (b + Σ wi xi)

• b + wTx ≥ 0predicty=1• Otherwise,predicty=-1

• NBisalinearthresholdfunction– Weightvector(w)isassignedbycomputingconditionalprobabilities

• Infact,Linearthresholdfunctionsareaverypopularrepresentation!

LinearClassifiers

Eachpointinthisspaceisadocument.

Thecoordinates(e.g.,x1,x2),aredeterminedbyfeatureactivations

++++

+

+

++

- -

- - -- - - - -

++++

+++

sign(b + wTx)

Expressivity

• Linearfunctionsarequiteexpressive– Existsalinearfunctionthatisconsistentwiththedata

• Afamousnegativeexamples(XOR):

++++

+

+

+++

+++

+++

++++

+

+

+++

+++

+++

- -

- - -- - - - -

- -

- - -- - - - -

ExpressivityBytransformingthefeaturespacethesefunctionscanbemadelinear

Representeachpointin2Das(x,x2)

Expressivity

Morerealisticscenario:thedataisalmost linearlyseparable,exceptforsomenoise.

- --

- -- - - - -

++++

+

+

+++

+++

+++

sign(b + wTx)

-

+

--+

++++

+

+

+++

+++

+++

+

-

Features• SofarwehavediscussedBoW representation

– Infact,youcanuseaveryrichrepresentation• Broaderdefinition

– FunctionsmappingattributesoftheinputtoaBoolean/categorical/numericvalue

• Question:assumethatyouhavealexicon,containingpositiveandnegativesentimentwords.HowcanyouuseittoimproveoverBoW?

φ1(x) =1 x1 is capitalized0 otherwise

⎧⎨⎩

φk (x) =1 x contains ''good '' more than twice0 otherwise

⎧⎨⎩

Perceptron

• Oneoftheearliestlearningalgorithms– IntroducedbyRosenblatt1958tomodelneurallearning

• Goal:directlysearchforaseparatinghyperplane– Ifoneexists,perceptronwillfindit– Ifnot,…

• Online algorithm– Considersoneexampleatatime(NB– looksatentiredata)

• Errordrivenalgorithm– Updatestheweightsonlywhenamistakeismade

PerceptronIntuition

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

42

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

Rn∈

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

• Themarginofadataset(𝛾)isthemaximummarginpossibleforthatdatasetusinganyweightvector.

MistakeBoundforPerceptron

• LetD={(xi,yi)}bealabeleddatasetthatisseparable• Let||xi||<Rforallexamples.• Let𝛾 bethemarginofthedatasetD.• Then,theperceptronalgorithmwillmakeatmost

R2/ 𝛾 2mistakesonthedata.

PracticalExample

46

Source: Scaling to very very large corpora for natural language disambiguationMichele Banko, Eric Brill. Microsoft Research, Redmond, WA. 2001.

Task: context sensitive spelling{principle , principal},{weather,whether}.

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

DeceptionClassification

Summary• ClassificationisabasictoolforNLP

– E.g.,Whatisthetopicofadocument?• Classifier:mappingfrominputtolabel

– Label:BinaryorCategorical• Wesawtwosimplelearningalgorithmsforfindingtheparametersoflinearclassificationfunctions– NaïveBayesandPerceptron

• Next:– Moresophisticatedalgorithms– Applications(or– howtogetittowork!)

Questions?