ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal:...

ML4NLPIntroduction to Classification

Dan Goldwasser

CS 590NLP

Purdue Universitydgoldwas@purdue.edu

StatisticalLanguageModeling

• Intuition:bylookingatlargequantitiesoftextwecanfindstatisticalregularities– Distinguishbetweencorrectandincorrectsentences

• Languagemodelsdefineaprobabilitydistributionoverstrings(e.g.,sentences)inalanguage.

• Wecanuselanguagemodeltoscoreandranksentences

“I don’t know {whether,weather} to laugh or cry”

P(“Idon’t..weathertolaugh..”)><P(“Idon’t..whethertolaugh..”)

LanguageModelingwithN-grams

U n i g ra m m o delB i g ra m m o del

Tr i g ra m m o del

P (w1)P (w2)...P (wi )P (w1)P (w2 |w1)...P (wi |wi-1)P (w1)P (w2 |w1)...P (wi |wi-2 wi-1)

EvaluatingLanguageModels

• Assumingthatwehavealanguagemodel,howcanwetellifit’sgood?

• Option1:trytogenerateShakespeare..– ThisisknowasQualitativeevaluation

• Option2:Quantitativeevaluation– Option2.1:SeehowwellyoudoonSpellingcorrection

• ThisisknownasExtrinsic Evaluation– Option2.2:Findanindependentmeasure forLMquality

• ThisisknownasIntrinsic Evaluation

WhenareLMapplicable?

DeepVisual-SemanticAlignments forGeneratingImageDescriptions

Findingregularityinlanguageissurprisinglyuseful!Easyexample:weather/whether

Butalso-Translation(canyouproduce“legal”FrenchfromsourceEnglish?)

CaptionGeneration(combineoutputofvisualsensorsintoagrammaticalsentence)

Classification

• Afundamentalmachine learningtool– WidelyapplicableinNLP

• Supervisedlearning:Learnerisgivenacollectionoflabeleddocuments– Emails:Spam/notspam;Reviews:Pos/Neg

• Buildafunction mappingdocumentstolabels– Keyproperty:Generalization

• functionshouldworkwellonnewdata

SentimentAnalysis

Dude, I just watched this horror flick! Selling points: nightmares scenes, torture scenes, terrible monsters that was so bad a##!

Don’t buy the popcorn it was terrible, the monsters selling it must have wanted to torture me, it was so bad it gave me nightmares!

What should your learning algorithm look at?

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

PowerRelations

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al .WWW2012.

BlahUnacceptableblah

Yourhonor, Iagreeblahblahblah

What should your learning algorithm look at?

PowerRelationsCommunicative behaviors are “patterned and coordinated, like a dance” [Niederhoffer and Pennebaker 2002]

EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al.WWW2012.

Classification

• Weassumewehavealabeleddataset.Howcanwebuildaclassifier?

• Decideonarepresentation andalearningalgorithm– Essentially:Functionapproximation

• Representation:Whatisthedomainofthefunction• Learning:Howtofindagood approximation

– Wewilllookintoseveralsimpleexamples• NaïveBayes,Perceptron

• Let’sstartwithsomedefinitions..

BasicDefinitions• Given:Dasetoflabeledexamples{<x,y>}• Goal:Learnafunctionf(x)s.t. f(x)=y

– Note:ycanbebinary,orcategorical– Typically theinputxisrepresented asavectoroffeatures

• BreakDintothreeparts:– Trainingset(usedbythelearningalgorithm)– Test set(evaluatethelearnedmodel)– Development set(tuningthelearningalgorithm)

• Evaluation:– performancemeasureoverthetestset– Accuracy:proportionofcorrectpredictions(testdata)

PrecisionandRecall• Givenadataset,wetrainaclassifierthatgets99%accuracy

• Didwedoagoodjob?• Buildaclassifierforbraintumor:

– 99.9%ofbrainscansdonotshowsignsoftumor– Didwedoagoodjob?

• Bysimplysaying“NO”toallexampleswereducetheerrorbyafactorof10!– ClearlyAccuracyisnotthebestwaytoevaluatethelearningsystemwhenthedataisheavilyskewed!

• Intuition:weneedameasurethatcapturestheclasswecareabout!(rare)

PrecisionandRecall

• Thelearnercanmaketwokindsofmistakes:– FalsePositive– FalseNegative

• Precision:• “when we predicted the rare class, how often are we right?”

• Recall• “Out of all the instances of the rare class, how many did we

catch?”

TrueLabel:

Predicted:

TruePositive

FalsePositive

Predicted:

FalseNegative

TrueNegative

True Pos

Predicted Pos=

True Pos

True Pos + False Pos

True Pos

Actual Pos=

True Pos

True Pos + False Neg

F-Score

• PrecisionandRecallgiveustworeferencepointstocomparelearningperformance

• Whichalgorithmisbetter?• Option1:Average• Option2:F-Score

Precision Recall Average FScore

Algorithm1 0.5 0.4 0.45 0.444

Algorithm2 0.7 0.1 0.4 0.175

Algorithm3 0.02 1 0.51 0.0392

2Propertiesoff-score:• Rangesbetween0-1• Prefersprecisionandrecall

withsimilarvalues

Weneedasinglescore

SimpleExample:NaïveBayes

• NaïveBayes:simpleprobabilisticclassifier– Givenasetoflabeleddata:

• DocumentsD,eachassociatedwithalabelv• Simplefeaturerepresentation:BoW

– Learning:• ConstructaprobabilitydistributionP(v|d)

– Prediction:• Assignthelabelwiththehighestprobability

• Reliesonstrong simplifyingassumptions

SimpleRepresentation:BoW

• Basicidea:(sentimentanalysis)– “I loved this movie, it’s awesome! I couldn’t stop

laughing for two hours!”– Mappinginputtolabelcanbedonebyrepresentingthefrequenciesofindividual words

– Document=wordcounts

• Simple,yetsurprisinglypowerfulrepresentation!

BayesRule

• NaïveBayesisasimpleprobabilisticclassificationmethod,basedonBayesrule.

P(v | d) = P(d | v)P(v)P(d)

BasicsofNaïveBayes• P(v) - thepriorprobability ofalabelv

Reflectsbackgroundknowledge;beforedataisobserved.Ifnoinformation- uniformdistribution.

• P(D) - Theprobabilitythatthissample oftheDataisobserved.(Noknowledgeofthelabel)

• P(D|v): TheprobabilityofobservingthesampleD,giventhatthelabelv isthetarget(Likelihood)–

• P(v|D): Theposteriorprobability of v.Theprobabilitythatv isthetarget,giventhatD hasbeenobserved.

BayesRule

• NaïveBayesisasimpleclassificationmethod,basedonBayesrule.

Check your intuition:

P(v|d)increaseswithP(v)andwithP(d|v)

P(v|d)decreaseswithP(d)

P(v | d) = P(d | v)P(v)P(d)

NaïveBayes

• Thelearnerconsidersasetofcandidatelabels,andattemptstofindthemostprobable onev∈V,giventheobserveddata.

• Suchmaximallyprobableassignmentiscalledmaximumaposteriori assignment(MAP);Bayestheoremisusedtocomputeit:

P(v|D)=P(D|v)P(v)/P(D)

vMAP = argmaxv∈ V P(v|D) = argmaxv∈ v P(D|v) P(v)/P(D)

= argmaxv∈ V P(D|v) P(v) SinceP(D)isthesameforallv∈ V

NaïveBayes

• HowcanwecomputeP(v |D)?– Basic idea: represent document as a set of

features, such as BoW features

vMAP = argmaxvj∈V

P(x1,x2, ...,xn | vj )P(vj)P(x1,x2, ...,xn )

= argmaxvj∈VP(x1,x2, ...,xn | vj )P(vj)

)x,...,x,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj ∈∈ ==

NB:ParameterEstimation

• Giventrainingdatawecanestimatethetwoterms• EstimatingP(v) iseasy.Foreachvaluevcounthow

manytimesitappearsinthetrainingdata.

• However,itisnotfeasibletoestimate P(x1,…,xn |v)– Inthiscasewehavetoestimate,foreachtargetvalue,the

probabilityofeachinstance(mostofwhichwillnotoccur)

• InordertouseaBayesianclassifiersinpractice,weneedtomakeassumptionsthatwillallowustoestimatethesequantities.

VMAP = argmaxv P(x1, x2, …, xn | v )P(v)

Question:Assume binary xi’s. How many parameters does the model require?

NB:IndependenceAssumption

• Bagofwordsrepresentation:– Wordpositioncanbeignored

• ConditionalIndependence:Assumefeatureprobabilitiesareindependentgiventhelabel– P(xi|vj)=P(xi|xi-1;vj)

• Bothassumptionsare nottrue– Helpsimplifythemodel– Simplemodelsworkwell

NaiveBayes

P(x1,x2, ...,xn | vj ) =

VMAP = argmaxv P(x1, x2, …, xn | v )P(v)

Assumption:featurevaluesareindependentgiventhetargetvalue

= P(xi | vj )i=1

EstimatingProbabilities(MLE)

ndocuments) (v#

documents) v in training in appears (word # v)|P(word kkk ==

Sparsity of data is a problem-- if is small, the estimate is not accurate-- if is 0, it will dominate the estimate: we will never predict

if a word that never appeared in training (with ) appears in the test data

n kn v

∏ == ∈ i iidislike}{like,vNB v)|wordP(xP(v)argmax v

v)|P(wordkHowdoweestimate ?

Assumeadocumentclassificationproblem, usingwordfeatures

RobustEstimationofProbabilities

• Thisprocessiscalledsmoothing.• Therearemanywaystodoit,somebetterjustifiedthanothers;• Anempiricalissue.

• Here:• nk is#(of occurrences of the word in the presence of v)• n is#(of occurrences of the label v)• pisapriorestimateofv(e.g.,uniform)• misequivalentsamplesize(#oflabels)

• LaplaceRule:fortheBooleancase,p=1/2,m=2

∏ == ∈ i iidislike}{like,vNB v)|wordP(xP(v)argmax v

mnmpn v)|P(x k

2n1n v)|P(x k

NaïveBayes• Veryeasytoimplement• Convergesveryquickly

– Learningisjustcounting• Performswellinpractice

– Appliedtomanydocumentclassificationtasks– Ifdatasetissmall,NBcanperformbetterthansophisticatedalgorithms

• Strongindependenceassumptions– Ifassumptionshold:NBistheoptimalclassifier– Evenifnot,canperformwell

• Next:fromNBtolearninglinearthresholdfunctions

NaïveBayes:TwoClasses

• NoticethatthenaïveBayesmethodgivesamethodforpredictingratherthananexplicitclassifier

• Inthecaseoftwoclasses,v∈{0,1} wepredictthatv=1 iff:

10)v|P(x0)P(v

1)v|P(x1)P(vn

1i jij

1i jij>

∏∏

• NoticethatthenaïveBayesmethodgivesamethodforpredictingratherthananexplicitclassifier.

• Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:

10)v|P(x0)P(v

1)v|P(x1)P(vn

1i jij

1i jij>

∏∏

Denote: pi = P(xi = 1 | v = 1), qi = P(xi = 1 | v = 0)

P(vj = 1)• pi

xi (1 - pi )1-xi

P(vj = 0)• qixi (1 - qi )

Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:

q-1q)(q-(10)P(v

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

∏∏

q-1q)(q-(10)P(v

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

∏∏

0)xq-1

qlogp-1

p(logq-1p-1log

0)P(v1)P(v

:iff 1vpredict we logarithm; Take

j >−++=

∑∑

Introduction toMachineLearning.Fall2014 33

0)xq-1

qlogp-1

p(logq-1p-1log

0)P(v1)P(v

:iff 1vpredict we logarithm; Take

j >−++=

∑∑

wi = log pi

1 - pi

− log qi

1 - qi

= log pi

1 - qi

1 - pi

if pi = qi then wi = 0 and the feature is irrelevant

•We get that naive Bayes is a linear separator with :

q-1q)(q-(10)P(v

p)(p-(11)P(v

)q-(1q0)P(v

)p-(1p1)P(v

∏∏

LinearClassifiers

• Linearthresholdfunctions– Associateaweight(wi)witheachfeature(xi)– Prediction:sign(b + wTx) = sign (b + Σ wi xi)

• b + wTx ≥ 0predicty=1• Otherwise,predicty=-1

• NBisalinearthresholdfunction– Weightvector(w)isassignedbycomputingconditionalprobabilities

• Infact,Linearthresholdfunctionsareaverypopularrepresentation!

LinearClassifiers

Eachpointinthisspaceisadocument.

Thecoordinates(e.g.,x1,x2),aredeterminedbyfeatureactivations

- - -- - - - -

sign(b + wTx)

Expressivity

• Linearfunctionsarequiteexpressive– Existsalinearfunctionthatisconsistentwiththedata

• Afamousnegativeexamples(XOR):

- - -- - - - -

ExpressivityBytransformingthefeaturespacethesefunctionscanbemadelinear

Representeachpointin2Das(x,x2)

Expressivity

Morerealisticscenario:thedataisalmost linearlyseparable,exceptforsomenoise.

- -- - - - -

sign(b + wTx)

Features• SofarwehavediscussedBoW representation

– Infact,youcanuseaveryrichrepresentation• Broaderdefinition

– FunctionsmappingattributesoftheinputtoaBoolean/categorical/numericvalue

• Question:assumethatyouhavealexicon,containingpositiveandnegativesentimentwords.HowcanyouuseittoimproveoverBoW?

φ1(x) =1 x1 is capitalized0 otherwise

⎧⎨⎩

φk (x) =1 x contains ''good '' more than twice0 otherwise

⎧⎨⎩

Perceptron

• Oneoftheearliestlearningalgorithms– IntroducedbyRosenblatt1958tomodelneurallearning

• Goal:directlysearchforaseparatinghyperplane– Ifoneexists,perceptronwillfindit– Ifnot,…

• Online algorithm– Considersoneexampleatatime(NB– looksatentiredata)

• Errordrivenalgorithm– Updatestheweightsonlywhenamistakeismade

PerceptronIntuition

Perceptron

• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn

• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}

1. Initializew=0

2.Cyclethroughallexamples

a.Predict thelabelofinstancextobe y’=sgn{w�x)

b.Ify’≠y,update theweightvector:

w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

Margin

• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.

• Themarginofadataset(𝛾)isthemaximummarginpossibleforthatdatasetusinganyweightvector.

MistakeBoundforPerceptron

• LetD={(xi,yi)}bealabeleddatasetthatisseparable• Let||xi||<Rforallexamples.• Let𝛾 bethemarginofthedatasetD.• Then,theperceptronalgorithmwillmakeatmost

R2/ 𝛾 2mistakesonthedata.

PracticalExample

Source: Scaling to very very large corpora for natural language disambiguationMichele Banko, Eric Brill. Microsoft Research, Redmond, WA. 2001.

Task: context sensitive spelling{principle , principal},{weather,whether}.

DeceptiveReviews

FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?

DeceptionClassification

Summary• ClassificationisabasictoolforNLP

– E.g.,Whatisthetopicofadocument?• Classifier:mappingfrominputtolabel

– Label:BinaryorCategorical• Wesawtwosimplelearningalgorithmsforfindingtheparametersoflinearclassificationfunctions– NaïveBayesandPerceptron

• Next:– Moresophisticatedalgorithms– Applications(or– howtogetittowork!)

Questions?

ML4NLP - Purdue University · Basic Definitions • Given: D a set of labeled examples {}• Goal:...

Documents

MANIPULAÇÃO DE ÕNDICES DE CORREÇÃO ... · Web viewO S.T.F., todavia, entre 1977 e 1988, entendeu que as contribuições sociais não seriam tributos, por força da retirada de

ArcView Print Job - turismo.mantova.it · x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

BIHAR POLICE TELEPHONE NOS.gopalganjpolice.bih.nic.in/Docs/bihar police.pdfEkta Bhawan, S.T.F. 0612 2219855 2691377 H.Q.R.T. 0612 2911198 2528880 9431017939 I. P.S. Mess Patna 0612

Pinheirinho do Vale · 2020-01-31 · b PZ8 L&? _ x &9UL? Z ? x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

G-1 FUTURE AIRPORT LAYOUT PLAN - Elevate Bur · x x x x x x x x x x x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xxx x x x xx

Arber Dika punimi MA Final...W!ZD :d: , Ç i X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

µ ] v - Casa Montessori · µ ] v 7lwoxo , x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

COMITATO REGIONALE EMILIA ROMAGNA · Federazione Italiana Giuoco Calcio Lega Nazionale Dilettanti ... _____ allenatore dilettante iscritto nei ruoli del S.T.F ... in occasione di

CONCURSOS PUBLICADOS 07/08/17 - AGMER Paraná · 2017. 8. 7. · S.T.F. turno tarde 3ro 2da del CBC, en los siguiente horarios: miérco- les (13:10 a 14:30 (17:35 14:30hs). Los interesados,

[ZÌ Mx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Rural camp report new version2...î ,QWURGXFWLRQ X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

GILVANES DOMINGUES FILHO - unicuritiba.edu.br · 680È5,2 ,1752'8d2 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

AIRPORT LAYOUT PLAN...X 132.000 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

DRIVE GRACEFORD DRIVE COLAINE - Aberdeen MD...X X X X X X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Machine Learning for Natural Language Processing (CS4062)koidlk/cs4062/ml4nlp-reader.pdf · Machine Learning for Natural Language Processing (CS4062) Lecturer: Kevin Koidl Original

X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ X _ μ

Profesionales en Equipamiento para el Transporte y la ... · Cinturón completo de amarre Interior Ancho 50 mm S.T.F. 250daN Cinta 6000daN Capacidad de amarre LC500daN/LC1000daN Ganchos

W eg n tz - stadt-hallenberg.de · x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

æ ò Y - WKO.at9714]-NEKP... · ï d ] o í x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

X Priloaje kUkXO $OXQL BRSPREORU - sancraiu.ro · 2014. 12. 10. · x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x