Stats 170A: Project in Data Science Text Analysis and

Stats170A:ProjectinDataScience

TextAnalysisandClassification

Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine

(AcknowledgementstoProfMarkSteyvers andProfSameerSingh,UCI,forvariousslides)

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2

Reading,Homework,Lectures

• Referencereading:– Python

• http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

• Homework7– Dueby2pmWednesdaynextweek

• NextLectures– Today:textanalysisandclassification– Nextweek:discussionofprojects



Roomforimprovement….


PredictingtheSentimentofTweets

wow:othiswasincluded intheplaylist#awesome positive

Icouldnotbehappierwithmyliferightnow positive

HAPPYBIRTHDAYTOTHEPANDERIFICPANDAIK.@TheOrionSound positive

I'mtiredashellInevergetaoffdayduring theweekanymore negative

Themoviewasreallydullandstupid negative

@washingtonpost @silvajanes Howawful!!!!!! negative

“Documents” ClassLabels


PredictingtheScoresofMovieReviews

Ilikedthismovie.Welldone.Greatactinganddirection 4

Terrificmovie, sawit3times.LovedthescenesinMexico. 5

Boringashell.Whogetspaidtocreatethisstuff? 1

Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2

“Documents” Ratings


EntityRecognition

Ilikedthismovie.Welldone.Greatactinganddirection 4

Terrificmovie, sawit3times.LovedthescenesinMexico. 5

Boringashell.Whogetspaidtocreatethisstuff? 1

Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2

“Documents” Ratings


ExamplesofUserLocationFieldsinTwitter

User location: Deutschland User location: Mountain View, CA User location: Florida, USA User location: United States User location: West Virginia, Appalachia User location: Columbia Mo User location: South Florida User location: Germany/Berlin User location: St Louis, MO User location: St. Louis User location: Central Virginia User location: United States��User location: Sea Holme (of Norse legend) Oz User location: California, USA User location: USA User location: Cambodia User location: San Antonio, TX User location: Between my ears User location: Chicago, IL




BasicConceptsinTextRepresentation


RepresentingTextDocumentsNumerically

• Howdowerepresenttext(strings)forinputtomachinelearningalgorithms?(whichgenerallyrequirenumericrepresentations)

• Definea vocabulary=setofwordsorterms

• Simplemethod:BagofWords– Document representedasavectorofcountsofwordsinthevocabulary

• Morecomplex:Real-ValuedEmbeddings– Document (orword) representedasareal-valuedvectorin“embedding space”

• Evenmorecomplex:SequentialRepresentations– Sequential state-machinemodelfor sequencesofwords


ExampleofBag-of-WordsMatrix

database SQL index calculus derivative function Label

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2


ConceptsandTerminology

• Document:– Collectionofwords

• Abook,anewsarticle,areport,aWebpage,anemail,atweet,etc– Maycontainboth textandmetadata.– Examplesofmetadata:authorname(s),date,wherepublished, etc

Notethatthedefinition ofadocument isflexiblee.g.,abookcouldbeasingledocument,or…..

eachsectionofabookcouldbeconsidereda“document”

• Corpus:acollectionofdocuments– e.g.,allnewsarticlesfromtheLosAngelesTimessince1990– e.g.,allWikipediaWebpages– e.g.,allYelpreviewsfor restaurantsinChicago– e.g.,arandomsampleofTweetsfromDec2017


ConceptsandTerminology

• Tokens:– Groupings ofcharactersintherawtext– individualwords (or“wordtypes”)+possiblynumbers,punctuation, etc

• WordsorTerms– Uniquetokens (orcombinationsof tokens)thataremeaningful– e.g.,wordssuchas“cat”,“dog”,termssuchas“U.S”or“92697”– Canalsoincludebigrams(“SanDiego”), trigrams(“NewYorkCity”),etc

• Vocabulary– Thespecificsetofunique termsusedbyanalgorithm orapplication– TheEnglish languagehasorderof1millionuniquewords– Inaparticularapplicationwemightuseavocabularyofonly10kto50kterms

• e.g.,relevant/commonwords(unigrams)• Bigram,trigrams,…,ngrams,canalsobepartofthevocabulary


TypicalPipelineforBuildingaTextClassifier

Tokenization

StopWordRemoval

VocabularyDefinition

CreateBagofWords

BuildTextClassifier

Document(string)

Notethatstepsinthepipeline,andhowtheseareimplemented (e.g.,howtokenization isdone)willvaryfromapplicationtoapplicationdepending ontheproblem


ExampleofaTextAnalysisPipeline(forInformationExtraction)

Fromhttp://www.nltk.org/book/ch07.html


Example

Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’

Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)


Example



Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}


Example



Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}

BagofWords={[‘chapter’,1],[‘1’,1],[‘the’,2],[‘beginning’,2],…}


?

Tokenization

l Splituptextintoindividualtokens(words/terms)

l Simplestapproachistoignoreallpunctuationandwhitespaceanduseonlyunbrokenstringsofalphabeticcharactersastokens

If you had a magic potion I’d love to have it.

If you had a magic potion I’d love to have it


IssuesinTokenization

• Finland’s capital → Finland Finlands Finland’s ?

• what’re, I’m, isn’t → What are, I am, is not

• Hewlett-Packard → Hewlett Packard ?

• state-of-the-art → state of the art ?

• Lowercase → lower-case lowercase lower case ?

• San Francisco → onetokenortwo?

• m.p.h., PhD. → ??

Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin


TokenizationSoftware

• Insteadofwritingyourowntokenizerwithacomplexsetofrules,useexistingsoftware– e.g.,tokenizer function inNLTK– e.g.,tokenizer fromStanford’snaturallanguagegroup

• Practicaltip:– Itsusefultokeepdifferent representations ofthedatatouselateron,e.g.,

• Datawithoriginalsequenceandformatting,tokenizedlist,bagofwords,etc

– Sequentialorderofwordsisneeded fordetectingngrams.– Punctuationcancontainuseful information:

If you had a magic potion I’d love to have it . If that makes sense

Ifwedecidetoextractn-gramslateron,weknowthat“it”and“if”shouldnotbecombined.So itsusefultoretainthisinformation.


SentenceDetection:ExampleusingaDecisionTree

Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin


Example:TokenizationofTweets

Therearespecial-purposetokenizers fordifferenttypesoftext,e.g.,forTweets

from nltk.tokenize import TweetTokenizerfrom nltk.tokenize import word_tokenize

tknzr = TweetTokenizer()

for i in range(10): print("NLTK Twitter Tokenizer: ", tknzr.tokenize(tweets.text[i]) ) print("NLTK Standard Tokenizer:", word_tokenize(tweets.text[i]))


NLTK Twitter Tokenizer: ['RT', '@ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who', '…'] NLTK Standard Tokenizer: ['RT', '@', 'ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who…']

NLTK Twitter Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https://t.co/olA35uCt89'] NLTK Standard Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https', ':', '//t.co/olA35uCt89']

NLTK Twitter Tokenizer: ['RT', '@IndivisibleTeam', ':', "Here's", 'what', 'you', 'need', 'to', 'know', 'about', '#ReleaseTheMemo', '�', '1', '⃣', 'It', '’', 's', 'not', 'really', 'a', 'memo', '.', "They're", 'talking', 'points', 'by', 'Devin', 'Nunes', '…'] NLTK Standard Tokenizer: ['RT', '@', 'IndivisibleTeam', ':', 'Here', "'s", 'what', 'you', 'need', 'to', 'know', 'about', '#', 'ReleaseTheMemo�', '1⃣', 'It’s', 'not', 'really', 'a', 'memo', '.', 'They', "'re", 'talking', 'points', 'by', 'Devin', 'Nunes…']

SampleOutputfromthe2Tokenizers


Example:DefiningaVocabulary

Rawtext(astringinPython) raw1 = “The dog chased the cat and the mouse. Why did the dog do this?”

Thereare14wordtokens inthestringraw1(ifweignorepunctuationandspaces)The,dog,chased,the,cat,and,the,mouse,Why,did,the,dog, do,this

Thevocabulary (theunique tokens,normalizing tolowercase)is:the,dog,chased,cat,and,mouse,why,did,do, thisThevocabularysizeis10.

Thecounts forabagofwordsrepresentation is:the(3),dog (2), chased(1), cat(1),and(1),mouse(1),why(1),did(1),do(1), this(1)

Ifweremovestopwords wedecreaseourvocabularysizeto4dog (2),chased(1), cat(1),mouse(1)


DefiningtheVocabulary(afterTokenization)

• Vocabulary– Setof terms(words) usedtoconstructthedocument-termmatrix

• Basicapproach:usesinglewords(unigrams)asterms

• Removeverycommonterms(e.g.,stopwords)

• Removeveryrareterms:e.g.,removealltermsthatoccurinfewerthanKdocumentsinthecorpus(e.g.,K=10)– Getsridofmisspellings, unusualnames,etc

• Canextendtermlistwithn-grams– Frequentwordcombinations (2-grams,3-grams,…)

“feelgood”/“NewYorkCity”


FrequencyofWordinEnglishusage

RankofWordbyFrequency

Verylong“tail”ofrare

words

Graphfromwww.prismnet.com/~dierdorf/wordfrequency.png


Stopword Removal

l Removewords thatarelikelytobeirrelevanttoourprobleml Keepcontentwords(typically verbs,adverbs,nouns,adjectives)



Example:

But what about this?

[Prince Hamlet] To be or not to be ...


NLTKStopWord List

Note:inmanyapplicationstheremaybeadditionaldomain-specific“stopwords”thatareverycommonandthatwemaywanttoremovesincetheycontainlittleinformation,e.g.,theterm“restaurant”inreviews


N-grams

• Usefuln-gramsaregroupsofn-wordsthatcommonlyappearinsequence– “ComputerScience”– “LosAngeles”– “NewYorkCity”

– Notethatwecanhavebothtermslike“computer”and“computer science”,e.g.,

“Iamstudying computer scienceatUCIandIboughtanewcomputer lastweek.”

- Addingann-gramasaterminthevocabularymaybeusefulinamodel- e.g.,forarestaurantreviewclassificationproblem, “waittime”maybemore

informative than“time”


AutomaticallyFindingN-grams

- e.g.,Bigrams- Keeptrackofalluniquepairsthatoccursequentially(excludestopwords)

- I visited Microsoft Research for a computer science job interview last week.”

- * visited Microsoft Research * * computer science job interview last week.”

- Rankbyfrequencyofoccurrence(ornumberofdocsabigramappearsin)- Canalsorankbyothermetrics

- KeepthetopKinthevocabulary- Howlargeshould Kbe?Dependsontheapplication,amountofdata,etc- MightneedtosearchoverdifferentvaluesofK(e.g.,usingcross-validation)

- Sameideafortri-grams,4grams,etc.

- Seealsohttp://www.nltk.org/howto/collocations.html inNLTK




d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2




d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

Uselabeledtrainingdata(supervisedlearning)tolearnaclassifier– HomeworkAssignment7


OverviewofTextClassification


TextClassification

• Textclassificationhasmanyapplications– Spamemaildetection– Classifyingnewsarticles,e.g.,Google News– ClassifyingWebpagesintocategories

• DataRepresentation– “Bagofwords/terms”mostcommonlyused:eithercountsorbinary– Canalsouseotherweighting andadditional features(e.g.,metadata)

• ClassificationMethods– NaïveBayeswidelyusedbaseline

• Fastandreasonablyaccurate– LogisticRegression

• Widelyusedinindustry,accurate,excellentbaseline– Neuralnetworksanddeep learning

• Canbeveryaccurate• Canrequireverylargeamountsoflabeledtrainingdata• Morecomplexthanothermethods


ExampleofDocumentbyTermMatrix

predict finance stocks goal score team ClassLabel

d1 3 1 4 0 0 1 1

d2 2 2 1 0 1 0 1

d3 1 1 2 0 0 0 1

d4 1 2 0 0 0 0 1

d5 3 1 1 0 1 0 1

d6 1 0 0 1 3 1 2

d7 0 0 1 4 1 0 2

d8 2 0 0 1 2 1 2

d9 1 0 0 2 1 2 2

d10 1 0 0 1 3 1 2


PossibleWeightsforaLinearClassifier

predict finance stocks goal score team ClassLabel

Weight 0.1 2.5 1.0 -2.0 -0.2 -1.5

d1 3 1 4 0 0 1 1d2 2 2 1 0 1 0 1d3 1 1 2 0 0 0 1d4 1 2 0 0 0 0 1d5 3 1 1 0 1 0 1d6 1 0 0 1 3 1 2d7 0 0 1 4 1 0 2d8 2 0 0 1 2 1 2d9 1 0 0 2 1 2 2d10 1 0 0 1 3 1 2


RealExamplefromYelpData

YelpDatasetNumberofReviews 706,693NumberofReviewsw/oNeutralRating 595,468

NumberofTokens 85,392,376VocabularySizew/oStopwords 176,114

ArrayDimensions (595468, 176114)

Numberofcellsin theArray 104,870,251,352Non-zeroentries 28,357,001Density 0.0027%


ExampleofaPipelineforDocumentClassification

TrainingDocuments(corpus)

Tokenization ListsofTokens

Vocabulary

BagofWords

MachineLearningAlgorithm

Stopword andrarewordremoval

FrequencyCounts

DocumentClassifier


ExampleofaPipelineforDocumentClassification

TrainingDocuments(corpus)

Tokenization ListsofTokens

Vocabulary

BagofWords

MachineLearningAlgorithm

NewDocument

DocumentClassifier

ListsofTokens

Tokenization BagofWords

LabelPrediction

Stopword andrarewordremoval

FrequencyCounts


KeyStepsinDocumentAnalysisPipelines(forBagofWords)

• Tokenization– Variousoptions (e.g.,withpunctuation, nonalphanumericsymbols, etc)

• Vocabularydefinition– N-grams,stopword removal,rarewordremoval, stemming

• Featuredefinition– Binary(termpresentornot?)– Counts– Weightedcounts,e.g.,TD-IDF(seelaterintheslides)

• Classifierselection– NaïveBayes,logistic,SVMs,neuralnetworks,etc


TF-IDFWeightingofFeatures

Inpracticetheinputscanbeweighted– Itcanbehelpful touse“TF-IDFweights”insteadofcounts

TF(t,d)=termfrequency=count=numberoftimestermtoccursindocd

IDF(t,d)=inversedocumentfrequency=log(N/numberofdocswithtermt)

(whereN=totalnumberofdocsinthecorpus)

TF-IDF(t,d)=TF(t,d)*IDF(t,d)

TheIDFtermhastheeffectofupweighting termsthatoccurinfewdocs


TF-IDFExample

N=1000inacorpusofnewsarticles

Term1:t=“city”,appearsin500documents

IDF(t)=log(1000/500)=log(2)=1(logisbase2,not important)

Term2:t=“freeway”,appearsin10documents

IDF(t)=log(1000/10)=log(100)=6.64

Sooccurrencesof“freeway”willgetupweighted byafactorof6.64comparedtooccurrencesof“city”


ClassificationMethodsforText

• Logisticregression– widelyusedforbag-of-words– Inputdimensionality (vocabulary size)ishigh(e.g.,5kto500k)– regularization isuseful

• Recurrentneuralnetworksarewidelyusedforsequentialmodels(seelater)– Morecomplex,butcanpickupsequential information

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html


From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html


From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html


From:https://github.com/TeamHG-Memex/eli5


SentimentAnalysis


Veryusefultutorial,manypracticaltipsandideas

Onlineathttp://sentiment.christopherpotts.net/


Example:PredictingifReviewTextisPositiveorNegative

• Twobasicapproaches– Uselexiconsofpositiveandnegativewords– Uselabeleddatatolearnaclassificationmodel(canalsocombinebothapproaches)

• Challenges,e.g.,howtohandlenegation?– Positivereview:

• “ThismovieisnotdepressingthewayIthought itwouldbe.Idon’t thinkIhaveanythingnegative tosayabout it.”

• “Iwasn’t disappointed atallwiththeacting – infacttheactingwasexcellent.”

– Negativereviews:• “Sittingthroughthismoviewasnot somethingIenjoyed doing”• “Thiswasn’t averygoodscript.Thedirectorisusuallyexcellent (somereallygreat

moviesovertheyearsthatIreallyenjoyed),but thisleavesmecold”


NegationMarking(duringTokenization)

• Idea:appenda_NEGsuffixtoeverywordappearingbetweenanegationandaclause-levelpunctuationmark

Fromhttp://sentiment.christopherpotts.net/lingstruc.html



ExamplesofNegationMarking


ImprovementsinClassificationAccuracy



SentimentLexicons(Dictionaries)

• Counts,percentages,weightsofwordsinvariouscategories,– e.g.,degree towhichawordtendstopositiveornegative

• ExampleLexicons– Generalinquirer (http://www.wjh.harvard.edu/~inquirer)

• WordscategorizedaccordingtoPositive/Negative,StrongvsWeak,ActivevsPassive,etc

– Sentiwordnet (http://sentiwordnet.isti.cnr.it/)• Synsets inWordNet3.0 annotatedfordegreesofpositivity,negativity,and

neutrality/objectiveness

– LinguisticInquiryandWordCount(http://www.liwc.net/)

– NRCWord-Emotion AssociationLexicon(nextslide)



YelpChallengeDataSet


Jupyter NotebookExamplewithYelpDataset


VectorRepresentationsforText


VectorRepresentationsallowGeneralization

• Example:Cat : [1.5,1.5,-0.2,0.0,0.0,0.0]Kitten: [1.4,1.5,-0.2,0.0,0.0,0.0]Pet: [2.0,1.6,-0.3,0.0,0.0,0.0]

• Iftheword“cat”occursinadocumentthenweknowthatotherwordslike“kitten”and“pet”aresimilar

• Whyisthisusefulinlearning?– Consideraclassifierbasedonweights(e.g.,logisticregression,neuralnetwork)– Traditionalapproach:eachwordhasitsownseparateweight– Ifwerepresentwordsbytheirvectors,andlearnweightson thevectors,then

wordsthatareclosetogetherwillproducesimilaroutputs– Thiscanhelpwithgeneralization




AnotherExampleofWordEmbedding


Publicly-AvailablePre-TrainedWordEmbeddings

• Word2Vec:https://code.google.com/archive/p/word2vec/– GoogleNewsdataset,3Mvocab(wordsandphrases), from100Btokens– Entityvectors:1.4Mvectorsforentities, trainedon100Bwords fromnews

articles,entitynamesfromFreebase

• Glove:http://nlp.stanford.edu/projects/glove/– Wikipedia:400kvocab,50dto300dvectors,basedon6Btokens– CommonCrawl,2.2Mvocab,300dvectors, from840Btokens– Twitter:1.2Mvocab,25dto200dvectors,basedon2Btweets

• Note:youmaywanttoconsiderusingtheseinyourprojects– Easierandfasterthantrainingyourownembeddingmodel


Component1

Component2

x

y

z

Inthisexample,y andz aremoresimilarundercosinesimilaritythany andx becausetheanglebetweeny andz ismuchsmaller

CosineDistancebetweentwoVectors

Cosinedistance=1- Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0


ExamplesofSimilaritybetweenWordsExamplesfromhttps://code.google.com/archive/p/word2vec/

Cosinesimilarityto“France”

Cosinesimilarityto“SanFrancisco”

Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0


1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.68

MostSimilarVectorsto‘cat’fromword2vec300dembeddings

Rank Word CosineSimilarity


1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.6811:chihuahua 0.6712:pooch 0.6713:kitties 0.6714:dachshund 0.6715:poodle 0.6616:stray_cat 0.6617:Shih_Tzu 0.6618:tabby 0.6619:basset_hound 0.6520:golden_retriever 0.65

MostSimilarVectorsto‘cat’fromword2vec300dembeddings



1:oxycontin 1.002:Oxycontin 0.793:Oxycodone 0.714:OxyContin 0.695:morphine_pills 0.676:hydrocodone 0.677:Lortab 0.678:oxycodone 0.659:OxyContin_pills 0.6510:Hydrocodone 0.6511:Alprazolam 0.6412:Oxycodone_pills 0.6313:Xanax 0.6214:Lorcet 0.6215:Percocet_pills 0.6116:Valium 0.6117:Diazepam 0.6018:methadone_pills 0.6019:hydrocodone_pills 0.6020:prescription_pills 0.59

MostSimilarVectorsto‘oxycontin’fromword2vec300dembeddings



1:iran 1.002:pakistan 0.663:israel 0.644:lebanon 0.625:russia 0.626:iraq 0.617:egypt 0.608:Irans 0.599:afghanistan 0.5910:Iran 0.5911:arabs 0.5812:israeli 0.5713:obama 0.5614:arab 0.5615:iraqi 0.5616:japan 0.5617:sri_lanka 0.5618:america 0.5519:korea 0.5520:cuba 0.55

MostSimilarVectorsto‘iran’fromword2vec300dembeddings



1:Iran 1.002:Tehran 0.833:Iranian 0.824:Islamic_republic 0.815:Islamic_Republic 0.816:Iranians 0.787:Teheran 0.768:Syria 0.759:Ahmadinejad 0.6910:Irans 0.6711:Larijani 0.6612:North_Korea 0.6413:Mottaki 0.6314:clerical_regime 0.6315:nuclear_ambitions 0.6316:Khamenei 0.6317:Ayatollah_Khamenei 0.6318:Velayati 0.6319:Ahmedinejad 0.6320:Rafsanjani 0.62

MostSimilarVectorsto‘Iran’fromword2vec300dembeddings



RecurrentNeuralNetworksforSequentialData



NetworkModelforPredictingtheNextWordinaSentence

thedogsawcatonwall

Sentence:Thedogsawthecatonthewall.

.

1000000

0100000

h1

h2

Binaryinput,“One-hot”encoding

thedogsawcatonwall.

Hiddenunitactivations(real-valued)

Binarytargetoutput,

“One-hot”encoding


StandardRecurrentNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Differenttoa“feedforward”networkHiddenunitatpositiontnowalsoHasinputfromhiddenvectorattimet-1


StateComputationinaNeuralNetwork





Keypoints:- fw ineachpositionistheRNNstate- Thisstateisupdatedrecursivelyfromthe

previousstateandthecurrentinput




Keypoints:- Thex->fandf->harrowsareweightmatrices- Thesameweightmatricesare(usually)used

acrossthesequence


LearningtheWeightsinaRecurrentNetwork


Learning:- lossfunctioncomparespredictionsandtargets- computegradient(basedonerror)andupdateweights


Example:PredictionofSequencesofCharacters



SimulatingSequencesfromanRNN









OutputfromanRNNModelTrainedonShakespeare

KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.

Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.

DUKE VINCENTIO: Well, your wit is in the care of side and that.

Examplesfrom“TheUnreasonableEffectiveness ofRecurrentNeuralNetworks”,AndrejKaparthy,blog,http://karpathy.github.io/2015/05/21/rnn-effectiveness/


OutputfromanRNNModelTrainedonCookingRecipes

Fromhttps://gist.github.com/nylki/1efbaa36635956d35bcc


Imageclassification

Imagecaptioning

Sentimentanalysis

Machinetranslation

Syncedsequence(videoclassification)

DifferentRecurrentNetworkModels


Part-of-SpeechTagging


PartofSpeech(POS)Tagging

• CommonPOScategories(ortags)inEnglish:– Noun,verb,article,preposition, pronoun, adverb, conjunction, interjection

• Howevertherearemanymorespecializedcategories– E.g.,propernouns:e.g.,‘Toronto’, ‘Smith’,….– E.g.,comparativeadverb:e.g.,‘bigger’, ‘smaller’,…– E.g.,symbol: ‘3.12’,‘$’,…

• AssigningPOScategoriestowordsintextisknownastagging


UniversalTagset (asusedinNLTK)

12universaltags:VERB- verbs(alltensesandmodes)NOUN- nouns(commonandproper)PRON- pronounsADJ- adjectivesADV- adverbsADP- adpositions (prepositionsandpostpositions)CONJ- conjunctionsDET- determinersNUM- cardinalnumbersPRT- particlesorotherfunctionwordsX- other:foreignwords,typos,abbreviations.- punctuation

See"AUniversalPart-of-SpeechTagset"bySlavPetrov,Dipanjan DasandRyanMcDonald formoredetails:http://arxiv.org/abs/1104.2086andhttp://code.google.com/p/universal-pos-tags/


UniversalTagset (usedinNLTK)

Tag Meaning English ExamplesADJ adjective new, good, high, special, big, localADP adposition on, of, at, with, by, into, underADV adverb really, already, still, early, nowCONJ conjunction and, or, but, if, while, althoughDET determiner, article the, a, some, most, every, no, whichNOUN noun year, home, costs, time, AfricaNUM numeral twenty-four, fourth, 1991, 14:24PRT particle at, on, out, over per, that, up, withPRON pronoun he, their, her, its, my, I, usVERB verb is, say, told, given, playing, would. punctuation marks . , ; !X other ersatz, esprit, dunno, gr8, univeristy

(fromSection2.3inChapter5ofNLTKBook)


POSTaggingAlgorithms

• TaggingAlgorithms:“POSTaggers”– Tagging isoftendoneautomaticallywithalgorithms– Thesealgorithms oftenusesequential(Markov)models

• Thetagforaparticulartokencandependonwords/tagsbeforeandafterit– Thesemodelsaretrainedusingmachinelearning

• usingvariouswordfeaturesanddictionariesasinput– Trainedonmanuallylabeleddocuments

• Taggingperformanceisbestonmaterialthatthetaggerwasoriginallytrainedon(oftennewsdocuments)

• Tagscanbehelpful“downstream”forvariousapplications– E.g.,fordocumentclassificationwemightwanttoonlyusenouns, adjectives,

andverbsandignoreeverythingelse– E.g.,forinformation extractionwemight focusonlyonnouns


ChallengesinPOSTagging

• TaggingwordsintextwiththeircorrectPOStagsisnotsimplyassigningwordstotagsusingalookuptable

• Semanticcontext– Thenegotiatorwasabletobridgethegapbetweenthe2sides– Here‘bridge’ isusedasaverbeventhough weordinarily thinkofitasanoun– Theotherwordsinthesentenceandthegrammaticalstructureallowusto

interpret ‘bridge’ hereasaverb

• Ambiguity,e.g.,– Thepresidentwasentertaininglastnight

• Boththeadjectiveandverbtagfor“entertaining”workhere,i.e.,thereisambiguity

• Tokenizationissues– ThealgorithmmustbeabletodealwithtokenssuchasI’ld or‘pre-specified’


SoftwareforPartofSpeechTagging

• Manysoftwarepackagesandonlinetoolsavailable

• NLTKPOSTagger

• StanfordnaturallanguagegroupprovidesexcellentPOStaggersinseverallanguages(English,German,Chinese,French,etc)

• http://nlp.stanford.edu/software/tagger.shtml• UsesthePennTreebanktagset

• Onlinedemo– http://demo.ark.cs.cmu.edu/parse


Example:RemovingStopWordswithaPOSTagger

• Filteroutanytermthatdoesnotbelongtoa(user-defined)targetsetofPOSclasses

SPEAKER WORD POSTAGPATIENT if IN PrepositionPATIENT you PRP Personal PronounPATIENT had VBD Verb, Past Tense PATIENT a DT DeterminerPATIENT magic JJ AdjectivePATIENT potion NN NounPATIENT i PRP Personal PronounPATIENT ‘d MD ModalPATIENT love VB Verb, base formPATIENT to TO PATIENT have VB Verb, base formPATIENT it PRP Personal Pronoun

Examplerule:extractadjectivesandnounsonly