96
Stats 170A: Project in Data Science Text Analysis and Classification Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine (Acknowledgements to Prof Mark Steyvers and Prof Sameer Singh, UCI, for various slides)

Stats 170A: Project in Data Science Text Analysis and

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stats 170A: Project in Data Science Text Analysis and

Stats170A:ProjectinDataScience

TextAnalysisandClassification

Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine

(AcknowledgementstoProfMarkSteyvers andProfSameerSingh,UCI,forvariousslides)

Page 2: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2

Reading,Homework,Lectures

• Referencereading:– Python

• http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

• Homework7– Dueby2pmWednesdaynextweek

• NextLectures– Today:textanalysisandclassification– Nextweek:discussionofprojects

Page 3: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3

Page 4: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4

Roomforimprovement….

Page 5: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5

PredictingtheSentimentofTweets

wow:othiswasincluded intheplaylist#awesome positive

Icouldnotbehappierwithmyliferightnow positive

HAPPYBIRTHDAYTOTHEPANDERIFICPANDAIK.@TheOrionSound positive

I'mtiredashellInevergetaoffdayduring theweekanymore negative

Themoviewasreallydullandstupid negative

@washingtonpost @silvajanes Howawful!!!!!! negative

“Documents” ClassLabels

Page 6: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6

PredictingtheScoresofMovieReviews

Ilikedthismovie.Welldone.Greatactinganddirection 4

Terrificmovie, sawit3times.LovedthescenesinMexico. 5

Boringashell.Whogetspaidtocreatethisstuff? 1

Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2

“Documents” Ratings

Page 7: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7

EntityRecognition

Ilikedthismovie.Welldone.Greatactinganddirection 4

Terrificmovie, sawit3times.LovedthescenesinMexico. 5

Boringashell.Whogetspaidtocreatethisstuff? 1

Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2

“Documents” Ratings

Page 8: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8

ExamplesofUserLocationFieldsinTwitter

User location: Deutschland User location: Mountain View, CA User location: Florida, USA User location: United States User location: West Virginia, Appalachia User location: Columbia Mo User location: South Florida User location: Germany/Berlin User location: St Louis, MO User location: St. Louis User location: Central Virginia User location: United States��User location: Sea Holme (of Norse legend) Oz User location: California, USA User location: USA User location: Cambodia User location: San Antonio, TX User location: Between my ears User location: Chicago, IL

Page 9: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9

Page 10: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10

Page 11: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11

BasicConceptsinTextRepresentation

Page 12: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12

RepresentingTextDocumentsNumerically

• Howdowerepresenttext(strings)forinputtomachinelearningalgorithms?(whichgenerallyrequirenumericrepresentations)

• Definea vocabulary=setofwordsorterms

• Simplemethod:BagofWords– Document representedasavectorofcountsofwordsinthevocabulary

• Morecomplex:Real-ValuedEmbeddings– Document (orword) representedasareal-valuedvectorin“embedding space”

• Evenmorecomplex:SequentialRepresentations– Sequential state-machinemodelfor sequencesofwords

Page 13: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13

ExampleofBag-of-WordsMatrix

database SQL index calculus derivative function Label

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

Page 14: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14

ConceptsandTerminology

• Document:– Collectionofwords

• Abook,anewsarticle,areport,aWebpage,anemail,atweet,etc– Maycontainboth textandmetadata.– Examplesofmetadata:authorname(s),date,wherepublished, etc

Notethatthedefinition ofadocument isflexiblee.g.,abookcouldbeasingledocument,or…..

eachsectionofabookcouldbeconsidereda“document”

• Corpus:acollectionofdocuments– e.g.,allnewsarticlesfromtheLosAngelesTimessince1990– e.g.,allWikipediaWebpages– e.g.,allYelpreviewsfor restaurantsinChicago– e.g.,arandomsampleofTweetsfromDec2017

Page 15: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15

ConceptsandTerminology

• Tokens:– Groupings ofcharactersintherawtext– individualwords (or“wordtypes”)+possiblynumbers,punctuation, etc

• WordsorTerms– Uniquetokens (orcombinationsof tokens)thataremeaningful– e.g.,wordssuchas“cat”,“dog”,termssuchas“U.S”or“92697”– Canalsoincludebigrams(“SanDiego”), trigrams(“NewYorkCity”),etc

• Vocabulary– Thespecificsetofunique termsusedbyanalgorithm orapplication– TheEnglish languagehasorderof1millionuniquewords– Inaparticularapplicationwemightuseavocabularyofonly10kto50kterms

• e.g.,relevant/commonwords(unigrams)• Bigram,trigrams,…,ngrams,canalsobepartofthevocabulary

Page 16: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16

TypicalPipelineforBuildingaTextClassifier

Tokenization

StopWordRemoval

VocabularyDefinition

CreateBagofWords

BuildTextClassifier

Document(string)

Notethatstepsinthepipeline,andhowtheseareimplemented (e.g.,howtokenization isdone)willvaryfromapplicationtoapplicationdepending ontheproblem

Page 17: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 17

ExampleofaTextAnalysisPipeline(forInformationExtraction)

Fromhttp://www.nltk.org/book/ch07.html

Page 18: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18

Example

Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’

Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)

Page 19: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19

Example

Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’

Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)

Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}

Page 20: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20

Example

Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’

Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)

Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}

BagofWords={[‘chapter’,1],[‘1’,1],[‘the’,2],[‘beginning’,2],…}

Page 21: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21

?

Tokenization

l Splituptextintoindividualtokens(words/terms)

l Simplestapproachistoignoreallpunctuationandwhitespaceanduseonlyunbrokenstringsofalphabeticcharactersastokens

If you had a magic potion I’d love to have it.

If you had a magic potion I’d love to have it

Page 22: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22

IssuesinTokenization

• Finland’s capital → Finland Finlands Finland’s ?

• what’re, I’m, isn’t → What are, I am, is not

• Hewlett-Packard → Hewlett Packard ?

• state-of-the-art → state of the art ?

• Lowercase → lower-case lowercase lower case ?

• San Francisco → onetokenortwo?

• m.p.h., PhD. → ??

Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin

Page 23: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23

TokenizationSoftware

• Insteadofwritingyourowntokenizerwithacomplexsetofrules,useexistingsoftware– e.g.,tokenizer function inNLTK– e.g.,tokenizer fromStanford’snaturallanguagegroup

• Practicaltip:– Itsusefultokeepdifferent representations ofthedatatouselateron,e.g.,

• Datawithoriginalsequenceandformatting,tokenizedlist,bagofwords,etc

– Sequentialorderofwordsisneeded fordetectingngrams.– Punctuationcancontainuseful information:

If you had a magic potion I’d love to have it . If that makes sense

Ifwedecidetoextractn-gramslateron,weknowthat“it”and“if”shouldnotbecombined.So itsusefultoretainthisinformation.

Page 24: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24

SentenceDetection:ExampleusingaDecisionTree

Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin

Page 25: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25

Example:TokenizationofTweets

Therearespecial-purposetokenizers fordifferenttypesoftext,e.g.,forTweets

from nltk.tokenize import TweetTokenizerfrom nltk.tokenize import word_tokenize

tknzr = TweetTokenizer()

for i in range(10): print("NLTK Twitter Tokenizer: ", tknzr.tokenize(tweets.text[i]) ) print("NLTK Standard Tokenizer:", word_tokenize(tweets.text[i]))

Page 26: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26

NLTK Twitter Tokenizer: ['RT', '@ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who', '…'] NLTK Standard Tokenizer: ['RT', '@', 'ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who…']

NLTK Twitter Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https://t.co/olA35uCt89'] NLTK Standard Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https', ':', '//t.co/olA35uCt89']

NLTK Twitter Tokenizer: ['RT', '@IndivisibleTeam', ':', "Here's", 'what', 'you', 'need', 'to', 'know', 'about', '#ReleaseTheMemo', '�', '1', '⃣', 'It', '’', 's', 'not', 'really', 'a', 'memo', '.', "They're", 'talking', 'points', 'by', 'Devin', 'Nunes', '…'] NLTK Standard Tokenizer: ['RT', '@', 'IndivisibleTeam', ':', 'Here', "'s", 'what', 'you', 'need', 'to', 'know', 'about', '#', 'ReleaseTheMemo�', '1⃣', 'It’s', 'not', 'really', 'a', 'memo', '.', 'They', "'re", 'talking', 'points', 'by', 'Devin', 'Nunes…']

SampleOutputfromthe2Tokenizers

Page 27: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27

Example:DefiningaVocabulary

Rawtext(astringinPython) raw1 = “The dog chased the cat and the mouse. Why did the dog do this?”

Thereare14wordtokens inthestringraw1(ifweignorepunctuationandspaces)The,dog,chased,the,cat,and,the,mouse,Why,did,the,dog, do,this

Thevocabulary (theunique tokens,normalizing tolowercase)is:the,dog,chased,cat,and,mouse,why,did,do, thisThevocabularysizeis10.

Thecounts forabagofwordsrepresentation is:the(3),dog (2), chased(1), cat(1),and(1),mouse(1),why(1),did(1),do(1), this(1)

Ifweremovestopwords wedecreaseourvocabularysizeto4dog (2),chased(1), cat(1),mouse(1)

Page 28: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28

DefiningtheVocabulary(afterTokenization)

• Vocabulary– Setof terms(words) usedtoconstructthedocument-termmatrix

• Basicapproach:usesinglewords(unigrams)asterms

• Removeverycommonterms(e.g.,stopwords)

• Removeveryrareterms:e.g.,removealltermsthatoccurinfewerthanKdocumentsinthecorpus(e.g.,K=10)– Getsridofmisspellings, unusualnames,etc

• Canextendtermlistwithn-grams– Frequentwordcombinations (2-grams,3-grams,…)

“feelgood”/“NewYorkCity”

Page 29: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29

FrequencyofWordinEnglishusage

RankofWordbyFrequency

Verylong“tail”ofrare

words

Graphfromwww.prismnet.com/~dierdorf/wordfrequency.png

Page 30: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30

Stopword Removal

l Removewords thatarelikelytobeirrelevanttoourprobleml Keepcontentwords(typically verbs,adverbs,nouns,adjectives)

If you had a magic potion I’d love to have it . If that makes sense

If you had a magic potion I’d love to have it . If that makes sense

Example:

But what about this?

[Prince Hamlet] To be or not to be ...

Page 31: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31

NLTKStopWord List

Note:inmanyapplicationstheremaybeadditionaldomain-specific“stopwords”thatareverycommonandthatwemaywanttoremovesincetheycontainlittleinformation,e.g.,theterm“restaurant”inreviews

Page 32: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32

N-grams

• Usefuln-gramsaregroupsofn-wordsthatcommonlyappearinsequence– “ComputerScience”– “LosAngeles”– “NewYorkCity”

– Notethatwecanhavebothtermslike“computer”and“computer science”,e.g.,

“Iamstudying computer scienceatUCIandIboughtanewcomputer lastweek.”

- Addingann-gramasaterminthevocabularymaybeusefulinamodel- e.g.,forarestaurantreviewclassificationproblem, “waittime”maybemore

informative than“time”

Page 33: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33

AutomaticallyFindingN-grams

- e.g.,Bigrams- Keeptrackofalluniquepairsthatoccursequentially(excludestopwords)

- I visited Microsoft Research for a computer science job interview last week.”

- * visited Microsoft Research * * computer science job interview last week.”

- Rankbyfrequencyofoccurrence(ornumberofdocsabigramappearsin)- Canalsorankbyothermetrics

- KeepthetopKinthevocabulary- Howlargeshould Kbe?Dependsontheapplication,amountofdata,etc- MightneedtosearchoverdifferentvaluesofK(e.g.,usingcross-validation)

- Sameideafortri-grams,4grams,etc.

- Seealsohttp://www.nltk.org/howto/collocations.html inNLTK

Page 34: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34

ExampleofBag-of-WordsMatrix

database SQL index calculus derivative function Label

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

Page 35: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35

ExampleofBag-of-WordsMatrix

database SQL index calculus derivative function Label

d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2

Uselabeledtrainingdata(supervisedlearning)tolearnaclassifier– HomeworkAssignment7

Page 36: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36

OverviewofTextClassification

Page 37: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37

TextClassification

• Textclassificationhasmanyapplications– Spamemaildetection– Classifyingnewsarticles,e.g.,Google News– ClassifyingWebpagesintocategories

• DataRepresentation– “Bagofwords/terms”mostcommonlyused:eithercountsorbinary– Canalsouseotherweighting andadditional features(e.g.,metadata)

• ClassificationMethods– NaïveBayeswidelyusedbaseline

• Fastandreasonablyaccurate– LogisticRegression

• Widelyusedinindustry,accurate,excellentbaseline– Neuralnetworksanddeep learning

• Canbeveryaccurate• Canrequireverylargeamountsoflabeledtrainingdata• Morecomplexthanothermethods

Page 38: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38

ExampleofDocumentbyTermMatrix

predict finance stocks goal score team ClassLabel

d1 3 1 4 0 0 1 1

d2 2 2 1 0 1 0 1

d3 1 1 2 0 0 0 1

d4 1 2 0 0 0 0 1

d5 3 1 1 0 1 0 1

d6 1 0 0 1 3 1 2

d7 0 0 1 4 1 0 2

d8 2 0 0 1 2 1 2

d9 1 0 0 2 1 2 2

d10 1 0 0 1 3 1 2

Page 39: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39

PossibleWeightsforaLinearClassifier

predict finance stocks goal score team ClassLabel

Weight 0.1 2.5 1.0 -2.0 -0.2 -1.5

d1 3 1 4 0 0 1 1d2 2 2 1 0 1 0 1d3 1 1 2 0 0 0 1d4 1 2 0 0 0 0 1d5 3 1 1 0 1 0 1d6 1 0 0 1 3 1 2d7 0 0 1 4 1 0 2d8 2 0 0 1 2 1 2d9 1 0 0 2 1 2 2d10 1 0 0 1 3 1 2

Page 40: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40

RealExamplefromYelpData

YelpDatasetNumberofReviews 706,693NumberofReviewsw/oNeutralRating 595,468

NumberofTokens 85,392,376VocabularySizew/oStopwords 176,114

ArrayDimensions (595468, 176114)

Numberofcellsin theArray 104,870,251,352Non-zeroentries 28,357,001Density 0.0027%

Page 41: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41

ExampleofaPipelineforDocumentClassification

TrainingDocuments(corpus)

Tokenization ListsofTokens

Vocabulary

BagofWords

MachineLearningAlgorithm

Stopword andrarewordremoval

FrequencyCounts

DocumentClassifier

Page 42: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42

ExampleofaPipelineforDocumentClassification

TrainingDocuments(corpus)

Tokenization ListsofTokens

Vocabulary

BagofWords

MachineLearningAlgorithm

NewDocument

DocumentClassifier

ListsofTokens

Tokenization BagofWords

LabelPrediction

Stopword andrarewordremoval

FrequencyCounts

Page 43: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43

KeyStepsinDocumentAnalysisPipelines(forBagofWords)

• Tokenization– Variousoptions (e.g.,withpunctuation, nonalphanumericsymbols, etc)

• Vocabularydefinition– N-grams,stopword removal,rarewordremoval, stemming

• Featuredefinition– Binary(termpresentornot?)– Counts– Weightedcounts,e.g.,TD-IDF(seelaterintheslides)

• Classifierselection– NaïveBayes,logistic,SVMs,neuralnetworks,etc

Page 44: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44

TF-IDFWeightingofFeatures

Inpracticetheinputscanbeweighted– Itcanbehelpful touse“TF-IDFweights”insteadofcounts

TF(t,d)=termfrequency=count=numberoftimestermtoccursindocd

IDF(t,d)=inversedocumentfrequency=log(N/numberofdocswithtermt)

(whereN=totalnumberofdocsinthecorpus)

TF-IDF(t,d)=TF(t,d)*IDF(t,d)

TheIDFtermhastheeffectofupweighting termsthatoccurinfewdocs

Page 45: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45

TF-IDFExample

N=1000inacorpusofnewsarticles

Term1:t=“city”,appearsin500documents

IDF(t)=log(1000/500)=log(2)=1(logisbase2,not important)

Term2:t=“freeway”,appearsin10documents

IDF(t)=log(1000/10)=log(100)=6.64

Sooccurrencesof“freeway”willgetupweighted byafactorof6.64comparedtooccurrencesof“city”

Page 46: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46

ClassificationMethodsforText

• Logisticregression– widelyusedforbag-of-words– Inputdimensionality (vocabulary size)ishigh(e.g.,5kto500k)– regularization isuseful

• Recurrentneuralnetworksarewidelyusedforsequentialmodels(seelater)– Morecomplex,butcanpickupsequential information

Page 47: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html

Page 48: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48

From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html

Page 49: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49

From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html

Page 50: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50

From:https://github.com/TeamHG-Memex/eli5

Page 51: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51

SentimentAnalysis

Page 52: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52

Veryusefultutorial,manypracticaltipsandideas

Onlineathttp://sentiment.christopherpotts.net/

Page 53: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53

Example:PredictingifReviewTextisPositiveorNegative

• Twobasicapproaches– Uselexiconsofpositiveandnegativewords– Uselabeleddatatolearnaclassificationmodel(canalsocombinebothapproaches)

• Challenges,e.g.,howtohandlenegation?– Positivereview:

• “ThismovieisnotdepressingthewayIthought itwouldbe.Idon’t thinkIhaveanythingnegative tosayabout it.”

• “Iwasn’t disappointed atallwiththeacting – infacttheactingwasexcellent.”

– Negativereviews:• “Sittingthroughthismoviewasnot somethingIenjoyed doing”• “Thiswasn’t averygoodscript.Thedirectorisusuallyexcellent (somereallygreat

moviesovertheyearsthatIreallyenjoyed),but thisleavesmecold”

Page 54: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54

NegationMarking(duringTokenization)

• Idea:appenda_NEGsuffixtoeverywordappearingbetweenanegationandaclause-levelpunctuationmark

Fromhttp://sentiment.christopherpotts.net/lingstruc.html

Page 55: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55

Fromhttp://sentiment.christopherpotts.net/lingstruc.html

ExamplesofNegationMarking

Page 56: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56

ImprovementsinClassificationAccuracy

Fromhttp://sentiment.christopherpotts.net/lingstruc.html

Page 57: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 57

SentimentLexicons(Dictionaries)

• Counts,percentages,weightsofwordsinvariouscategories,– e.g.,degree towhichawordtendstopositiveornegative

• ExampleLexicons– Generalinquirer (http://www.wjh.harvard.edu/~inquirer)

• WordscategorizedaccordingtoPositive/Negative,StrongvsWeak,ActivevsPassive,etc

– Sentiwordnet (http://sentiwordnet.isti.cnr.it/)• Synsets inWordNet3.0 annotatedfordegreesofpositivity,negativity,and

neutrality/objectiveness

– LinguisticInquiryandWordCount(http://www.liwc.net/)

– NRCWord-Emotion AssociationLexicon(nextslide)

Page 58: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 58

Page 59: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 59

YelpChallengeDataSet

Page 60: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 60

Jupyter NotebookExamplewithYelpDataset

Page 61: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 61

VectorRepresentationsforText

Page 62: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 62

VectorRepresentationsallowGeneralization

• Example:Cat : [1.5,1.5,-0.2,0.0,0.0,0.0]Kitten: [1.4,1.5,-0.2,0.0,0.0,0.0]Pet: [2.0,1.6,-0.3,0.0,0.0,0.0]

• Iftheword“cat”occursinadocumentthenweknowthatotherwordslike“kitten”and“pet”aresimilar

• Whyisthisusefulinlearning?– Consideraclassifierbasedonweights(e.g.,logisticregression,neuralnetwork)– Traditionalapproach:eachwordhasitsownseparateweight– Ifwerepresentwordsbytheirvectors,andlearnweightson thevectors,then

wordsthatareclosetogetherwillproducesimilaroutputs– Thiscanhelpwithgeneralization

Page 63: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 63

Page 64: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 64

Page 65: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 65

AnotherExampleofWordEmbedding

Page 66: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 66

Publicly-AvailablePre-TrainedWordEmbeddings

• Word2Vec:https://code.google.com/archive/p/word2vec/– GoogleNewsdataset,3Mvocab(wordsandphrases), from100Btokens– Entityvectors:1.4Mvectorsforentities, trainedon100Bwords fromnews

articles,entitynamesfromFreebase

• Glove:http://nlp.stanford.edu/projects/glove/– Wikipedia:400kvocab,50dto300dvectors,basedon6Btokens– CommonCrawl,2.2Mvocab,300dvectors, from840Btokens– Twitter:1.2Mvocab,25dto200dvectors,basedon2Btweets

• Note:youmaywanttoconsiderusingtheseinyourprojects– Easierandfasterthantrainingyourownembeddingmodel

Page 67: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 67

Component1

Component2

x

y

z

Inthisexample,y andz aremoresimilarundercosinesimilaritythany andx becausetheanglebetweeny andz ismuchsmaller

CosineDistancebetweentwoVectors

Cosinedistance=1- Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0

Page 68: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 68

ExamplesofSimilaritybetweenWordsExamplesfromhttps://code.google.com/archive/p/word2vec/

Cosinesimilarityto“France”

Cosinesimilarityto“SanFrancisco”

Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0

Page 69: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 69

1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.68

MostSimilarVectorsto‘cat’fromword2vec300dembeddings

Rank Word CosineSimilarity

Page 70: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 70

1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.6811:chihuahua 0.6712:pooch 0.6713:kitties 0.6714:dachshund 0.6715:poodle 0.6616:stray_cat 0.6617:Shih_Tzu 0.6618:tabby 0.6619:basset_hound 0.6520:golden_retriever 0.65

MostSimilarVectorsto‘cat’fromword2vec300dembeddings

Rank Word CosineSimilarity

Page 71: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 71

1:oxycontin 1.002:Oxycontin 0.793:Oxycodone 0.714:OxyContin 0.695:morphine_pills 0.676:hydrocodone 0.677:Lortab 0.678:oxycodone 0.659:OxyContin_pills 0.6510:Hydrocodone 0.6511:Alprazolam 0.6412:Oxycodone_pills 0.6313:Xanax 0.6214:Lorcet 0.6215:Percocet_pills 0.6116:Valium 0.6117:Diazepam 0.6018:methadone_pills 0.6019:hydrocodone_pills 0.6020:prescription_pills 0.59

MostSimilarVectorsto‘oxycontin’fromword2vec300dembeddings

Rank Word CosineSimilarity

Page 72: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 72

1:iran 1.002:pakistan 0.663:israel 0.644:lebanon 0.625:russia 0.626:iraq 0.617:egypt 0.608:Irans 0.599:afghanistan 0.5910:Iran 0.5911:arabs 0.5812:israeli 0.5713:obama 0.5614:arab 0.5615:iraqi 0.5616:japan 0.5617:sri_lanka 0.5618:america 0.5519:korea 0.5520:cuba 0.55

MostSimilarVectorsto‘iran’fromword2vec300dembeddings

Rank Word CosineSimilarity

Page 73: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 73

1:Iran 1.002:Tehran 0.833:Iranian 0.824:Islamic_republic 0.815:Islamic_Republic 0.816:Iranians 0.787:Teheran 0.768:Syria 0.759:Ahmadinejad 0.6910:Irans 0.6711:Larijani 0.6612:North_Korea 0.6413:Mottaki 0.6314:clerical_regime 0.6315:nuclear_ambitions 0.6316:Khamenei 0.6317:Ayatollah_Khamenei 0.6318:Velayati 0.6319:Ahmedinejad 0.6320:Rafsanjani 0.62

MostSimilarVectorsto‘Iran’fromword2vec300dembeddings

Rank Word CosineSimilarity

Page 74: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 74

RecurrentNeuralNetworksforSequentialData

Page 75: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 75

Page 76: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 76

NetworkModelforPredictingtheNextWordinaSentence

thedogsawcatonwall

Sentence:Thedogsawthecatonthewall.

.

1000000

0100000

h1

h2

Binaryinput,“One-hot”encoding

thedogsawcatonwall.

Hiddenunitactivations(real-valued)

Binarytargetoutput,

“One-hot”encoding

Page 77: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 77

StandardRecurrentNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Differenttoa“feedforward”networkHiddenunitatpositiontnowalsoHasinputfromhiddenvectorattimet-1

Page 78: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 78

StateComputationinaNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Page 79: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 79

StateComputationinaNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Keypoints:- fw ineachpositionistheRNNstate- Thisstateisupdatedrecursivelyfromthe

previousstateandthecurrentinput

Page 80: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 80

StateComputationinaNeuralNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Keypoints:- Thex->fandf->harrowsareweightmatrices- Thesameweightmatricesare(usually)used

acrossthesequence

Page 81: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 81

LearningtheWeightsinaRecurrentNetwork

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Learning:- lossfunctioncomparespredictionsandtargets- computegradient(basedonerror)andupdateweights

Page 82: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 82

Example:PredictionofSequencesofCharacters

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Page 83: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 83

SimulatingSequencesfromanRNN

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Page 84: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 84

SimulatingSequencesfromanRNN

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Page 85: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 85

SimulatingSequencesfromanRNN

Figuresfromhttp://cs231n.stanford.edu/slides/2017/

Page 86: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 86

OutputfromanRNNModelTrainedonShakespeare

KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.

Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.

DUKE VINCENTIO: Well, your wit is in the care of side and that.

Examplesfrom“TheUnreasonableEffectiveness ofRecurrentNeuralNetworks”,AndrejKaparthy,blog,http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 87: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 87

OutputfromanRNNModelTrainedonCookingRecipes

Fromhttps://gist.github.com/nylki/1efbaa36635956d35bcc

Page 88: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 88

Imageclassification

Imagecaptioning

Sentimentanalysis

Machinetranslation

Syncedsequence(videoclassification)

DifferentRecurrentNetworkModels

Page 89: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 89

Part-of-SpeechTagging

Page 90: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 90

PartofSpeech(POS)Tagging

• CommonPOScategories(ortags)inEnglish:– Noun,verb,article,preposition, pronoun, adverb, conjunction, interjection

• Howevertherearemanymorespecializedcategories– E.g.,propernouns:e.g.,‘Toronto’, ‘Smith’,….– E.g.,comparativeadverb:e.g.,‘bigger’, ‘smaller’,…– E.g.,symbol: ‘3.12’,‘$’,…

• AssigningPOScategoriestowordsintextisknownastagging

Page 91: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 91

UniversalTagset (asusedinNLTK)

12universaltags:VERB- verbs(alltensesandmodes)NOUN- nouns(commonandproper)PRON- pronounsADJ- adjectivesADV- adverbsADP- adpositions (prepositionsandpostpositions)CONJ- conjunctionsDET- determinersNUM- cardinalnumbersPRT- particlesorotherfunctionwordsX- other:foreignwords,typos,abbreviations.- punctuation

See"AUniversalPart-of-SpeechTagset"bySlavPetrov,Dipanjan DasandRyanMcDonald formoredetails:http://arxiv.org/abs/1104.2086andhttp://code.google.com/p/universal-pos-tags/

Page 92: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 92

UniversalTagset (usedinNLTK)

Tag Meaning English ExamplesADJ adjective new, good, high, special, big, localADP adposition on, of, at, with, by, into, underADV adverb really, already, still, early, nowCONJ conjunction and, or, but, if, while, althoughDET determiner, article the, a, some, most, every, no, whichNOUN noun year, home, costs, time, AfricaNUM numeral twenty-four, fourth, 1991, 14:24PRT particle at, on, out, over per, that, up, withPRON pronoun he, their, her, its, my, I, usVERB verb is, say, told, given, playing, would. punctuation marks . , ; !X other ersatz, esprit, dunno, gr8, univeristy

(fromSection2.3inChapter5ofNLTKBook)

Page 93: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 93

POSTaggingAlgorithms

• TaggingAlgorithms:“POSTaggers”– Tagging isoftendoneautomaticallywithalgorithms– Thesealgorithms oftenusesequential(Markov)models

• Thetagforaparticulartokencandependonwords/tagsbeforeandafterit– Thesemodelsaretrainedusingmachinelearning

• usingvariouswordfeaturesanddictionariesasinput– Trainedonmanuallylabeleddocuments

• Taggingperformanceisbestonmaterialthatthetaggerwasoriginallytrainedon(oftennewsdocuments)

• Tagscanbehelpful“downstream”forvariousapplications– E.g.,fordocumentclassificationwemightwanttoonlyusenouns, adjectives,

andverbsandignoreeverythingelse– E.g.,forinformation extractionwemight focusonlyonnouns

Page 94: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 94

ChallengesinPOSTagging

• TaggingwordsintextwiththeircorrectPOStagsisnotsimplyassigningwordstotagsusingalookuptable

• Semanticcontext– Thenegotiatorwasabletobridgethegapbetweenthe2sides– Here‘bridge’ isusedasaverbeventhough weordinarily thinkofitasanoun– Theotherwordsinthesentenceandthegrammaticalstructureallowusto

interpret ‘bridge’ hereasaverb

• Ambiguity,e.g.,– Thepresidentwasentertaininglastnight

• Boththeadjectiveandverbtagfor“entertaining”workhere,i.e.,thereisambiguity

• Tokenizationissues– ThealgorithmmustbeabletodealwithtokenssuchasI’ld or‘pre-specified’

Page 95: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 95

SoftwareforPartofSpeechTagging

• Manysoftwarepackagesandonlinetoolsavailable

• NLTKPOSTagger

• StanfordnaturallanguagegroupprovidesexcellentPOStaggersinseverallanguages(English,German,Chinese,French,etc)

• http://nlp.stanford.edu/software/tagger.shtml• UsesthePennTreebanktagset

• Onlinedemo– http://demo.ark.cs.cmu.edu/parse

Page 96: Stats 170A: Project in Data Science Text Analysis and

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 96

Example:RemovingStopWordswithaPOSTagger

• Filteroutanytermthatdoesnotbelongtoa(user-defined)targetsetofPOSclasses

SPEAKER WORD POSTAGPATIENT if IN PrepositionPATIENT you PRP Personal PronounPATIENT had VBD Verb, Past Tense PATIENT a DT DeterminerPATIENT magic JJ AdjectivePATIENT potion NN NounPATIENT i PRP Personal PronounPATIENT ‘d MD ModalPATIENT love VB Verb, base formPATIENT to TO PATIENT have VB Verb, base formPATIENT it PRP Personal Pronoun

Examplerule:extractadjectivesandnounsonly