Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Bag-of-Words MethodsforText MiningCSCI-GA.2590 – Lecture 2A

Ralph Grishman

NYU

NYU 2

Bag of Words Models• do we really need elaborate linguistic analysis?• look at text mining applications

– document retrieval– opinion mining– association mining

• see how far we can get with document-level bag-of-words models– and introduce some of our mathematical approaches

1/16/14

NYU 3

• document retrieval• opinion mining• association mining

1/16/14

NYU 4

Information Retrieval• Task: given query = list of keywords, identify

and rank relevant documents from collection• Basic idea:

find documents whose set of words most closely matches words in query

1/16/14

NYU 5

Topic Vector• Suppose the document collection has n distinct words, w1, …, wn

• Each document is characterized by an n-dimensional vector whose ith component is the frequency of word wi in the document

Example• D1 = [The cat chased the mouse.]• D2 = [The dog chased the cat.]• W = [The, chased, dog, cat, mouse] (n = 5)• V1 = [ 2 , 1 , 0 , 1 , 1 ]• V2 = [ 2 , 1 , 1 , 1 , 0 ]

1/16/14

NYU 6

Weighting the components• Unusual words like elephant determine the topic much more

than common words such as “the” or “have”• can ignore words on a stop list or• weight each term frequency tfi by its inverse document

frequency idfi

• where N = size of collection and ni = number of documents

containing term i

1/16/14

witfiidfi

idfilog(Nni)

NYU 7

Cosine similarity metricDefine a similarity metric between topic vectorsA common choice is cosine similarity (dot

product):

1/16/14

sim(A,B)aibii

ai2

i bi

2i

NYU 8

Cosine similarity metric• the cosine similarity metric is the cosine of the

angle between the term vectors:

1/16/14

w1

w2

Verdict: a success• For heterogenous text collections, the vector

space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years

NYU 10


1/16/14

NYU 11

Opinion Mining– Task: judge whether a document expresses a positive or

negative opinion (or no opinion) about an object or topic

• classification task• valuable for producers/marketers of all sorts of products

– Simple strategy: bag-of-words• make lists of positive and negative words• see which predominate in a given document

(and mark as ‘no opinion’ if there are few words of either type• problem: hard to make such lists

– lists will differ for different topics

1/16/14

NYU 12

Training a Model– Instead, label the documents in a corpus and then

train a classifier from the corpus• labeling is easier than thinking up words• in effect, learn positive and negative words from the

corpus– probabilistic classifier

• identify most likely class

s = argmax P ( t | W) t є {pos, neg}

1/16/14

NYU 13

Using Bayes’ Rule

• The last step is based on the naïve assumption of independence of the word probabilities

1/16/14

argmaxt P(t |W) argmaxt P(W |t)P(t)P(W)argmaxt P(W |t)P(t)argmaxt P(w1,...,wn |t)P(t)argmaxt P(wi |t)P(t)

i

NYU 14

Training• We now estimate these probabilities from the

training corpus (of N documents) using maximum likelihood estimators

• P ( t ) = count ( docs labeled t) / N

• P ( wi | t ) = count ( docs labeled t containing wi )

count ( docs labeled t)

1/16/14

NYU 15

A Problem• Suppose a glowing review GR (with lots of

positive words) includes one word, “mathematical”, previously seen only in negative reviews

• P ( positive | GR ) = ?

1/16/14

NYU 16

P ( positive | GR ) = 0because P ( “mathematical” | positive ) = 0

The maximum likelihood estimate is poor when there is very little data

We need to ‘smooth’ the probabilities to avoid this problem

1/16/14

NYU 17

Laplace Smoothing• A simple remedy is to add 1 to each count

– to keep them as probabilities

we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’)

– for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent)

1/16/14

ttP 1)(

NYU 18

An NLTK DemoSentiment Analysis with Python NLTK Text

Classification

• http://text-processing.com/demo/sentiment/

NLTK Code (simplified classifier)

• http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier

1/16/14

http://text-processing.com/demo/sentiment/



NYU 19

• Is “low” a positive or a negative term?

1/16/14

NYU 20

Ambiguous terms“low” can be positive “low price”

or negative “low quality”

1/16/14

NYU 21

Verdict: mixed• A simple bag-of-words strategy with a NB

model works quite well for simple reviews referring to a single item, but fails– for ambiguous terms– for comparative reviews– to reveal aspects of an opinion

• the car looked great and handled well, but the wheels kept falling off

1/16/14

NYU 22


1/16/14

NYU 23

Association Mining• Goal: find interesting relationships among attributes

of an object in a large collection …objects with attribute A also have attribute B– e.g., “people who bought A also bought B”

• For text: documents with term A also have term B– widely used in scientific and medical literature

1/16/14

NYU 24

Bag-of-words• Simplest approach

• look for words A and B for whichfrequency (A and B in same document) >>frequency of A x frequency of B

• doesn’t work well• want to find names (of companies, products, genes),

not individual words• interested in specific types of terms• want to learn from a few examples

– need contexts to avoid noise

1/16/14

NYU 25

Needed Ingredients• Effective text association mining needs

– name recognition– term classification– preferably: ability to learn patterns– preferably: a good GUI

– demo: www.coremine.com

1/16/14

http://www.coremine.com/

NYU 26

Conclusion• Some tasks can be handled effectively (and

very simply) by bag-of-words models,

• but most benefit from an analysis of language structure

1/16/14

Documents

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A