Upload
said
View
45
Download
0
Embed Size (px)
DESCRIPTION
NYU. Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A. Ralph Grishman. Bag of Words Models. do we really need elaborate linguistic analysis? look at text mining applications document retrieval opinion mining association mining - PowerPoint PPT Presentation
Citation preview
Bag-of-Words MethodsforText MiningCSCI-GA.2590 – Lecture 2A
Ralph Grishman
NYU
NYU 2
Bag of Words Models• do we really need elaborate linguistic analysis?• look at text mining applications
– document retrieval– opinion mining– association mining
• see how far we can get with document-level bag-of-words models– and introduce some of our mathematical approaches
1/16/14
NYU 3
• document retrieval• opinion mining• association mining
1/16/14
NYU 4
Information Retrieval• Task: given query = list of keywords, identify
and rank relevant documents from collection• Basic idea:
find documents whose set of words most closely matches words in query
1/16/14
NYU 5
Topic Vector• Suppose the document collection has n distinct words, w1, …, wn
• Each document is characterized by an n-dimensional vector whose ith component is the frequency of word wi in the document
Example• D1 = [The cat chased the mouse.]• D2 = [The dog chased the cat.]• W = [The, chased, dog, cat, mouse] (n = 5)• V1 = [ 2 , 1 , 0 , 1 , 1 ]• V2 = [ 2 , 1 , 1 , 1 , 0 ]
1/16/14
NYU 6
Weighting the components• Unusual words like elephant determine the topic much more
than common words such as “the” or “have”• can ignore words on a stop list or• weight each term frequency tfi by its inverse document
frequency idfi
• where N = size of collection and ni = number of documents
containing term i
1/16/14
witfiidfi
idfilog(Nni)
NYU 7
Cosine similarity metricDefine a similarity metric between topic vectorsA common choice is cosine similarity (dot
product):
1/16/14
sim(A,B)aibii
ai2
i bi
2i
NYU 8
Cosine similarity metric• the cosine similarity metric is the cosine of the
angle between the term vectors:
1/16/14
w1
w2
Verdict: a success• For heterogenous text collections, the vector
space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years
NYU 10
• document retrieval• opinion mining• association mining
1/16/14
NYU 11
Opinion Mining– Task: judge whether a document expresses a positive or
negative opinion (or no opinion) about an object or topic
• classification task• valuable for producers/marketers of all sorts of products
– Simple strategy: bag-of-words• make lists of positive and negative words• see which predominate in a given document
(and mark as ‘no opinion’ if there are few words of either type• problem: hard to make such lists
– lists will differ for different topics
1/16/14
NYU 12
Training a Model– Instead, label the documents in a corpus and then
train a classifier from the corpus• labeling is easier than thinking up words• in effect, learn positive and negative words from the
corpus– probabilistic classifier
• identify most likely class
s = argmax P ( t | W) t є {pos, neg}
1/16/14
NYU 13
Using Bayes’ Rule
• The last step is based on the naïve assumption of independence of the word probabilities
1/16/14
argmaxt P(t |W) argmaxt P(W |t)P(t)P(W)argmaxt P(W |t)P(t)argmaxt P(w1,...,wn |t)P(t)argmaxt P(wi |t)P(t)
i
NYU 14
Training• We now estimate these probabilities from the
training corpus (of N documents) using maximum likelihood estimators
• P ( t ) = count ( docs labeled t) / N
• P ( wi | t ) = count ( docs labeled t containing wi )
count ( docs labeled t)
1/16/14
NYU 15
A Problem• Suppose a glowing review GR (with lots of
positive words) includes one word, “mathematical”, previously seen only in negative reviews
• P ( positive | GR ) = ?
1/16/14
NYU 16
P ( positive | GR ) = 0because P ( “mathematical” | positive ) = 0
The maximum likelihood estimate is poor when there is very little data
We need to ‘smooth’ the probabilities to avoid this problem
1/16/14
NYU 17
Laplace Smoothing• A simple remedy is to add 1 to each count
– to keep them as probabilities
we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’)
– for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent)
1/16/14
ttP 1)(
NYU 18
An NLTK DemoSentiment Analysis with Python NLTK Text
Classification
• http://text-processing.com/demo/sentiment/
NLTK Code (simplified classifier)
• http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier
1/16/14
NYU 19
• Is “low” a positive or a negative term?
1/16/14
NYU 20
Ambiguous terms“low” can be positive “low price”
or negative “low quality”
1/16/14
NYU 21
Verdict: mixed• A simple bag-of-words strategy with a NB
model works quite well for simple reviews referring to a single item, but fails– for ambiguous terms– for comparative reviews– to reveal aspects of an opinion
• the car looked great and handled well, but the wheels kept falling off
1/16/14
NYU 22
• document retrieval• opinion mining• association mining
1/16/14
NYU 23
Association Mining• Goal: find interesting relationships among attributes
of an object in a large collection …objects with attribute A also have attribute B– e.g., “people who bought A also bought B”
• For text: documents with term A also have term B– widely used in scientific and medical literature
1/16/14
NYU 24
Bag-of-words• Simplest approach
• look for words A and B for whichfrequency (A and B in same document) >>frequency of A x frequency of B
• doesn’t work well• want to find names (of companies, products, genes),
not individual words• interested in specific types of terms• want to learn from a few examples
– need contexts to avoid noise
1/16/14
NYU 25
Needed Ingredients• Effective text association mining needs
– name recognition– term classification– preferably: ability to learn patterns– preferably: a good GUI
– demo: www.coremine.com
1/16/14
NYU 26
Conclusion• Some tasks can be handled effectively (and
very simply) by bag-of-words models,
• but most benefit from an analysis of language structure
1/16/14