1
http://ebiquity.umbc.edu/ Feature Space for Text Feature Space for Text Analysis Analysis Justin Martineau, Tim Finin, Shamit Patel and Justin Martineau, Tim Finin, Shamit Patel and Anupam Joshi Anupam Joshi University of Maryland, Baltimore County University of Maryland, Baltimore County Bag of words text classification techniques represent documents as a collection of words with associated counts. Machine learning algorithms, such as support vector machines, use these feature spaces to classify new documents using their similarity to a set of manually labeled documents. We describe an efficient way to weight these feature scores to improve classification accuracy. epresenting A Movie Review for City of Angels as a Bag of Words Most bag of words representations use the raw word count as the feature value. The words in this word cloud are sized to this value. Raw Term Frequency Weights: Raw word counts place too much emphasis on stop words. To correct, we can weight the raw score with TFIDF to boost the value of rare words in documents. Words are sized to their TFIDF value. TFIDF Weights: Delta TFIDF Weights: TFIDF values rare words, but sentimental words are common in sentiment analysis corpora. Delta TFIDF values words that are rarer in one type of labeled document than the other. Delta TFIDF calculates inverse document frequency (IDF) scores separately in the positive and negative training sets. Subtracting the positive IDF score from the negative for a word determines the sentiment orientation. The word cloud on the left shows positive words for movies in blue and negative in red. Word size indicates strength. Multiplying the Delta IDF score by the raw word count yields the Delta TFIDF representation for our movie review of the city of angels as depicted on the right. Product Review Dataset 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold Baseline Delta TFID Enron Spam Dataset Spam vs Not Spam 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold Baseline Delta TFID curacy on Binary Classification Tasks using SVMs: TFIDF underperforms the baseline because sentim- ental words are common in a corpus composed entirely of movie reviews. Distance from an SVM’s margin is a good proxy for opinion strength and con-fidence. Delta TFIDF works on both subjectivity detection and sentiment orientation determinations. Congressional debate transcripts are more dif-ficult than movie reviews. Party affiliation is a better indicator of how a con-gress person will vote than the content of their speeches. Delta TFIDF using only half as much training data as the baseline method operates about as well on even the weakest senti- mental expressions. Delta TFIDF works on a variety of domains and is not limited to just movie reviews. Delta TFIDF works on a variety of binary text classification problems and is not limited to sentiment analysis.

Delta TFIDF: an Improved Feature Space for Text Analysis

Embed Size (px)

DESCRIPTION

Raw Term Frequency Weights:. TFIDF Weights:. Most bag of words representations use the raw word count as the feature value. The words in this word cloud are sized to this value. - PowerPoint PPT Presentation

Citation preview

Page 1: Delta TFIDF: an Improved Feature Space for Text Analysis

http://ebiquity.umbc.edu/

Delta TFIDF: an Improved Feature Delta TFIDF: an Improved Feature Space for Text AnalysisSpace for Text AnalysisDelta TFIDF: an Improved Feature Delta TFIDF: an Improved Feature Space for Text AnalysisSpace for Text Analysis

Justin Martineau, Tim Finin, Shamit Patel and Anupam JoshiJustin Martineau, Tim Finin, Shamit Patel and Anupam JoshiUniversity of Maryland, Baltimore CountyUniversity of Maryland, Baltimore County

Justin Martineau, Tim Finin, Shamit Patel and Anupam JoshiJustin Martineau, Tim Finin, Shamit Patel and Anupam JoshiUniversity of Maryland, Baltimore CountyUniversity of Maryland, Baltimore County

Bag of words text classification techniques represent documents as a collection of words with associated counts. Machine learning algorithms, such as support vector machines, use these feature spaces to classify new documents using their similarity to a set of manually labeled documents. We describe an efficient way to weight these feature scores to improve classification accuracy.

Representing A Movie Review for City of Angels as a Bag of Words

Most bag of words representations use the raw word count as the feature value. The words in this word cloud are sized to this value.

Raw Term Frequency Weights:Raw word counts place too much emphasis on stop words. To correct, we can weight the raw score with TFIDF to boost the value of rare words in documents. Words are sized to their TFIDF value.

TFIDF Weights:

Delta TFIDF Weights:TFIDF values rare words, but sentimental words are common in sentiment analysis corpora. Delta TFIDF values words that are rarer in one type of labeled document than the other. Delta TFIDF calculates inverse document frequency (IDF) scores separately in the positive and negative training sets. Subtracting the positive IDF score from the negative for a word determines the sentiment orientation. The word cloud on the left shows positive words for movies in blue and negative in red. Word size indicates strength. Multiplying the Delta IDF score by the raw word count yields the Delta TFIDF representation for our movie review of the city of angels as depicted on the right.

Product Review Dataset

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Fold

Accuracy

Baseline

Delta TFIDF

Enron Spam DatasetSpam vs Not Spam

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9

Fold

Accuracy

Baseline

Delta TFIDF

Accuracy on Binary Classification Tasks using SVMs:

TFIDF underperforms the baseline because sentim-ental words are common in a corpus composed entirely of movie reviews.

Distance from an SVM’s margin is a good proxy for opinion strength and con-fidence. Delta TFIDF works on both subjectivity detection and sentiment orientation determinations.

Congressional debate transcripts are more dif-ficult than movie reviews. Party affiliation is a better indicator of how a con-gress person will vote than the content of their speeches.

Delta TFIDF using only half as much training data as the baseline method operates about as well on even the weakest senti-mental expressions.

Delta TFIDF works on a variety of domains and is not limited to just movie reviews.

Delta TFIDF works on a variety of binary text classification problems and is not limited to sentiment analysis.