31
A Survey of Opinion Mining A Survey of Opinion Mining Dongjoo Lee Intelligent Database Systems Lab. Dept. of Computer Science and Engineering Seoul National University

A Survey of Opinion Mining

  • Upload
    kata

  • View
    50

  • Download
    1

Embed Size (px)

DESCRIPTION

A Survey of Opinion Mining. Dongjoo Lee Intelligent Database Systems Lab. Dept. of Computer Science and Engineering Seoul National University. Introduction. The Web contains a wealth of opinions about products, politics, and more in newsgroup posts, review sites, and other web sites - PowerPoint PPT Presentation

Citation preview

Page 1: A Survey of Opinion Mining

A Survey of Opinion MiningA Survey of Opinion Mining

Dongjoo Lee

Intelligent Database Systems Lab.

Dept. of Computer Science and Engineering

Seoul National University

Page 2: A Survey of Opinion Mining

Copyright © 2007 by CEBT

IntroductionIntroduction

The Web contains a wealth of opinions about products, politics, and more in newsgroup posts, review sites, and other web sites

A few problems

What is the general opinion on the proposed tax reform?

How is popular opinion on the presidential candidates evolving?

Which of our customers are unsatisfied? Why?

Opinion Mining (OM)

a recent discipline at the crossroads of information retrieval and computational linguistics which is concerned not with the subject of a document, but with opinion it expresses

Related Areas

Data Mining(DM), Information Retrieval (IR), Text Classification (TC), Text Summarization (TS)

IDS Lab. - 2Center for E-Business Technology

Page 3: A Survey of Opinion Mining

Copyright © 2007 by CEBT

AgendaAgenda

Introduction

Development of Linguistic Resource Conjunction Method

PMI Method

WordNet Expanding Method

Gloss Use Method

Sentiment Classification PMI Method

Machine Learning Method

NLP Combined Method

Extracting and Summarizing Opinion Expression Statistical Approach

NLP Based Approach

Discussion

Center for E-Business Technology IDS Lab. - 3

Page 4: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Development of Linguistic Resource Development of Linguistic Resource (1)(1)

Linguistic resources can be used to extract opinion and to classify the sentiment of text

Appraisal Theory Sentiment related properties are well-defined

A framework of linguistic resources which describes how writers and speakers express inter-subjective and ideological position

underlying linguistic foundation of OM

Tasks Determining the subjectivity of a term

Determining term orientation

Determining the strength of term attitude

Example Objective: vertical, yellow, liquid

Subjective– Positive: good < excellent

– Negative: bad < terrible

Center for E-Business Technology IDS Lab. - 4

Page 5: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Development of Linguistic Resource Development of Linguistic Resource (2)(2)

Conjunction Method

PMI Method

Orientation

Subjectivity

WordNet Expansion Method

Gloss Use Method

Orientation

Subjectivity

SentiWordNet

Center for E-Business Technology IDS Lab. - 5

Page 6: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Conjunction Method Conjunction Method - overview- overview

Hatzivassiloglou and McKeown, 1997

Hypothesis Adjectives in ‘and’ conjunctions usually have similar orientation, while ‘but’

is used with opposite orientation.

Process

Randomly selected adjectives with positive and negative orientation seed terms were used to predict orientation.

Center for E-Business Technology IDS Lab. - 6

1. All conjunction of adjectives are extracted from the corpus.

2. A log-linear regression model combines information from different conjunctions to determine if each two conjoined adjectives are of same or different orientation.

3. A clustering algorithm separates the adjectives into two subsets of different orientation. It places as many words of same orientation as possible into the same subset.

4. The average frequencies in each group are compared and the group with the higher frequency is labeled as positive.

andbut

positive

negative

corpuscorpus

seed termsseed terms

Page 7: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Conjunction Method Conjunction Method –– objective function and objective function and constraintsconstraints

Select pmin that minimizes Φ(p)

dissimilarity between adjectives in same cluster is minimized and dissimilarity between adjectives in different cluster is maximized.

Experiments HM term set : 1,336 adjectives

– 657 positive, 679 negative terms

Methods to improve performance of orientation prediction– But rule : Most conjunctions had same orientation, while some conjunctions

linked by ‘but’ had almost opposite orientation

– log-linear regression model

– morphological relationship adequate-inadequate or thoughtful –thoughtless

log-linear model with morphological relationship : 82.5% accuracy

IDS Lab. - 7Center for E-Business Technology

|Ci| : the cardinality of cluster i

d(x, y) : the dissimilarity between adjectives x , y

Page 8: A Survey of Opinion Mining

Copyright © 2007 by CEBT

PMI Method PMI Method - overview- overview

Pointwise Mutual Information (PMI)

a measure of association used in information theory and statistics

Orientation

– Turney and Littman, 2003

– terms with similar orientation tend to co-occur in documents

Subjectivity

– Baroni and Vegnaduzzo, 2004

– subjective adjectives tend to occur in the near of other subjective adjectives

IDS Lab. - 8Center for E-Business Technology

Page 9: A Survey of Opinion Mining

Copyright © 2007 by CEBT

PMI Method PMI Method – predicting semantic orientation– predicting semantic orientation

Modified PMI was measured using the number of results returned by the AltaVista search engine with NEAR operator

Predicting semantic orientation of a term SO(t)

Experiments

With HM term set and three corpora

– With small corpus, accuracy isn’t higher than conjunction method.

– With large corpus, accuracy is higher than conjunction method.

Center for E-Business Technology IDS Lab. - 9

t : target term

ti : paradigmatic term

Corpus AV-ENG AV-CA TASA

Approx. # of word in corpus 1 *1011 2*109 1*107

Accuracy 87.13% 80.31% 61.83%

Page 10: A Survey of Opinion Mining

Copyright © 2007 by CEBT

WordNet Expansion MethodWordNet Expansion Method

Hu et al., 2004 used synonym and antonym relationship between words

Hypothesis adjectives usually share the same orientation as their synonyms and

opposite orientation as their antonyms

By using a set of seed adjectives, orientation of all adjectives in WordNet can be assigned through a procedure exploring on the cluster graphs.

IDS Lab. - 10Center for E-Business Technology

Page 11: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Gloss Use Method Gloss Use Method - overview- overview

Esuli et al., 2005, 2006

Hypothesis

Orientation

– terms with similar orientation have similar glosses

Subjectivity

– terms with similar orientation have similar glosses

– terms without orientation have non-oriented glosses

SentiWordNet

All words in the WordNet have three scores

– positivity, negativity, and objectivity

Term Sense is positioned in reversed triangle

Center for E-Business Technology

good: that which is pleasing or valuable or useful; agreeable or pleasing

beautiful: aesthetically pleasing

pretty: pleasing by delicacy or grace; not imposing

yellow: similar to the color of an egg yolk

vertical: at right angles to the plane of the horizon or a base line

IDS Lab. - 11

Page 12: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Gloss Use Method – Gloss Use Method – classification processclassification process

Process

1. A seed set (Lp, Ln) is provided as input

2. Lexical relations (e.g. synonymy) from a thesaurus, or online dictionary, are used to extend seed set. Once added to the original ones, the new terms yield two new, richer sets Trp and Trn; together they form the

training set for the learning phase of Step 4.

3. For each term ti in Trp∪Trn or in the test set, a

textual representation of ti is generated by

collating all the glosses of ti as found in a

machine-readable dictionary. Each such representation is converted into vectorial form by standard text indexing techniques.

4. A binary text classifier is trained on the terms in Trp∪Trn and then applied to the

terms in the test set.

1. Experiments

1. Classifier : NB, SVM, PrTFIDF

2. 87.38% AccuracyCenter for E-Business Technology IDS Lab. - 12

Page 13: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Development of Linguistic Resource - Development of Linguistic Resource - SummarySummary

Method Intuition Accuracy Characteristics

Conjunction

Method

Adjectives in and conjunctions usually have similar orientation, though but is used with opposite orientation

78.08% The First try test data : 1336

adjectives

PMI method

terms with similar orientation tend to co-occur in documents

87.13% No limitation Much time required

WordNet Expansion

Method

adjectives usually share the same orientation as their synonyms and opposite orientation as their antonyms

N/A Limited to WordNet

Gloss Use Method

terms with similar orientation have similar glosses

terms without orientation have non-oriented glosses

87.38% SentiWordNet (All word in WordNet)

Accuracy depends on the quality of thesaurus

Center for E-Business Technology IDS Lab. - 13

Page 14: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Sentiment ClassificationSentiment Classification

The process of identifying the sentiment – or polarity – of a piece of text or a document. Document-level

Sentence-level, phrase-level

Feature-level

– Define target of the opinion and assign the sentiment of the target

Document-level Sentiment Classification Method PMI method

Machine Learning Method

– Default Classifiers

– Enhanced Classifier

NLP Combined Method

– A Two-Step Classification

– Combining Appraisal Theory

Center for E-Business Technology IDS Lab. - 14

Page 15: A Survey of Opinion Mining

Copyright © 2007 by CEBT

PMI MethodPMI Method

Turney et al., 2002

Process Only two-word phrases containing adjectives or adverbs are extracted

Semantic orientation of a phrase

– SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”)

Semantic orientation is an average semantic orientation of the phrases

Experiments 410 reviews from Epinions (epinion.com): 170 positive, 240 negative

calculating the PMI of 10,658 phrases from 410 reviews consume about 30 hours

Center for E-Business Technology

Domain of review Accuracy Domain of review Accuracy

Automobiles 84.00% Movies 65.83%

- Honda Accord 83.78% - The Matrix 66.67%

- Volkswagen Jetta 84.21% - Pearl Harbor 65.00%

Banks 80.00% Travel Destination 70.53%

- Bank of America 78.33% - Cancun 64.41%

- Washington Mutual 81.67% - Puerto Vallarta 80.56%

IDS Lab. - 15

Page 16: A Survey of Opinion Mining

Copyright © 2007 by CEBT

ML ML - Default Classifier- Default Classifier

Pang and Lee, 2002

A special case of text categorization with sentiment- rather than topic-based categories

Document modeling

standard bag-of-features framework

Experiments

Data : movie reviews (Internet Movie Database), rating -> negative, neutral, positive

Naïve Bayes, Maximum Entropy, Support Vector Machine

In terms of relative performance, Naïve Bayes tends to do the worst and SVM tends to do the best, although the differences aren’t very large.

Center for E-Business Technology IDS Lab. - 16

Features # of featuresFrequency

or presence?NB ME SVM

unigrams 16165 freq. 78.7 N/A 72.8unigrams 16165 pres. 81.0 80.4 82.9unigrams+bigrams 32330 pres. 80.6 80.8 82.7bigrams 16165 pres. 77.3 77.4 77.1unigrams+POS 16695 pres. 81.5 80.4 81.9adjectives 2633 pres. 77.0 77.7 75.1top 2633 unigrams 2633 pres. 80.3 81.0 81.4unigrams+position 22430 pres. 81.0 80.1 81.6

Page 17: A Survey of Opinion Mining

Copyright © 2007 by CEBT

ML ML - Using Only Subjective Sentences- Using Only Subjective Sentences

Pang and Lee, 2004

improved polarity classification by removing objective sentences

A subjectivity detector determines whether each sentence is subjective or not

Standard subjectivity classifier

Subjectivity classifier using proximity relationship

The use of subjectivity extracts can improve the polarity classification at least no loss of accuracy.

Center for E-Business Technology IDS Lab. - 17

Page 18: A Survey of Opinion Mining

Copyright © 2007 by CEBT

NLP Combined MethodNLP Combined Method – A Two-Step – A Two-Step ClassificationClassification

Wilson et al., 2005

A Two-Step Contextual Polarity Classification

employ machine learning and 28 linguistic features

document polarity : the average polarity of phrases

Step 1. Neutral-polar classifier classifies each phrase containing a clue as neutral or polar

Step 2. Polarity classifier takes all phrases marked in step 1 as polar and disambiguates their contextual polarity (positive, negative, both, or neutral).

28 Features : were extracted using NLP techniques with a dependency parser

4 Word Features, 8 Modification Features, 11 Structure Features, 3 Sentence Features, 1 Document Feature

Experiments

Data : Multi-perspective Question Answering (MPQA) Opinion Corpus

Center for E-Business Technology

Features AccuracyWord token 73.6

Word+priorpol 74.228 features 75.9

Features Accuracy

Word token 61.7Word+priorpol 63.0

10 features 65.7

neutral-polar classification (%) polarity classification (%).

IDS Lab. - 18

Page 19: A Survey of Opinion Mining

Copyright © 2007 by CEBT

NLP Combined MethodNLP Combined Method - Combining Appraisal - Combining Appraisal TheoryTheory

Whitelaw et al., 2005 applied the appraisal theory to the machine learning methods of Pang and Lee

Structure of an appraisal

An example “not very happy”

Experiments a lexicon of 1329 appraisal entities have been produced semi-automatically from

400 seed terms in around twenty man-hours

combining attitude type and orientation : accuracy 90.2%.

Center for E-Business Technology IDS Lab. - 19

Page 20: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Sentiment Classification - SummarySentiment Classification - Summary

Method Characteristics Pros Cons

PMI Method

Use phrase PMI Simple Need not priory

polarity dictionary

Loss of contextual meaning

Slow(Time to get PMI)

Machine Learning Method

Bag of Words Unigram to bigram or n-

gram SVM, NB, MaxEnt

Simple Need not priory

polarity dictionary

Loss of contextual meaning

Need learning phase

NLP Combined

Method

Based on ML Parsing or Syntactic

Analysis Prior polarity to

contextual polarity

Consider contextual meaning

Easily extendible for various purpose

Need prior polarity dictionary

Syntactic Analysis Overhead

Center for E-Business Technology IDS Lab. - 20

Page 21: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Extracting and Summarizing Opinion Extracting and Summarizing Opinion ExpressionExpression

Goal Extract the opinion expression from large reviews and present it with an effective

way

Tasks Feature Extraction

– Sentiment classification at the feature-level requires the extraction of features that are the target of opinion words

Sentiment Assignment– Each feature is usually classified as being either favorable or unfavorable.

Visualization– Extracted opinion expression are summarized and visualized.

Methods Statistical Approaches

– ReviewSeer (2003)

– Opinion Observer (2004)

– Red Opal (2007)

NLP-Based Approaches– Kanayama System (2004)

– WebFountain (2005)

– OPINE (2005)

Center for E-Business Technology IDS Lab. - 21

product

product

ExtractFeaturesExtract

Features

SummarizeSummarize

AssignSentiment

AssignSentiment

reviews

Page 22: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Opinion Observer Opinion Observer - Overview- Overview

Hu and Liu, 2005

Extract and summarize opinion expression from customer reviews on the Web.

Only mines the features of the product on which the customers have expressed their opinions and whether the opinion are positive or negative

Overall process

1. Review crawling

2. Feature extraction

3. Sentiment assignment

– Opinion word extraction

– Opinion orientation identification

4. Summary generation

Center for E-Business Technology IDS Lab. - 22

Overall process

Page 23: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Opinion Observer Opinion Observer - Tasks- Tasks

Feature Extraction Product features are extracted from the noun or noun phrase by the

association miner CBA

Compactness pruning, redundancy pruning

Sentiment Assignment Opinion sentence : a sentence contains one or more product features and

one or more opinion words

Adjectives are the only opinion words

Prior polarity of adjectives was identified by WordNet expansion methods with seed terms

Infrequent features are extracted by using frequent opinion words

Polarity of a sentence is assigned as a dominant orientation

Extracted form : (product feature, # of positive sentences, # of negative sentences)

Experiments Large collection of reviews of 15 electronic products

86.3% recall, 84.0% precision

IDS Lab. - 23Center for E-Business Technology

Page 24: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Opinion Observer Opinion Observer - Visualization- Visualization

Features of products are compared by the bar graph

Number of positive and negative sentences of each feature are normalized

IDS Lab. - 24Center for E-Business Technology

Positive portion

Negative portion

Page 25: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Web Fountain Web Fountain - Overview- Overview

Yi et al., 2005

Extracts target features of the sentiment from the various resources and assigns polarity to the features

System Architecture

Sentiment Miner

Analyzes grammatical sentence structures and phrases by using NLP techniques

Center for E-Business Technology IDS Lab. - 25

Page 26: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Web Fountain Web Fountain – Tasks– Tasks

Feature Extraction Candidate features

– a part-of relationship with the given topic

– an attribute-of relationship with the given topic.

– an attribute-of relationship with a known feature of the given topic

bBNP (Beginning definite Base Noun Phrase) heuristic is used

Select bnp (base noun phrase) that has high likelihood ratio

Experiments

– Precision - digital camera: 97%, music reviews: 100%

Sentiment Assignment Parse and traverse with two linguistic resources

– Sentiment lexicon: define the sentiment polarity of terms

– Sentiment pattern database: contain the sentiment assignment patterns of predicates

Experiments

– Product review

– Recall 56%, Precision 87%

IDS Lab. - 26Center for E-Business Technology

Page 27: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Web Fountain Web Fountain – Visualization– Visualization

Web interface listing sentiment bearing sentences about a given product

IDS Lab. - 27Center for E-Business Technology

Page 28: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Extracting and Summarizing Opinion Expression Extracting and Summarizing Opinion Expression - - SummarySummary

System Feature Extraction Sentiment Assignment Visualization

Statistical

ReviewSeer(2003)

N/A probabilistic model Naïve Bayes Accuracy: 85.3%

List feature term and it’s score and show sentences contain the feature term

Opinion Observer(2004)

CBA miner Infrequent feature

selection

WordNet expansion prior polarity of adjectives

graph

Recall: 86.3% Precision: 84.0%

Red Opal(2007)

frequent noun and noun phrase

Precision:85%

use user’s rating Precision:80%

ordered product list by score of each feature

the confidence of the scoring

NLP-based

Kanayama’s system(2004)

sentiment unit modifying the machine translation framework

N/A

Recall:43% Precision:89%

WebFountain(2005)

bBNP heuristics likelihood ratio Precision:97%

sentiment lexicon sentiment pattern database Recall:56% Precision:87%

listing sentiment bearing sentences of a product

OPINE(2005)

Web PMI Recall:76% Precision:79%

Relaxation Labeling Recall:89% Precision:86%

N/A

Center for E-Business Technology IDS Lab. - 28

Page 29: A Survey of Opinion Mining

Copyright © 2007 by CEBT

DiscussionDiscussion

OM is a growing research discipline related to various research areas, such as IR, computational linguistics, TC, TS, and DM.

Surveyed three topics and summarized it.

For Korean OM?

There isn’t any published research into the Korean OM.

Language differences may impose some limits on the methods used in the OM subtasks.

– Structural differences between English and Korean may mean that the same heuristics cannot be applied to extract features from text

– The lack of Korean thesaurus similar to WordNet limits the methods of obtaining the prior polarity of words for the PMI or conjunction methods.

Research into Korean OM must be conducted in conjunction with other related areas.

Center for E-Business Technology IDS Lab. - 29

Page 30: A Survey of Opinion Mining

Copyright © 2007 by CEBT

Discussion Discussion - Research Map of OM- Research Map of OM

IDS Lab. - 30Center for E-Business Technology

Page 31: A Survey of Opinion Mining

Thank youThank you

IDS Lab. - 31Center for E-Business Technology