46
Document-level Semantic Orientation and Argumentation Presented by Marta Tatu CS7301 March 15, 2005

Document-level Semantic Orientation and Argumentation

  • Upload
    basil

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

Document-level Semantic Orientation and Argumentation. Presented by Marta Tatu CS7301 March 15, 2005.  or ? Semantic Orientation Applied to Unsupervised Classification of Reviews. Peter D. Turney ACL-2002. Overview. - PowerPoint PPT Presentation

Citation preview

Page 1: Document-level Semantic Orientation and Argumentation

Document-level Semantic Orientation and Argumentation

Presented by Marta TatuCS7301

March 15, 2005

Page 2: Document-level Semantic Orientation and Argumentation

or ? Semantic Orientation Applied to

Unsupervised Classification of

Reviews

Peter D. TurneyACL-2002

Page 3: Document-level Semantic Orientation and Argumentation

3

Overview Unsupervised learning algorithm for

classifying reviews as recommended or not recommended

The classification is based on the semantic orientation of the phrases in the review which contain adjectives and adverbs

Page 4: Document-level Semantic Orientation and Argumentation

4

AlgorithmInput: review

Identify phrases that contain adjectives or adverbs by using a part-of-speech tagger

Estimate the semantic orientation of each phrase

Assign a class to the given review based on the average semantic orientation of its phrasesOutput: classification ( or )

Page 5: Document-level Semantic Orientation and Argumentation

5

Step 1 Apply Brill’s part-of-speech tagger on the review Adjective are good indicators of subjective sentences. In

isolation: unpredictable steering () / plot ()

Extract two consecutive words: one is an adjective or adverb, the other provides the context

First Word Second Word Third Word(not extracted)

1. JJ NN or NNS Anything

2. RB, RBR, or RBS JJ Not NN nor NNS

3. JJ JJ Not NN nor NNS

4. NN or NNS JJ Not NN nor NNS

5. RB, RBR, or RBS VB, VBD, VBN, or VBG Anything

Page 6: Document-level Semantic Orientation and Argumentation

6

Step 2 Estimate the semantic orientation of the

extracted phrases using PMI-IR (Turney, 2001) Pointwise Mutual Information (Church and Hanks,

1989):

Semantic Orientation:

PMI-IR estimates PMI by issuing queries to a search engine (Altavista, ~350 million pages)

)()()(

221 21

21log),(PMI wordpwordpwordwordpwordword

)poor"",(PMI)excellent"",(PMI)(SO phrasephrasephrase

)excellent")hits("poor"" NEAR hits(

)poor")hits("excellent"" NEAR hits(log)(SO 2 phrase

phrasephrase

Page 7: Document-level Semantic Orientation and Argumentation

7

Step 2 – continued Added 0.01 to hits to avoid division by

zero If hits(phrase NEAR “excellent”) and hits(phrase

NEAR “poor”)≤4, then eliminate phrase Added “AND (NOT host:epinions)” to the

queries not to include the Epinions website

Page 8: Document-level Semantic Orientation and Argumentation

8

Step 3 Calculate the average

semantic orientation of the phrases in the given review

If the average is positive, then

If the average is negative, then

Phrase POS tags SOdirect deposit JJ NN 1.288

local branch JJ NN 0.421

small part JJ NN 0.053

online service JJ NN 2.780

well other RB JJ 0.237

low fees JJ NNS 0.333

true service JJ NN -0.732

other bank JJ NN -0.850

inconveniently located

RB VBN -1.541

Average Semantic Orientation

0.322

Page 9: Document-level Semantic Orientation and Argumentation

9

Experiments 410 reviews from Epinions

170 (41%) () 240 (59%) () Average phrases per review: 26

Baseline accuracy: 59%

Domain Accuracy Correlation

Automobiles 84.00% 0.4618

Banks 80.00% 0.6167

Movies 65.83% 0.3608

Travel Destinations 70.53% 0.4155

All 74.39% 0.5174

Page 10: Document-level Semantic Orientation and Argumentation

10

Discussion What makes the movies hard to classify?

The average SO tends to classify a recommended movies as not recommended

Evil characters make good movies The whole is not necessarily the sum of the

parts Good beaches do not necessarily add up

to a good vacation But good automobile parts usually add up

to a good automobile

Page 11: Document-level Semantic Orientation and Argumentation

11

Applications Summary statistics for search engines Summarization of reviews

Pick out the sentence with the highest positive/negative semantic orientation given a positive/negative review

Filtering “flames” for newsgroups When the semantic orientation drops below a

threshold, the message might be a potential flame

Page 12: Document-level Semantic Orientation and Argumentation

12

Questions ? Comments ? Observations ?

Page 13: Document-level Semantic Orientation and Argumentation

? Sentiment Classification using Machine Learning

Techniques

Bo Pang, Lillian Lee and Shivakumar Vaithyanathan

EMNLP-2002

Page 14: Document-level Semantic Orientation and Argumentation

14

Overview Consider the problem of classifying

documents by overall sentiment Three machine learning methods besides

the human-generated lists of words Naïve Bayes Maximum Entropy Support Vector Machines

Page 15: Document-level Semantic Orientation and Argumentation

15

Experimental Data Movie-review domain Source: Internet Movie Database (IMDb) Stars or numerical value ratings converted

into positive, negative, or neutral » no need to hand label the data for training or testing

Maximum of 20 reviews/author/sentiment category 752 negative reviews 1301 positive reviews 144 reviewers

Page 16: Document-level Semantic Orientation and Argumentation

16

List of Words Baseline Maybe there are certain words that people tend

to use to express strong sentiments Classification done by counting the number of

positive and negative words in the document Random-choice baseline: 50%

Page 17: Document-level Semantic Orientation and Argumentation

17

Machine Learning Methods Bag-of-features framework:

{f1,…,fm} predefined set of m features

ni(d) = number of times fi occurs in document d

(Naïve Bayes)

))(,),(),(( 21 dndndnd m

)(

))|()((:)|(

)(

)|()()|(),|(maxarg

1

)(

dP

cfPcPdcP

dP

cdPcPdcPdcPc

m

i

dni

NB

c

i

Page 18: Document-level Semantic Orientation and Argumentation

18

Machine Learning Methods – continued (Maximum

Entropy)

where Fi,c is a feature/class function:

Support vector machines: Find hyperplane that maximizes the margin. The constraint optimization problem:

cj is the correct class of document dj

)),(exp()(

1: ,,

iciciME cdF

dZP

otherwise ,0

and 0)(,1:),(,

ccdncdF i

ci

w

}1,1{,0,: jjj

jjj cdcw

Page 19: Document-level Semantic Orientation and Argumentation

19

Evaluation 700 positive-sentiment and 700 negative-

sentiment documents 3 equal-sized folds The tag “NOT_” was added to every word

between a negation word (“not”, “isn’t”, “didn’t”) and the first punctuation mark “good” is opposite to “not very good”

Features: 16,165 unigrams appearing at least 4 times in

the 1400-document corpus 16,165 most often occurring bigrams in the

same data

Page 20: Document-level Semantic Orientation and Argumentation

20

Results

POS information added to differentiate between: “I love this movie” and “This is a love story”

Page 21: Document-level Semantic Orientation and Argumentation

21

Conclusion Results produced by the machine learning

techniques are better than the human-generated baselines SVMs tend to do the best Unigram presence information is the most

effective Frequency vs. presence: “thwarted

expectation”, many words indicative of the opposite sentiment to that of the entire review

Some form of discourse analysis is necessary

Page 22: Document-level Semantic Orientation and Argumentation

22

Questions ? Comments ? Observations ?

Page 23: Document-level Semantic Orientation and Argumentation

Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status

Simone Teufel and Marc MoensCL-2002

Page 24: Document-level Semantic Orientation and Argumentation

24

Overview Summarization of scientific articles:

restore the discourse context of extracted material by adding the rhetorical status of each sentence in the document

Gold standard data for summaries consisting of computational linguistics articles annotated with the rhetorical status and relevance for each sentence

Supervised learning algorithm which classifies sentences into 7 rhetorical categories

Page 25: Document-level Semantic Orientation and Argumentation

25

Why? Knowledge about the rhetorical status of

the sentence enables the tailoring of the summaries according to user’s expertise and task Nonexpert summary: background information

and the general purpose of the paper Expert summary: no background, instead

differences between this approach and similar ones

Contrasts or complementarity among articles can be expressed

Page 26: Document-level Semantic Orientation and Argumentation

26

Rhetorical Status Generalizations about the nature of scientific

texts + information to enable the construction of better summaries

Problem structure: problems (research goals), solutions (methods), and results

Intellectual attribution: what the new contribution is, as opposed to previous work and background (generally accepted statements)

Scientific argumentation Attitude toward other people’s work: rival

approach, prior approach with a fault, or an approach contributing parts of the authors’ own solution

Page 27: Document-level Semantic Orientation and Argumentation

27

Metadiscourse and Agentivity Metadiscourse is an aspect of scientific

argumentation and a way of expressing attitude toward previous work “we argue that”, “in contrast to common

belief, we” Agent roles in argumentation: rivals,

contributors of part of the solution (they), the entire research community, or the authors of the paper (we)

Page 28: Document-level Semantic Orientation and Argumentation

28

Citations and Relatedness Just knowing that an article cites another

is often not enough One needs to read the context of the

citation to understand the relation between the articles Article cited negatively or contrastively Article cited positively or in which the authors

state that their own work originates from the cited work

Page 29: Document-level Semantic Orientation and Argumentation

29

Rhetorical Annotation Scheme

Only one category assigned to each full sentence Nonoverlapping, nonhierarchical scheme The rhetorical status is determined on the basis

of the global context of the paper

Page 30: Document-level Semantic Orientation and Argumentation

30

Relevance Select important content from text Highly subjective » low human agreement Sentence is considered relevant if it

describes the research goal or states a difference with a rival approach

Other definitions: relevant sentence if it shows a high level of similarity with a sentence in the abstract

Page 31: Document-level Semantic Orientation and Argumentation

31

Corpus 80 conference articles

Association for Computational Linguistics (ACL) European Chapter of the Association for

Computational Linguistics (EACL) Applied Natural Language Processing (ANLP) International Joint Conference on Artificial

Intelligence (IJCAI) International Conference on Computational

Linguistics (COLING). XML markups added

Page 32: Document-level Semantic Orientation and Argumentation

32

The Gold Standard 3 tasked-trained annotators 17 pages of guidelines 20 hours of training No communication between annotators Evaluation measures of the annotation:

Stability Reproducibility

Page 33: Document-level Semantic Orientation and Argumentation

33

Results of Annotation Kappa coefficient K (Siegel and Castellan, 1988)

where P(A)= pairwise agreement and P(E)= random agreement

Stability: K=.82, .81, .76 (N=1,220 and k=2) Reproducibility: K=.71

)(1

)()(

EP

EPAPK

Page 34: Document-level Semantic Orientation and Argumentation

34

The System Supervised machine learning Naïve Bayes

Page 35: Document-level Semantic Orientation and Argumentation

35

Features Absolute location of a sentence

Limitations of the author’s own method can be expected to be found toward the end, while limitations of other researchers’ work are discussed in the introduction

Page 36: Document-level Semantic Orientation and Argumentation

36

Features – continued Section structure: relative and absolute

position of sentence within section: First, last, second or third, second-last or third-

last, or either somewhere in the first, second, or last third of the section

Paragraph structure: relative position of sentence within a paragraph Initial, medial, or final

Page 37: Document-level Semantic Orientation and Argumentation

37

Features – continued Headlines: type of headline of current

section Introduction, Implementation, Example,

Conclusion, Result, Evaluation, Solution, Experiment, Discussion, Method, Problems, Related Work, Data, Further Work, Problem Statement, or Non-Prototypical

Sentence length Longer or shorter than 12 words (threshold)

Page 38: Document-level Semantic Orientation and Argumentation

38

Features – continued Title word contents: does the sentence

contain words also occurring in the title? TF*IDF word contents

High values to words that occur frequently in one document, but rarely in the overall collection of documents

Do the 18 highest-scoring TF*IDF words belong to the sentence?

Verb syntax: voice, tense, and modal linguistic features

Page 39: Document-level Semantic Orientation and Argumentation

39

Features – continued Citation

Citation (self), citation (other), author name, or none + location of the citation in the sentence (beginning, middle, or end)

History: most probable previous category AIM tends to follow CONTRAST Calculated as a second pass process during

training

Page 40: Document-level Semantic Orientation and Argumentation

40

Features – continued Formulaic expressions: list of phrases described

by regular expressions, divided into 18 classes, comprising a total of 644 patterns Clustering prevents data sparseness

Page 41: Document-level Semantic Orientation and Argumentation

41

Features – continued Agent: 13 types, 167 patterns

The placeholder WORK_NOUN can be replaced by a set of 37 nouns including theory, method, prototype, algorithm

Agent classes with a distribution very similar with the overall distribution of target categories were excluded

Page 42: Document-level Semantic Orientation and Argumentation

42

Features – continued Action: 365 verbs clustered into 20 classes based

on semantic concepts such as similarity, contrast PRESENTATION_ACTIONs: present, report, state RESEARCH_ACTIONs: analyze, conduct, define, and

observe Negation is considered

Page 43: Document-level Semantic Orientation and Argumentation

43

System Evaluation 10-fold-cross-validation

Page 44: Document-level Semantic Orientation and Argumentation

44

Feature Impact The most distinctive single feature is Location,

followed by SegAgent, Citations, Headlines, Agent and Formulaic

Page 45: Document-level Semantic Orientation and Argumentation

45

Questions ? Comments ? Observations ?

Page 46: Document-level Semantic Orientation and Argumentation

46

Thank You !