Mining Product Opinions and Reviews on the Web

Mining Product Opinions and Reviews on The WebFelipe Mattosinho

Page 2

Agenda

IntroductionBasicsRequirementsDesign ImplementationEvaluationConclusion

Introduction

Page 4

Introduction

What is Opinion Mining?

Opinion Mining

not conventional data mining

retrieve useful information out of users opinions

a sub-area of Web Mining

stays at the crossroads of IR, IE and DM.

Page 5

Introduction

1279

748

128 100

Source: Amazon.com

576

Users are not willing to read them all Difficult to find necessary

information Difficult to draw conclusions

Opinion overload

Structure data

Build intelligent applications (Web 3.0)

Why using Opinion Mining?

Page 6

Introduction

Existing approaches

Source: Amazon.com

Source: CNET.com

Ranking

Classification

Facts are different from opinions

Asking „pros“ and „cons“ can induce opinions

Tiresome task for users

Basics

Page 8

Basics

Opinions highlight strengths and weaknesses about objects under discussion (OuD)

O:(T,A), T is a taxonomy of components and A is a set of attributes of O

The use of word feature for simplicity

Opinion Model

Page 9

Basics

Opinion Level

Sentence Level

Feature Level

Level of Sentiment Analysis

too coarse-grained, does not cover important information

better approach, but still do not cover everything

Optimal level, best coverage

Page 10

Basics

Trends in Sentiment Analysis for Opinion Mining

Granularity Level

Lower Perfomance

Higher Complexity

Requirements

Page 12

Requirements

Customers

E-commerce

Manufacturers

Target Audience

Page 13

Requirements

Functional Requirements

Non-functional Requirements

Generate a feature-based summary to the user System administrator has control over core parameters and policies

mechanisms

Fedseeko compatibility Fault-tolerance Performance Interoperability

Design

Page 15

Design

System Architecture

System Management Module POS Tagging Module Opinion Retriever Module Opinion Mining Module

Page 16

Design

System Management Module

Long jobs handled asynchronously Workers run concurrently,

different times of the day or in different machines

Page 17

Design

Opinion Retrieval Module

Create task description Web scraping otimization

Page 18

Design

Opinion Composition Model

Other words are also special (negation words, orientation inverter words, “too” words)

Workers run concurrently, different times of the day or in different machines

Page 19

Design

Opinion Sentence

I needed to take pictures during my last travel to Italy. So far, I’m very happy with this camera. The picture quality is good and the zoom is powerful. One thing that I didn’t like is the LCD resolution.

I_PRP needed_VBD to_TO take_VB pictures_NNS during_IN my_PRP$ last_RB travel_NN to_TO Italy_NNP ._. So_RB far_RB ,_, I_PRP ‘m_VBP very_RB happy_JJ with_IN this_DT camera_NN ._. The_DT picture_NN quality_NN is_VBZ good_JJ and_CC the_DT zoom_NN is_VBZ powerful_JJ ._. One_CD thing_NN that_IN I_PRP did_VBD n‘t_RB like_VB is_VBZ the_DT LCD_NNP resolution_NN ._.

So_RB far_RB ,_, I_PRP ‘m_VBP very_RB happy_JJ with_IN this_DT camera_NN ._. The_DT picture_NN quality_NN is_VBZ good_JJ and_CC the_DT zoom_NN is_VBZ powerful_JJ ._. One_CD thing_NN that_IN I_PRP did_VBD n‘t_RB like_VB is_VBZ the_DT LCD_NNP resolution_NN ._.

Page 20

Design

camera_NN (picture_NN quality_NN) zoom_NN thing_NN LCD_NNP resolution_NN wedding_NN car_NN photos_NNS dog_NN road_NN lot_NN [...]

Feature Identification

camera_NN (picture_NN quality_NN) flash_NN thing_NN England_NNP rehearsel_NN photos_NNS [...]

horse_NN (picture_NN quality_NN) flash_NN farm_NN country_NN rehearsel_NN photos_NNS [...]

camera_NN (picture_NN quality_NN) flash_NN photos_NNS [...]

Page 21

Design

Feature Identification

Pros

Cons

Customers use different words to refer to the same feature Detects additional useful information (not part of the opinion model) No manual annotated data

Does not detect infrequent features Detects non-features

Page 22

Design

good,1,nice,1

bad

good,1,nice,1,bad,-1

Seed List

Search Word Orientation Algorithm

Page 23

Design

Opinion Words in Context

Negation Rules

Too Rules

Before adjectives usually denotes negative sentiment. E.g „This camera is too small“.

not

not

no

good

bad

problem

Page 24

Design

Opinion Words in Context

Orientation Inverter / Sentiment Inverter Words

Find sentiment/orientation for opinion words with unknown orientation

„The camera is nice, except for the initialization time which takes long “

„The aufocous is great, the battery life lasts long, but i find the functions a little complex.“

Page 25

Design

0 1 2 3 4 5 6 7 8 9

Score(image quality) = (1 / |4 – 1| ) + (-1 / |9 - 1| ) = 0.2083 (positive)

Score(autofocus) = (1 / |1 - 7 | ) + (-1 / | 9 - 7 | ) = 0.1667 - 0.5 = -0.333 (negative)

“The image quality is amazing, but the autofocus is terrible.”

Aggregating opinions for a feature

Page 26

Implementation

Overview

Core Technologies

Ruby Gems (Libraries)

Java Libraries

JRuby on Rails (JRuby 1.5.0 RC3 / Rails 2.3.8 )

Mechanize Nokogiri Ruby-aaws Delayed_Job

Rita.Wordnet Stanford POS Tagger API

Page 27

Implementation

Overview

Page 28

Implementation

Overview

Evaluation

Page 30

Evaluation

Product Number of Opinions Number of Opinion Sentences

iPod Touch 8GB 120 673Nikon D5000 86 452Nikon P90 52 273Xbox 360 41 181

Test Environment

Sample data

System Configuration

AMD Turion(tm) 64 Mobile Technology ML-32 / 1GB RAM Ubuntu 9.04 32 bits

Page 31

Evaluation

Effectiveness of Feature Identification

Threshold Accuracy Features

Page 32

Evaluation

Xbox 360 lowest effectiveness due to wrong part-of-speech tagging

„Complex sentences“ and domain dependent sentences are also wrongly classified

Sentiment Classification Effectiveness

Page 33

Evaluation

System Efficiency

The lower the threshold, the higher the number of features and hence the number of sentences analyzed

What is the price to address many exceptions ?

Page 34

Evaluation

Considerations

Complex Sentences / Domain Dependent Sentences / Exceptions

Users may talk about other objects, with similar features Domain Dependent sentences (e.g „ The device heats very fast.”)

POS Tagging Errors

„This camera is GOOD“ „[...] the hard drive which comes with the device.”

Pluralization cases

May not refer to the same OuD (e.g „camera“ and „cameras“)

Page 35

Conclusion

POECS performs well with a good rate of accuracy. Observations shows that many users write „simple“ straightforward sentences,

which are covered by POECS. Domain specific annotations can help the system to be more effective. Human language is complex, covering many cases represent a lot of loss in

performance

Sample data

Page 36

Conclusion

Minimize number of manual annotations through recognition of reusable patterns of the human language

Cope with common unsolved problems such as

Safe ways to recognize which features belong to which object

Global opinion knowledge to help improvement of local analysis (sentence or feature level)

Future Work

Page 37

Conclusion

Questions?

Documents

Mining Product Opinions and Reviews on the Web