Upload
felipe-japm
View
3.125
Download
2
Tags:
Embed Size (px)
Citation preview
Mining Product Opinions and Reviews on The WebFelipe Mattosinho
Page 2
Agenda
IntroductionBasicsRequirementsDesign ImplementationEvaluationConclusion
Introduction
Page 4
Introduction
What is Opinion Mining?
Opinion Mining
not conventional data mining
retrieve useful information out of users opinions
a sub-area of Web Mining
stays at the crossroads of IR, IE and DM.
Page 5
Introduction
1279
748
128 100
Source: Amazon.com
576
Users are not willing to read them all Difficult to find necessary
information Difficult to draw conclusions
Opinion overload
Structure data
Build intelligent applications (Web 3.0)
Why using Opinion Mining?
Page 6
Introduction
Existing approaches
Source: Amazon.com
Source: CNET.com
Ranking
Classification
Facts are different from opinions
Asking „pros“ and „cons“ can induce opinions
Tiresome task for users
Basics
Page 8
Basics
Opinions highlight strengths and weaknesses about objects under discussion (OuD)
O:(T,A), T is a taxonomy of components and A is a set of attributes of O
The use of word feature for simplicity
Opinion Model
Page 9
Basics
Opinion Level
Sentence Level
Feature Level
Level of Sentiment Analysis
too coarse-grained, does not cover important information
better approach, but still do not cover everything
Optimal level, best coverage
Page 10
Basics
Trends in Sentiment Analysis for Opinion Mining
Granularity Level
Lower Perfomance
Higher Complexity
Requirements
Page 12
Requirements
Customers
E-commerce
Manufacturers
Target Audience
Page 13
Requirements
Functional Requirements
Non-functional Requirements
Generate a feature-based summary to the user System administrator has control over core parameters and policies
mechanisms
Fedseeko compatibility Fault-tolerance Performance Interoperability
Design
Page 15
Design
System Architecture
System Management Module POS Tagging Module Opinion Retriever Module Opinion Mining Module
Page 16
Design
System Management Module
Long jobs handled asynchronously Workers run concurrently,
different times of the day or in different machines
Page 17
Design
Opinion Retrieval Module
Create task description Web scraping otimization
Page 18
Design
Opinion Composition Model
Other words are also special (negation words, orientation inverter words, “too” words)
Workers run concurrently, different times of the day or in different machines
Page 19
Design
Opinion Sentence
I needed to take pictures during my last travel to Italy. So far, I’m very happy with this camera. The picture quality is good and the zoom is powerful. One thing that I didn’t like is the LCD resolution.
I_PRP needed_VBD to_TO take_VB pictures_NNS during_IN my_PRP$ last_RB travel_NN to_TO Italy_NNP ._. So_RB far_RB ,_, I_PRP ‘m_VBP very_RB happy_JJ with_IN this_DT camera_NN ._. The_DT picture_NN quality_NN is_VBZ good_JJ and_CC the_DT zoom_NN is_VBZ powerful_JJ ._. One_CD thing_NN that_IN I_PRP did_VBD n‘t_RB like_VB is_VBZ the_DT LCD_NNP resolution_NN ._.
So_RB far_RB ,_, I_PRP ‘m_VBP very_RB happy_JJ with_IN this_DT camera_NN ._. The_DT picture_NN quality_NN is_VBZ good_JJ and_CC the_DT zoom_NN is_VBZ powerful_JJ ._. One_CD thing_NN that_IN I_PRP did_VBD n‘t_RB like_VB is_VBZ the_DT LCD_NNP resolution_NN ._.
Page 20
Design
camera_NN (picture_NN quality_NN) zoom_NN thing_NN LCD_NNP resolution_NN wedding_NN car_NN photos_NNS dog_NN road_NN lot_NN [...]
Feature Identification
camera_NN (picture_NN quality_NN) flash_NN thing_NN England_NNP rehearsel_NN photos_NNS [...]
horse_NN (picture_NN quality_NN) flash_NN farm_NN country_NN rehearsel_NN photos_NNS [...]
camera_NN (picture_NN quality_NN) flash_NN photos_NNS [...]
Page 21
Design
Feature Identification
Pros
Cons
Customers use different words to refer to the same feature Detects additional useful information (not part of the opinion model) No manual annotated data
Does not detect infrequent features Detects non-features
Page 22
Design
good,1,nice,1
bad
good,1,nice,1,bad,-1
Seed List
Search Word Orientation Algorithm
Page 23
Design
Opinion Words in Context
Negation Rules
Too Rules
Before adjectives usually denotes negative sentiment. E.g „This camera is too small“.
not
not
no
good
bad
problem
Page 24
Design
Opinion Words in Context
Orientation Inverter / Sentiment Inverter Words
Find sentiment/orientation for opinion words with unknown orientation
„The camera is nice, except for the initialization time which takes long “
„The aufocous is great, the battery life lasts long, but i find the functions a little complex.“
Page 25
Design
0 1 2 3 4 5 6 7 8 9
Score(image quality) = (1 / |4 – 1| ) + (-1 / |9 - 1| ) = 0.2083 (positive)
Score(autofocus) = (1 / |1 - 7 | ) + (-1 / | 9 - 7 | ) = 0.1667 - 0.5 = -0.333 (negative)
“The image quality is amazing, but the autofocus is terrible.”
Aggregating opinions for a feature
Page 26
Implementation
Overview
Core Technologies
Ruby Gems (Libraries)
Java Libraries
JRuby on Rails (JRuby 1.5.0 RC3 / Rails 2.3.8 )
Mechanize Nokogiri Ruby-aaws Delayed_Job
Rita.Wordnet Stanford POS Tagger API
Page 27
Implementation
Overview
Page 28
Implementation
Overview
Evaluation
Page 30
Evaluation
Product Number of Opinions Number of Opinion Sentences
iPod Touch 8GB 120 673Nikon D5000 86 452Nikon P90 52 273Xbox 360 41 181
Test Environment
Sample data
System Configuration
AMD Turion(tm) 64 Mobile Technology ML-32 / 1GB RAM Ubuntu 9.04 32 bits
Page 31
Evaluation
Effectiveness of Feature Identification
Threshold Accuracy Features
Page 32
Evaluation
Xbox 360 lowest effectiveness due to wrong part-of-speech tagging
„Complex sentences“ and domain dependent sentences are also wrongly classified
Sentiment Classification Effectiveness
Page 33
Evaluation
System Efficiency
The lower the threshold, the higher the number of features and hence the number of sentences analyzed
What is the price to address many exceptions ?
Page 34
Evaluation
Considerations
Complex Sentences / Domain Dependent Sentences / Exceptions
Users may talk about other objects, with similar features Domain Dependent sentences (e.g „ The device heats very fast.”)
POS Tagging Errors
„This camera is GOOD“ „[...] the hard drive which comes with the device.”
Pluralization cases
May not refer to the same OuD (e.g „camera“ and „cameras“)
Page 35
Conclusion
POECS performs well with a good rate of accuracy. Observations shows that many users write „simple“ straightforward sentences,
which are covered by POECS. Domain specific annotations can help the system to be more effective. Human language is complex, covering many cases represent a lot of loss in
performance
Sample data
Page 36
Conclusion
Minimize number of manual annotations through recognition of reusable patterns of the human language
Cope with common unsolved problems such as
Safe ways to recognize which features belong to which object
Global opinion knowledge to help improvement of local analysis (sentence or feature level)
Future Work
Page 37
Conclusion
Questions?