Sentiment Analysis Jadavpur University Kolkata, India Sivaji
Bandyopadhyay
Slide 2
Sentiment Knowledge Acquisition Sentiment / Subjectivity
Detection Sentiment Polarity Detection Sentiment Structurization
Sentiment Summary Overview SentimentHuman Intelligence Sentiment
Analysis is a multifaceted problem
Slide 3
Prior Polarity Sentiment Lexicon Automatic Computational
Processes WordNet Dictionary Based Antonym Involving Human
Intelligence Dr Sentiment Cross Lingual Projection of Sentiment
Lexicons Sentiment Knowledge Acquisition Involving Human
Intelligence
Slide 4
Sentiment Analysis 4 Sentiment Analysis Sentiment Detection
Sentiment Classification Andrea Esuli and Fabrizio Sebastiani.
SentiWordNet: A publicly available lexical resource for opinion
mining. In Proceedings of Language Resources and Evaluation (LREC),
2006. IITH is a very good institution. I love Hyderabad, the city
is famous for its Biriyani, Pearl and old Mughal architecture!
Summer in Hyderabad is too scorching. IITH is a very good
institution. I love Hyderabad, the city is famous for its Biriyani,
Pearl and old Mughal architecture! Summer in Hyderabad is too
scorching.
Slide 5
What is SentiWordNet? 5 POSOffsetPositivityNegativitySynset
Adjective10063610.8750.0happy Noun44665800.3750.0friendliness
Adverb2145890.6250.125sharply Verb24719930.00.125shame Prior
Polarity Lexicon
Slide 6
Prior Polarity Lexicon Sentiment Bearing Words: love, hate,
good, favorite Challenges for Polarity Identification: Context
Information (Pang et al., 2002) Domain Pragmatic Knowledge (Aue and
Gamon, 2005) Time Dimension (Read, 2005) Language/Culture
Properties (Wiebe and Mihalcea, 2006) 6
Slide 7
Prior Polarity Lexicon Context Information I prefer Limuzin as
it is longer than Mercedes. Avoid longer baggage during excursion
in Amazon. Language/Culture Properties (Sahera: A marriage-wear of
India) (Durgapujo: A festival of Bengal) Domain Pragmatic Knowledge
Sensex go high. Price go high. Time Dimension During 90s mobile
phone users generally reported in various online reviews about
their color-phones but in recent times color-phone is not just
enough. People are fascinated and influenced by touch screen and
various software(s) installation facilities on these new generation
gadgets. 7 Continue.(2)
Slide 8
Prior Polarity Lexicon Suppose total occurrence of a word long
in a domain corpus is n. The positive and negative occurrence of
that word are S p and S n respectively. Therefore in a developed
sentiment lexicon the assigned positivity and negativity score of
that word will be as follows: Positivity: S p /n Negativity: S n /n
These associative positive and negative scores are called prior
polarity. 8 Continue.(3)
Slide 9
Source Lexicon Acquisition Available Resources for English
SentiWordNet (Esuli et. al., 2006) SentiWordNet is an automatically
constructed lexical resource for English that assigns a positivity
score and a negativity score to each WordNet synset. WordNet Affect
List (Strapparava et al., 2004) WordNet synsets tagged with six
basic emotions: anger, disgust, fear, joy, sadness, surprise.
Taboadas Adjective List (Voll et al., 2006) An automatically
constructed adjective list with positivity and negativity polarity
assignment. Subjectivity Word List (Wilson et. al., 2005) The
entries in the subjectivity word list have been manually labeled
with part of speech (POS) tags as well as either strong or weak
subjective tag depending on the reliability of the subjective
nature of the entry. 9
Slide 10
Source Language Acquisition Chosen Source Lexicon Resources
SentiWordNet SentiWordNet is most widely used in several
applications such as sentiment analysis, opinion mining and emotion
analysis. Subjectivity Word List (Wilson et. al., 2005)
Subjectivity Word List is most trustable as the opinion mining
system OpinionFinder that uses the subjectivity word list has
reported highest score for opinion/sentiment subjectivity (Wiebe
and Riloff, 2006) (Das and Bandyopadhyay, 2010) 10
Continue.(1)
Slide 11
Source Language Acquisition Noise-Reduction A merged sentiment
lexicon has been developed from both the resources by removing the
duplicates. It has been observed that 64% of the single word
entries are common in the Subjectivity Word List and SentiWordNet.
The new merged sentiment lexicon consists of 14,135 numbers of
tokens. Several filtering techniques have been applied to generate
the new list. 11 Continue.(2)
Slide 12
Source Language Acquisition 12 Continue.(3)
SentiWordNetSubjectivity Word List SingleMultiSingleMulti
Unambiguous Words 115424790915866990 20789300004745963 Discarded
Ambiguous Words Threshold Orientation Strength Subjectivity
Strength POS 86944300002652928
Slide 13
Target Language Generation 13 Generation Strategies Bilingual
Dictionary Based Approach WordNet Based Approach Antonym Generation
Corpus Based Approach Dr Sentiment (A Gaming Approach)
Slide 14
Target Language Generation 14 Continue.(1) Bilingual Dictionary
Based Approach A word-level translation technique adopted. Robust
and reliable synsets (approx 9966) are created by native speakers
as well as linguistics experts of the specific languages as a part
of English to Indian Languages Machine Translation Systems (EILMT).
Various language specific dictionaries acquired.
Slide 15
Target Language Generation 15 Continue.(2) Bilingual Dictionary
Based Approach Hindi (90,872) SHABDKOSH
(http://www.shabdkosh.com/)http://www.shabdkosh.com/ Shabdanjali
(http://www.shabdkosh.com/content/category/download
s/)http://www.shabdkosh.com/content/category/download s/ Bengali
(102119) Samsad Bengali-English Dictionary
(http://dsal.uchicago.edu/dictionaries/biswas_bengali/)http://dsal.uchicago.edu/dictionaries/biswas_bengali/
Telugu (112310) Charles Philip Brown English-Telugu Dictionary
(http://dsal.uchicago.edu/dictionaries/brown/)http://dsal.uchicago.edu/dictionaries/brown/
Aksharamala English-Telugu Dictionary
(https://groups.google.com/group/aksharamala)https://groups.google.com/group/aksharamala
English-Telugu Dictionary
(http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Dict_Fr
ame.html)http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Dict_Fr
ame.html
Slide 16
Target Language Generation 16 Continue.(3) Bilingual Dictionary
Based Approach Hindi Translation process has resulted 22,708 Hindi
entries Bengali Translation process has resulted 34,117 Bengali
entries Telugu Translation process has resulted 30,889 Telugu
entries Almost 88% Telugu SentiWordNet generated by this
process
Slide 17
Target Language Generation 17 Continue.(4) WordNet Based
Expansion Approach Synonymy Expansion WordNet based expansion
technique produces more synset members: inactive, motionless,
static for a source word still. Prior polarity scores are directly
copied Antonymy Expansion WordNet based expansion technique
produces more sentiment lexemes: ugly for a source word beautiful.
Prior polarities are calculated as: T p =1-S p T n =1-S n where S
p, S n are the positivity and negativity score for the source
language (i.e, English) and T p, T n are the positivity and
negativity score for target languages
Slide 18
Target Language Generation 18 Continue.(5) WordNet Based
Expansion Approach Hindi Hindi WordNet (Jha et al., 2001)
(http://www.cfilt.iitb.ac.in/wordnet/webhwn/) is a well structured
and manually compiled resource and is being updated since last nine
years.http://www.cfilt.iitb.ac.in/wordnet/webhwn/ Almost 60%
generated by this process Bengali The Bengali
(http://bn.asianwordnet.org/)http://bn.asianwordnet.org/ It only
contains 1775 noun synsets as reported in (Robkop et al., 2010)
Only 5% new lexicon entries have been generated in this
process
Slide 19
Target Language Generation 19 Continue.(6) Antonymy Generation
Affix/SuffixWordAntonym abXNormalAb-normal misXFortuneMis-fortune
imX-exXIm-plicitEx-plicit antiXClockwiseAnti-clockwise
nonXAlignedNon-aligned inX-exXIn-trovertEx-trovert
disXInterestDis-interest unXBiasedUn-biased
upX-downXUp-hillDown-hill imXPossibleIm-possible illXLegalIl-legal
overX-underXOverdoneUnder-done inXConsistentIn-consistent
rX-irXRegularIr-regular Xless-XfulHarm-lessHarm-ful
malXFunctionMal-function About 8% of Bengali, 7% of Hindi and 11%
of Telugu SentiWordNet entries are generated in this process.
Slide 20
Target Language Generation 20 Continue.(7) Corpus Based
Approach Language/culture specific words: (Sahera: A marriage-wear)
(Durgapujo: A festival of Bengal) Technique Generated sentiment
Lexicon used a seed list Tag-Set SWP (Sentiment Word Positive) SWN
(Sentiment Word Negative) Corpus EILMT language specific corpus:
approximately 10K of sentences. Model Conditional Random Field
(CRF) An n-gram (n=4) sequence labeling model has been used for the
present task.
Slide 21
Limitations 21 Issues in Cross Lingual Projection Sentiment
score may not be equal to source language Relative sentiment score
is needed rather than absolute score Language / Culture specific
lexicons should be included Sentiment score should be updated by
time
Slide 22
Involving Human Intelligence 22 WORLD INTERNET USAGE AND
POPULATION STATISTICS World Regions Population ( 2010 Est.)
Internet Users Dec. 31, 2000 Internet Users Latest Data Penetration
(% Population) Growth 2000-2010 Users % of Table
Africa1,013,779,0504,514,400110,931,70010.9 %2,357.3 %5.6 %
Asia3,834,792,852114,304,000825,094,39621.5 %621.8 %42.0 %
Europe813,319,511105,096,093475,069,44858.4 %352.0 %24.2 % Middle
East212,336,9243,284,80063,240,94629.8 %1,825.3 %3.2 % North
America344,124,450108,096,800266,224,50077.4 %146.3 %13.5 % Latin
America/Caribbean 592,556,97218,068,919204,689,83634.5 %1,032.8
%10.4 % Oceania / Australia34,700,2017,620,48021,263,99061.3 %179.0
%1.1 % WORLD TOTAL6,845,609,960360,985,4921,966,514,81628.7 %444.8
%100.0 %
Slide 23
Dr. Sentiment 23 Q1
Slide 24
Dr. Sentiment 24 Q2 WordPositivityNegativity Good0.6250.0
Better0.8750.0 Best0.9800.0
Slide 25
Dr. Sentiment 25 Q3
Slide 26
Dr. Sentiment 26 Q4
Slide 27
SentimentUn-Explored Dimensions 27 Blue in Islam: In verse
20:102 of the Quran, the word zurq (plural of azraq 'blue') is used
metaphorically for evil doers whose eyes are glazed with fear
Geo-Spatial
Expected Impact of the Resources 30 Resources are useful in
multiple aspect Mono-Lingual Sentiment/Opinion/Emotion Analysis
task Generated language specific SentiWordNet(s) could be expanded
by other proposed methods (Dictionary, WordNet, Antonym and Corpus
Based Approach) The other dimensions Geospatial Information
retrieval Personalized search Recommender System etc Stylometry: A
writers Senti-Mentality Plagiarism: Spamming Technique: Geo-Spatial
and User Perspective
Slide 31
The Road Ahead 31 Languages
AfrikaansBulgarianDutchGermanIrishMalayRussianThai
AlbanianCatalanEstonianGreekItalianMalteseSerbianTurkish
ArabicChineseFilipinoHaitianJapaneseNorwegianSlovakUkrainian
ArmenianCroatianFinnishHebrewKoreanPersianSlovenianUrdu
AzerbaijaniCreoleFrenchHungarianLatvianPolishSpanishVietnamese
BasqueCzechGalicianIcelandicLithuanianPortugueseSwahiliWelsh
BelarusianDanishGeorgianIndonesianMacedonianRomanianSwedishYiddish
Basic SentiWordNet has been developed for 56 languages A. Das and
S. Bandyopadhyay. Towards The Global SentiWordNet, In the Workshop
on Model and Measurement of Meaning (M3), PACLIC 24, November 4,
Sendai, Japan, 2010. (Accepted)
Slide 32
References 32 Resources I.A. Das and S. Bandyopadhyay. Towards
The Global SentiWordNet, In the Workshop on Model and Measurement
of Meaning (M3), PACLIC 24, November 4, Sendai, Japan, 2010. II.A.
Das and S. Bandyopadhyay. SentiWordNet for Indian Languages, In the
8 th Workshop on Asian Language Resources (ALR), August 21-22,
Beijing, China, 2010. III.A. Das and S. Bandyopadhyay. SentiWordNet
for Bangla, In Knowledge Sharing Event-4: Task 2: Building
Electronic Dictionary, February 23 rd -24 th, 2010, Mysore.
Slide 33
Solution Architecture Explored Rule-Based Machine Learning
Hybrid Adaptive Genetic Algorithm: Multiple Objective Optimization,
The Evolutionary Technique to Detect Sentiment Adaptive Genetic
Algorithm: Multiple Objective Optimization technique yielded all
other techniques Sentiment / Subjectivity Detection
Slide 34
Sentence subjectivity: An objective sentence expresses some
factual information about the world, while a subjective sentence
expresses some personal feelings or beliefs. Example: Type: Film
Review, Film Name: Deep Blue Sea, Holder: Arbitrary-outside of
theatre Oh, This is blue! Is this statement an objective or
subjective statement? blue is not a evaluative expression Among
different cultures with a different colour scheme (blue; positive
or negative?)
Slide 35
Example: Type: Comment, Holder: Governor of WB, Issue:
Nandigram. Governor said the government should keep patience. Is
this statement an objective or subjective statement? keep patience
regarding what? How to determine Governors comment is
important?
Slide 36
Subjectivity is a social norms Subjectivity knowledge is
pragmatic A prior knowledge always help to identify
Subjectivity
Slide 37
A rule-based approach Use Themes and Ontology as pragmatic
knowledge SentiWordNet (Bengali): a prior polarity lexicon Features
Frequency Average Distribution Functional Word Positional Aspect
Theme Identification Ontology List Stemming Cluster Part of Speech
Chunk SentiWordNet (Bengali)
Slide 38
FeaturesOverall Performance incremented by Stemming Cluster
4.05% Part of Speech 3.62% Chunk4.07% Functional Word 1.88%
SentiWordNet (Bengali) 5.02% Ontology List3.66% Feature wise System
Performance 0 10 20 30 40 50 60 70 80 EnglishBengali Base-Line
POS-Chunk Ontology Position Distribution
Slide 39
DomainPrecisionRecall NEWS72.16%76.00% BLOG74.60%80.40% Overall
System Performance Observations Subjectivity detection is trivial
for blog corpus rather than for news corpus Performance incremented
by 2% only from rule-based system using CRF technique with the same
feature set
Slide 40
GBML used to identify automatically best feature set based on
the principle of natural selection and survival of the fittest. The
identified fittest feature set is then optimized locally and global
optimization is then obtained by multi-objective optimization
technique. The local optimization identify the best range of
feature values of a particular feature. The Global optimization
technique identifies the best ranges of values of given multiple
feature.
Slide 41
TypesFeatures Lexico-Syntactic POS SentiWordNet Frequency
Stemming Syntactic Chunk Label Dependency Parsing Discourse Level
Title of the Document First Paragraph Average Distribution Theme
Word Experimentally Best Identified Feature Set
Slide 42
GAs are characterized by the five basic components as follows
I.Chromosome representation for the feasible solutions to the
optimization problem. II.Initial population of the feasible
solutions. III.A fitness function that evaluates each solution.
IV.Genetic operators that generate a new population from the
existing population. V.Control parameters such as population size,
probability of genetic operators, number of generation etc.
Slide 43
Where is the resultant subjectivity function, to be calculated
and is the i th feature function. If the present model is
represented in a vector space model then the above function could
be re-written as: This equation specifies what is known as the dot
product between vectors. The GBML provides the facility to search
in the Pareto-optimal set of possible features. To make the Pareto
optimality mathematically more rigorous, we state that a feature
vector x is partially less than feature vector y, symbolically
x