16
DASA Project Data Acquisition for Sentiment Analysis Ali Belcaid © AB Advisory & Consulting High level architecture and components overview – March 2013

Data Acquisition for Sentiment Analysis

Embed Size (px)

DESCRIPTION

Quick Start up for classic sentiment analysis project

Citation preview

Page 1: Data Acquisition for Sentiment Analysis

DASA ProjectData Acquisition for Sentiment Analysis

Ali Belcaid © AB Advisory & Consulting

High level architecture and components overview – March 2013

Page 2: Data Acquisition for Sentiment Analysis

Objectives

• Streamline and facilitate the process of unstructured data acquisition

• Create and manage corpora’s for contextual opinions and sentiments

• Detect trends based on contexctual reviews, comments, discussions…

• Run and train models for sentiment or opinion analysis

• Provide Figures, results and graphs as outputs

Page 3: Data Acquisition for Sentiment Analysis

Software components

• Python– Program language

• Django : Web application container

• Scapy : Web Crawler

• Librairies : Twitter,

• MySQL / MongoDB / Hbase– For the time being, no absolute choice is made But the final solution could be a mix

of different databases depending on the nature of the use.

• R Project– R Project will be used whenever specific textmining libraries are missing in python

or it become easier to use R instead of python. In that case, the R scripts will beencapsulated in python programs.

• Hadoop– For massive storage we will use Hadoop. The architecture is not yet depicted .

– It is used for Raw data storage.

Page 4: Data Acquisition for Sentiment Analysis

Simplified Solution Architecture

Web Interface (Django)

Crawl Engine & API(Scrapy)

Text Mining Engine(NLTK)

(TM – R project)

Pre-processing &

Corpuses

Output results

ConfigurationCrawl

Content

1 2

3

4

5

Page 5: Data Acquisition for Sentiment Analysis

Architecture components

1Data sources : The access will be managed via API or Crawls. Sources are all ones related to social media -> blogs, forums, advisors, social web… In general, all media where sentiment / opinion are expressed.

2 Web Interface to interact with the system -> to manage inputs, configurations, outputs…

3There will be a mix between Scrapy (the Crawler) and python scripts for using APIs. Basically, the engine will be used to gather all data sources and store them for further processing (pre-processing and analysis).

4There will be a mix between Scrapy (the Crawler) and python scripts for using APIs. Basically, the engine will be used to gather all data sources and store them for further processing.

5The target database solution is not yet selected. The objective is to store all the relative content whenever is raw data, configuration items or ouput results.

Page 6: Data Acquisition for Sentiment Analysis

Characteristics of Sentiment Analysis

Sentiment = Holder + Polarity + Target + Auxiliary –Holder: who expresses the sentiment –Target: what/whom the sentiment is expressed to –Polarity: the nature of the sentiment (e.g., positive or negative)

“The games in iPhone 4s are pretty funny!”

Feature/Aspect Target Polarity : Positive

Holder = the user/reviewer

Auxiliary• Strength : Differentiate the intensity • Confidence : Measure the reliability of the sentiment • Summary : Explain the reason inducing the sentiment • Time

Page 7: Data Acquisition for Sentiment Analysis

Basic Tasks

• Holder detection – Find who express the sentiment

• Target recognition – Find whom/what the sentiment is expressed towards

• Sentiment (Polarity) classification – Positive, negative, neutral

• Opinion summarization

• Opinion spam detection

Page 8: Data Acquisition for Sentiment Analysis

Subjectivity versus Sentiment

• Sentiment analysis also known as opinion mining.• Attempts to identify the opinion/sentiment that a person may hold

towards an object• It is a finer grain analysis compared to subjectivity analysis

Page 9: Data Acquisition for Sentiment Analysis

Lexicon Based Sentiment Classification

Basic idea

• Use the dominant polarity of the opinion words in the sentence to determine its polarity :• If positive/negative opinion prevails, the opinion sentence is regarded as

positive/negative• Lexicon + Counting• Lexicon + Grammar Rule + Inference Method

Example Lexicon : http://www.wjh.harvard.edu/~inquirerhttp://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rarhttp://sentiwordnet.isti.cnr.it/

Page 10: Data Acquisition for Sentiment Analysis

Sentiment Analysis Tasks

Level Task Description

Document • Task: sentiment classification of reviews

• Classes: positive, negative, and neutral

• Assumption: each document (or review) focuses on a single object (not true in many discussion posts) and contains opinion from a single opinion holder.

Sentence • Task 1: identifying subjective/opinionated sentences

• Classes: objective and subjective (opinionated)

• Task 2: sentiment classification of sentences

• Classes: positive, negative and neutral.

• Assumption: a sentence contains only one opinion; not true in many cases.

• Then we can also consider clauses or phrases.

Feature • Task 1: Identify and extract object features that have been commented on by an opinion holder (e.g., a reviewer).

• Task 2: Determine whether the opinions on the features are positive, negative or neutral.

• Task 3: Group feature synonyms.

• Produce a feature-based opinion summary of multiple reviews.

Page 11: Data Acquisition for Sentiment Analysis

Some tools

Lexicon-based tools

• Use sentiment and subjectivity lexicons• Rule-based classifier

• A sentence is subjective if it has at least two words in the lexicon• A sentence is objective otherwise

Corpus-based tools

• Use corpora annotated for subjectivity and/or sentiment• Train machine learning algorithms:

• Naïve bayes• Decision trees• SVM • …

• Learn to automatically annotate new text

Page 12: Data Acquisition for Sentiment Analysis

Sentiment Analysis : Levels

• Document level –E.g., product/movie review

• Sentence level –E.g., news sentence

• Expression level –E.g., word/phrase

Page 13: Data Acquisition for Sentiment Analysis

Sentiment Analysis : Holder detection

Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns

International officers believe that the EU will prevail. International officers said US officials want the EU to prevail.

• View source identification as an information extraction task and tackle the problem using sequence tagging and pattern matching techniques simultaneously

• Linear-chain CRF model to identify opinion sources • Patterns incorporated as features

Page 14: Data Acquisition for Sentiment Analysis

Sentiment Analysis : Twitter

Page 15: Data Acquisition for Sentiment Analysis

Sentiment Analysis : Twitter

1. Tweet normalization – A simple rule-based model –“gooood” to “good”, “luve” to “love”

2. POS tagging – OpenNLP POS tagger 3. Word stemming – A word stem mapping table (about 20,000

entries) 4. Syntactic parsing – A Maximum Spanning Tree dependency

parser

Page 16: Data Acquisition for Sentiment Analysis

Crawling scenario : Definition

Scenario x

Instance 1

Instance 2

Instance n

URLS sélectionnées

Paramètres de configuration

Name

Key words

• Scenario : 1 -> n : Category.• Theme: n -> n : Scenario• Scenario : 1 -> n : instance

• The scenario define the type of Crawl wewant to run. It is tied to the notion of instance which is considered as a specificconfiguration of scenario.

Module gestion des URLS

Module gestion de paramètres

de configuration

Il faudra se pencher sur l’interface GUI en développement de Nutch et s’en inspirer pour la gestion des paramètres et des URLS.

Theme

Category