Data Acquisition for Sentiment Analysis

  • View

  • Download

Embed Size (px)


Quick Start up for classic sentiment analysis project

Text of Data Acquisition for Sentiment Analysis

  • 1. DASAProjectData Acquisition for Sentiment AnalysisAli Belcaid AB Advisory& ConsultingHigh levelarchitecture and components overviewMarch 2013

2. ObjectivesStreamlineand facilitatethe processof unstructureddata acquisitionCreateand manage corporasfor contextualopinions and sentimentsDetecttrends basedon contexctualreviews, comments, discussionsRunand train modelsfor sentiment or opinion analysisProvideFigures, resultsand graphs as outputs 3. Software componentsPythonProgram languageDjango : Web application containerScapy: Web CrawlerLibrairies : Twitter,MySQL / MongoDB/ HbaseFor the time being, no absolutechoiceismade But the final solution couldbea mix of differentdatabasesdependingon the nature of the use.R ProjectR Project willbeusedwheneverspecifictextmininglibrariesare missingin python or itbecomeeasierto use R insteadof python. In thatcase, the R scripts willbeencapsulatedin python programs.HadoopFor massive storagewewilluse Hadoop. The architecture isnot yetdepicted.It isusedfor Rawdata storage. 4. SimplifiedSolution ArchitectureWeb Interface (Django)Crawl Engine& API(Scrapy)TextMiningEngine(NLTK)(TM R project)Pre- processing& CorpusesOutput resultsConfigurationCrawl Content12345 5. Architecture components1Data sources : The accesswillbemanagedvia API or Crawls. Sources are all onesrelatedto social media -> blogs, forums, advisors, social web In general, all media wheresentiment / opinion are expressed.2Web Interface to interactwiththe system -> to manage inputs, configurations, outputs3There willbea mix betweenScrapy(the Crawler) and python scripts for usingAPIs. Basically, the enginewillbeusedto gatherall data sources and store themfor furtherprocessing(pre- processingand analysis).4There willbea mix betweenScrapy(the Crawler) and python scripts for usingAPIs. Basically, the enginewillbeusedto gatherall data sources and store themfor furtherprocessing.5The targetdatabasesolution isnot yetselected. The objective isto store all the relative content wheneverisrawdata, configuration items or ouputresults. 6. Characteristicsof Sentiment AnalysisSentiment = Holder + Polarity + Target + AuxiliaryHolder: who expresses the sentimentTarget: what/whom the sentiment is expressed toPolarity: the nature of the sentiment (e.g., positiveor negative)The games in iPhone 4s are pretty funny!Feature/Aspect Target Polarity : PositiveHolder = the user/reviewerAuxiliaryStrength : Differentiate the intensityConfidence : Measure the reliability of the sentimentSummary : Explain the reason inducing the sentimentTime 7. Basic TasksHolderdetection Find who express the sentimentTargetrecognition Find whom/what the sentiment is expressed towardsSentiment (Polarity) classification Positive, negative, neutralOpinion summarizationOpinion spam detection 8. Subjectivityversus SentimentSentiment analysis also known as opinion mining.Attempts to identify the opinion/sentiment that a person may hold towards an objectIt is a finer grain analysis compared to subjectivity analysis 9. Lexicon Based Sentiment ClassificationBasic ideaUse the dominant polarity of the opinion words in the sentence to determine its polarity :If positive/negative opinion prevails, the opinion sentence is regarded as positive/negativeLexicon + CountingLexicon + Grammar Rule + Inference MethodExample Lexicon : 10. Sentiment AnalysisTasksLevelTaskDescriptionDocumentTask: sentiment classification of reviewsClasses: positive, negative, and neutralAssumption: each document (or review) focuses on a single object (not true in many discussion posts) and contains opinion from a single opinion holder.SentenceTask 1: identifying subjective/opinionated sentencesClasses: objective and subjective (opinionated)Task 2: sentiment classification of sentencesClasses: positive, negative and neutral.Assumption: a sentence contains only one opinion; not true in many cases.Then we can also consider clauses or phrases.FeatureTask 1: Identify and extract object features that have been commented on by an opinion holder (e.g., a reviewer).Task 2: Determine whether the opinions on the features are positive, negative or neutral.Task 3: Group feature synonyms.Produce a feature-based opinion summary of multiple reviews. 11. SometoolsLexicon-based toolsUse sentiment and subjectivity lexiconsRule-based classifierA sentence is subjective if it has at least two words in the lexiconA sentence is objective otherwiseCorpus-based toolsUse corpora annotated for subjectivity and/or sentimentTrain machine learning algorithms:Nave bayesDecision treesSVMLearn to automatically annotate new text 12. Sentiment Analysis: LevelsDocument levelE.g., product/movie reviewSentence levelE.g., news sentenceExpression levelE.g., word/phrase 13. Sentiment Analysis: HolderdetectionIdentifying Sources of Opinions with Conditional Random Fields and Extraction PatternsInternational officers believe that the EU will prevail.International officers said US officials want the EU to prevail.View source identification as an information extraction task and tackle the problem using sequence tagging and pattern matching techniques simultaneouslyLinear-chain CRF model to identify opinion sourcesPatterns incorporated as features 14. Sentiment Analysis: Twitter 15. Sentiment Analysis: Twitter1.Tweet normalization A simple rule-based model gooood to good, luve to love2.POS tagging OpenNLPPOS tagger3.Word stemming A word stem mapping table (about 20,000 entries)4.Syntactic parsing A Maximum Spanning Tree dependency parser 16. Crawlingscenario : DefinitionScenario xInstance 1Instance 2Instance nURLS slectionnesParamtres de configurationNameKey wordsScenario : 1 -> n : Category.Theme: n -> n : ScenarioScenario : 1 -> n : instanceThe scenario definethe type of Crawl wewantto run. It istiedto the notion of instance whichisconsideredas a specificconfiguration of scenario.Module gestion des URLSModule gestion de paramtres de configurationIl faudra se pencher sur linterface GUI en dveloppement de Nutchet sen inspirer pour la gestion des paramtres et des URLS.ThemeCategory