Upload
datasciencesociety
View
398
Download
2
Embed Size (px)
Citation preview
Sentiment Analysis
Yasen Kiprov
PhD Student, Intelligent Systems
R&D Engineer, NLP
AGENDA● Introduction to NLP
● Text Classification & Sentiment Analysis● Engineering approach
● Supervised Machine Learning● Linear & Logistic Regression
● Sentiment analysis for statisticians
● Why is it not working (Discussion)
● Bonus track – word embeddings
Natural Language Processing
● Enables interaction between computers and humans through natural languages
or “The branch of information science that deals with natural language information”
● Natural language understanding - enabling computers to derive meaning from human input
● Natural language generation
(Not neuro linguistic programming, still some magic applies)
NLP is everywhere
Google translateGoogle ads
Google searchSiri / Question Answering
Chat botsSpam generation / spam filtering
Gene and protein detection
Surveillance / marketing
Text Classification
● Automatically assign a piece of text to one or more classes.
● History: Guess the author based on text specifics and author style
1901: “One author prefers “em” as a short for “them”- let's use this as feature!”
1970s: Who wrote “The Federalist Papers”?
Text Classification
● Spam or not spam
● News analysis: politics, sports, business
● Google ads verticals26 root categories, 2200 subcategories
● Terrorist or not
Yes, they read your facebook and yes, they know...
Also Text Classification
● Detect truth / lie / sarcasm / joke
● Determine medical condition from hospital records, patient description
● Guess stock prices● “How will this press release affect company shares price”
● Sentiment analysis
Sentiment Analysis
● Determining writer's attitude
● Overall document: positive / negative / neutral“We totally enjoyed our stay there!”
● Towards a target:“Battery sucks, bends really well though”
● Detecting emotions: sad, happy, angry, excited
● Scales:● Number of stars / -10 to +10 / percentage
● Subjective vs Objective
Classification for engineers
● Why bother with AI, keep it simple:
IF text contains “ em ” AND NOT text contains “ them “ author is XELSE author is Y
● But what if...
Classification for engineers
● If author X decided to use “them” once?
Let's try a list of words that only author X usesIF text contains a word from listX author is XELSE try other rules
Find all the features !!!
Classification for engineers
● Build a super smart system of if-else statements to classify correctly each document
● Solving the problem algorithmically● An “expert system”
● Still used in practice for many applications● Twitter “sentiment analysis” only rule: if text contains :) or :(
When to do engineering
● For very narrow tasks● Determine if text is a url or email address
● For a very specific domain● “If text contains a name of any US president, it's a legislation”
● To create a proof-of-concept● Twitter “sentiment analysis” only rule: if text contains :) or :(
● When it's hard to get enough data (explained later)
AGENDA● Introduction to NLP
● Text Classification & Sentiment Analysis● How it's done (by engineers)
● Supervised Machine Learning● Linear & Logistic Regression
● Sentiment analysis for statisticians
● Why is it not working (Discussion)
● Bonus track – word embeddings
Supervised learning - Regression“In statistics, regression analysis is a statistical process for estimating the relationships among variables.”
● Create a hypothesis function based on the blue dots
● When a new X appears, calculate Y
The graph: X values are features, Y values are target values.
Linear Regression Example
● Let X be temperature
● Let Y be chance of rain
Create a function that predicts chance of rain, given temperature
(In reality X is a vector with many feature values)
Linear Regression Example
● Let X be temperature
● Let Y be chance of rain
Create a function that predicts chance of rain, given temperature
(In reality X is a vector with many feature values)
Hypothesis (function of a line):
Parameters:
Cost Function:
Goal:
Linear Regression Maths
Step:
Supervised Learning - Classification
“identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations(or instances) whose categorymembership is known.”
Given a set of training instances, predict a continuous valued output for new ones.
The graph: x1 and x2 are features, dot color is the target class.
Classification Example
● Let X1 be temperature
● Let X2 be humidity
Create a function that predicts rain or no rain.
(In reality X is a vector with many feature values)
2D Example
● Let X be humidity
● Let Y = 0 for no rain
● Let Y = 1 for rain
Linear hypothesis function doesn't really make sense now.
Logistic function can approximate better.
Logistic Regression
AGENDA● Introduction to NLP
● Text Classification & Sentiment Analysis● How it's done (by engineers)
● Supervised Machine Learning● Linear & Logistic Regression
● Sentiment analysis for statisticians
● Why is it not working (Discussion)
● Bonus track – word embeddings
Agenda Explained
● Until now:● What is text classification● What is supervised learning (classification)
● Up next:● How to apply supervised learning to text?
Statistical Sentiment Analysis
● Document: A piece of text
● Corpus: Set of documents
● Target: Y, positive/negative, emotion, percentage
● Training corpus: Set of documents for which we know Y
● What is X?● How to convert a document to a (real-valued) vector
● Building training corpus● Find “enough” data
Defining Features
● Each word: one-hot vector● I = [0, 0, 0, 1, 0, 0, 0, …, 0]● like = [1, 0, 0, 0, 0, 0, 0, …, 0]● cookies = [0, 0, 0, 0, 0, 0, 1, …, 0]
● Number of dimensions = size of vocabulary
● Document: bag of words● Order of words is lost● Count of words can be added● Term frequency / inverse document frequency
"I like cookies" = [1, 0, 0, 1, 0, 0, 1, …, 0]
Feature Engineering
● Ngrams (as one-hot)● I, like, cookies - unigrams● “I like” = [0, 0, 0, 0, 1, 0, …, 0] - bigrams● “I like cookies” - trigrams
● Character n-grams:● li, ik, ke, lik, ike
● Dictionaries:● Great value for sentiment analysis● Very good for domain specific text
If document contains any of: {love, like, good, cool}
add this one: [0, 0, 1, 0, …, 0]
Feature Engineering● Simple features
● Document Length● Emoticons● elooongated words● ALL-CAPS● Stopwords
● Through other classification methods:● Parts of speech● Negation contexts “I don't like cookies”● Named Entities
● Approximate dimensions of X: 100k – 10m
Work Process
● Assemble training corpus● Separate test corpus
● Invent new features
● Generate model (supervised learning)● Test performance
● Repeat
Tips & Tricks
● Performance usually is● precision / recall / accuracy / f-measure
● Simple Machine Learning with tons of features
● Even a linear classifier works
● Marketing● Everyone uses different corpus (can't compare accuracy)● Showing only what you're sure about● Generalizing: “overall, 70% of your customers like you”
AGENDA● Introduction to NLP
● Text Classification & Sentiment Analysis● How it's done (by engineers)
● Supervised Machine Learning● Linear & Logistic Regression
● Sentiment analysis for statisticians
● Why is it not working (Discussion)
● Bonus track – word embeddings
A.I. - Why is it not working?
“Algorithmically solvable: A decision problem that can be solved by an algorithm that halts on all inputs in a finite number of steps.
“Unsolvable problem: A problem that cannot be solved for all cases by any algorithm whatsoever”
● Artificial Intelligence: Develop intelligent systems, deal with real world problems. It works... kind of...
- “Siri, will you marry me?”- “My End User License Agreement does not cover marriage.
My apologies”
Challenges
● Annotation Guidelines
● Inter-annotator agreement
● SemEval● Sentiment analysis corpus (~14k tweets)● For 40% of tweets annotators didn't agree
"I don't know half of you half as well as I should like; and I like less than half of you half as well as you deserve.”
Bilbo Baggins
Still not convinced?● Context issues
● Narrowing the domain helps● “beer is cool”, “soup is cool”● “No babies yet!” - condoms / fertility drugs● “Obama goes full Bush on Syria”
● User generated content SUCKS!● “Polynesian sauce from chik fila a be so bomb”
● Common sense“I tried the banana slicer and found it unacceptable. […] the
slicer is curved from left to right. All of my bananas are bent the other way.”
AGENDA● Introduction to NLP
● Text Classification & Sentiment Analysis● How it's done (by engineers)
● Supervised Machine Learning● Linear & Logistic Regression
● Sentiment analysis for statisticians
● Why is it not working (Discussion)
● Bonus track – word embeddings
Word representations● One-hot is sparse and meaningless
● N-dimensional vector for each word● “Ubuntu” close to “Debian”● “king” to “queen” = “man” to “woman”● Based solely on word co-occurrence
n = 50 to 1000
Deep Learning
● Artificial Neural Networks
● Input - word embeddings
● Output – target class
● Complex layer structure
● No feature engineering
Tools
● NLTK – NLP in python
● GATE – NLP in java + GUI
● Stanford CoreNLP – NLP in java + deep neural networks
● AlchemyAPI – commercial API for NLP (free demo)
● MetaMind – enterprise sentiment analysis and computer vision (deep
neural networks)
● WolframAlpha – Smart question answering (knows maths)
Thank you!