Natural Language Processing

NATURAL LANGUAGE PROCESSING

a developer’s perspective to

- Vanya Seth & Dharmendra Prasad

WHAT IS NATURAL LANGUAGE PROCESSING?

WHAT IS NATURAL LANGUAGE PROCESSING?

NLP is a tool to devise ways to interpret human understandable languages and extract meaning out of it, which can be used by other humans (translators), machines or software systems for achieving greater objectives.

Why do we need this? Not everyone speaks the same language (MT) Not all the languages are written the same way (MT) We would like to automate tasks which are tedious (IE) We desire more accuracy and less human errors!

(Automation)

FEW INTERESTING PROBLEMS

1. Spam Detection2. Spelling correction3. Parts of Speech Tagging4. Named Entity Recognition

1. Co reference Resolution2. Information Extraction3. Sentiment Analysis4. Machine Translation

1. Answering Questions2. Summarization 3. Paraphrasing4. Dialog

WHY IS NLP A COMPLEX PROBLEM

Ambiguity is pervasiveHeadline: Republicans Grill IRS Chief Over Lost Emails

Meanings: Republicans harshly question the chief about the emails Republicans cook the chief using email as the fuel

New ways of writing Twitter hashtags All Capitals Abused notations (U for You, FB for facebook, @ for At) New words (Retweet, Unfriend etc) Emoticons ( , and many others)

HOW DO WE DEAL WITH THIS PROBLEM?

We learn, we remember and we conquer What tools we use?

We require the knowledge about the language We require the knowledge about the world We need a way to combine knowledge sources

How we do this? Probabilistic models built upon language data for

inferring language properties P(“fragrant” -> “rose”) is high P(“awful” -> “love”) is low

TEXT PROCESSING – THE BASE OF NLP

What we mostly do in NLP is “Text Processing”. Normalizing the text in one way or the other.

Word Tokenization, Text Search, Sentence Segmentation, Pattern Recognition, Disambiguating Words etc…

Text Processing is Important, one important tool is

Regular Expressions (http://regexpal.com)

Word Tokenization (http://sentiment.christopherpotts.net/tokenizing.html)

how many words are there in the text

what is the size of the vocabulary and so on..

http://regexpal.com/

http://sentiment.christopherpotts.net/tokenizing.html

LANGUAGE MODELING

probabilistic approach

WHAT ARE WE TRYING TO SOLVE?

Sentence completion

How are _____?The weather is ______.

Phrase Rearrangement

little had Mary a lambhappy everyone during is holidays

How can we solve these problems?

STEP BY STEP - LANGUAGE MODELING

Sentence completion – Probability(upcoming word)

Phrase rearrangement – Probability(occurrence of a sentence)

For both of these, we need a machinery, which is called “language model”

In simple words – a language model is a black box, which has prior knowledge about the language(s) and for any question we consult the LM for an appropriate answer

Formally – A language model is a model which computes either P(upcoming word) or P(occurrence of a sentence)

WHAT DOES THE LANGAUGE MODEL CONTAINS?

This ideally depends on what kind of answer you want from the model.

An answer to the question:

P(upcoming word) - list of pairs(phrase – probability), a phrase with highest probability (or any word based on the complexity of the algorithm)

P(phrase rearrangement) – list of pairs (sentence – probability), a sentence with highest probability (or any sentence based on the complexity of the algorithm)

How do we get these probabilities?

HOW TO CALCULATE THE PROBABILITIES

Goal: Calculating the Probability of a sequence of words

P("The tiger is a fierce animal") =P(The) * P(tiger | The) * P(is | The tiger) * P(a | The tiger

is) * * P(fierce | The tiger is a) * P(animal | The tiger is a

fierce)

This is the joint probability of the sequence of words by the chain rule of probability.

P(animal | The tiger is a fierce) = Count(The tiger is a fierce animal) Count(The tiger is a fierce)

A PRACTICAL APPROACH : MARKOV’S ASSUMPTION

It would be mostly sufficient to assume that P(animal| the tiger is a fierce) =

P(animal| fierce) or P(animal| a fierce)Hence, P(the tiger is a fierce animal) = P(is| the tiger) *

P(a| tiger is) * P(fierce| is a) * P(animal| a fierce)

EXERCISE 1 : UNIGRAMS

Vocabulary: Cold, You, They, How, Are, The, Weather, Hot, Is, Was

Strategy 1: Assign probabilities to all the words depending on the number of times it occurred in the corpus

How are Cold? How are You? How are They? How are How?

All these sentences are equally likely.

This model is called unigram model

EXERCISE 1 : IMPROVING WITH BIGRAMS

Strategy 2: Assign probabilities based on some context. Look for one previous word.

is cold , is hot, is gone are gone, are you, are they was cold, was hot they are, the weather, you are

Complete the sentence : The weather is ________

Three possible options are The weather is hot The weather is cold The weather is gone

This model is called bigram model

EXERCISE 1 : GETTING BETTER WITH TRIGRAMS

Strategy 3: Start looking at more than one previous words

weather is cold , tea is hot, they are gone, how are you, where are they water was cold, food was hot

Complete the sentence : The weather is ________Possible options is : The weather is cold.

Complete the sentence : How are ________?Possible options is : How are you?

This model is called trigram model

EXERCISE 1 : HOW FAR CAN WE DO THIS?

The computer I installed in the chemistry laboratory _______

Bigrams may result in :

The computer I installed in the chemistry laboratory equipment

Trigrams may result in :

The computer I installed in the chemistry laboratory apparatus

Long Distance Dependencies**

Huge corpus with common n grams.

“The computer I installed in the chemistry laboratory crashed”

DEALING WITH UNSEEN NGRAMS

A 2000 words vocabulary has 2000*2000 = 4,000,000 possible bigrams

A piece of document with 16000 tokens produces around 5000 unique bigrams

This means 5000/4000000 or 99.875% of the bigrams are never seen.

This will assign zero probability to all the unseen bigrams and hence, if we calculate the probability of a sentence with a new bigram, our model will return a value zero.

To avoid this we use a technique called smoothing. The simplest of all is the Laplace’s Smoothing or the Add One Smoothing

SMOOTHING TECHNIQUES

Laplace’s Smoothing: It suggests that, consider that you saw everything one more time. This solves the problems of unseen bigrams or ngrams.

Applying Laplace’s smoothing we will calculate probability as below:

Without Smoothing: P(wi| wi-1) = c(wi| wi-1)

c(wi-1)

Add One Smoothing: P(wi| wi-1) = c(wi| wi-1) + 1

c(wi-1) + |V|

Why adding a |V| in the denominator?

Seeing each word one more time means seeing all the unique words one more time and the total number of unique words is |V| so the denominator increases by |V|

IMPORTANT LINKS FOR THE SECTION

Online Available Corpus

SRI Language Model ToolKit : http://www.speech.sri.com/projects/srilm/

Google Books Ngram Viewer: https://books.google.com/ngrams

Google Ngram Corpus: http://googleresearch.blogspot.in/2006/08/all-our-n-gram-are-belong-to-you.html

http://www.speech.sri.com/projects/srilm/

https://books.google.com/ngrams

http://googleresearch.blogspot.in/2006/08/all-our-n-gram-are-belong-to-you.html

http://googleresearch.blogspot.in/2006/08/all-our-n-gram-are-belong-to-you.html

SPELLING CORRECTION

SPELLING TASKS

Typing is a manual process & not everyone types correctly. Spelling mistakes occur frequently. There are two spelling tasks in such scenarios

Error Detection Error Correction

Auto Correct hte -> the

Suggest one correction theate-> theatre

Suggest list of correction leding -> leading, lending

TYPES OF SPELLING ERRORS

Non Word Errors opportnity -> opportunity graffe -> giraffe

Real Word Errors Typographical Errors

In dog we trust -> god

Cognitive Errors (Homophones) good buy -> bye withdraw cache -> cash

FIXING NON WORD ERRORS

Non Word Errors – very easy to correct

Detection Have a word list from the dictionary of the language Check the word in the dictionary, if the word is not present, the word is

an error.

Correction Find candidate words which are similar to the error Choose the best candidate based on any algorithm or model discussed

next.

FIXING REAL WORD ERRORS

Real Word Errors – tough to fix

Detection There is no error detection possible, because even the error is a real

word hence it is impossible to test the word against the given dictionary.

Correction Find candidate words which are similar (pronunciation, spelling) to a

word in the sentence Do this for all the words in the sentences Choose the best candidate based on any algorithm or model discussed

next.

FINDING CANDIDATES - MINIMUM EDIT DISTANCE

N M EN A M E

H T ET H E

A C R E S SA C R E S

INSERT

TRANSPOSE

DELETE

NOISY CHANNEL MODEL – BASIC

The noisy channel is a probabilistic model which represents the real world conditions.

We run a lot of guesses through the channel and the one which matches the most to the noisy word is our correct word.

NOISY CHANNEL MODEL – BASIC CONT..

What are we trying to find?

For an observation x(the noisy word), we are looking for a word w (from the vocabulary) which maximizes the probability of the word using this noisy channel.

ŵ = argmax P(w|x) w ϵ V

= argmax P(x|w) * P(w) / P(x) -> Bayes Rule w ϵ V

P(x) is constant for all the w in the vocabulary, because x is the observation.

P(x|w) is called the channel model and P(w) is called the language model.

NOISY CHANNEL MODEL – EXAMPLEx = acress

Possible candidates:

Error (x) Correction (w)

Correct Letter

Error Letter

Error Type

acress actress t - Deletionacress cress - a Insertionacress caress ca ac Transpositi

onacress access c r Substitutio

nacress across o e Substitutio

nacress acres - s Insertionacress acres - s Insertion

NOISY CHANNEL MODEL – EXAMPLE CONT…

Words Taken from Corpus of Contemporary English (400,000,000 words)

Clearly the probability of across is the highest and our model will suggest the correction ‘across’ in a unigram language model

Word Frequency P(word)actress 9448 0.00002362cress 220 0.00000055caress 686 0.00000171access 35310 0.00008827across 105559 0.00026389acres 12874 0.00003218

NOISY CHANNEL MODEL – ITS NOT YET OVER

We just talked about the probability of the words (the language model), which is one factor contributing to the correction.

The other factor which is the channel model, still needs to be consulted for a better correction task.

The channel model consults a tool called Confusion Matrix for finding out the likelihood of a type of error:

Types of Error: Insertion, Deletion, Transposition & Substitution

Each confusion matrix tells the possibility of a given type of error.

CONFUSION MATRICES

NOISY CHANNEL MODEL – EXAMPLE CONT…

Words Taken from Corpus of Contemporary English (400,000,000 words)

Clearly the probability of across is the highest and our model will suggest the correction ‘across’ in a unigram language model

Word Correct letter

Error Letter

x|w P(x|word)

P(word) Result10e-9

actress

t - c|ct 0.0001170

0.00002362

2.7

cress - a a|# 0.0000014

0.00000055

0.00078

caress ca ac ac|ca 0.0000016

0.00000171

0.0028

access c r r|c 0.0000002

0.00008827

0.019

across o e e|o 0.0000093

0.00026389

2.8

acres - s es|e 0.0000321

0.00003218

1.0

acres - s ss|s 0.0000342

0.00003218

1.0

IMPORTANT LINKS FOR THE SECTION

Wikipedia - Common English Spelling mistakes : https://en.wikipedia.org/wiki/Commonly_misspelled_English_words

Birkbeck Spelling Error Corpus : http://ota.ox.ac.uk/headers/0643.xml

Peter Norvig List of Error: http://norvig.com/ngrams/spell-errors.txt

https://en.wikipedia.org/wiki/Commonly_misspelled_English_words

http://ota.ox.ac.uk/headers/0643.xml

http://norvig.com/ngrams/spell-errors.txt

TEXT CLASSIFICATION

SPAM DETECTION - CLASSIC CASE OF TEXT CLASSIFICATION

SPAM DETECTION - CLASSIC CASE OF TEXT CLASSIFICATION

Spam like attributes Undisclosed Recipients Prize!! No name, lucky draw Suspicious URL

WHAT IS TEXT CLASSIFICATION

It is a process of classifying given documents into various classes. For e.g.:

An e-mail is a SPAM or NOT? A product review is POSITIVE or NEGATIVE?

And so on…

We need a classification model, which works on the given inputs and produces the desired output.

Inputs: a document d a fixed set of classes C = {c1, c2, c3, …., cn}

Desired Output:

a predicted class c ϵ C such that the document belongs to that class c.

CLASSIFICATION METHODS

Hand Coded Rules

Based on Combination of words for e.g. a black listed sender, words like Viagra, dollars, impress a girl.

Supervised Machine LearningNaïve BayesSupport Vector MachinesLogistic RegressionKNN (K Nearest Neighbor)

SENTIMENT ANALYSIS

What is a Sentiment? It is an attitude, affectively colored beliefs or dispositions towards

objects and persons liking, loving, hating, valuing, desiring

What is the task of Sentiment AnalysisDetecting attitude holder of the attitude, target of the attitude, type of

the attitude.

Types of Analysis Simplest is to assign a binary value to a sentence or document Slightly complex is to rate on a scale of 1-10 Toughest is to detect the target and source

EXAMPLE OF SENTIMENTS FOR A PERFUME

1. The fragrance is just awesome, I love it

2. Keeps you going all day long, this is the best perfume

3. Seriously do you call it a perfume? It’s awful.

4. Thanks for this wonderful fragrance in the classy bottle. Great!!

5. I feel like being cheated after buying this piece of crap

6. If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.

Attitude – awesome, best, classy, awful, crap, cheated, thanksTarget – perfume, bottle, fragrance, Holder – purchaser, user

We are basically extracting the opinion.

WHY IS SENTIMENT ANALYSIS IMPORTANT

1. We prefer to watch movies after reading the reviews

2. We prefer to buy products after reader the reviews

3. We prefer to invest in stocks after understanding the market sentiments

4. We want a profitable asset

5. We don’t want to get cheated with odd surprises

6. We believe that we can predict future (election result, market outcome)

Now all the above, more or less can be solved using the tool called ‘Sentiment Analysis’.

HOW TO? – A BASELINE ALGORITHM

Input : Test documents Tokenize the test documents Feature Extraction ( either all the tokens or a group of relevant

tokens) Classification using any of the classifier

Naïve Bayes Classifiers Maximum Entropy Classifiers Support Vector Machine Classifier

TELL ME MORE - NAÏVE BAYES CLASSIFIER

Our goal:

For a document d and set of classes C, we need to calculate probability of all the classes given the document.

P(c|d) = P(d|c)*P(c)

P(d)The class which maximizes P(c|d) is the class where the document

belongs. P(d|c) – conditional probability of the document given the class P(c) – prior probability of the class

NAÏVE BAYES CLASSIFIER CONT…

What is the practical meaning of prior P(c) and likelihood P(d|c) ?

P(c) means the probability of occurrence of the class, i.e. how often the class occurs in the corpus.

P(d|c) means the probability of the occurrence of some set of features(of certain length) given a class, i.e. P(x1, x2, x3, … xi | c). The joint probability of all the features given a class.

Now calculating the joint probability requires enormous amount of parameters to be calculated, may be of the order |x|n and these many for each class.

That would require enormous amount of training samples which is mostly not available.

This looks complicated, we must try out simplifications!!

SIMPLIFYING THE NAÏVE BAYES CLASSIFIER

Simplifying Assumptions Position of the word in the document doesn’t matter – Bag of Words

Feature probabilities given a class are independent - Conditional Independence

This simplifies our model to

P(x1, x2, x3… xi | c) = P(x1|c) * P(x2|c) * P(x3|c)* … *P(xi | c)

And hence, the probability of a class given a document reduces to

P(c|d) = P(c) * P(x1|c) * P(x2|c) * P(x3|c) *… * P(xn|c)

A STEP BY STEP DEMONSTRATION

Calculating the priors: P(India) = 3/5 P(Pakistan) = 2/5

Calculating the conditional probabilities :P(Delhi | India) = 0.15789473 P(Lahore | Pakistan) = 0.25P(India | India) = 0.15789473 P(Islamabad | Pakistan) = 0.1875P(Mumbai | India) = 0.2631579 P(Pakistan | Pakistan) = 0.1875P(Chennai| India) = 0.10526316 P(Hyderabad | Pakistan) = 0.125P(Hyderabad | India) = 0.15789473 P(India | Pakistan) = 0.0625P(Lahore | India) = 0.05263158 P(Mumbai | Pakistan) = 0.0625P(Islamabad | India) = 0.05263158 P(Chennai | Pakistan) = 0.0625

Vocabulary Size: 8

P(w|c) = count(w,c) + 1

count(c) +|V|

A STEP BY STEP DEMONSTRATION CONT…Priors: P(India) = 3/5 P(Pakistan) = 2/5

Conditional probabilities :P(Lahore | India) = 0.05263158 P(Lahore | Pakistan) = 0.25P(Hyderabad | India) = 0.15789473 P(Hyderabad | Pakistan) = 0.125P(Chennai| India) = 0.10526316 P(Chennai | Pakistan) = 0.0625P(Islamabad | India) = 0.05263158 P(Islamabad | Pakistan) = 0.1875

Test Doc : Lahore Hyderabad Chennai Islamabad P(test doc | India) = P(India) * P(Lahore | India) * P(Hyderabad | India) *

P(Chennai | India) * P(Islamabad | India)= 0.6 * 0.0526 * 0.1578 * 0.1052 * 0.0526 =

0.0000275

P(test doc | Pakistan) = P(Pakistan) * P(Lahore | Pakistan) * P(Hyderabad | Pakistan) * P(Chennai | Pakistan) * P(Islamabad | Pakistan)

= 0.4 * 0.25 * 0.125 * 0.0625 * 0.1875 = 0.0001464

CHALLENGES IN SENTIMENT ANALYSIS

Tokenization Issues Data is available online in HTML, XML and various other mark up

languages

Twitter names, hash tags etc pollute the data

Phone numbers, short forms, new words, emoticons, phone numbers etc..

Extracting Features Handling negations

I didn’t like this movie vs I really liked this movie

Which words to use? Choosing the words (I, this, movie etc do not belong to the set of

words which contribute to attitude)

SENTIMENT LEXICONS

The words which matter in the sentiments. Its better to train our models on these lexicons instead of the complete list of words in the training documents.

Here are few links for the lexicons:

The General Inquirer : http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls http://www.wjh.harvard.edu/~inquirer/homecat.htm

LIWC(Linguistic Inquiry and Word Count) http://www.liwc.net

Negative emotions (bad, weird, hate, problem, crap) Positive Emotions ( love, wonderful, magnificent, lovely)

Bing Liu's page on Opinion Mining https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls

http://www.wjh.harvard.edu/~inquirer/homecat.htm

http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls

http://www.wjh.harvard.edu/~inquirer/homecat.htm

HOW GOOD IS OUR CLASSIFIER?Not all classifiers we build is a state of art system. We must refine and fine

tune it after building.

What are the parameters to judge a classifier?Contingency Matrix

What is the accuracy of the system?Accuracy = TP + TN TP+TN+FP+FN

THE PROBLEM WITH THE ACCURACY PARAMETER

Task : Build a classifier to identify food items on web pages

Most of the tokens on a web page won’t be the name off a food item. Let us say that there are 1000 words and only 10 words are names of food item.

Let us consider that our classifier is a bogus one and it always returns a false for each word it encounters.

This means the accuracy of our system = TP + TN = 990/1000 = 99% TP+TN+FP+FN

Hence, our 99% accurate system is not able to do what we needed, i.e. detecting food items.

HOW TO FIX THIS?

We definitely need a better parameter to judge our model. So we define two parameters PRECISION and RECALL

Precision, is the percentage of selected items that are correct.Recall, is the percentage of correct item that the system was able to

select.

PRECISION = TP / ( TP + FP ) = 0/0 = UNDEFINEDRECALL = TP / (TP + FN) = 0/10 = 0

So, these parameters somewhere fairly judge our classifiers. Here, the recall is zero and the precision too is zero.

ADVANTAGES OF PRECISION & RECALL

The figures on the last slide didn’t give much insight into the roles of these two parameters.

Slightly better classifier - capable of selecting the true food itemsPrecision = 10/(10 + 20) = 33%Recall = 10/(10+10) = 50%

Hence these parameters fairly judge the classifier, so a combination of these two measures can be a fair evaluation criteria. It is also called the F measure.

WHAT NEXT?

Discriminative Language modelsMaximum EntropySupport Vector Machines

Other Advanced ApplicationsNamed Entity RecognitionPOS TaggingMachine Translation & Probabilistic Parsing

CFGsLanguage Grammars

Areas of researchInformation Retrieval (Query Based and Generic)Question & AnsweringSummarization

Dharmendra Prasad [email protected]

Thank You

Data & Analytics

Natural Language Processing