55
NATURAL LANGUAGE PROCESSING a developer’s perspective to - Vanya Seth & Dharmendra Prasad

Natural Language Processing

Embed Size (px)

Citation preview

Page 1: Natural Language Processing

NATURAL LANGUAGE PROCESSING

a developer’s perspective to

- Vanya Seth & Dharmendra Prasad

Page 2: Natural Language Processing

WHAT IS NATURAL LANGUAGE PROCESSING?

WHAT IS NATURAL LANGUAGE PROCESSING?

NLP is a tool to devise ways to interpret human understandable languages and extract meaning out of it, which can be used by other humans (translators), machines or software systems for achieving greater objectives.

Why do we need this? Not everyone speaks the same language (MT) Not all the languages are written the same way (MT) We would like to automate tasks which are tedious (IE) We desire more accuracy and less human errors!

(Automation)

Page 3: Natural Language Processing

FEW INTERESTING PROBLEMS

1. Spam Detection2. Spelling correction3. Parts of Speech Tagging4. Named Entity Recognition

1. Co reference Resolution2. Information Extraction3. Sentiment Analysis4. Machine Translation

1. Answering Questions2. Summarization 3. Paraphrasing4. Dialog

Page 4: Natural Language Processing

WHY IS NLP A COMPLEX PROBLEM

Ambiguity is pervasiveHeadline: Republicans Grill IRS Chief Over Lost Emails

Meanings: Republicans harshly question the chief about the emails Republicans cook the chief using email as the fuel

New ways of writing Twitter hashtags All Capitals Abused notations (U for You, FB for facebook, @ for At) New words (Retweet, Unfriend etc) Emoticons ( , and many others)

Page 5: Natural Language Processing

HOW DO WE DEAL WITH THIS PROBLEM?

We learn, we remember and we conquer What tools we use?

We require the knowledge about the language We require the knowledge about the world We need a way to combine knowledge sources

How we do this? Probabilistic models built upon language data for

inferring language properties P(“fragrant” -> “rose”) is high P(“awful” -> “love”) is low

Page 6: Natural Language Processing

TEXT PROCESSING – THE BASE OF NLP

What we mostly do in NLP is “Text Processing”. Normalizing the text in one way or the other.

Word Tokenization, Text Search, Sentence Segmentation, Pattern Recognition, Disambiguating Words etc…

Text Processing is Important, one important tool is

Regular Expressions (http://regexpal.com)

Word Tokenization (http://sentiment.christopherpotts.net/tokenizing.html)

how many words are there in the text

what is the size of the vocabulary and so on..

Page 7: Natural Language Processing

LANGUAGE MODELING

probabilistic approach

Page 8: Natural Language Processing

WHAT ARE WE TRYING TO SOLVE?

Sentence completion

How are _____?The weather is ______.

Phrase Rearrangement

little had Mary a lambhappy everyone during is holidays

How can we solve these problems?

Page 9: Natural Language Processing

STEP BY STEP - LANGUAGE MODELING

Sentence completion – Probability(upcoming word)

Phrase rearrangement – Probability(occurrence of a sentence)

For both of these, we need a machinery, which is called “language model”

In simple words – a language model is a black box, which has prior knowledge about the language(s) and for any question we consult the LM for an appropriate answer

Formally – A language model is a model which computes either P(upcoming word) or P(occurrence of a sentence)

Page 10: Natural Language Processing

WHAT DOES THE LANGAUGE MODEL CONTAINS?

This ideally depends on what kind of answer you want from the model.

An answer to the question:

P(upcoming word) - list of pairs(phrase – probability), a phrase with highest probability (or any word based on the complexity of the algorithm)

P(phrase rearrangement) – list of pairs (sentence – probability), a sentence with highest probability (or any sentence based on the complexity of the algorithm)

How do we get these probabilities?

Page 11: Natural Language Processing

HOW TO CALCULATE THE PROBABILITIES

Goal: Calculating the Probability of a sequence of words

P("The tiger is a fierce animal") =P(The) * P(tiger | The) * P(is | The tiger) * P(a | The tiger

is) * * P(fierce | The tiger is a) * P(animal | The tiger is a

fierce)

This is the joint probability of the sequence of words by the chain rule of probability.

P(animal | The tiger is a fierce) = Count(The tiger is a fierce animal) Count(The tiger is a fierce)

Page 12: Natural Language Processing

A PRACTICAL APPROACH : MARKOV’S ASSUMPTION

It would be mostly sufficient to assume that P(animal| the tiger is a fierce) =

P(animal| fierce) or P(animal| a fierce)Hence, P(the tiger is a fierce animal) = P(is| the tiger) *

P(a| tiger is) * P(fierce| is a) * P(animal| a fierce)

Page 13: Natural Language Processing

EXERCISE 1 : UNIGRAMS

Vocabulary: Cold, You, They, How, Are, The, Weather, Hot, Is, Was

Strategy 1: Assign probabilities to all the words depending on the number of times it occurred in the corpus

How are Cold? How are You? How are They? How are How?

All these sentences are equally likely.

This model is called unigram model

Page 14: Natural Language Processing

EXERCISE 1 : IMPROVING WITH BIGRAMS

Strategy 2: Assign probabilities based on some context. Look for one previous word.

is cold , is hot, is gone are gone, are you, are they was cold, was hot they are, the weather, you are

Complete the sentence : The weather is ________

Three possible options are The weather is hot The weather is cold The weather is gone

This model is called bigram model

Page 15: Natural Language Processing

EXERCISE 1 : GETTING BETTER WITH TRIGRAMS

Strategy 3: Start looking at more than one previous words

weather is cold , tea is hot, they are gone, how are you, where are they water was cold, food was hot

Complete the sentence : The weather is ________Possible options is : The weather is cold.

Complete the sentence : How are ________?Possible options is : How are you?

This model is called trigram model

Page 16: Natural Language Processing

EXERCISE 1 : HOW FAR CAN WE DO THIS?

The computer I installed in the chemistry laboratory _______

Bigrams may result in :

The computer I installed in the chemistry laboratory equipment

Trigrams may result in :

The computer I installed in the chemistry laboratory apparatus

Long Distance Dependencies**

Huge corpus with common n grams.

“The computer I installed in the chemistry laboratory crashed”

Page 17: Natural Language Processing

DEALING WITH UNSEEN NGRAMS

A 2000 words vocabulary has 2000*2000 = 4,000,000 possible bigrams

A piece of document with 16000 tokens produces around 5000 unique bigrams

This means 5000/4000000 or 99.875% of the bigrams are never seen.

This will assign zero probability to all the unseen bigrams and hence, if we calculate the probability of a sentence with a new bigram, our model will return a value zero.

To avoid this we use a technique called smoothing. The simplest of all is the Laplace’s Smoothing or the Add One Smoothing

Page 18: Natural Language Processing

SMOOTHING TECHNIQUES

Laplace’s Smoothing: It suggests that, consider that you saw everything one more time. This solves the problems of unseen bigrams or ngrams.

Applying Laplace’s smoothing we will calculate probability as below:

Without Smoothing: P(wi| wi-1) = c(wi| wi-1)

c(wi-1)

Add One Smoothing: P(wi| wi-1) = c(wi| wi-1) + 1

c(wi-1) + |V|

Why adding a |V| in the denominator?

Seeing each word one more time means seeing all the unique words one more time and the total number of unique words is |V| so the denominator increases by |V|

Page 19: Natural Language Processing

IMPORTANT LINKS FOR THE SECTION

Online Available Corpus

SRI Language Model ToolKit : http://www.speech.sri.com/projects/srilm/

Google Books Ngram Viewer: https://books.google.com/ngrams

Google Ngram Corpus: http://googleresearch.blogspot.in/2006/08/all-our-n-gram-are-belong-to-you.html

Page 20: Natural Language Processing

SPELLING CORRECTION

Page 21: Natural Language Processing

SPELLING TASKS

Typing is a manual process & not everyone types correctly. Spelling mistakes occur frequently. There are two spelling tasks in such scenarios

Error Detection Error Correction

Auto Correct hte -> the

Suggest one correction theate-> theatre

Suggest list of correction leding -> leading, lending

Page 22: Natural Language Processing

TYPES OF SPELLING ERRORS

Non Word Errors opportnity -> opportunity graffe -> giraffe

Real Word Errors Typographical Errors

In dog we trust -> god

Cognitive Errors (Homophones) good buy -> bye withdraw cache -> cash

Page 23: Natural Language Processing

FIXING NON WORD ERRORS

Non Word Errors – very easy to correct

Detection Have a word list from the dictionary of the language Check the word in the dictionary, if the word is not present, the word is

an error.

Correction Find candidate words which are similar to the error Choose the best candidate based on any algorithm or model discussed

next.

Page 24: Natural Language Processing

FIXING REAL WORD ERRORS

Real Word Errors – tough to fix

Detection There is no error detection possible, because even the error is a real

word hence it is impossible to test the word against the given dictionary.

Correction Find candidate words which are similar (pronunciation, spelling) to a

word in the sentence Do this for all the words in the sentences Choose the best candidate based on any algorithm or model discussed

next.

Page 25: Natural Language Processing

FINDING CANDIDATES - MINIMUM EDIT DISTANCE

N M EN A M E

H T ET H E

A C R E S SA C R E S

INSERT

TRANSPOSE

DELETE

Page 26: Natural Language Processing

NOISY CHANNEL MODEL – BASIC

The noisy channel is a probabilistic model which represents the real world conditions.

We run a lot of guesses through the channel and the one which matches the most to the noisy word is our correct word.

Page 27: Natural Language Processing

NOISY CHANNEL MODEL – BASIC CONT..

What are we trying to find?

For an observation x(the noisy word), we are looking for a word w (from the vocabulary) which maximizes the probability of the word using this noisy channel.

ŵ = argmax P(w|x) w ϵ V

= argmax P(x|w) * P(w) / P(x) -> Bayes Rule w ϵ V

P(x) is constant for all the w in the vocabulary, because x is the observation.

P(x|w) is called the channel model and P(w) is called the language model. 

Page 28: Natural Language Processing

NOISY CHANNEL MODEL – EXAMPLEx = acress

Possible candidates:

Error (x) Correction (w)

Correct Letter

Error Letter

Error Type

acress actress t - Deletionacress cress - a Insertionacress caress ca ac Transpositi

onacress access c r Substitutio

nacress across o e Substitutio

nacress acres - s Insertionacress acres - s Insertion

Page 29: Natural Language Processing

NOISY CHANNEL MODEL – EXAMPLE CONT…

Words Taken from Corpus of Contemporary English (400,000,000 words)

Clearly the probability of across is the highest and our model will suggest the correction ‘across’ in a unigram language model

Word Frequency P(word)actress 9448 0.00002362cress 220 0.00000055caress 686 0.00000171access 35310 0.00008827across 105559 0.00026389acres 12874 0.00003218

Page 30: Natural Language Processing

NOISY CHANNEL MODEL – ITS NOT YET OVER

We just talked about the probability of the words (the language model), which is one factor contributing to the correction.

The other factor which is the channel model, still needs to be consulted for a better correction task.

The channel model consults a tool called Confusion Matrix for finding out the likelihood of a type of error:

Types of Error: Insertion, Deletion, Transposition & Substitution

Each confusion matrix tells the possibility of a given type of error.

Page 31: Natural Language Processing

CONFUSION MATRICES

Page 32: Natural Language Processing

NOISY CHANNEL MODEL – EXAMPLE CONT…

Words Taken from Corpus of Contemporary English (400,000,000 words)

Clearly the probability of across is the highest and our model will suggest the correction ‘across’ in a unigram language model

Word Correct letter

Error Letter

x|w P(x|word)

P(word) Result10e-9

actress

t - c|ct 0.0001170

0.00002362

2.7

cress - a a|# 0.0000014

0.00000055

0.00078

caress ca ac ac|ca 0.0000016

0.00000171

0.0028

access c r r|c 0.0000002

0.00008827

0.019

across o e e|o 0.0000093

0.00026389

2.8

acres - s es|e 0.0000321

0.00003218

1.0

acres - s ss|s 0.0000342

0.00003218

1.0

Page 33: Natural Language Processing

IMPORTANT LINKS FOR THE SECTION

Wikipedia - Common English Spelling mistakes : https://en.wikipedia.org/wiki/Commonly_misspelled_English_words

Birkbeck Spelling Error Corpus : http://ota.ox.ac.uk/headers/0643.xml

Peter Norvig List of Error: http://norvig.com/ngrams/spell-errors.txt

Page 34: Natural Language Processing

TEXT CLASSIFICATION

Page 35: Natural Language Processing

SPAM DETECTION - CLASSIC CASE OF TEXT CLASSIFICATION

Page 36: Natural Language Processing

SPAM DETECTION - CLASSIC CASE OF TEXT CLASSIFICATION

Spam like attributes Undisclosed Recipients Prize!! No name, lucky draw Suspicious URL

Page 37: Natural Language Processing

WHAT IS TEXT CLASSIFICATION

It is a process of classifying given documents into various classes. For e.g.:

An e-mail is a SPAM or NOT? A product review is POSITIVE or NEGATIVE?

And so on…

We need a classification model, which works on the given inputs and produces the desired output.

Inputs: a document d a fixed set of classes C = {c1, c2, c3, …., cn}

Desired Output:

a predicted class c ϵ C such that the document belongs to that class c.

Page 38: Natural Language Processing

CLASSIFICATION METHODS

Hand Coded Rules

Based on Combination of words for e.g. a black listed sender, words like Viagra, dollars, impress a girl.

Supervised Machine LearningNaïve BayesSupport Vector MachinesLogistic RegressionKNN (K Nearest Neighbor)

Page 39: Natural Language Processing

SENTIMENT ANALYSIS

What is a Sentiment? It is an attitude, affectively colored beliefs or dispositions towards

objects and persons liking, loving, hating, valuing, desiring

What is the task of Sentiment AnalysisDetecting attitude holder of the attitude, target of the attitude, type of

the attitude.

Types of Analysis Simplest is to assign a binary value to a sentence or document Slightly complex is to rate on a scale of 1-10 Toughest is to detect the target and source

Page 40: Natural Language Processing

EXAMPLE OF SENTIMENTS FOR A PERFUME

1. The fragrance is just awesome, I love it

2. Keeps you going all day long, this is the best perfume

3. Seriously do you call it a perfume? It’s awful.

4. Thanks for this wonderful fragrance in the classy bottle. Great!!

5. I feel like being cheated after buying this piece of crap

6. If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.

Attitude – awesome, best, classy, awful, crap, cheated, thanksTarget – perfume, bottle, fragrance, Holder – purchaser, user

We are basically extracting the opinion.

Page 41: Natural Language Processing

WHY IS SENTIMENT ANALYSIS IMPORTANT

1. We prefer to watch movies after reading the reviews

2. We prefer to buy products after reader the reviews

3. We prefer to invest in stocks after understanding the market sentiments

4. We want a profitable asset

5. We don’t want to get cheated with odd surprises

6. We believe that we can predict future (election result, market outcome)

Now all the above, more or less can be solved using the tool called ‘Sentiment Analysis’.

Page 42: Natural Language Processing

HOW TO? – A BASELINE ALGORITHM

Input : Test documents Tokenize the test documents Feature Extraction ( either all the tokens or a group of relevant

tokens) Classification using any of the classifier

Naïve Bayes Classifiers Maximum Entropy Classifiers Support Vector Machine Classifier

Page 43: Natural Language Processing

TELL ME MORE - NAÏVE BAYES CLASSIFIER

Our goal:

For a document d and set of classes C, we need to calculate probability of all the classes given the document.

P(c|d) = P(d|c)*P(c)

P(d)The class which maximizes P(c|d) is the class where the document

belongs. P(d|c) – conditional probability of the document given the class P(c) – prior probability of the class

Page 44: Natural Language Processing

NAÏVE BAYES CLASSIFIER CONT…

What is the practical meaning of prior P(c) and likelihood P(d|c) ?

P(c) means the probability of occurrence of the class, i.e. how often the class occurs in the corpus.

P(d|c) means the probability of the occurrence of some set of features(of certain length) given a class, i.e. P(x1, x2, x3, … xi | c). The joint probability of all the features given a class.

Now calculating the joint probability requires enormous amount of parameters to be calculated, may be of the order |x|n and these many for each class.

That would require enormous amount of training samples which is mostly not available.

This looks complicated, we must try out simplifications!!

Page 45: Natural Language Processing

SIMPLIFYING THE NAÏVE BAYES CLASSIFIER

Simplifying Assumptions Position of the word in the document doesn’t matter – Bag of Words

Feature probabilities given a class are independent - Conditional Independence

This simplifies our model to

P(x1, x2, x3… xi | c) = P(x1|c) * P(x2|c) * P(x3|c)* … *P(xi | c)

And hence, the probability of a class given a document reduces to

P(c|d) = P(c) * P(x1|c) * P(x2|c) * P(x3|c) *… * P(xn|c)

Page 46: Natural Language Processing

A STEP BY STEP DEMONSTRATION

Calculating the priors: P(India) = 3/5 P(Pakistan) = 2/5

Calculating the conditional probabilities :P(Delhi | India) = 0.15789473 P(Lahore | Pakistan) = 0.25P(India | India) = 0.15789473 P(Islamabad | Pakistan) = 0.1875P(Mumbai | India) = 0.2631579 P(Pakistan | Pakistan) = 0.1875P(Chennai| India) = 0.10526316 P(Hyderabad | Pakistan) = 0.125P(Hyderabad | India) = 0.15789473 P(India | Pakistan) = 0.0625P(Lahore | India) = 0.05263158 P(Mumbai | Pakistan) = 0.0625P(Islamabad | India) = 0.05263158 P(Chennai | Pakistan) = 0.0625

Vocabulary Size: 8

P(w|c) = count(w,c) + 1

count(c) +|V|

Page 47: Natural Language Processing

A STEP BY STEP DEMONSTRATION CONT…Priors: P(India) = 3/5 P(Pakistan) = 2/5

Conditional probabilities :P(Lahore | India) = 0.05263158 P(Lahore | Pakistan) = 0.25P(Hyderabad | India) = 0.15789473 P(Hyderabad | Pakistan) = 0.125P(Chennai| India) = 0.10526316 P(Chennai | Pakistan) = 0.0625P(Islamabad | India) = 0.05263158 P(Islamabad | Pakistan) = 0.1875

Test Doc : Lahore Hyderabad Chennai Islamabad P(test doc | India) = P(India) * P(Lahore | India) * P(Hyderabad | India) *

P(Chennai | India) * P(Islamabad | India)= 0.6 * 0.0526 * 0.1578 * 0.1052 * 0.0526 =

0.0000275

P(test doc | Pakistan) = P(Pakistan) * P(Lahore | Pakistan) * P(Hyderabad | Pakistan) * P(Chennai | Pakistan) * P(Islamabad | Pakistan)

= 0.4 * 0.25 * 0.125 * 0.0625 * 0.1875 = 0.0001464

Page 48: Natural Language Processing

CHALLENGES IN SENTIMENT ANALYSIS

Tokenization Issues Data is available online in HTML, XML and various other mark up

languages

Twitter names, hash tags etc pollute the data

Phone numbers, short forms, new words, emoticons, phone numbers etc..

Extracting Features Handling negations

I didn’t like this movie vs I really liked this movie

Which words to use? Choosing the words (I, this, movie etc do not belong to the set of

words which contribute to attitude)

Page 49: Natural Language Processing

SENTIMENT LEXICONS

The words which matter in the sentiments. Its better to train our models on these lexicons instead of the complete list of words in the training documents.

Here are few links for the lexicons:

The General Inquirer : http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls http://www.wjh.harvard.edu/~inquirer/homecat.htm

LIWC(Linguistic Inquiry and Word Count) http://www.liwc.net

Negative emotions (bad, weird, hate, problem, crap) Positive Emotions ( love, wonderful, magnificent, lovely)

Bing Liu's page on Opinion Mining https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Page 50: Natural Language Processing

HOW GOOD IS OUR CLASSIFIER?Not all classifiers we build is a state of art system. We must refine and fine

tune it after building.

What are the parameters to judge a classifier?Contingency Matrix

What is the accuracy of the system?Accuracy = TP + TN TP+TN+FP+FN

Page 51: Natural Language Processing

THE PROBLEM WITH THE ACCURACY PARAMETER

Task : Build a classifier to identify food items on web pages

Most of the tokens on a web page won’t be the name off a food item. Let us say that there are 1000 words and only 10 words are names of food item.

Let us consider that our classifier is a bogus one and it always returns a false for each word it encounters.

This means the accuracy of our system = TP + TN = 990/1000 = 99% TP+TN+FP+FN

Hence, our 99% accurate system is not able to do what we needed, i.e. detecting food items.

Page 52: Natural Language Processing

HOW TO FIX THIS?

We definitely need a better parameter to judge our model. So we define two parameters PRECISION and RECALL

Precision, is the percentage of selected items that are correct.Recall, is the percentage of correct item that the system was able to

select.

PRECISION = TP / ( TP + FP ) = 0/0 = UNDEFINEDRECALL = TP / (TP + FN) = 0/10 = 0

So, these parameters somewhere fairly judge our classifiers. Here, the recall is zero and the precision too is zero.

Page 53: Natural Language Processing

ADVANTAGES OF PRECISION & RECALL

The figures on the last slide didn’t give much insight into the roles of these two parameters.

Slightly better classifier - capable of selecting the true food itemsPrecision = 10/(10 + 20) = 33%Recall = 10/(10+10) = 50%

Hence these parameters fairly judge the classifier, so a combination of these two measures can be a fair evaluation criteria. It is also called the F measure.

Page 54: Natural Language Processing

WHAT NEXT?

Discriminative Language modelsMaximum EntropySupport Vector Machines

Other Advanced ApplicationsNamed Entity RecognitionPOS TaggingMachine Translation & Probabilistic Parsing

CFGsLanguage Grammars

Areas of researchInformation Retrieval (Query Based and Generic)Question & AnsweringSummarization

Page 55: Natural Language Processing

Dharmendra Prasad [email protected]

Thank You