Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Natural Language ProcessingMachine Learning

Potsdam, 26 April 2012

Saeedeh MomtaziInformation Systems Group

Introduction

Machine LearningField of study that gives computers the ability to learn withoutbeing explicitly programmed.

[Arthur Samuel, 1959]

Learning MethodsSupervised learning

Active learning

Unsupervised learningSemi-supervised learningReinforcement learning

Saeedeh Momtazi | NLP | 26.04.2012

2

Outline

1 Supervised Learning

2 Semi-Supervised Learning

3 Unsupervised Learning


3

Outline





4

Supervised Learning

Renting budget: 1000 e

Size: 180 m2

Age: 2 years 7


5

Supervised Learning

size

age

4

7?7

44

44

4

7

7

7

7

7


6

Classification

Training

T1 → C1T2 → C2...

Tn → Cn

−−−−→

F1F2...

Fn

−−−−→�� Model(F,C)

Testing

Tn+1 →? −−−−→ Fn+1 −−−−→

←−−−−−−−−

Cn+1


7

Applications

Problem Item Category

POS Tagging Word POSNamed Entity Recognition Word Named entityWord Sense Disambiguation Word The word’s senseSpam Mail Detection Document Spam/Not spamLanguage Identification Document LanguageText Categorization Document TopicInformation Retrieval Document Relevant/Not relevant


8

Part Of Speech Tagging

“I saw the man on the roof.”

“ I[PRON] saw[V ] the[DET ] man[N] on[PREP] the[DET ] roof[N]. ”

[PRON] Pronoun[PREP] Preposition[DET] Determiner[V] Verb[N] Noun...


9

Named Entity Recognition

“Steven Paul Jobs, co-founder of Apple Inc, was born in California.”

“ Steven Paul JobsPerson

, co-founder of Apple IncOrganization

, was born in CaliforniaLocation

.”

PersonOrganizationLocationDate...


10

Word Sense Disambiguation

“Jim flew his plane to Texas.”

“Alice destroys the item with a plane.”


11

Spam Mail Detection


12

Language Identification


13

Text Categorization

---------------------------------------------

---------------------------------------------

---------------------------------------------

---------------------------------------------

news

------------------------------------------------------------------------------------------

? ? ? ?

Culture Sport Politics Business


14

Information Retrieval


15

Classification

Training

T1 → C1T2 → C2...

Tn → Cn

−−−−→

F1F2...

Fn

−−−−→�� Model(F,C)

Testing

Tn+1 →? −−−−→ Fn+1 −−−−→

←−−−−−−−−

Cn+1


16

Classification Algorithms

K Nearest NeighborSupport Vector MachinesNaïve BayesMaximum EntropyLinear RegressionLogistic RegressionNeural NetworksDecision TreesBoosting...


17

K Nearest Neighbor

?


18

K Nearest Neighbor

?


19

K Nearest Neighbor

1 Nearest Neighbor


20

K Nearest Neighbor

?

3 Nearest Neighbor


21

K Nearest Neighbor

3 Nearest Neighbor


22

K Nearest Neighbor

But this item is closer

3 Nearest Neighbor


23

Support Vector Machines


24


Find a hyperplane in the vector space that separates the itemsof the two categories


25


There might be more than one possible separating hyperplane


26


?

There might be more than one possible separating hyperplane


27


?

Find the hyperplane with maximum marginVectors at the margins are called support vectors


28


Find the hyperplane with maximum marginVectors at the margins are called support vectors


29

Naïve Bayes

Selecting the class with highest probability⇒ Minimizing the number of items with wrong labels

c = argmaxciP(ci)

The probability should depend on the to be classified data (d)

P(ci |d)


30

Naïve Bayes

c = argmaxciP(ci)

c = argmaxciP(ci |d)

c = argmaxci

P(d |ci) · P(ci)

P(d) P(d) has no effect

c = argmaxciP(d |ci) · P(ci)


31

Naïve Bayes

c = argmaxciP(d |ci) · P(ci)

Likelihood

ProbabilityPrior

Probability


32

Maximum Entropy

Assigning a weight λj to each feature fjPositive weight: the feature is likely to be effectiveNegative weight: the feature is likely to be ineffective

Picking out a subset of data by each feature

Voting for each class based on the sum of weighted features

c = argmaxciP(ci |d , λ)


33

Maximum Entropy

c = argmaxciP(ci |d , λ)

P(ci |d , λ) =exp

∑j λj · fj(c,d)∑

ciexp

∑j λj · fj(ci ,d)

The expectation of each feature is calculated as follows:

E(fi) =∑

(c,d)∈(C,D)

P(c,d) · fi(c,d)


34

Classification

Training

T1 → C1T2 → C2...

Tn → Cn

−−−−→

F1F2...

Fn

−−−−→�� Model(F,C)

Testing

Tn+1 →? −−−−→ Fn+1 −−−−→

←−−−−−−−−

Cn+1


35

Feature Selection

size

age

4

7

44

44

4

7

7

7

7

7

LocationNumber of roomsBalcony...


36

Feature Selection

Bag-of-words:Each document can be represented by the set of words thatappear in the documentResult is a high dimensional feature spaceThe process is computationally expensive

SolutionUsing a feature selection method to select informative words


37

Feature Selection Methods

Information GainMutual Informationχ-Square


38

Information Gain

Measuring the number of bits required for category predictionw.r.t. the presence or absence of a term in the documentRemoving words whose information gain is less than apredefined threshold

IG(w) =−K∑

i=1

P(ci) log P(ci)

+ P(w)K∑

i=1

P(ci |w) log P(ci |w)

+ P(w)K∑

i=1

P(ci |w) log P(ci |w)


39

Information Gain

P(ci) =NiN

P(w) = NwN P(ci |w) = Niw

Ni

P(w) = NwN P(ci |w) = Niw

Ni

N: # docsNi : # docs in category ciNw : # docs containing wNw : # docs not containing wNiw : # docs in category ci containing wNiw : # docs in category ci not containing w


40

Mutual Information

Measuring the effect of each word in predicting the categoryHow much does its presence or absence in a documentcontribute to category prediction?

MI(w , ci) = logP(w , ci)

P(w) · P(ci)

Removing words whose mutual information is less than apredefined threshold

MI(w) = maxiMI(w , ci)

MI(w) =∑

i

P(ci) ·MI(w , ci)


41

χ-square

Measuring the dependencies between words and categories

χ2(w , ci) =N · (NiwNiw − NiwNiw )

2

(Niw + Niw ) · (Niw + Niw ) · (Niw + Niw ) · (Niw + Niw )

Ranking words based on their χ-square measure

χ2(w) =K∑

i=1

P(ci) · χ2(w , ci)

Selecting the top words as features


42

Feature Selection

These models perform well for document-level classificationSpam Mail DetectionLanguage IdentificationText Categorization

Word-level Classification might need another types of featuresPOS TaggingNamed Entity Recognition

(will be discussed later)


43

Shortcoming

Data annotation is labor intensive

Solution:Using a minimum amount of annotated dataAnnotating further data by human, if they are very informative

Active Learning


44

Active Learning


45

Active Learning

Annotating a small amount of data


46

Active Learning

MM

MM

LL

L

H H

H

HH

HM


Calculating the confidence score of the classifier onunlabeled data


47

Active Learning


Calculating the confidence score of the classifier onunlabeled data

Finding the informative unlabeled data(data with lowest confidence)

Annotating the informative data by the human


48

Active Learning

Amazon Mechanical Turk


49

Outline





50

Semi-Supervised Learning

Problem of data annotation

Solution:Using minimum amount of annotated dataAnnotating further data automatically, if they are easy to predict


51


A small amount of labeled data


52



A large amount of unlabeled data


53



A large amount of unlabeled data

SolutionFinding the similarity between the labeled andunlabeled dataPredicting the labels of the unlabeled data


54


Training the classifier usingThe labeled dataPredicted labels ofthe unlabeled data


55

Shortcoming

Introducing a lot of noisy data to the system


56

Shortcoming

Introducing a lot of noisy data to the system

SolutionAdding unlabeled data to the training set, if the predicted labelhas a high confidence


57



58

Related Books


by O. Chapelle, B. Schölkopf, A. ZienMIT Press

2006

Semisupervised Learning forComputational Linguistics

by S. AbneyChapman & Hall

2007


59

Outline





60

Unsupervised Learning

area

age

444

4

4

4

7

7

7

7

7

7


61

Unsupervised Learning

area

age

777

77

7

7

7

7

7

7

7


62

Clustering

Working based on the similarities between the data items

Assigning the similar data items to the same cluster


63

Applications

Word ClusteringSpeech RecognitionMachine TranslationNamed Entity RecognitionInformation Retrieval...

Document ClusteringInformation Retrieval...


64

Speech recognition

⇒ “Computers can recognize speech.”

“Computers can wreck a nice peach.”


65

Machine Translation

“The cat eats ...” ⇒ “Die Katze frisst ...”“Die Katze isst ...”


66

Language Modeling

Corpus Texts:“I have a meeting on Monday evening”“You should work on Wednesday afternoon”“The next session of the NLP lecture in on Thursday morning”

no observation in the corpus.

⇒ “The talk is on Monday morning.”

“The talk is on Monday molding.”

Monday

Tuesday

Wednesday

Thursday

Sunday

Saturday

Friday

afternoonmorning

evening

night


67

Language Modeling

Corpus Texts:“I have a meeting on Monday evening”“You should work on Wednesday afternoon”“The next session of the NLP lecture in on Thursday morning”

⇒ [Week-day] [day-time]

Class-based Language Model


68


“The first car was invented by Karl Benz.”

“Thomas Edison invented the first commercially practical light.”

“Alexander Graham Bell invented the first practical telephone.”

vehicleautomobile

car


69



70


Clustering documents

based on their similarities


71

Clustering Algorithms

FlatK-means

HierarchicalTop-Down (Divisive)Bottom-Up (Agglomerative)

Single-linkComplete-linkAverage-link


72

K-means

The best known clustering algorithmWorks well for many casesUsed as default / baseline for clustering documents

AlgorithmDefining each cluster center as the mean or centroid of the itemsin the cluster

~µ =1|c|

∑~x∈c

~x

Minimizing the average squared Euclidean distance of the itemsfrom their cluster centers


73

K-meansInitialization: Randomly choose k items as initial centroidswhile stopping criterion has not been met do

for each item doFind the nearest centroidAssign the item to the cluster associated with the nearest centroid

end forfor each cluster do

Update the centroid of the cluster based on the average of all items in the clusterend for

end while

Iterating two steps:Re-assignment

Assigning each vector to its closest centroid

Re-computation

Computing each centroid as the average of the vectors that were assigned to it in re-assignment


74

K-means

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

AppletKM.html


75

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html

Hierarchical AgglomerativeClustering (HAC)

Creating a hierarchy in the form of a binary tree


76


Initial Mapping: Put a single item in each clusterwhile reaching the predefined number of clusters do

for each pair of clusters doMeasure the similarity of two clusters

end forMerge the two clusters that are most similar

end while

Measuring the similarity in three ways:Single-linkComplete-linkAverage-link


77


Single-link / single-linkage clusteringBased on the similarity of the most similar members


78


Complete-link / complete-linkage clusteringBased on the similarity of the most dissimilar members


79


Average-link / average-linkage clusteringBased on the average of all similarities between the members


80


http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

AppletH.html


81

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletH.html

Further Reading

Introduction to Information Retrieval

C.D. Manning, P. Raghavan, H. Schütze Cambridge

University Press 2008

http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Chapters 13,14,15,16,17


82



Documents

Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning