20070702 Text Categorization

Text Categorization

Chapter 16

Foundations of Statistical Natural Language Processing

Outline

• Preparation

• Decision Tree

• Maximum Entropy Modeling

• Perceptrons

• K nearest Neighbor Classification

Part I

• Preparation

Classification

• Classification / Categorization– The task of assigning objects from a universe to two or

mare classes (categorizes)

Problem Object CategoriesTagging Context of a word The word’s (POS)

tags

Disambiguation Context of a word The word’s seneses

PP attachment Sentence Parse trees

Author identification Document Document authors

Language identification

Document Languages

Text categorization Document topics

Task Description

• Goal: Given the classification scheme, the system can decide which class(es) a document is related to.

• A mapping from document space to classification scheme.– 1 to 1 / 1 to many

• To build the mapping: – observe the known samples classified in the scheme, – Summarize the features and create rules/formula – Decide the classes for the new documents according t

o the rules.

Task Formulation

• Training set:– (text doc, category) ->

-> for TC, doc is presented as a vector of (possibly weighted ) word counts

• Model class– a parameterized family of classifiers

• Training procedure– selects one classifier from this family.

• E.g.

A data representation model

g(x) = 0

x1

x2w

w = (1,1)b = -1

w x2 + b < 0w x1 + b > 0

(0,1)

(1,0)

Evaluation(1)

• Test set• For binary classification

(proportion of correctly classified objects)

Yes is correct No is correct

Yes was assigned a b

No was assigned c d

Contingency table

Evaluation(2)• More than two categories

– Macro-averaging• For each category create a contingency table, then compute

the precision/recall seperately• Average the evaluation measure over categories.• E.g

– Micro-averaging: make a single contingency table for all the data by summing the scores in each cell for all categories.

• Macro-avg: give equal weight to each class• Micro-avg: give equal weight to each object

Part II

• Decision Tree

E.g.

Node1 7681 articlesP(c|n1) = 0.3000

split: cts value: 2


split: net value: 1


split: vs value: 2


Node4 541 articles

P(c|n4) = 0.649

Node6 301 articles

P(c|n6) = 0.694


cts < 2 cts >= 2

net<1 Net>= 1vs <2

vs >= 2

A trained decision tree for category “earnings”

Doc = {cts=1, net =3}

A Closer Look on the E.g.

• Doc = {cts=1, net =3}

Data presentation model

Model classStructure decidedParameters ?

Training Procedure

Data Presentation Model (1)

• An art in itself.– Usually depends on the particular categorizatio

n method used.

• In this book, given as an e.g., we present each document as an weighted word vector.

• The words are chosen by X2 (chi-square) method from the training corpus– 20 words are chosen. E.g. vs, mln, 1000, loss,

profit…

Ref to: Chap 5

Data Presentation Model (2)

• Each document is then represented as a vector of K = 20 integers, ,

• tf(ij) : the number of occurrences of term i in document j• l(j) the length of document j• E.g. profit

Training Procedure: Growing (1)• Growing a tree

– Splitting criterion: • finding the feature and its value that we will split on• Information gain

– Stopping criterion• Determines when to stop splitting• e.g. all elements at a node have an identical representation or

the same category

Entropy of parent Node

Proportion of elements that passed on to the left nodes

Ref. Machine Learning

Training Procedure: Growing (2)

• E.g. the value of G(‘cts’, 2)• H(t) = -0.3*log(0.3) - 0.7*log(0.7) = 0.611• pL = 5977/7681• G(‘cts’,2) = 0.611 – (*) = 0.283

Node1 7681 articlesP(c|n1) = 0.3000split: cts value: 2

Node2 5977 articles2p(c|n) = 0.116


cts < 2 cts >= 2

Training Procedure: pruning (1)• Overfitting:

– E.g.– Introduced by the er

rors or coarse in the training set

– Or the insufficiency of training set

• Solution: Pruning– Create a detailed decision tree, then pruning the tree to a

appropriate size

– Approach: • Quinlan 1987

• Quinlan 1993

• Magerman 1994

Ref to :chap3 (3.7.1) machine learning

Training Procedure: pruning (2)• Validation

– Validation set (Cross validation)

Discussion

• Learning Curve– Large training set v.s. optimal performance

• Can be interpreted easily greatest advantage– The model is more complicated than classifiers like N

aïve Bayes, linear regression, etc.

• Split the training set into smaller and smaller subsets. This makes correct generalization harder. (not enough data for reliable prediction)– Pruning addresses the problem to some extent.

Part III

• Maximum Entropy Modeling– Data presentation model– Model class– Training procedure

Basic Idea• Given a set of raining documents and their catego

ries• Select features that represent empirical data.• Select a probability density function to generate th

e empirical data • Found out the probability distribution (=decide the

parameters of the probability function) that – has the maximum entropy H(p) of all the possible p– Satisfies the constrains given by features– Maximizes the likelihood of the data

• New document is classified under the probability distribution

Data Presentation Model• Remind the data presentation model in used by de

cision tree:– Each document is then represented as a vector of K = 2

0 integers, ,where s(ij) is an integer and presents the weight of the feature word;

• The features f(i) are defined to characterize any property of a pair (x,c).

Model Class

• Loglinear models

• K number of features, ai is the weight of feature fi and Z is a normalizing constant, used to ensure a probability distribution results.

• Classify new document:– Compute p(x, 0), and p(x, 1) and, choose the class labe

l with the greater probability

Training Process: Generalized Iterative Scaling

• Given the equation

• Under a set of constrains:

– The expected value of fi for p* is the same as the expected value for the empirical distribution

• There is a unique maximum entropy distribution• There is a computable procedure that converges

to the distribution p* (16.2.1)

The Principle of Maximum Entropy

• Given by E.T.Jaynes in 1957• The distribution with maximum entropy is more p

ossible to appear than other distributions. – (the entropy of a close system is continuously increasi

ng?)

• Information entropy as a measurement of ‘uninformativeness’– If we chose a model with less entropy, we would add

‘information’ constraints to the model that are not justified by the empirical evidence available to us

Application to Text Categorization• Feature selection

– In maximum entropy modeling, feature selection and training are usually integrated

• Test for convergence– Compare the log difference between empirical and estimated featur

e expectations

• Generalized iterative scaling – Computationally expensive due to slow convergence

• VS. Naïve Bayes– Both use the prior probability– NB suppose no dependency between variables, while MEM doesn’t

• Strength– Arbitrarily complex features can be defined if the experimenter beli

eves that these features may contribute useful information for the classification decision.

– Unified framework for feature selection and classification

Part VI

• Perceptrons

Models• Data presentation Model:

– text document is represented as term vectors.

• Model class

• Binary classification– For any input text document x, – Class(x) = c iff f(x) > 0; else class(x) <> c;

• Algorithm: – Perceptron learning algorithm is a simple example of gr

adient descent algorithm– The goal is to learn the weighted vector w and a thresh

old theta.

Perceptron learning Procedure: gradient descent

• Gradient descent– an optimization algorithm. – To find a local minimum of a function using gradient

descent,

Perceptron learning Procedure: Basic Idea

• To find a linear division of the training set

• Procedure: – Estimate w and theta, if they make a mistake we move th

em in the direction of greatest change for the optimality criterion.

• For each (x,y) pair • Pass (xi,yi,wi) to the update rule w(j)' = w(j) + α(δ − y)x(j)

• Perceptron convergence theorem– Novikoff (1962) proved that the perceptron algorithm con

verges after a finite number of iterations if the data is linearly separable

j-th item in the weight vector j-th item in the input vector

expected output & output

Why

• w(j)' = w(j) + α(δ − y)x(j)– Ref

• Ref: www.cs.ualberta.ca/~sutton/book/8/node3.thml• The 8.1 and 8.2 of reinforce learning• Refer to formula 8.2

– The greatest gradient

– Where w’ = {w1,w2,… wk, theta}, x’ = {x1, x2, …, xk, -1}

http://www.cs.ualberta.ca/~sutton/book/8/node3.thml

E.g.

w

x

xw+x

s’ sYes

No

Discussion

• The data set should be linear separable– (-1969), researchers relaized the limitations, and the i

nterest in perceptrons reminds low• As a gradient descent algorithm, it doesn’t suffer

from the local optimum problem• Back propagation algorithm, etc.

– 80’s– Multi-layer perceptrons, neural networks, connectionis

t models.– Overcome the shortcoming of conceptrons, ideally ca

n learn any classification function (e.g. XOR).– Converges more slowly– Can get caught in local optima

Part V

• K Nearest Neighbor Classification

Nearest Neighbor

• Category = purple

K Nearest Neighbor

• N = 4• Category = Blue

Discussion

• Similarity metric– The complexity of KNN is in finding a good

measure of similarity

• It’s performance is very dependent on the right similarity metric

• Efficiency– However there are ways of implementing KNN

search efficiently, and often there is an obvious choice for a similarity metric

Thanks!

Education

20070702 Text Categorization