Upload
midi
View
1.146
Download
0
Embed Size (px)
DESCRIPTION
the 16th chapter of the book: foundation of statistical natural language processing
Citation preview
Text Categorization
Chapter 16
Foundations of Statistical Natural Language Processing
Outline
• Preparation
• Decision Tree
• Maximum Entropy Modeling
• Perceptrons
• K nearest Neighbor Classification
Part I
• Preparation
Classification
• Classification / Categorization– The task of assigning objects from a universe to two or
mare classes (categorizes)
Problem Object CategoriesTagging Context of a word The word’s (POS)
tags
Disambiguation Context of a word The word’s seneses
PP attachment Sentence Parse trees
Author identification Document Document authors
Language identification
Document Languages
Text categorization Document topics
Task Description
• Goal: Given the classification scheme, the system can decide which class(es) a document is related to.
• A mapping from document space to classification scheme.– 1 to 1 / 1 to many
• To build the mapping: – observe the known samples classified in the scheme, – Summarize the features and create rules/formula – Decide the classes for the new documents according t
o the rules.
Task Formulation
• Training set:– (text doc, category) ->
-> for TC, doc is presented as a vector of (possibly weighted ) word counts
• Model class– a parameterized family of classifiers
• Training procedure– selects one classifier from this family.
• E.g.
A data representation model
g(x) = 0
x1
x2w
w = (1,1)b = -1
w x2 + b < 0w x1 + b > 0
(0,1)
(1,0)
Evaluation(1)
• Test set• For binary classification
(proportion of correctly classified objects)
Yes is correct No is correct
Yes was assigned a b
No was assigned c d
Contingency table
Evaluation(2)• More than two categories
– Macro-averaging• For each category create a contingency table, then compute
the precision/recall seperately• Average the evaluation measure over categories.• E.g
– Micro-averaging: make a single contingency table for all the data by summing the scores in each cell for all categories.
• Macro-avg: give equal weight to each class• Micro-avg: give equal weight to each object
Part II
• Decision Tree
E.g.
Node1 7681 articlesP(c|n1) = 0.3000
split: cts value: 2
Node2 5977 articlesP(c|n2) = 0.116
split: net value: 1
Node5 1704 articlesP(c|n5) = 0.943
split: vs value: 2
Node3 5436 articlesP(c|n3) = 0.050
Node4 541 articles
P(c|n4) = 0.649
Node6 301 articles
P(c|n6) = 0.694
Node7 1403 articlesP(c|n7) = 0.996
cts < 2 cts >= 2
net<1 Net>= 1vs <2
vs >= 2
A trained decision tree for category “earnings”
Doc = {cts=1, net =3}
A Closer Look on the E.g.
• Doc = {cts=1, net =3}
Data presentation model
Model classStructure decidedParameters ?
Training Procedure
Data Presentation Model (1)
• An art in itself.– Usually depends on the particular categorizatio
n method used.
• In this book, given as an e.g., we present each document as an weighted word vector.
• The words are chosen by X2 (chi-square) method from the training corpus– 20 words are chosen. E.g. vs, mln, 1000, loss,
profit…
Ref to: Chap 5
Data Presentation Model (2)
• Each document is then represented as a vector of K = 20 integers, ,
• tf(ij) : the number of occurrences of term i in document j• l(j) the length of document j• E.g. profit
Training Procedure: Growing (1)• Growing a tree
– Splitting criterion: • finding the feature and its value that we will split on• Information gain
– Stopping criterion• Determines when to stop splitting• e.g. all elements at a node have an identical representation or
the same category
Entropy of parent Node
Proportion of elements that passed on to the left nodes
Ref. Machine Learning
Training Procedure: Growing (2)
• E.g. the value of G(‘cts’, 2)• H(t) = -0.3*log(0.3) - 0.7*log(0.7) = 0.611• pL = 5977/7681• G(‘cts’,2) = 0.611 – (*) = 0.283
Node1 7681 articlesP(c|n1) = 0.3000split: cts value: 2
Node2 5977 articles2p(c|n) = 0.116
Node5 1704 articlesP(c|n5) = 0.943
cts < 2 cts >= 2
Training Procedure: pruning (1)• Overfitting:
– E.g.– Introduced by the er
rors or coarse in the training set
– Or the insufficiency of training set
• Solution: Pruning– Create a detailed decision tree, then pruning the tree to a
appropriate size
– Approach: • Quinlan 1987
• Quinlan 1993
• Magerman 1994
Ref to :chap3 (3.7.1) machine learning
Training Procedure: pruning (2)• Validation
– Validation set (Cross validation)
Discussion
• Learning Curve– Large training set v.s. optimal performance
• Can be interpreted easily greatest advantage– The model is more complicated than classifiers like N
aïve Bayes, linear regression, etc.
• Split the training set into smaller and smaller subsets. This makes correct generalization harder. (not enough data for reliable prediction)– Pruning addresses the problem to some extent.
Part III
• Maximum Entropy Modeling– Data presentation model– Model class– Training procedure
Basic Idea• Given a set of raining documents and their catego
ries• Select features that represent empirical data.• Select a probability density function to generate th
e empirical data • Found out the probability distribution (=decide the
parameters of the probability function) that – has the maximum entropy H(p) of all the possible p– Satisfies the constrains given by features– Maximizes the likelihood of the data
• New document is classified under the probability distribution
Data Presentation Model• Remind the data presentation model in used by de
cision tree:– Each document is then represented as a vector of K = 2
0 integers, ,where s(ij) is an integer and presents the weight of the feature word;
• The features f(i) are defined to characterize any property of a pair (x,c).
Model Class
• Loglinear models
• K number of features, ai is the weight of feature fi and Z is a normalizing constant, used to ensure a probability distribution results.
• Classify new document:– Compute p(x, 0), and p(x, 1) and, choose the class labe
l with the greater probability
Training Process: Generalized Iterative Scaling
• Given the equation
• Under a set of constrains:
– The expected value of fi for p* is the same as the expected value for the empirical distribution
• There is a unique maximum entropy distribution• There is a computable procedure that converges
to the distribution p* (16.2.1)
The Principle of Maximum Entropy
• Given by E.T.Jaynes in 1957• The distribution with maximum entropy is more p
ossible to appear than other distributions. – (the entropy of a close system is continuously increasi
ng?)
• Information entropy as a measurement of ‘uninformativeness’– If we chose a model with less entropy, we would add
‘information’ constraints to the model that are not justified by the empirical evidence available to us
Application to Text Categorization• Feature selection
– In maximum entropy modeling, feature selection and training are usually integrated
• Test for convergence– Compare the log difference between empirical and estimated featur
e expectations
• Generalized iterative scaling – Computationally expensive due to slow convergence
• VS. Naïve Bayes– Both use the prior probability– NB suppose no dependency between variables, while MEM doesn’t
• Strength– Arbitrarily complex features can be defined if the experimenter beli
eves that these features may contribute useful information for the classification decision.
– Unified framework for feature selection and classification
Part VI
• Perceptrons
Models• Data presentation Model:
– text document is represented as term vectors.
• Model class
• Binary classification– For any input text document x, – Class(x) = c iff f(x) > 0; else class(x) <> c;
• Algorithm: – Perceptron learning algorithm is a simple example of gr
adient descent algorithm– The goal is to learn the weighted vector w and a thresh
old theta.
Perceptron learning Procedure: gradient descent
• Gradient descent– an optimization algorithm. – To find a local minimum of a function using gradient
descent,
Perceptron learning Procedure: Basic Idea
• To find a linear division of the training set
• Procedure: – Estimate w and theta, if they make a mistake we move th
em in the direction of greatest change for the optimality criterion.
• For each (x,y) pair • Pass (xi,yi,wi) to the update rule w(j)' = w(j) + α(δ − y)x(j)
• Perceptron convergence theorem– Novikoff (1962) proved that the perceptron algorithm con
verges after a finite number of iterations if the data is linearly separable
j-th item in the weight vector j-th item in the input vector
expected output & output
Why
• w(j)' = w(j) + α(δ − y)x(j)– Ref
• Ref: www.cs.ualberta.ca/~sutton/book/8/node3.thml• The 8.1 and 8.2 of reinforce learning• Refer to formula 8.2
– The greatest gradient
– Where w’ = {w1,w2,… wk, theta}, x’ = {x1, x2, …, xk, -1}
E.g.
w
x
xw+x
s’ sYes
No
Discussion
• The data set should be linear separable– (-1969), researchers relaized the limitations, and the i
nterest in perceptrons reminds low• As a gradient descent algorithm, it doesn’t suffer
from the local optimum problem• Back propagation algorithm, etc.
– 80’s– Multi-layer perceptrons, neural networks, connectionis
t models.– Overcome the shortcoming of conceptrons, ideally ca
n learn any classification function (e.g. XOR).– Converges more slowly– Can get caught in local optima
Part V
• K Nearest Neighbor Classification
Nearest Neighbor
• Category = purple
K Nearest Neighbor
• N = 4• Category = Blue
Discussion
• Similarity metric– The complexity of KNN is in finding a good
measure of similarity
• It’s performance is very dependent on the right similarity metric
• Efficiency– However there are ways of implementing KNN
search efficiently, and often there is an obvious choice for a similarity metric
Thanks!