Text categorization

1

Text Categorization

Quang NguyenSaltlux Vietnam Development Center

Sept 17, 2010

Contents

Definition Text Categorization Division Dimensionality Reduction Machine Learning Approaches Text Categorization Evaluation

2

Definition

Definition Text categorization (TC – aka text classification,

or topic spotting): the activity of labelling natural language texts with thematic categories from a predefined set

Example

3

TC

Formal Definition

TC assigns a boolean value to pair (dj, ci) DxC where D is a domain of documents and C = {c1, …, c|C|} is predefined set of categories.

Value is T (true): file dj under ci

Value is F (false): do not file dj under ci

4

Applications of Text Categorization

Document Organization Text Filtering Word Sense Disambiguation (WSD):

consider word as document, word senses as categories

5

Text Categorization Division

Single-label vs. multi-label TC Single-label: exactly 1 category must be

assigned to each djD Multi-label: any number from 0 to |C| may be

assigned to the same djD

Category-pivoted vs. document-pivoted TC Document-pivoted: given djD, find all ci C

under which it should be filed Category-pivoted: given ciC, find all dj D that

should be filed under it

6

Text Categorization Division

Hard categorization vs. ranking categorization

Hard: require T or F for each pair (dj, ci) Ranking: given djD, rank categories in C = {c1,

…, c|C|} according to their estimated appropriateness to dj

Rule-based categorization vs. Machine Learning categorization

Knowledge acquisition bottleneck from experts Can give very good results.

7

Machine Learning Approach to Text Categorization

To contruct a ranking classifier for djD, find all ci C, consisting of

Function CVSi: D [0, 1], value in [0,1] represents evidence for the fact that djci

Category threshold i such that CVSi(dj)≥i is interpreted as T and CVSi(dj)< i is interpreted as F

Training Set: classifier CVSi is inductively built by observing the characteristics of these documents

Test Set: used for testing the effectiveness of the classifiers

8

Document Representation

Document dj is usually represented as a vector

Most of system use TFIDF for wkj

9

),...,( ||1 jTjj wwd T is the set of terms (features)0 ≤ wkj≤1

)(

||log).,(),(.

kjkjkkj tDF

DdtTFdtIDFTFw

Dimensionality Reduction

Definition: DR reduces the size of the vector space from || to |’|<<||; the set ’ is called reduced term set

Benefits Reduce index size Reduce overfitting

Distinction DR by Term Selection: ’ is subset of DR by Term Extraction: the terms in ′ are not of

the same type of the terms in (e.g. if the terms in are words, the terms in ′ may not be words at all) 10

DR by Term Selection

11

Latent Semantic Indexing

Use concepts instead of words Mathematical model

relates documents and the concepts Looks for concepts in the documents Stores them in a concept space

related documents are connected to form a concept space

Do not need an exact match for the query

12

Latent Semantic Indexing

13

Probabilistics Classifiers

View CSVi(dj): probability that a document belongs to ci using Bayes’ theorem

Problem: the number of possible vectors is too high.

To alleviate this problem, use independence assumption (called Naïve Bayes Classifiers)

14

)(

)|()()|(

j

ijiji

dP

cdPcPdcP

Randomly picked document is jd

Randomly picked document is belongs to ci

Estimation of this value is problematic

||

1

)|()|(

kikjji cwPdcP

Naïve Bayes Classifiers

Use binary-valued vector representations for documents

Pki : short for P(wkj=1|ci), P(wkj|ci) is

Plug all things together

Defining classifier for ci requires estimating parameters {p1i, …,p||i} from the training data

15

Naïve Bayes Classifiers

16

kjkj

jkjk

jkjk

wki

wki

dwiki

dwk

idw

kidw

kikj

pp

cwPcwP

cwPcwPcwP

1)1(

)|(1)|(

)|()|()|(

Decision Tree Classifiers

Is a tree in which internal nodes are labelled by terms, branches departing from them are labelled by tests on the weight that term has in the test document and leafs are categories

17

Decision Rule Classifiers

Classifier is built by an inductive rule learning method consists of a DNF rule

DNF rules are similar to DTs, but tend to generate more compact classifiers than DT learners

18

Rocchio Classifiers

Rocchio’s method computes a classifier as follows

Esy to implement, quite efficient Drawback: miss most of documents if

documents in category tend to occur in disjoint clusters

19

iii wwc ||1 ,...,

, : parameters adjusting importance of positive and negative examples

Neural Networks Classifiers

A Neural Network (NN) text classifier is a network of units.

Typical way of training NNs is backpropagation, whereby the term weights of training documents are loaded, if a misclassification occurs, the error is “backpropagated” to adjust parameters 20

t1

t2

t||

c1

c2

c|C|

Hidden Layers

K-Nearest Neighbour Classifiers

For deciding whether dj ci, Find k most similar documents of dj

Pick the most popular category among the k similar documents

Simple method that works well when the document similarity measure is accurate

Training data is not used to build a model

21

c1c2

c3

Training set

c1c2

c3

kNN(k=3) Classifier for new document

c1c2

c3

Rocchio vs. kNN

Rocchio (a): Miss most of documents as the centroid full outside of the clusters

kNN (b): overcome problem of Rocchio

22

Support Vector Machine

An Support Vector Machine (SVM) looks for a hyperplane with the maximum margin between positive and negative examples of training documents.

23

Optimal Hyperplane

Maximum Margin

Support Vectors

Support Vector Machine

Do not use all training documents, use only support vectors (documents near the border)

Applicable to the case in which positives and negatives are not linearly separable

24

Text Categorization Evaluation

Precision:

Recall

F1: harmonic means of precision and recall

25

ii

ii FPTP

TPcpre

)(

ii

ii FNTP

TPcrec

)(

recpre

recpre

recpre

cF i

..211

2)(1

Contigency Table for ci

Breakeven: the value at which precision is equal recall

Text Categorization Evaluation

26

Pi,Ri

ithresholdcategory

1.0

1.0

Pi

Ri

Breakeven point

References

Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, p.1~p.47, 2002.

Manu Konchady, Text Mining Application Programming, Charles River Media, 2006.

27

Thank you!

28

Documents

Text categorization