[ppt]

A Survey on Text Categorization with Machine Learning

Chikayama lab.Dai Saito

Introduction:Text Categorization

Many digital Texts are available E-mail, Online news, Blog …

Need of Automatic Text Categorization is increasing without human resource Merits of time and cost

Introduction:Text Categorization

Application Spam filter Topic Categorization

Introduction:Machine Learning

Making Categorization rule automatically by Feature of Text

Types of Machine Learning (ML) Supervised Learning

Labeling Unsupervised Learning

Clustering

Introduction:flow of ML

1. Prepare training Text data with label Feature of Text

2. Learn3. Categorize new Text

Label1

Label2

？

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Number of labels

Binary-label True or False (Ex. spam or not) Applied for other types

Multi-label Many labels, but

One Text has one label Overlapping-label

One Text has some labels

Yes

No

L1

L2

L3

L4

L1

L2

L3

L4

Types of labels

Topic Categorization Basic Task Compare individual words

Author Categorization Sentiment Categorization

Ex) Review of products Need more linguistic information

Outline


Feature of Text

How to express a feature of Text? “Bag of Words”

Ignore an order of words Structure

Ex) I like this car. | I don’t like this car. “Bag of Words” will not work well

(d:document = text) (t:term = word)

Preprocessing

Remove stop words “the” “a” “for” …

Stemming relational -> relate, truly -> true

Term Weighting

Term Frequency Number of a term in a document Frequent terms in a document seems to

be important for categorization tf ・ idf

Terms appearing in many documents are not useful for categorization

Sentiment Weighting

For sentiment classification,weight a word as Positive or Negative

Constructing sentiment dictionary WordNet [04 Kamps et al.]

Synonym Database Using a distance

from ‘good’ and ‘bad’

g o o d

bad

happyd (good, happy) = 2d (bad, happy) = 4

Dimension Reduction Size of feature vector is

(#terms)*(#documents) #terms size of dictionary≒ High calculation cost Risk of overfitting

Best for training data ≠ Best for real data

Choosing effective feature to improve accuracy and calculation cost

Dimension Reduction

df-threshold Terms appearing in very few documents

(ex.only one) are not important 　　 Score

　

If t and cj are independent, Score is equal to Zero

Outline


Learning Algorithm

Many (Almost all?) algorithms are used in Text Categorization Simple approach

Naïve Bayes K-Nearest Neighbor

High performance approach Boosting Support Vector Machine

Hierarchical Learning

Naïve Bayes

Bayes Rule

This value is hard to calculate ? Assumption :

each terms occurs independently

k-Nearest Neighbor

Define a “distance” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2|

= cosθ

check k of high similarityTexts and categorize bymajority vote

If size of test data is larger, memory and search cost is higher

d1

d2θk=3

Boosting

BoosTexter [00 Schapire et al.] Ada boost

making many “weak learner”s with different parameters

Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data

BoosTexter uses Decision Stump as “weak learner”

+

+

+ +

+

－－

－

－

－

Simple example of Boosting

+

+

+ +

+

－－

－

－

－

1.

－

－

+

+

+ ++

－

－

－

2.

+

+

+ ++

－

－

－－

－

3.

Support Vector Machine

Text Categorization with SVM[98 Joachims]

Maximize margin

Text Categorization with SVM

SVM works well for Text Categorization Robustness for high dimension

Robustness for overfitting Most Text Categorization problems are

linearly separable All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)

Comparison of these methods

[02 Sebastiani] Reuters-21578 (2 versions)

difference: number of Categories

.920.870SVMBoostingNaïve Bayes

k-NNMethod

.878

.795

.860Ver.1(90)

- .815.823Ver.2(10)

Hierarchical Learning

TreeBoost[06 Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as

Training data Applying AdaBoost recursively Better classifier than ‘flat’ AdaBoost

Accuracy ： 2-3% up Time: training and categorization time down

Hierarchical SVM[04 Cai et al.]

TreeBoost

root

L1 L2 L3 L4

L11 L12 L41 L42 L43

L421 L422

Outline


Conclusion

Overview of Text Categorizationwith Machine Learning Feature of Text Learning Algorithm

Future Work Natural Language Processing with

Machine Learning, especially in Japanese Calculation Cost

Documents

[ppt]