28
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito

[ppt]

  • Upload
    butest

  • View
    544

  • Download
    4

Embed Size (px)

Citation preview

Page 1: [ppt]

A Survey on Text Categorization with Machine Learning

Chikayama lab.Dai Saito

Page 2: [ppt]

Introduction:Text Categorization

Many digital Texts are available E-mail, Online news, Blog …

Need of Automatic Text Categorization is increasing without human resource Merits of time and cost

Page 3: [ppt]

Introduction:Text Categorization

Application Spam filter Topic Categorization

Page 4: [ppt]

Introduction:Machine Learning

Making Categorization rule automatically by Feature of Text

Types of Machine Learning (ML) Supervised Learning

Labeling Unsupervised Learning

Clustering

Page 5: [ppt]

Introduction:flow of ML

1. Prepare training Text data with label Feature of Text

2. Learn3. Categorize new Text

Label1

Label2

Page 6: [ppt]

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Page 7: [ppt]

Number of labels

Binary-label True or False (Ex. spam or not) Applied for other types

Multi-label Many labels, but

One Text has one label Overlapping-label

One Text has some labels

Yes

No

L1

L2

L3

L4

L1

L2

L3

L4

Page 8: [ppt]

Types of labels

Topic Categorization Basic Task Compare individual words

Author Categorization Sentiment Categorization

Ex) Review of products Need more linguistic information

Page 9: [ppt]

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Page 10: [ppt]

Feature of Text

How to express a feature of Text? “Bag of Words”

Ignore an order of words Structure

Ex) I like this car. | I don’t like this car. “Bag of Words” will not work well

(d:document = text) (t:term = word)

Page 11: [ppt]

Preprocessing

Remove stop words “the” “a” “for” …

Stemming relational -> relate, truly -> true

Page 12: [ppt]

Term Weighting

Term Frequency Number of a term in a document Frequent terms in a document seems to

be important for categorization tf ・ idf

Terms appearing in many documents are not useful for categorization

Page 13: [ppt]

Sentiment Weighting

For sentiment classification,weight a word as Positive or Negative

Constructing sentiment dictionary WordNet [04 Kamps et al.]

Synonym Database Using a distance

from ‘good’ and ‘bad’

g o o d

bad

happyd (good, happy) = 2d (bad, happy) = 4

Page 14: [ppt]

Dimension Reduction Size of feature vector is

(#terms)*(#documents) #terms size of dictionary≒ High calculation cost Risk of overfitting

Best for training data ≠ Best for real data

Choosing effective feature to improve accuracy and calculation cost

Page 15: [ppt]

Dimension Reduction

df-threshold Terms appearing in very few documents

(ex.only one) are not important    Score

 

If t and cj are independent, Score is equal to Zero

Page 16: [ppt]

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Page 17: [ppt]

Learning Algorithm

Many (Almost all?) algorithms are used in Text Categorization Simple approach

Naïve Bayes K-Nearest Neighbor

High performance approach Boosting Support Vector Machine

Hierarchical Learning

Page 18: [ppt]

Naïve Bayes

Bayes Rule

This value is hard to calculate ? Assumption :

each terms occurs independently

Page 19: [ppt]

k-Nearest Neighbor

Define a “distance” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2|

= cosθ

check k of high similarityTexts and categorize bymajority vote

If size of test data is larger, memory and search cost is higher

d1

d2θk=3

Page 20: [ppt]

Boosting

BoosTexter [00 Schapire et al.] Ada boost

making many “weak learner”s with different parameters

Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data

BoosTexter uses Decision Stump as “weak learner”

Page 21: [ppt]

+

+

+ +

+

--

Simple example of Boosting

+

+

+ +

+

--

1.

+

+

+ ++

2.

+

+

+ ++

--

3.

Page 22: [ppt]

Support Vector Machine

Text Categorization with SVM[98 Joachims]

Maximize margin

Page 23: [ppt]

Text Categorization with SVM

SVM works well for Text Categorization Robustness for high dimension

Robustness for overfitting Most Text Categorization problems are

linearly separable All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)

Page 24: [ppt]

Comparison of these methods

[02 Sebastiani] Reuters-21578 (2 versions)

difference: number of Categories

.920.870SVMBoostingNaïve Bayes

k-NNMethod

.878

.795

.860Ver.1(90)

- .815.823Ver.2(10)

Page 25: [ppt]

Hierarchical Learning

TreeBoost[06 Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as

Training data Applying AdaBoost recursively Better classifier than ‘flat’ AdaBoost

Accuracy : 2-3% up Time: training and categorization time down

Hierarchical SVM[04 Cai et al.]

Page 26: [ppt]

TreeBoost

root

L1 L2 L3 L4

L11 L12 L41 L42 L43

L421 L422

Page 27: [ppt]

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Page 28: [ppt]

Conclusion

Overview of Text Categorizationwith Machine Learning Feature of Text Learning Algorithm

Future Work Natural Language Processing with

Machine Learning, especially in Japanese Calculation Cost