Upload
butest
View
544
Download
4
Tags:
Embed Size (px)
Citation preview
A Survey on Text Categorization with Machine Learning
Chikayama lab.Dai Saito
Introduction:Text Categorization
Many digital Texts are available E-mail, Online news, Blog …
Need of Automatic Text Categorization is increasing without human resource Merits of time and cost
Introduction:Text Categorization
Application Spam filter Topic Categorization
Introduction:Machine Learning
Making Categorization rule automatically by Feature of Text
Types of Machine Learning (ML) Supervised Learning
Labeling Unsupervised Learning
Clustering
Introduction:flow of ML
1. Prepare training Text data with label Feature of Text
2. Learn3. Categorize new Text
Label1
Label2
?
Outline
Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
Number of labels
Binary-label True or False (Ex. spam or not) Applied for other types
Multi-label Many labels, but
One Text has one label Overlapping-label
One Text has some labels
Yes
No
L1
L2
L3
L4
L1
L2
L3
L4
Types of labels
Topic Categorization Basic Task Compare individual words
Author Categorization Sentiment Categorization
Ex) Review of products Need more linguistic information
Outline
Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
Feature of Text
How to express a feature of Text? “Bag of Words”
Ignore an order of words Structure
Ex) I like this car. | I don’t like this car. “Bag of Words” will not work well
(d:document = text) (t:term = word)
Preprocessing
Remove stop words “the” “a” “for” …
Stemming relational -> relate, truly -> true
Term Weighting
Term Frequency Number of a term in a document Frequent terms in a document seems to
be important for categorization tf ・ idf
Terms appearing in many documents are not useful for categorization
Sentiment Weighting
For sentiment classification,weight a word as Positive or Negative
Constructing sentiment dictionary WordNet [04 Kamps et al.]
Synonym Database Using a distance
from ‘good’ and ‘bad’
g o o d
bad
happyd (good, happy) = 2d (bad, happy) = 4
Dimension Reduction Size of feature vector is
(#terms)*(#documents) #terms size of dictionary≒ High calculation cost Risk of overfitting
Best for training data ≠ Best for real data
Choosing effective feature to improve accuracy and calculation cost
Dimension Reduction
df-threshold Terms appearing in very few documents
(ex.only one) are not important Score
If t and cj are independent, Score is equal to Zero
Outline
Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
Learning Algorithm
Many (Almost all?) algorithms are used in Text Categorization Simple approach
Naïve Bayes K-Nearest Neighbor
High performance approach Boosting Support Vector Machine
Hierarchical Learning
Naïve Bayes
Bayes Rule
This value is hard to calculate ? Assumption :
each terms occurs independently
k-Nearest Neighbor
Define a “distance” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2|
= cosθ
check k of high similarityTexts and categorize bymajority vote
If size of test data is larger, memory and search cost is higher
d1
d2θk=3
Boosting
BoosTexter [00 Schapire et al.] Ada boost
making many “weak learner”s with different parameters
Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data
BoosTexter uses Decision Stump as “weak learner”
+
+
+ +
+
--
-
-
-
Simple example of Boosting
+
+
+ +
+
--
-
-
-
1.
-
-
+
+
+ ++
-
-
-
2.
+
+
+ ++
-
-
--
-
3.
Support Vector Machine
Text Categorization with SVM[98 Joachims]
Maximize margin
Text Categorization with SVM
SVM works well for Text Categorization Robustness for high dimension
Robustness for overfitting Most Text Categorization problems are
linearly separable All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)
Comparison of these methods
[02 Sebastiani] Reuters-21578 (2 versions)
difference: number of Categories
.920.870SVMBoostingNaïve Bayes
k-NNMethod
.878
.795
.860Ver.1(90)
- .815.823Ver.2(10)
Hierarchical Learning
TreeBoost[06 Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as
Training data Applying AdaBoost recursively Better classifier than ‘flat’ AdaBoost
Accuracy : 2-3% up Time: training and categorization time down
Hierarchical SVM[04 Cai et al.]
TreeBoost
root
L1 L2 L3 L4
L11 L12 L41 L42 L43
L421 L422
Outline
Introduction Text Categorization Feature of Text Learning Algorithm Conclusion
Conclusion
Overview of Text Categorizationwith Machine Learning Feature of Text Learning Algorithm
Future Work Natural Language Processing with
Machine Learning, especially in Japanese Calculation Cost