Upload
emmet
View
29
Download
0
Embed Size (px)
DESCRIPTION
Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization. Advisor: Hsin -His Chen Reporter: Chi- Hsin Yu Date: 2009.09.24. From NIPS 2008. Outlines. Introduction Related Work Review SVM - PowerPoint PPT Presentation
Citation preview
Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization
Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2009.09.24From NIPS 2008
Outlines
•Introduction•Related Work•Review SVM•SSLW (Semi-supervised Learning with Weakly-
Related Unlabeled Data)•Experiments•Conclusion
Introduction•Semi-supervised Learning (SSL)
▫takes advantage of a large amount of unlabeled data to enhance classification accuracy
•Cluster assumption▫puts the decision boundary in low density areas
without crossing the high density regions▫is only meaningful when the labeled and
unlabeled data are somehow closely related If they were weakly related, the labeled and
unlabeled data could be well separated
Introduction (conti.)
•This paper aiming to▫Identify a new data representation (in
feature space) By constructing a new kernel function
▫Advantages Informative to the target class(category) consistent with the feature coherence
patterns exhibiting in the weakly related unlabeled data
Related Work
•The two types of semi-supervised learning (SSL)▫Transductive SSL
labels only for the available unlabeled data▫Inductive SSL
also learns a classifier that can be used to predict labels for new data
SSLW
SVM • Notations
▫ £ = {(x1, y1), . . . , (xl, yl)} Labeled documents
▫ U= {(xl+1, yl+1), . . . , (xn, yn)} unlabeled documents
▫ Document-word matrix D=(d1, d2, …, dn), di ∈ NV
V: the size of the vocabulary di: word-frequency vector for document i
▫ Word-Document matrix G=(g1, g2, …, gV) gi=(gi,1, gi,2,…,gi,n)
K=DTD, K ∈ Rnxn
Document pairwise similarity
α。 y=(α1y1, α2y2, …, αnyn) element-wise product
SSLW
•K=DTD K=DTRD▫R ∈ RVxV : word-correlation matrix
•Two ways to construct the matrix RG=UW, W=(w1,w2,…wV)wi: internal representation o the i-th word R= WTW, T=UUT
the top p right eigenvectors of Gαi ≥0, ξ ≥0
SSLW (conti.)
SSLW (conti.)
•An Efficient Algorithm of SSLW
Experiments• Corpus
▫Reuters-21578 (9400 docs),▫WebKB (4518 docs)▫TREC AP88: an external information source for both
datasets (1000 documents, randomly selected)
Evaluation Methodology
•4 positive + 4 negative samples from each training set
•AUR (area under the ROC curve)•Averaging the AUR (ten times of each
experiment)
Conclusion
•SSLW ▫Significantly improves both the accuracy
and the reliability of text categorization, given a small training pool and the additional
unlabeled data that are weakly related to the test bed.
Thanks!!