Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization



Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization. Advisor: Hsin -His Chen Reporter: Chi- Hsin Yu Date: 2009.09.24. From NIPS 2008. Outlines. Introduction Related Work Review SVM - PowerPoint PPT Presentation

Citation preview

Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2009.09.24From NIPS 2008


•Introduction•Related Work•Review SVM•SSLW (Semi-supervised Learning with Weakly-

Related Unlabeled Data)•Experiments•Conclusion

Introduction•Semi-supervised Learning (SSL)

▫takes advantage of a large amount of unlabeled data to enhance classification accuracy

•Cluster assumption▫puts the decision boundary in low density areas

without crossing the high density regions▫is only meaningful when the labeled and

unlabeled data are somehow closely related If they were weakly related, the labeled and

unlabeled data could be well separated

Introduction (conti.)

•This paper aiming to▫Identify a new data representation (in

feature space) By constructing a new kernel function

▫Advantages Informative to the target class(category) consistent with the feature coherence

patterns exhibiting in the weakly related unlabeled data

Related Work

•The two types of semi-supervised learning (SSL)▫Transductive SSL

labels only for the available unlabeled data▫Inductive SSL

also learns a classifier that can be used to predict labels for new data


SVM • Notations

▫ £ = {(x1, y1), . . . , (xl, yl)} Labeled documents

▫ U= {(xl+1, yl+1), . . . , (xn, yn)} unlabeled documents

▫ Document-word matrix D=(d1, d2, …, dn), di ∈ NV

V: the size of the vocabulary di: word-frequency vector for document i

▫ Word-Document matrix G=(g1, g2, …, gV) gi=(gi,1, gi,2,…,gi,n)

K=DTD, K ∈ Rnxn

Document pairwise similarity

α。 y=(α1y1, α2y2, …, αnyn) element-wise product


•K=DTD K=DTRD▫R ∈ RVxV : word-correlation matrix

•Two ways to construct the matrix RG=UW, W=(w1,w2,…wV)wi: internal representation o the i-th word R= WTW, T=UUT

the top p right eigenvectors of Gαi ≥0, ξ ≥0

SSLW (conti.)

SSLW (conti.)

•An Efficient Algorithm of SSLW

Experiments• Corpus

▫Reuters-21578 (9400 docs),▫WebKB (4518 docs)▫TREC AP88: an external information source for both

datasets (1000 documents, randomly selected)

Evaluation Methodology

•4 positive + 4 negative samples from each training set

•AUR (area under the ROC curve)•Averaging the AUR (ten times of each



•SSLW ▫Significantly improves both the accuracy

and the reliability of text categorization, given a small training pool and the additional

unlabeled data that are weakly related to the test bed.

