Page 1: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Advisor: Hsin-His ChenReporter: Chi-Hsin YuDate: 2009.09.24From NIPS 2008

Page 2: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization


•Introduction•Related Work•Review SVM•SSLW (Semi-supervised Learning with Weakly-

Related Unlabeled Data)•Experiments•Conclusion

Page 3: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Introduction•Semi-supervised Learning (SSL)

▫takes advantage of a large amount of unlabeled data to enhance classification accuracy

•Cluster assumption▫puts the decision boundary in low density areas

without crossing the high density regions▫is only meaningful when the labeled and

unlabeled data are somehow closely related If they were weakly related, the labeled and

unlabeled data could be well separated

Page 4: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Introduction (conti.)

•This paper aiming to▫Identify a new data representation (in

feature space) By constructing a new kernel function

▫Advantages Informative to the target class(category) consistent with the feature coherence

patterns exhibiting in the weakly related unlabeled data

Page 5: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Related Work

•The two types of semi-supervised learning (SSL)▫Transductive SSL

labels only for the available unlabeled data▫Inductive SSL

also learns a classifier that can be used to predict labels for new data


Page 6: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

SVM • Notations

▫ £ = {(x1, y1), . . . , (xl, yl)} Labeled documents

▫ U= {(xl+1, yl+1), . . . , (xn, yn)} unlabeled documents

▫ Document-word matrix D=(d1, d2, …, dn), di ∈ NV

V: the size of the vocabulary di: word-frequency vector for document i

▫ Word-Document matrix G=(g1, g2, …, gV) gi=(gi,1, gi,2,…,gi,n)

K=DTD, K ∈ Rnxn

Document pairwise similarity

α。 y=(α1y1, α2y2, …, αnyn) element-wise product

Page 7: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization


•K=DTD K=DTRD▫R ∈ RVxV : word-correlation matrix

•Two ways to construct the matrix RG=UW, W=(w1,w2,…wV)wi: internal representation o the i-th word R= WTW, T=UUT

the top p right eigenvectors of Gαi ≥0, ξ ≥0

Page 8: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

SSLW (conti.)

Page 9: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

SSLW (conti.)

•An Efficient Algorithm of SSLW

Page 10: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Experiments• Corpus

▫Reuters-21578 (9400 docs),▫WebKB (4518 docs)▫TREC AP88: an external information source for both

datasets (1000 documents, randomly selected)

Page 11: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Evaluation Methodology

•4 positive + 4 negative samples from each training set

•AUR (area under the ROC curve)•Averaging the AUR (ten times of each


Page 12: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization
Page 13: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization


•SSLW ▫Significantly improves both the accuracy

and the reliability of text categorization, given a small training pool and the additional

unlabeled data that are weakly related to the test bed.

Page 14: Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

