Clustering tagged documents with labeled and unlabeled documents

Preview:

DESCRIPTION

Clustering tagged documents with labeled and unlabeled documents. Presenter : Jian-Ren Chen Authors : Chien -Liang Liu*, Wen -Hoar Hsaio , Chia -Hoang Lee, Chun- Hsien Chen 2013 , IPM. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

Citation preview

Intelligent Database Systems Lab

Presenter : JIAN-REN CHEN

Authors : Chien-Liang Liu*, Wen-Hoar Hsaio, Chia-Hoang Lee,

   Chun-Hsien Chen

2013 , IPM

Clustering tagged documents with labeled and unlabeled documents

Intelligent Database Systems Lab

OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments

Intelligent Database Systems Lab

MotivationTags can provide semantic information about the resources and

they can help machines perform the classification or clustering

tasks accurately.

Probabilistic latent semantic analysis (PLSA)

- aspect model

- statistical clustering model

Intelligent Database Systems Lab

ObjectivesThis study employs Constrained-PLSA to cluster tagged documents

with a small amount of seeds.

The Constrained-PLSA is based on statistical clustering model

rather than aspect model.

Intelligent Database Systems Lab

Methodology - PLSA

Terms (keywords) of the document collection

documents

E-step

M-step

Intelligent Database Systems Lab

Methodology - Constrained-PLSAE-step

M-step

Intelligent Database Systems Lab

Experiments - Data set A (CiteULike)

Intelligent Database Systems Lab

Experiments (Data set A)

Intelligent Database Systems Lab

Experiments - Data set B (CiteULike)

Intelligent Database Systems Lab

Experiments (Data set B)

Intelligent Database Systems Lab

Conclusions• The performance of ‘‘tags as words’’ representation scheme is

more stable than ‘‘words + tags’’ representation scheme.

• Unsupervised learning methods fail to function properly in

the data set with noisy information, but Constrained-PLSA

function properly and stable even though only a small amount

of labeled data is available.

Intelligent Database Systems Lab

Comments• Advantages

- Constrained-PLSA outperforms the other methods• Disadvantage

- too much artificial processing in experiment• Applications- text mining- tagged document clustering

Recommended