Upload
erasmus-decker
View
22
Download
1
Embed Size (px)
DESCRIPTION
Document Clustering with Cluster Refinement and Model Selection Capabilities. Advisor : Dr. Hsu Presenter : Shu-Ya Li Authors : Xin Liu, Yihong Gong, Wei Xu, Shenghuo Zhu. 2002 . SIGIR . Page(s) : 191 - 198. Outline. Motivation Objective Method Experimental Result - PowerPoint PPT Presentation
Citation preview
1Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
Document Clustering with Cluster Refinement and Model Selection Capabilities
Advisor : Dr. Hsu
Presenter : Shu-Ya Li
Authors : Xin Liu, Yihong Gong, Wei Xu,
Shenghuo Zhu
2002 . SIGIR . Page(s) : 191 - 198
2Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation
Objective
Method
Experimental Result
Conclusion
Personal Opinions
3Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
The problems and limitations:
1) The user must formulate the query using the keywords.
2) Traditional text search engines is a narrowly specified search for documents matching the user’s query.
3) Traditional search engine returns hundreds, or even thousands of hits.
4Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objective
We propose a document clustering method that strives to achieve:
1) a high accuracy of document clustering
2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability)
5Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Method
Feature Set Term frequencies (TF)
Name entities (NE)
Term pairs (TP)
The documents reporting the Clinton-Lewinsky scandal
The common name entities : ”Clinton”, ”Lewinsky”, ”Ken Starr”, ”Linda Tripp”, etcThe word pairs : ”grand jury”, ”independent counsel”, ”supreme court”
6Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Method - self-refinement process
Apply the iterative voting scheme to refine the document clusters.
1) GMM + EM algorithm
GMM
EM algorithm
7Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Method - self-refinement process
2) Identify discriminative features F = {f1, f2, . . . , fΛ} along with cluster labels S = {σ1, σ2, . . . , σΛ}
Define the discriminative feature metric DFM(fi)
3)
4) Compare the new document cluster set with C.
The result converges →terminate the process Otherwise →set C to the new cluster set, and go to Step 2.
8Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Method - Model Selection
measure the similarity between C and C’
The model selection algorithm1) Guess the possible number of document clusters from the data range (Rl,Rh).
2) Set k = Rl.
3) Cluster the document corpus into k clusters.
4) Compute between each pair of the results, and take the average on all the .
5) If k < Rh, k = k + 1, go to Step 3; otherwise, go to Step 6.
6) Select the k which yields the largest average .
9Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
1) GMM + EM algorithm
2)
3)
Experimental Result - Document Clustering Evaluation
ABC+CNN-01-13-18-32-48-70-71-77-86[ 新聞機構 - 新聞事件類別 - 報導次數 ]
10Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental Result - Model Selection Evaluation
Compared with the BIC-based model selection method
11Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusion
To accurately cluster the given document corpus by using the GMM Model together with EM algorithm.
The model selection capability has been achieved by guessing a value C for the number of clusters N.