Upload
martha
View
40
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Stochastic Unsupervised Learning on Unlabeled Data. Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong – Rutgers, the State University of New Jersey. July 2, 2011. Our Story. - PowerPoint PPT Presentation
Citation preview
Stochastic Unsupervised Learning on Unlabeled Data
July 2, 2011
Presented by Jianjun Xie – CoreLogicCollaborated with Chuanren Liu, Yong Ge and Hui Xiong – Rutgers, the State University of New Jersey
Our Story
“Let’s set up a team to compete another data mining challenge” – a call with Rutgers
Is it a competition on data preprocessing?
Transfer the problem into a clustering problem: How many clusters we are shooting for? What distance measurement works better? Go with the stochastic K-means clustering.
Dataset Recap
Five real world data sets were extracted from different domains No labels were provided during unsupervised learning challenge The withheld labels are multi-class labels.
Some records can belong to different labels at the same time Performance was measured by a global score, which is defined as
Area Under Learning Curve A simple linear classifier (Hebbian learner) was used to calculate
the learning curve Focus on small number of training samples by log2 scaling on x-
axis of the learning curve
Evolution of Our Approaches
Simple Data Preprocessing Normalization: Z-scale (std=1, mean=0) TF-IDF on text recognition (TERRY dataset)
PCA: PCA on raw data PCA on normalized data Normalized PCA vs. non-normalized PCA
K-means Clustering Cluster on top N normalized PCs Cosine similarity vs. Euclidian distance
Stochastic Clustering Process
Given Data set X, number of cluster K, and iteration N For n=1, 2, …, N
Randomly choose K seeds from X Perform K-means clustering, assign each record a cluster
membership In
Transform In into binary representation Combine the N binary representation together as the final result Example of binary representation of clusters
Say cluster label = 1,2,3 Binary representation will be (1 0 0) (0 1 0) and (0 0 1)
Our final approach
Results of Our ApproachesDataset Harry – human action recognition
ResultsDataset Rita – object recognition
ResultsDataset Sylvester-- ecology
ResultsDataset Terry – text recognition
ResultsDataset Avicenna – Arabic manuscripts
Summary on ResultsOverall rank 2nd.
Pie Chart Title
Dataset Winner Valid
Winner Final
Winner Rank
Our Valid
Our Final
Our Rank
Avecinna 0.1744 0.2183 1 0.1386 0.1906 6
Harry 0.8640 0.7043 6 0.9085 0.7357 3
Rita 0.3095 0.4951 1 0.3737 0.4782 5
Sylvester 0.6409 0.4569 6 0.7146 0.5828 1
Terry 8.195 0.8465 1 0.8176 0.8437 2
Discussions
Stochastic clustering can generate better results than PCA in general
Cosine similarity distance is better than Euclidian distance Normalized data is better than non-normalized data for k-means in
general Number of clusters (K) is an important factor, but can be relaxed for
this particular competition.
Thank you !Questions?