Stochastic Unsupervised Learning on Unlabeled Data

Preview:

DESCRIPTION

Stochastic Unsupervised Learning on Unlabeled Data. Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong – Rutgers, the State University of New Jersey. July 2, 2011. Our Story. - PowerPoint PPT Presentation

Citation preview

Stochastic Unsupervised Learning on Unlabeled Data

July 2, 2011

Presented by Jianjun Xie – CoreLogicCollaborated with Chuanren Liu, Yong Ge and Hui Xiong – Rutgers, the State University of New Jersey

Our Story

“Let’s set up a team to compete another data mining challenge” – a call with Rutgers

Is it a competition on data preprocessing?

Transfer the problem into a clustering problem: How many clusters we are shooting for? What distance measurement works better? Go with the stochastic K-means clustering.

Dataset Recap

Five real world data sets were extracted from different domains No labels were provided during unsupervised learning challenge The withheld labels are multi-class labels.

Some records can belong to different labels at the same time Performance was measured by a global score, which is defined as

Area Under Learning Curve A simple linear classifier (Hebbian learner) was used to calculate

the learning curve Focus on small number of training samples by log2 scaling on x-

axis of the learning curve

Evolution of Our Approaches

Simple Data Preprocessing Normalization: Z-scale (std=1, mean=0) TF-IDF on text recognition (TERRY dataset)

PCA: PCA on raw data PCA on normalized data Normalized PCA vs. non-normalized PCA

K-means Clustering Cluster on top N normalized PCs Cosine similarity vs. Euclidian distance

Stochastic Clustering Process

Given Data set X, number of cluster K, and iteration N For n=1, 2, …, N

Randomly choose K seeds from X Perform K-means clustering, assign each record a cluster

membership In

Transform In into binary representation Combine the N binary representation together as the final result Example of binary representation of clusters

Say cluster label = 1,2,3 Binary representation will be (1 0 0) (0 1 0) and (0 0 1)

Our final approach

Results of Our ApproachesDataset Harry – human action recognition

ResultsDataset Rita – object recognition

ResultsDataset Sylvester-- ecology

ResultsDataset Terry – text recognition

ResultsDataset Avicenna – Arabic manuscripts

Summary on ResultsOverall rank 2nd.

Pie Chart Title

Dataset Winner Valid

Winner Final

Winner Rank

Our Valid

Our Final

Our Rank

Avecinna 0.1744 0.2183 1 0.1386 0.1906 6

Harry 0.8640 0.7043 6 0.9085 0.7357 3

Rita 0.3095 0.4951 1 0.3737 0.4782 5

Sylvester 0.6409 0.4569 6 0.7146 0.5828 1

Terry 8.195 0.8465 1 0.8176 0.8437 2

Discussions

Stochastic clustering can generate better results than PCA in general

Cosine similarity distance is better than Euclidian distance Normalized data is better than non-normalized data for k-means in

general Number of clusters (K) is an important factor, but can be relaxed for

this particular competition.

Thank you !Questions?

Recommended