Efficient Clustering of High-Dimensional Data Sets

Andrew McCallumWhizBang! Labs & CMU

Kamal NigamWhizBang! Labs

Lyle UngarUPenn

Large Clustering Problems

• Many examples

• Many clusters

• Many dimensions

Example Domains• Text

• Images

• Protein Structure

The Citation Clustering Data

• Over 1,000,000 citations

• About 100,000 unique papers

• About 100,000 unique vocabulary words

• Over 1 trillion distance calculations

Reduce number of distance calculations

• [Bradley, Fayyad, Reina KDD-98] – Sample to find initial starting points for

k-means or EM

• [Moore 98]– Use multi-resolution kd-trees to group similar

data points

• [Omohundro 89]– Balltrees

The Canopies Approach

• Two distance metrics: cheap & expensive

• First Pass– very inexpensive distance metric– create overlapping canopies

• Second Pass– expensive, accurate distance metric– canopies determine which distances calculated

Illustrating Canopies

Overlapping Canopies

Creating canopies with two thresholds

• Put all points in D• Loop:

– Pick a point X from D

– Put points within Kloose of X in canopy

– Remove points within Ktight of X from D loose

Canopies

• Two distance metrics– cheap and approximate– expensive and accurate

• Two-pass clustering– create overlapping canopies– full clustering with limited distances

• Canopy property– points in same cluster will be in same canopy

Using canopies with GAC

• Calculate expensive distances between points in the same canopy

• All other distances default to infinity

• Sort finite distances and iteratively merge closest

Computational Savings

• inexpensive metric << expensive metric

• number of canopies: c (large)

• canopies overlap: each point in f canopies

• roughly f*n/c points per canopy

• O(f 2 *n 2/c) expensive distance calculations

• complexity reduction: O(f2/c)

• n=106; k=104; c=1000; f small:

computation reduced by factor of 1000

Experimental Results

134.090.835Complete GAC

7.650.838Canopies GAC

MinutesF1

Preserving Good Clustering

• Small, disjoint canopies big time savings

• Large, overlapping canopies original accurate clustering

• Goal: fast and accurate– requires good, cheap distance metric

Reduced Dimension Representations

• Clustering finds groups of similar objects

• Understanding clusters can be difficult

• Important to understand/interpret results

• Patterns waiting to be discovered

A picture is worth 1000 clusters

Feature Subset Selection

• Find n features that work best for prediction

• Find n features such that distance on them best correlates with distance on all features

• Minimize:

Discrepancy= (dijnew

i<j∑ −dij

Feature Subset Selection

• Suppose all features relevant

• Does that mean dimensionality can’t be reduced?

• No!

• Manifold in feature space is what counts, not relevance of individual features

• Manifold can be lower dimension than feats

PCA: Principal Component Analysis

• Given data in d dimensions

• Compute:– d-dim mean vector M– dxd-dim covariance matrix C– eigenvectors and eigenvalues– Sort by eigenvalues– Select top k<d eigenvalues– Project data onto k eigenvectors

Mean vector M: mi =

xipts∑

1pts∑

Covariance C: cij =E (xi −mi )(xj −mj)

• Eigenvectors– Unit vectors in directions of maximum variance

• Eigenvalues– Magnitude of the variance in the direction of each

eigenvector

M ⋅ x=λ ⋅x

(M−λI) ⋅x=0

M ⋅ej =λ j ⋅ej

• Find largest eigenvalues and corresponding eigenvectors

• Project points onto k principal components

• where A is a d x k matrix whose columns are the k principal components of each point

λ1,λ2,λ3,...e1,e2,e3,...

x'=At x−M( )

PCA via Autoencoder ANN

Non-Linear PCA by Autoencoder

• need vector representation

• 0-d: sample mean

• 1-d: y = mx + b

• 2-d: y1 = mx + b; y2 = m`x + b`

m+ akieii=1

∑⎛ ⎝

⎞ ⎠ −xk

⎡ ⎣

⎤ ⎦ k=1

MDS: Multidimensional Scaling

• PCA requires vector representation

• Given pairwise distances between n points?

• Find coordinates for points in d dimensional space s.t. distances are preserved “best”

• Assign points to coords xi in d-dim space

– random coordinate values– principal components– dimensions with greatest variance

• Do gradient descent on coordinates xi of each point j until distortion is minimzed

J ee =

(dijnew−dij

i<j∑

(dijold)2

i<j∑

Distortion

J ff =dijnew−dij

dijold

⎝ ⎜

⎠ ⎟

i<j∑

Distortion

J ef =

(dijnew−dij

dijold

i<j∑

dijold

i<j∑

Distortion

Gradient Descent on Coordinates

dJeedxk

dkjnew−dkj

old( )

j≠k∑

xk −xjdkjnew

⎝ ⎜

⎠ ⎟

dijold

i<j∑

Subjective Distances

• Brazil• USA• Egypt• Congo• Russia• France• Cuba• Yugoslavia• Israel• China

How Many Dimensions?

• D too large– perfect fit, no distortion– not easy to understand/visualize

• D too small– poor fit, much distortion– easyto visualize, but pattern may be misleading

• D just right?

Agglomerative Clustering of Proteins

Efficient Clustering of High-Dimensional Data Sets

Documents

Density-based Projected Clustering over High Dimensional ...disi.unitn.it/~themis/publications/sdm12.pdf · Density-based Projected Clustering over High Dimensional Data Streams Irene

1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn

Clustering Techniques for Large Data Sets

SCHNEL: Scalable clustering of high dimensional single-cell data · SCHNEL: Scalable clustering of high dimensional single-cell data Tamim Abdelaal1,2#, Paul de Raadt2#, Boudewijn

Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Sets Clustering - Proceedings of Machine Learning Research

1992-8645 CLUSTERING OF HIGH DIMENSIONAL DATASET … · Clustering the high-dimensional data set is one of the main issues in clustering analysis. ... efficient method for assigning

Distributed Data Clustering in Multi-Dimensional Peer …crpit.com/confpapers/CRPITV104Lodi.pdf · Distributed Data Clustering in Multi-Dimensional Peer-To-Peer Networks ... As of

Clustering Algorithms for Numerical Data Sets. Contents 1.Data Clustering Introduction 2.Hierarchical Clustering Algorithms 3.Partitional Data Clustering

Clustering and Indexing in High-dimensional spaces

Two-Dimensional Clustering Algorithms for Image · PDF fileTwo-Dimensional Clustering Algorithms for Image Segmentation ... In image processing, KM clustering algorithm assigns a

Subspace Clustering for High Dimensional Data: A Review · 2015-08-06 · Subspace Clustering for High Dimensional Data: A Review ⁄ Lance Parsons Department of Computer Science

Algorithms-For-clustering-high Dimensional and Distributed Data

Influential Features PCA for high dimensional clustering

High-Dimensional Text Clustering by Dimensionality Reduction … · 2020. 10. 28. · Research Article High-Dimensional Text Clustering by Dimensionality Reduction and Improved Density

Clustering Very Large Multi-dimensional Datasets with MapReduce

1 Unsupervised Feature Selection for High-Dimensional Non-Gaussian Data Clustering ... · 2018-12-20 · Unsupervised Feature Selection for High-Dimensional Non-Gaussian Data Clustering

Three-Dimensional Motion Tracking using Clustering - SJSUgchen/Math285F15/285proj-Shiroma-Zastovnik.pdf · Three-Dimensional Motion Tracking using Clustering Andrew Zastovnik and

Density-based Projected Clustering over High Dimensional Data …helios.mi.parisdescartes.fr/~themisp/publications/sdm12.pdf · 2012-01-25 · 2.3 Stream clustering in full dimensional

Visual Clustering of High-dimensional Data