Deepak Turaga 1, Michalis Vlachos 2, Olivier Verscheure 1 1 IBM T.J. Watson Research Center, NY, USA...

Deepak Turaga1, Michalis Vlachos2, Olivier Verscheure1

1IBM T.J. Watson Research Center, NY, USA2IBM Zürich Research Laboratory, Switzerland

On K-Means Cluster Preservation using Quantization Schemes

overview – what we want to do…• Examine under what conditions compression methodologies

retain the clustering outcome• We focus on the K-Means algorithm

k-Means

cluster 1 cluster 2 cluster 3 cluster 1 cluster 2 cluster 3

k-Means

identical clustering

results

original data quantized data

why we want to do that…

• Reduced Storage– The quantized data will take up less space

• Faster execution– Since the data can be represented in a more

compact form the cluster algorithm will require less runtime

• Faster execution– Since the data can be represented in a more compact

form the cluster algorithm will take less runtime

• Anonymization/Privacy Preservation– The original values are not disclosed

• Faster execution– Since the data can be represented in a more compact form the

cluster algorithm will take less runtime

• Anonymization/Privacy Preservation– The original values are not disclosed

• Authentication– encode some message with the quantization

We will achieve the above and still guarantee same results

other cluster preservation techniques

• We do not transform into another space• Space requirements same – no data simplification• Shape preservation

[Oliveira04] S. R. M. Oliveira and O. R. Zaane. Privacy Preservation When Sharing Data For Clustering, 2004[Parameswaran05] R. Parameswaran and D. Blough. A Robust Data Obfuscation Approach for Privacy Preservation of Clustered Data, 2005

original

quantized

K-Means Algorithm:

1. Initialize k clusters (k specified by user) randomly.

2. Repeat until convergence

1. Assign each object to the nearest cluster center.

2. Re-estimate cluster centers.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

k-means overview

-0.5 0 0.5 1 1.5

k-means example

k-means applications/usage

• Fast pre-clustering

k-means applications/usage

• Fast pre-clustering

• Real-time clustering (eg image, video effects)– Color/Image segmentation

k-means objective function

• Objective: Mininize sum of intra-class variance

Cluster centroid

After some algebraic manipulations

clusters

Dimensions/Time instances 2nd moment 1st moment

k-means objective function

So we can preserve the k-Means outcome if:

clusters

Dimensions/Time instances 2nd moment 1st moment

• We maintain the cluster assignment• We preserve the 1st and 2nd moment of the cluster objects

moment preserving quantization

• 1st moment: average• 2nd (central) moment :

variance• 3rd moment: skewness• 4th moment: kyrtosis

In order to preserve the first and second moment we will use the following quantizer:

Everything below the mean valueis ‘snapped’ here

Everything above the mean valueis ‘snapped’ here

Everything below the mean valueis ‘snapped’ here

= 0.2049g

= -1.4795

-2.4240-0.22380.0581

-0.4246-0.2029-1.5131-1.1264-0.81500.3666

-0.58611.53740.1401

-1.8628-0.4542-0.65210.1033

-0.2206-0.2790-0.7337-0.0645

original -1.4795 0.2049 0.2049 0.2049 0.2049 -1.4795 -1.4795 -1.4795 0.2049 -1.4795 0.2049 0.2049 -1.4795 0.2049 -1.4795 0.2049 0.2049 0.2049 -1.4795 0.2049

quantized

-0.4689average =

These are the points for one dimension and for one cluster of objects.

Process is repeated for all dimensions and for all clustersWe have one quantizer per class

Dimension d (or time instance d)

our quantization

• One quantizer per class• The quantized data are binary

our quantization

• The fact the we have 1 quantizer per class suggests that we need to run k-Means once before we quantize

• This is not a shortcoming of the technique as we need to know the cluster boundaries so that we know how much we can simplify the data.

why quantization works?

• Why does the clustering remain same before and after quantization?

Centers do not change (averages remain same)

why quantization works?

• Why does the clustering remain same before and after quantization?

Centers do not change (averages remain same)

Cluster assignment does not change because clusters ‘shrink’due to quantization

will it always work?

• The results will be the same for datasets with well-formed clusters

• Discrepancy of results means that clusters were not that dense

• Use moment preserving quantization to preserve objective function

• Due to cluster shrinkage, cluster assignments will not change

• Identical results for optimal k-Means• One quantizer per class• 1-bit quantizer per dimension

clusters

Dimensions 2nd moment 1st moment

example: shape preservation

[Bagnall06] A. J. Bagnall, C. A. Ratanamahatana, E. J. Keogh, S. Lonardi, and G. J. Janacek. A Bit Level Representation for Time Series Data Mining with Shape Based Similarity. In Data Min. Knowl. Discov. 13(1), pages 11–40, 2006.

example: cluster preservation

• 3 years Nasdaq stock ticker data• We cluster into k=8 clusters

Confusion Matrix

3% mislabeled data after the moment preserving quantization

With Binary Clipping: 80% mislabeled

quantization levels indicate cluster spread

example: label preservation

• 2 datasets– Contours of fish – Contours of leaves

• Clustering and then k-NN voting

Acer platanoides Salix fragilis Tilia Quercus robur

For rotation invariance we use a rotation invariant features

0 20 40 600

0 10 20 300

space-timefrequency

example: label preservation

• Very low mislabeling error for MPQ

• High error rate for Binary Clipping

other nice characteristics

• Low sensitivity to initial centers– Mismatch when starting from different centers is

around 7%

other nice characteristics

• Low sensitivity to initial centers– Mismatch when starting from different centers is

around 7%

• Neighborhood preservation – even though we are not optimizing directly that…– Good results because we are preserving the ‘shape’

of the object

size reduction by a factor of 3when using the quantized scheme

• Compression reduces for increasing K

summary

• 1-bit quantizer per dimension sufficient to preserve kMeans ‘as well as possible’

• Theoretically the results will be identical (under conditions)

• Good ‘shape’ preservation

Future work:• Multi-bit quantization• Multi-dimension quantization

Deepak Turaga 1, Michalis Vlachos 2, Olivier Verscheure 1 1 IBM T.J. Watson Research Center, NY, USA...

Documents

Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection Jing Gao 1, Wei Fan 2, Deepak Turaga 2,

The Valley MegaphoneThe Valley Megaphonesite.ieee.org/phoenix/files/2013/12/Dec13vm.pdf · Pavan Turaga pturaga@asu.edu Computer Society Jerry Crow jerry.crow@computer.org CPMT Society

E va na turaga ena Vei Yalayalati Vou eratou vakatokayacataki ena Filipi: 1.Na luvei Eroti. Turaga mai Ituria kei Tirakoniti mai na 4 BC me yacova na

JEMESA, NA TACINA NA TURAGA

© 2007 IBM Corporation IBM Systems IBM System P. IBM System p IBM Systems 2 © 2007 IBM Corporation Consistent Predictable Delivery IBM POWER™ SYSTEMS

Oracle Data Integration · IBM DataStage IBM Discovery IBM Federation Server IBM Lotus Notes IBM Netezza IBM Rational Rose IBM Rational Architect Informatica Metadata Mgr. Informatica

Moving Vistas: Exploiting Motion for Describing Scenespturaga/papers/Vistas.pdf · Moving Vistas: Exploiting Motion for Describing Scenes Nitesh Shroff, Pavan Turaga and Rama Chellappa

HOREBEKE · ... V.U.: Dieter Verscheure - Broekestraat 59 - 9667 Horebeke - dieter.verscheure@ ... ingenieur Gent, schepen, Belsele 33 jaar OCMW ... 23 jaar, studente, Lier 13

IBMi Pracawsieci Systemnazwdomen · IBM Confidential IBM Confidential IBM Confidential IBM Confidential IBM Confidential IBM Confidential IBM Confidential IBM Confidential IBMi

· Ranjit Unnikrishnan Brittle Diabetes: Current Approach SV Madhu, Sundaram Gopalakrishnan ... Shravan NS Turaga, Salma H Khan, Rohit Tandon, Navin C Manda

ITU-T Video Coding Standardschenlab.ece.cornell.edu/Publication/Deepak/bookchap.pdf · ITU-T Video Coding Standards Deepak Turaga and Tsuhan Chen Electrical and Computer Engineering,

Turaga Kabaya 3 Instructions

Marius Verscheure Presentation Ecmor

Turaga Kabaya 2 Instructions

Mata Kacheri by Turaga KrishnamohanaRao

Latent Space Domain Transfer between High Dimensional Overlapping Distributions Sihong Xie Wei Fan Jing Peng* Olivier Verscheure Jiangtao Ren Sun Yat-Sen

IBM i: IBM i Access Client Solutions€¦ · IBM i версия 7.3 Подключение к IBM i IBM i Access Client Solutions IBM

Naming Sandhya Turaga CS-249 Fall-2005. Outline of Chapter Naming in general Characteristics of distributed naming Bindings Consistency Scaling Approaches

Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao

PowerPoint プレゼンテーションBCG IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM IBM ÇDE'Y*X IBM IBM IBM D 1.2 .ñ-fFž-fYJ IBM IBM IBM IBM IBM IBM Concerns and Appeals • IBM IBM