DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation

Alexander HinneburgMartin-Luther-University Halle-Wittenberg, Germany

Hans-Henning Gabriel101tec GmbH, Halle, Germany

Overview

• Density-based clustering and DENCLUE 1.0• Hill climbing as EM-algorithm• Identification of local maxima• Applications of general EM-acceleration • Experiments

Density-Based Clustering

• Assumption– clusters are regions of high density in the data

space ,• How to estimate density?

– parametric models• mixture models

– non-parametric models• histogram• kernel density estimation

Kernel Density Estimation• Idea

– influence of a data point is modeled by a kernel– density is the normalized sum of all kernels– smoothing parameter h

Gaussian Kernel

Density Estimate

DENCLUE 1.0 Framework• Clusters are defined by local maxima of

the density estimate– find all maxima by hill climbing

• Problem– const. step size

Gradient

Hill Climbing

const. step size

Problem of const. Step Size

• Not efficient– many unnecessary small steps

• Not effective– does not converge to a local maximum

just comes close• Example

New Hill Climbing Approach

• General approach– differentiate density estimate and set to zero

– no solution, but can be used for iteration

New DENCLUE 2.0 Hill Climbing

• Efficient– automatically adjusted step size at no extra costs

• Effective– converges to local maximum (proof follows)

• Example

Proof of Convergence• Cast the problem of maximizing kernel denstiy

as maximizing the likelihood of a mixture model

• Introduce hidden variable

Proof of Convergence

• Complete likelihood is maximized by EM-Algorithm

• this also maximizes the original likelihood, which is the kernel density estimate

• When starting the EM with we do the hill climbing for

E-Step

M-Step

Identification of local Maxima• EM-Algorithm iterates until

– reached end point– sum of k last step sizes

• Assumption– true local maximum is in a ball of around

• Points with end points closerbelong to the same maximum M

• In case of non-unique assignmentdo a few extra EM iterations

Acceleration

• Sparse EM– update only the p% points with largest posterior– saves 1-p% of kernel computations after first iteration

• Data Reduction– use only %p of the data as representative points– random sampling– kMeans

Experiments

• Comparison of DENCLUE 1.0 (FS) vs. 2.0 (SSA)

• 16-dim. artificial data• both methods are tuned to find the correct clustering

Experiments

• Comparison of acceleration methods

Experiments

• Clustering quality (normalized mutual information, NMI) vs. sample size (RS)

Experiments

• Cluster Quality (NMI) of DENCLUE 2.0 (SSA) and acceleration methods and k-Means on real data

sample sizes 0.8, 0.4, 0.2

Conclusion

• New hill climbing for DENCLUE• Automatic step size adjustment• Convergence proof by reduction to EM• Allows the application of general

EM accelerations• Future work

– automatic setting of smoothing parameter h(so far tuned manually)

Thank you for your attention!

Documents

DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel