18
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle- Wittenberg, Germany Hans-Henning Gabriel 101tec GmbH, Halle, Germany

DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Embed Size (px)

DESCRIPTION

Density-Based Clustering Assumption –clusters are regions of high density in the data space, How to estimate density? –parametric models mixture models –non-parametric models histogram kernel density estimation

Citation preview

Page 1: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation

Alexander HinneburgMartin-Luther-University Halle-Wittenberg, Germany

Hans-Henning Gabriel101tec GmbH, Halle, Germany

Page 2: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Overview

• Density-based clustering and DENCLUE 1.0• Hill climbing as EM-algorithm• Identification of local maxima• Applications of general EM-acceleration • Experiments

Page 3: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Density-Based Clustering

• Assumption– clusters are regions of high density in the data

space ,• How to estimate density?

– parametric models• mixture models

– non-parametric models• histogram• kernel density estimation

Page 4: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Kernel Density Estimation• Idea

– influence of a data point is modeled by a kernel– density is the normalized sum of all kernels– smoothing parameter h

Gaussian Kernel

Density Estimate

Page 5: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

DENCLUE 1.0 Framework• Clusters are defined by local maxima of

the density estimate– find all maxima by hill climbing

• Problem– const. step size

Gradient

Hill Climbing

const. step size

Page 6: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Problem of const. Step Size

• Not efficient– many unnecessary small steps

• Not effective– does not converge to a local maximum

just comes close• Example

Page 7: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

New Hill Climbing Approach

• General approach– differentiate density estimate and set to zero

– no solution, but can be used for iteration

Page 8: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

New DENCLUE 2.0 Hill Climbing

• Efficient– automatically adjusted step size at no extra costs

• Effective– converges to local maximum (proof follows)

• Example

Page 9: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Proof of Convergence• Cast the problem of maximizing kernel denstiy

as maximizing the likelihood of a mixture model

• Introduce hidden variable

Page 10: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Proof of Convergence

• Complete likelihood is maximized by EM-Algorithm

• this also maximizes the original likelihood, which is the kernel density estimate

• When starting the EM with we do the hill climbing for

E-Step

M-Step

Page 11: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Identification of local Maxima• EM-Algorithm iterates until

– reached end point– sum of k last step sizes

• Assumption– true local maximum is in a ball of around

• Points with end points closerbelong to the same maximum M

• In case of non-unique assignmentdo a few extra EM iterations

Page 12: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Acceleration

• Sparse EM– update only the p% points with largest posterior– saves 1-p% of kernel computations after first iteration

• Data Reduction– use only %p of the data as representative points– random sampling– kMeans

Page 13: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Experiments

• Comparison of DENCLUE 1.0 (FS) vs. 2.0 (SSA)

• 16-dim. artificial data• both methods are tuned to find the correct clustering

Page 14: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Experiments

• Comparison of acceleration methods

Page 15: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Experiments

• Clustering quality (normalized mutual information, NMI) vs. sample size (RS)

Page 16: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Experiments

• Cluster Quality (NMI) of DENCLUE 2.0 (SSA) and acceleration methods and k-Means on real data

sample sizes 0.8, 0.4, 0.2

Page 17: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Conclusion

• New hill climbing for DENCLUE• Automatic step size adjustment• Convergence proof by reduction to EM• Allows the application of general

EM accelerations• Future work

– automatic setting of smoothing parameter h(so far tuned manually)

Page 18: DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel

Thank you for your attention!