39
Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer Science Brighton,UK November 01-04, 2004

Efficient Density-Based Clustering of Complex Objects Stefan Brecheisen, Hans-Peter Kriegel, Martin Pfeifle University of Munich Institute for Computer

Embed Size (px)

Citation preview

Efficient Density-Based Clustering of Complex Objects

Stefan Brecheisen, Hans-Peter Kriegel, Martin PfeifleUniversity of Munich

Institute for Computer Science

Brighton,UKNovember 01-04, 2004

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Outline

• Density-Based Clustering

• Clustering of Complex Objects

• Experimental Evaluation

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Outline

• Density-Based ClusteringCore Object · Density-Reachability ·

DBSCAN · OPTICS

• Clustering of Complex Objects

• Experimental Evaluation

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Data Mining

• Larger and larger amounts of data collected automatically

• Too large for humans to analyze manually

• Tools to assist analysis necessary KDD / Data Mining

Hubble Space Telescope Telecommunication Data Market-Basket Data

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Clustering

• Clustering– Efficiently grouping the database into sub-groups (clusters) such that

• similarity within clusters maximized

• similarity between clusters minimized

Flat Clustering one level of clusters

Hierarchical Clusteringnested clusters

e.g. density-based clustering algorithm

DBSCAN [KDD 96]

e.g. density-based clustering algorithm

OPTICS [SIGMOD 99]

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Density-Based Clustering I

• Parameters

– range and minimal weight MinPts

• Definition: core object

– q is core object if | rangeQuery (q,) | MinPts

• Definition: directly density-reachable

– p directly density-reachable from q if

q is a core object and p rangeQuery (q,)

• Definition: density-reachable– density-reachable: transitive closure of

“directly density-reachable”

qMinPts=5

p

q

MinPts=5

oq

r

MinPts=5

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Density-Based Clustering II

• Core Idea of Hierarchical Cluster Ordering:

Order the objects linearly such that objects of a cluster are adjacent in the ordering.

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Density-Based Clustering II

• Core Idea of Hierarchical Cluster Ordering:

Order the objects linearly such that objects of a cluster are adjacent in the ordering.

• Definition: core-distance

otherwise)(dist

|),(rangeQuery|if)(distcore , oMinPts

MinPtsooMinPts

core-distance(o)

o

MinPts = 5

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Density-Based Clustering II

• Core Idea of Hierarchical Cluster Ordering:

Order the objects linearly such that objects of a cluster are adjacent in the ordering.

• Definition: core-distance

• Definition: reachability-distance

otherwise)(dist

|),(rangeQuery|if)(distcore , oMinPts

MinPtsooMinPts

reach dist ( , ) max(core dist ( ),dist( , )), , MinPts MinPtsp o o p o

core-distance(o)

o

reachability-distance(p,o)

p p

reachability-distance(p,o)

MinPts = 5

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

OPTICS Algorithm

A I

B

J

K

L

R

M

P

N

CF

DE G H

44

reach

seedlist:

• Example Database (2-dimensional, 16 points)

• = 44, MinPts = 3

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

OPTICS Algorithm

A I

B

J

K

L

R

M

P

N

CF

DE G H

44

reach

seedlist:

• Example Database (2-dimensional, 16 points)

• = 44, MinPts = 3

A I

B

J

K

L

R

M

P

N

CF

DE G H

A

44

core-distance

(B,40) (I, 40)

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

OPTICS Algorithm

44

reach

• Example Database (2-dimensional, 16 points)

• = 44, MinPts = 3

A

44

BA I

B

J

K

L

R

M

P

N

CF

DE G H

seedlist: (I, 40) (C, 40)

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

OPTICS Algorithm

44

reach

• Example Database (2-dimensional, 16 points)

• = 44, MinPts = 3

A

44

BA I

B

J

K

L

R

M

P

N

CF

DE G H

I

seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43)

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

OPTICS Algorithm

44

reach

• Example Database (2-dimensional, 16 points)

• = 44, MinPts = 3

A

44

B IA I

B

J

K

L

R

M

P

N

CF

DE G H

J

seedlist: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40)

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

OPTICS Algorithm

44

reach

• Example Database (2-dimensional, 16 points)

• = 44, MinPts = 3

A

44

B I JA I

B

J

K

L

R

M

P

N

CF

DE G H

L

seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40)

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

OPTICS Algorithm

A I

B

J

K

L

R

M

P

N

CF

DE G H

seedlist: -

A B I J L M K N R P C D F G E H

44

reach

• Example Database (2-dimensional, 16 points)

• = 44, MinPts = 3

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

OPTICS Algorithm

A I

B

J

K

L

R

M

P

N

CF

DE G H

seedlist: -

A B I J L M K N R P C D F G E H

44

reach

• Example Database (2-dimensional, 16 points)

• = 44, MinPts = 3

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Outline

• Foundations of Density-Based ClusteringCore Object · Density-Reachability ·

DBSCAN · OPTICS

• Clustering of Complex ObjectsDirect Integration of the Multi-Step Query Processing

Paradigm

• Experimental Evaluation

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Complex Objects

complex objects

complex

models

complex distance measure

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Single-Step Clustering Approach

Exact information

Density-based Clustering algorithms, like DBSCAN and OPTICS

Query

Q(q,)Result R(q,)

• Performance Problems

• For each database object q,

we perform one range query.

• Expensive exact distance

computation do(o,q) for each

object o of the database

independent of the range

1 2

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Multi-Step Query Processing

• Multi-Step Similarity SearchRange Queries (Faloutsos et al. 94)

k-Nearest Neighbor Queries (Korn et al. 96)

Optimal k- Nearest Neighbor Queries (Seidl, Kriegel 98)

• No False Drops?

Filter Step(index-based)

Refinement Step(exact evaluation)

candidates

results

),(),( qpdqpd of

filter distance object distance

Lower-Bounding Property

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Traditional Multi-Step Clustering Approach

Range query processor(e.g. Faloutsos et al. 94)

Density-based Clustering algorithms, like DBSCAN and OPTICS • Performance Problems

• For each database object q, we

perform one range query (1).

• The range query is first

performed on the filter

information (2,3).

• One expensive exact distance

computation do(o,q) for each

object o of the candidate set

C(q,) is performed (4). This

refinement step is very expensive

for non-selective filters or high

values.

Query (q,)

1

CandidatesC(q,)

Filter information

QueryQ(q,)

using df

2 3

Exact information

refinement-step computation of do(o,q) for

all o C(q,)

4

Result (q,)

5

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Integrated Multi-Step Clustering Approach

Exact information

Filter information

ExtendedDensity-based Clustering algorithms,

like DBSCAN and OPTICS

Query

Q(q,)using df

CandidatesC (q,)

computation of do(o,q) for

Core - properties of q

1 2 3

• Direct integration of the multi-step query processingparadigm into the clustering algorithm

• postponing expensive exact distance computationsas long as possible

• Proposed Solution

• For each database object q, we

perform one range query on the

filter information (1,2).

• Only those exact distances

do(o,q) are computed which are

necessary to determine the

core-properties of q (3).

• A beneficial heuristic for

determining the reachability-

properties is applied which

saves on exact distance

computations (4).

postponed computations of

do(o,q) for Reach.-properties of o

4

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Filter Information

Q

• First, we carry out a range query on the filter for each query object Q.• Second, we order the resulting candidate set in ascending order according to the filter distance.• Third, we walk through the candidate set and perform exact distance calculations until we can be sure that we have found the MinPts nearest neighbors.

MinPts=3=75

df(K,Q)=10

df(Z,Q)=12

df(R,Q)=18

df(M,Q)=55

df(A,Q)=58

df(I,Q)=65

Sorted Distance List

R Z

K

M

A

I do(K,Q)=53

do(Z,Q)=69

do(R,Q)=49

Determination of Core-PropertiesIntegrated Multi-Step Clustering Approach

core-distance of Q =53

do(R,Q)=53

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(R,B)=18

df(R,D)=34

df(K,B)=20

df(K,L)=30

df(K,G)=43

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Integrated Multi-Step Clustering Approach

df(K,C)=55

d0(M,C)=65first elements areascendingly ordered

each list of predecessor objects is ascendingly ordered

do(R,Q)=53

do(Z,Q)=69

df(M,Q)=55

df(A,Q)=58

df(I,Q)=65

result list of the current query object Q whichhas to be inserted into the extended seedlist

do(K,Q)=53

Extended Seedlist

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

do(K,Q)=53

df(R,B)=18

df(R,D)=34

df(K,B)=20

df(K,L)=30

df(K,G)=43

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Integrated Multi-Step Clustering Approach

df(K,C)=55

d0(M,C)=65

do(R,Q)=53

do(Z,Q)=69

df(M,Q)=55

df(A,Q)=58

df(I,Q)=65

result list of the current query object Q whichhas to be inserted into the extended seedlist

do(K,Q)=53

Extended Seedlist

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(R,B)=18

df(R,D)=34

df(K,B)=20

df(K,L)=30

df(K,G)=43

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Integrated Multi-Step Clustering Approach

d0(M,C)=65

do(R,Q)=53

do(Z,Q)=69

df(M,Q)=55

df(A,Q)=58

df(I,Q)=65

result list of the current query object Q whichhas to be inserted into the extended seedlist

do(K,Q)=53

do(K,Q)=53

d0(Z,Q)=69

Extended Seedlist

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(R,B)=18

df(R,D)=34

df(K,B)=20

df(K,L)=30

df(K,G)=43

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Integrated Multi-Step Clustering Approach

do(R,Q)=53

do(Z,Q)=69

df(M,Q)=55

df(A,Q)=58

df(I,Q)=65

result list of the current query object Q whichhas to be inserted into the extended seedlist

do(K,Q)=53

do(K,Q)=53

d0(M,C)=65 d0(Z,Q)=69

d0(R,Q)=53

Extended Seedlist

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(R,B)=18

df(R,D)=34

df(K,B)=20

df(K,L)=30

df(K,G)=43

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Integrated Multi-Step Clustering Approach

do(R,Q)=53

do(Z,Q)=69

df(M,Q)=55

df(A,Q)=58

df(I,Q)=65

result list of the current query object Q whichhas to be inserted into the extended seedlist

do(K,Q)=53

do(K,Q)=53

d0(M,C)=65 d0(Z,Q)=69

d0(R,Q)=53

df(M,Q)=55

Extended Seedlist

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(R,B)=18

df(R,D)=34

df(K,B)=20

df(K,L)=30

df(K,G)=43

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Integrated Multi-Step Clustering Approach

do(R,Q)=53

do(Z,Q)=69

df(M,Q)=55

df(A,Q)=58

df(I,Q)=65

result list of the current query object Q whichhas to be inserted into the extended seedlist

do(K,Q)=53

do(K,Q)=53

d0(M,C)=65

d0(Z,Q)=69

d0(R,Q)=53

df(M,Q)=55 df(A,Q)=58

Extended Seedlist

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(A,Q)=58df(R,B)=18

df(R,D)=34

df(K,B)=20

df(K,L)=30

df(K,G)=43

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Extended SeedlistIntegrated Multi-Step Clustering Approach

do(R,Q)=53

do(Z,Q)=69

df(M,Q)=55

df(A,Q)=58

df(I,Q)=65

result list of the current query object Q whichhas to be inserted into the extended seedlist

do(K,Q)=53

do(K,Q)=53

d0(M,C)=65

d0(Z,Q)=69

d0(R,Q)=53

df(M,Q)=55 df(I,Q)=65

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(A,Q)=58df(K,B)=20

df(K,L)=30

df(K,G)=43

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Determination of Next Query ObjectIntegrated Multi-Step Clustering Approach

do(K,Q)=53

d0(M,C)=65

d0(Z,Q)=69

d0(R,Q)=53

df(M,Q)=55 df(I,Q)=65do(R,B)=44df(R,B)=18

df(R,D)=34

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(A,Q)=58

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Determination of Next Query ObjectIntegrated Multi-Step Clustering Approach

df(K,B)=20

df(K,L)=30

df(K,G)=43

do(K,Q)=53

d0(M,C)=65

d0(Z,Q)=69df(M,Q)=55 df(I,Q)=65

d0(R,Q)=53

do(R,B)=44df(R,D)=34

do(R,B)=44

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

df(A,Q)=58

• Data Structure “List of Lists”

• Additional information about possible predecessor objects are stored in order to

postpone exact distance calculations as long as possible.

Determination of Next Query ObjectIntegrated Multi-Step Clustering Approach

df(K,B)=20

df(K,L)=30

df(K,G)=43

do(K,Q)=53

d0(M,C)=65

d0(Z,Q)=69df(M,Q)=55 df(I,Q)=65

d0(R,Q)=53

do(R,B)=44df(R,D)=34

do(R,B)=44

d0(K,B)=25

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Outline

• Foundations of Density-Based ClusteringCore Object · Density-Reachability ·

DBSCAN · OPTICS

• Clustering of Complex ObjectsDirect Integration of the Multi-Step Query Processing

Paradigm

• Experimental Evaluation

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Experimental Evaluation

• High dimensional feature vectorsrepresenting CAD objects [DASFAA 03]

• not very selective filter used (Euclidean norm)

• Graphs representing images [DAWAK 03]• Expensive exact distance function• Selective filter used

Test Data Sets

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

1

10

100

1000

10000

500 1000 2000 3000

f ul l table scan

tr adi ti onal mul ti -step quer y pr ocess ing

integr ated mul ti step quer y pr ocess ing

1

10

100

1000

10000

500 1000 2000 3000

f ul l table scan

tr adi ti onal mul ti -step quer y pr ocessing

integr ated mul ti step quer y pr ocessing

no. of objects

run

tim

e [

se

c.]

no. of objects

run

tim

e [

se

c.]

Feature vectors

• Already non-selective filters (feature vectors) are helpful for accelerating DBSCAN by up to an order of magnitude when using the new integrated multi-step query processing approach.

• The traditional multi-step query processing approach does not benefit from non-selective filters (feature vectors), as the cardinality of the candidate set is still high even when small values are used.

• When filters of high selectivity (graphs) are used, our new integrated multi-step query processing approach leads to a speed-up of two orders of magnitude compared to a full table scan.

Graphs

Experimental EvaluationDBSCAN

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

1

10

100

1000

10000

100000

500 1000 2000 3000

f ul l table scan

tr adi ti onal mul ti -step quer y pr ocessingintegr ated mul ti -step quer y pr ocessing

no. of objects

run

tim

e [

se

c.]

no. of objects

run

tim

e [

se

c.]

• When using filters of high selectivity (graphs), our new integrated multi-step query processing approach outperforms the traditional multi-step query processing approach and the full table scan by a factor of up to 30.

• For high values, as used with OPTICS, the full table scan performs even better than the traditional multi-step query processing approach.

1

10

100

1000

10000

500 1000 2000 3000

f ul l table scan

tr adi ti onal mul ti -step quer y pr ocess ingintegr ated mul ti -step quer y pr ocess ing

Feature vectors Graphs

no. of objects

Experimental EvaluationOPTICS

Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK

Conclusions

Summary „Efficient Density-Based Clustering of Complex Objects“• direct integration of the multi-step query processing paradigm into the clustering algorithm• MinPts-nearest neighbor queries on the exact information• postponing expensive exact distance computations as long as possible

Future Work• integration of the multi-step query processing paradigm into other data mining algorithms