Upload
lorena-figge
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
Efficient Density-Based Clustering of Complex Objects
Stefan Brecheisen, Hans-Peter Kriegel, Martin PfeifleUniversity of Munich
Institute for Computer Science
Brighton,UKNovember 01-04, 2004
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Outline
• Density-Based Clustering
• Clustering of Complex Objects
• Experimental Evaluation
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Outline
• Density-Based ClusteringCore Object · Density-Reachability ·
DBSCAN · OPTICS
• Clustering of Complex Objects
• Experimental Evaluation
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Data Mining
• Larger and larger amounts of data collected automatically
• Too large for humans to analyze manually
• Tools to assist analysis necessary KDD / Data Mining
Hubble Space Telescope Telecommunication Data Market-Basket Data
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Clustering
• Clustering– Efficiently grouping the database into sub-groups (clusters) such that
• similarity within clusters maximized
• similarity between clusters minimized
Flat Clustering one level of clusters
Hierarchical Clusteringnested clusters
e.g. density-based clustering algorithm
DBSCAN [KDD 96]
e.g. density-based clustering algorithm
OPTICS [SIGMOD 99]
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Density-Based Clustering I
• Parameters
– range and minimal weight MinPts
• Definition: core object
– q is core object if | rangeQuery (q,) | MinPts
• Definition: directly density-reachable
– p directly density-reachable from q if
q is a core object and p rangeQuery (q,)
• Definition: density-reachable– density-reachable: transitive closure of
“directly density-reachable”
qMinPts=5
p
q
MinPts=5
oq
r
MinPts=5
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Density-Based Clustering II
• Core Idea of Hierarchical Cluster Ordering:
Order the objects linearly such that objects of a cluster are adjacent in the ordering.
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Density-Based Clustering II
• Core Idea of Hierarchical Cluster Ordering:
Order the objects linearly such that objects of a cluster are adjacent in the ordering.
• Definition: core-distance
otherwise)(dist
|),(rangeQuery|if)(distcore , oMinPts
MinPtsooMinPts
core-distance(o)
o
MinPts = 5
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Density-Based Clustering II
• Core Idea of Hierarchical Cluster Ordering:
Order the objects linearly such that objects of a cluster are adjacent in the ordering.
• Definition: core-distance
• Definition: reachability-distance
otherwise)(dist
|),(rangeQuery|if)(distcore , oMinPts
MinPtsooMinPts
reach dist ( , ) max(core dist ( ),dist( , )), , MinPts MinPtsp o o p o
core-distance(o)
o
reachability-distance(p,o)
p p
reachability-distance(p,o)
MinPts = 5
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
OPTICS Algorithm
A I
B
J
K
L
R
M
P
N
CF
DE G H
44
reach
seedlist:
• Example Database (2-dimensional, 16 points)
• = 44, MinPts = 3
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
OPTICS Algorithm
A I
B
J
K
L
R
M
P
N
CF
DE G H
44
reach
seedlist:
• Example Database (2-dimensional, 16 points)
• = 44, MinPts = 3
A I
B
J
K
L
R
M
P
N
CF
DE G H
A
44
core-distance
(B,40) (I, 40)
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
OPTICS Algorithm
44
reach
• Example Database (2-dimensional, 16 points)
• = 44, MinPts = 3
A
44
BA I
B
J
K
L
R
M
P
N
CF
DE G H
seedlist: (I, 40) (C, 40)
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
OPTICS Algorithm
44
reach
• Example Database (2-dimensional, 16 points)
• = 44, MinPts = 3
A
44
BA I
B
J
K
L
R
M
P
N
CF
DE G H
I
seedlist: (J, 20) (K, 20) (L, 31) (C, 40) (M, 40) (R, 43)
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
OPTICS Algorithm
44
reach
• Example Database (2-dimensional, 16 points)
• = 44, MinPts = 3
A
44
B IA I
B
J
K
L
R
M
P
N
CF
DE G H
J
seedlist: (L, 19) (K, 20) (R, 21) (M, 30) (P, 31) (C, 40)
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
OPTICS Algorithm
44
reach
• Example Database (2-dimensional, 16 points)
• = 44, MinPts = 3
A
44
B I JA I
B
J
K
L
R
M
P
N
CF
DE G H
L
…
seedlist: (M, 18) (K, 18) (R, 20) (P, 21) (N, 35) (C, 40)
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
OPTICS Algorithm
A I
B
J
K
L
R
M
P
N
CF
DE G H
seedlist: -
A B I J L M K N R P C D F G E H
44
reach
• Example Database (2-dimensional, 16 points)
• = 44, MinPts = 3
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
OPTICS Algorithm
A I
B
J
K
L
R
M
P
N
CF
DE G H
seedlist: -
A B I J L M K N R P C D F G E H
44
reach
• Example Database (2-dimensional, 16 points)
• = 44, MinPts = 3
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Outline
• Foundations of Density-Based ClusteringCore Object · Density-Reachability ·
DBSCAN · OPTICS
• Clustering of Complex ObjectsDirect Integration of the Multi-Step Query Processing
Paradigm
• Experimental Evaluation
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Complex Objects
complex objects
complex
models
complex distance measure
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Single-Step Clustering Approach
Exact information
Density-based Clustering algorithms, like DBSCAN and OPTICS
Query
Q(q,)Result R(q,)
• Performance Problems
• For each database object q,
we perform one range query.
• Expensive exact distance
computation do(o,q) for each
object o of the database
independent of the range
1 2
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Multi-Step Query Processing
• Multi-Step Similarity SearchRange Queries (Faloutsos et al. 94)
k-Nearest Neighbor Queries (Korn et al. 96)
Optimal k- Nearest Neighbor Queries (Seidl, Kriegel 98)
• No False Drops?
Filter Step(index-based)
Refinement Step(exact evaluation)
candidates
results
),(),( qpdqpd of
filter distance object distance
Lower-Bounding Property
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Traditional Multi-Step Clustering Approach
Range query processor(e.g. Faloutsos et al. 94)
Density-based Clustering algorithms, like DBSCAN and OPTICS • Performance Problems
• For each database object q, we
perform one range query (1).
• The range query is first
performed on the filter
information (2,3).
• One expensive exact distance
computation do(o,q) for each
object o of the candidate set
C(q,) is performed (4). This
refinement step is very expensive
for non-selective filters or high
values.
Query (q,)
1
CandidatesC(q,)
Filter information
QueryQ(q,)
using df
2 3
Exact information
refinement-step computation of do(o,q) for
all o C(q,)
4
Result (q,)
5
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Integrated Multi-Step Clustering Approach
Exact information
Filter information
ExtendedDensity-based Clustering algorithms,
like DBSCAN and OPTICS
Query
Q(q,)using df
CandidatesC (q,)
computation of do(o,q) for
Core - properties of q
1 2 3
• Direct integration of the multi-step query processingparadigm into the clustering algorithm
• postponing expensive exact distance computationsas long as possible
• Proposed Solution
• For each database object q, we
perform one range query on the
filter information (1,2).
• Only those exact distances
do(o,q) are computed which are
necessary to determine the
core-properties of q (3).
• A beneficial heuristic for
determining the reachability-
properties is applied which
saves on exact distance
computations (4).
postponed computations of
do(o,q) for Reach.-properties of o
4
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Filter Information
Q
• First, we carry out a range query on the filter for each query object Q.• Second, we order the resulting candidate set in ascending order according to the filter distance.• Third, we walk through the candidate set and perform exact distance calculations until we can be sure that we have found the MinPts nearest neighbors.
MinPts=3=75
df(K,Q)=10
df(Z,Q)=12
df(R,Q)=18
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
Sorted Distance List
R Z
K
M
A
I do(K,Q)=53
do(Z,Q)=69
do(R,Q)=49
Determination of Core-PropertiesIntegrated Multi-Step Clustering Approach
core-distance of Q =53
do(R,Q)=53
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(R,B)=18
df(R,D)=34
df(K,B)=20
df(K,L)=30
df(K,G)=43
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Integrated Multi-Step Clustering Approach
df(K,C)=55
d0(M,C)=65first elements areascendingly ordered
each list of predecessor objects is ascendingly ordered
do(R,Q)=53
do(Z,Q)=69
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
result list of the current query object Q whichhas to be inserted into the extended seedlist
do(K,Q)=53
Extended Seedlist
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
do(K,Q)=53
df(R,B)=18
df(R,D)=34
df(K,B)=20
df(K,L)=30
df(K,G)=43
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Integrated Multi-Step Clustering Approach
df(K,C)=55
d0(M,C)=65
do(R,Q)=53
do(Z,Q)=69
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
result list of the current query object Q whichhas to be inserted into the extended seedlist
do(K,Q)=53
Extended Seedlist
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(R,B)=18
df(R,D)=34
df(K,B)=20
df(K,L)=30
df(K,G)=43
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Integrated Multi-Step Clustering Approach
d0(M,C)=65
do(R,Q)=53
do(Z,Q)=69
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
result list of the current query object Q whichhas to be inserted into the extended seedlist
do(K,Q)=53
do(K,Q)=53
d0(Z,Q)=69
Extended Seedlist
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(R,B)=18
df(R,D)=34
df(K,B)=20
df(K,L)=30
df(K,G)=43
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Integrated Multi-Step Clustering Approach
do(R,Q)=53
do(Z,Q)=69
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
result list of the current query object Q whichhas to be inserted into the extended seedlist
do(K,Q)=53
do(K,Q)=53
d0(M,C)=65 d0(Z,Q)=69
d0(R,Q)=53
Extended Seedlist
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(R,B)=18
df(R,D)=34
df(K,B)=20
df(K,L)=30
df(K,G)=43
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Integrated Multi-Step Clustering Approach
do(R,Q)=53
do(Z,Q)=69
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
result list of the current query object Q whichhas to be inserted into the extended seedlist
do(K,Q)=53
do(K,Q)=53
d0(M,C)=65 d0(Z,Q)=69
d0(R,Q)=53
df(M,Q)=55
Extended Seedlist
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(R,B)=18
df(R,D)=34
df(K,B)=20
df(K,L)=30
df(K,G)=43
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Integrated Multi-Step Clustering Approach
do(R,Q)=53
do(Z,Q)=69
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
result list of the current query object Q whichhas to be inserted into the extended seedlist
do(K,Q)=53
do(K,Q)=53
d0(M,C)=65
d0(Z,Q)=69
d0(R,Q)=53
df(M,Q)=55 df(A,Q)=58
Extended Seedlist
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(A,Q)=58df(R,B)=18
df(R,D)=34
df(K,B)=20
df(K,L)=30
df(K,G)=43
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Extended SeedlistIntegrated Multi-Step Clustering Approach
do(R,Q)=53
do(Z,Q)=69
df(M,Q)=55
df(A,Q)=58
df(I,Q)=65
result list of the current query object Q whichhas to be inserted into the extended seedlist
do(K,Q)=53
do(K,Q)=53
d0(M,C)=65
d0(Z,Q)=69
d0(R,Q)=53
df(M,Q)=55 df(I,Q)=65
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(A,Q)=58df(K,B)=20
df(K,L)=30
df(K,G)=43
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Determination of Next Query ObjectIntegrated Multi-Step Clustering Approach
do(K,Q)=53
d0(M,C)=65
d0(Z,Q)=69
d0(R,Q)=53
df(M,Q)=55 df(I,Q)=65do(R,B)=44df(R,B)=18
df(R,D)=34
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(A,Q)=58
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Determination of Next Query ObjectIntegrated Multi-Step Clustering Approach
df(K,B)=20
df(K,L)=30
df(K,G)=43
do(K,Q)=53
d0(M,C)=65
d0(Z,Q)=69df(M,Q)=55 df(I,Q)=65
d0(R,Q)=53
do(R,B)=44df(R,D)=34
do(R,B)=44
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
df(A,Q)=58
• Data Structure “List of Lists”
• Additional information about possible predecessor objects are stored in order to
postpone exact distance calculations as long as possible.
Determination of Next Query ObjectIntegrated Multi-Step Clustering Approach
df(K,B)=20
df(K,L)=30
df(K,G)=43
do(K,Q)=53
d0(M,C)=65
d0(Z,Q)=69df(M,Q)=55 df(I,Q)=65
d0(R,Q)=53
do(R,B)=44df(R,D)=34
do(R,B)=44
d0(K,B)=25
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Outline
• Foundations of Density-Based ClusteringCore Object · Density-Reachability ·
DBSCAN · OPTICS
• Clustering of Complex ObjectsDirect Integration of the Multi-Step Query Processing
Paradigm
• Experimental Evaluation
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Experimental Evaluation
• High dimensional feature vectorsrepresenting CAD objects [DASFAA 03]
• not very selective filter used (Euclidean norm)
• Graphs representing images [DAWAK 03]• Expensive exact distance function• Selective filter used
Test Data Sets
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
1
10
100
1000
10000
500 1000 2000 3000
f ul l table scan
tr adi ti onal mul ti -step quer y pr ocess ing
integr ated mul ti step quer y pr ocess ing
1
10
100
1000
10000
500 1000 2000 3000
f ul l table scan
tr adi ti onal mul ti -step quer y pr ocessing
integr ated mul ti step quer y pr ocessing
no. of objects
run
tim
e [
se
c.]
no. of objects
run
tim
e [
se
c.]
Feature vectors
• Already non-selective filters (feature vectors) are helpful for accelerating DBSCAN by up to an order of magnitude when using the new integrated multi-step query processing approach.
• The traditional multi-step query processing approach does not benefit from non-selective filters (feature vectors), as the cardinality of the candidate set is still high even when small values are used.
• When filters of high selectivity (graphs) are used, our new integrated multi-step query processing approach leads to a speed-up of two orders of magnitude compared to a full table scan.
Graphs
Experimental EvaluationDBSCAN
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
1
10
100
1000
10000
100000
500 1000 2000 3000
f ul l table scan
tr adi ti onal mul ti -step quer y pr ocessingintegr ated mul ti -step quer y pr ocessing
no. of objects
run
tim
e [
se
c.]
no. of objects
run
tim
e [
se
c.]
• When using filters of high selectivity (graphs), our new integrated multi-step query processing approach outperforms the traditional multi-step query processing approach and the full table scan by a factor of up to 30.
• For high values, as used with OPTICS, the full table scan performs even better than the traditional multi-step query processing approach.
1
10
100
1000
10000
500 1000 2000 3000
f ul l table scan
tr adi ti onal mul ti -step quer y pr ocess ingintegr ated mul ti -step quer y pr ocess ing
Feature vectors Graphs
no. of objects
Experimental EvaluationOPTICS
Martin Pfeifle, University of Munich ICDM 2004, Brighton, UK
Conclusions
Summary „Efficient Density-Based Clustering of Complex Objects“• direct integration of the multi-step query processing paradigm into the clustering algorithm• MinPts-nearest neighbor queries on the exact information• postponing expensive exact distance computations as long as possible
Future Work• integration of the multi-step query processing paradigm into other data mining algorithms