32
GRAPH-BASED HIERARCHICAL CONCEPTUAL GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Embed Size (px)

Citation preview

Page 1: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

GRAPH-BASED HIERARCHICAL CONCEPTUAL GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERINGCLUSTERING

by

Istvan Jonyer,

Lawrence B. Holder and

Diane J. Cook

The University of Texas at Arlington

Page 2: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

OutlineOutline

What is hierarchical conceptual clustering?Overview of SubdueConceptual clustering in SubdueEvaluation of hierarchical clusteringsExperiments and resultsConclusions

Page 3: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

What is clustering?What is clustering?

Page 4: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

What is What is hierarchical hierarchical conceptual conceptual clustering?clustering?

Unsupervised concept learningGenerating hierarchies to explain dataApplications

– Hypothesis generation and testing– Prediction based on groups– Finding taxonomies

Page 5: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Example hierarchical Example hierarchical conceptualconceptual clusteringclustering

Animals

BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal

Fertilization: externalName: mammalBodyCover: hair

Name: birdBodyCover: feathers

Name: reptileBodyCover: cornified-skin

HeartChamber: imperfect-fourFertilization: internal

Name: fishBodyCover: scales

HeartChamber: two

Name: amphibianBodyCover: moist-skinHeartChamber: three

Page 6: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

The ProblemThe Problem

Hierarchical conceptual clustering in discrete-valued structural databases

Existing systems:– Continuous-valued– Discrete but unstructured– We can do better! (Field under explored)

Page 7: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Related WorkRelated Work

CobwebLabyrinthAutoClassSnobIn Euclidian space: Chameleon, Cure

Unsupervised learning algorithms

Page 8: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

The SolutionThe Solution

Take Subdue and extend it!

Page 9: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Overview of SubdueOverview of Subdue

Data mining in graph representations of structural databases

A

C

B D

A

C

BD

F

E

f c

b

ad

e

a

bc

g

Page 10: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Overview of SubdueOverview of Subdue

Iteratively searching for best substructure by MDL heuristic

A

C

BD

c

b

a

Page 11: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Overview of SubdueOverview of Subdue

Compress using best substructure

S S

F

E

f

d

eg

Page 12: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Overview of SubdueOverview of Subdue

Fuzzy match– Inexact matching of subgraphs– Applications:

Defining fuzzy concepts Evaluation of clusterings

Page 13: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Conceptual Clustering with Conceptual Clustering with SubdueSubdue

Use Subdue to identify clusters– The best subgraph in an iteration defines a

cluster When to stop within an iteration?

1) Use –limit option2) Use –size option3) Use first minimum heuristic (new)

Page 14: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

The First Minimum HeuristicThe First Minimum Heuristic

Use subgraph at first local minimum– Detect it using –prune2 option

0.75

0.8

0.85

0.9

0.95

1

1.05

Page 15: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

The First Minimum HeuristicThe First Minimum Heuristic

Not a greedy heuristic!– Although first local minimum is usually the

global minimum– First local minimum is caused by a smaller,

more frequently occurring subgraph– Subsequent minima are caused by bigger, less

frequently occurring subgraphs

=> First subgraph is more general

Page 16: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

The First Minimum HeuristicThe First Minimum Heuristic

A multi-minimum search space:

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

Page 17: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Lattice vs. TreeLattice vs. Tree

Previous work defined classification trees– Inadequate in structured domains

Better hierarchical description: classification lattice– A cluster can have more than one parent– A parent can be at any level (not only one level

above)

Page 18: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Hierarchical Clustering in Hierarchical Clustering in SubdueSubdue

Subdue can compress by a subgraph after each iteration

Subsequent clusters may be defined in terms of previously defined clusters

This results in a hierarchy

Page 19: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Hierarchical Conceptual Hierarchical Conceptual Clustering of an Artificial Clustering of an Artificial

DomainDomain

Page 20: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Hierarchical Conceptual Clustering Hierarchical Conceptual Clustering of an Artificial Domainof an Artificial Domain

Root

Page 21: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Evaluation of ClusteringsEvaluation of Clusterings

Traditional evaluation:

– Not applicable to hierarchical domains

No known evaluation for hierarchical clusterings– Most hierarchical evaluations are anecdotal

erDistanceIntraClust

erDistanceInterClustQualityClustering

Page 22: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

New Evaluation Heuristic for New Evaluation Heuristic for Hierarchical ClusteringsHierarchical Clusterings

Properties of a good clustering:– Small number of clusters

Large coverage good generality

– Big cluster descriptions More features more inferential power

– Minimal or no overlap between clusters More distinct clusters better defined concepts

Page 23: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

New Evaluation Heuristic for New Evaluation Heuristic for Hierarchical ClusteringsHierarchical Clusterings

Big clusters: bigger distance between disjoint clusters

Overlap: less overlap bigger distance

Few clusters: averaging comparisons

c

iHc

i

c

ijji

c

i

c

ij

H

k

H

l ljkisize

ljki

C i

i j

CQHH

HH

HHdistance

CQ1

1

1 1

1

1 1 1 1 ,,

,,

)(

),(max

),(

Page 24: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Experiments and ResultsExperiments and Results

Validation in an artificial domainValidation in unstructured domainsComparison to existing systemsReal world applications

Page 25: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

The Animal DomainThe Animal Domain

Name Body Cover Heart Chamber Body Temp. Fertilization

mammal hair four regulated internalbird feathers four regulated internalreptile cornified-skin imperfect-four unregulated internal

amphibian moist-skin three unregulated external

fish scales two unregulated external

animal

hair

mammal

BodyCover

Fertilization

HeartChamber

BodyTempinternalregulated

Namefour

Page 26: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Hierarchical Clustering of the Hierarchical Clustering of the Animal DomainAnimal Domain

Animals

BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal

Fertilization: externalName: mammalBodyCover: hair

Name: birdBodyCover: feathers

Name: reptileBodyCover: cornified-skin

HeartChamber: imperfect-fourFertilization: internal

Name: fishBodyCover: scales

HeartChamber: two

Name: amphibianBodyCover: moist-skinHeartChamber: three

Page 27: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Hierarchical Clustering of the Hierarchical Clustering of the Animal Domain by CobwebAnimal Domain by Cobweb

animals

amphibian/fishmammal/bird reptile

mammal bird fish amphibian

Page 28: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Comparison of Subdue and Comparison of Subdue and CobwebCobweb

Quality of Subdue’s lattice (tree): 2.60Quality of Cobweb’s tree: 1.74Therefore Subdue is betterReasons for a higher score:

– Better generalization resulting in less clusters– Eliminating overlap between (reptile) and

(amphibian/fish)

Page 29: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Chemical Application: Chemical Application: Clustering of a DNA sequenceClustering of a DNA sequence

Page 30: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Chemical Application: Chemical Application: Clustering of a DNA sequenceClustering of a DNA sequence

Coverage– 61%

– 68%

– 71%

DNA

O |O == P — OH

C — N C — C

C — C \ O

O |O == P — OH | O | CH2

C \ N — C \ C

O \ C / \ C — C N — C / \O C

Page 31: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

ConclusionsConclusions

Goal of hierarchical conceptual clustering of structured databases was achieved

Synthesized classification latticeDeveloped new evaluation heuristic for

hierarchical clusteringsGood performance in comparison to other

systems, even in unstructured domains

Page 32: GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington

Future WorkFuture Work

More experiments on real-world domainsComparison to other systemsIncorporation of evaluation tool into

Subdue