Data Mining Using Conceptual Clustering

Preview:

Citation preview

Data Mining Using Conceptual Clustering

By Trupti Kadam

What is Data Mining?

• Many Definitions– Non-trivial extraction of implicit, previously unknown and

potentially useful information from data– Exploration & analysis, by automatic or

semi-automatic means, of large quantities of data in order to discover meaningful patterns

• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

• Traditional Techniquesmay be unsuitable due to – Enormity of data– High dimensionality

of data– Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Clustering Definition

• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to one

another.– Data points in separate clusters are less similar to

one another.

Conceptual Clustering• Unsupervised, spontaneous - categorizes or

postulates concepts without a teacher• Conceptual clustering forms a classification tree -

all initial observations in root - create new children using single attribute (not good), attribute combinations (all), information metrics, etc. - Each node is a class

• Should decide quality of class partition and significance (noise)

• Many models use search to discover hierarchies which fulfill some heuristic within and/or between clusters - similarity, cohesiveness, etc.

Concept Under CC

Concept Hierarchy

Contd..• Suppose we choose 6 as threshold value for

similarity, the algo produce 5 distinct clusters (1,2),(3,4),(5,6,7,8),(5,6),(5,7,8) after deleting redundant one and a hierarchy is formed as follows:

Contd..

The COBWEB Conceptual Clustering Algorithm

• The COBWEB algorithm was developed by machine learning researchers in the 1980s for clustering objects in a object-attribute data set.

• The COBWEB algorithm yields a clustering dendogram called classification tree that characterizes each cluster with a probabilistic description.

Contd..

• When given a new instance, COBWEB considers the overall quality of either placing the instance in an existing category or modifying the hierarchy

• The criterion COBWEB uses for evaluating the quality of the classification is called category utility

Category utility

• Was developed in research of human categorization (Gluck and Corter 1985)

• Category utility attempts to maximize both the probability that two objects in the same category have values in common and the probability that objects in different categories will have different property values.

• Manhattan distance or Euclidean distance formula is used to measure cohesion among clusters.

Category utility

P(Ck) represents size of cluster Ck.

represents probability of attribute Ai taking on value V ij over the entire set, and

is its conditional probability of taking the same value in class k C .

• To evaluate an entire partition made up of K clusters, we use the average CU over the K clusters

The Classification Tree Generated by the COBWEB Algorithm

• COBWEB performs a hill-climbing search of the space of possible taxonomies (trees) using category utility to evaluate and select possible categorizations

– Initializes the taxonomy to a single category whose features are those of the first example

– For each example, the algorithm begins with the root category and moves through the tree

– At each level is uses category utility to evaluate the taxonomies

1. Placing the example in the best category2. Adding a new category containing the example3. Merging two existing categories and adding the example

to the category4. Splitting two existing categories and placing the example

into the best category in the tree

• Insertion means that the new object is inserted into one of the existing child nodes. The COBWEB algorithm evaluates the respective CU function value of inserting the new object into each of the existing child nodes and selects the one with the highest score.

• The COBWEB algorithm also considers creating a new child node specifically for the new object.

• The COBWEB algorithm considers merging the two existing child nodes with the highest and second highest scores.

BA

P

… … …

P

… …

BA

N

Merge

• The COBWEB algorithm considers spliting the existing child node with the highest score.

BA

P

… … …

P

… …

BA

N

Split

The COBWEB AlgorithmCobweb(N, I)

If N is a terminal node,Then Create-new-terminals(N, I) Incorporate(N,I).

Else Incorporate(N, I).For each child C of node N,

Compute the score for placing I in C.

Let P be the node with the highest score W.

Let Q be the node with the second highest score.

Let X be the score for placing I in a new node R.

Let Y be the score for merging P and Q into one node.

Let Z be the score for splitting P into its children.

If W is the best score,Then Cobweb(P, I) (place I

in category P).Else if X is the best score,

Then initialize R’s probabilities using I’s values

(place I by itself in the new category R).

Else if Y is the best score,Then let O be Merge(P, R,

N).Cobweb(O, I).

Else if Z is the best scoreThen Split(P, N).

Cobweb(N, I).

Input: The current node N in the concept hierarchy.

An unclassified (attribute-value) instance I.

Results: A concept hierarchy that classifies the instance.

Top-level call: Cobweb(Top-node, I).

Variables: C, P, Q, and R are nodes in the hierarchy.

U, V, W, and X are clustering (partition) scores.

• Limitations of COBWEB– The assumption that the attributes are independent of

each other is often too strong because correlation may exist

– Not suitable for clustering large database data – skewed tree and expensive probability distributions

ITERATE

The algorithm has three primary steps: 1. Derive a classification tree using category utility as

a criterion function for grouping instances.2. Extract a good initial partition of data from the

classification tree as a starting point to focus the search for desirable groupings or clusters.

3. Iteratively redistribute data objects among the groupings to achieve maximally separable clusters.

Derivation of classification tree

The initial partition structure is extracted by comparing the CU value of classes or nodes along a path in the classification tree. For any path from root to leaf of a classification tree this value initially increases, and then drops .

Extraction of a good initial partition

Iteratively redistribute data objects

• The iterative redistribution operator is applied to maximize the cohesion measure for individual classes in the partition.

• The redistribution operator assigns object d toclass k for which the category match measure CMdk is maximum.

Evaluating Cluster Partitions• To be able to assess the result of a certain

clustering operation, we adopt a measure known as cohesion, which measures the degree of interclass similarity between objects in the same class.

• The increase in predictability for an object for an object d assigned to cluster k, Mdk is defined as

THANK YOU

Recommended