Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu...

Clustering by Pattern Similarity in Large Data Sets

Haixun Wang, Wei Wang, Jiong Yang, Philip S. YuIBM T. J. Watson Research CenterPresented by Edmond Wu

DB-Seminar Slide 2

Talk Outline

Introduction

Related Work

pCluster Model

Performance Analysis

Conclusion

DB-Seminar Slide 3

Motivation

Why discovery of clusters based on pattern similarity is interesting and important?

DNA micro-array analysis

E-commerce: Recommendation systems & target marketing

DB-Seminar Slide 4

Background Knowledge

Clustering: the process of grouping a set of objects into classes of similar objects.

Subspace clustering: discovering clusters embedded in the subspace of a high dimensional datasets.

Pattern similarity: coherent pattern on a subset of dimensions. ( Not require to have close values on at least one attribute)

DB-Seminar Slide 5

Example of Similar pattern on a subset of dimensions

DB-Seminar Slide 6

Challenges

Identifying subspace clusters in high-dimensional data sets is difficult.

Traditional distance functions can not capture the pattern similarity among the objects.

DB-Seminar Slide 7

How to detect shifting pattern?

Given N attributes a1,…,an

Define a derived attribute Aij=ai-aj for every

pair of attributes ai-aj Thus, the problem equals to mine subspace clusters on the objects with the derived set of attributes.

Drawback: The converted dataset will have

N(N-1)/2 dimensions

intractable even for a small N

DB-Seminar Slide 8

Related Work

Bicluster Model (Cheng et al):

AIJ: sub Matrix of a DNA array, with the following mean squared residue score H(I,J):

δ- bicluster: AIJ is called a δ- bicluster if H(I,J) ≤δ

DB-Seminar Slide 9

Bicluster Model (Example)

(1) Shifting pattern (2) Scaling patternH(I,J)=0 H(I,J)=2/3

(3) Not similar pattern (4) Submatrix of (2)

H(I,J)=8 H(I,J)=2.25>2/3

If we set δ=2, (3),(4) are not δ- bicluster.

a1 a2 a3

O1 1 2 3

O2 5 6 7

a1 a2 a3

O1 1 2 4

O2 2 4 8

a1 a2 a3

O1 2 4 12

O2 4 6 2

O1 1 4

O2 2 8

DB-Seminar Slide 10

Drawbacks of Bicluster Model

A submatrix of a δ- bicluster is not necessarily a δ- bicluster.

Not sure to find all qualified clusters (randomly greedy algorithm provides only an approximate answer).

Can not exclude outlier in a bicluster.

Difficulties in designing efficient algorithm.

DB-Seminar Slide 11

Bicluster Model (Example)

The bicluster shown in Figure (a) contains an obvious outlier but it still has a fairly small mean squared residue (4.238).

If we get rid of such outliers by reducing the δ threshold, it will exclude many biclusters which do exhibit similar patterns.

DB-Seminar Slide 12

The pCluster Model

pScore of a 2× 2 matrix:

O : subset of objects in the database

T : subset of attributes; (O,T): submatrix of dataset

δ: user specified clustering threshold

dxa: value of object X on attribute a

Given x, y O, and ∈ a, b ∈T

)()( ybyaxbxaybya

xbxadddd

ddpScore

DB-Seminar Slide 13

The pCluster Model (Cont.)

pScore(X) ≤ δ means that the change of values on the two attributes between the two objects in X is confined byδ, a user-specified threshold.

Pair (O, T ) forms a δ-pCluster if for any 2 × 2 submatrix X in (O, T ), we have pScore(X) ≤ δ for some δ ≥ 0.

DB-Seminar Slide 14

The pCluster Model (Example)

In Figure (a): Object 2, 3 and {b, c} form a 2× 2 submatrix X: d2b=12, d2c=15, d3b=40, d3c=43 pScore(X)=|(12-15)-(40-43)|=0

Objects 1,2,3 and {b,c,h,j,e} form a pCluster (δ=0)

DB-Seminar Slide 15

The pCluster Model (Cont.)

Compact property of pCluster:

let (O,T) be a δ-pCluster. Any of its submatrix, (O’,T’) is also a δ-pCluster (Based on the definition of pCluster);

The volume of a pCluster: |O|×|T|;

Definition of pCluster is symmetric:

|(dxa－ dxb) － (dya－ dyb)|

= |(dxa－ dya) － (dxb－ dyb)|

DB-Seminar Slide 16

Problem Statement

Task: To find all pairs (O,T) such that (O,T) is a δ-pCluster according to its definition, and |O|≥ nr, |T|≥ nc.

Parameters: D : dataset δ: a cluster threshold nc : a minimal number of columns nr : a minimal number of rows

DB-Seminar Slide 17

The Algorithm

Definition of MDS: Assuming c = (O, T) is a δ-pCluster. Column set T is a Maximum Dimension Set (MDS) of c if

there does not exist T’ T such that (O, T’) is also a δ-pCluster.

Objects can form pClusters on multiple MDSs. The algorithm is depth-first, meaning only generate pClusters that cluster on MDSs.

DB-Seminar Slide 18

Pair-wise Clustering

Pairwise Clustering Principle:

Given objects X and Y, and a dimension set T, X and Y form a δ-pCluster on T iff the difference between the largest and smallest value in

S(X, Y, T) is below δ.

In other word, ({X,Y},T) is a pCluster if the following is true:

),(max,

bafTba

)()(),( ybxbyaxa ddddbaf

DB-Seminar Slide 19

Pair-wise Clustering (Example)

Sorted sequence of S(X, Y, T) =s1,…,sk ,…,sn

Object x and y forms a δ-pCluster if

Three MDSs were found: {e,g,c}, {a,d,b,h}, {h,f}

1,...,1, niss ik

DB-Seminar Slide 20

MDS Pruning

MDS Pruning Principle:

Let Txy be an MDS for objects x, y, and a ∈Txy. For any O and T , a necessary condition of ({x, y} ∪O, {a} ∪ T ) being a δ-pCluster is b ∈ T , Oab {x, y}.

The pruning criterion can be stated as follows:

For any dimension a in a MDS Txy, count the number

of Oab that contain {x, y}. If the number of such Oab is

less than nc-1, remove a from Txy. Furthermore, if the

removal of a makes |Txy| < nc, we remove Txy as well.

DB-Seminar Slide 21

MDS Pruning (Example)

DB-Seminar Slide 22

The Main Algorithm

First step: Scan the dataset to find column-pair MDSs and object-pair MDSs.Second step: Prune object-pair MDSs and column-pair MDSs by turn until no pruning can be made.Third step: Insert the remaining object-pair MDSs into a prefix tree. (Each node represents a cluster of objects, each edge represents the column selected)

DB-Seminar Slide 23

Construct a prefix tree

Sort the order of columns e.g., a,b,c,…Insert 2-object pCluster(O,T) into the prefix tree. Perform a post-order traversal of the prefix tree. Prune nodes that |O|<nr. ( Add the objects in O to nodes whose column set

T’ T and |T’|=|T|-1

DB-Seminar Slide 24

Construct a prefix tree (Example)

DB-Seminar Slide 25

Algorithm Complexity

Main algorithm for mining pClusters has time complexity :

where M is the # of columns and N is the # of

objects.The worse case:However, the complexity can be greatly reduced because of the MDS pruning process.

)loglog( 22 MMNNNMO

)( 22NkMO

DB-Seminar Slide 26

Experiments

DatasetsSynthetic datasets (parameters: different nr, nc, # of embedded perfect pCluster with δ=0)

Gene expression data (yeast microarray)

MovieLens dataset (E-commerce)

DB-Seminar Slide 27

Performance Analysis

Response time VS. data size

DB-Seminar Slide 28

Performance Analysis (Cont.)

Sensitiveness to mining parameters: δ, nc, and nr

DB-Seminar Slide 29

Compare the pCluster with an alternative approach based on the subspace clustering algorithm CLIQUE.

DB-Seminar Slide 30

The pruning process is essential in the pCluster algorithm.

Without pruning, the pCluster Algorithm can not beyond 3,000 objects. As the number of the MDS become too large to put into a Prefix tree.

DB-Seminar Slide 31

Conclusion

pCluster Model: capture the closeness of objects and pattern similarity among the objects in subsets of dimensions.Advantages :

-Discover all the qualified pClusters. -The depth-first clustering algorithm avoids generating clusters which are part of other clusters. -More efficient than current algorithm. -Resilient to outliers

DB-Seminar Slide 32

References

Y. Cheng and G. Church. Biclustering of expression data. In Proc. of 8th International Conference on Intelligent System for Molecular Biology, 2000.S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and

G. Church. Yeast micro data set, 2000. In http://arep.med.harvard.edu/biclustering/yeast.matrix,

R. C. Agarwal, C. C. Aggarwal, and V. Parsad. Depth first generation of long patterns. In SIGKDD, 2000.J. Yang, W. Wang, H. Wang, and P. S. Yu. δ-clusters:

Capturing subspace correlation in a large data set. In ICDE, pages 517–528, 2002.

Thanks!!

Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu...

Documents

Room temperature ferromagnetism in partially hydrogenated ... · Room temperature ferromagnetism in partially hydrogenated epitaxial graphene Lanfei Xie,1 Xiao Wang,1,2 Jiong Lu,3

Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin

Overcoming Semantic Drift in Information Extraction · 2020-03-13 · Overcoming Semantic Drift in Information Extraction Zhixu Li §1 , Hongsong Li 2, Haixun Wang 3, Yi Yang 4, Xiangliang

A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo

Arterial Spin Labeling at 7T - Double Edged Sword Jiong-Jiong Wang, PhD University of Pennsylvania, Philadelphia, PA

Event Summarization for System Management Wei Peng†, Chang-shing Perng§, Tao Li†, Haixun Wang§ †Florida International University §IBM T.J.Waston Research

Voice Pitch Elicited Frequency Following Response in Chinese … · 2019. 6. 23. · Voice Pitch Elicited Frequency Following Response in Chinese Elderlies Shuo Wang 1 *, Jiong Hu

Reducing Platform...Yan Wang Technical Marketing Engineer Jiong Gong Senior BIOS Engineer Intel Corporation Reducing Platform Boot Time in UDK 2010 2 Executive Summary This document

Collaborative Web Caching Based on Proxy Affinities Jiong Yang, Wei Wang in T. J.Watson Research Center Richard Muntz in Computer Science Department of

Speaker: Yee Wei Law Collaborators: Umith Dharmaratna, Jiong Jin, Slaven Marusic, Marimuthu Palaniswami 1

A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang

TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003

Meteorology Geography Department East China Normal University Shu Jiong

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft

Research Article Bacterial Genomic DNA Wang Jiong ... · Macrophages without Provoking Inflammation. Wang Jiong*, Chen Huimin,Yan Jin, Pu Xuexue, Jiang Xueqin and Liu Rongyu. Department

researchbank.swinburne.edu · Utility Max-Min Fair Resource Allocation for Communication Networks with Multipath Routing Jiong Jin⁄, Wei-Hua Wang and Marimuthu Palaniswami Department

Biclustering in Gene Expression Data by Tendency - …weiwang/paper/CSB04_1.pdfBiclustering in Gene Expression Data by Tendency Jinze Liu 1, Jiong Yang2, and Wei Wang 1Department of

Mining Biological Data Jiong Yang, Ph. D. Visiting Assistant Professor UIUC jioyang@cs.uiuc.edu

Solar flare hard X-ray spikes observed by RHESSI: a statistical study Jianxia Cheng Jiong Qiu, Mingde Ding, and Haimin Wang

images.nature.com€¦ · Web viewn dium-doped GeTe Lihua Wu,1 Xin Li,1, 2 Shanyu Wang,3 Tiansong Zhang,4 Jiong Yang,1, * Wenqing Zhang,1, * Lidong Chen,4 and Jihui Yang 3, * 1. Materials