7
A Quality-Threshold Data Summarization Algorithm Viet Ha-Thuc Computer Science Department The University of Iowa Iowa City, IA 52242, USA [email protected] Duc-Cuong Nguyen School of Computer Science and Engineering International University Ho Chi Minh City, Vietnam [email protected] Padmini Srinivasan School of Library and Information Science The University of Iowa Iowa City, IA 52242, USA [email protected] Abstract—As database sizes increase, semantic data summarization techniques have been developed, so that data mining algorithms can be run on the summarized set for the sake of efficiency. Clustering algorithms such as K-Means have popularly been used as semantic summarization methods where cluster centers become the summarized set. The goal of semantic summarization is to provide a summarized view of the original dataset such that the summarization ratio is maximized while the error (i.e., information loss) is minimized. This paper presents a new clustering-based data summarization algorithm, in which the quality of the summarized set can be controlled. The algorithm partitions a dataset into a number of clusters until the distortion of each cluster is less than a given threshold, thus guaranteeing the summarized set has less than a fixed amount of information loss. Based on the threshold, the number of clusters is automatically determined. The proposed algorithm, unlike traditional K-Means, adjusts initial centers based on the information about the data space discovered so far, thus significantly alleviating the local optimum effect. Our experiments show that our algorithm generates higher quality clusters than K-Means does and it also guarantees an error bound, an essential criterion for data summarization. Keywords- Data Summarization (or Compression); K-Means Clustering. I. INTRODUCTION In recent years, database sizes have rapidly increased. Thus, data summarization has become necessary and has been applied in many areas. For instance, in distributed data mining, databases often comprise of many datasets in many distant sites. Because of expensive data transfer costs and privacy restrictions, the datasets are not likely to be gathered to a central site and then processed there. Instead, the datasets can be semantically summarized locally at each site, and the summarized sets then sent to the central site for further processing. The key issue in semantic data summarization is how to convert a large dataset into a much smaller one (high summarization ratio), but still preserve the macro-structure of the original set (low information loss). Sampling can be used to generate a summarized set ([6][15]). Although sampling-based summarization methods are relatively simple and straightforward, these techniques hardly guarantee the quality of the summarized set. Moreover, they could disclose private information by sharing the actual data items. Clustering algorithms have also been used for semantic summarization, where cluster centers become the summarized set on which other data mining algorithms can be run ([1][5][6]). Among those, K-Means clustering algorithm has been frequently used due to its low complexity and ease of implementation ([3][4][11][17][18]). Each dataset is partitioned into k clusters, and the corresponding summarized set exactly consists of k elements. In general, as k increases, the summarized set becomes higher in quality but lower in summarization ratio. However, there are several problems inherent in using K- Means as the summarization method. First, K-Means does not provide any error bound on the resulting clusters, so it cannot guarantee the quality of summarized sets. Second, K- Means requires users to pre-specify the number of clusters k (i.e., the size of the summarized set). Last, the final results of K-Means strongly depend on the initial centers, which are often randomly generated ([4][11][17][18]). In this paper, a novel semantic summarization method is proposed. Our method partitions the whole dataset into a number of clusters whose distortion is within an error bound (or less than a threshold). The number of clusters k is determined based on the characteristics of the dataset and the threshold. Thus, the threshold is the user parameter and it controls the tradeoff between summarization ratio and information loss. Unlike conventional K-Means, where initial centers for all clusters are randomly generated at the beginning, our algorithm inserts and removes initial centers dynamically based on the characteristics of the dataset that the algorithm has discovered so far. This strategy alleviates the local optimum effect. The rest of this paper is organized as follows. In Section 2, we discuss a variety of related work. Section 3 describes a K- Means based summarization scheme, which we use as a baseline for comparison. Our main contributions are described and discussed in Section 4. Next, in Section 5, we indicate the efficiency of the scheme through experiments. And in Section 6, we present some concluding remarks. II. RELATED WORK Semantic summarization is different from traditional (syntactic) summarization/compression such as Huffman, Shannon or Lempel-Ziv coding techniques which consider a dataset as a sequence of bytes. Semantic summarization takes the content of a dataset into account to generate a 240 978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

[IEEE 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies - Ho Chi Minh City, Vietnam (2008.07.13-2008.07.17)]

  • Upload
    padmini

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

A Quality-Threshold Data Summarization

Algorithm

Viet Ha-Thuc

Computer Science Department

The University of Iowa

Iowa City, IA 52242, USA

[email protected]

Duc-Cuong Nguyen

School of Computer Science and

Engineering

International University

Ho Chi Minh City, Vietnam

[email protected]

Padmini Srinivasan

School of Library and Information

Science

The University of Iowa

Iowa City, IA 52242, USA

[email protected]

Abstract—As database sizes increase, semantic data

summarization techniques have been developed, so that data

mining algorithms can be run on the summarized set for the

sake of efficiency. Clustering algorithms such as K-Means have

popularly been used as semantic summarization methods

where cluster centers become the summarized set. The goal of

semantic summarization is to provide a summarized view of

the original dataset such that the summarization ratio is

maximized while the error (i.e., information loss) is minimized.

This paper presents a new clustering-based data

summarization algorithm, in which the quality of the

summarized set can be controlled. The algorithm partitions a

dataset into a number of clusters until the distortion of each

cluster is less than a given threshold, thus guaranteeing the

summarized set has less than a fixed amount of information

loss. Based on the threshold, the number of clusters is

automatically determined. The proposed algorithm, unlike

traditional K-Means, adjusts initial centers based on the

information about the data space discovered so far, thus

significantly alleviating the local optimum effect. Our

experiments show that our algorithm generates higher quality

clusters than K-Means does and it also guarantees an error bound, an essential criterion for data summarization.

Keywords- Data Summarization (or Compression); K-Means

Clustering.

I. INTRODUCTION

In recent years, database sizes have rapidly increased. Thus, data summarization has become necessary and has been applied in many areas. For instance, in distributed data mining, databases often comprise of many datasets in many distant sites. Because of expensive data transfer costs and privacy restrictions, the datasets are not likely to be gathered to a central site and then processed there. Instead, the datasets can be semantically summarized locally at each site, and the summarized sets then sent to the central site for further processing. The key issue in semantic data summarization is how to convert a large dataset into a much smaller one (high summarization ratio), but still preserve the macro-structure of the original set (low information loss).

Sampling can be used to generate a summarized set ([6][15]). Although sampling-based summarization methods are relatively simple and straightforward, these techniques hardly guarantee the quality of the summarized set. Moreover, they could disclose private information by sharing the actual data items.

Clustering algorithms have also been used for semantic summarization, where cluster centers become the summarized set on which other data mining algorithms can be run ([1][5][6]). Among those, K-Means clustering algorithm has been frequently used due to its low complexity and ease of implementation ([3][4][11][17][18]). Each dataset is partitioned into k clusters, and the corresponding summarized set exactly consists of kelements. In general, as k increases, the summarized set becomes higher in quality but lower in summarization ratio.

However, there are several problems inherent in using K-Means as the summarization method. First, K-Means does not provide any error bound on the resulting clusters, so it cannot guarantee the quality of summarized sets. Second, K-Means requires users to pre-specify the number of clusters k(i.e., the size of the summarized set). Last, the final results of K-Means strongly depend on the initial centers, which are often randomly generated ([4][11][17][18]).

In this paper, a novel semantic summarization method is proposed. Our method partitions the whole dataset into a number of clusters whose distortion is within an error bound (or less than a threshold). The number of clusters k is determined based on the characteristics of the dataset and the threshold. Thus, the threshold is the user parameter and it controls the tradeoff between summarization ratio and information loss.

Unlike conventional K-Means, where initial centers for all clusters are randomly generated at the beginning, our algorithm inserts and removes initial centers dynamically based on the characteristics of the dataset that the algorithm has discovered so far. This strategy alleviates the local optimum effect.

The rest of this paper is organized as follows. In Section 2, we discuss a variety of related work. Section 3 describes a K-Means based summarization scheme, which we use as a baseline for comparison. Our main contributions are described and discussed in Section 4. Next, in Section 5, we indicate the efficiency of the scheme through experiments. And in Section 6, we present some concluding remarks.

II. RELATED WORK

Semantic summarization is different from traditional (syntactic) summarization/compression such as Huffman, Shannon or Lempel-Ziv coding techniques which consider a dataset as a sequence of bytes. Semantic summarization takes the content of a dataset into account to generate a

240978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

summarized set. The summarized set does not contain all raw data items in the original dataset, but it only shows a summarized view of the original ([16]). Therefore, semantic summarization is a lossy method. Clustering methods (e.g., K-Means and BIRCH) are popularly used techniques for semantic summarization.

In Discovery Net [18], each distributed dataset is locally summarized by the K-Means algorithm. Then, the summarized sets are sent to a central site for global clustering. The quality of this approach is largely dependent on the performance of K-means.

In [11], a scalable clustering algorithm is proposed to deal with very large datasets. In this approach, the datasets are divided into several equally sized and disjoint segments. Then, hard K-Means or Fuzzy K-Means algorithm are used to summarize each data segment. Similar to Discovery Net, a clustering algorithm is then run on the union of summarized sets.

Wagstaff, L. K., Shu, P. H., Mazzoni, D., and Castano, R. ([17]) propose a semi-supervised method for clustering hyper-spectral data. This method incorporates existing knowledge into K-Means algorithm to summarize big collections of hyper-spectral images. More specifically, this pre-specified knowledge is used to seed the initial centers for K-Means algorithm.

Jagadish, H. V., Raymond, T. Ng, and Ooi, B. C. ([14]) note that the pure K-Means method cannot provide any guaranteed error bound and then develop ItCompress, an adaptation of K-Means, to overcome this limitation. ItCompress takes a dataset, an error tolerance vector, and a number k as inputs. It then finds k representatives and tries to assign each data item to one of the representatives such that their difference is within the error bound (or smaller than the tolerance vector). If there is no such assignment, it will find the one that has the minimal number of violations. However, like K-Means, ItCompress still uses a fixed value for k and blindly initializes starting positions of representatives.

Also dealing with the quality threshold, Heyer, L., Kruglyak, and S., Yooseph, S. ([12]) propose a new clustering algorithm, named QT_Clust (Quality threshold clustering), that is specific for gene data. QT_Clust algorithm takes as input the dataset D and a diameter threshold d, and returns a set of clusters whose diameter is less than d. However, the computational complexity of QT_Clust algorithm is O(n3), in which n is the number of data items in the dataset. Thus, QT_Clust algorithm is not applicable to very large datasets.

Chen, B., Phang, C. T., Harrison, R., and Pan, Y. ([3]) propose a hybrid algorithm, namely H-K-Means, to solve the problem of determining the initial conditions. H-K-Means algorithm exploits hierarchical clustering as a preprocessing step to determine the value of k and locations of initial points for K-Means. However, the computational complexity of hierarchical clustering algorithm is O(n2). So, it does not scale well to large n.

BIRCH ([19]) constructs a hierarchical clustering structure, called a CF-Tree. A non-leaf node in a CF-Tree represents a cluster made up sub-clusters represented by its child nodes. A leaf node represents a cluster made up sub-clusters in its entries. Each entry stores the clustering

features (CF) of a cluster, including the mean, the number of items, and squared error. When inserting a new data item, starting from the root, it recursively finds a closest branch according to a distance metric and moves down until reaching a leaf entry. If the entry can “absorb” the data item then its CF is updated. Otherwise, it carries out a re-structuring process like in a B++ tree. We note that the right entry (cluster) of each data item is decided locally, making the quality of BIRCH often lower than that of K-Means ([4]).

Comparing to these previous methods, our proposed algorithm, Quality-threshold summarization, has some advantages: first, it returns clusters satisfying a given quality threshold; second, the value of k is automatically determined based on the threshold and the characteristics of the dataset; third, Quality-threshold summarization adjusts initial points dynamically, alleviating the local optimum effect; and finally, the computational complexity of Quality-threshold summarization is linear in n, so it scales well to large datasets.

III. KMEANS-BASED SEMANTIC SUMMARIZATION

Clustering algorithms have popularly been used for

semantic summarization ([4][5][11][17][18]). In this

approach, the whole dataset is partitioned into several

clusters which are groups of similar data items (see Fig.

1(a)). Each of the clusters is then replaced by its center and

some clustering feature (e.g., the number of data items in

cluster), which form a summarized set (see Fig. 1(b)).

Figure 1. Clustering-based summarization.

241978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

K-Means has been popularly used for data summarization. The goal of K-Means is to obtain a partition of the dataset into k clusters that minimizes the sum of cluster distortions.Distortion of a cluster is defined as the sum of squared Euclidean distances between each data item belonging to that cluster and the cluster center (i.e., the mean of all data items in the cluster). The distortion of a cluster reveals how well the center represents for all data items in the cluster.

K-Means takes as input: dataset D, the number of clusters k, and a set I= {I1 I2 … Ik}, in which Ij is the initial position for center of the cluster j. K-Means returns the final positions {C1 C2 … Ck} for cluster centers and distortion of the clusters {E1 E2 …Ek}. K-Means algorithm is described as follows.

Figure 2. K-Means Algorithm for data summarization.

The time complexity of K-Means algorithm is O(nkl) where n = |D|, k is the number of clusters, and l is the number of iterations. In practice, l is considered as a constant. So, time complexity of K-Means algorithm is O(nk). The space complexity of K-Means is O(n).

As mentioned above, K-Means algorithm requires the value of k in advance and cannot guarantee clusters within an error bound. Its final result is sensitive to starting positions (i.e., set I). In the next section, we will present a novel algorithm that partitions a dataset into a number of clusters satisfying a quality threshold. Based on the threshold, the number of clusters is automatically determined. The algorithm also adjusts initial centers based on the information about dataset discovered so far, thus alleviates the local optimum effect.

IV. QUALITY THRESHOLD SUMMARIZATION

Quality-threshold summarization algorithm takes a dataset D and a threshold T as inputs. The purpose of Quality-threshold summarization algorithm is to obtain a partition of D into a number of clusters minimizing the total distortion and satisfying the condition that distortion of every cluster must be less than the threshold T. Unlike K-Means based summarization, Quality-threshold summarization algorithm does not require the value of k in advance. Instead, the final number of clusters is determined by the characteristics of Dand the threshold T.

Figure 3. Quality-threshold summarization algorithm.

Algorithm: K-Means (D, k, I)

Input: Dataset D, the number of clusters k, initial

positions for centers I

Output: cluster centers {C1 C2 … Ck} and cluster

distortions {E1 E2 … Ek}

Description:

1. Initialization: Cj = Ij (∀j: 1≤ j≤k);

2. a. Determine membership of each data item d

∈ D: Assign d to the cluster whose center is closest to d;

b. Update cluster centers: set Cj be equal to

the mean of all data items in cluster j (∀j: 1≤j≤k);

3. if cluster centers do not change then

a. Compute cluster distortions: set Ej be

equal to the sum of squared Euclidean

distances from each data item in cluster j

to Cj (∀j: 1≤ j≤k);

b. Return {C1 C2 … Ck} and {E1 E2 …

Ek};

else

go to 2;

Algorithm: Quality-Threshold Summarization (D,

T)

Input: Dataset D, a threshold T

Output: cluster centers {C1 C2 … Ck}

Description:

1. Initialization:

a. C= ∅;

b. t=0; k0= 1;

c. Randomly choose a data item d∈ D,

and set I0={d};

2. (Ct, Et) K-Means(D, kt, It);

3. kt+1= kt;

4. It+1= Ct;

5. for i: = 1 to kt do

6. if Eti less than threshold T

a. C= C ∪ {Cti};

b. Remove cluster i out of D

c. kt+1 = kt+1 -1;

d. It+1 = It+1 – {Cti};

end if;

end for;

7. if (D = ∅) then

a. return C;

b. Stop;

8. else

a. Randomly choose a data point dapproximately close to the center

of the largest cluster, insert d to

It+1;

b. kt+1 = kt+1 +1;

c. t= t+1;

d. go to 2;

242978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

Quality-threshold summarization (Fig. 3) iterates the K-Means algorithm several times in which the value of kt at iteration t is determined dynamically. At the end of each iteration, data items of clusters satisfying quality criterion (i.e., their distortions are less than T) are removed out of dataset D. The centers of remaining clusters and one data point randomly selected in the area around the center of the largest cluster with some radius (Step 8a) serve as initial centers for the next iteration. In our current implementation, the data point is the center of the largest cluster adding some slightly random noise. That makes the point close but still a little different to the center. The algorithm terminates when D is empty.

Note that in the above algorithm, a cluster is removed if and only if its distortion is less than threshold T. Thus, Quality-threshold summarization guarantees that all clusters it provides are within the error bound. Moreover, at every iteration, two parameters of K-Means algorithm including the number of clusters (i.e., kt) and the initial positions for cluster centers (i.e., It) are determined based on information about the dataset discovered in previous iterations. This helps to reduce local optimum effect. The quality threshold and local optimal problems are discussed further in subsections A and B.

Complexity: Let K be the number of clusters in the final result. Note that at the end of each iteration in the Quality-threshold summarization algorithm, there is exactly one new initial cluster center inserted. So, the number of iterations (i.e., the number of times that K-Means is called) is K. And the time complexity of K-Means is O(kt nt) in which kt and nt

are the number of clusters and the number of remaining data

items at iteration t. It is obvious that kt≤ t ≤K and nt≤ |D| = n.So, an upper bound of time complexity of Quality-threshold summarization is O(K2n). However, early removal of clusters satisfying quality threshold decreases both kt and nt. That makes the actual value of kt is often strictly less than t, and the value of nt monotonically decreases. Thus, the removal significantly reduces the running time. Space complexity of Quality-threshold summarization is O(n), the same as K-Means algorithm.

A. Quality Threshold

Quality threshold is critical in semantic data summarization ([12][13][14]), and particularly in clustering-based summarization where all similar data items are substituted by the one common representative (e.g. cluster center). To make the summarization reliable, we need to control the amount of information lost when the substitution is performed. In general, the higher cluster quality is, the less information is lost.

There are several measurements of cluster quality, for example, cluster diameter (the maximum distance between any pair of data items), cluster radius (the maximum distance between cluster center and any data item), or cluster distortion. Note that any of the measurements can be applied to our Quality-threshold summarization algorithm without increasing complexity. For example, if we would like to use cluster diameter instead of cluster distortion, we just need to adjust K-Means (Fig. 3, Step 2) so that it returns cluster diameters instead of cluster distortions and replace the corresponding quality measurement in Step 6. However, we choose distortion measurement due to the fact that while the measurements like cluster diameter or radius only involve in

“boundary” data items, cluster distortion is the sum of squared Euclidean distances between the center and all data items in the cluster.

B. Local Optimum Effect

K-Means algorithm is sensitive to initial starting point due to the fact that the algorithm terminates when it converges to one of numerous local optima. For example, if the algorithm starts with initial positions in Fig. 4(a), it will converge to a “bad” local optimal solution (Fig. 4(b)) in which one of clusters is empty. In another case, if K-Means algorithm starts with appropriate initial positions (Fig. 4(c)), it will converge to a “good” local optimum (Fig. 4(d)).

Figure 4. Local optimum effect. The asterisks denote initialized

positions of clusters, and solid periods denote final positions of clusters.

Thus, it is necessary to have good initialization for K-Means. However, summarization or clustering is often the first step in data mining, and there is often no prior knowledge about the dataset available for this step.

Quality-threshold summarization initializes the centers in a gradually increasing strategy. This strategy is illustrated through an example in Fig. 5. For simplicity, we assume that there is no cluster satisfying quality threshold before the algorithm terminates. At the first iteration (k0=1), it randomly initializes the unique center (Fig. 5(a)), and it will converge at (global) optimum (Fig. 5(b)). Next, a new initial center is inserted into the area close to the first center (Fig. 5(c)), and it will converge at positions in Fig. 5(d). In next

243978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

iteration, another new initial center is inserted into the area around the center of the largest cluster (Fig. 5(e)). Finally, it will converge at a “good” local optimum in Fig. 5(f).

Figure 5. Convergence in Quality-threshold summarization. The

asterisks denote initialized positions of clusters, and solid periods denote

final positions of clusters.

Quality-threshold summarization follows the heuristic that the best place to initialize a new center should be in the area around the center of the largest cluster. Note that the objective function of Quality-threshold summarization algorithm is the sum of cluster distortions. Thus, this heuristic is natural because the largest cluster is the one that contributes to the overall distortion the most. And our experimental results on both synthetic and real-world datasets have confirmed the efficiency of the heuristic.

V. EXPERIMENTAL RESULTS

In this section, we study the efficiency of the Quality-threshold summarization algorithm by comparing it with K-Means based summarization algorithm. The efficiency of each method is shown through the relationship between two quantities: distortion and summarization ratio. The two quantities are computed as follows.

Distortion: the sum of squared Euclidean distances from every data item to its representative (i.e. the cluster center). Distortion represents the amount of information lost. In general, the lower distortion is, the higher quality of the summarized set is.

Summarization ratio: the ratio of the original dataset size and the summarized set size. In our experiments, summarization ratio is computed as the ratio between the number of data items in the whole dataset and the number of clusters.

For each input, we ran the two algorithms 15 times, and computed the average value of each output result.

A. Results on a Synthetic Datatset

Dataset description: the synthetic dataset is a mixture

of 10 dimensional Gaussians totally containing 5000 data

items. The mean and variance of a dimension of each

Gaussian are randomly chosen from the intervals [25, 75]

and [10, 15] respectively.

We ran K-Means algorithm with different values of K,

then distortion and compression ratio are computed as

described above. (Table 1)

TABLE I. RESULTS PROVIDED BY K-MEANS ON THE SYNTHETIC DATASET.

K (the number of

clusters)

Summarization

Ratio

Distortion

(x107)

155 32.26 1.914

135 37.04 1.980

115 43.48 2.054

95 52.63 2.141

75 66.67 2.248

55 99.90 2.390

35 142.86 2.591

Similarly, we ran Quality-threshold summarization

algorithm with different values of threshold T, then we

computed distortion and compression ratio. (Table 2)

TABLE II. RESULTS PROVIDED BY QUALITY-

THRESHOLD SUMMARIZATION ON THE SYNTHETIC DATASET

T

(threshold)

(x103)

The average

number of

clusters

Summarization

Ratio

Distortion

(x107)

100 150.43 33.23 1.092

200 85.40 58.55 1.287

300 60.96 82.02 1.402

400 47.94 104.30 1.486

500 40.33 123.98 1.546

600 34.76 143.84 1.599

The relationships between summarization ratio and

distortion in the results of the two algorithms are shown in

244978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

Fig. 6. We see that for any given summarization ratio (i.e.,

the number of clusters) the total distortion of the clusters

generated by Quality-threshold summarization is much less

than the total distortion of ones generated by K-Means. For

instance, when summarization ratio is 100, the distortions of

the summarized sets provided by K-Means and Quality-threshold algorithm are about 2.4x107 and 1.45x107,

respectively.

(1): K-Means based summarization algorithm

(2): Quality-threshold summarization algorithm

Figure 6. Experimental results on synthetic dataset.

B. Results on a Real-world Dataset

Dataset description: we use Corel dataset from UCI KDD ([9]). This dataset contains 68040 items. Each item consists of 32 numerical attributes.

We ran K-Means and Quality-threshold summarization on Corel dataset with different values of k and threshold Trespectively. The results of the two algorithms are described in the Table 3 and Table 4, respectively.

TABLE III. RESULTS PROVIDED BY K-MEANS ON COREL DATASET.

K (the number of clusters) Summarization

Ratio

Distortion

170 400.24 3621.82

155 438.97 3725.18

140 486.00 3853.63

125 544.32 4048.63

110 618.55 4308.45

95 716.21 4538.43

80 850.50 4948.09

65 1046.77 5490.71

50 1360.80 6125.29

TABLE IV. RESULTS PROVIDED BY QUALITY-THRESHOLD SUMMARIZATION ON COREL DATASET

T

(threshold)

The average

number of

clusters

Summarization

Ratio

Distortion

20 145.33 468.17 1728.30

30 101.33 671.45 1746.83

40 88.00 773.18 2021.55

50 74.33 915.34 2181.82

60 62.33 1091.55 2243.38

70 60.33 1127.73 2414.58

80 55.33 1229.64 2565.92

90 50.00 1360.80 2720.32

100 48.67 1398.08 3002.29

(1): K-Means based summarization algorithm

(2): Quality-threshold summarization algorithm

Figure 7. Experimental results on synthetic dataset.

The summarization ratio – distortion curves are shown in

Fig. 7. Again, we could see the significant improvement of

Quality-threshold algorithm over K-Means. For instance,

when summarization ratio is 1000, the distortions of the

summarized sets provided by K-Means and Quality-

threshold algorithm are about 5300 and 2200, respectively.

An important pattern in the both figures is that the

difference in distortion between outputs provided by two algorithms becomes more and more significant as

summarization ratio increases.

VI. CONCLUDING REMARKS

This paper presents a novel algorithm for data summarization, named Quality-threshold algorithm. Compared to K-Means clustering, a popular algorithm for data summarization in previous work, Quality-threshold algorithm has the following advantages. First, it provides an error-bound on summarized sets. Second, the size of the

245978-1-4244-2379-8/08/$25.00 (c)2008 IEEE

summarized set is automatically determined by taking characteristics of the original set and the quality threshold into account. Finally, cluster centers are dynamically initialized based on the information about the data space discovered so far, which alleviates the local optimum effect.

Our experimental results confirm the efficiency of the proposed algorithm. With the same summarization ratio, clusters generated by Quality-threshold algorithm are not only guaranteed to satisfy an error bound, but the clusters also have significantly lower distortion than clusters generated by K-Means clustering algorithm.

For future direction, we plan to extend the applicable scope of the Quality-threshold to the text domain by combining the algorithm with probabilistic topic models, which is popular for document modeling in the information retrieval community recently ([2][7][8][10][20]). In probabilistic topic models, each document in a dataset is a mixture of k latent topics. The topics and mixture coefficients are automatically learned from the dataset. Then, each document is represented by a k-element vector which corresponds to the coefficients. The representation makes it possible to apply the proposed algorithm for summarizing text collections.

REFERENCES

[1] Babu, S., Garofalakis, M., Rastogi, R., “SPARTAN: a model-based

semantic summarization system for massive data tables”, In

Proceedings of the 21st ACM International Conference on

Management of Data (SIGMOD), 2001.

[2] Blei, M., Ng, A., Jordan, M., “Latent Dirichlet allocation”, Journal of

Machine Learning Research, vol. 3, 2003.

[3] Chen, B., Phang, C. T., Harrison, R., Pan, Y., “Novel hybrid

hierarchical K-Means clustering method (H-K-means) for microarray

analysis”, In Proceedings of the IEEE Computational Systems

Bioinformatics Conference - Workshops (CSBW), 2005.

[4] Cheng, C., Luo, J., “Comparison of data summarization micro-

clustering methods for hierarchical clustering”, Technical Report 695,

2003.

[5] Duc-Cuong Nguyen, “Flexible information management strategies in

machine learning and data mining”, PhD thesis, University of Wales,

UK, 2004.

[6] DuMouchel, W., Volinsky, C., Johnson, T., Corinnna, C., Pregibon,

D., “Squashing flat files flatter”, In Proceedings of the 5th ACM

SIGKDD International Conference on Knowledge Discovery and

Data Mining (KDD), 1999.

[7] Griffiths, T., Steyvers, M., “Finding scientific topics”, In Proceedings

of National Academy of Science (PNAS), 2004.

[8] Ha-Thuc, V., Srinivasan, P., “A robust learning approach for text

classification”, In Proceedings of the 7th SIAM Text Mining

Workshop (TMW), 2008.

[9] Hettich, S., Bay, S. D. “The UCI KDD archive”

[http://kdd.ics.uci.edu], Department of Information and Computer

Science, University of California at Irvine, USA, 1999.

[10] Hofmann, T., “Probabilistic latent semantic indexing”, In Proceedings

of the 15th Conference on Uncertainty in Artificial Intelligence (UAI),

1999.

[11] Hore, P., Hall, L. O. “Scalable clustering: a distributed approach”, In

Proceedings of IEEE International Conference on Fuzzy Systems

(FUZZ-IEEE), 2004.

[12] Heyer, L., Kruglyak, S., Yooseph, S., “Exploring expression data:

identification and analysis of coexpressed genes”, Genome Research,

vol. 9(11), 1999.

[13] Jagadish, H. V., Madar, J., Raymond, T. Ng, “Semantic

summarization and pattern extraction with fascicles”, In Proceedings

of the 25th International Conference on Very Large Data Bases

(VLDB), 1999.

[14] Jagadish, H. V., Raymond, T. Ng, Ooi, B. C., “ItCompress: an

iterative semantic summarization algorithm”, In Proceedings of the

20th International Conference on Data Engineering (ICDE), 2004.

[15] Provost, F., “Distributed data mining: scaling up and beyond”, In

Advances in Distributed and Parallel Knowledge Discovery, MIT

Press, 2000.

[16] Saint-Paul, R., Raschia, G., Mouaddib, N., “General purpose database

summarization”, In Proceedings of the 31st International Conference

on Very Large Data Bases (VLDB), 2005.

[17] Wagstaff, L. K., Shu, P. H., Mazzoni, D., Castano, R., “Semi-

supervised data summarization: using spectral libraries to improve

hyperspectral clustering”, The Interplanetary Network Progress

Report, vol. 42, 2005.

[18] Wendel, P., Ghanem, M., Guo, Y., “Scalable clustering on the data

grid”, In Proceedings of 5th IEEE International Symposium Cluster

Computing and the Grid (CCGrid), 2005.

[19] Zhang, T., Ramakrishnan, R., Livny, M., “BIRCH: an efficient data

clustering method for very large databases”, In Proceedings of the

ACM International Conference on Management of Data (SIGMOD),

1996.

[20] Zhou, D., Manavoglu, E. Li, J., Giles, L., Zha, H., “Probabilistic

models for discovering e-communities”, In Proceedings of the 15th

ACM International World Wide Web Conference (WWW), 2006.

246978-1-4244-2379-8/08/$25.00 (c)2008 IEEE