Similarity Measures for Categorical Data: A …...means), distance-based outlier detection, classi cation (knn, SVM), and several other data mining tasks. These algorithms typically

Similarity Measures for Categorical Data: A Comparative Evaluation

Shyam Boriah Varun Chandola Vipin Kumar

Department of Computer Science and EngineeringUniversity of Minnesota

<sboriah,chandola,kumar>@cs.umn.edu

Abstract

Measuring similarity or distance between two entities isa key step for several data mining and knowledge discov-ery tasks. The notion of similarity for continuous datais relatively well-understood, but for categorical data,the similarity computation is not straightforward. Sev-eral data-driven similarity measures have been proposedin the literature to compute the similarity between twocategorical data instances but their relative performancehas not been evaluated. In this paper we study theperformance of a variety of similarity measures in thecontext of a specific data mining task: outlier detec-tion. Results on a variety of data sets show that whileno one measure dominates others for all types of prob-lems, some measures are able to have consistently highperformance.

1 Introduction

Measuring similarity or distance between two datapoints is a core requirement for several data min-ing and knowledge discovery tasks that involve dis-tance computation. Examples include clustering (k-means), distance-based outlier detection, classification(knn, SVM), and several other data mining tasks. Thesealgorithms typically treat the similarity computation asan orthogonal step and can make use of any measure.

For continuous data sets, the Minkowski Distanceis a general method used to compute distance betweentwo multivariate points. In particular, the MinkowskiDistance of order 1 (Manhattan) and order 2 (Euclidean)are the two most widely used distance measures forcontinuous data. The key observation about the abovemeasures is that they are independent of the underlyingdata set to which the two points belong. Several data-driven measures, such as Mahalanobis Distance, havealso been explored for continuous data.

The notion of similarity or distance for categoricaldata is not as straightforward as for continuous data.The key characteristic of categorical data is that the

different values that a categorical attribute takes arenot inherently ordered. Thus, it is not possible todirectly compare two different categorical values. Thesimplest way to find similarity between two categoricalattributes is to assign a similarity of 1 if the valuesare identical and a similarity of 0 if the values arenot identical. For two multivariate categorical datapoints, the similarity between them will be directlyproportional to the number of attributes in which theymatch. This simple measure is also known as the overlapmeasure in the literature [33].

One obvious drawback of the overlap measure isthat it does not distinguish between the different val-ues taken by an attribute. All matches, as well as mis-matches, are treated as equal. For example, considera categorical data set D, defined over two attributes:color and shape. Let color take 3 possible values in D:{red, blue, green} and shape take 3 possible values inD: {square, circle, triangle}. Table 1 summarizes thefrequency of occurrence for each possible combinationin D.

shapesquare circle triangle Total

colorred 30 2 3 35blue 25 25 0 50green 2 1 2 5

Total 57 28 5

Table 1: Frequency Distribution of a Simple 2-D Cate-gorical Data Set

The overlap similarity between two instances(green,square) and (green,circle) is 1

3 . The overlap sim-ilarity between (blue,square) and (blue,circle) is also 1

3 .But the frequency distribution in Table 1 shows thatwhile (blue,square) and (blue,circle) are frequent combi-nations, (green,square) and (green,circle) are very rare

243

combinations in the data set. Thus, it would appearthat the overlapmeasure is too simplistic in giving equalimportance to all matches and mismatches. Althoughthere is no inherent ordering in categorical data, theprevious example shows that there is other informationin categorical data sets that can be used to define whatshould be considered more similar and what should beconsidered less similar.

This observation has motivated researchers to comeup with data-driven similarity measures for categoricalattributes. Such measures take into account the fre-quency distribution of different attribute values in agiven data set to define similarity between two cate-gorical attribute values. In this paper, we study a vari-ety of similarity measures proposed in diverse researchfields ranging from statistics to ecology as well as manyof their variations. Each measure uses the informationpresent in the data uniquely to define similarity.

Since we are evaluating data-driven similarity mea-sures it is obvious that their performance is highly re-lated to the data set that is being analyzed. To under-stand this relationship, we first identify the key char-acteristics of a categorical data set. For each of thedifferent similarity measure that we study, we analyzehow it relates to the different characteristics of the dataset.

1.1 Key Contributions. The key contributions ofthis paper are as follows:

• We bring together fourteen different categoricalmeasures from different fields and study them to-gether in a single context. Many of these measureshave not been investigated outside the domain theywere introduced in, and not compared with othermeasures.

• We classify the categorical measures in three differ-ent ways based on how they utilize information inthe data.

• We evaluate the various similarity measures forcategorical data on a wide variety of benchmarkdata sets. In particular, we show the utility of data-driven measures for the problem of determiningsimilarity with categorical data.

• We also propose a number of new measures thatare either variants of other previously proposedmeasures, or derived from previously proposedsimilarity frameworks. The performance of someof the measures we propose is among the bestperformance of all the measures we study.

1.2 Organization of the Paper. The rest of thepaper is organized as follows. We first mention all re-lated efforts in the study of similarity measures in Sec-tion 2. In section 3, we identify various characteristicsof categorical data that are relevant to this study. Wethen introduce the 14 different similarity measures thatare studied in this paper in Section 4. We describe ourexperimental setup, evaluation methodology and the re-sults on public data sets in Section 5.

2 Related Work

Sneath and Sokal discuss categorical similarity measuresin some detail in their book [32] on numerical taxonomy.They were among the first to put together and discussmany of the measures discussed in their book. At thetime, two major concerns were (1) biological relevance,since numerical taxonomy was mainly concerned withtaxonomies from biology, ecology, etc., and (2) compu-tation efficiency since computational resources were lim-ited and scarce. Nevertheless, many of the observationsmade by Sneath and Sokal are quite relevant today andoffer key insights into many of the measures.

There are several books [2, 18, 16, 21] on clusteranalysis that discuss the problem of determining sim-ilarity between categorical attributes. However, mostof these books do not offer solutions to the problem ordiscuss the measures in this paper, and the usual recom-mendation is to binarize the data and then use binarysimilarity measures.

Wilson and Martinez [36] performed a detailedstudy of heterogeneous distance functions (for datawith categorical and continuous attributes) for instance-based learning. The measures in their study are basedupon a supervised approach where each data instancehas class information in addition to a set of categori-cal/continuous attributes. Measures discussed in thispaper are orthogonal to the ones proposed in [36] sincesupervised measures determine similarity based on classinformation, while data-driven measures determine sim-ilarity based on the data distribution. In principle, bothideas can be combined.

There have been a number of new data mining tech-niques for categorical data that have been proposed re-cently. Some of them use notions of similarity which areneighborhood-based [15, 4, 8, 26, 1, 22], or incorporatethe similarity computation into the learning algorithm[13, 17, 12]. Neighborhood-based approaches use somenotion of similarity (usually the overlap measure) todefine the neighborhood of a data instance, while themeasures we study in this paper are directly used todetermine similarity between a pair of data instances;hence, we see the measures discussed in this paper as be-ing useful to compute the neighborhood of a point and

244

neighborhood-based measures as meta-similarity mea-sures. Since techniques which embed similarity mea-sures into the learning algorithm do not explicitly de-fine general categorical similarity measures, we do notdiscuss them in this paper.

Jones and Furnas [20] studied several similaritymeasures in the field of information retrieval. In partic-ular, they performed a geometric analysis on continuousmeasures in order to reveal important differences whichwould affect retrieval performance. Noreault et al. [25]also studied measures in information retrieval with thegoal of generalizing effectiveness based on empiricallyevaluating the performance of the measures. Anothercomparative empirical evaluation–for determining simi-larity between fuzzy sets–was performed by Zwick et al.[37], followed by several others [27, 35].

3 Categorical Data

Categorical data (also known as nominal or qualitativemulti-state data) has been studied for a long timein various contexts. As mentioned earlier, computingsimilarity between categorical data instances is notstraightforward owing to the fact that there is noexplicit notion of ordering between categorical values.To overcome this problem, several data-driven similaritymeasures have been proposed for categorical data. Thebehavior of such measures directly depends on the data.In this section, we identify the key characteristics ofa categorical data set that can potentially affect thebehavior of a data-driven similarity measure.

For the sake of notation, consider a categoricaldata set D containing N objects, defined over a setof d categorical attributes where Ak denotes the kth

attribute. Let the attribute Ak take nk values in thegiven data set that are denoted by the set Ak. We alsouse the following notation:

• fk(x): The number of times attribute Ak takes thevalue x in the data set D. Note that if x /∈ Ak,fk(x) = 0

• pk(x): The sample probability of attribute Ak totake the value x in the data set D. The sampleprobability is given by

pk(x) =fk(x)N

• p2k(x): Another probability estimate of attributeAk to take the value x in a given data set, given by

p2k(x) =

fk(x)(fk(x)− 1)N(N − 1)

3.1 Characteristics of Categorical Data. Sincethis paper discusses data-driven similarity measuresfor categorical data, a key task is to identify thecharacteristics of a categorical data set that affect thebehavior of such a similarity measure. We enumeratethe characteristics of a categorical data set below:

• Size of Data, N . As we will see later, mostmeasures are typically invariant of the size ofthe data, though there are some measures (e.g.Smirnov) that do make use of this information.

• Number of attributes, d. Most measures are invari-ant of this characteristic, since they typically nor-malize the similarity over the number of attributes.But in our experimental results we observe that thenumber of attributes does affect the performance ofthe outlier detection algorithms.

• Number of values taken by each attribute, nk. Adata set might contain attributes that take severalvalues and attributes that take very few values. Forexample, one attribute might take several hundredpossible values, while another attribute might takevery few values. A similarity measure might givemore importance to the second attribute, whileignoring the first one. In fact, one of the measuresdiscussed in this paper (Eskin) behaves exactly likethis.

• Distribution of fk(x). This refers to the distribu-tion of frequency of values taken by an attribute inthe given data set. In certain data sets an attributemight be distributed uniformly over the set Ak,while in others the distribution might be skewed.A similarity measure might give more importanceto attribute values that occur rarely, while anothersimilarity measure might give more importance tofrequently occurring attribute values.

4 Similarity Measures for Categorical Data

The study of similarity between data objects withcategorical variables has had a long history. Pearsonproposed a chi-square statistic in the late 1800s whichis often used to test independence between categoricalvariables in a contingency table. Pearson’s chi-squarestatistic was later modified and extended, leading toseveral other measures [28, 24, 7]. More recently,however, the overlap measure has become the mostcommonly used similarity measure for categorical data.Its popularity is perhaps related to its simplicity andeasy of use. In this section, we will discuss the overlapmeasure and several data-driven similarity measures forcategorical data.

245

Note that we have converted measures that wereoriginally proposed as distance to similarity measuresin order to make the measures comparable in thisstudy. The measures discussed henceforth will all be inthe context of similarity, with distance measures beingconverted using the formula:

sim =1

1 + dist

Almost all similarity measures assign a similarityvalue between two data instances X and Y belongingto the data set D (introduced in Section 3) as follows:

S(X,Y ) =d∑

k=1

wkSk(Xk, Yk)(4.1)

where Sk(Xk, Yk) is the per-attribute similarity betweentwo values for the categorical attribute Ak. Note thatXk, Yk ∈ Ak. The quantity wk denotes the weightassigned to the attribute Ak.

To understand how different measures calculate theper-attribute similarity, Sk(Xk, Yk), consider a categori-cal attribute A, which takes one of the values {a, b, c, d}.We have dropped the subscript k for simplicity. Theper-attribute similarity computation is equivalent toconstructing the (symmetric) matrix shown in Figure1.

a b c da S(a, a) S(a, b) S(a, c) S(a, d)b S(b, b) S(b, c) S(b, d)c S(c, c) S(c, d)d S(d, d)

Figure 1: Similarity Matrix for a Single CategoricalAttribute

Essentially, in determining the similarity betweentwo values, any categorical measure is filling the entriesof this matrix. For example, the overlap measure setsthe diagonal entries to 1 and the off-diagonal entriesto 0, i.e., the similarity is 1 if the values match and0 if the values mismatch. Additionally, measures mayuse the following information in computing a similarityvalue (all the measures in this paper use only thisinformation):

• f(a), f(b), f(c), f(d), the frequencies of the valuesin the data set

• N , the size of the data set

• n, the number of values taken by the attribute (4in the case above)

We can classify measures in several ways, basedon: (i) the manner in which they fill the entries ofthe similarity matrix, (ii) whether more weight is afunction of the frequency of the attribute values, (iii) thearguments used to propose the measure (probabilistic,information-theoretic, etc.). In this paper, we willdescribe the measures by classifying them as follows:

• those that fill the diagonal entries only. These aremeasures that set the off-diagonal entries to 0 (mis-matches are uniformly given the minimum value)and give possibly different weights to matches.

• those that fill the off-diagonal entries only. Thesemeasures set the diagonal entries to 1 (matchesare uniformly given the maximum value) and givepossibly different weights to mismatches.

• those that fill both diagonal and off-diagonal en-tries. These measures give different weights to bothmatches and mismatches.

Table 2 gives the mathematical formulas for themeasures we will be describing in this paper. Thevarious measures described in Table 2 compute theper-attribute similarity Sk(Xk, Yk) as shown in column2 and compute the attribute weight wk as shown incolumn 3.

4.1 Measures that fill Diagonal Entries only.

1. Overlap. The overlap measure simply counts thenumber of attributes that match in the two datainstances. The range of per-attribute similarityfor the overlap measure is [0, 1], with a value of0 occurring when there is no match, and a value of1 occurring when the attribute values match.

2. Goodall. Goodall [14] proposed a measure that at-tempts to normalize the similarity between two ob-jects by the probability that the similarity valueobserved could be observed in a random sampleof two points. This measure assigns higher sim-ilarity to a match if the value is infrequent thanif the value is frequent. Goodall’s original mea-sure details a procedure to combine similaritiesin the multivariate setting which takes into ac-count dependencies between attributes. Since thisprocedure is computationally expensive, we use asimpler version of the measure (described next asGoodall1). Goodall’s original measure is not em-pirically evaluated in this paper. We also proposethree other variants of Goodall’s measure in thispaper: Goodall2, Goodall3 and Goodall4.

246

Measure Sk(Xk, Yk) wk, k = 1 . . . d

1. Overlap =

{1 if Xk = Yk

0 otherwise1d

2. Eskin =

{1 if Xk = Yk

n2k

n2k+2

otherwise1d

3. IOF =

{1 if Xk = Yk

11+log fk(Xk)×log fk(Yk)

otherwise1d

4. OF =

{1 if Xk = Yk

1

1+log Nfk(Xk)×log N

fk(Yk)otherwise

1d

5. Lin =

{2 log pk(Xk) if Xk = Yk

2 log(pk(Xk) + pk(Yk)) otherwise1∑d

i=1log pi(Xi)+log pi(Yi)

6. Lin1 =

∑q∈Q

log pk(q) if Xk = Yk

2 log∑q∈Q

pk(q) otherwise

1∑d

i=1

∑q∈Q

log pi(q)

7. Goodall1 =

1−∑q∈Q

p2k(q) if Xk = Yk

0 otherwise

1d

8. Goodall2 =

1−∑q∈Q

p2k(q) if Xk = Yk

0 otherwise

1d

9. Goodall3 =

{1− p2

k(Xk) if Xk = Yk

0 otherwise1d

10. Goodall4 =

{p2

k(Xk) if Xk = Yk

0 otherwise1d

11. Smirnov =

2 +

N−fk(Xk)fk(Xk)

+∑

q∈{Ak\Xk}

fk(q)

N − fk(q)if Xk = Yk∑

q∈{Ak\{Xk,Yk}}

fk(q)

N − fk(q)otherwise

1∑d

k=1nk

12. Gambaryan =

{−[pk(Xk) log2 pk(Xk)+ if Xk = Yk

(1− pk(Xk)) log2 (1− pk(Xk))]

0 otherwise

1∑d

k=1nk

13. Burnaby =

1 if Xk = Yk∑

q∈Ak2 log(1−pk(q))

logpk(Xk)pk(Yk)

(1−pk(Xk))(1−pk(Yk))+∑

q∈Ak2 log(1−pk(q))

otherwise1d

14. Anderberg S(X, Y ) =

∑k∈{1≤k≤d:Xk=Yk}

(1

pk(Xk)

)2 2

nk(nk + 1)∑k∈{1≤k≤d:Xk=Yk}

(1

pk(Xk)

)2 2

nk(nk + 1)+

∑k∈{1≤k≤d:Xk 6=Yk}

(1

2pk(Xk)pk(Yk)

)2

nk(nk + 1)

Table 2: Similarity Measures for Categorical Attributes. Note that S(X, Y ) =∑d

k=1wkSk(Xk, Yk). For measure Lin1, {Q ⊆ Ak :

∀q ∈ Q, pk(Xk) ≤ pk(q) ≤ pk(Yk)}, assuming pk(Xk) ≤ pk(Yk). For measure Goodall1, {Q ⊆ Ak : ∀q ∈ Q, pk(q) ≤ pk(Xk)}. For

measure Goodall2, {Q ⊆ Ak : ∀q ∈ Q, pk(q) ≥ pk(Xk)}. 247

3. Goodall1. The Goodall1 measure is the same asGoodall’s measure on a per-attribute basis. How-ever, instead of combining the similarities by tak-ing into account dependencies between attributes,the Goodall1 measure takes the average of the per-attribute similarities. The range of Sk(Xk, Yk) formatches in Goodall1 measure is [0, 1 − 2

N(N−1) ],with the minimum being attained when attributeAk takes only one value, and the maximum is at-tained when the value Xk occurs twice, while allother possible values of Ak occur more than 2 times.

4. Goodall2. The Goodall2 measure is a variant ofGoodall’s measure proposed by us. This measureassigns higher similarity if the matching valuesare infrequent, and at the same time there areother values that are even less frequent, i.e., thesimilarity is higher if there are many values withapproximately equal frequencies, and lower if thefrequency distribution is skewed. The range ofSk(Xk, Yk) for matches in the Goodall2 measureis [0, 1 − 2

N(N−1) ], with the minimum value beingattained if attribute Ak takes only one value, andthe maximum is attained when the value Xk occurstwice, while all other possible values of Ak occuronly once each.

5. Goodall3. We also propose another variant ofGoodall’s measure called Goodall3. The Goodall3measure assigns a high similarity if the matchingvalues are infrequent regardless of the frequenciesof the other values. The range of Sk(Xk, Yk) formatches in the Goodall3 measure is [0, 1− 2

N(N−1) ],with the minimum value being attained if Xk is theonly value for attribute Ak and maximum value isattained if Xk occurs only twice.

6. Goodall4. The Goodall4 measure assigns similarity1−Goodall3 for matches. The range of Sk(Xk, Yk)for matches in the Goodall4 measure is [ 2

N(N−1) , 1],with the minimum value being attained if Xk oc-curs only once, and the maximum value is attainedif Xk is the only value for attribute Ak.

7. Gambaryan. Gambaryan proposed a measure [11]that gives more weight to matches where thematching value occurs in about half the data set,i.e., in between being frequent and rare. TheGambaryan measure for a single attribute matchis closely related to the Shannon entropy from in-formation theory, as can be seen from its formulain Table 2. The range of Sk(Xk, Yk) for matchesin the Gambaryan measure is [0, 1], with the min-imum value being attained if Xk is the only value

for attribute Ak and the maximum value is attainedwhen Xk has frequency N

2 .

4.2 Measures that fill Off-diagonal Entries only.

1. Eskin. Eskin et al. [9] proposed a normalizationkernel for record-based network intrusion detec-tion data. The original measure is distance-basedand assigns a weight of 2

n2k

for mismatches; whenadapted to similarity, this becomes a weight of

n2k

n2k+2

. This measure gives more weight to mis-matches that occur on attributes that take manyvalues. The range of Sk(Xk, Yk) for mismatches inthe Eskin measure is [ 23 ,

N2

N2+2 ], with the minimumvalue being attained when the attribute Ak takesonly two values, and the maximum value is attainedwhen the attribute has all unique values.

2. Inverse Occurrence Frequency (IOF ). The in-verse occurrence frequency measure assigns lowersimilarity to mismatches on more frequent values.The IOF measure is related to the concept of in-verse document frequency which comes from infor-mation retrieval [19], where it is used to signify therelative number of documents that contain a spe-cific word. A key difference is that inverse doc-ument frequency is computed on a term-documentmatrix which is usually binary, while the IOF mea-sure is defined for categorical data. The range ofSk(Xk, Yk) for mismatches in the IOF measure is[ 11+(log N

2 )2, 1], with the minimum value being at-

tained when Xk and Yk each occur N2 times (i.e.,

these are the only two values), and the maximumvalue is attained when Xk and Yk occur only oncein the data set.

3. Occurrence Frequency (OF ). The occurrence fre-quency measure gives the opposite weighting ofthe IOF measure for mismatches, i.e., mismatcheson less frequent values are assigned lower sim-ilarity and mismatches on more frequent val-ues are assigned higher similarity. The rangeof Sk(Xk, Yk) for mismatches in OF measure is[ 1(1+(log N)2) ,

11+(log 2)2 ], with the minimum value

being attained when Xk and Yk occur only oncein the data set, and the maximum value attainedwhen Xk and Yk occur N

2 times.

4. Burnaby. Burnaby [5] proposed a similarity mea-sure using arguments from information theory. Heargues that the set of observed values are like agroup of signals conveying information and, as ininformation theory, attribute values that are rarelyobserved should be considered more informative. In

248

[5], Burnaby proposed information weighted mea-sures for binary, ordinal, categorical and continuousdata. The measure we present in Table 2 is adaptedfrom Burnaby’s categorical measure. This measureassigns low similarity to mismatches on rare valuesand high similarity to mismatches on frequent val-ues. The range of Sk(Xk, Yk) for mismatches for theBurnaby measure is [ N log(1− 1

N )

N log(1− 1N )−log(N−1)

, 1], withthe minimum value being attained when all valuesfor attribute Ak occur only once, and the maxi-mum value is attained when Xk and Yk each occurN2 times.

4.3 Measures that fill both Diagonal and Off-diagonal Entries.

1. Lin. In [23], Lin describes an information-theoreticframework for similarity, where he argues thatwhen similarity is thought of in terms of assump-tions about the space, the similarity measure nat-urally follows from the assumptions. Lin [23] dis-cusses the ordinal, string, word and semantic simi-larity settings; we applied his framework to the cat-egorical setting to derive the Lin measure in Table2. The Lin measure gives higher weight to matcheson frequent values, and lower weight to mismatcheson infrequent values. The range of Sk(Xk, Yk) fora match in the Lin measure is [−2 logN, 0], withthe minimum value being attained when Xk occursonly once and the maximum value attained whenXk occurs N times. The range of Sk(Xk, Yk) fora mismatch in Lin measure is [−2 log N

2 , 0], withthe minimum value being attained when Xk andYk each occur only once, and the maximum valueis attained when Xk and Yk each occur N

2 times.

2. Lin1. The Lin1 measure is another measure wehave derived using Lin’s similarity framework. Thismeasure gives lower weight to mismatches if ei-ther of the mismatching values are very frequent,or if there are several values that have frequencyin between those of the mismatching values; higherweight is given when there are mismatches on infre-quent values and there are few other infrequent val-ues. For matches, lower weight is given for matcheson frequent values or matches on values that havemany other values of the same frequency; higherweight is given to matches on rare values. Therange of Sk(Xk, Yk) for matches in the Lin1 mea-sure is [−2 log N

2 , 0], with the minimum value be-ing attained when Xk occurs twice in the dataset,and no other value for attribute k occurs twice,and maximum value is attained when Xk occurs Ntimes. The range of Sk(Xk, Yk) for mismatches in

the Lin1 measure is [−2 log N2 , 0], with the mini-

mum value being attained when Xk and Yk bothoccur only once and all other values for attributeAk occur more than once, and the maximum valueis attained when Xk is the most frequent value andYk is the least frequent value or vice versa.

3. Smirnov. Smirnov [31] proposed a measure rootedin probability theory that not only considers a givenvalue’s frequency, but also takes into account thedistribution of the other values taken by the sameattribute. The Smirnov measure is probabilisticfor both matches and mismatches. For a match,the similarity is high when the frequency of thematching value is low, and the other values occurfrequently. The range of Sk(Xk, Yk) for a match inthe Smirnov measure is [2, 2N ], with the minimumvalue being attained when Xk occurs N times;the maximum value is attained when Xk occursonly once and there only one other possible valuefor attribute Ak, which occurs N − 1 times. Therange of Sk(Xk, Yk) for a mismatch in the Smirnovmeasure is [0, N

2 −1], with the minimum value beingattained when the attribute Ak takes only twovalues, Xk and Yk; and the maximum is attainedwhen Ak takes only one more value apart from Xk

and Yk and it occurs N −2 times (Xk and Yk occuronce each).

4. Anderberg. In his book on cluster analysis [2],Anderberg presents an approach to handle simi-larity between categorical attributes. He arguesthat rare matches indicate a strong association andshould be given a very high weight, and that mis-matches on rare values should be treated as be-ing distinctive and should also be given special im-portance. In accordance with these arguments,the Anderberg measure assigns higher similarityto rare matches, and lower similarity to rare mis-matches. The Anderberg measure is unique in thesense that it cannot be written in the form of Equa-tion 4.1. The range of the Anderberg measure is[0, 1]; the minimum value is attained when thereare no matches, and the maximum value is attainedwhen all attributes match.

4.4 Further classification of similarity mea-sures. We can further classify categorical similaritymeasures based on the arguments used to proposed themeasures:

1. Probabilistic approaches take into account theprobability of a given match taking place.The following measures are probabilistic:Goodall, Smirnov,Anderberg.

249

2. Information-theoretic approaches incorpo-rate the information content of a particularvalue/variable with respect to the data set. Thefollowing measures are information-theoretic:Lin, Lin1, Burnaby.

Table 3 provides a characterization of each of the 14similarity measures in terms of how they handle the var-ious characteristics of categorical data. This table showsthat measures Eskin and Anderberg assign weight toevery attribute using the quantity nk, though in op-posite ways. Another interesting observation from col-umn 3 is that several measures—Lin, Lin1, Goodall1,Goodall3, Smirnov, Anderberg—assign higher similar-ity to a match when the attribute value is rare (fk islow), while Goodall2 and Goodall4 assign higher simi-larity to a match when the attribute value is frequent (fk

is high). Only Gambaryan assigns the maximum simi-larity when the attribute value has a frequency close to12 . Column 4 shows that IOF , Lin, Lin1, Smirnov andBurnaby assign greater similarity when the mismatchoccurs between rare values, while OF and Anderbergassign greater similarity for a mismatch between fre-quent values.

5 Experimental Evaluation

In this section we present an experimental evaluation ofthe 14 measures (listed in Table 2) on 18 different datasets in the context of outlier detection.

Of these data sets, 16 are based on the data setsavailable at the UCI Machine Learning Repository [3],two are based on network data generated by SKAIONCorporation for the ARDA information assurance pro-gram [30]. The details about the 18 data sets are sum-marized in Table 4. Eleven of these data sets werepurely categorical, five (KD1,KD2,Sk1,Sk2,Cen) had amix of continuous and categorical attributes, and twodata sets, Irs and Sgm, were purely continuous. Contin-uous variables were discretized using the MDL method[10]. The KD1,KD2 data sets were obtained from theKDDCup data set by discretizing the continuous at-tributes into 10 and 100 bins respectively. Another pos-sible way to handle a mixture of attributes is to computethe similarity for continuous and categorical attributesseparately, and then do a weighted aggregation. In thisstudy we converted the continuous attributes to cate-gorical to simplify comparative evaluation.

Each data set contains labeled instances belongingto multiple classes. We identified one class as the outlierclass, and rest of the classes were grouped togetherand called normal. For most of the data sets, thesmallest class was selected as the outlier class. The onlyexceptions were (Cr1,Cr2) and (KD1,KD2), where the

original data sets had two similar-sized classes. For eachdata set in the pair, we sampled 100 points from one ofthe two classes as the outlier points.

In Section 3.1 we discussed a number of character-istics for categorical data sets; in Table 4 we describethe various public data sets in terms of these charac-teristics. The first row gives the size of each data set.The second row shows the percentage of outlier pointsin the original data set. The third row indicates thenumber of attributes in the data sets. Rows 4 and 5show the distribution of the number of values taken byeach attribute; the difference between the average andthe median is a measure of how skewed this distributionis. For example, data set Sk1 has a few attributes thattake many values while most other attributes take fewvalues. The next three rows show the distribution of fre-quency of values taken by an attribute in the given dataset (i.e., fk(x)). This is done by showing the numberof attributes that have a uniform, Gaussian and skeweddistribution in rows 6, 7, and 8 respectively.

The last two rows in Table 4 denote the cross-validation classification recall and precision reported bythe C4.5 classifier on the outlier class. This quantity in-dicates the separability between the instances belongingto normal class(es) and instances belonging to outlierclass, using the given set of attributes. A low accuracyimplies that distinguishing between outliers and normalinstances is difficult in that particular data set using adecision tree-based classifier.

5.1 Evaluation Methodology. The performance ofthe different similarity measures was evaluated in thecontext of outlier detection using nearest neighbors[29, 34]. All instances belonging to the normal class(es)form the training set. We construct the test set byadding the outlier points to the training set. For eachtest instance, we find its k nearest neighbors using thegiven similarity measure, in the training set (we chosethe parameter k = 10). The outlier score is the distanceto the kth nearest neighbor. The test instances are thensorted in decreasing order of outlier scores.

To evaluate a measure, we count the number of trueoutliers in the top p portion of the sorted test instances,where p = δn, 0 ≤ δ ≤ 1 and n is the number of actualoutliers. Let o be the number of actual outliers in thetop p predicted outliers. The accuracy of the algorithmis measured as o

p .In this paper we present results for δ = 1. We

have also experimented with other lower values of δ andthe trends in relative performance are similar. We havepresented these additional results in our extended workavailable as a technical report [6].

250

{fk(Xk), fk(Yk)}Measure nk Xk = Yk Xk 6= Yk

Overlap 1 0Eskin ∝ n2

k 1 0IOF 1 ∝ 1/(log fk(Xk) log fk(Yk))OF 1 ∝ log fk(Xk) log fk(Yk)Lin ∝ 1/ log fk(Xk) ∝ 1/ log (fk(Xk) + fk(Yk))Lin1 ∝ 1/ log fk(Xk) ∝ 1/ log |fk(Xk)− fk(Yk)|

Goodall1 ∝ (1− f2k (Xk)) 0

Goodall2 ∝ f2k (Xk) 0

Goodall3 ∝ (1− f2k (Xk)) 0

Goodall4 ∝ f2k (Xk) 0

Smirnov ∝ 1/fk(Xk) ∝ 1/(fk(Xk) + fk(Yk))Gambaryan Maximum at fk(Xk) = N

20

Burnaby 1 ∝ 1/ log fk(Xk),∝ 1/ log fk(Yk)Anderberg ∝ 1/nk ∝ 1/f2

k (Xk) ∝ fk(Xk)fk(Yk)

Table 3: Relation between per-attribute similarity, S(Xk, Yk) and {nk, fk(Xk), fk(Yk)}.

5.2 Experimental Results on Public Data Sets.Our experimental results verified our initial hypothesesabout categorical similarity measures. As can be seenfrom Table 5, there are many situations where theOverlap measure does not give good performance. Thisis consistent with our intuition that the use of additionalinformation would lead to better performance. Inparticular, we expected that since categorical data doesnot have inherent ordering, data-driven measures wouldbe able to take advantage of information present inthe data set to make more accurate determinations ofsimilarity between a pair of data instances.

We make some key observations about the resultsin Table 5:

1. No single measure is always superior or inferior.This is to be expected since each data set hasdifferent characteristics.

2. The use of some measures gives consistently betterperformance on a large variety of data. TheLin, OF , Goodall3 measures give among the bestperformance overall in terms of outlier detectionperformance. This is noteworthy since Lin andGoodall3 have been introduced for the first timein this paper.

3. There are some pairs of measures that exhibit com-plementary performance, i.e., one performs wellwhere the other performs poorly and vice-versa.Example complimentary pairs are (OF, IOF ),(Lin, Lin1) and (Goodall3, Goodall4). This obser-vation means that it may be possible to constructmeasures that draw on the strengths of two mea-sures in order to obtain superior performance. This

is an aspect of this work that needs to be pursuedin future work.

4. The performance of an outlier detection algorithmis significantly affected by the similarity measureused (we refer the reader to our extended work[6] for a similar evaluation using a different outlierdetection algorithm, LOF, which provides similarconclusions). For example, for the Cn data set,which has a very low classification accuracy for theoutlier class, using OF still achieves close to 50 %accuracy. We also note that for many of the datasets there is a relationship between decision treeperformance (separability) and the performance ofthe measures. Specifically, for some of the data sets(e.g., Sk1, Tmr, Aud) with low separability therewas high variance in the performance of the variousmeasures.

5. The Eskin similarity measure weights attributesproportional to the number of values taken by theattribute (nk). For data sets in which the attributestake large number of values (e.g., KD2, Sk1, Sk2),Eskin performs very poorly.

6. The Smirnov measure assigns similarity to both di-agonal and off-diagonal entries in the per-attributesimilarity matrix (Figure 1). But it still performsvery poorly on most of the data sets. The othermeasures that operate similarly–Lin, Lin1 andAnderberg—perform better than Smirnov in al-most every data set.

251

Cr1

Cr2

IrsC

nK

D1

KD

2Sk1

Sk2

Msh

Sgm

Cen

Can

Hys

Lym

Nur

Tm

rT

TT

Aud

Size

1663

1659

150

4155

2100

2100

3480

2606

4308

210

5041

277

132

148

6480

336

727

190

%O

utls.

44

33

25

54

43

14

16

29

23

41

333

14

25

d6

64

42

29

29

10

10

21

18

10

94

18

815

917

avg(n

k)

3.2

33.2

86.8

02.9

14.8

037

337

286

4.1

86.3

77.0

93.9

02.8

03.1

03.3

32.3

12.8

02.6

5m

ed(n

k)

33

83

426

99

36

73

33

32

32

fk

Uni.

66

03

00

22

21

01

12

81

00

fk

Gauss.

00

47

87

33

58

75

311

02

95

fk

Skw

d.

00

032

22

21

55

14

93

30

50

12

05

Reca

ll0.7

50.9

10.9

00.9

10.8

80.8

90.0

00.8

31.0

01.0

00.9

60.2

40.6

30.7

40.6

20.1

10.6

30.2

3P

recision

0.8

20.8

50.9

80.9

10.9

90.8

90.0

00.8

31.0

01.0

00.9

40.7

61.0

00.7

30.6

60.1

00.7

60.2

6

Table

4:D

escriptionof

Public

Data

Sets.

Mea

sure

Cr1

Cr2

IrsC

nK

D1

KD

2Sk1

Sk2

Msh

Sgm

Cen

Can

Hys

Lym

Nur

Tm

rT

TT

Aud

Avg

ovrl p

0.9

10.9

10.7

80.0

30.7

70.7

70.4

10.1

21.0

00.9

30.0

70.3

00.6

00.5

90.0

00.2

91.0

00.4

00.5

5

eskn

0.0

00.0

00.6

60.0

00.5

10.0

00.0

00.0

61.0

00.0

70.0

00.4

70.6

00.5

60.4

60.2

91.0

00.3

80.3

4

iof

1.0

00.9

20.2

00.4

00.4

40.1

00.1

00.1

01.0

00.0

00.1

50.5

20.6

00.5

60.4

90.3

61.0

00.3

60.4

6of

0.9

30.9

20.8

80.4

80.7

40.9

00.5

70.3

61.0

01.0

00.2

70.3

61.0

00.6

90.4

70.2

20.2

80.6

20.6

5

lin0.9

10.9

10.8

60.1

60.7

80.8

50.6

60.2

11.0

00.9

30.2

10.4

91.0

00.7

00.0

00.4

01.0

00.5

50.6

5lin

10.1

30.4

90.9

40.1

10.8

80.6

60.7

50.4

01.0

01.0

00.2

50.4

90.6

00.6

90.3

10.3

70.2

50.4

90.5

5

good1

0.7

00.6

80.8

00.1

10.7

70.0

10.5

90.4

01.0

01.0

00.3

60.3

00.8

70.7

20.0

00.4

70.7

50.4

00.5

5

good2

0.9

10.9

10.7

80.4

40.6

10.0

10.6

40.6

41.0

01.0

00.2

90.4

90.6

00.6

40.3

20.2

60.9

00.5

50.6

1good3

0.9

10.9

10.8

00.1

60.7

50.0

10.5

90.4

11.0

01.0

00.3

70.4

60.8

70.6

70.0

40.5

01.0

00.4

30.6

0

good4

0.7

00.9

80.5

60.0

20.5

20.7

90.0

50.0

80.6

30.0

70.1

80.5

20.6

00.5

60.5

20.2

90.6

00.4

90.4

5

smrnv

0.9

10.9

10.7

80.0

30.0

00.0

00.3

20.0

40.0

00.8

30.1

50.2

30.7

00.3

80.0

80.3

30.5

50.0

40.3

5gm

brn

0.9

10.9

10.7

60.4

70.4

30.0

10.1

40.3

51.0

00.9

70.2

50.5

60.6

00.7

00.4

60.3

51.0

00.6

00.5

8

brnby

0.9

10.9

10.7

80.0

30.6

60.9

00.5

50.1

71.0

00.9

30.1

30.5

10.9

00.5

90.4

60.2

91.0

00.4

50.6

2

anbr

g0.9

30.9

20.9

00.0

70.5

20.4

60.4

60.0

71.0

00.9

30.2

60.3

31.0

00.6

40.0

50.3

00.7

00.2

60.5

4

Avg

0.7

70.8

10.7

50.1

80.6

00.3

90.4

20.2

40.9

00.7

60.2

10.4

30.7

50.6

20.2

60.3

40.7

90.4

3

Table

5:E

xperimental

Results

ForkNN

Algorithm

forδ

=1.0.

252

6 Concluding Remarks and Future Work

Computing similarity between categorical attributeshas been discussed in a variety of contexts. In thispaper we have brought together several such mea-sures and evaluated them in the context of out-lier detection. We have also proposed several vari-ants (Lin, Lin1, Goodall2, Goodall3, Goodall4) of exist-ing similarity measures, some of which perform very wellas shown in our evaluation.

Given this set of similarity measures, the first ques-tion that comes to mind is: Which similarity measure isbest suited for my data mining task?. Our experimentalresults suggest that there is no one best performing sim-ilarity measure. Hence, one needs to understand how asimilarity measure handles the different characteristicsof a categorical data set, and this needs to be exploredin future research.

We used outlier detection as the underlying datamining task for the comparative evaluation in thiswork. However, similar studies can be performed usingclassification or clustering as the underlying task. It willbe useful to know if the relative performance of thesesimilarity measures remains the same for the other datamining tasks.

In our evaluation methodology we have used onesimilarity measure across all attributes. Since differentattributes in a data set can be of different nature, analternative way is to use different measures for differentattributes. This appears to be especially promisinggiven the complimentary nature of several similaritymeasures.

7 Acknowledgements

We are grateful to the anonymous reviewers for theircomments and suggestions, which improved this paper.We would also like to thank Gyorgy Simon for hishelpful comments on an early draft of this paper. Thiswork was supported by NSF Grant CNS-0551551, NSFITR Grant ACI-0325949, NSF Grant IIS-0308264, andNSF Grant IIS-0713227. Access to computing facilitieswas provided by the University of Minnesota DigitalTechnology Center and Supercomputing Institute.

References

[1] A. Ahmad and L. Dey. A method to computedistance between two categorical values of sameattribute in unsupervised learning for categoricaldata set. Pattern Recogn. Lett., 28(1):110–118,2007.

[2] M. R. Anderberg. Cluster Analysis for Applica-tions. Academic Press, New York, 1973.

[3] A. Asuncion and D. J. Newman. UCI machinelearning repository. [http://archive.ics.uci.edu/ml].Irvine, CA: University of California, School ofInformation and Computer Science, 2007.

[4] Y. Biberman. A context similarity measure. InECML ’94: Proceedings of the European Confer-ence on Machine Learning, pages 49–63. Springer,1994.

[5] T. Burnaby. On a method for character weightinga similarity coefficient, employing the concept ofinformation. Mathematical Geology, 2(1):25–38,1970.

[6] V. Chandola, S. Boriah, and V. Kumar. Simi-larity measures for categorical data—a compara-tive study. Technical Report 07-022, Departmentof Computer Science & Engineering, University ofMinnesota, October 2007.

[7] H. Cramer. The Elements of Probability Theoryand Some of its Applications. John Wiley & Sons,New York, NY, 1946.

[8] G. Das and H. Mannila. Context-based similar-ity measures for categorical databases. In PKDD’00: Proceedings of the 4th European Conference onPrinciples of Data Mining and Knowledge Discov-ery, pages 201–210, London, UK, 2000. Springer-Verlag.

[9] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, andS. Stolfo. A geometric framework for unsupervisedanomaly detection. In D. Barbara and S. Jajodia,editors, Applications of Data Mining in ComputerSecurity, pages 78–100. Kluwer Academic Publish-ers, Norwell, MA, 2002.

[10] U. M. Fayyad and K. B. Irani. Multi-interval dis-cretization of continuous-valued attributes for clas-sification learning. In Proceedings of the 13th In-ternational Joint Conference on Artificial Intelli-gence, pages 1022–1029, San Francisco, CA, 1993.Morgan Kaufmann.

[11] P. Gambaryan. A mathematical model of taxon-omy. Izvest. Akad. Nauk Armen. SSR, 17(12):47–53, 1964.

[12] V. Ganti, J. Gehrke, and R. Ramakrishnan.CACTUS–clustering categorical data using sum-maries. In KDD ’99: Proceedings of the 5th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 73–83, New York,NY, USA, 1999. ACM Press.

253

[13] D. Gibson, J. Kleinberg, and P. Raghavan. Clus-tering categorical data: an approach based on dy-namical systems. The VLDB Journal, 8(3):222–236, 2000.

[14] D. W. Goodall. A new similarity index based onprobability. Biometrics, 22(4):882–907, 1966.

[15] S. Guha, R. Rastogi, and K. Shim. ROCK: A ro-bust clustering algorithm for categorical attributes.Information Systems, 25(5):345–366, 2000.

[16] J. A. Hartigan. Clustering Algorithms. John Wiley& Sons, New York, NY, 1975.

[17] Z. Huang. Extensions to the k-means algorithm forclustering large data sets with categorical values.Data Mining and Knowledge Discovery, 2(3):283–304, 1998.

[18] A. K. Jain and R. C. Dubes. Algorithms forClustering Data. Prentice-Hall, Englewood Cliffs,NJ, 1988.

[19] K. S. Jones. A statistical interpretation of termspecificity and its application in retrieval. In Doc-ument Retrieval Systems, volume 3 of Taylor Gra-ham Series In Foundations Of Information Sci-ence, pages 132–142. Taylor Graham Publishing,London, UK, 1988. ISBN 0-947568-21-2.

[20] W. P. Jones and G. W. Furnas. Pictures of rele-vance: a geometric analysis of similarity measures.J. Am. Soc. Inf. Sci., 38(6):420–442, 1987.

[21] L. Kaufman and P. J. Rousseeuw. Finding Groupsin Data: An Introduction to Cluster Analysis. JohnWiley & Sons, New York, NY, 1990.

[22] S. Q. Le and T. B. Ho. An association-baseddissimilarity measure for categorical data. PatternRecogn. Lett., 26(16):2549–2557, 2005.

[23] D. Lin. An information-theoretic definition ofsimilarity. In ICML ’98: Proceedings of the 15thInternational Conference on Machine Learning,pages 296–304, San Francisco, CA, USA, 1998.Morgan Kaufmann Publishers Inc.

[24] K. Maung. Measurement of association in a contin-gency table with special reference to the pigmen-tation of hair and eye colours of Scottish schoolchildren. Annals of Eugenics, 11:189–223, 1941.

[25] T. Noreault, M. McGill, and M. B. Koll. A perfor-mance evaluation of similarity measures, documentterm weighting schemes and representations in a

boolean environment. In SIGIR ’80: Proceedingsof the 3rd annual ACM conference on Research anddevelopment in information retrieval, pages 57–76,Kent, UK, 1981. Butterworth & Co.

[26] C. R. Palmer and C. Faloutsos. Electricity basedexternal similarity of categorical attributes. InPAKDD ’03: Proceedings of the 7th Pacific-AsiaConference on Advances in Knowledge Discoveryand Data Mining, pages 486–500. Springer, 2003.

[27] C. P. Pappis and N. I. Karacapilidis. A comparativeassessment of measures of similarity of fuzzy values.Fuzzy Sets and Systems, 56(2):171–174, 1993.

[28] K. Pearson. On the general theory of multiple con-tingency with special reference to partial contin-gency. Biometrika, 11(3):145–158, 1916.

[29] S. Ramaswamy, R. Rastogi, and K. Shim. Efficientalgorithms for mining outliers from large data sets.In SIGMOD ’00: Proceedings of the ACM SIG-MOD International Conference on Management ofData, pages 427–438. ACM Press, 2000.

[30] SKAION Corporation. SKAION intru-sion detection system evaluation data.[http://www.skaion.com/news/rel20031001.html].

[31] E. S. Smirnov. On exact methods in systematics.Systematic Zoology, 17(1):1–13, 1968.

[32] P. H. A. Sneath and R. R. Sokal. Numerical Tax-onomy: The Principles and Practice of NumericalClassification. W. H. Freeman and Company, SanFrancisco, 1973.

[33] C. Stanfill and D. Waltz. Toward memory-basedreasoning. Commun. ACM, 29(12):1213–1228,1986.

[34] P.-N. Tan, M. Steinbach, and V. Kumar. Intro-duction to Data Mining. Addison-Wesley, Boston,MA, 2005.

[35] X. Wang, B. De Baets, and E. Kerre. A compar-ative study of similarity measures. Fuzzy Sets andSystems, 73(2):259–268, 1995.

[36] D. R. Wilson and T. R. Martinez. Improvedheterogeneous distance functions. J. Artif. Intell.Res. (JAIR), 6:1–34, 1997.

[37] R. Zwick, E. Carlstein, and D. V. Budescu. Mea-sures of similarity among fuzzy concepts: A com-parative analysis. International Journal of Approx-imate Reasoning, 1(2):221–242, 1987.

254

Documents

Similarity Measures for Categorical Data: A …...means), distance-based outlier detection, classi cation (knn, SVM), and several other data mining tasks. These algorithms typically