25
Mining Rank-Correlated Sets Of Mining Rank-Correlated Sets Of Numerical Attributes. Numerical Attributes.

Mining Rank-Correlated Sets Of Numerical Attributes

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Mining Rank-Correlated Sets Of Numerical Attributes

Mining Rank-Correlated Sets Of Mining Rank-Correlated Sets Of Numerical Attributes.Numerical Attributes.

Mining Rank-Correlated Sets Of Mining Rank-Correlated Sets Of Numerical Attributes.Numerical Attributes.

Page 2: Mining Rank-Correlated Sets Of Numerical Attributes

Introduction.Introduction.

Need for finding new measures for mining numerical datasets.Need for finding new measures for mining numerical datasets.

– problems with previous association rules with discretization methods.problems with previous association rules with discretization methods.

– Discretization example and its disadvantages.Discretization example and its disadvantages.

Proposed Methods and concepts.Proposed Methods and concepts.

Rank and Rank correlation measuresRank and Rank correlation measures..

– What is Rank and Rank correlation?What is Rank and Rank correlation?

– Introducing New Rank correlation Measures.Introducing New Rank correlation Measures.

– Observations.Observations.

– Defining the support for new measures.Defining the support for new measures.

– Example Dataset.Example Dataset.

– Calculating the Rank correlation measure values on the example dataset.Calculating the Rank correlation measure values on the example dataset.

– Properties of the new Rank correlation measures.Properties of the new Rank correlation measures.

Categorical Attribute combined with Numerical attributesCategorical Attribute combined with Numerical attributes..

– Explaining the new measures on categorical and mixed data.Explaining the new measures on categorical and mixed data.

Defining confidence and Association Rules with new support measuresDefining confidence and Association Rules with new support measures..

– Example and explanation.Example and explanation.

Finding Ranks with attribute value ties in the dataset.Finding Ranks with attribute value ties in the dataset.

– Defining new support measures with average rank values.Defining new support measures with average rank values.

Algorithms for finding out frequent itemsets using new Rank based support measuresAlgorithms for finding out frequent itemsets using new Rank based support measures

– Explicit pair method for converting the data set to normal one.Explicit pair method for converting the data set to normal one.

– Using Algorithm with Special Indices.Using Algorithm with Special Indices.

– Explaining search traversal techniques Range-tree and Kd-tree.Explaining search traversal techniques Range-tree and Kd-tree.

Experimental Evaluation.Experimental Evaluation.

– Results of applying new support measures on a meteorological data.Results of applying new support measures on a meteorological data.

– Performance evaluation for the newly found support measures.Performance evaluation for the newly found support measures.

IIndexndexIIndexndex

Page 3: Mining Rank-Correlated Sets Of Numerical Attributes

IIntroductionntroductionIIntroductionntroduction

The motivation for the research reported upon in this paper is an application where we

want to mine frequently occurring patterns in a meteorological dataset containing measurements from

various weather stations in Belgium over the past few years. Each record contains a set of

measurements (such as temperature or pressure) taken in a given station at a given time point,

together with extra information about the stations (such as location or altitude).

The classical association rule framework [1], however, is not adequate to deal with

numerical data directly. Most previous approaches to association rule mining for numerical attributes

were based on discretization.

For mining the rank correlated numerical datasets three new support measures are

proposed. Their application on the real time meteorological dataset and according performance

evaluations are also presented.

Page 4: Mining Rank-Correlated Sets Of Numerical Attributes

DDiscretization and Concept hierarchyiscretization and Concept hierarchyDDiscretization and Concept hierarchyiscretization and Concept hierarchy

Discretization

• reduce the number of values for a given continuous attribute by dividing the range of the

attribute into intervals. Interval labels can then be used to replace actual data values

Concept hierarchies

• reduce the data by collecting and replacing low level concepts (such as numeric values for

the attribute age) by higher level concepts (such as young, middle-aged, or senior)

E.g. If you have a attribute Age with values (22,23,24,35,36,37,55,56,57).Applying discretization

and breaking this into intervals we get,

Age (Young(20…35), MiddleAge(35….55), Old(55….80))

here ,

Young gives you (12,13,24)

MiddleAge (35,36,37)

Old (55,56,57)

Page 5: Mining Rank-Correlated Sets Of Numerical Attributes

DDiscretization Exampleiscretization ExampleDDiscretization Exampleiscretization Example

RecordID Age Married Status NumCars

100 23 NO 1

200 25 YES 1

300 29 NO 0

400 34 YES 2

500 38 YES 2

Table PeopleTable People

Rules Support Confidence

(Age: 30…..39) and (married: Yes) => NumCars : 2 40 % 100 %

(NumCars: 0….1) => (Married: No) 40 % 66.66 %

Minimum Support = 40 % ,Minimum Confidence = 50 %

Page 6: Mining Rank-Correlated Sets Of Numerical Attributes

The above simple approach (discretization) to the quantitative database has two major drawbacks:

“MinSup”: If the number of intervals for a quantitative attribute is large, the support for any single interval can be low. Hence without using larger intervals ,some rules involving this attribute may not be found because they lack minimum support.

“MinConf”: There is some information lost whenever we partition values into intervals. Some rules may have minimum confidence only when an item in the antecedent consists of a single value (or a small interval). This information loss increases as the interval sizes become larger.

For example, in Figure 2, the rule (NumCars: O) => (Married: No)” has 100 % confidence and 20 % Support.

But if we had partitioned the attribute NumCars into intervals such that 0 and 1 cars end up in the same partition, then the closest rule is (NumCars: O.. 1) => (Married: No)”, which only has 66.67 % confidence and 60 % support.

Conclusion: As the interval size increases the confidence decreases and support increases and also vice versa.A “catch-22” situation– if the intervals are too large, some rules may not have minimum confidence; if they are too small, some rules may not have minimum support.

Page 7: Mining Rank-Correlated Sets Of Numerical Attributes

DDisadvantages of Discretizationisadvantages of DiscretizationDDisadvantages of Discretizationisadvantages of Discretization

Discretization disadvantages can be listed as:

First of all it always incurs an information loss, since values falling in the same bucket become

indistinguishable and small differences in attribute value become unnoticeable.

On the other hand, very small changes in values close to a discretization border may cause

unjustifiably large changes in the set of active rules.

If there are too many Discretization intervals, discovered rules are replicated in each interval making

the overall trends hard to spot.

It is also possible that rules will fail to meet the minimum support criterion when they are split among

many narrow intervals.

Page 8: Mining Rank-Correlated Sets Of Numerical Attributes

Rank:

Let D be the database and H be its header i.e. set of attributes. The rank of a value of a numerical

attribute in A (t.A), is the index of the value of the attribute in the database when its ordered ascending

w.r.t A. That is the smallest value of t.A in A gets a rank 1,etc.Consider following table,

Person Height Weight

A 5.5 53

B 5.7 55

C 5.8 45

D 5.9 50

E 6.0 58

F 6.1 65

G 6.2 68

H 6.3 60

Person A B C D E F G H

Rank by height 1 2 3 4 5 6 7 8

Rank by weight 3 4 1 2 5 7 8 6

Sample Table PersonSample Table Person

Same table represented with Rank ValuesSame table represented with Rank Values..

WWhat is Rank and Rank Correlation measures?hat is Rank and Rank Correlation measures?WWhat is Rank and Rank Correlation measures?hat is Rank and Rank Correlation measures?

Page 9: Mining Rank-Correlated Sets Of Numerical Attributes

““Rank correlationRank correlation”” is the study of relationships between different rankings on the same set of items. It deals

with measuring correspondence between two rankings, and assessing the significance of this

correspondence. Calculating the Rank Correlation Coefficients is described in the next section.

Assumption: For most part of this paper it is assumed that all values of numerical attributes are Distinct .

RRank Correlation:ank Correlation:RRank Correlation:ank Correlation:

Page 10: Mining Rank-Correlated Sets Of Numerical Attributes

Kendall’s ┬Kendall’s ┬

Spearman’s FSpearman’s F

Spearman’s Spearman’s ρρ

1.Observatios for 1.Observatios for ρρ and F: and F: When two attributes are highly positively correlated, the ranks of their values will be close within most tuples. Hence the sums ∑t ε D|r (t.A) – r(t.B))2| and ∑t ε D|r (t.A) – r(t.B)) will be small if A and B are correlated. On the other hand, if A and B are uncorrelated, the expected absolute difference between the ranks of A and B is (|D| - 1)/2.resulting in very high values for these sums. The formulas are chosen in such a way that they are 1 for total positive correlation (e.g. A=B) and -1 for total negative correlation (e.g. A and B are in reverse order)

2.Observations for ┬ :2.Observations for ┬ : The formula actually gives you the probability that a randomly chosen pair of tuples (s, t), s.A < t.A and s.B < t.B. In this case too the formula is chosen such that it is 1 for total positive correlation and -1 for total negative correlation. Due to this connection to the probability this formula is natural than the other two.

RRank correlation measuresank correlation measures

Page 11: Mining Rank-Correlated Sets Of Numerical Attributes

By using the above coefficients following support measures are proposed.

where I is set of numerical attributes.

suppρ and suppF : aggregating the difference between the maximum and minimum rank values.

supp┬, : all pairs of tuples (s,t) are taken and the number of pairs such that s comes before t for all

attributes. This number is then divided by the maximal number of such possible pairs.

MMeasuring the support for Numerical attributeseasuring the support for Numerical attributes

Page 12: Mining Rank-Correlated Sets Of Numerical Attributes

In order to count the supp┬ ({T , P , H }), we need to count the number of pairs of tuples such that the first tuple is smaller than the second tuple in T, P and H. e.g. the pair (t5 , t1) is such a pair, and (t2 , t3) is not. As there are 6 pairs of tuples that are in order [ (t5 , t3) (t5 , t2) (t5 , t1)] and [t4 , t3) (t4 , t3) (t4 , t3)], supp┬ is 6/10.

• supp┬ is very complicated to compute.

EExample Weather Dataset:xample Weather Dataset:

Where, W = Wind Speed, P = Pressure, T = Temperature, H = Humidity.The numbers in bracket indicate the rank of the value within the attribute.

Consider the following example Weather dataset;

Page 13: Mining Rank-Correlated Sets Of Numerical Attributes

PProperties:roperties:

1. Average supp┬ : Let A1,A2,A3,A4….Ak be k statistically independent numerical attributes. Then the expected or average value of the supp┬ is given by,

2. Support for Kendall’s formula is never greater than the support calculated by Spearman’s Footrule.

The property “Support for a super set can not be greater than its sub set” holds true for those support measures too. This property can be stated as:

0 <= suppm (J) <= suppm (I) <= 1

This property can be used to calculate the upper bound for supp┬, because in contrast to the supp┬ itself it is easy to compute.

Page 14: Mining Rank-Correlated Sets Of Numerical Attributes

The above definitions of support are also applicable to ordinal attribute as their domains are ordered.

Integrating categorical attributes with numerical attributes :

Categorical attributes will be used to define the context in which the support of numerical attribute is computed.

Example: We have attributes related to weather data. Then a sample rule,

“In measurements of Antwerp and Brussels, the temperature and wind speed in Antwerp are higher then in Brussels in 70 % of the cases”

Can be given as,

CCategorical attributesategorical attributes

Page 15: Mining Rank-Correlated Sets Of Numerical Attributes

As we have defined the support for categorical attributes we can easily define the confidence for the rule.

The interpretation of the above association rule with confidence c is, for all pairs of tuples (t1 ,t2)with t1 satisfying J1, and t2 satisfying J2, if t1 is smaller than t2 in I1,then there is a chance with confidence c that t1 is also smaller than t2 in I2

Example:

Consider the rule,

In Antwerp if the wind speed in one measurement is greater than in another measurement ,then there is a probability of 65% that the cloudiness will be higher as well.

AAssociation rulesssociation rules

Page 16: Mining Rank-Correlated Sets Of Numerical Attributes

At the start of this presentation we had a assumption that all numerical values of an attribute are Distinct. That wont be the case in the real time data.

This happens with the Ordinal data too. e.g an attribute Cloudiness that can take values (low, medium, High). For this attribute at least 1/3rd of the pairs will be tied.

Modifying Support Measures:Modifying Support Measures:

supp┬:

For our support measure supp┬ , we can just keep our original definition; that is, if two tuples have a tie on an attribute of I, they won’t contribute to suppt (I).

Suppρ & SuppF:

i.e., suppose a value t.A is repeated m times in attribute A, and there are p tuples with an A-value smaller than t.A. Then, the rank of value t.A can have any of the following values

1.In the range of p + 1, . . . , p + m. 2.The average = art.A = (p + 1 + m)/2

3.The maximal rank = r.A = p + m.

Assigning the average rank to tied values, is the most natural choice. But for the minimal and maximal rank, this sum respectively decreases and increases, hence introducing a bias.

MMeasuring Ranks with tieseasuring Ranks with ties

Page 17: Mining Rank-Correlated Sets Of Numerical Attributes

FFor rankings with ties Suppor rankings with ties Suppρρ(I) & SuppF(I) (I) & SuppF(I)

The definition of supp┬ does not change

For ranks with no ties these formulas get reduced to the original ones.

Page 18: Mining Rank-Correlated Sets Of Numerical Attributes

1. As only ranks are compared instead of the actual values in the database, we first

transform the database by replacing every attribute value by its rank w.r.t. this attribute, which

can be done in O(|H||D| log (|D|)) time.

2. Generate all sets of attributes satisfying the minimum support threshold,

3. In the case of supp., or suppF , it can be done by the standard level-wise search in

which candidate itemsets are generated only when all of its subsets are known to be frequent.

Counting the supports requires only a single scan through the data in every iteration.

4. In the case that supp┬ is used as support measure, the mining task becomes

inherently more difficult since the number of pairs of transactions is quadratic in the size of the

database.

AAlgorithmslgorithms

Page 19: Mining Rank-Correlated Sets Of Numerical Attributes

A brute force solution to the presented mining problem, is to explicitly combine all pairs of transactions from the original database D, and create a Boolean database D', such that

With this new database we can apply the general methods for finding frequent itemsets. So the

problem is reduced to standard frequent itemset mining on the new database D‘.

• Disadvantages:

1.Size of the new database D‘ ,it is quadratic in the size to the original database.

2. Time needed to create it.

3. So it can be applied to only small databases.

The advantage: it is most of the time very efficient in combination with depth-first frequent itemset

mining algorithms because all pairs that do not support an itemset will be removed from its branch Such

fine level of pruning will not be possible in the other approaches.

EExplicit creation of pairs of transactionsxplicit creation of pairs of transactions

Page 20: Mining Rank-Correlated Sets Of Numerical Attributes

Spatial Indices: These are data structures especially designed to allow for searching in multidimensional data and to allow searches in several dimensions simultaneously.

When the database is too large and the brute force method is not applicable, we need an efficient mechanism to count supp┬ for all candidate itemsets. For a given attribute set I, this comes down to the following nested for:

Disadvantage: 1.The double loop is not feasible as it performs a quadratic operation for each candidate

set.If we can replace this inner for loop by some other search technique to find out the smaller values in I this would become an efficient technique.

UUsing Spatial Indicessing Spatial Indices

Page 21: Mining Rank-Correlated Sets Of Numerical Attributes

Recall the property for the newly found support measures supp┬ and suppF,

As suppF can be computed a lot more efficiently compared to suppT , we will first compute it for every candidate itemset. In case it gives a value under the minimum support threshold, we can already prune the itemset from the search space without computing the more expensive supp┬ .

After generating the frequent itemset our next task is to prune the itemsets which are not in the support constraints.

Generating all candidate itemsets in reverse order using “Reverse Depth First” ,gives us the best result because, it is guaranteed that all subsets of a given candidate itemset are generated before the candidate itself.

E.G. the subsets of { a ,b , c, d} will be generated in the following manner,

{d} {a}

{c} {a, d}

{c, d} {a, c}

{b} {a, c, d}

{b, d} {a, b}

{b, c} {a, b, d}

{b, c, d} ▲

Kd-tree and Range-tree are the two search traversal techniques used for the above process.

OOptimization by using support property of supp┬ and suppFptimization by using support property of supp┬ and suppF

Page 22: Mining Rank-Correlated Sets Of Numerical Attributes

Rules found in the weather data :

In the table, the altitude of the sun (solar alt), the amount of precipitation (precip), the amount of cloudiness (cloud), and the wind speed (w speed).

Association rules found in weather data of one station in Brussels. • The first rule simply claims that in 66% of the cases, the temperature is higher if the altitude of the

sun is higher. The second rule indicates that if it is raining harder, then there is probability of 64% that the cloudiness is bigger as well. Intuitively, one would expect these confidences would be higher.

• The last three rules indicate that higher cloudiness or higher wind speed positively influence the amount of rain, and apparently, with both higher cloudiness and wind speed this effect becomes even stronger. Indeed; for two randomly selected tuples, the probability of having a higher precipitation is only 32%.

• When, however, we know that the cloudiness in the first tuple is higher, or the wind speed in the first tuple is higher, or both, the probability of the first tuple having a higher precipitation increases to respectively 48%, 44%, and 57%.

EExperimental Evaluationxperimental Evaluation

Page 23: Mining Rank-Correlated Sets Of Numerical Attributes

Data and Resources Used: 1.Real life meteorological dataset and a number of benchmark dataset. 2.All algorithms implemented in C programming language. 3.Tested on 2.2 GHz Opteron system with 2GB RAM. 4.supp┬ is more applied than the other two measures.

Observations: 1.Implementation of Explicit pairs ran out of memory for 1000 records. 2.Kd-trees incur almost no memory overhead and fits in main memory.

3.Range trees uses a lot of memory and scaled hundreds of thousands of

records depending on t he minsup parameter.

Conclusions: 1. Overall Range tree method is the most efficient for not too low minimal support. Because then it generates a lot of itemsets and runs out of memory. 2.For 1% minimum support Kd trees achieve the best performance because they incur almost no memory loss. 3.Explicit pairs method can not handle data of 10000 records at all because the generated 49 995 000 new records do not fit into the memory.

PPerformance evaluation of supp┬erformance evaluation of supp┬

Page 24: Mining Rank-Correlated Sets Of Numerical Attributes

Mining patterns in both these cases is much more efficient than in supp┬. But both these

methods are very sensitive to minimum support level chosen. e.g. for suppp the value of minimum

support must be very low, in the range of tenth of millionth of percent. It shows that theoretical

maximum value is rarely achieved.

Conclusions:

1.Three new support measures for continuous attributes based on rank methods

are introduced.

2. Discretization required previously for numerical data is removed.

3. Algorithms are presented for mining frequent itemsets based on those

definitions.

4. Tests results performed on real world meteorological data are presented and

the efficiency of the algorithms have been verified on large databases.

PPerformance of suppp and suppFerformance of suppp and suppF

Page 25: Mining Rank-Correlated Sets Of Numerical Attributes

Questions Questions