20
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Embed Size (px)

Citation preview

Page 1: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Data MiningSpring 2007

• Noisy data• Data Discretization using

Entropy based and ChiMerge

Page 2: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Noisy Data

• Noise: Random error, Data Present but not correct.– Data Transmission error– Data Entry problem

• Removing noise– Data Smoothing (rounding, averaging within a window).– Clustering/merging and Detecting outliers.

• Data Smoothing– First sort the data and partition it into (equi-depth) bins.– Then the values in each bin using Smooth by Bin Means,

Smooth by Bin Median, Smooth by Bin Boundaries, etc.

Page 3: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Noisy Data (Binning Methods)

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Page 4: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Noisy Data (Clustering)

• Outliers may be detected by clustering, where similar values are organized into groups or “clusters”.

• Values which falls outside of the set of clusters may be considered outliers.

Page 5: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Data Discretization

• The task of attribute (feature)-discretization techniques is to discretize the values of continuous features into a small number of intervals, where each interval is mapped to a discrete symbol.

• Advantages:- – Simplified data description and easy-to-understand data

and final data-mining results.– Only Small interesting rules are mined.– End-results processing time decreased.– End-results accuracy improved.

Page 6: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Effect of Continuous Data on Results Accuracy

age income age buys_computer<=30 medium 9 ?<=30 medium 11 ?<=30 medium 13 ?

age income age buys_computer<=30 medium 9 no<=30 medium 10 no<=30 medium 11 no<=30 medium 12 no

Data Mining

• If ‘age <= 30’ and income = ‘medium’ and age = ‘9’ then buys_computer = ‘no’

• If ‘age <= 30’ and income = ‘medium’ and age = ‘10’ then buys_computer = ‘no’

• If ‘age <= 30’ and income = ‘medium’ and age = ‘11’ then buys_computer = ‘no’

• If ‘age <= 30’ and income = ‘medium’ and age = ‘12’ then buys_computer = ‘no’

Discover only those rules which contain support (frequency) greater >= 1

Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%”

Page 7: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Entropy-Based Discretization

• Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is

• Where pi is the probability of class i in S1, determined by dividing the number of samples of class i in S1 by the total number of samples in S1.

Page 8: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Entropy-Based Discretization

• The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.

• The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,

Page 9: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Example 1

IDID 1 2 3 4 5 6 7 8 9

AgeAge 21 22 24 25 27 27 27 35 41

GradeGrade F F P F P P P P P

• Let Grade be the class attribute. Use entropy-based discretization to divide the range of ages into different discrete intervals.

• There are 6 possible boundaries. They are 21.5, 23, 24.5, 26, 31, and 38.

• Let us consider the boundary at T = 21.5. Let S1 = {21} Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}

(21+22) / 2 = 21.5

(22+24) / 2 = 23

Page 10: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Example 1 (cont’)

• The number of elements in S1 and S2 are:|S1| = 1|S2| = 8

• The entropy of S1 is

• The entropy of S2 is

)0(log)0()1(log)1(

)P(log)P()F(log)F()(

22

221 GradePGradePGradePGradePSEnt

)6(log)6()2(log)2(

)P(log)P()F(log)F()(

22

222 GradePGradePGradePGradePSEnt

ID 1 2 3 4 5 6 7 8 9

Age 21 22 24 25 27 27 27 35 41

Grade F F P F P P P P P

Page 11: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Example 1 (cont’)

• Hence, the entropy after partitioning at T = 21.5 is

...

)(|9|

|8|)(

|9|

|1|

)(||

||)(

||

||),(

21

22

11

SEntSEnt

SEntS

SSEnt

S

STSE

Page 12: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Example 1 (cont’)

• The entropies after partitioning for all the boundaries are: T = 21.5 = E(S,21.5) T = 23 = E(S,23) . . T = 38 = E(S,38)

Select the boundary with the smallest entropySuppose best is T = 23ID 1 2 3 4 5 6 7 8 9

Age 21 22 24 25 27 27 27 35 41

Grade F F P F P P P P P

Now recursively apply entropy discretization upon both partitions

Page 13: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

ChiMerge (Kerber92)

• This discretization method uses a merging approach.• ChiMerge’s view:

– First sort the data on the basis of attribute (which is being descretize).

– List all possible boundaries or intervals. In case of last example there were 6 boundaries, 0, 21.5, 23, 24.5, 26, 31, and 38.

– For all two near intervals, calculate 2 class independent test.

• {0,21.5} and {21.5, 23}• {21.5, 23} and {23, 24.5}• .• .

– Pick the best 2 two near intervals and merge them.

Page 14: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

ChiMerge -- The Algorithm

Compute the 2 value for each pair of adjacent intervals

Merge the pair of adjacent intervals with the lowest 2 value

Repeat and until 2 values of all adjacent pairs exceeds a threshold

Page 15: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

Chi-Square Test

oij = observed frequency of interval i for class j

eij = expected frequency (Ri * Cj) / N

c

i

r

j ij

ijij

e

eo

1 1

2

2

Class 1

Class 2 Σ

Int 1

o11 o12 R1

Int 2

o21 o22 R2

Σ C1 C2 N

Page 16: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

ChiMerge Example

Data Set Sample: F K

1 1 12 3 23 7 14 8 15 9 16 11 27 23 28 37 19 39 210 45 111 46 112 59 1

• Interval points for feature F are: 0, 2, 5, 7.5, 8.5, 10, etc.

Page 17: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

ChiMerge Example (cont.)

2 was minimum for intervals: [7.5, 8.5] and [8.5, 10]

K=1 K=2

Interval [7.5, 8.5] A11=1 A12=0 R1=1Interval [8.5, 9.5] A21=1 A22=0 R2=1

C1=2 C2=0 N=2

Based on the table’s values, we can calculate expected values:E11 = 2/2 = 1,E12 = 0/2 0,E21 = 2/2 = 1, and E22 = 0/2 0

and corresponding 2 test:2 = (1 – 1)2 / 1 + (0 – 0)2 / 0.1 + (1 – 1)2 / 1 + (0 – 0)2 / 0.1 = 0.2

For the degree of freedom d=1, and 2 = 0.0 < 2.706 ( MERGE !)

c

i

r

j ij

ijij

e

eo

1 1

2

2

oij = observed frequency of interval i for class j

eij = expected frequency (Ri * Cj) / N

Page 18: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

ChiMerge Example (cont.)Additional Iterations:

K=1 K=2

Interval [0, 7.5] A11=2 A12=1 R1=3Interval [7.5, 10] A21=2 A22=0 R2=2

C1=4 C2=1 N=5

E11 = 12/5 = 2.4,E12 = 3/5 = 0.6,E21 = 8/5 = 1.6, and E22 = 2/5 = 0.4

2 = (2 – 2.4)2 / 2.4 + (1 – 0.6)2 / 0.6 + (2 – 1.6)2 / 1.6 + (0 – 0.4)2 / 0.4 2 = 0.834

For the degree of freedom d=1, 2 = 0.834 < 2.706 (MERGE!)

Page 19: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

ChiMerge Example (cont.)

K=1 K=2

Interval [0, 10.0] A11=4 A12=1 R1=5

Interval [10.0, 42.0] A21=1 A22=3 R2=4

C1=5 C2=4 N=9

E11 = 2.78

E12 =2.22

E21 = 2.22

E22 = 1.78

and 2 = 2.72 > 2.706 (NO MERGE !)

Final discretization: [0, 10], [10, 42], and [42, 60]

Page 20: Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge

References

– Text book of Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers, August 2000. (Chapter 3).

– Data Mining: Concepts, Models, Methods, and Algorithms by Mehmed Kantardzic John Wiley & Sons 2003. (Chapter 3).