Noise Resilience in Machine Learning Algorithms

Exploring the Noise ResilienceCombined Sturges Algorithm

Akrita AgarwalAdvisor: Dr. Anca Ralescu

November 7, 2015

Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 1 / 39

Motivation


Motivation

A study on Noise?

Real-world datasets are noisyRecordings under normal environmental conditionsEquipment Measurement ErrorMost algorithms ignore Noise.Not a lot of research done on Noise.

Aim : Explore the robustness of algorithms to Noise.

Which algorithm is least affected by noisy Datasets?


Classification


Classification

Classification : Assigning a new observation to a set of knowncategories

Companies store large amounts of data.

Effective Classifier can assist in making good predictions and informedbusiness decisions.

E.g. Whether to recommend Prime products to the non-primecustomers, based on behavior


Classification Algorithms

Two broad kinds of Classifiers are -

Frequency based classifiers: use the frequency of datapoints in thedataset to determine the class membership of a given test point,Geometry based classifiers leverage the geometrical aspects of adataset such as the distance.


Naive Bayes

The Naive Bayes Classifier

Frequency based classifierComputes the probability of a test data point to be in each classclass probability extracted from training data.

Pros

Intuitive to understand and build.Easily trained, even with a small datasetIt’s fast

Cons

Assumes conditional independence of the dataignores the underlying geometry of the data.


k Nearest Neighbors

The k Nearest Neighbors Classifier

Geometry based classifierAssigns the class to test data point by determining the majority classof k nearest points

Pros

Easy to implement and understandClasses don’t have to be linearly separable

Cons

Tends to ignore the importance of an attribute; uses allonly indirectly takes into account the frequency of the data


Combined Sturges Classifier


Combined Sturges

The Combined Sturges(CS)Classifier

Explicitly uses geometry + frequencyData represented as Frequency distribution on class.Classification Score is computed for each class.Test point assigned to class with highest Score.

Continuous data values are binned.

No. of bins = d1 + log2neSturges, 1926 - Choice of a Class Interval


Combined Sturges

Dummy Dataset

Table: Dummy Dataset

A1 A2 Class3 2 11 2 14 2 03 2 11 1 02 2 13 3 04 1 0

Table: Frequency Distribution on Classes 0 & 1

A1 f (A1) A2 f (A2)

1 0.25 1 0.503 0.25 2 0.254 0.50 3 0.25

A1 f (A1) A2 f (A2)

1 0.25 2 0.752 0.25 3 0.253 0.50


Combined Sturges

Dummy Dataset

Table: Dummy Dataset

A1 A2 Class3 2 11 2 14 2 03 2 11 1 02 2 13 3 04 1 0

Table: Frequency Distribution on Classes 0 & 1

A1 f (A1) A2 f (A2)

1 0.25 1 0.503 0.25 2 0.254 0.50 3 0.25

A1 f (A1) A2 f (A2)

1 0.25 2 0.752 0.25 3 0.253 0.50


Combined Sturges

Test Point : T1

3 4


Combined Sturges

1 Geometric Criterion

Test Point : T1

3 4

minimum Distance

ClassificationCriteria :Geometric

ClassificationScore: HighestPosteriorProbability

Table: Nearest distance of T1 to Classes

A1 f (A1) A2 f (A2)

1 0.25 1 0.503 0.25 2 0.254 0.50 3 0.25

A1 f (A1) A2 f (A2)

1 0.25 2 0.752 0.25 3 0.253 0.50


Combined Sturges

Classification Score, S(c) c ∈ 0, 1

S(0)

A1 = P(Class0)× f (A1)A2 = P(Class0)× f (A2)average(A1,A2) =average(0.5 ×0.25, 0.5× 0.25) = 0.125

S(1)

A1 = P(Class1)× f (A1)A2 = P(Class1)× f (A2)average(A1,A2) = average(0.5× 0.50, 0.5× 0.25) = 0.187

S(0) < S(1)

Class 1


Combined Sturges

1 Statistical Criterion

Test Point : T1

3 4

maximumFrequency

Classificationcriteria : Statistical

ClassificationScore: MinimumDistance

Table: Maximum Frequency in Classes

A1 f (A1) A2 f (A2)

1 0.25 1 0.503 0.25 2 0.254 0.50 3 0.25

A1 f (A1) A2 f (A2)

1 0.25 2 0.752 0.25 3 0.253 0.50


Combined Sturges

Classification Score

S(0)

A1 = (4− 3) = 1A2 = (4− 1) = 3average(A1,A2) = average(1, 3) = 2

S(1)

A1 = (3− 3) = 0A2 = (4− 2) = 2average(A1,A2) = average(0, 2) = 1

S(0) > S(1)

Class 1


Combined Sturges

1 Combined Criterion

Test Point : T1

3 4

d =(T1 − A1).f (A1)

Expected DistanceED = EDc

A1.EDcA2

min ExpectedDistance, ED

Table: Aggregate Expected Distance, ED

A1 f (A1) d .f A2 f (A2) d .f

1 0.25 0.50 1 0.50 1.503 0.25 0 2 0.25 0.504 0.50 0.50 3 0.25 0.25

ED0A1 1.00 ED0

A2 2.25

A1 f (A1) d .f A2 f (A2) d .f

1 0.25 0.50 2 0.75 1.502 0.25 0.25 3 0.25 0.253 0.50 0

ED1A1 0.75 ED1

A2 1.75


Combined Sturges

Classification Penalty

S(0)

ED = 1.00× 2.25 = 2.25S(0) = ED × (1− P(Class0)) = 1.125

S(1)

ED = 0.75× 1.75 = 1.31S(1) = ED × (1− P(Class1)) = 0.655

S(0) > S(1)

Class 1


The Noise Model


The Noise Model

Dealing with Noise

Brodley & Fried, 1999 - detect and reduce noise

Kubica & Moore, 2003 - identify Noise using a probabilistic modeland remove it.

Elias Kalapanidas, 2003 - Developed a Noise Model based on dataproperties.


The Noise Model

Additive Noise, x′

= x + δx

δx = σxj × zi,jσxj , standard deviation of attribute j,zi,j = CDF (pi,j)

xi ,j =

{x

′i ,j if pi ,j ≥ n

xi ,j if pi ,j < n(1)

Based on Noise level n ∈ {0, 0.15, 0.30, 0.50, 0.80}


The Noise Model

Attribute-level Noise

Table: Original Dataset

A1 A2 Class3 2 11 2 14 2 03 2 11 1 02 2 13 3 04 1 0

Table: 40% (n = 0.4) Noisy Dataset

A1 A2 Class8.5 0.55 18.9 2 14 0.7 03 2 14.7 1 02 2 13 3 01.6 0.02 0


Datasets


Datasets

Artificial datasets

Multivariate Normal

x1 = random Normal vector, t = random Normal vectorx2 = 0.8x1 + 0.6tx3 = 0.6x1 + 0.8tx4 = t

Linear Function with Non-normal inputs

x2 = (x1)2 + 0.5t


Datasets

2 Artificial datasets

Different Imbalanced-Ratio

3 Real Datasets

Table: Comparison of physical properties of Datasets.

DatasetNo. of

SamplesNo. ofClasses

No. ofAttributes

AttributeValue

ImbalanceRatio

Haberman 306 2 3 Integer 2.78A1 200 3 4 Real 6.66A2 200 3 4 Real 39Iris 150 3 4 Real 2

PimaDiabetes

768 2 8Integer,Real

1.87


Process Flow

1 Create Artificial Datasets

2 Implement the Noise model on all Datasets

3 Apply the three algorithms

4 Compare the results


Results


Results

Performance Measures

Confusion Matrix

Table: Confusion matrix for 2 classes.

Predicted OutcomePositive Negative

Actual valuesPositive TP FNNegative FP TN

Accuracy Acc = TP+TNTP+TN+FP+FN

Precision P = TPTP+FP

Recall R = TPTP+FN

F-measure Fα = PRαP+(1−α)R


Results

Non-Noisy Datasets

Artificial Datasets -

knn does the best - 91.2% & 93.7%Good improvement in CS from 65% - 76%

Table: Non-Noisy Artificial Datasets - Performance of all algorithms

Dataset Algorithm Accuracy Precision Recall F-measure

CS 65.0 63.5 70.1 66.6knn 91.2 92.8 87.4 89.8A1Naive Bayes 60.2 61.6 60.14 64.1

CS 76.0 68.4 71.62 69.7knn 93.7 94.7 91.9 93.2A2Naive Bayes 63.1 61.1 65.2 63.5


Results

Real Datasets -Iris : knn does best. Followed by Naive Bayes.Haberman : CS does best. Naive Bayes is really bad.Pima-Diabetes : CS is best. Naive Bayes follows.

Table: Non-Noisy Real Datasets - Performance of all algorithms

Dataset Algorithm Accuracy Precision Recall F-Measure

CS 94.3 95.1 94.3 94.7knn 96.7 96.8 96.7 96.8IrisNaive Bayes 96.2 93.7 95 94.3

CS 75.2 67.2 61.6 64.2knn 73.4 63.2 54.8 58.5HabermanNaive Bayes 0.5 41.9 47.6 47.3

CS 73.7 74.9 65.1 69.6knn 64.5 65.6 66.9 66.3

Pima -Diabetes

Naive Bayes 70.3 59.2 56.7 57.9


Results

Noisy Datasets : A1knn does best.For both knn and CS, No change with noiseNaive Bayes does bad.

Table: Noisy A1 dataset - Performance of all algorithms

Algorithm Noise % Accuracy Precision Recall F-Measure

0 65 63.5 70.1 66.615 64.8 63.4 96.7 96.8CS50 65.5 63.2 95 94.3

0 87.5 87.2 61.6 61.615 87.3 88.1 54.8 58.5knn50 86.7 88.5 47.6 47.3

0 ≈ 0 ≈ 0 ≈ 0 ≈ 015 ≈ 0 ≈ 0 ≈ 0 ≈ 0Naive Bayes50 ≈ 0 ≈ 0 ≈ 0 ≈ 0


Results

Noisy Datasets : A2knn does best, but goes from 92.6% - 86.3%For CS, no change with noiseFrom A1 to A2, CS : 65% - 76%

Table: Noisy A2 dataset - Performance of all algorithms


0 76.0 68.4 71.6 69.715 76.8 64.7 73.1 68.4CS50 76.4 66.9 71.7 68.5

0 92.6 86.9 85.5 86.215 91.1 84.2 84.2 83.5knn50 86.3 83.0 78.2 77.9

0 ≈ 0 ≈ 0 ≈ 0 ≈ 015 ≈ 0 ≈ 0 ≈ 0 ≈ 0Naive Bayes50 ≈ 0 ≈ 0 ≈ 0 ≈ 0


Results

Noisy Datasets : Irisknn does best at 0% Noise (96.7%) , then CS 94.5%CS does best at 50% Noise - 73.1%, then knn - 63.8%

Table: Noisy Iris dataset - Performance of all algorithms


0 94.5 94.9 94.5 94.715 86.2 87.6 86.2 86.9CS50 73.1 74.9 73.1 73.9

0 96.7 96.8 96.7 96.815 83.6 84.6 83.6 84.1knn50 63.8 63.2 63.8 63.5

0 93.3 92.3 91.9 92.115 92.3 91.5 91.2 91.4Naive Bayes50 0.7 18.3 0.7 NaN


Results

Noisy Datasets : HabermanCS does best at 74.7%Naive Bayes performs badly at ≈ 43%

Table: Noisy Haberman dataset - Performance of all algorithms


0 74.7 66.7 61.4 63.915 66.1 62.2 61.9 62.0CS50 74.5 66.6 63 64.7

0 74.1 65.7 55.1 59.715 72.0 56.2 52.3 54.0knn50 70.5 51.8 50.6 51.0

0 41.0 47.1 46.5 46.815 43.3 46.2 45.3 45.7Naive Bayes50 41.4 34.7 32.4 31.8


Results

Noisy Datasets : Pima-DiabetesCS does best, followed by knnNaive Bayes bad with Noise : 70% - 55.7% - 0%

Table: Noisy Pima-Diabetes dataset - Performance of all algorithms


0 72.8 72.8 64.2 68.215 70.8 68.3 65.8 67CS50 67.0 64.9 55.9 60.0

0 63.5 64.6 65.9 65.215 60.8 61.2 62.3 61.7knn50 55.0 55.6 56.1 55.8

0 70.3 59.2 56.7 57.915 55.7 49.4 46.0 NaNNaive Bayes50 0 0 0 NaN


Results

Results Summary

Table: Best Algorithm for different Noise Levels

Dataset 0% Noise 15% Noise 50% Noise

A1 knn knn knnA2 knn knn knn

Haberman CS knn CSIris knn Naive Bayes CS

Pima -Diabetes

CS CS CS


Conclusion

No algorithm is best.

In general knn has better accuracy but CS is more robust to noise.

Naive Bayes does much worse for noise, than others.

Also:

CS performs well for Imbalanced Datasets.


Future Work

Test with more datasets.

Test for performance on imbalanced datasets.

Only additive Noise model was used, try with other variations.

Compare with more algorithms.


Questions

Questions?


Engineering

Noise Resilience in Machine Learning Algorithms