Noise Resilience in Machine Learning Algorithms

Exploring the Noise ResilienceCombined Sturges Algorithm

Akrita AgarwalAdvisor: Dr. Anca Ralescu

November 7, 2015

Akrita Agarwal Exploring the Noise Resilience Combined Sturges AlgorithmNovember 7, 2015 1 / 39

Motivation

A study on Noise?

Real-world datasets are noisyRecordings under normal environmental conditionsEquipment Measurement ErrorMost algorithms ignore Noise.Not a lot of research done on Noise.

Aim : Explore the robustness of algorithms to Noise.

Which algorithm is least affected by noisy Datasets?

Classification

Classification : Assigning a new observation to a set of knowncategories

Companies store large amounts of data.

Effective Classifier can assist in making good predictions and informedbusiness decisions.

E.g. Whether to recommend Prime products to the non-primecustomers, based on behavior

Classification Algorithms

Two broad kinds of Classifiers are -

Frequency based classifiers: use the frequency of datapoints in thedataset to determine the class membership of a given test point,Geometry based classifiers leverage the geometrical aspects of adataset such as the distance.

Naive Bayes

The Naive Bayes Classifier

Frequency based classifierComputes the probability of a test data point to be in each classclass probability extracted from training data.

Intuitive to understand and build.Easily trained, even with a small datasetIt’s fast

Assumes conditional independence of the dataignores the underlying geometry of the data.

k Nearest Neighbors

The k Nearest Neighbors Classifier

Geometry based classifierAssigns the class to test data point by determining the majority classof k nearest points

Easy to implement and understandClasses don’t have to be linearly separable

Tends to ignore the importance of an attribute; uses allonly indirectly takes into account the frequency of the data

Combined Sturges Classifier

Combined Sturges

The Combined Sturges(CS)Classifier

Explicitly uses geometry + frequencyData represented as Frequency distribution on class.Classification Score is computed for each class.Test point assigned to class with highest Score.

Continuous data values are binned.

No. of bins = d1 + log2neSturges, 1926 - Choice of a Class Interval

Combined Sturges

Dummy Dataset

Table: Dummy Dataset

A1 A2 Class3 2 11 2 14 2 03 2 11 1 02 2 13 3 04 1 0

Table: Frequency Distribution on Classes 0 & 1

A1 f (A1) A2 f (A2)

1 0.25 1 0.503 0.25 2 0.254 0.50 3 0.25

A1 f (A1) A2 f (A2)

1 0.25 2 0.752 0.25 3 0.253 0.50

Combined Sturges

Dummy Dataset

Table: Dummy Dataset

A1 A2 Class3 2 11 2 14 2 03 2 11 1 02 2 13 3 04 1 0

Table: Frequency Distribution on Classes 0 & 1

A1 f (A1) A2 f (A2)

1 0.25 1 0.503 0.25 2 0.254 0.50 3 0.25

A1 f (A1) A2 f (A2)

1 0.25 2 0.752 0.25 3 0.253 0.50

Combined Sturges

Test Point : T1

Combined Sturges

1 Geometric Criterion

Test Point : T1

minimum Distance

ClassificationCriteria :Geometric

ClassificationScore: HighestPosteriorProbability

Table: Nearest distance of T1 to Classes

A1 f (A1) A2 f (A2)

1 0.25 1 0.503 0.25 2 0.254 0.50 3 0.25

A1 f (A1) A2 f (A2)

1 0.25 2 0.752 0.25 3 0.253 0.50

Combined Sturges

Classification Score, S(c) c ∈ 0, 1

A1 = P(Class0)× f (A1)A2 = P(Class0)× f (A2)average(A1,A2) =average(0.5 ×0.25, 0.5× 0.25) = 0.125

A1 = P(Class1)× f (A1)A2 = P(Class1)× f (A2)average(A1,A2) = average(0.5× 0.50, 0.5× 0.25) = 0.187

S(0) < S(1)

Class 1

Combined Sturges

1 Statistical Criterion

Test Point : T1

maximumFrequency

Classificationcriteria : Statistical

ClassificationScore: MinimumDistance

Table: Maximum Frequency in Classes

A1 f (A1) A2 f (A2)

1 0.25 1 0.503 0.25 2 0.254 0.50 3 0.25

A1 f (A1) A2 f (A2)

1 0.25 2 0.752 0.25 3 0.253 0.50

Combined Sturges

Classification Score

A1 = (4− 3) = 1A2 = (4− 1) = 3average(A1,A2) = average(1, 3) = 2

A1 = (3− 3) = 0A2 = (4− 2) = 2average(A1,A2) = average(0, 2) = 1

S(0) > S(1)

Class 1

Combined Sturges

1 Combined Criterion

Test Point : T1

d =(T1 − A1).f (A1)

Expected DistanceED = EDc

A1.EDcA2

min ExpectedDistance, ED

Table: Aggregate Expected Distance, ED

A1 f (A1) d .f A2 f (A2) d .f

1 0.25 0.50 1 0.50 1.503 0.25 0 2 0.25 0.504 0.50 0.50 3 0.25 0.25

ED0A1 1.00 ED0

A2 2.25

A1 f (A1) d .f A2 f (A2) d .f

1 0.25 0.50 2 0.75 1.502 0.25 0.25 3 0.25 0.253 0.50 0

ED1A1 0.75 ED1

A2 1.75

Combined Sturges

Classification Penalty

ED = 1.00× 2.25 = 2.25S(0) = ED × (1− P(Class0)) = 1.125

ED = 0.75× 1.75 = 1.31S(1) = ED × (1− P(Class1)) = 0.655

S(0) > S(1)

Class 1

The Noise Model

Dealing with Noise

Brodley & Fried, 1999 - detect and reduce noise

Kubica & Moore, 2003 - identify Noise using a probabilistic modeland remove it.

Elias Kalapanidas, 2003 - Developed a Noise Model based on dataproperties.

The Noise Model

Additive Noise, x′

= x + δx

δx = σxj × zi,jσxj , standard deviation of attribute j,zi,j = CDF (pi,j)

xi ,j =

′i ,j if pi ,j ≥ n

xi ,j if pi ,j < n(1)

Based on Noise level n ∈ {0, 0.15, 0.30, 0.50, 0.80}

The Noise Model

Attribute-level Noise

Table: Original Dataset

A1 A2 Class3 2 11 2 14 2 03 2 11 1 02 2 13 3 04 1 0

Table: 40% (n = 0.4) Noisy Dataset

A1 A2 Class8.5 0.55 18.9 2 14 0.7 03 2 14.7 1 02 2 13 3 01.6 0.02 0

Datasets

Artificial datasets

Multivariate Normal

x1 = random Normal vector, t = random Normal vectorx2 = 0.8x1 + 0.6tx3 = 0.6x1 + 0.8tx4 = t

Linear Function with Non-normal inputs

x2 = (x1)2 + 0.5t

Datasets

2 Artificial datasets

Different Imbalanced-Ratio

3 Real Datasets

Table: Comparison of physical properties of Datasets.

DatasetNo. of

SamplesNo. ofClasses

No. ofAttributes

AttributeValue

ImbalanceRatio

Haberman 306 2 3 Integer 2.78A1 200 3 4 Real 6.66A2 200 3 4 Real 39Iris 150 3 4 Real 2

PimaDiabetes

768 2 8Integer,Real

Process Flow

1 Create Artificial Datasets

2 Implement the Noise model on all Datasets

3 Apply the three algorithms

4 Compare the results

Results

Performance Measures

Confusion Matrix

Table: Confusion matrix for 2 classes.

Predicted OutcomePositive Negative

Actual valuesPositive TP FNNegative FP TN

Accuracy Acc = TP+TNTP+TN+FP+FN

Precision P = TPTP+FP

Recall R = TPTP+FN

F-measure Fα = PRαP+(1−α)R

Results

Non-Noisy Datasets

Artificial Datasets -

knn does the best - 91.2% & 93.7%Good improvement in CS from 65% - 76%

Table: Non-Noisy Artificial Datasets - Performance of all algorithms

Dataset Algorithm Accuracy Precision Recall F-measure

CS 65.0 63.5 70.1 66.6knn 91.2 92.8 87.4 89.8A1Naive Bayes 60.2 61.6 60.14 64.1

CS 76.0 68.4 71.62 69.7knn 93.7 94.7 91.9 93.2A2Naive Bayes 63.1 61.1 65.2 63.5

Results

Real Datasets -Iris : knn does best. Followed by Naive Bayes.Haberman : CS does best. Naive Bayes is really bad.Pima-Diabetes : CS is best. Naive Bayes follows.

Table: Non-Noisy Real Datasets - Performance of all algorithms

Dataset Algorithm Accuracy Precision Recall F-Measure

CS 94.3 95.1 94.3 94.7knn 96.7 96.8 96.7 96.8IrisNaive Bayes 96.2 93.7 95 94.3

CS 75.2 67.2 61.6 64.2knn 73.4 63.2 54.8 58.5HabermanNaive Bayes 0.5 41.9 47.6 47.3

CS 73.7 74.9 65.1 69.6knn 64.5 65.6 66.9 66.3

Pima -Diabetes

Naive Bayes 70.3 59.2 56.7 57.9

Results

Noisy Datasets : A1knn does best.For both knn and CS, No change with noiseNaive Bayes does bad.

Table: Noisy A1 dataset - Performance of all algorithms

Algorithm Noise % Accuracy Precision Recall F-Measure

0 65 63.5 70.1 66.615 64.8 63.4 96.7 96.8CS50 65.5 63.2 95 94.3

0 87.5 87.2 61.6 61.615 87.3 88.1 54.8 58.5knn50 86.7 88.5 47.6 47.3

0 ≈ 0 ≈ 0 ≈ 0 ≈ 015 ≈ 0 ≈ 0 ≈ 0 ≈ 0Naive Bayes50 ≈ 0 ≈ 0 ≈ 0 ≈ 0

Results

Noisy Datasets : A2knn does best, but goes from 92.6% - 86.3%For CS, no change with noiseFrom A1 to A2, CS : 65% - 76%

Table: Noisy A2 dataset - Performance of all algorithms

0 76.0 68.4 71.6 69.715 76.8 64.7 73.1 68.4CS50 76.4 66.9 71.7 68.5

0 92.6 86.9 85.5 86.215 91.1 84.2 84.2 83.5knn50 86.3 83.0 78.2 77.9

0 ≈ 0 ≈ 0 ≈ 0 ≈ 015 ≈ 0 ≈ 0 ≈ 0 ≈ 0Naive Bayes50 ≈ 0 ≈ 0 ≈ 0 ≈ 0

Results

Noisy Datasets : Irisknn does best at 0% Noise (96.7%) , then CS 94.5%CS does best at 50% Noise - 73.1%, then knn - 63.8%

Table: Noisy Iris dataset - Performance of all algorithms

0 94.5 94.9 94.5 94.715 86.2 87.6 86.2 86.9CS50 73.1 74.9 73.1 73.9

0 96.7 96.8 96.7 96.815 83.6 84.6 83.6 84.1knn50 63.8 63.2 63.8 63.5

0 93.3 92.3 91.9 92.115 92.3 91.5 91.2 91.4Naive Bayes50 0.7 18.3 0.7 NaN

Results

Noisy Datasets : HabermanCS does best at 74.7%Naive Bayes performs badly at ≈ 43%

Table: Noisy Haberman dataset - Performance of all algorithms

0 74.7 66.7 61.4 63.915 66.1 62.2 61.9 62.0CS50 74.5 66.6 63 64.7

0 74.1 65.7 55.1 59.715 72.0 56.2 52.3 54.0knn50 70.5 51.8 50.6 51.0

0 41.0 47.1 46.5 46.815 43.3 46.2 45.3 45.7Naive Bayes50 41.4 34.7 32.4 31.8

Results

Noisy Datasets : Pima-DiabetesCS does best, followed by knnNaive Bayes bad with Noise : 70% - 55.7% - 0%

Table: Noisy Pima-Diabetes dataset - Performance of all algorithms

0 72.8 72.8 64.2 68.215 70.8 68.3 65.8 67CS50 67.0 64.9 55.9 60.0

0 63.5 64.6 65.9 65.215 60.8 61.2 62.3 61.7knn50 55.0 55.6 56.1 55.8

0 70.3 59.2 56.7 57.915 55.7 49.4 46.0 NaNNaive Bayes50 0 0 0 NaN

Results

Results Summary

Table: Best Algorithm for different Noise Levels

Dataset 0% Noise 15% Noise 50% Noise

A1 knn knn knnA2 knn knn knn

Haberman CS knn CSIris knn Naive Bayes CS

Pima -Diabetes

CS CS CS

Conclusion

No algorithm is best.

In general knn has better accuracy but CS is more robust to noise.

Naive Bayes does much worse for noise, than others.

CS performs well for Imbalanced Datasets.

Future Work

Test with more datasets.

Test for performance on imbalanced datasets.

Only additive Noise model was used, try with other variations.

Compare with more algorithms.

Questions

Questions?

Noise Resilience in Machine Learning Algorithms

Engineering

ADAPTIVE FILTERING ALGORITHMS FOR NOISE … FILTERING ALGORITHMS FOR NOISE CANCELLATION ... The purpose of this thesis is to study the adaptive filters theory for the ... Adaptive

SENSITIVITY TO NOISE IN PARTICLE FILTERS FOR 2-D TRACKING ALGORITHMS

NOISE ROBUST ALGORITHMS TO IMPROVE CELL …ufdcimages.uflib.ufl.edu/UF/E0/02/11/39/00001/ramani_m.pdf · noise robust algorithms to improve cell phone speech intelligibility for the

Genetic Algorithms, Noise, and the Sizing of Populationswpmedia.wolfram.com/uploads/sites/13/2018/02/06-4-3.pdf · Genetic Algorithms, Noise, and the Sizing of Populations 335 mechanistic

Genetic Algorithms, Noise, and the Sizing of Populations · 2/6/2018 · Genetic Algorithms, Noise, and the Sizing of Populations 335 mechanistic terms using variations or extensions

The Case for Optimum Detection Algorithms in MIMO Wireless ... · Transmit noise appears spatially colored at the receiver Interference appears as spatially colored noise Phase-noise

Stochastic Bouncy Particle Samplerproceedings.mlr.press/v70/pakman17a/pakman17a.pdf3. Noise Resilience and Big Data 3.1. Noise Resilience Let us assume that only a noisy version of

Denoising algorithms for processing images corrupted by additive noise

Noise-reduction algorithms for optimization- based imaging through

On the Resilience of Ant Algorithms. Experiment with

CHAPTER 5 FILTERING ALGORITHMS FOR REMOVAL OF ...shodhganga.inflibnet.ac.in/bitstream/10603/27619/10... · FILTERING ALGORITHMS FOR REMOVAL OF MULTIPLICATIVE SPECKLE NOISE IN IMAGES

Noise Robustness of Optimization Algorithms via

Adaptive ﬁltering algorithms for acoustic echo and noise ... · classes of speech enhancement techniques, namely acoustic echo cancellation (AEC) (section 1.2.2) and acoustic noise

Speech Enhancement Based On Noise Reductionzduan/teaching/ece472/projects/2014/... · Speech Enhancement Based On Noise Reduction ... certain speech enhancement algorithms at the

Nonlinear total variation based noise removal algorithmslomn/Cours/ECE/PhysicaRudinOsher.pdf · Nonlinear total variation based noise removal algorithms Leonid ... The mean curvature

Noise Resilience of an RGNG-based Grid Cell Model · Noise Resilience of an RGNG-based Grid Cell Model ... Noise Resilience of an RGNG-based Grid Cell Model, ... Grossberg (Pilly

Performance Evaluation of Different Active Noise Control ...830747/FULLTEXT01.pdf · Performance Evaluation of Different Active Noise Control (ANC) Algorithms for Attenuating Noise

Frequency Estimation Algorithms to Improve Grid Inertial ... · noise sensitivity and bandwidth (speed of response to changes in actual frequency). Frequency estimation algorithms

Multiplicative noise and heavy tails in stochastic …stochastic optimization algorithms as discrete random recurrence relations, we show that multiplicative noise, as it commonly

REVIEW OF ADAPTIVE NOISE CANCELLATION TECHNIQUES USING FUZZY-NEURAL NETWORKS … · 2012. 6. 28. · To enhance the speech, many algorithms and noise cancellation techniques are used