LEARNING FROM NOISY DATA

Ivan BratkoUniversity of Ljubljana

Slovenia

ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY

Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides

Overview

• Learning from noisy data• Idea of tree pruning• How to prune optimally• Methods for tree pruning• Estimating probabilities

Learning from Noisy Data

• Sources of “noise”– Errors in measurements, errors in data

encoding, errors in examples, missing values

• Problems– Complex hypothesis– Poor comprehensibility– Overfitting: hypothesis overfits the data– Low classification accuracy on new data

Fitting data

looks good, but doesnot fit the data exactly!

What is the relation between x and y, y = y(x)?How can we predict y from x?

Overfitting data

What is the relation between x and y, y = y(x)?How can we predict y from x?

Makes no errorin the training data!But how about predictingnew cases?

Overfitting in Extreme

• Let default accuracy be the probability of majority class

• Overfitting may result in accuracy lower then default

• Example– Attributes have no correlation with class

(i.e., 100% noise)– Two classes: c1, c2

– Class probabilities: p(c1) = 0.7, p(c2) = 0.3– Default accuracy = 0.7

Overfitting in Extreme

Decision tree with one example per leaf

Acc. = 0.7 Acc. = 0.3

Expected accuracy = 0.7 x 0.7 + 0.3 x 0.3 = 0.58

0.58 < 0.7

Pruning of Decision Trees

• Means of handling noise in tree learning

• After pruning the accuracy on previously unseen examples may increase

Data set– 20 classes– Default classifier 24.7%

# nodes accuracy

Overfi tted tree 150 41%

Prunned tree 15 45%

Experts - 42%

Typical Example from Practice:Locating Primary Tumor

[credit]

0.1 1 10 100 1000

value of m

Effects of prunning

accuracy on training set

accuracy on test set

bigger trees smaller trees

Effects of Pruning

How to Prune Optimally?• Main questions

– How much pruning?– Where to prune?– Large number of candidate pruned trees!

• Typical relation btw tree size and accuracy on the new data

• Main difficulty in pruning: this curve is not known!

Accuracy

Tree Size

Two Kinds of Pruning

Pre pruning(forward pruning)

Post pruning

Forward Pruning

• Stop expanding trees if benefits of potential sub-trees seem dubious– Information gain low– Number of examples very small– Example set statistically insignificant– Etc.

Forward Pruning Inferior

• Myopic• Depends on parameters which are hard

(impossible?) to guess• Example:

Pre and Post Pruning

• Forward pruning considered inferior and myopic

• Post pruning makes use of sub-trees and in this way reduces the complexity

Post pruning

• Main idea: prune unreliable parts of tree

• Outline of pruning procedure: start at bottom of tree, proceed upward; that is: prune unreliable subtrees

• Main question: How to know whether a subtree is unreliable? Will accuracy improve after pruning?

Estimating accuracy of subtree

• One idea: Use special test data set (“pruning set”)

• This is OK if sufficient amount of learning data available

• In case of shortage of data: Try estimate accuracy directly from learning data

Partitioning data in tree learning

All available data

Training set Test set Growing set Pruning set

• Typical proportions: training set 70%, test set 30% growing set 70%, pruning set 30%

Estimating accuracy with pruning set

Accuracy of hypothesis on new data = probability of correct classification of a new

example

Accuracy of hypothesis on new data proportion of correctly classified examples in

pruning set

Error of a hypothesis = probability of misclassification of a new example

Drawback of using a pruning set: less data for “growing set”

Use pruning set to estimate accuracy of sub trees and accuracy at individual nodes

Let T be a sub tree rooted at node v: v T

Define: Gain from pruning at v = # misclassifications in T - # misclassifications at v

Reduced error pruning, Quinlan 87

Reduced error pruning

Repeat: prune at node with largest gain until only negative gain nodes

remain

“Bottom-up restriction”: T can only be pruned if it does not contain a sub tree with lower error than T

Reduced error pruning

• Theorem (Esposito, Malerba, Semeraro 1997):

REP with bottom-up restriction finds the smallest most accurate sub tree w.r.t. pruning set.

Minimal Error Pruning (MEP)Niblett and Bratko 86; Cestnik and Bratko 91

Does not require a pruning set for estimating error

Estimates error on new data directly from “growing set”, using the Bayesian method for probability estimation (e.g. Laplace estimate or m-estimate)

Main principle: Prune so that estimated classification error is minimal

• Deciding about pruning at node v: a tree T: v p1 p2 ...

• E(T) = error of optimally pruned tree T

Minimal Error Pruning

Static and backed-up errors

Define: static error at v : e(v) = p( class C | v) where C is the most likely class at v

If T pruned at v then its error is e(v).

If T not pruned at v then its (backed-up) error is: p1 E(T1) + p2 E(T2) + ...

Minimal error pruning

Decision whether to prune or not:

Prune if static error backed-up error:

E(T) = min( e(v), i pi E(Ti))

Minimal error pruning

Main question:How to estimate static errors e(v)?Use Laplace or m-estimate of probability

At a node v : N examples nC majority class examples

Laplace probability estimate

where k is the number of classes.

Problems with Laplace: Assumes all classes a priori equally likely Degree of pruning depends on number of

classes

kNnp C

m-estimate of probability

pC = ( nC + pCa m ) / ( N + m)

where: pCa = a priori probability of class C

m is a non-negative parameter tuned by expert

m-estimate

Important points:

Takes into account prior probabilities Pruning not sensitive to number of

classes Varying m: series of differently

pruned trees Choice of m depends on confidence

in data

m-estimate in pruning

Choice of m:

Low noise low m little pruningHigh noise high m much pruning

Note: Using m-estimate is as if examples at node were a random sample, which they are not. Suitably adjusting m compensates for

Some other pruning methods

• Error-complexity pruning, Breiman et al. 84 (CART)

• Pessimistic error pruning, Quinlan 87

• Error-based pruning, Quinlan 93 (C4.5)

Error-complexity pruningBreiman et al. 1884, Program CART

Considers:

• Error rate on "growing" set• Size of tree• Error rate on "pruning set"• Minimise error and complexity: i.e. find a

compromise between error and size

A sub tree T with root v:

• R(v) = # errors on "growing" set at node v• R(T) = # errors on "growing" set of tree T• NT = # leaves in T • Total cost = Error cost + Complexity cost• Total cost = R + N

Error complexity cost

• Total cost = Error cost + Complexity cost• Total cost = R + N = complexity cost per leaf

Pruning at v

• Cost of T (T unpruned) = R(T) + NT

• Cost of v (T pruned at v) = R(v) +

• When costs of T and v are equal: = reduction of error per leaf

Pruning algorithm

• Compute for each node in unpruned tree

• Repeat prune sub tree with smallest until root only is left

• This gives a series of increasingly pruned trees; estimate their accuracy

Selecting best pruned tree

• Finally select the "best" tree from this series

• Select the smallest tree within 1 standard error of minimum error (1-SE rule)

• Standard error = sqrt( Rmin * (1-Rmin) / #exs)

Comments

• Note: Cost complexity pruning limits selection to a subset of all possible pruned trees.

• Consequence: Best pruned tree may be missed

• Two ways of estimating error on new data: (a) using pruning set (b) using cross-validation in a rather complicated way

Comments

• 1-SE rule tends to overprune;• Simply choosing min. error tree ("0-SE

rule") performs better in experiments• Error estimate with cross validation is

complicated and based on a debatable assumption

Selecting best tree

• Using pruning set: Measure error of candidate pruned trees

on pruning set;• Select the smallest tree within 1 standard

error of minimum error.

Comparison of pruning methods (Esposito, Malerba, Semeraro 96, IEEE Trans.)

• Experiments with 14 data sets from UCI repository

Results: Does pruning improve accuracy? Generally yes But the effects of pruning also

depend on domain:– In most domains pruning

improves accuracy, in some it does not, in very few it worsens

Pruning in rule learning

• Ideas from pruning decision trees can be adapted to the learning of if-then rules

• Pre-pruning and post-pruning can be combined and reduced error pruning idea applies

• Furnkranz (1997) reviews several approaches and evaluates them experimentally

Estimating Probabilities

• Setup

– n experiments (n = r + s)– r successes– s failures

• How likely it is that next experiment will be a success?

• Estimate with relative frequency

Relative Frequency

• Works when we have many experiments, but not with small samples• Consider

– flipping a coin– we flip a coin twice, both times comes a

head– what is probability of head in the next flip?

• Probability of 1.0 (1.0=2/2) seems unreasonable

Coins and mushrooms

• Probability of head = ?• Probability of mushroom edible = ?• Make one, two ... experiments• Interpret results in terms of probability• Relative frequency does not work well

Coins and mushrooms

• We need to consider prior expectations• Prior prob. = 1/2, in both cases not

unreasonable• But, is this enough? • Intuition says: our probability estimates for

coins and mushrooms still different• Difference lies in prior probability

distribution• What are sensible prior distributions for

coins and for mushrooms?

Bayesian Procedure for Estimating Probabilities

• Assume initial probability distribution (prior distribution)

• Based on some evidence E, update this distribution to obtain posterior distribution

• Compute the expected value over posterior distribution. Variance of posterior distribution is related to certainty of this estimate

Bayes Formula

• Bayesian process takes prior probability and combines it with new evidence to obtain updated (posterior) probability

Bayes in estimating probabilities

• Form of hypothesis H is: P(event) = x

• So: P( H | E) = P( P(event)=x | E)

• That is: probability that probability is x• May appear confusing!

Bayes update of probability

P( P(event)=x | E) =

P( P(event)=x) * P( E | P(event)=x) / P( E)

Prior prob. densityPosterior prob. density

Expected probability

• Expected value of prob. of event: P( event | E )

• P( event | E) = Integral over [0,1] of x weighted by prob. density

Bayes update of probability

Prior prob. distribution

Bayes update with evidence E

Posterior prob. distribution

Expected value, variance

Update of density: Example

Uniform prior

0 1Posterior

Choice of Prior Distribution:Beta Distribution (a,b)

Bayesian Update ofBeta Distributions

• Let prior distribution be (a,b)• Assume experimental results

– s successful outcomes– f failures

Updated Beta distribution is (a+s, b+f)

• Beta probability distributions have a nice mathematical property

Betadistribution

Bayes Update

• Cestnik, 1991– Replace parameters a and b with m and

– pa is prior probability

• Assume: N experiments, n positive outcome

relativefrequency

priorprobability

Choosing priorprobability distribution

• If we know prior probability and variance, this determines a, b and m, pa

• A domain expert to choose prior distribution, defined by either a, b or m, pa

• m, pa may be more practical then a, b

• Expert hopefully has some idea about pa and m:

– low variance, more confidence in pa large m

– high variance, less confidence in pa small m

Laplace Probability Estimate

• For a problem with two outcomes

• Assumes prior probability distribution of (1,1)

• Also equals m-estimate with pa = 1/k and m =k , where k =2

Using domain knowledge to improve accuracy

• If domain-specific knowledge is available prior to learning it may provide useful additional constraints

• Additional constraints may alleviate problems with noise

• One approach is Q2 learning that uses qualitative constraints in numerical learning

Q2 LearningVladušič, Šuc and Bratko 2004

• Q2 learning, Qualitatively faithful Quantitative learning

• Learning from numerical data is guided by qualitative constraints

• Resulting numerical model fits learning data numerically and respects given qualitative model

• Qualitative model can be provided by domain expert, or induced from data

Qualitative difficulties ofnumerical learning

Learn time behavior of water level: h = f( t, initial_outflow)

outflow t

Predicting water level with M5

Qualitatively incorrect – water level cannot increase

M5 prediction

10.08.75

7.5Initial_ouflow

Predicting water level with Q2

Q2 predictions

True values

LEARNING FROM NOISY DATA

Documents

Noise2Void - Learning Denoising From Single Noisy Imagesopenaccess.thecvf.com/content_CVPR_2019/papers/Krull... · 2019-06-10 · Noise2Void - Learning Denoising from Single Noisy

The Noisy Euclidean Traveling Salesman Problem and Learning

A Machine Learning Perspective on Managing Noisy DataA Machine Learning Perspective on Managing Noisy Data ... • Human errors • Machine failures • Code bugs Data errors are everywhere

Learning with Noisy Labels - Neural Information … · Learning with Noisy Labels Nagarajan Natarajan Inderjit S. Dhillon Pradeep Ravikumar Department of Computer Science, University

Learning with Noisy Correspondence for Cross-modal Matching

Automatic Editing of Noisy Seismic Data

Type Inference on Noisy RDF Data

Learning Programs from Noisy Data - · PDF fileLearning Programs from Noisy Data ... a major problem in existing synthesis techniques. ... A new method for constructing statistical

Learning sequential classiﬁers from long and noisy

Neural-Network Combination for Noisy Data Classification

Emerging topics in learning from noisy and missing …mhug.disi.unitn.it/tutorial-acmmm16/pdf/intro-ns.pdfPart I: 1/27 Emerging topics in learning from noisy and missing data ACM Multimedia

Symbolic Learning and Reasoning With Noisy Data for

Unsupervised Learning from Noisy Networks with ...papers.nips.cc/paper/6291-unsupervised-learning... · Unsupervised Learning from Noisy Networks with Applications to Hi-C Data Bo

Self-Paced Robust Learning for Leveraging Clean Labels in Noisy Datactlu/Publication/2020/AAAI... · 2020. 2. 21. · whole noisy data D w is not a proper solution because the amount

Univariate subdivision schemes for noisy data with …web.math.princeton.edu/.../DynHormanHeardSharon.2015.USS.pdfUnivariate subdivision schemes for noisy data with geometric applications

LTD Facilitates Learning in a Noisy Environment

Separate-and-Conquer Rule Learning - yaroslavvb.com · Separate-and-Conquer Rule Learning ... For example, if the data are noisy, the induction of complete and consistent theories

Secure communication based on noisy input data - Entropy fileSecure communication based on noisy input data Entropy Stephan Sigg June 21, 2011

Unsupervised Learning from Noisy Networks with Applications to Hi-C … · 2017-02-03 · Unsupervised Learning from Noisy Networks with Applications to Hi-C Data Bo Wang⇤1, Junjie

1 Wireless Tomography in Noisy Environments using Machine Learning