LEARNING FROM NOISY DATA

Preview:

DESCRIPTION

ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY. LEARNING FROM NOISY DATA. Ivan Bratko University of Ljubljana Slovenia. Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides. Overview. Learning from noisy data Idea of tree pruning How to prune optimally - PowerPoint PPT Presentation

Citation preview

1

LEARNING FROM NOISY DATA

Ivan BratkoUniversity of Ljubljana

Slovenia

ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY

Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides

2

Overview

• Learning from noisy data• Idea of tree pruning• How to prune optimally• Methods for tree pruning• Estimating probabilities

3

Learning from Noisy Data

• Sources of “noise”– Errors in measurements, errors in data

encoding, errors in examples, missing values

• Problems– Complex hypothesis– Poor comprehensibility– Overfitting: hypothesis overfits the data– Low classification accuracy on new data

4

Fitting data

y

x

looks good, but doesnot fit the data exactly!

What is the relation between x and y, y = y(x)?How can we predict y from x?

5

Overfitting data

y

x

What is the relation between x and y, y = y(x)?How can we predict y from x?

Makes no errorin the training data!But how about predictingnew cases?

6

Overfitting in Extreme

• Let default accuracy be the probability of majority class

• Overfitting may result in accuracy lower then default

• Example– Attributes have no correlation with class

(i.e., 100% noise)– Two classes: c1, c2

– Class probabilities: p(c1) = 0.7, p(c2) = 0.3– Default accuracy = 0.7

7

Overfitting in Extreme

Decision tree with one example per leaf

Acc. = 0.7 Acc. = 0.3

c1 c2

Expected accuracy = 0.7 x 0.7 + 0.3 x 0.3 = 0.58

0.58 < 0.7

8

Pruning of Decision Trees

• Means of handling noise in tree learning

• After pruning the accuracy on previously unseen examples may increase

9

Data set– 20 classes– Default classifier 24.7%

# nodes accuracy

Overfi tted tree 150 41%

Prunned tree 15 45%

Experts - 42%

Typical Example from Practice:Locating Primary Tumor

10

[credit]

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.1 1 10 100 1000

acc

ura

cy

value of m

Effects of prunning

accuracy on training set

accuracy on test set

bigger trees smaller trees

Effects of Pruning

12

How to Prune Optimally?• Main questions

– How much pruning?– Where to prune?– Large number of candidate pruned trees!

• Typical relation btw tree size and accuracy on the new data

• Main difficulty in pruning: this curve is not known!

Accuracy

Tree Size

13

Two Kinds of Pruning

Pre pruning(forward pruning)

Post pruning

14

Forward Pruning

• Stop expanding trees if benefits of potential sub-trees seem dubious– Information gain low– Number of examples very small– Example set statistically insignificant– Etc.

15

Forward Pruning Inferior

• Myopic• Depends on parameters which are hard

(impossible?) to guess• Example:

x1

x2

a

b

16

Pre and Post Pruning

• Forward pruning considered inferior and myopic

• Post pruning makes use of sub-trees and in this way reduces the complexity

17

Post pruning

• Main idea: prune unreliable parts of tree

• Outline of pruning procedure: start at bottom of tree, proceed upward; that is: prune unreliable subtrees

• Main question: How to know whether a subtree is unreliable? Will accuracy improve after pruning?

18

Estimating accuracy of subtree

• One idea: Use special test data set (“pruning set”)

• This is OK if sufficient amount of learning data available

• In case of shortage of data: Try estimate accuracy directly from learning data

19

Partitioning data in tree learning

All available data

Training set Test set Growing set Pruning set

• Typical proportions: training set 70%, test set 30% growing set 70%, pruning set 30%

20

Estimating accuracy with pruning set

Accuracy of hypothesis on new data = probability of correct classification of a new

example

Accuracy of hypothesis on new data proportion of correctly classified examples in

pruning set

Error of a hypothesis = probability of misclassification of a new example

Drawback of using a pruning set: less data for “growing set”

21

Use pruning set to estimate accuracy of sub trees and accuracy at individual nodes

Let T be a sub tree rooted at node v: v T

Define: Gain from pruning at v = # misclassifications in T - # misclassifications at v

Reduced error pruning, Quinlan 87

22

Reduced error pruning

Repeat: prune at node with largest gain until only negative gain nodes

remain

“Bottom-up restriction”: T can only be pruned if it does not contain a sub tree with lower error than T

23

Reduced error pruning

• Theorem (Esposito, Malerba, Semeraro 1997):

REP with bottom-up restriction finds the smallest most accurate sub tree w.r.t. pruning set.

24

Minimal Error Pruning (MEP)Niblett and Bratko 86; Cestnik and Bratko 91

Does not require a pruning set for estimating error

Estimates error on new data directly from “growing set”, using the Bayesian method for probability estimation (e.g. Laplace estimate or m-estimate)

Main principle: Prune so that estimated classification error is minimal

25

• Deciding about pruning at node v: a tree T: v p1 p2 ...

T1 T2

• E(T) = error of optimally pruned tree T

Minimal Error Pruning

26

Static and backed-up errors

Define: static error at v : e(v) = p( class C | v) where C is the most likely class at v

If T pruned at v then its error is e(v).

If T not pruned at v then its (backed-up) error is: p1 E(T1) + p2 E(T2) + ...

27

Minimal error pruning

Decision whether to prune or not:

Prune if static error backed-up error:

E(T) = min( e(v), i pi E(Ti))

28

Minimal error pruning

Main question:How to estimate static errors e(v)?Use Laplace or m-estimate of probability

At a node v : N examples nC majority class examples

29

Laplace probability estimate

where k is the number of classes.

Problems with Laplace: Assumes all classes a priori equally likely Degree of pruning depends on number of

classes

kNnp C

C 1

30

m-estimate of probability

pC = ( nC + pCa m ) / ( N + m)

where: pCa = a priori probability of class C

m is a non-negative parameter tuned by expert

31

m-estimate

Important points:

Takes into account prior probabilities Pruning not sensitive to number of

classes Varying m: series of differently

pruned trees Choice of m depends on confidence

in data

32

m-estimate in pruning

Choice of m:

Low noise low m little pruningHigh noise high m much pruning

Note: Using m-estimate is as if examples at node were a random sample, which they are not. Suitably adjusting m compensates for

this.

33

Some other pruning methods

• Error-complexity pruning, Breiman et al. 84 (CART)

• Pessimistic error pruning, Quinlan 87

• Error-based pruning, Quinlan 93 (C4.5)

34

Error-complexity pruningBreiman et al. 1884, Program CART

Considers:

• Error rate on "growing" set• Size of tree• Error rate on "pruning set"• Minimise error and complexity: i.e. find a

compromise between error and size

35

A sub tree T with root v:

v T

• R(v) = # errors on "growing" set at node v• R(T) = # errors on "growing" set of tree T• NT = # leaves in T • Total cost = Error cost + Complexity cost• Total cost = R + N

36

Error complexity cost

• Total cost = Error cost + Complexity cost• Total cost = R + N = complexity cost per leaf

37

Pruning at v

• Cost of T (T unpruned) = R(T) + NT

• Cost of v (T pruned at v) = R(v) +

• When costs of T and v are equal: = reduction of error per leaf

38

Pruning algorithm

• Compute for each node in unpruned tree

• Repeat prune sub tree with smallest until root only is left

• This gives a series of increasingly pruned trees; estimate their accuracy

39

Selecting best pruned tree

• Finally select the "best" tree from this series

• Select the smallest tree within 1 standard error of minimum error (1-SE rule)

• Standard error = sqrt( Rmin * (1-Rmin) / #exs)

40

Comments

• Note: Cost complexity pruning limits selection to a subset of all possible pruned trees.

• Consequence: Best pruned tree may be missed

• Two ways of estimating error on new data: (a) using pruning set (b) using cross-validation in a rather complicated way

41

Comments

• 1-SE rule tends to overprune;• Simply choosing min. error tree ("0-SE

rule") performs better in experiments• Error estimate with cross validation is

complicated and based on a debatable assumption

42

Selecting best tree

• Using pruning set: Measure error of candidate pruned trees

on pruning set;• Select the smallest tree within 1 standard

error of minimum error.

43

Comparison of pruning methods (Esposito, Malerba, Semeraro 96, IEEE Trans.)

• Experiments with 14 data sets from UCI repository

Results: Does pruning improve accuracy? Generally yes But the effects of pruning also

depend on domain:– In most domains pruning

improves accuracy, in some it does not, in very few it worsens

44

Pruning in rule learning

• Ideas from pruning decision trees can be adapted to the learning of if-then rules

• Pre-pruning and post-pruning can be combined and reduced error pruning idea applies

• Furnkranz (1997) reviews several approaches and evaluates them experimentally

45

Estimating Probabilities

• Setup

– n experiments (n = r + s)– r successes– s failures

• How likely it is that next experiment will be a success?

• Estimate with relative frequency

46

Relative Frequency

• Works when we have many experiments, but not with small samples• Consider

– flipping a coin– we flip a coin twice, both times comes a

head– what is probability of head in the next flip?

• Probability of 1.0 (1.0=2/2) seems unreasonable

47

Coins and mushrooms

• Probability of head = ?• Probability of mushroom edible = ?• Make one, two ... experiments• Interpret results in terms of probability• Relative frequency does not work well

48

Coins and mushrooms

• We need to consider prior expectations• Prior prob. = 1/2, in both cases not

unreasonable• But, is this enough? • Intuition says: our probability estimates for

coins and mushrooms still different• Difference lies in prior probability

distribution• What are sensible prior distributions for

coins and for mushrooms?

49

Bayesian Procedure for Estimating Probabilities

• Assume initial probability distribution (prior distribution)

• Based on some evidence E, update this distribution to obtain posterior distribution

• Compute the expected value over posterior distribution. Variance of posterior distribution is related to certainty of this estimate

50

Bayes Formula

• Bayesian process takes prior probability and combines it with new evidence to obtain updated (posterior) probability

51

Bayes in estimating probabilities

• Form of hypothesis H is: P(event) = x

• So: P( H | E) = P( P(event)=x | E)

• That is: probability that probability is x• May appear confusing!

52

Bayes update of probability

P( P(event)=x | E) =

P( P(event)=x) * P( E | P(event)=x) / P( E)

Prior prob. densityPosterior prob. density

53

Expected probability

• Expected value of prob. of event: P( event | E )

• P( event | E) = Integral over [0,1] of x weighted by prob. density

54

Bayes update of probability

Prior prob. distribution

Bayes update with evidence E

Posterior prob. distribution

Expected value, variance

55

Update of density: Example

Uniform prior

0 1Posterior

0 1

56

Choice of Prior Distribution:Beta Distribution (a,b)

57

Bayesian Update ofBeta Distributions

• Let prior distribution be (a,b)• Assume experimental results

– s successful outcomes– f failures

Updated Beta distribution is (a+s, b+f)

• Beta probability distributions have a nice mathematical property

Betadistribution

Betadistribution

Bayes Update

58

m-estimate of probability

• Cestnik, 1991– Replace parameters a and b with m and

pa

– pa is prior probability

• Assume: N experiments, n positive outcome

59

m-estimate of probability

relativefrequency

priorprobability

60

Choosing priorprobability distribution

• If we know prior probability and variance, this determines a, b and m, pa

• A domain expert to choose prior distribution, defined by either a, b or m, pa

• m, pa may be more practical then a, b

• Expert hopefully has some idea about pa and m:

– low variance, more confidence in pa large m

– high variance, less confidence in pa small m

61

Laplace Probability Estimate

• For a problem with two outcomes

• Assumes prior probability distribution of (1,1)

• Also equals m-estimate with pa = 1/k and m =k , where k =2

62

Using domain knowledge to improve accuracy

• If domain-specific knowledge is available prior to learning it may provide useful additional constraints

• Additional constraints may alleviate problems with noise

• One approach is Q2 learning that uses qualitative constraints in numerical learning

63

Q2 LearningVladušič, Šuc and Bratko 2004

• Q2 learning, Qualitatively faithful Quantitative learning

• Learning from numerical data is guided by qualitative constraints

• Resulting numerical model fits learning data numerically and respects given qualitative model

• Qualitative model can be provided by domain expert, or induced from data

64

Qualitative difficulties ofnumerical learning

Learn time behavior of water level: h = f( t, initial_outflow)

h

outflow t

h

650

20

40

60

80

100

h

t=1

t=19

Predicting water level with M5

Qualitatively incorrect – water level cannot increase

M5 prediction

11.25

10.08.75

6.25

7.5Initial_ouflow

=12.5

66

Predicting water level with Q2

Q2 predictions

True values

Recommended