65
1 LEARNING FROM NOISY DATA Ivan Bratko University of Ljubljana Slovenia ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides

LEARNING FROM NOISY DATA

  • Upload
    judah

  • View
    80

  • Download
    0

Embed Size (px)

DESCRIPTION

ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY. LEARNING FROM NOISY DATA. Ivan Bratko University of Ljubljana Slovenia. Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides. Overview. Learning from noisy data Idea of tree pruning How to prune optimally - PowerPoint PPT Presentation

Citation preview

Page 1: LEARNING FROM NOISY DATA

1

LEARNING FROM NOISY DATA

Ivan BratkoUniversity of Ljubljana

Slovenia

ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY

Acknowledgement: Thanks to Blaz Zupan for his contribution to these slides

Page 2: LEARNING FROM NOISY DATA

2

Overview

• Learning from noisy data• Idea of tree pruning• How to prune optimally• Methods for tree pruning• Estimating probabilities

Page 3: LEARNING FROM NOISY DATA

3

Learning from Noisy Data

• Sources of “noise”– Errors in measurements, errors in data

encoding, errors in examples, missing values

• Problems– Complex hypothesis– Poor comprehensibility– Overfitting: hypothesis overfits the data– Low classification accuracy on new data

Page 4: LEARNING FROM NOISY DATA

4

Fitting data

y

x

looks good, but doesnot fit the data exactly!

What is the relation between x and y, y = y(x)?How can we predict y from x?

Page 5: LEARNING FROM NOISY DATA

5

Overfitting data

y

x

What is the relation between x and y, y = y(x)?How can we predict y from x?

Makes no errorin the training data!But how about predictingnew cases?

Page 6: LEARNING FROM NOISY DATA

6

Overfitting in Extreme

• Let default accuracy be the probability of majority class

• Overfitting may result in accuracy lower then default

• Example– Attributes have no correlation with class

(i.e., 100% noise)– Two classes: c1, c2

– Class probabilities: p(c1) = 0.7, p(c2) = 0.3– Default accuracy = 0.7

Page 7: LEARNING FROM NOISY DATA

7

Overfitting in Extreme

Decision tree with one example per leaf

Acc. = 0.7 Acc. = 0.3

c1 c2

Expected accuracy = 0.7 x 0.7 + 0.3 x 0.3 = 0.58

0.58 < 0.7

Page 8: LEARNING FROM NOISY DATA

8

Pruning of Decision Trees

• Means of handling noise in tree learning

• After pruning the accuracy on previously unseen examples may increase

Page 9: LEARNING FROM NOISY DATA

9

Data set– 20 classes– Default classifier 24.7%

# nodes accuracy

Overfi tted tree 150 41%

Prunned tree 15 45%

Experts - 42%

Typical Example from Practice:Locating Primary Tumor

Page 10: LEARNING FROM NOISY DATA

10

[credit]

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.1 1 10 100 1000

acc

ura

cy

value of m

Effects of prunning

accuracy on training set

accuracy on test set

bigger trees smaller trees

Effects of Pruning

Page 11: LEARNING FROM NOISY DATA

12

How to Prune Optimally?• Main questions

– How much pruning?– Where to prune?– Large number of candidate pruned trees!

• Typical relation btw tree size and accuracy on the new data

• Main difficulty in pruning: this curve is not known!

Accuracy

Tree Size

Page 12: LEARNING FROM NOISY DATA

13

Two Kinds of Pruning

Pre pruning(forward pruning)

Post pruning

Page 13: LEARNING FROM NOISY DATA

14

Forward Pruning

• Stop expanding trees if benefits of potential sub-trees seem dubious– Information gain low– Number of examples very small– Example set statistically insignificant– Etc.

Page 14: LEARNING FROM NOISY DATA

15

Forward Pruning Inferior

• Myopic• Depends on parameters which are hard

(impossible?) to guess• Example:

x1

x2

a

b

Page 15: LEARNING FROM NOISY DATA

16

Pre and Post Pruning

• Forward pruning considered inferior and myopic

• Post pruning makes use of sub-trees and in this way reduces the complexity

Page 16: LEARNING FROM NOISY DATA

17

Post pruning

• Main idea: prune unreliable parts of tree

• Outline of pruning procedure: start at bottom of tree, proceed upward; that is: prune unreliable subtrees

• Main question: How to know whether a subtree is unreliable? Will accuracy improve after pruning?

Page 17: LEARNING FROM NOISY DATA

18

Estimating accuracy of subtree

• One idea: Use special test data set (“pruning set”)

• This is OK if sufficient amount of learning data available

• In case of shortage of data: Try estimate accuracy directly from learning data

Page 18: LEARNING FROM NOISY DATA

19

Partitioning data in tree learning

All available data

Training set Test set Growing set Pruning set

• Typical proportions: training set 70%, test set 30% growing set 70%, pruning set 30%

Page 19: LEARNING FROM NOISY DATA

20

Estimating accuracy with pruning set

Accuracy of hypothesis on new data = probability of correct classification of a new

example

Accuracy of hypothesis on new data proportion of correctly classified examples in

pruning set

Error of a hypothesis = probability of misclassification of a new example

Drawback of using a pruning set: less data for “growing set”

Page 20: LEARNING FROM NOISY DATA

21

Use pruning set to estimate accuracy of sub trees and accuracy at individual nodes

Let T be a sub tree rooted at node v: v T

Define: Gain from pruning at v = # misclassifications in T - # misclassifications at v

Reduced error pruning, Quinlan 87

Page 21: LEARNING FROM NOISY DATA

22

Reduced error pruning

Repeat: prune at node with largest gain until only negative gain nodes

remain

“Bottom-up restriction”: T can only be pruned if it does not contain a sub tree with lower error than T

Page 22: LEARNING FROM NOISY DATA

23

Reduced error pruning

• Theorem (Esposito, Malerba, Semeraro 1997):

REP with bottom-up restriction finds the smallest most accurate sub tree w.r.t. pruning set.

Page 23: LEARNING FROM NOISY DATA

24

Minimal Error Pruning (MEP)Niblett and Bratko 86; Cestnik and Bratko 91

Does not require a pruning set for estimating error

Estimates error on new data directly from “growing set”, using the Bayesian method for probability estimation (e.g. Laplace estimate or m-estimate)

Main principle: Prune so that estimated classification error is minimal

Page 24: LEARNING FROM NOISY DATA

25

• Deciding about pruning at node v: a tree T: v p1 p2 ...

T1 T2

• E(T) = error of optimally pruned tree T

Minimal Error Pruning

Page 25: LEARNING FROM NOISY DATA

26

Static and backed-up errors

Define: static error at v : e(v) = p( class C | v) where C is the most likely class at v

If T pruned at v then its error is e(v).

If T not pruned at v then its (backed-up) error is: p1 E(T1) + p2 E(T2) + ...

Page 26: LEARNING FROM NOISY DATA

27

Minimal error pruning

Decision whether to prune or not:

Prune if static error backed-up error:

E(T) = min( e(v), i pi E(Ti))

Page 27: LEARNING FROM NOISY DATA

28

Minimal error pruning

Main question:How to estimate static errors e(v)?Use Laplace or m-estimate of probability

At a node v : N examples nC majority class examples

Page 28: LEARNING FROM NOISY DATA

29

Laplace probability estimate

where k is the number of classes.

Problems with Laplace: Assumes all classes a priori equally likely Degree of pruning depends on number of

classes

kNnp C

C 1

Page 29: LEARNING FROM NOISY DATA

30

m-estimate of probability

pC = ( nC + pCa m ) / ( N + m)

where: pCa = a priori probability of class C

m is a non-negative parameter tuned by expert

Page 30: LEARNING FROM NOISY DATA

31

m-estimate

Important points:

Takes into account prior probabilities Pruning not sensitive to number of

classes Varying m: series of differently

pruned trees Choice of m depends on confidence

in data

Page 31: LEARNING FROM NOISY DATA

32

m-estimate in pruning

Choice of m:

Low noise low m little pruningHigh noise high m much pruning

Note: Using m-estimate is as if examples at node were a random sample, which they are not. Suitably adjusting m compensates for

this.

Page 32: LEARNING FROM NOISY DATA

33

Some other pruning methods

• Error-complexity pruning, Breiman et al. 84 (CART)

• Pessimistic error pruning, Quinlan 87

• Error-based pruning, Quinlan 93 (C4.5)

Page 33: LEARNING FROM NOISY DATA

34

Error-complexity pruningBreiman et al. 1884, Program CART

Considers:

• Error rate on "growing" set• Size of tree• Error rate on "pruning set"• Minimise error and complexity: i.e. find a

compromise between error and size

Page 34: LEARNING FROM NOISY DATA

35

A sub tree T with root v:

v T

• R(v) = # errors on "growing" set at node v• R(T) = # errors on "growing" set of tree T• NT = # leaves in T • Total cost = Error cost + Complexity cost• Total cost = R + N

Page 35: LEARNING FROM NOISY DATA

36

Error complexity cost

• Total cost = Error cost + Complexity cost• Total cost = R + N = complexity cost per leaf

Page 36: LEARNING FROM NOISY DATA

37

Pruning at v

• Cost of T (T unpruned) = R(T) + NT

• Cost of v (T pruned at v) = R(v) +

• When costs of T and v are equal: = reduction of error per leaf

Page 37: LEARNING FROM NOISY DATA

38

Pruning algorithm

• Compute for each node in unpruned tree

• Repeat prune sub tree with smallest until root only is left

• This gives a series of increasingly pruned trees; estimate their accuracy

Page 38: LEARNING FROM NOISY DATA

39

Selecting best pruned tree

• Finally select the "best" tree from this series

• Select the smallest tree within 1 standard error of minimum error (1-SE rule)

• Standard error = sqrt( Rmin * (1-Rmin) / #exs)

Page 39: LEARNING FROM NOISY DATA

40

Comments

• Note: Cost complexity pruning limits selection to a subset of all possible pruned trees.

• Consequence: Best pruned tree may be missed

• Two ways of estimating error on new data: (a) using pruning set (b) using cross-validation in a rather complicated way

Page 40: LEARNING FROM NOISY DATA

41

Comments

• 1-SE rule tends to overprune;• Simply choosing min. error tree ("0-SE

rule") performs better in experiments• Error estimate with cross validation is

complicated and based on a debatable assumption

Page 41: LEARNING FROM NOISY DATA

42

Selecting best tree

• Using pruning set: Measure error of candidate pruned trees

on pruning set;• Select the smallest tree within 1 standard

error of minimum error.

Page 42: LEARNING FROM NOISY DATA

43

Comparison of pruning methods (Esposito, Malerba, Semeraro 96, IEEE Trans.)

• Experiments with 14 data sets from UCI repository

Results: Does pruning improve accuracy? Generally yes But the effects of pruning also

depend on domain:– In most domains pruning

improves accuracy, in some it does not, in very few it worsens

Page 43: LEARNING FROM NOISY DATA

44

Pruning in rule learning

• Ideas from pruning decision trees can be adapted to the learning of if-then rules

• Pre-pruning and post-pruning can be combined and reduced error pruning idea applies

• Furnkranz (1997) reviews several approaches and evaluates them experimentally

Page 44: LEARNING FROM NOISY DATA

45

Estimating Probabilities

• Setup

– n experiments (n = r + s)– r successes– s failures

• How likely it is that next experiment will be a success?

• Estimate with relative frequency

Page 45: LEARNING FROM NOISY DATA

46

Relative Frequency

• Works when we have many experiments, but not with small samples• Consider

– flipping a coin– we flip a coin twice, both times comes a

head– what is probability of head in the next flip?

• Probability of 1.0 (1.0=2/2) seems unreasonable

Page 46: LEARNING FROM NOISY DATA

47

Coins and mushrooms

• Probability of head = ?• Probability of mushroom edible = ?• Make one, two ... experiments• Interpret results in terms of probability• Relative frequency does not work well

Page 47: LEARNING FROM NOISY DATA

48

Coins and mushrooms

• We need to consider prior expectations• Prior prob. = 1/2, in both cases not

unreasonable• But, is this enough? • Intuition says: our probability estimates for

coins and mushrooms still different• Difference lies in prior probability

distribution• What are sensible prior distributions for

coins and for mushrooms?

Page 48: LEARNING FROM NOISY DATA

49

Bayesian Procedure for Estimating Probabilities

• Assume initial probability distribution (prior distribution)

• Based on some evidence E, update this distribution to obtain posterior distribution

• Compute the expected value over posterior distribution. Variance of posterior distribution is related to certainty of this estimate

Page 49: LEARNING FROM NOISY DATA

50

Bayes Formula

• Bayesian process takes prior probability and combines it with new evidence to obtain updated (posterior) probability

Page 50: LEARNING FROM NOISY DATA

51

Bayes in estimating probabilities

• Form of hypothesis H is: P(event) = x

• So: P( H | E) = P( P(event)=x | E)

• That is: probability that probability is x• May appear confusing!

Page 51: LEARNING FROM NOISY DATA

52

Bayes update of probability

P( P(event)=x | E) =

P( P(event)=x) * P( E | P(event)=x) / P( E)

Prior prob. densityPosterior prob. density

Page 52: LEARNING FROM NOISY DATA

53

Expected probability

• Expected value of prob. of event: P( event | E )

• P( event | E) = Integral over [0,1] of x weighted by prob. density

Page 53: LEARNING FROM NOISY DATA

54

Bayes update of probability

Prior prob. distribution

Bayes update with evidence E

Posterior prob. distribution

Expected value, variance

Page 54: LEARNING FROM NOISY DATA

55

Update of density: Example

Uniform prior

0 1Posterior

0 1

Page 55: LEARNING FROM NOISY DATA

56

Choice of Prior Distribution:Beta Distribution (a,b)

Page 56: LEARNING FROM NOISY DATA

57

Bayesian Update ofBeta Distributions

• Let prior distribution be (a,b)• Assume experimental results

– s successful outcomes– f failures

Updated Beta distribution is (a+s, b+f)

• Beta probability distributions have a nice mathematical property

Betadistribution

Betadistribution

Bayes Update

Page 57: LEARNING FROM NOISY DATA

58

m-estimate of probability

• Cestnik, 1991– Replace parameters a and b with m and

pa

– pa is prior probability

• Assume: N experiments, n positive outcome

Page 58: LEARNING FROM NOISY DATA

59

m-estimate of probability

relativefrequency

priorprobability

Page 59: LEARNING FROM NOISY DATA

60

Choosing priorprobability distribution

• If we know prior probability and variance, this determines a, b and m, pa

• A domain expert to choose prior distribution, defined by either a, b or m, pa

• m, pa may be more practical then a, b

• Expert hopefully has some idea about pa and m:

– low variance, more confidence in pa large m

– high variance, less confidence in pa small m

Page 60: LEARNING FROM NOISY DATA

61

Laplace Probability Estimate

• For a problem with two outcomes

• Assumes prior probability distribution of (1,1)

• Also equals m-estimate with pa = 1/k and m =k , where k =2

Page 61: LEARNING FROM NOISY DATA

62

Using domain knowledge to improve accuracy

• If domain-specific knowledge is available prior to learning it may provide useful additional constraints

• Additional constraints may alleviate problems with noise

• One approach is Q2 learning that uses qualitative constraints in numerical learning

Page 62: LEARNING FROM NOISY DATA

63

Q2 LearningVladušič, Šuc and Bratko 2004

• Q2 learning, Qualitatively faithful Quantitative learning

• Learning from numerical data is guided by qualitative constraints

• Resulting numerical model fits learning data numerically and respects given qualitative model

• Qualitative model can be provided by domain expert, or induced from data

Page 63: LEARNING FROM NOISY DATA

64

Qualitative difficulties ofnumerical learning

Learn time behavior of water level: h = f( t, initial_outflow)

h

outflow t

h

Page 64: LEARNING FROM NOISY DATA

650

20

40

60

80

100

h

t=1

t=19

Predicting water level with M5

Qualitatively incorrect – water level cannot increase

M5 prediction

11.25

10.08.75

6.25

7.5Initial_ouflow

=12.5

Page 65: LEARNING FROM NOISY DATA

66

Predicting water level with Q2

Q2 predictions

True values