52
Data Mining and Machine Learning Naive Bayes David Corne, HWU [email protected]

Data Mining and Machine Learning

Embed Size (px)

DESCRIPTION

Data Mining and Machine Learning. Naive Bayes David Corne, HWU [email protected]. A very simple dataset – one field / one class. A very simple dataset – one field / one class. A new patient has a blood test – his P34 level is HIGH. what is our best guess for prostate cancer?. - PowerPoint PPT Presentation

Citation preview

Page 1: Data Mining and Machine Learning

Data Mining and Machine Learning

Naive BayesDavid Corne, HWU

[email protected]

Page 2: Data Mining and Machine Learning

A very simple dataset – one field / one class

P34 level Prostatecancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

Page 3: Data Mining and Machine Learning

A very simple dataset – one field / one class

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

A new patient hasa blood test – his P34level is HIGH.

what is our best guess for prostate cancer?

Page 4: Data Mining and Machine Learning

A very simple dataset – one field / one class

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

It’s useful to know: P(cancer = Y)

Page 5: Data Mining and Machine Learning

A very simple dataset – one field / one class

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

It’s useful to know: P(cancer = Y)

- on basis of this tiny dataset, P(c = Y) is 5/10 = 0.5

Page 6: Data Mining and Machine Learning

A very simple dataset – one field / one class

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

It’s useful to know: P(cancer = Y)

- on basis of this tiny dataset, P(c = Y) is 5/10 = 0.5

So, with no other info you’d expect P(cancer=Y) to be 0.5

Page 7: Data Mining and Machine Learning

A very simple dataset – one field / one class

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

But we know that P34 =H,so actually we want: P(cancer=Y | P34 = H)

- the prob that cancer is Y, given that P34 is high

Page 8: Data Mining and Machine Learning

A very simple dataset – one field / one class

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

P(cancer=Y | P34 = H)

- the prob that cancer is Y, given that P34 is high

- this seems to be 2/3 = ~ 0.67

Page 9: Data Mining and Machine Learning

A very simple dataset – one field / one class

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

So we have:

P ( c=Y | P34 = H) = 0.67 P ( c =N | P34 = H) = 0.33

The class value with the highest probability is ourbest guess

Page 10: Data Mining and Machine Learning

In general we may have any number of class values

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NHigh Maybe

Medium Y

suppose again we know that P34 is High; here we have:

P ( c=Y | P34 = H) = 0.5 P ( c=N | P34 = H) = 0.25 P(c = Maybe | H) = 0.25

... and again, Y is the winner

Page 11: Data Mining and Machine Learning

That is the essence of Naive Bayes,

but:the probability calculations are much

trickier when there are >1 fieldsso we make a ‘Naive’ assumption that

makes it simpler

Page 12: Data Mining and Machine Learning

Bayes’ theorem

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

As we saw, on the rightwe are illustrating:

P(cancer = Y | P34 = H)

Page 13: Data Mining and Machine Learning

Bayes’ theorem

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium Y

And now we are illustrating

P(P34 = H | cancer = Y)

This is a different thing, that turns out as 2/5 = 0.4

Page 14: Data Mining and Machine Learning

Bayes’ theorem is this:

P( A | B) = P ( B | A ) P (A) P(B)

It is very useful when it is hard to get P(A | B) directly, but easier to get the things on the right

Page 15: Data Mining and Machine Learning

Bayes’ theorem in 1-non-class-field DMML context:

P( Class=X | Fieldval = F) =

P ( Fieldval = F | Class = X ) × P( Class = X)

P(Fieldval = F)

Page 16: Data Mining and Machine Learning

Bayes’ theorem in 1-non-class-field DMML context:

P( Class=X | Fieldval = F) =

P ( Fieldval = F | Class = X ) × P( Class = X)

P(Fieldval = F)

We want to check this for each class and choosethe class that gives the highest value.

Page 17: Data Mining and Machine Learning

Bayes’ theorem in 1-non-class-field DMML context:

P( Class=X | Fieldval = F) =

P ( Fieldval = F | Class = X ) × P( Class = X)

P(Fieldval = F)

E.g. We compare: P(Fieldval | Yes) × P (Yes) P(Fieldval | No)× P (No) P(Fieldval | Maybe) × P (Maybe)

... we can ignore “P(Fieldval = F)” ... why ?

Page 18: Data Mining and Machine Learning

and that was Exactly how we do Naive Bayes for a

1-field dataset

Page 19: Data Mining and Machine Learning

Note how this relates to our beloved histograms

P34 level Prostate cancer

High YMedium Y

Low YLow NLow N

Medium NHigh YHigh NLow N

Medium YLow Med High

0.60.40.20

P(Low| N)

P(High| Y)

Page 20: Data Mining and Machine Learning

Nave-Bayes with Many-fields

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

Page 21: Data Mining and Machine Learning

Nave-Bayes with Many-fields

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

New patient:P34=M, P61=M, BMI = H

Best guess at cancer field ?

Page 22: Data Mining and Machine Learning

Nave-Bayes with Many-fields

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

New patient:P34=M, P61=M, BMI = H

Best guess at cancer field ?

P(p34=M | Y) × P(p61=M | Y) × P(BMI=H |Y) × P(cancer = Y)P(p34=M | N) × P(p61=M | N) × P(BMI=H |N) × P(cancer = N)

which of these gives the highest value?

Page 23: Data Mining and Machine Learning

Nave-Bayes with Many-fields

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

New patient:P34=M, P61=M, BMI = H

Best guess at cancer field ?

P(p34=M | Y) × P(p61=M | Y) × P(BMI=H |Y) × P(cancer = Y)P(p34=M | N) × P(p61=M | N) × P(BMI=H |N) × P(cancer = N)

which of these gives the highest value?

Page 24: Data Mining and Machine Learning

Nave-Bayes with Many-fields

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

New patient:P34=M, P61=M, BMI = H

Best guess at cancer field ?

P(p34=M | Y) × P(p61=M | Y) × P(BMI=H |Y) × P(cancer = Y)P(p34=M | N) × P(p61=M | N) × P(BMI=H |N) × P(cancer = N)

which of these gives the highest value?

Page 25: Data Mining and Machine Learning

Nave-Bayes with Many-fields

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

New patient:P34=M, P61=M, BMI = H

Best guess at cancer field ?

P(p34=M | Y) × P(p61=M | Y) × P(BMI=H |Y) × P(cancer = Y)P(p34=M | N) × P(p61=M | N) × P(BMI=H |N) × P(cancer = N)

which of these gives the highest value?

Page 26: Data Mining and Machine Learning

Nave-Bayes with

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

New patient:P34=M, P61=M, BMI = H

Best guess at cancer field ?

P(p34=M | Y) × P(p61=M | Y) × P(BMI=H |Y) × P(cancer = Y)P(p34=M | N) × P(p61=M | N) × P(BMI=H |N) × P(cancer = N)

which of these gives the highest value?

Page 27: Data Mining and Machine Learning

Nave-Bayes with Many-fields

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

New patient:P34=M, P61=M, BMI = H

Best guess at cancer field ?

0.4 × 0 × 0.4 × 0.5 = 0 0.2 × 0.4 × 0.2 × 0.5 = 0.008

which of these gives the highest value?

Page 28: Data Mining and Machine Learning

In practice, we finesse the zeroes and use logs:(note: log(A×B×C×D×…) = log(A)+log(B)+ …)

P34 level P61 level BMI Prostate cancer

High Low Medium YMedium Low Medium Y

Low Low High YLow High Low NLow Low Low N

Medium Medium Low NHigh Low Medium YHigh Medium Low NLow Low High N

Medium High High Y

New patient:P34=M, P61=M, BMI = H

Best guess at cancer field ?

log(0.4) + log (0.001) + log(0.4) + log(0.5) = -4.09log(0.2) + log (0.4) + log(0.2) + log(0.5) = -2.09

which of these gives the highest value?

Page 29: Data Mining and Machine Learning

Nave-Bayes -- in generalN fields, q possible class values, New unclassified instance: F1 = v1, F2 = v2, ... , Fn = vn

what is the class value? i.e. Is it c1, c2, .. or cq ?

calculate each of these q things – biggest one gives the class:

P(F1=v1 | c1) × P(F2=v2 | c1) × ... × P(Fn=vn | c1) × P(c1)P(F1=v1 | c2) × P(F2=v2 | c2) × ... × P(Fn=vn | c2) × P(c2)...P(F1=v1 | cq) × P(F2=v2 | cq) × ... × P(Fn=vn | cq) × P(cq)

Page 30: Data Mining and Machine Learning

Nave-Bayes -- in general

As indicated, what we normally do, when there aremore than a handful of fields, is this

Calculate:log(P(F1=v1 | c1)) + ... + log(P(Fn=vn | c1)) + log( P(c1))

log(P(F1=v1 | c2)) + ... + log(P(Fn=vn | c2)) + log( P(c2))

and choose class based on highest of these. Because … ?

Page 31: Data Mining and Machine Learning

Nave-Bayes -- in general

log( a × b × c × …) = log(a) + log(b) + log(c) + …

and this means we won’t get underflow errors, whichwe would otherwise get with, e.g.

0.003 × 0.000296 × 0.001 × …[100 fields] × 0.042 …

Page 32: Data Mining and Machine Learning

Deriving NB

Essence of Naive Bayes, with 1 non-class field, is to calc this for each class value, given some new instance with fieldval = F:

P(class = C | Fieldval = F)

For many fields, our new instance is (e.g.) (F1, F2, ...Fn), and the ‘essence of Naive Bayes’ is to calculate this for each class:

P(class = C | F1,F2,F3,...,Fn)

i.e. What is prob of class C, given all these field vals together?

Page 33: Data Mining and Machine Learning

Apply magic dust and Bayes theorem, and ...

... If we make the naive assumption that all of the fields are independent of each other

(e.g. P(F1| F2) = P(F1), etc ...) ... then

P (class = C | F1 and F2 and F3 and ... Fn)

∝ P( F1 and F2 and ... and Fn | C) x P (C)

= P(F1| C) x P (F2 | C) x ... X P(Fn | C) x P(C)

… which is what we calculate in NB

Page 34: Data Mining and Machine Learning

Confusion and CW2

Page 35: Data Mining and Machine Learning

80% ‘accuracy’ ...

Page 36: Data Mining and Machine Learning

80% ‘accuracy’ ...

… means our classifier predicts the correct class 80% of the time, right?

Page 37: Data Mining and Machine Learning

80% ‘accuracy’ ...

Could be with this confusion matrix:

True \ Pred A B C

A 80 10 10

B 10 80 10

C 5 15 80

Page 38: Data Mining and Machine Learning

80% ‘accuracy’ ...

Could be with this confusion matrix:

True \ Pred A B C

A 80 10 10

B 10 80 10

C 5 15 80

True \ Pred A B C

A 100 0 0

B 0 100 0

C 60 0 40

Or this one

Page 39: Data Mining and Machine Learning

80% ‘accuracy’ ...

Could be with this confusion matrix:

True \ Pred A B C

A 80 10 10

B 10 80 10

C 5 15 80

True \ Pred A B C

A 100 0 0

B 0 100 0

C 60 0 40

Or this one

Individual class accuracies

80%

80%

80%100%

100%

60%

Page 40: Data Mining and Machine Learning

80% ‘accuracy’ ...

Could be with this confusion matrix:

True \ Pred A B C

A 80 10 10

B 10 80 10

C 5 15 80

True \ Pred A B C

A 100 0 0

B 0 100 0

C 60 0 40

Or this one

Individual class accuracies

80%

80%

80%80%

100%

100%

60%86.7%

Mean class accuracy

Page 41: Data Mining and Machine Learning

unbalanced classestake care in interpretation of accuracy figures

True \ Pred A B C

A 100 0 0

B 400 300 300

C 500 1500 8000

Individual class accuracies

100%

30%

80%

Overall accuracy: (100+300+8000) / 11100 75.7%

Page 42: Data Mining and Machine Learning

True \ Pred A B C

A 100 0 0

B 400 300 300

C 500 1500 8000

Individual class accuracies

100%

30%

80%70%

Overall accuracy: (100+300+8000) / 11100 75.7%

Mean class accuracy

unbalanced classestake care in interpretation of accuracy figures

Page 43: Data Mining and Machine Learning

True \ Pred A B C

A 100 0 0

B 0 100 0

C 4000

4000 2000

Individual class accuracies

100%

100%

20%73.3%

Overall accuracy: (100+100+2000) / 10200 21.6%

Mean class accuracy

unbalanced classestake care in interpretation of accuracy figures

Page 44: Data Mining and Machine Learning

Example fromhttps://tidsskrift.dk/index.php/geografisktidsskrift/article/view/2519/4475

Classifying agricultural land use from satellite image data

Page 45: Data Mining and Machine Learning

CW2

Page 46: Data Mining and Machine Learning

datasets

In this coursework you will work with a ‘handwritten digit recognition’ dataset. Get it from my site here: http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/optall.txt Also pick up ten other versions of the same dataset, as follows: http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/opt0.txthttp://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/opt1.txthttp://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/opt2.txt[etc...] http://www.macs.hw.ac.uk/~dwcorne/Teaching/DMML/opt9.txt

Page 47: Data Mining and Machine Learning

2 lines from opt2.txt0 0 5 14 4 0 0 0 0 0 13 8 0 0 0 0 0 3 14 4 0 0 0 0 0 6 16 14 9 2 0 0 0 4 16 3 4 11 2 0 0 0 14 3 0 4 11 0 0 0 10 8 4 11 12 0 0 0 4 12 14 7 0 0 00 0 11 16 10 1 0 0 0 4 16 10 15 8 0 0 0 4 16 3 11 13 0 0 0 1 14 6 9 14 0 0 0 0 0 0 12 10 0 0 0 0 0 6 16 6 0 0 0 0 5 15 15 8 8 3 0 0 10 16 16 16 16 6 1

Page 48: Data Mining and Machine Learning

2 lines from opt2.txt0 0 5 14 4 0 0 00 0 13 8 0 0 0 0 0 3 14 4 0 0 0 00 6 16 14 9 2 0 0 0 4 16 3 4 11 2 00 0 14 3 0 4 11 0 0 0 10 8 4 11 12 0 0 0 4 12 14 7 0 0

0 (not a 2)

0 0 11 16 10 1 0 0 0 4 16 10 15 8 0 0 0 4 16 3 11 13 0 0 0 1 14 6 9 14 0 0 0 0 0 0 12 10 0 0 0 0 0 6 16 6 0 0 0 0 5 15 15 8 8 3 0 0 10 16 16 16 16 6

1 (a 2)

Page 49: Data Mining and Machine Learning

2 lines from opt2.txt0 0 5 14 4 0 0 00 0 13 8 0 0 0 0 0 3 14 4 0 0 0 00 6 16 14 9 2 0 0 0 4 16 3 4 11 2 00 0 14 3 0 4 11 0 0 0 10 8 4 11 12 0 0 0 4 12 14 7 0 0

0 (not a 2)

0 0 11 16 10 1 0 0 0 4 16 10 15 8 0 0 0 4 16 3 11 13 0 0 0 1 14 6 9 14 0 0 0 0 0 0 12 10 0 0 0 0 0 6 16 6 0 0 0 0 5 15 15 8 8 3 0 0 10 16 16 16 16 6

1 (a 2)

0 - 8

9 - 16

Page 50: Data Mining and Machine Learning

2 lines from opt2.txt0 0 5 14 4 0 0 00 0 13 8 0 0 0 0 0 3 14 4 0 0 0 00 6 16 14 9 2 0 0 0 4 16 3 4 11 2 00 0 14 3 0 4 11 0 0 0 10 8 4 11 12 0 0 0 4 12 14 7 0 0

0 (not a 2)

0 0 11 16 10 1 0 0 0 4 16 10 15 8 0 0 0 4 16 3 11 13 0 0 0 1 14 6 9 14 0 0 0 0 0 0 12 10 0 0 0 0 0 6 16 6 0 0 0 0 5 15 15 8 8 3 0 0 10 16 16 16 16 6

1 (a 2)

0 - 8

9 - 16

1 2 3 4 5 6 7 89 10 11 12 13 14 15 16 17 18 19 20 21 22 23 2425 26 27 28 29 30 31 3233 34 35 36 37 38 39 4041 42 43 44 45 46 47 4849 50 51 52 53 54 55 5657 58 59 60 61 62 63 64

in op2.txt, which fields correlate well with the class field?

Which fields will tend to be high valuesWhen the class is ‘1’ ?Which will tend to be low when class is ‘1’ ?

Page 51: Data Mining and Machine Learning

• Produce a version of optall.txt that has the instan• ces in a randomised order. • 2.• Run my naïve Bayes awk script on the resulting vers• ion of optall.txt • 3.• Implement a program or script that allows you to wo• rk out the correlation between any two • fields. • 4.• Using your program, find out the correlation betwee• n each field and the class field -- • using • only the first 50% of instances in the data file • – for all the optN files. For each optN.txt file, • keep a record of the top five fields, in order of a• bsolute correlation value.

Run my nb.awk on optall.txt

your-linux-prompt: awk f nb.awk < optall.txt

•Implement Pearson’s correlation somehow•Work out correlations between fields and class•for each of the optN.txt files•Select the ‘best fields’ from each of those expts•Build new versions of optall.txt using those fields•Run my nb.awk again

Page 52: Data Mining and Machine Learning

Next – feature selection