Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000

Machine Learning & Data Mining CS/CNS/EE 155

Lecture 10: Genera4ve Models, Naïve Bayes

1

Announcements

•  Kaggle submissions end 1:59pm Thursday •  HW5 will be released by the end of this week – Last year people found it tough, so start early and write good code

2

Where We Are •  Part 1: Discrimina4ve Models

–  Basics: Bias/Variance, Over/UnderfiSng –  Linear models: perceptron, SVM, logis4c regression –  Regulariza4on –  Decision Trees –  Model Aggrega4on: bagging and boos4ng –  Deep Learning

•  Part 2: Genera4ve and Unsupervised Models <-‐ We Are Here –  Probabilis4c genera4ve models –  Deep genera4ve models –  Dimensionality Reduc4on –  Matrix Factoriza4on, Embeddings, etc.

3

Note: “Genera4ve” and “Unsupervised” are not synonyms, but genera4ve models are o^en unsupervised and vice versa.

What’s a Genera4ve Model?

Consider a dataset •  Discrimina4ve models learn – Well defined task, but intelligence can’t be built on discrimina4on alone

•  Genera4ve models learn – Less well-‐defined task, but models can do much more than just discriminate

– Currently the focus of a significant percent of high-‐profile ML research

4

{ }Niii yxS 1),( ==

)|( xyP

P(x, y)

What can Genera4ve Models Do?

5

y x P(x,y)

SPAM Help! 0.15

NOT Help! 0.1

SPAM Homework 0.05

NOT Homework 0.45

SPAM Winner! 0.2

NOT Winner! 0.05

Example model:

Full joint distribu4on; sums to 1

Our model specifies these. Don’t worry yet about the model architecture or how we trained it.


Discriminate We can compute So know, e.g., P(y=SPAM|x=Help!):

P(y=SPAM, x=Help!) = 0.15 P(x=Help!) = 0.25

=> P(y=SPAM|x=Help!) = 0.6

6

y x P(x,y)

SPAM Help! 0.15

NOT Help! 0.1

SPAM Homework 0.05

NOT Homework 0.45

SPAM Winner! 0.2

NOT Winner! 0.05

Example model:

P(y | x) = P(x, y)P(x)

P(x) = P(y, x)y∑


7

y x P(x,y)

SPAM Help! 0.15

NOT Help! 0.1

SPAM Homework 0.05

NOT Homework 0.45

SPAM Winner! 0.2

NOT Winner! 0.05

Example model:

∑=x

xyPyP ),()(

Summarize and Predict E.g. the marginal distribu4on of y is So we know P(y=SPAM) = 0.4 before we’ve even seen x. You can’t get that from a discrimina4ve model.

E.g. Varia4onal Autoencoders

8

Can discover structure in, e.g., the MNIST dataset using just the unlabeled images: This is a “walk” through the structure this genera4ve model discovered.

hqp://blog.fasrorwardlabs.com/2016/08/22/under-‐the-‐hood-‐of-‐the-‐varia4onal-‐autoencoder-‐in.html

E.g. Latent Dirichlet Alloca4on (Disclaimer: this was the top result of Google Search “LDA example”)

9 hqp://blog.echen.me/2011/08/22/introduc4on-‐to-‐latent-‐dirichlet-‐alloca4on/

Can iden4fy topics from reading unlabeled text corpus, e.g. Presiden4al Campaign, Wildlife:


10

y x P(x,y)

SPAM Help! 0.15

NOT Help! 0.1

SPAM Homework 0.05

NOT Homework 0.45

SPAM Winner! 0.2

NOT Winner! 0.05

Example model:

Generate We can generate examples straight from the data distribu4on:

>>>random.uniform(0,1) 0.404… >>>return (NOT, Homework)

Or we can generate condi4onally, since going from label to data is as easy as from data to label: So know, e.g., P(x=Help!|y=SPAM), P(x=Homework|y=SPAM), and P(x=Winner!|y=SPAM) and can sample from that condi4onal distribu4on.

)(),()|(

yPyxPyxP = ∑=

xxyPyP ),()(

E.g. Genera4ve Adversarial Networks

11

hqp://www.nature.com/news/astronomers-‐explore-‐uses-‐for-‐ai-‐generated-‐images-‐1.21398?WT.mc_id=FBK_NatureNews

Learn from labeled images, condi4on on “Volcano” and generate new images, because why not?

Learn from unlabeled telescope imagery, generate new images, now astronomers have more data

E.g. Style Transfer

12

Figure 2: Images that combine the content of a photograph with the style of several well-knownartworks. The images were created by finding an image that simultaneously matches the contentrepresentation of the photograph and the style representation of the artwork (see Methods). Theoriginal photograph depicting the Neckarfront in Tubingen, Germany, is shown in A (Photo:Andreas Praefcke). The painting that provided the style for the respective generated imageis shown in the bottom left corner of each panel. B The Shipwreck of the Minotaur by J.M.W.Turner, 1805. C The Starry Night by Vincent van Gogh, 1889. D Der Schrei by Edvard Munch,1893. E Femme nue assise by Pablo Picasso, 1910. F Composition VII by Wassily Kandinsky,1913.

5

hqp://gitxiv.com/posts/jG46ukGod8R7Rdtud/a-‐neural-‐algorithm-‐of-‐ar4s4c-‐style

Condi4on on a style and a content image, generate new images:

What can Genera4ve Models Do? Handle Missing Values and Model Uncertainty

(Previous example was too simple to illustrate this.) By condi4oning on known values, genera4ve models can handle missing values probabilis4cally. Further, many genera4ve models are structured to model uncertainty about their missing value es4mates.

13 hqps://openai.com/blog/genera4ve-‐models/

Naïve explora4on Explora4on with a genera4ve uncertainty model

What’s the Catch?

•  Discrimina4ve models make (much) beqer predic4ons –  Discrimina4ve models are directly op4mized to predict –  Genera4ve models make predic4ons by combining mul4ple es4mated values

14

P(x) = P(y, x)y∑P(y | x) = P(x, y)

P(x)

What’s the Catch?

•  Genera4ve modeling is an underspecified task –  Goal of discrimina4ve models is clear: improve accuracy –  Goal of genera4ve models is less clear: the quality of a model is task-‐dependent and somewhat subjec4ve.

15

What’s the Catch?

•  Genera4ve models are more complicated –  In our example, the genera4ve model had 6 parameters; a discrimina4ve model would have needed 3 to do its job

–  Imagine what happens when feature space gets large, e.g. 784 con4nuous-‐valued pixels!

–  Requires very clever model design and assump4ons about data

16

P(y | x) = P(x, y)P(x)

What if there are so many different x that P(x) underflows?

Example

Suppose we have a binary y label and binary x labels.

What’s wrong with the Probability Table approach (arguably the most naïve genera4ve model)?

17

y x1=Winner! x2=Homework P(x,y)

SPAM 1 1 0.01

NOT 1 1 0.01

SPAM 0 1 0.03

NOT 0 1 0.35

SPAM 1 0 0.25

NOT 1 0 0.05

SPAM 0 0 0.2

NOT 0 0 0.1

Example

Suppose we have a binary y label and binary x labels.

What’s wrong with the Probability Table approach (arguably the most naïve genera4ve model)?

18

y x1=Winner! x2=Homework P(x,y)

SPAM 1 1 0.01

NOT 1 1 0.01

SPAM 0 1 0.03

NOT 0 1 0.35

SPAM 1 0 0.25

NOT 1 0 0.05

SPAM 0 0 0.2

NOT 0 0 0.1 Model Complexity is ExponenFal

w.r.t. the length of x!

We need a beMer model…

Naïve Bayes

19

Reminder: Condi4onal Independence

x1 and x2 are condi0onally independent given y iff

or, equivalently

and

)|(),|( 121 yxPyxxP =

)|()|()|,( 2121 yxPyxPyxxP =

20

)|(),|( 212 yxPyxxP =

Reminder: Condi4onal Independence

If D random variables xd are all condi4onally independent given y, then We can express this graphically:

21

∏=

==D

d

dDD yxPyPyPyxxPyxxP1

11 )|()()()|,...,(),,...,(

Y

X1 XD … X2

… Roughly speaking, an edge from A to B means B depends on A. Absence of edge means condi4onal independence.

“Bayesian Network”, a type of Graphical Model Diagram

But graphical models are a world unto themselves, and we won’t get into them too deeply…

The Naïve Bayes Model •  A genera4ve model, but used in the context of predic4on. •  For a label y and D features xd, Naïve Bayes posits that all the

features are condi4onally independent given y.

•  This is a strong (perhaps naïve) assump4on about our data

•  But we only have to keep track of P(y) and P(xd|y)!

22

Y

X1 XD … X2

…

∏=

==D

d

dDD yxPyPyPyxxPyxxP1

11 )|()()()|,...,(),,...,(

Why is Naïve Bayes Convenient?

•  Compact representa4on

•  Easy to compute any quan4ty –  P(y|x), P(xd|y), …

•  Easy to es4mate model components –  P(y), P(xd|y)

•  Easy to sample

•  Easy to deal with missing values

23

Example Model (Discrete)

•  Each xd binary (-‐1 or +1) – E.g., presence (+1) or absence (-‐1) of word

24

x1=Homework x2=Winner!

y=SPAM P(x1 =1|y)=0.2 P(x2 =1|y)=0.5

y=NOT P(x1 =1|y)=0.6 P(x2 =1|y)=0.1

P(y)

y=SPAM 0.7

y=NOT 0.3

P(x|y)

P(y) Model Complexity is Linear

w.r.t. the length of x!

Making Predic4ons

25

P(y | x) = P(x, y)P(x)

= P(x | y)P(y)P(x)

= P(y)P(x)

P(xd | y)d∏

∝P(y) P(xd | y)d∏

Model components we keep track of.

Bayes Rule (hence the name)


y=SPAM P(x1 =1|y)=0.2 P(x2 =1|y)=0.5

y=NOT P(x1 =1|y)=0.6 P(x2 =1|y)=0.1

P(y)

y=SPAM 0.7

y=NOT 0.3

Example Predic4on

26


y=SPAM P(x1 =1|y)=0.2 P(x2 =1|y)=0.5

y=NOT P(x1 =1|y)=0.6 P(x2 =1|y)=0.1

P(y)

y=SPAM 0.7

y=NOT 0.3

...698.0)5.01(*2.0*7.0)1.01(*6.0*3.0

)1.01(*6.0*3.0

)|1,1()()|1()|1()(

)1,1()|1()|1()(

)1,1|(

21

21

21

21

21

=−+−

−=

−==

=−=====

−==

=−=====

−===

∑y

yxxPyPNOTyxPNOTyxPNOTyP

xxPNOTyxPNOTyxPNOTyP

xxNOTyP

One Empirical Comparison

•  Columns denote how frequently each model places

1st, 2nd, 3rd, etc.

•  Only genera4ve model (Naïve Bayes) is in last place

27

An Empirical Comparison of Supervised Learning Algorithms

Table 4. Bootstrap Analysis of Overall Rank by Mean Performance Across Problems and Metrics

model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th

bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000rf 0.390 0.525 0.084 0.001 0.000 0.000 0.000 0.000 0.000 0.000bag-dt 0.030 0.232 0.571 0.150 0.017 0.000 0.000 0.000 0.000 0.000svm 0.000 0.008 0.148 0.574 0.240 0.029 0.001 0.000 0.000 0.000ann 0.000 0.007 0.035 0.230 0.606 0.122 0.000 0.000 0.000 0.000knn 0.000 0.000 0.000 0.009 0.114 0.592 0.245 0.038 0.002 0.000bst-stmp 0.000 0.000 0.002 0.013 0.014 0.257 0.710 0.004 0.000 0.000dt 0.000 0.000 0.000 0.000 0.000 0.000 0.004 0.616 0.291 0.089logreg 0.000 0.000 0.000 0.000 0.000 0.000 0.040 0.312 0.423 0.225nb 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.030 0.284 0.686

overall, and only a 4.2% chance of seeing them ranklower than 3rd place. Random forests would come in1st place 39% of the time, 2nd place 53% of the time,with little chance (0.1%) of ranking below third place.

There is less than a 20% chance that a method otherthan boosted trees, random forests, and bagged treeswould rank in the top three, and no chance (0.0%)that another method would rank 1st—it appears to bea clean sweep for ensembles of trees. SVMs probablywould rank 4th, and neural nets probably would rank5th, but there is a 1 in 3 chance that SVMs would rankafter neural nets. The bootstrap analysis clearly showsthat MBL, boosted 1-level stumps, plain decision trees,logistic regression, and naive bayes are not competitiveon average with the top five models on these problemsand metrics when trained on 5k samples.

6. Related Work

STATLOG is perhaps the best known study (Kinget al., 1995). STATLOG was a very comprehensivestudy when it was performed, but since then importantnew learning algorithms have been introduced such asbagging, boosting, SVMs, and random forests. LeCunet al. (1995) presents a study that compares severallearning algorithms (including SVMs) on a handwrit-ing recognition problem using three performance crite-ria: accuracy, rejection rate, and computational cost.Cooper et al. (1997) present results from a study thatevaluates nearly a dozen learning methods on a realmedical data set using both accuracy and an ROC-likemetric. Lim et al. (2000) perform an empirical com-parison of decision trees and other classification meth-ods using accuracy as the main criterion. Bauer andKohavi (1999) present an impressive empirical analy-sis of ensemble methods such as bagging and boosting.Perlich et al. (2003) conducts an empirical comparisonbetween decision trees and logistic regression. Provost

and Domingos (2003) examine the issue of predictingprobabilities with decision trees, including smoothedand bagged trees. Provost and Fawcett (1997) discussthe importance of evaluating learning algorithms onmetrics other than accuracy such as ROC.

7. Conclusions

The field has made substantial progress in the lastdecade. Learning methods such as boosting, randomforests, bagging, and SVMs achieve excellent perfor-mance that would have been difficult to obtain just 15years ago. Of the earlier learning methods, feedfor-ward neural nets have the best performance and arecompetitive with some of the newer methods, particu-larly if models will not be calibrated after training.

Calibration with either Platt’s method or Isotonic Re-gression is remarkably effective at obtaining excellentperformance on the probability metrics from learningalgorithms that performed well on the ordering met-rics. Calibration dramatically improves the perfor-mance of boosted trees, SVMs, boosted stumps, andNaive Bayes, and provides a small, but noticeable im-provement for random forests. Neural nets, baggedtrees, memory based methods, and logistic regressionare not significantly improved by calibration.

With excellent performance on all eight metrics, cali-brated boosted trees were the best learning algorithmoverall. Random forests are close second, followed byuncalibrated bagged trees, calibrated SVMs, and un-calibrated neural nets. The models that performedpoorest were naive bayes, logistic regression, decisiontrees, and boosted stumps. Although some methodsclearly perform better or worse than other methodson average, there is significant variability across theproblems and metrics. Even the best models some-times perform poorly, and models with poor average

“An Empirical Comparison of Supervised Learning Algorithms” Caruana, Niculescu-‐Mizil, ICML 2006

Example Predic4on #2

•  What if we want to compute ? •  It’s an explicitly defined model component:

28

P(x1 | x2:D, y)

)|(),|( 1:21 yxPyxxP D =

Example Predic4on #3

•  What if we want to compute ?

29

P(x1 | x2:D )

∑ ∏

∑ ∏

∑

∑

=

==

==

y

D

d

d

y

D

d

d

y

Dy

D

D

DD

yxPyP

yxPyP

yxPyP

yxxPyP

xPxxPxxP

2

1

:2

:21

:2

:21:21

)|()(

)|()(

)|()(

)|,()(

)(),()|(

Why is the numerator smaller than the denominator?

“Marginalizing out the y”

Marginaliza4on in Matrix Form

•  Compute P(xd=1):

30


y=SPAM P(x1=1|y)=0.2 P(x2=1|y)=0.5

y=NOT P(x1=1|y)=0.6 P(x2=1|y)=0.1

P(y)

y=SPAM 0.7

y=NOT 0.3

O

P

P(xd =1) = OTP!" #$d d-‐th row

P(xd =1) = P(xd =1| y)P(y)y∑

For heaven’s sake, don’t use a for loop

Predic4on with Missing Values

•  What if we don’t observe x2:D? •  Predict P(y=SPAM|x1)

31

∑∑ ==DD x

D

x

D

xPyxPxxyPxyP

:2:2 )(),()|,()|( 1

:11:21

How to efficiently sum over mulFple missing values?

We can marginalize out the missing values!

Condi4onal Independence to the Rescue!

32

∑∑ ==DD x

D

x

D

xPyxPxxyPxyP

:2:2 )(),()|,()|( 1

:11:21

∏=

=D

d

dD yxPyPyxP1

:1 )|()(),(

[ ]

)|()(

)|()|()(

)|()(),(

1

,2

1

:2:2

yxPyP

yxPyxPyP

yxPyPyxP

Dd x

dx d

d

x

d

d

DD

=

=

=

∏ ∑

∑∏∑

∈

From previous slide

DefiniFon of Naïve Bayes

Swap Product & Sum due to independence!

Marginalizes to 1!

Training •  Maximum Likelihood of Training Set:

where the argmax/min is over all possible Naïve Bayes models.

33

S = (xi, yi ){ }i=1N

∑

∏−=

=

iii

P

iii

PP

yxP

yxPSP

),(logminarg

),(maxarg)(maxarg

Training

34

P(y = SPAM ) =Ny=SPAM

N

SPAMy

xSPAMy

N

NSPAMyxP

=

===== 1,1 1

)|1(

Frequency of SPAM documents in training set

Frequency of word x1 appearing in SPAM documents in training set

•  Subject to Naïve Bayes assump4on on structure of P(x,y), we only need to es4mate P(y) and each each P(xd|y)!

•  This is just coun4ng!

Training Deriva4on

•  Define:

35

P(x, y) =wx,y

wx ',y 'x ',y '∑

∑ ∑∑ ⎥⎦

⎤⎢⎣

⎡+−=−

i yxyxyx

wiii

PwwyxP

ii','

',', loglogargmin),(logargmin

∂wx ,y = −Nx,y

wx,y

+Nwx ',y '

x ',y '∑

# training examples (x,y)

Nx,y

N=

wx,y

wx ',y 'x ',y '∑è

Frequency of (x,y) in training set!

P(x, y) =Nx,y

Nè

Just a re-‐parameteriza4on

Regulariza4on

•  Add “pseudo counts” –  i.e. hallucinate some data

36

P(y = SPAM ) =Ny=SPAM +λPy=SPAM

N +λ

P(x1 =1| y = SPAM ) =N

y=SPAM∧x1=1+λP

y=SPAM∧x1=1

Ny=SPAM +λ

Ocen just set pseudo counts to uniform distribuFon

Sampling

•  Can sample from distribu4on P(x,y) –  First sample y:

•  Random uniform variable R •  Set y=SPAM if R < P(y=SPAM) & y=NOT otherwise

–  Then sample each xd: •  Sample uniform variable R •  Set xd=1 if R < P(xd=1|y) & xd=0 otherwise

37

Sampling Example

•  Sample P(y) – R = 0.5, so set y = SPAM

•  Sample P(x1|y=SPAM) – R = 0.1, so set x1 = 1

•  Sample P(x2|y=SPAM) – R = 0.9, so set x2 = 0

38


y=SPAM P(x1=1|y)=0.2 P(x2=1|y)=0.5

y=NOT P(x1=1|y)=0.6 P(x2=1|y)=0.1

P(y)

y=SPAM 0.7

y=NOT 0.3

Can be done in either order

Sampling Example #2

•  Sample P(y) – R = 0.9, so set y = NOT

•  Sample P(x1|y=NOT) – R = 0.5, so set x1 = 1

•  Sample P(x2|y=NOT) – R = 0.05, so set x2 = 1

39


y=SPAM P(x1=1|y)=0.2 P(x2=1|y)=0.5

y=NOT P(x1=1|y)=0.6 P(x2=1|y)=0.1

P(y)

y=SPAM 0.7

y=NOT 0.3

Recap: Genera4ve Models and Naïve Bayes

•  Genera4ve models (aqempt to) model the whole data distribu4on – Can generate new data, tolerate missing values, etc.

– Not as good at predic4on as discrimina4ve models

40

Recap: Genera4ve Models and Naïve Bayes

•  The Naïve Bayes model assumes all features are condi4onally independent given the label – Greatly simplified model structure, but s4ll genera4ve

– Compact representa4on – Easy to train – Easy to compute various probabili4es – Not the most accurate for standard predic4on

41

Next Lecture

•  Hidden Markov Models in depth – Sequence Modeling – Requires Dynamic Programming –  Implement aspects of HMMs in homework

•  Recita4on Tuesday: – Recap of Dynamic Programming (for HMMs)

42

Documents

Machine(Learning(&(DataMining · model 1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th bst-dt 0.580 0.228 0.160 0.023 0.009 0.000 0.000 0.000 0.000 0.000 rf 0.390 0.525 0.084 0.001 0.000