62
Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha.

Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha

  • Upload
    urbana

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha. Bayes Optimal Classifier. Gibbs Algorithm. Naïve Bayes Classifier. Bayesian Belief networks. EM algorithm. Bayesian Optimal classifier. So far we have asked: Which is the most likely-to be correct hypothesis: - PowerPoint PPT Presentation

Citation preview

Page 1: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Bayesian Learning. Pt 2.

6.7- 6.12 Machine Learning

Promethea Pythaitha.

Page 2: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Bayes Optimal Classifier.

Gibbs Algorithm.

Naïve Bayes Classifier.

Bayesian Belief networks.

EM algorithm.

Page 3: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Bayesian Optimal classifier. So far we have asked: Which is the most

likely-to be correct hypothesis:Which h is the M.A.P hypothesis.

Recall MAP = Maximum a-posteriori hypothesis. Which hypothesis has the highest likelyhood of corectness

given the sample data we have seen.

Page 4: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

What we usually want is the classification of a specific instance x not in the training data D.One way: Find hMAP and return it’s prediction for x.

Or decide what is the most probable classification for x.

Page 5: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Boolean classification: Have hypotheses h1 h6 with posterior

probabilities:

And h1 classifies x as - , the rest as +. Then net support for – is 25%, and for + is 75%. The “Bayes Optimal Classification” is + even

though the hMAP says - .

Hypothesis h1 h2 h3 h4 h5 h6

Post. prob .25 .15 .15 .15 .15 .15

Page 6: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Bayes Optimal Classifier.Classifies an instance by taking the average of all

the hypotheses predictions weighted by the credibility of the hypothesis.

Eqn 6.18.

Page 7: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Any system that classifies instances using this system is a “Bayes Optimal Classifier”

No other classification method, given the same hypothesis space and prior knowledge can outperform this method – on average!!!

A particularly interesting result is that the predictions made by a BOC can correspond to hypotheses that are not even in it’s hypothesis space. This helps deal with the limitations imposed by overly

restricted hypothesis spaces.

Page 8: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Best performance!!!----- At what cost?

Bayes Optimal Classification is the best – on average – but it can be very computationally costly. First it has to learn all the posterior probabilities for the

hypotheses in H. Then it has to poll the hypotheses to find what each one

predicts for x’s classification. Then it has to compute this big weighted sum (eqn 6.18)

But remember, hypothesis spaces get very large…. Recall the reason why we used a specific and general boundarys in

the CELA algorithm.

Page 9: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Gibbs Algorithm. One way to avoid some of the cmputation is:

1: Select h from H based on the posterior-probability distribution.

So more “credible” hypotheses are selected with higher probability than others.

2: Return h(x). This saves the time of computing the results hi(x) for all hi

in H, and doing the big sum…. But it is less optimal.

Page 10: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

How well can it do?

If we compute expected misclassification error of the Gibbs algorithm over target concepts drawn at random based on the a-priori probability distribution assumed by the learner,

Then this error will be at most twice that for the B.O.C.

** On Average.

Page 11: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Naïve Bayes Classifier. [NBC] Highly practical Bayesian learner.

Can, under right conditions, rank as well as

Neural-nets or Decision-trees.

Applies to any learning task where each instance x is described by a conjunction of attributes (or attribute-value pairs) and where target function f(x) can take any value in finite set V.

We are given training data D, and asked to classify a new instance x = <a1, a2, …, an>

Page 12: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Bayesian approach: Classify a new instance by assigning the most

probable target value: vMAP, given the attribute values of the instance.

Eqn 6.19.

Page 13: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

To get the classification we need to find the vj that maximizes

P(a1, a2, …, an| vj )*P(vj ) Second term: EASY.

It’s simply # instances with classification vj over total # instances.

First term: Hard!!! Need a HUGE set of training data.

Suppose we have 10 attributes (with 2 possibilities each) and 15 classifications. 15,360 possibilities,

And we have say 200 instances with known classifications. Cannot get a reliable estimate!!!

Page 14: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

The Naïve assumption. One way to ‘fix’ the problem is to assume the

attributes are conditionally independent. Assume P(a1, a2, …, an| vj) = Πi P(ai| vj)

Then the Naïve Bayes Classifier uses this for the prediction:

Eqn 6.20.

Page 15: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Naïve Bayes Algorithm. 1: Learn the P(ai| vj) and P(ai| vj) for all a’s and

v’s. (based on training data)In our example this is 10(2)*15 = 300.

Sample size of 200 is plenty!

** This set of numbers is the learned hypothesis.

2: Use this hypothesis to find vNB.IF our “naïve assumption”: Conditional

independence is true then vNB = vMAP

Page 16: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Bayesian Learning vs. Other Machine Learning methods. In Bayesian Learning, there is not explicit

search through a hypothesis space.Does not produce an inference rule (D-tree) or a

weight vector (NN).Instead it forms the hypothesis by observing

frequencies of various data combinations.

Page 17: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Example. There are four possible end-states of stellar

evolution:1: White-Dwarf. 2: Neutron Star.3: Black-hole.4: Brown dwarf.

Page 18: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

White Dwarf. About the size of the Earth, the mass of our Sun. [Up

to 1.44 times solar mass] The little white dot in the center of each is the White –Dwarf.

Page 19: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Neutron Stars. About the size of the Gallatin Valley. About twice the mass of our Sun (up to 2.9 Solar Masses) Don’t go too close! They have huge enough gravitational and

Electromagnetic fields, stretch you into spaghetti, rip out every metal atom in your body, and finally spread you across the whole surface!!

Form in Type II

supernovae.

The Neutron star is a tiny

speck at the center of that

cloud.

Page 20: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Black-Holes. (3 to 50 Solar masses) The ultimate cosmic sink-hole, even devours light!! Time-dilation, etc. come into effect near the event-

horizon.

Page 21: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Brown-Dwarfs. Stars that never got hot enough to start fusion (<.1

Solar masses)

Page 22: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Classification: Because it is hard to get data and impossible to

observe these from close up, we need an accurate way of identifying these remnants.

Two ways:1: Computer model of Stellar structure

Create a program that models a star, and has to estimate the equations-of-state governing the more bizarre remnants (such as Neutron Stars) as they involve super-nuclear densities and are not well predicted by Quantum mechanics.

Learning algorithms (such as NN’s) are sometimes used to tune the model based on known stellar remnants.

2: group an unclassified remnant with others having similar attributes.

Page 23: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

The latter is more like a Bayesian Classifier.Define (for masses of progenitor stars)

.1 to 10 Solar Masses = Average. 10 to 40 Solar Masses = Giant 40 to 150 Solar Masses = Supergiant 0 to .1 Solar Masses = Tiny.

Define (for masses of remnants) < 1.44 Solar masses = Small 1.44 to 2.9 Solar masses = Medium > 2.9 Solar masses = Large.

Define classifications: WD = White Dwarf. NS = Neutron Star. BH = Black hole BD = Brown Dwarf.

Page 24: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Some Training Data:Progenitor Mass /Solar Mass

Present Mass/Solar Mass

Remnant Classification

Average Small WD

Average Small WD

Average Small WD

Average Small WD

Average Medium NS

Giant Medium NS

Giant Medium NS

Giant Medium NS

Giant Large BH

Giant Large BH

Supergiant Large BH

Supergiant Large BH

Tiny Small BD

Tiny Small BD

Page 25: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

If we find a new Stellar remnant with attributes

<Average, Medium> we could certainly put it’s mass into a stellar model that has been fine-tuned by our Neural Net, or, we could simply use a Bayesian Classification:

Either would give the same result: Comparing with data we have, and matching attributes, this

has to be a Neutron star.

Similarly we can predict <Tiny, Small> Brown-Dwarf. <Supergiant, large> Black Hole.

Page 26: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Quantitative example.Table 3.2 pg 59.

Page 27: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Quantitative example. See table 3.2 pg59. Possible target values = {no, yes} P(no) = 9/14 = .64 P(yes) = 1-P(no) = .36 Want to know: PlayTennis? If <sunny, cool, high, strong> Need P(sunny|no), P(sunny|yes), etc…

P(sunny|no) = #sunny’s in no category / #no’s = 3/5. P(sunny|yes) = 2/9. etc…

NB classification: NO. Support for no = P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no)

=.0206. Support for yes = …= .0053.

Page 28: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Estimating probabilities. Usually

P(event) = (# times event occurred)/(total # trials)

Fair coin: 50/50 heads/tails.So out of two tosses, we expect 1 head, 1 tail.Don’t bet your life on it!!!

What about 10 tosses? P(all tails) = ½^10 = 1/1024. More likely than winning the lottery, which DOES

happen!

Page 29: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Small sample bias. Using the simple ratio induces a bias:

ex: P(heads) = 0/10 = 0. NOT TRUE!!

Sample not representative of population.

And will dominate NB classifier. Multiplication by 0.

Page 30: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

M-estimate. P(event e) = [#times e occurred + m*p]/[# trials +m]

m = “equivalent sample size.” p = prior estimate of P(e).

Essentially assuming m virtual trials following the predicted distribution (usually uniform.) – in addition to the real ones.

Reduces the small-sample bias.

Page 31: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Text Classification using Naïve Bayes.

Used a Naïve Bayes approach to decide “like” or “dislike” based on words in the text.Simplifying assumptions:

1: Position of word did not matter. 2: Most common 100 words were removed from

consideration.

Overall performance = 89% accuracyVersus 5% for random guessing.

Page 32: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Bayesian Belief network. The assumption of Conditional independence

may not be true!!EX:

v = lives in community “k” Suppose we know that a survey has been done indicating 90%

of the people there are young-earth creationists. h1 = Is a Young-earth Creationist. h2 = Discredits Darwinian Evolution. h3 = Likes Carrots.

Clearly (h1|v) and (h2|v) are not independent, but (h3|v) is unaffected.

Page 33: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Reality. In any set of attributes, some will be

conditionally independent. And some will not.

Page 34: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Bayesian Belief Networks. Allow conditional independence rules to be

stated for certain subsets of the attributes. Best of both worlds:

More realistic than assuming all attributes are conditionally independent.

Computationally cheaper than if we ignore the possibility of independent attributes.

Page 35: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Formally a Bayesian belief network describes a probability distribution over a set of variables.If we have variables Y1, …, Yn

Yk has domain Vk

then the Bayesian belief network is a probability density distribution over V1xV2x….Vn.

Page 36: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Representation. Each variable in the instance is a node in the

BBN. Every node has: 1: Network arcs assert a variable is

conditionally independent of it’s non-children, given it’s parents.

2: Conditional probability tables define the distribution of a variable given those of it’s parents.

Page 37: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Strongly reminiscent of a Neural Net structure. Here we have conditional probabilities instead of weights. See fig 6.3. pg 186.

Page 38: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Storm affects probability of someone lighting a campfire – not the other way around.

BBN’s allow us to state causality rules!! Once we have learned the BBN, we can

calculate the probability distribution of any attribute.pg 186.

Page 39: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Learning the Network: If structure is known and all attributes are

visible, then learn probabilities like in NB classifier.

If structure is known, but not all attributes are visible, have hidden values.Train using NN-type Gradient ascentOr EM algorithm.

If structure is unknown… various methods.

Page 40: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Gradient ascent training. Maximize P(D|h) by going in direction of steepest

ascent: Gradient(ln P(D|h)) Define ‘weight’ wijk as the conditional probability that

Yi takes value yij with parents in the configuration uik.

General form:

Eqn 6.25.

Page 41: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Weight update:

Since for given i and k, the sum of wijk’s must be 1, and now can exceed 1, we update all wijk’s and then normalize.

wijk wijk / (Σj wijk)

Page 42: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Works very well in practice, though it can get stuck on local optima, much like Gradient descent for NN’s!!

Page 43: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

EM algorithm. Alternate way of learning hidden variables

given training data. In general, use the mean value for an attribute

when it’s value is not known. “The EM algorithm can be used even for variable

whose value is never even directly observed, provided the general form of the probability distribution governing those variables is known.”

Page 44: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Learning the means of several unknown variables.(Normal distribution)

Guess h = <μ1,μ2>Must know variance.

Estimate probability of data pt. I coming from each distribution assuming h is correct.

Update h.

Page 45: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

What complicates the situation is that we have more than one unknown variable.

The general idea is:Pick randomly your estimate of the means.Assuming they are correct, and using the standard

deviation (which must be known and equal for all variables) figure out which data points are most likely to have come from which distributions. They will have the largest effect on the revision of that mean.

Page 46: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Then redefine each approximate mean as a weighted sample mean, where all data points are considered, but their effect on the kth distribution mean is weighted by how likely they were to come from it.

Loop till we get convergence to a set of means.

Page 47: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

In general, we use a quality test to measure the fit of the sample data with the learned distribution means.For normally distributed variables, we could use a

hypothesis test: How certain can we be that true-mean = estimated

value, given the sample data.

If the quality function has only one maximum, this method will find it.Otherwise, it can get stuck on a local maximum,

but in practice works quite well.

Page 48: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Example: Estimating means of 2 Gaussians. Suppose we have two unknown variables, and we

want their means μ1 and μ2. (true–mean 1, true-mean 2)

All we have is a few sample data points from these distributions {blue and green dots}

We cannot see the actual distributions. We know the standard deviation before hand.

Page 49: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Review of the normal distribution:

The inflection point is at one standard deviation: σ from the mean. Centered at the mean, the probability of getting a point at most 1 σ away is

68% This means the probability of drawing a point to the right of μ+1*σ is at

most .5(1-.68) = 16% At most 2 σ from the mean is 95%

This means the probability of drawing a point to the right of μ+2*σ is at most .5(1-.95) = 2.5%

At most 3 σ from the mean is 99.7% This means the probability of drawing a point to the right of μ+3*σ is at

most .5(1-.997) = 0.15%

Page 50: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

In the above, the “to the right probability” is the same for the left, but I am doing one-sided for a reason.

The important part is to note how quickly the “tails” of the normal distribution “fall off”.

In other words, the probability of getting a specific data point drops drastically as we go away from the mean.

Page 51: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Back to our example. We randomly pick our estimates for the true means: h = <μ1,

μ2>

Then we drop them on our data set, and for each ith data point, find the probability that it came from distribution 1, and that from distribution 2.

Page 52: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Recall the discussion of how fast the probabilities go down as we go away from the mean in a normal distribution, we get approximately this:

There’s about a 60% chance that the blue dots came from Dist. 1, and about <.001% that they came from Dist. 2

Similarly there’s about a 50% chance that the green dots came from Dist. 2, and about <.001% that they came from Dist. 1 These figures are estimates for qualitative understanding! Real values can be found using the rigorous definition that

is following shortly.

Page 53: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha
Page 54: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

So we now calculate new estimates μ1’, μ2’ for true-mean1 and true-mean2. Basically, we say:

Assume the blue points did come from Dist. 1.

Then the mean μ1 should be more to the left.

Similarly, the mean μ2 should be more to the right since it seems the green points should be coming from Dist. 2.

The fact is we actually modify μ1 using all the data points (even those that we do not think come from distribution1) but we weight each point’s effect by how likely it is that the point came from distribution 1.

Page 55: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha
Page 56: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Now replace h = < μ1, μ2> with h’ = < μ1’, μ2’>

and repeat till we get convergence of h.

Page 57: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Formally. Initialize h = <μ1, μ2> using random values for the

means. Untill h converges: DO

Calculate E(zij) for each hidden variable zij. E(zij) = Likelihood that xi comes from distribution j.

Calculate our evolved hypothesis h’ = <μ1’,μ2’>

assuming zij = E(zij) as calculated using h.

Replace h with h’.

Page 58: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

It can be proved that this will find a locally optimal h, given the training data.If our quality function only has a global maximum,

it is guaranteed to find it.Otherwise, can get stuck on a local optimum.

Just like Gradient Ascent for BBN’s Or Gradient ascent for NN’, Etc.

Very reminiscent of an EA!!For example: For a GA, have a population.

Optimize based on that population. Create descendants and choose best….

Update the population.

Page 59: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

General EM algorithm. Let X = {x1,…,xm} : observed data in m trials.

Let Z = {z1,…, zm}: unobserved data in m trials.

Let Y = XṶZ = full data. It is correct to treat Z as a random variable with a

probability distribution completely determined by X and some parameters Θ to be learned.

Page 60: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Denote the current Θ as h. We are looking for the hypothesis h’ that maximizes

E(ln P(Y|h’)). This is the maximum-likelihood hypothesis. Like I said before, we want to use the current hypothesis to

estimate the Quality of states reachable from h:

Q(h’|h) = E [ln P(Y|h’) | h, X] Here, Q(h’) depends on h since we are calculating this

using h.

Page 61: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha

Algorithm: Until we have convergence in h: DO

1: Estimation: E Calculate Q(h’|h) using current hypothesis h and

observed X to estimate the probability distribution over the whole Y.

Q(h’|h) E[ln P(Y|h’) |h, X]

2: Maximization: M Replace h with h’ that will maximize Q.

Page 62: Bayesian Learning.  Pt 2.  6.7- 6.12 Machine Learning Promethea Pythaitha