Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha

Bayesian Learning. Pt 2.

6.7- 6.12 Machine Learning

Promethea Pythaitha.

Bayes Optimal Classifier.

Gibbs Algorithm.

Naïve Bayes Classifier.

Bayesian Belief networks.

EM algorithm.

Bayesian Optimal classifier. So far we have asked: Which is the most

likely-to be correct hypothesis:Which h is the M.A.P hypothesis.

Recall MAP = Maximum a-posteriori hypothesis. Which hypothesis has the highest likelyhood of corectness

given the sample data we have seen.

What we usually want is the classification of a specific instance x not in the training data D.One way: Find hMAP and return it’s prediction for x.

Or decide what is the most probable classification for x.

Boolean classification: Have hypotheses h1 h6 with posterior

probabilities:

And h1 classifies x as - , the rest as +. Then net support for – is 25%, and for + is 75%. The “Bayes Optimal Classification” is + even

though the hMAP says - .

Hypothesis h1 h2 h3 h4 h5 h6

Post. prob .25 .15 .15 .15 .15 .15

Bayes Optimal Classifier.Classifies an instance by taking the average of all

the hypotheses predictions weighted by the credibility of the hypothesis.

Eqn 6.18.

Any system that classifies instances using this system is a “Bayes Optimal Classifier”

No other classification method, given the same hypothesis space and prior knowledge can outperform this method – on average!!!

A particularly interesting result is that the predictions made by a BOC can correspond to hypotheses that are not even in it’s hypothesis space. This helps deal with the limitations imposed by overly

restricted hypothesis spaces.

Best performance!!!----- At what cost?

Bayes Optimal Classification is the best – on average – but it can be very computationally costly. First it has to learn all the posterior probabilities for the

hypotheses in H. Then it has to poll the hypotheses to find what each one

predicts for x’s classification. Then it has to compute this big weighted sum (eqn 6.18)

But remember, hypothesis spaces get very large…. Recall the reason why we used a specific and general boundarys in

the CELA algorithm.

Gibbs Algorithm. One way to avoid some of the cmputation is:

1: Select h from H based on the posterior-probability distribution.

So more “credible” hypotheses are selected with higher probability than others.

2: Return h(x). This saves the time of computing the results hi(x) for all hi

in H, and doing the big sum…. But it is less optimal.

How well can it do?

If we compute expected misclassification error of the Gibbs algorithm over target concepts drawn at random based on the a-priori probability distribution assumed by the learner,

Then this error will be at most twice that for the B.O.C.

** On Average.

Naïve Bayes Classifier. [NBC] Highly practical Bayesian learner.

Can, under right conditions, rank as well as

Neural-nets or Decision-trees.

Applies to any learning task where each instance x is described by a conjunction of attributes (or attribute-value pairs) and where target function f(x) can take any value in finite set V.

We are given training data D, and asked to classify a new instance x = <a1, a2, …, an>

Bayesian approach: Classify a new instance by assigning the most

probable target value: vMAP, given the attribute values of the instance.

Eqn 6.19.

To get the classification we need to find the vj that maximizes

P(a1, a2, …, an| vj )*P(vj ) Second term: EASY.

It’s simply # instances with classification vj over total # instances.

First term: Hard!!! Need a HUGE set of training data.

Suppose we have 10 attributes (with 2 possibilities each) and 15 classifications. 15,360 possibilities,

And we have say 200 instances with known classifications. Cannot get a reliable estimate!!!

The Naïve assumption. One way to ‘fix’ the problem is to assume the

attributes are conditionally independent. Assume P(a1, a2, …, an| vj) = Πi P(ai| vj)

Then the Naïve Bayes Classifier uses this for the prediction:

Eqn 6.20.

Naïve Bayes Algorithm. 1: Learn the P(ai| vj) and P(ai| vj) for all a’s and

v’s. (based on training data)In our example this is 10(2)*15 = 300.

Sample size of 200 is plenty!

** This set of numbers is the learned hypothesis.

2: Use this hypothesis to find vNB.IF our “naïve assumption”: Conditional

independence is true then vNB = vMAP

Bayesian Learning vs. Other Machine Learning methods. In Bayesian Learning, there is not explicit

search through a hypothesis space.Does not produce an inference rule (D-tree) or a

weight vector (NN).Instead it forms the hypothesis by observing

frequencies of various data combinations.

Example. There are four possible end-states of stellar

evolution:1: White-Dwarf. 2: Neutron Star.3: Black-hole.4: Brown dwarf.

White Dwarf. About the size of the Earth, the mass of our Sun. [Up

to 1.44 times solar mass] The little white dot in the center of each is the White –Dwarf.

Neutron Stars. About the size of the Gallatin Valley. About twice the mass of our Sun (up to 2.9 Solar Masses) Don’t go too close! They have huge enough gravitational and

Electromagnetic fields, stretch you into spaghetti, rip out every metal atom in your body, and finally spread you across the whole surface!!

Form in Type II

supernovae.

The Neutron star is a tiny

speck at the center of that

cloud.

Black-Holes. (3 to 50 Solar masses) The ultimate cosmic sink-hole, even devours light!! Time-dilation, etc. come into effect near the event-

horizon.

Brown-Dwarfs. Stars that never got hot enough to start fusion (<.1

Solar masses)

Classification: Because it is hard to get data and impossible to

observe these from close up, we need an accurate way of identifying these remnants.

Two ways:1: Computer model of Stellar structure

Create a program that models a star, and has to estimate the equations-of-state governing the more bizarre remnants (such as Neutron Stars) as they involve super-nuclear densities and are not well predicted by Quantum mechanics.

Learning algorithms (such as NN’s) are sometimes used to tune the model based on known stellar remnants.

2: group an unclassified remnant with others having similar attributes.

The latter is more like a Bayesian Classifier.Define (for masses of progenitor stars)

.1 to 10 Solar Masses = Average. 10 to 40 Solar Masses = Giant 40 to 150 Solar Masses = Supergiant 0 to .1 Solar Masses = Tiny.

Define (for masses of remnants) < 1.44 Solar masses = Small 1.44 to 2.9 Solar masses = Medium > 2.9 Solar masses = Large.

Define classifications: WD = White Dwarf. NS = Neutron Star. BH = Black hole BD = Brown Dwarf.

Some Training Data:Progenitor Mass /Solar Mass

Present Mass/Solar Mass

Remnant Classification

Average Small WD

Average Small WD

Average Small WD

Average Small WD

Average Medium NS

Giant Medium NS

Giant Medium NS

Giant Medium NS

Giant Large BH

Giant Large BH

Supergiant Large BH

Supergiant Large BH

Tiny Small BD

Tiny Small BD

If we find a new Stellar remnant with attributes

<Average, Medium> we could certainly put it’s mass into a stellar model that has been fine-tuned by our Neural Net, or, we could simply use a Bayesian Classification:

Either would give the same result: Comparing with data we have, and matching attributes, this

has to be a Neutron star.

Similarly we can predict <Tiny, Small> Brown-Dwarf. <Supergiant, large> Black Hole.

Quantitative example.Table 3.2 pg 59.

Quantitative example. See table 3.2 pg59. Possible target values = {no, yes} P(no) = 9/14 = .64 P(yes) = 1-P(no) = .36 Want to know: PlayTennis? If <sunny, cool, high, strong> Need P(sunny|no), P(sunny|yes), etc…

P(sunny|no) = #sunny’s in no category / #no’s = 3/5. P(sunny|yes) = 2/9. etc…

NB classification: NO. Support for no = P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no)

=.0206. Support for yes = …= .0053.

Estimating probabilities. Usually

P(event) = (# times event occurred)/(total # trials)

Fair coin: 50/50 heads/tails.So out of two tosses, we expect 1 head, 1 tail.Don’t bet your life on it!!!

What about 10 tosses? P(all tails) = ½^10 = 1/1024. More likely than winning the lottery, which DOES

happen!

Small sample bias. Using the simple ratio induces a bias:

ex: P(heads) = 0/10 = 0. NOT TRUE!!

Sample not representative of population.

And will dominate NB classifier. Multiplication by 0.

M-estimate. P(event e) = [#times e occurred + m*p]/[# trials +m]

m = “equivalent sample size.” p = prior estimate of P(e).

Essentially assuming m virtual trials following the predicted distribution (usually uniform.) – in addition to the real ones.

Reduces the small-sample bias.

Text Classification using Naïve Bayes.

Used a Naïve Bayes approach to decide “like” or “dislike” based on words in the text.Simplifying assumptions:

1: Position of word did not matter. 2: Most common 100 words were removed from

consideration.

Overall performance = 89% accuracyVersus 5% for random guessing.

Bayesian Belief network. The assumption of Conditional independence

may not be true!!EX:

v = lives in community “k” Suppose we know that a survey has been done indicating 90%

of the people there are young-earth creationists. h1 = Is a Young-earth Creationist. h2 = Discredits Darwinian Evolution. h3 = Likes Carrots.

Clearly (h1|v) and (h2|v) are not independent, but (h3|v) is unaffected.

Reality. In any set of attributes, some will be

conditionally independent. And some will not.

Bayesian Belief Networks. Allow conditional independence rules to be

stated for certain subsets of the attributes. Best of both worlds:

More realistic than assuming all attributes are conditionally independent.

Computationally cheaper than if we ignore the possibility of independent attributes.

Formally a Bayesian belief network describes a probability distribution over a set of variables.If we have variables Y1, …, Yn

Yk has domain Vk

then the Bayesian belief network is a probability density distribution over V1xV2x….Vn.

Representation. Each variable in the instance is a node in the

BBN. Every node has: 1: Network arcs assert a variable is

conditionally independent of it’s non-children, given it’s parents.

2: Conditional probability tables define the distribution of a variable given those of it’s parents.

Strongly reminiscent of a Neural Net structure. Here we have conditional probabilities instead of weights. See fig 6.3. pg 186.

Storm affects probability of someone lighting a campfire – not the other way around.

BBN’s allow us to state causality rules!! Once we have learned the BBN, we can

calculate the probability distribution of any attribute.pg 186.

Learning the Network: If structure is known and all attributes are

visible, then learn probabilities like in NB classifier.

If structure is known, but not all attributes are visible, have hidden values.Train using NN-type Gradient ascentOr EM algorithm.

If structure is unknown… various methods.

Gradient ascent training. Maximize P(D|h) by going in direction of steepest

ascent: Gradient(ln P(D|h)) Define ‘weight’ wijk as the conditional probability that

Yi takes value yij with parents in the configuration uik.

General form:

Eqn 6.25.

Weight update:

Since for given i and k, the sum of wijk’s must be 1, and now can exceed 1, we update all wijk’s and then normalize.

wijk wijk / (Σj wijk)

Works very well in practice, though it can get stuck on local optima, much like Gradient descent for NN’s!!

EM algorithm. Alternate way of learning hidden variables

given training data. In general, use the mean value for an attribute

when it’s value is not known. “The EM algorithm can be used even for variable

whose value is never even directly observed, provided the general form of the probability distribution governing those variables is known.”

Learning the means of several unknown variables.(Normal distribution)

Guess h = <μ1,μ2>Must know variance.

Estimate probability of data pt. I coming from each distribution assuming h is correct.

Update h.

What complicates the situation is that we have more than one unknown variable.

The general idea is:Pick randomly your estimate of the means.Assuming they are correct, and using the standard

deviation (which must be known and equal for all variables) figure out which data points are most likely to have come from which distributions. They will have the largest effect on the revision of that mean.

Then redefine each approximate mean as a weighted sample mean, where all data points are considered, but their effect on the kth distribution mean is weighted by how likely they were to come from it.

Loop till we get convergence to a set of means.

In general, we use a quality test to measure the fit of the sample data with the learned distribution means.For normally distributed variables, we could use a

hypothesis test: How certain can we be that true-mean = estimated

value, given the sample data.

If the quality function has only one maximum, this method will find it.Otherwise, it can get stuck on a local maximum,

but in practice works quite well.

Example: Estimating means of 2 Gaussians. Suppose we have two unknown variables, and we

want their means μ1 and μ2. (true–mean 1, true-mean 2)

All we have is a few sample data points from these distributions {blue and green dots}

We cannot see the actual distributions. We know the standard deviation before hand.

Review of the normal distribution:

The inflection point is at one standard deviation: σ from the mean. Centered at the mean, the probability of getting a point at most 1 σ away is

68% This means the probability of drawing a point to the right of μ+1*σ is at

most .5(1-.68) = 16% At most 2 σ from the mean is 95%

This means the probability of drawing a point to the right of μ+2*σ is at most .5(1-.95) = 2.5%

At most 3 σ from the mean is 99.7% This means the probability of drawing a point to the right of μ+3*σ is at

most .5(1-.997) = 0.15%

In the above, the “to the right probability” is the same for the left, but I am doing one-sided for a reason.

The important part is to note how quickly the “tails” of the normal distribution “fall off”.

In other words, the probability of getting a specific data point drops drastically as we go away from the mean.

Back to our example. We randomly pick our estimates for the true means: h = <μ1,

μ2>

Then we drop them on our data set, and for each ith data point, find the probability that it came from distribution 1, and that from distribution 2.

Recall the discussion of how fast the probabilities go down as we go away from the mean in a normal distribution, we get approximately this:

There’s about a 60% chance that the blue dots came from Dist. 1, and about <.001% that they came from Dist. 2

Similarly there’s about a 50% chance that the green dots came from Dist. 2, and about <.001% that they came from Dist. 1 These figures are estimates for qualitative understanding! Real values can be found using the rigorous definition that

is following shortly.

So we now calculate new estimates μ1’, μ2’ for true-mean1 and true-mean2. Basically, we say:

Assume the blue points did come from Dist. 1.

Then the mean μ1 should be more to the left.

Similarly, the mean μ2 should be more to the right since it seems the green points should be coming from Dist. 2.

The fact is we actually modify μ1 using all the data points (even those that we do not think come from distribution1) but we weight each point’s effect by how likely it is that the point came from distribution 1.

Now replace h = < μ1, μ2> with h’ = < μ1’, μ2’>

and repeat till we get convergence of h.

Formally. Initialize h = <μ1, μ2> using random values for the

means. Untill h converges: DO

Calculate E(zij) for each hidden variable zij. E(zij) = Likelihood that xi comes from distribution j.

Calculate our evolved hypothesis h’ = <μ1’,μ2’>

assuming zij = E(zij) as calculated using h.

Replace h with h’.

It can be proved that this will find a locally optimal h, given the training data.If our quality function only has a global maximum,

it is guaranteed to find it.Otherwise, can get stuck on a local optimum.

Just like Gradient Ascent for BBN’s Or Gradient ascent for NN’, Etc.

Very reminiscent of an EA!!For example: For a GA, have a population.

Optimize based on that population. Create descendants and choose best….

Update the population.

General EM algorithm. Let X = {x1,…,xm} : observed data in m trials.

Let Z = {z1,…, zm}: unobserved data in m trials.

Let Y = XṶZ = full data. It is correct to treat Z as a random variable with a

probability distribution completely determined by X and some parameters Θ to be learned.

Denote the current Θ as h. We are looking for the hypothesis h’ that maximizes

E(ln P(Y|h’)). This is the maximum-likelihood hypothesis. Like I said before, we want to use the current hypothesis to

estimate the Quality of states reachable from h:

Q(h’|h) = E [ln P(Y|h’) | h, X] Here, Q(h’) depends on h since we are calculating this

using h.

Algorithm: Until we have convergence in h: DO

1: Estimation: E Calculate Q(h’|h) using current hypothesis h and

observed X to estimate the probability distribution over the whole Y.

Q(h’|h) E[ln P(Y|h’) |h, X]

2: Maximization: M Replace h with h’ that will maximize Q.

Documents

Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha