20
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.

5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Embed Size (px)

Citation preview

Page 1: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

5. Maximum Likelihood –II

Prof. Yuille.

Stat 231. Fall 2004.

Page 2: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Topics

• Exponential Distributions, Sufficient Statistics, and MLE.

• Maximum Entropy Principle.

• Model Selection.

Page 3: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Exponential Distributions.

• Gaussians are a member of the class of exponential distribution.

• Parameters• Statistics

Page 4: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Sufficient Statistics.

• The are the sufficient statistics of the distribution.

• Knowledge of

is all we need to know about the data

The rest is irrelevant.

• Almost all distributions can be expressed as Exponentials – Gaussian, Poisson, etc.

Page 5: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Sufficient Statistics of Gaussian

• One-Dimensional Gaussian and samples

• Sufficient statistics are

• And

• These are sufficient to learn the parameters

of the distribution from data.

Page 6: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

MLE for Gaussian

• To estimate the parameters – maximize

• Or equivalently, maximize:

• The sufficient statistics are chosen so that

Page 7: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Sufficient Statistics for Gaussian

• Distribution is of form:

• This is the same as a Gaussian with mean

• and variance

Page 8: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Exponential Models and MLE.

• MLE corresponds to maximizing

Equivalent to minimizing

Where

Page 9: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Exponential Models and MLE.

• This minimization is a convex optimization problem and hence has a unique solution. But finding this solution may be difficult.

• Algorithms such as Generalized Iterative Scaling are guaranteed to converge.

Page 10: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Maximum Entropy Principle.

• An alternative way to think of Exponential Distributions and MLE.

• Start with the Statistics, and then estimate the form and the parameters of the probability distribution.

• Using the Maximum Entropy principle.

Page 11: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Entropy

• The entropy of a distribution is

• Defined by Shannon as a measure of the information obtained by observing a sample from P(x).

Page 12: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Maximum Entropy Principle

• Maximum Entropy Principle. Select the distribution P(x) which maximizes the entropy subject to constraints.

• Lagrange multipliers

• The observed value of the statistics are

Page 13: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Maximum Entropy

• Minimize with respect to P(x). Gives the (exponential) form of the distribution:

• Maximizing with respect to the Lagrange parameters ensures that the constraints are satisfied:

Page 14: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Maximum Entropy.

• This gives the same result as MLE for Exponential Distributions.

• Maximum Entropy + Constraints = Exponential Distribution + MLE Parameter.

• The Max-Ent distribution which has the observed sufficient statistics is the exponential distribution with those statistics.

• Example: can obtain a Gaussian by performing Max-Ent on statistics

Page 15: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Minimax Principle.

• Construct a distribution incrementally by increasing the number of statistics

• The entropy of the Max-Ent distribution with M statistics is given by:

• Minimax Principle: select the statistics to minimize the entropy of the maximum entropy distribution. This relates to model selection.

Page 16: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Model Selection.

• Suppose we do not know which model generates the data.

• Two models

• Priors

• Model selection enables us to estimate which model is most likely to have generated the data

Page 17: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Model Selection.

• Calculate

• Compare with

• Observe that we must sum over all possible values of the model parameters

Page 18: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Model Selection & Minimax.

• The entropy of the Max-Ent distribution

• Is minus the probability of the data

• So the Minimax Principle is a form of model selection. But it estimates the parameters instead of summing them out.

Page 19: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Model Selection.

• Important Issue:

Suppose the model

has more parameters than

Then is more flexible and can fit a larger number of data models.

• But summing over the parameters • and penalizes this flexibility.• Gives “Occam’s Razor” favoring the simpler

model.

Page 20: 5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004

Model Selection.

• More advanced modeling requires performing model selection – where the models are complex.

• Beyond scope of this course.