42
Likelihood and Bayesian Inference Joe Felsenstein Department of Genome Sciences and Department of Biology Likelihood and Bayesian Inference – p.1/33

Likelihood and Bayesian Inference · Bayes’ Theorem Suppose we have related events, B and some other mutually exclusive events A1, A2, A3,...,A8. The probability of B given A3 (for

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Likelihood and Bayesian Inference

    Joe Felsenstein

    Department of Genome Sciences and Department of Biology

    Likelihood and Bayesian Inference – p.1/33

  • Bayes’ Theorem

    Suppose we have related events, B and some other mutually exclusiveevents A1, A2, A3, . . . , A8 . The probability of B given A3 (for example) is

    Prob (A3 | B) =Prob (A3 and B)

    Prob (B)

    and since it is also true that

    Prob (B | A3) =Prob (A3 and B)

    Prob (A3)

    we can multiply by Prob (A3) and substitute for Prob (A3 and B) to get

    Prob (A3 | B) =Prob (A3) Prob (B | A3)

    Prob (B)

    (Think of B as the data, and the Ai as different hypotheses).

    Likelihood and Bayesian Inference – p.2/33

  • Getting Bayes’ Rule

    Since the denominator can be rewritten as

    Prob (B) = Prob (A1) Prob (B | A1) + . . . + Prob (A8) Prob (B | A8)

    We can substitute that in to get the final form of Bayes’ Rule:

    Prob (A3|B) =Prob (A3) Prob (B | A3)

    Prob (A1) Prob (B | A1) + . . . + Prob (A8) Prob (B | A8)

    What this does is compute the probability of A3 given that we saw Bfrom the prior probabilities of the Ai and the conditional probabilities ofthe observed data B given each Ai.

    Likelihood and Bayesian Inference – p.3/33

  • Odds ratio, Bayes’ Theorem, maximum likelihood

    We start with an “odds ratio” version of Bayes’ Theorem: take the ratio ofthe numerators for two different hypotheses and we get:

    D the dataH1 Hypothesis 1H2 Hypothesis 2| the symbol for “given”

    Prob (H1 | D)

    Prob (H2 | D)

    ︸ ︷︷ ︸Posterior odds ratio

    =

    Prob (D | H1)

    Prob (D | H2)

    ︸ ︷︷ ︸Likelihood ratio

    Prob (H1)

    Prob (H2)

    ︸ ︷︷ ︸Prior odds ratio

    Likelihood and Bayesian Inference – p.4/33

  • A simple example of Bayes Theorem

    If a space probe finds no Little Green Men on Mars, when it would have a1/3 chance of missing them if they were there:

    likelihoods

    0

    no

    yes

    1

    Likelihood and Bayesian Inference – p.5/33

  • A simple example of Bayes Theorem

    If a space probe finds no Little Green Men on Mars, when it would have a1/3 chance of missing them if they were there:

    priorsno

    yes

    likelihoods

    0

    no

    yes

    1

    4

    1× 1/3

    1

    Likelihood and Bayesian Inference – p.6/33

  • A simple example of Bayes Theorem

    If a space probe finds no Little Green Men on Mars, when it would have a1/3 chance of missing them if they were there:

    priors

    posteriors

    no

    yes

    noyes

    likelihoods

    0

    no

    yes

    1

    4

    1× 1/3

    1=

    43

    Likelihood and Bayesian Inference – p.7/33

  • A simple example of Bayes Theorem

    If a space probe finds no Little Green Men on Mars, when it would have a1/3 chance of missing them if they were there:

    priors

    posteriors

    no

    yes no

    yes

    noyes

    likelihoods

    0

    no

    yes

    1

    4

    1× 1/3

    1=

    43

    1

    4× 1/3

    1

    Likelihood and Bayesian Inference – p.8/33

  • A simple example of Bayes Theorem

    If a space probe finds no Little Green Men on Mars, when it would have a1/3 chance of missing them if they were there:

    priors

    posteriors

    no

    yes no

    yes

    noyes

    no

    yes

    likelihoods

    0

    no

    yes

    1

    4

    1× 1/3

    1=

    43

    1

    4× 1/3

    1=

    112

    Likelihood and Bayesian Inference – p.9/33

  • The likelihood ratio term ultimately dominates

    If we see one Little Green Man, the likelihood calculation does the rightthing:

    1=

    2/3

    1

    4

    (put this way, this is OK but not mathematically kosher)

    If after n missions, we keep seeing none, the likelihood ratio term is

    (1

    3

    )n

    It dominates the calculation, overwhelming the prior.Thus even if we don’t have a prior we can believe in, we may be interestedin knowing which hypothesis the likelihood ratio is recommending ...

    Likelihood and Bayesian Inference – p.10/33

  • Likelihood in simple coin-tossing

    Tossing a coin n times, with probability p of heads, the probability ofoutcome HHTHTTTTHTTH is

    pp(1 − p)p(1 − p)(1 − p)(1 − p)(1 − p)p(1 − p)(1 − p)p

    which is

    L = p5(1 − p)6

    Plotting L against p to find its maximum:

    0.0 0.2 0.4 0.6 0.8 1.0

    Like

    lihoo

    d

    p 0.454

    Likelihood and Bayesian Inference – p.11/33

  • Differentiating to find the maximum:

    Differentiating the expression for L with respect to p and equating thederivative to 0, the value of p that is at the peak is found (not surprisingly)to be p = 5/11:

    ∂L

    ∂p=

    (5

    p−

    6

    1 − p

    )

    p5(1 − p)6 = 0

    5 − 11 p = 0

    p̂ =5

    11

    Likelihood and Bayesian Inference – p.12/33

  • A likelihood curve

    L

    θ

    Likelihood and Bayesian Inference – p.13/33

  • Its maximum likelihood estimate

    L

    θθ

    the MLE

    Likelihood and Bayesian Inference – p.14/33

  • Using the Likelihood Ratio Test

    L

    θθ

    reduce ln L by3.841 / 2

    the MLE

    Likelihood and Bayesian Inference – p.15/33

  • The (approximate, asymptotic) confidence interval

    L

    θθ

    reduce ln L by3.841 / 2

    confidence interval

    the MLE

    Likelihood and Bayesian Inference – p.16/33

  • Better to plot log(L) rather than L

    θ

    ln (

    L)

    Likelihood and Bayesian Inference – p.17/33

  • Better to plot log(L) rather than L

    θθ

    the MLE

    ln(L

    )

    Likelihood and Bayesian Inference – p.18/33

  • Better to plot log(L) rather than L

    θθ

    the MLE

    ln(L

    ) reduce ln L by3.841 / 2

    Likelihood and Bayesian Inference – p.19/33

  • Better to plot log(L) rather than L

    θθ

    the MLE

    ln(L

    ) reduce ln L by3.841 / 2

    confidence interval

    Likelihood and Bayesian Inference – p.20/33

  • Contours of a likelihood surface in two dimensions

    length of branch 1

    leng

    th o

    f bra

    nch

    2

    Likelihood and Bayesian Inference – p.21/33

  • Where the maximum likelihood estimate is

    length of branch 1

    leng

    th o

    f bra

    nch

    2

    MLE

    Likelihood and Bayesian Inference – p.22/33

  • Using the LRT to define a confidence interval

    length of branch 1

    height of this contour isless than at the peak by an amountequal to 1/2 the chi−square value with

    leng

    th o

    f bra

    nch

    2

    one degree of freedom which is significant at 95% level

    Likelihood and Bayesian Inference – p.23/33

  • Ditto, in the other variable

    length of branch 1

    height of this contour isless than at the peak by an amountequal to 1/2 the chi−square value with

    (shaded area is the joint confidence interval)

    leng

    th o

    f bra

    nch

    2

    one degree of freedom which is significant at 95% level

    Likelihood and Bayesian Inference – p.24/33

  • A joint confidence region

    length of branch 1

    height of this contour isless than at the peak by an amountequal to 1/2 the chi−square value with

    (shaded area is the joint confidence interval)

    leng

    th o

    f bra

    nch

    2

    two degrees of freedom which is significant at 95% level

    Likelihood and Bayesian Inference – p.25/33

  • The Likelihood Ratio TestRemember that confidence intervals and tests are related: we test a nullhypothesis by seeing whether the observed data’s summary statistic isoutside of the confidence interval around the parameter value for the nullhypothesis.

    The Likelihood Ratio Test invented by R. A. Fisher does this:

    Find the best overall parameter value and the likelihood, which ismaximized there: L(θ1).

    Likelihood and Bayesian Inference – p.26/33

  • The Likelihood Ratio TestRemember that confidence intervals and tests are related: we test a nullhypothesis by seeing whether the observed data’s summary statistic isoutside of the confidence interval around the parameter value for the nullhypothesis.

    The Likelihood Ratio Test invented by R. A. Fisher does this:

    Find the best overall parameter value and the likelihood, which ismaximized there: L(θ1).

    Find the best parameter value, and its likelihood, under constraintthat the null hypothesis is true: L(θ0).

    Likelihood and Bayesian Inference – p.26/33

  • The Likelihood Ratio TestRemember that confidence intervals and tests are related: we test a nullhypothesis by seeing whether the observed data’s summary statistic isoutside of the confidence interval around the parameter value for the nullhypothesis.

    The Likelihood Ratio Test invented by R. A. Fisher does this:

    Find the best overall parameter value and the likelihood, which ismaximized there: L(θ1).

    Find the best parameter value, and its likelihood, under constraintthat the null hypothesis is true: L(θ0).

    The degrees of freedom is the difference of the number ofparameters in these two models, p1 − p0. The null hyothesis modelmust be a subcase of the general hypothesis, and must be within itsparameter space, not on the boundary.

    Likelihood and Bayesian Inference – p.26/33

  • The Likelihood Ratio TestRemember that confidence intervals and tests are related: we test a nullhypothesis by seeing whether the observed data’s summary statistic isoutside of the confidence interval around the parameter value for the nullhypothesis.

    The Likelihood Ratio Test invented by R. A. Fisher does this:

    Find the best overall parameter value and the likelihood, which ismaximized there: L(θ1).

    Find the best parameter value, and its likelihood, under constraintthat the null hypothesis is true: L(θ0).

    The degrees of freedom is the difference of the number ofparameters in these two models, p1 − p0. The null hyothesis modelmust be a subcase of the general hypothesis, and must be within itsparameter space, not on the boundary.

    Take the log of the ratio of these likelihoods, or (what is the same),the difference of the logs of these two likelihoods: ln(L(θ1)/L(θ0)).

    Likelihood and Bayesian Inference – p.26/33

  • The Likelihood Ratio TestRemember that confidence intervals and tests are related: we test a nullhypothesis by seeing whether the observed data’s summary statistic isoutside of the confidence interval around the parameter value for the nullhypothesis.

    The Likelihood Ratio Test invented by R. A. Fisher does this:

    Find the best overall parameter value and the likelihood, which ismaximized there: L(θ1).

    Find the best parameter value, and its likelihood, under constraintthat the null hypothesis is true: L(θ0).

    The degrees of freedom is the difference of the number ofparameters in these two models, p1 − p0. The null hyothesis modelmust be a subcase of the general hypothesis, and must be within itsparameter space, not on the boundary.

    Take the log of the ratio of these likelihoods, or (what is the same),the difference of the logs of these two likelihoods: ln(L(θ1)/L(θ0)).

    Double it, and look it up on a chi-square distribution with p1 − p0degrees of freedom.

    Likelihood and Bayesian Inference – p.26/33

  • An example with phylogenies: molecular clock?

    A B C D E

    v2v1

    v3

    v4 v5

    v6

    v7

    v8

    Constraints for a clock

    v2v1 =

    v4 v5=

    v3 v7 v4 v8=+ +

    v1 v6 v3=+

    Likelihood and Bayesian Inference – p.27/33

  • Testing for a molecular clock

    To test for a molecular clock:Obtain the likelihood with no constraint of a molecular clock (Forprimates data with Ts/Tn = 30 we get ln L1 = −2616.86)

    Obtain the highest likelihood for a tree which is constrained to havea molecular clock: ln L0 = −2679.0

    Look up 2(ln L1 − ln L0) = 2 × 62.14 = 124.28 on a χ2 distributionwith n − 2 = 12 degrees of freedom (in this case the result issignificant)

    Likelihood and Bayesian Inference – p.28/33

  • An example – samples from a Poisson distribution

    Suppose we have m samples from a Poisson distribution whose (unknown) mean parameteris λ. Suppose the numbers of events we see are n1, n2, . . . , nm. The likelihood is

    L =e−λλn1

    n1!×

    e−λλn2

    n2!× . . . ×

    e−λλnm

    nm!

    collecting powers and exponentials, this becomes

    L = e−mλλn1+n2+...+nm/(lots of factorials)

    Taking logarithms, which makes it easier

    ln L = −mλ +“

    X

    ni

    ln λ + (stuff not involving λ)

    Differentiate this, set to zero:

    ∂ ln L

    ∂λ= −m +

    X

    ni

    ” 1

    λ+ 0 = 0

    When you solve this for λ, you find that the MLE of λ is just the average number of events.

    λ̂ =

    P

    ni

    m

    Likelihood and Bayesian Inference – p.29/33

  • An example of Bayesian inference with coins

    0.0 0.2 0.4 0.6 0.8 1.0

    pThe prior on Heads probability – a truncated exponential distribution

    Likelihood and Bayesian Inference – p.30/33

  • An example of Bayesian inference with coins

    0.0 0.2 0.4 0.6 0.8 1.00

    The likelihood curve for 11 tosses with 5 heads appearing.

    Likelihood and Bayesian Inference – p.31/33

  • An example of Bayesian inference with coins

    0.0 0.2 0.4 0.6 0.8 1.0

    pThe resulting posterior on Heads probability

    Likelihood and Bayesian Inference – p.32/33

  • Bayesian inference

    Bayesian inference uses likelihoods, but has a prior distribution on theunknown parameters.

    In theory it just multiplies the prior density by the likelihood curve,

    Likelihood and Bayesian Inference – p.33/33

  • Bayesian inference

    Bayesian inference uses likelihoods, but has a prior distribution on theunknown parameters.

    In theory it just multiplies the prior density by the likelihood curve,

    ... then it takes the resulting curve and restandardizes it so the areaunder it is 1.

    Likelihood and Bayesian Inference – p.33/33

  • Bayesian inference

    Bayesian inference uses likelihoods, but has a prior distribution on theunknown parameters.

    In theory it just multiplies the prior density by the likelihood curve,

    ... then it takes the resulting curve and restandardizes it so the areaunder it is 1.That is the posterior, the very thing we need.

    Likelihood and Bayesian Inference – p.33/33

  • Bayesian inference

    Bayesian inference uses likelihoods, but has a prior distribution on theunknown parameters.

    In theory it just multiplies the prior density by the likelihood curve,

    ... then it takes the resulting curve and restandardizes it so the areaunder it is 1.That is the posterior, the very thing we need.

    In practice, for complex models, Markov Chain Monte Carlo(MCMC) methods are used to wander in the parameter space andtake a large enough sample from the posterior.

    Likelihood and Bayesian Inference – p.33/33

  • Bayesian inference

    Bayesian inference uses likelihoods, but has a prior distribution on theunknown parameters.

    In theory it just multiplies the prior density by the likelihood curve,

    ... then it takes the resulting curve and restandardizes it so the areaunder it is 1.That is the posterior, the very thing we need.

    In practice, for complex models, Markov Chain Monte Carlo(MCMC) methods are used to wander in the parameter space andtake a large enough sample from the posterior.

    The controversy between Bayesians and non-Bayesians is reallyover just one thing – whether assuming you know the prior isjustified.

    Likelihood and Bayesian Inference – p.33/33

  • Bayesian inference

    Bayesian inference uses likelihoods, but has a prior distribution on theunknown parameters.

    In theory it just multiplies the prior density by the likelihood curve,

    ... then it takes the resulting curve and restandardizes it so the areaunder it is 1.That is the posterior, the very thing we need.

    In practice, for complex models, Markov Chain Monte Carlo(MCMC) methods are used to wander in the parameter space andtake a large enough sample from the posterior.

    The controversy between Bayesians and non-Bayesians is reallyover just one thing – whether assuming you know the prior isjustified.

    If the prior is flat in that region, the highest point on the likelihoodcurve (i.e., the MLE) is also the peak of the posterior density.

    Likelihood and Bayesian Inference – p.33/33

    Bayes' TheoremGetting Bayes' RuleOdds ratio, Bayes' Theorem, maximum likelihoodA simple example of Bayes TheoremA simple example of Bayes TheoremA simple example of Bayes TheoremA simple example of Bayes TheoremA simple example of Bayes TheoremThe likelihood ratio term ultimately dominatesLikelihood in simple coin-tossingDifferentiating to find the maximum:A likelihood curveIts maximum likelihood estimateUsing the Likelihood Ratio TestThe (approximate, asymptotic)confidence intervalBetter to plot log(L)rather than LBetter to plot log(L)rather than LBetter to plot log(L)rather than LBetter to plot log(L)rather than LContours of a likelihood surface in two dimensionsWhere the maximum likelihood estimate isUsing the LRT to define a confidence intervalDitto, in the other variableA joint confidence regionThe Likelihood Ratio TestAn example with phylogenies: molecular clock? Testing for a molecular clockAn example -- samples from a Poisson distributionAn example of Bayesian inference with coinsAn example of Bayesian inference with coinsAn example of Bayesian inference with coinsBayesian inference