Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red

Probability and Maximum Likelihood

How are we doing on the pass sequence?

• This fit is pretty good, but…

Han

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our bets

Probability TheoryExample of a random experiment

– We poll 60 users who are using one of two search engines and record the following:

X0 1 2 3 4 5 6 7 8

Each point corresponds to one of 60 users

Two searchengines

Number of “good hits” returned by search

engine

X0 1 2 3 4 5 6 7 8

Probability TheoryRandom variables

– X and Y are called random variables– Each has its own sample space:

• SX = {0,1,2,3,4,5,6,7,8}

• SY = {1,2}

X0 1 2 3 4 5 6 7 8

Probability TheoryProbability

– P(X=i,Y=j) is the probability (relative frequency) of

observing X = i and Y = j– P(X,Y) refers to the whole table of probabilities

– Properties: 0 ≤ P ≤ 1, P = 1

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

P(X=i,Y=j)

Probability TheoryMarginal probability

– P(X=i) is the marginal probability that X = i, ie, the

probability that X = i, ignoring Y

X0 1 2 3 4 5 6 7 8

P(X)

P(Y)

Probability TheoryMarginal probability

– P(X=i) is the marginal probability that X = i, ie, the

probability that X = i, ignoring Y– From the table: P(X=i) =j P(X=i,Y=j)

Note that

i P(X=i) = 1

and

j P(Y=j) = 1

X0 1 2 3 4 5 6 7 8

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

3460

2660

P(Y=j)

3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60

P(X=i)

SUM RULE

Probability TheoryConditional probability

– P(X=i|Y=j) is the probability that X=i, given that Y=j– From the table: P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j)

X0 1 2 3 4 5 6 7 8

P(X|Y=1)

P(Y=1)

Probability TheoryConditional probability

– How about the opposite conditional probability, P(Y=j|X=i)?

– P(Y=j|X=i) =P(X=i,Y=j) / P(X=i) Note that jP(Y=j|

X=i)=1

X0 1 2 3 4 5 6 7 8

3 6 8 8 5 3 1 0 060 60 60 60 60 60 60 60 60

0 0 0 1 4 5 8 6 260 60 60 60 60 60 60 60 60

X0 1 2 3 4 5 6 7 8

33

03

66

06

88

08

89

19

59

49

38

58

19

89

06

66

02

22

P(Y=j|X=i)P(X=i,Y=j)

3 6 8 9 9 8 9 6 260 60 60 60 60 60 60 60 60

P(X=i)

Summary of types of probability

• Joint probability: P(X,Y)

• Marginal probability (ignore other variable): P(X) and P(Y)

• Conditional probability (condition on the other variable having a certain value): P(X|Y) and P(Y|X)

Probability TheoryConstructing joint probability

– Suppose we know• The probability that the user will pick each search

engine, P(Y=j), and

• For each search engine, the probability of each number of good hits, P(X=i|Y=j)

– Can we construct the joint probability, P(X=i,Y=j)?

– Yes. Rearranging P(X=i|Y=j) =P(X=i,Y=j) / P(Y=j) we get P(X=i,Y=j) =P(X=i|Y=j) P(Y=j)

PRODUCT RULE

Summary of computational rules

• SUM RULE: P(X) = Y P(X,Y)

P(Y) = X P(X,Y)

– Notation: We simplify P(X=i,Y=j) for clarity

• PRODUCT RULE: P(X,Y) = P(X|Y) P(Y)

P(X,Y) = P(Y|X) P(X)

Ordinal variables• In our example, X has a natural order 0…8

– X is a number of hits, and– For the ordering of the columns in the table below,

nearby X’s have similar probabilities

• Y does not have a natural order

X0 1 2 3 4 5 6 7 8

Probabilities for real numbers

• Can’t we treat real numbers as IEEE DOUBLES with 264 possible values?

• Hah, hah. No!

• How about quantizing real variables to reasonable number of values?

• Sometimes works, but…– We need to carefully account for ordinality– Doing so can lead to cumbersome mathematics

Probability theory for real numbers• Quantize X using bins of width • Then, X {.., -2, -, 0, , 2, ..}

• Define PQ(X=x) = Probability that x X ≤ x+

• Problem: PQ(X=x) depends on the choice of • Solution: Let 0• Problem: In that case, PQ(X=x) 0

• Solution: Define a probability density

P(x) = lim0 PQ(X=x)/

= lim0 (Probability that x X ≤ x+)/

Probability theory for real numbers

Probability density

– Suppose P(x) is a probability density

– Properties

• P(x) 0• It is NOT necessary that P(x) ≤ 1

• x P(x) dx = 1

– Probabilities of intervals:

P(aX≤b) = b

x=a P(x) dx

Probability theory for real numbers

Joint, marginal and conditional densities

• Suppose P(x,y) is a joint probability density

– x y P(x,y) dx dy = 1

– P( (X,Y) R) = R P(x,y) dx dy

• Marginal density: P(x) = y P(x,y) dy

• Conditional density: P(x|y) = P(x,y) / P(y)

x

yR

The Gaussian distribution

is the standard deviation

The Gaussian distribution

-1

-1

“Precision”

is the standard deviation

Mean and variance

• The mean of X is E[X] = X X P(X)

or E[X] = x x P(x) dx

• The variance of X is VAR(X) = X(X-E[X])2P(X)

or VAR(X) = x (x - E[X])2P(x)dx

• The std dev of X is STD(X) = SQRT(VAR(X))

• The covariance of X and Y is

COV(X,Y) = XY (X-E[X]) (Y-E[Y]) P(X,Y)

or COV(X,Y) = x y (x-E[X]) (y-E[Y]) P(x,y) dx dy

Mean and variance of the Gaussian

E[X] = VAR(X) = 2

STD(X) =

How can we use probability as a framework for machine learning?

Maximum likelihood estimation• Say we have a density P(x|) with parameter

• The likelihood of a set of independent and identically drawn (IDD) data x = (x1,…,xN) is

P(x|) = n=1N P(xn|)

• The log-likelihood is L = ln P(x|) = n=1N lnP(xn|)

• The maximum likelihood (ML) estimate of is

ML = argmax L = argmax n=1N ln P(xn|)

• Example: For Gaussian likelihood P(x|) = N (x|,2),

L =

Comments on notation from now on• Instead of j P(X=i,Y=j), we write X P(X,Y)

• P() and p() are used interchangeably

• Discrete and continuous variables treated the same, so X, X, x and x are interchangeable

• ML and ML are interchangeable

• argmax f() is the value of that maximizes f()

• In the context of data x1,…,xN, symbols x, X, X and X refer to the entire set of data

• N (x|,2) =

• log() = ln() and exp(x) = ex

• pcontext(x) and p(x|context) are interchangable



P(x|) = n=1N P(xn|)





L =

Questions?

How are we doing on the pass sequence?

• This fit is pretty good, but…

Han

d-l

abe

led

ho

rizo

nta

l co

ord

inat

e, t

The red line doesn’t reveal different levels of uncertainty in predictions

Cross validation reduced the training data, so the red line isn’t as accurate as it should be

Choosing a particular M and w seems wrong – we should hedge our betsNo progress!

But…



P(x|) = n=1N P(xn|)





L =

Maximum likelihood estimation

L =


Objective of regression: Minimize error

E(w) = ½ n ( tn - y(xn,w) )2

Documents

Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red