43
CSC446 : Pattern Recognition Prof. Dr. Mostafa G. M. Mostafa Faculty of Computer & Information Sciences Computer Science Department AIN SHAMS UNIVERSITY Lecture Note 3: Mathematical Foundations ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1 Appendix, Pattern Classification and PRML

CSC446: Pattern Recognition (LN3)

Embed Size (px)

Citation preview

Page 1: CSC446: Pattern Recognition (LN3)

CSC446 : Pattern Recognition

Prof. Dr. Mostafa G. M. Mostafa Faculty of Computer & Information Sciences

Computer Science Department

AIN SHAMS UNIVERSITY

Lecture Note 3:

Mathematical Foundations

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Appendix, Pattern Classification and PRML

Page 2: CSC446: Pattern Recognition (LN3)

CS446 : Pattern Recognition

Readings: Chapter 1 in Bishop’s PRML

Data Modeling (Regression)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 3: CSC446: Pattern Recognition (LN3)

Learning: Data Modeling

• Assume we have examples of pairs (x , y) and we

want to learn the mapping 𝑭:𝑿 → 𝒀 to predict y

for future values of x.

𝒚 𝒙 = 𝐬𝐢𝐧(𝟐𝝅𝒙)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 4: CSC446: Pattern Recognition (LN3)

Polynomial Curve Fitting

• Problem: There are many possible mapping

functions 𝑭:𝑿 → 𝒀 exist!

Which one to choose?

• We could choose the one

that minimize the error :

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 5: CSC446: Pattern Recognition (LN3)

Polynomial Curve Fitting

• Fitting a different polynomials (models) to

data:

𝑦 𝑥 = 𝒘𝟎 𝑦 𝑥 = 𝒘𝟎+𝒘𝟏𝒙

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 6: CSC446: Pattern Recognition (LN3)

Polynomial Curve Fitting

• Fitting a different polynomials (models) to

data:

𝑦 𝑥 = 𝒘𝟎+𝒘𝟏𝒙+𝒘𝟐𝒙𝟐 𝑦 𝑥 = 𝒘𝟎+𝒘𝟏𝒙+𝒘𝟐𝒙

𝟐 +⋯+𝒘𝟖𝒙𝟖

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 7: CSC446: Pattern Recognition (LN3)

Overfitting

• At M = 9, we get zero training Error , BUT

highest testing Error

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 8: CSC446: Pattern Recognition (LN3)

Effect of Data Size

• As number of data samples N increases, we

get more closer to the real data model with

higher order.

M = 9 M = 9

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 9: CSC446: Pattern Recognition (LN3)

Performance Evaluation

• Generalization error is the true error for the

population of examples we would like to optimize

– Sample mean only approximates it.

• Two ways to assess the generalization error is:

• Theoretical: Law of Large numbers

– statistical bounds on the difference between the true and

sample mean errors

• Practical: Use a separate data set with m data

samples to test the model

(Mean) test error =

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 10: CSC446: Pattern Recognition (LN3)

Assignment 1

1. Derive an equation for estimating the

parameters w from the sample data for

the cases M = 1 and M = 2.

2. Use such equations to draw a relation

between w and E(w) for each M. Use the

estimated values of w as the middle values

of the w range.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 11: CSC446: Pattern Recognition (LN3)

CS446 : Pattern Recognition

Readings: Appendix A

Probability & Statistics

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 12: CSC446: Pattern Recognition (LN3)

1- Probability Theory • Randomness:

–we call a phenomenon random if individual outcomes

are uncertain but there is nonetheless a regular

distribution of outcomes in a large number of

repetitions.

• Probability:

– the probability of any outcome of a random phenomenon

is the proportion of times the outcome would occur in a

very long series of repetitions.

–Probability is the long-term relative frequency.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 13: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Discrete random variables:

–Let x X ; the sample space X = {v1, v2, ... , vm}.

–We denote by pi the probability that x = vi:

• Where pi must satisfy the following two conditions:

pi = Pr{ x = vi } , i = 1, . . . , m.

m

i

iipp

1

1 and 0

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 14: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Equally likely outcomes:

“Equally likely outcomes are outcomes that

have the same probability of occurring.”

• Examples:

– Rolling a fair die

– Tossing a fair coin

• P(x) is a “Uniform Distribution”

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 15: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Equally likely outcomes:

• if we have ten identical balls numbered from 0 to 9, in a box

find the probability of randomly drawing a ball with a number

divisible by 3,

– the event space (desired outcomes): A={3,6,9}.

– the sample space (possible outcomes): S = {0, 1, 2, . . . , 9}.

• Since the drawing is at random, then each outcome is equally

likely to occur, i.e.: P(0) = P(1) = P(2) =…= P(9) =1/10

• P(A) ={numb. Of outcomes in A} / {number of outcomes in S}

= 3/10 = 0.3

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 16: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Biased outcomes (non-uniform dist.):

“Biased outcomes are outcomes that have

different probability of occurring.”

• Examples:

– Rolling a unfair die

– Tossing a unfair coin

• P(x) is a “Non-uniform Dist.”

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 17: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Biased outcomes (non-uniform dist.):

• A biased coin, twice as likely to come up tails as

heads, is tossed twice:

– What is the probability that at least one head occurs?

• Solution:

– Sample space = {HH, HT, TH, TT}

– P(H= head) = 1/3 , P(T= tail) =2/3

– Sample points/probability for the event:

• P(HT)= 1/3 x 2/3 = 2/9 P(HH)= 1/3 x 1/3= 1/9

• P(TH) = 2/3 x 1/3 = 2/9 P(TT)= 2/3 x 2/3 = 4/9

– Answer: 5/9 = 0.56 (sum of weights in red)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 18: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Probability and Language

What’s the probability of a random word (from a random

dictionary page) being a verb?

• Solution:

• All words = just count all the words in the dictionary

• # of ways to get a verb: number of words which are verbs!

• If a dictionary has 50,000 entries, and 10,000 are verbs,

then:

• P(Verb) =10000/50000 = 1/5 = .20

wordsall

verbagettowaysofverbadrawingP

#) (

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 19: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Conditional Probability

– A way to reason about the outcome of an

experiment based on partial information:

• In a word guessing game the first letter for the word

is a “t”. How likely is the second letter is an “h”?

• How likely is a person has a disease given that a

medical test was negative?

• A spot shows up on a radar screen. How likely it

corresponds to an aircraft?

• I saw your friend, How likely I will saw you?

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 20: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Conditional Probability

• let A and B be events

• p(B|A) = the probability of event B occurring given event A occurs

• definition:

)(

),()|(

BP

BAPBAP

A B A,B

Note: P(A,B)=P(A|B) · P(B)

Also : P(A,B) = P(B,A)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 21: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Conditional Probability

• One of the following 30 items is chosen at random.

• What is P(X), the probability that it is an X?

• What is P(X|red), the probability that it is an X given that it

is red?

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 22: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Statistically Independent events

– Variables x and y are said to be

statistically independent if and only if:

– That is, knowing the value of x did not

give us any additional knowledge about

the possible value of y

)()(),( yPxPyxP

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 23: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Marginal Probability

• Conditional Probability

• Joint Probability

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 24: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Sum Rule

• Product Rule

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 25: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Sum Rule

• Product Rule

• The Rules of Probability

)()|()()|(),( YpYXpXpXYpYXp

Y

YXpXp ),()(

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 26: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Bayes Theorem

where

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 27: CSC446: Pattern Recognition (LN3)

1- Probability Theory

• Probability mass function, P(x):

– P(x) is the cumulative distribution of p(x).

Xx

z

xP

xP

dxxpz)P(x

1)( and

0)(

)(

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 28: CSC446: Pattern Recognition (LN3)

2- Statistics • Statistics is the science of collecting, organizing, and interpreting numerical

facts, which we call data.

• The best way of

looking at data is to

draw its histogram/

(frequency

distribution)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 29: CSC446: Pattern Recognition (LN3)

2- Statistics

• Univariate Gaussian/Normal Density: – A density that is analytically tractable

– Continuous density

– A lot of processes are asymptotically Gaussian

Where:

= mean (or expected value) of x 2 = squared deviation or variance

,2

1exp

2

1)(

2

xxp

1)( dxxp

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 30: CSC446: Pattern Recognition (LN3)

2- Statistics • Univariate Gaussian/Normal Density

p(u) ~ N(0,1)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 31: CSC446: Pattern Recognition (LN3)

2- Statistics

• Multivariate Normal Density – Multivariate normal density in d dimensions is:

where:

x = (x1, x2, …, xd)t = The multivariate random variable

= (1, 2, …, d)t = the mean vector

= d*d covariance matrix, || and -1 are it determinant

and inverse, respectively .

)x()x(

2

1exp

)2(

1)x( 1

2/12/

t

dp

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 32: CSC446: Pattern Recognition (LN3)

2- Statistics

• Multivariate Density: Statistically Independent

– If xi and xj are statistically independent

σij = 0.

– In this case, p (x) reduces to the product of the

univariate normal densities for the components of

x. That is: if p(xi) ~ N(xi | µi , σi )

p(x) = p(x1,x2, …, xd) = p(x1) p(x2) … p(xd)

= p(xi) , i

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 33: CSC446: Pattern Recognition (LN3)

2- Statistics

• Multivariate Normal Density

– From the multivariate normal density, the loci of

points of constant density are hyperellipsoids for

which the quadratic form (x−µ)t Σ−1(x−µ) is

constant

– The quantity:

r2 = (x−µ)t Σ−1 (x−µ)

is sometimes called the squared Mahalanobis

distance from x to µ.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 34: CSC446: Pattern Recognition (LN3)

2- Statistics

Multivariate Normal Density

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 35: CSC446: Pattern Recognition (LN3)

2- Statistics

Expected values:

• The expected value, mean or average of the random variable

x is defined by:

• if f(x) is any function of x, the expected value of f is defined

by:

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 36: CSC446: Pattern Recognition (LN3)

2- Statistics

Expected values:

• The second moment of x is defined by:

• The variance of x is defined by:

where σ is the standard deviation of x.

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 37: CSC446: Pattern Recognition (LN3)

3- Mathematical Notations

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 38: CSC446: Pattern Recognition (LN3)

3- Mathematical Notations

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 39: CSC446: Pattern Recognition (LN3)

3- Mathematical Notations

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 40: CSC446: Pattern Recognition (LN3)

3- Mathematical Notations

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 41: CSC446: Pattern Recognition (LN3)

3- Mathematical Notations

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 42: CSC446: Pattern Recognition (LN3)

3- Mathematical Notations

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Page 43: CSC446: Pattern Recognition (LN3)

Next Time

Bayesian Decision Theory

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1