34
Introduction to Machine Learning Statistical Machine Learning Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 30

Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Introduction to Machine Learning

Statistical Machine Learning

Varun ChandolaComputer Science & Engineering

State University of New York at BuffaloBuffalo, NY, USA

[email protected]

Chandola@UB CSE 474/574 1 / 30

Page 2: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Outline

Statistical Machine Learning - Introduction

Introduction to Probability

Random Variables

Bayes Rule

Continuous Random Variables

Different Types of Distributions

Handling Multivariate Distributions

Chandola@UB CSE 474/574 2 / 30

Page 3: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Statistical Machine Learning

Functional MethodsI y = f (x)

I Learn f () using training data

I y∗ = f (x∗) for a test data instance

Statistical/Probabilistic Methods

I Calculate the conditional probability of the target to be y , given thatthe input is x

I Assume that y |x is random variable generated from a probabilitydistribution

I Learn parameters of the distribution using training data

Chandola@UB CSE 474/574 3 / 30

Page 4: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Statistical Machine Learning

Functional MethodsI y = f (x)

I Learn f () using training data

I y∗ = f (x∗) for a test data instance

Statistical/Probabilistic Methods

I Calculate the conditional probability of the target to be y , given thatthe input is x

I Assume that y |x is random variable generated from a probabilitydistribution

I Learn parameters of the distribution using training data

Chandola@UB CSE 474/574 3 / 30

Page 5: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

What is Probability? [2, 1]

I Probability that a coin will land heads is 50%1

I What does this mean?

1Dr. Persi Diaconis showed that a coin is 51% likely to land facing the same wayup as it is started.

Chandola@UB CSE 474/574 4 / 30

Page 6: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Chandola@UB CSE 474/574 5 / 30

Page 7: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Frequentist Interpretation

I Number of times an event will be observed in n trials

I What if the event can only occur once?I My winning the next month’s powerball.I Polar ice caps melting by year 2020.

Chandola@UB CSE 474/574 6 / 30

Page 8: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Frequentist Interpretation

I Number of times an event will be observed in n trials

I What if the event can only occur once?I My winning the next month’s powerball.I Polar ice caps melting by year 2020.

Chandola@UB CSE 474/574 6 / 30

Page 9: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Bayesian Interpretation

I Uncertainty of the event

I Use for making decisionsI Should I put in an offer for a sports car?

Chandola@UB CSE 474/574 7 / 30

Page 10: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

What is a Random Variable (X )?

I Can take any value from XI Discrete Random Variable - X is finite/countably finite

I Categorical??

I Continuous Random Variable - X is infinite

I P(X = x) or P(x) is the probability of X taking value xI an event

I What is a distribution?

Examples

1. Coin toss (X = {heads, tails})2. Six sided dice (X = {1, 2, 3, 4, 5, 6})

Chandola@UB CSE 474/574 8 / 30

Page 11: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Notation, Notation, Notation

I X - random variable (X if multivariate)

I x - a specific value taken by the random variable ((x if multivariate))

I P(X = x) or P(x) is the probability of the event X = x

I p(x) is either the probability mass function (discrete) orprobability density function (continuous) for the random variableX at xI Probability mass (or density) at x

Chandola@UB CSE 474/574 9 / 30

Page 12: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Basic Rules - Quick Review

I For two events A and B:I P(A ∨ B) = P(A) + P(B)− P(A ∧ B)I Joint Probability

I P(A,B) = P(A ∧ B) = P(A|B)P(B)I Also known as the product rule

I Conditional Probability

I P(A|B) = P(A,B)P(B)

Chandola@UB CSE 474/574 10 / 30

Page 13: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Chain Rule of Probability

I Given D random variables, {X1,X2, . . . ,XD}

P(X1,X2, . . . ,XD) = P(X1)P(X2|X1)P(X3|X1,X2) . . .P(XD |X1,X2, . . . ,XD)

Chandola@UB CSE 474/574 11 / 30

Page 14: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Marginal Distribution

I Given P(A,B) what is P(A)?I Sum P(A,B) over all values for B

P(A) =∑b

P(A,B) =∑b

P(A|B = b)P(B = b)

I Sum rule

Chandola@UB CSE 474/574 12 / 30

Page 15: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Bayes Rule or Bayes Theorem

I Computing P(X = x |Y = y):

Bayes Theorem

P(X = x |Y = y) =P(X = x ,Y = y)

P(Y = y)

=P(X = x)P(Y = y |X = x)∑x′ P(X = x ′)P(Y = y |X = x ′)

Chandola@UB CSE 474/574 13 / 30

Page 16: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Example

I Medical Diagnosis

I Random event 1: A test is positive or negative (X )

I Random event 2: A person has cancer (Y ) – yes or no

I What we know:

1. Test has accuracy of 80%2. Number of times the test is positive when the person has cancer

P(X = 1|Y = 1) = 0.8

3. Prior probability of having cancer is 0.4%

P(Y = 1) = 0.004

Question?

If I test positive, does it mean that I have 80% rate of cancer?

Chandola@UB CSE 474/574 14 / 30

Page 17: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Base Rate Fallacy

I Ignored the prior information

I What we need is:P(Y = 1|X = 1) =?

I More information:I False positive (alarm) rate for the testI P(X = 1|Y = 0) = 0.1

P(Y = 1|X = 1) =P(X = 1|Y = 1)P(Y = 1)

P(X = 1|Y = 1)P(Y = 1) + P(X = 1|Y = 0)P(Y = 0)

Chandola@UB CSE 474/574 15 / 30

Page 18: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Classification Using Bayes Rule

I Given input example x, find the true class

P(Y = c |X = x)

I Y is the random variable denoting the true class

I Assuming the class-conditional probability is known

P(X = x|Y = c)

I Applying Bayes Rule

P(Y = c |X = x) =P(Y = c)P(X = x|Y = c)∑c P(Y = c ′))P(X = x|Y = c ′)

Chandola@UB CSE 474/574 16 / 30

Page 19: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Independence

I One random variable does not depend on another

I A ⊥ B ⇐⇒ P(A,B) = P(A)P(B)

I Joint written as a product of marginals

I Conditional Independence

A ⊥ B|C ⇐⇒ P(A,B|C ) = P(A|C )P(B|C )

Chandola@UB CSE 474/574 17 / 30

Page 20: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Continuous Random Variables

I X is continuous

I Can take any value

I How does one define probability?I

∑x P(X = x) = 1

I Probability that X lies in an interval [a, b]?I P(a < X ≤ b) = P(x ≤ b)− P(x ≤ a)I F (q) = P(x ≤ q) is the cumulative distribution functionI P(a < X ≤ b) = F (b)− F (a)

Chandola@UB CSE 474/574 18 / 30

Page 21: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Continuous Random Variables

I X is continuous

I Can take any value

I How does one define probability?I

∑x P(X = x) = 1

I Probability that X lies in an interval [a, b]?I P(a < X ≤ b) = P(x ≤ b)− P(x ≤ a)I F (q) = P(x ≤ q) is the cumulative distribution functionI P(a < X ≤ b) = F (b)− F (a)

Chandola@UB CSE 474/574 18 / 30

Page 22: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Probability Density

Probability Density Function

p(x) =∂

∂xF (x)

P(a < X ≤ b) =

∫ b

a

p(x)dx

I Can p(x) be greater than 1?

Chandola@UB CSE 474/574 19 / 30

Page 23: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Expectation

I Expected value of a random variable

E[X ]

I What is most likely to happen in terms of X?

I For discrete variables

E[X ] ,∑x∈X

xP(X = x)

I For continuous variables

E[X ] ,∫Xxp(x)dx

I Mean of X (µ)

Chandola@UB CSE 474/574 20 / 30

Page 24: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Expectation of Functions of Random Variable

I Let g(X ) be a function of X

I If X is discrete:

E[g(X )] ,∑x∈X

g(x)P(X = x)

I If X is continuous:

E[g(X )] ,∫Xg(x)p(x)dx

PropertiesI E[c] = c , c - constant

I If X ≤ Y , then E[X ] ≤ E[Y ]

I E[X + Y ] = E[X ] + E[Y ]

I E[aX ] = aE[X ]

I var [X ] = E[(X − µ)2] =E[X 2]− µ2

I Cov [X ,Y ] = E[XY ]−E[X ]E[Y ]

I Jensen’s inequality: If ϕ(X ) isconvex,

ϕ(E[X ]) ≤ E[ϕ(X )]

Chandola@UB CSE 474/574 21 / 30

Page 25: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

What is a Probability Distribution?

DiscreteI Binomial,Bernoulli

I Multinomial, Multinoulli

I Poisson

I Empirical

ContinuousI Gaussian (Normal)

I Degenerate pdf

I Laplace

I Gamma

I Beta

I Pareto

Chandola@UB CSE 474/574 22 / 30

Page 26: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Binomial Distribution

I X = Number of heads observed in n coin tosses

I Parameters: n, θ

I X ∼ Bin(n, θ)

I Probability mass function (pmf)

Bin(k |n, θ) ,

(n

k

)θk(1− θ)n−k

Bernoulli DistributionI Binomial distribution with n = 1

I Only one parameter (θ)

Chandola@UB CSE 474/574 23 / 30

Page 27: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Multinomial Distribution

I Simulates a K sided die

I Random variable x = (x1, x2, . . . , xK )

I Parameters: n, θ

I θ ← <K

I θj - probability that j th side shows up

Mu(x|n,θ) ,

(n

x1, x2, . . . , xK

) K∏j=1

θxjj

Multinoulli DistributionI Multinomial distribution with n = 1

I x is a vector of 0s and 1s with only one bit set to 1

I Only one parameter (θ)

Chandola@UB CSE 474/574 24 / 30

Page 28: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Gaussian (Normal) Distribution

N (x |µ, σ2) ,1√

2πσ2e−

12σ2 (x−µ)2

I Parameters:

1. µ = E[X ]2. σ2 = var [X ] = E[(X − µ)2]

I X ∼ N (µ, σ2)⇔ p(X = x) = N (µ, σ2)

I X ∼ N (0, 1) ⇐ X is a standard normalrandom variable

I Cumulative distribution function:

Φ(x ;µ, σ2) ,∫ x

−∞N (z |µ, σ2)dz

Chandola@UB CSE 474/574 25 / 30

Page 29: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Joint Probability Distributions

I Multiple related random variables

I p(x1, x2, . . . , xD) for D > 1 variables (X1,X2, . . . ,XD)

I Discrete random variables?

I Continuous random variables?

I What do we measure?

CovarianceI How does X vary with respect to Y

I For linear relationship:

cov [X ,Y ] , E[(X − E[X ])(Y − E[Y ])] = E[XY ]− E[X ]E[Y ]

Chandola@UB CSE 474/574 26 / 30

Page 30: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Covariance and Correlation

I x is a d-dimensional random vector

cov [X] , E[(X− E[X])(X− E[X])>]

=

var [X1] cov [X1,X2] · · · cov [X1,Xd ]

cov [X2,X1] var [X2] · · · cov [X2,Xd ]...

.... . .

...cov [Xd ,X1] cov [Xd ,X2] · · · var [Xd ]

I Covariances can be between 0 and ∞I Normalized covariance ⇒ Correlation

Chandola@UB CSE 474/574 27 / 30

Page 31: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Correlation

I Pearson Correlation Coefficient

corr [X ,Y ] ,cov [X ,Y ]√var [X ]var [Y ]

I What is corr [X ,X ]?

I −1 ≤ corr [X ,Y ] ≤ 1

I When is corr [X ,Y ] = 1?

I Y = aX + b

Chandola@UB CSE 474/574 28 / 30

Page 32: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Correlation

I Pearson Correlation Coefficient

corr [X ,Y ] ,cov [X ,Y ]√var [X ]var [Y ]

I What is corr [X ,X ]?

I −1 ≤ corr [X ,Y ] ≤ 1

I When is corr [X ,Y ] = 1?I Y = aX + b

Chandola@UB CSE 474/574 28 / 30

Page 33: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

Multivariate Gaussian Distribution

I Most widely used joint probability distribution

N (X|µ,Σ) ,1

(2π)D/2|Σ|1/2exp

[−1

2(x− µ)>Σ−1(x− µ)

]

Chandola@UB CSE 474/574 29 / 30

Page 34: Introduction to Machine Learning - Statistical Machine Learning · 2020-05-26 · All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics). Springer,

References

E. Jaynes and G. Bretthorst.Probability Theory: The Logic of Science.Cambridge University Press Cambridge:, 2003.

L. Wasserman.All of Statistics: A Concise Course in Statistical Inference (SpringerTexts in Statistics).Springer, Oct. 2004.

Chandola@UB CSE 474/574 30 / 30