9
Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 10, Friday June 22 nd , 2007 (Everything you always wanted to know about statistics … but were afraid to ask)

Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Embed Size (px)

Citation preview

Page 1: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Machine LearningSaarland University, SS 2007

Holger Bast[with input from Ingmar Weber]

Max-Planck-Institut für InformatikSaarbrücken, Germany

Lecture 10, Friday June 22nd, 2007(Everything you always wanted to know

about statistics … but were afraid to ask)

Page 2: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Overview of this lecture

Maximum likelihood vs. unbiased estimators

– Example: normal distribution

– Example: drawing numbers from a box

Things you keep on reading in the ML literature [example]

– marginal distribution

– prior

– posterior

Statistical tests

– hypothesis testing

– discussion of its (non)sense

Page 3: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Maximum likelihood vs. unbiased estimators

Example: maximum likelihood estimator from Lecture 8, Example 2

– μ(x1,…,xn) = 1/n ∙ Σi xi σ2(x1,…,xn) = 1/n ∙ Σi (xi – μ)2

– X1,…,Xn independent identically distributed random

variables with mean μ and variance σ2

– E μ(X1,…,Xn) = μ [blackboard]

– E σ2(X1,…,Xn) = (n–1) / n ∙ σ2 ≠ σ2 [blackboard]

– unbiased variance estimator = 1 / (n-1) ∙ Σi (xi – μ)2

Example: number x drawn from box with numbers 1..n for unknown n

– maximum likelihood estimator: n = x[blackboard]

– unbiased estimator: n = 2x – 1 [blackboard]

Page 4: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Marginal distribution Joint probability distribution, for example

– pick a random MPII staff member

– random variables X = department, Y = gender

– for example, Pr(X = D3, Y = female)

D1 D2 D3 D4 D5

male 0.24 0.09 0.13 0.25 0.11

female 0.03 0.03 0.04 0.04 0.04

0.27 0.12 0.17 0.29 0.15

0.82

0.18

Pr(D3)

Pr(female)

Note:– matrix entries sum to 1– in general, Pr(X = x, Y = y) ≠ Pr(X = x) ∙ Pr(Y = y)

[holds if and only if X and Y are independent]

Page 5: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Frequentism vs. Bayesianism

Frequentism

– probability = relative frequency in large number of trials

– associated with random (physical) system

– only applied to well-defined events in well-defined space

for example: probability of a die showing 6

Bayesianism

– probability = degree of belief

– no random process at all needs to be involved

– applied to arbitrary statements

for example: probability that I will like a new movie

Page 6: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Prior / Posterior probability

Prior

– guess about the data, no random experiment behind

– go on computing with the guess like with a probability

– for example: Z1,…,Zn from E-Step of EM algorithm

Posterior

– probability related to an event that has already happened

– for example: all our likelihoods from Lectures 8 and 9

Note: these are no well-defined technical terms

– but often used as if, which is confusing

– the Bayesianism way …

Page 7: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Hypothesis testing

Example: do two samples have the same mean?

– e.g., two groups of patients in a medical experiment, one group with medication and one group without

– for example, 8.6 4.3 3.2 5.1 and 2.1 4.2 7.6 3.2 2.9

Test

– Formulate null hypothesis, e.g. equal means

– compute probability p of the given (or more extreme) data, assuming that the null hypothesis is true [blackboard]

Outcome

– p ≤ α = 0.05 hypothesis rejected with significance level 95%

one says: the difference of the means is statistically significant

– p > α = 0.05 the hypothesis cannot be rejected

one says: the difference of the means is statistically insignificant

Page 8: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Hypothesis testing — BEWARE!

What one would ideally like:

– given this data, what is the probability that my hypothesis if true?

– formally: Pr(H | D)

What one gets from hypothesis testing

– given that my hypothesis is true, what is the probability of this (or more extreme) data

– formally: Pr(D | H)

– but Pr(D | H) could be low for other reasons than the hypothesis!! [blackboard example]

Useful at all?

– OK: challenge theory by attempting to reject it

– NO: confirm theory by rejecting corresponding null hypothesis

Page 9: Machine Learning Saarland University, SS 2007 Holger Bast [with input from Ingmar Weber] Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture

Literature

Read the wonderful articles by Jacob Cohen

– Things I have learned (so far)American Psychologist, 45(12):1304–1312, 1990

– The earth is round (p < .05)American Psychologist 49(12):997–1003, 1994