Probability & Statistics: Intro, summary statistics

Preview:

Citation preview

Mathematical Tools for Neural and Cognitive Science

Probability & Statistics: Intro, summary statistics, probability

Fall semester, 2018

1

- Efron & Tibshirani, Introduction to the Bootstrap, 1998

2

Some history…• 1600’s: Early notions of data summary/averaging

• 1700’s: Bayesian prob/statistics (Bayes, Laplace)

• 1920’s: Frequentist statistics for science (e.g., Fisher)

• 1940’s: Statistical signal analysis and communication, estimation/decision theory (e.g., Shannon, Wiener, etc)

• 1950’s: Return of Bayesian statistics (e.g., Jeffreys, Wald, Savage, Jaynes…)

• 1970’s: Computation, optimization, simulation (e.g,. Tukey)

• 1990’s: Machine learning (large-scale computing + statistical inference + lots of data)

• Since 1950’s! : statistical neural/cognitive models

3

Scientific process

Summarize/fit model(s),compare with predictions

Create/modifyhypothesis/model

Generate predictions,design experiment

Observe / measure data

4

Descriptive statistics: Central tendency5

Descriptive statistics: Central tendency• We often summarize data with the average. Why?

• Average minimizes the squared error (as in regression!):

• Generalize: minimize Lp norm:

– minimize L1 norm: median,

– minimize L0 norm: mode

– minimize norm: midpoint of range

• Issues: outliers, asymmetry, bimodality

• How do we choose?

L∞

m(~x)

µ(~x) = argminc

1

N

NX

n=1

�xn � c

�2=

1

N

NX

n=1

xn

argminc

"1

N

NX

n=1

|xn � c|p#1/p

6

Descriptive statistics: Dispersion7

Descriptive statistics: Dispersion

• Sample standard deviation

• Mean absolute deviation (MAD) about the median

• Quantiles

d(~x) =1

N

NX

n=1

��xn �m(~x)

��

�(~x) = minc

"1

N

NX

n=1

(xn � c)2#1/2

=

"1

N

NX

n=1

(xn � µ(~x))2#1/2

8

Descriptive statistics: Dispersion

Summary statistics (eg: sample mean/var) can be interpreted as estimates of model parametersTo formalize this, we need tools from probability…

9

data

{xn}

histogram

{ck, hk}

probability distribution

p(x)

10

data

{ ⃗x n}

probabilistic model

pθ( ⃗x )

Measurement

Inference

11

You pick a family at random and discover that one of the children is a girl.

Probabilistic Middleville

The stork delivers boys and girls randomly, with family probability {BB,BG,GB,GG}={0.2,0.3,0.2,0.3}

probabilistic model

In Middleville, every family has two children, brought by the stork.

What are the chances that the other child is a girl?

data

inference

12

Statistical MiddlevilleIn Middleville, every family has two children, brought by the stork.

In a survey of 100 of the Middleville families, 32 have two girls, 23 have two boys, and the remainder one of each.

data

inference

The stork delivers boys and girls randomly, with family probability {BB,BG,GB,GG}={0.2,0.3,0.2,0.3}

You pick a family at random and discover that one of the children is a girl.

What are the chances that the other child is a girl?

13

Probability basics (outline)

• distributions: discrete and continuous

• expected value, moments

• cumulative distributions. Quantiles, Q-Q plots, drawing samples.

• transformations: affine, monotonic nonlinear

14

Probability: Definitions/notationlet X, Y, Z be random variables

they can take on values (like ‘heads’ or ‘tails’; or integers 1-6; or real-valued numbers)

let x, y, z stand generically for values they can take, and denote events such as X = x

write the probability that X takes on value x as P(X = x), or PX(x), or sometimes just P(x)

P(x) is a function over values x, which we call the probability “distribution” function (pdf) (for continuous variables, “density”)

Useful to have this notation up on slid, while introducing concepts on board

15

Probability distributions

−∞∫ x xd p( ) =1( )b

a= ∫ ( )P a< < b d x px x

P(x)

p(x)

0 < P(xi ) <1, ∀i

P(xi ) = 1i∑

0 < p(x)

p(x)dx = 1−∞

Discrete random variable Continuous random variable

16

0 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 110

0.05

0.1

0.15

0.2

0.25

0 200 400 600 800 10000

0.02

0.04

0.06

0.08

0.1

2 3 4 5 6 7 8 9 10 11 120

0.05

0.1

0.15

0.2

1 2 3 4 5 60

0.05

0.1

0.15

0.2

a not-quite-fair coin sum of two rolled fair dice

clicks of a Geiger counter, in a fixed time interval

horizontal velocity of gas molecules exiting a fan... and, time between clicks

Example distributionsroll of a fair die

-0 1 2 4 53 6 7 8 9 10

17

Expected value - discrete

[the mean, ]µE(X ) = xi p(xi )i=1

N

0 1 2 3 4# of credit cards

0

0.5

1

1.5

2

2.5

# of

stu

dent

s

104

0 1 2 3 4# of credit cards

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

P(x)

µ

E( f (X )) = f (xi )p(xi )i=1

N

∑More generally:

18

Expected value - continuous

E(x) =

Zx p(x) dx

E(x2) =

Zx

2p(x) dx

E

�(x� µ)2

�=

Z(x� µ)2 p(x) dx

=

Zx

2p(x) dx� µ

2

E (f(x)) =

Zf(x) p(x) dx

[mean, ]µ

[“second moment”, m2]

�2[variance, ]

Note: this is an inner product, and thus linear:

[equal to m2 minus ]μ2

E (af(x) + bg(x)) = aE (f(x)) + bE (g(x))

[“expected value of f ”]

19

Cumulatives

2 3 4 5 6 7 8 9 1011120

0.05

0.1

0.15

0.2

x

p(x)

2 4 6 8 10 120

1

x

c(x)

50 100 150x

p(x)

50 100 1500

0.5

1

x

c(x)

c(y) =

Z y

�1p(x)dx

20

Drawing samples - discrete

0

0.125

0.25

0.375

0.5

0

0.25

0.5

0.75

1

21

• joint distributions

• marginals (integrating)

• conditionals (slicing)

• Bayes’ rule (inverse probability)

• statistical independence (separability)

• linear transformations

Multi-variate probability

[on board]

22

Joint and conditional probability - discrete23

Joint and conditional probability - discrete

P(Ace) P(Heart) P(Ace & Heart) P(Ace | Heart) P(not Jack of Diamonds) P(Ace | not Jack of Diamonds)

“Independence”

24

p(x, y)

Joint distribution (continuous)25

p(x) =

Zp(x, y)dy

p(x, y)

Marginal distribution 26

Conditional probability

A BA & B

p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)

Neither A nor B

27

p(x, y) p(x|y = 68)

Conditional distribution28

p(x|y = 68) = p(x, y = 68)

�Zp(x, y = 68)dx

= p(x, y = 68).p(y = 68)

P(x

|Y=6

8)

Conditional distribution

p(x|y) = p(x, y)/p(y)

More generally:

normalize (by marginal)slice joint distribution

29

Bayes’ Rule

A BA & B

p(A& B) = p(B)p(A | B)

= p(A)p(B | A)

⇒ p(A | B) = p(B | A)p(A)p(B)

p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)

30

Bayes’ Rule

p(x|y) = p(y|x) p(x)/p(y)

(a direct consequence of the definition of conditional probability)

31

P(x|Y=120)

P(x)

Conditional vs. marginal

In general, the marginals for different Y values differ. When are they they same? In particular, when are all conditionals equal to the marginal?

32

Statistical independence

Random variables X and Y are statistically independent if (and only if):

Independence implies that all conditionals are equal to the corresponding marginal:

p(x, y) = p(x)p( y) ∀ x, y

p(x | y) = p(x, y) / p( y) = p(x) ∀ x, y

[note: for discrete distributions, this is an outer product!]

33

Sums of RVs

In addition, if X and Y are independent, then

E(XY ) = E(X )E(Y )

σ Z2 = E X +Y( )− µX + µY( )( )2( ) =σ X

2 +σ Y2

and is a convolution of andpZ (z)

E(X +Y ) = E(X )+ E(Y )

Let Z = X + Y. Since expectation is linear:

pX (x) pY ( y)

[on board]

34

• Mean and variance summarize the centroid/width

• Translation and rescaling of random variables

• Mean/variance of weighted sum of random variables

• The sample average

• ... converges to true mean (except for bizarre distributions)

• ... with variance

• ... most common common choice for an estimate ...

Mean and variance35

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250

300

350

400

450

500(u+u+u+u)/sqrt(4)

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250104 samples of uniform dist

−4 −3 −2 −1 0 1 2 3 40

100

200

300

400

500

60010 u’s divided by sqrt(10)

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250

300

350

400

450(u+u)/sqrt(2)

Central limit for a uniform distribution...

10k samples, uniform density (sigma=1)

36

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

5000

6000one coin

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000avg of 4 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000avg of 16 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000avg of 64 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000

2500avg of 256 coins

Central limit for a binary distribution...37

Recommended