19
Mathematical Tools for Neural and Cognitive Science Probability & Statistics: Intro, summary statistics, probability Fall semester, 2018 1 - Efron & Tibshirani, Introduction to the Bootstrap, 1998 2

Probability & Statistics: Intro, summary statistics

  • Upload
    others

  • View
    32

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Probability & Statistics: Intro, summary statistics

Mathematical Tools for Neural and Cognitive Science

Probability & Statistics: Intro, summary statistics, probability

Fall semester, 2018

1

- Efron & Tibshirani, Introduction to the Bootstrap, 1998

2

Page 2: Probability & Statistics: Intro, summary statistics

Some history…• 1600’s: Early notions of data summary/averaging

• 1700’s: Bayesian prob/statistics (Bayes, Laplace)

• 1920’s: Frequentist statistics for science (e.g., Fisher)

• 1940’s: Statistical signal analysis and communication, estimation/decision theory (e.g., Shannon, Wiener, etc)

• 1950’s: Return of Bayesian statistics (e.g., Jeffreys, Wald, Savage, Jaynes…)

• 1970’s: Computation, optimization, simulation (e.g,. Tukey)

• 1990’s: Machine learning (large-scale computing + statistical inference + lots of data)

• Since 1950’s! : statistical neural/cognitive models

3

Scientific process

Summarize/fit model(s),compare with predictions

Create/modifyhypothesis/model

Generate predictions,design experiment

Observe / measure data

4

Page 3: Probability & Statistics: Intro, summary statistics

Descriptive statistics: Central tendency5

Descriptive statistics: Central tendency• We often summarize data with the average. Why?

• Average minimizes the squared error (as in regression!):

• Generalize: minimize Lp norm:

– minimize L1 norm: median,

– minimize L0 norm: mode

– minimize norm: midpoint of range

• Issues: outliers, asymmetry, bimodality

• How do we choose?

L∞

m(~x)

µ(~x) = argminc

1

N

NX

n=1

�xn � c

�2=

1

N

NX

n=1

xn

argminc

"1

N

NX

n=1

|xn � c|p#1/p

6

Page 4: Probability & Statistics: Intro, summary statistics

Descriptive statistics: Dispersion7

Descriptive statistics: Dispersion

• Sample standard deviation

• Mean absolute deviation (MAD) about the median

• Quantiles

d(~x) =1

N

NX

n=1

��xn �m(~x)

��

�(~x) = minc

"1

N

NX

n=1

(xn � c)2#1/2

=

"1

N

NX

n=1

(xn � µ(~x))2#1/2

8

Page 5: Probability & Statistics: Intro, summary statistics

Descriptive statistics: Dispersion

Summary statistics (eg: sample mean/var) can be interpreted as estimates of model parametersTo formalize this, we need tools from probability…

9

data

{xn}

histogram

{ck, hk}

probability distribution

p(x)

10

Page 6: Probability & Statistics: Intro, summary statistics

data

{ ⃗x n}

probabilistic model

pθ( ⃗x )

Measurement

Inference

11

You pick a family at random and discover that one of the children is a girl.

Probabilistic Middleville

The stork delivers boys and girls randomly, with family probability {BB,BG,GB,GG}={0.2,0.3,0.2,0.3}

probabilistic model

In Middleville, every family has two children, brought by the stork.

What are the chances that the other child is a girl?

data

inference

12

Page 7: Probability & Statistics: Intro, summary statistics

Statistical MiddlevilleIn Middleville, every family has two children, brought by the stork.

In a survey of 100 of the Middleville families, 32 have two girls, 23 have two boys, and the remainder one of each.

data

inference

The stork delivers boys and girls randomly, with family probability {BB,BG,GB,GG}={0.2,0.3,0.2,0.3}

You pick a family at random and discover that one of the children is a girl.

What are the chances that the other child is a girl?

13

Probability basics (outline)

• distributions: discrete and continuous

• expected value, moments

• cumulative distributions. Quantiles, Q-Q plots, drawing samples.

• transformations: affine, monotonic nonlinear

14

Page 8: Probability & Statistics: Intro, summary statistics

Probability: Definitions/notationlet X, Y, Z be random variables

they can take on values (like ‘heads’ or ‘tails’; or integers 1-6; or real-valued numbers)

let x, y, z stand generically for values they can take, and denote events such as X = x

write the probability that X takes on value x as P(X = x), or PX(x), or sometimes just P(x)

P(x) is a function over values x, which we call the probability “distribution” function (pdf) (for continuous variables, “density”)

Useful to have this notation up on slid, while introducing concepts on board

15

Probability distributions

−∞∫ x xd p( ) =1( )b

a= ∫ ( )P a< < b d x px x

P(x)

p(x)

0 < P(xi ) <1, ∀i

P(xi ) = 1i∑

0 < p(x)

p(x)dx = 1−∞

Discrete random variable Continuous random variable

16

Page 9: Probability & Statistics: Intro, summary statistics

0 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 110

0.05

0.1

0.15

0.2

0.25

0 200 400 600 800 10000

0.02

0.04

0.06

0.08

0.1

2 3 4 5 6 7 8 9 10 11 120

0.05

0.1

0.15

0.2

1 2 3 4 5 60

0.05

0.1

0.15

0.2

a not-quite-fair coin sum of two rolled fair dice

clicks of a Geiger counter, in a fixed time interval

horizontal velocity of gas molecules exiting a fan... and, time between clicks

Example distributionsroll of a fair die

-0 1 2 4 53 6 7 8 9 10

17

Expected value - discrete

[the mean, ]µE(X ) = xi p(xi )i=1

N

0 1 2 3 4# of credit cards

0

0.5

1

1.5

2

2.5

# of

stu

dent

s

104

0 1 2 3 4# of credit cards

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

P(x)

µ

E( f (X )) = f (xi )p(xi )i=1

N

∑More generally:

18

Page 10: Probability & Statistics: Intro, summary statistics

Expected value - continuous

E(x) =

Zx p(x) dx

E(x2) =

Zx

2p(x) dx

E

�(x� µ)2

�=

Z(x� µ)2 p(x) dx

=

Zx

2p(x) dx� µ

2

E (f(x)) =

Zf(x) p(x) dx

[mean, ]µ

[“second moment”, m2]

�2[variance, ]

Note: this is an inner product, and thus linear:

[equal to m2 minus ]μ2

E (af(x) + bg(x)) = aE (f(x)) + bE (g(x))

[“expected value of f ”]

19

Cumulatives

2 3 4 5 6 7 8 9 1011120

0.05

0.1

0.15

0.2

x

p(x)

2 4 6 8 10 120

1

x

c(x)

50 100 150x

p(x)

50 100 1500

0.5

1

x

c(x)

c(y) =

Z y

�1p(x)dx

20

Page 11: Probability & Statistics: Intro, summary statistics

Drawing samples - discrete

0

0.125

0.25

0.375

0.5

0

0.25

0.5

0.75

1

21

• joint distributions

• marginals (integrating)

• conditionals (slicing)

• Bayes’ rule (inverse probability)

• statistical independence (separability)

• linear transformations

Multi-variate probability

[on board]

22

Page 12: Probability & Statistics: Intro, summary statistics

Joint and conditional probability - discrete23

Joint and conditional probability - discrete

P(Ace) P(Heart) P(Ace & Heart) P(Ace | Heart) P(not Jack of Diamonds) P(Ace | not Jack of Diamonds)

“Independence”

24

Page 13: Probability & Statistics: Intro, summary statistics

p(x, y)

Joint distribution (continuous)25

p(x) =

Zp(x, y)dy

p(x, y)

Marginal distribution 26

Page 14: Probability & Statistics: Intro, summary statistics

Conditional probability

A BA & B

p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)

Neither A nor B

27

p(x, y) p(x|y = 68)

Conditional distribution28

Page 15: Probability & Statistics: Intro, summary statistics

p(x|y = 68) = p(x, y = 68)

�Zp(x, y = 68)dx

= p(x, y = 68).p(y = 68)

P(x

|Y=6

8)

Conditional distribution

p(x|y) = p(x, y)/p(y)

More generally:

normalize (by marginal)slice joint distribution

29

Bayes’ Rule

A BA & B

p(A& B) = p(B)p(A | B)

= p(A)p(B | A)

⇒ p(A | B) = p(B | A)p(A)p(B)

p(A | B) = probability of A given that B is asserted to be true = p(A& B)p(B)

30

Page 16: Probability & Statistics: Intro, summary statistics

Bayes’ Rule

p(x|y) = p(y|x) p(x)/p(y)

(a direct consequence of the definition of conditional probability)

31

P(x|Y=120)

P(x)

Conditional vs. marginal

In general, the marginals for different Y values differ. When are they they same? In particular, when are all conditionals equal to the marginal?

32

Page 17: Probability & Statistics: Intro, summary statistics

Statistical independence

Random variables X and Y are statistically independent if (and only if):

Independence implies that all conditionals are equal to the corresponding marginal:

p(x, y) = p(x)p( y) ∀ x, y

p(x | y) = p(x, y) / p( y) = p(x) ∀ x, y

[note: for discrete distributions, this is an outer product!]

33

Sums of RVs

In addition, if X and Y are independent, then

E(XY ) = E(X )E(Y )

σ Z2 = E X +Y( )− µX + µY( )( )2( ) =σ X

2 +σ Y2

and is a convolution of andpZ (z)

E(X +Y ) = E(X )+ E(Y )

Let Z = X + Y. Since expectation is linear:

pX (x) pY ( y)

[on board]

34

Page 18: Probability & Statistics: Intro, summary statistics

• Mean and variance summarize the centroid/width

• Translation and rescaling of random variables

• Mean/variance of weighted sum of random variables

• The sample average

• ... converges to true mean (except for bizarre distributions)

• ... with variance

• ... most common common choice for an estimate ...

Mean and variance35

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250

300

350

400

450

500(u+u+u+u)/sqrt(4)

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250104 samples of uniform dist

−4 −3 −2 −1 0 1 2 3 40

100

200

300

400

500

60010 u’s divided by sqrt(10)

−4 −3 −2 −1 0 1 2 3 40

50

100

150

200

250

300

350

400

450(u+u)/sqrt(2)

Central limit for a uniform distribution...

10k samples, uniform density (sigma=1)

36

Page 19: Probability & Statistics: Intro, summary statistics

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

5000

6000one coin

0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000avg of 4 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000avg of 16 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000avg of 64 coins

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000

2500avg of 256 coins

Central limit for a binary distribution...37