Statistics Cookbook

8/12/2019 Statistics Cookbook

1/28

Probability and Statistics

Cookbook

Copyright cMatthias Vallentin,[email protected]

10th October, 2012
http://matthias.vallentin.net/http://matthias.vallentin.net/mailto:[email protected]:[email protected]://matthias.vallentin.net/


2/28

This cookbook integrates a variety of topics in probability the-

ory and statistics. It is based on literature[1,6,3]and in-class

material from courses of the statistics department at the Uni-

versity of California in Berkeley but also influenced by other

sources[4,5]. If you find errors or have suggestions for further

topics, I would appreciate if you send me an email. The most re-

cent version of this document is available at http://matthias.

vallentin.net/probability-and-statistics-cookbook/ . To

reproduce, please contact me.

Contents

1 Distribution Overview 31.1 Discrete Distributions . . . . . . . . . . 31.2 Continuous Distributions . . . . . . . . 4

2 Probability Theory 6

3 Random Variables 6

3.1 Transformations . . . . . . . . . . . . . 7

4 Expectation 7

5 Variance 7

6 Inequalities 8

7 Distribution Relationships 8

8 Probability and Moment GeneratingFunctions 9

9 Multivariate Distributions 99.1 Standard Bivariate Normal . . . . . . . 99.2 Bivariate Normal . . . . . . . . . . . . . 99.3 Multivariate Normal . . . . . . . . . . . 9

10 Convergence 910.1 Law of Large Numbers (LLN) . . . . . . 1010.2 Central Limit Theorem (CLT) . . . . . 10

11 Statistical Inference 1011.1 Point Estimation . . . . . . . . . . . . . 1011.2 Normal-Based Confidence Interval . . . 11

11.3 Empirical distribution . . . . . . . . . . 1111.4 Statistical Functionals . . . . . . . . . . 11

12 Parametric Inference 11

12.1 Method of Moments . . . . . . . . . . . 11

12.2 Maximum Likelihood . . . . . . . . . . . 12

12.2.1 Delta Method . . . . . . . . . . . 12

12.3 Multiparameter Models . . . . . . . . . 12

12.3.1 Multiparameter delta method . . 13

12.4 Parametric Bootstrap . . . . . . . . . . 13

13 Hypothesis Testing 13

14 Bayesian Inference 14

14.1 Credible Intervals . . . . . . . . . . . . . 14

14.2 Function of parameters. . . . . . . . . . 14

14.3 Priors . . . . . . . . . . . . . . . . . . . 15

14.3.1 Conjugate Priors . . . . . . . . . 15

14.4 Bayesian Testing . . . . . . . . . . . . . 15

15 Exponential Family 16

16 Sampling Methods 1616.1 The Bootstrap . . . . . . . . . . . . . . 16

16.1.1 Bootstrap Confidence Intervals . 16

16.2 Rejection Sampling . . . . . . . . . . . . 17

16.3 Importance Sampling. . . . . . . . . . . 17

17 Decision Theory 17

17.1 Risk . . . . . . . . . . . . . . . . . . . . 17

17.2 Admissibility . . . . . . . . . . . . . . . 17

17.3 Bayes Rule . . . . . . . . . . . . . . . . 18

17.4 Minimax Rules . . . . . . . . . . . . . . 18

18 Linear Regression 18

18.1 Simple Linear Regression . . . . . . . . 18

18.2 Prediction . . . . . . . . . . . . . . . . . 19

18.3 Multiple Regression . . . . . . . . . . . 19

18.4 Model Selection . . . . . . . . . . . . . . 19

19 Non-parametric Function Estimation 20

19.1 Density Estimation . . . . . . . . . . . . 20

19.1.1 Histograms . . . . . . . . . . . . 20

19.1.2 Kernel Density Estimator (KDE) 21

19.2 Non-parametric Regression . . . . . . . 2119.3 Smoothing Using Orthogonal Functions 21

20 Stochastic Processes 2220.1 Markov Chains . . . . . . . . . . . . . . 2220.2 Poisson Processes . . . . . . . . . . . . . 22

21 Time Series 2321.1 Stationary Time Series . . . . . . . . . . 2321.2 Estimation of Correlation . . . . . . . . 2421.3 Non-Stationary Time Series . . . . . . . 24

21.3.1 Detrending . . . . . . . . . . . . 2421.4 ARIMA models . . . . . . . . . . . . . . 24

21.4.1 Causality and Invertibility . . . . 2521.5 Spectral Analysis . . . . . . . . . . . . . 25

22 Math 2622.1 Gamma Function . . . . . . . . . . . . . 2622.2 Beta Function . . . . . . . . . . . . . . . 2622.3 Series . . . . . . . . . . . . . . . . . . . 2722.4 Combinatorics . . . . . . . . . . . . . . 27
mailto:[email protected]:[email protected]://matthias.vallentin.net/probability-and-statistics-cookbook/http://matthias.vallentin.net/probability-and-statistics-cookbook/http://matthias.vallentin.net/probability-and-statistics-cookbook/http://matthias.vallentin.net/probability-and-statistics-cookbook/http://matthias.vallentin.net/probability-and-statistics-cookbook/mailto:[email protected]


3/28

1 Distribution Overview

1.1 Discrete Distributions

Notation1 FX(x) fX(x) E [X] V [X] MX(s)

Uniform Unif {a , . . . , b}

0 x < axa+1

ba a x b1 x > b

I(a < x < b)

b

a + 1

a + b

2

(b a + 1)2 112

eas e(b+1)ss(b

a)

Bernoulli Bern (p) (1p)1x px (1p)1x p p(1 p) 1 p +pes

Binomial Bin (n, p) I1p(n x, x + 1)

n

x

px (1 p)nx np np(1 p) (1p +pes)n

Multinomial Mult (n, p) n!

x1! . . . xk!px11 pxkk

ki=1

xi= n npi npi(1 pi)

ki=0

piesi

n

Hypergeometric Hyp(N , m, n)

x npnp(1p)

mx

mxnx

Nx

nmN

nm(N n)(Nm)N2(N 1)

Negative Binomial NBin(r, p) Ip(r, x + 1) x + r 1

r 1 pr(1

p)x r

1

p

p r

1

p

p2 p

1 (1p)esr

Geometric Geo (p) 1 (1p)x x N+ p(1p)x1 x N+ 1p

1 pp2

p

1 (1p)es

Poisson Po () exi=0

i

i!

xe

x! e(e

s1)

1

n

a bx

PMF

Uniform (discrete)

0.00

0.05

0.10

0.15

0.20

0.25

0 10 20 30 40x

PMF

n =40, p =0.3n =30, p =0.6n =25, p =0.9

Binomial

0.0

0.2

0.4

0.6

0.8

2 4 6 8 10x

PMF

p = 0.2p = 0.5p = 0.8

Geometric

0.0

0.1

0.2

0.3

0 5 10 15 20

x

PMF

= 1

= 4

= 10

Poisson

1We use the notation (s,x) and (x) to refer to the Gamma functions (see 22.1), and use B(x,y) and Ix to refer to the Beta functions (see 22.2).

3


4/28

1.2 Continuous Distributions

Notation FX(x) fX(x) E [X] V [X] MX(s)

Uniform Unif (a, b)

0 x < axaba a < x < b

1 x > b

I(a < x < b)

b aa + b

2

(b a)212

esb esas(b a)

Normal

N, 2 (x) =

x

(t) dt (x) = 1

2exp

(x )2

22 2 exps +

2s2

2

Log-Normal lnN, 2 12

+1

2erf

ln x

22

1

x

22exp

(ln x )

2

22

e+

2/2 (e2 1)e2+2

Multivariate Normal MVN(, ) (2)k/2||1/2e 12 (x)T1(x) exp

Ts +1

2sTs

Students t Student() Ix

2,

2

+12

2

1 + x2

(+1)/20

2 >2 1< 2

Chi-square 2k1

(k/2)

k

2,x

2

1

2k/2(k/2)xk/21ex/2 k 2k (1 2s)k/2 s < 1/2

F F(d1, d2) I d1xd1x+d2

d12

,d1

2

(d1x)d1dd22(d1x+d2)

d1+d2

xBd12 ,

d12

d2d2 2

2d22(d1+ d2 2)d1(d2 2)2(d2 4)

Exponential Exp () 1 ex/ 1

ex/ 2 1

1 s (s < 1/)

Gamma Gamma (, ) (,x/)

()

1

() x1ex/ 2

1

1 s

(s < 1/)

Inverse Gamma InvGamma(, )

, x

()

()x1e/x

1 > 1 2

( 1)2( 2)2 > 2 2(s)/2

() K

4s

Dirichlet Dir ()

ki=1 i

k

i=1 (i)

k

i=1

xi1iiki=1 i

E [Xi] (1 E [Xi])ki=1 i+ 1

Beta Beta (, ) Ix(, ) ( + )

() ()x1 (1 x)1

+

( + )2( + + 1) 1 +

k=1

k1r=0

+ r

+ + r

sk

k!

Weibull Weibull(, k) 1 e(x/)k k

x

k1e(x/)

k

1 +

1

k

2

1 +

2

k

2

n=0

snn

n!

1 +n

k

Pareto Pareto(xm, ) 1xm

x

x xm x

m

x+1 x xm xm

1 > 1 xm

( 1)2( 2) > 2 (xms)(,xms)s


5/28

1

b a

a bx

PDF

Uniform (continuous)

0.0

0.2

0.4

0.6

0.8

4 2 0 2 4

x

(x)

=0, 2

=0.2

=0, 2

=1

=0, 2

=5

=2, 2

=0.5

Normal

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0x

PDF

=0, 2

=3 =2,

2=2

=0, 2

=1 =0.5,

2=1

=0.25, 2

=1 =0.125,

2=1

LogNormal

0.0

0.1

0.2

0.3

0.4

4 2 0 2 4

x

PDF

=1

=2

=5

=

Student's t

0

1

2

3

4

0 2 4 6 8

x

PDF

k = 1

k = 2

k = 3

k = 4

k = 5

2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 1 2 3 4 5

x

PDF

d1 =1, d2 =1

d1 =2, d2 =1

d1 =5, d2 =2

d1 =100, d2 =1

d1 =100, d2 =100

F

0.0

0.5

1.0

1.5

2.0

0 1 2 3 4 5x

PDF

= 2 = 1 = 0.4

Exponential

0.0

0.1

0.2

0.3

0.4

0.5

0 5 10 15 20

x

PDF

=1, =2 =2, =2 =3, =2 =5, =1 =9, =0.5

Gamma

0

1

2

3

4

0 1 2 3 4 5

x

PDF

=1, =1 =2, =1 =3, =1 =3, =0.5

Inverse Gamma

0

1

2

3

4

5

0.0 0.2 0.4 0.6 0.8 1.0

x

PDF

=0.5, =0.5 =5, =1 =1, =3 =2, =2 =2, =5

Beta

0

1

2

3

4

0.0 0.5 1.0 1.5 2.0 2.5

x

PDF

=1, k =0.5

=1, k =1 =1, k =1.5 =1, k =5

Weibull

0

1

2

3

0 1 2 3 4 5

x

PDF

xm =1, =1

xm =1, =2

xm =1, =4

Pareto

5


6/28

2 Probability Theory

Definitions

Sample space Outcome (point or element) Event A -algebraA

1. A2. A1, A2, . . . , A =

i=1 Ai A

3. A A = A A Probability Distribution P

1. P [A] 0 A2. P [] = 1

3. P

i=1

Ai

=

i=1

P [Ai]

Probability space (, A,P)Properties

P [] = 0 B = B= (A A) B = (A B) (A B) P [A] = 1 P [A] P [B] = P [A B] + P [A B] P [] = 1 P [] = 0 (n An) = n An (n An) = n An DeMorgan P [n An] = 1 P [n An] P [A B] = P [A] + P [B] P [A B]

= P

[A B] P

[A] +P

[B] P [A B] = P [A B] + P [A B] + P [A B] P [A B] = P [A] P [A B]

Continuity of Probabilities

A1 A2 . . . = limn P [An] = P [A] whereA=i=1 Ai

A1 A2 . . . = limn P [An] = P [A] whereA=i=1 Ai

IndependenceA B P [A B] = P [A]P [B]

Conditional Probability

P [A | B] = P [A B]P [B]

P [B]> 0

Law of Total Probability

P [B] =ni=1

P [B|Ai]P [Ai] =ni=1

Ai

Bayes Theorem

P [Ai

|B] =

P [B | Ai]P [Ai]

nj=1 P [B | Aj ]P [Aj ] =n

i=1 AiInclusion-Exclusion Principle n

i=1

Ai

= nr=1

(1)r1

ii1


7/28

3.1 Transformations

Transformation functionZ= (X)

Discrete

fZ(z) = P [(X) = z ] = P [{x: (x) = z}] = P

X 1(z) = x1(z)

f(x)

Continuous

FZ(z) = P [(X) z] =Az

f(x) dx withAz = {x: (x) z}

Special case if strictly monotone

fZ(z) = fX(1(z))

ddz 1(z) =fX(x) dxdz

=fX(x) 1|J|The Rule of the Lazy Statistician

E [Z] = (x) dFX(x)E [IA(x)] =

IA(x) dFX(x) =

A

dFX(x) = P [X A]

Convolution

Z:= X+ Y fZ(z) =

fX,Y(x, z x) dx X,Y0= z

0

fX,Y(x, z x) dx

Z:= |X Y| fZ(z) = 2

0

fX,Y(x, z+ x) dx

Z:=

X

Y

fZ(z) =

|x

|fX,Y(x,xz) dx

=

xfx(x)fX(x)fY(xz) dx

4 Expectation

Definition and properties

E [X] = X =

x dFX(x) =

x

xfX(x) X discrete

xfX(x) X continuous

P [X= c] = 1 = E [c] = c E [cX] = c E [X] E [X+ Y] = E [X] + E [Y]

E [XY ] =X,Y

xyfX,Y(x, y) dFX(x) dFY(y)

E [(Y)] =(E [X]) (cf. Jenseninequality) P [X Y] = 0 = E [X] E [Y] P [X= Y] = 1 = E [X] = E [Y] E [X] =

x=1

P [X x]

Sample mean

Xn = 1n

ni=1

Xi

Conditional expectation

E [Y| X= x] =

yf(y | x) dy E [X] = E [E [X| Y]] E[(X, Y) | X= x] =

(x, y)fY|X(y | x) dx

E [(Y, Z) | X= x] =

(y, z)f(Y,Z)|X(y, z | x) dydz E [Y + Z| X] = E [Y| X] + E [Z| X] E [(X)Y| X] = (X)E [Y| X] E[Y| X] = c = Cov [X, Y] = 0

5 VarianceDefinition and properties

V [X] = 2X = E

(X E [X])2 = E X2 E [X]2 V

n

i=1 Xi =n

i=1 V [Xi] + 2i=j Cov [Xi, Yj ] V

ni=1

Xi

=

ni=1

V [Xi] ifXi Xj

Standard deviationsd[X] =

V [X] = X

Covariance

Cov [X, Y] = E [(X E [X])(Y E [Y])] = E [XY ] E [X]E [Y] Cov [X, a] = 0 Cov [X, X] = V [X] Cov [X, Y] = Cov [Y, X] Cov [aX,bY] = abCov [X, Y]

7


8/28

Cov [X+ a, Y + b] = Cov [X, Y]

Cov ni=1

Xi,

mj=1

Yj

= ni=1

mj=1

Cov [Xi, Yj ]

Correlation

[X, Y] = Cov [X, Y]

V [X]V [Y]

Independence

X Y = [X, Y] = 0 Cov [X, Y] = 0 E [XY ] = E [X]E [Y]Sample variance

S2 = 1

n 1ni=1

(Xi Xn)2

Conditional variance

V [Y| X] = E (Y E [Y| X])2 | X = E Y2 | X E [Y| X]2 V [Y] = E [V [Y| X]] + V [E [Y| X]]

6 Inequalities

Cauchy-Schwarz

E [XY ]2 E X2E Y2

Markov

P [(X) t] E [(X)]t

Chebyshev

P [|X E [X]| t] V [X]t2

Chernoff

P [X (1 + )] e

(1 + )1+

> 1

Jensen

E [(X)] (E [X]) convex

7 Distribution Relationships

Binomial

Xi

Bern(p) =

n

i=1 Xi Bin(n, p) X Bin(n, p) , Y Bin(m, p) = X+ Y Bin(n + m, p)

limn Bin(n, p) = Po(np) (n large,p small) limn Bin(n, p) =N(np,np(1 p)) (n large, p far from 0 and 1)

Negative Binomial

X NBin (1, p) = Geo (p) X NBin (r, p) = ri=1Geo (p) Xi NBin(ri, p) =

Xi NBin(

ri, p)

X

NBin (r, p) . Y

Bin(s + r, p) =

P [X

s] = P [Y

r]

Poisson

Xi Po (i) Xi Xj =ni=1

Xi Po

ni=1

i

Xi Po (i) Xi Xj = Xi

nj=1

Xj Bin nj=1

Xj , inj=1 j

Exponential

Xi Exp() Xi Xj =ni=1

Xi Gamma (n, )

Memoryless property: P [X > x + y | X > y ] = P [X > x]Normal

X N, 2 = X N(0, 1) X N, 2 Z= aX+ b = Z Na + b, a22 X N1, 21 Y N2, 22 = X+ Y N1+ 2, 21+ 22 Xi N

i,

2i

=i Xi Ni i,i 2i

P [a < X b] = b

a

(x) = 1 (x) (x) = x(x) (x) = (x2 1)(x) Upper quantile ofN(0, 1): z= 1(1 )

Gamma

X Gamma (, ) X/ Gamma (, 1) Gamma (, ) i=1Exp () Xi Gamma(i, ) Xi Xj =

i Xi Gamma (

i i, )

()

=

0

x1ex dx

Beta

1B(, )

x1(1 x)1 = ( + )()()

x1(1 x)1

E Xk =

B( + k, )

B(, ) =

+ k 1 + + k 1

E Xk1 Beta (1, 1) Unif(0, 1)8


9/28

8 Probability and Moment Generating Functions

GX(t) = E

tX |t| 0) lim

nP [|Xn X| > ] = 0

3. Almost surely (strongly): Xnas X

P limn

Xn= X = P : limn

Xn() = X() = 19


10/28

4. In quadratic mean (L2): Xnqm X

limn

E

(Xn X)2

= 0

Relationships

Xn qm X = Xn P X = Xn D X

Xn

as

X =

Xn

P

X

Xn D X (c R) P [X= c] = 1 = Xn P X Xn P X Yn P Y = Xn+ Yn P X+ Y Xn qm X Yn qm Y = Xn+ Yn qm X+ Y Xn P X Yn P Y = XnYn P XY Xn P X = (Xn) P (X) Xn D X = (Xn) D (X) Xn qm b limn E [Xn] = b limnV [Xn] = 0 X1, . . . , X n i id E [X] = V [X]< Xn qm

Slutzkys Theorem

Xn D X andYn P c = Xn+ Yn D X+ c Xn D X andYn P c = XnYn D cX In general: Xn D XandYn D Y = Xn+ Yn D X+ Y

10.1 Law of Large Numbers (LLN)

Let{X1, . . . , X n} be a sequence of iid rvs, E [X1] = .

Weak (WLLN)

XnP n

Strong (SLLN)

Xnas n

10.2 Central Limit Theorem (CLT)

Let{X1, . . . , X n} be a sequence of iid rvs, E [X1] = , and V [X1] = 2.

Zn:=Xn

V Xn=

n(Xn )

D Z where Z N(0, 1)

limn

P [Zn z] = (z) z R

CLT notations

Zn N(0, 1)

Xn N

, 2

n

Xn N

0,

2

n

n(

Xn ) N0, 2

n(Xn )n

N(0, 1)

Continuity correction

P

Xn x x + 12

/

n

P

Xn x

1

x 12

/

n

Delta method

Yn N

, 2

n

= (Yn) N

(), (())2

2

n

11 Statistical Inference

LetX1, , Xn iid Fif not otherwise noted.

11.1 Point Estimation

Point estimatorn of is a rv:n = g(X1, . . . , X n) bias(n) = E n Consistency:n P Sampling distribution: F(n) Standard error: se(n) = V n Mean squared error: mse= E

(n )2 =bias(n)2 + V n

limn bias(n) = 0 limn se(n) = 0 =n is consistent Asymptotic normality:

n

se

D N(0, 1)

Slutzkys Theorem often lets us replace se(n) by some (weakly) consis-tent estimatorn.

10


11/28

11.2 Normal-Based Confidence Interval

Supposen N,se2. Let z/2 = 1(1 (/2)), i.e., P Z > z/2 = /2and P

z/2 < Z < z/2 = 1 where Z N(0, 1). ThenCn=n z/2se

11.3 Empirical distribution

Empirical Distribution Function (ECDF)

Fn(x) = ni=1 I(Xi x)n

I(Xi x) =

1 Xi x0 Xi > x

Properties (for any fixed x)

EFn =F(x)

VFn = F(x)(1 F(x))

n

mse= F(x)(1 F(x))n

D 0Fn P F(x)

Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1, . . . , X n F)

P supx F(x) Fn(x) > = 2e2n2Nonparametric 1 confidence band for F

L(x) = max{Fn n, 0}U(x) = min{Fn+ n, 1}

=

1

2nlog

2

P [L(x) F(x) U(x) x] 1

11.4 Statistical Functionals

Statistical functional: T(F) Plug-in estimator of = (F):n = T(Fn) Linear functional: T(F) = (x) dFX(x) Plug-in estimator for linear functional:

T(Fn) = (x) dFn(x) = 1nn

i=1 (Xi) Often: T(Fn) NT(F),se2 = T(Fn) z/2se pth quantile: F1(p) = inf{x: F(x) p}= Xn2 = 1

n 1ni=1

(Xi Xn)2

=

1n

ni=1(Xi )3

3j

= ni=1(Xi

Xn)(Yi

Yn)n

i=1(Xi Xn)2ni=1(Yi Yn)12 Parametric Inference

Let F=

f(x; ) : be a parametric model with parameter space Rkand parameter = (1, . . . , k).

12.1 Method of Moments

jth moment

j() = E Xj = xj dFX(x)jth sample moment

j = 1n

ni=1

Xji

Method of moments estimator (MoM)

1() =12() =

2

..

. =

..

.k() =k

11


12/28

Properties of the MoM estimator

n exists with probability tending to 1 Consistency:n P Asymptotic normality:

n(

) D N(0, )

where =gE Y YT gT, Y = (X, X2, . . . , X k)T,g= (g1, . . . , gk) and gj = 1j ()12.2 Maximum Likelihood

Likelihood:Ln: [0, )

Ln() =ni=1

f(Xi; )

Log-likelihood

n() = log Ln() =ni=1

log f(Xi; )

Maximum likelihood estimator (mle)

Ln(n) = sup

Ln()

Score function

s(X; ) =

log f(X; )

Fisher informationI() = V[s(X; )]

In() = nI()

Fisher information (exponential family)

I() = E

s(X; )

Observed Fisher information

Iobsn () = 2

2

ni=1

log f(Xi; )

Properties of the mle

Consistency:n P

Equivariance:n is the mle = (n) ist the mle of() Asymptotic normality:

1. se 1/In()(n )

se

D N(0, 1)

2.

se

1/In(

n)

(n )se D N(0, 1) Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-

ples. Ifn is any other estimator, the asymptotic relative efficiency isare(n,n) = V

nV

n 1 Approximately the Bayes estimator

12.2.1 Delta MethodIf=() where is differentiable and () = 0:

(n )se() D N(0, 1)where=() is the mle of and

se= ()se(n)12.3 Multiparameter Models

Let = (1, . . . , k) and= (1, . . . ,k) be the mle.Hjj =

2n2

Hjk = 2njk

Fisher information matrix

In() =

E[H11] E[H1k]... . . . ...E[Hk1] E[Hkk ]

Under appropriate regularity conditions

( ) N(0, Jn)12


13/28

withJn() = I1n . Further, if

j is the j th component of , then(j j)sej D N(0, 1)

where

se

2j =Jn(j, j) and Cov

j ,k

=Jn(j, k)

12.3.1 Multiparameter delta method

Let =(1, . . . , k) and let the gradient of be

=

1...

k

Suppose

== 0 and=(). Then,( )se() D N(0, 1)

where

se() = TJn

andJn = Jn() and= =.

12.4 Parametric Bootstrap

Sample from f(x;n) instead of fromFn, wheren could be the mleor methodof moments estimator.

13 Hypothesis Testing

H0: 0 versus H1: 1Definitions

Null hypothesis H0 Alternative hypothesis H1

Simple hypothesis = 0

Composite hypothesis > 0 or < 0 Two-sided test: H0 : = 0 versus H1: =0 One-sided test: H0: 0 versus H1: > 0 Critical value c Test statistic T Rejection regionR = {x: T(x)> c} Power function() = P [X R] Power of a test: 1 P [Type II error] = 1 = inf

1()

Test size: = P [Type I error] = sup0

()

RetainH0 RejectH0H0 true

Type I Error ()

H1 true Type II Error ()

(power)p-value

p-value = sup0 P[T(X) T(x)] = inf

: T(x) R

p-value = sup0 P[T(X) T(X)] 1F(T(X) ) s ince T(X)F

= inf

: T(X) R

p-va lue ev ide nce

0.1 l ittle or no evidence against H0

Wald test

Two-sided test Reject H0 when|W| > z/2 where W=

0se P |W| > z/2 p-value = P0[|W| > |w|] P [|Z| > |w|] = 2(|w|)

Likelihood ratio test (LRT)

T(X) = sup Ln()sup0 Ln()

= Ln(n)Ln(n,0) 13


14/28

(X) = 2 log T(X) D 2rq whereki=1

Z2i 2k andZ1, . . . , Z k iid N(0, 1)

p-value = P0[(X)> (x)] P

2rq > (x)

Multinomial LRT

mle:

pn=

X1n

, . . . ,Xk

n

T(X) =Ln(pn)Ln(p0) = kj=1pjp0jX

j

(X) = 2k

j=1

Xjlog

pjp0j

D 2k1

The approximate size LRT rejects H0 when(X) 2k1,Pearson Chi-square Test

T =kj=1

(XjE [Xj ])2E [Xj ]

where E [Xj ] = np0j under H0

T D 2k1 p-value = P 2k1> T(x) Faster D X2k1 than LRT, hence preferable for small n

Independence testing

I rows,J columns, X multinomial sample of size n = I J mles unconstrained:pij = Xijn mles under H0:p0ij =pipj = Xin Xjn LRT: = 2

Ii=1

Jj=1 Xijlog

nXijXiXj PearsonChiSq: T = Ii=1Jj=1 (XijE[Xij ])2E[Xij ]

LRT and Pearson D 2k, where = (I 1)(J 1)

14 Bayesian Inference

Bayes Theorem

f( | x) = f(x | )f()f(xn)

= f(x | )f()f(x | )f() d Ln()f()

Definitions

Xn = (X1, . . . , X n)

xn = (x1, . . . , xn) Prior density f() Likelihood f(xn | ): joint density of the data

In particular,Xn iid = f(xn | ) =ni=1

f(xi | ) = Ln()

Posterior density f( | xn) Normalizing constantcn = f(xn) = f(x | )f() d Kernel: part of a density that depends on Posterior mean n =

f( | xn) d=

Ln()f()Ln()f() d

14.1 Credible Intervals

Posterior interval

P [ (a, b) | xn] = ba

f( | xn) d= 1

Equal-tail credible interval a

f( | xn) d=b

f( | xn) d= /2

Highest posterior density (HPD) region Rn

1. P [ Rn] = 1 2. Rn= {: f( | xn)> k} for some k

Rn is unimodal = Rn is an interval

14.2 Function of parametersLet= () andA = {: () }.Posterior CDF for

H(r | xn) = P [() | xn] =A

f( | xn) d

Posterior density

h(| xn) = H(| xn)Bayesian delta method

| Xn N(),se () 14


15/28

14.3 Priors

Choice

Subjective bayesianism. Objective bayesianism. Robust bayesianism.

Types

Flat: f() constant Proper: f() d= 1 Improper: f() d= Jeffreys prior (transformation-invariant):

f()

I() f()

det(I())

Conjugate: f() andf( | xn) belong to the same parametric family

14.3.1 Conjugate Priors

Discrete likelihood

Likelihood Conjugate prior Posterior hyperparameters

Bern(p) Beta (, ) +n

i=1 xi, + n n

i=1 xiBin(p) Beta (, ) +

ni=1

xi, +ni=1

Ni ni=1

xi

NBin (p) Beta (, ) + rn,+ni=1

xi

Po () Gamma (, ) +ni=1

xi, + n

Multinomial(p) Dir () +

ni=1

x(i)

Geo(p) Beta (, ) + n, +

ni=1

xi

Continuous likelihood (subscript c denotes constant)

Likelihood Conjugate prior Posterior hyperparameters

Unif (0, ) Pareto(xm, k) max

x(n), xm

, k+ n

Exp() Gamma (, ) + n, +ni=1

xi

N, 2c N0,

20

0

2

0

+ ni=1 xi

2

c /1

2

0

+ n

2

c , 120

+ n

2c

1Nc, 2 Scaled Inverse Chi-

square(, 20 )+ n,

20+n

i=1(xi )2+ n

N, 2 Normal-scaled InverseGamma(,,,)

+ nx

+ n , + n, +

n

2,

+1

2

ni=1

(xi x)2 + (x )2

2(n + )

MVN(, c) MVN(0, 0) 10 + n1c 1 10 0+ n1x,10 + n

1c1

MVN(c, ) Inve rse-Wishart(, )

n + , +

ni=1

(xi c)(xi c)T

Pareto(xmc , k) Gamma (, ) + n, +ni=1

log xixmc

Pareto(xm, kc) Pareto(x0, k0) x0, k0 kn where k0> knGamma(c, ) Gamma (0, 0) 0+ nc, 0+

ni=1

xi

14.4 Bayesian Testing

IfH0: 0:

Prior probability P [H0] =

0

f() d

Posterior probability P [H0 | xn] =

0

f( | xn) d

LetH0, . . . , H K1 b e K hypotheses. Suppose f( | Hk),

P [Hk | xn] = f(xn | Hk)P [Hk]Kk=1 f(x

n | Hk)P [Hk],

15


16/28

Marginal likelihood

f(xn | Hi) =

f(xn | , Hi)f( | Hi) d

Posterior odds (ofHi relative to Hj)

P [Hi | xn]P [Hj | xn] =

f(xn | Hi)f(xn | Hj) Bayes Factor BFij

P [Hi]P [Hj ] prior odds

Bayes factorlog10 BF10 BF10 evidence

0 0.5 1 1.5 Weak0.5 1 1.5 10 Moderate1 2 10 100 Strong>2 >100 Dec isive

p =p

1pBF101 + p1pBF10

where p = P [H1] andp = P [H1 | xn]

15 Exponential FamilyScalar parameter

fX(x | ) = h(x)exp {()T(x) A()}=h(x)g()exp {()T(x)}

Vector parameter

fX(x | ) = h(x)exp

si=1

i()Ti(x) A()

=h(x)exp

{()

T(x)

A()

}=h(x)g()exp {() T(x)}Natural form

fX(x | ) = h(x)exp { T(x) A()}=h(x)g()exp { T(x)}=h(x)g()exp

TT(x)

16 Sampling Methods

16.1 The BootstrapLet Tn = g(X1, . . . , X n) be a statistic.

1. Estimate VF[Tn] with V Fn [Tn].2. ApproximateV Fn [Tn] using simulation:

(a) Repeat the followingB times to get Tn,1, . . . , T n,B, an ii dsample from

the sampling distribution implied byFni. Sample uniformlyX1 , . . . , X

nFn.

ii. ComputeTn = g(X1 , . . . , X

n).

(b) Then

vboot=V Fn = 1BBb=1

Tn,b

1

B

Br=1

Tn,r

2

16.1.1 Bootstrap Confidence Intervals

Normal-based interval

Tn z/2

seboot

Pivotal interval

1. Location parameter = T(F)

2. Pivot Rn =n 3. Let H(r) = P [Rn r] be the cdf ofRn4. Let Rn,b=n,b n. Approximate Husing bootstrap:

H(r) = 1B

Bb=1

I(Rn,b r)

5. = sample quantile of (n,1, . . . ,n,B)6. r = sample quantile of (R

n,1, . . . , R

n,B), i.e., r

=

n

7. Approximate 1 confidence interval Cn=

a, b

where

a= n H1 1 2

= n r1/2= 2n 1/2

b= n H1 2

= n r/2= 2n /2

Percentile interval

Cn = /2, 1/2 16


17/28

16.2 Rejection Sampling

Setup

We can easily sample from g () We want to sample from h(), but it is difficult We know h() up to a proportional constant: h() = k()

k () d

Envelope condition: we can find M >0 such that k()

M g()

Algorithm

1. Draw cand g()2. Generateu Unif (0, 1)3. Accept cand ifu k(

cand)

M g(cand)

4. Repeat untilB values ofcand have been accepted

Example

We can easily sample from the prior g () = f()

Target is the posterior h() k() = f(xn | )f() Envelope condition: f(xn | ) f(xn |n) = Ln(n) M Algorithm

1. Draw cand f()2. Generateu Unif (0, 1)3. Accept cand ifu Ln(

cand)

Ln(n)16.3 Importance Sampling

Sample from an importance function g rather than target density h.Algorithm to obtain an approximation to E [q() | xn]:

1. Sample from the prior1, . . . , niid f()

2. wi= Ln(i)Bi=1 Ln(i)

i= 1, . . . , B

3. E [q() | xn] Bi=1 q(i)wi17 Decision Theory

Definitions

Unknown quantity affecting our decision:

Decision rule: synonymous for an estimator Actiona A: possible value of the decision rule. In the estimation

context, the action is just an estimate of ,(x). Loss function L: consequences of taking action a when true state is or

discrepancy between and, L : A [k, ).Loss functions

Squared error loss: L(, a) = (

a)2

Linear loss: L(, a) =K1( a) a


18/28

17.3 Bayes Rule

Bayes rule (or Bayes estimator)

r(f,) = infr(f, )(x) = infr( | x) x = r(f,) = r ( | x)f(x) dx

Theorems

Squared error loss: posterior mean Absolute error loss: posterior median Zero-one loss: posterior mode

17.4 Minimax Rules

Maximum riskR() = sup

R(,) R(a) = sup

R(, a)

Minimax rulesup

R(,

) = inf

R(

) = inf

sup

R(,

)

= Bayes rule c: R(,) = cLeast favorable prior

f = Bayes rule R(,f) r(f,f) 18 Linear Regression

Definitions

Response variable Y CovariateX (aka predictor variable or feature)

18.1 Simple Linear Regression

ModelYi = 0+ 1Xi+ i E [i | Xi] = 0, V [i | Xi] = 2

Fitted line r(x) =0+1xPredicted (fitted) values

Yi =

r(Xi)

Residualsi = Yi Yi= Yi 0+1Xi

Residual sums of squares (rss)

rss(0,1) = ni=1

2i

Least square estimates

T = (

0,

1)

T : min

0,

1

rss

0= Yn 1 Xn1= ni=1(Xi Xn)(Yi Yn)n

i=1(Xi Xn)2 =

ni=1 XiYi nXYni=1 X

2i nX2

E

| Xn = 01

V

| Xn

= 2

nsX

n1

ni=1 X

2i Xn

Xn 1

se(0) = sXnn

i=1X2i

n

se(1) = sX

n

where s2X = n1n

i=1(Xi Xn)2 and2 = 1n2 ni=12i (unbiased estimate).Further properties:

Consistency:0 P 0 and1 P 1 Asymptotic normality:

0 0se(0) D N(0, 1) and 1 1se(1) D N(0, 1) Approximate 1 confidence intervals for 0 and1:

0 z/2se(0) and1 z/2se(1) Wald test for H0 : 1= 0 vs. H1: 1= 0: reject H0 if|W| > z/2 where

W =1/se(1).R2

R2 = ni=1(Yi Y)2ni=1(Yi Y)2

= 1 ni=12ini=1(Yi Y)2

= 1 rsstss

18


19/28

Likelihood

L =ni=1

f(Xi, Yi) =ni=1

fX(Xi) ni=1

fY|X(Yi | Xi) = L1 L2

L1=ni=1

fX(Xi)

L2=n

i=1

fY|X(Yi | Xi) n

exp 122 iYi (0 1Xi)2

Under the assumption of Normality, the least squares estimator is also the mle

2 = 1n

ni=1

2i

18.2 Prediction

ObserveX= x of the covariate and want to predict their outcome Y.

Y =0+1xV

Y = V 0+ x2V 1+ 2xCov 0,1Prediction interval 2n=2ni=1(Xi X)2ni(Xi X)2j + 1

Y z/2n

18.3 Multiple Regression

Y =X +

where

X=

X11 X1k... . . . ...Xn1 Xnk

=1...

k

=1...

n

Likelihood

L(, ) = (22)n/2 exp

122

rss

rss= (y X)T(y X) = Y X2 =Ni=1

(Yi xTi )2

If the (k k) matrix XTX is invertible,= (XTX)1XTYV

| Xn =2(XTX)1 N, 2(XTX)1Estimate regression function

r(x) = kj=1

jxjUnbiased estimate for 2

2 = 1n k

ni=1

2i = X Y

mle = X 2 = n kn

2

1 Confidence interval

j z/2

se(

j)

18.4 Model Selection

Consider predicting a new observation Y for covariates X and let S Jdenote a subset of the covariates in the model, where |S| =k and|J| =n.Issues

Underfitting: too few covariates yields high bias Overfitting: too many covariates yields high variance

Procedure

1. Assign a score to each model

2. Search through all models to find the one with the highest score

Hypothesis testing

H0: j = 0 vs. H1: j= 0 j JMean squared prediction error (mspe)

mspe= E

(Y(S) Y)2Prediction risk

R(S) =ni=1

mspei =ni=1

E

(Yi(S) Yi )2

Training error

Rtr(S) = ni=1

(Yi(S) Yi)219


20/28

R2

R2(S) = 1 rss(S)tss

= 1 Rtr(S)

tss = 1

ni=1(Yi(S) Y)2ni=1(Yi Y)2

The training error is a downward-biased estimate of the prediction risk.

E

Rtr(S) < R (S)bias(Rtr(S)) = E Rtr(S) R(S) = 2

ni=1

Cov Yi, YiAdjustedR2

R2(S) = 1 n 1n k

rss

tss

Mallows Cp statisticR(S) =Rtr(S) + 2k2 = lack of fit + complexity penaltyAkaike Information Criterion (AIC)

AIC(S) = n(S,2S) k

Bayesian Information Criterion (BIC)

BI C(S) = n(S,2S) k2log nValidation and training

RV(S) = mi=1

(Yi (S) Yi )2 m= |{validation data}|, often n4 or n2Leave-one-out cross-validation

RCV(S) = ni=1

(Yi Y(i))2 = ni=1

Yi Yi(S)1 Uii(S)

2

U(S) = XS(XTSXS)

1XS (hat matrix)

19 Non-parametric Function Estimation

19.1 Density Estimation

Estimate f(x), where f(x) = P [X A] = A f(x) dx.Integrated square error (ise)

L(f,fn) = f(x) fn(x)2 dx= J(h) + f2(x) dx

Frequentist risk

R(f,fn) = E L(f,fn) = b2(x) dx + v(x) dx

b(x) = E

fn(x)

f(x)

v(x) = V fn(x)19.1.1 Histograms

Definitions

Number of bins m Binwidthh = 1m BinBj has j observations Definepj =j/n and pj = Bj f(u) du

Histogram estimator

fn(x) = mj=1

pjh

I(x Bj)

E

fn(x) = pjh

V

fn(x)

=pj(1 pj)

nh2

R(fn, f) h 212 (f(u))2 du + 1nhh =

1

n1/3

6

(f(u))2du

1/3

R(fn, f) Cn2/3

C=

3

4

2/3 (f(u))2 du

1/3

Cross-validation estimate ofE [J(h)]

JCV(h) = f2n(x) dx 2nn

i=1f(i)(Xi) = 2(n 1)h n + 1(n 1)h

m

j=1p2j

20


21/28

19.1.2 Kernel Density Estimator (KDE)

Kernel K

K(x) 0 K(x) dx= 1 xK(x) dx= 0

x2K(x) dx 2K> 0KDE

fn(x) = 1n

ni=1

1

hK

x Xi

h

R(f,fn) 1

4(hK)

4

(f(x))2 dx +

1

nh

K2(x) dx

h = c2/51 c

1/52 c

1/53

n1/5 c1=

2K, c2=

K2(x) dx, c3=

(f(x))2 dx

R(f,

fn) =

c4n4/5

c4=5

4(2K)

2/5

K2(x) dx

4/5

C(K) (f)2 dx

1/5

Epanechnikov Kernel

K(x) =

3

4

5(1x2/5) |x| 0

ipij = 1Chapman-Kolmogorov

pij(m + n) =k

pij(m)pkj(n)

Pm+n= PmPn

Pn = P P= PnMarginal probability

n = (n(1), . . . , n(N)) where i(i) = P [Xn = i]

0 initial distribution

n = 0Pn

20.2 Poisson Processes

Poisson process

{Xt : t [0, )} = number of events up to and including time t X0= 0 Independent increments:

t0< < tn : Xt1 Xt0 Xtn Xtn1

Intensity function(t)

P [Xt+h Xt = 1] = (t)h + o(h) P [Xt+h Xt = 2] = o(h)

Xs+t Xs Po (m(s + t) m(s)) where m(t) =t

0(s) ds

Homogeneous Poisson process

(t) = Xt Po (t) > 0

Waiting times

Wt:= time at whichXt occurs

Wt Gamma

t,1

Interarrival times

St = Wt+1 Wt

St Exp

1

tWt1 Wt

St

22


23/28

21 Time Series

Mean function

xt = E [xt] =

xft(x) dx

Autocovariance function

x(s, t) = E [(xs s)(xt t)] = E [xsxt] stx(t, t) = E (xt t)2 = V [xt]

Autocorrelation function (ACF)

(s, t) = Cov [xs, xt]V [xs]V [xt]

= (s, t)(s, s)(t, t)

Cross-covariance function (CCV)

xy(s, t) = E [(xs xs)(yt yt)]Cross-correlation function (CCF)

xy(s, t) =

xy(s, t)x(s, s)y(t, t)

Backshift operatorBk(xt) = xtk

Difference operatord = (1 B)d

White noise

wt wn(0, 2w) Gaussian: wt iid N

0, 2w

E [wt] = 0 t T V [wt] = 2 t T w(s, t) = 0 s =t s, t T

Random walk

Drift xt= t +

tj=1 wj

E [xt] = tSymmetric moving average

mt =

k

j=k a

jxtj where aj =aj 0 andk

j=k a

j = 1

21.1 Stationary Time Series

Strictly stationary

P [xt1 c1, . . . , xtk ck] = P [xt1+h c1, . . . , xtk+h ck]

k N, tk, ck, h Z

Weakly stationary

E x2t < t Z E x2t =m t Z x(s, t) = x(s + r, t + r) r,s,t Z

Autocovariance function

(h) = E [(xt+h )(xt )] h Z (0) = E (xt )2

(0)

0

(0) |(h)| (h) = (h)

Autocorrelation function (ACF)

x(h) = Cov [xt+h, xt]V [xt+h]V [xt]

= (t + h, t)(t + h, t + h)(t, t)

=(h)

(0)

Jointly stationary time series

xy(h) = E [(xt+h

x)(yt

y)]

xy(h) = xy(h)

x(0)y(h)

Linear process

xt = +

j=jwtj where

j=

|j | <

(h) = 2

w

j=

j+hj

23


24/28

21.2 Estimation of Correlation

Sample mean

x= 1

n

nt=1

xt

Sample variance

V [x] = 1

n

n

h=n1 |h|

n x(h)Sample autocovariance function

(h) = 1n

nht=1

(xt+h x)(xt x)

Sample autocorrelation function

(h) =(h)

(0)

Sample cross-variance function

xy(h) = 1n

nht=1

(xt+h x)(yt y)

Sample cross-correlation function

xy(h) = xy(h)x(0)y(0)Properties

x(h)= 1n ifxt is white noise

xy(h) = 1

n ifxt or yt is white noise

21.3 Non-Stationary Time Series

Classical decomposition model

xt = t+ st+ wt

t = trend

st = seasonal component wt = random noise term

21.3.1 Detrending

Least squares

1. Choose trend model, e.g.,t = 0+ 1t + 2t2

2. Minimize rss to obtain trend estimatet =0+1t +2t23. Residuals noise wt

Moving average

The low-passfilter vt is a symmetric moving average mt withaj = 12k+1 :

vt = 1

2k+ 1

ki=k

xt1

If 12k+1k

i=kwtj 0, a linear trend function t = 0+ 1t passeswithout distortion

Differencing

t = 0+ 1t =

xt= 1

21.4 ARIMA models

Autoregressive polynomial

(z) = 1 1z pzp z C p= 0

Autoregressive operator

(B) = 1 1B pBp

Autoregressive model order p, AR (p)

xt = 1xt1+ + pxtp+ wt (B)xt= wtAR (1)

xt = k(xtk) +k1j=0

j(wtj)k,||


25/28

Moving average polynomial

(z) = 1 + 1z+ + qzq z C q= 0Moving average operator

(B) = 1 + 1B+ + pBp

MA (q) (moving average model orderq)

xt = wt+ 1wt

1+

+ qwtq

xt = (B)wt

E [xt] =

qj=0

jE [wtj ] = 0

(h) = Cov [xt+h, xt] =

2wqh

j=0jj+h 0 h q0 h > q

MA (1)xt= wt+ wt1

(h) =

(1 + 2)2w h= 0

2w h= 1

0 h > 1

(h) =

(1+2) h= 1

0 h > 1

ARMA (p, q)

xt = 1xt1+ + pxtp+ wt+ 1wt1+ + qwtq(B)xt= (B)wt

Partial autocorrelation function (PACF)

xh1i regression ofxi on{xh1, xh2, . . . , x1} hh= corr(xh x

h

1

h , x0 xh

1

0 ) h 2 E.g.,11= corr(x1, x0) = (1)ARIMA (p, d, q)

dxt = (1 B)dxt is ARMA (p, q)(B)(1 B)dxt= (B)wt

Exponentially Weighted Moving Average (EWMA)

xt = xt1+ wt wt1

xt =

j=1(1 )j1xtj+ wt when||


26/28

Amplitude A Phase U1= A cos and U2= A sin often normally distributed rvs

Periodic mixture

xt =

qk=1

(Uk1cos(2kt) + Uk2sin(2kt))

Uk1, Uk2, fork = 1, . . . , q , are independent zero-mean rvs with variances 2k

(h) = qk=1 2kcos(2kh) (0) = E x2t = qk=1 2k

Spectral representation of a periodic process

(h) = 2 cos(20h)

=2

2e2i0h +

2

2e2i0h

=

1/21/2

e2ih dF()

Spectral distribution function

F() =

0 < 02/2 < 02 0

F() = F(1/2) = 0 F() = F(1/2) = (0)

Spectral density

f() =

h=(h)e2ih 1

2 1

2

Needsh= |(h)| < = (h) = 1/21/2 e2ihf() d h= 0, 1, . . . f() 0 f() = f() f() = f(1 ) (0) = V [xt] =

1/21/2 f() d

White noise: fw() = 2w ARMA (p, q) , (B)xt= (B)wt:

fx() = 2w

|(e2i)|2

|(e2i)

|2

where (z) = 1 pk=1 kzk and (z) = 1 +qk=1 kzk

Discrete Fourier Transform (DFT)

d(j) = n1/2

ni=1

xte2ijt

Fourier/Fundamental frequencies

j = j/n

Inverse DFT

xt = n1/2

n1j=0

d(j)e2ijt

PeriodogramI(j/n) = |d(j/n)|2

Scaled Periodogram

P(j/n) = 4

nI(j/n)

=

2

n

n

t=1xtcos(2tj/n

2+

2

n

n

t=1xtsin(2tj/n

2

22 Math

22.1 Gamma Function

Ordinary: (s) =

0

ts1etdt

Upper incomplete: (s, x) =x

ts1etdt

Lower incomplete: (s, x) = x

0

ts1etdt

( + 1) =() > 1 (n) = (n 1)! n N (1/2) =

22.2 Beta Function

Ordinary: B(x, y) = B(y, x) = 1

0

tx1(1 t)y1 dt= (x)(y)(x + y)

Incomplete: B(x;a, b) = x

0

ta1(1 t)b1 dt Regularized incomplete:

Ix(a, b) = B(x; a, b)B(a, b)a,b

N

=

a+b1j=a

(a + b 1)!j!(a + b 1 j)! xj(1 x)a+b1j26


27/28

I0(a, b) = 0 I1(a, b) = 1 Ix(a, b) = 1 I1x(b, a)

22.3 Series

Finite

n

k=1k=

n(n + 1)

2

n

k=1

(2k 1) = n2

n

k=1

k2 =n(n + 1)(2n + 1)

6

n

k=1

k3 =

n(n + 1)

2

2

nk=0

ck = cn+1 1

c 1 c = 1

Binomial

n

k=0n

k= 2n

n

k=0

r+ k

k

=

r+ n + 1

n

nk=0

k

m

=

n + 1

m + 1

Vandermondes Identity:

rk=0

m

k

n

r k

=

m + n

r

Binomial Theorem:

n

k=0n

kankbk = (a + b)nInfinite

k=0

pk = 1

1 p ,k=1

pk = p

1 p |p|


28/28

Univariate

distributionrelationships,court

esyLeemisandMcQueston[2].

28

Documents

Statistics Cookbook