Statistical techniques - Electrical and Computer Engineeringmv/edu/D2K/lectures/stat-techniques.pdf · 1 of 42 Statistical techniques Malathi Veeraraghavan University of Virginia

1 of 42

Statistical techniques

Malathi VeeraraghavanUniversity of Virginia

Last updated: Nov. 25, 2014

1. Basic measures (univariate data):

• R Book [1]: The sample mean of the numeric data set, is

• R Book: The sample median, , of is the middle value of the sorted val-

ues. Let the sorted data be denoted by . Then

• R Book: The sample variance and sample standard deviation. For a numeric dataset , the sample variance is

, and

sample standard deviation, is the square root of sample variance.• R Book: The th quantile is at position in the sorted data. If this is not an

integer, a weighted average is used. For example, as the median is the 0.5 quantile, if is even, the position is a fraction, and hence an average is used as shown above in thedefinition of median. The th quantile splits the dataset so that 100 % is smaller and100 % is larger.

• R Book: Percentiles do the same thing as quantiles, except that on a scale of 0 to 100is used instead of 0 to 1. Page 171, Devore book [2]: Order the sample observationsfrom smallest to largest. Then the th smallest observation in the list is taken to be the

th sample percentile.

• R Book: Quartiles refer to the 0, 25, 50, 75 and 100 percentiles. Quintiles refer to the0, 20, 40, 60, 80, 100 percentiles.

2. Visualization

Represent univariate data with boxplots, barplots, histograms (counts, breaks, density)

3. Random variables vs. sample values:

Devore book, page 87: Random variables are customarily denoted by uppercase letters, such as Xand Y, near the end of the alphabet. Lowercase letters are used to represent some particular valueof the corresponding random variable, e.g., .

x1 x2 … xn, , ,

xx1 x2 … xn+ + +

n-----------------------------------------=

m x1 x2 … xn, , ,

x 1( ) x 2( ) … x n( )≤ ≤ ≤

m

x k 1+( ) n 2k 1 (odd)+=

12--- x k( ) x k 1+( )+( ) n 2k (even)=

=

x1 x2 … xn, , ,

s2 1

n 1–------------ xi x–( )2

i 1=

n

∑=

s

p 1 p n 1–( )+

n

p p

1 p–( )

n

i

100 i 0.5–( )n

-----------------------------

X x=

2 of 42

4. Probability plots (Bivariate data):

These are used to check a distributional assumption (Page 170, Devore book). As it compares thedistribution of one variable to that of another, it is listed under bivariate data operations.

• Page 173, Devore book: A plot of pairs

is a normal probability plot . If the sample observations are drawn from a normal dis-tribution with mean and standard deviation , the points should fall close to a straight line with slope and intercept . Thus, a plot for which the points fall close to some straight line suggests that the assumption of a normal population distribution is plausible.

5. Quantile-quantile plot (qqplot):

R Book: Plots the quantiles of one variable against the quantiles of another as points. If the distri-butions have similar shapes (parameters of the distribution do not need to be the same), thenpoints will fall roughly along a straight line.

6. Normal quantile plot (qqnorm):

Plots the quantiles of a data set against the quantiles of the standard normal distribution.

7. Basic normal distribution concepts:

Devore, page 151: The th percentile of a normal distribution with mean and standarddeviation is related to the th percentile of the standard normal distribution as:

Devore, page 221: If are independent, normally distributed random variables (with

possibly different means and/or variances), then any linear combination of the 's also has a nor-

mal disitrbution.

8. Correlation coefficient of two data vectors, and :

R Book:

,

9. Interpretation of correlation coefficient:

Relationship between two variables is strong if , moderate if , and weak if.

10. Properties of correlation coefficient

iff for some numbers and with .

n

100 i 0.5–( ) n⁄[ ]th z percentile, ith smallest observation( )

µ σσ µ

100p( ) µσ 100p( )

(100p)th percentile for normalµ σ,( ) µ (100p)th for standard normal[ ] σ⋅+=

X1 X2 … Xn, , ,

Xi

x y

r cor x y,( )xi x–( ) yi y–( )∑

xi x–( )2∑ yi y–( )∑

2------------------------------------------------------------------

Sxy

Sxx Syy

------------------------ 1n 1–------------

xi x–

sx------------

yi y–

sy------------ ∑= = = =

r 0.8≥ 0.5 r 0.8< <r 0.5≤

1 r≤– 1≤r 1= y ax b+= a b a 0≠

3 of 42

Devore’s book, page 201: If , it just indicates that the relationship is not completely linear,but there may still be a strong non-linear relationship. Also does not imply that and areindependent, but only that there is a “complete absence of a linear relationship.” When , itmeans and are uncorrelated. Two variables could be uncorrelated and yet highly dependentbecause there is a strong nonlinear relationship. Therefore, if do not conclude that the vari-ables have no relationship. If two variables are independent, the correlation coefficient is 0.

11. Linear model (regression): For a bivariate sample

, where , the sample data consists of observed pairs

. is a random variable, and is the observed value of . Indepen-

dence of 's follows from the independence of . Sources of error are measurement errors, omit-

ted variables, non-linear form, human behavior.

12. Rsquared

is the fraction of variance explained by the model; it is called coefficient of determination (R

Book); where is the sum squared of errors and is the total sum of

squares. SSE is also called RSS or residual sum of squares. SSReg or SSR is the regression sum of squares and

, where , and . The term fitted values is used for the ,

and the residuals are .

13. Regression coefficients in a linear regression between two variables and variance oferror term:

If , where is an error term, and normally distributed with 0 mean and vari-

ance. Given a sample of the two random variables , using the principles of least squares (see

Devore’s book, page 455), we can compute point estimates for the regression coefficients,

and . The variance of the error term is estimated as

.

Devore, page 470: The mean value of the estimated slope coefficient , where is the

true regression slope coefficient. The estimated standard deviation or standard error of the slope

coefficient is given by: . The estimator for the true slope coefficient is a normally

r 1<r 0= X Y

r 0=

X Y

r 0=

Yi β0 β1xi εi+ += εi N 0 σ2,( )∼ 1 i n≤ ≤

x1 y1,( ) x2 y2,( ) … xn yn,( ), , , Yi yi Yi

Yi εi

R2

1SSESST----------–= SSE SST

SST RSS SSReg+=

R2

1

yi yiˆ–( )

2

i

∑

yi y–( )2

i

∑------------------------------–= yi

ˆ β0ˆ β1

ˆ xi+= y

yi

i

∑n

------------= yiˆ

yi yiˆ–( )

Yi β1xi β0 εi+ += εi σ2

xi yi,( )

β1ˆ

xi x–( ) yi y–( )∑xi x–( )2

∑--------------------------------------------

Sxy

Sxx-------= = β0

ˆ y β1ˆ x–=

σ2ˆs2 SSE

n 2–------------= =

E β1ˆ( ) β1= β1

sβ1ˆ

s

Sxx

------------= β1ˆ

4 of 42

distributed statistic (random variable) because can be expressed as a linear function of the

independent random variables .

Devore, page 471: The assumptions of the simple linear regression model imply that the standard-

ized variable has a t-distribution with degrees of freedom.

Therefore a % confidence interval of the slope coefficient of the true regression line

is

14. Relation between correlation coefficient and coefficient of determination

R Book: Square of sample correlation coefficient = coefficient of determination

This explains the interpretation noted in point 9. If , then , which means only 25%of the observed variation in is explained by the model.

15. Smoothing spline

A spline is a function defined piecewise by polynomials. For example, a cubic spline is defined asfollow. Given , a function is a cubic spline if

a. On each interval , is a cubic polynomial

b. The polynomial pieces fit together at points (called knots) such that itself and its

first and second derivatives are continuous at each , and hence on the whole .

A cubic spline is specified as

for

A cubic spline is a Natural Cubic Spline (NCS) if its second and third derivatives are zero at theextreme intervals, which implies , in other words, is linear on the two

extreme intervals and .

Aims of curving fitting are two-fold: (i) obtain a good fit to the data, and (ii) obtain a curve esti-mate that does not display too much rapid fluctuation (as the goal is to detect trends and hence theneed for smoothing). Therefore, it is necessary to make a compromise between these two differentgoals in curve fitting. The roughness penalty is a measure of roughness in the fitting process. Thisis quantified as follows:

, where is a twice-differentiable curve

[Digression: A function may be differentiable at a point but not twice differentiable (i.e., the firstderivative exists, but the second derivative does not). A function is said to be differentiable at

if the limit exists. Even a function with a smooth graph may not be dif-

β1ˆ

Y1 Y2 … Yn, , ,( )

Tβ1ˆ β1–

Sβ1ˆ

-----------------= n 2–( )

100 1 α–( ) β1

β1ˆ tα 2⁄ n 2–( ), s

β1ˆ⋅( )±

r2

R2

=

r 0.5= r2

0.25=

y

a t1 t2 … tn b< < < < < g

a t1,( ) t1 t2,( ) … tn b,( ), , g

ti g

ti a b,[ ]

g t( ) di t ti–( )3ci t ti–( )2

bi t ti–( ) ai+ + += ti t ti 1+≤ ≤

d0 c0 dn cn 0= = = = g

a t1,[ ] tn b,[ ]

g'' t( )[ ]2td

a

b

∫ g t( )

f a

f' a( ) f h a+( ) f a( )–h

-----------------------------------h 0→lim=

5 of 42

ferentiable at a point for example if its tangent is vertical. For instance, the function is notdifferentiable at x = 0.] An intuitive explanation for why this is a good measure of roughness is as follows. “Roughnessshould not be affected by the addition of a constant or linear function. This leads naturally to theidea of a roughness function that depends on the second derivative of the curve under consider-ation” [3].

Penalized sum of squares:

(1)

where is any twice-differentiable function on , and is a smoothing parameter (“rate ofexchange” between residual error and local variation).Penalized least squares estimator:

where recall of a function is the value of the argument at which the function is a minimum

Cost of is determined not only by its goodness-of-fit to the data quantified by the residualsum of squares, but also by its roughness (second term of Eq. (1)). For given , minimizing will give the best compromise between smoothness and goodness-of-fit [3]. The curve estimator

is necessarily a natural cubic spline with knots at ,

16. Linear combination of independent normal variables

Devore, page 219: Given a collection of random variables, , and numerical con-

stants , let the random variable be a linear combination of the Xs.

.

Let have mean values , respectively, and variances , respec-

tively.a. Whether or not, the s are independent, mean

b. If are independent, variance

c. For any ,

y x=

S g( ) Yi g ti( )–[ ]2

i 1=

n

∑ α g'' t( )[ ]2td

a

b

∫+=

g a b,[ ] α

g arg min S g( )=

arg min

arg min f x( ) x y f y( ) f x( )≥,∀{ }=

S g( )α S g( )

g ti 1 i n≤ ≤

n X1 X2 … Xn, , , n

a1 a2 … an, , , Y

Y a1X1 a2X2 … anXn+ + + aiXi

i 1=

n

∑= =

X1 X2 … Xn, , , µ1 µ2 … µn, , , σ12 σ2

2 … σn2, , ,

Xi

E Y( ) a1µ1 a2µ2 … anµn+ + +=

X1 X2 … Xn, , ,

V Y( ) a12σ1

2a2

2σ22 … an

2σn2

+ + +=

X1 X2 … Xn, , ,

V Y( ) Cov Xi Xj,( )

j 1=

n

∑i 1=

n

∑=

Cov Xi Xj,( ) E XiXj( ) µiµj–=

6 of 42

17. Central Limit Theorem

Devore, page 215: Let be a random sample from a distribution with mean and

variance . Then if is sufficiently large, has approximately a normal distribution with mean

and . The larger the , the better the approximation. Note that the underlying

population distribution can be anything. Devore’s book notes on page 217 that a rule of thumb isthat if , then the Central Limit Theorem can be used.[Recall a random sample of size consists of random variables if the are i.i.d.

(independent and identically distributed).]

18. Samples drawn from a population with normal distribution

Devore book, page 214: Let be a random sample from a normal distribution with

mean and standard deviation . Then, for any , is normally distributed with mean

and . See the difference with Central Limit Theorem, which holds for any underlying

population distribution, but requires to be large.

19. Confidence intervals

Case 1: If population variance is known [Devore, page 257]If after observing , we compute the observed sample mean , the

resulting fixed interval

is a 95% confidence interval for , the population mean.

More generally, a % CI for the mean of a normal population when the value isknown is given by

.

Case 2: If population variance is unknown and sample size is large [Devore, page 264]If is suifficiently large, the standardized variable

has approximately a standard normal distribution. This implies that

is a large-sample confidence interval for with the confidence level approximately%. This formula is valid regardless of the shape of the population distribution. Gener-

ally speaking, is sufficient to use this interval.

Note that these are confidence intervals for population mean. If we want a confidence interval fora population proportion , then there is a different equation.

Theorem: [Devore, page 270]

X1 X2 … Xn, , , µ

σ2n X

µX

µ= σX2 σ2

n⁄= n

n 30>n X1 X2 … Xn, , , Xi

X1 X2 … Xn, , ,

µ σ n X µX

µ=

σX2 σ2

n⁄=

n

σ2

X1 x1=( ) X2 x2=( ) … Xn xn=( ), , , x

x1.96σ

n--------------– x

1.96σn

--------------+, µ

100 1 α–( ) µ σ

x zα 2⁄σn

-------– x zα 2⁄σn

-------+,

n

ZX µ–

S n⁄--------------=

x zα 2⁄s

n-------– x zα 2⁄

s

n-------+,

µ100 1 α–( )

n 40>

p

7 of 42

When is the mean of a random sample of size from a normal distribution with mean , therandom variable

has a t distribution with degrees of freedom.

Case 3: Population variance unknown and sample size is small [Devore, page 272]Let and be the sample mean and sample standard deviation computed from the results of arandom sample from a normal population with mean . Then a % CI for the mean is

Corollary [Devore, page 276]: This confidence interval is robust to small to even moderate depar-tures from normality unless is quite small.

20. t distribution:

The t-distribution has only one parameter , the degrees of freedom. Type help(pt) to see the PDFof the t-distribution. The mean is 0 when , and the variance is for . As ,the t distribution becomes approximately equal to the normal distribution.

, the value is called a t critical value.

21. Confidence interval interpretation

Devore book, page 257, Interpreting a confidence intervalRecall that for each sample of a variable will yield a different mean . This makes , a statis-tic, i.e., a random variable. From Central Limit Theorem, will be normally distributed if thesample size is sufficiently large irrespective of the underlying population distribution (if the popu-lation distribution is normal, then will be normally distributed for all sample sizes). Because is normally distributed, we can write

where and are the population mean and standard deviation, respectively. We can write this asfollows:

is a 95% confidence interval for ,

where is the mean value from one sample, but we cannot write

,

since the item in the parenthesis is not a random variable. The population mean , is a fixedparameter (but unknown), and is a value known to us - it is the mean value from one sample.

X n µ

TX µ–

S n⁄--------------=

n 1–

x s

µ 100 1 α–( ) µ

x tα 2⁄ n 1–,s

n-------– x tα 2 n 1–,⁄

s

n-------+,

n

n

n 1> n n 2–( )⁄ n 2> n 40≥

P tα 2⁄ n 1–,– T tα 2⁄ n 1–,< <( ) 1 α–=

P T tα n,<( ) 1 α–= tα n,

X X X

X

X X

P X1.96σ

n--------------– µ X

1.96σn

--------------+< < 0.95=

µ σ

x1.96σ

n--------------– x

1.96σn

--------------+, µ

x

P x1.96σ

n-------------- µ x

1.96σn

--------------+< <– 0.95=

µx

8 of 42

22. Confidence interval for the mean value of the dependent variable at a particular value ofthe independent variable:

The dependent variable is normally distributed, and so the mean is a parameter of this distribu-tion, for which we can provide a confidence interval.Devore book, page 477, section 12.4

Suppose, for example, that and . Recall that are the true values of the

regression coefficients and error term variance in the population. From a first sample of ( ), we

might estimate and . A second sample might result in and

, etc. From this it follows, that itself varies from sample to sample, and is

hence a statistic, for a fixed value of the independent variable, . From one sample, we can

then provide a confidence interval for the parameter, mean of . Recall that corresponding to a

given is a normally distributed random variable; therefore its mean is a parameter.

a. The mean value of is

b. The variance of is

, and , the standard deviation of is .

Replace by its estimate , which is .

c. A % confidence interval for , the expected value of when is

because the variable

has a Student-t distribution with degrees of freedom.

23. Prediction interval for the dependent variable value at a specific value of the indepen-dent variable:

Statistical inference about the predicted value of based on the sample is done with a predictioninterval. First run a linear regression of dependent variable on independent variable based onone sample of values.

R Book: Prediction interval:

, with the standard error

The value of comes from the -distribution with degrees of freedom, where is the num-ber in the sample.

β0 50= β1 2= β0 β1 σ2, ,

x y,

β0ˆ 52.35= β1

ˆ 1.895= β0ˆ 46.52=

β1ˆ 2.056= Y β0

ˆ β1ˆ x

*+=

x x*

=

Y Y

x*

Y

E Y( ) E β0ˆ β1

ˆ x*

+( ) µY x

*⋅= =

Y

V Y( ) σ2 1n--- x

*x–( )

2

sxx---------------------+

= sY

Y V Y( )

σ s SSE n 2–( )⁄

100 1 α–( ) µY x

*⋅Y x x

*=

β0ˆ β1

ˆ x*

+( ) tα 2 n 2–( ),⁄ sβ0ˆ β1

ˆ x*

+( )⋅

± Y tα 2 n 2–( ),⁄ sY

⋅( )±=

TY β0 β1x

*+( )–

sY

--------------------------------------= n 2–

Y

Y x

x y,( )

y t* SE± SE σ 11n--- x

*x–( )

2

sxx---------------------+ +=

t*

t n 2– n

9 of 42

Devore book, page 481:

Because the future value is independent of the observed 's

[Recall from Devore, page 220, that , and if and are independent,

then variances add, ).]

Continuing the prediction analysis:

.

Substituting for , again the t-distribution can be used.

A % prediction interval for a future observation to be made when is

or equivalently,

has a Student-t distribution with degrees of freedom.

24. Compare the above two:

Mean value for a particular value of can be predicted to be .

.

is a fixed but unknown value, while is a random variable.

In contrast, if we are asked to predict the value of at a given value of the dependent variable ,

where both terms are random variables.

Therefore, prediction intervals of a particular value of will be wider than confidence interval ofthe mean value of . CI refers to a population characteristic whose value is fixed but unknown tous. In contrast, a future value of is not a parameter but instead a random variable. Hence predic-tion interval shows more variability since could be anything.

25. The classical assumptions that must hold for linear model [From Econometrics book byStudenmund, page 89]

a. The regression model is linear, is correctly specified, and has an additive error term.b. The error term has a zero population meanc. All explanatory (independent) variables are uncorrelated with the error term.d. Observations of the error term are uncorrelated with each other (no serial correlation).e. The error term has a constant variance (no heteroskedasticity).f. No explanatory variable is a perfect linear function of any other explanatory variable

(no perfect multicollinearity)

Y Yi

V Y β0ˆ β1

ˆ x*

+( )–[ ] V Y[ ] V β0ˆ β1

ˆ x*

+( )+=

E X1 X2–( ) E X1( ) E X2( )–= X1 X2

V X1 X2–( ) V X1( ) V X2( )+=

V Y β0ˆ β1

ˆ x*

+( )–[ ] V Y[ ] V β0ˆ β1

ˆ x*

+( )+ σ21

1n--- x

*x–( )

2

sxx---------------------+ +

= =

s σ

100 1 α–( ) Y x x*

=

β0ˆ β1

ˆ x*

+( ) tα 2 n 2–,⁄ s 11n--- x

*x–( )

2

sxx---------------------+ +

⋅± y tα 2 n 2–,⁄ s2

sY2

+⋅+

TY β0

ˆ β1ˆ x

*+( )–

s 11n--- x

*x–( )

2

sxx---------------------+ +

-----------------------------------------------------= n 2–

E Y( ) x x*

= β0ˆ β1

ˆ x*

+( )

Error of estimation of mean β0 β1x*

+( ) β0ˆ β1

ˆ x*

+( )–=

β0 β1x*

+( ) β0ˆ β1

ˆ x*

+( )

Y x*

Error of prediction Y β0ˆ β1

ˆ x*

+( )–=

Y

Y

y

Y

10 of 42

g. The error term is normally distributed (this assumption is optional but usually isinvoked).

26. Aptness of the model and model checking (Devore book, page 501)

Use error terms (residuals) or standardized residuals:

Residuals: .

To derive the properties of residuals, let , represent the th residual as a random vari-

able (before the observations are actually made).

Variance

Standardized residuals are:

,

Diagnostic plots are drawn:

a. Residual plot or on the vertical axis vs. on the horizontal axis - residuals

should not be correlated with the independent variables, its variance should not changewith , and there should be no serial correlation.

b. Residual plot or on the vertical axis vs. (fitted values) on the horizontal axis

should not show any discernible patterns.c. on the vertical axis vs. on the horizontal axis - should be close to line - then

estimated regression function gives accurate predictions of the values actuallyobserved -- This plot is the single most often recommended for multiple regressionanalysis.

d. A normal probability plot of the standardized residuals --> Check if residuals are nor-mally distributed.

ei yi yiˆ–=

ei Yi Yiˆ–= i

E Yi Yiˆ–( ) E Yi( ) E β0 β1xi+( )– β0 β1xi β0 β1xi+( )–+ 0= = =

V Yi Yiˆ–( ) σ2

11n---

xi x–( )2

Sxx--------------------––

=

ei* yi yi–

s 11n---

xi x–( )2

Sxx--------------------––

---------------------------------------------= i 1 … n, ,=

ei*

ei xi

xi

ei*

ei yi

yi yi 45°

11 of 42

The figure above shows plots of the standardized residuals vs. the independent variable x. Fig. (c)shows one discrepant observation (see circled point) that could have unduly influenced the bestfit.

27. Simulation

Devore, Section 5.3A statistic is any quantity whose value can be calculated from sample data. Prior to obtaining thedata, there is uncertainity as to the value of any particular statistic, and hence a statistic is a ran-dom variable. Its distribution is called sampling distribution to describe how the statistic varies invalue across all samples that might be selected.A random sample of size consists of random variables if the are i.i.d. (indepen-

dent and identically distributed.

Scanned in figure from Devore’s book, page 13.2

n X1 X2 … Xn, , , Xi

12 of 42

To run a simulation experiment:a. Choose your statistic of interest, e.g., , trimmed mean, etc.b. The population distribution (e.g., normal, uniform, with specified parameters)c. The sample size, e.g., or d. The number of replications, e.g., or

Generate random samples, each of size , from the specified population distribution. Find valueof statistic for each sample. Plot histogram of statistic. This histogram gives an approximate sam-pling distribution of the statistic. The larger the , the better the approximation. For simple statis-tics, in practice or is sufficient.

28. Distributions

Lognormal [Devore, page 166]A nonnegative random variable is said to have a lognormal distribution if the random variable

has a normal distribution. The resulting pdf of a lognormal random variable when

is normally distributed with mean and variance is

Mean is . Median is . Variance is .

Gamma function [Devore, page 159]For , the gamma function is defined by

Important properties are:a. For any , b. For any positive integer, ,

c.

Gamma distribution [Devore, page 160]A continuous random variable is said to have a gamma distribution if the pdf of is

where and are both greater than 0. The standard gamma distribution has . If and, the gamma distribution reduces to an exponential distribution. Mean is and variance

is . The parameter is called the scale parameter because values other than 1 either stretch orcompress the pdf in the x direction (Devore, page 160), and is called the shape parameter.

X S,

n 10= n 15=

k 500= k 1000=

k n

k

k 500= 1000

X

Y X( )ln=

X( )ln µ σ2

fX x( )1

2πσx-----------------e

x( )ln µ–[ ]22σ2( )⁄–

x 0≥

0 x 0<

=

eµ σ2

2⁄+e

µe

σ2

1– e

2µ σ2+

α 0> Γ α( )

Γ α( ) xα 1–

ex–

xd

0

∞

∫=

α 1> Γ α( ) α 1–( )Γ α 1–( )=

n Γ n( ) n 1–( )!=

Γ 1 2⁄( ) π=

X X

fX x( )1

βαΓ α( )-------------------x

α 1–e

x β⁄–x 0≥

0 otherwise

=

α β β 1= α 1=

β 1 λ⁄= αβ

αβ2 βα

13 of 42

Chi-Squared distribution [Devore, page 161]Let be a positive integer. Then a random variable is said to have a chi-squared distributionwith parameter if the pdf of is the gamma density with and . The parameter

is called the number of degrees of freedom (df) of . The symbol is used instead of “chi-squared.” Mean is and variance is .If is a standard normal random variable, and , then has a chi-squared distribution with

degrees of freedom [Devore, page 163, problem 71]

Devore, page 278: Let Let be a random sample from a distribution with mean and

variance . Then the rv

has a chi-squared ( ) distribution with df

A % CI for variance of a normal population has a lower limit

and upper limit

F distribution [Devore, 360]If and are independent chi-squared rvs with and df, respectively, then the rv

has an F distribution. A property of F distribution is

Exponential distribution [Devore, 157]A random variable is said to have an exponential distribution with parameter ( ) if thepdf of X is

Mean is and variance is .

Cauchy distributionIt is the same as the t-distribution with 1 degree of freedom. Mean and variance are undefined.Median and mode are both 0. Density function:

, .

Bernoulli distributionA discrete random variable that is either 0 or 1 has the following pmf:

ν X

ν X α ν 2⁄= β 2= ν

X χ2

ν 2ν

X Y X2= Y

ν 1=

X1 X2 … Xn, , , µ

σ2

n 1–( )S2

σ2----------------------

Xi X–( )2

∑σ2

------------------------------= χ2n 1–

100 1 α–( ) σ2

n 1–( )s2 χα 2 n 1–,⁄

2⁄

n 1–( )s2 χ1 α– 2 n 1–,⁄

2⁄

X1 X2 ν1 ν2

FX1 ν1⁄X2 ν2⁄----------------=

F1 α– ν1 ν2, ,1

Fα ν2 ν1, ,--------------------=

X λ λ 0>

fX x( ) λeλx–

x 0≥0 otherwise

=

1λ--- 1

λ2-----

fX x( ) 1

π 1 x2

+( )-----------------------= ∞ x ∞< <–

X

14 of 42


Binomial distributionA discrete random variable is the number of successes in Bernoulli trials in each of which theprobability of success is . Its pmf is as follows:

Mean is and variance is or .

Uniform distributionA continuous random variable X has a uniform distribution on the interval [A,B] if the pdf of X is

Mean is and variance is . There is also a discrete version of the uniform distribution.

for .


Poisson distributionA discrete random variable X has a Poisson distribution with parameter ( ) if the pmf of Xis

for

Mean and variance are both

Pareto distributionA continuous random variable X has a Pareto distribution with scale parameter and shapeparameter if the pdf of X is

,

Mean is for and if . Variance is if , and if .

29. Central Limit Theorem (CLT) criteria checking wi th simulations

pX 0( ) 1 p– q= =

pX 1( ) p=

p pq

X n

p

pX k( )nk p

k1 p–( )n k–

0 k n≤ ≤

0 otherwise

=

np np 1 p–( ) npq

fX x( )1

B A–------------- A x<( ) B<

0 otherwise

=

B A+( ) 2⁄ B A–( )212⁄

pX i( ) 1n---= 1 i n≤ ≤

n 1+( ) 2⁄ n2

1–( ) 12⁄

λ λ 0>

pX x( ) eλ– λx

x!--------------= x 0 1 2 …, , ,=

λ

θα

fX x( ) αθα

xα 1+

-------------= x θ≥

αθ α 1–( )⁄ α 1> ∞ α 1≤ ∞ 1 α 2≤ ≤ θ2α( ) α 2–( ) α 1–( )2( )⁄α 2>

15 of 42

Run simulations as per Item 27. Draw random samples, each of size , from a population distri-

bution. Know the population mean and population variance (as this is a simulation, we

know these values a priori). Compute the sample mean and sample variance for each sample

of the samples (i.e., for ). Note that the two statistics, and , representing the samplemean and sample standard deviation respectively, are random variables. Then run these five tests:

a. Test if sampling distribution of sample mean, , is normal using qqnorm and/or a his-togram.

b. Compare the mean of the sample mean (take the average of all sample means)with the population mean. These should be approximately equal.

c. Compare the standard deviation of the sample mean (find the square root of

the variance of all sample means) with , which is population variance dividedby square root of the sample size . These should be approximately equal.

d. Compare the mean of sample variance (average of the sample variance across all

samples) with , population variance. These should be approximately equal.e. This final step has three parts:

i. For each of the k samples, find the % confidence interval for populationmean using t-distribution (see that this uses sample mean, , and sample stan-dard deviation, ):

Draw a matplot showing vertical lines connecting upper limit and lower limit for the confidence intervals (see Simulation.R)

ii. Check if the proportion of the above set of intervals that contain the populationmean is 95%. Set this proportion to . If the CLT approximation is met, will beclose to 0.95.

iii. If not, find a 95% confidence interval for the true coverage probability of theseconfidence intervals. Following the procedure for obtaining a confidence intervalfor a population proportion (see item 31), we obtain this true coverage probabilityas:

What exactly is the relation between this final single confidence interval and that of the confi-dence intervals we obtained for the population mean from the samples? Recognize that theupper limit of the confidence interval containing the population mean is a “statistic” because itvaries from sample to sample. Similarly, the lower limit is also a statistic. Therefore, we can

talk about , where is the population mean. Let’s say and

, these being the limits found from the sample of size for the proportion of this

sample that has confidence intervals that contain the population mean. Think of this last step asbeing executed on a random sample of size , each of which has a value for the random variable

k n

µ σ2

xi si2

i k 1 i k≤ ≤ X S

X

E X[ ] k

Var X( )

k σ n( )⁄n

E S2[ ]

k σ2

100 1 α–( )µ x

s

x tα 2⁄ n 1–,s

n-------– x tα 2 n 1–,⁄

s

n-------+,

k

k

p p

p zα 2⁄pqn------±

k

k

U

L

P L µ U< <( ) µ p1 p zα 2⁄pqn------+=

p2 p zα 2⁄pqn------–= k

k

16 of 42

and the random variable . Therefore, if we build a future confidence interval for the popula-tion mean, and the interval is , we are % confident that the true confidence levelrepresented by is in the interval . So even if constructed the interval using 95% value

for the t critical value, we say the true confidence level is in the interval .

Finally there is a theorem that if the underlying distribution is symmetric, even if the CLT approx-imation does not hold, the true confidence level is at least as large as the nominal (i.e. the levelfound with t-test). Statisticians are inherently conservative: if we say that the confidence level is95% we are happier if it is in fact larger than 95% than if it is smaller than 95%. So for symmetricdistributions, we can use the t-distribution based confidence interval. An example for this was theCauchy distribution (mean does not exist, but we were finding confidence interval for the popula-tion median). But since the chi-squared distribution with two degrees of freedom is not symmet-ric, this theorem does not apply. In fact, here that the CIs for the 5000 samples using the t-test for95% confidence level were actually overly optimistic because the true coverage value was only88-89%.

30. Normal approximation of the binomial distribution [Devore, page 217]

The CLT is used to justify the normal approximation to binomial distribution. Let be

for .

If is the number of successes in these trials, it is a binomial random variable, and since every success is a 1 and every failure is a 0. The sample mean is , the

sample proportion of successes. is approximately normal when is large by CLT. As is aconstant, is also approximately normal when is large. The rule of thumb is that when

and , this ensures that is large enough to overcome any skewness in theunderlying Bernoulli distribution ( much larger or much smaller than ) and CLT holds.

If X~Binomial(n,p), then:

is approximately .

Normal approximation is usually reasonably accurate if and .

R Book: Continuity correction . If ~Binom( ), , where and

. For a general binomial random variable with mean and standard deviation ,the approximation is an improvement to theCentral Limit Theorem.

31. Confidence interval for a population proportion

Devore, page 231When is a binomial random variable with parameters and , the sample proportion is a unbiased estimator of .

U L

l u,( ) 100 1 α–( )l u,( ) p1 p2,( )

p1 p2,( )

X1 X2 … Xn, , ,

Xi1 if the ith trial results in a success

0 if the ith trial results in a failure

= i 1 2 … n, , ,=

X n

X1 … Xn+ + X= X n⁄

X n n

X n⁄ n

np 10≥ n 1 p–( ) 10≥ n

p 1 p–

ZX E X[ ]–

Var X( )---------------------- X np–

np 1 p–( )--------------------------- p p–

p 1 p–( ) n⁄( )------------------------------------= = = N 0 1,( )

np 10≥ n 1 p–( ) 10≥

X n p, P X 22≤( ) P Z 22.5 µ–( ) σ⁄≤( )≈ µ np=

σ np 1 p–( )= µ σP a X b≤ ≤( ) P a 0.5– µ–( ) σ Z≤⁄ b 0.5 µ+ +( ) σ⁄≤( )≈

X n p p X n⁄=

p

17 of 42

Devore, page 265A random sample of size is selected, and is the number of successes in the sample. If issmall compared to the population size, can be regarded as a binomial random variable with

and . Furthermore, if both and , then has an

approximately normal distribution.As seen earlier, because is also approximately normal, and , is approximately nor-

mal, and and . Note that the standard deviation involves the

unknown parameter .

.

Replace the less than sign with the equal sign and solve the resulting quadratic equation for .This gives the two roots:

Proposition: A confidence interval for a population proportion with confidence level approxi-mately % has

lower confidence limit = and

upper confidence limit =

If the sample size is large, is negligible compared to , and under the square root is

negligible compared to , and is negligible compared to 1. The approximate confidence

limits are then:

This was implemented in the Simulation case study for the Cauchy distribution.

32. Stochastic process

A stochastic process (SP) is a family of random variables defined on a given proba-

bility space, indexed by the time variable , where varies over an index set . 1

E p( ) EXn--- 1

n---E X( ) 1

n--- np( ) p= = = =

n X n

X

E X( ) np= σX np 1 p–( )= np 10≥ n 1 p–( ) 10≥ X

X n⁄ p X n⁄= p

E p( ) p= σp σX n⁄ p 1 p–( ) n⁄= = σp

p

P zα 2⁄–p p–

p 1 p–( ) n⁄------------------------------- zα 2⁄< <

1 α–≈

p

p

pzα 2⁄2

2n----------- zα 2⁄

pqn

------zα 2⁄2

4n2

-----------+±+

1 zα 2⁄2

n⁄+-------------------------------------------------------------------=

p

100 1 α–( )

pzα 2⁄2

2n----------- zα 2⁄

pqn

------zα 2⁄2

4n2

-----------+–+

1 zα 2⁄2

n⁄+-------------------------------------------------------------------

pzα 2⁄2

2n----------- zα 2⁄

pqn------

zα 2⁄2

4n2

-----------++ +

1 zα 2⁄2

n⁄+-------------------------------------------------------------------

zα 2⁄2

n⁄ pzα 2⁄2

4n2

-----------

pqn

------ zα 2⁄2

n⁄

p zα 2⁄pqn------±

X t( ) t T∈{ }

t t T

18 of 42

Just as a random variable assigns a number to each outcome in a sample space , a stochastic

process assigns a sample function to each outcome .

A sample function is the time function associated with outcome of an experiment.

The ensemble of a stochastic process (sp) is the set of all possible time functions that can result

from an experiment.

Example of an sp: Cumulative earnings in the roulette experiment.

Types of stochastic processes: Discrete value and continuous value; Discrete time and continu-

ous time. If each random variable for different values of are discrete rv, then the sp is a dis-

crete value. If the process is defined only for discrete time instants, then it is a discrete time sp.

Random sequence: for a discrete time process, a random sequence is an ordered sequence of

random variables , , ....- Essentially a random sequence is a discrete-time stochastic process.

Relation between sp and rv: A discrete value sp is defined by the joint PMF ,

while a continuous value sp is defined by the joint PDF.

1. Most of the statements in this writeup have been taken verbatim from [4]; exceptions are noted.

s S

x t s,( ) s

x t s,( ) s

x t s1,( )

x t s2,( )

x t s3,( )

s1

s2

s3

t t1=X t1( ) is a random variable

Figure 1:Representation of a stochastic process - relation to random variables

Sample function

X t( ) t

Xn

X0 X1

PX t1( ) …X tk( ), x1 … xk, ,( )

19 of 42

Independent, identically distributed random sequence: Let be an iid random sequence. All

has the same distribution. Therefore . For a discrete value process, the sample

vector has the joint PMF (by property of independent random variables)

(1)

For continuous value sp, same operation but with pdf for an independent process.

The simplest and most important type of dependence is the first-order dependence or Markov

dependence.

Markov process [4] is a stochastic process whose dynamic behavior is such that probability dis-

tribution for its future development depends only on its present state and not how the process

arrived in that state. If state space is discrete, then it is a Markov chain, i.e., the random variables

or are discrete. We will see that a random walk is a special type of Markov chain.

33. Random walk

[Trivedi’s book, 1st edition, page 280 [4]] Consider a sequence of independent Bernoulli trials.

The random variable denotes the result of the trial, with denoting success, and

denoting failure. Probability of success of the trial is , which is independent of the

index . As the 's are mutually independent, the process is a Bernoulli process

(discrete-value, discrete-time).

A binomial process is the sequence of partial sums, , where .

A generalization of the Bernoulli process allows each trial to have more than two possible out-comes. Let be a sequence of independent discrete random variables, and define

the partial sums . Then the sum process is a random walk.

A simple random walk is defined as follows. It is a process whose increments take

on values +1 or -1, and are i.i.d. and independent of the starting value . Note that the 's are

not independent.

Xn

Xi PXix( ) PX x( )=

Xn1… Xnk

, ,

PX t1( ) …X tk( ), x1 … xk, ,( ) PX x1( )PX x2( )…PX xn( ) PX xi( )

i 1=

n

∏= =

Xn X t( )

Yi ith

Yi 1=

Yi 0= ith

p

i Yi Yi i 1 2 …, ,={ }

P Yi 1=( ) p=

E Yi[ ] p=

Var Yi[ ] p 1 p–( )=

Sn n 1 2 …, ,={ } Sn Y1 Y2 … Yn+ + +=

P Sn k=( ) nk p

k1 p–( )n k–

=

E Sn[ ] np=

Var Sn[ ] np 1 p–( )=

Yi i 1 2 … n, , ,={ }

Sn Y1 Y2 … Yn+ + += Sn n 1 2 …, ,={ }

Zt Xt Xt 1––=

X0 Xt

20 of 42

, and and for all

Linear interpolation of the points reflects the time evolution of the process and is called a

path of an simple random walk.Letting the process go up by a constant and down by a constant makes it a general Binomialprocess.

and for all

A symmetric simple random walk is when .

Example of simple random walk: cumulative earnings in the roulette case study. Each trial is a

Bernoulli trial with . After trials, if we ask, what is the number of games won. The

answer is that this is a Binomial process. The number of games won at discrete time instant, ,is the only number required along with the th trial to determine the number of games won at dis-crete time instant . Previous values, at say , do not matter. Cumulative earnings is a simplerandom walk since on each trial, the increment is +1 or -1. See table below to see how the vari-ance of the cumulative earnings increases as the number of games played increases. Cumulative

earnings at each discrete time step are not independent of each other. The earnings at one stepdepends upon just the value at the previous step. Hence a random walk is a Markov chain.

34. Function of random variable

[Trivedi’s book [4]] If a random variable , i.e., is a function of another of random vari-able , then from the density function of , the density of can be computed for some functions.

For example if , then the

CDF and

PDF

Cumulative earnings after games

1 +1 or -1

2 +2 (both games won), 0 (one won, one lost), -2 (both lost)

3 +3, +1, -1, -3

4 +4, +2, 0,-2,-4

5 +5,+3,+1,-1,-3,-5

Xt X0 Zk

k 1=

t

∑+= t 1 2 …, ,= P Zk 1=( ) p= P Zk 1–=( ) 1 p–= k

t Xt,( )

u d

P Zk u=( ) p= P Zk d–=( ) 1 p–= k

p 0.5=

pwin1838------= n

n 1–

n

n n 2–

n n

Y Φ X( )= Y

X X Y

Y X2

=

FY y( ) P Y y≤( ) P X2

y≤( ) P y– X y≤ ≤( ) FX y( ) FX y–( )–= = = =

fY y( )1

2 y---------- fX y( ) fX y–( )+[ ] 0 y ∞< <

0 otherwise

=

21 of 42

35. Rate of returns:

Arithmetic return is defined as follows:

because principal grows as follows:

Simple gross return is .

Compound returns over periods:

(Principal grows)

Annualized returns - per year return

, compounding periods per year. Number of years =

Logarithmic or continuously compounded return:

Taking logarithms turns products into sums.

If the annual rate is , then the annualized rate of return if there are compounding periods/

year is:

because

In the SP500 dataset analysis, yield is defined as:

, where is the closing S&P 500 index value on day .

In R, log without a base specification is ln (natural logarithm; base ). Therefore, is the same

as , and cumulative sum of is equivalent to .

36. Autocorrelation

Autocorrelation in a time series , at lag is given by:

, where

Lag-one sample autocorrelation coefficient is given by:

Rt

Pt Pt 1––

Pt 1–-----------------------

Pt

Pt 1–------------ 1–= = Pt Pt 1– 1 Rt+( )=

1 Rt+( )

k

1 Rt k( )+ 1 Rt+( ) 1 Rt 1–+( )… 1 Rt k– 1++( )Pt

Pt 1–------------

Pt 1–

Pt 2–------------ …

Pt k– 1+

Pt k–-------------------

Pt

Pt k–------------= = =

Pt k– 1+ Pt k– 1 Rt k– 1++( )=

r 1 Rt j–+( )n k⁄

j 0=

k 1–

∏ 1–= n k n⁄

1 Rt k( )+ 1 r+( )k n⁄=

rt 1 Rt+( )lnPt

Pt 1–------------ ln= =

rt k( ) 1 Rt k( )+( )ln rt rt 1– … rt k– 1++ + += =

Rnom n

1Rnom

n------------+

n

1– eRnom 1–=

1an---+

n

n ∞→lim e

a=

Yt 100Xt

Xt 1–------------ ln= Xt t

e Yt

rt Yt rt k( )

Yt t 1 2 … n, , ,= k

rk cor Yt Yt k+,( )

Yt Y–( ) Yt k+ Y–( )

t 1=

n k–

∑

Yt Y–( )2

t 1=

n

∑

--------------------------------------------------------= = Y Yt

t 1=

n

∑=

22 of 42

37. Empirical distribution

Let be the true unknown CDF. Recall CDF is Without loss of generality, we can reorder so that . Then the empirical

distribution function is

As , . Since we cannot sample from , we sample from , when we draw boot-

strap samples.

38. Hypothesis testing - Example setup

[Source: Ted Chang] Two hypotheses: - null hypothesis

- alternative hypothesis

Either is true or is true, but not both. Important: need to define a rejection region.

Data is assumed to come from some distribution that is partially unknown. We formulate ahypothesis about the unknown parameter of some random variable’s distribution. We assume thetype of distribution, but formulate the hypothesis about the value of its parameter, and test thehypothesis against one sample.

For example, consider the red-black roulette: = outcome of the th game of red-black roulette,

if the house wins th game

if the house loses th game

We assume are independent and each has the probability distribution

r1 cor Yt Yt 1–,( )

Yt Y–( ) Yt 1– Y–( )

t 2=

n

∑

Yt Y–( )2

t 1=

n

∑

--------------------------------------------------------= =

F x( ) F x( ) Pr X x≤( )=

x1 x2 … xN, , , x1 x2 … xN≤ ≤ ≤

FN

FN x( ) 0= x x1<

FN x( ) 1N----= x1 x x2≤ ≤

…

FN x( ) rN----= xr x xr 1+≤ ≤

…

FN x( ) N 1–N

-------------= xN 1– x xN≤ ≤

FN x( ) 1= xN x≤

N ∞→ FNˆ F→ F FN

ˆ

H0

Ha

H0 Ha

Xi i i 1 2 … n, , ,=

Xi 1= i

Xi 0= i

X1 … Xn, , Xi

23 of 42

is unknown, but it equals if the roulette wheel is properly balanced.

In this example, our hypothesis can be:: (wheel is properly balanced)

: (wheel is not properly balanced)

Question: Does the data so contradict the truth of that we must accept the truth of ?

Take one random sample of size , i.e., we play games. Let

= proportion of games won by the house

If the number of games is large and is much different from , we would tend to believethat : is not true and that : is true.

39. Hypothesis testing - General form of reasoning

[Source: Ted Chang] a. If is true, then event has a very low probability.

Example: If is true, then it is it is highly improbable that will be muchdifferent from .

is the event that “ is much different from .”b. Suppose occurs, then we “reject and accept .”

Example: if is much different from , we reject and accept :

c. Suppose does not occur. Then we “fail to reject .” This might mean that is

true, but it might also mean that the evidence is inconclusive.Example: Suppose , and we decide to reject if the house wins 0 games or all

five games. is the event that is either 0 or 1. Suppose the house wins 3 games, so does not occur. Does this imply that : is true? NO!

40. Hypothesis testing - probabilistic reasoning by contradiction

[Source: Ted Chang] The following reasoning is logically correct:

a. If is true, then is false.

b. is true. Therefore is false.

The following reasoning is logically INCORRECT:a. If is true, then is false.

b. is false. Therefore is true.

NO! If is false, then can either be true or false.

Example: If (the animal is a cow) is true, then (the animal has two legs) is false.

P Xi 1=( ) p=

P Xi 0=( ) 1 p–=

p 20 38⁄

H0 p 20 38⁄=

Ha p 20 38⁄≠

H0 Ha

n n

p1n--- Xi

i

∑=

n p 20 38⁄H0 p 20 38⁄= Ha p 20 38⁄≠

H0 A

p 20 38⁄= p

20 38⁄A p 20 38⁄

A H0 Ha

p 20 38⁄ H0 Ha p 20 38⁄≠

A H0 H0

n 5= H0

A p

A H0 p 20 38⁄=

H0 A

A H0

H0 A

A H0

A H0

H0 A

24 of 42

We see an animal with 2 legs (so is true). Can we conclude that the animal is not a cow (

is false)? YESWe see an animal with 4 legs (so is false). Can we conclude that the animal is a cow ( is

false)? NO

41. How do we choose which hypothesis is , and which is ?

[Source: Ted Chang] • Using hypothesis testing, we can only establish the truth of . So make the hypothesis

you want to establish.• All inconclusive data is resolved in favor of . So make the hypothesis you want to "give

the benefit of the doubt" to.• It must be possible to collect data to contradict .

Example (red-black roulette): Suppose we are a manufacturer of roulette wheels. We might wantto prove that our wheels are fair. So we might want to use:

:

:

BUT! It is impossible to collect evidence to contradict . Even if , this does not

contradict . In other words, it is impossible to establish that to infi-nite precision. The best we can do is to set up the problem as:

:

:

and hope that we “fail to reject ,” that is the evidence is not inconsistent with the fairness of the

roulette wheel.

42. Hypothesis testing overview and terminology

[Source: Ted Chang] Question: Does the data so contradict the truth of that we must accept the truth of ?

Test Statistic: a function of the sample dataRejection (Critical) Region: the range of values of the test statistic that would tend to make usdoubt and accept .

Significance level (denoted ): the probability, assuming the truth of , that the test statistic

would like in the rejection region. is usually set to 0.05. measures how convincing the evi-dence is when we reject and accept . Smaller values of means the evidence in favor of

is more convincing, i.e., the case for rejecting is stronger. It is the probability of a type-I

error, i.e., that is rejected when it is actually true.

Devore, page 313: It is customary to call the data significant when is rejected, and not signif-

icant otherwise.

p-value is the the significance level at which the data just crosses the border from not significant to significant. If a hypothesis test is done at a level • if , is rejected

A H0

A H0

H0 Ha

Ha Ha

H0 H0

H0

H0 p 20 38⁄≠

Ha p 20 38⁄=

p 20 38⁄≠ p 20 38⁄=

p 20 38⁄ 10100–

+= p 20 38⁄=

H0 p 20 38⁄=

Ha p 20 38⁄≠

H0

H0 Ha

H0 Ha

α H0

α αH0 Ha α

Ha H0

H0

H0

αα p-value≥ H0

25 of 42

• if < p-value, we fail to reject

Refer to item 39.a. If is true, the probability that the test statistic lies in the rejection region is

extremely small ( ).Item 40 fomulation: If is true, then event has a very low probability.

Event is the event that the test statistic lies in the rejection region.

b. Suppose the test statistic does lie in the rejection region, then we “reject and accept

.”

Item 40 fomulation: Suppose occurs, then we “reject and accept .”

c. Suppose the test statistic does not lie in the rejection region, then we “fail to reject.” This might mean that is true, but it might also mean that the evidence is inclo-

clusive.Item 40 formulation: Suppose does not occur. Then we “fail to reject .”

43. Hypothesis testing - roulette example

[Source: Ted Chang] See Item 38 - red-black rouletteWe spin a wheel times. Define as the number of games won by the house.

: approximately 0.526

: (likely charge against the house is that the wheel is rigged in its favor)

Define event (event that the test statistic, number of wins, lies in the rejection region) as event. Therefore

Change event to denote . For the corresponding rejection region, the significance level is

, so , if it occrs, is more convincing. If out of 1000 games, more than 600 games are

won, then it is easier to reject , i.e., that the wheel is properly balanced. If we make our rejec-

tion region 550 games or over, it is less of a strong case to bring charges against the house of anunbalanced wheel. See the effect of the choice of rejection region on the significance level.

α H0

H0

αH0 A

A

H0

Ha

A H0 Ha

H0 H0

A H0

n 1000= X

H0 p 20 38⁄=

Ha p 20 38⁄>

A

X 550≥ α1 P X 550≥ p 20 38⁄=( )( )=

ZX np–

np 1 p–( )---------------------------

X 10002038------⋅

–

10002038------ 18

38------⋅ ⋅

-------------------------------------= =

α1 P X 549.5> p 20 38⁄=( )( ) P Z

549.5 10002038------⋅

–

10002038------ 18

38------⋅ ⋅

----------------------------------------------->

P Z 1.468333>( ) 0.071= = = =

A X 600≥

α2 P X 595.5> p 20 38⁄=( )( ) P Z

595.5 10002038------⋅

–

10002038------ 18

38------⋅ ⋅

----------------------------------------------->

P Z 4.381667>( ) 5.9e 6–= = = =

α2 α1< A2

H0

26 of 42

On why use a normal approximation, the answer is that if the Binomial is more difficult to com-pute and needs computers. Note that the conditions and , which may not happen if

is close to 0 or 1, then need the exact Binomial CI. See the two links posted under the Hypothe-sis lecture.

44. Type-I and Type-II errors (significance testing)

Two types of errors:a. : Type I error occurs if null hypothesis is rejected when it is true

= Pr(reject | is true), i.e., probability of rejecting when is true.

b. : Type II error occurs if null hypothesis is not rejected when it is false.= Pr(not rejecting | is false).

Significance testing controls the probability of a type I error. is called the “power.” Usually power calculations are used to set a sample size.

Example (red-black roulette): 1000 spins of the wheel: (wheel is properly balanced)

: (wheel is improperly balanced in favor of the house)

is the number ofgames won by the houseIf we use as the rejection region, (rather high!)We fail to reject when .

= Pr[not rejecting | is false]. See that depends upon the true value of .

For example, using the normal approximation,

It is more difficult to detect an unbalanced wheel if it is only slightly unbalanced.We can calculate the power at each value of :

is calculated using the value of in ; is calculated using the value of in .

Rule: If an experiment and sample size are fixed, and a test statistic is chosen. Then decreasingthe size of the rejection region to obtain a smaller value of results in a larger value of for anyparticular value consistent with .

Why is it called power: “The correct decision is to reject a false null hypothesis. There is alwayssome probability that we decide that the null hypothesis is false when it is indeed false. This deci-sion is called the power of the decision-making process. It is called power because it is the deci-sion we aim for. Remember that we are only testing the null hypothesis because we think it iswrong. Deciding to reject a false null hypothesis, then, is the power, inasmuch as we learn themost about populations when we accurately reject false notions of truth.

np 10> nq 10>p

αα H0 H0 H0 H0

ββ H0 H0

1 β–

H0 p 20 38⁄ 0.5263= =

Ha p 20 38⁄>

X

X 550≥ α P X 550> p 20 38⁄=( )[ ] 0.071= =

H0 X 549≤

β p( ) H0 H0 β p

β 0.53( ) P X 549≤ p 0.53=[ ] P Z549.5 1000 0.53⋅( )–

1000 0.53 0.47⋅ ⋅--------------------------------------------------- 1.23551= < 0.8917= = =

β 0.6( ) P X 549≤ p 0.6=[ ] P Z549.5 1000 0.6⋅( )–

1000 0.6 0.4⋅ ⋅------------------------------------------------ 3.25976–= < 0.00055= = =

p

β p( ) P X 549≤[ ] P Z549.5 1000 p⋅( )–

1000 p 1 p–( )⋅ ⋅--------------------------------------------- <= =

α p H0 β p( ) p H1

α βHa

27 of 42

The power in hypothesis testing is the probability of rejecting a false null hypothesis. Specifically,it is the probability that a randomly selected sample will show that the null hypothesis is falsewhen the null hypothesis is indeed false.” From http://www.sagepub.com/upm-data/40007_Chapter8.pdf.

45. Hypothesis testing (significance tests): example from Devore’s book

Say an engineer has a new technique that will result in a reduced defect rate. Current state is that10% of the boards manufactured are defective.Null hypothesis: , where is the true proportion of defective boards in the new man-

ufacturing techniqueAlternative hypothesis:

Reason for not making the null hypothesis is that it is simpler to decide the rejection regionif it is a single number. Note what happens to , Type-II error probability - it is dependent on ,since can be any value less than 0.1.

If X is the number of defective boards in a sample of 200 boards, X is a binomial random variablewith parameter , and therefore .

Choose rejection region to be , which means if the number of defective boards in a sampleof 200 lies in this region, the null hypothesis is rejected.

when

Using R’s pbinom(15,200,0.1) function, we find this probability to be 0.14.Therefore, there is a 14% probability that if the number of defective boards in a sample of 200 liesbetween 0 to 15, the hypothesis is still true that even the new technique will yield a defectiveboard proportion of 10%.Type-II error probability is a function of , because if , then

, which is equal to =0.04.If , then .

46. Summary: The steps of a hypothesis test

Test procedure:a. Identify parameter of interest - it could be population mean, proportion, correlation

between two variables, etc.b. State null hypothesis about that parameter (pick a null value for the parameter of

interest, e.g., proportion in the above example)c. State alternative hypothesis about that parameter

d. Choose an appropriate test statistic and substitute parameter’s null value and otherknown parameter values but leave sample-related variables as unknowns.

e. State the rejection region for the selected significance level . Choose a rejectionregion so thati. is evidence which would tend to contradict .

H0: p 0.1= p

Ha: p 0.1<

p 0.1≥β p

p

p 0.1= E X[ ] np 200 0.1× 20= = =

0 15,[ ]

α P H0 is rejected when it is true( ) P 0 X 15≤ ≤( )= = X Bin 200 0.1,( )∼

β p p 0.05=

β 0.05( ) P X 16 when X Bin 200 0.05,( )∼≥( )=

1 Bin 15 200 0.05, ,( )–

p 0.06= β 0.06( ) 0.15=

H0

p 0.1=

Ha

T

αR

T R∈ H0

28 of 42

ii.

f. Collect the data (sample) and compute value of the test statistic by filling in sample-related variable valuesi. If , we reject and accept ; this is strong evidence that is true and

is false.ii. If , we fail to reject : either is true, or the evidence is inconclusive.

47. Power and sample size determination for population proportion

[Devore, page 308] If : is true, then the test statistic has approxi-

mately a standard normal distribution. But if is not true, and , then still has a normal

but its mean and variance are no longer 0 and 1, respectively. Instead,

and

For an upper-tailed test, the probability of a type-II error is . Use the above

provided mean and variance, standardize and then refer to standard normal cdf.

Thus, the sample size for which a level test also has at the alternative value is

These calculations are often used in clinical trials.

48. Concept of P-value

Consider an example where is chosen to be 0.05 for an upper-tailed test. ,

where . This means that one sample could have yielded a sample mean that is

and yet the hypothesis was true that . Probability of this occurring is

5%. This 5% significance level was arbitrarily chosen, for which the rejection region was definedas . Instead give the P-value, which is the probability calculated assuming is true, of

obtaining a test statistic value at least as contradictory to as the value that actually resulted (in

that sample). The smaller the P-value, the more contradictory is the sample data to .

To compute P-value, replace the last two steps of the test procedure in Item 46 with the following:Step 5: Calculate the test statistic value for that sample.

α P T R∈ H0 is true[ ]=

T

T R∈ H0 Ha Ha H0

T R∉ H0 H0

H0 p p0= Zp p0–

p0 1 p0–( )( ) n⁄-----------------------------------------=

H0 p p'= Z

E Z[ ]p' p0–

p0 1 p0–( )( ) n⁄-----------------------------------------= V Z( ) p' 1 p'–( ) n⁄

p0 1 p0–( ) n⁄--------------------------------=

β p'( ) P Z zα< p p'=( )=

n α β p'( ) β= p'

n

zα p0 1 p0–( ) zβ p' 1 p'–( )+

p' p0–--------------------------------------------------------------------------

2

for a one-tailed test

zα 2⁄ p0 1 p0–( ) zβ p' 1 p'–( )+

p' p0–-------------------------------------------------------------------------------

2

for a two-tailed test (an approximate solution)

=

αα P Type I error( ) P H0 is rejected when it is true( ) 0.05 P z 1.645≥( )= = = =

z N 0 1,( )∼

µ0 1.645 σ n( )⁄( )+ µ µ0=

z 1.645≥ H0

H0

H0

29 of 42

Step 6: Determine P-value corresponding to this test statistic valueStep 7: If a user wants an that is smaller than the P-value, then do not reject the hypothesis. Butif the user wants a Type I error probability value that is larger than or equal to the P-value,reject the hypothesis. So the smaller the P-value, the easier it is to reject. In the defective boards example, if we saw a particular value of 15 defective boards, the P-valuecorresponding to this would be 0.05 because dbinom(15,200,0.1) = 0.05. Now if the managerthinks it is okay with the risk of a type-I error probability of 5% or greater, then the hypothesis isrejected and the new technique deployed. If not, the hypothesis is rejected. So p-values make iteasier for different data scientists to decide whether or not to reject a hypothesis based on theirown requirements of Type-I error probabilities.

49. Pearson’s statistic

[Devore, page 569] Generalize binomial to multinomial by allowing each trial to result in one of possible outcomes. A hypothesis with may be : , , and . The

alternative states that is not true, that is at least one of the s has a different value from that

stated in .

Theorem:

Provided that , for every ( ), the Pearson’s statistic is:

has approximately a chi-squared distribution with df.

Initutive explanation: need an overall measure of discrepancy across all outcomes, and hence

we take the obvious square of deviations, , where : . If

and , if and , the two contribute the same squared deviations,

but is only 5% less than expected, while is 50% less! Therefore, we divide by expected

value.

Null hypothesis: :

Alternative hypothesis: : at least one does not equal .

Test statistic value:

Rejection region: , where is the significance level.

Therefore R’s prop.test uses this Pearson’s chi-squared statistic, with continuity correction.

R Book: When (binomial case), the Pearson’s statistic reduces to:

αα

χ2

k

k 3= H0 p1 0.5= p2 0.3= p3 0.2=

H0 pi

H0

npi 5≥ i i 1 2 … k, , ,= χ2

χ2 Ni npi–2( )

npi--------------------------

i 1=

k

∑(observed - expected)

2

expected------------------------------------------------------∑= =

k 1–

k

ni npi0–( )2H0 p1 p10=( ) … pk pk0=( ), ,

np10 100= np20 10= n1 95= n2 5=

n1 n2

H0 p1 p10=( ) … pk pk0=( ), ,

Ha pi pi0

χ2 ni npi0–2( )

npi0----------------------------

i 1=

k

∑(observed - expected)

2

expected------------------------------------------------------∑= =

χ2 χα k 1–,2≥ α

k 2= χ2

30 of 42

50. Relation between hypothesis testing and confidence intervals

R Book: A significance (hypothesis) test with significance level will be rejected if and only if the

% confidence interval around does not contain . This is using a hypothesis test that

the population mean is , i.e., : and : . The test statistic is

and CI is . The above uses mean as the parameter, but the similar conceptapplies for proportion.

51. Hypothesis about population means

Since the Central Limit Theorem states that the statistic, sample mean is normally distributedirrespective of the population distribution if the sample size is large, if we have a hypothesis thatthe population mean is , we can construct a null hypothesis, alternative hypothesis, test statistic

(this doesn’t have to be sample mean, it could be median, trimmed mean, average of extreme val-ues, etc.), and rejection region as follows:

Null Hypothesis:

Test statistic value: (population distribution can be other than normal, but sample

size should be large).Alternative Hypothesis:

The above is Case 2 of Item 19. Consider the other two cases. If population distribution is normal,and standard deviation of the population is known (Case 1), in the Z statistic listed above ischanged by replacing the sample variance by the population variance . The rejection regionsare the same as in Case 2.

Alternative Hypothesis Rejection Region for significance level test

(upper-tailed test)

(lower-tailed test)

or (two-tailed test)

χ2 p1ˆ p1–( )

2

p1 1 p1–( )( ) n⁄-------------------------------------=

α

100 1 α–( ) X µ0

µ0 H0 µ µ0= Ha µ µ0≠

TX µ–SE X( )---------------=

P t*

T t*≤ ≤–( ) 1 α–=

X

µ0

H0: µ µ0=

zx µ0–

s n( )⁄------------------=

α

Ha: µ µ0> z zα≥

Ha: µ µ0< z zα–≤

Ha: µ µ0≠ z zα 2⁄–≤ z zα 2⁄≥

σS σ

31 of 42

where is standard normal CDF.For Case 3 of Item 19 (normal distribution but sample size is small - use a test statistic that is t dis-tributed; also see corollary for moderate departures from normal):

Null Hypothesis:

Test statistic value: (population distribution can be other than normal, but sample

size should be large).Alternative Hypothesis:

p-value =

52. Power and sample size determination for population mean (normal population withknown )

Devore, pages 297-298: Consider an upper tailed test for a population mean (see Item 51). The

rejection region is . This is equivalent to . This means that will not

rejected if . Now let denote a particular value of that exceeds the null

value . Then,

Alternative Hypothesis Type-II error probability for significance level test

Alternative Hypothesis Rejection Region for significance level test

(upper-tailed test)

(lower-tailed test)

or (two-tailed test)

β µ'( ) α

Ha: µ µ0>Φ zα

µ0 µ'–

σ n⁄----------------+

Ha: µ µ0<1 Φ zα

µ0 µ'–

σ n⁄----------------+–

–

Ha: µ µ0≠Φ zα 2⁄

µ0 µ'–

σ n⁄----------------+

Φ zα 2⁄µ0 µ'–

σ n⁄----------------+–

–

Φ

H0: µ µ0=

tx µ0–

S n( )⁄-------------------=

α

Ha: µ µ0> t tα n 1–,≥

Ha: µ µ0< t tα n 1–,–≤

Ha: µ µ0≠ t tα 2⁄ n 1–,–≤ t tα 2⁄ n 1–,≥

P T t H0≤( ) HA: µ µ0<

P T t H0≥( ) HA: µ µ0>

P T µ0– t µ0–≥ H0( ) HA: µ µ0≠

σ

z zα≥ x µ0 zα σ n⁄⋅+≥ H0

x µ0 zα σ n⁄⋅+( )< µ' µ

µ0

β µ'( ) P H0 is not rejected when µ µ'=( )=

32 of 42

, where is the CDF of the standard normal, i.e.,

As increases, becomes more negative, so will be small when greatly exceeds

(because the value at which is evaluated will be quite negative).

Suppose we have two restrictions: P(type I error) = and for specified values of. Then, for an upper-tailed test, the sample size should be chosen to satisfy:

or (see above statement about being negative)

Thus, the sample size for which a level test also has at the alternative value is

53. Box-Cox transformation

Ref: http://www.stat.uconn.edu/~studentjournal/index_files/pengfi_s05.pdfTransform the dependent variable as follows:

(2)

The aim of this transformation is to ensure that the Classical assumptions are met, i.e., the depen-

dent variable is transformed so that it is normally distributed. .

The reference goes on to say that “clearly not all data could be power-transformed to Normal.Draper and Cox (1969) studied this problem and conclude that even in cases that no power-trans-formation could bring the distribution to exactly normal, the usual estimates of will lead to adistribution that satisfies certain restrictions on the first 4 moments, thus will be usually symmet-ric.” Box-Cox used two methods to make inference on the transformation parameter , the first ofwhich is maximum likelihood estimation method (MLE). A 95% confidence interval is providedfor . See above reference for details.The equations shown in the decline data set explanation notes can be derived from Eq. (2).

β µ'( ) P X µ0 zα σ n⁄ when µ µ'=⋅+<( )=

β µ'( ) PX µ'–

σ n⁄-------------- zα

µ0 µ'–

σ n⁄---------------- when µ µ'=+<

=

β µ'( ) Φ zαµ0 µ'–

σ n⁄----------------+

= Φ z( ) P Z z≤( )

µ' µ0 µ'– β µ'( ) µ' µ0

Φ z( )

α β µ'( ) β=

α µ' β, , n

Φ zαµ0 µ'–

σ n⁄----------------+

β= zβ– zαµ0 µ'–

σ n⁄----------------+=

n α β µ'( ) β= µ'

n

σ zα zβ+( )µ0 µ'–

--------------------------2

for a one-tailed (upper or lower) test

σ zα 2⁄ zβ+( )µ0 µ'–

-------------------------------2

for a two-tailed test (an approximate solution)

=

y

y λ( )y

λ1–

λ-------------- if λ 0≠

y( )log if λ 0=

=

y Y N β0 β1x+( ) σ2,( )=

λ

λ

λ

prodλ β0 β1time error++= λ 0≠

prod( )log β0 β1time error++= λ 0=

33 of 42

54. Estimators

An estimator is an unbiased estimator of if for every possible value of . [Devore’sbook, page 231]. An estimator is robust if it performs well for many underlying population distri-butions. An estimator is consistent if as the sample size increases, it gets closer to the true value.

One estimator is more efficient than another if both are unbiased, and .

The sample variance is an unbiased estimator of the population variance.

, which reduces to

. Finding the mean, .

, and

.

55. Maximum likelihood function

Devore, page 246Let have a joint pmf or pdf

EQ(3)

wherethe parameters are unknown values. When are the observed sample

values and EQ(3) is regarded as a function of , it is called a likelihood function. The

maximum likelihood estimates (mle’s) are those values of that maxi-

mize the likelihood function, so that

for all

In other words, we find that set of values for the distribution parameters such that the

likelihood of obtaining this sample of observed values is maximized. The reason log

likelihood is used is because log of a product term is sum of log of the terms.Example: Let be a random sample from a normal distribution. Because of indepen-

dence, the joint pdf is the product of individual pdfs:

θ θ E θ[ ] θ= θ

θ1 θ2 Var θ1( ) Var θ2( )<

S2 1

n 1–------------ Xi X–( )2

i 1=

n

∑=

S2 1

n 1–------------ Xi

22XiX– X2+( )

i 1=

n

∑1

n 1–------------ Xi

2

i 1=

n

∑2n

n 1–------------

Xi

i 1=

n

∑

n---------------X–

nX2

n 1–------------+= =

S2 1

n 1–------------ Xi

2

i 1=

n

∑nX

2

n 1–------------–= E S

2[ ] 1n 1–------------ E Xi

2[ ]

i 1=

n

∑n

n 1–------------E X

2[ ]–=

E Xi2[ ] Var Xi[ ] E Xi[ ]( )2

+ σ2 µ2+= = E X

2[ ] Var X[ ] E X[ ]( )2+

σ2

n------ µ2

+= =

E S2[ ] 1

n 1–------------n σ2 µ2

+( ) nn 1–------------ σ2

n------ µ2

+ – σ2

= =

X1 X2 … Xn, , ,

f x1 x2 … xn θ1 θ2 … θm, , ,;, , ,( )

θ1 θ2 … θm, , , x1 … xn, ,

θ1 θ2 … θm, , ,

θ1ˆ θ2

ˆ … θmˆ, , , θ1 θ2 … θm, , ,

f x1 x2 … xn θ1ˆ θ2

ˆ … θmˆ, , ,;, , ,( ) f x1 x2 … xn θ1 θ2 … θm, , ,;, , ,( )≥ θ1 θ2 … θm, , ,

θ1 θ2 … θm, , ,

x1 … xn, ,

X1 … Xn, ,

f x1 x2 … xn µ σ2,;, , ,( ) 1

2πσ2-----------------e

x1 µ–( )2– 2σ2( )⁄

… 1

2πσ2-----------------e

xn µ–( )2– 2σ2( )⁄

=

f x1 x2 … xn µ σ2,;, , ,( ) 1

2πσ2------------- n 2⁄

exi µ–( )∑

2– 2σ2( )⁄

=

34 of 42

To maximize the likelihood, take the partial derivative of the log likelihood function with respect

to and , equate them to zero and solve for the two parameters.

and

Therefore the MLE (maximum likelihood estimator) of is not an unbiased estimator (recallthat for unbiased estimator, the denominator is . The two principles of estimation (unbi-asedness and maximum likelihood) yield two different estimators. When is large the MLE isapproximately unbiased.

56. Durbin-Watson test

Econometrics by Studenmund, page 322-329Serial correlation causes the OLS estimates of the standard error of the regression coefficient esti-mators (e.g., ) to be biased, leading to unreliable hypothesis testing about the true regression

coefficients. Typically, the bias is negative, which means the OLS underestimates the SE; thismakes the t-score higher, which makes it more likely to reject the hypothesis that slope coefficientis 0.

The Durbin-Watson d statistic is used to determine if there is first order serial correlation in theerror term. It should be used only when the following assumptions are met:

a. The regression model includes an intercept termb. The serial correlation is first-order in nature

where is the coefficient of serial correlation and is the classical (normally distrib-uted) error term.

c. The regression model does not include a lagged dependent variable as an independentvariable.

The equation for the Durbin-Watson statistic for T observations is:

a. With extreme positive correlation, because , .

b. With extreme negative correlation, because , .

c. No serial correlation, .

f x1 x2 … xn µ σ2,;, , ,( )[ ]lnn2--- – 2πσ2( )ln

1

2σ2--------- xi µ–( )∑

2–=

µ σ2

µ X= σ2ˆXi X–( )2

∑n

------------------------------=

σ2

n 1–( )n

sβ1ˆ

εt ρεt 1– ut+=

ρ u

d

d

et et 1––( )2

2

T

∑

et2

1

T

∑

-------------------------------------=

et et 1–= d 0=

et et 1––= d 4≈

d 2≈

d et et 1––( )2

2

T

∑ et2

1

T

∑⁄ et2

2

T

∑ 2 etet 1–

2

T

∑ et2

2

T

∑+–

et2

1

T

∑⁄ et2

2

T

∑ et 1–2

2

T

∑+

et2

1

T

∑⁄ 2≈= = =

35 of 42

If there is no serial correlation, then since and are not related, on average, will be

0.Studenmund, page 17: Note the difference between

residuals, , difference between observed value and estimated value, and

error term , which is a random variable, recall that and are

true regression coefficients.The error term is a theoretical concept that cannot be observed (population concept), while resid-

uals are real-world observed values. could have been written as as residuals are estimates of

the error terms.

Procedure:a. Obtain the OLS residualsb. Determine the sample size and number of explanatory variables, and from statistical

tables, find and , the upper and lower critical values for given levels of signif-

icance. These are provided in tables, http://www.nd.edu/~wevans1/econ30331/Durbin_Watson_tables.pdf, which is posted on our course web site, or appendices inStudenmund’s book (or other Econometrics books). Also found http://www.stan-ford.edu/~clint/bench/dwcrit.htm -- useful for our decline data (with 120 observa-tions).

c. Given the null hypothesis of no serial correlation, and a two-sided alternative hypothe-sis:

the appropriate decision rule is:

d. If testing

the appropriate decision rule is:

Remedies for serial correlation are Generalized Least Squares, and Newey-West standard errors(see Studenmund or other Econometrics book).

57. Adjusted R-squared

et et 1– etet 1–∑

ei yi yiˆ–=

εi Yi β0 β1xi+( )–= β0 β1

ei εiˆ

dU dL d

H0: ρ 0= no serial correlation( )

Ha: ρ 0≠ serial correlation( )

if d dL<( ) Reject H0

if d 4 dL–>( ) Reject H0

if 4 dU– d dU> >( ) Do Not Reject H0

otherwise Inconclusive

H0: ρ 0= no serial correlation( )

Ha: ρ 0> positive serial correlation( )

if d dL<( ) Reject H0

if d dU>( ) Do Not Reject H0

if dL d dU≤ ≤( ) Inconclusive

36 of 42

,

where is the number of independent variables (not counting the intercept). This removes theeffect of improving Rsquared simply by adding a variable because df decreases (degrees of free-dom is ) making it a better measure even if the variable is irrelevant.

58. ANOVA

ANOVA (Analysis of variance) is a generalization of the t-test for two independent samples thatallows us to compare means of several independent samples. The goal of one-way analysis ofvariance is to decide whether the difference in the sample means is indicative of a difference inthe population means of each sample, or is attributable to a sampling variation. We have I sam-ples, each from a different population (Verzani).

Rest of this section is from Devore’s book:Assumptions:

, (4)

where is a rv that denotes the measurement from the population, or the measurement

taken on the experimental unit that receives the treatment, and are independent with

. The population or treatment distributions are all normal with the same variance ,i.e., each is normally distributed with

and

Hypothesis: and for at least one pair and .

Test statistic , where

Mean square for treatments

Mean square for error

Sample means ,

Grand mean

R2

1

yi yiˆ–( )

2

i

∑

N K– 1–( )⁄

yi y–( )2

i

∑

N 1–( )⁄

-------------------------------------------------------------------–=

K

N K– 1–( )

Xi j µi εij+=

Xi j jth

ith

jth

ith εij

N 0 σ2,( ) I σ2

Xi j

E Xij[ ] µi= V Xij( ) σ2=

H0: µ1 µ2 … µI= = = Ha: µi µj≠ i j

F MSTr MSE⁄=

MSTrJ

I 1–----------- Xi� X��–( )2

i

∑=

MSES1

2S2

2 … SI2

+ + +

I------------------------------------------=

Xi�

Xij

i 1=

J

∑

J-----------------= i 1 2 … I, , ,=

X��

Xi j

j 1=

J

∑i 1=

I

∑

IJ---------------------------=

37 of 42

Sample variances, ,

When is true, test statistic has an F distribution with and . With

denoting the computed value of , the rejection region then specifies a test with

significance level . The null hypothesis value is (MSTr and MSE are equal).

Explanation: we compare the “between-samples” variation to a measure of variation computed“within” each of the samples. If is true, i.e., sample means are equal and equal to the grand

mean, it implies that should be small. However if s are different, some will differ sig-

nificantly from the grand mean. So MSTr is affected by whether is true or false. But MSE will

stay unchanged irrespective of whether is true or false because depends only on , and not

where the various distributions are centered.Proposition:When is true:

but when is false:

Both are unbiased estimates of population variance when is true, but if is false, MSTr

will overestimate .

Each mean has variance . Variance of these variances is unbiased if we divide by

(see item 54). Therefore, estimates , which explains why there is a

multiplier of in the MSTr definition.

Total sum of squares

Treatment sum of squares

Error sum of squares

Identities:

; ; ;

Si2

Xi j Xi�–( )2

i 1=

J

∑

J 1–---------------------------------------= i 1 2 … I, , ,=

H0 F ν1 I 1–= ν2 I J 1–( )= f

F f Fα I 1– J J 1–( ), ,≥

α F 1=

H0

MSTr µi Xi�

H0

H0 Si2 σ2

Xi�

H0

E MSTr[ ] E MSE[ ] σ2= =

H0

E MSTr[ ] E MSE[ ]> σ2=

σ2H0 H0

σ2

Xi� σ2J⁄ I 1–

Xi� X��

–( )2

i

∑ I 1–( )⁄ σ2J⁄

J

SST xij x��

–( )2

j 1=

J

∑i 1=

I

∑=

SSTr xi� x��

–( )2

j 1=

J

∑i 1=

I

∑=

SSE xi j xi�–( )2

j 1=

J

∑i 1=

I

∑=

SST SSTr SSE+= MSTrSSTrI 1–------------= MSE

SSEI J 1–( )------------------=

38 of 42

SSE measures variation that would be present even if is true. SSTr is the amount of variation

that can be explained by possible differences in mean values.Anova table

p-value is calculated from .

59. Tukey’s procedure (T method or Tukey Honest Significant Differences method)

Devore’s book

If is rejected in an Anova test, then we will most likely want to know which of the popula-

tion means differ. This is a called a multiple comparisons method. If , then there are sixrequired CIs for pairwise mean differences: , , , , , . If

the interval does not contain 0, the two s are considered to differ significantly. But instead ofrunning the pairwise t test multiple times, we execute the Tukey’s method which controls for a“simultaneous” confidence level for all intervals. Later we explain what “simulta-neous” means.

Studentized range distribution: Let denote the upper-tail critical value of the Studen-

tized range distribution with numerator df and demonimator df. [In R, this distribution can becalled with ptukey.]

Proposition: With probability ,

for every and

( and ) with

[Note that and are two indices for identifying the population and hence both go from 1 to .]In other words, the proposition gives us the confidence intervals for the difference in the popula-tion means on a pairwise basis with .

Interpretation of simultaneous confidence level: If we compute two 95% confidence intervals for two independent population characteristics (twomeans or one mean and one proportion, etc.), then the simultaneous or joint confidence level is

roughly , which is approx. 0.9. With three the number is , which is 86% roughly.Applying Tukey’s method, the samples are not independent - MSE appears in every case, and the

Source of variation

dfSum of squares

Mean square F statistic p-value

Treatments SSTr Pr(> )

Error SSE

Total SST

H0

I 1–MSTr

SSTrI 1–------------= F

MSTrMSE--------------=

F

I J 1–( )MSE

SSEI J 1–( )------------------=

IJ 1–

P F observed valueH0≥( )

H0 I

I 4=

µ1 µ2– µ1 µ3– µ1 µ4– µ2 µ3– µ2 µ4– µ3 µ4–

µ

I I 1–( ) 2⁄

Qα m ν, , α

m ν

1 α–

Xi� Xj�– Qα I I J 1–( ), , MSE J⁄( )– µi µj– Xi� Xj�– Qα I I J 1–( ), , MSE J⁄( )+≤ ≤ i j

i 1 …I,= j 1 … I, ,= i j<

i j I

i j<

0.952

100 0.953( )

39 of 42

various intervals share the same sample means ( ). There is no simple way to compute simulta-

neous confidence level from individual confidence levels. All that can be said is that to obtain a95% simultaneous level, the individual level for each interval must be considerably larger than95%. In other words, to obtain a 5% experimentwise or family error rate, the individual per-com-parison error rate for each interval must be considerably smaller than 0.05.

Devore, page 481:Consider CIs for two population parameters that are related; e.g., , the expected value of

when for two values of are not independent since both depend on and (see Item

22). But we can say that the joint confidence level on the resulting pair of CIs for and

is at least % if the individual CIs are %. Can generalize to CIs where thejoint confidence level is at least %. This result is based on Bonferroni inequality andhence joint CIs are called Bonferroni intervals. In R, help(p.adjust) to see that “The adjustmentmethods include the Bonferroni correction ("bonferroni") in which the p-values are multiplied bythe number of comparisons.”

60. Unequal sample sizes

Let denote the I sample sizes, and let . ANOVA formulas:

Test statistic value: where and

Rejection region

Tukey’s confidence intervals for multiple comparisons:

61. Skewness:

Install and load the e1071 library; help(skewness) for a definition. Type-1 skewness is

, where is the third moment; is the second moment

xi�

µY x

*⋅Y

x x*

= x* β0

ˆ β1ˆ, S

µY v⋅ µ

Y w⋅

100 1 2α–( ) 100 1 α–( ) k

100 1 kα–( )

J1 J2 … JI, , , n Ji

i

∑=

SST xij x��

–( )2

j 1=

Ji

∑i 1=

I

∑= df n 1–=

SSTr xi� x��

–( )2

j 1=

Ji

∑i 1=

I

∑= df I 1–=

SSE xi j xi�–( )2

j 1=

Ji

∑i 1=

I

∑= df Ji 1–( )i

∑ n I–= =

fMSTrMSE--------------= MSTr

SSTrI 1–------------= MSE

SSEn I–-----------=

f Fα I 1 n I–,–,≥

Xi� Xj�– Qα I n I–, ,MSE

2------------ 1

Ji---- 1

Jj----+

– µi µj– Xi� Xj�– Qα I n I–, ,MSE

2------------ 1

Ji---- 1

Jj----+

+≤ ≤

g1

m3

m2( )3 2⁄--------------------= m3 m2

40 of 42

, th moment of a sample

A NIST Web site http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm states “Neg-ative values for the skewness indicate data that are skewed left and positive values for the skew-ness indicate data that are skewed right. By skewed left, we mean that the left tail is long relativeto the right tail. Similarly, skewed right means that the right tail is long relative to the left tail.”

From Devore’s book:A nonnormal distribution can fall in one of three categories:

a. Symmetric but with lighter tails than normal: S shaped with smallest points above andlargest points below straight line that can be drawn through most of the points

b. Symmetric but with heavier tails than normal: opposite S shaped curvec. Skewed: Positively skewed: concave; smallest and largest points above straight line

that can be drawn through most of the points. Opposite is true for negatively skewed

62. Multinomial distribution

This is a generalization of a binomial distribution in which there are possible outcomes. If therandom variables is the number of times outcome number is observed over trials, the vec-

tor follows a multinomial distribution with parameters and vector

The textbook gives a probability in the goodness-of-fit test chapter, but the equation provided isfor a system without replacement. It is similar to hypergeometric distribution, which is the proba-bility of successes in draws without replacement (in contrast to binomial, which assumesreplacement).

The chi-squared significance test for goodness of fit (From Verzani’s R book):Let be the observed cell counts in a table that arise from random sampling. Suppose

their joint distribution is described by the multinomial distribution with probabilities .

A significance test of , where are the specified probabilities

for at least one

can be performed with the statistic with degrees of freedom. Expected cell counts shouldall be five or more. Large values of the statistic support the alternative.

63. Goodness-of-fit test for continuous distributions

The Kolmogorov-Smirnov goodness-of-fit test (From Verzani’s R book)

Assume is an i.i.d. sample from a continuous distribution with CDF . Let

be the empirical CDF. A significance test of

mk1n--- xi x–( )k

i 1=

n

∑= k

k

Yi i n

Y Y1 Y2 … Yk, , ,( )= n

Π p1 p2 … pk, , ,( )=

P Y1 y1 Y2 y2 … Yk yk=, ,=,=( ) n!y1!y2!…yk!----------------------------p

y1p2y2…pk

yk=

k n

y1 y2 … yn, , ,

p1 p2 … pn, , ,

H0: p1 π1 … pk, , πk= = πi

Ha: pi πi≠ i

χ2k 1–

x1 x2 … xn, , , F x( ) Fn x( )

41 of 42

, where is a specified distribution

can be constructed with test statistic . Large values of support the alternative hypothesis.

and

The K-S test works when the null hypothesis is fully specified without looking at the data. Cannotassume population parameter values by looking at the sample.

Shapiro-Wilk test for normality (From Verzani’s R book).Assume is an i.i.d. sample from a continuous distribution. A significance test of

: parent distribution is normal

: parent distribution is not normal

can be carried out with the Shapiro-Wilk test statistic

64. Two-way ANOVA with

Devore book:When factor A consists of levels and factor B consists of levels, there are different combi-nations of levels of the two factors, each called a treatment. implies that there is only one

observation at level of factor A and level of factor B. This means the total number of observa-tions is . Consider a fixed-effects model

(5)

where , , and s are assumed independent, normally distributed, with mean

0 and variance . This is an additive model. While an ideal model would be , there

are means ( ) and one variance term for , which means there is a total of unknowns

in the model. However, with the assumption, we have only observations. Therefore,

this ideal model paramters cannot be estimated. Instead we choose an additive model, where and so that . This model has only

parameters and so can be solved if and . To eliminate non-uniqueness (wherebydifferent sets of and values are not feasible solutions), the model is rewritten as (5).

Unbiased estimators for the parameters are:

, ,

There are two different hypotheses in a two-factor experiment with

versus at least one

versus at least one

No factor A effect implies that all s are equal, and hence are all 0 since they sum to 0. Same

applies for factor B.

Definition:

H0: F x( ) F0 x( )= Ha: F x( ) F0 x( )≠ F0 x( )

D D

Fn x( )# i: xi x≤{ }

n----------------------------= D maximum in x of Fn x( ) F x( )–=

x1 x2 … xn, , ,

H0

Ha

Kij 1=

I J IJ

Kij 1=

i j

IJ

Xi j µ αi βj εij+ + +=

αi

i 1=

I

∑ 0= βj

j 1=

J

∑ 0= εij

σ2Xi j µij εi j+=

IJ µi j εij IJ 1+

Kij 1= IJ

Xi j αi βj εi j+ += i 1 2 … I, , ,= j 1 2 … J, , ,= µi j αi βj+=

I J 1+ + I 3≥ J 3≥αi βj

µ X��= αiˆ Xi� X��–= βj

ˆ X� j X��–=

Kij 1=

H0A: α1 α2 … αI 0= = = = HaA: αi 0≠

H0B: β1 β2 … βJ 0= = = = HaB: βj 0≠

αi

42 of 42

with

with

with

with

and

where , and .References

[1] J. Verzani, “Using R for Introductory Statistics,” Chapman & Hall CRC.[2] J. Devore, “Probability and Statistics for Engineering and the Sciences,” Seventh Edition.[3] http://www.ms.uky.edu/~mai/biostat277/smsp.ppt[4] K. S. Trivedi, “Probability, Statistics with Reliability, Queueing and Computer Science Applications,” First

Edition, Prentice Hall, ISBN 0-13-711564-4r.

HypothesesTest

Statistic Value

Rejection Region

vs.

vs.

SST xij x��

–( )2

j 1=

J

∑i 1=

I

∑= df IJ 1–=

SSA xi� x��

–( )2

j 1=

J

∑i 1=

I

∑= df I 1–=

SSB x� j x

��–( )2

j 1=

J

∑i 1=

I

∑= df J 1–=

SSE xi j xi� x� j– x

��+–( )2

j 1=

J

∑i 1=

I

∑= df I 1–( ) J 1–( )=

SST SSA SSB SSE+ +=

H0A HaA fAMSAMSE------------=

fA Fα I 1– I 1–( ) J 1–( ), ,≥

H0B HaB fAMSBMSE------------=

fB Fα J 1– I 1–( ) J 1–( ), ,≥

MSA SSA I 1–( )⁄= MSB SSB J 1–( )⁄= MSE SSE I 1–( ) J 1–( )( )⁄=

Documents

Statistical techniques - Electrical and Computer Engineeringmv/edu/D2K/lectures/stat-techniques.pdf · 1 of 42 Statistical techniques Malathi Veeraraghavan University of Virginia