Statistics - Lecture 04 - appspot.comstatnipa.appspot.com/cours/04/04.pdf9/26/2018 Statistics - Lecture 04 ﬁle:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/inde x.html#40

9/26/2018 Statistics - Lecture 04

file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 1/40

Statistics - Lecture 04Nicodème Paul Faculté de médecine, Université de Strasbourg



CorrelationIn many situations the objective in studying the joint behavior of two variables is to seewhether they are related.

Given n pairs of observations , it is natural to speak of and having a positive relationship if large x's are paired with large y's and small x's withsmall y's. Similarly, if large x's are paired with small y's and small x's with large y's, then anegative relationship between the variables is implied.

·

: temperature and : body length in drosophila

: Height and : Weight

: parent Height and : child Height

: eyes Color and : hair Color

: Color and : perceived Taste

- X Y

- X Y

- X Y

- X Y

- X Y

· ( , ), ( , ), . . . , ( , )x1 y1 x2 y2 xn yn X

Y

2/40



CorrelationIn a population, we collect a random sample of 38 individuals. For each individual, we measure his or her height(mm) and his or her weight.

Height Weight

1 1629 71.0

2 1569 56.5

3 1561 56.0

4 1619 61.0

5 1566 65.0

6 1639 62.0

7 1494 53.0

8 1568 53.0

9 1540 65.0

10 1530 57.0

3/40



IndependenceIs the variable Height is independent of the variable Weight?

No

Yes

I cannot tell

Submit Show Hint Show Answer Clear

4/40



CorrelationA bivariate data set consists of measurements or observations on two variables, X and Y·

5/40



Sample correlationThe Pearson's sample correlation coe�cient r of a bivariate data for is given by:

· ( , )xi yi i = 1, . . . ,n

r =( − )( − )∑n

i=1 xi x yi y

( ( − )( ( − )∑ni=1 xi x)2 ∑n

i=1 yi y)2− −−−−−−−−−−−−−−−−−−−−−−−−

√

6/40



Sample correlation

Some properties of the Pearson correlation coe�cent are as following:

The value of does not depend on the unit of measurement for either variable

The value of does not depend on which of the two variables is considered

The value of is between 1 and -1

A correlation coe�cient of occurs only when all the points in a scatterplot of the datalie exactly on a straight line that slopes upward. Similarly, only when all the points lieexactly on a downward-sloping line.

· r

· r X

· r

· r = 1r = −1

7/40



Examples of scatterplotsValeurs possibles: -1, -0.8, -0.4, 0, 0.4, 0.8, 1·

8/40



Examples of scatterplots with coe�cient correlations

9/40



Hypothesis testing on ρ and be jointly normal random variables

is an estimator for .

Hypotheses:

Test statistic:

follows a t distribution with degrees of freedoom

Decision is based on the critical value or

· Xi Yi

· R(X,Y ) =( − )( − )∑n

i=1 Xi X Yi Y

( ( − )( ( − )∑ni=1 Xi X)

2∑

ni=1 Yi Y )

2√ρ

· : ρ = 0 : ρ ≠ 0H0 H1

·

T = Rn − 2

1 − R2

− −−−−−

√

n − 2

· tn−21−α/2

p − value = (T > | |)PH0tn−2

10/40



Example - linear relationship. As , the value of the t statistic is:

For , . As , we reject the null hypothesis, meaning thatthere is strong evidence that , the population parameter is signi�cantly di�erent from .

Note that you cannot �nd this value neither calculate the p-value using the t table provided.Indeed, there is no entry for the t distribution with 36 degrees of freedom in the table.

· r = 0.62 n = 38

t = r = 4.74n − 2

1 − r2

− −−−−√

· α = 0.05 = 2.028t360.975 4.74 > 2.028

ρ 0

·

11/40



If a signi�cant sample correlation coe�cient between two variables X and Y is observed, what could be itsmeaning?

X causes Y

Y causes X

Some third factor, either directly or indirectly, causes both X and Y

An unlikely event has occurred and a large sample correlation coe�cient has beengenerated by chance from a population in which X and Y are, in fact, not correlated

The correlation is purely nonsensical, a situation that may arise when measurements ofX and Y are not taken on a common unit of association


12/40



Goodness of �t - checking for normalityA normal probability plot is a scatterplot of the (normal score , observed value ) pairs for

. The empirical cummulative distribution is de�ned as:

And the scores are found such that .

In software packages, the graphical representation can be obtained using a quantile -quantile plot.

A strong linear pattern in a normal probability plot suggests that population normality isplausible. On the other hand, systematic departure from a straight-line pattern (such ascurvature in the plot) indicates that it is not reasonable to assume that the populationdistribution is normal.

· zi xii = 1, 2, . . . ,n

P(X ≤ ) =xi

⎧

⎩

⎨

⎪⎪⎪

⎪⎪⎪

1 − 0.51/n

0.51/n

i − 0.3175

n + 0.365

if i = 1

if i = n

otherwise

· zi P(Z ≤ ) = P(X ≤ )zi xi

·

·

13/40



Checking for normality

14/40



Assess normality with Shapiro testHypotheses: · : L(X) = N : L(X) ≠ NH0 H1

Shapiro-Wilk normality test

data: rnorm

(50, mean = 1, sd = 3)

W = 0.97627, p-value = 0.4073


data: rf

(50, df1 = 3, df2 = 2)

W = 0.67196, p-value = 2.687e-09

15/40



Assess normality with Shapiro testHypotheses: · : L(X) = N : L(X) ≠ NH0 H1


data: scatdat$Height

W = 0.93336, p-value = 0.02562


data: scatdat$Weight

W = 0.95284, p-value = 0.1105

16/40



Dealing with non normal data

Transform the data with the log or the square root function.

Wilcoxon signed rank test can be used in place of a 1-sample or paired t-test

Wilcoxon rank sum test substitutes for the 2-sample t-test

Bootstrappping (testing for a population mean)

1. Extract a new sample of n observations from the original set of n.

2. Calculate the mean of this new sample of n.

3. Repeat steps (1) and (2) an arbitrarily large number of times, say 5000 times.

4. Use the estimated distribution of the sample mean for inference

·

·

·

·

17/40



Goodness of �tSuppose data is generated from an experiment, given the following frequencies for the valuesof .

1 2 ... k

Frequencies ...

We want to test the hypotheses:

with a signi�cant level .

X

X

n1 n2 nk

: L(X) = ( , , . . . , ) : L(X) ≠ ( , , . . . , )H0 p1 p2 pk H1 p1 p2 pk

α

18/40



Goodness of �t

Genetics

Consider two di�erent characteristics of tomatoes: leaf shape and plant size. The leaf shapemay be potato-leafed or cut-leafed, and the plant may be tall or dwarf.

19/40



Goodness of �t - Genetics

Tall cut-leaf (TTCC, TTCc, TtCc, TtCC): , Tall potato-leaf (TTcc, Ttcc):

Dwarf cut-leaf (ttCC, ttCc): , Dwarf potato-leaf (ttcc):

· 916

316

· 316

116

20/40



Notation

with tall cut-leaf: 1, tall potato-leaf: 2, dwarf cut-leaf: 3, dwarf potato-leaf: 4

Given a categorical random variable with k possible values (k di�erent levels or categoriesor cells):

is the true proportion of category and we have .

· X

L(X) = = ( , , . . . , )P0 p1 p2 pk

pi i + +. . . + = 1p1 p2 pk

X = {1, 2, 3, 4}

21/40



Goodness of �tSuppose data is generated from an experiment, given the following frequencies for the valuesof .

1 2 ... k

Frequencies ...

We want to test the hypotheses:

with a signi�cant level .

X

X

n1 n2 nk

: L(X) = ( , , . . . , ) : L(X) ≠ ( , , . . . , )H0 p1 p2 pk H1 p1 p2 pk

α

22/40



Goodness of �tUnder the null hypothesis, given a sample size , the expected count for category is: .We can de�ne the goodness-of-�t statistic, KH as:

Expressed di�erently and referring to category as cell, we write:

Under , with additionnal conditions such as: , , for wehave:

is the number of estimated parameters in the null hypothesis.

n i n × pi

KH =∑i=1

k ( − n ×ni pi)2

n × pi

KH = ∑all cells

(observed cell count − expected cell count)2

expected cell count

H0 n ≥ 50 n ≥ 5pi i = 1, . . . ,k

L(KH) ≈ χ2k−1−d

d

23/40



Goodness of �t - ExampleSuppose we perform an experiment with tomato plants, and we observed the following data :

We want to test :

with a level of signi�cance .

: L(X) = ( , , , ) : L(X) ≠ ( , , , )H09

16

3

16

3

16

1

16H1

9

16

3

16

3

16

1

16

α = 0.01

24/40



Goodness of �t

The value of the test statistic:

We verify:

kh = + +(926 − 906.1875)2

906.1875

(288 − 302.0625)2

302.0625

(293 − 302.0625)2

302.0625

+ ≈ 1.47(104 − 100.6875)2

100.6875

25/40



Goodness of �tFor a signi�cance level , we calculate the critical value such that . Wewould fail to reject si .

In the example, , , , number of degrees of freedom is 3.

The critical value is 11.345. Therefore, we fail to reject the null hypothesis. The p-value isapproximately 0.65.

α cα P( > ) = αχ2k−1−d

cα

H0 kh < cα

kh ≈ 1.47 d = 0 α = 0.01

26/40



Contingency tableA rectangular table used to summarize a categorical data set; two-way tables are used tocompare several populations on the basis of a categorical variable or to determine if anassociation exists between two categorical variables.

Example of a contingency table :

·

·

27/40



ReviewGiven two discrete random variables X and Y, the joint distribution is de�ned by:

...

...

...

... ... ... ... ...

...

= P(X = ,Y = )pij xi yj

Y

X y1 y2 ym

x1 p11 p12 p1m

x2 p21 p22 p2m

xl pl1 pl2 plm

The marginal distribution of :

The marginal distribution of :

If and are independent :

· X P(X = ) = =xi pi. ∑mj=1 pij

· Y P(X = ) = =yj p.j ∑ni=1 pij

· X Y = ×pij pi. p.j

28/40



Testing independence

...

...

...

... ... ... ... ...

...

The sample data observed from is a contingency table:· (X,Y )

Y

X y1 y2 ym

x1 n11 n12 n1m

x2 n21 n22 n2m

xl nl1 nl2 nlm

The marginal frequencies for :

The marginal frequencies for :

If and are independent :

· X =ni. ∑mj=1 nij

· Y =n.j ∑ni=1 nij

· X Y =nij×ni. n.j

n

29/40



Testing independence : X and Y are independent

: X and Y are not independent

avec et

The number of parameters to estimate is .

Test statistic:

· H0

· H1

· = ×pij pi. p.j i = 1, 2, . . . , l j = 1, 2, . . . ,m

· l − 1 + m − 1

·

KH =∑i=1

l

∑j=1

m ( −nijni.n.j

n)2

ni.n.j

n

Under , when and for all i, j:· H0 n ≥ 50 ≥ 5ni.n.j

n

L(KH) ≈ χ2(l−1)(m−1)

30/40



Example

Let a random variables with values (Family) and (alone) and another random variablewith values (Present) et (Absent). From the sample, we can estimate:

X F S Y

P A

P(F ,P) ≈ P(F ,A) ≈ P(S,P) ≈ P(S,A) ≈40

260

60

260

100

260

60

260

P(F) ≈ P(S) ≈ P(P) ≈ P(A) ≈100

260

160

260

140

260

120

260

31/40



Example

and are independent if:

: and are independent

: and are not independent

· H0 X Y

· H1 X Y

X Y

P(X = x,Y = y) = P(X = x) × P(Y = y)

32/40



Example

P(F) × P(P) ≈ × P(F) × P(A) ≈ ×100

260

140

260

100

260

120

260

P(S) × P(P) ≈ × P(S) × P(A) ≈ ×160

260

140

260

160

260

120

260

33/40



Example

The distribution has degrees of freedom. for .We reject then the null hypothesis.

expected cell count =(row marginal total) ∗ (column marginal total)

grand total

df = (number of rows − 1) × (number of columns − 1)

kh = + + + ≈ 12.54(40 − 54)2

54

(60 − 46)2

46

(100 − 86.4)2

86.4

(60 − 73.6)2

73.6

χ2 (2 − 1) × (2 − 1) = 1 = 3.841cα α = 0.05

34/40



Can she taste the di�erence?

Tea-tasting experiment: a woman claimed to be able to judge whether tea or milk was poured in a cup �rst. Fisherdesigned an experiment to test her ability

What should be done about chance variations in the temperature, sweetness, and so on?

How many cups should be used in the test? Should they be paired? In what order should thecups be presented? Fisher suggests: if discrimination of the kind under test is absent, the resultof the experiment will be wholly governed by the laws of chance.

What conclusion could be drawn from a perfect score or from one with one or more errors?

·

·

·

35/40



Fisher's exact testTea-tasting experiment: a woman claimed to be able to judge whether tea or milk was poured in a cup �rst. Thewoman was given eight cups of tea, in four of which milk was poured �rst, and was told to guess which four hadmilk poured �rst.

The contingency for this design is:·

: probability to select a cup milk �rst

: probability to select a cup tea �rst

The hypotheses to test:

· p1

· p2

·

: = : >H0 p1 p2 H1 p1 p2

36/40



Fisher's exact testTea-tasting experiment: a woman claimed to be able to judge whether tea or milk was poured in a cup �rst. Thewoman was given eight cups of tea, in four of which tea was poured �rst, and was told to guess which four had teapoured �rst.

random variable with avlues

The Hypergeometric Distribution:

There are possible ways to classify 4 of the 8 cups as milk �rst.

· X x = 0, 1, 2, 3, 4

·

(X = x) = x = 0, 1, 2, 3, 4PH0

( )( )4x

44−x

( )84

· ( ) = 7084

37/40



Check yourselfTea-tasting experiment: a woman claimed to be able to judge whether tea or milk was poured in a cup �rst.The woman was given eight cups of tea, in four of which milk was poured �rst, and was told to guess which fourhad milk poured �rst. Suppose she identi�es 3 cups milk �rst correctly. The p-value is:

1/70

0.04

16/70

17/70


38/40



Fisher's exact test

The marginal counts , , , and are �xed

The hypotheses to test:

The test statistic is the number of observations cell (1, 1) and

· a + b c + d a + c b + d

·

: = : > or < or ≠H0 p1 p2 H1 p1 p2 p1 p2 p1 p2

· T

(T = x) = x = 0, 1, 2, . . . ,min{a + b,a + c}PH0

( )( )a+cx

b+da+b−x

( )na+b

39/40



See you next time

40/40

Documents

Statistics - Lecture 04 - appspot.comstatnipa.appspot.com/cours/04/04.pdf9/26/2018 Statistics - Lecture 04 ﬁle:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/inde x.html#40