Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 1/40
Statistics - Lecture 04Nicodème Paul Faculté de médecine, Université de Strasbourg
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 2/40
CorrelationIn many situations the objective in studying the joint behavior of two variables is to seewhether they are related.
Given n pairs of observations , it is natural to speak of and having a positive relationship if large x's are paired with large y's and small x's withsmall y's. Similarly, if large x's are paired with small y's and small x's with large y's, then anegative relationship between the variables is implied.
·
: temperature and : body length in drosophila
: Height and : Weight
: parent Height and : child Height
: eyes Color and : hair Color
: Color and : perceived Taste
- X Y
- X Y
- X Y
- X Y
- X Y
· ( , ), ( , ), . . . , ( , )x1 y1 x2 y2 xn yn X
Y
2/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 3/40
CorrelationIn a population, we collect a random sample of 38 individuals. For each individual, we measure his or her height(mm) and his or her weight.
Height Weight
1 1629 71.0
2 1569 56.5
3 1561 56.0
4 1619 61.0
5 1566 65.0
6 1639 62.0
7 1494 53.0
8 1568 53.0
9 1540 65.0
10 1530 57.0
3/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 4/40
IndependenceIs the variable Height is independent of the variable Weight?
No
Yes
I cannot tell
Submit Show Hint Show Answer Clear
4/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 5/40
CorrelationA bivariate data set consists of measurements or observations on two variables, X and Y·
5/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 6/40
Sample correlationThe Pearson's sample correlation coe�cient r of a bivariate data for is given by:
· ( , )xi yi i = 1, . . . ,n
r =( − )( − )∑n
i=1 xi x yi y
( ( − )( ( − )∑ni=1 xi x)2 ∑n
i=1 yi y)2− −−−−−−−−−−−−−−−−−−−−−−−−
√
6/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 7/40
Sample correlation
Some properties of the Pearson correlation coe�cent are as following:
The value of does not depend on the unit of measurement for either variable
The value of does not depend on which of the two variables is considered
The value of is between 1 and -1
A correlation coe�cient of occurs only when all the points in a scatterplot of the datalie exactly on a straight line that slopes upward. Similarly, only when all the points lieexactly on a downward-sloping line.
· r
· r X
· r
· r = 1r = −1
7/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 8/40
Examples of scatterplotsValeurs possibles: -1, -0.8, -0.4, 0, 0.4, 0.8, 1·
8/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 9/40
Examples of scatterplots with coe�cient correlations
9/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 10/40
Hypothesis testing on ρ and be jointly normal random variables
is an estimator for .
Hypotheses:
Test statistic:
follows a t distribution with degrees of freedoom
Decision is based on the critical value or
· Xi Yi
· R(X,Y ) =( − )( − )∑n
i=1 Xi X Yi Y
( ( − )( ( − )∑ni=1 Xi X)
2∑
ni=1 Yi Y )
2√ρ
· : ρ = 0 : ρ ≠ 0H0 H1
·
T = Rn − 2
1 − R2
− −−−−−
√
n − 2
· tn−21−α/2
p − value = (T > | |)PH0tn−2
10/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 11/40
Example - linear relationship. As , the value of the t statistic is:
For , . As , we reject the null hypothesis, meaning thatthere is strong evidence that , the population parameter is signi�cantly di�erent from .
Note that you cannot �nd this value neither calculate the p-value using the t table provided.Indeed, there is no entry for the t distribution with 36 degrees of freedom in the table.
· r = 0.62 n = 38
t = r = 4.74n − 2
1 − r2
− −−−−√
· α = 0.05 = 2.028t360.975 4.74 > 2.028
ρ 0
·
11/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 12/40
If a signi�cant sample correlation coe�cient between two variables X and Y is observed, what could be itsmeaning?
X causes Y
Y causes X
Some third factor, either directly or indirectly, causes both X and Y
An unlikely event has occurred and a large sample correlation coe�cient has beengenerated by chance from a population in which X and Y are, in fact, not correlated
The correlation is purely nonsensical, a situation that may arise when measurements ofX and Y are not taken on a common unit of association
Submit Show Hint Show Answer Clear
12/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 13/40
Goodness of �t - checking for normalityA normal probability plot is a scatterplot of the (normal score , observed value ) pairs for
. The empirical cummulative distribution is de�ned as:
And the scores are found such that .
In software packages, the graphical representation can be obtained using a quantile -quantile plot.
A strong linear pattern in a normal probability plot suggests that population normality isplausible. On the other hand, systematic departure from a straight-line pattern (such ascurvature in the plot) indicates that it is not reasonable to assume that the populationdistribution is normal.
· zi xii = 1, 2, . . . ,n
P(X ≤ ) =xi
⎧
⎩
⎨
⎪⎪⎪
⎪⎪⎪
1 − 0.51/n
0.51/n
i − 0.3175
n + 0.365
if i = 1
if i = n
otherwise
· zi P(Z ≤ ) = P(X ≤ )zi xi
·
·
13/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 14/40
Checking for normality
14/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 15/40
Assess normality with Shapiro testHypotheses: · : L(X) = N : L(X) ≠ NH0 H1
Shapiro-Wilk normality test
data: rnorm
(50, mean = 1, sd = 3)
W = 0.97627, p-value = 0.4073
Shapiro-Wilk normality test
data: rf
(50, df1 = 3, df2 = 2)
W = 0.67196, p-value = 2.687e-09
15/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 16/40
Assess normality with Shapiro testHypotheses: · : L(X) = N : L(X) ≠ NH0 H1
Shapiro-Wilk normality test
data: scatdat$Height
W = 0.93336, p-value = 0.02562
Shapiro-Wilk normality test
data: scatdat$Weight
W = 0.95284, p-value = 0.1105
16/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 17/40
Dealing with non normal data
Transform the data with the log or the square root function.
Wilcoxon signed rank test can be used in place of a 1-sample or paired t-test
Wilcoxon rank sum test substitutes for the 2-sample t-test
Bootstrappping (testing for a population mean)
1. Extract a new sample of n observations from the original set of n.
2. Calculate the mean of this new sample of n.
3. Repeat steps (1) and (2) an arbitrarily large number of times, say 5000 times.
4. Use the estimated distribution of the sample mean for inference
·
·
·
·
17/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 18/40
Goodness of �tSuppose data is generated from an experiment, given the following frequencies for the valuesof .
1 2 ... k
Frequencies ...
We want to test the hypotheses:
with a signi�cant level .
X
X
n1 n2 nk
: L(X) = ( , , . . . , ) : L(X) ≠ ( , , . . . , )H0 p1 p2 pk H1 p1 p2 pk
α
18/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 19/40
Goodness of �t
Genetics
Consider two di�erent characteristics of tomatoes: leaf shape and plant size. The leaf shapemay be potato-leafed or cut-leafed, and the plant may be tall or dwarf.
19/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 20/40
Goodness of �t - Genetics
Tall cut-leaf (TTCC, TTCc, TtCc, TtCC): , Tall potato-leaf (TTcc, Ttcc):
Dwarf cut-leaf (ttCC, ttCc): , Dwarf potato-leaf (ttcc):
· 916
316
· 316
116
20/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 21/40
Notation
with tall cut-leaf: 1, tall potato-leaf: 2, dwarf cut-leaf: 3, dwarf potato-leaf: 4
Given a categorical random variable with k possible values (k di�erent levels or categoriesor cells):
is the true proportion of category and we have .
· X
L(X) = = ( , , . . . , )P0 p1 p2 pk
pi i + +. . . + = 1p1 p2 pk
X = {1, 2, 3, 4}
21/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 22/40
Goodness of �tSuppose data is generated from an experiment, given the following frequencies for the valuesof .
1 2 ... k
Frequencies ...
We want to test the hypotheses:
with a signi�cant level .
X
X
n1 n2 nk
: L(X) = ( , , . . . , ) : L(X) ≠ ( , , . . . , )H0 p1 p2 pk H1 p1 p2 pk
α
22/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 23/40
Goodness of �tUnder the null hypothesis, given a sample size , the expected count for category is: .We can de�ne the goodness-of-�t statistic, KH as:
Expressed di�erently and referring to category as cell, we write:
Under , with additionnal conditions such as: , , for wehave:
is the number of estimated parameters in the null hypothesis.
n i n × pi
KH =∑i=1
k ( − n ×ni pi)2
n × pi
KH = ∑all cells
(observed cell count − expected cell count)2
expected cell count
H0 n ≥ 50 n ≥ 5pi i = 1, . . . ,k
L(KH) ≈ χ2k−1−d
d
23/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 24/40
Goodness of �t - ExampleSuppose we perform an experiment with tomato plants, and we observed the following data :
We want to test :
with a level of signi�cance .
: L(X) = ( , , , ) : L(X) ≠ ( , , , )H09
16
3
16
3
16
1
16H1
9
16
3
16
3
16
1
16
α = 0.01
24/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 25/40
Goodness of �t
The value of the test statistic:
We verify:
kh = + +(926 − 906.1875)2
906.1875
(288 − 302.0625)2
302.0625
(293 − 302.0625)2
302.0625
+ ≈ 1.47(104 − 100.6875)2
100.6875
25/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 26/40
Goodness of �tFor a signi�cance level , we calculate the critical value such that . Wewould fail to reject si .
In the example, , , , number of degrees of freedom is 3.
The critical value is 11.345. Therefore, we fail to reject the null hypothesis. The p-value isapproximately 0.65.
α cα P( > ) = αχ2k−1−d
cα
H0 kh < cα
kh ≈ 1.47 d = 0 α = 0.01
26/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 27/40
Contingency tableA rectangular table used to summarize a categorical data set; two-way tables are used tocompare several populations on the basis of a categorical variable or to determine if anassociation exists between two categorical variables.
Example of a contingency table :
·
·
27/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 28/40
ReviewGiven two discrete random variables X and Y, the joint distribution is de�ned by:
...
...
...
... ... ... ... ...
...
= P(X = ,Y = )pij xi yj
Y
X y1 y2 ym
x1 p11 p12 p1m
x2 p21 p22 p2m
xl pl1 pl2 plm
The marginal distribution of :
The marginal distribution of :
If and are independent :
· X P(X = ) = =xi pi. ∑mj=1 pij
· Y P(X = ) = =yj p.j ∑ni=1 pij
· X Y = ×pij pi. p.j
28/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 29/40
Testing independence
...
...
...
... ... ... ... ...
...
The sample data observed from is a contingency table:· (X,Y )
Y
X y1 y2 ym
x1 n11 n12 n1m
x2 n21 n22 n2m
xl nl1 nl2 nlm
The marginal frequencies for :
The marginal frequencies for :
If and are independent :
· X =ni. ∑mj=1 nij
· Y =n.j ∑ni=1 nij
· X Y =nij×ni. n.j
n
29/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 30/40
Testing independence : X and Y are independent
: X and Y are not independent
avec et
The number of parameters to estimate is .
Test statistic:
· H0
· H1
· = ×pij pi. p.j i = 1, 2, . . . , l j = 1, 2, . . . ,m
· l − 1 + m − 1
·
KH =∑i=1
l
∑j=1
m ( −nijni.n.j
n)2
ni.n.j
n
Under , when and for all i, j:· H0 n ≥ 50 ≥ 5ni.n.j
n
L(KH) ≈ χ2(l−1)(m−1)
30/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 31/40
Example
Let a random variables with values (Family) and (alone) and another random variablewith values (Present) et (Absent). From the sample, we can estimate:
X F S Y
P A
P(F ,P) ≈ P(F ,A) ≈ P(S,P) ≈ P(S,A) ≈40
260
60
260
100
260
60
260
P(F) ≈ P(S) ≈ P(P) ≈ P(A) ≈100
260
160
260
140
260
120
260
31/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 32/40
Example
and are independent if:
: and are independent
: and are not independent
· H0 X Y
· H1 X Y
X Y
P(X = x,Y = y) = P(X = x) × P(Y = y)
32/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 33/40
Example
P(F) × P(P) ≈ × P(F) × P(A) ≈ ×100
260
140
260
100
260
120
260
P(S) × P(P) ≈ × P(S) × P(A) ≈ ×160
260
140
260
160
260
120
260
33/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 34/40
Example
The distribution has degrees of freedom. for .We reject then the null hypothesis.
expected cell count =(row marginal total) ∗ (column marginal total)
grand total
df = (number of rows − 1) × (number of columns − 1)
kh = + + + ≈ 12.54(40 − 54)2
54
(60 − 46)2
46
(100 − 86.4)2
86.4
(60 − 73.6)2
73.6
χ2 (2 − 1) × (2 − 1) = 1 = 3.841cα α = 0.05
34/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 35/40
Can she taste the di�erence?
Tea-tasting experiment: a woman claimed to be able to judge whether tea or milk was poured in a cup �rst. Fisherdesigned an experiment to test her ability
What should be done about chance variations in the temperature, sweetness, and so on?
How many cups should be used in the test? Should they be paired? In what order should thecups be presented? Fisher suggests: if discrimination of the kind under test is absent, the resultof the experiment will be wholly governed by the laws of chance.
What conclusion could be drawn from a perfect score or from one with one or more errors?
·
·
·
35/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 36/40
Fisher's exact testTea-tasting experiment: a woman claimed to be able to judge whether tea or milk was poured in a cup �rst. Thewoman was given eight cups of tea, in four of which milk was poured �rst, and was told to guess which four hadmilk poured �rst.
The contingency for this design is:·
: probability to select a cup milk �rst
: probability to select a cup tea �rst
The hypotheses to test:
· p1
· p2
·
: = : >H0 p1 p2 H1 p1 p2
36/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 37/40
Fisher's exact testTea-tasting experiment: a woman claimed to be able to judge whether tea or milk was poured in a cup �rst. Thewoman was given eight cups of tea, in four of which tea was poured �rst, and was told to guess which four had teapoured �rst.
random variable with avlues
The Hypergeometric Distribution:
There are possible ways to classify 4 of the 8 cups as milk �rst.
· X x = 0, 1, 2, 3, 4
·
(X = x) = x = 0, 1, 2, 3, 4PH0
( )( )4x
44−x
( )84
· ( ) = 7084
37/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 38/40
Check yourselfTea-tasting experiment: a woman claimed to be able to judge whether tea or milk was poured in a cup �rst.The woman was given eight cups of tea, in four of which milk was poured �rst, and was told to guess which fourhad milk poured �rst. Suppose she identi�es 3 cups milk �rst correctly. The p-value is:
1/70
0.04
16/70
17/70
Submit Show Hint Show Answer Clear
38/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 39/40
Fisher's exact test
The marginal counts , , , and are �xed
The hypotheses to test:
The test statistic is the number of observations cell (1, 1) and
· a + b c + d a + c b + d
·
: = : > or < or ≠H0 p1 p2 H1 p1 p2 p1 p2 p1 p2
· T
(T = x) = x = 0, 1, 2, . . . ,min{a + b,a + c}PH0
( )( )a+cx
b+da+b−x
( )na+b
39/40
9/26/2018 Statistics - Lecture 04
file:///users/home/npaul/enseignement/esbs/2018-2019/cours/04/index.html#40 40/40
See you next time
40/40