27
Is cross-fertilization good or bad?: An analysis of Darwin’s Zea Mays Data By Jamie Chatman and Charlotte Hsieh

Zea mays

Embed Size (px)

Citation preview

Page 1: Zea mays

Is cross-fertilization good or bad?: An analysis of Darwin’s Zea Mays Data

By Jamie Chatman

and

Charlotte Hsieh

Page 2: Zea mays

Outline

Short biography of Charles Darwin and Ronald Fisher

Description of the Zea Mays data Analysis of the data

Parametric tests (t-test, confidence intervals) Nonparametric test (i.e. Wilcoxon signed rank) Bootstrap tests

Conclusion

Page 3: Zea mays

Short Biography of Charles Darwin

Darwin was born in 1809 in Shrewsbury, England At 16 went to Edinburgh University to study

medicine, but did not finish He went to Cambridge University, where he

received his degree studying to become a clergyman.

Darwin worked as an unpaid naturalist on a five-year scientific expedition to South America 1831.

Darwin’s research led to his book, On the Origin of Species by Means of Natural Selection, published in 1859.

1809-1882

Page 4: Zea mays

Short Biography of Ronald Fisher

Fisher was born in East Finchley, London in 1890. Fisher went to Cambridge University and

received a degree in mathematics. Fisher made many discoveries in statistics

including maximum likelihood, analysis of variance, sufficiency, and was a pioneer for design of experiments.

1890-1962

Page 5: Zea mays

Darwin’s Zea Mays Data

Page 6: Zea mays

Hypothesis

Null Hypothesis: Ho: There is no difference in stalk height between

the cross-fertilized and self-fertilized plants.

Alternative Hypothesis: HA: Cross-fertilized stalk heights are not equal to

self-fertilized heights HA: Cross-fertilization leads to increased stalk

height

Page 7: Zea mays

Galton’s Approach to the DataCrossed Self-Fert.

Pot I 23.500 17.375

12.000 20.375

21.000 20.000

Pot II 22.000 20.000

19.124 18.375

21.500 18.625

Pot III 22.125 18.625

20.375 15.250

18.250 16.500

21.625 18.000

23.250 16.250

Pot IV 21.000 18.000

22.125 12.750

23.000 15.500

12.000 18.000

Original DataCrossed Self-Fert. Difference

23.500 20.375 3.175

23.250 20.000 3.250

23.000 20.000 3.000

22.125 18.625 3.500

22.125 18.625 3.500

22.000 18.375 3.625

21.625 18.000 3.625

21.500 18.000 3.500

21.000 18.000 3.000

21.000 17.375 3.625

20.375 16.500 3.875

19.124 16.250 2.874

18.250 15.500 2.750

12.000 15.250 -3.250

12.000 12.750 -0.750

Galton’s Approach

Page 8: Zea mays

Parametric Test Fisher made an assumption that the stalk heights

were normally distributed Crossed: X ~ Self-fertilized Y~ Difference: X-Y=d ~

p-value : 0.0497 Reject the null hypothesis that at the .05 level

),( 2XXN σµ

),( 2YYN σµ

),( 22XYxYN σσµµ +−

26.22

6166.22 =

=

ds

dd.f.= 14

148.206166.2

1526.22

=−=t

yx µµ =

Page 9: Zea mays

Parametric Test

95% confidence interval

)15/7181.4*145.26167.215/7181.4*145.26167.2( +≤≤− d

))/()/(( 025.025. nstxdnstx +≤≤−

)2298.500364(. ≤≤ d

Since zero is not in the interval, the null hypothesis that the differences =0, (or that the means) are equal is rejected

Page 10: Zea mays

Fisher’s Non-Parametric Approach If Ho is true, and the heights of the crossed and self-

fertilized are equal, then there should be an equal chance that each one of the pairs came from the self-fert. or the crossed If we look at all possible swaps in each pair there are

215 = 32,768 possibilities The sum of the differences is 39.25 But only 863 of these cases have sums of the difference as

great as 39.25 So the null hypothesis would be rejected at the

0526.

768,32

863*2 = level

Page 11: Zea mays

Fisher’s Nonparametric Approach The results of the nonparametric test agreed with

the results of the t-test Fisher was happy with this However, Fisher believed that removing the

assumption of normality in the nonparametric test would result in a less powerful test than the t-test

“[Nonparametric tests] assume less knowledge, or more ignorance, of the experimental material than does the standard test…”

We disagree

Page 12: Zea mays

Non-Parametric Test Wilcoxon Signed Rank Test

Diff.

6.125

-8.375

1

2

0.749

2.875

3.5

5.125

1.75

3.625

7

3

9.375

7.5

-6

Diff. Rank

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

721

== ∑=

n

iiRW

0.749

1

1.75

2

2.875

3

3.5

3.625

5.125

6

6.125

7

7.59.375

8.375

-

-

6

)12)(1(...21)(

02

1

2

1

2

1

2

1)(

222

1

1

++=+++=

=+

++

−=

nn

n

nRVar

nnRE

6

)12)(1()(

0)(1

++=

=

= ∑

=

nnnWVar

REWEn

ii

Page 13: Zea mays

Non-Parametric Test

Wilcoxon Signed Rank Test When n is large W~N(0, Var(W))

This gives a p-value of 0.0409. Thus we reject the null hypothesis.

045.2072

)(

0

6)130)(115(15

=−=−++WVar

W

Page 14: Zea mays

Bootstrap Methods

Introduced by Bradley Efron (1979) 44 years after Fisher’s analysis "If statistics had evolved at a time when computers existed,

it wouldn't be what it is today (Efron)." Uses repeated re-samples of the data Allows the use of computer sampling approaches

that are asymptotically equivalent to tests where exact significance levels require complicated manipulations

A sampling simulation approximation to Fisher’s nonparametric approach

The data “pull themselves up by their own bootstraps” by generating new data sets through which their reliability can be determined.

Page 15: Zea mays

Bootstrap: Random Sign Change If Ho is true, there is an equal chance that the

plants in each pair are cross-fertilized or self-fertilized

Method: 1. Randomly shift from cross to self-fertilized in each

pair 2. Compute sum of differences 3. Repeat 5,000 times 4. Plot histogram of summed differences 5. Find the number of summed differences > 39.25

Page 16: Zea mays

Bootstrap: Random Sign Change

-60 -40 -20 0 20 40 60

020

040

060

080

0

Histogram of 5000 Resampled Sums of (Sign) Randomized Zea Mays Differences

Total of Differences

Fre

quen

cy

Results 124/5000 are >39.25. The p-value is

2*(124/5000)=0.0496. Compare to exact

combinatorial p-value of 0.0526

Page 17: Zea mays

Bootstrap: Resample Within Pots Experimenters will tend to present data in such a way as

to get significant results In order to be sure that pairings in each pot are random,

we can resample within pots We assume equality of heights in each pot Method:

1. Sample 3 crossed plants in pot 1 with replacement 2. Sample 3 self-fert. plants in pot 1 with replacement 3. Repeat for pots 2-4 4. Compute sum of differences 5. Repeat 5,000 times 6. Plot histogram of summed differences 5. Find the number of summed differences <0

Page 18: Zea mays

Bootstrap: Resample Within Pots

-100 -50 0 50 100

050

010

0015

00

Histogram of Sums of Differences in 5000 Resamplings with Resampling Within Pots

Value of Sum of Differences

Freq

uenc

y

Results 27/5000 are <0 The p-value is

2*(27/5000)=0.0108

Page 19: Zea mays

Resampling-Based Sign Test Disregard size of difference and look only at the sign of the

difference If Ho is true, the probability of any difference being positive or

negative is 0.5, and we can use a binomial approach, where we would expect half out of 15 pairs to have a positive difference and half to have a negative difference

We can count the number of positive differences in resampled pairs of size 15

Method: 1. Sample 3 crossed plants in pot 1 with replacement 2. Sample 3 self-fert. plants in pot 1 with replacement 3. Repeat for pots 2-4 4. Count the number of positive differences 5. Repeat 5,000 times

Page 20: Zea mays

Resampling-Based Sign Test

Results Almost every time out of

5,000, we get over 8 positive differences out of 15.

#pos diff < 6: 0/5000 #pos diff < 8: 2/5000 p-value is essentially 0

6 8 10 12 14

050

010

0015

0020

00

Histogram of Number of Positive Differences Between Crossed and Self-Fertilized in 5000 Resamplings of Size 15 from the Zea Mays Data with Randomization

Within Pots

Number of Positive Differences

Freq

uenc

y

Page 21: Zea mays

Randomization Within Pots Disregard information about cross or self-fertilized Find the distribution of summed differences by

resampling from pooled data Method:

1. Pool plants in pot 1 2. Sample 3 plants from the pool w/replacement, treat as crossed 3. Sample 3 plants from the pool w/replacement, treat as self-fert. 4. Repeat for pots 2-4 5. Compute sum of differences 6. Repeat 5,000 times 7. Plot histogram of summed differences (=distribution of null

hypothesis) 8. Find the number of summed differences >39.25

Page 22: Zea mays

Randomization Within Pots

Results 38/5000 are >39.25 The p-value is

2*(38/5000)= 0.0152

-100 -50 0 50 100

05

001

000

150

0

Histogram of Null Hypothesis Randomization Test Distribution (resample of 5000)

Sum of Differences

Fre

que

ncy

Page 23: Zea mays

Resampling Approach to Confidence Intervals Using Darwin’s original

differences: 1. Sample 15 differences

with replacement 2. Compute the sum of

differences 3. Repeat 5,000 times 4. Plot histogram of

summed differences 5. Take 125th and 4875th

summed difference Divide by sample size = 15

-100 -50 0 50 100

050

010

0015

00

Histogram of 5000 Sums of 15 Resampled Differences in Galton's Zea Mays Data

Sum of 15 Differences

Fre

quen

cy

We get 95% CI: (0.1749, 4.817), which is shorter than the t-interval (.0036, 5.230)

Page 24: Zea mays

Resampling Approach to Confidence Intervals In the resampling approaches, “95% of the

resampled average differences were between 0.1749 and 4.817.”

This is not equivalent to the t- procedure, where “with probability 95%, the true value of the difference estimate lies between 0.0036 and 5.230.”

Page 25: Zea mays

Conclusion

We can conclude from our tests that cross-fertilization leads to increased stalk heights

Despite Fisher’s concerns that removing normality assumptions was less intelligible than the t-test, nonparametric resampling-based methods are powerful and efficient

Page 26: Zea mays

Is there anything else to consider?

Not using randomization, which might lead to environmental advantages and disadvantages Soil conditions or fertility Lighting Air currents Irrigation/evaporation

Page 27: Zea mays

References Fisher, R.A.(1935). The Design of Experiments. Edinburgh:

Oliver & Boyd, 29-49. Thompson, J.R.(2000). Simulation: A Modeler’s Approach.

New York: Wiley-International Publication, 199-210. http://www.fact-index.com/r/ro/ronald_fisher.html http://www.lib.virginia.edu/science/parshall/darwin.html http://www.mste.uiuc.edu/stat/bootarticle.html http://www.psych.usyd.edu.au/difference5/scholars/galton.html