III. Sampling - geosci.uchicago.edufoote/MODEL/2012/notes2012.Samplin… · 3.2 This approach uses sampling with replacement from observed sample to simulate sampling without replacement

GEOS 36501/EVOL 33001 13 January 2012 Page 1 of 23

III. Sampling

1 Overview of Sampling, Error, Bias

1.1 Biased vs. random sampling

1.2 Biased vs. unbiased statistic (or estimator)

1.3 Precision vs. accuracy

2 Error Estimates With Assumed Sampling

Distribution

2.1 Standard Error:

Standard deviation of distribution of sample statistics that would result from infinitenumber of trials of drawing sample from underlying probability distribution and calculatingthe sample statistic.

2.2 In practice we generally do not estimate error by repeatedsampling from the underlying distribution (expensive andtime-consuming), although there are exceptions.

2.3 Approximations based on sample distribution (from Sokaland Rohlf):



2.4 Limitations:

2.4.1 Many approximation formulae make assumptions about shape ofdistribution and sample size.

2.4.2 We may be interested in novel statistic or one whose samplingdistribution is not well characterized.

3 Bootstrap Error Estimates

3.1 Estimate standard error by resampling from the singlesample we have.

3.2 This approach uses sampling with replacement fromobserved sample to simulate sampling without replacementfrom the underlying distribution.

3.3 Procedure

3.3.1 Start with observed sample of size n and observed sample statistic, callit Z.

3.3.2 Randomly pick a sample of size n, with replacement, from the observedsample.

3.3.3 Calculate the sample statistic of interest on this random sample; call isZboot.

3.3.4 Repeat many times (generally hundreds to thousands, ideally untilestimate of SE stabilizes).

3.3.5 Calculate standard deviation of the Zboot.

This is an estimate of the standard error of the observed sample statistic Z:SD(Zboot) ≈ SE(Z).

3.4 Simple (but not necessarily most useful) example: trimmedmean

• Define p-% trimmed mean as mean of sample with p% lowest and p% highestobservations discarded. (Idea is to try to reduce effect of outliers.)

• Suppose data consist of 10 (ordered) observations: 1,2,3,4,8,10,12,15,20,30. Let thetrimmed mean be denoted Z. Then Z = (3 + 4 + 8 + 10 + 12 + 15)/6 = 8.67.


• R code to estimate SE(Z)#define function

trim.mean<-function(x,ntrim){ii<-order(x)

xtmp<-x[ii]

return(mean(xtmp[(ntrim+1):(n-ntrim)]))}data<-c(1,2,3,4,8,10,12,15,20,30) #specify data

n<-length(data)

ntrim<-2 #specify number to trim from each side

Zobs<-trim.mean(data,ntrim) #get observed value

nrep<-10000 #specify number of bootstrap replicates

Zboot<-rep(NA,nrep) #assign memory

for (i in 1:nrep) #get bootstrap replicates

Zboot[i]<-trim.mean(sample(data,n,replace=TRUE),ntrim)

SE<-sd(Zboot) #calculate bootstrap std. error

hist(Zboot,breaks=50) #plot histogram of results

#alternative code, without loops

DATA<-matrix(sample(data,nrep*n,replace=TRUE),n,nrep)

#each column is a bootstrap replicate

Zboot<-apply(DATA,2,trim.mean,ntrim)

SE<-sd(Zboot)

• This yields Zobs = 8.67 and SE(Z) ≈ 3.1.

Histogram of Zboot

Zboot

Fre

quen

cy

5 10 15 20 25

0

100

200

300

400

500

600


3.5 Useful R function: sample(x,n,replace=TRUE[or FALSE])

returns a random sample of size n from the vector x with or without replacement.

3.6 To sample from array X so that the variables (columns) staytogether:

• nr<-dim(X)[1] #get number of rows

• i<-sample(1:nr,n,replace=TRUE[or FALSE])

#returns vector of integers sampled on [1,n]

• XSAMP<-X[i,]

4 Parametric bootstrap

4.1 Take observed sample and estimate relevant parameter fromit.

4.2 Resample from parametric distribution with parameterequal to sample estimate (rather than resampling fromobserved distribution).

4.3 This approach can also be applied to more complicatedsituations:

for example, simulating a process with parameters estimated from data.

4.3.1 We’ll do lots of this later...


5 Examples of Finite-sample Bias (sample-size bias)

5.1 Sample variance

5.1.1∑

(x − x̄)2/n is biased.

This is systematically too low, which makes sense since it is based on squared deviationsfrom sample mean.

5.1.2∑

(x − x̄)2/(n − 1) is unbiased.

5.2 Number of taxa

5.2.1 Rarefaction method (from Raup 1975)

• Abundance of species i is Ni; N =∑

Ni.

• Consider a particular species, i.

•(

N−Ni

n

)is the number of ways of drawing the non-i individuals in a sample of n.

•(

Nn

)is the number of ways of drawing all individuals.

• Therefore, the ratio of these two is the probability of not drawing any individuals ofspecies i.

• Therefore 1 minus this ratio is the probability of drawing at least one individual ofspecies i.

• So the expected number of species is just the sum of this probability, calculated foreach species in turn.

5.2.2 Caveats

• Rarefaction for interpolation rather than extrapolation

• Collecting curves vs. rarefaction curves

• Apparent “leveling off” of curves does not imply that nearly everything has beenfound (only that you’re unlikely to find it with modest effort).

• Curves affected by factors other than sample size (sampling method, taxonomictreatment, size of geographic area etc.).

• Crossing of rarefaction curves can make interpretation difficult.



5.2.3 Examples of application of taxonomic rarefaction (Raup 1975; Raup andSchopf 1978)

This example suggests that the increase in observed family diversity in post-Paleozoic

echinoids cannot be accounted for by an increase in the number of species sampled.


This example suggests that much of the variation in the number of observed echinoid

orders is consistent with differences in number of sampled species. (But does this meanthat’s really all that is going on?!)


5.2.4 Interpretation of taxonomic rarefaction curves not entirelystraightforward.

Sampling standardization to be treated in more detail later


5.3 Range

5.3.1 Example: Range of samples from normal distribution





5.3.2 Example: Test for nonrandomness of sampling with respect tomorphology

(Foote 1997, Paleobiology 23:181)


5.3.3 Correction in general case via rarefaction (random subsampling atcontrolled sample-size)

(Foote 1992, Paleobiology 18:1)

Caveat: Range at standardized sample size may not convey any information that isn’tconveyed by sample variance.


6 Extreme value statistics

6.1 Introduction to problem

6.1.1 Previous look at standard errors considered sampling distribution ofquantities such as mean.

6.1.2 We may also be interested in distribution of extremes:

For example, how is the largest of n observations distributed, or the second smallest, etc.?

6.1.3 Applications: earthquakes, floods, etc.; evolutionary “constraints”

6.2 Probability of number of observations exceeding some value,if distribution known

6.2.1 Pr(X > x) = 1 − F (x), where F (x) is the cumulative distribution.

6.2.2 If there are N observations, then the probability that exactly k of themexceed some value x is given by a simple binomial:(

N

k

)[1 − F (x)]k · F (x)N−k

6.2.3 Example: normal with N = 10, x = 0.67, and k = 3:

F (0.67) = 0.75, so the probability =(103

)0.2530.757 = 0.25.

6.2.4 Future observations

• Suppse we have n1 past observations ranked from m = 1 (largest) to m = n1

(smallest), and we take n2 future observations.

• What is the probability that exactly k of n2 observations will exceed the mth valuefrom the first set of n1 observations?

• Simply find F (x) corresponding to the mth value and plug into previous binomialequation.

• Clearly this works only if we know the distribution.


6.3 Probability of number of observations exceeding some value,even if distribution is not known

6.3.1 General expressions:


6.3.2 Derivaton:

See Gumbel pp. 57-60

6.3.3 Intuitive explanation for insensitivity to distribution:

A given number of points should “cover” a given proportion of the cumulative distribution,regardless of the shape of the distribution (provided that it is continuous).

6.3.4 Example (table 2.2.1 from Gumbel):

Note symmetry in table. Probability of x exceedances above largest is the same asprobability of x exceedances below lowest, etc.


6.3.5 Application to crinoid evolution (Foote 1994)




6.4 Relationship to theory of records

6.4.1 Let there be n1 past trials and n2 future trials. What is the probabilitythat the record set (m = 1) by first set of trials will stand by the secondset (i.e. x = 0)?

This is w(0). Now, suppose we let n1 = n2, then we have:

w(x) =

(n1

m

)m

(n2

x

)(n1 + n2)

(n1+n2−1x+m−1

) ,

which, for n1 = n2, m = 1, and x = 0, gives

w(0) =

(n1

1

)(n1

0

)(2n1)

(2n1−1

0

) ,

which is equal to 12.

6.4.2 What is the expected number of exceedances above the past record?

E(x) =mn2

n1 + 1=

n1

n1 + 1≈ 1 for large n1

6.4.3 Thus, for athletic contests, if all trials reflect the same underlying poolof talent, equipment, etc., the waiting time between successive recordshould progressively double.

6.4.4 Likewise for discoveries of largest dinosaur, oldest primate etc.

Deviations suggest change in “rules” or nonrandom searching.

Documents

III. Sampling - geosci.uchicago.edufoote/MODEL/2012/notes2012.Samplin… · 3.2 This approach uses sampling with replacement from observed sample to simulate sampling without replacement