46
Environmental Data Analysis with MatLab Lecture 24: Confidence Limits of Spectra; Bootstraps

Environmental Data Analysis with MatLab

  • Upload
    hayes

  • View
    100

  • Download
    0

Embed Size (px)

DESCRIPTION

Environmental Data Analysis with MatLab. Lecture 24: Confidence Limits of Spectra; Bootstraps. Housekeeping. This is the last lecture The final presentations are next week The last homework is due today. SYLLABUS. - PowerPoint PPT Presentation

Citation preview

Page 1: Environmental Data Analysis with  MatLab

Environmental Data Analysis with MatLab

Lecture 24:

Confidence Limits of Spectra; Bootstraps

Page 2: Environmental Data Analysis with  MatLab

Housekeeping

This is the last lecture

The final presentations are next week

The last homework is due today

Page 3: Environmental Data Analysis with  MatLab

Lecture 01 Using MatLabLecture 02 Looking At DataLecture 03 Probability and Measurement Error Lecture 04 Multivariate DistributionsLecture 05 Linear ModelsLecture 06 The Principle of Least SquaresLecture 07 Prior InformationLecture 08 Solving Generalized Least Squares ProblemsLecture 09 Fourier SeriesLecture 10 Complex Fourier SeriesLecture 11 Lessons Learned from the Fourier TransformLecture 12 Power Spectral DensityLecture 13 Filter Theory Lecture 14 Applications of Filters Lecture 15 Factor Analysis Lecture 16 Orthogonal functions Lecture 17 Covariance and AutocorrelationLecture 18 Cross-correlationLecture 19 Smoothing, Correlation and SpectraLecture 20 Coherence; Tapering and Spectral Analysis Lecture 21 InterpolationLecture 22 Hypothesis testing Lecture 23 Hypothesis Testing continued; F-TestsLecture 24 Confidence Limits of Spectra, Bootstraps

SYLLABUS

Page 4: Environmental Data Analysis with  MatLab

purpose of the lecture

continue

develop a way to assess the significance ofa spectral peak

and

develop the Bootstrap Methodof determining confidence intervals

Page 5: Environmental Data Analysis with  MatLab

Part 1

assessing the confidence level of a spectral peak

Page 6: Environmental Data Analysis with  MatLab

what does confidence in a spectral peak mean?

Page 7: Environmental Data Analysis with  MatLab

one possibilityindefinitely long phenomenon

you observe a short time window(looks “noisy” with no obvious periodicities)

you compute the p.s.d. and detect a peak

you askwould this peak still be there if I observed some other time

window?or did it arise from random variation?

Page 8: Environmental Data Analysis with  MatLab

0 100 200 300 400 500 600 700 800 900 1000-10

-5

0

5

10

0 0.50

50

100

0 0.2 0.40

50

100

0 0.50

50

100

0 0.2 0.40

50

100

example

t

ffff

da.s.d Y N N N

Page 9: Environmental Data Analysis with  MatLab

0 100 200 300 400 500 600 700 800 900 1000-10

-5

0

5

10

0 0.2 0.40

50

100

0 0.2 0.40

50

100

0 0.2 0.40

50

100

0 0.2 0.40

50

100

t

ffff

da.s.d Y Y Y Y

Page 10: Environmental Data Analysis with  MatLab

Null Hypothesis

The spectral peak can be explained by random variation in a time series that consists of nothing but random noise.

Page 11: Environmental Data Analysis with  MatLab

Easiest Case to Analyze

Random time series that is:

Normally-distributeduncorrelatedzero meanvariance that matches power of time series under consideration

Page 12: Environmental Data Analysis with  MatLab

So what is the probability density function p(s2) of points in the power spectral density s2 of such a

time series ?

Page 13: Environmental Data Analysis with  MatLab

Chain of Logic, Part 1

The time series is Normally-distributed

The Fourier Transform is a linear function of the time series

Linear functions of Normally-distributed variables are Normally-distributed, so the Fourier Transform is Normally-distributed too

For a complex FT, the real and imaginary parts are individually Normally-distributed

Page 14: Environmental Data Analysis with  MatLab

Chain of Logic, Part 2

The time series has zero mean

The Fourier Transform is a linear function of the time series

The mean of a linear function is the function of the mean value, so the mean of the FT is zero

For a complex FT, the means of the real and imaginary parts are individually zero

Page 15: Environmental Data Analysis with  MatLab

Chain of Logic, Part 3

The time series is uncorrelated

The Fourier Transform has [GTG]-1 proportional to I

So by the usual rules of error propagation, the Fourier Transform is uncorrelated too

For a complex FT, the real and imaginary parts are uncorrelated

Page 16: Environmental Data Analysis with  MatLab

Chain of Logic, Part 4

The power spectral density is proportional to the sum of squares of the real and imaginary parts of the Fourier Transform

The sum of squares of two uncorrelated Normally-distributed variables with zero mean and unit variance is chi-squared distributed with two degrees of freedom.

Once the p.s.d. is scaled to have unit variance, it is chi-squared distributed with two degrees of freedom.

Page 17: Environmental Data Analysis with  MatLab

so

s2/c is chi-squared distributed

where c is a yet-to-be-determined scaling factor

Page 18: Environmental Data Analysis with  MatLab

in the text, it is shown that

where:σd2 is the variance of the dataNf is the length of the p.s.d.Δf is the frequency samplingff is the variance of the taper.It adjusts for the effect of a tapering.

Page 19: Environmental Data Analysis with  MatLab

0 2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

7

8

9

0 5 10 15 20 25 30-20

-10

0

10

20 A) tapered time series

time t, seconds

d(i)

B) power spectral density

frequency f, Hz

+2sd

-2sds2(f)

mean

95%

example 1: a completely random timeseries

Page 20: Environmental Data Analysis with  MatLab

1 2 3 4 5 6 7 80

5

10

15

20

25

30

35

power spectral density, s2(f)

coun

tsmean 95%

example 1:histogram ofspectralvalues

Page 21: Environmental Data Analysis with  MatLab

0 2 4 6 8 10 12 14 16 18 200

5

10

15

20

0 5 10 15 20 25 30

-20

-10

0

10

20A) tapered time series

time t, seconds

d(i)

B) power spectral density

frequency f, Hz

+2sd

-2sds2(f)

mean95%

example 2: random timeseries consistingof 5 Hz cosineplus noise

Page 22: Environmental Data Analysis with  MatLab

2 4 6 8 10 12 14 16 180

10

20

30

40

50

60

power spectral density, s2(f)

coun

ts

mean 95% peak

example 2:histogram ofspectralvalues

Page 23: Environmental Data Analysis with  MatLab

so how confident are we of a peak at 5 Hz ?

= 0.99994

the p.s.f. is predicted to be less than the level of the peak 99.994% of the time

But here we must be very careful

Page 24: Environmental Data Analysis with  MatLab

two alternative Null Hypotheses

a peak of the observed amplitude at 5 Hz is caused by random variation

a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation

Page 25: Environmental Data Analysis with  MatLab

two alternative Null Hypotheses

a peak of the observed amplitude at 5 Hz is caused by random variation

a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation

much more likely, since p.s.d. has many frequency points

(513 in this case)

Page 26: Environmental Data Analysis with  MatLab

two alternative Null Hypotheses

a peak of the observed amplitude at 5 Hz is caused by random variation

a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation

peak of the observed amplitude or greater occurs only 1-0.99994= 0.006 % of the time

The Null Hypothesis can be rejected to high certainty

Page 27: Environmental Data Analysis with  MatLab

two alternative Null Hypotheses

a peak of the observed amplitude at 5 Hz is caused by random variation

a peak at the observed amplitude somewhere in the p.s.d. is caused by random variation

peak of the observed amplitude occurs only 1-(0.99994)513

= 3% of the timeThe Null Hypothesis can be rejected to acceptable certainty

Page 28: Environmental Data Analysis with  MatLab

Part 2

The Bootstrap Method

Page 29: Environmental Data Analysis with  MatLab

The Issue

What do you do when you have a statistic that can test a Null Hypothesis

but you don’t know its probability density function

?

Page 30: Environmental Data Analysis with  MatLab

If you could repeat the experiment many times, you could address the problem empirically

perform experimentcalculate statistic, s

make histogram of s’snormalize histogram into empirical p.d.f.

repeat

Page 31: Environmental Data Analysis with  MatLab

The problem is that it’s not usually possible to repeat an experiment many times over

Page 32: Environmental Data Analysis with  MatLab

Bootstrap Method

create approximate repeat datasetsby randomly resampling (with duplications)

the one existing data set

Page 33: Environmental Data Analysis with  MatLab

example of resampling

1.42.13.83.11.51.7

123456

313251

3.81.43.82.11.51.4

123456

original data set

random integers in range 1-6

resampled data set

Page 34: Environmental Data Analysis with  MatLab

example of resampling

1.42.13.83.11.51.7

123456

313251

3.81.43.82.11.51.4

123456

original data set

random integers in range 1-6

new data set

Page 35: Environmental Data Analysis with  MatLab

p(d) p’(d)

sampling

duplication

mixing

interpretation of resampling

Page 36: Environmental Data Analysis with  MatLab

time t, hours

d(i)

Example

what is the p(b)where b is the slope of a linear fit?

Page 37: Environmental Data Analysis with  MatLab

This is a good test case, because we know the answer

if the data are Normally-distributed, uncorrelated with variance σd2,

and given the linear problem d = G m where m = [intercept, slope]T

The slope is also Normally-distributed with a variance that is the lower-right element of σd2 [GTG]-1

Page 38: Environmental Data Analysis with  MatLab
Page 39: Environmental Data Analysis with  MatLab

create resampled data set

returns Nrandom integers from 1 to N

Page 40: Environmental Data Analysis with  MatLab

usual code for least squares fit of line

save slopes

Page 41: Environmental Data Analysis with  MatLab

histogram of slopes

Page 42: Environmental Data Analysis with  MatLab

2.5% and 97.5%

boundsintegrate p(b) to P(b)

Page 43: Environmental Data Analysis with  MatLab

0.5 0.51 0.52 0.53 0.54 0.55 0.560

10

20

30

40

50

slope, b

p(b)

p(b)

standard error propagation

bootstrap

slope, b

95% confidence

Page 44: Environmental Data Analysis with  MatLab

a more complicated example

p(r)where r isratio of CaO to Na2O ratio of the second varimax factor

of the Atlantic Rock dataset

Page 45: Environmental Data Analysis with  MatLab

0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.520

5

10

15

20

25

30

35

CaO/Na2O ratio, r

p(r)

p(r)

CaO / Na2O ratio, r

95% confidence

mean

Page 46: Environmental Data Analysis with  MatLab

we can use this histogram to write confidence intervals for r

r has a mean of 0.486

95% probability that r is between 0.458 and 0.512

and roughly, since p(r) is approximately symmetrical

r = 0.486 ± 0.025 (95% confidence)