Introduction to Geological Data Analysis

Introduction to Geological Data Analysis

GS-134 By William A. Prothero, Jr.

Winter, 2002

Table of Contents:

Chapter 1. Kinds of Data Chapter 2, Plotting Data with 2 variables Chapter 3, Correlation and Regression Chapter 4, The Statistics of Discrete Value Random Variables Chapter 5, Probability Distributions and Statistical Inference Chapter 6, Statistical Inference and t, χ2 and F distributions Chapter 7, Propagation of Errors

This material is extracted from an unpublished book written by William A. Prothero on Geological Statistics. It is an “in-progress” work. Please do not copy or reproduce any of this work without my permission. Please also note that the course syllabus, homework, and lab activities are available at http;//oceanography.geol.ucsb.edu/ (click on "Classes"). Thank you, William A. Prothero, Jr.

Version: December 19, 2001, ©University of California Plotting Data - Kinds of Data 1-1 ©University of California, 2001

CHAPTER 1 By William A. Prothero, Jr.

Kinds of Data

A measurement can come in many forms. It may be the color of a rock, the number that comes up on a die, or a measurement from an instrument. We define several types of measurement scales. nominal : Classified as belonging to one of a number of defined categories. For example,

a rock may be 'measured' as igneous, metamorphic or sedimentary. The simplest type of scale is a nominal scale with only two categories, that is, a scale where an object can have one of two possible states. For example, we might 'measure' a rock as either containing a particular mineral or not containing that mineral.

ordinal : Classified as belonging to one of a number of defined categories where the

different categories have a definite rank or order. An example of the ordinal scale of measurement is Moh's hardness scale for minerals. A mineral is classified as 1,2,3..10 where a mineral with a hardness of 10 is harder than a mineral with a hardness of 9, which is harder than a mineral with a hardness of 8 and so forth. A mineral with a hardness of 4, however, is not necessarily twice as hard as a mineral with a hardness of 2.

counting: measurement scale also has discrete values. An example of data measured on a

counting scale is the number of earthquakes above a certain magnitude recorded in a particular location in one year.

interval: Measurements made on these scales have a continuous scale of values.

Temperature is measured on an interval scale. Although the centigrade and Kelvin scales have different zeros, the difference between the boiling point of water and the freezing point of water is 100° on both scales.

ratio: The ratio scale is the same as the interval scale, but it has a true zero. Length,

mass, velocity and time are examples of measurements made on a ratio scale. angular : Measurements of strikes and dips are examples of data measured on an

angular scale. An angular scale is a continuous scale between 0° and 360°. Each of the above data types may require variations in plotting strategies. The following sections show how to construct histograms for these data types and later chapters show how to plot these data when more than a single variable is associated with the data.


Parametric and non-parametric statistics There are two important types of statistics. The first type is parametric statistics. Parametric statistics concern the use of sample parameters to estimate population parameters. For example, if we were interested in the porosity of a particular sandstone bed, we might take a sample of 10 porosity measurements and estimate the mean and standard deviation of the bed based on the mean and standard deviation of our sample using parametric statistics. Parametric statistics are limited to data measured on a continuous scale of values and require that a number of assumptions be satisfied, including that the individuals in the population are independent and that the population is normally distributed. As we will see later in this chapter, the Central Limit Theorem greatly increases the number of problems that can be addressed with parametric statistics. Non-parametric statistics do not involve the parameters of the population from which the sample was taken and may be used when data is measured on a discrete scale of values or when the assumptions required by parametric statistics can not be satisfied. When the number of samples is large, we can sometimes treat data measured on a discrete scale as if they were continuous. We stress that when using any statistical method, it is very important to make sure that the assumptions on which the method is based are appropriate to the problem. Measurement Errors If data were error free, it would not be necessary to read the remainder of this text. Errors come from many sources. In fact, they are physically required through the Heisenberg Uncertainty Principle, which states that there is an inherent uncertainty in any measurement that can be made in a finite length of time. On a practical level, errors occur because the instruments that we use have noise and because of naturally occurring variations in the earth. For example, suppose you are measuring the composition of rocks sampled from a particular region. You would expect composition to vary because of (hopefully) small variations in the history of the rock, variations in chemical composition of the source, and varying contamination from other sources (crustal rock contamination of igneous intrusive rocks, weathering, leaching, etc). In seismology, the earthquake signals will vary from site to site because of varying surface soil conditions, scattering of the seismic waves on the way to the source, and instrument errors. However, one person’s noise may become another person’s signal. If the problem is to determine the magnitude of the quake, the variations in signal due to scattering and surface structure variations will be “noise”. But, if the problem is to study scattering or site response, the variations due to these effects are “signal” to be studied and explained. Accuracy and Precision: Accuracy is the closeness of a measurement to the "true" value of the quantity being measured. Precision is the repeatability of a measurement. If our measurements are very


precise, then all our values will cluster about the same value. To make the distinction between these two terms clear, consider the data plotted in Figure 1.1. Five measurements of the concentration of chemical X in a given water sample are made with each of four different instruments. The "true" concentration of the sample is 50 mg/l.

0

20

40

60

80

CH

EM

ICA

L X

(m

g/l)

A B C D Figure 1.1 Plot of concentration of chemical X. The correct value of the concentration is 50 mg/l. A shows a precise and accurate measurement, B is accurate, but not precise , C is precise, but not accurate, and D is neither precise nor accurate. Instrument A is both precise and accurate. Instrument B is accurate but not precise. Instrument C is precise but not accurate. Instrument D is neither precise nor accurate. Bias: Bias will also be discussed in more detail in the chapter 7. Cases C and D in figure 1.1 (above) demonstrate “bias” in the data. In this case, the bias is caused by an inaccurate measurement device. A good example is when you measure your weight on the bathroom scale. You may consistently get the same weight, but if the zero of the scale is not set properly, the result will consistently be high (or low), or biased. Significant figures and rounding: Another important concern with respect to data measurement is the correct use of significant figures. This has become a problem as hand calculators have come into universal use. For example, suppose you measure the length of a fossil with a ruler and find it to be 5 cm. Now suppose that you decide to divide that length by 3, for whatever reason. The answer is 1.666666666.... Obviously, since the ruler measurement is probably accurate to less than 0.1 cm, there is no reason to carry all of the sixes after the decimal point. Significant figures are the accurate digits, not counting leading zeros, in a number. When the number of digits on the right hand side of the decimal point of a number is reduced, that is called “rounding off”. If the portion that is being eliminated is greater than 5, then you round up, but if it is less than 5, you round down. So, 5.667 would round to 5.67, 5.7, or 6 while 5.462 would round to 5.46, 5.45, and 5.5, and 5. You set your own consistent convention for when the truncated digit is exactly 5. Generally, it is a round up, but sometimes it is alternately rounded up, then down. Some conventions and rules exist regarding the number of “significant figures” to carry in your answer. When a number is put into a formula, the answer need have no more precision than the original number. Precision is also implied by how many digits are written. Writing “16.0”


implies 16.0 ± 0.05, so that the number is known to within 0.1 accuracy whereas writing '16.000' would indicate that the number is known to within 0.001 accuracy. The number 41.653 has 5 significant figures, 32.0 has 3, 0.0005 has 1, and so forth. In calculations involving addition and subtraction, the final result should have no more significant figures after the decimal point than the number with the least number of significant figures after the decimal point used in the calculation. For example, 6.03 + 7.2 = 13.2. In calculations involving multiplication and division, the final result has no more significant figures than the number with the least number of significant figures used in the calculation. For example, 1.53 x 101 * 7.21 = 1.10 x 102. Note that it may be clearer if you use scientific notation in these calculations. Consider the number 1,000; 1.000 x 103 has 4 significant figures whereas 1 x 103 only has 1. If your calculation involves multiple operations, it is best to carry additional significant figures until the final result so round-off errors don't accumulate.

It is extremely important to maintain precision when performing repetitive numeric operations. Keep as much precision as possible during intermediate calculations, but show the answer with the correct number of significant figures.

Data distributions It is very difficult to extract meaning from a large compilation of numbers, but graphical methods help us extract meaning from data because we see it visually. When the data values consist of single numbers, such as porosity, density, composition, amplitude, etc., the histogram is most convenient. Data are divided into ranges of values, called “classes”, and the number of data points within each class are then plotted as a bar chart. This bar chart represents the data distribution. The shape of the distribution can tell us about errors in the data and underlying processes that influence it. A histogram can be described as a plot of measurement values versus frequency of occurrence and for that reason, histograms are sometimes referred to as frequency diagrams . Some examples of histograms are shown in Figure 1.2(a-f). The histogram in Figure 1.2 (b) is called a cumulative histogram or a cumulative frequency diagram. Note that in this figure, the values for weight % always increase as φ increases. The weight % at any point on this graph is the weight % for all φ values equal to or less than the value of φ at that point. (φ is a measure of grain size equal to -log(grain diameter in mm)). The histogram in Figure 1.2 (d) is called a circular histogram and is a more meaningful way of displaying directional data than a simply x,y histogram.


2 4 6 8 10

0

25

50

φφ

WE

IGH

T %

2 4 6 8 10

0

20

40

60

80

100

φφ Figure 1.2a and b: Grain size plots. The plot on the right is the cumulative histogram, which is just the integral of the curve on the left.

igneous metamorphic sedimentary

0

10

20

30

40

50

60

% M

AP

AR

EA

Figure 1.2 c and b: Examples of a nominal histogram and an angular histogram.

0 10 20 30

+4-5

+3-4

+2-3

+1-2

+0-1

-0-1

-1-2

-2-3-3-4

-4-5

-5-6

-6-7

% OF EARTH

Alt

itu

de

(km

)

Figure 1.2e: Example of a histogram with bars running horizontally.


31-40 41-50 51-60 61-70 71-80 81-90 91-100

0

25

50

75

100

# S

tud

ents

Figure 1.2f: Examples of histogram of interval data. Each bar is the number of students achieving scores between specific values. The general shape of a distribution can be described by terms such as symmetrical, bimodal and skewed (figure 1.3). The central tendency of a distribution is described by the mean, mode and median value.

Figure 1.3 Symmetric, bimodal, and skewed distributions. Here there is sufficient data in the sample so that the histogram bars follow a relatively smooth curve. The mean is the sum of all measurements divided by the total number of measurements. The mean of N data values is:

m =x i

i =1

N

∑N

mean of the data

This value is also referred to as the arithmetic mean. There is also a geometric mean,

defined as:

Nxxxxg .....321=

.


It is interesting to note that the logarithm of the geometrical mean is equal to the

arithmetic mean of the logs of the individual numbers. Log[(abcd)1/N] = {log(a) + log(b) + log(c) + log(d)}/N

For a data distribution, important definitions are: mode: the most frequently occurring value. median: the value one-half of the measured values lie above and one-half the measured

values lie below. For a symmetrical distribution, the mean, mode and median values are the same.

Figure 1.4 For a symmetric distribution (bottom), the mode, median, and mean are at the same place, but for a skewed distribution, they are not.

The dispersion or variation of a data distribution can be described by the variance and standard deviation. The standard deviation is defined below.

N

mxvariance sample

N

ii∑

=

−= 1

2)(

1

)(1

2

−

−=

∑=

N

mxvariance sampleunbiased

N

ii

The formula for the sample mean, m, given above, uses the Greek summation

symbol, ∑=

N

iix

1

. This means to add all values of x (your data values). This is x1 +

x2 + x3 + x4 + ……etc. This is a common symbol in statistical formulas. The

summation symbol may also be shown as: ∑n

ix , which means to sum over all

values of i. To test your understanding of this notation, imagine a very simple data set with values of x = 1,2 and 3. Then, N=3. Do the calculation by hand. If you can do it, you probably understand the notation. The answer is 2, for the mean of the data.

m o d e m e d i a n

m e a n

m o d em e d i a n

m e a n


N

mxs

N

ii

bx

∑=

−= 1

2)( standard deviation of the sampled data

1

)(1

2

−

−=

∑=

N

mxs

N

ii

x unbiased standard deviation of x (use this)

where the mean, m was defined previously. The larger the standard deviation, the larger the spread of values about the mean. The variance is the square of the standard deviation. The meaning of "unbiased", used above, will be discussed in Chapter 5.

meanstandarddeviation

Figure 1.5. This figure shows how the standard deviation is a measure of the spread of values about the mean.

Symbols to be used throughout this book: i’th data value, xi number of data in the experiment, N mean of data values, unbiased standard deviation of the data, s

variance of the data, s2 A data distribution can also be described in terms of percentiles. The median value is the 50th percentile. The value which 75% of the measured values lie below is the 75th percentile. The value which 25% of the measured values lie below is the 25th percentile, and so forth. The 25th, 50th and 75th percentiles are also known as the quartiles. Given the range of data values and the quartile values for a distribution, we can tell if the distribution is symmetrical or highly skewed.

xm or ,


To illustrate the definitions of the terms defined above, we consider the set of measurements shown in Table 1.1. 5.0 5.8 6.1 6.3 6.9 7.4 7.5 7.5 7.6 7.8 8.3 8.3 8.8 8.8 8.9 9.0 9.1 9.2 9.4 9.4 9.4 9.4 9.5 9.6 10.0 10.0 10.3 10.3 10.5 10.5 10.6 10.7 10.8 10.8 10.9 11.0 11.4 11.6 11.9 11.9 12.2 12.3 12.4 12.8 12.8 13.1 13.4 13.5 14.0 14.1 Table 1.1. Length of fossil A (cm) There are 50 data points in this data set, ranging in value from 5.0 cm to 14.1 cm. The mean, median and mode are 10.0 cm, 10.0 cm and 9.4 cm, respectively. The standard deviation is 2.2 cm. The 25th and 75th percentiles are 8.8 cm and 11.9 cm, respectively. See if you can find the median and mode by inspection. Height of the Histogram Bars Now you know how, using scripting, to assign a data value to a particular class. You also need to know how to compute the height of the histogram bar. There are several ways that this may be done. 1. Raw This method may be used for any kind of data, but it is the only way that nominal data can be plotted. 2. Frequency: This method may only be used for interval, ratio, and a modified version of angular data. It requires that the data can be expressed on an interval number scale (it also works for integers). This scale is most common when a comparison with “expected” values is wanted. The number of values within the class is then equal to the area of the bar (height*width). This normalization has the advantage that the width of the bars may vary without the undesirable effect that wider bars also get taller. Figure 1.6 demonstrates the appearance of the plot of dice tossing experiment when bar heights are calculated according to “number in class” and according to “number in class/class width”, when faces "5" and "6" are combined into a single class. Dice tossing experiment: A die is tossed N times. The number of times each face is “expected” to come up is N/6, since there are six faces and it is equally likely that any face come up. When some number of tosses are made, the number for each face will normally NOT be N/6. This is due to a natural randomness that will discussed further in later chapters.

class of widthclass in number

Height Bar =

class innumber HeightBar =


Figure 1.6 Comparison of two methods of plotting a histogram when it is modified so that the numbers “5” and “6” are combined into a single class. The plot on the left shows the effect of simply adding the “5” and “6” values into a single bar. The plot on the right shows what happens when the bar height is the number of “5” and “6” occurrences divided by the number of faces included in the bar, or class interval. 3. Probability: Here, the bar height for option 2 is divided by the total number of data, N. In this case, the histogram bars can be directly compared to the "probability density distribution", which will be introduced in Chapter 5. This has the advantage, for the dice toss experiment, of allowing us to plot any number of dice tosses without resetting the maximum Y value on the plot and makes it easy to compare actual results to "expected" results. Practice: It is important that you know how to process data by hand before entrusting it to the computer. This exercise gives you some practice in this. The problem is to plot a histogram of the data in the table below. You decide to sort the data into 5 equal width classes divided between 0 and 10. First, enter the upper and lower boundaries of each class into the table below. There should be 5 equally spaced intervals with the upper boundary of class n equal to the lower boundary of class n+1.

Class # Lower Boundary

Upper Boundary

1 0 2 3 4 5 10

)( data of(Number *class) of widthclass innumber

HeightBar =

Results when number 5 and 6 are combined into one column of histogram

0

5

1 0

1 5

2 0

2 5

3 0

1 2 3 4 5 and 6

# s h o w i n g o n D i e

Expected:

Observed:

Results of dice toss when 5 and 6 are combined into one histogram column

0 5

10 15 20 25 30

1 2 3 4 5 and 6 # on die face

Expected # of times a die number comes up for 60 tosses


Next, enter the class # for each of the numbers in the table of numbers below. You can do this by inspection. Verify also that equation 4 gives the correct class number. This formula is needed when the script is written to find the class.

Values Class # 7.64 6.22 8.75 1.61 4.17 6.91 1.88 1.96 8.23 4.31 8.84 5.59 1.5

5.94 5.78 2.66 2.8

8.34 3.68 4.91

Now, count the number of data values in each class and enter them into column 2 (1) of the table below. Then enter the class frequency and class probability values according to normalizations 1, 2, and 3 of page 1-9.

Class # (1) # in Class (2) Class Frequency

(3) Class Probability

Density

Area of Class

1 2 3 4 5

For (1) you should have 4,3,6,3,4 and for (2), you should have 2,1.5,3,1.5,2 and for (3) you should have 0.1,0.075,0.15,0.075,0.1 and for the area of each class, you should have 0.2,0.15,0.3,0.15,0.2.


If the above data were sampled from a continuous uniform distribution, where every number between 0 and 5 is equally likely. The probability of getting a number in any class is 1/5 = 0.2. Notice that the area of the class dithers around 0.2 and that the sum of all of the areas is 1.0. Now, draw up the histogram on some scratch paper. Circular Histograms Circular histograms must also be constructed in a particular way. The important point is that the area of the circular histogram element must be proportional to the number of data points it contains. Figure 1.7 shows a portion of an angular histogram. The area of a circle of radius R is πR2. Since a complete circle represents 2π radians of angular rotation (360o), the area of a pie shaped segment of a circle that is W radians, Aclass = πR2(W/2π). W is analogous to the width of the histogram bar from before, but its units are radians. What remains is to normalize the area. We can also say that the area of a segment divided by the total area of the histogram should be given by the number of data in the class divided by the total number of data. That is, the fractional area represented by a class should equal the fraction of the total that is contained within the class. This gives us the relation: Aclass/A = f/N, where Aclass is the

R

Figure 1.7 Histogram for directional data. area of the pie-shaped histogram segment, A is the total area of the histogram, f is the number of data points in the class, and N is the total number of data points. The equation is, then:

Aclass

A= πR 2 W

2πA= f

N

and

R2 = 2 Af

WN equation for radius of pie segment in circular histogram


So, you first decide how many segments you want for the histogram. If you want 10 segments, you have W=2π/10 radians. All that is needed is to decide the total area of the histogram based on how physically large you want the plot to be, determine the class “frequencies” (how many are in each class), compute R, and make the plot. Thinking in "Statistics"

Some of the terms of statistics were defined for studies of a population of people. A pollster wants to predict the outcome of an upcoming election. He/she can't poll every person. It would be too expensive. So, a sample of individual members of the population is called up on the telephone and asked their opinion. It is the responsibility of the pollster to sample the population in such a way that the results will reflect the opinions of the population as a whole. Data taken from a sample are used to infer the opinions of the population. In fact, this is the central problem of statistics. We take a sample of measurements of our study area, then infer the properties of the entire study area from that sample.

But, it isn't enough just to produce a single number that is the "answer". Some measure

of the accuracy of that number is required. This number is usually expressed as a "confidence limit". We say that: If this experiment was repeated many times, what are the upper and lower limits that 95% of the results lie within? If we wanted to be safe, we could specify 99%, or some other percentage. But, the critical piece of information is the statement that yes, there are errors, but there is an XX% chance that our result lies between the two specified limits.

The ideas of a sample and a population apply to geological statistics, as well as

opinion polling. The term population as it is used in statistics refers to the set of all possible measurements of a particular quantity. For example, if we are interested in the nature of the pebbles in a given conglomerate, the population will consist of all the pebbles in the conglomerate. For a dice toss experiment, the population can be considered the infinity of possible dice toss outcomes. When the experiment is repeated, the population of all possible dice throws is being sampled repeatedly. In other words, we visualize an abstract collection of all possible values of the quantity of interest. Then, when we make a measurement, toss the die, throw the coin, …whatever, we are sampling this abstract population. I like to think of it as a giant grab bag full of small pieces of paper with a number written on each one. A sample is taken by reaching in and grabbing out N pieces of paper and reading the numbers. Suppose, for example, that you drill some cores of a rock to determine the orientation of the remnant magnetism for a paleomagnetic study. The rock most likely does not have a constant magnetic direction throughout its body, so a single core would give a very uncertain result. The result can be improved by taking a number of cores and averaging the results. The level of uncertainty will be affected by how many cores are averaged together and the amount of variation of magnetic direction within the rock itself. The entire rock formation would be considered the population and the collection of cores would be the sample. It is then the statistician’s task to infer the properties of the population from the measurements on the sample.


Sampling Methods Simply going out and making some measurements sounds easy. In practice, the process

is prone to errors and misjudgements. It can be very expensive to launch a field program, gather data at great expense, then find in the analysis that there are not enough data or the data are sampled incorrectly. You can usually improve the sampling strategy if you understand as much as possible about the process or system you are sampling. When this is impossible, small test experiments can be useful. The following paragraphs discuss some of the issues involved in designing a sampling strategy.

It is important to take a truly random sample so that errors tend to average out. But,

getting truly random sampling is not always straightforward, especially in the earth sciences, where values of interest may not be randomly accessible or where, for example, only certain kinds of rock formations are exposed. Suppose you want to sample the soil properties in a 1 km2 area. You might be measuring a soil contaminant, nutrients, porosity, moisture, or any other property appropriate to the study. You must first answer some very basic questions. Is the parameter you want to measure distributed randomly within the area, or is there a systematic variation? You must also determine the source of noise in the measurement. Is the noise due to error in the measurement instrument, or is it due to natural variations in the properties of the soil? An example of a systematic variation is a slow change of the parameter across the area you are sampling. For example, if you are measuring nutrients but portions of the study area have large trees and some have low plants, you would expect a dependence on this. If you want to study the properties averaged over a large area, you may want to consider natural variations as noise to be averaged out. On the other hand, variations of the nutrients that are caused by vegetation differences may be of interest. It all depends on the goals of your study. One sampling option is to adopt simple random sampling. The area is divided into a grid and sampling takes place at randomly selected grid points. Grid points may be selected using random number tables or a computer’s random number generator. In the field, you could toss a die. One method (Cheeney, 1983) is to divide the length into 6 locations and select one by tossing the die. The chosen interval is divided into 6 subintervals and one of these is chosen by die toss. This subdivision can continue as far as needed. Another sampling method is stratified sampling . This method prevents the bunching of data points that may occur with simple random sampling. For example, we might lay out a grid of 10 x 10 squares and take a number of randomly located samples within each of the 100 squares. If you are measuring the magnetization direction of a rock outcrop by taking cores and the random selection system bunched all of the samples in a small portion of the rock, it would be wasteful to blindly take these data. However, if this selection of random data points were rejected until a more “satisfactory” distribution was determined, the statistical assumptions of randomness would be violated and conclusions based on statistics would be suspect. If you will reject bunched data locations, stratified sampling is the method to choose. Methods of identifying systematic variations will be discussed in later chapters, when correlation is discussed.


Another approach would be systematic sampling scheme in which you would pick a location at random and distribute the data points at even intervals from the start point over the remainder of the field area. You must be careful that this approach doesn't introduce any bias. If there is any reason to believe that the property being measured varies systematically, this approach may not work. This method generally reduces the number of data points needed in a sample, but produces somewhat less precise results than other methods. A systematic sampling method is used in point counting work in petrography. How many samples should be taken? As we will see, the required sample size depends on two major factors. The first is the precision required by the study. The more precision you want in your results, the more samples you will need to take. The second is the inherent variability in the population you are sampling. The greater this variability, the greater the necessary sample size. Of course there are practical limits which must be considered. These may include the availability of possible samples and the costs involved in sampling. More complete discussions of sampling theory and problems are given in Chapter 5 and 6. Also, for a discussion of sampling methods, see Cheeney (1983) and Cochran, W.G. (1977. Sampling Techniques, 5. New York: Wiley). Modeling Statistical Interpretations If you do not aspire to be a mathematician, and most geologists don't, there is a very easy way to test your statistical inferences. This is by simulating the experiment on the computer. It is a great way to prove, without mathematics, that your results are valid. Even more useful, the use of random numbers can also help us understand the principles of statistics. Statistical simulations will be an important component of the lab exercises.

Generating random numbers in Excel: There are a couple of ways of generating random numbers in Excel. The first is to use the RAND () function. It generates a number between 0 and 1. To generate a number between a and b, use =RAND()*(b-a)+a. RAND () generates numbers with an equal chance of taking on any value between 0 and 1. To get another distribution, use the "Data Analysis" tool, which can be accessed under the "Tools" menu. When the dialog box comes up, scroll down the list of tools and select "Random numbers". You will be able to select the distribution you want. Note that the numbers will only be computed once. If you use the RAND() function, the numbers will change every time you do a "recalculate" operation ("Apple =" on the Mac and "Ctrl =" on the PC.


Review After reading this chapter, you should: • Be able to discuss the types of measurement scales discussed in this chapter: discrete,

continuous, nominal, ordinal, counting, interval, ratio and angular. • Understand the difference between accuracy and precision and know how to use significant

figures correctly. • Be able to describe a data distribution in terms of overall shape, central tendency and

dispersion. Know how to find the mean, mode, median, standard deviation and percentiles for a data distribution.

• Be able to construct a histogram of various kinds of data and compute the correct bar

heights. • Be able to describe the important considerations in designing a sampling strategy. Vocabulary:

sample population sample mean sample variance sample standard deviation histogram bias


Problems 1. Describe the following data as discrete, continuous, nominal, ordinal, counting, interval, ratio and/or angular. a. the mineral phases present in a rock b. the concentration of iron in a rock c. the age of a rock as determined by U-Pb dating d. the age of a rock as determined from fossils e. the size of earthquakes measured on the Richter scale f. the daily high and daily low temperatures in an area g. the amount of rainfall in a given locality h. paleocurrent directions determined from ripple marks i. δO18 values relative to SMOW (standard mean ocean water) 2. Define the term's accuracy and precision by way of a dart board analogy. That is, draw a dartboard with 5 darts on it thrown by someone who is accurate but not precise, precise but not accurate, precise and accurate and neither precise nor accurate. 3. Give the answers to the following problems to the correct number of significant figures. a. 13.67 + 4.2 = b. 2.4 * 4.11 = 4. Using Excel, plot a regular histogram and a cumulative frequency histogram for the following data set. Be sure to indicate the mean and standard deviation of the data on the plot of the histogram. Note: you can plot a histogram with any desired bar heights by directly entering the bar heights in the field labeled “class frequencies” and clicking on the “Plot Data” button.

43 47 48 49 49 52 53 54 55 55 56 57 57 58 58 59 60 61 62 63 64 64 64 65 65 65 65 65 65 65 67 68 68 69 69 69 70 70 70 70 72 72 73 74 74 78 78 79 79 83

5. In designing a water well, you need to select a screen slot size that will retain about 90% of the filter pack material surrounding the well hole. Data from a sieve analysis of this filter pack material are shown below. Construct a cumulative frequency diagram, using Excel, and


determine the necessary slot size by plotting the cumulative % caught on the sieves on the y-axis and the sieve slot size on the x-axis.. weight % caught on sieve sieve slot size (mm) 2 0.0 8 0.4 20 0.6 30 0.8 30 1.1 10 1.7 6. What is the difference between parametric and non-parametric statistics?

Version December 20, 2001 Data With Two Variables 2-1 ©University of California, 2001

Chapter 2

Plotting Data With Two Variables Often each data item has more than a single number associated with it. Porosity may change with height, earthquake signal amplitude changes with distance from the source, radiogenic composition changes with age, oxygen isotope ratios change with temperature, etc. It is these relationships that tell us the story we want to extract from the data. There may be a large number of variables associated with each data item. This is the topic of “Multi-Variate Analysis”, and is beyond the scope of this book. However, the geologist will often face the problem of processing data with only two variables. This chapter treats the scaling and plotting of x-y data, fitting of the basic equations to data, and how noise in the data can affect its interpretation. Plotting X - Y Data Most of your X-Y data plots will be created in a charting program. A simple x-y chart created in Microsoft Excel is shown below. The chart has a title, a label for each axis, and a legend that describes the symbols that represent the two data sets that are plotted. The importance of making clear data plots cannot be overemphasized. The reader should be able to understand the content of the plot by looking at the plot and its caption.

Figure 2.1a. This is a sample data plot showing correct axis and data labeling. When plotting data, error bars should also be shown on the plot.

Logarithmic Scaling Often, it is useful to plot values on a logarithmic scale. The logarithm of either, or both, of its X and Y axis values is plotted. The most common reason for plotting on a logarithmic scale is that data

Plot of atmospheric gas concentrations at Mauna Loa Observatory

0

10

20

30

40

0 5 10 15

Time-years before the present

Con

cent

ratio

n-pp

m

Data #1

Data #2

Version December 20, 2001 Data With Two Variables 2-2

X-Y plot with linear axes

0

5 0 0 0 0

1 0 0 0 0 0

1 5 0 0 0 0

2 0 0 0 0 0

2 5 0 0 0 0

3 0 0 0 0 0

3 5 0 0 0 0

4 0 0 0 0 0

4 5 0 0 0 0

5 0 0 0 0 0

0 5 1 0 1 5

X

values span many orders of magnitude. This is true for the earthquake magnitude scale where ground motion induced by quakes varies from sub-micron to meter amplitudes, a range of six decades or more.

Figure 2.1b. When data vary over many decades, a logarithmic scale is used.

Figures 2.1b illustrates the need for log plots. Very little detail is shown for most of the data points. The last data value determines the plot scaling and other points lay along the X-axis. The plots of figure 2.1c show a conventionally labeled X vs log(Y) plot. This is most commonly used because it is easy to read the original data values from the Y-axis. The rightmost option is where we numerically take the log of the Y data values, then make the plot using the transformed Y values. Thus, what we see on the Y-axis is the true logarithm of the Y data. This is the simplest method

to use when determining the best-fit coefficients of the equation that describes the data (the method is described in the next section). The reason for this is that the fitting equations require the slope of the line, and this slope is best calculated from the log(Y) axis values. The student generally gets confused when trying to use the left plot to do this. Excel labels log scales according to the left figure.

Properties of logarithms reviewed: We ask, what is the value of X in the formula BX = N, where N is the number of interest, B is the “base”, and X is the logarithm of N. For example, suppose we are interested in base = 10. This is the base we will use almost exclusively in this chapter. If N = 100, then we ask what is the value of X in 10X = 100? It is easy to see that 102 = 100, so the log(100) = 2. It is simple to get orders of magnitude from log values. From this, it is simple to derive other properties of logarithms. Some of the important properties of logarithms are: log(ab) = log(a) + log(b) log(ab) = blog(a) log(a/b) = log(a) - log(b) ** Note: The logarithm of a negative number does not exist. If you try to take the log of a negative number in Excel, the value that is returned will be “#NUM!”, which means “Not A Number”.


Figure 2.1c. These two figures illustrate 2 ways of labelling the Y axis for a log plot. The left plot is the most conventional and is what Excel produces. The right side is most useful for calculating the coefficients of the equation of the line that fits the data. Notice, on the right, that the log of the y values are taken, then a linear y axis is used to plot the values.

Data Plots and Determining Functional Dependence The use of logarithmic plot scales can both illuminate and obscure important facts about your data. It is possible to determine the functional form of the underlying equation followed by the data, by selecting the correct kind of plot.

0.75

0.5

Slope = 0.75/0.5

intercept

Figure 2.2 Plot of a straight line, showing computation of slope.

X-Y plot with log(Y) axis

0

1

2

3

4

5

6

0 5 1 0 1 5

X

X-Y plot with log(Y) axis

1

10

100

1000

10000

100000

1000000

0 5 10 15

X

Y


The following functional commonly occur in problems of interest to earth scientists: y = mx + b (linear dependence) (2-1)

y = Axn + b (power law dependence) (2-2)

y = Aen x + b (exponential dependence) (2-3) Equation 2-1 is the familiar equation of a straight line. It is characterized by its slope and intercept, which is the value of y at x=0. A diagram is shown in figure 2.2. The slope, m, is 1.5 and the intercept, b, is 2. It is possible to determine using graphical methods, the unknown constants in equations 2-2 and 2-3. The following operations demonstrate how this is done. Power Law – Equation 2-2 Rearranging equation 2-2 slightly, it becomes: Y − b = A x n Taking the log of both sides, we have:

log (y − b) = log ( A) + log( x n)

log (y − b) = log ( A) + n log (x ) So, if we define new variables,

Y l = log( y − b)

and

Xl = log (x ) Method: If the data follows the power law dependence (eq. 2-2), a plot of the log(x) vs log(y-b) will produce a straight line. The b is problematic. For many power law dependencies, it is zero. If it is not zero, you will need to use a computer to fit the best line to the data. For the purposes of this class, always try to get a fit with b=0 first. If you get a straight line with log(x) vs log(y), then find the slope of the line. You should use the calculated values of log(x) and log(y) to get the slope. This slope is then equal to n in the above equation. The intercept is equal to log(A), so you can solve for A. The important thing to remember is to plot the calculated values to determine the slope and intercept. Also, verify your answer by putting in one or two values for x and see if they agree with the y values you are trying to fit. Don't omit this important self-test check!

The equation becomes: Y l = nX l + l o g ( A )


we can plot Yl vs Xl and the slope will be equal to n, the power of x in equation 2-2. The intercept will be the value of log(A). Exponential Equation 2-3 Similarly, for equation 2-3, we have:

y − b = Aen x

log(y − b) = log( A) + nx log(e) We let: Y l = log ( y − b ) So, Y l = n log ( e ) x + log ( A ) form of linear equation We can see that the Y axis should be plotted on a logarithmic scale as log(y-b) and the X axis on a linear scale. The slope will be the value of nlog(e). There is a complication in this procedure for these two functional forms. We do not know the value of the constant b. Often we expect that b = 0, as in the case of radioactive decay. If it is strongly suspected that the data follow a power law with a non-zero b, then b could be varied in the plot until the “best” straight line is achieved So far, appeals to intuition are being made so that you obtain an understanding of the underlying principles. However, the fitting of straight lines in the presence of noisy data is fraught with dangerous traps in interpretation. Questions that must be asked of any data fit are: a) what other values of the parameters produce an equally “good” fit? b) do other functional dependencies produce an equally “good” fit? A more quantitative definition of a “Good” fit will be given when computer curve fitting is discussed using Excel.

Helpful hints: It is not necessary to have the X = 0 value plotted to determine the intercept, which produces the “b” value in equations 2-1 to 2-3. Once it has been determined that the data follow a straight line dependence, any X,Y value (from the straight line) may be used to solve for b. Just read an X,Y value from the graph, substitute it into the equation (slope is known, but b is not), and solve for b.


Example of finding the function’s constants: Suppose that we have the following data. This data was calculated using the formula y = exp(2*x). Let's plot it in several different ways, then see how to recover the original constants of this equation. But first, note that this equation is of the form:

This form is the same as that in equation 2-2, but with the b on the left hand side. For now, don't pay attention to the column labeled log(y).

x y log(y) 1 7.389056 0.868589 4 2980.958 3.474356 5 22026.47 4.342945 10 4.85E+08 8.68589 20 2.35E+17 17.37178

The first thing to notice about the numbers in the y column is that they range from 7.8.. to 2.35 x1017, an extremely large range. An x-y plot of this data is shown in figure 2.3 below.

Figure 2.3. x-y plot of the data in the example.

Notice that the extreme range of the data causes all of the data except the largest to be plotted on the x axis. We suspect that we should make the y axis into a log axis. Figure 2.4 shows this.

nxAeby =−

y = exp(2*x)

0

5E+16

1E+17

1.5E+17

2E+17

2.5E+17

0 10 20 30

X

y y


Figure 2.4. The example data is plotted on a log y axis. Note: exp(x) means ex.

Notice that the data plots as a straight line. We can measure the slope and intercept of this straight line and find the "A" and "n" coefficients of the equation, to make sure they agree with what we already know. But first, we note that the labels on the Y axis are still reflecting the original data values. If we use these numbers to calculate the slope and intercept of the straight line, we will get the wrong answer. This is because the Excel plot routine, just to be nice and convenient for those who want to read the original numbers from the Y scale, did not really label the log(y) values. The easiest way to get the log values is to make a third column that is log(y), then do a new plot of x vs log(y). This plot is shown in figure 2.5 below. We can see that the slope of this line is: 0.869, and can be measured from the plot itself, or calculated from the table of numbers (don't do this with real data; it's best to do a least squares fit when data have errors):

See if you can get these numbers yourself. The plot above also shows the y intercept to be 0. So, referring back to our equation:

Y l = n log ( e ) x + log ( A ) we can see that it has the form

where m = nlog(e) and b = log(A). In Excel, log(e) is computed as =LOG(EXP(1)), which equals about 0.434. So, solving for m, we have:

0.87 = n*0.434 or n = 2.00 Hurray, that was our value!

y = exp(2*x)

0

5

10

15

20

0 10 20 30

x

log(

y)log(y)

87.0)520(

)34.437.17( =−−=slope

bmxY +=


Also, since b = 0 (y intercept), then we solve log(A) = 0. The log(1) = 0, and our original value for A is 1. So, we have created some data artificially, pretended we didn't know where it came from, then worked backwards to get our initial equation. This is the procedure for all of the other functional forms. Complications: If the values of one of the variables are negative, you can’t take the log, because the logarithm of a negative number has no meaning. But, you can make a substitution x’ = -x in the equations. This lets you take the log(-x) for all values, so the log would become the log of a positive number. You need to adjust the equation for the equation coefficients, though. Also, most data have errors. We did the example with noise free data, so it worked out perfectly. In the presence of errors, the coefficients that you solve for will have errors too. Also, sometimes you cannot tell whether a log or linear axis gives the best fit. You have to use what you know about the process that created the data and use your best judgement. It is never wise to blindly apply mathematical techniques without knowing something about the processes that created the data.

Review After reading this chapter, you should:

• Know how to find the functional dependence of common forms and find the unknown parameters in the function, from the plot. Be sure that you can derive the equations for slope and intercept in all three cases.

• Understand how to make linear and logarithmic plots .


Problems: Problem 1: This problem is designed to support your understanding of the simple derivationsof the slopes and intercepts for the 3 functional forms of equations that have been discussed. 1a) If all values of x are negative, you cannot take the log of these numbers to test for a power law or exponential dependence. Derive the equation for slope and intercept for power law and exponential dependencies when all values of x are negative. Problem 2: Do problem 1, but when all values of y are negative. Problem 3. Seismologists have noticed that the relationship between the magnitude of earthquakes and their frequency is described by the equation y* = a-bx, where y* is the log of the number of earthquakes and x is the magnitude of the earthquakes. For the following data, find 'b'. Use the mid-point of the range of values given for x. magnitude of earthquakes number of earthquakes 6.5-7.0 2 6.0-6.5 3 5.5-6.0 10 5.0-5.5 16 4.5-5.0 72 4.0-4.5 181 3.5-4.0 483 3.0-3.5 846 2.5-3.0 302 2.0-2.5 73 Problem 4. Determine the half-life of chemical B based on the following experimental data. (The half-life is the time at which one-half of the chemical remains. fraction chemical left time (days) 0.97 0.2 0.92 0.5 0.84 1.0 0.71 2.0 0.42 5.0 0.18 10.0 0.03 20.0 0.006 30.0


Problem 5. The following data were collected during an experiment to determine the relationship between temperature and vapor pressure for an organic chemical. From previous experience you know that the general form of the equation that describes this relationship is given below.

ln P = A

T+ B

Find A and B. What is the vapor pressure at 370C (3100K)? P(atm) T(0K) 0.059 283 0.13 298 0.36 323 0.75 343 Hint: make the 1/T dependence linear with a substitution of variables. Problem 6: For the following datasets, determine the functional form of the underlying equation and its unknown parameters. You can assume that b = 0 for equations that are not linear. Dataset 1 Dataset 2 Dataset 3 Dataset 4 0,3 1,5 2,7 3,9 4,11 5,13 6,15 7,17 8,19 9,21

0,2 1,1.1 2,0.6 3,0.33 4,0.18 5,0.1 6,0.05 7,0.03 8,0.02 9,0.01

0,0 1,1 2,5.66 3,15.59 4,32 5,55.9 6,88.18 7,129.64 8,181.02 9,243

0,2.02 0.83,1.23 1.67,0.71 2.5,0.45 3.33,0.26 4.17,0.15 5,0.16 5.83,0.08 6.67,0.02 7.5,0.03 8.33,0.02 9.17,0.04

Version: December 20, 2001 Correlation and Regression 3-1 ©University of California, 2001

Figure 3.1. Plot of data in table 3.1.

CHAPTER 3 Correlation and Regression In this chapter, we discuss correlation and regression for two sets of data measured on a continuous scale. We begin with a discussion of scatter diagrams. Scatter Diagrams A scatter diagram is simply an x,y plot of the two variables of concern. For example, figure 3.1 shows a scatter diagram of length and width of fossil A. These data are listed in Table 3.1.

_______________________ length width 18.4 15.4 16.9 15.1 13.6 10.9 11.4 9.7 7.8 7.4 6.3 5.3 ___________________________ Table 3.1 Powerful data analysis software has made it easy to perform complex statistical analyses on your data. This is very good, but there are

pitfalls in relying too much on sophisticated computer calculations when you do not

completely understand how to do the calculations yourself. It is important to develop intuition about the data and the expected results from a particular analysis. This intuition will help you avoid stupid mistakes in interpretation and also catch numerical errors in data entry. Before you do a computer calculation, you should always estimate a range of reasonable output values. Then, when/if the result of the computer calculation is quite different from what you expected, you have either made an error in specifying the analysis to the computer software, or you don't understand what you are computing. Either situation requires careful investigation. A good example of the need to understand the calculation at more than a superficial level is the computation of the correlation between two variables, x and y. An x-y scatter plot is always done first. Then you can visually determine whether there might be a correlation and whether it is reasonable to calculate a correlation coefficient. Some interesting misinterpretations of the correlation coefficient will be illustrated in the following pages. Even though the computer is a great tool for doing extensive computation, you should do the calculation by hand, at least once, to make sure you understand the process.

2015105

4

8

12

16

Wid

th (

mm

) of

fos

sil

A

Length of fossil A


Variance and Covariance The variance and covariance are important quantities, and are introduced here so we can use them in the next section. The variance is give by:

11

)()var(

2

22

2

−

−=

−

−==

∑∑

∑n

n

xx

n

xxsx n

ni

in

i

x (A)

Notice that the variance, in the above equation, is the standard deviation of the data squared. The standard deviation was defined in chapter 1. The second form of the variance (right hand side of the equation) is exactly equivalent to the standard definition, but is sometimes convenient to use when calculating with a calculator, or when deriving equations. In general, the variance will increase as scatter in the data values increases. Another important quantity is the "covariance" between two variables, x and y. The formula for the covariance is given below. It is very analogous to the variance, but includes both x and y values. Notice the similarities between the two equations. Instead of squaring x, we have x times y values. This keeps the dimensions the same.

11

))((),cov( 2

−

−=

−

−−==

∑∑∑

∑n

n

yxyx

n

yyxxsyx n

ni

ni

iin

ii

xy (B)

The covariance is an expression of the relationship between the x and y data points. Notice that it is similar to the standard deviation of a single variable squared, but instead of squaring values, x and y values are multiplied. Hints on understanding these formulas: It is very important to become familiar with the summation notation. A few minutes to focus on this notation will be very worth your while when you try to understand more complex concepts and formulas later in this chapter. Suppose there are n values of x. Suppose these

values are: 1, 2 and 3 (for simplicity). The formula: ∑n

ix means to add all values of x together. Since there

are 3 values, n = 3. The n on the ∑n

sign means to sum over all of the n values of x. So, for the simple

data set that means we do: 6321 =++=∑n

ix . Now find the variance of this simple dataset using

formula A above. Use both forms of the formula to convince yourself that they are equivalent. After you do this, assume a y dataset to be 2, 3, and 4. Now do the covariance formula (B) and see what you get. Do both forms. They should agree. If they do, you will have mastered the summation notation.


More details: Since we can write the mean of x as: ∑=n

ixn

x1

, the second form of equation A above can

be written as: 1

22

2

−

−=

∑n

xnxs n

x . We will use this form later in this chapter. Find an equivalent

simplification for formula B above.

Correlation coefficients The problem with the previous formula for covariance is that its actual value is not simply related to the relationship between x and y. It would be more elegant if we had a scale where 0 implied no relationship and 1 (or -1) implied the maximum relationship. This can be achieved by defining the Pearson's correlation coefficient. The Pearson's correlation coefficient, denoted by rxy is a linear correlation coefficient; that is, it is used to assess the linear relationship between two variables. It is used for data that are random and normally distributed. The xy subscripts are used to emphasize the fact that the correlation is between the variables, x and y. This coefficient is very important in the least squares fit of a straight line to x and y data.

The value of rxy can vary from -1 to +1. When the two variables covary exactly in a linear manner and one variable increases as the other increases, rxy =+1. When one variable increases as the other decreases, rxy =-1. When there is no linear correlation between the two variables, rxy =0. Figure 3.2 shows some scatter diagrams for various values of rxy.


100

0

25

r=+1

100

0

25

r=-1

100

0

10

20

r=+0.8

1050

0

10

20

r=0

Figure 3.2. Plots showing the value of the Pearson’s correlation coefficient for different rxy values. Notice that the plots show increasing scatter as the r value decreases toward 0.


382549)5()2()3( 222

1

2 =++=++=∑=

N

iix

The correlation coefficient, rxy, is calculated as

( )( )

( ) ( ) yx

xy

ii

ii

iii

xy ss

s

n

yy

n

xx

nyyxxr

2

22

11

)1(=

−

−

−

−

−−−=

∑∑

∑ (3-1)

Another form is:

−

−

−=

∑ ∑∑ ∑

∑ ∑ ∑

i iii

i iii

i i kkiii

xy

nyynxx

nyxyxr

2

2

2

2

(3-2)

where x is one variable, y is the second variable and n is the sample size. It doesn't matter which variable we call 'x' and which we call 'y' in this case. Notice that all we had to do to convert from the covariance was to divide by xs ys . This division "normalizes" the value of the covariance so that it varies

between -1 and +1. For the data in Table 3.1, (to make sure you understand the calculation, see if you can duplicate the numbers given below. they correspond to eq. 3-2):

r = 888. 48 -

(74. 4)(63. 8)6

1039.62 - 74.4 2

6760.92 -

63. 8 2

6

= 0.99

. Extra help: When calculating the above values, is very important to do the calculations in the correct order. For example, if you have 3 numbers, say 3,2, and 5. The sum of them is 10. In our summation notation, this is: Now if we square the result, we get: But, suppose we square the values before adding them together. This is indicated as: So, we got a value of 100 by adding the numbers first, then squaring, but a value of 38 by squaring the numbers first, then adding. Clearly, this is an important effect and you should be very careful which of the procedures is indicated. It is very important that you become familiar with the summation notation. This is best done by substituting

10)523( =++=∑n

ix

( ) 10010523 222

1

==++=

∑

=

N

iix


numbers into the examples in this book until you are comfortable and get answers that agree with the book's. We have calculated rxy, but as of yet, we have said nothing about the significance of the correlation. The term "significance" has meaning both in real life and in statistics. An experimental result may give us a number, but is that number "significant"? In statistics, we ask whether this number is highly probably, given the errors (or randomness) in the data. For example, suppose you are a psychic and studying psycho-kinesis, which is the use of the mind to influence of matter. You concentrate on "heads". A coin is tossed once and the side that comes up is "heads". Wow! Is this significant? Does it mean anything, or could the side just as easily have been "tails"? The probability of heads coming up is 1/2. Most would agree that a 50-50 probability is pretty "insignificant" and psycho-kinetic powers remain unproven. But, suppose, after 100 tries, the coin toss favors heads 75% of the time. This result is highly unlikely due to randomness. Therefore the "significance" of the result is much greater. This is an important point, and applies to correlation as well. Intuitively, the significance of a correlation of 0.9 is much greater when the sample size is very large than when the sample size is very small. For a small sample size, the alignment of data points along a straight line may be fortuitous, but when many data points lie along a straight line, the case becomes much more convincing, or "significant". The calculation of "significance" will be discussed in greater depth in later chapters. (a) (b)

0 10 200

10

20

r=0.8

-6 0 6

-6

0

6

r=0

Figure 3.3. Data which show a low correlation coefficient, yet are obviously correlated. These kinds of data illustrate inappropriate applications of the Pearson correlation coefficient. A few words of warning There are several factors of which you should be aware when interpreting the significance of correlation coefficients. First, the Pearson's correlation coefficient, as we have said, is a measure of linear correlation. Data may be highly correlated but have a zero linear correlation coefficient. Also, an outlier in the data set can have a large effect on the value of rxy and lead to erroneous conclusions. This does not mean that you should ignore outliers, but you should be aware of their effect on rxy. Figure 3.3 illustrates these points.


Obviously, these data are not randomly distributed, and a quick look at the scatter plot verifies this. A problem also occurs when the data are acquired from a 'closed system'. A closed system is one in which the values of x and y are not completely independent because the fixed total of all measurements must add to 100% or some other fixed sum. Closed system data occur frequently in geologic studies. For example, closed systems exist in measurements of percentage compositions in studies of whole rock chemistry and in work with ternary plots of rock composition. Because the sum of the various measurements must add to a fixed sum, an increase in the proportion of one variable can only occur at the expense of one of the other variables. Therefore, negative correlations are artificially induced. One final point is a reminder that a significant correlation between two variables does not imply a cause and effect relationship. We may notice that at the end of the month, our bank account balance is at its lowest level. Does this mean our bank account is somehow linked to the calendar? No, it's the fact our paycheck is deposited on the first of the month. The day of the month doesn't CAUSE our bank account to go down, it is just a variable that varies in the same way.

0 1 2 3 4 5

0

50

100

150

depth

Tem

per

atu

re

Figure 3.4. A plot of depth vs temperature that will be used in the least squares fit example. Least squares regression Often, we wish to quantify the relation between two variables and we do this by fitting a line or curve to the data. Fitting a curve to data is called regression. In this section, we will discuss linear regression by the method of least squares. This method assumes that a linear correlation exists between the two variables of concern and that the variables are normally distributed. In this section, we will use the data listed in Table 3.2 and plotted in Figure 3.4 as an example. The purpose of linear regression, of course, is to find the “best” straight line fit through data that has errors, or some kind of natural variation. The data may have an underlying physical basis for lying on a straight line, or may just plot in a linear way and the regression just allows us a more convenient description of the behavior of the data.


Temperature vs Depth

0

20

40

60

80

100

120

0 1 2 3 4

depth

Tem

per

atu

re (d

eg

C)

e i

xI ,y i y = ax + b

There are two situations that will affect our approach to the regression: 1. The error or variation is almost exclusively in one of the two variables. This situation would occur,

for example if one was measuring fault offset vs time. The time measurement would be very precise, but the offset measurement would be subject to measurement errors and natural variations in distance due to shifts in monuments.

2. The error or variation is inherent in both variables. In this case, we compute the “reduced major axis line”.

_____________________________________________________________ depth temperature (°C) 0.25 25 0.5 35 1.0 60 2.0 80 3.0 105 _____________________________________________________________ Table 3.2

With a least squares regression, we fit a straight line (y = a + bx) to the data by minimizing the sum of the squares of the distances between each data point and the best-fit line. This distance is measured in the y-direction. In calculating a correlation coefficient, it did not matter which variable we called 'x' and which we called 'y'. In least square regression, it matters. If we are regressing y on x, x is the independent variable and y is the dependent variable. We assume that the error involved in the measurement of x is negligible compared to the error involved in the measurement of y.

So, our job now is to find the “best” a and best b constants for the y = ax + b

equation, so the the data are fit as well as possible. There are many ways to do this, but the most common way is to do a “Least Squares Fit”. We want all of the ei “fit errors” (see figure 3.5) to be as small as possible. Suppose we calculate the sum of squares of all of the fit errors.

All data are referenced to an arbitrary straight line with slope = a and intercept = b. We don't expect each data point to pass through the line. At each x data point, there will be a difference, or “error” between the line and the data point’s y value. Here is the equation:

Figure 3.5. Plot of temperature/depth data. xi,yi is the coordinates of the i’th data point. ei is the difference between the value predicted by the straight line, and the actual data value.


yi = ax i + b + ei (3-3)

y and x are the x,y values of the i’th data point. a is the slope of the straight line and b is its intercept. ei is the “error”, or misfit. In order to get the “best” values for a and b, we want the sum of squares of all of the errors to be as small as possible, or:

( )∑=

−=n

iiid yyR

1

2) where iy) is the predicted y (from the straight line) at point i.

ii axby +=) equation for predicted y To find the best values for a and b, we can differentiate Rd with respect to a and set the result to zero, then do the same for b. Then we solve the two equations for a and b.

We do: 0=∂

∂aRd and 0=

∂∂

bRd

When we do the operations indicated in the two equations, we have 2 equations and 2 unknowns, and can then solve for a and b. The 2 equations are, after differentiating and simplifying:

∑∑∑===

+=n

ii

n

iii

n

ii xaxbyx

1

2

11

(3-4)

and

∑∑==

+=n

ii

n

ii xanby

11

(3-5)

Working on eq. 3-5, we divide each side by n, and use ∑=n

ixn

x1

. Then we get:

xaby += (C) Now, we can substitute the above equation into eq 3-4, where we get: ∑∑∑ +−=

ni

ni

nii xaxxayyx 2)(

We multiply out the terms and get:

2

22

222

)1(

)(

x

ni

ni

ni

ni

ni

nii

snaxyn

xnxaxyn

xanxaxynxaxxaxyyx

−+=

−+=

+−=+−=

∑∑∑∑∑∑

Ok, now rearrange the last line of the above equation:

2

1)1(

1 x

ii

snna

n

yxnyx

−−=

−

−∑

Notice that the left side of the equation is 2xys and that the n-1 cancels on the right side. This leaves us

with the simple formula: 22

xxy ass =


or: 2

2

x

xy

s

sa =

Remembering our definition of xyr , we get equation 3-6 below. Putting this value for a into eq. (C),

above, we get equation 3-7 below.

=

x

yxy s

sra (3-6)

b = xs

sry

x

yxy

− (3-7)

The sy and sx in the equation are the standard deviation of the y values and the standard deviation of the x values. rxy is the Pearson correlation coefficient, which was defined earlier. Notice the relationship, in equation 3-6 between the slope and correlation coefficient. As rxy gets larger, the slope, a gets larger also, and if rxy = 0, then the slope of the best fit line is zero too. Finally, we can write the equation for the best fit line as:

ix

yxy

x

yxyi x

s

srx

s

sryy +−= )(ˆ (3-8)

another useful form, easier to remember, for the above equation is:

x

ixy

y

i

sxx

rs

yy )()ˆ( −=

− (3-9)

Discussion: It is important to remember that the best fit line will not go through each data point. From algebra, we remember that we need at least two equations to solve exactly for two unknowns. A straight line has only two unknowns, the slope and the intercept. When we have more than two values for x and y, we have more than enough unknowns to determine a straight line that passes exactly through two data points. In fact, if the data have errors, the straight line slope and intercept will be different for each pair of data points. The problem is that we have too many data points to exactly fit the line to all of them. This is called an over-determined problem. In fact, it would be meaningless to try to exactly fit each data point, since there are errors in real data. That is why we only try to find the "best fit" line for the data.

For the data in Table 3.2,

b =

- ( 6 .7 5 ) ( 3 0 5 )

55 5 8 .7 5

1 4 .3 1 2 5 - 6 .7 5 2

5

= 2 8 .2 7

a = 6 1 - ( 2 8 .2 7 ) ( 1 .3 5 ) = 2 2 .8 4

and y = 22.8 + 28.3x. As a check on the calculation, the best-fit line should pass through the point, ( x ,y ) . Reduced major axis line


In the least squares regression line, we assumed that we knew one of the variables much better than the other. If this is not the case, then a least squares regression is not appropriate. A reduced major axis line is another type of linear regression line. Here, the sum of areas of the triangles between the data points and the best-fit line, as shown in Figure 3.5.

least squares regression reduced major axis

Figure 3.5 The equations for a and b for a reduced major axis line are:

( ) xbya

n

xx

n

yy

bi

i

ii

−=

−

−=

∑ ∑∑ ∑

and

)(

2

2

22

For the data in Table 3.1,

b =

760.92-( 63.8) 2

6

1039.62-( 74.4)

2

6

= 0. 84 and a= 10.63 - (0.84)(12.4) = 0.22

so y = 0.22 + 0.84x.


Transformations and weighted regression Often, x and y will not show a linear relationship, but a linear relationship can be found by transforming one or both of the variables. We discussed data transformations at some length in Chapter 4. Transforming the data changes the weighting of the individual data points. Even if you do not transform the data, if some data points are known with more precision that others, it may be desirable to give those points more weight. Residual analysis Residual analysis is a good way to decide if a linear fit was a good choice. In a residual analysis, the differences for each data point between the true y-value and the y-value as determined from the best-fit line are plotted for each x-value of the data points. If the linear fit was a good choice, then the scatter above and below the zero line should be about the same. If this analysis shows a bias (for example the residual grows larger as x increases), another curve might be a better choice. We can compute the quality of the fit using the standard deviation of the errors, ei.

)1()ˆ(1

11

1 222

1

22xyyi

n

iiie rsyy

ne

ns −=−

−=

−= ∑ ∑

=

Calculating in Excel using the "Regression" tool In Excel, there is a data analysis tool called "regression". Unfortunately, this tool does not produce the same result for slope and intercept as when the formulae derived above are used. Which one is best? Has a mistake been made in the calculations and shouldn't we just trust Excel, which has been around for quite awhile? After all, the programming gods made it for us. It is easy to test which one is "best". Just compute the "residual", e

cs using the formula above and notice which value of slope and intercept results in the smallest value. It turns out that the smallest value is gotten from the values computed from formulas derived in this text. The Excel regression tool computes slope and intercepts that result in a higher residual. Since we are looking for the "best fit", we choose the formulas in this text. Clearly, Excel (Office 95) has computed a value under different assumptions than we are making here. The Excel help files do not provide the answer. Therefore, we cannot trust the value that Excel produces unless we can determine why Excel's answer is different from ours. This illustrates the importance of knowing how to do the calculation by "hand" before trusting a powerful computer program to give you a number, which could be wrong, or could be assuming a different set of conditions than you expected.


Review After reading this chapter, you should:

• Know what a scatter diagram is. • Know what a correlation coefficient is an how to calculate Pearson's correlation coefficient, r.

• Understand what the terms 'correlation' and 'regression' mean. • Be aware of some of the pitfalls of correlation and regression statistics. • Know what least squares regression and reduced major axis lines are and how to calculate

them. • Know what residual analysis is and how it is used.


Exercises

1. Construct a scatter diagram, calculate 'r' and determine the significance of 'r' for the following data. Show all your work!

island age (million years) distance of island from current hot spot 0 0 0.5 200 2.8 400 7.8 800 11.2 1050

2. Determine the least squares regression line and 90% confidence interval for the data in Exercise #1 above. Which variable should be called 'x' and which should be called 'y'? Does it matter? Show all your work!

3. Construct a scatter diagram, calculate 'r' and determine the significance of 'r' for the following

data. Show all your work! Na2O and K2O (weight %) SiO2 (weight %) 2 45 5 50 7 55 1.8 44 6 53 3.7 48

4. Determine the reduced major axis line for the data in Exercise #3 above. Which variable should be called 'x' and which should be called 'y'? Does it matter? Why is a reduced major axis line more appropriate than a least squares regression line, assuming the error in the analytical techniques used for all analyses is the same. Show all your work!

5. During a Journal Club talk, a student states that the correlation between two variables is 98%.

Should you be impressed by this statistic or do you need more information? Explain. 6. List four pitfalls to watch out for when working with correlation and regression statistics.


Version: December 20, 2001 Probability and Probability Distributions 4-1 ©University of California, 2001

CHAPTER 4

The Statistics of Discrete Value Random Variables In the study of statistics, it is useful to first study variables having discrete values. Familiar examples are coin and dice tosses. This gives us a chance to better understand beginning statistical principles and leads naturally to the study of continuous variables and statistical inference. Combinations An understanding of combinations is the first step in learning about probability and statistical inference. Let’s begin with an analysis of the coin toss. When you toss a coin 10 times, how many heads and tails do you expect? Right now, it would be a good idea for you to toss a coin 10 times and see how many heads you get. Did you expect to get that number? Simulating coin tosses using Excel: The random number function (rand()), generates a random number between 0 and 1. You can use Excel’s “IF” statement to test whether the random number is greater or less than 0.5 to give it a two state value. To do this, make a column of random numbers in B2 to B12 using “=rand()”. In C2, enter “=IF(B2<0.5,0,1)”. Extend the formula to C12. Notice that the value in column C is 0 or 1, depending on whether the random number is <0.5 or >0/5. You can sum up the number of “heads” by putting “=sum(C2:C12)” in cell C13. Press “Apple=” keys simultaneously (or "Ctl=" on a PC) to get new simulated toss experimental results. You don’t need to use Excel to simulate coin tossing. Go ahead and toss 4 coins right now. Do it several times. You can either toss one coin four times, or 4 coins once. The statistics are the same. When you toss a coin, you expect to get heads half the time and tails the other half. There are 2 possible “outcomes” in a single coin toss. These are a) heads and b) tails. There is only one outcome that we are interested in (heads), so we define a heads as a success, and we have 1 of the outcomes that is a success. To get the ratio of heads to tosses, you do:

Poutcomes possible of#

successesas define you that outcomes possible of# Ratio ===

21

So, this shows how we find that half of the tosses are expected to be heads. Ratio is the probability that a single toss will come out to be a head. \ Suppose you are performing an experiment that consists of tossing a coin 4 times. You should get 4*P = 4*(1/2) = 2 heads. Of course, sometimes you get 0 heads, 1 head, 3 heads, and 4 heads. But, if you toss the coin many, many times, you expect the ratio of heads to tails to become closer and closer to 0.5. For the 4 coin toss experiment, is it possible to predict the number of times we expect to get some number of heads different from 2? It has already been shown that it is possible to predict the


probability of getting a head in a single toss using the number of possible outcomes of a toss. Let’s write down all of the possible outcomes when we toss 4 coins. Each outcome is equally likely. The possibilities are shown below, with the first letter representing the outcome of the first toss, the second letter representing the outcome of the second toss, so TTHH would be tails for the first toss, tails for the second, heads for the third, and heads for the fourth.

TTTT, TTTH, TTHT, TTHH, THTT, THTH, THHT, THHH HTTT, HTTH, HTHT, HTHH, HHTT, HHTH, HHHT, HHHH There are several facts to notice. The first is that the counting started with all T’s and progressed as if counting in binary, where a T was a “0” and an H was a “1”. There are other ways to do this, but binary counting will come in handy later. The 4 tosses become analogous to a 4 “bit” number, which has 24 = 16 possible values. Notice that there are 16 combinations of heads and tails that can occur for the 4 coin toss sample we are discussing. So, how many outcomes are there with 0 heads? Count ‘em. The answer is 1. There are 4 outcomes with 1 head, 6 outcomes with 2 heads, 4 outcomes with 3 heads, and 1 outcome with 4 heads. Of course, the total of all the outcomes is 16, as it has to be. We can’t just apply the formula that we used before. The number of possible outcomes is 16, since there are that many combinations of heads and tails when a coin is tossed 4 times. So, the probability for a single outcome must be 1/16. That is the probability for any one of the above combinations happening in the sample. But, when we are going to say we have a success when several of the above outcomes occur, we then add the probabilities for all successful outcomes. Said another way, we defined success such that half of the 16 combinations were successes, then the probability would have to be 1/2, wouldn’t it? So, the formula becomes:

P ( 1 head ) = # Successes

# Outcomes=

4

16 So, the probability of getting 1 head is 1/4, which means that if we conduct the 4 coin toss sample 12 times, we expect to get 0 head 12*(1/16) times, 1 head 12*(4/16) times, 2 heads 12*(6/16) times, 3 heads 12*4/16) times, and 4 heads 12*(1/16) times.

When asked to determine the probability of a particular random combination occurring, you can “Brute Force” the result by writing down all possible combinations, then counting the number of combinations that you consider “successes” and dividing that number by the total possible combinations.

Rules of Probability The probability of rolling a "5" with a single die is 1/6, but what is the probability of rolling a "5" or a "6"? What is the probability of rolling two "6"s with two dice or that the sum of the faces of two dice will add up to "5"? We know that the probability of any one coin toss being "heads" is 0.5, but what is the probability that if we toss 4 coins, all will be "heads" or that 3 of the 4 coins will be "heads"?


There are two basic rules of probability which you need to know. Rule #1 is that the probability of occurrence of more than one mutually exclusive event is the sum of the probability of the separate events. Mutually exclusive means that only one possible event can occur at one time (e.g., a coin toss is either "heads" or "tails", it cannot be both). The probability of rolling a "5" or a "6" with a single throw of a die is 1/6 + 1/6 which equals 1/3. The probability of throwing either a "heads" or "tails" is 1/2 + 1/2 which equals 1, which makes sense since there are no other choices. Rule #2 is that the probability of the occurrence of a number of independent events is the product of the separate probabilities. Independent means that the occurrence of one event does not affect the probability of the occurrence of any other event. The probability of rolling two "6"s with two dice is therefore 1/6 * 1/6 which equals 1/36 and the probability of tossing 4 "heads" in a row is 1/2 * 1/2 * 1/2 * 1/2 which equals 1/16. In some problems, the probability of a particular event may not be constant. For example, consider the following problem. Two cards are dealt from a deck of 52 cards. What is the possibility that both cards are aces? Since there are 4 aces in the deck and there is equal probability of receiving any card, the probability of being dealt an ace with a single card is 4/52. The probability of being dealt a second ace is 3/51 since one ace has been removed from the deck. So the probability of being dealt 2 aces with 2 cards is 4/52 * 3/51 which equals 0.0045 or about 0.5%. Probability distributions A probability distribution is a plot of the “expected” frequency of an event occurring. Remember that you have already been exposed to sample distributions , where the actual data sampled is being plotted. Much of statistics involves the comparison of probability distributions with sample distributions . Sometimes we use the term "expected frequency" rather than probability. There are several important probability distributions in the field of statistics. In this chapter, we will be concerned with the Binomial Distribution. Expectation Values and Ensembles It is important to understand that the average result of an experiment, repeated many, many times, in the limit of an infinity of times, will approach the expected result. The imagined infinity of experiments is called an "ensemble" and the average of this infinity of experiments is called an "ensemble average". Now that this has been presented at an intuitive level, it is time to put this idea on a firmer mathematical basis. We express the expectation value of a variable as: E[x] = µ (4−1) where E[x] is the expectation value and µ is the average of an infinity of experiments (ensemble average). µ is the “population mean” because it samples the entire population. Suppose we apply this to the experimental result: the difference between the number of heads and the number of tails in a coin toss experiment.

We define: N

tailsof #heads of # −=d


We know that for any particular experiment, d will generally not be zero. We express the average of d, for an infinite number of experiments as E[d]. We get:

[ ] 0N

tailsof # - heads of # =

= EdE

The above equation states that, even though d may only rarely be exactly zero, when many coin toss experiments are averaged, the positive and negative d’s ultimately cancel and d asymptotically approaches zero. It is exactly zero in the limit of the average of an infinity of experiments. There are some general properties of E[] that are properties of the normal average also. Several useful relations are: a) µ=][E x where x is a random variable and µ is the true average of x b) µaax =][E where x is a random variable and a is a constant c) babax +=+ µ][E where x is a random variable and a and b are a constants

d) 22

])(

[E σµ =−N

x where x is a random variable and n is the number of data values

in a “sample”. σ2 is the “population” variance. You can verify the above by considering the way a normal average behaves when it is multiplied by a constant, or a constant is added to it. The above formulae only change this conceptualization by using and infinite number of terms in the average. Note that a new symbol was introduced, µ. This is the population mean of x, which is the average of x in the limit of an infinite number of measurements. We also introduced the population variance of x, which is the variance averaged over an infinity of experiments. So for the coin toss case discussed previously, if c is a constant, the quantities: E[cd] = 0 (4-2) and E[c+d] = c since d averages to 0. (4-3) Also: E[(x - µ)2] = σ2 which is the expected variance A more rigorous and general derivation of the expectation value will be given in a later section.


Binomial distribution The discrete probability distribution for a measurement, observation or event which has two possible outcomes is described by the binomial distribution. Examples of such observations include a coin toss ("heads" or "tails"), a quality control laboratory test (defective or not defective) and the drilling of an oil well (a strike or a dry well). Another important example where the binomial theorem applies is fluctuations in the values of a particular class in a histogram. The binomial theorem will be derived for this example. Figure 4.1 shows the continuous probability distribution where any value is equally likely between 0 and xMax. The uniform distribution is used here for simplicity, but any distribution could be used.

0 xMaxxxl u

p(x)

1xMax

uniform distribution

x values Figure 4.1 . Description of variables for the derivation of the binomial theorem. Suppose an experiment is run where X is measured N times. The probability that X is inside the class interval is:

P = X u− Xl

xMax

Expectation values and Ensemble Averages: This seems to be difficult for students to grasp, yet it is very simple in concept. An ensemble is defined as the results of a number of experiments. If the experiment is a single coin toss, then the ensemble is the results of some number of coin tosses. An infinite ensemble would be the result of an infinite number of coin tosses. The Expectation Value of a quantity is just the average value of the ensemble results. So, we do an experiment and get a result, say x. The value of x is determined an infinite number of times in an experiment. We then can take the expectation value of x, x2, etc, as discussed above. Why is this useful? It’s useful because it gives us a way to think about random processes. If we had the perfect experiment, with an infinite amount of data, we would expect to get the “Expectation Value” of whatever variable we were measuring. However, if we have a less than perfect experiment, the result will be more or less close to the expectation value, depending on the variance. If we can theoretically calculate the variance, we would use the expectation value of the variance (above). If not, we have to estimate it from the data.


Note: For continuous data, the probability that any particular value will occur is zero. Why do you think this is? Think about how many real numbers there are in any finite interval. Is it intuitive that because there are so many possibilities, that the probability of getting any one of them is very small? To fix this problem, for continuous numbers, it is necessary to consider the probability that a value will lie between two other values. This is expressed as p(x l<x<xu), which is the probability that x lies between xl and xu. The probability is the area under the curve (cross-hatched in figure 4.1 above). For discrete data (like dice toss and binary theorem), x values are not continuous, and there are only a finite number of possible values, so it is possible to evaluate the probability of getting a particular value.

Continuous distributions will be discussed later in the chapter. The important element of this discussion is that the data point can be either a "hit" (with probability P) or a "miss" with probability 1-P. For N data, the number of "hits" will be NP. Conversely, the number of "misses" will be N(1-P). We define: Q = 1 − P , so (1 − P) N = QN We now plot the histogram of the “expected” results of this experiment. Figure 4.2 shows the expected number of "misses" (0% on the x axis) and "hits" (100% on the x axis). Let's put this in terms of a dice toss. If we are watching the number "2", then a toss with "2" showing will be a "hit" and a toss where any other number is showing will be a "miss". P will be 1/6, and Q will be 5/6. Our experiment has N=1, because we toss the die once, then count the result as a "hit" or "miss".

Q

P

0% 100%Outcomes: % of data within the Class

P(R)

Figure 4.2. The acquisition of a single data point is the experiment, the probabilities of 0% inside the class and 100% inside the class are plotted. When the first data point is taken, there are only 2 possible outcomes, with probability P and Q. These are that 0 data are in the class (0%, or 0 "hits") and 1 data are in the class (100% or 1 "hit"). Suppose that we now have 2 data points in our experiment. In our dice analogy, this would correspond for tossing the die twice, the counting "hits" or "misses". But now we have 3 possibilities: 0 "hits", 1 "hit", or 2 "hits". Figure 4.3 illustrates the probabilities. Keep in mind that when the data is "in the class", we count it as a "hit". Examining figure 4.3, we can see that each of the first two outcomes generates 2 more possible outcomes. This is analogous to the toss of 2 coins. If the first toss produces a head, the second toss


can produce a head, or a tail (HH or HT). If the first toss produces a tail, the second toss can produce a head or a tail (TH or TT). So, there are 4 possible outcomes. For the coin toss example, P = Q = 0.5.

P Q

PP QP PQ QQ

in out

in out in out

in-in out-in out-outin-out

First value

Second value

Figure 4.3 . Expected numbers of data points within each of the 4 different outcomes that are possible with 2 data values in the experiment. Figure 4.4 is a histogram of the expected outcomes when 2 data points are taken. There are 3 possible values for the probabilities. None, half, or all of the 2 data values are between Xu and Xl. So the probabilities of the 3 outcomes (2 in, 1 in, and 0 in) are P2, 2QP, and Q2.

Q

P

0% 100%

Outcomes: % of data within the Class

2PQ

50%

2

2

P(R)

R 0 1 2 Figure 4.4 . Histogram of outcomes when 2 data points have been taken. Notice that the R value is also shown along the x axis. The next step in the derivation is to take 3 data values. Notice how this is reminiscent of counting all of the possible outcomes of coin tossing. The only difference is that we are using P and Q instead of H and T. We are also counting the possible outcomes in a slightly different way. For 3 data values, the outcomes are, counting the same way we did with the coin tosses: PPP PPQ PQP PQQ QPP QPQ QQP QQQ


These can be evaluated, and are: P3 P2Q P2 Q PQ2 QP2 Q2P Q2P Q3 To get the probabilities for each outcome, we add the probabilities where the outcome is the same. Thus, we have: P3 3P2Q 3PQ2 Q3

for the probabilities for each class.

Q

P

0% 100%Outcomes: % of data within the Class

3P Q

33%

2

2

3PQ3

3

67%

P(R)

Figure 4.5. Probabilities of outcomes when 3 data values have been taken. The following table summarizes the results so far. The probabilities of the individual classes are the coefficients of the equations.

Number of data points N=1

(P + Q)

N=2

(Q + P)2

N=3

(Q + P)3

By inference, we expect that for any value of N, the probabilities are the coefficients of the equation N(Q+P)N. Using the binomial expansion, the coefficients for arbitrary N are: QN; NQN-1; N(N-1)QN-2P2/(1•2); N(N-1)(N-2)QN-3P3/(1•2•3) . . . . . . . . . . . . . . . N(N-1)(N-2) . . . 2QPN-1/(1•2•3 . . . (N-1)); PN


The R’th term is, where 1-P is substituted for Q:

P( R ) =N !

R !(N − R )!(1 − P) N −R P R (4-4)

Important information: Do you know what the "!" sign means? It is the "Factorial" symbol. This is just a shorthand that allows us to write down a sequence of numbers more concisely. Its definition is: N! = N*(N-1)*(N-2)*…….1. So, 4! would be: 4*3*2*1. There is an interesting property that isn't obvious. That is that 0! = 1. Odd, but you will see that this definition of 0! is the most useful for formulas like equation 4-4.

This is the answer. To apply this to an example, suppose that we have a 5-sided die and will throw it 10 times. P = 1/5 = 0.2 and N = 10. The expression for a particular side coming up R times is:

P( R ) =10 !

R !(10 − R)!

4

5

R 1

5

10 − R

Figure 4.6 shows the histogram of the probabilities of getting a particular die value R times in 10 throws. R = 2 times shows the greatest probability, since 10(1/5) is the expected value.

P(R)

0 1 2 3 4 5 6 7 8 9 10

# of times a given die face occurs Figure 4.6 . Histogram of probabilities of a particular die face showing R times out of N throws. Next, we consider a more interesting example. Suppose that the probability of striking oil with a wildcat well is 10%. If 10 wells are to be drilled, what is the probability that all 10 will be dry? If we assume that the probability of finding oil at any one well is independent of finding oil at any other and that a well is either a strike or dry, nothing in between, then we can apply the binomial probability distribution to this problem. Here, P = 0.1, R = 0 and N = 10, so

P =

10 !

0!(10 !)(0. 9)10 (0.1)0

which equals 0.35 or 35%. Note that we arrive at the same result if we simply follow the rules of probability discussed in the beginning of this chapter. Since the probability of drilling a dry well is 0.9,


the probability of drilling 10 dry wells is 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 + 0.9 which equals 0.35. What is the probability of drilling one successful well? Now, P = 0.1, R= 1 and N = 10, so

P = 10 !

1!( 9!)(0.9)

9(0.1)

1

which equals ~0.39 or ~39%. To find the probability of drilling 2 successful wells, let p = 0.1, r = 2 and n = 10. In this case, P = 0.19 or 19%. What if we want to know the probability of drilling at least 1 successful well? In this case, we must add the probability of drilling 1 successful well to the probability of drilling 2, 3, 4 or more successful wells. Note that often a problem such as (a) (b)

.

0 1 2 3 4 5 6 7

0.0

0.1

0.2

0.3

0.4

p

0 1 2 3 4 5 6 7

0.0

0.1

0.2

0.3

0.4

N=7, P=0.5 N=7, P=0.25

Figure 4.7. Binomial distribution for 2 values of N and P. this can be simplified by rewording the original question. For example, to find the probability of at least 1 successful well, we can find the probability of 0 successful wells. The probability of more than 0 successful wells is just 1 minus this number. As an exercise, do these calculations. The sum of the probabilities for 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 successful wells should be 1. The binomial distribution for n = 7 is given in Figure 4.7 for different values of P. Note that the distribution is symmetric for P = 0.5 and asymmetric for P � 0.5. Important properties of the binomial distribution are its approximate mean and variance, if 0.1 � P � 0.9 and NP > 5: E[x] = NP and E[s2] = NP(1-P)


From the above two equations, the ratio of the standard deviation of binomial distributed data to the mean is given by:

f =NP (1 − P )

NP=

(1 − P)

PN (4-5)

So, as N increases, the width of the distribution expressed as a fraction of the mean also decreases. Suppose N=10 and P=0.2. Then:

f = (1 − 0.2)

0.2(10 )= 0.632

which means that the standard deviation of the distribution is a bit more than half of the mean. But if we let N=1000, then

f =(1 − 0. 2)

0.2(1000 )= 0.0632

and the standard deviation is only 6.3% of the mean value, a factor of 10 lower. The standard deviation relative to the mean varies as:

f ∝ 1

N

This very important result will be used many times.

Stirling’s factorial approximation for large N:

N ! ≅ N N e− N 2πN This is extremely useful when large factorials will eventually cancel, but the computer’s number range is too small to evaluate them prior to cancellation.

Problem 1: Compute and plot the distribution of the number of heads in 100 coin tosses, using the

binomial theorem. To compute the large factorials, use Excel and Stirling's approximation.


Review

After reading this chapter, you should: • Have a basic understanding of the concept of probability. • Be able to do simple probability problems involving coins, dice and cards. • Understand the definition of an "expectation value". What is it, in statistical terms? How can you

define it in words? • Understand how the binomial distribution is derived • Know when the binomial distribution applies and how to use it to solve problems.


Exercises 1. If you select one card at random from a deck of 52 cards, what is the probability that the card will be a. the queen of spades? b. of the suit clubs? c. an ace? d. the ace of spades or the queen of hearts? e. either a jack, queen or king? 2. If you select two cards at random from a deck of 52 cards, putting the first card back before you select the second card, what is the probability of a. the queen of spades both times? b. an ace both times? c. the same card both times? d. a card of the suit hearts or diamonds both times? 3. If you select two cards at random from a deck of 52 cards, and do not put the first card back before you select the second card, what is the probability of a. the cards both being aces? b. the cards belonging to the same suit? c. at least one of the cards being an ace? 4. Assume that when you drill a wildcat oil well, there are only two possible outcomes - either the well is dry or it strikes oil. Assume that the chances of success at any one wells is independent or the success at any other. Suppose that 8 wells are to be drilled and p = 0.1. Using the equation that describes the binomial probability distribution, a. what is the probability that they will all be dry? b. what is the probability that 1 out of the 8 will be a success? c. what is the probability that 2 out of the 8 will be a success? d. what is the probability that at least 1 out of the 8 wells will be a success? 5. Given the assumptions in Example 6 above, how many wells must be drilled to guarantee a 75% chance of at least 1 success? 6. Write an Excel sheet to calculate the binomial probability, P, given input by the user for N, R and P.


7. Modify the program from exercise #9 to calculate the cumulative binomial probability, P. That is, calculate the probability of at least R successes in N events. 8. Make a game that is a game based on probability. It could be involve cards, dice or whatever. Use your imagination!

Version: December 20, 2001 Sampling and Statistical Inference 5-1 ©University of California, 2001

CHAPTER 5

Probability Distributions and Statistical Inference In the previous chapter, we discussed probability, probability distributions for discrete variables, and expectation values. In this chapter, we introduce continuous probability distributions and sampling distributions and begin our discussion of statistical inference. At this point, you might want to review page 1-9, where the height of the histogram bars is discussed. Continuous Distributions and Expectation Values When a variable is continuous, the distribution of sampled values will be continuous. This will remind you of the differences in histograms between discrete and continuous data.

Referring to figure 5.1 below, imagine that we have done an experiment and plotted the histogram. The class boundaries are at X1, X2 etc. The area of the far left bar of the histogram is equal to (X2 - X1)F1, where F1 is the height of bar 1. The frequency (vertical height of the bar) is the number of occurrences within the class divided by the width of the class, (X2-X1), so the area of the bar is the number of occurrences within the class. The sum of all the areas of the histogram bars is the total number of data values in the experiment, which we have been defining as N. Remember, in Chapter 1 (p 1-9) it was suggested that the scaling of the histogram plots would be more convenient if the frequencies were divided by N. This makes the total area of the histogram would be equal to 1, exactly. Suppose we now want to compute the expected mean of a data sample. The expected mean will not depend on any particular experiment. According to the intuitive explanation of expectation that has been already presented, we imagine calculating the individual means for the experiment conducted an infinity of times and averaging the individual means to get the expected mean. We can't do this in reality, of course. Bit. there must be a way. We know, from probability theory that the expected mean of the number of heads in a coin toss is half the number of tosses. There must be a way to use the probability to calculate the expected mean. In fact, there is.

First we need to figure out how to calculate the mean using only the histogram bar height values. Here we go! Suppose, in a histogram for a discrete random variable, there are 5 data values equal to 1, 2 values equal to 3, and 6 values equal to 4. The mean would be:

m =5 • 1 + 2 • 3 + 6 • 4

13= 2. 7

If you don't believe this formula, work it out for yourself on paper.


X

Freq

uenc

y of

occ

urre

nce

X X X X X X X X X1 2 3 4 5 6 7 8 9

12

3

4

5

6

7

8

Figure 5.1 . Plot of a histogram of a continuous distribution whose area has been approximated by 8 bars.

Notice that we computed m by grouping equal values. Similarly, we compute the mean of the distribution of figure 5.1 by making the approximation that the all data values within each class are the same, and are at the center of the class. The number of data within the i'th class is the area of the class, which is Fi*(Xi+1 - Xi). So, the mean will be:

N

XXXXF

XXXXF

XXXXF

m 2

)()(....

2

)()(

2)(

)( 98898

32232

21121

+−∗

+−∗++−∗

=

Fi are the height of the individual bars, and the Xi’s are the values of X at the class boundaries. If we define:

Xci =

(X i + Xi +1 )

2

which is the center of the bar, we have:

N

XXXFXXXFXXXFm ccc 889822321121 )(....)()( ∗−∗∗−∗+∗−∗

=

So, the mean is computed by multiplying the height of each bar by its center value of X, summing over all bars, and then dividing by N. Now suppose that N gets very large and the bars also become very narrow. Defining �X=Xi+1 - Xi, we have:

µ = lim N → ∞Fi Xci ∆X

i=1

N

∑N

=1

NxF ( x )dx

−∞

+∞

∫ = xp (x )dx−∞

+∞

∫ (5-1)


where p(x) is given by F(x)/N. Note that E[x] = m from equation 4-1.

Notation: m will be used to indicate the mean of observed data µ will be used to indicate the “expected” mean, or “population” mean obtained by averaging many repeated experiments. So, E[m] = µ.

So, in the limit of infinite data, the distribution becomes continuous and we compute the mean as shown in the previous equation. In general, the expectation value of an arbitrary function, f(x) is

E[ f ( x )] = f ( x ) p(x )dx−∞

+∞

∫ (5-2)

The above equation is the general formula for computing an expectation value of a general function of a random variable x. which is distributed according to the probability distribution p(x). It is now possible to derive several extremely important algebraic properties of expectation values. Multiplication of the function f(x) by a constant results in the following:

E[af ( x )] = af (x ) p( x )dx−∞

+∞

∫ = a f ( x )p( x )dx− ∞

+ ∞

∫

So, E[af ( x )] = aE[ f ( x )] (5-3) Similarly, the following relationships can be proven, where x and y are random variables and a and b are constants. E[af ( x ) + bf ( y)] = aE [ f ( x )] + bE [ f ( y )] (5-4) For example, the squared value of the data is computed as:

E[ x 2 ] = x 2

−∞

+∞

∫ p( x )dx (5-6)

The expectation value of the data variance is much more interesting. We compute E[( x − µ)

2] = E[ x

2 − 2 xµ+ µ2)] = E [x

2] − E [2 xµ] + E [µ2

] From equations 4-1 and 4-2, and considering that m is a constant, E[( x − µ) 2 ] = E[ x 2 ] − 2µE[ x ] + µ2 Since E[x] = µ,


σ2 = E [( x − µ) 2 ] = E[ x 2 ] − µ2 = x 2 p( x )dx − µ2

−∞

+ ∞

∫ (5-7)

σ2 is the variance of the continuous distribution. This will be called the population variance in the next chapter.

It is important to note the basic difference between the continuous and discrete distributions when data with continuous values are considered. When data values are continuous, the only time that discrete histograms are used is in the processing of actual data. Continuous distributions apply only when we are considering the limit of an infinity of data. Continuous distributions are used when performing computations to find “expected values”. When working with real data, the observed values may vary considerably from expected values, as has been demonstrated by the simulations in the previous chapters.

Gaussian and Normal Distributions The Gaussian and Normal distributions are extremely important in the field of statistics. Many populations follow a normal distribution. Moreover, as we will see in a later chapter, the sample means of arbitrary populations follow a normal distribution.. The central limit theorem (discussed later) tells us this. The gaussian distribution is familiar to you as the bell curve upon which students' grades are often based. The equation that describes the gaussian distribution is

p(x ) =1

σ 2πe

−( x −µ )2

2 σ2

(5-8)

where µ is the mean and σ is the standard deviation.

P(x)

0

0.2

0.1

0-8 8

µ=0

σ=1

µ=0

σ=2

2 4 6-2-4-6

X value


Figure 5.2 . Plots of the gaussian distribution for µ=0 and σ = 1 and σ=2. The area between X = -2 and X = +2 for the curve with σ = 2 is the same as the area between X = -1 and X = +1 for the curve with σ = 1. These areas are filled in for each curve. It can be proven mathematically that the continuous gaussian distribution describes the discrete binomial distribution when n approaches infinity. The gaussian distribution stretches from - ∞ to +∞ and is described completely by two parameters, µµ and σσ . Figure 5.2 shows two examples of the gaussian distribution for different values of µ and σ. By definition, the total area under the normal curve is one square unit. Therefore the area under the curve between any two x values, x1and x2 gives us the percentage of the total number of x values that lie between x1 and x2. This means that simply by knowing µ and σ for a gaussian distribution, we can determine the probability of the occurrence of an x value between any given x1 and x2. values. Figure 5.3 divides the area under the curve into percentages. As you can see from this figure, for any gaussian curve, ~68.3% of the x values lie between µ and ±1σ and ~95.5% of the values lie between µ and ± 2σ. To illustrate how we use this information, we consider the following example. Assume that we have a list of the mid-term exam scores for a class of 1,000 Geology 4 students. The mean of the exam is 80 and the standard deviation is 5. Assume that the exam scores follow a perfect normal distribution. Based on Figure 5.3, we know that the percentage of students who scored between 80 and 85 is 34.13% of the class, the percentage of students who scored between 75 and 85 is 64.26% and the percentage of students who scored above 90 was 2.27%. We can see that 15.87% of the class scored below 75 and that 2.27% scored below 70.

P(x)

0 µ µ+σµ - σ

x value

µ+ 2 σ µ+ 3 σµ -2 σµ -3 σ

34.13%

13.60%

2.27%

34.13%

13.60%

2.27%

πσ 21

1

Figure 5.3 . Areas beneath various segments of the Gaussian curve. We can calculate the area under this curve for any two x values by numerical integration, but in practice, we rely on existing tables such as Table A1 in the Appendix. Take a look at this table now. This table is organized according to z-values, which describe the normal distribution, where

Z = ( x − µ)

σ (5-9)

and z = 1 corresponds to µ + 1σ and z = -1 corresponds to µ - 1σ. This table gives the percentage of the area under the normal curve between positive infinity and the z-value of interest. Armed with the fact that the normal curve is symmetric and a few mathematical manipulations, we can easily find the


percentage of the area under the curve that lies between any z-values, which we can translate back to x values. Let us continue with our example of Geology 4 exam grades to illustrate the use of Table A1. Suppose we are interested in finding the percentage of students who scored above 88. First we need to determine the appropriate z-value. Here, z = (88-80)/5 = 1.6. We look up z =1.6 in Table A1 and find that the area under the curve above 88 is 0.0548 or 5.48%. If we wanted to find the percentage of students who scored below 88, we would look up the percentage of students who scored above 88 (z = 1.6) and subtract that value from 1, since the total area under the curve must add to 1. So the percentage of students who scored lower than 88 is 94.52%. If we wanted to find the percentage of students who scored between 80 and 88, we would find the percentage of students who scored above 80 (z = 0), 0.50, and subtract from this value the percentage of students who scored above 80 (z = 0), 0.0548 leaving 0.4452. It is best to draw yourself a sketch of the area of the curve in which you are interested. To test your understanding, find the area under the normal curve between the z-values listed below and see if your answers agree with the ones given. Note that P(-Z) = 1-P(Z). Z area Z area 0.00 and ∞ 0.500 1.96 and ∞ 0.025

0.55 and ∞ 0.2912 - ∞ and -1.27 0.1020 1.23 and ∞ 0.1093 - ∞ and -0.88 0.1894 2.34 and ∞ 0.00996 -0.70 and 0.70 0.516 1.00 and ∞ 0.1587 -1.00 and 1.00 0.6826

Practice reading the Z tabl es in Appendix A1 Verify that you can read the table to get the values for the following situations: 1. What is the area beneath the Z distribution curve for Z > 1.5? (Ans = 0.0668) 2. What is the probability that Z > 2? (Ans = 0.0228) 3. What is the probability that 0 < Z < 1? (Ans = 0.3413) 4. If µ = 2 and σ = 4, what is the probability that a data value > 4? (Ans = 0.3085) 5. If µ = 5 and σ = 2, what is the probability that x < 3? (Ans = 0.1567) 6. If µ = 5 and σ = 2, what is the probability that 3 < x < 7? (Ans = 0.6827)

You should also know how to use Table A1 to find the z-value that a given percent of the curve lies above. For example, what z-value does 2.5% of the curve lie above? The answer is +1.96. Find the z-values that 10%, 5%, 1% and 0.25% of the curve lies above. Make sure that your answers are +1.285, +1.645, +2.325 and +2.81.


P(Z)

0

0.4

0

Z value1.961.96

E n c l o s e s 9 5 % o ft h e a r e a u n d e r t h e

c u r v e

Figure 5.4 . 90% of the area under the normal curve is contained between Z values of +1.96 and -1.96. This corresponds to µ ± 1.96σ in the general case. Poisson distribution The poisson distribution describes the probability distribution of discrete events occurring within a given finite interval or object, such as time, length, area, volume, body of water, host specimen, etc. For example, the poisson distribution may be used to describe radioactivity decay, where the number of decay particles is counted for a specified length of time. Conditions for a process obeying a poisson distribution are: • The probability of a single occurrence of the event is proportional to the interval size. • The probability of 2 or more events occurring within a sufficiently small interval is negligible. • Events occur in non-overlapping intervals independently. That is, the occurrence of one event does

not influence the occurrence of the other event. The form of the poisson distribution is:

p(y ;λ) = e − λλy

y !

for y = 1, 2, 3, etc. y is the value of the random variable (# of occurrences within the interval) and λ is the expected number of events. For example, suppose you are observing radioactive decay and expect 10 events/second. The probability of getting y values during any particular second is:

p(y ,10 ) =e −10 10

y !

y

Uniform distribution In a uniform distribution, the probability of the occurrence of any x value is the same as the probability of the occurrence of any other x value, on the interval between Xu and Xl . Its probability density distribution is given by:


p(x ) = 1

Xu − Xl

The denominator maintains the normalization, which requires that the area of p(x) is equal to 1. Von Mises distribution The von Mises distribution is also called the circular normal distribution. It is the equivalent of the normal distribution for directional data, such as paleocurrent directions or grain orientation data. It is given by:

M( µ0 ,k ) =1

2πIo(k )

ek ( cos ( ϑ −µ 0 )

where k (which is an angle that is always >0) is called the “concentration parameter” (analogous to σ),I0(k) is a modified Bessel function of the first kind, and µ0 is the mean of the distribution. Its values are tabulated in math tables texts. Log-normal distribution In a log-normal distribution, the logarithms of a set of values form a normal distribution. For example, grain sizes, trace element concentrations and the sizes of oil fields all follow a log-normal distribution. If a distribution of data values is skewed toward the low end, try taking the log of each data value and plotting the distribution. If it looks like a normal distribution, the data most likely follow the log-normal distribution. The form of the log-normal distribution is:

p(x ) =1

xσn 2πe

−1

2

ln x − µn

σ n

2

where µn and σn are the mean and standard deviation of the ln(x)’s. Note that at x=0, p(x) goes to infinity.


Figure 5.5. This is a plot of the gaussian distribution for µ=5 and σ=1.25. The tails of the distribution are at µ+-1.96σ, and we are defining any data point in the tails as a rare event.

Sample Distribution of a Single Data Value when σσ is Known Suppose that a sample consisting of a single data point is randomly taken from a population with a Gaussian distribution. We want to find the lowest and highest value of the population mean, µ, of a population that would produce that sample 5% or less of the time. We will call events in this range a "rare event". Given a particular sample value, x, figure 5.2 shows the highest and lowest values of µ that are possible, assuming that x is within the 95% limits. Suppose that µ = xi - 1.96σ. This is the case in the left hand plot of figure 5.2 and is the dividing line between a "rare event" and a non-"rare event", at the upper end of the distribution. We can quantitatively define the non-"rare event" by the inequality xi < µ + 1.96σ, which occurs 2.5% of the time. So, we can say that, under the conditions we have specified, µ > xi - 1.96 σ. This can be seen in the plot, or found from the first inequality by subtracting 1.96σ from each side of the equality. The right hand plot of figure 5.1 shows that for a non-"rare event", xi > µ - 1.96σ, which occurs 2.5% of the time. So, we could say that µ < xi + 1.96 σ. More succinctly, we could say that we have the following result 95% of the time:

xi − 1.96σ < µ < x i + 1. 96 σ (5-3) Now, here is where we have to be extremely careful in how we think about this result. Obviously, for a single value of xi, if we repeated the experiment, the value of µ would jump around all over the place. What we can say is that: If the value for µ is between the two limits of equation 5-3, then we will get the sampled value for xi only 5% of the time, on the average. Suppose the “true”


Figure 5.6 . Smallest and largest values of µ that are consistent with a particular sample xi if we require that at least 95% of repeated experiments are required to produce an xi within this range. Notice that the left figure shows the distribution offset to the left by the maximum amount and the right hand figure shows the distribution offset to the right by the maximum amount. value of µ is exactly xi. You would really get xi much more often than 5% of the time. However, that is not the point. All you have is a single value of xi and you have to draw whatever conclusion that you can from the information you have. Note also that you don’t know σ from the data, but must know its value independently. This presents a fatal complication if all you really know about the data is a single value. In the absence of more information that that provided by a single data point, it would be extremely optimistic to make any conclusion at all. The next section shows how to make inferences about the population mean and variance when more than a single data point is available.


Sampling Distributions Sampling distributions are probability distributions of a sample statistic (e.g. m and s2)

calculated for all possible samples of size N from a given population, as discussed in the previous paragraph. For example, if our population consists of the numbers 5, 8, 10 and 6 and we are interested in the sampling distribution of the sample mean, m for a sample size of 2 (N=2), our sampling distribution includes the values 6.5, 7.5, 5.5, 9, 7, and 8 where 6.5 is the mean of 5 and 8, 7.5 is the mean of 5 and 10 and so forth. For this same population and a sample size of 3, our sampling distribution is 7.66, 6.33, 8 and 7 where 7.66 is the mean of 5, 8 and 10, 6.33 is the mean of 5, 8 and 6 and so forth.

Consider the sampling distribution for the sample mean, m for a population with a normal distribution. The distribution of the sample means is normally distributed. The mean of the sample means is equal to the population mean,

µm = µ (5-4) and the standard deviation of the sample means is related to the standard deviation of the population by

σm =

σN (5-5)

where N is the sample size.

sample mean

µ, p o p u l a t i o n m e a n

upper conf idence l imi tlower conf idence l imi t

m -

p(x)σN

σ =m

Figure 5.7 . This shows a possible distribution of sample means, m. The peak is at the population mean. The total area under the distribution is 1, so if the shaded area is 0.95, 95% of the times the experiment is repeated (for a large number of repeats), the sample mean will lie between the upper and lower confidence limits. The standard deviation of the distribution of sample means is “expected“ to be the standard deviation of the population divided by the square root of the number of data points in the sample, N.

Distribution of Sample Means: Determining Optimum Sample Size Since the distribution of sample means follows a normal distribution, we can express this distribution in terms of Z-values,


Z = m − µσm

= m − µσ

N (5-6) where we have used the relationship between the variance of the sample means and the population variance that we discussed above. We can use the standard normal distribution to determine the sample size necessary to estimate the population mean, µ, to a required confidence level. The confidence level is the percentage of times the experiment is expected to produce the result lying between some upper and lower bound (figure 5.3). Referring to figure 5.2 and substituting the value of σ given by equation 9–5 for σ in equation 5-3, and m for xi (because now we are working with the sample mean) we have:

m − 1.96σN

< µ< m + 1.96σN

(5-7)

Remember that the above equation means that if the experiment were repeated a large number of times, and µ were beyond either one of the extremes shown, we would arrive at the measured value of m, for a sample size of N, 5% (or less) of the time. Notice also that we must know σ. The next chapter will discuss how we find the limits to µ when σ is not known. For example, suppose an experimenter needs to measure the ratio of two isotopes to within ±0.06 to a 95% confidence level. That is, if this experiment were repeated many times and the 'true' isotopic ratio were known, then this 'true' value would lie within the specified range for 95% of the experiments. Suppose that the errors in the measurements follow a normal distribution and the standard deviation for the technique is known to be 0.1. How many measurements should be made? Are 5 measurements enough? From Appendix A1, we see that the z-value for which 2.5% of the curve lies above is 1.96. Since 2.5% of the curve also lies below z = -1.96, that leaves 95% of the curve between z = +1.96 and -1.96. We refer to the area above the positive z-value as the upper tail and the area below the negative z-value as the lower tail. By using the above expression for z using z=1.96, we can find the maximum difference between x and µ that will allow us to be in the center 95% of the curve. Subtracting m from each side of equation 5-7, we can determine the limits of m - µ as:

−1. 96σN

< µ− m < 1.96σN (5-8)

or at the extreme limits:

m − µ = ±1.96σN

(5-9)

In the example discussed above,


m − µ =(1.96 )(0.1)

5= 0. 09

This value is too large. We try N=15. m - µ is:

m − µ =

zσN

=(1.96 )( 0.1)

15= 0. 05

which is lower than our required value of ±0.06 for m - µ so we may be taking more samples than necessary. We can determine the minimum number of samples by solving for N in equation 5-9, and,

N =

zσm − µ

2

(5-10) and in this case, N= 11. Central Limit Theorem Many kinds of randomness can be interpreted in terms of a normal distribution. This is not an accident. In fact, when the measurement is derived from a process that has some kind of intrinsic averaging effect, the distribution will be normal. There is even a theorem that states this fundamental property. The Central Limit Theorem states that the distribution of sample means tends toward a normal distribution as sample size becomes large, even if the population from which those samples were taken is not normally distributed. This theorem is very important in statistics because it tells us that even if a population does not follow a normal distribution, we can still assume normally distributed data if we are studying the means. Statistical Inference We introduce the concepts of statistical inference and begin our discussion of hypothesis testing and estimation here. We will present these concepts in more detail in the next chapter when we discuss the t, F and χ2 sampling distributions. In general, statistical inference involves either the formulation of a hypothesis about a population and the testing of that hypothesis or the estimation of a confidence interval for a population parameter as given by equation 5-10. Both hypothesis testing and parameter estimation are based on a sample and a sampling distribution (figure 5.3). For example, we might state as a hypothesis that "the mean of this population is not significantly different than 10 at the 95% confidence level" or "the means of these two populations are not statistically different at the 95% confidence level". If we are interested in estimation, our question might be 'between what two values can we be 95% confident that the mean of this population lies?'. We imagine the results of an experiment repeated many times. The confidence intervals are the upper and lower values between which a particular “statistic” (e.g. sample mean) lies some percentage of the time (often 95%).


In statistics, we never prove anything! We simply state the probability that a given hypothesis is true or that a population parameter lies within a particular interval. There is nothing magical about the value of 95%. We could have chosen to consider the 80% confidence level or the 99% confidence level, but 95% is a common value chosen. Estimating population parameters from sample parameters One of the most important values we compute is the sample mean. The value of the population mean, µ, must be determined by taking samples. In chapter 1, we defined the sample mean, m as the arithmetic mean of the sample values, x,

m =x i

i =1

N

∑N

(5-11)

where N is the sample size (the number of individuals in the sample). The sample mean, m is the best, unbiased estimator of the population mean, µ. By unbiased, we mean that if we conducted the same

experiment many times (infinity, in the limit) then m will tend towards µ exactly. This result was given in equation 4-1, where it was stated that E[x] = m. The other property of the population that is of interest is its standard deviation. This parameter defines how much variation from the mean the individual population values take. Again, this must be estimated from the sample. We define the sample variance, s2:

s 2 =( x

i − m ) 2

i =1

N

∑N −1

=x

i

2 −x i

i =1

N

∑

2

Ni =1

N

∑N −1

(5-12)

where xi is the sample value and N is the sample size. Using the above equation for the sample variance, s2, we arrive at the best, unbiased estimator (see next section) of the population variance, σ2 The standard deviation is the square root of the variance. The two equations for variance given above are equivalent. The first is a simpler expression; the second may be more convenient under some circumstances.

When computing s2 for large N, you must be careful as you accumulate the sum of squared data values, so that the sum of the squared numbers does not overflow the capacity of the number system used by the computer. The first form of the equation will be superior in this respect, because the mean is subtracted before the number is squared. When this is not enough, the sums must be computed separately for sub-”blocks” of numbers, then divided by N-1, then added together.


Bias and Unbiased Estimators An “Estimator” is the equation you use to compute the desired quantity from the data. The goal is to make the best possible estimate of the population value, from the data resulting from an experiment, that you can. Some estimators may consistently give you a value that is lower (or higher) than the actual population value. This means that although you may get more data and repeat the experiment many times, the answer may still be off. You may wonder about the N-1 term in the denominator of equation 5-12 above. It would intuitively seem that N is the proper denominator to use, since we are essentially taking an average of the squared difference between the sample value and the sample mean. However, if we do this, the sample standard deviation will come out slightly under the true population standard deviation, on the average. You could prove this using a computer simulation (see problem 1 below). Problem 1. Prove that equation 5-12 does produce an unbiased estimate of the population

variance. Do this by generating a random number in Excel using the “Data Analysis…” selection from the “Tools” menu. Generate a large number of random numbers with a normal distribution (with µ= 0 and σ2 = 1). Then, repeatedly sample and average the computed values for s2. Show that dividing by N-1 instead of N in the formula produces the correct variance, in the limit of many samples. Hint: start with small values of N.

There is another way to intuitively understand this biasing effect. It is caused by the use of m (the mean computed by the data) rather than µ (the true population mean) in the variance formula. Consider the computation of the variance of a sample consisting of a single value. The sample mean is exactly equal to the sample value. In the variance formula, the numerator is zero, and the denominator is 0, so we get 0/0, which is undetermined. It’s a good thing, because: how can we get a variance with only one sample? There is not enough data! If we used N in the denominator for a sample size of 1, s2 would be 0, which would be wrong. Undetermined is a better answer. However, if we somehow knew the population mean, µ, we would have some information on the variance, even from one sample. When the population mean is known, the correct denominator is N, which is 1. Now, if we take 2 data points for our sample, we can get a somewhat better value for s2. The denominator (N-1) is “1”. The sample mean is computed as (x1 + x2)/2 and the variance is {(x1 - m)2 + (x2 - m)2}/1. It is important to note that µ is halfway between x1 and x2. No matter what the population mean (for 2 data points), m is always halfway between x1 and x2. The values of (x1 - m)2 and (x2 - m)2 will be slightly reduced, on the average, from that which would have been obtained using µ. We speak of this as the statistics of the variance having N-1 degrees of freedom.


When the population mean, µ is known, the correct formula for variance (the unbiased estimator of the population variance) is:

s 2 =( xi − µ)2

i =1

N

∑N

When only the sample mean, m is known, the correct formula for variance (the unbiased estimator of the population variance) is:

s 2 =(x i − m )2

i =1

N

∑N − 1

The formula for a property of a sample distribution that tends toward the same value of the property of the population (when its value for a large number of samples are averaged) is called an “Unbiased

Estimator”. The above formula for s2 is an “Unbiased Estimator”.

Review

After reading this chapter, you should: • Be able to calculate the sample mean and the sample variance. Know these formulas. • Understand what is meant when we say that the sample mean and the sample variance are the

'best, unbiased estimators' of the population mean and population variance. • Understand the term 'sample distribution' as it is used in this chapter. • Be able to describe the distribution of sample means for a normal population and the nature of

the mean and variance of this distribution and the relationship between these parameters and sample size.

• Be able to determine the sample size necessary to estimate the population mean from a sample

based on the distribution of the sample means. Assume that s is known. • Be able to state and understand the Central Limit Theorem and appreciate why this theorem is

so important in statistics. • Be familiar in a general way with the concepts of statistical inference , parameter estimation and

hypothesis testing.

• Understand what an unbiased estimator is, and why it is important to use unbiased estimators.


Exercises 1. Calculate m and s for the samples given below. a. 1 2 2 3 3 3 4 4 5 5 b. 4 8 10 12 6 9 10 13 6 9 10 14 8 9 11 15 8 10 11 17 2. Identify the population, the individual and a sample for each of the following problems. Note

that there is no one correct answer for many of these. a. a study of the minerals and rock fragments in a thin section b. a study of the porosity of a sandstone unit c. a study of the opinions of professional geologists in California on offshore oil issues d. a study of the occurrence of a particular fossil in the Jurassic e. a study of the isotopic composition of lavas from hot spots f. a study of the average number of hours spent by UCSB students doing homework per week g. a survey of the extent of chemical contamination on a 1 km2 industrial lot h. a study of fault orientation in a map area i. a study of minerals present in drill core

3. Suggest appropriate sampling schemes for each of the problems outlined in exercise #2 above.

Note that there is no one correct answer for these questions.

4. What is a random sample? What is a biased sample? 5. What factors influence the choice of sample size?


6. Define the term 'sampling distribution' in your own words. 7. Suppose you wish to measure the concentration of chemical X with an accuracy of ± 0.1 g, to a

95% confidence level. The measurement process you will be using has a known variance of 0.01 g2. How many measurements must you make?

8. What is the Central Limit Theorem and why is it important? 9. Use the Central Limit Theorem to make a program that will compute a gaussian distributed

random number using the rand() function. This function, as specified, returns a number with a uniform distribution. Prove that this number is gaussian distributed by computing the number of times the random number is outside of the ±1s bounds. Of course, you will have to make a pretty good computation of what s is before you can do this (Hint: use your simulation to do this).

10. Assume that the final exams for a Geology 4 class follow perfectly a normal distribution. The

mean of the exam was 55 and the standard deviation is 20. Using Figure 4.10 from this chapter, determine how many students received a score

a. above 75? b. below 35? c. between 35 and 55? d. above 95? e. between 35 and 95?

Each of your answers should be accompanied by a sketch of the area of the curve you are trying to find. 11. Use Table A1 to answer the following questions pertaining to the problem described in exercise #12. How many students received a score

a. above 70? b. between 70 and 90? c. between 50 and 80? d. above 85? e. below 50?

Each of your answers should be accompanied by a sketch of the area of the curve you are trying to find. 12. For the problem described in exercise #10, between what two scores did 95% of the class score?


Version: December 20, 2001 Parametric Statistics 6-1 ©University of California, 2001

CHAPTER 6

Statistical Inference and the t, F and χχ2-distributions In this chapter, we introduce the concepts of statistical inference and the Student’s-t, F and χχ 2 (Chi-squared) sampling distributions , which will apply to Gaussian distributed populations. We discuss the use of these distribution in a variety of parametric statistical tests. Throughout this chapter, we stress the basic principles of statistical inference that underlie each of these tests - hypothesis testing or parameter estimation at a specified level of confidence or significance. The Student’s-t distribution This distribution was introduced by a statistician named William S. Gossett, who wrote under the pen name of Student. The distribution acquired the name “Student’s-t”, after Gossett’s pen name. The Student’s-t distribution solves a problem that is created when we want to compute the z value, but don’t know σ, the population standard deviation. We had, from chapter 9:

Z =( m − µ)σ

N

(6-1))

If σ is unknown, the z tables cannot be used. It is logical to substitute the sample standard deviation, s for the population standard deviation, σ. If we take all possible (infinity in the limit) samples of size N from a normal distribution with mean µ and variance σ2 and calculate a t-statistic defined as:

t =( m − µ)

sN

(6-2)

for each sample, we will have a Student’s-t distribution, which we shall call the “t distribution from now on. For large N, the t-distribution approaches the normal distribution. For practical purposes, when N>30, the t-distribution and the normal distribution are identical. For N<30 the t-distribution curve will be symmetric as is the normal curve; however, as N decreases the curve will flatten. This makes intuitive sense as with a smaller sample size, m is less likely to be close to µ than for a larger sample size and therefore there will be fewer values for small t. Curves for a normal distribution and for a t-distribution with N=4 are shown in Figure 6.1.


Tables of t-values In Chapter 5, we used tables with z-values for a standard normal distribution to find the proportion of the area under the normal distribution curve between a given z-value and z equal to infinity. A table for t-values is contained in Appendix A2.

Figure 6.1 , Distribution of t values

We are interested in finding the t-value that a certain proportion of the area under the t-distribution curve lies above. In some cases we want to find a t-value such that 5% of the area under the curve is in the upper tail. In other cases, we want 5% of the area under the curve to be contained in both tails, so we want to find a t-value such that 2.5% of the area under the curve is in the upper tail. Appendix A2 gives the t-values for which 2.5% and 5% of the area under the curve is in the upper tail. These values are given for various degrees of freedom. For a t-distribution, the number of degrees of freedom is N-1, where N is the sample size . A table of t-values is given as Table A2 in the Appendix. It is important that you feel comfortable with reading this table and that you understand the relationship between the numbers on this table and the t-distribution curve. For example, suppose we want to find the t-value above which 5% of the area beneath the t-distribution curve lies for a sample of size 11. In this case, we have 10 degrees of freedom. Since we are interested in the upper tail only, we find the t-value for which we are looking, 1.812, in the row corresponding to 10 degrees of freedom and the column for 0.05. Next, suppose we want to know between what two t-values 95% of the distribution curve lies, for 10 degrees of freedom. In this case we want to find the t-value for which 2.5% of the area in the upper tail and 2.5% of the area in the lower tail. We look in the 0.025 column for 10 degrees of freedom and read a value of 2.228 from the table. This means that the 95% of the area of the curve lies between t=-2.228 and t=+2.228. Often we have a t-value and want to know whether that value lies in the tail or tails of the curve. For example, suppose we calculate a t-value of 1.9000 for a sample with a sample size of 16. Is this t-value in the 5% of the area beneath the distribution curve contained in the two tails of the curve? We look in the column 0.025 and the row for degrees of freedom equal to 15


(a)

(b)

Figure 6.2. Curves showing t distribution. The areas in black show the 5% region of “unlikely event” for a two tail test (a) and for a single tail test (b).

and find the t-value 2.131. It helps a great deal to draw a sketch of the curve and at this stage you should always do so. From Figure 6.2a, it is clear that our t-value of 1.900 is not contained in the two tails of the curve. What if instead we asked whether this t-value was contained in the upper tail of the curve that contained 5% of the area under the curve? The answer this time is yes it is, as Figure 6.2b shows.


Practice reading the Student’s-t Tables in Appendix A2. Verify that you can read the table to get the t value for the following situations: 1. N = 5, find the t value that is in the upper 10% of the range (ans: t=1.533). 2. N = 10, find the t value that is in the upper 2.5% of the range (ans: t=2.262). 3. N=12, t = 1.785. Is the t value within the upper 10% range? (ans: Yes) 4. For example 3, is the t value within the upper 5% range? (ans: No). 5. N=11. We want the t values for the 95%, two-tailed confidence limit. (ans: -2.228 < t < 2.228) 6. N=4. Find the t values representing the two-tailed 90% confidence limit. (ans: -2.353 < t < 2.353)

t-test: Estimating the Population Mean We can use a t-test based on the t-distribution to estimate the mean of a population within a specified confidence level from a sample. The procedure is identical to that followed in Chapter 9, except that we use the sample standard deviation rather than the population standard deviation. Using equation 6-2 instead of equation 6-6, we find, in analogy to the derivation of equation 6-7, the following:

m − ts

N< µ < m + t

s

N (6-3)

The application of this equation is demonstrated in figure 6.3. The figure demonstrates several important properties. Given a particular value for t, the allowable range of µ will decrease as N increases. This is because the standard

P(x)

µ

Sample mean

µ+ts

N

m

µ-t

S m a l l e s t µ c o n s i s t e n t w i th m L a r g e s t µ c o n s i s t e n t w i th m

µ

µ+t

µ-ts

N

m

Sample mean

s

N

s

N

Figure 6.3. Largest and smallest population mean, µ that is allowed by a specified t value. Note on the above plots that the sample mean, m is at the same place on the plot and the probability distribution curve is shifted to the right and to the left to reflect its most extreme allowed positions. See figure 5.6 .

deviation of the distribution of the means decreases with increasing sample size (refer to equation 6-5 and the associated discussion). Also, the larger the value of t, the wider the limits on µ. The value of t


is selected from the t distribution table according to the number of degrees of freedom (N - 1) and the desired confidence intervals. Suppose we want to find µ for the porosity of a sandstone unit based on a sample of 11 measurements for which m is 6.4 and s is 3.1. We won't worry about units of porosity here. We assume that the porosity of the sandstone follows a normal distribution. Specifically, we want to find a range of values which we can be 95% confident contains µ. This means that if the experiment were repeated many times, 95% of the sample means, m, would be within this range. Since our sample size is 11, our t-statistic will have 10 degrees of freedom. We are interested in the t-distribution for 10 degrees of freedom. Since we want the 95% confidence interval and we have no reason to be interested in only the upper tail or only the lower tail of the t-curve, we use a two-tailed test. That is, we wish to find the positive t-value such that 2.5% of the area under the curve lies above that value and the negative t-value such that 2.5% of the area under the curve lies below that value. Since the curve is symmetric and most tables give values only for positive t-values, we consider the upper tail only. We look up the value for the t-statistic in column 0.025 for 10 degrees of freedom in Table 2A. This value is +2.228. This means that when samples of size 11 are taken from a normal population, only 5% of the samples will have extreme values for m and s such that t will satisfy the relation −2.228 ≤ t ≤ 2. 228 for 95% of all possible samples of size 11. From equation 6-2, we can express this relation as

−2.228 ≤

m − µs

N

≤ 2. 228

(6-3) If we rewrite this expression in terms of µ (equation 6-3), we obtain

m − t

s

N≤ µ ≤ m + t

s

N (6-4)

m −2.228 (3.1)

11≤ µ ≤ m +

2.228 (3.1)

11

which, if we substitute our values for m and s, gives:

5.83.4 ≤≤ µ

which says that the 95% confidence limits on µ are between 7.3 and 11.5. If we specify a confidence level greater than 95%, our range of values would be larger. Conversely, if we specify a confidence level of less than 95%, our range of values would be smaller.


In the example above we used a two-tailed approach. In other problems, we might be interested in estimating µ so that we are 95% confident that µ exceeds a certain value or that µ is below a specific value. In such cases, a one-tailed approach would be appropriate.

Single tailed test: Consider again the problem of determining the mean porosity of a sandstone unit discussed above. Suppose we want to know the value, µmax we could be 95% confident µ exceeded. Restated, we could say: if we repeat this experiment an infinity of times, the value we would get for the sample mean would exceed µmax in 95% of the experiments. We look up the value for the t-statistic in column 0.05 for 10 degrees of freedom in Table 2A. This value is +1.812. This means that when samples of size 11 are taken from a normal population, only 5% of the samples will have extreme values for m and s such that t will be greater than 1.812. Another way of saying this is that for 95% of all possible samples of size 11, t will be less than 1.812. We write − ∞ ≤ t ≤ 1. 812

− ∞ ≤ m − µs

N

≤ 1. 812

which gives µ ≥ 7.7. Decreasing the error limits: The error limits for an experiment can be reduced by increasing the sample size, N (see eqn 6-4). The error approaches zero as N approaches infinity. This is a great help in designing an experiment where you don’t want to over-sample, because taking samples is time-consuming and often expensive. You will somehow get an estimate of the population variance using a small trial experiment, existing data, or theory. If the estimate of the population variance is close, you will be able to accurately estimate the number of samples you need to estimate the population value with a particular accuracy. Of course, there is also the possibility that your estimate of the population variance is poor, in which case you may over or under sample by a considerable margin. Introduction to Hypothesis Testing Since it is so important to state statistical tests precisely, the formalism of “Hypothesis Testing” has been developed. In hypothesis testing, we state the hypothesis we are testing (the “null hypothesis”, or

Using the Student’s t-test: 1. Take the sample 2. compute the mean, m and sample variance, s

3. calculate: N

ss xx =

4. calculate t from the t tables and degrees of freedom (N-1) 5. calculate the expected error limits (eq 6-4) 6. interpret results: 95% of values will be within the limits (if you used t for 95%)


Ho), and also the alternative hypothesis that is true, if the original hypothesis in refuted (the “alternative hypothesis”, or Ha).

For example, we might want to ask the question: “Can we state, at a 95% confidence level, that the mean of the population from which this sample was taken is 15?” In this case, we would pose the null hypothesis, Ho, that the mean of our population is 15. Our alternative hypothesis, Ha is that the mean of our population is not 15. We state this formally as

H o ( µ1 = 15 ) a n d H a ( µ1 ≠ 15 )

and we set our significance level at 95%. This means that we will reject Ho if there is less than a 5% probability of Ho being true. We calculate a t-statistic based on the expression for t given above. Since m is 6.4, s is 3.1, and our sample size is 10, our t-value is 6.0 as calculated below.

t =

m − µs

N

=15 − 9. 43.1

11

= 6. 0

In this case, we have 10 degrees of freedom. A two-tailed test is appropriate here since we do not care whether the porosity of our sandstone is greater than 15 or less than 15, only that it is significantly different from 15. Therefore, we look up the critical value for t in a t-table with 10 degrees of freedom at the 0.025 level. One way to remember the appropriate column of the t-table to look at is to take the desired significance level for rejection (here 0.05) and divide this number by '1' if we have a one-tailed test or '2' if we are using a two-tailed test. In this case, the critical t-value is 2.228. Now we have all the information we need to either accept or reject our null hypothesis at the 95% significance level. Figure 6.3 shows the critical and calculated t-values for this problem. Since our t-value of 6.0 lies in the curve's tail, we can say that there is less than a 5% chance that the mean porosity of the sandstone is 15. Therefore, we reject Ho at the 95% confidence level and conclude that the mean porosity of the sandstone is not 15.


Figure 6.3 . t distribution showing 95% significance levels and example t value of 6.0.

We could also ask a question about the sample size necessary for a given study. In the last chapter, when we asked this question, we knew σ2. Usually, we only know s2 and we must use the t-distribution. If we have no idea what the value for s is in a particular case, we might perform a pilot study to obtain this information before going ahead with the main study. Discussion: In general, the critical question is whether the data are consistent with the null hypothesis at the required significance level. What is required is the sample distribution and confidence levels. An important issue in this process is the selection of the correct sample distribution. In this text we emphasize the gaussian distribution, where the Central Limit Theorem assures us that the distribution of sample means will be close to gaussian. Siegal (1956) suggests six steps that should be carried out in hypothesis testing. These are: 1. State the null and alternative hypotheses. The null hypothesis is usually called H0. The alternative hypothesis is called Ha here. In chapter 5, problem 6, where the problem was to decide whether die were weighted or not, H0 might be stated as: “there is no difference between the probability of each die face showing” and Ha might be stated as: “there is a difference between probabilities of dice face showing”. 2. Choose the statistical test. There are many statistical tests to choose from. Each test requires that the data conform to certain assumptions, for example that the population is normally distributed. Tests vary in their power to discriminate. That is, the limits set on the parameters to be tested may vary from very wide when a test has little “power” to narrow when a test has a high “power”. Tests with very few assumptions often have little power relative to tests which have many assumptions. It is important to make efficient use of the data. For example, if there is good reason to assume that the data are normally distributed, gaussian


statistics should be used. If the data are not normally distributed, less powerful tests will be required. The power of a test will be discussed more later. 3. Choose the size of the sample, N, and the size of a small quantity, α. The choice of α determines the “confidence limit” that you require for your data. α is the probability that the result falls outside of the confidence limits. An α of 0.05 would correspond to a confidence level of 95%. It is expected that as N increases, the result will be more accurate. This is reflected in the limits set by equation 5-7 for gaussian distributions. It is important to choose neither too small nor too large a value of N. Too large a value is a waste of effort and too small means that the results will not be reliable. 4. Evaluate or determine the frequency distribution of the “test statistic”. The “test statistic” is the value that we are going to test. In the dice toss problem, the “test statistic” was the number of “3’s” (or the face you were testing). In that problem, the frequency distribution of the number of times a particular die face occurred was given by the binomial distribution. The test statistic might be the sample mean or sample variance. If the data are gaussian distributed, the correct distribution might be the “t”, “Chi-squared”, or “F” distributions (coming later). If there is not enough fundamental knowledge about the process that produces variations in the data, it may be necessary to “prove” that the data follow a particular frequency distribution. If measured values are put into an equation (e.g. to compute an age date), the distribution of the statistic may be influenced operations in that equation. 5. Define the “critical region” or region of rejection of the null hypothesis. This is the probability that the result of the experiment is outside of the “confidence limits” that were set in step 3. For example, if the frequency distribution of the “test statistic” is normal and its standard deviation is 1, an experimental result greater than 1.96 or less than -1.96 would be within the region of rejection at the α = 0.05 level. So, if the “test statistic” is outside of the “critical region”. we can accept the null hypothesis at the specified confidence level, but if it is inside the “critical region”, we can reject the null hypothesis at the specified confidence level. 6. Make the decision. Suppose the “test statistic” is within the “critical region”. There are two possible conclusions. They are either that H0 is false, or that an unlikely event has occurred and H0 is actually true.

Example of t-test: Comparing Two Means The t-distribution can also be used to test the probability that two samples were taken from identical populations. In this case we calculate a t-statistic according to the following expression

t =m

1− m

2

S1

N1

+1

N2

(6-5)

where m1 and m2 are the sample means and N1 and N2 are the sample sizes of the first and second samples, respectively, and where


S =(N

1 − 1)s12 + (N

2 −1)s22

N1

+ N2

− 2 (6-6)

and S is called the combined variance. Because our t-distribution was determined by sampling from a single population, one of the requirements for performing this t-test is that the variance estimates not be significantly different. This can be tested using the F distribution discussed in the next section. If we take all possible samples of size N1 and N2 from the same population and calculated a t-statistic as defined here for each sample, the result will be a t-distribution with N1+N2-2 degrees of freedom. We illustrate how we use this t-test with an example. Consider the following problem concerning ore deposits. Data on the concentrations of ore on opposite sides of a fault suggest that the ores are significantly different. If so, then it is likely that the fault was formed before the ore was emplaced. If not, then it is likely that the fault formed after ore deposition. Based on the data provided below, is there a difference, at the 95% confidence level, between the concentration of ore on either side of the fault? In other words, is the probability less than 5% that the sample from north of the fault and the sample from south of the fault are from identical populations? We assume that the distribution of ore concentrations follows a normal distribution and σ2 from the sample north of the fault is not statistically different from σ2 from the sample south of the fault. The data for this problem are given below in Table 6.1. Mean ore concentration Variance Number of data in

Sample

North of fault 33 mg/kg 13 10 South of fault 23 mg/kg 12 15

Table 6.1. Data from example discussed in text .

Our null hypothesis, Ho, is that the mean of the ore concentration north of the fault is not statistically different from the mean of the ore concentration south of the fault, assuming that the variances for the two populations are the same. We state this formally as

H0(µ

1= µ

2σ

1

2 = σ2

2 )

where the subscript '1' refers here to the ore north of the fault and the subscript '2' refers to the ore south of the fault. If Ho is true then we can state that our two samples come from identical populations within the confidence level of the test. Our alternative hypothesis, Ha, is that the mean of the ore concentration north of the fault is statistically different from the mean of the ore concentration south of the fault. We state this formally as


H a( µ1 ≠ µ2 )

Next we must state our desired significance level. Remember, with statistics we never prove anything. We can only state that our null hypothesis is supported at a stated level of confidence or significance. For this problem, we set the required significance level at 95%. Another way of stating this is that we will reject Ho if, when the experiment was repeated a large number of times, we would reject Ho 95% of the those experiments. Next, we calculate the t-statistic for the comparison of two means with equations 6-5 and 6-6.

S =(10 −1)(13) + (15 − 1)(12 )

10 +15 − 2= 3.5

t = 33 − 23

3.51

10+ 1

15

= 7.0

Here, our t-statistic is 7.0 and we have 23 degrees of freedom. A two-tailed test is appropriate here since we do not care whether the concentration of ore on one side of the fault is higher or lower than the other only that one side is significantly different. Therefore, we look up the critical value for t for this problem in a t-table with 23 degrees of freedom at the 0.025 level. Here, the critical t-value is 2.069.

Figure 6.4. t distribution for 23 degrees of freedom, with 95% confidence limits marked. Now we have all the information we need to either accept or reject our null hypothesis at the 95% significance level, as shown in Figure 6.4. Since our t-value of 7.0 lies in the curve's tail, we can say that there is less than a 5% chance that these two samples are from identical populations.


Therefore, we reject Ho at the 95% confidence level and conclude that we can be 95% confident that the fault was formed before the ore was emplaced.

Figure 5.4. Illustration of type I and type II errors. Curve A corresponds to the distribution specified in H0. Curve B is another distribution, which might also produce the experiment’s outcome.

Coming to the Wrong Conclusion Just as it is possible for a gambler to win, “against the odds” at the roulette table, or toss 10 heads in a row, it is possible to come to the wrong answer, in statistics. This is because even the rare events have a probability of occurring.

Figure of 5.4 shows the situation where there are two populations. Population A has a mean of 10 and population B has a mean of 20. The overlap between the two distributions allows for a finite probability of making an erroneous conclusion. Let’s state the null hypothesis, Ho, that the population we are sampling is represented by A. The alternative hypothesis, Ha, is that the population we are sampling is represented by curve B. Suppose the population we are sampling is really population A. Suppose also that the sample lies in the tail of curve A. We reject the null hypothesis, which was correct, and conclude that the sample came from population B. What happened is we got a rare event and came to the wrong conclusion. This is called a Type I error. At the 95% confidence level, there is a 5% chance of getting this type of error. You must be careful about one-tail or two-tail significance levels here.

Suppose, however, that the correct answer is population B. The chance that a sample taken

from population B, lies within the “accept” range of our test is the area b shown in figure 5.4 (remember: Ho stated that the sample was from population A). When the sample lies within this region of the probability curve, we would accept Ho, incorrectly concluding that the sample came from population A. According to the curves shown in the figure, there is a pretty high probability that a sample from population B will be interpreted as a sample from population A. The smaller the b value, the smaller the probability of getting a type II error. The power of a test is defined as 1- b.

How can we improve the situation in a real experiment? The best way is to reduce the standard

deviation of our test population by increasing the number of data values in a sample. Increasing the level of significance (smaller a) only increases the overlap. Reducing the level of significance (larger a) decreases the area of overlap, and this is one of the ways to increase the power of a test, but it does compromise the overall significance of the test. Since the width of the distribution of the means decreases as N/1 , sampling more data is one of the best ways to increase the power of a test. Reality


H0 true H0 false Decision: accept H0 1-a b (type II) accept H1 a (type I) 1-b Table 5.1. Matrix of possible interpretations of an experiment based on the true population value and the sampled outcome. Example:

Figure 5.4 shows two distributions that might produce the same experimental result. Suppose we define H0 as “µ = 15” and Ha as “µ ≠ 15”. If X is greater than 17 (in the region defined by “a”), we reject H0. The probability that a type I error occurred (we falsely rejected H0) is given by a (0.05 for a 95% confidence level). However, suppose that X is 15, well within the “accept” zone of H0. The probability that µ is really 20, but X is within the “accept” zone of H0 is given by the area indicated by “b”, which is the probability of a type II error. The type II error occurs when we erroneously accept H0. It is simple to compute the probability, “b”, given the value of “a” chosen for the test and σ of the population. Suppose that the µA = 10 and µB = 20. Suppose also that σ

N=4. We choose a = 0.05,

so that the upper “reject” region begins at 10 + 1.96*4 = 17.84. Since the mean of Curve B is 20, this is (20 - 17.84)/4 = 2.16, which is 0.54 standard deviations from the mean of Curve B. From Appendix A1, the area of the normal curve at Z = 0.54 is 0.295. Be sure to check this in Appendix A1 to be sure you understand where this number came from. Use the graphic at the top of the figure to be sure what area the table produces. The result means that the probability “b” of a type II error in this case is 0.295. 29.5% of the time, under the given circumstances, if the distribution of Curve B was the “real” distribution, one would falsely accept H0. The F-distribution The F-distribution describes the distribution of the ratio of the variances of two independent samples from the same normal population. If we take all possible samples of size N1 from one normal population and size N2 from a second normal population, where both populations have the same variance σ2 and calculate an F-statistic defined as

F =s1

2

s2

2

for each two samples, we will have an F-distribution where s1 is the variance of the first sample and s2 is the variance of the second sample. The sample with the largest variance is always put on top in the equation. The F-distribution has N1-1 and N2 -1 degrees of freedom. An example of an F-distribution is shown below in Figure 6.6. The choice of which sample should be sample 1 and which should be sample 2 should be made so that F > 1 in order to use most tables of F-values. Note that there are no negative values for F.


43210

0.0

0.2

0.4

0.6

0.8

F-value (4 degrees of freedom for both samples)

freq

uenc

y of

F

Figure 6.6. The F distribution , which is the ratio of variances of two samples from a Gaussian distributed population for 4 degrees of freedom for both samples. A table of F-values is given as Table A3 in the Appendix. As with the standard normal and t distributions, we are interested in finding the value above which a certain proportion of the area under the distribution curve lies. It is necessary to specify the degrees of freedom for both the numerator and the denominator to use a table of F-values. F-values corresponding to 5% probabilities are given in Table A3. For example, where 5% of the area under the curve lies in the upper tail for 4 degrees of freedom in both the numerator and the denominator, the F-value is 6.39. F-test An F-test based on the F-distribution can be used to test the probability that two samples were taken from populations with equal variances. For example, consider the problem of ore concentrations across a fault that we discussed earlier. In performing a t-test, we assumed that the variances of the two populations represented by the two samples were not statistically different. Let us now test whether or not this was a good assumption. Our null hypothesis, is that the two variances are equal and our alternative hypothesis is that they are not. We state

H o ( σ12 = σ2

2) a n d H a ( σ12 ≠ σ2

2 ) and this time let us set our confidence level at 95%. We look up the F-value in this case where N1=10 and N2=15 corresponding to a 5% area in the upper tail. This value is 2.65. The variance of the first sample is 13 and the variance of the second sample is 12, so F = 1.08. Our F-value is not in the tail of


the curve, so we accept our null hypothesis. We were justified in assuming that the two samples came from populations with the same variances at the 95% confidence level.

DO THIS NOW! Practice readi ng the F Tables in Appendix A3 Verify that you can read the table to get the F value for the following situations: 1. N1 = 5, N2 = 8. find the F value that is in the upper 5% of the range. Assume

sample 1 has the smallest variance. (ans: F=6.09). 2. N1 = 10, N2 = 20, find the Fvalue that is in the upper 5% of the range. Assume sample

2 has the smallest variance. (ans: F=2.42).

3. N1=12, N2= 11. s1

2 = 4.5, s2

2 = 1. 2 . Are the variances from the same population, to

a 95% significance? (ans: No, F = 3.75, the F limit = 2.85)

4. N1= 4, N2= 6. s1

2 = 4.6, s2

2 = 2.0 . Are the variances from the same population, to a

95% confidence level? (ans: Yes. F=2.3, F limit = 5.41)

χχ2-distribution If gaussian distributed variables are squared, they follow the χ2-distribution. For example, if Y is a single gaussian distributed variable, then

χ2 = (Y − µ) 2

σ2 follows a χ2 distribution with 1 degree of freedom. If form a sum of N terms

χ2 = (Y i − µ) 2

σ2

i = 1

N

∑

we have a χ2 distribution with N degrees of freedom. Figure 6.7 shows a plot of the χ2-distribution for 4 degrees of freedom. As the number of degrees of freedom becomes large, the χ2-distribution approaches a normal distribution. As with the other distributions we have discussed so far, the total area under the curve is one. Note, that the value of χ2 is always positive.


12840

0.0

0.1

0.2

chi-squared value (4 degrees of freedom)

freq

uenc

y of

chi

-squ

ared

Figure 6.7 . Chi-squared distribution for 4 degrees of freedom. Random variables with a gaussian distribution become χ2 distributed when they are squared.

The mean of a χ2 distributed variable with N degrees of freedom, E[χ2] = N

The variance of a χ2 distributed variable with N degrees of freedom is var[χ2]=2N

Table A4 in the Appendix gives the values of χ2 which define the upper tail of the curve for various degrees of freedom. Critical χ2-values are given corresponding to various area under the curve in the upper tail. χχ2-tests A very useful application of the χ2-test is in testing whether a sample came from a Gaussian distribution. To do this, we form a “statistic” which is related to the difference between the “expected” and “observed” number of data values within each class. The χ2-statistic for this situation is:

x 2 =

(Oi − E

i)2

Eii = 1

# of classes

∑

where Oi is the observed frequency in the i’th class of the distribution and “Ei” is the expected frequency in the i’th class according to some probability distribution. The number of degrees of freedom are c - k - 1 where c is the number of classes, k is the number of estimated parameters (k = 2 if m and s2 are used as estimates for µ and σ). So, if an analysis used 6 histogram bars, and µ was


estimated from the data, x, and σ was also estimated from the data, the number of degrees of freedom would be 6 – 2 = 4. The χ2-distribution is important because it can be used in many parametric and non-parametric tests. Concept Review It is important to understand the similarities in how the various distributions are used to test a hypothesis. All of the distributions discussed in this chapter are derived from Gaussian distributed data. When the data is transformed in specific ways (e.g. we may be interested in a squared parameter: chi-squared, or a ratio of variances: F test), a certain distribution results. This is the distribution of a gaussian distributed variable that has been squared or ratio’d, or some operation has been performed on it. For example, the Normal distribution results if we transform the Gaussian distributed data according to:

Z =( x i − µ)

σ

The t distribution results if we transform the Gaussian distributed data according to:

t =( x i − m )

s

The t test is used for putting confidence limits on the distribution of sample means. It is important that the sample means follow a normal distribution. Use the χ2 test to prove it. The χ2 distribution results if we square Gaussian distributed variables. Use the χ2 test to test the confidence with which a distribution is normal (p 6-16). The F distribution results if we compute the sample variances and divide the largest variance by the smallest variance. It is used to test the confidence with which the sample variances of two samples are from populations with equal population variances (p 6-14).

F =s

12

s2

2

Of course, you should remember that the distribution comes from visualizing the repeating of the experiment many times and plotting the histogram that is the average of all of the histograms, in the limit where the class interval gets very small. Reading each of the tables is similar. You figure out the degrees of freedom and the confidence limits, read the value, then see if the computed sample “statistic” is within the “Accept” or “reject” range. Encouragement


While statistical thinking represents a radical departure from the way you normally think, it is really not so hard if you concentrate on a few facts. When making statistical inferences, it is helpful to remember the sampling paradigm discussed earlier. There exists a “population” of values and you have taken a sample from that population. The “test statistic” follows some kind of distribution, based on the population statistics (for Gaussian population distributions the mean and variance are enough). Once you know that distribution, the confidence limits follow immediately by considering the area underneath the distribution curve. After that, it is a simple matter to test whether your sample value falls within those limits.

Problem 2. Suppose you are performing an experiment to determine whether a sample of seawater

is derived from deep bottom water or from surface water. Deep bottom water (DBW) has a mean concentration of constituent A of 100 parts per million and surface water (SW) has a mean concentration of constituent A of 120 ppm. Assume also that you have found out by independent means that the standard deviation of the population of both DBW and SW is 20 ppm. You take a sample consisting of N analyses of the water. The problem is to choose whether the sample of water is from DBW or SW. Make an analysis of this problem using the principles discussed above.

a) Discuss and perform the six steps required of hypothesis testing. b) Analyze the problem in terms of type I and type II errors. Plot a and b vs N and

determine the optimum number of samples for a 95% confidence that you can discriminate between DBW and SW.


Review After reading this chapter, you should: • Know what a t-distribution is and how it is generated. • Know what an F-distribution is and how it is generated. • Know what an χ2-distribution is and how it is generated. • Be able to read tables of t-values, F-values and χ2-values and understand the relationship between

these values and the t, F and χ2-distribution curves. • Be able to perform a t-test to:

- estimate the population mean from a sample; - determine whether or not the mean of a population is different from (or higher or lower than) a

specified value; and - to compare two samples to test if they are from identical populations to a certain level of

confidence. • Be able to perform an F-test to determine whether two samples come from populations with equal

variances. • Be thoroughly familiar with the method of hypothesis testing. • Understand Type I and Type II errors, and the “power” of a test, and be able to calculate the

probability of each.


Exercises State the null hypothesis and alternative hypothesis for all problems where you are asked to perform a statistical test involving hypothesis testing. 1. For 9 degrees of freedom,

a. what is the t-value above which 5% of the area beneath the t-distribution curve lies? b. what is the t-value above which 2.5% of the area beneath the t-distribution curve lies? c. between what two t-values do 95% of the t-values lie?

2. A t-value of 1.8 is calculated for 20 degrees of freedom. a. Is this value in the 5% of the area beneath the t-distribution curve in both tails? b. Is this value in the 5% of the area beneath the t-distribution curve in the upper tail?

3a. For the purpose of using a t-test to estimate the population mean from a sample, how many degrees of freedom are there?

3b. For the purpose of using a t-test to compare two sample means how many degrees of freedom

are there? 4. List the basic steps in the hypothesis testing procedure. 5. What does it mean to say "I am 95% confident that the population mean lies between 140 and

150."? 6. What is meant by the phrase 'the power of the test'? 7. In using a t-distribution to estimate the population mean from a sample, does the size of the

range of values specified for the population mean increase or decrease with a. greater required precision in the estimate? b. increased sample size? c. increase variability in the population?

8. To determine the concentration of chemical X in a given liquid, 12 measurements were made.

The error in the measurements is normally distributed. Given the data below, what two values can we be 95% confident the true concentration lies between?

15 17 15 25


21 14 23 25 19 23 18 26 9. The recommended safe limit for chemical Y in drinking water is 10 mg/l. Water samples are

taken once a month to monitor this chemical. The data for the first 6 months of testing are given below. Can we be 95% confident that the concentration of Y is less than 10 mg/l?

11 8 8 9 9 9 10. After reviewing some measurements made in the lab, the lab supervisor notices a seemingly

systematically bias in the data. The supervisor suspects that the two lab assistants who made the measurements are using a slightly different measurement technique and that this is the root of the problem. One day, both assistants are given the same 10 materials to measure. Based on the following data, can we be 95% confident that the technique of the two assistants is different?

Assistant A Assistant B 52 57 58 59 57 65 70 68 65 60 11. For 10 degrees of freedom in the numerator and 10 degrees of freedom in the denominator,

what is the f-value above which 5% of the area beneath the f-distribution curve lies? 12. The variance of errors in measurements made by two different labs are given below. Are these

differences in variances statistically significant at the 95% significance level? sample size standard deviation Lab A 11 66 Lab B 21 40 13. This very important problem demonstrates the use of the Chi-squared distribution to test

whether a sample could have come from a Gaussian distributed population. 20 random data points are taken. m = 2.995, s = 1.049. The data were plotted on a histogram consisting of 10 equal classes beginning at a value of 0 and ending at a value of 6. The number of data within the classes is: 0,1,1,4,4,5,2,2,1,0.

a) Assuming that the data are sampled from a Gaussian distribution, compute the expected

number of data in each class. Approximate µ and σ with m and s. b) Compute the χ2 statistic for these data.


c) Within a 95% confidence level, could you reject the null hypothesis that the data are sampled

from a Gaussian distribution with µ = 2.995 and σ = 1.049?

Version December 20, 2001 Propagation of Errors 7-1 ©University of California, 2001

Chapter 7

Propagation of Errors and Noise In some cases, the data values are sampled directly in the form that is needed. An example is the length between two markers. The length is measured directly. The most common case arises when the measurements are put into a formula to produce another quantity. In the case of surveying, distance between elements of an array of monuments across and earthquake fault might be used to compute the surface strain. The amount of radioactive products in a rock may measured and put into a formula to compute its age. The volume and weight of a rock may be put into a formula to determine its density, or the amplitude of a seismic wave will be put into a formula to determine the magnitude of an earthquake. The distribution the data errors and kind of formula will affect the interpretation of the answer. This chapter will show you how to determine the accuracy of the answer and identify some pitfalls in interpreting results from noisy data. Errors When Data Values are Added or Subtracted A common situation occurs when data values are added. An example of this is the measurement of the distance between two widely separated points. Suppose that the distance is sufficiently great and the topography sufficiently rough that you must make a series of end to end length measurements. Each length measurement is subject to a certain error, which we will assume to be Gaussian distributed with a zero mean and standard deviation, σ. Assume that N length measurements are required. The total length plus an error is, assuming that all of the individual length measurements are corrected exactly to horizontal distance:

∑=

∆+=+N

iiilL

1

)(ε

The total length, L is the sum of all the individual lengths, li. The error in the i’th length is given by ∆i. This results in an overall length error of ε. The individual errors would be expected to both add and cancel randomly so it would be incorrect to simply add the errors. Since the total length will be a random variable, we compute its "expected" value. Since the mean value of the individual errors is zero, we have:

E L + ε[ ]= E ( li + ∆i

)i = 1

N

∑

= E ( li + ∆i

)[ ]i =1

N

∑

E L[ ]+ 0 = E li[ ]+ E ∆ i[ ]( )

i= 1

N

∑ = (li

i=1

N

∑ + 0) = L

So, the expectation value of L is just L, which is equal to the sum of all of the individual distances, which agrees with our intuition. This only tells us that for repeated experiments, the errors average to zero.


But, for an individual experiment, we need the standard deviation of the error. We get this by computing the expectation value of the variance using equation 7-9. We have:

[ ]

∆=

−∆+=−= ∑∑ ∑22

i

22 )(E)(Ei

ii

iiieL EllLLσ

L is the error free length and Le is the length measured as a result of a single experiment. Notice that the term on the right is the square of a sum of terms. Multiplying out some of the terms, this will look like:

∆∆=

∆

∆= ∑∑∑∑

= =

N

i

N

kki

kk

iiL EE

1 1

2 )(σ

There are terms that are sums over ∆ i ∆ k . If N=2 we can multiply the terms by hand, resulting in

E ∆1

+ ∆2( ) ∆

1+ ∆

2( )[ ]= E ∆1

2 + 2 ∆1∆

2+ ∆

2

2[ ]. The expectation of all ∆ 1∆

2 terms will be zero, since

∆1 and ∆2 are independent and will average to zero. The ∆ 1

2 and ∆ 2

2 terms will not, since they are squared and will never have negative numbers to cancel with the positive ones. So, we have:

∆= ∑i

iL E 22σ

In the general case, when i = k cancellation will not occur and E[∆i∆i] � 0, but when i�k, E[∆i∆k] = 0.. The expectation is the variance of the ∆ population (the population of errors of each of the individual length measurements), which we will call σl

2 . The final answer is:

[ ] 22

2

i

2 E)(E li

iiiL NLl σσ =∆=

−∆+= ∑∑

So, the variance of the total length is given by the sum of the variances of each of the individual length measurements. If the variances of each of the terms in the sum are different, the individual variances are summed to get the final answer as shown in equation 7-1 below.

σL

2 = σi

2

i =1

N

∑ (7-1)

Interestingly, the above formula also applies to the case when measurements are subtracted. This is because the minus sign is eliminated by the variance computation, which squares the error.


Problem 1: Prove equation 7-1 for the case when 3 lengths are added to get the total length. Let each of the individual length measurements have random errors with standard deviation of σe.

Problem 2: Suppose that measurement A has a Gaussian distributed error with variance σa

2 and measurement B has a Gaussian distributed error with variance σb

2 . Prove that the variance of the difference, A - B is given by σa

2 + σb2.

Errors When Data are Multiplied or Divided Data values with random errors are often multiplied or divided. Suppose the density is being measured by computing the volume and mass of an object. Then, the density is given by:

ρ = M

V

If the mass and volume each have errors, how will these combine to produce the error for the density? To approximate the effect of ρ of a small change in M and V, , we use the chain rule of differentiation, which says:

df ( x, y ) =∂f

∂xdx +

∂f

∂ydy

The above formula gives us a relationship that can be used to compute a small change in the function, f(x,y) caused by small changes in either x or y. It only applies exactly to infinitesimally small changes in x and y. Here we don’t need an exact result, so we can extend it to larger changes (we say the result is accurate “to first order”). We indicate that the changes are finite by using the notation δx and δy instead of dx and dy. So, the chain rule takes the form:

δf ( x ,y ) =∂f

∂xδx +

∂f

∂yδy + small error

This equation is the first order term of the Taylor’s expansion for a function of two variables. The small error will become important when “bias” is treated. For the density formula, the change in density due to a small change in mass and volume is given by:

δρ( M ,V ) ≅∂ρ∂M

δM +∂ρ∂V

δV


and since:

∂ρ∂M

=1

V ;

∂ρ∂V

= −M

V 2

Then:

δρ=1

VδM −

M

V 2 δV

Expressing the above equation as the fraction of the total density (note that we are dropping the ≅ symbol, so must remember that the equations are only accurate to first order):

δρρ

=δM

M−

δV

V

We can compute the variance of the fractional density changes as:

Eδρρ

2

=

σρ

ρ

2

= EδM

M−

δV

V

2

=

σM

M

2

+σV

V

2

Note that once the chain rule is used, the results follow those derived for sums and differences of random variables. If we define c as the ratio of the standard deviation of the parameter to the value of the parameter, according to the above equation, we have: cρ

2 = c M

2 + cV

2 where

cρ2 =

σρ2

ρ2 ; cV

2 =σ

V

2

V 2 and c M

2 =σ

M

2

M 2

We can then write a general law of propagation of errors, which states that if:

f ( x, y, z, .. . p,q, r, ... . ) =x • y • z•. ...

p • q • r•.. ..

then the total error, expressed in terms of the fractional variation defined above, c f

2 = c x

2 + c y

2 + cz

2 +. ... .. +c p

2 + c q

2 + cr

2 +. ... ... (7-2) So, equation 7-1 expresses the total variance of the result when data are summed and equation 7-2 above expresses the total variance of the results when data are multiplied and divided.



Induced Bias Mathematical operations on noisy data can affect the result in unexpected ways. A simple case occurs when noisy data values are squared. The randomness which previously averaged to zero because of cancellation of positive and negative will no longer average to zero because all of the squared numbers have a positive sign. There will be a non-zero average, or bias added by this effect. For example, suppose data follow the form of equation 7-7, where Y = y + aY noise . Ynoise is a Gaussian distributed random quantity with mean = 0 and standard deviation σnoise. Suppose Y is squared. We have: Y

2 = (y + aY noise )2 = y

2 + a2Y noise

2 + 2 yaY noise Now taking the expectation of each side of the above equation and using equation 7-9, we have: E Y 2[ ]= E y 2 + a2 Y noise

2 + 2 yaY noise

= E y 2[ ]+ E a2 Y noise

2[ ]+ E 2 yaY noise[ ]= y2 + a2 ( σ

noise

2 + µnoise

2 ) + 0 (7-3)

So, when Y is squared, its average value (which is the expectation) is “biased” by the variance of the noise. If the mean of the noise is zero, as it has been defined here, then µnoise = 0. So, if Gaussian distributed data will be used in a formula which squares the values, it is much better to find the average of the values in the sample prior to squaring each value, as opposed to squaring each sample, then taking the average.

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10

Y Y +Bias

Figure 7.1. Plot of the result of squaring noisy data (equation 7-1). The dotted line shows how the expected value of Y without noise is increased by the bias, which is aσ. Where σ is the standard deviation of the noise. In this case, aσ = 4. This would lead the experimenter to estimate too high a value for the quantity represented by Y. It is very common to put noisy data values into a formula, so it is important to understand the effect that the formula will have on the answer. Will the noise bias the answer? Is the distribution of the answer Gaussian if the data are Gaussian? It is important to answer these questions if we are to apply statistical tests based on the assumption that errors are Gaussian distributed. Are the statistical tests


applied to the data first, or should they be applied to the answer? This section will give guidance on this question and follow with an example in age dating. Assume that the data, x will be entered into a general formula, given by: Y = f (x ) (7-4) Y is the value computed from the data. Generally, x will also have a variation due to noise. The experimenter would hope that this variation would be small relative to the data value (high signal to noise ratio). This variation can be expressed as: Y = f ( x + ∆) (7-5) We are interested in the case where ∆/x is small (relatively high signal to noise), so we use a Taylor’s expansion of f(x), which is given by:

f ( x + ∆) = f ( x ) + ∆ ∂f ( x )

∂x+ ∆2

2 !

∂f 2 ( x )

∂x 2+.. ... + ∆n− 1

( n − 1)!

∂f n−1 (x )

∂x n− 1+ error (7-6)

The Taylor’s series expansion for several functional forms is given below. The expansion is carried only to second order. This is good enough to show the effect of bias. If the bias in a result is large, one should also look at the higher order terms or take a different approach to the noise analysis. If the function has an exponential dependence,

Y = f (x ) = Ae nx +b

∂f (x )

∂x= Ane nx

∂f 2 ( x )

∂x 2= An 2 e nx

So Y = f (x ) + ∆( Ane nx ) + ∆2 (An 2e nx )+. .. ... ... . (7-7) Here, x is the value of the data and ∆ is the random variation or noise in the data. The ∆( Ane nx ) term is the first order randomness in Y (the result of the calculation) which is caused by randomness in x (the data). The last term is also random and causes the bias in Y, since it will not average to zero. To get the expected bias in Y, we take the expectation value of Y:

E Y[ ] = E f ( x )[ ]+ E ∆Ane nx[ ]+ E ∆2 An 2 e nx[ ] = f ( x ) + Ane nx E ∆[ ]+ An 2 e nxE ∆2[ ]


The following derivations all assume that the random variable that is being input to the equation is gaussian distributed.

Now, E[∆]=0, since the average of the noise is taken to be zero, and E ∆2[ ]= σnoise

2 . Remember that we are evaluating the noise effect at a particular value of x, so f(x) is unvarying in the above derivation, so E[f(x)]=f(x). So, the result is: E Y[ ] = f (x ) + An 2 e

nxσnoise

2 (7-8) The second term is the bias effect, which gets larger as the square of n. The ratio of the bias to the actual value is given by:

R =bias

f ( x )=

An 2 e nx σnoise2

Ae nx= n2 σ

noise

2 (7-9)

So, the bias (relative to the signal) gets larger as n and σ increase. Practice: Using the above techniques, prove that the expansion to second order and bias ration, R are correct for the following useful functional forms: 1. Linear :

f ( x ) = mx + b

Y = f (x + ∆) = mx + b + m∆E Y[ ] = mx + b + 0

R = 0

(7-10)

2. Variable in denominator

.

f ( x ) = A

x

Y = f (x + ∆) =A

x− ∆

2 A

x 2+ ∆2 2 A

x 3+... .. ...

E Y[ ] ≈A

x+ σ2 2 A

x 3

R ≈2 σ2

x 2

(7-11)


3. Power law form (we assume n`> 1):

f ( x ) = Ax n

Y = f (x + ∆) = Ax n + ∆Anx n −1 +∆2

2An( n − 1) x n− 2 +. .. ... ...

E Y[ ] ≈ Ax n + σ2

2An (n − 1) x n− 2

R ≈ σ2

2 x 2n(n −1)

(7-12)

4. Logarithmic:

f ( x ) = A ln ( x )

Y = f (x + ∆) = A ln (x ) + ∆A

x− ∆2 A

2 x 2 + .. ... ...

E Y[ ] = A ln ( x ) + 0 −σ2 A

2 x 2

R = −σ2

2 x 2 ln ( x )

(7-13)

5. Exponential:

f ( x ) = Ae bx + c

Y = f (x + ∆) = Ae b ( x + ∆)

= Ae bx (1 + b∆ +(b∆)2

2!+.. .. ... .)

E Y[ ] ≈ Aebx

(1 +b 2σ2

2)

R ≈b2σ2

2

(7-14)

Problem 3: Write and implement a button script that shows that equations 7-13 is true by

repeatedly adding random values to x and computing the running average of the value of f(x), as was done in chapter 5 for coin tossing. Show that the value computed from the equation for R agrees with the value found from the simulation.

Case Study - Errors in Age Dating Using U-Pb Analyses


Age dating based on radioactive decay relies on the fact that radioactive elements decay at a known rate depending on time. In general, we can represent the concentration of the radiogenic element by: A = A0 e −λ t (7-14) where A0 is the original concentration of the “parent” element at time t=0 and λ is the decay constant. The time at which A is equal to half of the concentration is called the half-life and is equal to:

A

2= A

0e

− λt1/2

or

T1/ 2 =

1

λlog

e2 ≈

0.693

λ

If the “parent” element decays to the “daughter” element, after a time, t the concentration of “daughter” atoms will be: D = A0 − A = A0 − A0 e −λ t = A0 (1 − e −λt ) (7-15) If we take the ratio of D/N and solve for t, the result is:

t =1

λlog e 1 +

D

N

So, if it is known that the “daughter” atoms were the result only of radioactive decay of the “parent” atom, the age can be computed. But, it is often the case that there is an initial concentration of the “daughter” element. When more than one age dating method is used, the results (if they agree) are said to be “concordant”. For this case study, we treat the 207Pb/206Pb isotope system. 238U decays to 206Pb and 235U decays to 207Pb. The decay equations (from equation 7-15) are: 206 Pb[ ]

now= 238 U[ ]

now( eλ 238

t −1)

and

207 Pb[ ]

now= 235 U[ ]

now(e λ 235 t −1)

Dividing the two equations, we obtain:


207 Pb[ ]now206 Pb[ ]

now

=235 U[ ]now

(eλ 235 t −1)238 U[ ]

now(eλ 238 t −1)

=(e λ 235 t − 1)

137 .88 (e λ 238 t − 1) (7-16)

[207Pb]/[206Pb] is the measured present day lead isotope ratio and the present day [235U/238U] ratio is 1/137.88 and is assumed to be a constant which does not depend on age and history of the sample. So, it is possible to compute an age from a single analysis. The best mineral for use of this system is zircon, because it retains uranium and its decay products, crystallizes with almost no lead, and is widely distributed. Equation 7-16 cannot be solved explicitly for age (t). The “Simulations” stack included with this book provides a button whose script solves this equation numerically. An important question to be asked is: how sensitive is the age determination to errors in the various constants that are in the equation? Currently, the best available measurement accuracy in the 207Pb/206Pb ratios causes 1/5 of the uncertainty in age than that caused by uncertainties in the decay constants. The decay constants of the uranium to lead systems are 9.85 x 10-10 ±0.10% yr-1 for 235U and 1.55 x 10-10 ±0.08% yr-1 for 238U and have been defined by international convention. The 235U/238U ratio is also uncertain to about 0.15% . The measurement of the 207Pb/206Pb ratios requires complex instrumentation and precise analytical techniques. This ratio can be measured to accuracies as great as 0.1% to 0.03%. Another important source of error is the correction for common lead, which is lead that is present in the sample from sources other than decay of the “parent” isotopes of uranium. The source of common lead is original lead in the sample as it crystallized, lead introduced by exchange with external sources, and lead added during handling prior to analysis. A complete analysis of common lead errors is beyond the scope of this text. Problem 4: Determine the error of the age determination of a zircon when the 207Pb/206Pb ratio

changes by 0.1% for ages of 100 Ma, 200 Ma, 300 Ma, and 500 Ma (1 Ma = 1 x 106

years; at 100 Ma and a ratio error of 1%, the age error is 23.7%). Draw a graph of the age error vs age of the sample.

Problem 5: Make graphs of the error in the age determination vs age of the sample caused by the

errors in each of the two decay constants. Problem 6: Determine the error of the age determination due to the error in the 235U/238U ratio.


Problem 7: Study the problem of bias in the result due to random errors in the

measured207Pb/206Pb ratio. Because the age equation cannot be solved analytically, this simulation will need to be implemented using the computer. Repeatedly compute the age, each time with a random error {σ

exRandom ("g", -1) } to the 235U/238U

ratio. Keep a running average of the age determination and put the current average value into a card field, as was done in Chapter 5. Bias may show up as a higher age, on the average, than the actual age. See if you can think of any way to determine bias without doing the repeated sampling simulation.

The Distribution of the Errors If y+e=f(x+δx), and δx is a random error is x then e will be a random error in y. In general if δx is gaussian distributed, e will not be gaussian distributed. This will affect the validity of the statistical test that is applied to determine our confidence limits of y. Equation 7-1 shows that when data values are added, the expectation value of the variance of the answer is just the sum of the expectation value of the individual variances of the data points. So, if N data values, each with standard deviation σ, are added together, the standard deviation of the sum is given by σsum

2 = Nσindividual

2 . The mean of the answer is just the mean of the sum of the individual values. Since the answer is the result of linear, additive operations the distribution of the errors remains gaussian, with µ=0 and std deviation as given above. This result holds, to first order when data are divided or multiplied. Equation 7-2 is the standard deviation of the answer, in this case. The answer remains gaussian distributed. This is because we only took the first term in the expansion for M/V. When the errors are so large that the second order term is required, bias results and the distribution is no longer gaussian. Example: Determination of Density Let's look at the expansion of the equation ρ=M/V to higher orders. The density, ρ, has a small change, δρ, caused by small changes in M and V. We can write this below as:

ρ+ δρ= M + δM

V + δV

Rewrite this as:

ρ+ δρ=M + δM

V

1

1 + δV

V

We can make the following simple expansion:

1

1 + δV

V

= 1 −δV

V+

δV

V

2

−δV

V

3

+δV

V

4

+.. .. .


So after subtracting simplifying by subtracting out the ρ on the left and the M/V on the right, we can rearrange the density equation as:

δρ=M + δM( )

V1 − δV

V+ δV

V

2

− δV

V

3

+.. ..

−

M

V

Multiply out so we can better see the small terms multiplied together.

δρ=1

VδM − M

δV

V− δMδV

V+ M

δV

V

2

+ δMδV

V

2

− MδV

V

3

+. .. .

The first and second terms have only one variable with δ. This is why it is called the "first order". The third and fourth terms have two δ variables multiplied, and is called the "second order". Terms 5 and 6 have cubed δ variables, and are called "third order", etc. First order terms are linear in the gaussian distributed random error for mass and volume (δM and δV), so their distributions will be gaussian. However, second order terms are squared. The δV2 term is χ2 distributed, since the χ2 distribution is the one that describes the distribution for squared gaussian variables (Ch 9). But, what is the distribution for the δMδV term? We know that δV2 will always be positive. However, in this case it is possible that δM will be positive while δV is negative. So, right off, we know that it will not be the same distribution as the one for δV2. If δM and δV are completely independent of each other, the product will average to zero and the contribution of this term to the standard deviation of δρ will be the product of the standard deviations of the volume and mass errors. The concept of "independence" will be discussed further in a later chapter. It is sufficient to say, for now, that when two random variables are independent of one another, the standard deviation of their product is just the product of the standard deviations of each of the individual random variables. The third order terms begin to get even more complicated. Term 5 is ok, because it has δV2 and δM. The δV2 portion will be χ2 distributed, as before and δM will be gaussian distributed, so we will have the product of a χ2 distributed variable and a gaussian distributed variable. The term with δV3 will have another unique distribution. This is best modeled on the computer using a simulation. It is rarely be necessary to go beyond second order. It is the second order terms in the error expansion that produce "bias" in the result. This bias cannot be eliminated by increasing the sample size. It exists even for an infinite number of data. This can be easily seen by taking the expectation of δρ, as follows:

E δρ[ ]= E

1

VδM − M

δV

V−

δMδV

V+ M

δV

V

2

+. .

Simplifying,

E δρ[ ]=

1

VE δM[ ]−

M

VE δV[ ]− 1

VE δM δV[ ]+

M

V2 E δV( )2[ ]

+. ... .


Since the average of δM , δV and δMδV will be zero, over many repetitions of the experiment, the only term that will remain is the δV2 term, which is the cause of the bias. So, the second order bias in ρ is given by:

E δρ[ ]=

M

V 2 E δV( )2[ ]=M

V 2 σv

2

where σv is the standard deviation of the volume measurement. It is interesting that the bias is controlled by errors in the volume alone. The mass is in the numerator of the equation, so its effect is linear and will average to zero at all orders. Problem 8. Suppose δV/V=0.5 and δM/M=0.5. Find the distribution(s) of the relative error in

density δρ/ρ up to second order. Find the expected mean value of ρ and its standard deviation if V=10m3 and M=2kg.

Problem 9: Create a button that simulates problem 8. Show that the values for standard deviation

of each of the "orders" of the error expansion that result from your simulation agree with the values you expect from problem 8.

The General Case In practice, the equations describing the relationship between the data and the answer may be quite complex. Certain problems, such as determining velocity structure from earthquake arrival time measurements, are nonlinear and require extensive computation to determine the correct uncertainties in the computed velocity structure. Other kinds of problems, such as trying to predict the weather, have a result that is so strongly affected by small perturbations and errors that a meaningful error analysis is impossible. If you are fortunate, you will encounter simpler equations of the form 7-10 to 7-14. If you can perform a Taylor's series on the equation, you can do an error analysis. You can also approach the problem as a simulation that is done on the computer. This is best for those who are uncertain of their math skills and provides a meaningful check when strong nonlinearities or complex equations are required. When Should Averaging Be Done? When measuring the density of an object by finding its mass and volume, it is possible to approach the data analysis in two ways. Suppose the are N measurements of mass and volume. One could first find the value for the volume by computing the average of all of the volume measurements.

The standard deviation of the volume errors would then be σV= σv

N. Applying the same

procedure to the mass measurement, σM= σm

N. The σv and σm are the standard deviations of the

population distribution of the volume and mass measurements. After the averaging, the standard


deviation of the errors is reduced by N. When this is put in the M/V formula, the bias caused by the second order error terms is reduced by 1/N. The first order error terms are reduced by 1/N, as expected. The other option for performing the analysis is to compute the value of ρ for each of the M and V values. Then, after all of the values of ρ are computed, take the average of the ρ's. This has the extreme disadvantage of increasing the size of the second order error terms, which cause bias in the final result. Problem 10: Write a button script to simulate the effect described in the above paragraph and show

quantitatively that the results of the simulation agree with the results of the above analysis. Generate N random values of the mass and volume using the xRandom function. Use the parameters of problem 8, where δV/V=0.5 and δM/M=0.5. Then compare the results (std deviation and bias of the answer) when the mass and volume values are averaged first to the results when the densities are computed first and averaged to get a final density.

Noise in Data Revisited Chapter 6 treated the case where the noise level is constant. Here we expand the treatment to include the case when the standard deviation of the noise depends on the signal level. We also show how the choice of plot scale affects the appearance of the plotted data. When making logarithmic plots, it is important to be aware that the plot scale is expanded at low values and compressed at high values. For example, if variability of 0.5 exists in all data values, this will show up as large variations in the 0.1 range, but very small at the 103 range. Logarithmic axes most accurately reflect data variation when that variation is proportional to the value of the data. The data variation (noise) is proportional to the signal in earthquake magnitude determinations. This is true because most of the variation in seismic signal levels is due to scattering of the seismic wave caused by heterogeneities in the earth, which is proportional to the signal amplitude. On the other hand, when most of the noise in data is from the measuring instrument such as a voltmeter or mass spectrometer, the variation in the data will be more or less constant. A log plot of this kind of data would show large variations in plotted data at small values and small variations at large values. We saw in Chapter 6 that noise in data is also called “random error”. A measurement may be expressed as the sum of the noise free, or exact value and an added noise component, in the following way: Y = y + aY noise where y is the exact value and Ynoise is the random error. Here we will consider Ynoise to have an average of zero and a standard deviation of 1. The constant, a is the standard deviation of the added noise. If the average value of the noise was not zero, we would say it was “biased”. As discussed in Chapter 6, two important cases are 1) when the amplitude of the noise is a constant and 2) when the amplitude of the noise is proportional to the signal, y. Below are the two forms: Noise constant: Y = y + aY noise (7-7)


Noise proportional to y: Y = y + aY

noise y = y ( 1 + aY

noise ) (7-8)

The distinction between these two cases is shown in the two log-log plots of figure 7.5. Data are generated according to the each of equations 7-7 and 7-8.

Figure 7.5 The left plot is a log-log plot of y = x2 + Ynoise where noise is constant. The right plot is a log-log plot of y = x2 + 0.2 x2 Ynoise ., where noise is proportional to noise free signal, y. The left hand plot in figure 7.5 shows a log-log plot with signal (y) plus constant noise. The right hand plot shows signal (y) plus noise proportional to the value of y. The important feature here is that the randomness in the left plot decreases at larger x and y, and in the right plot the randomness remains relatively constant. This has important consequences when fitting straight lines to log-log plotted data. Obviously, in the first case, one would not want to fit a straight line to the lower values of the data where the noise is high. In the second case, the noise is relatively uniform over the range and a fit will be force to take into account the full range of the data. Non Gaussian Data Distributions Note that the previous discussions all assumed that the data were distributed according to a gaussian distribution. This is often not the case and will affect which analysis procedures produce the optimum result. For example, it is not uncommon for data to follow a log normal distribution. A log normal distribution is suspected for data that cannot go negative or whose distribution is skewed to higher values (Ch 5). If the data are log-normal, but the result involves an equation that takes the log (which transforms the distribution back to gaussian), it is better to take compute individual values of the result, then average. If you average before taking the log, you introduce bias.

*If the data follow a log normal distribution, and the result we want requires we apply the equation y=log(data), can you figure out why bias is introduced if the data are averaged first? *It is always necessary to be aware of the distribution that the data are following. What tests can you apply to determine if data follow a particular distribution?


Review: After reading this chapter and working the problems, you should: • Understand the relationship between the mean and variance of the data to the mean and variance of

the answer, after putting data into an equation. • Understand what bias is and how to compute it analytically for simple functional forms and how to

model it on the computer using simulations. • Be able to determine the distribution of errors in the answer that is a function of the equation used to

get the answer and the distribution of the data.

Documents

Introduction to Geological Data Analysis