11
Statistics for Biology Descriptive Statistics Repeated measurements in biology are rarely identical, due to random errors and natural variation. If enough measurements are repeated they can be plotted on a histogram, like the one on the right. This usually shows a normal distribution , with most of the repeats close to some central value. Many biological phenomena follow this pattern: eg. peoples' heights, number of peas in a pod, the breathing rate of insects, etc. The central value of the normal distribution curve is the mean (also known as the arithmetic mean or average ). But how reliable is this mean? If the data are all close together, then the mean is probably good, but if they are scattered widely, then the calculated mean may not be very reliable. The width of the normal distribution curve is given by the standard deviation (SD), and the larger the SD, the less reliable the data. For comparing different sets of data, a better measure is the 95% confidence interval (CI). This is derived from the SD, and is the range above and below the mean within which 95% of the repeated measurements lie (marked on the histogram above). You can be pretty confident that the real mean lies somewhere in this range. Whenever you calculate a mean you should also calculate a confidence limit to indicate the quality of your data. In Excel the mean is calculated using the formula =AVERAGE (range) , the SD is calculated using =STDEV (range) and the 95% CI is calculated using =CONFIDENCE (0.05, STDEV(range), COUNT(range)) . This spreadsheet shows two sets of data with the same mean. In group A the confidence interval is small compared to the mean, so the data are reliable and you can be confident that the real mean is close to your calculated mean. But in group B the confidence interval is large compared to the mean, so the data are unreliable, as the real mean could be quite far away from your calculated mean. Note that Excel will always return the results of a calculation to about 8 small confidence limit, low variability, data close together, mean is reliable large confidence limit, high variability, data scattered, mean is unreliable values number of times each valu occurs mean normal distribution curve 95% CI 95% CI

Bio Stats

Embed Size (px)

DESCRIPTION

Bio Stats

Citation preview

Page 1: Bio Stats

Statistics for

Biology

Descriptive Statistics Repeated measurements in biology are

rarely identical, due to random errors

and natural variation. If enough

measurements are repeated they can be

plotted on a histogram, like the one on

the right. This usually shows a normal

distribution, with most of the repeats

close to some central value. Many

biological phenomena follow this

pattern: eg. peoples' heights, number of

peas in a pod, the breathing rate of

insects, etc.

The central value of the normal

distribution curve is the mean (also

known as the arithmetic mean or

average). But how reliable is this

mean? If the data are all close

together, then the mean is probably

good, but if they are scattered

widely, then the calculated mean

may not be very reliable. The width of the normal distribution curve is given by the standard

deviation (SD), and the larger the SD, the less reliable the data. For comparing different sets of data,

a better measure is the 95% confidence interval (CI). This is derived from the SD, and is the range

above and below the mean within which 95% of the repeated measurements lie (marked on the

histogram above). You can be pretty confident that the real mean lies somewhere in this range.

Whenever you calculate a mean you should also calculate a confidence limit to indicate the quality of

your data.

In Excel the mean is calculated using the formula =AVERAGE (range) , the SD is calculated using

=STDEV (range) , and the 95% CI is calculated using =CONFIDENCE (0.05, STDEV(range), COUNT(range)) .

This spreadsheet shows two sets of

data with the same mean. In group A

the confidence interval is small

compared to the mean, so the data are

reliable and you can be confident that

the real mean is close to your

calculated mean. But in group B the

confidence interval is large compared

to the mean, so the data are unreliable,

as the real mean could be quite far

away from your calculated mean. Note

that Excel will always return the

results of a calculation to about 8

small confidence limit,

low variability,

data close together,

mean is reliable

large confidence limit,

high variability,

data scattered,

mean is unreliable

values

nu

mb

er

of

time

s e

ach

va

lue

occ

urs

mean

normal

distribution

curve

95% CI95% CI

Page 2: Bio Stats

decimal places of precision. This is meaningless, and cells with calculated results should always be

formatted to a more sensible precision (Format menu > Cells > Number tab > Number).

Plotting Data Once you have collected data you will want to plot a graph or chart to show trends or relationships

clearly. With a little effort, Excel produces very nice charts. First enter the data you want to plot into

two columns (or rows) and select them.

Drawing the Graph. Click on the chart wizard . This has four steps:

1. Graph Type. For a bar graph choose Column and for a scatter graph (also known as a line graph)

choose XY(Scatter) then press Next. Do not choose Line.

2. Source Data. If the sample graph looks OK, just hit Next. If it looks wrong you can correct it by

clicking on the Series tab, then the red arrow in the X Values box, then highlight the cells

containing the X data on the spreadsheet. Repeat for the Y Values box.

3. Chart Options. You can do these now or change them later, but you should at least enter suitable

titles for the graph and the axes and probably turn off the gridlines and legend.

4. Graph Location. Just hit Finish. This puts the chart beside the data so you can see both.

Changing the Graph. Once you have drawn the graph, you can now change any aspect of it by

double-clicking (or sometimes right-clicking) on the part you want to change. For example you can: move and re-shape the graph change the background colour (white is usually best!) change the shape and size of the markers (dots) change the axes scales and tick marks add a trend line or error bars (see below)

Lines. To draw a straight "line of best fit" right click on a point, select Add Trendline, and choose

linear. In the option tab you can force it to go through the origin if you think it should, and you can

even have it print the line equation if you are interested in the slope or intercept of the trend line. If

instead you want to "join the dots" (and you don't often) double-click on a point and set line to

automatic.

Error bars. These are used to show the confidence intervals on the graph. You must already have

entered the 95% confidence limits on the spreadsheet beside the X and Y data columns. Then

double-click on the points on the graph to get the Format Data Series dialog box and choose the Y

Error Bars tab. Click on the red arrow in the Custom + box, and highlight the range of cells

containing your confidence limits. Repeat for the Custom - box.

Page 3: Bio Stats

Problems

1. Here are the results of an investigation into the rate of photosynthesis in the pond weed Elodea.

The number of bubbles given off in one minute was counted under different light intensities, and

each measurement was repeated 5 times. Use Excel to calculate the means and 95% confidence

limits of these results, then plot a graph of the mean results with error bars and a line of best fit.

light

intensity (Lux)

repeat 1 repeat 2 repeat 3 repeat 4 repeat 5

0 5 2 0 2 1

500 12 4 5 8 7

1000 7 20 18 14 24

2000 42 25 31 14 38

3500 45 40 36 50 28

5000 65 54 72 58 36

Page 4: Bio Stats

There is a bewildering variety of statistical tests available, and it is important to choose the right one.

This flow chart will help you to decide which statistical test to use, and the tests are described in

detail on the next 5 pages.

Pearson correlation coefficient=CORREL (range 1, range 2)0 = no correlation1 = perfect correlation

Spearman correlation coefficient=CORREL (range 1, range 2) on of data0=no correlation/ 1=perfect correlation

ranks

Linear regressionAdd Trendline to graph and Display Equation.Gives slope and intercept of line

Paired testt-=TTEST(range1, range2, 2, 1)If <5% then significant differenceIf >5% then no significant difference

PP

Unpaired testt-=TTEST(range1, range2, 2, 2)If <5% then significant differenceIf >5% then no significant difference

PP

2 test

=CHITEST(obs range, exp range)If <5% then disagree with theoryIf >5% then agree with theory

PP

2 test

=CHITEST(obs range, exp range)If <5% then significant differenceIf >5% then no significant difference

PP

2 test for association

=CHITEST(obs range, exp range)If <5% then significant associationIf >5% then no significant association

PP

ANOVATools menu > Data analysis > AnovaIf <5% then significant differenceIf >5% then no significant difference

PP

Testing for acorrelation

Finding how onefactor affects another

2 sets

>2 sets

Comparing observed counts to a theory

Testing for a differencebetween counts

Testing for an associationbetween groups of counts

Testing fora relation

between 2 sets

Testing fora difference

between sets

Fre

qu

encie

s (c

ounts

)

starthere

normaldata

non-normaldata

sameindividuals

differentindividuals

Measure

ments

Plot scatter graph

Plotbar

graph

Calculatemean and

95% CI fromreplicates

Whatkindof

test?

Whatkindof

data?

Whatkindof

test?

Page 5: Bio Stats

Statistics to Test for a Correlation Correlation statistics are used to investigate an association between two factors such as age and

height; weight and blood pressure; or smoking and lung cancer. After collecting as many pairs of

measurements as possible of the two factors, plot a scatter graph of one against the other. If both

factors increase together then there is a positive correlation, or if one factor decreases when the

other increases then there is a negative correlation. If the scatter graph has apparently random points

then there is no correlation.

variable 1

variab

le 2

variable 1

variab

le 2

variable 1

variab

le 2

Positive Correlation Negative Correlation No Correlation

There are two statistical tests to quantify a correlation: the Pearson correlation coefficient (r), and

Spearman's rank-order correlation coefficient (rs). These both vary from +1 (perfect correlation)

through 0 (no correlation) to –1 (perfect negative correlation). If your data are continuous and

normally-distributed use Pearson, otherwise use Spearman. In both cases the larger the absolute

value (positive or negative), the stronger, or more significant, the correlation. Values grater than 0.8

are very significant, values between 0.5 and 0.8 are probably significant, and values less than 0.5 are

probably insignificant.

In Excel the Pearson coefficient r is calculated using the formula: =CORREL (X range, Y range) . To

calculate the Spearman coefficient rs, first make two new columns showing the ranks (or order) of

the two sets of data, and then calculate the Pearson correlation on the rank data. The highest value is

given a rank of 1, the next highest a rank of 2 and so on. Equal values are given the same rank, but

the next rank should allow for this (e.g. if there are two values ranked 3, then the next value is

ranked 5).

In this example the size of breeding

pairs of penguins was measured to see

if there was correlation between the

sizes of the two sexes. The scatter

graph and both correlation

coefficients clearly indicate a strong

positive correlation. In other words

large females do pair with large males.

Of course this doesn't say why, but it

shows there is a correlation to

investigate further.

Page 6: Bio Stats

Linear Regression to Investigate a Causal Relationship. If you know that one variable causes the changes in the other variable, then there is a causal

relationship. In this case you can use linear regression to investigate the relation in more detail.

Regression fits a straight line to the data, and gives the values of the slope and intercept of that line

(m and c in the equation y = mx + c).

The simplest way to do this in Excel is to

plot a scatter graph of the data and use the

trendline feature of the graph. Right-click on

a data point on the graph, select Add

Trendline, and choose Linear. Click on the

Options tab, and select Display equation on

chart. You can also choose to set the

intercept to be zero (or some other value).

The full equation with the slope and

intercept values are now shown on the chart.

In this example the absorption of a yeast cell suspension is plotted against its cell concentration from

a cell counter. The trendline intercept was fixed at zero (because 0 cells have 0 absorbance), and the

equation on the graph shows the slope of the regression line.

The regression line can be used to make quantitative predictions. For example, using the graph

above, we could predict that a cell concentration of 9 x 107 cells per cm

3 would have an absorbance

of 1.37 (9 x 0.152).

Page 7: Bio Stats

T-Test to Compare Two Sets of Data Another common form of data analysis is to compare two sets of measurements to see if they are the

same or different. For example are plants treated with fertiliser taller than those without? If the

means of the two sets are very different, then it is easy to decide, but often the means are quite close

and it is difficult to judge whether the two sets are the same or are significantly different. To

compare two sets of data use the t-test, which tells you the probability (P) that there is no

difference between the two sets. This is called the null hypothesis.

P varies from 0 (impossible) to 1 (certain). The higher the probability, the more likely it is that the

two sets are the same, and that any differences are just due to random chance. The lower the

probability, the more likely it is that that the two sets are significantly different, and that any

differences are real. Where do you draw the line between these two conclusions? In biology the

critical probability is usually taken as 0.05 (or 5%). This may seem very low, but it reflects the facts

that biology experiments are expected to produce quite varied results. So if P > 5% then the two sets

are the same (i.e. accept the null hypothesis), and if P < 5% then the two sets are different (i.e. reject

the null hypothesis). For the t test to work, the number of repeats should be at least 5.

In Excel the t-test is performed using the formula: =TTEST (range1, range2, tails, type) . For the

examples you'll use in biology, tails is always 2 (for a "two-tailed" test), and type can be either 1 for a

paired test (where the two sets of data are from the same individuals), or 2 for an unpaired test

(where the sets are from different

individuals). The cell with the t test P

should be formatted as a percentage

(Format menu > cell > number tab >

percentage). This automatically multiplies

the value by 100 and adds the % sign.

This can make P values easier to read and

understand. It’s also a good idea to plot

the means as a bar chart with error bars

to show the difference graphically.

In the first example the yield of potatoes

in 10 plots treated with one fertiliser was

compared to that in 10 plots treated with

another fertiliser. Fertiliser B delivers a

larger mean yield, but the unpaired t-test

P shows that there is a 8% probability

that this difference is just due to chance.

Since this is >5% we accept the null

hypothesis that there is no significant

difference between the two fertilisers.

In the second example the pulse rate of 8

individuals was measured before and

after eating a large meal. The mean pulse

rate is certainly higher after eating, and

the paired t-test P shows that there is

only a tiny 0.005% probability that this

difference is due to chance, so the pulse

rate is significantly higher after a meal.

Page 8: Bio Stats

ANOVA to Compare >2 sets of Data The t test is limited to comparing two sets of data, so to compare many groups at once you need

analysis of variance (ANOVA). From the Excel Tools menu select Data Analysis then ANOVA

Single Factor. This brings up the ANOVA dialogue box, shown here.

Enter the Input Range by clicking in

the box then selecting the range of

cells containing the data, including

the headings.

Check that the columns/rows choice

is correct (this example is in three

columns), and click in Labels in First

Row if you have included these. The

column headings will appear in the

results table.

Leave Alpha at 0.05 (for the usual

5% significance level).

Click in the Output Range box and

click on a free cell on the worksheet,

which will become the top left cell of

the 8 x 15-cell results table.

Finally press OK.

The output is a large

data table, and you may

need to adjust the

column widths to read

it all. At this point you

should plot a bar graph

using the averages

column for the bars and

the variance column for

the error bars.

The most important cell

in the table is the P-

value, which as usual is

the probability that the

null hypothesis (that

there is no difference

between any of the data

sets) is true. This is the

same as a t-test

probability, and in fact if you try ANOVA with just two data sets, it returns the same P as a t test. If

P > 5% then there is no significant difference between any of the data sets (i.e. the null hypothesis is

true), but if P < 5% then at least one of the groups is significantly different from the others.

In the example on this page, which concerns the grain yield from three different varieties of wheat, P

is 0.14%, so is less than 5%, so there is a significant difference somewhere. The problem now is to

identify where the difference lies. This is done by examining the variance column in the summary

table. In this example, varieties 2 and 3 are very similar, but variety 1 is obviously the different one.

So the conclusion would be that variety 1 has a significantly lower yield than varieties 2 and 3.

Page 9: Bio Stats

Chi-squared Test for Frequency Data Sometimes the data from an experiment are not measurements but counts (or frequencies) of things,

such as counts of different phenotypes in a genetics cross, or counts of species in different habitats.

With frequency data you can’t usually calculate averages or do a t test, but instead you do a chi-

squared (2) test. This compares observed counts with some expected counts and tells you the

probability (P) that there is no difference between them. In Excel the 2 test is performed using the

formula: =CHITEST (observed range, expected range) . There are three different uses of the test

depending on how the expected data are calculated.

Sometimes the expected data can be calculated from a quantitative theory, in which case you

are testing whether your observed data agree with the theory. If P < 5% then the data do not

agree with the theory, and if P > 5% then the data do agree with the theory. A good example is a

genetic cross, where Mendel’s laws can be used to

predict frequencies of different phenotypes. In this

example Excel formulae are used to calculate the

expected values using a 3:1 ratio of the total

number of observations. The 2

P is 53%, which is

much greater than 5%, so the results do indeed

support Mendel’s law. Incidentally a very high P

(>80%) is suspicious, as it means that the results are

just too good to be true.

Other times the expected data are calculated by assuming that the counts in all the categories

should be the same, in which case you are testing whether there is a difference between the

counts. If P < 5% then the counts are significantly

different from each other, and if P > 5% then there is

no significant difference between the counts. In the

example above the sex of children born in a hospital

over a period of time is compared. The expected

values are calculated by assuming there should be

equal numbers of boys and girls, and the 2

P of

6.4% is greater than 5%, so there is no significant

difference between the sexes.

If the count data are for categories in two groups, then the expected data can be calculated by

assuming that the two groups are independent. If P < 5% then there is a significant association

between the two groups, and if P > 5% then the two groups are independent. Each group can have

counts in two or more categories, and the observed frequency data are set out in a table, called a

contingency table. A copy of this table is then made for the expected data, which are calculated for

each cell from the corresponding totals of the observed data, using the formula

E = column total x row total / grand total . In this example the flow rate of a stream (the two categories

fast / slow) is compared to the type of stream bed (the four categories weed-choked / some weeds /

shingle / silt) at 50

different sites to see if

there is an association

between them. The 2

P

of 1.1% is less than 5%,

so there is an association

between flow rate and

stream bed.

1.

2.

3.

Page 10: Bio Stats

Problems 1. In a test of two drugs 8 patients were given one drug and 8 patients another drug. The number of hours of relief

from symptoms was measured with the following results:

Drug A 3.2 1.6 5.7 2.8 5.5 1.2 6.1 2.9

Drug B 3.8 1.0 8.4 3.6 5.0 3.5 7.3 4.8

Find out which drug is better by calculating the mean and 95% confidence limit for each drug, then use an

appropriate statistical test to find if it is significantly better than the other drug.

2. In one of Mendel's dihybrid crosses, the following types and numbers of pea plants were recorded in the F2

generation:

Yellow round seeds Yellow wrinkled seeds Green round seeds Green wrinkled seeds

289 122 96 39

According to theory these should be in the ratio of 9:3:3:1. Do these observed results agree with the expected ratio?

3. The areas of moss growing on the north and south sides of a group of trees were compared.

North side of tree 20 43 53 86 70 54

South side of tree 63 11 21 54 9 74

Is there a significant difference between the north and south sides?

4. Five mammal traps were placed in randomly-selected positions in a deciduous wood. The numbers of field mice

captured in each trap in one day were recorded. The results were:

Trap A B C D E

no. of mice 22 26 21 8 23

Trap D caught far fewer mice than the others. Did this happen by chance or is the result significant?

5. In an investigation into pollution in a stream, the concentration of nitrates was measured at six different sites, and

a diversity index was calculated for the species present.

Site 1 2 3 4 5 6

Conductivity (S) 413.3 439.7 726 850 567.3 766.7

Diversity index 7.51 5.17 4.49 3.82 5.88 3.74

Is there a correlation between conductivity and diversity, and how strong is it? (The diversity index is calculated

from biotic data, so is not normally distributed.)

6. The blood groups of 400 individuals, from 4 different ethnic groups were recorded with the following results:

Ethnic group Blood Group O Blood Group A Blood Group B Blood Group AB

1 46 40 7 3

2 48 39 12 2

3 53 33 12 4

4 55 30 13 3

Is there as association between blood group and ethnic group?

7. The effect of enzyme concentration on rate of a reaction was investigated with the following results.

Enzyme concentration (mM) 0 0.1 0.2 0.5 0.8 1.0

Rate (arbitrary units) 0 0.8 1.1 3.2 6.6 7.2

Plot a graph of these results, fit a straight line to the data, and find the slope of this line. Use the slope to predict

the rate at an enzyme concentration of 0.7mM.

Page 11: Bio Stats

Alternative flow chart, showing the non-parametric tests (which are not available in Excel):

Pearson correlation coefficient=CORREL (range 1, range 2)0 = no correlation1 = perfect correlation

Spearman correlation coefficient=CORREL (range 1, range 2) on of data0=no correlation1=perfect correlation

ranks

Linear regressionAdd Trendline to graph and Display Equation.Gives slope and intercept of line

Paired testt-=TTEST(range1, range2, 2, 1)If <5% then sig. differenceIf >5% then no sig. difference

PP

Unpaired testt-=TTEST(range1, range2, 2, 2)If <5% then sig. differenceIf >5% then no sig. difference

PP

2 test

=CHITEST(obs range, exp range)If <5% then disagree with theoryIf >5% then agree with theory

PP

2 test

=CHITEST(obs range, exp range)If <5% then sig. differenceIf >5% then no sig. difference

PP

2 test for association

=CHITEST(obs range, exp range)If <5% then sig. associationIf >5% then no sig. association

PP

ANOVATools menu >Data analysis >AnovaIf <5% then sig. differenceIf >5% then no sig. difference

PP

Mann-Whitney testU-Not available in Excel

Kruskal-Wallis testNot available in Excel

Wilcoxon Matched Pairs testNot available in Excel

Testing for acorrelation

Finding how one factor affects another

2 sets

>2 sets

Comparing observed counts to a theory

Testing for a differencebetween counts

Testing for an associationbetween groups of counts

Testing for a relation

between 2 sets

Testing for a difference

between sets

Fre

qu

en

cie

s (c

oun

ts)

starthere

sameindividuals

differentindividuals

Me

asu

rem

ents

Plot scatter graph

Plotbar

graph

Whatkindof

test?

Whatkindof

test?

Whatkindof

data?

Calculatemean and

95% CI fromreplicates

Parametric Test Non-Parametric Test

Parametric Test Non-Parametric Test

Parametric Test Non-Parametric Test

Parametric Test Non-Parametric Test