12
Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics, National Institute of Genetics A new algorithm for estimating the number of nucleotide substitutions per site (i.e., the evolutionary distance) between two nucleotide sequences is presented. This algorithm can be applied to many estimation methods, such as Jukes and Cantor’s method, Kimura’s transition / transversion method, and Tajima and Nei’s method. Unlike ordinary methods, this algorithm is always applicable. Numerical computations and computer simulations indicate that this algorithm gives an almost unbiased estimate of the evolutionary distance, unless the evolutionary distance is very large. This algorithm should be useful especially when we analyze short nu- cleotide sequences. It can also be applied to amino acid sequences, for estimating the number of amino acid replacements. Introduction The number of nucleotide substitutions between nucleotide sequences is one of the fundamental quantities for the study of molecular evolution. The proportion of different nucleotide pairs between two nucleotide sequences can be converted into the number of nucleotide substitutions per nucleotide site (i.e., evolutionary distance) if an appropriate method is used (Kimura 1983; Nei 1987). Although there are many methods for such transformation (Jukes and Cantor 1969; Kimura 1980, 198 1; Taka- hata and Kimura 198 1; Gojobori et al. 1982; Tajima and Nei 1984; Tamura 1992), these methods have some problems. First, these methods give an overestimate when the length of nucleotide sequence is short. Second, they cannot be applied when the argument of a logarithm of the estimation formula becomes negative; such inapplicable cases occur quite frequently when two distantly related nucleotide sequences are com- pared (Gojobori et al. 1982; Tajima and Nei 1984). Both problems are caused by logarithms; that is, when x is a random variable, the expectation of -log,( 1-x) is larger than -log,[ 1-E(x)] , and x can be larger than unity even if E(x) is smaller than unity, where E(x) is the expectation of x. Here I present algorithms for estimating the evolutionary distance without using logarithms. Theory I consider three methods for estimating the evolutionary distance: Jukes and Cantor’s ( 1969) method, Tajima and Nei’s ( 1984) method, and Kimura’s ( 1980) transition / transversion method. 1. Key words: evolutionary distance, number of nucleotide Cantor’s method, Kimura’s method, Tajima and Nei’s method. differences. unbiased estimation, Jukes and Address for correspondence and reprints: Fumio Tajima, Institute of Genetics, Mishima, Shizuoka-ken 411, Japan. Department of Population Genetics, National Mol. Biol. Evol. 10(3):677-688. 1993. 0 1993 by The University of Chicago. All rights reserved. 0737~4038/93/ 1003-00 13$02.00 677

Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1

Fumio Tajima Department of Population Genetics, National Institute of Genetics

A new algorithm for estimating the number of nucleotide substitutions per site (i.e., the evolutionary distance) between two nucleotide sequences is presented. This algorithm can be applied to many estimation methods, such as Jukes and Cantor’s method, Kimura’s transition / transversion method, and Tajima and Nei’s method. Unlike ordinary methods, this algorithm is always applicable. Numerical computations and computer simulations indicate that this algorithm gives an almost unbiased estimate of the evolutionary distance, unless the evolutionary distance is very large. This algorithm should be useful especially when we analyze short nu- cleotide sequences. It can also be applied to amino acid sequences, for estimating the number of amino acid replacements.

Introduction

The number of nucleotide substitutions between nucleotide sequences is one of the fundamental quantities for the study of molecular evolution. The proportion of different nucleotide pairs between two nucleotide sequences can be converted into the number of nucleotide substitutions per nucleotide site (i.e., evolutionary distance) if an appropriate method is used (Kimura 1983; Nei 1987). Although there are many methods for such transformation (Jukes and Cantor 1969; Kimura 1980, 198 1; Taka- hata and Kimura 198 1; Gojobori et al. 1982; Tajima and Nei 1984; Tamura 1992), these methods have some problems. First, these methods give an overestimate when the length of nucleotide sequence is short. Second, they cannot be applied when the argument of a logarithm of the estimation formula becomes negative; such inapplicable cases occur quite frequently when two distantly related nucleotide sequences are com- pared (Gojobori et al. 1982; Tajima and Nei 1984). Both problems are caused by logarithms; that is, when x is a random variable, the expectation of -log,( 1 -x) is larger than -log,[ 1 -E(x)] , and x can be larger than unity even if E(x) is smaller than unity, where E(x) is the expectation of x. Here I present algorithms for estimating the evolutionary distance without using logarithms.

Theory

I consider three methods for estimating the evolutionary distance: Jukes and Cantor’s ( 1969) method, Tajima and Nei’s ( 1984) method, and Kimura’s ( 1980) transition / transversion method.

1. Key words: evolutionary distance, number of nucleotide Cantor’s method, Kimura’s method, Tajima and Nei’s method.

differences. unbiased estimation, Jukes and

Address for correspondence and reprints: Fumio Tajima, Institute of Genetics, Mishima, Shizuoka-ken 411, Japan.

Department of Population Genetics, National

Mol. Biol. Evol. 10(3):677-688. 1993. 0 1993 by The University of Chicago. All rights reserved. 0737~4038/93/ 1003-00 13$02.00

677

Page 2: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

678 Tajima

Jukes and Cantor’s Method

Denote the expected proportion of different nucleotide pairs between two nu- cleotide sequences by p. When the rates of substitution are the same among different nucleotide pairs, the evolutionary distance, d, can be given by

d = -b log&-(p/b)] , (1)

where b = 0.75 (Jukes and Cantor 1969; Kimura and Ohta 1972). Denote the number of pairs of nucleotides examined (or the length of nucleotide sequence) by n and denote the observed number of nucleotide pairs that differ between two nucleotide sequences by k. Then p can be estimated by fi = k/n. Substituting pA into equation ( 1 ), we obtain

&JK = -b log,W@lb)l , (2)

which is usually called Jukes and Cantor’s method. This formula is applicable only when $ < b and gives an overestimation when p^ is small because of the logarithm, as will be shown later.

Without use of the logarithm, equation ( 1) can be expressed as

(3)

which can be obtained by the Taylor-series expansion. Because k follows a binomial distribution with parameters n and p, pi can be estimated by k(‘)/n(‘) for i I k, where x(‘) [k(‘) or nti)] is defined as

P = X(X-1)(X-2)* l 0(x--i+l) = x!/(x-i)! . (4)

Note that the expectation of k(‘)/n(‘) is pi for i I k. Therefore, ignoring all terms higher than the kth order, we obtain

which is the formula to be used for estimating the evolutionary distance. The expec- tation of equation (5) then becomes

E(a) = ,t j&e (6)

Therefore, equation (5) is expected to give an almost unbiased estimate when p is not close to b-namely, when the evolutionary distance is not very large. Furthermore, equation (5) is always applicable, although it might give an unreasonable estimate when the evolutionary distance is very large.

To discover the accuracy of this algorithm, I have conducted numerical com- putations. First, for a given value of d, p was computed by p = b( 1 - emdlb), and then

Page 3: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

Unbiased Estimation of Evolutionary Distance 679

Table 1 Estimates of d, Obtained by Numerical Computations for the Case of the One- Parameter Model

True d Mean of a Mean of dJK (f)

n = 20: 0.5 . 1.0 . 1.5 . . 2.0 .

n = 100: 0.5 . 1.0 1.5 2.0 . 2.5 .

n = 500: 0.5 . 1.0 . . 1.5 . . 2.0 . 2.5 3.0

. .

. .

0.5000 0.5342 (0.0005) 0.9998 1.05 17 (0.0577) 1.4900 1.3553 (0.2410) 1.9223 1.4966 (0.4083)

0.5000 0.5060 (<lo-“) 1 .oooo 1.0265 (3 X 10-5) 1.5000 1.5772 (0.0196) 1.9999 1.99 16 (0.1522) 2.4957 2.2035 (0.3181)

0.5000 0.5012 (<lo-“) 1 .oOOo 1.0048 (<lo-lo) 1.5000 1.5178 (7 x 10-y 2.0000 2.0673 (0.0058) 2.5000 2.5593 (0.098 1) 3 .oooo 2.8368 (0.2606)

NOTE.-d is the estimate of d obtained by equation (5). & is the estimate obtained by equation (2), excluding inap- plicable cases. fis the probability of inapplicable cases.

the probability that the number of nucleotide differences between two nucleotide se- quences is k was computed by noting that k follows a binomial distribution with parameters n and p, where 0 I k I n. For each k the evolutionary distance ( a) was estimated by equation (5). For a comparison the distance ( dJK) was also estimated by equation ( 2) if it was applicable.

The results for n = 20, 100, and 500 are shown in table 1. In this table the probability (f) of having inapplicable cases-namely, cases of k 2 bn-is also shown, and the mean of & was computed by excluding inapplicable cases and noting that the unconditional expectation of dJJK is infinite if we define dJK = cc in inapplicable cases. We can see from this table that the mean of d is close to the true value of d unless d is very large, as expected. On the other hand, the mean of & is larger than the true value of d when d is small. As d increases, the probability of inapplicable cases (f) increases, and the mean of & computed by excluding inapplicable cases becomes smaller than the true value of d. This table also shows that the extent of overestimation is substantial when n is small. Thus, the new algorithm gives an almost unbiased estimate unless d is very large.

Tajima and Nei’s Method

Let ql,q2,q3, and q4 be the frequencies of nucleotides A, T, C, and G, respectively, and let x0 (i<j) be the relative frequency of nucleotide pair i and j between two nucleotide sequences. Then, Tajima and Nei ( 1984) showed that the evolutionary distance can be estimated by

Page 4: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

&N = -b log,[ 1 - Wb)] ,

where b is given either by

b = 1-i q; i=l

or by

b = [l-i q; + (1;*/h)]/2, i=l

in which h is given by

3 4

h = C C x$/(hiqj) - i=l j=i+l

In the same way as before, we obtain

(7)

(8)

(9)

which is the same as equation (5)) except for the value of b. As in the case of Jukes and Cantor’s method, equation ( 11) is expected to give an almost unbiased estimate when d is not very large, and it is always applicable.

In the above derivation, b was assumed to be constant, although this is not the case. To know the accuracy of equation ( 11)) I have conducted two series of computer simulations in which nucleotide substitutions follow the equal-input model (Tajima and Nei 1982). Let yti be the rate of substitution from the ith nucleotide to the jth nucleotide, per unit evolutionary time. In the first series of simulations, r2] = r31 = r41 = O.~C, ~12 = ~32 = r42 = 0.3 C, ~13 = ~23 = ~43 = O.~C, and ~14 = ~24 = ~34 = 0.1 c were assumed, where c 6 1. In the second series of simulations, r21 = r31 = r41 = 0.45 c, r12 = ~32 = ~42 = 0.05 C, ~13 = ~23 = ~43 = 0.05 C, and Y 14 = ~24 = r34 = 0.45 c were assumed. In these simulations b was estimated by equation ( 8 ) , and the evolutionary distance ($) was estimated by equation ( 11). For comparison, the distance (&N) was also estimated by equation ( 7)) if it was applicable-namely, if k < bn. For each set of parameters, simulations were conducted 2,000 times.

The results of computer simulations for n = 20 and 100 are shown in tables 2 and 3. In these tables, the proportion (f) of inapplicable cases for & is also shown. The tables show that the new algorithm gives an almost unbiased estimate when d is not very large. On the other hand, equation ( 7) gives an overestimate when d is not very large, and the extent of overestimation is substantial when n is small. Furthermore, equation (7) cannot be applied in many cases when d is large.

Page 5: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

Unbiased Estimation of Evolutionary Distance 68 1

Table 2 Estimates of d, Obtained by Computer Simulations for the Case of the Equal-Input Model

True d Mean d f SE Mean & + SE (f)

n = 20: 0.5 . 1.0 . 1.5 . 2.0 .

n = 100: 0.5 1.0 1.5 . . 2.0 . 2.5 .

0.5074 f 0.0051 1.0106 + 0.0114 1.4789 + 0.020 1 1.9596 + 0.0598

0.5036 f 0.0021 1.0029 + 0.0047 1.4921 f 0.0098 2.0095 + 0.0202 2.4900 + 0.1050

0.5562 + 0.0063 (0.0020) 1.1119 + 0.0184 (0.0900) 1.5276 I!Z 0.033 1 (0.2705) 1.7304 + 0.0455 (0.4305)

0.5 110 k 0.0022 (0.0000) 1.0422 f 0.0060 (0.0005) 1.5994 f 0.0119 (0.0410) 2.0016 + 0.0169 (0.2200) 2.1868 + 0.0206 (0.3700)

NOTE.--8 is the estimate of d obtained by equation (11). & is the estimate obtained by equation (7), excluding inapplicable cases. fis the proportion of in- applicable cases. The rates of nucleotide substitution were assumed to be rzl = r3, = r4, = 0.4~ r12 = r32 = r42 = 0.3~ r13 = r23 = r43 = 0.2c, and r14 = r24 = r34 = 0. lc, where r,, is the rate of substitution from the ith nucleotide to the jth nucleotide, per unit evolutionary time, and c 4 1. The number of replications is 2,000 for each set of parameters.

Kimura’s Transition / Transversion Method

In some genes, such as those in mitochondrial DNA, transitional substitutions occur more frequently than do transversional substitutions. In such cases, Kimura ( 1980) has shown that the evolutionary distance can be given by

d = --'I2 log,( l-2P-Q)- l/4 log,( l-2Q) , (12)

where P and Q are the expected proportions of transition-type and transversion-type pairs between two nucleotide sequences, respectively. Therefore, if the observed num- bers of transition-type and transversion-type pairs between two nucleotide sequences are denoted by k, and k,, respectively, the evolutionary distance can be estimated by

& = J/2 log,( 1-2&Q)- l/4 log,( l-2& , (13)

where p and & are given by p = ks/n and Q = k,/n, respectively. This formula is known as Kimura’s method and is applicable only when 2p + & < 1 and 2Q < 1 because of logarithms.

Using the Taylor-series and binomial expansions, we can express equation ( 12 ) as

(j=g i ” [ 0 2j-lpjQi-j+2i-2Qi i.

i=l j=o J II (14)

Page 6: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

Table 3 Estimates of d, Obtained by Computer Simulations for the Case of the Equal-Input Model

True d Mean k? + SE Mean arN f SE (f)

n = 20: 0.5 1.0 . 1.5 . . 2.0

n = 100: 0.5 1.0 1.5 2.0 . 2.5

0.5085 + 0.0058 0.9992 f 0.0205 1.3760 + 0.0224 1.7472 +- 0.0564

0.5052 + 0.0024 1.0004 + 0.006 1 1.4646 + 0.0167 1.9609 + 0.0375 2.2413 + 0.0826

0.5776 f 0.0077 (0.0 150) 1.0237 + 0.0139 (0.1880) 1.2381 + 0.0172 (0.3815) 1.3340 + 0.0191 (0.4845)

0.5 167 f 0.0026 (0.0000) 1.0739 rf: 0.0077 (0.0 120) 1.5222 + 0.0135 (0.1650) 1.7347 + 0.0171 (0.3695) 1.8456 f 0.0206 (0.4445)

NOTE.--8 is the estimate of d obtained by equation (11). & is the estimate obtained by equation (7), excluding inapplicable cases. fis the proportion of in- applicable cases. The rates of nucleotide substitution were assumed to be r21 = rsl = r4, = 0.45~ r I2 = r32 = r42 = 0.05~. r13 = r23 = r43 = O.OSc, and r14 = r24 = r34 = 0.45~. where r,, is the rate of substitution from the ith nucleotide to the jth nucleotide, per unit evolutionary time, and c G 1. The number of replications is 2,000 for each set of parameters.

Since ks, k,, and n -k,-k, follow a multinomial distribution with parameters n, P, Q, and 1 -P-Q, PjQ’-j can be estimated by k!j)k$i-i)/n(i) for j I k, and i-j I k,, and Qi can be estimated by k(‘)/n (i) for i I k,. Therefore, ignoring all higher-order terms that cannot be estimated, we obtain

where k = k,+k,, min is the larger one of 0 and i-k,, max is the smaller one of i and k,, and (j) = i!/[j!(i-j)!].

In order to check the accuracy of this algorithm, I have again conducted computer simulations. In the simulations, it was assumed that transitional substitutions occur four times as frequently as transversional ones. The evolutionary distance was estimated by equation ( 15 ) . For a comparison, the distance was also estimated by equation ( 13 ) if it was applicable. For each set of parameters, the simulations were repeated 2,000 times.

The results of computer simulations for n = 20 and 100 are shown in table 4. In this table, the proportion (f) of inapplicable cases for & is also shown, as before. We can see from this table that the new algorithm gives an almost unbiased estimate when d is not very large. On the other hand, equation ( 13) gives an overestimate when d is small, and the extent of overestimation is substantial when n is small. Furthermore, equation ( 13) cannot be applied in many cases when d is large, and, because of many inapplicable cases, the mean of & obtained from equation ( 13) when inapplicable cases are excluded is much smaller than the true value of d.

Variance of (z

In the cases of Jukes and Cantor’s and Tajima and Nei’s methods, the variance of a is approximately given by

Page 7: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

V(C). (16)

Since V(p^) and ild/ilp can be estimated by pA( 1 -p^)/( n - 1) and exp ( d/ b) , respectively, the variance of d can be estimated by

v(& = A 1-A n_l ew@WO . (17)

In the case of Kimura’s transition/transversion method, the variance of d is approximately given by

(18)

V(p), V(o), andCov(p,Q)can beestimated by& 1-p)/(n--l), e( 1-o)/(n---1), and -@/( n - 1 ), respectively. dd/dP and ad/@ are given by exp( 2 dl ) and [ exp( 2 d, ) + exp(4d2)]/2, respectively, where dl = -log,( I-2P-Q)/2 and d2 = -log,( l-2Q)/ 4. Since dl and d2 are the first and second terms in the right side of equation ( 12), d, and d2 can be estimated by

8, = i 1 max i co i=* id’) j=min j

2j-lkij)k$i-j)

and

cz 2

= 5 2i-*k(i)

i= 1 in”‘.

Table 4 Estimates of d, Obtained by Computer Simulations, When It Is Assumed That the Rate of Transitional Substitution is Four Times Larger than That of Transversional Substitution

True d Mean d + SE Mean & + SE (f)

n = 20: 0.5 . 1.0 . . . 1.5 . 2.0 .

n = 100: 0.5 . 1.0 1.5 2.0 . 2.5 .

0.5062 f 0.0060 1.0090 f 0.0256 1.4080 + 0.0326 1.6461 f 0.0423

0.497 1 +-’ 0.0024 1 .O 146 + 0.0074 1.4947 Ik 0.0171 1.9153 + 0.0365 2.4167 Z!I 0.2172

0.5510 + 0.0065 (0.0310) 0.9111 f 0.0094 (0.2455) 1.0780 f 0.0 104 (0.4095) 1.1582 k 0.0117 (0.5060)

0.5099 f 0.0026 (0.0000) 1.0645 + 0.0077 (0.0465) 1.4227 I!I 0.0 109 (0.2480) 1.5717 f 0.0116 (0.3915) 1.7196 + 0.0129 (0.4685)

(19)

(20)

NOTE.--d is the estimate of d obtained by equation ( 15). & is the estimate obtained by equation (13), excluding inapplicable cases. fis the proportion of in- applicable cases. The number of replications is 2,000 for each set of parameters.

Page 8: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

Then, the variance of d can be estimated by

V(d) = afP+a$-(a# + a&)’ n-l 9 (21)

where al = exp(2al) and a2 = [exp(2al)+exp(4~2)]/2.

Numerical Example

The theoretical studies presented above have shown that the new algorithm is useful especially when we analyze short nucleotide sequences. I now show this by using the nucleotide sequences of the human preproinsulin and rat preproinsulin I (Sures et al. 1980). Preproinsulin consists of four polypeptides: prepeptide, B-chain, C-peptide, and A-chain. The A- and B-chains produce active insulin, whereas the prepeptide and C-peptide are removed before active insulin is produced. Since the substitution rates might be different among different polypeptides, I have analyzed them separately. Furthermore, I have analyzed the first, second, and third positions in codons separately (Kimura 1980, 198 1). Thus, the number of nucleotides (n) used in each comparison is quite small: n = 23 for prepeptide (excluding the initiation codon), n = 30 for B-chain, n = 3 1 for C-peptide, and n = 2 1 for A-chain. We have already seen that, in the case where n is small, the extent of overestimation is substantial if the ordinary algorithms such as ( 2)) ( 7 ) , and ( 13 ) are used for estimating d.

The numbers of different pairs of nucleotides between the nucleotide sequences for the human and rat genes are given in table 5, from which the evolutionary distances were estimated. I have used three methods: Jukes and Cantor’s method (JC method), Tajima and Nei’s method, and Kimura’s transition / transversion method (K method). In the case of Tajima and Nei’s method, b’s were estimated by equations ( 8 ) (TN 1

Table 5 Observed Numbers of Different Pairs of Nucleotides, between the Nucleotide Sequences for the Human and Rat Insulin Genes

Position in Codon AA TT CC GG AG TC AT AC TG CG n

Prepeptide: First . . . . 1 2 11 6 1 Second . . . . . 1 9 7 3 1 Third . . . . 0 0 6 9 2

B-chain: First . . . . 3 7 9 9 0 Second . . . . . 9 10 5 6 0 Third . . . 2 2 8 5 3

C-peptide: First . . . . . 0 1 10 15 1 Second . . 9 6 .5 4 2 Third . . . 0 0 4 12 6

A-chain: First . . . . . 6 7 4 4 0 Second . . 8 5 2 6 0 Third . . . . . 0 1 13 3 2

0 0

2

0 4

0 2 2

0 0

1 0 2

0 0 0

0 0

0 0 1 23 1 0 0 23 0 0 3 23

0 0 0 30 0 0 0 30 1 1 2 30

1 1 2 31 0 2 1 31 2 4 1 31

0 0 0 21 0 0 0 21 0 0 0 21

SouRcE.-Sures et al. (1980).

Page 9: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

Unbiased Estimation of Evolutionary Distance 685

method) and (9) (TN2 method). The results are shown in table 6, where values in parentheses were obtained by the ordinary algorithms. The difference between the estimates obtained by the new and ordinary algorithms is small when the estimate of distance is small (d < 0.2). On the other hand, the difference is substantial when the estimate is large, especially when the third position in C-peptide is compared. Inci- dentally, it can be seen from this table that the evolutionary distance in the third position is larger than that in the first and second positions and that in the first and second positions the evolutionary distance of prepeptide and C-peptide is larger than that of A- and B-chains. These results are consistent with the neutral theory of molecular evolution (Kimura 1968, 1983).

Discussion

In this paper I have developed new algorithms for estimating the evolutionary distance that are always applicable and that give an almost unbiased estimate unless the distance is very large. In the analysis of nucleotide sequences, one often divides them into several regions when the substitution rates are different among different regions, as shown in the Numerical Example section. In such cases the number of nucleotides in each region might not be large, and the ordinary algorithms such as ( 2 ) , ( 7 ) , and ( 13 ) are expected to give overestimates. In this case the new algorithms will be helpful.

In the case where the number of nucleotides compared (n) is very large, the difference in the estimate, between the new and ordinary algorithms, is expected to be negligibly small. Although the amount of bias on the estimate obtained by the ordinary algorithm depends on the substitution model or the method used (see tables l-4), the new algorithm might be recommended to be used in the case of n < 1,000.

In the case where n is as small as 20, if the evolutionary distance estimated by the ordinary algorithm is small (d < 0.2)) the extent of overestimation is also small (see table 6). Therefore, in such cases, the ordinary algorithms as well as the new algorithms are applicable.

Tajima ( 1992) has shown that the estimated evolutionary distance (d) can be divided into two components: the distance that is actually realized (d,) and the error caused by the estimation process (e), where the mean of e is zero and d, and e are independent of each other. M. Nei and A. Rzhetsky (personal communication) have shown, however, that when Jukes and Cantor’s formula-i.e., equation (2)-is used to estimate these two components, the mean of e is not zero and there is a positive correlation between d, and e. To know whether this is also the case when equation ( 5 ) is used for the estimation, I have conducted computer simulations, and the results are shown in table 7, where the number of nucleotides is 100, the number of replications is 10,000, and the rates of substitution are assumed to be the same among different nucleotide pairs. The results indicate that the mean of e is not significantly different from zero and that the correlation coefficient between d, and e also is not significantly different from zero. Thus, it can be concluded that the new algorithm is recommended to be used in Tajima’s ( 1992) method if nucleotide sequences are short. If nucleotide sequences are long, however, the difference in the estimate between the two algorithms is not very large (see table 1)) so that either algorithm can be used in his method.

Incidentally, when we are interested in the estimation of the evolutionary distance (da) that is actually realized (i.e., the actual number of nucleotide substitutions per nucleotide site), we can use V (e) = V(a) - a/n as its variance. Table 7 gives the observed variances of e obtained by computer simulations, together with the expected

Page 10: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

Table 6 Estimates of d, between the Human and Rat Insulin Genes

POSITION IN CODON JC Method TN1 Method TN2 Method K Method

Prepeptide (n’ = 23 ): First . Second Third

B-chain (n = 30): First Second . Third

C-peptide (n = 3 1): First . . . . . . . . . Second . Third . .

A-chain (n = 2 1): First Second Third

0.14 + 0.09 (0.14 Ik 0.09) 0.14 f 0.09 (0.15 f 0.09) 0.15 f 0.10 (0.16 + 0.10) 0.14 + 0.09 (0.15 + 0.09) 0.14 IL 0.09 (0.14 + 0.09) 0.14 + 0.09 (0.15 f 0.09) 0.15 IL 0.10 (0.16 + 0.10) 0.14 f 0.09 (0.15 f 0.09) 0.45 IL 0.18 (0.47 + 0.19) 0.48 f 0.22 (0.52 + 0.24) 0.59 f 0.42 (0.76 2 0.63) 0.45 IL 0.19 (0.48 + 0.20)

0.07 f 0.05 (0.07 -1- 0.05) 0

0.62 + 0.21 (0.65 f 0.22)

0.18 f 0.09 (0.18 f 0.09) 0.18 + 0.09 (0.19 + 0.09) 0.19 + 0.11 (0.21 f 0.11) 0.18 -t 0.09 (0.18 f 0.09) 0.26 f 0.11 (0.27 IL 0.11) 0.26 _+ 0.11 (0.27 -t 0.11) 0.27 -t 0.12 (0.29 zk 0.12) 0.26 _t 0.11 (0.27 f 0.11) 0.74 k 0.24 (0.78 -t 0.26) 0.86 + 0.37 (0.96 + 0.44) 1.06 f 0.76 (1.65 -t 2.41) 0.76 -t 0.28 (0.83 + 0.32:

0 0

0.21 + 0.12 (0.22 k 0.12)

0.07 _t 0.05 (0.07 k 0.05) 0

0.07 f 0.05 (0.07 + 0.05) 0

0.07 + 0.05 (0.07 + 0.05) 0

0.63 f 0.22 (0.66 f 0.23) 0.68 -t 0.27 (0.74 k 0.30) 0.63 +- 0.23 (0.68 k 0.25)

0

0 0 0

0

0 0.22 k 0.13 (0.24 f 0.14) 0.25 f 0.20 (0.30 _+ 0.24) 0.22 -t 0.12 (0.23 f 0.13:

NOTE.-Data are from table 5. Values in parentheses are the estimates obtained by the ordinary algorithms, such as (2), (7), and (13).

Page 11: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

Unbiased Estimation of Evolutionary Distance 687

Table 7 Results of Computer Simulation

4 e eJK

Mean Variance Mean Variance Mean Variance

d= 0.1 (n = 100): Observed 0.0998 Expected 0.1

d = 0.2 (n = 100): Observed . . . Expected

d = 0.4 (n = 100): Observed . Expected

d = 0.8 (n = 100): Observed . Expected .

0.2000 0.2

0.4000 0.4

0.8007 0.008 16 0.8 0.008

0.00103 0.00 1

0.00 199 0.002

0.00405 0.004

0.000 1 0.000 11 0.0008*** 0.000 11 0 0.000 11 0 0.000 11

(0.010 + 0.010) (0.035 z!I 0.010***>

0.0003 0.00047 0.00 19*** 0.00048 0 0.00047 0 0.00047

(0.002 Ifi 0.010) (0.023 f O.OlO*)

0.000 1 0.00226 0.0043*** 0.00234 0 0.00222 0 0.00222

(0.000 * 0.010) (0.021 + o.olo*)

-0.0009 0.01317 0.0 140*** 0.0 1440 0 0.01310 0 0.01310 (-0.006 k 0.010) (0.026 of: O.OlO**)

NOTE.-e = d - da; erK = & - da. The rates of nucleotide substitution are assumed to be the same among different nucleotide pairs, and the number of replications is 10,000 for each set of parameters. Values in parentheses are the correlation coefficient between da and e (or eJK) and its standard error.

* P < 0.05. ** P< 0.01. *** P < 0.001.

variances -i.e., V(e) = V( d)-d/n-and shows that the agreement between them is satisfactory.

In this paper I have considered only nucleotide sequences. The present algorithm, however, can also be applied to amino acid sequences, for estimating the number of amino acid replacements per site, and the estimation formula is the same as equation (5 ) if n and k are the number of amino acids in sequence and the number of different amino acid pairs between two sequences, respectively, and if an appropriate value of b is used. Under the infinite-allele model ( Kimura and Crow 1964) b is equal to unity, and b = 0.95 if the rates of amino acid replacement are the same among different amino acid pairs. The value of b can also be estimated in the same way as in Tajima and Nei’s ( 1984) method. A computer program for estimating the evolutionary distance is available on request.

Acknowledgments

I thank Dr. M. Nei and two anonymous reviewers for their valuable suggestions and comments. This is contribution 1932 from the National Institute of Genetics, Mishima, Shizuoka 4 11, Japan.

LITERATURE CITED

GOJOBORI, T., K. ISHII, and M. NEI. 1982. Estimation of average number of nucleotide sub- stitutions when the rate of substitution varies with nucleotide. J. Mol. Evol. l&414-423.

JUKES, T. H., and C. R. CANTOR. 1969. Evolution of protein molecules. Pp. 2 l-l 32 in H. N. MUNRO, ed. Mammalian protein metabolism. Academic Press, New York.

Page 12: Unbiased Estimation of Evolutionary Distance between ... · Unbiased Estimation of Evolutionary Distance between Nucleotide Sequences 1 Fumio Tajima Department of Population Genetics,

688 Tajima

KIMURA, M. 1968. Evolutionary rate at the molecular level. Nature 217:624-626. . 1980. A simple method for estimating evolutionary rate of base substitutions through

comparative studies of nucleotide sequences. J. Mol. Evol. 16: 11 I- 120. . 198 1. Estimation of evolutionary distances between homologous nucleotide sequences.

Proc. Natl. Acad. Sci. USA 78:454-458. . 1983. The neutral theory of molecular evolution. Cambridge University Press, Cam-

bridge. KIMURA, M., and J. F. CROW. 1964. The number of alleles that can be maintained in a finite

population. Genetics 49:725-738. KIMURA, M., and T. OHTA. 1972. On the stochastic model for estimation of mutational distance

between homologous proteins. J. Mol. Evol. 2~87-90. NEI, M. 1987. Molecular evolutionary genetics. Columbia University Press, New York. SURES, I., D. V. GOEDDEL, A. GRAY, and A. ULLRICH. 1980. Nucleotide sequence of human

preproinsulin complementary DNA. Science 208:57-59. TAJIMA, F. 1992. Statistical method for estimating the standard errors of branch lengths in a

phylogenetic tree reconstructed without assuming equal rates of nucleotide substitution among different lineages. Mol. Biol. Evol. 9: 168- 18 1.

TAJIMA, F., and M. NEI. 1982. Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. J. Mol. Evol. 18: 115- 120.

. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269-285.

TAKAHATA, N., and M. KIMURA. 198 1. A model of evolutionary base substitutions and its application with special reference to rapid change of pseudogenes. Genetics 98:641-657.

TAMURA, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 9:678-687.

MASATOSHI NEI, reviewing editor

Received July 30, 1992; revision received October 6, 1992

Accepted October 6, 1992