9
Journal of Theoretical Biology 227 (2004) 503–511 Phylogenetics of artificial manuscripts Matthew Spencer*, Elizabeth A. Davidson, Adrian C. Barbrook, Christopher J. Howe Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1QW, UK Abstract Biological evolution has parallels with the development of natural languages, man-made artifacts, and manuscript texts. As a result, phylogenetic methods developed for evolutionary biology are increasingly being used in linguistics, anthropology, archaeology, and textual criticism. Despite this popularity, there have been few critical tests of their suitability. Here, we apply phylogenetic methods to artificial manuscripts with a known true phylogeny, produced by modern ‘scribes’. Although the survival of ancestral forms and multiple descendants from a single ancestor are probably much more common in manuscript evolution than biological evolution, we were able to reconstruct most of the true phylogeny. This is important because phylogenetic methods are influencing the production of critical editions of major written works. We also show that the variation in rates of change at different locations in the text follows a gamma distribution, as is often the case in DNA sequences. r 2004 Elsevier Ltd. All rights reserved. Keywords: Experimental phylogenetics; Manuscript evolution; Natural language; Stemmatics; Textual criticism 1. Introduction The important biological problem of reconstructing evolutionary relationships among living and extinct organisms has parallels in other historical disciplines. As a result, methods developed for biological phyloge- netics have also been used to reconstruct the relation- ships among natural languages (e.g. Forster and Toth, 2003; Forster et al., 1998; Gray and Jordan, 2000; Holden, 2002; McMahon and McMahon, 2003), beha- vioural patterns (e.g. Robson-Brown, 1999), archaeolo- gical artifacts (e.g. Collard and Shennan, 2000; O’Brien et al., 2001; O’Brien and Lyman, 2002; Tehrani and Collard, 2002), and written works such as chain letters (Bennett et al., 2003) and medieval manuscripts (e.g. Barbrook et al., 1998; Lee, 1989; Spencer et al., 2003). In the field of manuscripts, texts were replicated by copying, with ‘mutations’ occurring whenever a scribe did not exactly reproduce his or her source (Howe et al., 2001). Phylogenetics and other quantitative methods are already influencing the preparation of critical editions (Robinson, 1997b). To convince sceptics (e.g. Cartlidge, 2001; Hanna, 2000; Jones, 2001) that phylogenetic methods are appropriate, it is necessary to show that they can recover the correct relationships when these are known. Unfortunately, such cases are very rare. In the only example we know of, Robinson and O’Hara (1996) used maximum parsimony to reconstruct a phylogeny (stemma) for a set of manuscript copies (text tradition) of an Icelandic saga. The sources for 12 of the 46 extant manuscripts are known from external evidence (usually statements by the scribes), and the maximum parsimony phylogeny closely matched many of these known relationships. Here, we describe the production and phylogenetic analysis of a set of artificial manuscripts. We know the true relationships among all the manuscripts, and can quantify the accuracy of several different phylogenetic methods. As well as being a useful general test of the accuracy of phylogenetic reconstruction, we can exam- ine several features specific to the analysis of text traditions. One of the major differences between biological evolution and text traditions is the manner of divergence (Lee, 1989; O’Hara and Robinson, 1993, pp. 60–61). Although an evolutionary lineage can split into many species, the probability of several molecular lineages diverging at exactly the same time (a hard polytomy) is very small (Slowinski, 2001). At the species level, hard polytomies may occur, but only in special circumstances such as simultaneous isolation of sub- populations (Hoelzer and Melnick, 1994). Phylogenetic studies therefore usually assume that a bifurcating tree is a good representation of the evolutionary process, ARTICLE IN PRESS *Corresponding author. Tel.: +1223-765-945; fax: +1223-333-345. E-mail address: [email protected] (M. Spencer). 0022-5193/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2003.11.022

Phylogenetics of artificial manuscripts

Embed Size (px)

Citation preview

Page 1: Phylogenetics of artificial manuscripts

Journal of Theoretical Biology 227 (2004) 503–511

ARTICLE IN PRESS

*Correspond

E-mail addr

0022-5193/$ - se

doi:10.1016/j.jtb

Phylogenetics of artificial manuscripts

Matthew Spencer*, Elizabeth A. Davidson, Adrian C. Barbrook, Christopher J. Howe

Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1QW, UK

Abstract

Biological evolution has parallels with the development of natural languages, man-made artifacts, and manuscript texts. As a

result, phylogenetic methods developed for evolutionary biology are increasingly being used in linguistics, anthropology,

archaeology, and textual criticism. Despite this popularity, there have been few critical tests of their suitability. Here, we apply

phylogenetic methods to artificial manuscripts with a known true phylogeny, produced by modern ‘scribes’. Although the survival of

ancestral forms and multiple descendants from a single ancestor are probably much more common in manuscript evolution than

biological evolution, we were able to reconstruct most of the true phylogeny. This is important because phylogenetic methods are

influencing the production of critical editions of major written works. We also show that the variation in rates of change at different

locations in the text follows a gamma distribution, as is often the case in DNA sequences.

r 2004 Elsevier Ltd. All rights reserved.

Keywords: Experimental phylogenetics; Manuscript evolution; Natural language; Stemmatics; Textual criticism

1. Introduction

The important biological problem of reconstructingevolutionary relationships among living and extinctorganisms has parallels in other historical disciplines.As a result, methods developed for biological phyloge-netics have also been used to reconstruct the relation-ships among natural languages (e.g. Forster and Toth,2003; Forster et al., 1998; Gray and Jordan, 2000;Holden, 2002; McMahon and McMahon, 2003), beha-vioural patterns (e.g. Robson-Brown, 1999), archaeolo-gical artifacts (e.g. Collard and Shennan, 2000; O’Brienet al., 2001; O’Brien and Lyman, 2002; Tehrani andCollard, 2002), and written works such as chain letters(Bennett et al., 2003) and medieval manuscripts (e.g.Barbrook et al., 1998; Lee, 1989; Spencer et al., 2003).In the field of manuscripts, texts were replicated by

copying, with ‘mutations’ occurring whenever a scribedid not exactly reproduce his or her source (Howe et al.,2001). Phylogenetics and other quantitative methods arealready influencing the preparation of critical editions(Robinson, 1997b). To convince sceptics (e.g. Cartlidge,2001; Hanna, 2000; Jones, 2001) that phylogeneticmethods are appropriate, it is necessary to show thatthey can recover the correct relationships when these are

ing author. Tel.: +1223-765-945; fax: +1223-333-345.

ess: [email protected] (M. Spencer).

e front matter r 2004 Elsevier Ltd. All rights reserved.

i.2003.11.022

known. Unfortunately, such cases are very rare. In theonly example we know of, Robinson and O’Hara (1996)used maximum parsimony to reconstruct a phylogeny(stemma) for a set of manuscript copies (text tradition)of an Icelandic saga. The sources for 12 of the 46 extantmanuscripts are known from external evidence (usuallystatements by the scribes), and the maximum parsimonyphylogeny closely matched many of these knownrelationships.Here, we describe the production and phylogenetic

analysis of a set of artificial manuscripts. We know thetrue relationships among all the manuscripts, and canquantify the accuracy of several different phylogeneticmethods. As well as being a useful general test of theaccuracy of phylogenetic reconstruction, we can exam-ine several features specific to the analysis of texttraditions. One of the major differences betweenbiological evolution and text traditions is the mannerof divergence (Lee, 1989; O’Hara and Robinson, 1993,pp. 60–61). Although an evolutionary lineage can splitinto many species, the probability of several molecularlineages diverging at exactly the same time (a hardpolytomy) is very small (Slowinski, 2001). At the specieslevel, hard polytomies may occur, but only in specialcircumstances such as simultaneous isolation of sub-populations (Hoelzer and Melnick, 1994). Phylogeneticstudies therefore usually assume that a bifurcating tree isa good representation of the evolutionary process,

Page 2: Phylogenetics of artificial manuscripts

ARTICLE IN PRESS

ND

HS

JS2

KH

HS2

JH

WP

JW

ZM

LF

SYLAWCC DL

JS

GN

SD

YT

DF

IMBL

LM

0.01

Fig. 1. The true phylogeny. Artificial texts are indicated by their

scribes’ initials. Edges are proportional to the observed mean number

of changes per variant location (the mean character difference, scale

bar 0.01 units). The first text is LM (copied directly from the printed

edition).

M. Spencer et al. / Journal of Theoretical Biology 227 (2004) 503–511504

although polytomies may be useful representations ofuncertainty in branching order (Page and Holmes, 1998,p. 13). In text traditions, many copies can be made froma single exemplar, so hard polytomies are likely.Another important difference between biological evolu-tion and text traditions is that biological evolution iscontinual. After an evolutionary divergence, bothspecies continue to change, so ancestral forms do notpersist unless either the rate of change is very low or thetime interval since divergence is very small. Phylogen-eticists therefore assume that contemporary speciesshould always appear on the tips (terminal nodes) ofthe tree. In contrast, once a manuscript is produced, thetext it contains does not change (except occasionally dueto corrections and damage). Some extant manuscriptsmay be the ancestors of others, and should therefore berepresented by internal nodes of the phylogeny (Camer-on, 1987). By creating an artificial text tradition with aknown phylogeny that includes polytomies and extantmanuscripts on internal nodes, we can find out howmuch these features affect the accuracy of our phyloge-netic reconstructions.In our artificial text tradition, we have every manu-

script that was generated. In contrast, it is likely thatmost of the manuscripts in real text traditions have beenlost (Weitzman, 1987). By reconstructing phylogeniesusing random subsets of our artificial text tradition, wecan see how this may affect our ability to recover thetrue relationships among the manuscripts that remain.Some locations in protein or DNA sequences mutate

more frequently than others. This influences themeasurement of true evolutionary distances and thereconstruction of phylogenies (Page and Holmes, 1998,pp. 146–162). We currently have no information on thedistribution of scribal errors, but it seems likely thatsome locations in the text change more frequently thanothers. We previously showed that variation in errorrates among locations affects the estimation of distancesamong manuscripts (Spencer and Howe, 2001). Atlocations where the error rate is high, there will be ahigh probability of more than one change occurringbetween distantly related manuscripts. However, we willsee at most only one difference between any pair ofmanuscripts at each location, and will underestimate thedistance between the manuscripts. An artificial texttradition will allow us to measure the frequencydistribution of changes at each location and compareit with models of distributions from biology.

2. Methods

2.1. Creating the text tradition

Our text was the first eight paragraphs (834 words, 49sentences) of the medieval German poem Parzival by

Wolfram von Eschenbach, in the English translation byA.T. Hatto (von Eschenbach, 1980). We chose thisbecause it is a work of literature, short enough to becopied by volunteer scribes in a reasonable time(between 20 and 50min), and having unusual languagewhich we hoped would ensure enough errors for ouranalyses.The first text in our artificial tradition was copied

directly from the printed edition. A further 20 copieswere made, each from a photocopy of a previoushandwritten copy. We asked the scribes to copy theirexemplars carefully and legibly, but did not inform themof the purpose of the study. We gave no guidance onwhether to correct apparent mistakes in spelling andpunctuation. Most of our scribes were graduate studentsat the University of Cambridge, and all but three werenative English speakers (the exceptions were oneChinese, one German, and one Kutzki). We allocatedthe texts for copying haphazardly (in the technical senseof unsystematically, but without any formal randomiza-tion procedure). Fig. 1 shows the true relationshipsbetween the manuscripts. For the purposes of thispaper, we will call the true relationships among a set orsubset of manuscripts the ‘true phylogeny’.

2.2. Transcription and collation

We made two independent transcriptions of eachmanuscript into ASCII text files. We used the collation

Page 3: Phylogenetics of artificial manuscripts

ARTICLE IN PRESSM. Spencer et al. / Journal of Theoretical Biology 227 (2004) 503–511 505

package COLLATE (Robinson, 1994) to list thedifferences between the transcriptions, and correctedthese differences against the originals. We aligned thetexts of all manuscripts using a version of theprogressive multiple alignment algorithm (Spencer andHowe, submitted). The result was an array with one rowper manuscript and one column per word, with the samenumber of columns in each manuscript. Correspondingwords were placed in the same column, and gaps wereintroduced to represent insertions or deletions of words.We converted the alignment into the NEXUS formatused by many phylogenetic programs (Maddison et al.,1997). We replaced words in the alignment by singlesymbols in the NEXUS file, such that if two manuscriptshad the same word in a column, they shared the samesymbol at the corresponding location in the NEXUSfile. We included punctuation and did not regularizespelling or capitalization, because modern scribes (un-like their medieval counterparts) are likely to treat thesefeatures as significant. The alignment contained 856columns, with at least two different readings appearingat 280 of these. One hundred and twenty-four locationswere parsimony informative (at least two differentreadings each appeared in at least two manuscripts).The mean number of changes per column over the 20exemplar–copy pairs was 0.03 (median 0.03, range 0.01to 0.08). This is similar to the range for real scribes intwo cases of known exemplar–copy pairs, and to anestimate of mean scribal accuracy for Lydgate’s Kings of

England derived from a mathematical model (Spencerand Howe, 2002). Typical changes in the artificial textsincluded substitutions of one word for another (e.g.‘horse’ for ‘hare’; ‘bit’, ‘bet’ or ‘bile’ for ‘bite’), changesin word division and hyphenation (e.g. ‘gad flies’ for‘gadflies’; ‘out-landish’ for ‘outlandish’), and insertionsor deletions of small words (e.g. ‘of’ and ‘a’).

2.3. Blind phylogenetic analysis

We gave the alignment and the NEXUS file to ACB,who attempted to reconstruct the phylogeny for the

Table 1

Partition distances (‘Partition’ scaled to their maximum possible values) and

trees reconstructed using a range of different phylogenetic methods (NJ: nei

Method Treatment of gaps

NJ bootstrap 50% consensus Ignored

NJ bootstrap 50% consensus Extra state

MP bootstrap 50% consensus Ignored

MP Ignored

MP Extra state

MP Gap/missing excluded

NJ Ignored

NJ Extra state

Methods are sorted in order of increasing partition distance from the true

different trees.

tradition. ACB was not involved in any previous stageof the study and we did not provide him with anyadditional information or advice.In a number of cases (e.g. writing a whole sentence

twice in some manuscripts), a sequence of gaps wasapparently the result of a single scribal event. ACBtherefore decided to recode these consecutive gaps as asingle gap followed by missing data. Phylogeneticprograms allow several different treatments of gaps.ACB used three approaches: ignoring gaps, treatingthem as an extra reading, or excluding entire columnsthat contained gaps.ACB chose two different tree reconstruction methods:

neighbor-joining on mean character difference (Saitouand Nei, 1987; Studier and Keppler, 1988) andmaximum parsimony (Page and Holmes, 1998, pp.187–193), both implemented in PAUP� (Swofford,2001). In both methods, ACB also used a bootstrapanalysis (Felsenstein, 1985), a resampling technique thatindicates the level of support for each group. A largenumber of bootstrap replicate data sets are generated bysampling with replacement from the columns of theoriginal alignment, and a phylogeny is reconstructed foreach replicate. The results are represented as a consensustree, which retains only those groups of manuscriptssupported by at least 50% of the replicate trees. Themethods used are summarized in Table 1.All these methods of phylogeny reconstruction

produce branching trees with manuscripts as terminalnodes. We used the partition distance (Penny andHendy, 1985) to measure how different these trees arefrom the true phylogeny. Removing any single edge on atree divides the set of manuscripts into two disconnectedsubsets. The partition distance between two trees is thenumber of edges on the first tree for which there is noedge on the second tree whose removal divides themanuscripts into the same two subsets. A lowerpartition distance indicates greater similarity. Forbootstrap trees, we calculated partition distances be-tween the consensus and the true phylogeny, rather thanbetween the individual bootstrap replicates and the true

triplet symmetric differences (‘Triplet’) between the true phylogeny and

ghbor-joining; MP: maximum parsimony)

Partition Triplet

0.19 0.16

0.22 0.28

0.22 0.38

0.33, 0.39 0.42, 0.42

0.33, 0.39 0.42, 0.42

0.33, 0.39 0.42, 0.42

0.39 0.41

0.39 0.42

phylogeny. Where two distances are given, a method produced two

Page 4: Phylogenetics of artificial manuscripts

ARTICLE IN PRESSM. Spencer et al. / Journal of Theoretical Biology 227 (2004) 503–511506

phylogeny. We expressed this distance as a fraction of itsmaximum possible value, 2n � 6; where n is the numberof manuscripts. Partition distance can be sensitive tothe placement of a few rogue taxa, even if the rest ofthe trees agree closely. We therefore also calculated thetriplet symmetric difference between reconstructedtrees and the true phylogeny. The triplet symmetricdifference is less sensitive to the placement of a few taxa,and is implemented in COMPONENT 2 (R.D.M. Page,http://taxonomy.zoology.gla.ac.uk/rod/cpw/index.html,downloaded 23/10/03). To determine whether an ob-served tree is more similar to the true phylogeny thanwould be expected by chance, we used PAUP�

(Swofford, 2001) to generate random Markovianbifurcating trees having the same number of terminalnodes. We then calculated the proportion of randomtrees whose partition distance to the true phylogeny wasless than the partition distance from the observed tree tothe true phylogeny.ACB also generated phylogenies using split decom-

position (Dress et al., 1996; Huson, 1998), which wehave previously applied to manuscripts of Chaucer’sWife of Bath’s Prologue (Barbrook et al., 1998). Unlikethe methods above, split decomposition does notnecessarily produce tree-like phylogenies. In cases wheremanuscripts contain information from more than onesource, split decomposition can represent these relation-ships as a network. The degree of reticulation in thenetwork gives a visual impression of the extent to whicha simple branching model is inadequate for thetradition. We also used NeighborNet (Bryant andMoulton, 2002, implemented in Splitstree 4 beta 2,downloaded from http://www-ab.informatik.uni-tuebin-gen.de/software/jsplits/welcome en.html, 13/10/03) toreconstruct a second phylogenetic network. Neighbor-Net is an agglomerative method related to neighbor-joining, and produces networks that are oftensubstantially more resolved than split decompositionnetworks, especially when the number of taxa is large.Since removing a single edge does not divide a networkinto two disconnected components, we cannot use thepartition distance to measure the similarity between aphylogenies produced by split decomposition or Neigh-borNet and the true phylogeny.

2.4. Phylogenetic analysis of subsets

In most real text traditions, a large but unknownproportion of the original manuscripts will have beenlost. We want to know how this may affect the accuracyof reconstructions of the phylogeny. We selectedrandom subsets of 5, 10, 15, and 20 of the transcriptions(100 replicates of each size) and used the neighbor-joining method (Studier and Keppler, 1988) to recon-struct the phylogeny for each subset. Neighbor-joining isa very fast method (which is necessary when taking

many random subsets), and works with a matrix ofdistances between manuscripts. We used the meancharacter difference (also known as mean observedvariant distance, the proportion of locations at which apair of manuscripts differed). Asymptotic Jukes–Cantorand gamma-corrected distance measures, designed todeal with variation in rates of change among locations(Eqs. (15) and (20) in Spencer and Howe, 2001), werevery similar because all distances were small, and are notpresented here. We measured the distance between eachreconstructed tree and the corresponding subset of thetrue phylogeny using the partition distance expressed asa fraction of its maximum possible value. We alsoreconstructed the phylogeny for all manuscripts usingthe same method.

2.5. Distribution of error rates across variant locations

We previously suggested that the gamma distributionmight be a good description of variation in error ratesamong locations (Spencer and Howe, 2001). Since weknow the true phylogeny for the artificial text tradition,we can find out whether this is the case. A gammadistribution of rates with shape parameter c (small c

means that most errors occur at a few locations) willresult in a negative binomial distribution of number ofchanges per location, with the same shape parameter(Wakely, 1993). With no variation in rates, the numberof changes per location will follow a Poisson distribu-tion. We recorded the actual number of changes at eachlocation along the true phylogeny. We then estimatedthe shape parameter c for the negative binomialdistribution by maximum likelihood (Crawley, 1993,p. 339), and used a likelihood ratio test to compare thegoodness of fit of the two distributions (Sokal andRohlf, 1995, Appendix A.14).

3. Results

3.1. Blind phylogenetic analysis

The partition distances and triplet symmetric differ-ences from the trees reconstructed by each method tothe true phylogeny are shown in Table 1. The rankingsfrom partition distances and triplet symmetric differ-ences were very similar. There was no clear differencebetween neighbor-joining and maximum parsimony, orbetween different ways of treating gaps. Bootstrappinggave more accurate reconstructions because the true treeincluded polytomies. Without bootstrapping, parsimonyand neighbor-joining resolve polytomies into binarysubtrees with short and poorly supported internal edges.When we bootstrap, we obtain many different binarysubtrees at these points, and the consensus is usually thecorrect polytomy.

Page 5: Phylogenetics of artificial manuscripts

ARTICLE IN PRESSM. Spencer et al. / Journal of Theoretical Biology 227 (2004) 503–511 507

The most accurate reconstruction (neighbor-joiningbootstrap, 50% consensus, gaps ignored) is shown inFig. 2. All edges that were present in at least half of thebootstrap samples are included in this tree, and arelabelled with the percentage of bootstrap samples inwhich they appeared. The true phylogeny (Fig. 1) differsfrom the best reconstruction (Fig. 2) in the followingways:1. IMBL, JS, LM, ND, and SYL are correctly located

except that they should be internal nodes.2. The branching order of the group fGN; SD;YTg is

wrong. SD should be the ancestor of the group, but in thereconstruction YT branches off before GN and SD. Thebootstrap support for this branching order was 52%.3. The branching order of the group fJS2;KH;HSg is

wrong. KH should be the ancestor of the group, but thereconstruction has HS branching off before KH andJS2, with a bootstrap support of 53%.4. JH is grouped with ND, instead of being with HS2

and WP. The bootstrap supports for these pairs arefJH;NDg 69%, fHS2;WPg 61%.5. fAW;CC;DL;LF; SYL;ZMg are not resolved into

a group.

ZM

DF

IMBLSDGN

YT

JSLF

SYL

AW

CC

DLLM

HS

JS2

KH WP

HS2

ND

JH

JW

0.01

100100

52

82

9051 98

9953

61

69

Fig. 2. The most accurate blind reconstruction of the phylogeny

(neighbor-joining bootstrap, 50% consensus, gaps ignored). Edges are

proportional to lengths estimated by least squares on the bootstrap

consensus tree (the scale bar is a mean character difference of 0.01).

Internal edges are labelled with bootstrap support percentages.

Artificial texts are indicated by their scribes’ initials.

6. The relationships between LM (the first text in thetradition) and its descendants fDL; JS;NDg; the ances-tors of each of the three major groups of manuscripts,are not resolved.The failure to place ancestral manuscripts on internal

nodes (point 1) is a consequence of the assumption byboth neighbor-joining and maximum parsimony thatextant manuscripts are terminal nodes. If every changeintroduced by an exemplar was transmitted to its copy,the exemplar would be connected to an internal node byan edge of length zero, which would correctly representthe true relationship. This does not usually happen inpractice because not every change is transmitted. All theincorrect groupings (points 2, 3, and 4) have bootstrapsupport less than 70%. Low bootstrap support shouldtherefore be taken as indicating unreliable groups. Thefailure to detect the group {AW, CC, DL, LF, SYL,ZM} (point 5) is probably due to the true edgeseparating DL from LM (Fig. 1) being very short. Theinability of the most accurate method to resolve therelationships among larger groups of manuscripts (point6) suggests that there are conflicting signals in the data.This is common in phylogenetic analyses of texttraditions (Spencer et al., 2002, 2004a). The longestedges on the best reconstruction (leading to ZM andDF) were generated by scribes who were not nativeEnglish speakers, but the third non-native Englishspeaker (SYL) was not particularly inaccurate.Even the reconstructions from the least successful

methods were much closer to the true phylogeny thanwould be expected by chance. The largest scaledpartition distance between any method and the truephylogeny was 0.39. Out of 1000 random bifurcatingtrees, the median scaled partition distance to the truephylogeny was 0.72 (range 0.61–0.72). This means thatthe estimated probability of obtaining by chance a treeas similar to the true phylogeny as the worst reconstruc-tion is less than 0.001. Reconstructions from differentmethods were also much more similar to each other thanwould be expected by chance. The maximum partitiondistance among methods was 0.56, while the range ofpartition distances among 1000 random bifurcatingtrees was 0.72–1 (median 1).The best fitting split decomposition phylogeny (treat-

ing gaps as an extra character state, fit=58.3; resultswere very similar when gaps were ignored) is shown inFig. 3a. Three groups of manuscripts were clearlyresolved (fGN;SD;YTg; fDF; IMBLg; and fHS; JS2;KHg), although in each case the true ancestral manu-script (SD, IMBL, and KH respectively) was connectedto the ancestor of the group by a short edge, rather thanbeing an internal node. All these groups had at least99% bootstrap support in Fig. 2. All the other manu-scripts radiated from a central point, and their relation-ships were not clearly resolved. A fit of 58.3 means thaton average, distances between pairs of manuscripts on

Page 6: Phylogenetics of artificial manuscripts

ARTICLE IN PRESS

YT

KH

ND

IMBL

HSAW JS

ZM

CC

WP

SD

SYL

LF

GN

JH

JW

JS2

HS2

DF

DL,LM0.01

SD

HS2

DF

JS2

DL

YTGN

LMAW

JHCC

KH

JS

WP

ZM

HS

ND

SYL

JW

IMBL

LF

0.01

(a)

(b)

Fig. 3. (a) The best split decomposition graph (fit=58.3, gaps treated

as an extra character state). (b) The NeighborNet graph for the same

data. Edges are proportional to estimated lengths (the scale bars are

mean character difference of 0.01 in each case). Artificial texts are

indicated by their scribes’ initials.

Table 2

Mean and standard error of scaled partition distance between

reconstructed trees from subsets of artificial texts and the true

phylogeny for the corresponding subset

Subset size 5 10 15 20 21

Mean scaled partition distance 0.19 0.30 0.37 0.44 0.44

Standard error 0.02 0.01 0.01 0.003 0

We took 100 replicate random subsets of each size, and reconstructed

trees using neighbor-joining on mean character difference.

M. Spencer et al. / Journal of Theoretical Biology 227 (2004) 503–511508

the split decomposition phylogeny underestimate theinput distances by 41.7%. Thus, the split decompositiongraph is not an accurate representation of the distancesamong manuscripts. However, it is conservative in thatit fails to resolve some groupings of manuscripts, ratherthan forming incorrect groupings. There are no strongreticulations in the split decomposition phylogeny,which suggests that (as was in fact the case) there wasno contamination in the tradition (there are somereticulations present at a scale that cannot be visuallydistinguished on the phylogeny).In contrast, the NeighborNet graph (Fig. 3b) resolves

much more detail. The large number of reticulationsmakes it difficult to compare the NeighborNet graphwith the true phylogeny in detail, but it is at leastapproximately correct, except that ancestral manu-

scripts are placed on terminal nodes. In this case, thereticulations indicate conflicting data rather thanrecombination, because we know that the true phylo-geny is a tree.

3.2. Phylogenetic analysis of subsets

The scaled partition distances between the truephylogeny and the reconstructions for each subset sizeare shown in Table 2. In general, reconstructions fromlarger subsets were more different from the truephylogeny than those from smaller subsets. This isbecause neighbor-joining always produces bifurcatingtrees, while the true tree contained several polytomies.The larger the subset, the greater the chance that one ofthese polytomies will be sampled. The loss of manymanuscripts did not severely affect our ability toreconstruct the phylogeny for those that remained inthis text tradition.With all manuscripts included, the best subset tree

was slightly worse (scaled partition distance 0.44 fromthe true phylogeny) than the corresponding neighbor-joining tree produced from ACB’s version of the codeddata (scaled partition distance 0.39). This differencearose because ACB discarded all except the firstcharacter of long gap sequences, on the basis that largeinsertions or deletions may function as single events, nomatter how many words are added or deleted. Insertionsor deletions often occur when similar words appear closetogether in the text, and the scribe omits the textbetween them. It is probably not much more difficult toomit several lines than a few words in such cases.

3.3. Distribution of error rates across variant locations

There were no changes at most locations, but thereadings at a few locations changed many times (Fig. 4).The negative binomial distribution with shape para-meter c=0.38 fits the data much better than the Poissondistribution (likelihood ratio test, R ¼ 409 with onedegree of freedom, po0:001). The Poisson distributionpredicts too few locations with no changes and withmany changes, but too many with one or two changes. Ifthe distances among manuscripts were large (manychanges of reading), including the variation in error

Page 7: Phylogenetics of artificial manuscripts

ARTICLE IN PRESS

0 1 2 3 4 5 6 7 8 9 100

100

200

300

400

500

600

Number of changes

Fre

quen

cy

ObservedNegative binomialPoisson

Fig. 4. Observed distribution of number of changes among locations

in the artificial text tradition (circles), with fitted negative binomial

(solid line, shape parameter c = 0.38) and Poisson (dashed line)

distributions. The mean number of changes per location is 0.68. No

reading changed more than 10 times.

M. Spencer et al. / Journal of Theoretical Biology 227 (2004) 503–511 509

rates among locations might improve the accuracy ofphylogeny reconstruction (Spencer and Howe, 2001).

4. Discussion

Although our reconstructions were not perfect, theywere close to the true phylogeny, certainly much closerthan would be expected by chance. There was little tochoose between neighbor-joining and maximum parsi-mony, and both were reasonably accurate. Furthermore,in our most successful method (neighbor-joining boot-strap), all the incorrect groupings of artificial texts hadlow bootstrap support. In the real world, we willtherefore know which parts of a phylogeny are likelyto be unreliable, even where we do not know the truephylogeny.Simulating the evolution of manuscripts is another

way to generate data for testing phylogenetic methods.Unfortunately, we do not yet have a good generalmathematical model for the way changes occur whencopying written text (for some attempts see Finney,2002; Gjessing and Pierce, 1994; Spencer and Howe,2001). Thus, our approach is the best demonstration sofar that phylogenetic methods are appropriate forstudying the evolution of manuscripts.Parsimony has been widely used for reconstructing

text phylogenies (e.g. Lee, 1989; O’Hara and Robinson,1993; Robinson, 1997a; Robinson and O’Hara, 1996;Salemans, 1996, 2000; Spencer et al., 2002), probablybecause the principles of parsimony are similar totraditional methods of stemmatic reconstruction (e.g.Maas, 1958). There are fewer examples of distancemethods applied to text phylogenies (e.g. Spencer et al.,

2004a,b), despite significant early theoretical work inthis field (Buneman, 1971). Both approaches are likelyto give reasonable results if the true phylogeny is tree-like, and our artificial text tradition did not indicatestrong superiority of one approach over the other. Wemight expect distance methods to be more reliable thanmaximum parsimony if the manuscripts are highlydivergent, provided that an appropriate measure ofevolutionary distance is used to correct for multiplechanges (Spencer and Howe, 2001). Split decompositiondid not give a well-resolved phylogeny for our artificialmanuscript tradition. Nevertheless, the groups of manu-scripts it did resolve were correct. This suggests that theresults of previous studies using split decomposition(e.g. Barbrook et al., 1998; Mooney et al., 2001; Stolz,2003) are likely to be conservative rather than mislead-ing. The recently invented NeighborNet method (Bryantand Moulton, 2002) is very promising, as it appears tobe able to give a quick indication of relationships amongmanuscripts and the level of conflicting phylogeneticsignals, while overcoming the poor resolution of splitdecomposition when the number of manuscripts is large.There are many other phylogenetic methods that

could be applied to texts. The information-baseddistance used by Bennett et al. (Bennett et al., 2003; Liet al., 2001) to study the evolution of chain letters isparticularly interesting because it does not require textsto be aligned. This is useful when there are a largenumber of rearrangements of the order of words,sentences, or longer blocks of text (for example, extantmanuscripts of the Canterbury Tales have the tales andlinking passages in many different orders: Spencer et al.,2003). Nevertheless, it will usually be necessary to alignthe text for two reasons. Firstly, medieval scribes oftenfollowed their own spelling and punctuation systemrather than that of their exemplar. Unrelated manu-scripts produced by scribes close together in time,training, or regional background will tend to groupunless spelling and punctuation are regularized beforeanalysis (e.g. Robinson, 1997a, pp. 72–74). Identifyingcorresponding locations to regularize requires align-ment. Secondly, an editor usually wants to know thelocations in the text at which differences occur, andagain, this requires that texts are aligned.Maximum likelihood (Felsenstein, 1981) and Baye-

sian (Huelsenbeck et al., 2001) phylogenetic methods arenow widely used in biology, and have much strongertheoretical foundations than parsimony and distancemethods. We are not yet able to use maximumlikelihood and Bayesian methods for text traditions,because we do not know the probabilities of transitionsbetween words. Estimating these probabilities directly isdifficult because the number of possible words is verylarge, and we are unlikely ever to observe mosttransitions. Simple likelihood models for mor-phological characters have been suggested, in which all

Page 8: Phylogenetics of artificial manuscripts

ARTICLE IN PRESSM. Spencer et al. / Journal of Theoretical Biology 227 (2004) 503–511510

non-self-transition probabilities are equal (Lewis, 2001),but we cannot use these if the number of possible statesis indeterminate. Instead, it will probably be necessaryto predict transition probabilities for categories ofwords. Psychological experiments show that longerwords are less accurately recognized than shorter words(Pelli et al., 2003), and preliminary analyses of our datasuggest that the copying error rate is higher for longerwords, at least for simple spelling errors (M. Spenceret al., unpublished analyses). Word length is therefore apromising predictor of the error rate, except in caseswhere there are no word divisions in the text, such ascontinuous script manuscripts of the New Testament(Metzger, 1992, p. 13). Many other factors might also beimportant, such as rhyming constraints in poetry.Polytomies and cases in which both exemplar and

copy survive (so that the exemplar should be placed onan internal node) cannot be reconstructed exactly bymost phylogenetic methods. Provided we remember this,it will usually be possible to broadly interpret the resultsof a phylogenetic analysis in these cases. For example, amanuscript connected to an internal node by a veryshort branch (e.g. IMBL on Fig. 2) could really be theancestor, possibly separated by several intermediatecopies, of another extant manuscript (e.g. DF on Fig. 2).More detailed examination of the manuscripts will berequired to determine the exact relationships in thesecases, which may be more common in text traditionsand archaeology than in biology.We have not considered recombination (the production

of entities containing information from more than onesource) here, although it occurs in biology, linguistics,archaeology, and text traditions. Detecting and represent-ing recombination is a major focus of phylogeneticresearch (e.g. Posada and Crandall, 2001a, b). Artificialtext traditions with recombination could be produced inseveral ways. For example, one could ask scribes to makea single copy, given two or more exemplars, some ofwhich could be made incomplete or partly illegible.Nevertheless, we do not often know exactly how or whymedieval scribes made use of more than one exemplar.In conclusion, we have shown that phylogenetic

methods can accurately reconstruct the transmission ofwritten texts copied by hand. This is important for tworeasons. Firstly, it adds to the set of phylogeneticreconstructions in cases with known true relationships(e.g. Hillis et al., 1992). Secondly, reconstructing thetransmission of written texts is an important part oftextual criticism, in which phylogenetic methods arebecoming increasingly popular.

Acknowledgements

We are very grateful to the volunteer scribes. Twoanonymous referees, Barbara Bordalejo, Linne Mooney,

David Parker, and Peter Robinson made many usefulsuggestions. This work is part of the STEMMA project,funded by the Leverhulme Trust.

References

Barbrook, A.C., Howe, C.J., Blake, N., Robinson, P., 1998. The

phylogeny of The Canterbury Tales. Nature 394, 839.

Bennett, C.H., Li, M., Ma, B., 2003. Chain letters and evolutionary

histories. Sci. Am. 288, 64–69.

Bryant, D., Moulton, V., 2002. NeighborNet: an agglomerative

method for the construction of planar phylogenetic networks. In:

Guigo, R., Gusfield, D. (Eds.), WABI 2002, Lecture Notes in

Computer Science, Vol. 2452. Springer, Berlin, pp. 375–391.

Buneman, P., 1971. Filiation of manuscripts. In: Hodson, F.R.,

Kendall, D.G., Tautu, P. (Eds.), Mathematics in the Archae-

ological and Historical Sciences. Edinburgh University Press,

Edinburgh, pp. 387–395.

Cameron, H.D., 1987. The upside-down cladogram: problems in

manuscript affiliation. In: Hoenigswald, H.M., Wiener, L.F. (Eds.),

Biological Metaphor and Cladistic Classification: An Interdisci-

plinary Perspective. Frances Pinter, London, pp. 227–242.

Cartlidge, N., 2001. The Canterbury Tales and cladistics. Neuphilol.

Mitt. 102, 135–150.

Collard, M., Shennan, S., 2000. Processes of culture change in

prehistory: a case study from the European Neolithic. In: Renfrew,

C., Boyle, K. (Eds.), Archaeogenetics: DNA and the Population

Prehistory of Europe. McDonald Institute for Archaeological

Research, Cambridge, pp. 89–97.

Crawley, M.J., 1993. GLIM for Ecologists. Blackwell Scientific

Publications, Oxford.

Dress, A., Huson, D., Moulton, V., 1996. Analyzing and visualizing

sequence and distance data using SPLITSTREE. Discrete Appl.

Math. 71, 95–109.

Felsenstein, J., 1981. Evolutionary trees from DNA sequences: a

maximum likelihood approach. J. Mol. Evol. 17, 368–376.

Felsenstein, J., 1985. Confidence limits on phylogenies: an approach

using the bootstrap. Evolution 39, 783–791.

Finney, T.J., 2002. MSS: a manuscript copying simulation. Online:

http://purl.org/TC/downloads/simulation.

Forster, P., Toth, A., 2003. Toward a phylogenetic chronology of

ancient Gaulish, Celtic, and Indo-European. Proc. Natl Acad. Sci.

USA 100, 9079–9084.

Forster, P., Toth, A., Bandelt, H.-J., 1998. Evolutionary network

analysis of word lists: visualising the relationships between Alpine

Romance Languages. J. Quant. Ling. 5, 174–187.

Gjessing, H.K., Pierce, R.H., 1994. A stochastic model for the

presence/absence of readings in Niorstigningar Saga. World

Archaeol. 26, 268–294.

Gray, R.D., Jordan, F.M., 2000. Language trees support the express-

train sequence of Austronesian expansion. Nature 405, 1052–1055.

Hanna, R., 2000. The application of thought to textual criticism in

all modes—with apologies to A.E . Housman. Stud. Bibliogr. 53,

163–172.

Hillis, D.M., Bull, J.J., White, M.E., Badgett, M.R., Molineux, I.J.,

1992. Experimental phylogenetics: generation of a known phylo-

geny. Science 255, 589–592.

Hoelzer, G.A., Melnick, D.J., 1994. Patterns of speciation and limits to

phylogenetic resolution. Trends Ecol. Evol. 9, 104–107.

Holden, C.J., 2002. Bantu language trees reflect the spread of farming

across sub-Saharan Africa: a maximum-parsimony analysis. Proc.

R. Soc. London Ser. B. 269, 793–799.

Howe, C.J., Barbrook, A.C., Spencer, M., Robinson, P., Bordalejo, B.,

Mooney, L.R., 2001. Manuscript evolution. Trends Genet. 17,

147–152.

Page 9: Phylogenetics of artificial manuscripts

ARTICLE IN PRESSM. Spencer et al. / Journal of Theoretical Biology 227 (2004) 503–511 511

Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P., 2001.

Bayesian inference of phylogeny and its impact on evolutionary

biology. Science 294, 2310–2314.

Huson, D.H., 1998. SplitsTree: analyzing and visualizing evolutionary

data. Bioinformatics 14, 68–73.

Jones, A., 2001. The properties of a stemma: relating the manuscripts

in two texts from The Canterbury Tales. Parergon 18, 35–53.

Lee, A.R., 1989. Numerical taxonomy revisited: John Griffith, cladistic

analysis and St. Augustine’s Quaestiones in Heptateuchem. Stud.

Patrist. 20, 24–32.

Lewis, P.O., 2001. A likelihood approach to estimating phylogeny

from discrete morphological character data. Syst. Biol. 50,

913–925.

Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.,

2001. An information-based sequence distance and its application

to whole mitochondrial genome phylogeny. Bioinformatics 17,

149–154.

Maas, P., 1958. Textual Criticism. Oxford University Press, Oxford.

Maddison, D.R., Swofford, D.L., Maddison, W.P., 1997. NEXUS: an

extensible file format for systematic information. Syst. Biol. 46,

590–621.

McMahon, A., McMahon, R., 2003. Finding families:

quantitative methods in language classification. T. Philol. Soc.

101, 7–55.

Metzger, B.M., 1992. The Text of the New Testament: Its Transmis-

sion, Corruption, and Restoration. Oxford University Press, New

York.

Mooney, L.R., Barbrook, A.C., Howe, C.J., Spencer, M., 2001.

Stemmatic analysis of Lydgate’s ‘‘Kings of England’’: a test case

for the application of software developed for evolutionary biology

to manuscript stemmatics. Rev. Hist. Textes 31, 275–297.

O’Brien, M.J., Lyman, R.L., 2002. Evolutionary archaeology: current

status and future prospects. Evol. Anthropology 11, 26–36.

O’Brien, M.J., Darwent, J., Lyman, R.L., 2001. Cladistics is useful

for reconstructing archaeological phylogenies: palaeoindian

points from the Southeastern United States. J. Archaeol. Sci. 28,

1115–1136.

O’Hara, R., Robinson, P., 1993. Computer-assisted methods of

stemmatic analysis. In: Blake, N., Robinson, P. (Eds.), The

Canterbury Tales Project Occasional Papers, Vol. 1. Office for

Humanities Communication Publications, London, pp. 53–74.

Page, R.D.M., Holmes, E.C., 1998. Molecular Evolution: A Phyloge-

netic Approach. Blackwell Science, Oxford.

Pelli, D.G., Farell, B., Moore, D.C., 2003. The remarkable inefficiency

of word recognition. Nature 423, 752–756.

Penny, D., Hendy, M.D., 1985. The use of tree comparison metrics.

System. Zool. 34, 75–82.

Posada, D., Crandall, K.A., 2001a. Evaluation of methods for

detecting recombination from DNA sequences: computer simula-

tions. Proc. Natl Acad. Sci. USA 98, 13757–13762.

Posada, D., Crandall, K.A., 2001b. Intraspecific gene genealogies:

trees grafting into networks. Trends Ecol. Evol. 16, 37–45.

Robinson, P., 1997a. A stemmatic analysis of the fifteenth-century

witnesses to The Wife of Bath’s Prologue. In: Blake, N., Robinson,

P. (Eds.), The Canterbury Tales Project: Occasional Papers, Vol.

II. Office for Humanities Communication Publications, London,

pp. 69–132.

Robinson, P.M.W., 1994. Collate: Interactive Collation of Large

Textual Traditions. Oxford University Centre for Humanities

Computing, Oxford.

Robinson, P.M.W., 1997b. New directions in critical editing. In:

Sutherland, K. (Ed.), Electronic Text: Investigations in Method

and Theory. Clarendon Press, Oxford, pp. 145–171.

Robinson, P.M.W., O’Hara, R.J., 1996. Cladistic analysis of an Old

Norse manuscript tradition. In: Hockey, S., Ide, N. (Eds.),

Research in Humanities Computing, Vol. 4. Oxford University

Press, Oxford, pp. 115–137.

Robson-Brown, K., 1999. Cladistics as a tool in comparative analysis.

In: Lee, P.C. (Ed.), Comparative Primate Socioecology. Cambridge

University Press, Cambridge, pp. 23–43.

Saitou, N., Nei, M., 1987. The neighbor-joining method: a new method

for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425.

Salemans, B.J.P., 1996. Cladistics or the resurrection of the method of

Lachmann: on building the stemma of Yvain. In: van Reenen, P.,

van Mulken, M. (Eds.), Studies in Stemmatology. John Benjamins

Publishing Company, Amsterdam, pp. 3–70.

Salemans, B.J.P., 2000. Type-2 Variations: A Necessary Evil?

Stemmatology Meeting. Free University, Amsterdam.

Slowinski, J.B., 2001. Molecular polytomies. Mol. Phylogenet. Evol.

19, 114–120.

Sokal, R.R., Rohlf, F.J., 1995. Biometry. W.H. Freeman & Co., New

York.

Spencer, M., Howe, C.J., 2001. Estimating distances between manu-

scripts based on copying errors. Lit. Ling. Comp. 16, 467–484.

Spencer, M., Howe, C.J., 2002. How accurate were scribes? A

mathematical model. Lit. Ling. Comp. 17, 311–322.

Spencer, M., Howe, C.J., submitted. Collating texts using progressive

multiple alignment. Comput. Humanities.

Spencer, M., Wachtel, K., Howe, C.J., 2002. The Greek vorlage of the

Syra Harclensis: a comparative study on method in exploring

textual genealogy. TC: a Journal of Biblical Textual Criticism

[http://purl.org/TC] 7.

Spencer, M., Bordalejo, B., Wang, L.-S., Barbrook, A.C., Mooney,

L.R., Robinson, P., Warnow, T., Howe, C.J., 2003. Analyzing the

order of items in manuscripts of The Canterbury Tales. Comput.

Humanities 37, 97–109.

Spencer, M., Bordalejo, B., Robinson, P., Howe, C.J., 2004a. How

reliable is a stemma? An analysis of Chaucer’s Miller’s Tale. Lit.

Ling. Comp., in press.

Spencer, M., Mooney, L.R., Barbrook, A.C., Bordalejo, B., Howe,

C.J., Robinson, P., 2004b. The effects of weighting kinds of

variants. In: van Reenen, P., den Hollander, A. (Eds.), Studies in

Stemmatology II. John Benjamins Publishing Company, Amster-

dam.

Stolz, M., 2003. New philology and new phylogeny: aspects of a

critical electronic edition of Wolfram’s Parzival. Lit. Ling. Comp.

18, 139–150.

Studier, J.A., Keppler, K.J., 1988. A note on the neighbor-joining

algorithm of Saitou and Nei. Mol. Biol. Evol. 5, 729–731.

Swofford, D.L., 2001. PAUP� Phylogenetic Analysis Using Parsimony(�and other methods). Sinauer Associates, Sunderland, MA.

Tehrani, J., Collard, M., 2002. Investigating cultural evolution through

biological phylogenetic analyses of Turkmen textiles. J. Anthropol.

Archaeol. 21, 443–463.

von Eschenbach, W., 1980. Parzival. Penguin Books, London.

Wakely, J., 1993. Substitution rate variation among sites in

hypervariable region 1 of human mitochondrial DNA. J. Mol.

Evol. 37, 613–623.

Weitzman, M.P., 1987. The evolution of manuscript traditions. J. R.

Stat. Soc. Ser. A 150, 287–308.