18
Article: Collating Texts Using Progressive Multiple Alignment MATTHEW SPENCER 1 and CHRISTOPHER J. HOWE 2 1 Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, B3H 3J5, Canada E-mail: [email protected] 2 Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1QW, UK Abstract. To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair. Key words: dynamic programming, multiple alignment, stemma reconstruction, text collation, variants 1. Introduction Collation is used in the preparation of a critical edition from a set of wit- nesses. Making a collation is time consuming, and there have been many attempts to automate it (see the bibliography in Sabourin, 1994). The result is an apparatus listing the variants occurring at each location in the text, which may then be used in further analyses, such as the reconstruction of a stemma. Before reconstructing a stemma (or doing any other kind of analysis), the information on variant readings obtained in the collation must be encoded in a computer-readable form. We require a set of discrete variants, each having a different location in the text, such that within any variant, two witnesses share the same code if and only if they have the same reading. For example, consider the witnesses X, Y, and Z, and a possible coding: X ¼ ‘A text’; coding 00 Y ¼ ‘Another version’; coding 11 Z ¼ ‘A reading’; coding 02 8 < : Computers and the Humanities 38: 253–270, 2004. ȑ 2004 Kluwer Academic Publishers. Printed in the Netherlands. 253

Article: Collating Texts Using Progressive Multiple Alignment

Embed Size (px)

Citation preview

Page 1: Article: Collating Texts Using Progressive Multiple Alignment

Article: Collating Texts Using Progressive

Multiple Alignment

MATTHEW SPENCER1 and CHRISTOPHER J. HOWE2

1Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, B3H3J5, Canada

E-mail: [email protected] of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB21QW, UK

Abstract. To reconstruct a stemma or do any other kind of statistical analysis of a texttradition, one needs accurate data on the variants occurring at each location in each witness.These data are usually obtained from computer collation programs. Existing programs either

collate every witness against a base text or divide all texts up into segments as long as thelongest variant phrase at each point. These methods do not give ideal data for stemmareconstruction. We describe a better collation algorithm (progressive multiple alignment) thatcollates all witnesses word by word without a base text, adding groups of witnesses one at a

time, starting with the most closely related pair.

Key words: dynamic programming, multiple alignment, stemma reconstruction, text collation,

variants

1. Introduction

Collation is used in the preparation of a critical edition from a set of wit-nesses. Making a collation is time consuming, and there have been manyattempts to automate it (see the bibliography in Sabourin, 1994). The result isan apparatus listing the variants occurring at each location in the text, whichmay then be used in further analyses, such as the reconstruction of a stemma.Before reconstructing a stemma (or doing any other kind of analysis), theinformation on variant readings obtained in the collation must be encoded ina computer-readable form. We require a set of discrete variants, each havinga different location in the text, such that within any variant, two witnessesshare the same code if and only if they have the same reading. For example,consider the witnesses X, Y, and Z, and a possible coding:

X ¼ ‘A text’; coding 00Y ¼ ‘Another version’; coding 11Z ¼ ‘A reading’; coding 02

8<:

Computers and the Humanities 38: 253–270, 2004.� 2004 Kluwer Academic Publishers. Printed in the Netherlands.

253

Page 2: Article: Collating Texts Using Progressive Multiple Alignment

The first column of the coding indicates that X and Z share the reading ‘A’,while Y has the reading ‘Another’. The second column indicates that each ofthe witnesses has a different reading at this point. The encoding is easy if thewitnesses are collated into a matrix such that each row i is a witness, eachcolumn j is a variant, and the word or phrase at i, j is the reading in witness iat variant j (e.g. Thorpe, 2002, paragraphs 22–28,). Here, we show thatexisting computer collation methods do not achieve this goal, and describe abetter approach known as progressive multiple alignment.

2. Collating Against a Base Text

Most practical collation algorithms collate each witness against a base text(Ott, 1992, pp. 219–220; Robinson, 1994a, p. 5; Salemans, 2000, p. 106). Thelist of variants between each witness and the base text is sorted and merged toproduce an apparatus (Ott, 1979). Collation against a base text is the onlypractical option when collating by hand, because the time needed to compareevery witness with every other would be prohibitive, and the witnesses may notall be available at the same time. Generally, each manuscript is compared withthe same printed edition, and the differences recorded (see West, 1973, pp. 66–67). The familiarity of this approachmay be one reasonwhy collation against abase text is also used by the powerful software packages Collate 2 (Robinson,1994b) and TUSTEP (Ott, 2000). These packages have been used to producemany critical editions. Collate allows the user to control exactly how the textwill be broken up into variants, while TUSTEP automates the entire process.

Choosing only one witness against which to compare all the others is notan ideal solution when we want to construct a stemma. Let vX;Y be a collationof two witnesses X and Y. Suppose we have an algorithm that maximizes thequality Q of a collation of a set of N witnesses, where Q is some function ofall 1

2NðN� 1Þ pairwise comparisons among witnesses in the collation:

Q ¼ fðv1;2; v1;3; . . . ; vN�1;NÞ ð1Þ

If we choose one witness B as the base text, and collate every other witnessagainst B, we will be maximizing a function Q¢ of only N)1 of these pairwisecomparisons:

Q0 ¼ fðvB;1; vB;2; . . . ; vB;B�1; vB;Bþ1; . . . ; vB;NÞ ð2Þ

when preparing an apparatus,Q¢might be a good measure of quality, becausewe only want to represent differences from a base text that we have alreadychosen. When constructing a stemma, we cannot assume that the relationshipsamong most witnesses are unimportant. We also cannot assume that maxi-mizing Q¢ will maximize Q, because the terms in Equation (1) are not inde-pendent. Choosing the best possible collation v1;2 of witnesses 1 and 2, and

MATTHEW SPENCER AND CHRISTOPHER J. HOWE254

Page 3: Article: Collating Texts Using Progressive Multiple Alignment

independently choosing the best possible collation v1;3 ofwitnesses 1 and 3,mayresult in a poor collation v2;3 of witnesses 2 and 3. The situation is similar if weconstruct a composite base text that is not in the set of witnesses to be collated.Furthermore, we would then need an algorithm for constructing a suitablecomposite, and we are not aware of the existence of any such algorithm.

For example, consider the three texts V ¼ ‘abcdef ’, W ¼ ‘xyzde’ and X‘abxyz’ (for brevity, we are replacing words with single characters). Let thebase text be V. Then a reasonable pairwise collation of V against W (rep-resenting insertions or deletions of words using the gap symbol ‘)’) is

V ¼ abcdef

W ¼ xyzde�

The corresponding words ‘d’ and ‘e’ are lined up. Similarly, a reasonablepairwise collation of V against X, in which the corresponding words ‘a’ and‘b’ are lined up, is

V ¼ abcdef

X ¼ abxyz�

Putting these together gives the collation

V ¼ abcdef

W ¼ xyzde�X ¼ abxyz�

However, we have missed the shared words ‘x’, ‘y’, and ‘z’ in all witnessesother than the base text. A more sensible collation would be

V ¼ a b � c� d e f

W ¼ �� x y z d e �X ¼ a b x y z���

To obtain such a collation, we need to consider the relationships among allwitnesses, not just the relationships between each witness and the base text(Salemans, 2000, p. 82, note 65, p. 106, note 93).

A further problem occurs when we may not be able to identify the cor-responding locations in witnesses if each has been compared to a base text.Our initial experiments with phylogenetic analysis of texts (e.g. Barbrooket al., 1998) used NEXUS files (Maddison et al., 1997) generated by Collate.A NEXUS file encodes variants in a suitable form for phylogenetic analyses.Each column in the file corresponds to a variant (either located automaticallyor specified by the user), and each row to a witness. Within a column, eachdifferent symbol corresponds to a different reading found in one or morewitnesses. Unfortunately, because each witness is only compared to a base

COLLATING TEXTS USING PROGRESSIVE MULTIPLE ALIGNMENT 255

Page 4: Article: Collating Texts Using Progressive Multiple Alignment

text, the resulting codes do not always convey the right information. Table Ishows how Collate encodes the first few words of a collation of the Miller’sTale in Chaucer’s Canterbury Tales (manuscripts identified by the sigils listedin Blake and Robinson, 1997, pp. 180–181). This collation was made by theCanterbury Tales Project (De Montfort University, Leicester) in order toproduce an apparatus listing differences from a constructed base text. Thenumbered columns in Table I panel B correspond to the variants:

1. ‘Whilom’ vs. ‘Some tyme’2. ‘ther’ vs. ‘thet’3. ‘ther’ vs. omission of ‘ther’4: ‘was dwelling’ vs. ‘dwellid’5. Absence or presence of deleted ‘ther was’ before ‘dwelling’

There are many missing data (‘?’) in Table I panel B, even though none ofthe witnesses is unreadable here. The missing data occur because each witnesshas been compared only with the base text, not with each other witness. Forexample, He differs from the base text at the fourth location, in having‘dwellid’ instead of ‘was dwelling’. A difference is recorded between the base

Table I. Panel A: Extant versions of the start of the Miller’s Tale, Chaucer’s Canterbury Tales(other witnesses differ only in spelling, regularized here) and the collation base text (data fromthe Canterbury Tales Project, De Montfort University, Leicester) and panel B: Thecorresponding NEXUS encoding from Collate 2 (‘?’ indicates missing data) (Sigils as in

Blake and Robinson, 1997, pp. 180–181)

Panel A

Base: Whilom ther was dwelling

Ad1: Whilom ther was dwelling

He: Whilom ther dwellid

Hk: Whilom was dwelling

La: Whilom ther was [ther was deleted] dwelling

Pw: Whilom thet was dwelling

Ral: Sometyme therwas dwelling

Panel B

12345

Base 00000

Adl 000?0

He 0001?

Hk 0?1?0

La 000?1

Pw 01??0

Ral 100?0

MATTHEW SPENCER AND CHRISTOPHER J. HOWE256

Page 5: Article: Collating Texts Using Progressive Multiple Alignment

text (which has a 0 in column 4) and He (which has a 1 in column 4). All theother witnesses are identical to the base text at this point, so no variantinformation is recorded for them. Collate does not know what location in theother witnesses corresponds to the difference between ‘dwellid’ and ‘wasdwelling’. This leads to a very counter-intuitive coding, in which Adl containsidentical words to the base text but is coded differently. Furthermore, theinformation on the difference between He and all other witnesses at thislocation has been lost, because all other witnesses have missing data in col-umn 4. This is clearly unsatisfactory.

3. Parallel Segmentation

Another method is to align the variants in all witnesses in segments whoselength is that of the longest variant found in any witness, then treat eachsegment as a single variant. This is known as parallel segmentation, and isimplemented in Collate, although there is no published description of thealgorithm. Parallel segmentation is useful because word variants naturallyoccur in the context of phrases. However, it is not ideal for all kinds ofanalysis. The variants that are generated can be long phrases such as ‘Whanthat men sholde haue droghte, or ellis’ and ‘If men sholde haue droghte or’(from the Miller’s Tale). Treating each phrase as an equally different readingloses information, which would be retained if comparisons were made wordby word. The Text Encoding Initiative guidelines for critical apparatus(Sperberg-McQueen and Burnard, 2002) comment that parallel segmentation‘‘will become less convenient as traditions become more complex and tensiondevelops between the need to segment on the largest variation found and theneed to express the finest detail of agreement between witnesses’’.

4. Progressive Multiple Alignment

Neither collation against a base text nor parallel segmentation is entirelysuitable for reconstructing a stemma, although both are useful for producingcritical editions. We now describe a different approach, in which all witnessesare included in aword-by-word collationwhich is built up step by step, startingwith those witnesses that are most closely related on a guide tree. This is basedon the progressive multiple alignment method (Feng and Doolittle, 1987;Durbin et al., 1998, pp. 143–146), which is widely used in bioinformatics to lineup nucleic acid or protein sequences so that items (components of the se-quence) falling in the same column have an evolutionary relationship. Weassume that all the texts we wish to align are related by descent. Except inspecial situations where the vocabulary is very restricted, the expected simi-larity between unrelated texts is so low that we will not have any difficultyin recognizing them. In most cases, we will want to include all available

COLLATING TEXTS USING PROGRESSIVE MULTIPLE ALIGNMENT 257

Page 6: Article: Collating Texts Using Progressive Multiple Alignment

transcriptions of extant manuscripts in the tradition. This does mean that thecollation will be dominated by the most abundant groups of closely relatedwitnesses (Notredame, 2002, p. 140). In very large traditions such as the GreekNew Testament, one might therefore try to select witnesses so that the wholetradition is evenly sampled.

There are three steps:

1. Collate each pair of witnesses using a dynamic programming algorithm,and obtain an approximate measure of pairwise distance.

2. Build a guide tree using the distances from step 1.3. Starting with the most closely related pair of witnesses, collate witnesses

or groups of already-collated witnesses on the guide tree until all areincluded in a single collation. The reason for starting with the mostclosely related pair of witnesses is that we have to hypothesize gaps(insertions or omissions of words) in order to line up witnesses con-taining different numbers of words. If a pair of witnesses are very similar,the locations of gaps are much more certain than if the witnesses are verydifferent. Once gaps have been introduced, they are not altered by theaddition of less closely related witnesses to the collation.

4.1. PAIRWISE COLLATION USING DYNAMIC PROGRAMMING

In a pairwise collation, we want to arrange the two texts with one row pertext and one column per location, such that corresponding words occupy thesame columns in both texts. In contrast, collation algorithms such as OPCOL(Cannon, 1976) identify the shortest possible variant phrases, separated byregions in which the texts agree. Our problem is analogous to calculating theminimum edit distance between a pair of strings (where edit distance is somefunction of the number of additions, deletions and substitutions of charactersneeded to transform one string into the other), except that we are matchingword by word rather than character by character. For simplicity, we treattranspositions of words as pairs of substitutions. The edit distance problemhas received a considerable amount of attention in computer science (re-viewed in Kruskal, 1983; Navarro, 2001) and biology (Durbin et al., 1998,pp. 17–22). There are closely related problems in the alignment of bilingualtexts (Manning and Schutze, 1999, Section 13.1). We use a dynamic pro-gramming solution (Gotoh, 1982).

We first define the score for a given pairwise collation, such that higher-scoring collations require fewer additions, deletions and substitutions. Wethink that insertions or deletions of words are relatively rare, so a bettercollation is one that requires fewer of these events. Let the cost of an additionor deletion of a word from either text be -g. Although we do not know muchabout the processes that generate scribal changes, we do know that most

MATTHEW SPENCER AND CHRISTOPHER J. HOWE258

Page 7: Article: Collating Texts Using Progressive Multiple Alignment

spelling errors result in words that only differ from the correct form by oneinsertion, deletion, substitution, or transposition of individual letters (Kuk-ich, 1992, Section 2.1.1). It is therefore reasonable to assume that words thatappear at corresponding locations in a text tradition should be similar. Othercollation programs make the same assumption (Robinson, 1989; Ott, 1992,p. 218).

Let the cost or benefit of aligning two words in the same column be s,where s is positive if the words are very similar and negative if they are verydifferent. We could base s on edit distance in its usual form, but n-gramdistance (Ukkonen, 1992) is faster to calculate and has been successfully usedin spelling correction and information retrieval problems (Kukich, 1992;Robertson and Willett, 1998). The set of n-grams for a word is the set of allsets of adjacent sets of n characters. For our application, we use n ¼ 2. Forexample, the word ‘ther’ has the 2-grams ‘th’, ‘he’, and ‘er’. A fast approx-imation to the 2-gram distance D2(x,y) between two words x and y (the sumof absolute differences in number of occurrences of each 2-gram that occursin either word) is

D02ðx; yÞ ¼ jGðxÞj þ jGðyÞj � 2jGðxÞ \ GðyÞj ð3Þ

(Petrakis and Tzeras, 2000) where G(x) is the set of all (not necessarily dif-ferent) 2-grams that occur in x. We use this to calculate a similarity s with therange [)1,1], where words sharing no 2-grams score )1, and words withidentical sets of 2-grams score 1.

s ¼ 1� 2D0

2ðx; yÞjGðxÞj þ jGðyÞj ð4Þ

For example, s(‘ther’, ‘thet’) is 1/3. In the first implementation described here,a score of 1 does not necessarily mean the words are identical: for example‘gag’ and ‘aga’ have the same 2-grams. Our most recent implementation addspadding characters (represented here by ‘*’) at the start and end of words:‘*gag*’ and ‘*aga*’ are then distinguishable (Robertson and Willett, 1998).

Given two texts X and Y containing words x1,…,xm and y1,…,yn, weconstruct an m + l · n + l matrix C whose elements cij contain the score ofthe best possible collation of words 1 to i of X and words 1 to j of Y. Therows and columns of C are indexed from 0 to m and n, respectively. Settingc0,0 to 0, we can calculate all other elements of C iteratively by choosing thealternative with the highest score from aligning two words xi and yj, adding agap in X, and adding a gap in Y:

cij ¼ maxci�1; j�1 þ sðxi; yjÞ

ci; j�1 � gci�1; j � g

8<: ð5Þ

COLLATING TEXTS USING PROGRESSIVE MULTIPLE ALIGNMENT 259

Page 8: Article: Collating Texts Using Progressive Multiple Alignment

We set g (the cost of a gap) to 1, so we add a gap if it allows us more than theequivalent of one extra perfect match. Table II panel A shows the C matrixfor a simple example. When doing all possible pairwise collations, we areinterested only in the resulting dissimilarities, but if necessary we can recoverthe sequence of words and gaps by recording the decision made at each stepin Equation (5), and tracing the optimal path backwards from cmn. Whenthere is more than one equally good alternative, we choose one at random.This method requires the 2-gram distance between every pair of words, andruns in worst-case time proportional to m · n. There are faster but morecomplex methods (Navarro, 2001).

As an approximate measure HXY of distance between the aligned se-quences, we count the number of words (excluding gaps, because we have notyet chosen the definitive locations for them) that match exactly, divide it bythe number of columns in the collation at which neither text has a gap, andmake an approximate correction for the frequency of multiple changes at thesame location (Spencer and Howe, 2001):

HXY ¼ � ln#perfect matches

#columns without gaps

� �ð6Þ

If the number of possible words at a location is very large and transitionsamong words occur with equal frequencies, this is an estimate of the actualnumber of changes that occurred at this location along the line of trans-mission relating a pair of manuscripts.

Table II panel B shows pairwise distances for the texts in Table I panel A(excluding the base text). The distances between the pairs {Adl, Hk}, {Adl,La}, {Hk, La} and {Hk, Pw} are zero because in each case the only differencesare gaps. It is unlikely that this would occur when analyzing longer texts.

4.2. RECONSTRUCTION OF A GUIDE TREE

We use the fast neighbor-joining algorithm (Studier and Keppler, 1988) toconstruct an unrooted stemma (guide tree) from the matrix of all pairwisedistances HXY. The guide tree does not need to be particularly accurate, as itis only used to choose the order in which groups of witnesses are added to thecollation. Figure 1 shows the guide tree for the texts in Table I panel A,constructed using the distances in Table II panel B (some estimated edgelengths are negative on the guide tree because of the zero distances in Table IIpanel B, but this is less likely to occur in longer samples of text).

4.3. MULTIPLE COLLATION USING THE GUIDE TREE

We can be more certain about the locations of gaps when a pair of witnessesare very similar than when they are quite different. We therefore build up a

MATTHEW SPENCER AND CHRISTOPHER J. HOWE260

Page 9: Article: Collating Texts Using Progressive Multiple Alignment

multiple collation step by step, starting with the most closely related pair ofneighboring witnesses on the guide tree (the pair connected directly to thesame internal node by the shortest sum of edge lengths), and leaving thelocation of any existing gaps unchanged. After each pairwise collation, wedelete the pair that has been collated from the guide tree and place thecollation at the internal node to which both members of the pair were con-nected. For example, in Figure 1A, we first collate Hk and Pw and place theresult at the internal node (labeled {Hk, Pw} on Figure 1) to which they areboth directly connected. Next, Adl and He are collated and their collationplaced at the internal node {Adl, He}. Then La and {Adl, He} are collated,followed by {Hk, Pw} and {Adl, He, La}. The final step is to collate the tworemaining groups, Ra l and {Adl,He,Hk,La,Pw}.

When collating groups containing more than one witness, we modifyEquation (5) to work with the average scores over all witnesses in each group:

cij ¼ max

ci�1; j�1 þ �sijci; j�1 � �gjci�1; j � �gi

8<: ð7Þ

where

Table II. Panel A: Pairwise collation by dynamic programming of the texts Adl and He fromTable I panel A. The completed matrix C is shown, with numbers in bold indicating the

optimal path. The optimal collation requires the introduction of a gap after ‘ther’ in He andgives a distance HXY = )ln (2/3) = 0.4055 from Equation (6). panel B: Matrix of pairwisedistances HXY (Equation (6)) for the texts in Table I panel A (excluding the base text)

(He) Whilom ther dwellid

Panel A

(Ad1) 0 )1 )2 )3Whilom )1 1 0 )1ther )2 0 2 1

was )3 )1 1 1

dwelling )4 )2 0 1.5385

He Hk La Pw Ral

Panel B

Adl 0.4055 0 0 0.2877 0.2877

He 1.0986 0.4055 1.0986 1.0986

Hk 0 0 0.4055

La 0.2877 0.2877

Pw 0.6931

COLLATING TEXTS USING PROGRESSIVE MULTIPLE ALIGNMENT 261

Page 10: Article: Collating Texts Using Progressive Multiple Alignment

�sij ¼1

ab

X1OkOa

X1OlOb

0 if either xki or ylj has a gap

sðxkiyljÞ otherwise

�gj ¼1

b

X1OlOb

0 if ylj has a gap

g otherwise

�gi ¼1

a

X1OkOa

0 if xki has a gap

g otherwise

�ð8Þ

a and b are the numbers of witnesses in the groups X and Y, and xki is the ithword in the kth witness in group X. Gaps added at previous steps have nocost or benefit when compared to new gaps or other words, and are neveraltered at a later stage. This means that a gap having one independent originon the stemma is counted only once. As before, we break ties at random(choosing one particular alignment among all those that are equally good,and hoping that this choice will not have adverse consequences at a laterstage). Unlike the initial pairwise collations, we store the sequence of words

La

Ad1

He

Ra1

Hk

Pw

{Hk, Pw}

{Ad1, He, Hk, La, Pw}

{Ad1, He, La}

{Ad1, He}

0.330.22

0.11

-0.11

0.56 -0.16

0.12

0.06

Figure 1. A guide tree for multiple collation of the texts in Table I panel A (excludingthe base text), constructed using neighbor-joining on the distance matrix in Table IIpanel B. Terminal nodes are labeled with sigils. Internal nodes are labeled with the sets

of collated witnesses they receive as the collation is constructed. Edges are drawn toscale in units of HXY (Equation (6)) and labeled with their lengths.

MATTHEW SPENCER AND CHRISTOPHER J. HOWE262

Page 11: Article: Collating Texts Using Progressive Multiple Alignment

and gaps produced by each step in the collation (found by tracing backthrough the optimal sequence), and use the texts with added gaps as inputs atlater stages in the collation.

The pairwise collations are the most time-consuming step, and need timeproportional to the square of the number of witnesses and the square of theaverage number of words per witness. Once a multiple collation has beenproduced, generating a NEXUS file (or any similar encoding) is trivial. Weneed only scan down each column in the collation, assigning a differentsymbol to each different reading. Table III panel A shows a multiple collationof the texts from Table I panel A. The only thing we might want to changehere is that ’Some tyme’ in Ral is treated as a substitution of ‘Some’ for‘Whilom’ and an insertion of ‘tyme’. It would be better to regularize ‘Sometyme’ to a single word. Table III panel B shows the NEXUS encoding. Gaps(‘-’) are not missing data, and are treated as an additional state in theanalysis, because the addition or omission of a word can be informative whenconstructing a stemma. Nevertheless, long sequences of gaps are likely tocorrespond to single events (e.g. the absence of a whole phrase due to eyeskip), and it would probably be better to treat all but the first as missing data.

Many texts are divisible into blocks (e.g. lines of poetry or paragraphs ofprose) that can be collated separately. This is much quicker than collating theentire text at once. For two texts of equal length n, the time needed to collate

Table III. Panel A:A collation of the texts in Table I panel A (excluding the base text)produced by the progressive multiple alignment algorithm. Gaps are ‘-’ and Panel B: theNEXUS encoding of this collation

Panel A

Adl Whilom – ther was – dwelling

He Whilom – ther – – dwellid

Hk Whilom – – was – dwelling

La Whilom – ther was [ther was deleted] dwelling

Pw Whilom – thet was – dwelling

Ral Some tyme ther was – dwelling

Panel B

1 2 3 4 5 6

Adl 0 - 0 0 - 0

He 0 - 0 - - 1

Hk 0 - - 0 - 0

La 0 - 0 0 0 0

Pw 0 - 1 0 - 0

Ral 1 0 0 0 - 0

COLLATING TEXTS USING PROGRESSIVE MULTIPLE ALIGNMENT 263

Page 12: Article: Collating Texts Using Progressive Multiple Alignment

them as a single block is in theory proportional to n2. If we divide the textsinto p blocks of equal length and collate each block separately, the timeneeded is theoretically proportional to pðnpÞ

2 ¼ n2

p (although this ignores theextra computation necessary to handle the blocks, and variation in the size ofblocks). For example, collating a section of the Parzival tradition (data fromMichael Stolz, University of Basel: 15 witnesses each with about 550 words)took 652 minutes as a single block, and 18 minutes when each of the 115 lineswas collated separately (collations run in Matlab 6.5, The Mathworks, Inc.,Natick. MA on an IBM-compatible PC with a 400 MHz AMD K-6 pro-cessor), The improvement is not as large as the theoretical value, but is stillworthwhile.

5. Example and Code

An implementation of the algorithm described here in Matlab 6.5 (TheMathworks, Inc, Natick, MA) can be downloaded from [address of CHumArticlePlus website], together with an example collation of an artificial texttradition containing 21 witnesses (Spencer et al., 2004a). The collation has856 columns divided into 49 sentence blocks, and took approximately 1 houron an IBM-compatible PC with a 400 MHz AMD K-6 processor.

6. Discussion

Collations for different purposes need different approaches. Progressivemultiple alignment gives more useful results for statistical analysis, whileexisting methods such as collating against a base text or parallel segmenta-tion are better for producing an apparatus and require less computation.Nevertheless, editors and readers need a clear understanding of the rela-tionships among witnesses in order to benefit from an apparatus, and astemma is a good way to represent these relationships. If the tradition isheavily contaminated, the stemma may need to allow more than one ancestorfor each witness. Producing a stemma in these cases is more complicated, butis not necessarily impossible (e.g. Lee, 1990; Spencer et al., 2004b).

Progressive multiple alignment allows the production of stemmata basedon data that represent the text tradition more accurately than existingmethods. Intuitively, reconstructing a stemma is like reconstructing a mapfrom the distances between points. If we know the distance between each pairof cities in the set {Birmingham, Liverpool, London, Manchester}, we couldcorrectly reconstruct their locations, although we would not know which waywas North. This is analogous to comparing each witness to each other in a

MATTHEW SPENCER AND CHRISTOPHER J. HOWE264

Page 13: Article: Collating Texts Using Progressive Multiple Alignment

multiple alignment. If we only know the distances between each city andLondon (equivalent to collating each witness against a base text), we cannotreconstruct the relative locations of the other cities. Phylogenetic methods areincreasingly used for reconstructing stemmata (e.g. Platnick and Cameron,1977; Cameron, 1987; Lee, 1989; Robinson and O’Hara, 1996; Salemans,1996; Robinson, 1997; Mooney et al., 2001; Spencer et al., 2002; Stolz, 2003;Lantin et al., 2004). These methods require input data on the state of eachwitness at a set of corresponding locations. Progressive multiple alignment isan efficient way to generate these data. We argued in Section 2 that collatingagainst a base text is less good because it does not consider the relationshipsamong all witnesses. We argued in Section 3 that parallel segmentation is lessgood because it loses information on differences within long variants. Wetherefore suggest that multiple alignment is a good choice of collationmethod whenever a phylogenetic method will be used.

There are many possible improvements to the algorithm described here. Inbioinformatics, progressive multiple alignment remains the most widely-usedtechnique, and performs well in a wide range of situations (Notredame,2002). One problem is that sequences are added in order of relatedness, andmistakes made in the first stages of the alignment cannot later be corrected.More complicated consistency-based algorithms such as T-COFFEE (No-tredame et al., 2000) try to find a multiple alignment that agrees well withmost of the pairwise alignments, and are less subject to this problem. Anotherproblem is that the best collation depends on the relationships among texts,but our subsequent inferences about the relationships among texts require acollation. Simultaneous optimization of the collation and the phylogeny istherefore appealing (e.g. Gotoh, 1996; Durbin et al., 1998, pp. 180–188), butcomputationally expensive. We would like to see improvements in theseareas, but evaluating the performance of competing algorithms requires adatabase of high-quality manual text alignments. Such a database does notyet exist. So far, we have only evaluated the performance of our method bycomparing its output with our intuition about what a good text alignmentshould look like (the two are usually close).

Our algorithm represents changes in word order as sets of substitutions.This corresponds to increasing the weight given to rearrangements relative toother kinds of changes. The effects of differential weighting of kinds ofchanges on stemmata for Lydgate’s Kings of England were generally small(Spencer et al., in press). This will usually be true if different kinds of changeswere transmitted along the same lines of descent. Nevertheless, recognizingand encoding transpositions would be useful. In general, the problem ofaligning strings when transpositions are permitted is NP-hard, although somechoices of the cost of transpositions relative to other operations allow solu-tions in polynomial time (Wagner, 1975). Dynamic programming is notusually possible, but several heuristics have been developed. For example,

COLLATING TEXTS USING PROGRESSIVE MULTIPLE ALIGNMENT 265

Page 14: Article: Collating Texts Using Progressive Multiple Alignment

Greedy String Tiling (GST) has been used to detect plagiarism in computerprograms (Wise, 1996) and text re-use in journalism (Clough et al., 2002).The aim is to find a maximal set of non-overlapping substrings having one-to-one matchings between a pair of texts. Walking Tree methods representthe structure of a sequence as a tree with one leaf for each unit (e.g. a word ina text or a nucleotide in a DNA sequence), and allow rearrangements of somesections of the tree (Cull and Hsu, 1999). We do not know whether any ofthese methods has been successfully used in multiple alignment problems. Itis always possible to produce an initial collation using our method, thenexamine the aligned texts for transpositions. One could manually recodesmall transpositions as single columns in the alignment. Similarly, we usuallydeal with larger transpositions by editing the witness files so that all have thesame word order, then manually adding columns to the NEXUS file thatrepresent each change of order as a single column. Where a group of severalwords has been inserted or deleted, there will be a corresponding sequence ofgaps. If we believe that the insertion or deletion of all the words correspondsto a single event, we usually recode all but the first gap as missing data.

More generally, our method treats each word as an independent unit, butsubstitutions may be made at units of sense larger than words. For example,in Table III panel A, the witness He replaces ‘was dwelling’ with ‘dwellid’.The deletion of ‘was’ and the change from ‘dwelling’ to ‘dwellid’ might bemore likely to occur together than independently, because ‘was dwelling’ and‘dwellid’ perform corresponding roles in the sentence, but ‘was dwellid’ is notgrammatically acceptable. The progressive multiple alignment algorithmcorresponds to a regular grammar, the lowest level of the Chomsky hierar-chy of formal languages (Durbin et al., 1998, p. 238). The next level ofsophistication, a context-free grammar, is able to deal with dependencies bybuilding a parse tree for each sentence, representing the relationships amongcomponents such as noun phrases and verb phrases (Karttunen and Zwicky,1985, pp. 3–5). Stochastic context-free grammars have been successfully usedin aligning RNA sequences (Brown, 2000). For text data, an alignmentalgorithm based on a context-free grammar would have to parse each sen-tence in each witness, then align the parse trees. Parsing sentences using acontext-free grammar is complex (Lari and Young, 1990; Durbin et al., 1998,pp. 252–258), and requires lexical and syntactic information for the lan-guage. Furthermore, the text produced by scribes will not necessarily begrammatically correct, especially if they were not particularly familiar withthe language they were writing. Even human experts find many real-lifesentences difficult to parse (Sampson, 2000). In contrast, our method issimple and independent of the language (c.f. Ott, 1992, p. 219). At anintermediate level of complexity, it might be worth using part-of-speech tags,which improve the accuracy of bilingual text alignments (Toutanova et al.,2002).

MATTHEW SPENCER AND CHRISTOPHER J. HOWE266

Page 15: Article: Collating Texts Using Progressive Multiple Alignment

It is usual to do stemmatic analyses with regularized spelling, punctuation,and word division (e.g. Robinson, 1997). These features are generallythought to reveal more about scribes’ dialect and habits than about stem-matic relationships. We do not know of any algorithm for automatic regu-larization. At present, regularization is usually done either entirely by hand,or with the aid of programs such as Collate. If progressive multiple alignmentis used for collation and regularized spelling, punctuation, and word divisionare required, the regularization could be done before collation using anotherprogram. Alternatively, we can produce an initial collation using unregu-larized witness files. It is then easy to identify differences at correspondinglocations. Where these differences should be regularized, the witness files canbe edited and a second collation produced. In practice, spelling and punc-tuation make little difference to the collations produced by our method,because n-grams are an effective way of identifying corresponding words evenwhen they differ at a few characters. For example, when collating the arti-ficial text tradition mentioned in Section 5, we used progressive multiplealignment with no regularization, because we thought spelling and punctu-ation differences were likely to be important in modern English (Spenceret al., 2004a). Differences in word division will usually produce gaps in thealignment, but these are easy to spot and correct where necessary.

Tools such as Collate can handle almost all stages of editing from tran-scription to the production of an apparatus. Integrating progressive multiplealignment, statistical analysis of variants and the production of stemmatainto this process would make stemmatic methods of editing more accessibleto scholars.

Acknowledgements

This work is part of the STEMMA project funded by the Leverhulme Trust.We are grateful to the Canterbury Tales Project (De Montfort University) forthe Chaucer data, and to Michael Stolz for the Parzival data. Peter Robinson,Adrian Barbrook, Barbara Bordalejo, and Linne Mooney made many helpfulsuggestions. The manuscript was improved by comments from four anony-mous referees.

References

Barbrook A.C., Howe C.J., Blake N., Robinson P. (1998) The Phylogeny of The Canterbury

Tales. Nature, 394, p. 839.Blake N., Robinson P. (eds.) (1997) The Canterbury Tales Project Occasional Papers, Vol. II.

Office for Humanities Communication Publications, London. 184 p.

COLLATING TEXTS USING PROGRESSIVE MULTIPLE ALIGNMENT 267

Page 16: Article: Collating Texts Using Progressive Multiple Alignment

Brown M.P.S. (2000) Small Subunit Ribosomal RNA Modeling Using Stochastic Context-free

Grammars. ISMB Proceedings 2000. American Association for Artificial Intelligence, pp. 57–66.

Cameron H.D. (1987) The Upside-Down Cladogram: Problems in Manuscript Affiliation. In

Hoenigswald, H.M., Wiener, L.F. (eds.), Biological Metaphor and Cladistic Classification:An Interdisciplinary Perspective. Frances Pinter, London, pp. 227–242.

Cannon R.L., Jr. (1976) OPCOL: An Optimal Text Collation Algorithm. Computers and the

Humanities, 10, pp. 33–40.Clough P., Gaizauskas R., Piao S.S.L., Wilks Y. (2002) METER: MEasuring TExt Reuse.

Proceedings of the 40th Anniversary Meeting for the Association for Computational Lin-guistics (ACL-02). University of Pennsylvania, Philadelphia, USA, pp. 152–159.

Cull P., Hsu T. (1999) Improved Parallel and Sequential Walking Tree Methods for BiologicalString Alignments. Supercomputing’ 99.

Durbin R., Eddy S., Krogh A., Mitchison G. (1998) Biological Sequence Analysis. Cambridge

University Press, Cambridge. 356 p.Feng D.-F., Doolittle R.F. (1987) Progressive Sequence Alignment as a Prerequisite to Correct

Phylogenetic Trees. Journal of Molecular Evolution, 25, pp. 351–360.

Gotoh O. (1982) An Improved Algorithm for Matching Biological Sequences. Journal ofMolecular Biology, 162, pp. 705–708.

Gotoh O. (1996) Significant Improvement in Accuracy of Multiple Protein Sequence Align-ments by Iterative Refinement as Assessed by Reference to Structural Alignments. Journal

of Molecular Biology, 264, pp. 823–838.Karttunen L., Zwicky A.M. (1985) Introduction. In Dowty, D.R., Karttunen, L., Zwicky,

A.M. (eds.), Natural Language Parsing: Psychological, Computational, and Theoretical

Perspectives. Cambridge University Press, Cambridge, pp. 1–25.Kruskal J. B. (1983) An Overview of Sequence Comparison: Time Warps, String Edits, and

Macromolecules. SIAM Review, 25, pp. 201–237.

Kukich K. (1992) Techniques for Automatically Correcting Words in Text. ACM ComputingSurveys, 24, pp. 377–439.

Lantin A.-C., Baret P.V., Mace C. (2004) Phylogenetic Analysis of Gregory of Nazianzus’

Homily 27. Le poids des mots: Proceedings of the 7th International Conference on theStatistical Analysis of Textual Data. Louvain-la-Neuve, pp. 700–707.

Lari K., Young S.J. (1990) The Estimation of Stochastic Context-Free Grammars Using theInside–Outside Algorithm. Computer Speech and Language, 4, pp. 35–56.

Lee A.R. (1989) Numerical Taxonomy Revisited: John Griffith, Cladistic Analysis and St.Augustine’s Quaestiones in Heptateuchem. Stadia Patristica, 20, pp. 24–32.

Lee A. R. (1990) BLUDGEON: A Blunt Instrument for the Analysis of Contamination in

Textual Traditions. In Choueka, Y. (ed.), Computers in Literary and Linguistic Research.Champion-Slatkine, Paris, pp. 261–292.

Maddison D.R., Swofford D.L., Maddison W.P. (1997) NEXUS: An Extensible File Format

for Systematic Information. Systematic Biology, 46, pp. 590–621.Manning C.D., Schutze H. (1999) Foundations of Statistical Natural Language Processing. The

MIT Press, Cambridge, MA, 680 p.Mooney L.R., Barbrook A.C., Howe C.J., Spencer M. (2001) Stemmatic Analysis of Lydgate’s

‘‘Kings of England’’: A Test Case for the Application of Software Developed for Evolu-tionary Biology to Manuscript Stemmatics. Revue d’Histoire des Textes, 31, pp. 275–297.

Navarro G. (2001) A Guided Tour to Approximate String Matching. ACM Computing Sur-

veys, 33, pp. 31–88.Notredame C. (2002) Recent Progresses in Multiple Sequence Alignment: A Survey. Phar-

macogenomics, 3, pp. 131–144.

MATTHEW SPENCER AND CHRISTOPHER J. HOWE268

Page 17: Article: Collating Texts Using Progressive Multiple Alignment

Notredame C., Higgins D.G., Heringa J. (2000) T-Coffee: A Novel Method for Fast

and Accurate Multiple Sequence Alignment. Journal of Molecular Biology, 302, pp. 205–217.

Ott W. (1979) The Output of Collation Programs. In Ager, D.E., Knowles, F.E., Smith, J.

(eds.), Advances in Computer-Aided Literary and Linguistic Research. Department ofModern Languages, University of Aston, Birmingham, pp. 41–51.

Ott W. (1992) Computers and Textual Editing. In Butler, C.S. (ed.), Computers and Written

Texts, Blackwell, Oxford, pp. 205–226.Ott W. (2000) Strategies and Tools for Textual Scholarship: The Tubingen System of

Text Processing Programs (TUSTEP). Literary and Linguistic Computing, 15, pp. 93–108.

Petrakis E.G.M., Tzeras K. (2000) Similarity Searching in the CORDIS Text Database.Software – Practice and Experience, 30, pp. 1447–1464.

Platnick N.I., Cameron H.D. (1977) Cladistic Methods in Textual, Linguistic, and Phyloge-

netic Analysis. Systematic Zoology, 26, pp. 380–385.Robertson A.M., Willett P. (1998) Applications of n-grams in Textual Information Systems.

Journal of Documentation, 54, pp. 48–69.

Robinson P. (1994a) Collate 2: A User Guide. Oxford University Computing Services, Oxford,137 p.

Robinson P. (1997) A Stemmatic Analysis of the Fifteenth-Century Witnesses to The Wife ofBath’s Prologue. In Blake, N., Robinson, P. (eds.),TheCanterbury Tales Project: Occasional

Papers Vol. II. Office for Humanities Communication Publications, London, pp. 69–132.Robinson P.M.W. (1989) The Collation and Textual Criticism of Icelandic Manuscripts. (1):

Collation. Literary and Linguistic Computing, 4, pp. 99–105.

Robinson P.M.W. (1994b) Collate: Interactive Collation of Large Textual Traditions. OxfordUniversity Centre for Humanities Computing, Oxford.

Robinson P.M.W., O’Hara R.J. (1996) Cladistic Analysis of an Old Norse Manuscript Tra-

dition. In Hockey, S., Ide, N. (eds.), Research in Humanities Computing 4. Oxford Uni-versity Press, Oxford, pp. 115–137.

Sabourin C.F. (1994) Literary Computing. Infolingua, Montreal, 581 p.

Salemans B.J.P. (1996) Cladistics or the Resurrection of the Method of Lachmann: OnBuilding the Stemma of Yvain. In van Reenen, P., van Mulken, M. (eds.), Studies inStemmatology. John Benjamins Publishing Company, Amsterdam, pp. 3–70.

Saflemans B.J.P. (2000) Building Stemmas with the Computer in a Cladistic, Neo-Lach-

mannian Way. Katholieke Universiteit, Nijmegen, 351 p.Sampson G. (2000) The Role of Taxonomy in Language Engineering. Philosophical Trans-

actions of the Royal Society of London Series A, 358, pp. 1339–1355.

Spencer M., Davidson E.A., Barbrook A.C., Howe C.J. (2004a) Phylogenetics of ArtificialManuscripts. Journal of Theoretical Biology, 227, pp. 503–511.

Spencer M., Howe C.J. (2001) Estimating Distances between Manuscripts Based on Copying

Errors. Literary and Linguistic Computing, 16, pp. 467–484.Spencer M., Mooney L.R., Barbrook A.C., Bordalejo B., Howe C.J., Robinson P. (in press)

The Effects of Weighting Kinds of Variants. In den Hollander, A. (ed.), Studies inStemmatology II. John Benjamins Publishing Company, Amsterdam.

Spencer M., Wachtel K., Howe C.J. (2002) The Greek Vorlage of the Syra Harclensis:A Comparative Study on Method in Exploring Textual Genealogy. TC: a Journal ofBiblical Textual Criticism 7.

Spencer M., Wachtel K., Howe C.J. (2004b) Representing Multiple Pathways of Textual Flowin the Greek Manuscripts of the Letter of James Using Reduced Median Networks.Computers and the Humanities, 38, pp. 1–14.

COLLATING TEXTS USING PROGRESSIVE MULTIPLE ALIGNMENT 269

Page 18: Article: Collating Texts Using Progressive Multiple Alignment

Sperberg-McQueen C.M., Burnard L. (eds.) (2002) TEI P4: Guidelines for Electronic Text

Encoding and Interchange. Text Encoding Initiative Consortium. XML Version, Oxford,Providence, Charlottesville, Bergen

Stoliz M. (2003) New Philology and New Phylogeny: Aspects of a Critical Electronic Edition

of Wolfram’s Parzival. Literary and Linguistic Computing, 18, pp. 139–150.Studier J.A., Keppler K.J. (1988) A Note on the Neighbor-Joining Algorithm of Saitou and

Nei. Molecular Biology and Evolution, 5, pp. 729–731.

Thorpe J.C. (2002) Multivariate Statistical Analysis for Manuscript Classification. TC: AJournal of Biblical Textual Criticism, 7.

Toutanova K., llhan H.T., Manning C.D. (2002) Extensions to HMM-Based Statistical WordAlignment Models. Proceedings of the 2002 Conference on Empirical Methods in Natural

Language Processing, pp. 87–94.Ukkonen E. (1992) Approximate String-Matching with q-grams and Maximal Matches.

Theoretical Computer Science, 92, pp. 191–211.

Wagner R.A. (1975) On the Complexity of the Extended String-to-String Correction Problem.Proceedings of the 7th Annual ACM Symposium on Theory of Computing, Albuquerque,New Mexico, pp. 218–223.

West M.L. (1973) Textual Criticism and Editorial Technique Applicable to Greek and LatinTexts. B.G. Teubner, Stuttgart. 155 p.

Wise M.J. (1996) YAP3: Improved Detection of Similarities in Computer Program and OtherTexts. SIGCSE¢96, Philadelphia, USA, pp. 130–134.

MATTHEW SPENCER AND CHRISTOPHER J. HOWE270