10
BIOINFORMATICS Vol. 20 no. 17 2004, pages 3156–3165 doi:10.1093/bioinformatics/bth380 The UniMarker (UM) method for synteny mapping of large genomes Ben-Yang Liao 1, , Yu-Jung Chang 2,3, , Jan-Ming Ho 2 and Ming-Jing Hwang 1,1 Institute of Biomedical Sciences and 2 Institute of Information Science, Academia Sinica, Taipei, Taiwan and 3 Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan Received on January 28, 2004; revised on June 3, 2004; accepted on June 19, 2004 Advance Access publication June 24, 2004 ABSTRACT Motivation: Synteny mapping, or detecting regions that are orthologous between two genomes, is a key step in stud- ies of comparative genomics. For completely sequenced genomes, this is increasingly accomplished by whole-genome sequence alignment. However, such methods are computa- tionally expensive, especially for large genomes, and require rather complicated post-processing procedures to filter out non-orthologous sequence matches. Results: We have developed a novel method that does not require sequence alignment for synteny mapping of two large genomes, such as the human and mouse. In this method, the occurrence spectra of genome-wide unique 16mer sequences present in both the human and mouse genome are used to dir- ectly detect orthologous genomic segments. Being sequence alignment-free, the method is very fast and able to map the two mammalian genomes in one day of computing time on a single Pentium IV personal computer. The resulting human–mouse synteny map was shown to be in excellent agreement with those produced by the Mouse Genome Sequencing Consor- tium (MGSC) and by the Ensembl team; furthermore, the syntenic relationship of segments found only by our method was supported by BLASTZ sequence alignment. Availability: The source code of our method and the resulting human–mouse synteny maps have been placed at http://synteny.ibms.sinica.edu.tw/ for free access. Supplementary information: Seven supplementary figures can be found at the same website. Contact: [email protected] INTRODUCTION With the number of completely sequenced genomes increas- ing rapidly, comparative genomics is becoming an indis- pensable approach for genome annotation and for studying To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. genome evolution. Essential to this approach is whole- genome alignment, which is computationally demand- ing, particularly for large genomes, such as those of mammals. Thus, using conventional approaches, scores or even hundreds of computing processors are required to compare the human and mouse genomes in a time period of hours or days (Waterston et al., 2002; Schwartz et al., 2003), a practical time scale for doing competit- ive research in such a rapidly evolving field as genomics. Moreover, there appears to be considerable discrepancy in the various human–mouse synteny maps created independ- ently by several research groups (Waterston et al., 2002; Gregory et al., 2002; Clamp et al., 2003), even though they may use similar alignment algorithms and strategies (Ureta-Vidal et al., 2003). As many more large genomes will be sequenced in the next few years (Ureta-Vidal et al., 2003), there is a pressing need to develop a whole-genome alignment tool that can render the task feasible and practical using minimal computing facilit- ies, such as a single desktop computer. To achieve this goal, methods that deviate significantly from existing approaches using sequence alignment, such as BLAST (Altschul et al., 1990), MegaBLAST (Zhang et al., 2000), BLAT (Kent, 2002), BLASTZ (Schwartz et al., 2003) or PatternHunter (Ma et al., 2002), merit exploration. Various articles have demonstrated that the use of a hash table (Schuler, 1997; Ning et al., 2001) or suffix-tree (Delcher et al., 2002; Bray et al., 2003) can significantly speed up the computation time required for sequence mapping. Our previ- ous work (Chen et al., 2002) showed that, by matching unique 15mer words (those that appear exactly once in the genome and are therefore called UniMarkers or UMs), it is possible to dispense with the usual requirement for sequence align- ment and to genomically position the entire database of human single nucleotide polymorphism (SNP) sequences in just a few days of computing time on a single desktop computer. In the present study, we introduced a new concept of using UMs to detect sequence orthologues without doing sequence 3156 Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved.

The UniMarker (UM) method for synteny mapping of large genomes

  • Upload
    sinica

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

BIOINFORMATICS Vol. 20 no. 17 2004, pages 3156–3165doi:10.1093/bioinformatics/bth380

The UniMarker (UM) method for synteny mappingof large genomes

Ben-Yang Liao1,†, Yu-Jung Chang2,3,†, Jan-Ming Ho2

and Ming-Jing Hwang1,∗

1Institute of Biomedical Sciences and 2Institute of Information Science,Academia Sinica, Taipei, Taiwan and 3Department of Computer Science andInformation Engineering, National Taiwan University, Taipei, Taiwan

Received on January 28, 2004; revised on June 3, 2004; accepted on June 19, 2004

Advance Access publication June 24, 2004

ABSTRACTMotivation: Synteny mapping, or detecting regions that areorthologous between two genomes, is a key step in stud-ies of comparative genomics. For completely sequencedgenomes, this is increasingly accomplished by whole-genomesequence alignment. However, such methods are computa-tionally expensive, especially for large genomes, and requirerather complicated post-processing procedures to filter outnon-orthologous sequence matches.Results: We have developed a novel method that does notrequire sequence alignment for synteny mapping of two largegenomes, such as the human and mouse. In this method, theoccurrence spectra of genome-wide unique 16mer sequencespresent in both the human and mouse genome are used to dir-ectly detect orthologous genomic segments. Being sequencealignment-free, the method is very fast and able to map the twomammalian genomes in one day of computing time on a singlePentium IV personal computer. The resulting human–mousesynteny map was shown to be in excellent agreement withthose produced by the Mouse Genome Sequencing Consor-tium (MGSC) and by the Ensembl team; furthermore, thesyntenic relationship of segments found only by our methodwas supported by BLASTZ sequence alignment.Availability: The source code of our method and theresulting human–mouse synteny maps have been placed athttp://synteny.ibms.sinica.edu.tw/ for free access.Supplementary information: Seven supplementary figurescan be found at the same website.Contact: [email protected]

INTRODUCTIONWith the number of completely sequenced genomes increas-ing rapidly, comparative genomics is becoming an indis-pensable approach for genome annotation and for studying

∗To whom correspondence should be addressed.†The authors wish it to be known that, in their opinion, the first two authorsshould be regarded as joint First Authors.

genome evolution. Essential to this approach is whole-genome alignment, which is computationally demand-ing, particularly for large genomes, such as those ofmammals. Thus, using conventional approaches, scoresor even hundreds of computing processors are requiredto compare the human and mouse genomes in a timeperiod of hours or days (Waterstonet al., 2002; Schwartzet al., 2003), a practical time scale for doing competit-ive research in such a rapidly evolving field as genomics.Moreover, there appears to be considerable discrepancy inthe various human–mouse synteny maps created independ-ently by several research groups (Waterstonet al., 2002;Gregory et al., 2002; Clampet al., 2003), even thoughthey may use similar alignment algorithms and strategies(Ureta-Vidalet al., 2003).

As many more large genomes will be sequenced in the nextfew years (Ureta-Vidalet al., 2003), there is a pressing needto develop a whole-genome alignment tool that can render thetask feasible and practical using minimal computing facilit-ies, such as a single desktop computer. To achieve this goal,methods that deviate significantly from existing approachesusing sequence alignment, such as BLAST (Altschulet al.,1990), MegaBLAST (Zhanget al., 2000), BLAT (Kent, 2002),BLASTZ (Schwartzet al., 2003) or PatternHunter (Maet al.,2002), merit exploration.

Various articles have demonstrated that the use of a hashtable (Schuler, 1997; Ninget al., 2001) or suffix-tree (Delcheret al., 2002; Brayet al., 2003) can significantly speed up thecomputation time required for sequence mapping. Our previ-ous work (Chenet al., 2002) showed that, by matching unique15mer words (those that appear exactly once in the genomeand are therefore called UniMarkers or UMs), it is possibleto dispense with the usual requirement for sequence align-ment and to genomically position the entire database of humansingle nucleotide polymorphism (SNP) sequences in just afew days of computing time on a single desktop computer.In the present study, we introduced a new concept of usingUMs to detect sequence orthologues without doing sequence

3156 Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved.

Synteny mapping using UniMarkers

alignment and extended the UM method for whole-genomesynteny mapping.

To align two very long DNA sequences, such as those ofmetazoan genomes, the most common approach starts byfinding the so-called high scoring pairs (HSPs) of sequencefragments that are derived from words matched either byconsecutive seeds (Altschulet al., 1990; Zhanget al., 2000)or by spaced seeds first proposed by Li and coworkers inPatternHunter (Maet al., 2002) and later adopted in BLASTZ(Schwartzet al., 2003). These HSPs, in which a word orsegment in one sequence may have multiple matches in theother sequence, then serve as seeds, which are subsequentlyfiltered and combined to identify a set of longer segments thatare thought to be orthologous between the two sequences. Inthe final step, these segments, often called anchors or land-marks, are extended or processed to yield an alignment ormapping of the two sequences (Ureta-Vidalet al., 2003).Our UM method differs from these approaches by avoidingthe time-consuming step of finding and processing the HSPseeds; instead, orthologues anchoring segments are detec-ted directly from a genome-wide occurrence spectrum ofUMs common to the two genomes compared. Consequently,and as detailed below, the UM method is very fast and canmap the entire human genome against the entire mouse gen-ome and vice versa, in just one day on a single PentiumIV personal computer. To evaluate the quality of the res-ulting UM human–mouse map, it was compared with themap produced by the Mouse Genome Sequencing Consortium(MGSC) (Waterstonet al., 2002) and with that produced bythe Ensembl team (Clampet al., 2003; Hubbardet al., 2002).The UM map was shown to be in excellent agreement withthe MGSC map, missing only a few small MGSC segments,while having several small unique segments of its own. Theagreement with the Ensembl map was also very good, thoughnot as good as that with the MGSC map. Sequence align-ment using BLASTZ (Schwartzet al., 2003) on segmentsthat were map-unique or disagreed between maps indicatedthat the UM method, despite being sequence alignment-free,achieved high specificity and sensitivity in finding syntenicregions of the two mammalian genomes.

METHODSpUMp versus hUMpOrthologous regions, by definition, are homologous regionsshared by two genomes from a speciation event. The basicidea of our approach is that, between two genomes, ortho-logous regions should share more UniMarker pairs (UMps;an UMp connects identical UMs in both genomes) than non-orthologous regions. However, there are two kinds of UMp,those inherited from a common ancestor, hereafter referred toas primitive UMps (pUMps), and those that have arisen byrandom mutation, referred to as homoplastic UMps (hUMps)(Fig. 1). Although it is not possible to tell whether a given UMp

is a pUMp or a hUMp, it can be distinguished as a collectivegroup, as illustrated in Figure 1. This is because, by definition,pUMps can exist only between orthologous regions, whereashUMps can exist between any two regions, be they ortholog-ous or not. Consequently, pUMps can provide a signal forpairs of orthologous regions against a background noise ofhUMps, and, as long as the signal/noise ratio is sufficientlyhigh, i.e. the evolutionary distance between the two genomesis not too great, orthologous pairs should be detectable byanalyzing the UMp distribution in the two genomes.

Occurrence spectra of UMps and anchoring islandsA simple, but efficient, method to identifyk-mer UMs in thehuman genome has been described (Chenet al., 2002). Thismethod was used in the present study to identify 16mer UMsfor each of several assemblies of the human genome and forthe draft mouse genome sequence. Those UMs common toa particular assembly of the human genome and the mousegenome were extracted; each of these constitutes an UMp, asdefined above.

The UM method for mapping two genomes, A and B,involves the following. Each chromosome of genome B isdivided into a set of minimally overlapped fragments, eachcontaining an equal number of UMps, which, in this work,was set at 300 000, i.e. a number slightly greater than that(∼290 000) on the human Y chromosome (consequently, theentire human Y chromosome was a fragment). We then scangenome A using a sliding window of 50 kb and a movingstep of 10 kb to computeMij , the ratio of the number ofUMps common to both thei-th window of genome A andthe j -th chromosomal fragment of genome B (Nij ) to thetotal number of UMps found in thei-th window of genome A(Ni) (i.e. Mij = Nij /Ni). The values of these parameters,and of those described below, were empirically determinedin trial runs to minimize the computational cost while main-taining good resolution in the resulting human and mousesynteny map.

As illustrated in the example in Figure 2, theMij spectrumallowed us to find orthologous regions, hereafter referred toas anchoring islands, without doing sequence alignment. Fora segment to qualify as an anchoring island, at this stagein genome A only (Fig. 2A), we specified that at least fourconsecutive windows must have aMij value in the top 1.5%of all Mij (Fig. 2B) to suggest the presence of pUMps, ororthologous relationship, between these windows of genomeA and a chromosomal fragment of genome B. To pin downthe region in this chromosomal fragment of genome B withwhich the anchoring island of genome A was orthologous,we moved the sliding window to genome B, and operated iton the fragment-containing chromosome to computeNkl , thenumber of UMps shared by thek-th window (on the chromo-some of genome B) and thel-th island (on genome A). TheNkl spectrum (Fig. 2C) allowed us to delimit the matching

3157

B.-Y.Liao et al.

Fig. 1. The two types of UMp. All UMps shared by segments from two different genomes can be classified into two types, those that havedescended from a common ancestor, called primitive UMps (pUMps; black solid lines), and those that have arisen by random mutation, calledhomoplastic UMps (hUMps; gray dashed lines). (A) Following evolutionary changes, a certain pUMp could change its pairing randomly,resulting in a pUMp evolving into a hUMp. UMs (illustrated by four-letter words) found in both genomes are represented by shaded boxes.The site of mutation causing a change in UM pairings is marked by a black triangle. (B) The distribution of pUMps and hUMps. When twogenomes are compared, orthologous genomic segments will share both pUMps (shown as white boxes) and hUMps (shown by black boxes),but any two evolutionarily unrelated regions (e.g. the first segment of genome A and the second half of the genome B segment) can only sharehUMps.

anchoring island on genome B, which was specified as con-taining at least two consecutive windows with (i)Nkl valuesof at least 25 or (ii)Nkl values of at least 10 and within thetop 3% of allNkl for that particularl-th island of genomeA. Note that, for this stage, there was no need to computeNk or Nkl/Nk (i.e. Mkl), and the reason for the expansion toinclude the whole chromosome, instead of just the fragment,in the computation ofNkl was to provide sufficient back-ground noise (hUMps) to distinguish the signal (pUMps). Formultiple matches, i.e. when two or more matching anchoringislands were found on the fragment of genome B, the proced-ure for computingNkl was repeated after switching the slidingwindow back to operate on the anchoring island-containingchromosome of genome A. This procedure was repeated untilall anchoring islands were uniquely matched between the twogenomes. For the present work on the human and mouse gen-omes, we found that multiple matches occurred in∼30% ofcases; most of these could be resolved afterNkl was calculated

for the second time, and all could be resolved after the fourthcalculation.

Overlapped anchoring islandsA few (500–800 or 4–7%, depending on the version of genomeassembly used) of the resulting anchoring islands overlapped;this was due to the pUMp signal being independently detec-ted in overlapping windows. There were four types of suchoverlaps (Supplement Figure S1). For the first type of par-tial overlaps, which accounted for∼60–75% of overlaps,we simply set the boundary of the anchoring island at themidpoint of the overlap. The second and third types (account-ing for 20–40% of overlaps) occurred when a small island(usually<100 kb) was embedded in a large island. Furtheranalysis indicated that embedded islands of the second type,which comprised∼80% of the embedded cases, probably res-ulted from lineage-specific duplication, while those of thethird type resulted from microrearrangements. Accordingly,

3158

Synteny mapping using UniMarkers

Fig. 2. Identification of the anchoring islands. (A) TheMij spectrum (see text for definition) for mouse chromosome 16 computed from twohuman chromosomal fragments, denoted by 16.2f (the 2nd fragment on human chromosome 16 in the forward orientation) and 3.18f. Thedetected islands, regions containing at least four consecutive overlapping windows (each of 50 kb and with aMij value above threshold, seetext) are labeled as vertical bars on the mouse chromosome shown below thex-axis. The boundaries for each island were set at the midpointof the first and last of its consecutive windows. (B) The distribution ofMij [for all windows (i) and all chromosomal fragments (j ), see text].The lower boundary of the top 1.5% of the distribution (dark area) was chosen as theMij threshold in the present work. (C) TheNkl spectrumfor determining the matching island on the human chromosome, which, as indicated, was divided into minimally overlapped fragments withequal number of UMs, rather than base pairs (see text). For each mouse chromosome, such as chromosome 16 shown here, there were a totalof 612Mij spectra, as the human genome was divided into 612 chromosomal fragments (half forward and half backward); for clarity, onlytwo are shown in (A).

we discarded embedded islands of the second type, but keptthose of the third type and split their encompassing island intothree, as illustrated in Figure S1 (Supplement). The fourth typeoccurred when a very small island (∼40 kb) of one genomecontained two separable clusters of UMps, each of which wasmapped to one of two distinct, usually even smaller, islands

of the other genome. The fourth type was rare, accounting for<2% of the overlaps. For the sake of computational conveni-ence and automation, we kept the first of the two pairings anddiscarded the other.

Although the use of a smaller window and moving stepcan eliminate most of the overlaps, particularly those of the

3159

B.-Y.Liao et al.

first type, this would force the method to operate on fewerUMs, which could decrease the signal/noise ratio, espe-cially for regions containing a lower density of UMs (e.g.<1000 UMs/50 kb).

Bidirectional mappingAt this stage, we had a set of non-overlapping, one-to-onematched, anchoring islands for genomes A and B. We calledthis set the A → B set, since theMij for this set wascomputed on windows of genome A. To further reduce thelikelihood of the identified anchors being false positives, wealso computed the B→ A set, using identical proceduresand parameters to those described above, and extracted theoverlaps of the two sets. The bidirectional mapping helpedus set the thresholds forMij and Nkl (see above), usingwhich more than 95% of the mapped anchoring islands wereeither identical or substantially overlapped between the twodirections.

Conserved segments and syntenic blocksThe bidirectionally mapped and non-overlapping anchoringislands were then merged into conserved segments for any twoadjacent islands in one genome that were also adjacent, as wellas in the same orientation, in the other genome [see Nadeauand Sankoff (1998) for definitions of ‘conserved segment’ and‘syntenic block’ (aka ‘conserved synteny’)]. Finally, the res-ulting conserved segments were grouped into syntenic blocks,each of which consisted of conserved segments that werecontiguously matched, irrespective of the order and the orien-tation of their matching, in both genomes and on a singlechromosome.

Comparison with other mapsIt is not a trivial process to compare two different syntenymaps, because different degrees of concordance may arisefor conserved segments that are equivalent between the twomaps on either of the two genomes. We therefore devised aset of parameters to assign equivalent (i.e. overlapped) con-served segments to four categories (Supplement Figure S2):‘Agree (strong)’, ‘Agree (weak)’, ‘Disagree’ and ‘Unique’,with decreasing degrees of overlap. The main distinctionbetween the ‘Agree’ and ‘Disagree’ category was whethera substantial overlap in the segments was shared in bothor just one, of the two genomes; those that were not sub-stantially overlapped in either genome, or were overlapped,but not in the same orientation, were assigned to ‘Unique’.For the comparison with the MGSC and Ensembl maps, thesame versions of the genome assembly for either humanor mouse used in those maps were used to produce thecorresponding UM maps. These genome assemblies wereretrieved from ftp://ftp.ncbi.gov/genomes/H_sapiens/ andftp://ftp.ncbi.nih.gov/genomes/M_musculus/ at the NationalCenter for Biotechnology Information (NCBI). The MGSCmap, i.e. the genomic start and end positions and the

orientation of mapped conserved segments, was providedby Michael Kamal (Whitehead Institute, MIT). TheEnsembl map was downloaded from http://www.ensembl.org/Homo_sapiens/syntenyview/ and its segments parsed.

BLASTZ evaluationTo evaluate the segments classed as ‘Disagree’ or ‘Unique’between two maps, we subjected them to BLASTZ (Schwartzet al., 2003) sequence alignment, using parameters B= 2,C= 0, T= 1 and K= 5000, 9000 or 12 000. Each of theresulting alignments was displayed as a dot plot using thealignment viewer, Laj (Wilsonet al., 2001), inspected, andassigned to one of the five outcomes (Fig. 3 for illustrativeexamples), ‘Concordant’, ‘Shifted’, ‘Multiple’, ‘Reversed’and ‘Unsupported’. Those that showed no clear evidence ofhomology were considered ‘unsupported’ by sequence align-ment and were probably false positives. All the assignmentscould be made without much ambiguity, although, for a fewsegments with few and very small patches of matches in thedot plot, their assignment to one of the last four outcomescould be subjective.

SoftwareComputer modules for the UM method and synteny mapvisualization were written in Perl, C\C++ and Delphi/ObjectPascal. The run-time to produce a human–mouse map, whichincluded both the bidirectional mapping and the mergingof anchoring islands into conserved segments and syntenicblocks, was<23 h on one personal computer (2.8 GHzPentium IV, 768 MB memory). The preprocessing, i.e. theidentification of 16mer UMs and UMps, took 20.4 h on thesame machine equipped with 2 GB memory (1 h for UMps).With code optimization and modified data structures, our newversion of the UM method, updated after the completion of thepresent work, has reduced the entire process of human–mousemapping to∼7 h total. The new version is freely available atthe UM synteny website: http://synteny.ibms.sinica.edu.tw/.

RESULTSMaps from various versions of the human genomeThe speed of the UM method for producing a whole-genomesynteny map allowed us to produce multiple maps result-ing from different versions of genome assembly. Maps usingdifferent human genome assemblies differ mainly in the num-ber of small conserved segments which decreased with eachupdate of the genome (Supplement Figure S3). This corro-borates the argument that errors in sequence assembly aremore likely to produce artifactual microrearrangements thanto affect large (e.g.>1 Mb) synteny blocks (Pevzner andTesler, 2003). Given the results shown in Figure S3 (Sup-plement), we can expect a further reduction in the number ofsmall conserved segments when a ‘finished’ mouse genomebecomes available.

3160

Synteny mapping using UniMarkers

(A) (B) (C)

(D) (E) (F)

Fig. 3. Examples of BLASTZ alignment, shown as a dot plot, of conserved segments assigned as ‘Disagree’ or ‘Unique’. (A) Concordant,(B) Shifted, (C) and (D) Multiple, (E) Reversed, (F) Unsupported. (A) and (C) segments are from the UM map, (B) and (E) segments from theEnsembl map and (D) and (F) segments from the MGSC map. For visual clarity, BLASTZ parameterK (threshold for the maximal segmentpair score) was set at 12 000 in cases (B) and (D), 9000 in cases (A), (C) and (F), and 5000 in case (E).

Some parameters for the UM map using the ‘essentiallycomplete’ human genome (NCBI build 33) and the mousegenome NCBI build 30 (the only NCBI build for mouseavailable at the time of this work) are summarized in Table 1.Maps using human builds 30 and 31 gave quite similarresults (data not shown). For the conserved segments andsynteny blocks, these data, except for those for N50, arequite comparable with those reported by MGSC (Waterstonet al., 2002); in contrast, the 10 999 anchoring islands areonly a fraction of the 558 000 ‘landmarks’ (high scoring andbidirectional best sequence matches) identified by MGSC.Since the two sets of syntenic anchors eventually producedvery similar maps (details below), our much larger ‘islands’(846.9 Mb total length covering 33.9% of the mouse genome;Table 1) are, in effect, clusters of the ‘landmarks’ obtainedby sequence alignment using PatternHunter (Maet al., 2002)(188 Mb total length and 7.5% mouse genome coverage(Waterstonet al., 2002)).

Comparison with maps produced by MGSC andEnsemblAs the key component of a synteny map is a list of con-served segments, the easiest way to compare two syntenymaps is to compare two corresponding lists of conserved seg-ments. Using the criteria for comparing two maps describedin the Methods, the comparison of the results for UM versusMGSC and UM versus Ensembl is presented in Tables 2 and 3,respectively. A graphical overview of these results is alsopresented in Figure 4. As can be seen, the UM map agreed wellwith both the MGSC and the Ensembl maps, having∼99% ofthe mapped regions cross-covered with the former (Table 2)and up to 95% with the latter (Table 3). Furthermore, thevast majority of the ‘Agree’ segments were in strong agree-ment (i.e. high degree of overlap; Supplement Figure S2), andthe ‘Disagree’ or ‘Unique’ segments were mainly relativelysmall segments (Table 4), the largest being a few Mb in thecomparison with the MGSC map and 24 Mb in the comparison

3161

B.-Y.Liao et al.

Table 1. Size and genome coverage of anchoring islands, conserved segme-ments and syntenic blocksa

Mouse Human

10 999 anchoring Average 77.0 kb 81.8 kbislands N50 50.0 kb 50.0 kb

Largest 1.27 Mb 1.30 MbTotal length 846.9 Mb 899.9 Mb(% genome)b (33.9%) (31.8%)Spacing ave. 150.1 kb 182.2 kbSpacing N50 70 kb 80 kb

365 (≥100 kb) Average 6.33 Mb 7.08 Mbconserved N50 2.46 Mb 2.94 Mbsegments Largest 64.49 Mb 79.65 Mb

Total length 2309.3 Mb 2585.3 Mb(% genome) (92.3%) (91.3%)

224 syntenic Average 10.55 Mb 12.01 Mbblocks N50 4.78 Mb 5.58 Mb

Largest 146.01 Mb 143.27 MbTotal length 2363.8 Mb 2689.1 Mb(% genome) (94.5%) (94.9%)

aThese data are for the UM human–mouse synteny map using the ‘essentially com-plete’ human genome (NCBI build 33) and the draft mouse genome (NCBI build 30).bGenome size was calculated by omitting the telomeres, centromeres and gaps betweensupercontigs. (Mouse, 2.501 Gb; human, 2.832 Gb).

with the Ensembl map. The somewhat smaller genome cov-erage and the smaller conserved segments obtained using theUM map were probably due to the fact that, unlike in the othertwo maps, the anchoring islands were not extended to includeas much alignable sequence as possible.

Tables 2 and 3 also show that, for all categories, the agree-ment between UM and MGSC was significantly better thanthat between UM and Ensembl. This is attributable in part tothe smaller minimal conserved segments used in the Ensemblmap (100 versus 300 kb for the MGSC map) and to the factthat, unlike the UM and MGSC maps, the Ensembl map isnot cleanly resolved, in that some of its segments are sub-stantially overlapping with, or entirely embedded in, othersegments. The MGSC and Ensembl maps could not be pre-cisely compared, because they were generated using differentgenome versions.

Evaluation with sequence alignmentAlthough a good sequence alignment, i.e. one resulting ina clear diagonal in the dot plot, does not necessarily meanthat a pair of conserved segments are orthologous, the con-verse usually holds. Table 4 gives the results of sequencealignment, using BLASTZ (Schwartzet al., 2003), for the‘Disagree’ and ‘Unique’ segments from Tables 2 and 3. Theresults showed that all but 2 of the total 93(12+ 71+ 10)UM ‘Unique’ or ‘Disagree’ pairs of segments were concord-ant with BLASTZ alignment, and the two exceptions wereneither in the wrong orientation (‘Reversed’) nor withoutclear evidence of sequence similarity (‘Unsupported’). In

comparison, 2 of the 26 MGSC ‘Unique’ and 10 of the35 Ensembl ‘Unique’ segment pairs were ‘unsupported’ byBLASTZ alignment. Further examination (Figures S4 and S5in the supplement) showed that 17 of the 23 MGSC ‘Unique’,BLASTZ-concordant pairs, and 8 of the 11 Ensembl ‘Unique’,BLASTZ-concordant pairs, were actually detected by the UMmethod, but were not included in the comparison becausethe corresponding UM segments were too small (<300 or<100 kb for the comparison with the MGSC or Ensemblmap, respectively). These relatively small UM segments couldprobably be brought into agreement with the correspondingMGSC and Ensembl segments, if they were allowed to extendby sequence alignment, as discussed above. The remaining 6(23− 17) MGSC and 3 (11− 8) Ensembl pairs not detectedby UM were all small (most<1 Mb), and, interestingly, thedensity of their UMps was significantly smaller than typical(Figures S4 and S5 in the supplement). We did not carry outthe same evaluation on the ‘Agree’ segments due to limitedcomputing resources, but, given the consensus of the res-ults using two very different approaches (UM versus MGSCor UM versus Ensembl), together with the results presentedbelow of the Largest Increasing Subsequence (LIS) analysis(Gusfield, 1997) of UMps, it is unlikely that they would beBLASTZ-unsupported.

Evaluation with LIS analysis of UMpsFor a pair of conserved segments or anchoring islands, oneexpects the largest subset of UMps matched in the same dir-ection (Fig. 1) or LIS UMp, to be composed mainly of pUMps.An LIS analysis of UMps can, therefore, be used instead ofsequence alignment to detect questionable segment or islandpairs. Remarkably, the results of such an analysis (Supple-ment Figure S6) showed that, for 91% (10 014/10 999) of theUM anchoring islands, the LIS UMp ratio was 1.0, i.e. all theUMps matched within paired islands were ordered in the sameforward or backward orientation, and only 7 (out of 10 999)pairs had a LIS UMp ratio smaller than 0.8. Furthermore,all of these seven pairs with a low LIS UMp ratio, includingtwo in regions full of repetitive elements, showed evidence ofhomology as assessed by BLASTZ alignment (SupplementFigure S6). As the islands were merged into segments (seeMethods section), the percentage of ordered UMps woulddecrease (Supplement Figure S7); however, the sequence sim-ilarity of several less promising pairs, as suggested by the LISanalysis (Supplement Figure S7), was validated by BLASTZalignment (data not shown).

DISCUSSIONIdentifying sequence orthologues is a key step in compar-ing two genomes. For this task, sequence alignment hasbeen the method of choice, despite two well-recognizedproblems: (1) the assumption of contiguity in homologoussequences is intrinsically incorrect (Vinga and Almeida, 2003)and (2) sequence pairs with the highest alignment score may

3162

Synteny mapping using UniMarkers

Table 2. Comparison between the UM map and the MGSC map on conserved segmentsa

Agree Disagree Unique Total

Strong WeakUM 310 8 0 12 330MGSC 308 8 0 26 342

Size (Mb) % Mapped Size (largest) Size (largest) Size (largest) Size % GenomeUM

Mouse 2260.6 99.2 9.5 (3.1) 0.0 (–) 9.4 (2.6) 2279.5 91.7Human 2512.2 99.0 7.6 (2.8) 0.0 (–) 19.1 (3.2) 2539.0 90.3

MGSCMouse 2321.7 98.7 11.6 (0.8) 0.0 (–) 19.7 (4.2) 2353.0 94.6Human 2583.8 98.5 11.7 (0.5) 0.0 (–) 28.2 (3.9) 2623.6 93.3

aHuman assembly NCBI build 30 versus mouse assembly MGSCv3, with the minimum segment size cut at 300 kb.

Table 3. Comparison between the UM map and the Ensembl map on conserved segmentsa

Agree Disagree Unique Total

Strong WeakUM 261 23 10 71 365Ensembl 277 21 5 35 338

Size (Mb) % Mapped Size (largest) Size (largest) Size (largest) Size % GenomeUM

Mouse 2148.0 93.0 17.9 (3.3) 6.7 (1.7) 136.8 (18.9) 2309.3 92.3Human 2387.7 92.4 32.2 (4.6) 7.6 (2.0) 157.7 (24.0) 2585.3 91.3

EnsemblMouse 2274.1 94.5 59.5 (15.1) 34.9 (21.3) 37.8 (11.5) 2406.3 96.2Human 2514.2 93.9 72.9 (17.3) 46.6 (7.2) 43.5 (12.0) 2677.2 94.5

aHuman assembly NCBI build 33 versus mouse assembly NCBI build 30, with the minimum segment size cut at 100 kb.

Fig. 4. A graphical overview of the comparisons of the human–mouse synteny maps obtained by the UM method and the correspondingmap of either MGSC (A) or Ensembl (B). The UM map is shown in the left chromosomes. Each color corresponds to a particular humanchromosome. Regions within a dashed box indicate that the human orthologous regions are in the backward strand.

3163

B.-Y.Liao et al.

Table 4. BLASTZ-evaluation on the ‘Unique’ and ‘Disagree’ conserved segments from UM versus MGSC (Table 2) and UM versus Ensembl (Table 3)comparisons

Concordanta Shifted Multiple Reversed Unsupported Total

UniqueUM 11 (3) 0 1 0 0 12MGSC 23 (2) 0 1 0 2 26

UniqueUM 70 (33) 0 1 0 0 71Ensembl 11 (1) 6 5 3 10 35

DisagreeUM 10 (3) 0 0 0 0 10Ensembl 0 1 4 0 0 5

aIn parentheses are the number of conserved segments with size of the mouse segment≥1 Mb.

not be orthologues, necessitating a post-processing step todistinguish between orthologous and paralogous similarities(Ureta-Vidal et al., 2003). Another problem with sequencealignment is that its computational cost, both in time andmemory, escalates as the genome gets bigger.

The UM method can overcome many of these problems.First, it identifies two segments as orthologous only by therelative number (which must be above a background noisevalue), not the order, of their shared UMps. It follows thatnon-contiguous orthologous sequences may still be detec-ted, because the number of their shared UMps would onlybe changed a little, if at all, by rearrangements in the twosequences. Second, it searches for homologous sequences bydetecting a signal of non-randomly shared UMps over a largeregion of the genome (a 50 kb window in genome A and achromosome-size fragment of genome B), which, in effect,avoids detecting, and hence dealing with, the numerous qual-ified local similarities found using sequence alignment-basedmethods. Indeed, the iterations required to resolve multiplemapping relationships for some anchoring islands (see Meth-ods section) were mainly needed to divide an island in onegenome to match rearranged or substantially gapped islandsin the other genome. Third, as demonstrated above, the useof unique sequences at a fixed length can render mapping twomammalian genomes feasible on a personal computer withlimited memory. Schemes that utilize a much larger wordsset, such as a suffix-tree (Delcheret al., 2002; Brayet al.,2003), to speed up sequence mapping may have the flexibil-ity for a wider range of applications, but their high demandon memory space (Kurtz, 1999; Lefebvreet al., 2003) couldstrain laboratories with moderate computing resources.

In terms of computational speed for mapping largegenomes, other novel methods may rival or even better ourmethod. For example, PatternHunter (Maet al., 2002) requiresjust 20 days to do the human–mouse comparison on a PentiumIII (supplement in Waterstonet al., 2002), which can bereduced to just hours if Pentium IV and longer seeds areused. In terms of finding sequence matches, clearly, UMs of a

fixed length cannot be expected to compete with PatternHunter(http://www.bioinformaticssolutions.com/products/ph.php) or,for that matter, any other general purpose homology searchtool. However, it should be pointed out that the UM methodhas been developed for the special purpose of sequence map-ping where genome-wide uniqueness can be exploited (Chenet al., 2002). Indeed, it is notable that for the purpose of syn-teny mapping, the novel use of UMs to detect orthologoussignals (Fig. 1) can largely compensate for the loss of sens-itivity in finding sequence matches, as demonstrated by thecomparisons made between the UM map and the MGSC mapor the Ensembl map (Tables 2–4).

Analysis of Table 4 showed that the UM method missed veryfew, and mainly small, MGSC- or Ensembl-unique segmentsand that the reason why these segments were missed was theiruncharacteristic low UMp density (Figures S4 and S5, Sup-plement). Conversely, the reasons why an alignment-basedmethod, such as that adopted by MGSC or Ensembl, missedUM-unique, but BLASTZ-concordant, segments (Table 4) arenot clear, but, presumably, they were lost during the post-processing of seed matches. Although it should be notedthat BLASTZ-concordant segments are not necessary ortho-logues, the UM method of contrasting the pUMp/hUMp signaland noise (Fig. 1) presents a more direct and natural way of dis-tinguishing ancestor-inherited similarity (i.e. homology) fromsimilarity acquired independently by two taxa (i.e. homo-plasy). The use of UMps also facilitates the incorporationof LIS analysis, a fast and established algorithm (Gusfield,1997), to quickly identify potentially questionable conservedsegments (Figures S6 and S7, Supplement) or regions thatmay have undergone rapid evolutionary changes, includingrearrangements.

ACKNOWLEDGEMENTSWe are grateful to Michael Kamal (MIT) for providing theMGSC map. This work would not have been completedas quickly were it not for Glenn Tesler (UCSD) and Abel

3164

Synteny mapping using UniMarkers

Ureta-Vidal, Xose Fernandez and Michele Clamp (EBI),who generously answered our questions and gave us help-ful suggestions. We thank laboratory members Szu-Hsien Lu,Chia-Hao Ou, Austin Chiang and Leslie Chen for stimulat-ing discussions, and Richie Gan for maintaining a reliablecomputing system. We also thank Prof. Cheng-Yan Kaoof the National Taiwan University for valuable advice andcomments. This work was supported by the Genomics andProteomics Program of the Academia Sinica, under grantAS92IBMS1.

REFERENCESAltschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J.

(1990) Basic local alignment search tool.J. Mol. Biol., 215,403–410.

Bray,N., Dubchak,I. and Pachter,L. (2003) AVID: A global alignmentprogramGenome Res., 13, 97–102.

Chen,L.Y., Lu,S.H., Shih,E.S. and Hwang,M.J. (2002) Singlenucleotide polymorphism mapping using genome-wide uniquesequences.Genome Res., 12, 1106–1111.

Clamp,M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y.,Clark L, Cox,T., Cuff,J., Curwen,V.et al. (2003) Ensembl 2002:accommodating comparative genomics.Nucleic Acids Res., 31,38–42.

Gregory,S.G., Sekhon,M., Schein,J., Zhao,S., Osoegawa,K.,Scott,C.E., Evans,R.S., Burridge,P.W., Cox,T.V., Fox,C.A.et al.(2002) A physical map of the mouse genome.Nature, 418,743–750.

Delcher,A.L., Phillippy,A., Carlton,J. and Salzberg,S.L. (2002) Fastalgorithms for large-scale genome alignment and comparison.Nucleic Acids Res., 30, 2478–2483.

Gusfield,D. (1997)Algorithms on Strings, Trees, and Sequences:Computer Science and Computational Biology. CambridgeUniversity Press, Cambridge, pp. 290–292.

Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark L,Cox,T., Cuff,J., Curwen,V., Down,T.et al. (2002) The Ensemblgenome database project.Nucleic Acids Res., 30, 38–41.

Kent,W.J. (2002) BLAT-the BLAST-like alignment tool.GenomeRes., 12, 656–664.

Kurtz,S. (1999) Reducing the space requirement of suffix trees.Softw. Pract. Exp., 29, 1149–1171.

Lefebvre,A., Lecroq,T., Dauchel,H. and Alexandre,J. (2003) FOR-Repeats: detects repeats on entire chromosomes and betweengenomes.Bioinformatics, 19, 319–326.

Ma,B., Tromp,J. and Li,M. (2002) PatternHunter: faster and moresensitive homology search.Bioinformatics, 18, 440–445.

Nadeau,J.H. and Sankoff,D. (1998) Counting on comparative maps.Trends Genet., 14, 495–501.

Ning,Z., Cox,A.J. and Mullikin,J.C. (2001) SSAHA: a fast searchmethod for large DNA databases.Genome Res., 11, 1725–1729.

Pevzner,P. and Tesler,G. (2003) Human and mouse genomicsequences reveal extensive breakpoint reuse in mammalianevolution.Proc. Natl Acad. Sci. USA, 100, 7672–7677.

Schwartz,S., Kent,W.J., Smit,A., Zhang Z., Baertsch,R.,Hardison,R.C., Haussler,D. and Miller,W. (2003) Human–mousealignments with BLASTZ.Genome Res., 13, 103–107.

Schuler,G.D. (1997) Sequence mapping by electronic PCR.GenomeRes., 7, 541–550.

Ureta-Vidal,A., Ettwiller,L. and Birney,E. (2003) Comparativegenomics: genome-wide analysis in metazoan eukaryotes.Nat.Rev. Genet., 4, 251–262.

Vinga,S. and Almeida,J. (2003) Alignment-free sequencecomparison—a review.Bioinformatics, 17, 391–397.

Waterston,R.H., Lindblad-Toh,K., Birney,E., Rogers,J., Abril,J.F.,Agarwal,P., Agarwala,R., Ainscough,R., Alexandersson,M.,An,P. et al. (2002) Initial sequencing and comparative analysisof the mouse genome.Nature, 420, 520–562.

Wilson,M.D., Riemer,C., Martindale,D.W., Schnupf,P.,Boright,A.P., Cheung,T.L., Hardy,D.M., Schwartz,S.,Scherer,S.W., Tsui,L.C.et al. (2001) Comparative analysisof the gene-dense ACHE/TFR2 region on human chromosome7q22 with the orthologous region on mouse chromosome 5.Nucleic Acids Res., 29, 1352–1365.

Zhang,Z., Schwartz,S., Wagner,L. and Miller,W. (2000) A greedyalgorithm for aligning DNA sequences.J. Comput. Biol., 7,203–214.

3165